Kernel machines - PART II

Transcription

Kernel machines - PART II
September, 2008
MLSS’08
Stéphane Canu
[email protected]
Kernels and the learning problem
Tools: the functional framework
Algorithms
Roadmap
1 Kernels and the learning problem
Three learning problems
Learning from data: the problem
Kernelizing the linear regression
Kernel machines: a definition
2 Tools: the functional framework
In the beginning was the kernel
Kernel and hypothesis set
Optimization, loss function and the reguarization cost
3 Kernel machines
Non sparse kernel machines
Sparse kernel machines
practical SVM
4 Conclusion
Conclusion
Algorithms
In the beginning was the kenrel...
Definition (Kernel)
a function of two variable k from X × X to IR
Definition (Positive kernel)
A kernel k(s, t) on X is said to be positive
I
I
if it is symetric: k(s, t) = k(t, s)
an if for any finite positive interger n:
∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X ,
n X
n
X
i=1 j=1
it is strictly positive if for αi 6= 0
n
n X
X
αi αj k(xi , xj ) > 0
i=1 j=1
αi αj k(xi , xj ) ≥ 0
Conclusion
Algorithms
Conclusion
Examples of positive kernels
the linear kernel: s, t ∈ IRd ,
k(s, t) = s> t
symetric: s> t = t> s
positive:
n
n X
X
αi αj k(xi , xj )
=
i =1 j=1
=
n
n X
X
αi αj x>
i xj
i =1 j=1
n
X
!> 
αi x i

αj xj 
j=1
i =1
the product kernel:
n
X

k(s, t) = g (s)g (t)
n
2
X
=
αi xi i =1
for some g : IRd → IR,
symetric by construction
positive:
n X
n
X
αi αj k(xi , xj )
=
i =1 j=1
=
n X
n
X
i =1 j=1
n
X
αi αj g (xi )g (xj )
!
αi g (xi )
i =1
n
X


αj g (xj )
=
j=1
k is positive ⇔ (its square root exists) ⇔ k(s, t) = hφs , φt i
J.P. Vert, 2006
n
X
i =1
!2
αi g (xi )
Algorithms
Conclusion
Example: finite kernel
let φj , j = 1, p be a finite dictionary of functions from X to IR
(polynomials, wavelets...)
the feature map and linear kernel
Φ : X → IRp
s 7→ Φ = φ1 (s), ..., φp (s)
feature map:
Linear kernel in the feature space:
k(s, t) = φ1 (s), ..., φp (s)
e.g. the quadratic kernel: s, t ∈ IRd ,
>
φ1 (t), ..., φp (t)
k(s, t) = s> t + b
2
feature map:
Φ : IRd →
s 7→
d (d +1)
IRp=1+d+√ 2
√
√
√
Φ = 1, 2s1 , ..., 2sj , ..., 2sd , s12 , ..., sj2 , ..., sd2 , ..., 2si sj , ...
p multiplications vs. d + 1
use kernel to save computration
Algorithms
Conclusion
positive definite Kernel (PDK) algebra (closure)
if k1 (s, t) and k2 (s, t) are two positive kernels
∀a1 ∈ IR+
I
DPK are a convex cone:
I
for any measurable function ψ from X to IR
I
product kernel
a1 k1 (s, t) + k2 (s, t)
k(s, t) = ψ(s)ψ(t)
k1 (s, t)k2 (s, t)
proofs
I by linearity:
n X
n
X
n X
n
n X
n
X
X
αi αj a1 k1 (i, j)k2 (i, j) = a1
αi αj k1 (i, j) +
αi αj k2 (i, j)
i =1 j=1
I
by linearity:
n X
n
X
i =1 j=1
i =1 j=1
I
assuming
n X
n
X
∃ψ`
i =1 j=1
n
n
X
X
αi αj ψ(xi )ψ(xj ) =
αi ψ(xi )
αj ψ(xj )
s.t. k1 (s, t) =
i =1
j=1
P
ψ` (s)ψ` (t)
n X
n
X
X
αi αj k1 (xi , xj )k2 (xi , xj ) =
αi αj
ψ` (xi )ψ` (xj )k2 (xi , xj )
`
i =1 j=1
i =1 j=1
=
n X
n
XX
`
i =1 j=1
N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
`
αi ψ` (xi ) αj ψ` (xj ) k2 (xi , xj )
Algorithms
Conclusion
Kernel engineering: building PDK
I
I
I
I
for any polynomial with positive coef. φ from IR to IR
φ k(s, t)
if Ψis a function from IRd to IRd
k Ψ(s), Ψ(t)
if ϕ from IRd to IR+ , is minimum in 0
k(s, t) = ϕ(s + t) − ϕ(s − t)
convolution of two positive kernels is a positive kernel
K1 ? K2
the Gaussian kernel is a PDK
exp(−ks − tk2 )
= exp(−ksk2 − ktk2 − 2s> t)
= exp(−ksk2 ) exp(−ktk2 ) exp(2s> t)
I
s> t is a PDK and function exp as the limit of positive series expansion,
so exp(2s> t) is a PDK
I
exp(−ksk2 ) exp(−ktk2 ) is a PDK as a product kernel
I
the product of two PDK is a PDK
O. Catoni, master lecture, 2005
Algorithms
Conclusion
an attemp at classifying PD kernels
I
stationary kernels, (also called translation invariant):
k(s, t) = ks (s − t)
I
I
I
I
2
radial (isotropic)
gaussian: exp − rb , r = ks − tk
with compact support
c.s. Matèrn : max 0, 1 − br κ br k Bk br , κ ≥ (d + 1)/2
locally stationary kernels:
k(s, t) = k1 (s + t)ks (s − t)
K1 is a non negative function and K2 a radial kernel.
non stationary (projective kernels):
k(s, t) = kp (s > t)
I
separable kernels k(s, t) = k1 (s)k2 (t) with k1 and k2 (t) PDK
in this case K = k1 k2> where k1 = (k1 (x1 ), ..., k1 (xn ))
MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002
Algorithms
Conclusion
Kernel sprectral representation
I
stationary kernels, (Bochner’s theorem):
Z
ks (s − t) =
cos ω > (s − t) F (d ω)
IRd
with F a positive finite measure
I
radial kernels
Z
kr (ks − tk) =
∞
Ψd ωks − tk F (d ω)
0
where F is non deacreasing and bounded, and Ψd a specific function
I
non stationary (projective kernels):
Z
kp (s, t) =
IRd
Z
IRd
cos ω1> s − ω2> t F (d ω1 , d ω2 )
with F a positive bounded symetric measure
Fourier transform may help to design PD kernels
Algorithms
some examples of PD kernels...
type
name
2
− rb
k(s, t)
, r = ks − tk
radial
gaussian
exp
radial
radial
laplacian
rationnal
radial
loc. gauss.
exp(−r /b)
2
1 − r 2r+b
2
r d
exp(− rb )
max 0, 1 − 3b
non stat.
χ2
projective
projective
projective
polynomial
affine
cosine
projective
correlation
exp(−r /b), r =
P
k
(sk −tk )2
sk +tk
(s > t)p
+ b)p
>
s t/kskktk s>t
exp kskktk
−b
(s > t
Most of the kernels depends on a quantity b called the bandwidth
Conclusion
Algorithms
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
>
p
k(s, t) = (s t + b) = b
p
p
s >t
+1
b
for the gaussian Kernel: Bandwidth = influence zone
ks − tk2
1
k(s, t) = exp −
Z
2σ 2
Illustration
1 d density estimation
+ data
(x1 , x2 , ..., xn )
– Parzen estimate
n
Ib
P(x) =
1
Z
X
i=1
k(x, xi )
b=
b = 2σ 2
1
2
b=2
Conclusion
Algorithms
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
>
p
k(s, t) = (s t + b) = b
p
p
s >t
+1
b
for the gaussian Kernel: Bandwidth = influence zone
ks − tk2
1
k(s, t) = exp −
Z
2σ 2
Illustration
1 d density estimation
+ data
(x1 , x2 , ..., xn )
– Parzen estimate
n
Ib
P(x) =
1
Z
X
i=1
k(x, xi )
b=
b = 2σ 2
1
2
b=2
Conclusion
Algorithms
Conclusion
kernels for objects and structures
kernels on histograms and probability distributions
Z
k(p(x), q(x)) =
ki p(x), q(x) IP(x)dx
kernel on strings
I
spectral string kernel
I
using sub sequences
I
similarities by alignements
k(s, t) =
k(s, t) =
P
P
π
u
φu (s)φu (t)
exp(β(s, t, π))
kernels on graphs
I
the pseudo inverse of the (regularized) graph Laplacian
L=D −A
A is the adjency matrixD the degree matrix
I
diffusion kernels
I
subgraph kernel convolution (using random walks)
1
Z (b)
expbL
and kernels on heterogeneous data (image), HMM, automata...
Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006
Algorithms
Gram matrix
Definition (Gram matrix)
let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X .
the Gram matrix is the square K of dimension n and of general term
Kij = k(xi , xj ).
practical trick to check kernel positivity:
K is positive ⇔ λi > 0 its eigenvalues are posivies: if
K ui = λi ui ; i = 1, n
>
u>
i K ui = λi ui ui = λi
matrix K is the one to be used
Conclusion
Algorithms
Conclusion
Examples of Gram matrices with different bandwidth
raw data
b = .5
Gram matrix for b = 2
b = 10
Algorithms
Conclusion
different point of view about kernels
kernel and scalar product
k(s, t) = hφ(s), φ(t)iH
kernel and distance
d (s, t)2 = k(s, s) + k(t, t) − 2k(s, t)
kernel and covariance: a positive matrix is a covariance matrix
IP(f) =
1
1
exp − (f − f0 )> K −1 (f − f0 )
Z
2
if f0 = 0 and f = K α, IP(α) =
1
Z
exp − 12 α> K α
Kernel and regularity (green’s function)
k(s, t) = P ∗ Pδs−t
for some operator P
(e.g. some differential)
Algorithms
Let’s summarize
I
positive kernels
I
there is a lot of them
I
can be rather complex
I
2 classes: radial / projective
I
the bandwith matters (more than the kenrel itself)
I
the Gram matrix summarize the pairwise comparizons
Conclusion
Algorithms
Roadmap
3 Kernel machines
practical SVM
4 Conclusion
Conclusion
Algorithms
Conclusion
From kernel to functions
n o
Pmf
H0 = f mf < ∞; fj ∈ IR; tj ∈ X , f (x) = j=1
fj k(x, tj )
let define the bilinear form (g (x) =
Pmg
∀f , g ∈ H0 , hf , g iH0 =
i =1
gi k(x, si ))
mf mg
X
X
:
fj gi k(tj , si )
j=1 i=1
Evaluation functional: ∀x ∈ X
f (x) = hf (.), k(x, .)iH0
from k to H
with any postive kernel, a hypothesis set can be constructed H with its
metric
Algorithms
RKHS
Definition (reproducing kernel Hibert space (RKHS))
a Hilbert space H embeded with the inner product h., .iH is said to be
with reproduicing kernel if it exists a positive kernel k such that
∀s ∈ X , k(., s) ∈ H et ∀f ∈ H,
f (s) = hf (.), k(s, .)iH
positive kernel ⇔ RKHS
I
any function is pointwise defined
I
defines the inner product
I
it defines the regularity (smoothness) of the hypothesis set
Conclusion
Algorithms
functional differentiation in RKHS
Let J be a functional
J: H→
f 7→
IR
J(f )
examples:
J1 (f ) = kf k2 , J2 (f ) = f (x),
J directional derivative in direction g at point f
dJ(f , g ) =
J(f + εg ) − J(f )
lim
ε→0
ε
Gradient ∇J (f )
∇J : H
f
→ H
7→ ∇J (f )
si
exercice: find out ∇J1 (f ) et ∇J2 (f )
dJ(f , g ) = h∇J (f ), g iH
Conclusion
Hint
dJ(f + εg ) dJ(f , g ) =
dε
ε=0
Algorithms
Conclusion
Algorithms
Conclusion
Solution
dJ1 (f , g )
lim
kf +εg k2 −kf k2
ε
ε→0
lim
kf k2 +ε2 kg k2 +2εhf ,g iH −kf k2
ε
ε→0
lim
εkg k2 + 2hf , g iH
ε→0
=
=
=
⇔
∇J1 (f ) = 2f
= h2f , g iH
dJ2 (f , g )
=
lim
ε→0
g (x)
=
hk(x, .), g iH
=
Minimize J(f )
f ∈H
⇔
f (x)+εg (x)−f (x)
ε
⇔
∀g ∈ H, dJ(f , g ) = 0
∇J2 (f ) = k(x, .)
⇔
∇J (f ) = 0
Algorithms
Conclusion
Solution
dJ1 (f , g )
lim
kf +εg k2 −kf k2
ε
ε→0
lim
kf k2 +ε2 kg k2 +2εhf ,g iH −kf k2
ε
ε→0
lim
εkg k2 + 2hf , g iH
ε→0
=
=
=
⇔
∇J1 (f ) = 2f
= h2f , g iH
dJ2 (f , g )
=
lim
ε→0
g (x)
=
hk(x, .), g iH
=
Minimize J(f )
f ∈H
⇔
f (x)+εg (x)−f (x)
ε
⇔
∀g ∈ H, dJ(f , g ) = 0
∇J2 (f ) = k(x, .)
⇔
∇J (f ) = 0
Algorithms
Remark about the regularity matter: from H to k
How to build H together with k?
let Γt be a family in L2 , t ∈ X
define the following maping
St : L2
→
IR
7→
St (g ) =
Z
g
g (x)Γt (x)dx
f (t) = St (g ) = hg , Γt iL2
the following RKHS can be constructed
H = Im(S) hf1 , f2 iH = hg1 , g2 iL2
k(s, t) = hΓs , Γt iL2
the reproducing property is verified:
hf (.), k(., t)iH = hg , Γt iL2 = f (t)
Conclusion
Algorithms
Γx examples and associated kernels
Cameron Martin
Polynomial
Γx (u)
1I{x≤u}
d
X
e0 (u) +
xi ei (u)
1
Z
Gaussian
i=1
(x−u)2
− 2
exp
K (x, y )
min (x, y )
x> y + 1
1
Z0
exp−
(x−y )2
4
{ei }i =1,d are a finite sub sample of an orthonormal basis in L2
the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that
f (t) = (Sg )(t) =
R
Γt (u)g (u) du =
kf kH = kSg kH = kg k2L2 = kf 0 k2L2
integrate
S = P −1
R
1I{t≤u} g (u) du = G (t)
generalized differential
differentialize
Conclusion
Algorithms
Cameron Martin
Polynomial
Γx (u)
1I{x≤u}
d
X
e0 (u) +
xi ei (u)
1
Z
Gaussian
i=1
(x−u)2
− 2
exp
K (x, y )
min (x, y )
x> y + 1
1
Z0
exp−
(x−y )2
4
f (t) = (Sg )(t) =
R
Γt (u)g (u) du =
integrate
S = P −1
R
1I{t≤u} g (u) du = G (t)
differentialize
Conclusion
Algorithms
Cameron Martin
Polynomial
Γx (u)
1I{x≤u}
d
X
e0 (u) +
xi ei (u)
1
Z
Gaussian
i=1
(x−u)2
− 2
exp
K (x, y )
min (x, y )
x> y + 1
1
Z0
exp−
(x−y )2
4
f (t) = (Sg )(t) =
R
Γt (u)g (u) du =
integrate
S = P −1
R
1I{t≤u} g (u) du = G (t)
differentialize
Conclusion
Algorithms
Conclusion
other kernels (what realy matters)
I
finite kernels
k(s, t) = φ1 (s), ..., φp (s)
I
I
I
I
>
φ1 (t), ..., φp (t)
Mercer kernels
Pp
positive on a compact set
⇔
k(s, t) = j=1 λj φj (s)φj (t)
positive kernels
positive semi-definite
conditionnaly positive (for some functions pj )
∀{xi }i=1,n , ∀αi ,
n
X
αi pj (xi ) = 0; j = 1, p ,
i
n X
n
X
αi αj k(xi , xj ) ≥ 0
i=1 j=1
I
symetric non positive
I
non symetric – non positive
k(s, t) = tanh(s> t + α0 )
the key property: ∇Jt (f ) = k(t, .) holds
C. Ong et al, ICML , 2004
Algorithms
Let’s summarize
I
positive kernels ⇔ RKHS = H ⇔ regularity kf k2H
I
the key property: ∇Jt (f ) = k(t, .) holds not only for positive
kernels
f (xi ) exists (pointwise defined functions)
I
universal consistency in RKHS
I
the Gram matrix summarize the pairwise comparizons
Conclusion
Algorithms
Roadmap
3 Kernel machines
practical SVM
4 Conclusion
Conclusion
Algorithms
Conclusion
Convex optimization in one slide
I
The 
problem: H is a Hilbert space ( IRn or a RKHS)
OP


min
f ∈H
such that


and
J(f )
objective
Pi (f ) ≤ 0 i = 1, p
Rj (f ) = 0 j = 1, q
constraint
⇔
min
f ∈D⊂H
J(f )
J, Pi and Rj from H → IR
I
questions:
I
I
I
I
modelization (transformations)
optimization theory (existence, uniqueness, characterization of
the solution)
algorithms (no analytical solution)/ implementations
The importance of being convex:
I
I
convex OP: objective J and constraints Pj are convex, Rj are
afine
why convex? It is the solvable class:
⇒the solution exists, is unique and can be found efficiently
Boyd and Vandenberg 2004 ; Bonnans et al., 2006
Algorithms
Conclusion
examples of convex OP
Linear programming LP (standard form)


minn
 x∈I
R
s.t.

 and
c> x
Ax = b
x0


hf , kiH
 fmin
∈H
 s.t. Tf = y
T : H → IRn linear
(xi ≥ 0 , i = 1, n) 
Linear objective and linear constrains
Quadratic programming QP
(
min 1 x> C x
x∈IRn 2
− d> x
s.t. Ax b
quadratic objective and linear constrains
Second order cone programming SOCP


minn c> x
 x∈I
R
s.t.
kAi x − bi k2 ≤ d>
i x + ei , i = 1, p

 and F x = g
second order cone constraint
Algorithms
Conclusion
Convex optimization in three slides


J(f )
primal objective
 fmin
∈H
I Initial problem (primal):
s. t. Pi (f ) ≤ 0
i = 1, p

 and R (f ) = 0
j
= 1, q
j
I
Lagrangian: L(f , α, β) = J(f ) +
p
X
αi Pi (f ) +
|{z}
i=1 ≥0
I
βj Rj (f )
i=1
Lagrange dual function: Q(α, β) = min L(f , α, β)
f ∈H
I
Lagrange dual function with linear constraints Tf = y, p = 0
Q(α) = −y> α − J ∗ (−T ∗ α)
I
where J ∗ denotes the conjugate
and for any feasable b
f (such that Pi (b
f ) ≤ 0 and Ri (b
f ) = 0)
b
Q(α, β) = min L(f , α, β) ≤ L(f , α, β) ≤ J(b
f)
f ∈H
(
I
q
X
Lagrange dual problem:
max Q(α, β)
α,β
s.t. α 0
Algorithms
Conclusion
Convex optimization in four slides
Optimality conditions
(Karush-Kuhn-Tucker KKT conditions):

f ∗ , α∗ , β ∗
optimum
⇔
p
q
X
X


∗
∗
∗

∇
J(f
)
+
α
∇
P
(f
)
+
βj∗ ∇f Rj (f ∗ )

f
f i
i



i=1
i=1

Pi (f ∗ )

Rj (f ∗ )




αi∗



∗
αi Pi (f ∗ )
=0
≥0
=0
≥0
=0
duality gap:
J(f ∗ ) − Q(α∗ , β ∗ ) = 0;
Optimization summary
1. (try to) transform your problem in a convex one (standard form)
2. more variables or more constraints: is the dual simpler?
3. solve it: check the KKT conditions
Algorithms
loss function: the fitting cost
`:
H, X , Y
f , x, y
−→ IR+
7−→ `(y , f (x))
Regular
logistic
Convex
NON convexe
(x)+ = max(x, 0)
log 1 + exp−yf (x)
singular
hinge
(yf (x) − 1)+
L2-gaussian
L1-laplacian
(f (x) − y )2
|f (x) − y |
sigmoid
0/1
1 − tanh(yf (x))
sign(f (x)) − y )
Cauchy
Lp - square root
log(1 + (f (x) − y )2 )
|f (x) − y |1/2
Conclusion
Fidelity to the data: loss function
Algorithms
Conclusion
Algorithms
regularity criterion: regularization
Problem: H is too big...
let S0 = {f ∈ H |
Pn
i=1
`(f , xi , yi ) = 0}
the problem is ill posed: solution is not unique.
we have to choose one! fI minimal norm solution min kf k2
f ∈S0
regularized: build a regularization path
a sequence of problems whose solution fλ converges towards fI
3 way to do it:
1. penalization R(f ): min
f ∈H
n
X
`(f , xi , yi ), R(f )
i=1
2. subspace H1 ⊂ ...Hk ... ⊂ H : min
f ∈Hk
n
X
`(f , xi , yi )
i=1
3. iterative approach: gradient (Landweber-Friedman) or conjugate
gradient (Krylov subspace)
Conclusion
Algorithms
regularity criterion: regularization
Problem: H is too big...
let S0 = {f ∈ H |
Pn
i=1
`(f , xi , yi ) = 0}
the problem is ill posed: solution is not unique.
we have to choose one! fI minimal norm solution min kf k2
f ∈S0
regularized: build a regularization path
a sequence of problems whose solution fλ converges towards fI
3 way to do it:
1. penalization R(f ): min
f ∈H
n
X
`(f , xi , yi ), R(f )
i=1
2. subspace H1 ⊂ ...Hk ... ⊂ H : min
f ∈Hk
n
X
`(f , xi , yi )
i=1
3. iterative approach: gradient (Landweber-Friedman) or conjugate
gradient (Krylov subspace)
Conclusion
Algorithms
Conclusion
penalization choice
convex
regular
L2
singular
L1
kf k22
kf k1
normalized-SCAD
non convex
kf k2H = α> K α
X
αj2
j
1 + αj2
f (x) =
n
X
n
X
i=1
no more convex when p < 1
J. Weston et al. JMLR 03, A. Ng ICML’04
kf kp
αi k(x, xi ) kf k1 =
i=1
kf kpp =
Lp , p < 1
n
X
i=1
|αi |p
|αi |,
Algorithms
Conclusion
an attemp at classifying some Kernel learnig algorithm
`
sing.
h(H)
sing.L1
reg.
sing.L1
sing.
reg. L2
Regresion
K Danzig Selector
K LASSO
K LARS
SVR
reg.
reg. L2
Splines
Classification
LP SVM
K reg. log. L1
SVM
K-logistic reg.
Lagrangian SVM
Table: SVM and SVR stand for support vector machine and support
vector regression. LP linear programming, LARS Least angle regression
stagewise and reg. log. logistic regrssion. K. represent the kernelize
version o fthe linear algorithms..
things are changing: why `1 ?
The Gaussian Hare and the Laplacian Tortoise
Computability of `1 vs. `2 Regression Estimators.
Portnoy & Koenker, 1997
Algorithms
Conclusion
Algorithms
`1 gives sparsity: even faster!
Definition: Strongly homogeneous sets (variables)
I0 = i ∈ {1, ..., d } βi = 0
Théorème
Regular if S(β) + λT (β) differentiable in 0 with I0 (y) 6= ∅
∀ε > 0, ∃y0 ∈ B(y, ε) such that I0 (y0 ) 6= I0 (y)
Singular if S(β) + λT (β) NON differentiable in 0 with I0 (y) 6= ∅
∃ε > 0, ∀y0 ∈ B(y, ε) we have I0 (y0 ) = I0 (y)
a criteria is non smooth at zero =⇒ sparsity
Nikolova, 2000
Conclusion
Algorithms
let’s summarize
I
hypothesis k et H
I
data fidelity (loss): ` convex
I
regularity: learning as a multicriterion optimization
I
`1 = sparsity + convexity = some efficency
Conclusion
.
Algorithms
Conclusion

Kernel machines - PART II

Transcription

Similar documents

Investor Day 2010 Presentation

Rhythms of Power: Interaction in Musical Performance

Lecture 9 - Universität Tübingen