Kernel machines - PART II

Transcription

Kernel machines - PART II
Kernel machines - PART II
September, 2008
MLSS’08
Stéphane Canu
[email protected]
Kernels and the learning problem
Tools: the functional framework
Algorithms
Roadmap
1 Kernels and the learning problem
Three learning problems
Learning from data: the problem
Kernelizing the linear regression
Kernel machines: a definition
2 Tools: the functional framework
In the beginning was the kernel
Kernel and hypothesis set
Optimization, loss function and the reguarization cost
3 Kernel machines
Non sparse kernel machines
Sparse kernel machines
practical SVM
4 Conclusion
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
In the beginning was the kenrel...
Definition (Kernel)
a function of two variable k from X × X to IR
Definition (Positive kernel)
A kernel k(s, t) on X is said to be positive
I
I
if it is symetric: k(s, t) = k(t, s)
an if for any finite positive interger n:
∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X ,
n X
n
X
i=1 j=1
it is strictly positive if for αi 6= 0
n
n X
X
αi αj k(xi , xj ) > 0
i=1 j=1
αi αj k(xi , xj ) ≥ 0
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Examples of positive kernels
the linear kernel: s, t ∈ IRd ,
k(s, t) = s> t
symetric: s> t = t> s
positive:
n
n X
X
αi αj k(xi , xj )
=
i =1 j=1
=
n
n X
X
αi αj x>
i xj
i =1 j=1
n
X
!> 
αi x i

αj xj 
j=1
i =1
the product kernel:
n
X

k(s, t) = g (s)g (t)
n
2
X
=
αi xi i =1
for some g : IRd → IR,
symetric by construction
positive:
n X
n
X
αi αj k(xi , xj )
=
i =1 j=1
=
n X
n
X
i =1 j=1
n
X
αi αj g (xi )g (xj )
!
αi g (xi )
i =1
n
X


αj g (xj )
=
j=1
k is positive ⇔ (its square root exists) ⇔ k(s, t) = hφs , φt i
J.P. Vert, 2006
n
X
i =1
!2
αi g (xi )
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Example: finite kernel
let φj , j = 1, p be a finite dictionary of functions from X to IR
(polynomials, wavelets...)
the feature map and linear kernel
Φ : X → IRp
s 7→ Φ = φ1 (s), ..., φp (s)
feature map:
Linear kernel in the feature space:
k(s, t) = φ1 (s), ..., φp (s)
e.g. the quadratic kernel: s, t ∈ IRd ,
>
φ1 (t), ..., φp (t)
k(s, t) = s> t + b
2
feature map:
Φ : IRd →
s 7→
d (d +1)
IRp=1+d+√ 2
√
√
√
Φ = 1, 2s1 , ..., 2sj , ..., 2sd , s12 , ..., sj2 , ..., sd2 , ..., 2si sj , ...
p multiplications vs. d + 1
use kernel to save computration
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
positive definite Kernel (PDK) algebra (closure)
if k1 (s, t) and k2 (s, t) are two positive kernels
∀a1 ∈ IR+
I
DPK are a convex cone:
I
for any measurable function ψ from X to IR
I
product kernel
a1 k1 (s, t) + k2 (s, t)
k(s, t) = ψ(s)ψ(t)
k1 (s, t)k2 (s, t)
proofs
I by linearity:
n X
n
X
n X
n
n X
n
X
X
αi αj a1 k1 (i, j)k2 (i, j) = a1
αi αj k1 (i, j) +
αi αj k2 (i, j)
i =1 j=1
I
by linearity:
n X
n
X
i =1 j=1
i =1 j=1
I
assuming
n X
n
X
∃ψ`
i =1 j=1
n
n
X
X
αi αj ψ(xi )ψ(xj ) =
αi ψ(xi )
αj ψ(xj )
s.t. k1 (s, t) =
i =1
j=1
P
ψ` (s)ψ` (t)
n X
n
X
X
αi αj k1 (xi , xj )k2 (xi , xj ) =
αi αj
ψ` (xi )ψ` (xj )k2 (xi , xj )
`
i =1 j=1
i =1 j=1
=
n X
n
XX
`
i =1 j=1
N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
`
αi ψ` (xi ) αj ψ` (xj ) k2 (xi , xj )
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Kernel engineering: building PDK
I
I
I
I
for any polynomial with positive coef. φ from IR to IR
φ k(s, t)
if Ψis a function from IRd to IRd
k Ψ(s), Ψ(t)
if ϕ from IRd to IR+ , is minimum in 0
k(s, t) = ϕ(s + t) − ϕ(s − t)
convolution of two positive kernels is a positive kernel
K1 ? K2
the Gaussian kernel is a PDK
exp(−ks − tk2 )
= exp(−ksk2 − ktk2 − 2s> t)
= exp(−ksk2 ) exp(−ktk2 ) exp(2s> t)
I
s> t is a PDK and function exp as the limit of positive series expansion,
so exp(2s> t) is a PDK
I
exp(−ksk2 ) exp(−ktk2 ) is a PDK as a product kernel
I
the product of two PDK is a PDK
O. Catoni, master lecture, 2005
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
an attemp at classifying PD kernels
I
stationary kernels, (also called translation invariant):
k(s, t) = ks (s − t)
I
I
I
I
2
radial (isotropic)
gaussian: exp − rb , r = ks − tk
with compact support
c.s. Matèrn : max 0, 1 − br κ br k Bk br , κ ≥ (d + 1)/2
locally stationary kernels:
k(s, t) = k1 (s + t)ks (s − t)
K1 is a non negative function and K2 a radial kernel.
non stationary (projective kernels):
k(s, t) = kp (s > t)
I
separable kernels k(s, t) = k1 (s)k2 (t) with k1 and k2 (t) PDK
in this case K = k1 k2> where k1 = (k1 (x1 ), ..., k1 (xn ))
MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Kernel sprectral representation
I
stationary kernels, (Bochner’s theorem):
Z
ks (s − t) =
cos ω > (s − t) F (d ω)
IRd
with F a positive finite measure
I
radial kernels
Z
kr (ks − tk) =
∞
Ψd ωks − tk F (d ω)
0
where F is non deacreasing and bounded, and Ψd a specific function
I
non stationary (projective kernels):
Z
kp (s, t) =
IRd
Z
IRd
cos ω1> s − ω2> t F (d ω1 , d ω2 )
with F a positive bounded symetric measure
Fourier transform may help to design PD kernels
Kernels and the learning problem
Tools: the functional framework
Algorithms
some examples of PD kernels...
type
name
2
− rb
k(s, t)
, r = ks − tk
radial
gaussian
exp
radial
radial
laplacian
rationnal
radial
loc. gauss.
exp(−r /b)
2
1 − r 2r+b
2
r d
exp(− rb )
max 0, 1 − 3b
non stat.
χ2
projective
projective
projective
polynomial
affine
cosine
projective
correlation
exp(−r /b), r =
P
k
(sk −tk )2
sk +tk
(s > t)p
+ b)p
>
s t/kskktk s>t
exp kskktk
−b
(s > t
Most of the kernels depends on a quantity b called the bandwidth
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
>
p
k(s, t) = (s t + b) = b
p
p
s >t
+1
b
for the gaussian Kernel: Bandwidth = influence zone
ks − tk2
1
k(s, t) = exp −
Z
2σ 2
Illustration
1 d density estimation
+ data
(x1 , x2 , ..., xn )
– Parzen estimate
n
Ib
P(x) =
1
Z
X
i=1
k(x, xi )
b=
b = 2σ 2
1
2
b=2
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
>
p
k(s, t) = (s t + b) = b
p
p
s >t
+1
b
for the gaussian Kernel: Bandwidth = influence zone
ks − tk2
1
k(s, t) = exp −
Z
2σ 2
Illustration
1 d density estimation
+ data
(x1 , x2 , ..., xn )
– Parzen estimate
n
Ib
P(x) =
1
Z
X
i=1
k(x, xi )
b=
b = 2σ 2
1
2
b=2
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
kernels for objects and structures
kernels on histograms and probability distributions
Z
k(p(x), q(x)) =
ki p(x), q(x) IP(x)dx
kernel on strings
I
spectral string kernel
I
using sub sequences
I
similarities by alignements
k(s, t) =
k(s, t) =
P
P
π
u
φu (s)φu (t)
exp(β(s, t, π))
kernels on graphs
I
the pseudo inverse of the (regularized) graph Laplacian
L=D −A
A is the adjency matrixD the degree matrix
I
diffusion kernels
I
subgraph kernel convolution (using random walks)
1
Z (b)
expbL
and kernels on heterogeneous data (image), HMM, automata...
Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006
Kernels and the learning problem
Tools: the functional framework
Algorithms
Gram matrix
Definition (Gram matrix)
let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X .
the Gram matrix is the square K of dimension n and of general term
Kij = k(xi , xj ).
practical trick to check kernel positivity:
K is positive ⇔ λi > 0 its eigenvalues are posivies: if
K ui = λi ui ; i = 1, n
>
u>
i K ui = λi ui ui = λi
matrix K is the one to be used
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Examples of Gram matrices with different bandwidth
raw data
b = .5
Gram matrix for b = 2
b = 10
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
different point of view about kernels
kernel and scalar product
k(s, t) = hφ(s), φ(t)iH
kernel and distance
d (s, t)2 = k(s, s) + k(t, t) − 2k(s, t)
kernel and covariance: a positive matrix is a covariance matrix
IP(f) =
1
1
exp − (f − f0 )> K −1 (f − f0 )
Z
2
if f0 = 0 and f = K α, IP(α) =
1
Z
exp − 12 α> K α
Kernel and regularity (green’s function)
k(s, t) = P ∗ Pδs−t
for some operator P
(e.g. some differential)
Kernels and the learning problem
Tools: the functional framework
Algorithms
Let’s summarize
I
positive kernels
I
there is a lot of them
I
can be rather complex
I
2 classes: radial / projective
I
the bandwith matters (more than the kenrel itself)
I
the Gram matrix summarize the pairwise comparizons
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Roadmap
1 Kernels and the learning problem
Three learning problems
Learning from data: the problem
Kernelizing the linear regression
Kernel machines: a definition
2 Tools: the functional framework
In the beginning was the kernel
Kernel and hypothesis set
Optimization, loss function and the reguarization cost
3 Kernel machines
Non sparse kernel machines
Sparse kernel machines
practical SVM
4 Conclusion
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
From kernel to functions
n o
Pmf
H0 = f mf < ∞; fj ∈ IR; tj ∈ X , f (x) = j=1
fj k(x, tj )
let define the bilinear form (g (x) =
Pmg
∀f , g ∈ H0 , hf , g iH0 =
i =1
gi k(x, si ))
mf mg
X
X
:
fj gi k(tj , si )
j=1 i=1
Evaluation functional: ∀x ∈ X
f (x) = hf (.), k(x, .)iH0
from k to H
with any postive kernel, a hypothesis set can be constructed H with its
metric
Kernels and the learning problem
Tools: the functional framework
Algorithms
RKHS
Definition (reproducing kernel Hibert space (RKHS))
a Hilbert space H embeded with the inner product h., .iH is said to be
with reproduicing kernel if it exists a positive kernel k such that
∀s ∈ X , k(., s) ∈ H et ∀f ∈ H,
f (s) = hf (.), k(s, .)iH
positive kernel ⇔ RKHS
I
any function is pointwise defined
I
defines the inner product
I
it defines the regularity (smoothness) of the hypothesis set
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
functional differentiation in RKHS
Let J be a functional
J: H→
f 7→
IR
J(f )
examples:
J1 (f ) = kf k2 , J2 (f ) = f (x),
J directional derivative in direction g at point f
dJ(f , g ) =
J(f + εg ) − J(f )
lim
ε→0
ε
Gradient ∇J (f )
∇J : H
f
→ H
7→ ∇J (f )
si
exercice: find out ∇J1 (f ) et ∇J2 (f )
dJ(f , g ) = h∇J (f ), g iH
Conclusion
Kernels and the learning problem
Tools: the functional framework
Hint
dJ(f + εg ) dJ(f , g ) =
dε
ε=0
Algorithms
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Solution
dJ1 (f , g )
lim
kf +εg k2 −kf k2
ε
ε→0
lim
kf k2 +ε2 kg k2 +2εhf ,g iH −kf k2
ε
ε→0
lim
εkg k2 + 2hf , g iH
ε→0
=
=
=
⇔
∇J1 (f ) = 2f
= h2f , g iH
dJ2 (f , g )
=
lim
ε→0
g (x)
=
hk(x, .), g iH
=
Minimize J(f )
f ∈H
⇔
f (x)+εg (x)−f (x)
ε
⇔
∀g ∈ H, dJ(f , g ) = 0
∇J2 (f ) = k(x, .)
⇔
∇J (f ) = 0
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Solution
dJ1 (f , g )
lim
kf +εg k2 −kf k2
ε
ε→0
lim
kf k2 +ε2 kg k2 +2εhf ,g iH −kf k2
ε
ε→0
lim
εkg k2 + 2hf , g iH
ε→0
=
=
=
⇔
∇J1 (f ) = 2f
= h2f , g iH
dJ2 (f , g )
=
lim
ε→0
g (x)
=
hk(x, .), g iH
=
Minimize J(f )
f ∈H
⇔
f (x)+εg (x)−f (x)
ε
⇔
∀g ∈ H, dJ(f , g ) = 0
∇J2 (f ) = k(x, .)
⇔
∇J (f ) = 0
Kernels and the learning problem
Tools: the functional framework
Algorithms
Remark about the regularity matter: from H to k
How to build H together with k?
let Γt be a family in L2 , t ∈ X
define the following maping
St : L2
→
IR
7→
St (g ) =
Z
g
g (x)Γt (x)dx
f (t) = St (g ) = hg , Γt iL2
the following RKHS can be constructed
H = Im(S) hf1 , f2 iH = hg1 , g2 iL2
k(s, t) = hΓs , Γt iL2
the reproducing property is verified:
hf (.), k(., t)iH = hg , Γt iL2 = f (t)
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Γx examples and associated kernels
Cameron Martin
Polynomial
Γx (u)
1I{x≤u}
d
X
e0 (u) +
xi ei (u)
1
Z
Gaussian
i=1
(x−u)2
− 2
exp
K (x, y )
min (x, y )
x> y + 1
1
Z0
exp−
(x−y )2
4
{ei }i =1,d are a finite sub sample of an orthonormal basis in L2
the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that
f (t) = (Sg )(t) =
R
Γt (u)g (u) du =
kf kH = kSg kH = kg k2L2 = kf 0 k2L2
integrate
S = P −1
R
1I{t≤u} g (u) du = G (t)
generalized differential
differentialize
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Γx examples and associated kernels
Cameron Martin
Polynomial
Γx (u)
1I{x≤u}
d
X
e0 (u) +
xi ei (u)
1
Z
Gaussian
i=1
(x−u)2
− 2
exp
K (x, y )
min (x, y )
x> y + 1
1
Z0
exp−
(x−y )2
4
{ei }i =1,d are a finite sub sample of an orthonormal basis in L2
the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that
f (t) = (Sg )(t) =
R
Γt (u)g (u) du =
kf kH = kSg kH = kg k2L2 = kf 0 k2L2
integrate
S = P −1
R
1I{t≤u} g (u) du = G (t)
generalized differential
differentialize
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Γx examples and associated kernels
Cameron Martin
Polynomial
Γx (u)
1I{x≤u}
d
X
e0 (u) +
xi ei (u)
1
Z
Gaussian
i=1
(x−u)2
− 2
exp
K (x, y )
min (x, y )
x> y + 1
1
Z0
exp−
(x−y )2
4
{ei }i =1,d are a finite sub sample of an orthonormal basis in L2
the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that
f (t) = (Sg )(t) =
R
Γt (u)g (u) du =
kf kH = kSg kH = kg k2L2 = kf 0 k2L2
integrate
S = P −1
R
1I{t≤u} g (u) du = G (t)
generalized differential
differentialize
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
other kernels (what realy matters)
I
finite kernels
k(s, t) = φ1 (s), ..., φp (s)
I
I
I
I
>
φ1 (t), ..., φp (t)
Mercer kernels
Pp
positive on a compact set
⇔
k(s, t) = j=1 λj φj (s)φj (t)
positive kernels
positive semi-definite
conditionnaly positive (for some functions pj )
∀{xi }i=1,n , ∀αi ,
n
X
αi pj (xi ) = 0; j = 1, p ,
i
n X
n
X
αi αj k(xi , xj ) ≥ 0
i=1 j=1
I
symetric non positive
I
non symetric – non positive
k(s, t) = tanh(s> t + α0 )
the key property: ∇Jt (f ) = k(t, .) holds
C. Ong et al, ICML , 2004
Kernels and the learning problem
Tools: the functional framework
Algorithms
Let’s summarize
I
positive kernels ⇔ RKHS = H ⇔ regularity kf k2H
I
the key property: ∇Jt (f ) = k(t, .) holds not only for positive
kernels
f (xi ) exists (pointwise defined functions)
I
universal consistency in RKHS
I
the Gram matrix summarize the pairwise comparizons
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Roadmap
1 Kernels and the learning problem
Three learning problems
Learning from data: the problem
Kernelizing the linear regression
Kernel machines: a definition
2 Tools: the functional framework
In the beginning was the kernel
Kernel and hypothesis set
Optimization, loss function and the reguarization cost
3 Kernel machines
Non sparse kernel machines
Sparse kernel machines
practical SVM
4 Conclusion
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Convex optimization in one slide
I
The 
problem: H is a Hilbert space ( IRn or a RKHS)
OP


min
f ∈H
such that


and
J(f )
objective
Pi (f ) ≤ 0 i = 1, p
Rj (f ) = 0 j = 1, q
constraint
⇔
min
f ∈D⊂H
J(f )
J, Pi and Rj from H → IR
I
questions:
I
I
I
I
modelization (transformations)
optimization theory (existence, uniqueness, characterization of
the solution)
algorithms (no analytical solution)/ implementations
The importance of being convex:
I
I
convex OP: objective J and constraints Pj are convex, Rj are
afine
why convex? It is the solvable class:
⇒the solution exists, is unique and can be found efficiently
Boyd and Vandenberg 2004 ; Bonnans et al., 2006
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
examples of convex OP
Linear programming LP (standard form)


minn
 x∈I
R
s.t.

 and
c> x
Ax = b
x0


hf , kiH
 fmin
∈H
 s.t. Tf = y
T : H → IRn linear
(xi ≥ 0 , i = 1, n) 
Linear objective and linear constrains
Quadratic programming QP
(
min 1 x> C x
x∈IRn 2
− d> x
s.t. Ax b
quadratic objective and linear constrains
Second order cone programming SOCP


minn c> x
 x∈I
R
s.t.
kAi x − bi k2 ≤ d>
i x + ei , i = 1, p

 and F x = g
second order cone constraint
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Convex optimization in three slides


J(f )
primal objective
 fmin
∈H
I Initial problem (primal):
s. t. Pi (f ) ≤ 0
i = 1, p

 and R (f ) = 0
j
= 1, q
j
I
Lagrangian: L(f , α, β) = J(f ) +
p
X
αi Pi (f ) +
|{z}
i=1 ≥0
I
βj Rj (f )
i=1
Lagrange dual function: Q(α, β) = min L(f , α, β)
f ∈H
I
Lagrange dual function with linear constraints Tf = y, p = 0
Q(α) = −y> α − J ∗ (−T ∗ α)
I
where J ∗ denotes the conjugate
and for any feasable b
f (such that Pi (b
f ) ≤ 0 and Ri (b
f ) = 0)
b
Q(α, β) = min L(f , α, β) ≤ L(f , α, β) ≤ J(b
f)
f ∈H
(
I
q
X
Lagrange dual problem:
max Q(α, β)
α,β
s.t. α 0
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
Convex optimization in four slides
Optimality conditions
(Karush-Kuhn-Tucker KKT conditions):

f ∗ , α∗ , β ∗
optimum
⇔
p
q
X
X


∗
∗
∗

∇
J(f
)
+
α
∇
P
(f
)
+
βj∗ ∇f Rj (f ∗ )

f
f i
i



i=1
i=1

Pi (f ∗ )

Rj (f ∗ )




αi∗



∗
αi Pi (f ∗ )
=0
≥0
=0
≥0
=0
duality gap:
J(f ∗ ) − Q(α∗ , β ∗ ) = 0;
Optimization summary
1. (try to) transform your problem in a convex one (standard form)
2. more variables or more constraints: is the dual simpler?
3. solve it: check the KKT conditions
Kernels and the learning problem
Tools: the functional framework
Algorithms
loss function: the fitting cost
`:
H, X , Y
f , x, y
−→ IR+
7−→ `(y , f (x))
Regular
logistic
Convex
NON convexe
(x)+ = max(x, 0)
log 1 + exp−yf (x)
singular
hinge
(yf (x) − 1)+
L2-gaussian
L1-laplacian
(f (x) − y )2
|f (x) − y |
sigmoid
0/1
1 − tanh(yf (x))
sign(f (x)) − y )
Cauchy
Lp - square root
log(1 + (f (x) − y )2 )
|f (x) − y |1/2
Conclusion
Kernels and the learning problem
Tools: the functional framework
Fidelity to the data: loss function
Algorithms
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
regularity criterion: regularization
Problem: H is too big...
let S0 = {f ∈ H |
Pn
i=1
`(f , xi , yi ) = 0}
the problem is ill posed: solution is not unique.
we have to choose one! fI minimal norm solution min kf k2
f ∈S0
regularized: build a regularization path
a sequence of problems whose solution fλ converges towards fI
3 way to do it:
1. penalization R(f ): min
f ∈H
n
X
`(f , xi , yi ), R(f )
i=1
2. subspace H1 ⊂ ...Hk ... ⊂ H : min
f ∈Hk
n
X
`(f , xi , yi )
i=1
3. iterative approach: gradient (Landweber-Friedman) or conjugate
gradient (Krylov subspace)
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
regularity criterion: regularization
Problem: H is too big...
let S0 = {f ∈ H |
Pn
i=1
`(f , xi , yi ) = 0}
the problem is ill posed: solution is not unique.
we have to choose one! fI minimal norm solution min kf k2
f ∈S0
regularized: build a regularization path
a sequence of problems whose solution fλ converges towards fI
3 way to do it:
1. penalization R(f ): min
f ∈H
n
X
`(f , xi , yi ), R(f )
i=1
2. subspace H1 ⊂ ...Hk ... ⊂ H : min
f ∈Hk
n
X
`(f , xi , yi )
i=1
3. iterative approach: gradient (Landweber-Friedman) or conjugate
gradient (Krylov subspace)
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
penalization choice
convex
regular
L2
singular
L1
kf k22
kf k1
normalized-SCAD
non convex
kf k2H = α> K α
X
αj2
j
1 + αj2
f (x) =
n
X
n
X
i=1
no more convex when p < 1
J. Weston et al. JMLR 03, A. Ng ICML’04
kf kp
αi k(x, xi ) kf k1 =
i=1
kf kpp =
Lp , p < 1
n
X
i=1
|αi |p
|αi |,
Kernels and the learning problem
Tools: the functional framework
Algorithms
Conclusion
an attemp at classifying some Kernel learnig algorithm
`
sing.
h(H)
sing.L1
reg.
sing.L1
sing.
reg. L2
Regresion
K Danzig Selector
K LASSO
K LARS
SVR
reg.
reg. L2
Splines
Classification
LP SVM
K reg. log. L1
SVM
K-logistic reg.
Lagrangian SVM
Table: SVM and SVR stand for support vector machine and support
vector regression. LP linear programming, LARS Least angle regression
stagewise and reg. log. logistic regrssion. K. represent the kernelize
version o fthe linear algorithms..
Kernels and the learning problem
Tools: the functional framework
things are changing: why `1 ?
The Gaussian Hare and the Laplacian Tortoise
Computability of `1 vs. `2 Regression Estimators.
Portnoy & Koenker, 1997
Algorithms
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
`1 gives sparsity: even faster!
Definition: Strongly homogeneous sets (variables)
I0 = i ∈ {1, ..., d } βi = 0
Théorème
Regular if S(β) + λT (β) differentiable in 0 with I0 (y) 6= ∅
∀ε > 0, ∃y0 ∈ B(y, ε) such that I0 (y0 ) 6= I0 (y)
Singular if S(β) + λT (β) NON differentiable in 0 with I0 (y) 6= ∅
∃ε > 0, ∀y0 ∈ B(y, ε) we have I0 (y0 ) = I0 (y)
a criteria is non smooth at zero =⇒ sparsity
Nikolova, 2000
Conclusion
Kernels and the learning problem
Tools: the functional framework
Algorithms
let’s summarize
I
hypothesis k et H
I
data fidelity (loss): ` convex
I
regularity: learning as a multicriterion optimization
I
`1 = sparsity + convexity = some efficency
Conclusion
Kernels and the learning problem
.
Tools: the functional framework
Algorithms
Conclusion

Similar documents

Investor Day 2010 Presentation

Investor Day 2010 Presentation MARKET SHARE EVOLUTION BY RETAIL PRICE BY BRAND PRICE RANGE SHARE CHANGE H1 2010 VS. 2004 (PC market - FR, GER, IT, SP)

More information