Modern Optimization Techniques for Big Data Machine Learning Tong Zhang

Transcription

Modern Optimization Techniques for Big Data
Machine Learning
Tong Zhang
Rutgers University & Baidu Inc.
T. Zhang
Big Data Optimization
1 / 41
Outline
Background:
big data optimization in machine learning: special structure
T. Zhang
2 / 41
Outline
Background:
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
T. Zhang
2 / 41
Outline
Background:
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
2 / 41
Mathematical Problem
Big Data Optimization Problem in machine learning:
n
min f (w)
w
f (w) =
1X
fi (w)
n
i=1
Special structure: sum over data: large n
T. Zhang
3 / 41
Mathematical Problem
Big Data Optimization Problem in machine learning:
n
min f (w)
w
f (w) =
1X
fi (w)
n
i=1
Special structure: sum over data: large n
Assumptions on loss function
λ-strong convexity:
λ
f (w 0 ) ≥ f (w) + ∇f (w)> (w 0 − w) + kw 0 − wk22
2
|
{z
}
quadratic lower bound
L-smoothness:
L
fi (w 0 ) ≤ fi (w) + ∇fi (w)> (w 0 − w) + kw 0 − wk22
2
|
{z
}
quadratic upper bound
T. Zhang
3 / 41
Example: Computational Advertising
Large scale regularized logistic regression

min
w
1
n

n
X


ln(1 + e−w > xi yi ) + λ kwk2 
2


2
{z
}
i=1 |
fi (w)
data (xi , yi ) with yi ∈ {±1}
parameter vector w.
λ strongly convex and L = 0.25 maxi kxi k22 + λ smooth.
T. Zhang
4 / 41

min
w
1
n

n
X


ln(1 + e−w > xi yi ) + λ kwk2 
2


2
{z
}
i=1 |
fi (w)
parameter vector w.
big data: n ∼ 10 − 100 billion
high dimension: dim(xi ) ∼ 10 − 100 billion
T. Zhang
4 / 41

min
w
1
n

n
X


ln(1 + e−w > xi yi ) + λ kwk2 
2


2
{z
}
i=1 |
fi (w)
parameter vector w.
big data: n ∼ 10 − 100 billion
high dimension: dim(xi ) ∼ 10 − 100 billion
How to solve big optimization problems efficiently?
T. Zhang
4 / 41
Optimization Problem: Communication Complexity
From simple to complex
Single machine single-core
can employ sequential algorithms
T. Zhang
5 / 41
Single machine multi-core
relatively cheap communication
T. Zhang
5 / 41
Multi-machine (synchronous)
expensive communication
T. Zhang
5 / 41
Multi-machine (asynchronous)
break synchronization to reduce communication
T. Zhang
5 / 41
Multi-machine (asynchronous)
break synchronization to reduce communication
We want to solve simple problem well first, then more complex ones.
T. Zhang
5 / 41
Outline
Background:
other methods
T. Zhang
6 / 41
Batch Optimization Method: Gradient Descent
Solve
n
w∗ = arg min f (w)
w
f (w) =
1X
fi (w).
n
i=1
Gradient Descent (GD):
wk = wk −1 − ηk ∇f (wk −1 ).
How fast does this method converge to the optimal solution?
T. Zhang
7 / 41
Batch Optimization Method: Gradient Descent
Solve
n
w∗ = arg min f (w)
w
f (w) =
1X
fi (w).
n
i=1
Gradient Descent (GD):
wk = wk −1 − ηk ∇f (wk −1 ).
How fast does this method converge to the optimal solution?
Convergence rate depends on conditions of f (·).
For λ-strongly convex and L-smooth problems, it is linear rate:
f (wk ) − f (w∗ ) = O((1 − ρ)k ),
where ρ = O(λ/L) is the inverse condition number
T. Zhang
7 / 41
Stochastic Approximate Gradient Computation
If
n
1X
fi (w),
f (w) =
n
i=1
GD requires the computation of full gradient, which is extremely costly
n
1X
∇f (w) =
∇fi (w)
n
i=1
T. Zhang
8 / 41
Stochastic Approximate Gradient Computation
If
n
1X
fi (w),
f (w) =
n
i=1
GD requires the computation of full gradient, which is extremely costly
n
1X
∇f (w) =
∇fi (w)
n
i=1
Idea: stochastic optimization employs random sample (mini-batch) B
to approximate
∇f (w) ≈
1 X
∇fi (w)
|B|
i∈B
It is an unbiased estimator
more efficient computation but introduces variance
T. Zhang
8 / 41
SGD versus GD
SGD:
faster computation per step
Sublinear convergence: due to the variance of gradient
˜
approximation. f (wt ) − f (w∗ ) = O(1/t).
GD:
slower computation per step
Linear convergence: f (wt ) − f (w∗ ) = O((1 − ρ)t ).
T. Zhang
9 / 41
SGD versus GD
SGD:
faster computation per step
Sublinear convergence: due to the variance of gradient
˜
approximation. f (wt ) − f (w∗ ) = O(1/t).
GD:
slower computation per step
Linear convergence: f (wt ) − f (w∗ ) = O((1 − ρ)t ).
Improve SGD via variance reduction:
SGD: unbiased statistical estimator of gradient with large variance.
Smaller variance implies faster convergence
Idea: design other unbiased gradient estimators with small variance
T. Zhang
9 / 41
Improving SGD using Variance Reduction
The idea leads to modern stochastic algorithms for big data machine
learning with fast convergence rate
T. Zhang
10 / 41
Improving SGD using Variance Reduction
The idea leads to modern stochastic algorithms for big data machine
learning with fast convergence rate
Collins et al (2008): For special problems, with a relatively
complicated algorithm (Exponentiated Gradient on dual)
Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD called SAG
(stochastic average gradient)
Johnson and Z (NIPS 2013): SVRG (Stochastic variance reduced
gradient)
Shalev-Schwartz and Z (JMLR 2013): SDCA (Stochastic Dual
Coordinate Ascent)
T. Zhang
10 / 41
Outline
Background:
other methods
T. Zhang
11 / 41
Stochastic Variance Reduced Gradient: Derivation
Objective function
f (w) =
n
n
i=1
i=1
1X
1 X˜
fi (w) =
fi (w),
n
n
where
˜fi (w) = fi (w) − (∇fi (w)
˜ − ∇f (w))
˜ >w .
{z
}
|
sum to zero
˜ to be an approximate solution (close to w∗ ).
Pick w
T. Zhang
12 / 41
Stochastic Variance Reduced Gradient: Derivation
Objective function
f (w) =
n
n
i=1
i=1
1X
1 X˜
fi (w) =
fi (w),
n
n
where
˜fi (w) = fi (w) − (∇fi (w)
˜ − ∇f (w))
˜ >w .
{z
}
|
sum to zero
˜ to be an approximate solution (close to w∗ ).
Pick w
SVRG rule:
˜ + ∇f (w)]
˜ .
wt = wt−1 − ηt ∇˜fi (wt−1 ) = wt−1 − ηt [∇fi (wt−1 ) − ∇fi (w)
|
{z
}
small variance
Compare to SGD rule:
wt = wt−1 − ηt ∇fi (wt−1 )
| {z }
large variance
T. Zhang
12 / 41
SVRG Algorithm
Procedure SVRG
Parameters update frequency m and learning rate η
˜0
Initialize w
Iterate: for s = 1, 2, . . .
˜ =w
˜ s−1
w
1 Pn
˜
µ
˜ = n i=1 ∇ψi (w)
˜
w0 = w
Iterate: for t = 1, 2, . . . , m
Randomly pick it ∈ {1, . . . , n} and update weight
˜ +µ
wt = wt−1 − η(∇ψit (wt−1 ) − ∇ψit (w)
˜)
end
˜ s = wm
Set w
end
T. Zhang
13 / 41
SVRG v.s. Batch Gradient Descent: fast convergence
Number of examples needed to achieve accuracy:
˜ · L/λ log(1/))
Batch GD: O(n
˜
SVRG: O((n
+ L/λ) log(1/))
Assume L-smooth loss fi and λ strongly convex objective function.
SVRG has fast convergence — condition number effectively reduced
The gain of SVRG over batch algorithm is significant when n is large.
T. Zhang
14 / 41
Outline
Background:
other methods
T. Zhang
15 / 41
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
" n
#
1X >
2
min
(w xi − yi ) + λkwk1
w
n
i=1
T. Zhang
16 / 41
" n
#
1X >
2
min
w
n
i=1
or the ridge regression problem:


 n

1 X >

λ
2
2
min 
(w xi − yi ) +
kwk2 


w
 n i=1
|2 {z } 
|
{z
} regularization
loss
Goal: solve regularized loss minimization problems as fast as we can.
T. Zhang
16 / 41
" n
#
1X >
2
min
w
n
i=1
or the ridge regression problem:


 n

1 X >

λ
2
2
min 
(w xi − yi ) +
kwk2 


w
 n i=1
|2 {z } 
|
{z
} regularization
loss
Goal: solve regularized loss minimization problems as fast as we can.
solution: proximal Stochastic Dual Coordinate Ascent (Prox-SDCA).
can show: fast convergence of SDCA.
T. Zhang
16 / 41
General Problem
Want to solve:
#
n
1X
φi (Xi> w) + λg(w) ,
min P(w) :=
w
n
"
i=1
where Xi are matrices; g(·) is strongly convex.
Examples:
Multi-class logistic loss
φi (Xi> w) = ln
K
X
exp(w > Xi,` ) − w > Xi,yi .
`=1
L1 − L2 regularization
g(w) =
T. Zhang
1
σ
kwk22 + kwk1
2
λ
17 / 41
Dual Formulation
Primal:
"
#
n
1X
>
min P(w) :=
φi (Xi w) + λg(w) ,
w
n
i=1
Dual:
"
n
n
1 X
Xi αi
λn
1X
−φ∗i (−αi ) − λg ∗
max D(α) :=
α
n
i=1
!#
i=1
with the relationship
w = ∇g
∗
n
1 X
Xi αi
λn
!
.
i=1
The convex conjugate (dual) is defined as φ∗i (a) = supz (az − φi (z)).
SDCA: randomly pick i to optimize D(α) by varying αi while keeping
other dual variables fixed.
T. Zhang
18 / 41
Example: L1 − L2 Regularized Logistic Regression
Primal:
n
P(w) =
1X
λ
>
ln(1 + e−w Xi Yi ) + w > w + σkwk1 .
{z
} |2
|
n
{z
}
i=1
φi (w)
λg(w)
Dual: with αi Yi ∈ [0, 1]
n
D(α) =
1X
λ
−αi Yi ln(αi Yi ) − (1 − αi Yi ) ln(1 − αi Yi ) − ktrunc(v , σ/λ)k22
{z
} 2
|
n
i=1
φ∗i (−αi )
n
s.t. v =
1 X
αi Xi ;
λn
w = trunc(v , σ/λ)
i=1
where


uj − δ
trunc(u, δ)j = 0


uj + δ
T. Zhang
if uj > δ
if |uj | ≤ δ
if uj < −δ
19 / 41
Proximal-SDCA for L1 -L2 Regularization
Algorithm:
P
Keep dual α and v = (λn)−1 i αi Xi
Randomly pick i
Find ∆i by approximately maximizing:
−φ∗i (αi + ∆i ) − trunc(v , σ/λ)> Xi ∆i −
1
kXi k2 2 ∆2i ,
2λn
where φ∗i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi ) + (1 − (αi + ∆)Yi ) ln(1 − (αi + ∆)Yi )
α = α + ∆ i · ei
v = v + (λn)−1 ∆i · Xi .
Let w = trunc(v , σ/λ).
T. Zhang
20 / 41
Fast Convergence of SDCA
The number of iterations needed to achieve accuracy
For L-smooth loss:
˜
O
n+
L
λ
log
1
For non-smooth but G-Lipschitz loss (bounded gradient):
G2
˜
O n+
λ
T. Zhang
21 / 41
Fast Convergence of SDCA
The number of iterations needed to achieve accuracy
For L-smooth loss:
˜
O
n+
L
λ
log
1
For non-smooth but G-Lipschitz loss (bounded gradient):
G2
˜
O n+
λ
Similar to that of SVRG; and effective when n is large
T. Zhang
21 / 41
Solving L1 with Smooth Loss
Want to solve L1 regularization to accuracy with smooth φi :
n
1X
φi (w) + σkwk1 .
n
i=1
Apply Prox-SDCA with extra term 0.5λkwk22 , where λ = O():
˜ + 1/).
number of iterations needed by prox-SDCA is O(n
T. Zhang
22 / 41
n
1X
φi (w) + σkwk1 .
n
i=1
˜ + 1/).
Compare to (number of examples needed to go through):
2 ).
˜
Dual Averaging SGD (Xiao): O(1/
√
˜
FISTA (Nesterov’s batch accelerated proximal gradient): O(n/
).
Prox-SDCA wins in the statistically interesting regime: > Ω(1/n2 )
T. Zhang
22 / 41
n
1X
φi (w) + σkwk1 .
n
i=1
˜ + 1/).
Compare to (number of examples needed to go through):
2 ).
˜
Dual Averaging SGD (Xiao): O(1/
√
˜
FISTA (Nesterov’s batch accelerated proximal gradient): O(n/
).
Prox-SDCA wins in the statistically interesting regime: > Ω(1/n2 )
Can design accelerated prox-SDCA always superior to FISTA
T. Zhang
22 / 41
Outline
Background:
other methods
T. Zhang
23 / 41
Accelerated Prox-SDCA
Solving:
n
P(w) :=
1X
φi (Xi> w) + λg(w)
n
i=1
Convergence rate of Prox-SDCA depends on O(1/λ)
Inferior
√ to acceleration when λ is very small O(1/n), which has
O(1/ λ) dependency
T. Zhang
24 / 41
Accelerated Prox-SDCA
Solving:
n
P(w) :=
1X
φi (Xi> w) + λg(w)
n
i=1
Convergence rate of Prox-SDCA depends on O(1/λ)
Inferior
√ to acceleration when λ is very small O(1/n), which has
O(1/ λ) dependency
Inner-outer Iteration Accelerated Prox-SDCA
Pick a suitable κ = Θ(1/n) and β
For t = 2, 3 . . . (outer iter)
˜t (w) = λg(w) + 0.5κkw − y t−1 k22 (κ-strongly convex)
Let g
˜ t (w) = P(w) − λg(w) + g
˜t (w) (redefine P(·) – κ strongly convex)
Let P
˜ t (w) for (w (t) , α(t) ) with prox-SDCA (inner iter)
Approximately solve P
Let y (t) = w (t) + β(w (t) − w (t−1) ) (acceleration)
T. Zhang
24 / 41
Performance Comparisons
Problem
SVM
Algorithm
SGD
AGD (Nesterov)
Acc-Prox-SDCA
Lasso
SGD and variants
Stochastic Coordinate Descent
FISTA
Acc-Prox-SDCA
SGD, SDCA
Ridge Regression
AGD
Acc-Prox-SDCA
T. Zhang
Runtime
1
λ
q
1
λ q
n
n + min{ λ1 , λ
}
d
2
n
q
n 1
q n + min{ 1 , n }
nq
+ λ1
n λ1
q n + min{ λ1 , λn }
n
25 / 41
Additional Related Work on Acceleration
Methods achieving fast accelerated convergence comparable to
Acc-Prox-SDCA
Qihang Lin, Zhaosong Lu, Lin Xiao, An Accelerated Proximal
Coordinate Gradient Method and its Application to Regularized
Empirical Risk Minimization, 2014, arXiv
Yuchen Zhang, Lin Xiao, Stochastic Primal-Dual Coordinate Method
for Regularized Empirical Risk Minimization, 2014, arXiv
T. Zhang
26 / 41
Distributed Computing: Distribution Schemes
Distribute data (data parallelism)
all machines have the same parameters
each machine has a different set of data
Distribute features (model parallelism)
all machines have the same data
each machine has a different set of parameters
Distribute data and features (data & model parallelism)
each machine has a different set of data
each machine has a different set of parameters
T. Zhang
27 / 41
Main Issues in Distributed Large Scale Learning
System Design and Network Communication
data parallelism: need to transfer a reasonable size chunk of data
each time (mini batch)
model parallelism: distributed parameter vector (parameter server)
T. Zhang
28 / 41
Main Issues in Distributed Large Scale Learning
System Design and Network Communication
data parallelism: need to transfer a reasonable size chunk of data
each time (mini batch)
model parallelism: distributed parameter vector (parameter server)
Model Update Strategy
synchronous
asynchronous
T. Zhang
28 / 41
Outline
Background:
other methods
T. Zhang
29 / 41
MiniBatch
Vanilla SDCA (or SGD) is difficult to parallelize
Solution: use minibatch (thousands to hundreds of thousands)
T. Zhang
30 / 41
MiniBatch
Problem: simple minibatch implementation slows down convergence
limited gain for using parallel computing
T. Zhang
30 / 41
MiniBatch
Problem: simple minibatch implementation slows down convergence
limited gain for using parallel computing
Solution:
use Nesterov acceleration
use second order information (e.g. approximate Newton steps)
T. Zhang
30 / 41
MiniBatch SDCA with Acceleration
Parameters scalars λ, γ and θ ∈ [0, 1] ; mini-batch size b
(0)
(0)
Initialize α1 = · · · = αn = α
¯ (0) = 0, w (0) = 0
Iterate: for t = 1, 2, . . .
u (t−1) = (1 − θ)w (t−1) + θα
¯ (t−1)
Randomly pick subset I ⊂ {1, . . . , n} of size b and update
(t)
(t−1)
αi = (1 − θ)αi
− θ∇fi (u (t−1) )/(λn) for i ∈ I
(t)
(t−1)
αj = αj
for j ∈
/I
P
(t)
(t−1)
(t)
(t−1)
α
¯ =α
¯
+ i∈I (αi − αi
)
w (t) = (1 − θ)w (t−1) + θα
¯ (t)
end
Better than vanilla block SDCA, and allow large batch.
T. Zhang
31 / 41
Example
−1
10
Primal suboptimality
m=52
m=523
m=5229
AGD
SDCA
−2
10
−3
10
6
10
7
10
8
10
#processed examples
MiniBatch SDCA with acceleration can employ large minibatch size.
T. Zhang
32 / 41
Outline
Background:
other methods
T. Zhang
33 / 41
Communication Efficient Distributed Computing
Assume: data distributed over machines
m processors
each has n/m examples
Simple Computational Strategy — One Shot Averaging (OSA)
run optimization on m machines separately
obtaining parameters w (1) , . . . , w (m)
average the parameters: l
¯ = m−1
w
T. Zhang
Pm
i=1
w (i)
34 / 41
Improvement
OSA Strategy’s advantage:
machines run independently
simple and computationally efficient; asymptotically good in theory
Disadvantage:
practically inferior to training all examples on a single machine
T. Zhang
35 / 41
Improvement
OSA Strategy’s advantage:
machines run independently
simple and computationally efficient; asymptotically good in theory
Disadvantage:
practically inferior to training all examples on a single machine
Traditional solution in optimization: ADMM
New Idea: via 2nd order gradient sampling
Distributed Approximate NEwton (DANE)
T. Zhang
35 / 41
Distribution Scheme
Assume: data distributed over machines with decomposed problem
f (w) =
m
X
f (`) (w).
`=1
m processors
each f (`) (w) has n/m randomly partitioned examples
each machine holds a complete set of parameters
T. Zhang
36 / 41
DANE
˜ using OSA
Starting with w
Iterate
˜ and define
Take w
˜f (`) (w) = f (`) (w) − (∇f (`) (w)
˜ − ∇f (w))
˜ >w
on each machine solves
w (`) = arg min ˜f (`) (w)
w
independently.
˜
Take partial average as the next w
T. Zhang
37 / 41
DANE
˜ using OSA
Starting with w
Iterate
˜ and define
Take w
˜f (`) (w) = f (`) (w) − (∇f (`) (w)
˜ − ∇f (w))
˜ >w
on each machine solves
w (`) = arg min ˜f (`) (w)
w
independently.
˜
Take partial average as the next w
Lead to fast convergence: O((1 − ρ)` ) with ρ ≈ 1
T. Zhang
37 / 41
Reason: Approximate Newton Step
On each machine, we solve:
min ˜f (`) (w).
w
It can be regarded as approximate minimization of


1
 ˜
˜ > ∇2 f (`) (w)(w
˜
˜ 
˜ > (w − w)
˜ +
(w − w)
− w)
min f (w)
+ ∇f (w)
.
w
{z
}
2 |
˜
2nd order gradient sampling from ∇2 f (w)
Approximate Newton Step with sampled approximation of Hessian
T. Zhang
38 / 41
Comparisons
COV1
MNIST−47
ASTRO
0.231
DANE
ADMM
OSA
Opt
0.06
0.07
0.05
0.23
0.06
0.229
0.05
0.04
0.03
0
T. Zhang
5
t
10
0.04
0
5
t
10
0
5
t
10
39 / 41
Summary
Optimization in machine learning: sum over data structure
Traditional methods: gradient based batch algorithms
do not take advantage of special structure
Recent progress: stochastic optimization with fast rate
take advantage of special structure: suitable for single machine
T. Zhang
40 / 41
Summary
Distributed computing (data parallelism and synchronous update)
minibatch SDCA
DANE (batch algorithm on each machine + synchronization)
T. Zhang
40 / 41
Summary
minibatch SDCA
Other approaches
algorithmic side: ADMM, Asynchronous updates (Hogwild), etc
system side: distributed vector computing (parameter servers) – Baidu
has industrial leading solution
T. Zhang
40 / 41
Summary
minibatch SDCA
Other approaches
algorithmic side: ADMM, Asynchronous updates (Hogwild), etc
system side: distributed vector computing (parameter servers) – Baidu
has industrial leading solution
Fast developing field; many exciting new ideas
T. Zhang
40 / 41
References
Rie Johnson and TZ. Accelerating Stochastic Gradient Descent
using Predictive Variance Reduction, NIPS 2013.
Lin Xiao and TZ. A Proximal Stochastic Gradient Method with
Progressive Variance Reduction, SIAM J. Opt, to appear.
Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent
Methods for Regularized Loss Minimization, JMLR 14:567-599,
2013.
Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual
Coordinate Ascent for Regularized Loss Minimization, Math
Programming, to appear.
Shai Shalev-Schwartz and TZ. Accelerated Mini-Batch Stochastic
Dual Coordinate Ascent, NIPS 2013.
Ohad Shamir and Nathan Srebro and TZ. Communication-Efficient
Distributed Optimization using an Approximate Newton-type Method,
ICML 2014.
T. Zhang
41 / 41

Modern Optimization Techniques for Big Data Machine Learning Tong Zhang

Transcription

Similar documents

Media Kit - Meli Marketing

starter service

ServiceMax Linx for ServicePower

Types Of Video For Gaming Operators

Global Optimization Datasheet

Poly-Trace™ Capabilities

Systems Optimization

AMendoza-BCP Group - ARC Advisory Group

Ybrant Digital Limited