Modern Optimization Techniques for Big Data Machine Learning Tong Zhang

Transcription

Modern Optimization Techniques for Big Data Machine Learning Tong Zhang
Modern Optimization Techniques for Big Data
Machine Learning
Tong Zhang
Rutgers University & Baidu Inc.
T. Zhang
Big Data Optimization
1 / 41
Outline
Background:
big data optimization in machine learning: special structure
T. Zhang
Big Data Optimization
2 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
T. Zhang
Big Data Optimization
2 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
2 / 41
Mathematical Problem
Big Data Optimization Problem in machine learning:
n
min f (w)
w
f (w) =
1X
fi (w)
n
i=1
Special structure: sum over data: large n
T. Zhang
Big Data Optimization
3 / 41
Mathematical Problem
Big Data Optimization Problem in machine learning:
n
min f (w)
w
f (w) =
1X
fi (w)
n
i=1
Special structure: sum over data: large n
Assumptions on loss function
λ-strong convexity:
λ
f (w 0 ) ≥ f (w) + ∇f (w)> (w 0 − w) + kw 0 − wk22
2
|
{z
}
quadratic lower bound
L-smoothness:
L
fi (w 0 ) ≤ fi (w) + ∇fi (w)> (w 0 − w) + kw 0 − wk22
2
|
{z
}
quadratic upper bound
T. Zhang
Big Data Optimization
3 / 41
Example: Computational Advertising
Large scale regularized logistic regression

min
w
1
n

n
X


ln(1 + e−w > xi yi ) + λ kwk2 
2


2
{z
}
i=1 |
fi (w)
data (xi , yi ) with yi ∈ {±1}
parameter vector w.
λ strongly convex and L = 0.25 maxi kxi k22 + λ smooth.
T. Zhang
Big Data Optimization
4 / 41
Example: Computational Advertising
Large scale regularized logistic regression

min
w
1
n

n
X


ln(1 + e−w > xi yi ) + λ kwk2 
2


2
{z
}
i=1 |
fi (w)
data (xi , yi ) with yi ∈ {±1}
parameter vector w.
λ strongly convex and L = 0.25 maxi kxi k22 + λ smooth.
big data: n ∼ 10 − 100 billion
high dimension: dim(xi ) ∼ 10 − 100 billion
T. Zhang
Big Data Optimization
4 / 41
Example: Computational Advertising
Large scale regularized logistic regression

min
w
1
n

n
X


ln(1 + e−w > xi yi ) + λ kwk2 
2


2
{z
}
i=1 |
fi (w)
data (xi , yi ) with yi ∈ {±1}
parameter vector w.
λ strongly convex and L = 0.25 maxi kxi k22 + λ smooth.
big data: n ∼ 10 − 100 billion
high dimension: dim(xi ) ∼ 10 − 100 billion
How to solve big optimization problems efficiently?
T. Zhang
Big Data Optimization
4 / 41
Optimization Problem: Communication Complexity
From simple to complex
Single machine single-core
can employ sequential algorithms
T. Zhang
Big Data Optimization
5 / 41
Optimization Problem: Communication Complexity
From simple to complex
Single machine single-core
can employ sequential algorithms
Single machine multi-core
relatively cheap communication
T. Zhang
Big Data Optimization
5 / 41
Optimization Problem: Communication Complexity
From simple to complex
Single machine single-core
can employ sequential algorithms
Single machine multi-core
relatively cheap communication
Multi-machine (synchronous)
expensive communication
T. Zhang
Big Data Optimization
5 / 41
Optimization Problem: Communication Complexity
From simple to complex
Single machine single-core
can employ sequential algorithms
Single machine multi-core
relatively cheap communication
Multi-machine (synchronous)
expensive communication
Multi-machine (asynchronous)
break synchronization to reduce communication
T. Zhang
Big Data Optimization
5 / 41
Optimization Problem: Communication Complexity
From simple to complex
Single machine single-core
can employ sequential algorithms
Single machine multi-core
relatively cheap communication
Multi-machine (synchronous)
expensive communication
Multi-machine (asynchronous)
break synchronization to reduce communication
We want to solve simple problem well first, then more complex ones.
T. Zhang
Big Data Optimization
5 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
6 / 41
Batch Optimization Method: Gradient Descent
Solve
n
w∗ = arg min f (w)
w
f (w) =
1X
fi (w).
n
i=1
Gradient Descent (GD):
wk = wk −1 − ηk ∇f (wk −1 ).
How fast does this method converge to the optimal solution?
T. Zhang
Big Data Optimization
7 / 41
Batch Optimization Method: Gradient Descent
Solve
n
w∗ = arg min f (w)
w
f (w) =
1X
fi (w).
n
i=1
Gradient Descent (GD):
wk = wk −1 − ηk ∇f (wk −1 ).
How fast does this method converge to the optimal solution?
Convergence rate depends on conditions of f (·).
For λ-strongly convex and L-smooth problems, it is linear rate:
f (wk ) − f (w∗ ) = O((1 − ρ)k ),
where ρ = O(λ/L) is the inverse condition number
T. Zhang
Big Data Optimization
7 / 41
Stochastic Approximate Gradient Computation
If
n
1X
fi (w),
f (w) =
n
i=1
GD requires the computation of full gradient, which is extremely costly
n
1X
∇f (w) =
∇fi (w)
n
i=1
T. Zhang
Big Data Optimization
8 / 41
Stochastic Approximate Gradient Computation
If
n
1X
fi (w),
f (w) =
n
i=1
GD requires the computation of full gradient, which is extremely costly
n
1X
∇f (w) =
∇fi (w)
n
i=1
Idea: stochastic optimization employs random sample (mini-batch) B
to approximate
∇f (w) ≈
1 X
∇fi (w)
|B|
i∈B
It is an unbiased estimator
more efficient computation but introduces variance
T. Zhang
Big Data Optimization
8 / 41
SGD versus GD
SGD:
faster computation per step
Sublinear convergence: due to the variance of gradient
˜
approximation. f (wt ) − f (w∗ ) = O(1/t).
GD:
slower computation per step
Linear convergence: f (wt ) − f (w∗ ) = O((1 − ρ)t ).
T. Zhang
Big Data Optimization
9 / 41
SGD versus GD
SGD:
faster computation per step
Sublinear convergence: due to the variance of gradient
˜
approximation. f (wt ) − f (w∗ ) = O(1/t).
GD:
slower computation per step
Linear convergence: f (wt ) − f (w∗ ) = O((1 − ρ)t ).
Improve SGD via variance reduction:
SGD: unbiased statistical estimator of gradient with large variance.
Smaller variance implies faster convergence
Idea: design other unbiased gradient estimators with small variance
T. Zhang
Big Data Optimization
9 / 41
Improving SGD using Variance Reduction
The idea leads to modern stochastic algorithms for big data machine
learning with fast convergence rate
T. Zhang
Big Data Optimization
10 / 41
Improving SGD using Variance Reduction
The idea leads to modern stochastic algorithms for big data machine
learning with fast convergence rate
Collins et al (2008): For special problems, with a relatively
complicated algorithm (Exponentiated Gradient on dual)
Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD called SAG
(stochastic average gradient)
Johnson and Z (NIPS 2013): SVRG (Stochastic variance reduced
gradient)
Shalev-Schwartz and Z (JMLR 2013): SDCA (Stochastic Dual
Coordinate Ascent)
T. Zhang
Big Data Optimization
10 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
11 / 41
Stochastic Variance Reduced Gradient: Derivation
Objective function
f (w) =
n
n
i=1
i=1
1X
1 X˜
fi (w) =
fi (w),
n
n
where
˜fi (w) = fi (w) − (∇fi (w)
˜ − ∇f (w))
˜ >w .
{z
}
|
sum to zero
˜ to be an approximate solution (close to w∗ ).
Pick w
T. Zhang
Big Data Optimization
12 / 41
Stochastic Variance Reduced Gradient: Derivation
Objective function
f (w) =
n
n
i=1
i=1
1X
1 X˜
fi (w) =
fi (w),
n
n
where
˜fi (w) = fi (w) − (∇fi (w)
˜ − ∇f (w))
˜ >w .
{z
}
|
sum to zero
˜ to be an approximate solution (close to w∗ ).
Pick w
SVRG rule:
˜ + ∇f (w)]
˜ .
wt = wt−1 − ηt ∇˜fi (wt−1 ) = wt−1 − ηt [∇fi (wt−1 ) − ∇fi (w)
|
{z
}
small variance
Compare to SGD rule:
wt = wt−1 − ηt ∇fi (wt−1 )
| {z }
large variance
T. Zhang
Big Data Optimization
12 / 41
SVRG Algorithm
Procedure SVRG
Parameters update frequency m and learning rate η
˜0
Initialize w
Iterate: for s = 1, 2, . . .
˜ =w
˜ s−1
w
1 Pn
˜
µ
˜ = n i=1 ∇ψi (w)
˜
w0 = w
Iterate: for t = 1, 2, . . . , m
Randomly pick it ∈ {1, . . . , n} and update weight
˜ +µ
wt = wt−1 − η(∇ψit (wt−1 ) − ∇ψit (w)
˜)
end
˜ s = wm
Set w
end
T. Zhang
Big Data Optimization
13 / 41
SVRG v.s. Batch Gradient Descent: fast convergence
Number of examples needed to achieve accuracy:
˜ · L/λ log(1/))
Batch GD: O(n
˜
SVRG: O((n
+ L/λ) log(1/))
Assume L-smooth loss fi and λ strongly convex objective function.
SVRG has fast convergence — condition number effectively reduced
The gain of SVRG over batch algorithm is significant when n is large.
T. Zhang
Big Data Optimization
14 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
15 / 41
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
" n
#
1X >
2
min
(w xi − yi ) + λkwk1
w
n
i=1
T. Zhang
Big Data Optimization
16 / 41
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
" n
#
1X >
2
min
(w xi − yi ) + λkwk1
w
n
i=1
or the ridge regression problem:


 n

1 X >

λ
2
2
min 
(w xi − yi ) +
kwk2 


w
 n i=1
|2 {z } 
|
{z
} regularization
loss
Goal: solve regularized loss minimization problems as fast as we can.
T. Zhang
Big Data Optimization
16 / 41
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
" n
#
1X >
2
min
(w xi − yi ) + λkwk1
w
n
i=1
or the ridge regression problem:


 n

1 X >

λ
2
2
min 
(w xi − yi ) +
kwk2 


w
 n i=1
|2 {z } 
|
{z
} regularization
loss
Goal: solve regularized loss minimization problems as fast as we can.
solution: proximal Stochastic Dual Coordinate Ascent (Prox-SDCA).
can show: fast convergence of SDCA.
T. Zhang
Big Data Optimization
16 / 41
General Problem
Want to solve:
#
n
1X
φi (Xi> w) + λg(w) ,
min P(w) :=
w
n
"
i=1
where Xi are matrices; g(·) is strongly convex.
Examples:
Multi-class logistic loss
φi (Xi> w) = ln
K
X
exp(w > Xi,` ) − w > Xi,yi .
`=1
L1 − L2 regularization
g(w) =
T. Zhang
1
σ
kwk22 + kwk1
2
λ
Big Data Optimization
17 / 41
Dual Formulation
Primal:
"
#
n
1X
>
min P(w) :=
φi (Xi w) + λg(w) ,
w
n
i=1
Dual:
"
n
n
1 X
Xi αi
λn
1X
−φ∗i (−αi ) − λg ∗
max D(α) :=
α
n
i=1
!#
i=1
with the relationship
w = ∇g
∗
n
1 X
Xi αi
λn
!
.
i=1
The convex conjugate (dual) is defined as φ∗i (a) = supz (az − φi (z)).
SDCA: randomly pick i to optimize D(α) by varying αi while keeping
other dual variables fixed.
T. Zhang
Big Data Optimization
18 / 41
Example: L1 − L2 Regularized Logistic Regression
Primal:
n
P(w) =
1X
λ
>
ln(1 + e−w Xi Yi ) + w > w + σkwk1 .
{z
} |2
|
n
{z
}
i=1
φi (w)
λg(w)
Dual: with αi Yi ∈ [0, 1]
n
D(α) =
1X
λ
−αi Yi ln(αi Yi ) − (1 − αi Yi ) ln(1 − αi Yi ) − ktrunc(v , σ/λ)k22
{z
} 2
|
n
i=1
φ∗i (−αi )
n
s.t. v =
1 X
αi Xi ;
λn
w = trunc(v , σ/λ)
i=1
where


uj − δ
trunc(u, δ)j = 0


uj + δ
T. Zhang
Big Data Optimization
if uj > δ
if |uj | ≤ δ
if uj < −δ
19 / 41
Proximal-SDCA for L1 -L2 Regularization
Algorithm:
P
Keep dual α and v = (λn)−1 i αi Xi
Randomly pick i
Find ∆i by approximately maximizing:
−φ∗i (αi + ∆i ) − trunc(v , σ/λ)> Xi ∆i −
1
kXi k2 2 ∆2i ,
2λn
where φ∗i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi ) + (1 − (αi + ∆)Yi ) ln(1 − (αi + ∆)Yi )
α = α + ∆ i · ei
v = v + (λn)−1 ∆i · Xi .
Let w = trunc(v , σ/λ).
T. Zhang
Big Data Optimization
20 / 41
Fast Convergence of SDCA
The number of iterations needed to achieve accuracy
For L-smooth loss:
˜
O
n+
L
λ
log
1
For non-smooth but G-Lipschitz loss (bounded gradient):
G2
˜
O n+
λ
T. Zhang
Big Data Optimization
21 / 41
Fast Convergence of SDCA
The number of iterations needed to achieve accuracy
For L-smooth loss:
˜
O
n+
L
λ
log
1
For non-smooth but G-Lipschitz loss (bounded gradient):
G2
˜
O n+
λ
Similar to that of SVRG; and effective when n is large
T. Zhang
Big Data Optimization
21 / 41
Solving L1 with Smooth Loss
Want to solve L1 regularization to accuracy with smooth φi :
n
1X
φi (w) + σkwk1 .
n
i=1
Apply Prox-SDCA with extra term 0.5λkwk22 , where λ = O():
˜ + 1/).
number of iterations needed by prox-SDCA is O(n
T. Zhang
Big Data Optimization
22 / 41
Solving L1 with Smooth Loss
Want to solve L1 regularization to accuracy with smooth φi :
n
1X
φi (w) + σkwk1 .
n
i=1
Apply Prox-SDCA with extra term 0.5λkwk22 , where λ = O():
˜ + 1/).
number of iterations needed by prox-SDCA is O(n
Compare to (number of examples needed to go through):
2 ).
˜
Dual Averaging SGD (Xiao): O(1/
√
˜
FISTA (Nesterov’s batch accelerated proximal gradient): O(n/
).
Prox-SDCA wins in the statistically interesting regime: > Ω(1/n2 )
T. Zhang
Big Data Optimization
22 / 41
Solving L1 with Smooth Loss
Want to solve L1 regularization to accuracy with smooth φi :
n
1X
φi (w) + σkwk1 .
n
i=1
Apply Prox-SDCA with extra term 0.5λkwk22 , where λ = O():
˜ + 1/).
number of iterations needed by prox-SDCA is O(n
Compare to (number of examples needed to go through):
2 ).
˜
Dual Averaging SGD (Xiao): O(1/
√
˜
FISTA (Nesterov’s batch accelerated proximal gradient): O(n/
).
Prox-SDCA wins in the statistically interesting regime: > Ω(1/n2 )
Can design accelerated prox-SDCA always superior to FISTA
T. Zhang
Big Data Optimization
22 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
23 / 41
Accelerated Prox-SDCA
Solving:
n
P(w) :=
1X
φi (Xi> w) + λg(w)
n
i=1
Convergence rate of Prox-SDCA depends on O(1/λ)
Inferior
√ to acceleration when λ is very small O(1/n), which has
O(1/ λ) dependency
T. Zhang
Big Data Optimization
24 / 41
Accelerated Prox-SDCA
Solving:
n
P(w) :=
1X
φi (Xi> w) + λg(w)
n
i=1
Convergence rate of Prox-SDCA depends on O(1/λ)
Inferior
√ to acceleration when λ is very small O(1/n), which has
O(1/ λ) dependency
Inner-outer Iteration Accelerated Prox-SDCA
Pick a suitable κ = Θ(1/n) and β
For t = 2, 3 . . . (outer iter)
˜t (w) = λg(w) + 0.5κkw − y t−1 k22 (κ-strongly convex)
Let g
˜ t (w) = P(w) − λg(w) + g
˜t (w) (redefine P(·) – κ strongly convex)
Let P
˜ t (w) for (w (t) , α(t) ) with prox-SDCA (inner iter)
Approximately solve P
Let y (t) = w (t) + β(w (t) − w (t−1) ) (acceleration)
T. Zhang
Big Data Optimization
24 / 41
Performance Comparisons
Problem
SVM
Algorithm
SGD
AGD (Nesterov)
Acc-Prox-SDCA
Lasso
SGD and variants
Stochastic Coordinate Descent
FISTA
Acc-Prox-SDCA
SGD, SDCA
Ridge Regression
AGD
Acc-Prox-SDCA
T. Zhang
Big Data Optimization
Runtime
1
λ
q
1
λ q
n
n + min{ λ1 , λ
}
d
2
n
q
n 1
q n + min{ 1 , n }
nq
+ λ1
n λ1
q n + min{ λ1 , λn }
n
25 / 41
Additional Related Work on Acceleration
Methods achieving fast accelerated convergence comparable to
Acc-Prox-SDCA
Qihang Lin, Zhaosong Lu, Lin Xiao, An Accelerated Proximal
Coordinate Gradient Method and its Application to Regularized
Empirical Risk Minimization, 2014, arXiv
Yuchen Zhang, Lin Xiao, Stochastic Primal-Dual Coordinate Method
for Regularized Empirical Risk Minimization, 2014, arXiv
T. Zhang
Big Data Optimization
26 / 41
Distributed Computing: Distribution Schemes
Distribute data (data parallelism)
all machines have the same parameters
each machine has a different set of data
Distribute features (model parallelism)
all machines have the same data
each machine has a different set of parameters
Distribute data and features (data & model parallelism)
each machine has a different set of data
each machine has a different set of parameters
T. Zhang
Big Data Optimization
27 / 41
Main Issues in Distributed Large Scale Learning
System Design and Network Communication
data parallelism: need to transfer a reasonable size chunk of data
each time (mini batch)
model parallelism: distributed parameter vector (parameter server)
T. Zhang
Big Data Optimization
28 / 41
Main Issues in Distributed Large Scale Learning
System Design and Network Communication
data parallelism: need to transfer a reasonable size chunk of data
each time (mini batch)
model parallelism: distributed parameter vector (parameter server)
Model Update Strategy
synchronous
asynchronous
T. Zhang
Big Data Optimization
28 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
29 / 41
MiniBatch
Vanilla SDCA (or SGD) is difficult to parallelize
Solution: use minibatch (thousands to hundreds of thousands)
T. Zhang
Big Data Optimization
30 / 41
MiniBatch
Vanilla SDCA (or SGD) is difficult to parallelize
Solution: use minibatch (thousands to hundreds of thousands)
Problem: simple minibatch implementation slows down convergence
limited gain for using parallel computing
T. Zhang
Big Data Optimization
30 / 41
MiniBatch
Vanilla SDCA (or SGD) is difficult to parallelize
Solution: use minibatch (thousands to hundreds of thousands)
Problem: simple minibatch implementation slows down convergence
limited gain for using parallel computing
Solution:
use Nesterov acceleration
use second order information (e.g. approximate Newton steps)
T. Zhang
Big Data Optimization
30 / 41
MiniBatch SDCA with Acceleration
Parameters scalars λ, γ and θ ∈ [0, 1] ; mini-batch size b
(0)
(0)
Initialize α1 = · · · = αn = α
¯ (0) = 0, w (0) = 0
Iterate: for t = 1, 2, . . .
u (t−1) = (1 − θ)w (t−1) + θα
¯ (t−1)
Randomly pick subset I ⊂ {1, . . . , n} of size b and update
(t)
(t−1)
αi = (1 − θ)αi
− θ∇fi (u (t−1) )/(λn) for i ∈ I
(t)
(t−1)
αj = αj
for j ∈
/I
P
(t)
(t−1)
(t)
(t−1)
α
¯ =α
¯
+ i∈I (αi − αi
)
w (t) = (1 − θ)w (t−1) + θα
¯ (t)
end
Better than vanilla block SDCA, and allow large batch.
T. Zhang
Big Data Optimization
31 / 41
Example
−1
10
Primal suboptimality
m=52
m=523
m=5229
AGD
SDCA
−2
10
−3
10
6
10
7
10
8
10
#processed examples
MiniBatch SDCA with acceleration can employ large minibatch size.
T. Zhang
Big Data Optimization
32 / 41
Outline
Background:
big data optimization in machine learning: special structure
Single machine optimization
stochastic gradient (1st order) versus batch gradient: pros and cons
algorithm 1: SVRG (Stochastic variance reduced gradient)
algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)
algorithm 3: accelerated SDCA (with Nesterov acceleration)
Distributed optimization
algorithm 4: Minibatch SDCA
algorithm 5: DANE (Distributed Approximate NEwton-type method)
behaves like 2nd order stochastic sampling
other methods
T. Zhang
Big Data Optimization
33 / 41
Communication Efficient Distributed Computing
Assume: data distributed over machines
m processors
each has n/m examples
Simple Computational Strategy — One Shot Averaging (OSA)
run optimization on m machines separately
obtaining parameters w (1) , . . . , w (m)
average the parameters: l
¯ = m−1
w
T. Zhang
Pm
i=1
w (i)
Big Data Optimization
34 / 41
Improvement
OSA Strategy’s advantage:
machines run independently
simple and computationally efficient; asymptotically good in theory
Disadvantage:
practically inferior to training all examples on a single machine
T. Zhang
Big Data Optimization
35 / 41
Improvement
OSA Strategy’s advantage:
machines run independently
simple and computationally efficient; asymptotically good in theory
Disadvantage:
practically inferior to training all examples on a single machine
Traditional solution in optimization: ADMM
New Idea: via 2nd order gradient sampling
Distributed Approximate NEwton (DANE)
T. Zhang
Big Data Optimization
35 / 41
Distribution Scheme
Assume: data distributed over machines with decomposed problem
f (w) =
m
X
f (`) (w).
`=1
m processors
each f (`) (w) has n/m randomly partitioned examples
each machine holds a complete set of parameters
T. Zhang
Big Data Optimization
36 / 41
DANE
˜ using OSA
Starting with w
Iterate
˜ and define
Take w
˜f (`) (w) = f (`) (w) − (∇f (`) (w)
˜ − ∇f (w))
˜ >w
on each machine solves
w (`) = arg min ˜f (`) (w)
w
independently.
˜
Take partial average as the next w
T. Zhang
Big Data Optimization
37 / 41
DANE
˜ using OSA
Starting with w
Iterate
˜ and define
Take w
˜f (`) (w) = f (`) (w) − (∇f (`) (w)
˜ − ∇f (w))
˜ >w
on each machine solves
w (`) = arg min ˜f (`) (w)
w
independently.
˜
Take partial average as the next w
Lead to fast convergence: O((1 − ρ)` ) with ρ ≈ 1
T. Zhang
Big Data Optimization
37 / 41
Reason: Approximate Newton Step
On each machine, we solve:
min ˜f (`) (w).
w
It can be regarded as approximate minimization of


1
 ˜
˜ > ∇2 f (`) (w)(w
˜
˜ 
˜ > (w − w)
˜ +
(w − w)
− w)
min f (w)
+ ∇f (w)
.
w
{z
}
2 |
˜
2nd order gradient sampling from ∇2 f (w)
Approximate Newton Step with sampled approximation of Hessian
T. Zhang
Big Data Optimization
38 / 41
Comparisons
COV1
MNIST−47
ASTRO
0.231
DANE
ADMM
OSA
Opt
0.06
0.07
0.05
0.23
0.06
0.229
0.05
0.04
0.03
0
T. Zhang
5
t
10
0.04
0
5
t
Big Data Optimization
10
0
5
t
10
39 / 41
Summary
Optimization in machine learning: sum over data structure
Traditional methods: gradient based batch algorithms
do not take advantage of special structure
Recent progress: stochastic optimization with fast rate
take advantage of special structure: suitable for single machine
T. Zhang
Big Data Optimization
40 / 41
Summary
Optimization in machine learning: sum over data structure
Traditional methods: gradient based batch algorithms
do not take advantage of special structure
Recent progress: stochastic optimization with fast rate
take advantage of special structure: suitable for single machine
Distributed computing (data parallelism and synchronous update)
minibatch SDCA
DANE (batch algorithm on each machine + synchronization)
T. Zhang
Big Data Optimization
40 / 41
Summary
Optimization in machine learning: sum over data structure
Traditional methods: gradient based batch algorithms
do not take advantage of special structure
Recent progress: stochastic optimization with fast rate
take advantage of special structure: suitable for single machine
Distributed computing (data parallelism and synchronous update)
minibatch SDCA
DANE (batch algorithm on each machine + synchronization)
Other approaches
algorithmic side: ADMM, Asynchronous updates (Hogwild), etc
system side: distributed vector computing (parameter servers) – Baidu
has industrial leading solution
T. Zhang
Big Data Optimization
40 / 41
Summary
Optimization in machine learning: sum over data structure
Traditional methods: gradient based batch algorithms
do not take advantage of special structure
Recent progress: stochastic optimization with fast rate
take advantage of special structure: suitable for single machine
Distributed computing (data parallelism and synchronous update)
minibatch SDCA
DANE (batch algorithm on each machine + synchronization)
Other approaches
algorithmic side: ADMM, Asynchronous updates (Hogwild), etc
system side: distributed vector computing (parameter servers) – Baidu
has industrial leading solution
Fast developing field; many exciting new ideas
T. Zhang
Big Data Optimization
40 / 41
References
Rie Johnson and TZ. Accelerating Stochastic Gradient Descent
using Predictive Variance Reduction, NIPS 2013.
Lin Xiao and TZ. A Proximal Stochastic Gradient Method with
Progressive Variance Reduction, SIAM J. Opt, to appear.
Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate Ascent
Methods for Regularized Loss Minimization, JMLR 14:567-599,
2013.
Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic Dual
Coordinate Ascent for Regularized Loss Minimization, Math
Programming, to appear.
Shai Shalev-Schwartz and TZ. Accelerated Mini-Batch Stochastic
Dual Coordinate Ascent, NIPS 2013.
Ohad Shamir and Nathan Srebro and TZ. Communication-Efficient
Distributed Optimization using an Approximate Newton-type Method,
ICML 2014.
T. Zhang
Big Data Optimization
41 / 41