Chapter 8 Proximal Point Method and Augmented Lagrangian Method
Transcription
Chapter 8 Proximal Point Method and Augmented Lagrangian Method
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 1 Chapter 8 Proximal Point Method and Augmented Lagrangian Method Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Linearly Constrained Problem Consider the linearly constrained problem: min f (x) s.t. Ax = b, where A ∈ Rp×n. A natural thought is gradient projection method: xk+1 := P{x|Ax=b}(xk − tk ∇f (xk )) But, projection onto the affine set Ax = b is not easy: P{x|Ax=b}(u) = argmin kx − uk22 = u + A>(AA>)−1(b − Au) {x|Ax=b} inexpensive only if p n, or AA> = I, ... Usually, we want to avoid computing projection onto affine set. 2 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 3 Augmented Lagrangian Method For linearly constrained problem, augmented Lagrangian method (ALM) is usually used. Define augmented Lagrangian function: t Lt(x; λ) := f (x) − λ>(Ax − b) + kAx − bk22 2 where λ is called the Lagrange multiplier or the dual variable The augmented Lagrangian method is (starting with λ0 = 0) k+1 x := argminx Lt(x; λk ) λk+1 := λk − t(Axk+1 − b) ALM is in fact the proximal point method applied to the dual problem Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 4 8.1. Proximal Point Method min f (x) x where f (x) is a closed convex function. The proximal point method (PPM) iterates: xk = proxtk f (xk−1) = argminu f (u) + 1 2tk ku − xk−1k22 • can be viewed as proximal gradient method with g(x) = 0 • of interest if prox operator is much easier than minimizing f directly • a practical algorithm if inexact prox evaluations are used • step size tk > 0 affects number of iterations Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 5 Convergence assumptions • f is closed and convex (hence proxtf (x) is uniquely defined for all x) • optimal value f ∗ is finite and attained at x∗ result kx0 − x∗k22 , for k ≥ 1 f (x ) − f ≤ Pk 2 i=1 ti P • implies convergence if i ti → ∞ k ∗ • rate is 1/k if ti is fixed or variable but bounded away from 0 • ti is arbitrary; however number of iterations depends on ti Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Proof: apply analysis of proximal gradient method with g(x) = 0 • since g is zero, inequality (1) on page 7-4 holds for any t > 0 • from page 7-6, f (xi) is nonincreasing and i ti f (x ) − f ∗ 1 i ∗ 2 i−1 ∗ 2 ≤ kx − x k2 − kx − x k2 2 • combine inequalities for i = 1 to i = k to get ! k k X X 1 ti f (xi) − f ∗ ≤ kx0 − x∗k22 ti f (xk ) − f ∗ ≤ 2 i=1 i=1 6 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 7 Accelerated proximal point algorithms FISTA (take g(x) = 0 on p.7-20): choose x0 = x−1 and for k ≥ 1 1 − θ k−1 xk = proxtk f xk−1 + θk (xk−1 − xk−2) θk−1 Nesterov’s 2nd method (p. 7-34): choose x0 = v 0 and for k ≥ 1 v k = prox(tk /θk )f (v k−1), xk = (1 − θk )xk−1 + θk v k possible choices of parameters • fixed steps: tk = t and θk = 2/(k + 1) • variable steps: choose any tk > 0, θ1 = 1, and for k > 1, solve θk from (1 − θk )tk tk−1 = 2 θk2 θk−1 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 8 Convergence assumptions • f is closed and convex • optimal value f ∗ is finite and attained at x∗ result 2kx0 − x∗k22 f (x ) − f ≤ √ Pk √ 2 , 2 t1 + i=2 ti P √ • implies convergence if i ti → ∞ k ∗ for k ≥ 1 • rate is 1/k 2 if ti fixed or variable but bounded away from zero Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 9 Proof: follows from analysis in lecture 7 with g(x) = 0 • since g is zero, first inequalities on p.7-28 and p.7-36 hold for any t>0 • therefore the conclusion on p.7-29 and p.7-37 holds: θk2 0 kx − x∗k22 f (x ) − f ≤ 2tk k ∗ • for fixed step size tk = t, θk = 2/(k + 1), θk2 2 = 2tk (k + 1)2t • for variable step size, we proved on p. that θk2 2 ≤ √ P √ 2tk (2 t1 + ki=2 ti)2 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 8.2. Augmented Lagrangian Method primal problem min f (x) s.t. Ax = b Lagrangian function L(x; λ) = f (x) − λ>(Ax − b) dual problem max inf L(x; λ) = −f ∗(A>λ) + b>λ λ x optimality conditions: x, λ are optimal if • x ∈ dom f , Ax = b • A>λ ∈ ∂f (x) 10 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 11 augmented Lagrangian method: proximal point method applied to dual problem Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Proximal mapping of dual function proximal mapping of h(λ) = f ∗(A>λ) − b>λ is 1 proxth(λ) = argmin f ∗(A>u) − b>u + ku − λk22 2t u x − b) where dual expression: proxth(λ) = λ − t(Aˆ t xˆ = argmin f (x) − λ>(Ax − b) + kAx − bk22 2 x i.e., xˆ minimizes the augmented Lagrangian function 12 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 13 The 1st proof minu f ∗(A>u) − b>u + 2t1 ku − λk22 ⇔ minu maxxhx, A>ui − f (x) − b>u + 2t1 ku − λk22 ⇔ maxx minuhx, A>ui − f (x) − b>u + 2t1 ku − λk22 The inner problem gives u = λ − t(Ax − b). Thus, the outer problem becomes t min f (x) − hλ, Ax − bi + kAx − bk22 x 2 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 14 The 2nd proof • write the augmented Lagrangian minimization as minx,z f (x) + 2t kzk22 s.t. Ax − b − λ/t = z • optimality conditions (u is the multiplier for equality) tz + u = 0, Ax − b − λ/t = z, A>u ∈ ∂f (x) • eliminating x, z gives u = λ − t(Ax − b) and 0 ∈ A∂f ∗(A>u) − b + (u − λ)/t this is the optimality condition for u = proxth(λ) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 15 Augmented Lagrangian Method choose initial λ0 and repeat: 1. minimize augmented Lagrangian function t xk+1 = argmin f (x) − hλk , Ax − bi + kAx − bk22 2 x 2. update Lagrange multiplier (dual variable) λk+1 = λk − t(Axk+1 − b) • also known as method of multipliers, Bregman iteration • proximal point method applied to the dual problem • as variants, can apply the fast proximal point methods to the dual • usually implemented with inexact minimization in step 1 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Example: basis pursuit Variant of `1 minimization in compressed sensing and Lasso. min kxk1 s.t. Ax = b, where A ∈ Rm×n and b ∈ Rm. Augmented Lagrangian method: 1. minimize the augmented Lagrangian function t xk+1 = argmin kxk1 − hλk , Ax − bi + kAx − bk22 2 x 2. update Lagrange multiplier (dual variable) λk+1 = λk − t(Axk+1 − b) 16 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 17 • How to solve the problem in step 1? • note that it is equivalent to t min kxk1 + kAx − b − λk /tk22 x 2 • so one can use proximal gradient method (ISTA) to solve it to certain accuracy Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 18 ALM vs PGM Three variants of the `1 minimization problem: 1. `1-norm regularized problem 1 min τ kxk1 + kAx − bk22 2 2. `1-ball constrained problem min kAx − bk22, s.t. kxk1 ≤ η 3. basis pursuit problem min kxk1, s.t. Ax = b Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 19 solvers: • use proximal gradient method to solve 1 (involves computing the prox operator of `1 norm) • use gradient projection method to solve 2 (involves computing the projection onto `1 ball) • use ALM to solve 3 (involves inexactly minimizing the augmented Lagrangian function) advantages: • basis pursuit problem requires the equality to hold (important for many applications) • ALM converges such that Axk − b → 0 • while the other 2 models cannot guarantee this Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 20 8.3. Moreau-Yosida smoothing Moreau-Yosida regularization (Moreau envelope) of closed convex f is (with t > 0) 1 2 ft(x) = inf u f (u) + 2t ku − xk2 = f (proxtf (x)) + 2t1 kproxtf (x) − xk22 immediate properties • ft is convex (infimum over u of a convex function of x, u) • domain of ft is Rn (recall that proxtf (x) is defined for all x) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 21 Examples indicator function: smoothed f is squared Euclidean distance 1 f (x) = IC (x), ft(x) = d(x)2 2t 1-norm: smoothed function is Huber penalty n X f (x) = kxk1, ft(x) = φt(xk ) k=1 where φt(z) = z 2/(2t) if |z| ≤ t |z| − t/2 if |z| ≥ t Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 22 Conjugate of Moreau Envelope 1 ft(x) = inf f (u) + ku − xk22 u 2t • ft can be equivalently written as 1 ft(x) = inf f (u) + kvk22 u+v=x 2t • calculus rule for conjugate function f (x) = inf (g(u) + h(v)) u+v=x • thus, f ∗(y) = g ∗(y) + h∗(y) t (ft)∗(y) = f ∗(y) + kyk22 2 • hence, conjugate is strongly convex with parameter t Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 23 Gradient of Moreau envelope t ft(x) = sup x>y − f ∗(y) − kyk22 2 y • maximizer in definition is unique and satisfies x − ty ∈ ∂f ∗(y) ⇔ y ∈ ∂f (x − ty) • maximizing y is the gradient of ft: 1 ∇ft(x) = (x − proxtf (x)) = prox(1/t)f ∗ (x/t) t • gradient ∇ft is Lipschitz continuous with constant 1/t (follows from the non-expansiveness of prox) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Interpretation of proximal point algorithm apply gradient method to minimize Moreau envelope 1 min ft(x) = inf f (u) + ku − xk22 u 2t this is an exact smooth reformulation of problem of minimizing f (x): • solution x is minimizer of f • ft is differentiable with Lipschitz continuous gradient (L = 1/t) gradient update: with fixed tk = 1/L = t xk+1 = xk − t∇ft(xk ) = proxtf (xk ) this is the proximal point update with constant step size tk = t 24 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 25 Interpretation of ALM xk+1 = argminx f (x) − hλk , Ax − bi + 2t kAx − bk22 λk+1 = λk − t(Axk+1 − b) • dual problem is maxλ inf x L(x; λ), i.e., min − inf L(x; λ) λ x • dual function is − inf x L(x; λ) • smoothed dual function is 1 ft(λ) = inf − inf L(x, u) + ku − λk22 u x 2t Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 26 • gradient of ft is ∇ft(λ) = (λ − u)/t, where u is the optimal solution of 1 max inf L(x, u) − ku − λk22 u x 2t i.e., −(Ax − b) − (u − λ)/t = 0 and x minimizes f (x) − hu, Ax − bi − 1 ku − λk22 2t • Thus, with fixed t, dual update of ALM is gradient step applied to smoothed dual