Convex Optimization Lecture Notes (Incomplete)
Transcription
Convex Optimization Lecture Notes (Incomplete)
Convex Optimization Lecture Notes (Incomplete) [email protected] January 21, 2015 Contents 1 Introduction 5 2 Convex Sets 5 2.1 Types of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Affine Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Important Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Norm Balls / Norm Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Positive Semidefinite Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Operations that Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Linear-fractional and Perspective Functions . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Generalized Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.1 Positive Semidefinite Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Minimum and minimal Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 2.5 Supporting Hyperplane Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 Dual Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6.1 Dual Generalized Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6.2 Dual Generalized Inequality and Minimum/Minimal Elements . . . . . . . . . . . . . . 7 3 Convex Functions 8 3.1 Basic Properties and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.1 Extended-value Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.2 Equivalent Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.4 Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.5 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Operations That Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 The Conjugate Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Quasiconvex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 3.4.2 Operations That Preserve Quasiconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Log-concavity and Log-convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1 4 Convex Optimization Problems 11 4.1 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Feasibility Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 4.1.2 Transformations and Equivalent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2.2 Equivalent Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Quasiconvex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.1 Solving Quasiconvex Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.2 Example: Convex over Concave 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Linear Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.4.2 Linear-fractional Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.5 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5.2 Second Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5.3 SOCP: Robust Linear Programming 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 SOCP: Stochastic Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6.1 Monomials and Posynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6.2 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.6.3 Convex Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.7 Generalized Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.7.1 Conic Form Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.7.2 SDP: Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.8 Vector Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.8.1 Optimal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 4.8.3 Scalarization for Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Duality 19 5.1 The Lagrangian And The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.1 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.2 The Lagrangian Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.3 Lagrangian Dual and Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.4 Intuitions Behind Lagrangian Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.5 LP example and Finite Dual Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.6 Conjugate Functions and Lagrange Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 The Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2.1 Duality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2.2 Strong Duality And Slater’s Constraint Qualification . . . . . . . . . . . . . . . . . . . . 21 5.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Strong Duality of Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 5.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 5.4.1 Certificate of Suboptimality and Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . 22 5.4.2 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 KKT Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 5.5 Solving The Primal Problem via The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.6.1 Global Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.6.2 Local Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.7 Examples and Reformulating Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.7.1 Introducing variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.7.2 Making explicit constraints implicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.7.3 Transforming the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.8 Generalized Inequailities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 Approximation and Fitting 25 6.1 Norm Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.1.2 Different Penalty Functions and Their Consequences . . . . . . . . . . . . . . . . . . . . 26 6.1.3 Outliers and Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.1.4 Least-norm Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Regularized Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 6.2.1 Bi-criterion Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.3 `1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.4 Signal Reconstruction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.3 Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.3.1 Stochastic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.3.2 Worst-case Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.4 Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.4.1 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.4.2 Sparse Descriptions and Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.4.3 Checking Model Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 7 Statistical Estimation 31 7.1 Parametric Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 7.1.1 Logistic Regression Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.1.2 MAP Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Nonparameteric Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 7.2.1 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.3 Optimal Detector Design And Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.4 Chebyshev and Chernoff Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.4.1 Chebyshev Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.4.2 Chernoff Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 7.5 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 7.5.1 Further Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 8 Geometric Problems 34 8.1 Point-to-Set Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 PCA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 35 8.2 Distance between Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8.3 Euclidean Distance and Angle Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8.3.1 Expressing Constraints in Terms of G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8.3.2 Well-Condition Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 8.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 8.4 Extremal Volume Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 8.4.1 Lowner-John Ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 8.4.2 Maximum Volume Inscribed Ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 8.4.3 Affine Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.5 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.5.1 Chebychev Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.5.2 Maximum Volume Ellipsoid Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.5.3 Analytic Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.6.1 Linear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.6.2 Robust Linear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.6.3 Nonlinear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8.7 Placement and Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8.8 Floor Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 9 Numerical Linear Algebra Background 40 10 Unconstrained Optimization 40 10.1 Unconstrained Minimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 10.1.1 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 10.1.2 Conditional Number of Sublevel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.3.1 Performance Analysis on Toy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 10.4 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 10.4.1 Steepest Descent With an `1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 10.4.2 Performance and Choice of Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 10.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 The Newton Decrement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 10.5.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 10.5.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 10.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 10.6 Self-Concordant Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 1 Introduction 2 Convex Sets This chapter introduces numerous convex sets. 2.1 Types of Sets 2.1.1 Affine Sets A set is affine when it contains any line that passes through any two points that belong to the set. Examples: • Hyperplanes • Solution set of linear equation systems. An affine combination is defined to be a linear combination where the coefficients sum up to 1. An affine set is closed under affine combination. 2.1.2 Convex Sets A set is convex when it contains any line segment that passes through any two points that belong to the set. A convex combination is defined to be an affine combination with nonnegative coefficients. A convex set is closed under convex combination. All affine sets are convex. 2.1.3 Cones A set is a cone when a nonnegative scalar multiple of any element belongs to the set. A cone feels like a ice-cream cone; it starts at the origin, and it gets wider going away from the origin. It could be analogous to a pie slice. Given an point in a cone, the cone will contain the ray (half-lines) originating from the origin and going through the point. A conic combination is a linear combination with nonnegative coefficients. A convex cone is closed under conic combinations. 2.2 Important Sets • Hyperplane is affine. • Halfspaces are convex. • Polyhedra/polyhedron are intersections of halfspaces, so are convex. • Euclidean balls are defined by ||2 . They are convex. Elipsoids are related to balls, related by a positive definite matrix P . 5 2.2.1 Norm Balls / Norm Cones Given a norm, a norm ball replaces the L2 norm for Euclidean balls. A norm cone is a different beast; it is the set C = {(x, t) | kxk ≤ t, t ≥ 0} ∈ Rn+1 where x ∈ Rn . The norm cone with an L2 norm is called the second-order cone. 2.2.2 Positive Semidefinite Cone The following notations are used: • Sn is the set of all symmetric n × n matrices. • Sn+ is the set of all positive semidefinite n × n matrices. • Sn++ are positive definite. Because θ1 , θ2 ≥ 0, A, B ∈ Sn+ =⇒ θ1 A + θ2 B ∈ Sn+ the set is a convex cone. 2.3 Operations that Preserve Convexity Some functions preserve convex-ness of the set. • Intersections • Taking image under an affine function 2.3.1 Linear-fractional and Perspective Functions The perspective function P (z, t) = z/t preserves convexity. (The domain needs that t > 0) Similarly, a linear-fractional function, f (x) = Ax + b cT x + b which is formed by combining the perspective function with an affine function, preserves convexity as well. 2.4 Generalized Inequalities A cone can be used to define inequalities, if it meets certain criteria. It goes like: x K y ⇐⇒ y − x ∈ K 6 where K is a cone. It certainly makes sense; since K contains any rays, we get transitivity. If K is pointed at the origin, we get asymmetry. Since K contains the origin, we get reflexivity. This is an useful concept that will be exploited later in the course a lot. 2.4.1 Positive Semidefinite Cone Actually, using the positive semidefinite cone to compare matrices is a standard practice so for the rest of the book, matrix inequalities are automatically done used PSD cone. 2.4.2 Minimum and minimal Elements Generalized inequalities do not always give you a single element that is the minimum; we sometimes get a class of elements that are not smaller than any other elements, and are incomparable to each other. 2.5 Supporting Hyperplane Theorem This has not proved to be useful in the course yet; when it is needed I will come back and fill it in... 2.6 Dual Cones Let K be a cone. Then the dual cone is that K ∗ = y|xT y ≥ 0 ∀x ∈ K The idea of it is hard to express intuitively. When the cone is sharp, the dual cone will be obstuse, and vice versa. Also, when K is a proper cone, K∗∗ = K. 2.6.1 Dual Generalized Inequality The relationship K∗ is a generalized inequality induced by the dual cone K∗. In some ways it can relate to the original inequality K . Most notably, x K y ⇐⇒ y − x ∈ K ⇐⇒ λT (y − x) ≥ 0 ⇐⇒ λT x ≤ λT y where λ is any element of K∗. The takeaway of this is that we can use dual cones to compare values with respect to the original cone. See below for how this is used. 2.6.2 Dual Generalized Inequality and Minimum/Minimal Elements Minimum Element From above, we can see x ∈ S will be a minimum element with respect to K, when it is a unique minimizer λT x for all λ ∈ K∗. Geometrically, when x is a minimum element, the hyperplane passing through x with λ as a normal vector z|λT (x − z) = 0 is a strict supporting hyperplane; it touches S only at x. To see this: say x 6= y ∈ S. Since λT x < λT y by our assumption, we have λT (y − z) > 0. Minimal Elements Here, we have a gap between necessary and sufficient conditions. Necessity If x is a minimizer of λT x for all λ ∈ K∗, x is a minimal element. Sufficiency Even if x is a minimal element, it is possible that x is not a minimizer of λT x for some λ ∈ K∗. 7 However, if S is convex, the two conditions are indeed equivalent. 3 Convex Functions 3.1 Basic Properties and Definitions A function f is convex if the domain is convex, and for any x, y ∈ domf and θ ∈ [0, 1] we have θ · f (x) + (1 − θ) · f (y) ≥ f (θx + (1 − θ) y) Colloquially, you can say “the chord is above the curve”. 3.1.1 Extended-value Extensions We can augment a function f : f (x) x ∈ domf f˜ (x) = ∞ otherwise which is called the extended-value extension of f . Care must be taken in ensuring the extension is still convex/concave: the extension can break such properties. 3.1.2 Equivalent Conditions The following conditions, paired with the domain being convex, is equivalent to convexity. First Order Condition T f (y) ≥ f (x) + ∇f (x) (y − x) Geometrically, you can explain this as the tangential hyperplane which meets a convex function at a certain point, it will be a global lower bound in the entire domain. It has outstanding consequences; by examining a single point, we get information about the entire function. Second Order Condition ∇2 f (x) 0 which says the Hessian is positive semidefinite. 3.1.3 Examples • ln x is concave on R++ • eax is convex on R • xa is convex on R++ if a ≥ 1, concave 0 ≤ a ≤ 1 a • |x| is convex on R • Negative entropy: x log x is convex on R++ 8 • Norms: every norm is convex on Rn • Max functions: max {x1 , x2, x3 , · · · } is convex in Rn • Quadratic-over-linear: x2 /y is convex in {y > 0} • Log-sum-exp: as a “soft” approximation of the max function, f (x) = log ( • Geometric mean: ( Qn i=1 xi ) 1/n P exi ) is convex on Rn . is concave on Rn++ • Log-determinant: log det X is convex on positive definite X. 3.1.4 Epigraph An epigraph for a function f : Rn → R is a set of points “above” the graph. Naturally, it’s a subset of Rn+1 and sometimes is an useful way to think about convex functions. A function is convex iff its epigraph is convex! 3.1.5 Jensen’s Inequality The definition of convex function extends to multiple, even infinite, sums: f as long as 3.2 P X X xi θ i ≤ f (xi ) θi θ = 1. This can be used to prove a swath of other inequalities. Operations That Preserve Convexity • Nonnegative weighted sums • Composition with an affine mapping: if f is convex/concave, f (Ax + b) also is. • Pointwise maximum/supremum: f (x) = maxi fi (x) is convex if all fi are convex. This can be used to prove that: – Sum of k largest elements is convex: it’s the max of n k combinations. • Max eigenvalue of symmetric matrix is convex. • Minimization: g (x) = inf f (x, y) y∈C is convex if f is convex. You can prove this by using epigraphs – you are slicing the epigraph of g. • Perspective function 9 3.2.1 Composition Setup: h: Rk → R and g: Rn → Rk . The composition f (x) = h (g (x)). Then: • f is convex: – if h is convex and nondecreasing, and g is convex, or – if h is convex and nonincreasing, and g is concave • f is concave: – if h is concave and nondecreasing, and g is concave, or – if h is concave and nonincreasing, and g is convex To summarize: if h is nondecreasing, and h and g has same curvature (convexity/concavity), f follows. If h is nonincreasing, and h and g have differing curvature, f follows h. When determining nonincreasing/nondecreasing property of h, use its extended-value extension. Since extended-value extension can break nonincreasing/nondecreasing properties, you might want to come up with alternative definitions which are defined everywhere. Composition Examples • exp g (x) is convex if g is. • log g (x) is concave if g is concave and positive. • 1 g(x) is convex if g is positive and concave. p • g (x) is convex if p ≥ 1 and g is convex and nonnegative. Vector Composition If f (x) = h (g1 (x) , g2 (x) , g3 (x) , · · · ) the above rules still hold, except: • All gs need to have the same curvature. • h need to be nonincreasing/nondecreasing with respect to every input. 3.3 The Conjugate Function Given an f : Rn → R, the function f ∗: Rn → R is the conjugate: f ∗ (y) = sup y T x − f (x) x∈domf It is closely related to Lagrange Duals. I will omit further materials on this. 10 3.4 Quasiconvex Functions Quasiconvex functions are defined by having all its sublevel sets convex. Colloquially, they are “unimodal” functions. Quasiconvex functions are solved by bisection method + solving feasibility problems. The linear-fractional function f (x) = aT x + b cT x + b is quasiconvex. (Can prove: let f (x) = α and try to come up with the definition of the sublevel set.) Remember how we solved this by bisection methods? :-) 3.4.1 Properties T • First-order condition: f (y) ≤ f (x) =⇒ ∇f (x) (y − x) ≤ 0. Note: the gradient defines a supporting hyperplane. Since f is quasiconvex, all points y with f (y) ≤ f (x) must lie in one side of the hyperplane. • Second-order condition: f 00 (x) ≥ 0 3.4.2 Operations That Preserve Quasiconvexity • Nonnegative weighted sum • Composition: if h is nondecreasing and g is quasiconvex, then h (g (x)) is quasiconvex. • Composition with affine/linear fractional transformation. • Minimization along an axis: g (x) = miny f (x, y) is quasiconvex if f is. 3.5 Log-concavity and Log-convexity The function is log-concave or log-convex if its log is concave/convex. • The pdf of a Gaussian distribution is log-concave. • The gamma function is log-concave. 4 4.1 Convex Optimization Problems Optimization Problems A typical optimization problem formulation looks like: minimize f0 (x) subject to fi (x) ≤ 0 (i = 1, · · · , m) hi (x) = 0 (i = 1, · · · , p) 4.1.1 Feasibility Problem There are cases where you want to find a single x which satisfies all equality and inequality constraints. These are feasibility problems. 11 4.1.2 Transformations and Equivalent Problems Each problem can have multiple representations which are same in nature but expressed differently. Different expression can have different properties. • Nonzero equivalence constraints: move everything to LHS. • Minimization/maximization: flip signs. • Transformation of objective/constraint function: transforming objective functions through monotonic functions can yield an equivalent problem. • Slack variables: f (x) ≤ 0 is swapped out by f (x) + s = 0 and s ≥ 0. • Swapping implicit/explicit constraint: move an implicit constraint to explicit, by using extended value extension. 4.2 Convex Optimization A convex optimization problem looks just like the above definition, but have a few differences: • All f s are convex. • All gs are affine. Thus, the set of equivalence constraints can be expressed by aTi x = bi thus Ax = b Note all convex optimization problems do not have any locally optimal points; all optimals are global. Also, another important property arises from them: the feasible set of a convex optimization problem is convex, because it is an intersection of convex and affine sets – a sublevel set of a convex function is convex, and the feasible set is an intersection of sublevel sets and an affine set. Another important thing to note is that convexity is a function of the problem description; different formulations can make a non-convex problem convex and vice versa. It’s one of the most important points of the course. 4.2.1 Optimality Conditions If the objective function f0 is differentiable, x is the unique optimal solution if for all feasible y, we have T ∇f0 (x) (y − x) ≥ 0 this comes trivially from the first order condition of a quasiconvex function (a convex function is always quasiconvex). Optimality conditions for some special (mostly trivial) cases of problems are discussed: • Unconstrained problem: Set gradient to 0. • Only equality constraint: We can derive the below from the general optimality condition described above. ∇f0 (x) + AT ν = 0 (ν ∈ Rp ) T Here’s a brief outline: for any feasible y, we need to have ∇f0 (x) (y − x) ≥ 0. Note that y − x ∈ N (A) ⊥ (the null space) because Ax = b = Ay =⇒ A (x − y) = 0. Now, this means ∇f0 (x) ∈ N (A) = R AT , the last term being the column space of AT . Now, we can let 12 ∇f0 (x) = AT (−ν) for some ν and we are done. • Minimize over nonnegative orthant: for each i, we need to have ∇f (x) = 0 0 i ∇f (x) ≥ 0 0 i if xi > 0 if xi = 0 This is both intuitive, and relates to a concept (KKT conditions) which is discussed later. If xi is at the boundary, it can have some positive gradient along that axis; we can still be optimal because decreasing xi will make it infeasible. Otherwise, we need to have zero gradients. 4.2.2 Equivalent Convex Problems Just like what we discussed for general optimization problems, but specially for convex problems. • Eliminating affine constraint: Ax = b is equivalent to x ∈ F z + x0 where the column space of F is N (A) and x0 is a particular solution of Ax = b. So we can just put f0 (F z + x0 ) in place of f0 (x). • “Uneliminating” affine constraint: going in the opposite direction; if we are dealing with f0 (Ai x + bi ), let Ai x + bi = yi . – On eliminating/uneliminating affine constraint: on a naive view, eliminating affine constraint always seem like a good idea. However, it isn’t so; it is usually better to keep the affine constraint, and only do the elimination if it is immediately computationally advantageous. (This will be discussed later in the course.. but I don’t remember where) • Slack variables • Epigraph form: minimize t subject to f0 (x) − t ≤ 0 is effectively minimizing f0 (x). This seems stupid, but this gives us a convenient framework, because we can make objectives linear. 4.3 Quasiconvex Optimization When the objective function is quasiconvex, it is called a quasiconvex optimization problem. The biggest difference is that we will now have local optimal points; quasiconvex functions are allowed to have “flat” portions which give rise to local optimal points. 4.3.1 Solving Quasiconvex Optimization Problems Quasiconvex optimization problems are solved by bisection methods; at each iteration we ask if the sublevel set empty for given threshold. We can solve this by a convex feasibility problem. 13 4.3.2 Example: Convex over Concave Say p (x) is convex, q (x) is concave. Then f (x) = p (x) /q (x) is quasiconvex! How do you know? Consider the sublevel set: {x : f (x) ≤ t} = {x : p (x) /q (x) ≤ t} = {x : p (x) − t · q (x) ≤ 0} and p (x) − t · q (x) is convex! So the sublevel sets are convex. 4.4 Linear Optimization Problems In an LP problem, objectives and constraints are all affine functions. LP algorithms are very, very advanced and all these problems are readily solvable in today’s computers. It is a very mature technology. 4.4.1 Examples • Chebyshev Center of a Polyhedron: note that the ball lying inside a halfplane aTi x ≤ bi can be represented as kuk2 ≤ r =⇒ aTi (xc + u) ≤ bi Since sup aTi u = r kai k2 kuk2 ≤r we can rewrite the constraint as aTi xc + r kai k2 ≤ bi which is a linear constraint on xc and r. Therefore, having this inequality constraint for all sides of the polyhedron gives a LP problem. • Piecewise-linear minimization. Minimize: max aTi x + bi i This is equivalent to LP: minimize t subject to aTi x + bi ≤ t! This can be a quick, dirty, cheap way to solve convex optimization problems. 4.4.2 Linear-fractional Programming If the objective is linear-fractional, while the constraints are affine, it becomes a LFP problem. This is a quasiconvex problem, but it can also be translated into a LP problem. I will skip the formulation here. 14 4.5 Quadratic Optimization QP is a special kind of convex optimization where the objective is a convex quadratic function, and the constraint functions are affine. 1 minimize xT P x + q T x + r 2 subject to Gx h Ax = b and P ∈ Sn+ . When the inequality constraint is quadratic as well, it becomes a QCQP (Quadratically Constrainted Quadratic Programming) problem. 4.5.1 Examples • Least squares: needs no more introduction. When linear inequality constraints are added, it is no longer analytically solvable, but still is very tractable. • Isotonic regression: we add the following constraint to a least squares algorithm: x1 ≤ x2 ≤ · · · ≤ xn . This is still very easy in QP! • Distance between polyhedra: Minimizing Euclidean distance is a QP problem, and the constraints (two polyhedras) are convex. • Classic Markowitz Portfolio Optimization – Given an expected return vector p¯ and the covariance matrix Σ, find the minimum variance portfolio with expected return greater than, or equal to, rmin . This is trivially representable in QP. – Many extensions are possible; allow short positions, transaction costs, etc. 4.5.2 Second Order Cone Programming SOCP is closely related to QP. It has a linear objective, but a second-order cone inequality constraint: minimize f T x subject to |Ai x + bi |2 ≤ cTi x + di Fx = g The inequality constraint forces the tuple Ai x + bi , cTi + di to lie in the second-order cone in Rn+1 . When ci = 0 for all i, we can make them regular quadratic constraints and this becomes a QCQP. So basically, using a second-order cone instead of a (possibly open) polyhedra in LP. Note that the linear objective does not make SOCP weaker than QCQP. You can minimize t where f0 (x) ≤ t. 15 4.5.3 SOCP: Robust Linear Programming Suppose we have a LP minimize cT x subject to aTi x ≤ bi but the numbers given in the problem could be inaccurate. As an example, let’s just assume that the true value of ai can lie in a ellipsoid defined by Pi , centered at the given value: ai ∈ E = {a¯i + Pi u| kuk2 ≤ 1} and other values (c and bi ) are fixed. We want the inequalities to hold for all possible value of a. The inequality constraint can be cast as sup (a¯i + Pi u) = a¯i T x + PiT x2 ≤ bi kuk2 ≤1 which is actually a SOCP constraint. Note that the additional norm term PiT x2 acts as a regularization term; they prevent x from being large in directions with considerable uncertainty in the parameters ai . 4.5.4 SOCP: Stochastic Inequality Constraints When ai are normally distributed vectors with mean a¯i and covariance matrix Σi , the following constraint P aTi x ≤ bi ≥ η says that a linear inequality constraint will hold with a probability of η or better. This can be cast as a SOCP constraint as well. Since x will be concrete numbers, we can say aTi x ∼ n u¯i , σ 2 . Then P aTi x ≤ bi = P ui − u¯i bi − u¯i ≤ σ σ =P Z≤ bi − u¯i σ ≥ η ⇐⇒ Φ bi − u ¯i σ ≥ η ⇐⇒ bi − µ¯i ≥ Φ−1 (η) σ The last condition can be rephrased as 1/2 a¯i T x + Φ−1 (η) Σi x ≤ bi 2 which is a SOCP constraint. 4.6 Geometric Programming Geometric programming problems involve products of powers of variables, not weighted sums of variables. 4.6.1 Monomials and Posynomials A monomial function f is a product of powers of variables in the form 16 f (x) = cxa1 1 xa2 2 · · · xann where c > 0. A sum of monomials are called a posynomial; which looks like f (x) = K X ck xa1 1k xa2 2k · · · xannk k 4.6.2 Geometric Programming A GP problem looks like: minimize f0 (x) subject to fi (x) ≤ 1 (i = 1, 2, 3, · · · , m) hi (x) = 1 (i = 1, 2, 3, · · · , p) where f are posynomials and h are monomials. The domain of this problem is Rn++ . 4.6.3 Convex Transformation GP problems are not convex in general, but a change of variables will turn a GP into a convex optimization problem. Letting yi = log xi ⇐⇒ xi = eyi yields a monomial f (x) to f (x1 , x2 , x3 , · · · ) = f (ey1 , ey2 , ey3 , · · · ) = c · ea1 y1 · ea2 y2 · ea3 y3 · · · = exp aT y + b which is now an exponential of affine function. Similarly, a posynomial will be converted into a sum of exponentials of affine functions. Now, taking log of the objective and the constraints. The posynomials turn into log-sum-exp (which are convex), the monomials will be become affine. Thus, this is our regular convex problem now. 4.7 Generalized Inequality Constraints 4.7.1 Conic Form Problems Conic form problem is a generalization of LP, replacing componentwise inequality with generalized linear inequality with a cone K. minimize cT x subject to F x + g K 0 Ax = b 17 The SOCP can be expressed as a conic form problem if we set Ki to be a second-order cone in Rni +1 : minimize cT x subject to − Ai x + bi , cTi x + di Ki 0 (i = 1, · · · , m) Fx = g from which the name of SOCP comes. 4.7.2 SDP: Semidefinite Programming A special form of conic program, where K is Sn+ , which is the set of positive semidefinite matrices, is called a SDP. It has the form: minimize cT x subject to x1 F1 + x2 F2 + · · · + xn Fn 0 Ax = b 4.8 Vector Optimization We can generalize the regular convex optimization by letting the objective function take vector values; we can now use proper cones and generalized inequalities to find the best vector value. These are called vector optimization problems. 4.8.1 Optimal Values When a point x∗ is better or equal to than every other point in the domain of the problem, x∗ is called the optimal. In a vector optimization problem, if an optimal exists, it is unique. (Why? Vector optimization requires a proper cone; proper cones are pointed – they do not contain lines. However, if x1 and x2 are both optimal, p = x1 − x2 and −p are both in the cone, making it improper.) 4.8.2 Pareto Optimality In many problems we do not have a minimum value achievable, but a set of minimal values. They are incomparable to each other. A point x ∈ D is pareto-optimal when for all y that f0 (y) K f0 (x) implies f0 (y) = f0 (x). Note that there can be multiple values with the same minimal value. Note that every pareto value has to lie on the boundary of the set of achievable values. 4.8.3 Scalarization for Pareto Optimality A standard technique for finding pareto optimal points is to scalarize vector objectives by taking a weighted sum. This can be explained in terms of dual generalized inequality. Pick any λ ∈ K∗ and solve the following problem: 18 minimize λT f0 (x) subject to fi (x) ≤ 0 hi (x) = 0 By what we discussed in 2.6.2, a pareto optimal point must be a minimizer of this objective for any λ ∈ K∗. Now what happens when the problem is convex? Each λ with λ K∗ 0 will likely give us a different pareto point. Note λ K∗ 0 might not give us such guarantee, some elements might not be pareto optimal. 4.8.4 Examples • Regularized linear regression tries to minimize RMSE and norm of the coefficient at the same time 2 so we optimize kAx − bk2 + λxT x. Changing λ lets us explore all pareto optimal points. 5 Duality This chapter explores many important ideas. Duality is introduced and used as a tool to derive optimality conditions. KKT conditions are explained. 5.1 The Lagrangian And The Dual 5.1.1 The Lagrangian The Lagrangian L associated with a convex optimization problem minimize f0 (x) subject to fi (x) ≤ 0 (i = 1, · · · , m) hi (x) = 0 (i = 1, · · · , p) is a function taking x and the weights as input, and returning a weighted sum of the objective and constraints: L (x, λ, ν) = f0 (x) + m X λi fi (x) + i=1 p X νi hi (x) i=1 So positive values of fi (x) are going to penalize the objective function. The weights are called dual variables or Lagrangian multiplier vectors. 5.1.2 The Lagrangian Dual Function The Lagrangian dual function takes λ and ν, and minimizes L over all possible x. g (λ, ν) = inf L (x, λ, ν) = inf x∈D x∈D f0 (x) + m X i=1 19 λi fi (x) + p X i=1 ! νi hi (x) 5.1.3 Lagrangian Dual and Lower Bounds It is easy to see that for any elementwise positive λ, the Lagrangian dual function provides a lower bound on the optimal value p∗ of the original problem. This is very easy to see; if x is any feasible point, the dual function value is the sum of (possibly suboptimal) value p; if xp is the feasible optimal point, we have negative values of fi (i ≥ 1) and zeros for hi . Then, g (λ, ν) = inf L (x, λ, ν) ≤ L (xp , λ, ν) ≤ f (xp ) = p∗ x∈D 5.1.4 Intuitions Behind Lagrangian Dual An alternative way to express constraints is to introduce indicator functions in the objectives: minimize f0 (x) + m X I− (fi (x)) + i=1 p X I0 (hi (x)) i=1 the indicator functions will have a value of 0 when the constraint is met, ∞ otherwise. Now, these represent how much you are irritated by a violated constraint. We can replace them with a linear function - just a different set of preferences. Instead of hard constraints, we are imposing soft constraints. 5.1.5 LP example and Finite Dual Conditions A linear program’s lagrange dual function is T g (λ, ν) = inf L (x, λ, ν) = −bT ν + inf c + AT ν − λ x x x The dual value can be found analytically, since it is a affine function of x. Whenever any element of c + AT ν − λ is nonzero, we can manipulate x to make the dual value −∞. So it is finite only on a line where c + AT ν − λ = 0, which is a surprisingly common occurrence. 5.1.6 Conjugate Functions and Lagrange Dual The two functions are closely related, and Lagrangian dual can be expressed in terms of the conjugate function of the objective function, which makes duals easier to derive if the conjugate is readily known. 5.2 The Lagrange Dual Problem There’s one more thing which is named Lagrangian: the dual problem. The dual problem is the optimization problem maximize g (λ, ν) subject to λ 0 A pair of (λ, ν) is called dual feasible if it is a feasible point of this problem. The solution of this problem, (λ∗, ν∗) is called dual optimal or optimal Lagrange multipliers. The dual problem is always convex; whether or not the primal problem is convex. Why? g is a pointwise infimum of affine functions of λ and ν. Note the langrange dual for many problems were bounded only for a subset of the domain. We can bake this restriction into the problem explicitly, as a constraint. 20 5.2.1 Duality Gap Langrangian dual’s solution d∗ are related to the solution of the primal problem p∗, notably: d∗ ≤ p∗ Regardless of the original problem being convex. When the inequality is not strict, this is called a weak duality. The difference p ∗ −d∗ is called the duality gap. Duality can be used to provide lower bound of the primal p roblem. 5.2.2 Strong Duality And Slater’s Constraint Qualification When the duality gap is 0, we say strong duality holds for the problem. That means the lower bound obtained from the dual equals to the optimal solution of the problem; therefore solving the dual is (sort of) same as solving the primal. For obvious reasons, strong duality is very desirable but it doesn’t hold in general. But for convex problems, we usually (but not always) have strong duality. Given a convex problem, how do we know if strong duality holds? There are many qualifications, which ensures strong duality if the qualifications are satisfied. The text discusses one such qualification; Slater’s constraint qualification. The condition is quiet simple: if there exists x ∈ relintD such that all inequality conditions are strictly held, we have strong duality. Put another way: fi (x) < 0 (i = 0, 1, 2, · · · ) , Ax = b Also, it is noted that affine inequality constraints are allowed to held weakly. 5.2.3 Examples • Least-squares: since there are no infeasibility constraints, Slater’s condition just equals feasibility: so as long as the primal problem is feasible, strong duality holds. • QCQP: The Lagrangian is a quadratic form. When all λs are nonnegative, we have a positive semidefinite form and we can solve minimization over x analytically. • Nonconvex example: Minimizing a nonconvex quadratic function over the unit ball has strong duality. 5.3 Geometric Interpretation This section introduces some ways to think about Lagrange dual functions, which offer some intuition about why Slater’s condition works, and why most convex problems have strong duality. 5.3.1 Strong Duality of Convex Problems Let’s try to explain figure 5.3 to 5.5 from the book. Consider following set G: G = {(f1 (x) , · · · , fm (x) , h1 (x) , · · · , hp (x) , f0 (x)) |x ∈ D} = {(u, v, t) |x ∈ D} Note ui = fi (x), vi = gi (x) and t = f0 (x). Now the Lagrangian of this problem 21 L (λ, ν, x) = X λi ui + X ν i vi + t can be interpreted as a hyperplane passing through x with normal vector (λ, ν, 1)1 and that hyperplane will meet t-axis at the value of the Lagrangian. (See figure 5.3 from the book.) Now, the Lagrange dual function g (λ, ν) = inf L (λ, ν, x) x∈D will find x in the border of D: intuitively, the Lagrangian can still be decreased by wiggling x if x ∈ relintD. Therefore, the value of the Lagrange dual function can now be interpreted as a supporting hyperplane with normal vector (λ, ν, 1). Next, we solve the Lagrange dual problem which maximizes the position where the hyperplane hits the t-axis. Can we hit p∗, the optimal value? When G is convex, the feasible portion of G (i.e. u 0 and ν = 0) is convex again, and we can find a supporting hyperplane that meets G at the optimal point! But when G is not, p∗ can hide in a “nook” inside G and the supporting hyperplane might not meet p∗ at all. 5.4 Optimality Conditions 5.4.1 Certificate of Suboptimality and Stopping Criteria We know, without assuming strong duality, g (λ, ν) ≤ p? ≤ f0 (x) Now, f0 (x) − g (λ, ν) gives a upper bound on f0 (x) − p?, the quantity which shows how suboptimal x is. This gives us a stopping criteria for iterative algorithms; when f0 (x) − g (λ, ν) ≤ , it is a certificate that x is less than suboptimal. The quantity will never drop below the duality gap, so if you want this to work for arbitrarily small we would need strong duality. 5.4.2 Complementary Slackness Suppose the primal and dual optimal values are attained and equal. Then, f0 (x? ) = g (λ? , ν ? ) X X = inf f0 (x) + λi fi (x) + νi hi (x) x X X ≤ f0 (x? ) + λi fi (x? ) + νi hi (x? ) ≤ f0 (x? ) (assumed 0 duality gap) (definition of Lagrangian dual function) (taking infimum is less than or equal to any x) (λ?i are nonnegative, fi values are nonpositive and hi values are 0) So all inequalities can be replaced by equalities! In particular, it means two things. First, x? minimizes the Lagrangian. Next, X λ?i fi (x? ) = 0 Since each term in this sum is nonpositive, we can conclude all terms are 0: so for all i ∈ [1, m] we have: 1 Yep, there are some notation abusing here since λ and ν themselves are vectors. 22 λ? = 0 i either f (x? ) = 0 i This condition is called complementary slackness. 5.4.3 KKT Optimality Conditions KKT is a set of conditions for a tuple (x? , λ? , ν ? ) which are primal and dual feasible. It is a necessary condition for x∗ and (λ∗, ν∗) being optimal points for their respective problems with zero duality gap. That is, all optimal points must satisfy these conditions. KKT condition is: • x? is prime feasible: fi (x? ) ≤ 0 for all i, hi (x? ) = 0 for all i. • (λ? , ν ? ) is dual feasible: λ?i ≥ 0 • Complementary slackness: λ?i fi (x?i ) = 0 • Gradient of Lagrangian disappears: ∇f0 (x? ) + P λ?i ∇fi (x? ) + P νi? ∇hi (x? ) = 0 Note the last condition is something we didn’t see before. It makes intuitive sense though - the optimal point for the dual problem must minimize the Lagrangian. Since the primal problem is convex, the Lagrangian is convex - and the only point with 0 gradient is the minimum. KKT and Convex Problems When the primal problem is convex, KKT condition is necessary-sufficient for optimality. This has immense importance. We can frame solving convex optimization problems by solving KKT conditions. Sometimes KKT condition might be solvable analytically, giving us closed form solution for the optimization problem. The text also mentions that when Slater’s condition is satisfied for a convex problem, we can say that arbitrary x is primal optimal iff there are (λ, ν) that satisfies KKT along with x. I actually am not sure about why Slater’s condition is needed for this claim but the lecture doesn’t make a big deal out of it, so meh.. 5.5 Solving The Primal Problem via The Dual When we have strong duality and dual problem is easier to solve(due to some exploitable structure or analytical solution), one might solve the dual first to find the dual optimal point (λ? , ν ? ) and find the x that minimizes the Lagrangian. If this x is feasible, we have a solution! Otherwise, what do we do? If the Lagrangian is strictly convex, then there is a unique minimum: if this minimum is infeasible, then we conclude the primal optimal is unattainable. 5.6 Sensitivity Analysis The Lagrange multipliers for the dual problem can be used to infer the sensitivity of the optimal value with respect to perturbations of the constraints. What kind of perterbations? We can tighten or relax constraints for an arbitrary optimization problem, by changing the constraints to: 23 f (x) ≤ u i i h (x) = v i (i = 1, 2, · · · , m) (i = 1, 2, · · · , p) i Letting ui > 0 means we have more freedom regarding the value of fi ; ui < 0 otherwise. 5.6.1 Global Sensitivity The Lagrange multipliers will give you information about how the optimal value will change when we do this. Let’s denote the optimal value of the perturbed problem as a function of u and v: p? (u, v). We have a lower bound for this value when strong duality holds: f0 (x) ≥ p? (0, 0) − λ?T u − ν ?T v which can be obtained by manipulating the definitions. Using this lower bound, we can make some inferences about how f0 (x) will change with respect to u and v. Basically, when lower bound increases greatly, we make an inference the optimal value will increase greatly. However, when lower bound decreases, we don’t have such an assurance2 . Examples: • When λ?i is large, and we tighten the constraint (ui < 0), this will increase the lower bound a lot; the optimal value will increase greatly. • When λ?i is is small, and we loosen the constraint (ui > 0), this will decrease the lower bound a bit, but this might not decrease the optimal value a lot. 5.6.2 Local Sensitivity The text shows an interesting identity: λ?i = − ∂p? (0, 0) ∂ui Now, λ?i gives you the slope of the optimal value with respect to the particular constraint. All these, along with complementary slackness, can be used to interpret Lagrange multipliers; they tell you how “tight” a given inequality constraint is. Suppose we found that λ?1 = 0.1, and λ?2 = 100 after solving a problem. By complementary slackness, we know f1 (x? ) = f2 (x? ) = 0 and they are both tight. However, when we do decrease u2 , we know p? will move much more abruptly, because of the slope interpretation above. On the other hand, what happens when we increase u2 ? Locally we know p? will start to decrease fast; but it doesn’t tell us how it will behave when we keep increasing u2 . 5.7 Examples and Reformulating Problems As different formulations can change convex problems to non-convex and vice versa, dual problems are affected by how the problem is exactly formulated. Because of this, a problem that looks unnecessarily complicated might end up to be a better representation. The text gives some examples of this. 2 Actually I’m a bit curious regarding this as well - lower bound increasing might not increase the optimal value when optimal value was well above the lower bound to begin with. 24 5.7.1 Introducing variables 5.7.2 Making explicit constraints implicit 5.7.3 Transforming the objective 5.8 Generalized Inequailities How does the idea of Lagrangian dual extend to problems with vector inequalities? Well, it generalizes pretty well - we can define everything pretty similarly. Except how the nonnegative restriction for λ becomes a nonnegative restriction with dual cone. Here are some intuition behind this difference. Say we have the following problem: minimize f0 (x) s.t. fi (x) Ki 0 (i = 1, 2, · · · , m) hi (x) = 0 (i = 1, 2, · · · , p) Now the Lagrangian multiplier λ is vector valued. The Lagrangian becomes: f0 (x) + X λTi fi (x) + X νi hi (x) fi (x) is nonpositive with respect to Ki means that −fi (x) ∈ Ki . Remember we want each product λTi fi (x) to be nonpositive - otherwise this dual won’t be a lower bound anymore. Now we try to find the set of λ that makes λT y negative for all −y ∈ K. We will need to make λT y positive for all y ∈ K. What is this set? The dual cone. The dual of SDP is also given as an example. The actual derivation involves more linear algebra than I am comfortable with (shameful) so I’m skipping things here. 6 Approximation and Fitting With this chapter begins part II of the book on applications of convex optimization. Hopefully, I will be less confused/frustrated by materials in this part. :-) 6.1 Norm Approximation This section discusses various forms of the linear approximation problem: minx |Ax − b| with different norms and constraints. Without doubt, this is one of the most important optimization problems. 6.1.1 Examples • `2 norm: we get least squares. • `∞ norm: Chebyshev approximation problem. Reduces to an LP which is as easy as least squares, but no one discusses it! 25 • `1 norm: Sum of absolute residuals norm. Also reduces to LP, extremely interesting, as we will discuss further in this chapter. 6.1.2 Different Penalty Functions and Their Consequences The shape of the norm used in the approximation affects the results tremendously. The most common norms are `p norms - given a residual vector r, !1/p X p |ri | i We can ignore the powering by 1/p and just minimize the base of the exponentiation. Now we can think of `p norms giving separate penalties to each of the residual vector. Note that most norms do the same - so we can think in terms of a penalty function φ (r) when we think about norms. The text examines a few notable penalty functions: • Linear: sum of absolute values; associated with `1 norm. • Quadratic: sum of squared errors; associated with `2 norm. • Deadzone-linear: zero penalty for small enough residuals; grows linearly after the barrier. • Log-barrier: grows infinitely as we get near the preset barrier. Now how does these affect our solution? The penalty function measures our level of irritation with regard to the residual. When φ (r) grows rapidly as r becomes large, we are immensly irritated. When φ (r) shrinks rapidly as r becomes small, we don’t care as much. This simple description actually explains the stark difference between `1 and `2 norms. With a `1 norm, the slope of the penalty does not change when the residual gets smaller. Therefore, we still have enough urge to shrink the residual until it becomes 0. On the other hand, with a `2 norm the penalty will quickly get smaller when the residual gets smaller than 1. Now, once we go below 1, we do not have as much motivation to shrink it further - the penalty does not decrease as much. What happens when the residual is large? Then, `1 is actually less irritated by `2 ; the penalty grows much more rapidly. These explanations let us predict how the residuals from both penalty functions will be distributed. `1 will give us a lot of zeros, and a handful of very large residuals. `2 will only have a very small number of large residuals; and it won’t have as many zeros - many residuals will be “near” zero, but not exactly. The figures in the text confirms this theory. This actually was one of the most valuable intuitions I got out of this course. Awesome. Another little gem discussed in the lecture is that, contrary to the classic approach to fitting problems, the actual algorithms that find the x are not your tools anymore - they are standard now. The penalty function is your tool - you shape your problem to fit your actual needs. This is a very interesting, and at the same time very powerful perspective! 6.1.3 Outliers and Robustness Different penalty functions behave differently when outliers are present. As we can guess, quadratic loss functions are affected much worse than linear losses. When a penalty function is not sensitive to outliers, it is called robust. Linear loss function is an obvious example of this. The text introduces another robust penalty function, which is the Huber penalty function. It is a hybrid between quadratic and linear losses. 26 φhub (u) = u2 (|u| < M ) M (2 |u| − M ) otherwise Huber function grows linearly after the preset barrier. It is the closest thing to a “constant-beyondbarrier” loss function, without losing convexity. When all the residuals are small, we get the exact least square results - but if there are large residuals, we don’t go nuts with it. It is said in the lecture that 80% of all applications of linear regression could benefit from this. A bold, but very interesting claim. 6.1.4 Least-norm Problems A closely related problem is least-norm problem which has the following form: minimize |x| subject to Ax = b which obviously is meaningful only when Ax = b is underdetermined. This can be cast as a norm approximation problem by noting that the solution space is given by a particular solution, and the null space of A. Let Z consist of column vectors that are basis for N (A), and we minimize: |x0 + Zu| Two concrete examples are discussed in the lecture. • If we use `2 norm, we have a closed form solution using the KKT conditions. • If we use `1 norm, it can be modeled as an LP. This approach is in vogue, say in last 10 years or so. We are now looking for a sparse x. 6.2 Regularized Approximations Regularization is a practice of minimizing the norm of the coefficient |x|, as well as the norm of the residual. It is a popular practice in multiple disciplines. Why do it? The text introduces a few examples. First of all, it can be a way to express our prior knowledge or preference towards smaller coefficients. There might be cases where our model is not a good approximation of reality when x gets larger. Personally, this made the most sense to me; it can be a way of taking variations/errors of the matrix A into account. For example, say we assume an error ∆ in our matrix A. So we are minimizing (A + ∆) x−b = Ax − b + ∆x; the error is multiplied by x! We don’t want a large x. 6.2.1 Bi-criterion Formulation Regularization can be cast as a bi-criterion problem, as we have two objectives to minimize. We can trace the optimal trade-off curve between the two objectives. On one end, where |x| = 0, we have Ax = 0 and the residual norm is |b|. At the other end, there can be multiple Pareto-optimal points which minimize |Ax − b|. (When both norms are `2 , it is unique) 27 6.2.2 Regularization The actual practice of regularization is more concrete than merely trying to minimize the two objectives; it is a scalarization method. We minimize |Ax − b| + γ |x| where γ is a problem parameter. (Which, in practice, is typically set by cross validation or manual intervention) Practically, γ is the knobs we turn to solve the problem. Another common practice is taking the weighted sum of squared norms: 2 |Ax − b| + δ |x| 2 Note it is not obvious that the two problems sweep out the same tradeoff curve. (They do, and you can find the mapping between γ and δ given a specific problem). The most prominent scheme is Tikhonov regularization/ridge regression. We minimize: 2 2 |Ax − b|2 + δ |x|2 which even has an analytic solution. The text also mentions a “smoothing” regularization scheme - the penalty is on Dx instead of x. D can change depending on your criteria of fitness of solutions. For example, if we want x to be smooth, we can roughly penalize its second derivative by setting D as the Toeplitz matrix: 1 0 D= 0 .. . −2 1 0 0 0 ··· 1 −2 0 0 0 0 .. . 1 .. . −2 .. . 1 .. . 0 .. . · · · · · · .. . So that the elements of Dx are approximately the second derivatives (2xi − xi−1 − xi+1 ). 6.2.3 `1 Regularization `1 regularization is introduced as a heuristic for finding sparse solutions. We minimize: |Ax − b|2 + γ |x|1 The optimal tradeoff curve here can be an approximation of the optimal tradeoff curve between |Ax − b|2 and the cardinality cardx, which is the number of nonzero elements of x. `1 regularization can be solved as a SOCP problem. 6.2.4 Signal Reconstruction Problem An important class of problem is introduced: signal reconstruction. There is an underlying signal x which is observed with some noise; resulting in corrupted observation. x is assumed to be smooth. What is the most plausible guess for the time series x? This can be cast as a bicriterion problem: first we want to minimize |ˆ x − xcor |2 where x ˆ is our guess and xcor is the corrupted observation. On the other hand, we think smooth x ˆ are more likely, so we minimize a penalization function: φ (ˆ x). Different penalization schemes are introduced, quadratic smoothing 28 and total variance smoothing. In short, they are `2 and `1 penalizers, respectively. When underlying process has some jumps, as you can expect, total variance smoothing preserves those jumps, while quadratic smoothing tries to smooth out the transition. Some more insights are shared in the lecture videos. Recall `1 regularization gives you a small number of nonzero regularized terms. So if you are penalizing φ (ˆ x) = X |ˆ xi+1 − x ˆi | the first derivative is going to be sparse. What does the resulting function look like? Piecewise constant. Similarly, say we take the approximate second derivative |2ˆ xi − x ˆi−1 − x ˆi+1 |? We get piecewise linear! The theme goes on - if we take the third difference, we get piecewise quadratic (actually, splines). 6.3 Robust Approximation How do we solve approximation when A is noisy? Let’s say, A = A¯ + U where A¯ represents the componentwise mean, and U represents the random component with zero mean. How do we handle this? The prevalent method is to ignore that A has possible errors. It is okay, as long as you do a “posterior analysis” on the method: try changing A by a small amount and try the approximation again, see how it changes. 6.3.1 Stochastic Formulation A reasonable formulation for an approximation problem is to minimize: minimize E kAx − bk This is untractable in general, but tractable in some special cases, including when we minimize the `2 norm: 2 minimize E kAx − bk2 Then: 2 ¯ − b + U x T Ax ¯ − b + Ux E kAx − bk2 = E Ax ¯ − b T Ax ¯ − b + ExT U T U x = Ax ¯ − b2 + xT P x = Ax 2 2 2 1/2 ¯ − b + = Ax x P 2 2 Tada, we got Tikhonov regularization! This makes perfect sense - increasing magnitudes of x will increase the variation of Ax, which in turn increase the average value of kAx − bk by Jensen’s inequality. So ¯ − b small with making the variance small. This is a nice interpretation we are trying to balance making Ax for Tikhonov regularization as well. 29 6.3.2 Worst-case Robust Approximation Instead of taking expected value of the error, we can try to minimize the supremum of error across a set A consisting possible values of A. The text describes several types of A we can use to come up with explicit solutions. The following are those examples: • When A is a finite set • When A is mean plus U , where U is an error in a norm ball. • When each row of A is mean plus Pi , where Pi describes an ellipsoid of possible values. • more examples.. Worst-case robust least squares is mentioned in the lecture. This is not a convex problem, but it can be solved exactly. In fact, any optimization problem with two quadratic functions can be solved exactly (see appendix of the book). 6.4 Function Fitting In a function fitting problem, we try to approximate an unknown function by a linear combination of basis functions. We determine the coefficient vector x which yields the following function: f (u) = X xi fi (u) where fi () is the i-th basis function. A typical basis function are powers of u: the possible set of f is the set of polynomials. You can use piecewise linear and polynomial functions; using piecewise polynomial will give you spline functions. 6.4.1 Constraints We can impose various constraints on the function being fitted. The text introduces some tractable set of constraints. • Function value interpolation: the function value at a given point f (v) = P xi fi (v) is a linear function of x. Therefore, equality constraints and inequality constraints are actually linear constraints. • Derivative constraints: the derivative value at a given point f (v) = P xi ∇fi (v) is also a linear function of x. 6.4.2 Sparse Descriptions and Basis Pursuit In basis pursuit problems, we want to find a sparse f out of a very large number of basis functions. By a sparse f , we mean there are a few nonzero entries in the coefficient vector x. Mathematically, this is equivalent to the regressor selection problem (quite unsurprisingly), so the similar set of heuristics can be used. First, we can use `1 regularization to approximate optimizing for cardx. 30 6.4.3 Checking Model Consistency The text introduces an interesting problem - given a set of data points, is there a convex function that satisfies all those data? Fortunately, recall the first order convexity condition from 3.1.2 - using this, ensuring the convexity of a function is as easy as finding the gradients at the data points so that the first order condition is satisfied. We want to find g1 , · · · , gm so that: yj ≥ yi + giT (uj − ui ) for any pair of i and j. Fitting Convex Function To The Data We can “fit” a convex function to the data by finding fitted values of y, and ensuring the above condition holds for the fitted values. Formally, solve: minimize (yi − yˆi ) 2 subject toˆ yj ≥ yˆi + giT (uj − ui ) for any pair of i, j This is a regular QP. Note the result of this problem is not a functional representation of the fitted function, as in regular regression problems. Rather, we get the value of the function - so it’s a point-value representation. Bounding Values Say we want to find out if a new data point is “irregular” or not - is it consistent with what we saw earlier? In other words, given a new unew , what is the range of values possible given the previous data? We can minimize/maximize for yˆnew subject to the first order constraint, to find the range. These problems are LP. 7 Statistical Estimation I was stuck in this chapter for too long. It’s time to finish this chapter no matter what. This chapter shows some example applications of convex optimization in statistical settings. 7.1 Parametric Distribution Estimation The first example is MLE fitting - the most obvious, but the most useful. We of course require the constraints on x to be convex optimization friendly. A linear model with IID noise is discussed: yi = aTi x + vi The MLE is of course xml = argmaxx l (x) = argmaxx log px (y) px (y) depends on the distribution of vi s. Different assumptions on this distribution leads to different fitting methods: • Gaussian noise gives you OLS 31 • Laplacian noise gives you `1 regularized regression (of course, Laplacian distribution has a sharp peak at 0, which equates to having high incentive to reduce residual when residual is really small) • Uniform noise Also note that we need to have log px (y) to be concave in x, not y: and exponential families of distribution meet this criteria. Also, in many cases, your natural choice of parameters might not yield a log likelihood function that is concave. Usually with a change of variables, we achieve this. Also, we discuss that these distributions are equivalent to different penalty schemes - as demonstrated by the equivalence of L2 with Gaussian, L1 with Laplacian. There are 1:1 correspondence. If you have a penalty function p (v), the corresponding distribution is ep(v) normalized! 7.1.1 Logistic Regression Example T We model p = S a u + b = exp aT u+b 1+exp(aT u+b) where u are the explanatory variables, a and b are model param- eters. Say we have n = q + m examples, the first q of them having yi = 1 and next m of them having yi = 0. Then, the likelihood function has: q Y i=1 pi n Y (1 − pi ) i=q+1 Take log and plug in above equation for p and we get the following concave function: q X i=1 n X aT ui + b − log 1 + exp aT ui + b i=q+1 7.1.2 MAP Estimation MAP is the Bayes equivalent of MLE. The underlying philosophy is vastly different, but the optimization technicality remains more or less the same, except a term that describes the prior distribution. 7.2 Nonparameteric Distribution Estimation A nonparameteric distribution is one which we don’t have any closed formula for. So we will estimate a vector p where prob (x = αk ) = pk which lies in Rn . 7.2.1 Priors • Expected value of any function are just linear equality in terms of p, so we can express them easily. • The variance of the random variable is a concave function of p. Therefore, a lower bound on the variance can be expressed within the convex setting. • The entropy of X is a concave, so we can express the lower bound as well. • The KL-divergence between p and q is convex. So we can impose upper bound here. 32 7.2.2 Objectives • We can minimize/maximize expected values because they are affine to p. • We can find MLE because log likelihood for p in this setting is always concave. • We can find maximum entropy. • We can find minimum KL-divergence between p and q. 7.3 Optimal Detector Design And Hypothesis Testing I will only cover this section briefly. Problem setup: the parameter θ can take m values. For each value of θ, we have a nonparameteric distribution over n possible values α1 · · · αn . The probabilities can be represented by a matrix P = Rn×m . We call each θ a different hypothesis. We want to find which θ generated given sample. So the detector we want to design is a function from a sample to θ. We can create either deterministic or probabilistic detectors; like in game theory, introducing extra randomness can improve the detector in many ways. For a simple and convincing example; say we have a binary problem. Draw an ROC curve which shows the tradeoff between false positive and false negative errors. A deterministic detector might not be able to hit the sweet spot where pf n = pf p depending on the θs - but probabilistic detectors can. 7.4 Chebyshev and Chernoff Bounds 7.4.1 Chebyshev Bounds Chebyshev bounds give an upper bound on a probability of a set based on known quantities; many inequalities follow this form. For example, Markov’s inequality says: If X ∈ R+ has EX = µ then we have prob (X ≥ 1) ≤ µ. (Of course, this inequality is completely useless when µ > 1 but that’s how all these inequalities are.) This section looks at cases where we can find such bounds using convex optimization. In this setup, our prior knowledge is represented as a pair of functions and their expected values. The set whose probability we want to find bounds for is given as C. We want something like: prob (X ∈ C) ≤ Ef (X) for some function f whose expectation we can take. The recipe is to concoct an f which is a linear combination of the prior knowledges. Then Ef (X) is simply a linear combination of the expectations. How do we ensure the EV is above prob (X ∈ C)? We can impose that f (x) ≥ 1C (x) pointwise, where 1C is an indicator function for C. We can now state the following problem: X ai xi = Efi (X) xi = Ef (X) X subject tof (z) = xi fi (z) ≥ 1 if z ∈ C X f (z) = xi fi (z) ≥ 0 if z ∈ S\C minimize X This is a convex optimization problem, since the constraints are convex. For example, the first constraint can be recast as 33 g1 (x) = 1 − inf f (z) < 0 z∈C which is surely convex. There is another formulation where we solve a case where the first two moments are specified; but I am omitting it. 7.4.2 Chernoff Bounds This section deals with Chernoff bounds, which has a different form, but the same concept. 7.5 Experiment Design We discuss various solutions to the experiment design problem as an application. The setup is as follows. We have a fixed menu of p different experiments which is represented by ai (1 ≤ i ≤ p). We will perform m experiments, each experiment taken from the menu. For each experiment, we get yi as the result which is yi = aTi x + wi where wi are independent unit Gaussian noise. The maximum likelihood estimate is of course given by least squares. Then, the associated error e = x ˆ − x has zero mean and has covariance matrix E: E = EeeT = X ai aTi −1 How do we minimize E? What kind of metrics do we use? 7.5.1 Further Modeling First, this is an offline problem and we don’t actually care about the order we perform. So the only thing we care is that for each experiment on the menu, how many times do we perform it. So the optimization variables are a list of nonnegative integers mi which sum up to m. Of course, the above problem is combinatorially hard and we relax it a bit, by modeling what fraction of m do we run each experiments. Still, the objective E is a vector (actually a matrix) so we need some scalarization scheme to minimize it. The text discusses some strategies including: • D-optimial design: minimize determinant of E. Since determinant is the volume of the box, we are in effect minimizing the volume of the confidence ellipsoid. • E-optimal design: we minimize the largest eigenvalue of E. Rationale: the diameter of the confidence ellipsoid is proportional to norm of the matrix. • A-optimal design: we minimize the trace. This is, effectively, minimizing the error squared. 8 8.1 Geometric Problems Point-to-Set Distance A project of the point x0 to a closed set C is defined as the closest point in C that minimizes the distance from x0 . When C is closed and convex, and the norm is strictly convex (e.g. Euclidean), we can prove the 34 projection is unique. When the set C is convex, finding the projection is a convex optimization problem. Some examples are discussed - planes, halfplanes, and a proper cone. Finding the separating hyperplane between a point and a convex set is discussed as well. When we use Euclidean norm, we have a geometric, intuitive way to find one - take x0 and its projection p (x0 ), and use ~ 0 ) − ~x and passes the mean of two points. However, for other norms, the hyperplane which is normal to p (x we have to construct such hyperplane using dual problem; if we find a particular Lagrangian multiplier for which the dual problem is feasible, we know that multiplier constitutes a separating hyperplane. 8.1.1 PCA Example Suppose the set C of m × n matrices with at most k rank. A projection of X0 onto C which minimizes the Euclidean norm is achieved by a truncated SVD - yes PCA! 8.2 Distance between Sets Distance between two convex sets is an convex optimization problem, of course. The dual of this problem can be interpreted as a problem finding a separating hyperplane between the two sets. The argument can be made: if strong duality holds, a positive distance implies an existence of a separating hyperplane. 8.3 Euclidean Distance and Angle Problems This section deals with problems where Euclidean distances and angles between vectors are constrained. Setup: n vectors in Rn , for which we assume their Euclidean lengths are known: li = kai k2 . Distance and angular constraints can be cast as a constraint on G, which is the Gram matrix of A which has ai as column vectors: G = AT A G will be our optimization variable; after the optimization we can back out the interested vectors by Cholesky factorization. This is a SDP since G is always positive semidefinite. 8.3.1 Expressing Constraints in Terms of G • Diagonal entries will give length squared: Gii = li2 • The distance between vector i and j dij can be written as: kai − aj k2 = li2 + lj2 − 2aTi aj which means Gij is an affine function of d2ij : Gij = 1/2 = li2 + lj2 − 2Gij 1/2 li2 +lj2 −d2ij 2 This means range constraints on d2ij can be a pair of linear constraints on Gij . • Gij is an affine function of the correlation coefficient ρij . • Gij is also an affine function of cosine of the angle between two vectors: cos α. Since cos−1 is monotonic, we can use this to constrain the range of α. 35 8.3.2 Well-Condition Constraints The condition number of A, σ1 /σn , is a quasiconvex function of G. So we can impose a maximum value or try to minimize it using quasiconvex optimization. Two additional approaches to well-conditionness are discussed - dual basis and maximizing log det G . 8.3.3 Examples • When we only care about angles between vectors (or correlations) we can set li = 1 for all i. • When we only care about distance between vectors, we can assume that the mean of the vectors are 0. This can be solved using the squared lengths as the optimization variable. Since Gij = li2 + lj2 − 2d2ij /2, we get: G = z1T + 1z T − D /2 which should be PSD (zi = li2 ). 8.4 Extremal Volume Ellipsoids This section deals with problems which “approximates” given sets with ellipsoids. 8.4.1 Lowner-John Ellipsoid The LJ ellipsoid lj for a set C is defined as the minimum-volume ellipsoid that contains C. This can be cast as a convex optimization problem, however is only tractable when C is tractable. (Of course, C has an infinite number of points or whatever, it’s not going to be tractable..) We set our optimization variable A and b such that: εlj = {v| kAv + bk2 ≤ 1} The volume of the LJ ellipsoid is proportional to det A−1 , so that’s what we optimize for. We minimize: log det A−1 subject to supv∈C kAv + bk2 ≤ 1. As a trivial example, consider when C is a finite set of size m; then the constraints translate into m convex constraints on A and b. A notable feature of LJ ellipsoid is that its efficiency can be bounded; if you shrink an LJ ellipsoid by a factor or n (the dimension), it is guaranteed to fit inside C (of course, when C is bounded and has nonempty √ interior). So roughly we have a factor of n approximation. (Argh.. the proof is tricky. Uses modified problem’s KKT conditions.) √ When the set is symmetric about a point x0 , the factor 1/n can be improved to 1/ n. 8.4.2 Maximum Volume Inscribed Ellipsoid A related problem tries to find the maximum volume ellipsoid which lies inside a bounded, convex set C with nonempty interrior. We use a different formulation of the ellipsoid now; it’s a forward projection of a unit ball. 36 ε = {Bu + d| kuk2 ≤ 1} Now its volume is proportional to det B. The constraint would be: sup IC (Bu + d) ≤ 0 kuk2 ≤1 Max Ellipsoid Inside A Polyhedron A polyhedron is described by a set of m linear inequalities: C = x|aTi x ≤ bi We can now optimize regarding B and d. We can translate the constraint as: sup IC (Bu + d) ≤ 0 ⇐⇒ kuk2 ≤1 sup aTi (Bu + d) ≤ bi kuk2 ≤1 ⇐⇒ BaTi 2 + aTi d ≤ bi which is a convex constraint on B and d. 8.4.3 Affine Invariance If T is an invertible matrix, it is stated that the transformed LJ ellipsoid will still cover the set C after transformation. It holds for maximum volume inscribed ellipsoid as well. 8.5 Centering 8.5.1 Chebychev Center Given a bounded, nonempty-interior set C ∈ Rn , a Chebychev centering problem finds a point where the depth is maximized, which is defined as: depth (x, C) = dist (x, Rn \C) So it’s a point which is farthest from the exterior of C. This is not always tractable; suppose C is defined as a set of convex inequalities fi (x) ≤ 0. Then, Chebychev center could be found by solving: maximize subject to R gi (x, R) ≤ 0 (i = 0, 1, · · · ) where gi is a pointwise maximum of fi (x + Ru) where kuk2 ≤ 1. Since fi is convex and x + Ru is affine in x and R, gi is a convex function. However, it’s hard to evaluate gi ; since we have to find a pointwise maximum of convex functions. Therefore, Chebychev center problem is feasible only for specific classes of C. For example, when C is a polyhedron. (An LP can solve this case) 8.5.2 Maximum Volume Ellipsoid Center A generalization of Chebychev center is MVE center; the center of the maximum volume inscribed ellipsoid. When the MVE problem is solvable, MVE center is trivially attained. 37 8.5.3 Analytic Center An analytic center works with the logarithmic barrier − log x. If C is defined as fi (x) ≤ 0 for all i, the analytic center minimizes − X log (−fi (x)) i This makes sense; when x is feasible, the absolute value of fi (x) kind of denotes the margin between x and infeasible regions. Analytic center tries to maximize the product of those margins. The analytic center is not invariant under different representations of the same set C, obviously. 8.6 Classification This section deals with two sets of data {x1 , x2 , · · · , xN } and {y1 , y2 , · · · , yM }. We want to find a function f (x) such that f (xi ) > 0 and f (yi ) < 0. 8.6.1 Linear Discrimination Linear discrimination finds an affine function f (x) = aT x − b which satisfies the above requirements. Since these requirements are homogeneous in a and b, we can scale them arbitrarily so the following are satisfied: aT x − b ≥ 1 i aT y − b ≤ −1 i 8.6.2 Robust Linear Discrimination If two sets can be linearly discriminated, there will always be multiple functions that separate them. One way to choose among them is to maximize the minimum distance from the line to each sample; in other words, maximum margin or the “thickest slab”. This leads to the following problem: maximize t subject to aT xi − b ≥ t aT yi − b ≤ −t kak2 ≤ 1 Note the last requirement; we normalize a, since we will be able to arbitrarily increase t unless we normalize a. Support Vector Classifier When two sets cannot be linearly separated, we can relax the constraints f (xi ) > 1 and f (yi ) < −1 by rewriting them as: f (xi ) > 1 − ui , and f (yi ) < −1 + vi 38 where ui and vi are nonnegative. Those numbers can be interpreted as a measure of how much each constraint is violated. We can try to make these sparse by optimizing for the sum; this is an `1 norm and u and v will (hopefully) be sparse. Support Vector Machine The above are two approaches for robust linear discrimination. First tries to maximize the width of the slab. Second tries to minimize the number of misclassified points (actually, it optimizes its proxy). We can consider the trade-off between the two. Note the width of the slab z| − 1 ≤ aT z − b ≤ 1 Can be calculated by the distance between the two hyperplanes aT z = b − 1 and aT z = b + 1. Let aT z1 = b − 1 and aT z2 = b + 1. aT (z1 − z2 ) = 2. It follows kz1 − z2 k2 = 2/ kak2 . Now we can solve the following multicriterion optimization problem: minimize kak2 + γ 1T u + 1T v subject to aT xi − b ≥ 1 − ui aT yi − b ≤ −1 + vi u 0, v 0 We have SVM! Logistic Regression Another way to do approximate linear discrimination is logistic regression. This should be very familiar now; the negative log likelihood function is convex. 8.6.3 Nonlinear Discrimination We can create nonlinear separation space by introducing quadratic and polynomial features. For polynomial discrimination, we can do a bisection on the degree to find the smallest polynomial that can separate the input. 8.7 Placement and Location Placement problem deals with n points in Rk , where some locations are given, and the rest of the problems are the optimization variables. The treatment is rather basic. In essense, • You can minimize the sum of distances between connected nodes when the distance metric is convex. • You can place upper bounds on distance between pairs of points, or lengths of certain paths. • When the underlying connectivity represents a DAG, you can also minimize the max distance from a source node to a sink node using a DP-like argument. 8.8 Floor Planning A floor planning tries to place a number of axis-aligned rectangles without overlaps, optimizing for the size of the bounding rectangle. This is a hard combinatorial optimization problem in general, but specifying 39 relative positioning of the boxes can make these problems convex. A relative positioning constraint gives how individual pairs of rectangles are positioned. For example, rectangle i must be either above, below, left to, right to rectangle j. These can be cast as linear inequalities. For example, we can specify that rectangle i is left to rectangle j by specifying: xi + w i ≤ xj Some other constraints we can use: • Minimum area for each rectangle • Aspect ratio constraints are simple linear (in)equalities. • Alignment constraints: for example, two rectangles are centered at the same line • Symmetry constraints • Distance constraints: given relative positioning constraints, `1 or `∞ constraints can be cast pretty easily. Optimizing for the area of the bounding box gives you a geometric programming problem. 9 10 Numerical Linear Algebra Background Unconstrained Optimization Welcome to part 3! For the rest of the material I plan to skip over the theoretical parts, only covering the motivation and rationale of the algorithms. 10.1 Unconstrained Minimization Problems An unconstrained minimization doesn’t have any constraints but a function f () we need to minimize. We assume f to be convex and differentiable, so the optimality can be checked by looking at the gradient ∇f . This can sometimes be solved analytically (for example, least squares), but in general we need to resort to an iterative method (for example, geometric programming or analytic center). 10.1.1 Strong Convexity For most of this chapter we assume that f is strongly convex; which means that there exists m > 0 such that ∇2 f (x) mI for all symmetric x. This feels like an analogue of having a positive second order coefficient - is this different from having posdef Hessian? (Hopefully video lecture provides some insights) Anyways, this is an extremely strong assumption, and we can’t, in general, expect our functions to be strongly convex. Then why assume this? We are looking at theoretical convergence, which is already not attainable (because no algorithm is going to run infinitely). Professor says it’s more of a “feel good” stuff so let’s make assumptions that can shorten the proof. 40 Strong convexity has interesting consequences; the usual convexity bound can be improved so we have: T f (y) ≥ f (x) + ∇f (x) (y − x) + m 2 ky − xk2 2 We can analytically find the minimum point of RHS of this equation, and plug it back into the RHS to get the lower bound of f (y): f (y) ≥ f (x) − 1 2 k∇f (x)k2 2m So this practically means that we have a near-optimal point when the gradient is small. When we know m 2 this can be a hard guarantee, but m is not attainable in general. Therefore, we resort to making k∇f (x)k2 small enough so that we have a high chance of being near optimal. 10.1.2 Conditional Number of Sublevel Sets Conditional numbers of sublevel sets have a strong effect on efficiency of some algorithms. Conditional number of a set is defined as the ratio between maximum and minimum width of the set. The width for a convex set C along a direction q (kqk2 = 1) is defined by: W (C, q) = sup q T z − inf q T z z∈C z∈C 10.2 Descent Methods The family of iterative optimization algorithms which generate a new solution x(k+1) from x(k) by taking x(k+1) = x(k) + t(k) ∆x(k) where t(k) is called the step size, and ∆x(k) is called the search direction. Depending on how we choose t and ∆x, we get different algorithms. There are two popular ways of choosing t: • Exact line search minimizes f (t) = f x(k) + t · ∆x(k) exactly, by either analytic or iterative means. This is used when this minimization problem can be solved efficiently. • Backtracking search tries to find a t where the objective function sufficiently decreases. The exact details isn’t very important; it is employed when the minimization problem is harder to solve. The algorithm is governed by two parameters, which, practically, does not drastically change the performance of the search. 10.3 Gradient Descent Taking ∆x(k) = −∇f (x) will give you gradient descent algorithm. Some results for the convergence analysis are displayed. The lower bound for iteration is given as: log ((f (x0 ) − p∗ ) /) log (1/c) where p∗ is the optimal value, and we stop when we have f x(k) −p∗ < . The numerator is intuitive; the denominator involves the condition number and roughly equal to m/M . Therefore, as condition number increases, the number of required iterations will grow linearly. Given a constant condition number, this 41 bound shows that the error will decrease exponentially. For some reason this is called linear convergence in optimization context. 10.3.1 Performance Analysis on Toy Problems Exact search and backtracking line search is compared on toy problems; the number of iteration can differ by a factor of 2 or something like that. Also, we look at an example where we play with the condition number of the Hessian of f ; and the number of iteration can really blow up. 10.4 Steepest Descent Steepest descent algorithm generalizes gradient descent by employing a different norm. Given a norm, the normalized steepest descent direction is given by n o T ∆x = argmin ∇f (x) v| kvk = 1 Geometrically, we look at all vectors in a unit ball centered at current x and try to minimize f (x). When we use Euclidean norm, we regain gradient descent. Also, in some cases, we can think of SD as GD after a change of coordinates (intuitively this makes sense, because using a different norm is essentially employing a different view on the coordinate system). 10.4.1 Steepest Descent With an `1 Norm When we use `1 norm, SD essentially becomes the coordinate descent algorithm. It can be trivially shown: take the basis vector with the largest gradient component, and minimize along that direction. Since we are using `1 norm, we can never take a steeper descent. 10.4.2 Performance and Choice of Norm Without any problem-specific assumptions, essentially same as GD. However, remember that the condition number greatly affects the performance of GD - and change of coordinates can change the sublevel set’s condition number. Therefore, if we can choose a norm such that the sublevel sets will approximate an ellipsoid/sphere, SD works very well. A Hessian at the optimal point, if attainable, will minimize the condition number greatly. 10.5 Newton’s Method Newton’s method is the workhorse of convex optimization. The major motivation is that it tries to minimize the quadratic approximation of f () at x. To do this, we choose a Newton step ∆xnt : −1 ∆xnt = −∇2 f (x) ∇f (x) Several properties and interpretations are discussed. • The Newton step minimizes the second-order Taylor approximation of f . So, when f () roughly follows a quadratic form, Newton’s method is tremendously efficient. 42 • It’s the steepest descent direction for the quadratic norm defined by the Hessian. Recall that the Hessian at the optimal point is a great choice for a norm for SD - so when we have a near-optimal point, this choice minimizes the condition number greatly. • Solution of linearized optimality condition: we want to find v such that ∇f (x + v) = 0. And approximately: ∇f (x + v) ≈ ∇f (x) + ∇2 f (x) v = 0 and the Newton update is a solution for this. • Newton step is affinely invariant; so multiplying only a single coordinate by a constant factor will not change convergence. This is a big advantage over the usual gradient descent. Therefore, Newton’s method is much more resistant to high condition number sublevel sts. In practice, extremely high condition number can still hinder us because of finite precision arithmetic, yet it is still a big improvement. 10.5.1 The Newton Decrement The Newton decrement is a scalar value which is closely related; it is used as a stopping criterion as well: 1/2 T −1 λ (x) = ∇f (x) ∇2 f (x) ∇f (x) This is related to our estimate of our error f (x) − p∗ by the following relationship: f (x) − p∗ ≈ f (x) − inf fˆ (y) = y 1 2 λ 2 We stop when this value (λ2 /2) is less than . 10.5.2 Newton’s Method Newton’s method closesly follows the gradient descent algorithm, except it uses the Newton decrement for stopping criterion, which is checked before making the update. 10.5.3 Convergence Analysis The story told by the convergence analysis is interesting. There exists a threshold for k∇f (x)k2 , when broken, makes the algorithm converge quadratically. This condition (broken threshold for gradient), once attained, will hold in all further iterations. Therefore, the algorithm works in two separate stages. • In the dampen Newton phase, line search can give us an update size t < 1, and f will decrease by at least γ, another constant. • Pure Newton phase will follow, where we will only use full updates (t = 1) and we get quadratic convergence. 43 10.5.4 Summary • Very fast convergence: especially, quadratic convergence when we reach the optimal point. • Affine invariant: much more resistant to high condition numbers. • Performance does not depend much on the correct choice of parameters, unlike SD. 10.6 Self-Concordant Functions This section covers an alternative assumption on f which allows us a better (or more elegant) analysis on the performance of Newton method. This seems like more of an aesthetic, theoretic result, so unless some insights come up in the video lectures, I am going to skip it. 44