On how to solve large-scale log-determinant optimization problems Chengjing Wang January 25, 2014
Transcription
On how to solve large-scale log-determinant optimization problems Chengjing Wang January 25, 2014
On how to solve large-scale log-determinant optimization problems Chengjing Wang ∗ January 25, 2014 Abstract We propose a proximal augmented Lagrangian method and a hybrid method, i.e., employing the proximal augmented Lagrangian method to generate a good initial point and then employing the Newton-CG augmented Lagrangian method to get a highly accurate solution, to solve large-scale nonlinear semidefinite programming problems whose objective functions are a sum of a convex quadratic function and a log-determinant term. We demonstrate that the algorithms can supply a high quality solution efficiently even for some ill-conditioned problems. Keywords: Quadratic programming, Log-determinant optimization problem, Proximal augmented Lagrangian method, Augmented Lagrangian method, Newton-CG method 1 Introduction In this paper, by defining log 0 := −∞, we consider the following standard primal and dual nonlinear semidefinite programming problems whose objective functions are a sum of a convex quadratic function and a log-determinant term (QP-Logdet) : (P ) 1 min{ ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X : A(X) = b, X ≽ 0}, X 2 (D) 1 max {− ⟨X, Q(X)⟩ + bT y + µ log det Z + nµ(1 − log µ) : X,y,Z 2 −Q(X) + A∗ y + Z = C, Z ≽ 0}, ∗ School of Mathematics and Institute of Applied Mathematics, Southwest Jiaotong University, No.999, Xian Road, Chengdu 611756, China ([email protected]). The author’s research was supported by the National Natural Science Foundation of China under grant 11201382, the Youth Fund of Humanities and Social Sciences of the Ministry of Education under grant 12YJC910008, the project of the science and technology department of Sichuan province under grant 2012ZR0154, and the Fundamental Research Funds for the Central Universities under grants SWJTU12CX055 and SWJTU12ZT15. 1 where Q : S n → S n is a given self-adjoint positive semidefinite linear operator, C ∈ S n , b ∈ Rm , µ ≥ 0 is a given parameter, A : S n → Rm is a linear map and A∗ : Rm → S n is the adjoint of A. Note that the linear maps A and A∗ can be expressed, respectively, as [ A(X) = ⟨A1 , X⟩, . . . , ⟨Am , X⟩ ]T , ∗ A (y) = m ∑ yk Ak , (1) k=1 where Ak , k = 1, · · · , m are given matrices in S n . As for the explanation of all other main notations one may see Subsection 1.1. The perturbed QP-Logdet problem has the form {1 } (P ε ) min ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X : A(X) = b, X ≽ εI . X 2 (Dε ) 1 max {− ⟨X, Q(X)⟩ + bT y + ⟨Z, εI⟩ + µ log det(Q(X) + C − A∗ y − Z) X,y,Z 2 +nµ(1 − log µ) : −Q(X) + A∗ y + Z + µX −1 = C, Z ≽ 0}. The QP-Logdet problem (P ) is itself a classical convex model problem in optimization theory. It can be regarded as an extension of the qudratic semidefinite programming problem (QSDP) and the log-determinant (Logdet) problem, so it shares the structures of both problems, and it goes without saying that the QP-Logdet problem is considerable. For the QSDP, it is certainly a heart problem in nonlinear semidefinite programming problems, which has been considered by Toh [35], Toh, T¨ ut¨ unc¨ u and Todd [36, 37], Zhao [45], Jiang, Sun and Toh [14], etc.. For the Logdet problem, which has a very important application in covariance selection [5] and has been intensively studied over the past several years, including the work of Dahl, Vandenberghe and Roychowdhury [4], d’Aspremont, Banerjee and El Ghaoui [6], Li and Toh [15], Lu [16, 17], Lu and Zhang [18], Olsen, Oztoprak, Nocedal and Rennie [24], Scheinberg, Ma and Goldfarb [30], Scheinberg and Rish [31], Toh [34], Wang, Sun and Toh [40], Yang, Sun and Toh [41], Yang, Shen, Wonka, Lu and Ye [43], Yuan [44], etc.. As far as the QP-Logdet problem be concerned, it also arises in many practical applications such as robust simulation of global warming policies [13], speech recognition [39], and so on. Thus the algorithms developed to solve this kind of problems can potentially find wide applications. For the QP-Logdet problems of small and medium size, the interior-point method (IPM) with a direct solver is certainly an efficient and robust approach; however, for these large-scale problems, the IPM lacks the ability due to the need of computing, storing, and factorizing the Schur matrices that are typically dense. In view of this, we need to design new approaches. The proximal augmented Lagrangian (PAL) method proposed in [7] is a fast primaldual approach. The key idea of this approach is to apply the proximal technique to the augmented Lagrangian function of the primal problem at every iteration so that the 2 simplified inner problem has an analytical or at least a semi-analytical solution which can be solved very fast, and then to update the dual variables. The convergence of the PAL method has been shown in [7]. The biggest advantage of the PAL method is that there is no need to solve any linear system of equations to obtain step directions. It is a good approach for generating a good initial point. Furthermore, for some ill-conditioned inner problems whose Hessian matrices are of near low ranks, the proximal technique of the PAL method is actually better than the Newton-CG method since it can be regarded as a regularization treatment in some sense. Allowing for the advantages of the PAL method, we may apply it to the QP-Logdet problem, especially for generating an initial point. Although the PAL method is a good choice for the QP-Logdet problem, it is after all a gradient method, once the iteration point is near the optimal solution, we may consider an approach with locally faster convergence rate. The augmented Lagrangian (AL) method [29] is just in the position to play this role. It is a classical method that can be viewed as a special form of the proximal point algorithm (PPA). (As for the PPA, it was proposed by Martinet [19], and further studied by Rockafellar [28, 29].) Although the AL method for convex programs is a gradient ascent method applied to the corresponding dual problems [27], Sun, Sun and Zhang revealed that under the constraint nondegenerate conditions [1], it could be locally regarded as an approximate generalized Newton method applied to a semismooth equation. It is just this reason that inspired us to apply the AL method to solve the QP-Logdet problem. Great successes of the applications of the AL method to large-scale semidefinite programming problems can also be seen in [46, 40, 41] and the references therein. If applying the AL method to problem (P), we have to solve the inner problems which have no closed-form solutions. The well-tested quadratically convergent semismooth Newton method introduced by Qi and Sun [25] is an ideal approach to fulfill this task. As for the semismooth Newton direction, the preconditioned conjugate gradient (PCG) method is a good choice because the Newton system of equations is large and a direct solver is not proper. Mainly based on Rockafellar’s theoretical framework the global convergence and local convergence rate of the sequence generated by the Newton-CG AL (NAL) method can be established. The numerical results demonstrate that the proposed approach can be very efficient and robust to supply highly accurate solutions for large-scale problems not only for synthetic problems but also for real data problems. In this paper, we solve synthetic QP-Logdet problems with n up to 2, 000 and m up to 1, 186, 173 in about 2 hours. The remaining parts of this paper are organized as follows. In Section 2, we give a brief introduction on some preliminary contents. In Section 3, we describe the algorithms in details. In Section 4, we establish the convergence theory of the proposed approaches. In Section 5 we describe some numerical issues on the semismooth Newton-CG method. In Section 6, we present the numerical results. Finally, we give some concluding remarks. 3 1.1 Additional Notations In this paper, all vectors are assumed to be finite dimensional. The symbol Rn denotes the n-dimensional Euclidean space, and On denotes the set of n × n orthogonal matrices. The set of all m × n matrices with real entries is denoted by Rm×n , ⟨·, ·⟩ stands for the standard trace inner product in S n , ∥ · ∥ denotes the induced Frobenius norm, S+n n and S++ denote the sets of positive semidefinite matrices and positive definite matrices, respectively. The symbol ◦ denotes the Hadamard product, and ⊗ denotes the Kronecker product. The symbol FP ε := {X ∈ S n : A(X) = b, X ≽ εI} denotes the feasible set of the problem (P ε ), and ΠK (·) denotes the metric projector onto the closed convex set K. Let vec : Rm×n → Rmn be the vectorization operator on matrices defined by stacking the columns of a matrix to a long vector one by one. 2 Preliminaries In order to be able to present our ideas more clearly, we first introduce some concepts related to the AL method based on the classical papers by Rockafellar [28, 29]. 2.1 Maximal monotone operators Let H be a real Hilbert space with an inner product ⟨·, ·⟩. A multifunction T : H ⇒ H is said to be a monotone operator if ⟨z − z ′ , w − w′ ⟩ ≥ 0, whenever w ∈ T (z), w′ ∈ T (z ′ ). (2) It is said to be maximal monotone if, in addition, the graph G(T ) = {(z, w) ∈ H × H| w ∈ T (z)} is not properly contained in the graph of any other monotone operator T ′ : H ⇒ H. For example, if T is the subdifferential ∂f of a lower semicontinuous convex function f : H → (−∞, +∞], f ̸≡ +∞, then T is maximal monotone (see Minty [21] or Moreau [22]), and the relation 0 ∈ T (z) means that f (z) = min f . 2.2 Representations in terms of maximal monotone operators The following Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient for the optimality of (P ) and (D): A(X) − b = 0, Q(X) + C − A∗ y − Z = 0, XZ = µI, X ≽ 0, Z ≽ 0. 4 The following KKT conditions are also necessary and sufficient for the optimality of (P ε ) and (Dε ): A(X) − b = 0, Q(X) + C − A∗ y − Z − µX −1 = 0, ⟨X − εI, Z⟩ = 0, X − εI ≽ 0, Z ≽ 0. Throughout this paper, the following conditions for (P ε ) and (Dε ) are assumed to hold. Assumption 2.1. Problem (P ε ) satisfies the condition n ∃ X0 ∈ S++ such that A(X0 ) = b, X0 ≻ εI. Assumption 2.2. For problem (P ε ), there exists an α ∈ R such that the level set {x| f0 (X) ≤ α, A(X) = b, X ≽ εI} is nonempty and bounded, where f0 (x) stands for the objective function. Let l(X; y, Z) : S n × Rm × S n → R the extended form: L0 (X; y, Z) −∞ l(X; y, Z) = +∞ be the ordinary Lagrangian function for (P ε ) in if X ∈ FP ε and (y, Z) ∈ Rm × S+n , if X ∈ FP ε and (y, Z) ̸∈ Rm × S+n , if X ̸∈ FP ε , where L0 (X; y, Z) = 1 ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X − ⟨y, A(X) − b⟩ − ⟨Z, X − εI⟩. 2 The essential objective function in (P ε ) is given by { 1 ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X if X ∈ FP ε , 2 f (X) = sup l(X; y, Z) = +∞ otherwise, y∈Rm ,Z∈S n while the essential objective function in (Dε ) is given by { g0 (y, Z) if (y, Z) ∈ Rm × S+n , g(y, Z) = inf n l(X; y, Z) = X∈S −∞ if (y, Z) ̸∈ Rm × S+n , 5 As in [29], we can define the following operators Tl (X, y, Z) = {(V, u, W ) ∈ S n × Rm × S n | (V, −u, −W ) ∈ ∂l(X; y, Z)}, (X, y, Z) ∈ S n × Rm × S n , Tf (X) = {V ∈ S n | V ∈ ∂f (X)}, X ∈ S n , Tg (y, Z) = {(u, W ) ∈ Rm | (−u, −W ) ∈ ∂g(y, Z)}, (y, Z) ∈ Rm × S n . We can observe that l is a closed proper saddle function in the sense of [26, p.363], and the mapping Tl is a maximal monotone operator in S n ×Rm ×S n [26, Cor.37.5.2]. Meanwhile, f is a closed proper convex function, so Tf is a maximal monotone operator in S n [21, 22]. Similarly, g is a closed proper concave function, so Tg is a maximal monotone operator in Rm × S n . To discuss the rate of convergence, we introduce some related concepts and conclusions. Definition 2.1. (cf. [28]) For a maximal monotone operator T , we say that its inverse T −1 is Lipschitz continuous at the origin (with modulus a ≥ 0) if there is a unique solution z¯ to z = T −1 (0), and for some τ > 0 we have ∥z − z¯∥ ≤ a∥w∥, where z ∈ T −1 (w) and ∥w∥ ≤ τ. Assumption 2.3. Let X be the optimal solution to problem (P ε ) and (y, Z) is the corresponding Lagrangian multiplier. We say that the primal constraint nondegeneracy [1] holds at X to problem (P ε ) if ( ) ( ) ( m ) {0} A R n S + = , n lin(TS+ (X − εI)) I Sn where lin(TS+n (X − εI)) denotes the lineality space of TS+n (X − εI), and TS+n (X − εI) denotes the tangent cone of S+n at X − εI. Assumption 2.4. Let X be the optimal solution to problem (P ε ) and (y, Z) is the corresponding Lagrangian multiplier. We say that the strong second order sufficient condition [3] holds at X if ⟨H, ∇2XX L0 (X; y, Z)(H)⟩ + Υ(X−εI) (Z, H) > 0, ∀ H ∈ aff(C(X)) \ {0}, where the linear-quadratic function ΥB : S n × S n → R is defined by ΥB (Γ, A) := 2⟨Γ, AB + A⟩, (Γ, A) ∈ S n × S n , where B + is the Moore-Penrose pseudo-inverse of B. 6 Next we present the direct condition for the Lipschitz continuity of Tg−1 due to Rockafellar [29, Prop.2]. Proposition 2.1. If the strong second-order sufficient condition and the primal constraint nondegeneracy condition are satisfied, then Tl−1 is actually single-valued and continuously differentiable on a neighborhood of the origin, and so are Tf−1 and Tg−1 . Thus in particular, these mappings are all Lipschitz continuous at the origin. Remark 2.1. We must emphasize that the so called strong second-order condition in [29] actually includes the primal constraint nondegeneracy condition. 3 3.1 The algorithms The PAL method In this subsection, we describe the PAL method in details. The augmented Lagrangian function of (P) is Lσ (X; y) = 1 σ ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X − ⟨y, A(X) − b⟩ + ∥A(X) − b∥2 , X ≽ 0. 2 2 At the kth iteration, we solve the following subproblem } { 1 argmin Lσk (X; y k ) + ∥X − X k ∥2Tk | X ≽ 0 2 {1 } = argmin ⟨X, (Q + σk A∗ A + Tk )(X)⟩ − ⟨M k , X⟩ − µ log det X | X ≽ 0 , (3) 2 where M k = A∗ y k + σk A∗ b + Tk (X k ) − C. Pick Tk ≽ 0 such that Q + σk A∗ A + Tk = αk I, with αk = λmax (Q + σk A∗ A), then M k = −Q(X k ) − C + αk X k + A∗ (y k + σk (b − A(X k ))), and (3) simplifies to {1 } 1 k 2 argmin αk ∥X − M ∥ − µ log det X | X ≽ 0 . (4) 2 αk Based on Lemma 2.1 in [40], problem (4) has a closed-form solution X k+1 = ϕ+ γk ( 1 k k k T M ) := P k diag(ϕ+ γk (d ))(P ) , αk (5) Z k+1 = ϕ− γk ( 1 k k k T M ) := P k diag(ϕ− γk (d ))(P ) , αk (6) and further where α1k M k = P k Dk (P k )T is the eigenvalue decomposition of α1k M k with P k ∈ On and 1 k k − 1 k + Dk = diag(dk ), dk ∈ Rn , ϕ+ γk ( αk M ) and ϕγk ( αk M ) are matrix valued functions, ϕγk (d ) 7 √ k ϕ− γk (d ) ϕ+ γk (x) x2 +4γ k are vector valued functions, whose scalar counterparts are = 2 √ x2 −4γk and ϕ− , correspondingly, and γk = αµk . From (5) and (6), we can see γk (x) = 2 that X k+1 and Z k+1 are positive definite automatically. Once X k+1 is obtained, one may update the dual variable to get y k+1 . As for λmax (Q + σk A∗ A), we may apply the power method to compute it. Now we may summarize the PAL method as below and n Algorithm 1: The PAL method. Input X 0 ∈ S++ , y 0 ∈ Rm , σ0 = 10. Iterate the following steps: Step 1. Compute αk = λmax (Q + σk A∗ A). Step 2. Set Tk = αk I − Q − σk A∗ A, and M k = A∗ y k + σk A∗ b + Tk (X k ) − C. Compute X k+1 = ϕ+ γk ( 1 k 1 k M ) and Z k+1 = ϕ− M ). γk ( αk αk Step 3. Set y k+1 = y k − τ σk (A(X k+1 ) − b), where τ ∈ [1, 2) is a given parameter. Step 4. Compute RPk+1 = ∥b − A(X k+1 )∥ , max{1, ∥b∥} k+1 RD = ∥Q(X k+1 ) + C − A∗ y k+1 − Z k+1 ∥ , max{1, ∥C∥} k+1 If max{RPk+1 , RD } < Tol1, stop; else, choose σk+1 ; end. 3.2 The NAL method In this subsection, we apply the NAL method to the perturbed problem (P ε ) since this method can not guarantee the positive definiteness of the variable X automatically. Given a penalty parameter σ > 0, the augmented Lagrangian function for problem 8 (P ε ) is defined as Lσ (X; y, Z) = 1 σ ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X − ⟨y, A(X) − b⟩ + ∥A(X) − b∥2 + 2 2 1 1 ∥ΠS+n (Z − σ(X − εI))∥2 − ∥Z∥2 . 2σ 2σ At the kth iteration, we solve the following subproblem { } argmin Lσk (X; y k , Z k ) | X ≽ εI . (7) Since the subproblem (7) has no closed-form solution, it needs elaborating on. We can calculate the first-order derivative of Θk (X) := Lσk (X; y k , Z k ) with respect to X ∇Θk (X) = Q(X) + C − A∗ y k − µX −1 + σk A∗ (A(X) − b) − ΠS+n (Z k − σk (X − εI)). Furthermore, base on [26, Thm. 23.8] and [32, Lem. 2.1] we can calculate the Clarke generalized Jacobian ∂ 2 Θk (X) := ∂∇Θk (X) as below ∂ 2 Θk (X)[H] = (Q + σk A∗ A)[H] + µX −1 HX −1 + σk ∂ΠS+n (Z k − σk (X − εI)), ∀ H ∈ S n . Actually, from [20, Prop.1] every element in ∂ΠS+n (·) is positive semidefinite, so every element in ∂ 2 Θk (X) is positive definite. Therefore we can apply the semismooth Newton method developed in [25] with the line search technique which guarantees the global convergence. And we may obtain the Newton direction by solving the linear system of equations Vbσ0k [H] = −∇Θk (X k ), where Vbσ0k := Q + σk A∗ A + µ(X k )−1 ⊗ (X k )−1 + Vσ0k , here Vσ0k ∈ ∂ΠS+n (Z k − σk (X k − εI)) may be chosen in the same way as that in [46] Vσ0k [H] = σk P k (Ωk ◦ ((P k )T (H)P k ))(P k )T , where P k ∈ On and Z k − σk (X k − εI) = P k Λk (P k )T , where Λk is the diagonal matrix with diagonal entries consisting of the eigenvalues λk1 ≥ λk2 ≥ · · · ≥ λkn of Z k − σk (X k − εI); if defining three index sets αk := {i | λki > 0}, βk := {i | λki = 0}, and γk := {i | λki < 0}, [ then k Ω = Eγ¯k γ¯k νγ¯k γk 0 νγ¯Tk γk ] , νij := 9 λki , i ∈ γ¯k , j ∈ γk , λki − λkj here γ¯k = {1, · · · , n} \ γk , and Eγ¯k γ¯k ∈ S |¯γk | is the matrix of ones. Now we summarize the algorithm as below Algorithm 2: The NAL method. Input X 0 ∈ FP ε , y 0 ∈ Rm , Z 0 ∈ S+n , σ0 = 10. Iterate the following steps: Step 1. Find an approximate minimizer X k+1 ≈ arg minn Θk (X). X∈S++ (8) Step 2. Compute y k+1 = y k − σk (A(X k+1 ) − b), Z k+1 = ΠS+n (Z k − σk (X k+1 − εI)). Step 3. Compute RPk+1 = ∥b − A(X k+1 )∥ , max{1, ∥b∥} k+1 RD = ∥Q(X k+1 ) + C − A∗ y k+1 − Z k+1 − µ(X k+1 )−1 ∥ , max{1, ∥C∥} k+1 If max{RPk+1 , RD } < Tol2, stop; else, σk+1 = 2σk ; end. In Algorithm 2, the principal computing costs lie in Step 1, i.e., computing the approximate minimizer of the inner problem (8). So we state the algorithm about how to compute the approximate minimizer in details below. 10 Algorithm 3: The semismooth Newton-CG Method. Step 1. Given ζ ∈ (0, 12 ), τ1 , τ2 ∈ (0, 1), and δ ∈ (0, 1), choose X 0 (≽ εI). Step 2. For j = 0, 1, 2, . . . , Step 1.1. Apply the PCG method to find an approximate solution H j to (Vbσ0k + ϵj I)[H] = −∇Θk (X j ), (9) where ϵj := τ1 min{τ2 , ∥∇Θk (X j )∥}. Step 1.2. Set αj = δ mj , where mj is the first nonnegative integer m for which Θk (X j + δ m H j ) ≤ Θk (X j ) + ζδ m ⟨∇Θk (X j ), H j ⟩. Step 1.3. Set X j+1 = X j + αj H j . From Algorithm 3, we can see that the main computing costs lie in solving the linear operator equation (9). The linear operator corresponds to a fourth-order n dimemsional tensor or a n2 × n2 matrix. For large-scale problems, n is very large, so direct solvers are not suitable and PCG solvers are ideal approaches for (9). Among various PCG solvers, we adopt the symmetric QMR algorithm [9]. But we must note that comparing with the algorithm in [9], we change the matrix to the linear operator in (9), and change the vectors to the corresponding matrices. As these changes are only a trivial generalization of the symmetric QMR algorithm, we do not go to details. 4 Convergence analyses The convergence analyses of these two methods can be derived from Fazel, Pong, Sun and Tseng’s paper [7] and Rockafellar’s paper [28, 29] without many difficulties, respectively. For the sake of completeness, we also present these results below. 4.1 Convergence analysis for the PAL method Theorem 4.1. Assume that the solution set of (P) is nonempty and Assumption 2.1 holds. Assume that Q + σk A∗ A + √Tk is positive definite. Let {X k , y k , Z k } be generated from Algorithm 1, and if τ ∈ (0, 1+2 5 ), then the sequence {X k } converges to the optimal solution to (P) and {y k , Z k } converges to the optimal solution to the dual problem (D). Proof. Since Q + σk A∗ A + Tk = λmax (Q + σk A∗ A)I ≻ 0, based on Theorem B.1 in [7] and the KKT condition, the conclusion of the theorem is obvious. 11 4.2 Convergence analysis for the NAL method Since we can not solve the inner problems exactly, we will use the following stopping criteria considered by Rockafellar [28, 29] for terminating Algorithms 2 and 3: (A) Θk (X k+1 ) − inf Θk (X) ≤ ϵ2k /2σk , ϵk ≥ 0, ∞ ∑ ϵk < ∞; k=0 (B) Θk (X k+1 ) − inf Θk (X) ≤ δk2 /2σk ∥(y k+1 , Z k+1 ) − (y , Z )∥ , δk ≥ 0, k k 2 ∞ ∑ δk < ∞; k=0 (B ′ ) ∥∇Θk (X k+1 )∥ ≤ δk′ /σk ∥(y k+1 , Z k+1 ) − (y k , Z k )∥, 0 ≤ δk′ → 0. We directly obtain from [28, 29] the following convergence results. Theorem 4.2. Let Algorithm 2 be executed with stopping criterion (A). If Assumption 2.1 is satisfied, then the generated sequence {(y k , Z k )} is bounded and {(y k , Z k )} converges to (y, Z), where (y, Z) is the unique optimal solution to (Dε ), and {X k } is asymptotically minimizing for (P ε ) with max(Dε ) = inf(P ε ). The boundedness of {y k , Z k } under (A) is actually equivalent to the existence of an optimal solution to (Dε ). If {y k , Z k } is bounded and Assumption 2.2 is satisfied, then the sequence {X k } is also bounded, and the accumulation point of the sequence {X k } is the unique optimal solution to (P ε ). Theorem 4.3. Let Algorithm 2 be executed with stopping criteria (A) and (B). If Assumption 2.1 holds and Tg−1 is Lipschitz continuous at the origin with modulus ag , then {(y k , Z k )} converges to the unique optimal solution (y, Z) with max(Dε ) = inf(P ε ), and for all k sufficiently large, ∥(y k+1 , Z k+1 ) − (y, Z)∥ ≤ θk ∥(y k , Z k ) − (y, Z)∥, where 2 −1/2 θk = [ag (a2g + σk2 )−1/2 + δk ](1 − δk )−1 → θ∞ = ag (a2g + σ∞ ) < 1, σk → σ∞ , Moreover, the conclusions of Theorem 4.2 about {(y k , Z k )} are valid. If in addition to (B) and the condition on Tg−1 one has (B’) and the stronger condition that Tl−1 is Lipschitz continuous at the origin with modulus al (≥ ag ), then X k → X, where X is the unique optimal solution to (P ε ), and one has that for all k sufficiently large, ∥X k+1 − X∥ ≤ θk′ ∥(y k+1 , Z k+1 ) − (y k , Z k )∥, where ′ = al /σ∞ . θk′ = al (1 + δk′ )/σk → θ∞ 12 Remark 4.1. In Algorithm 2 we can also add the term 2σ1k ∥X −X k ∥2 to Θk (X). Actually, in our Matlab code, one can optionally add this term. This actually corresponds to the proximal method of multipliers considered in [29, Section 5]. Convergence analysis for this improvement can be conducted in a parallel way as for Algorithm 2. Note that in the stopping criteria (A) and (B), inf Θk (X) is an unknown value. Since b Θk (X) := Θk (X) + 2σ1k ∥X − X k ∥2 is a strongly convex function with modulus σ1k , then one has the estimation b k (X k+1 ) − inf Θ b k (X) ≤ 1 ∥∇Θ b k (y k+1 )∥2 , Θ 2σk thus criteria (A) and (B) can be practically modified as follows: b k (y k+1 )∥ ≤ ϵk , ϵk ≥ 0, ∥∇Θ b k (X k+1 )∥ ≤ δk ∥(y k+1 , Z k+1 ) − (y k , Z k )∥, δk ≥ 0, ∥∇Θ ∞ ∑ k=0 ∞ ∑ ϵk < ∞; δk < ∞. k=0 Remark 4.2. The condition that Tg−1 is Lipschitz continuous at the origin is not easy to be realized, however, if Assumptions 2.3 and 2.4 are satisfied, then Tl−1 , Tf−1 and Tg−1 are all single-valued and continuously differentiable on a neighborhood of the origin, and in particular they are all Lipschitz continuous at the origin. 5 Numerical issues in the associated semismooth NewtonCG method When applying Algorithm 3 to solve the inner problem (7), the most expensive step lies in solving the linear system of equations where (M + εI)vec(H) = vec(−∇Θk (X)), (10) M := Q + σAT A + µX −1 ⊗ X −1 + σP ⊗ P diag(vec(Ω))P ⊗ P, (11) Q and A denote the matrix representations of Q and A, to obtain the approximate Newton direction. In order to solve (10) as efficiently as possible, it is necessary to analyze the condition number of the coefficient matrix and design an efficient preconditioner. 5.1 Conditioning of the coefficient matrix Let S = X − εI. 13 For simplicity, we suppose that strict complementarity holds for Z, S, i.e., Z + S ≻ 0. From the fact that ZS = 0, we have [ Z ] Λ 0 PT, (12) Z − σS = P 0 −σΛS where ΛZ = diag(λZ ) ∈ Rr×r and ΛS = diag(λS ) ∈ R(n−r)×(n−r) are diagonal matrices of positive eigenvalues of Z and S, respectively. Define the index sets γ¯ := {1, . . . , r}, γ := {r + 1, . . . , n}. Let ] [ λZi Eγ¯γ¯ νγ¯γ , ν := Ω= , i ∈ γ¯ , j ∈ γ, (13) ij νγ¯Tγ 0 λZi + σλSj−r and c1 = min(λZ ) max(λZ ) , c = < σ. 2 min(λZ )/σ + max(λS ) max(λZ )/σ + min(λS ) Then we have Proposition 5.1. Suppose that the strict complementarity holds for Z and S, then we have the following bound on the condition number of M κ(M ) ≤ λmax (Q) + σλmax (AT A) + µ/λ2min (X) + σλmax (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ) , λmin (Q) + σλmin (AT A) + µ/λ2max (X) + c1 λmin (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ) (14) where Pe1 = Pγ¯ ⊗ Pγ¯ , Pe2 = Pγ ⊗ Pγ¯ , Pe3 = Pγ¯ ⊗ Pγ , Pγ¯ and Pγ are submatrices of P = [Pγ¯ Pγ ]. Proof. From (11), (12) and (13), we obtain M = Q + σAT A + µX −1 ⊗ X −1 + σ(Pe1 Pe1T + Pe2 D2 Pe2T + Pe3 D3 Pe3T ), where D2 = diag(vec(νγ¯γ )), and D3 = diag(vec(νγ¯Tγ )). Since c1 ≤ σνij ≤ c2 , i ∈ γ¯ , j ∈ γ, then we can derive c1 (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ) ≼ σ(Pe1 Pe1T + Pe2 D2 Pe2T + Pe3 D3 Pe3T ) ≼ σ(Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ). Furthermore, from [11, Thm 4.2.12], µ λ2max (X) I ≼ µX −1 ⊗ X −1 ≼ 14 µ λ2min (X) I. Hence from [10, Thm 4.3.7], µ + c1 λmin (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )]I ≼ M ≼ 2 λmax (X) µ + σλmax (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )]I. [λmax (Q) + σλmax (AT A) + 2 λmin (X) [λmin (Q) + σλmin (AT A) + So we can get the following bound on the condition number of M : κ(M ) ≤ λmax (Q) + σλmax (AT A) + µ/λ2min (X) + σλmax (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ) . λmin (Q) + σλmin (AT A) + µ/λ2max (X) + c1 λmin (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ) The upper bound in (14) suggests that: (i) with σ and 1/c1 increasing, the condition number κ(M ) may increase. (ii) With Q, AT A and P ⊗ P diag(vec(Ω))P ⊗ P being of (near) low ranks, µX −1 ⊗ X −1 can potentially improve the condition number, for the term µ/λ2max (X)I can “lift” the minimal eigenvalue of M ; in other words, if Q, AT A and P ⊗ P diag(vec(Ω))P ⊗ P are of (near) low ranks, then with µ decreasing, the condition number κ(M ) may probably increase. 5.2 A diagonal preconditioner To achieve faster convergence for the PCG method to solve (10), one may select a proper preconditioner. In our implementation, we devise an easy-to-compute diagonal preconditioner by using an idea first developed in [8]. The preconditioner has the following form MD := diag(Q) + σdiag(AT A) + µdiag(vec(Ψ)) + σdiag(d), where Ψ ∈ S n , d ∈ Rn , and 2 (Ψ)ij = (X −1 )ii (X −1 )jj + (X −1 )2ij , d(ij) = ((P ◦ P )Ω(P ◦ P )T )ij , 1 ≤ i, j ≤ n. The biggest advantage of the preconditioner MD is that only O(n3 ) flops are needed to compute it. 6 Numerical experiments In this section, we present some numerical results to demonstrate the performances of the approaches with both synthetic and real data. We implemented the approaches in Matlab R2013a. All runs were performed on a PC (Intel Core 2 Duo 2.60 GHz with 12 GB RAM). 15 We measure the infeasibilities and optimality for the primal and dual problems by RP , RD and RG which are described in Algorithms 1 and 2. In general cases, we use the PAL method to generate an initial point and then switch to the NAL method for accelerating the local convergence. We stop the PAL method when max{RP , RD } < Tol1, and stop the NAL method when max{RP , RD } < Tol2, where Tol1 and Tol2 are pre-specified accuracy tolerances with Tol1 = 5 × 10−3 and Tol2 = 10−6 as the default. In some hard cases, we use the PAL method only and terminate it with Tol1 = 10−6 . Furthermore, we also use the relative gap RG := |pobj − dobj| 1 + |pobj| + |dobj| to measure the quality of the solution, where pobj and dobj denote the primal and dual objective function values, respectively. For the PAL method, we cap the iteration number to be 1000; for the Newton-CG method, we set the maximal outer iteration number to be 100 and the maximal inner iteration number to be 15. We choose the initial iterate X0 = I, and σ0 = 10. If the NAL method is applied to the above problems, the inequality constraint X ≽ 0 is replaced by X ≽ εI with ε = 10−16 . 6.1 Synthetic experiments I In this subsection, we focus our numerical experiments on the following special problems } {1 ⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X | Xii = 1, i = 1, 2, · · · , n, X ≽ 0 (A1) min X 2 { } Xij = Xji = 0, ∀ (i, j) ∈ E 1 (A2) min ⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X X 2 Xii = 1, i = 1, 2, · · · , n, X ≽ 0 {1 } (A3) min ⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X | Tr(X) = 1, X ≽ 0 X 2 } { Xij = Xji = 0, ∀ (i, j) ∈ E 1 (A4) min ⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X . X 2 Tr(X) = 1, X ≽ 0 The matrices H and C are generated randomly, and E is a random subset of {(i, j) | 1 ≤ i < j, j = 2, . . . , n}. The performances of the algorithms are presented in the following tables. For each instance, we report the matrix dimension (n) and the number of the equality constraints 16 (m); the number of outer iterations (it) and the total number of inner iterations (itsub); the primal (pobj) and dual (dobj) objective values; the primal infeasibility (RP ), the dual infeasibility (RD ), the relative gap (RG ); the time (in the format of hours:minutes:seconds) taken. Firstly, we present the performances of the pure PAL method and the hybrid method (i.e., the PAL method and the NAL method) on the problems (A1) and (A2) with µ = 1 in Tables 1 and 2, respectively. But for the hybrid method, we only report the details of the Newton-CG method. From the two tables, we can see that although the PAL method can solve the problems rapidly, the hybrid method is still superior to it for all instances. Especially for instances with relatively few constraints, the hybrid method outperforms the PAL method obviously. Actually, the superiority of the hybrid method will be even obvious if we want to get a highly accurate solution. Table 1: Performances of the PAL method on problems (A1)-(A4) with µ = 1. Problem n|m it rand-A1-0500 rand-A1-1000 rand-A1-1500 rand-A1-2000 rand-A2-0500 rand-A2-1000 rand-A2-1500 rand-A2-2000 rand-A3-0500 rand-A3-1000 rand-A3-1500 rand-A3-2000 rand-A4-0500 rand-A4-1000 rand-A4-1500 rand-A4-2000 500 | 500; 1000 | 1000; 1500 | 1500; 2000 | 2000; 500 | 12322; 1000 | 25086; 1500 | 38389; 2000 | 51322; 500 | 1; 1000 | 1; 1500 | 1; 2000 | 1; 500 | 74080; 1000 | 296243; 1500 | 666966; 2000 | 1186173; 93| 88| 88| 95| 106| 102| 99| 95| 570| 1000| 1000| 1000| 1000| 1000| 1000| 1000| pobj 1.51611644 2.88889067 4.02954208 5.12194289 1.48549608 2.92216880 4.07660888 5.14445107 3.10830391 6.90874980 1.09705221 1.52001337 3.10830458 6.90848660 1.09609128 1.51822498 17 dobj 3 3 3 3 3 3 3 3 3 3 4 4 3 3 4 4 1.51612384 2.88891640 4.02959372 5.12194882 1.48549478 2.92216820 4.07660986 5.14444160 3.10830441 6.90875463 1.09708284 1.52027947 3.10830554 6.90875346 1.09707760 1.52026740 3 3 3 3 3 3 3 3 3 3 4 4 3 3 4 4 RP /RD /RG Time 4.2-7| 9.4-7| 2.4-6 6.8-7| 9.9-7| 4.5-6 8.9-7| 9.9-7| 6.4-6 9.2-7| 2.5-7| 5.8-7 5.1-7| 9.2-7| 4.4-7 2.7-7| 9.6-7| 1.0-7 6.6-8| 9.9-7| 1.2-7 3.3-7| 9.1-7| 9.2-7 9.8-7| 9.3-7| 7.9-8 4.8-6| 4.6-6| 3.5-7 2.0-4| 1.9-4| 1.4-5 1.3-3| 1.3-3| 8.8-5 1.6-4| 3.5-4| 1.5-7 2.8-4| 4.7-4| 1.9-5 6.6-3| 7.0-3| 4.5-4 1.0-2| 7.5-3| 6.7-4 18 1:20 3:41 8:13 20 1:47 4:28 8:53 1:08 10:36 31:24 1:16:31 3:29 17:55 46:26 1:41:03 Table 2: Performances of the hybrid method on problems (A1)(A4) with µ = 1. Problem n|m it|itsub rand-A1-0500 rand-A1-1000 rand-A1-1500 rand-A1-2000 rand-A2-0500 rand-A2-1000 rand-A2-1500 rand-A2-2000 rand-A3-0500 rand-A3-1000 rand-A3-1500 rand-A3-2000 rand-A4-0500 rand-A4-1000 rand-A4-1500 rand-A4-2000 500 | 500; 1000 | 1000; 1500 | 1500; 2000 | 2000; 500 | 12322; 1000 | 25086; 1500 | 38389; 2000 | 51322; 500 | 1; 1000 | 1; 1500 | 1; 2000 | 1; 500 | 74080; 1000 | 296243; 1500 | 666966; 2000 | 1186173; 5| 8 6| 9 6| 8 6| 7 10| 11 11| 12 10| 11 10| 11 13| 15 15| 17 20| 22 16| 19 32| 55 36| 63 52| 92 79| 140 pobj 1.51610927 2.88889217 4.02952847 5.12195286 1.48549372 2.92217094 4.07661066 5.14445263 3.10830441 6.90875466 1.09708296 1.52028038 3.10830709 6.90875652 1.09708312 1.52028052 dobj 3 3 3 3 3 3 3 3 3 3 4 4 3 3 4 4 1.51593972 2.88900007 4.02940576 5.12224022 1.48549527 2.92217815 4.07661215 5.14445533 3.10830442 6.90875467 1.09708296 1.52028038 3.10830709 6.90875652 1.09708312 1.52028052 3 3 3 3 3 3 3 3 3 3 4 4 3 3 4 4 RP /RD /RG Time 5.3-7| 1.7-7| 5.6-5 1.3-7| 3.9-8| 1.9-5 3.2-7| 8.2-8| 1.5-5 5.1-7| 1.4-7| 2.8-5 7.7-7| 6.4-7| 5.2-7 3.8-7| 2.0-7| 1.2-6 3.8-7| 2.3-7| 1.8-7 4.5-7| 2.4-7| 2.6-7 3.6-9| 5.8-7| 2.4-9 1.3-9| 4.1-7| 7.4-10 2.8-9| 9.0-7| 9.6-10 1.2-9| 7.9-7| 6.6-10 1.4-10| 7.6-7| 1.5-12 5.5-11| 7.4-7| 4.4-13 8.7-11| 6.9-7| 3.6-13 1.9-10| 9.1-7| 5.9-13 08 41 1:47 3:44 15 1:20 3:26 7:50 35 5:25 22:55 1:07:14 1:37 13:39 59:34 2:09:36 Secondly, we present the performances of the PAL method and the hybrid method on problem (A1) with various µ. From Tables 3 and 4, we can see that with µ decreasing, the computing time increases very mildly. In contrast, with µ decreasing, the computing time increases very obviously. The rationale is analyzed behind Proposition 5.1, that is, in this problem, Q, AT A and P ⊗ P diag(vec(Ω))P ⊗ P are of near low ranks and a small µ leads to the ill-conditioning of the inner problem, so the NAL method needs many PCG iterations to solve the linear system of equations, however, the PAL method directly gives a closed-form solution of a regularized problem. 18 Table 3: Performances of the PAL method on problem (A1) with various µ. Problem rand1000-1 rand1000-2 rand1000-3 rand1000-4 rand1000-5 rand1000-6 rand1000-7 rand1000-8 rand1000-9 rand1000-10 n|m µ it | | | | | | | | | | 1.0000; 0.7500; 0.5000; 0.2500; 0.1250; 0.0625; 0.0313; 0.0157; 0.0078; 0.0039; 88| 113| 112| 126| 171| 169| 179| 179| 180| 180| 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 pobj 2.88889067 2.84053903 2.77900646 2.69409906 2.63413549 2.59461152 2.56991656 2.55504884 2.54622819 2.54123460 RP /RD /RG dobj 3 3 3 3 3 3 3 3 3 3 2.88891640 2.84053619 2.77901061 2.69412261 2.63413210 2.59461261 2.56991919 2.55505345 2.54623730 2.54124523 3 3 3 3 3 3 3 3 3 3 6.8-7| 8.2-7| 8.8-7| 1.6-7| 8.6-7| 8.8-7| 7.7-7| 9.4-7| 7.7-7| 8.2-7| 9.9-7| 1.9-7| 5.5-7| 9.9-7| 2.2-7| 4.7-7| 4.9-7| 6.4-7| 7.2-7| 8.0-7| Time 4.5-6 5.0-7 7.5-7 4.4-6 6.4-7 2.1-7 5.1-7 9.0-7 1.8-6 2.1-6 1:20 1:42 1:41 1:57 2:45 2:41 2:44 2:47 2:45 2:45 Table 4: Performances of the hybrid method on problem (A1) with various µ. Problem rand1000-1 rand1000-2 rand1000-3 rand1000-4 rand1000-5 rand1000-6 rand1000-7 rand1000-8 rand1000-9 rand1000-10 n|m µ it|itsub | | | | | | | | | | 1.0000; 0.7500; 0.5000; 0.2500; 0.1250; 0.0625; 0.0313; 0.0157; 0.0078; 0.0039; 6| 9 6| 9 6| 11 12| 24 12| 24 11| 22 11| 22 11| 22 11| 26 11| 30 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 pobj 2.88889217 2.84053768 2.77900549 2.69409749 2.63412896 2.59460453 2.56991042 2.55504168 2.54622239 2.54122850 19 RP /RD /RG dobj 3 3 3 3 3 3 3 3 3 3 2.88900007 2.84064543 2.77914546 2.69409705 2.63412821 2.59460367 2.56990961 2.55504091 2.54622168 2.54122780 3 3 3 3 3 3 3 3 3 3 1.3-7| 1.4-7| 1.9-7| 9.5-8| 1.6-7| 1.9-7| 1.8-7| 1.7-7| 1.6-7| 1.5-7| 3.9-8| 3.6-8| 5.6-8| 4.7-7| 8.3-7| 9.8-7| 9.3-7| 9.1-7| 8.5-7| 8.4-7| 1.9-5 1.9-5 2.5-5 8.2-8 1.4-7 1.7-7 1.6-7 1.5-7 1.4-7 1.4-7 Time 43 45 59 3:18 3:51 5:12 6:59 9:08 12:17 15:50 1000 800 800 time (seconds) time (seconds) 1000 600 400 200 0 0 10 600 400 200 −1 10 0 0 10 −2 µ 10 −1 10 −2 µ 10 Figure 1: Comparison of the performances of two methods on problem (A1) with various µ. 6.2 Synthetic experiments II In this subsection, we focus the numerical experiments on the following problem: {1 } (B1) min ∥H ◦ (X − G)∥2 − µ log det X | Xii = 1, i = 1, 2, · · · , n, X ≽ 0 X 2 } { Xij = Xji = 0, ∀ (i, j) ∈ E 1 ∥H ◦ (X − G)∥2 − µ log det X . (B2) min X 2 Xii = 1, i = 1, 2, · · · , n, X ≽ 0 The matrices H and G are generated randomly, where H is a weighted matrix whose entries are between 0 and 1, and many entries are nearly zero, and E is a random subset of {(i, j) | 1 ≤ i < j, j = 2, . . . , n}. The objective function of (B1) or (B2) is 1 1 ⟨X, H ◦ H ◦ X⟩ − ⟨H ◦ H ◦ G, X⟩ + ⟨H ◦ G, H ◦ G⟩ − µ log det X, 2 2 (15) which is a special case of (P ), so the algorithms we developed can be applied to solve (B1)(B2). We list the performances of the PAL method and the hybrid method on problems (B1) and (B2) in Tables 5 and 6 as below. We can see that for these two problems, the PAL method is much faster than the hybrid method. The reason lies in that the entries of the weighted matrix H in (15) are small, then the entries of H ◦ H are even smaller which leads to a more ill-conditioned Hessian of the inner problem and in this situation the NAL method has no superiority. 20 Table 5: Performances of the PAL method on problems (B1)-(B2) with µ = 1. Problem n|m it rand-B1-0500 rand-B1-1000 rand-B1-1500 rand-B1-2000 rand-B2-0500 rand-B2-1000 rand-B2-1500 rand-B2-2000 500 | 500; 1000 | 1000; 1500 | 1500; 2000 | 2000; 500 | 12322; 1000 | 25086; 1500 | 38389; 2000 | 51322; 147| 114| 106| 107| 175| 143| 135| 129| pobj 3.90611146 1.76566465 4.38503029 8.41256796 5.30756232 1.91004704 4.53956279 8.56585712 RP /RD /RG dobj 3 4 4 4 3 4 4 4 3.90613831 1.76570729 4.38510787 8.41269558 5.30756434 1.91005117 4.53957194 8.56586838 3 4 4 4 3 4 4 4 8.9-7| 7.5-7| 2.5-7| 1.6-7| 8.6-7| 4.7-7| 3.4-8| 2.3-8| 3.1-7| 9.7-7| 9.4-7| 9.6-7| 2.1-7| 9.8-7| 9.9-7| 9.6-7| 3.4-6 1.2-5 8.8-6 7.6-6 1.9-7 1.1-6 1.0-6 6.6-7 Time 29 1:49 4:47 9:47 35 2:21 6:05 11:53 Table 6: Performances of the hybrid method on problems (B1)(B2) with µ = 1. Problem n|m rand-B1-0500 rand-B1-1000 rand-B1-1500 rand-B1-2000 rand-B2-0500 rand-B2-1000 rand-B2-1500 rand-B2-2000 500 | 500; 1000 | 1000; 1500 | 1500; 2000 | 2000; 500 | 12322; 1000 | 25086; 1500 | 38389; 2000 | 51322; it 9| 7| 7| 7| 6| 7| 7| 7| 25 20 19 20 17 19 19 19 pobj 3.90611894 1.76566628 4.38503287 8.41257271 5.30756255 1.91004712 4.53956281 8.56585840 RP /RD /RG dobj 3 4 4 4 3 4 4 4 3.90609819 1.76566528 4.38502532 8.41255864 5.30757122 1.91004680 4.53956184 8.56585294 3 4 4 4 3 4 4 4 6.9-7| 2.7-7| 3.7-7| 4.8-7| 3.5-7| 1.4-7| 1.7-7| 2.7-7| 7.6-7| 5.2-7| 6.5-7| 7.8-7| 9.8-7| 2.8-7| 3.1-7| 4.4-7| 2.7-6 2.9-7 8.6-7 8.4-7 8.2-7 8.2-8 1.1-7 3.2-7 Time 1:24 7:22 22:01 58:27 1:01 6:51 22:13 51:15 Based on the synthetic experiments in the above two subsections, we conclude that if the Hessian of the inner problem is good-conditioned, we prefer to adopt the hybrid method, but if the Hessian is ill-conditioned, the PAL method only is a good choice. 6.3 Real data experiments In this section, we consider the following model problem which is a special case of problem (P ) with Q ≡ 0: { } (C) min ⟨C, X⟩ − log det X | Xij = Xji = 0, ∀ (i, j) ∈ E, X ≽ 0 , X 21 where the matrix C is real data coming from the two gene expression data sets, and E is a predetermined index set. The model (C) finds wide applications in covariance selection. One gene set is the Rosetta Inpharmatics Compendium of gene expression profiles described by Hughes et al. [12]. The data set contains 253 samples with n = 6136 variables. We aim to estimate the covariance matrix of a Gaussian graphic model whose conditional independence is unknown. Another gene set is the Iconix microarray data obtained from 255 drug-treated rat livers; see Natsoulis et al. [23] for details. For both data sets, although our method can handle problems with larger matrix dimensions, we test only on a subset of the data. We create 5 subsets by taking 500, 1000, 1500, 2000 and 2500 variables with the highest variances, respectively. And as the variances vary widely, we normalize the sample covariance matrices to have unit variances on the diagonal. As the model (C) is a special case of (P ) with Q ≡ 0, so the Hessian of the inner problem may probably be ill-conditioned and we only apply the PAL method to solve it. Furthermore, we also compare the performance of the PAL method with that of the PPA proposed in [40]. Table 7: Performances of the PAL method on problem (C). Problem Rosetta-0500 Rosetta-1000 Rosetta-1500 Rosetta-2000 Rosetta-2500 Iconix-0500 Iconix-1000 Iconix-1500 Iconix-2000 Iconix-2500 n|m it | | | | | | | | | | 13| 12| 12| 12| 12| 28| 25| 21| 21| 16| 500 1000 1500 2000 2500 500 1000 1500 2000 2500 999; 1999; 2999; 3999; 4999; 999; 1999; 2999; 3999; 4999; pobj 2.29472323 3.06492978 3.56384267 3.94407754 4.19969175 1.30457645 2.28035658 2.92514554 3.42933318 3.84427349 22 RP /RD /RG dobj 1 1 1 1 1 2 2 2 2 2 2.29473435 3.06495129 3.56386620 3.94410344 4.19971940 1.30457705 2.28035773 2.92514672 3.42933507 3.84427544 1 1 1 1 1 2 2 2 2 2 4.3-7| 4.7-7| 3.7-7| 3.3-7| 2.9-7| 3.8-7| 1.3-7| 1.6-7| 6.6-8| 7.5-8| 5.1-7| 8.7-7| 6.7-7| 5.9-7| 5.4-7| 7.9-7| 9.0-7| 8.2-7| 9.7-7| 9.1-7| 2.4-6 3.5-6 3.3-6 3.2-6 3.3-6 2.3-7 2.5-7 2.0-7 2.8-7 2.5-7 Time 03 16 46 1:49 3:27 08 28 58 2:14 3:05 Table 8: Comparison of the PAL method and the PPA on problem (C) with accuracy 10−6 . Problem Rosetta-0500 Rosetta-1000 Rosetta-1500 Rosetta-2000 Rosetta-2500 Iconix-0500 Iconix-1000 Iconix-1500 Iconix-2000 Iconix-2500 n|m 500 1000 1500 2000 2500 500 1000 1500 2000 2500 | | | | | | | | | | it PAL PPA 13 12 12 12 12 28 25 21 21 16 2 2 2 2 3 3 3 4 4 4 999 1999 2999 3999 4999 999 1999 2999 3999 4999 Time PAL PPA pobj PAL 2.29472323 3.06492978 3.56384267 3.94407754 4.19969175 1.30457645 2.28035658 2.92514554 3.42933318 3.84427349 PPA 1 1 1 1 1 2 2 2 2 2 2.29473468 3.06495844 3.56387453 3.94411131 4.19979828 1.30457676 2.28035671 2.92514506 3.42933324 3.84427356 1 1 1 1 1 2 2 2 2 2 03 16 46 1:49 3:27 08 28 58 2:14 3:05 08 43 2:01 4:51 13:36 14 1:20 5:59 13:27 25:45 Tables 7 and 8 not only show that the PAL method is very efficient to solve the real data problems, but also show that the PAL method is about 2 to 7 times faster than the PPA. Concluding remarks In this paper, we designed a PAL method and a hybrid method to solve log-determinant optimization problems. We established the rigorous convergence results based on the fundamental theoretical frameworks. Extensive numerical experiments conducted on both synthetic problems and real data problems demonstrated that our methods are very robust and efficient. Based on the results of the numerical experiments, we conclude that for general problems, we prefer to adopt the hybrid method; but for some ill-conditioned problems, the PAL method may be a better choice. Acknowledgements I sincerely appreciate the Institute for Mathematical Sciences, National University of Singapore for supporting me to visit the institute and attend the workshop “Optimization: Computation, Theory and Modeling” in 2012 so that I can have a good opportunity to have fruitful discussions with Professors Defeng Sun and Kim-Chuan Toh on the PAL method. I also appreciate Dr. Xinyuan Zhao in Beijing University of Technology for many discussions on this topic. 23 References [1] F. Alizadeh, J. P. A. Haeberly, and O. L. Overton, Complementarity and nondegeneracy in semidefinite programming, Math. Programming, 77 (1997), 111–128. [2] O. Banerjee, L. El Ghaoui, A. d’Aspremont, Model selection through sparse maximum likelihood estimation, J. Mach. Learn. Res., 9 (2008), 485–516. [3] J. F. Bonnans and A. Shapiro, Perturbation Analysis of Optimization Problems, Springer, New York, 2000. [4] J. Dahl, L. Vandenberghe, and V. Roychowdhury, Covariance selection for nonchordal graphs via chordal embedding, Optim. Methods Softw., 23 (2008), 501–520. [5] A. Dempster, Covariance selection, Biometrics, 28 (1972), 157–175. [6] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection, SIAM J. Matrix Anal. Appl., 30 (2008), 56–66. [7] M. Fazel, T.-K. Pong, D. Sun, and P. Tseng, Hankel matrix rank minimization with applications to system identification and realization, SIAM J. Matrix Anal. Appl., 34 (2013), 946–977. [8] Y. Gao and D. Sun, Calibrating least squares semidefinite programming with equality and inequality constraints, SIAM J. Matrix Anal. Appl., 31 (2009), 1432–1457. [9] R. W. Freund and N. M. Nachtigal, A new Krylov subspace method for symmetric indefinite linear systems, ORNL/TM-12754, 1994. [10] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, Cambridge, UK, 1985. [11] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis, Cambridge University Press, Cambridge, UK, 1991. [12] T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend, Functional discovery via a compendium of expression profiles, Cell, 102 (2000), 109–126. [13] Z. Hu, J. Cao, and L. J. Hong, Robust simulation of global warming policies using the DICE model, Manage. Sci., 58 (2012), 1–17. [14] K. F. Jiang, D. F. Sun, and K.-C. Toh, An inexact accelerated proximal gradient method for large scale linearly constrained convex SDP, SIAM J. Optim., 22 (2012), 1042–1064. 24 [15] L. Li, and K.-C. Toh,, An inexact interior point method for L1-regularized sparse covariance selection, Math. Program. Comput., 2 (2010), 291–315. [16] Z. Lu, Smooth optimization approach for sparse covariance selection, SIAM J. Optim., 19 (2009), 1807–1827. [17] Z. Lu, Adaptive first-order methods for general sparse inverse covariance selection, SIAM J. Matrix Anal. Appl., 31 (2010), 2000–2016. [18] Z. Lu and Y. Zhang, Penalty decomposition methods for L0-norm minimization, Proceedings of Neural Information Processing Systems (NIPS), 2011: 46–54. [19] B. Martinet, Regularisation d’in´equations variationelles par approximations successives, Rev. Fran¸caise d’Informat. Recherche Op´erationnelle, (1970), 154–159. [20] F. Meng, D. Sun, and G. Zhao, Semismoothness of solutions to generalized equations and the Moreau–Yosida regularization, Math. Programming, 104 (2005), 561–581. [21] G. J. Minty, On the monotonicity of the gradient of a convex function, Pacific J. Math., 14 (1964), 243–247. [22] J. J. Moreau, Proximit´e et dualit´e dans un espace Hilbertien, Bull. Soc. Math. France, 93 (1965), 273–299. [23] G. Natsoulis, C. I. Pearson, J. Gollub, B. P. Eynon, J. Ferng, R. Nair, R. Idury, M. D. Lee, M. R. Fielden, R. J. Brennan, A. H. Roter and K. Jarnagin, The liver pharmacological and xenobiotic gene response repertoire, Mol. Syst. Biol., 175 (2008), 1–12. [24] P. Olsen, F. Oztoprak, J. Nocedal, and S. Rennie, Newton-like methods for sparse inverse covariance estimation, available at http://www.optimizationonline.org/DB HTML/2012/06/3506.html. [25] H. Qi and D. Sun, A quadratically convergent Newton method for computing the nearest correlation matrix, SIAM J. Matrix Anal. Appl., 28 (2006), 360–385. [26] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, 1970. [27] R. T. Rockafellar, A dual approach to solving nonlinear programming problems by unconstrained optimization, Math. Programming 5 (1973), 354–373. [28] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control Optim., 14 (1976), 877–898. [29] R. T. Rockafellar, Augmented Lagrangains and applications of the proximal point algorithm in convex programming, Math. Oper. Res., 1 (1976), 97–116. 25 [30] K. Scheinberg, S. Ma, and D. Goldfarb, Sparse inverse covariance selection via alternating linearization methods, Twenty-Fourth Annual Conference on Neural Information Processing Systems (NIPS), 2010: 2101–2109. [31] K. Scheinberg and I. Rish, Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach, in Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Comput. Sci. 6323, J. L. Balcazar, F. Bonchi, A. Gionis, and M. Sebag, eds., Springer-Verlag, Berlin, 2010, 196–212. [32] D. Sun, The strong second order sufficient condition and constraint nondegeneracy in nonlinear semidefinite programming and their implications, Math. Oper. Res., 31 (2006), 761–776. [33] D. Sun, J. Sun, and L. Zhang, The rate of convergence of the augmented Lagrangian method for nonlinear semidefinite programming, Math. Programming, 114 (2008), 349–391. [34] K.-C. Toh, Primal-dual path-following algorithms for determinant maximization problems with linear matrix inequalities, Comput. Optim. Appl., 14 (1999), 309–330. [35] K.-C. Toh, An inexact primal-dual path following algorithm for convex quadratic SDP, Math. Programming, 112 (2008), 221–254. [36] R. H. T¨ ut¨ unc¨ u, K.-C. Toh, and M. J. Todd, Solving semidefinite-quadratic-linear programs using SDPT3, Math. Programming, 95 (2003), 189–217. [37] K.-C. Toh, R. H. T¨ ut¨ unc¨ u, and M. J. Todd, Inexact primal-dual path-following algorithms for a special class of convex quadratic SDP and related problems, Pac. J. Optim., 3 (2007), 135–164. [38] N.-K. Tsing, M. K. H. Fan, and E. I. Verriest, On analyticity of functions involving eigenvalues, Linear Algebra Appl. 207 (1994), 159–180. [39] B. Varadarajan, D. Povey, and S. M. Chu, Quick fmllr for speaker adaptation in speech recognition, Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. [40] C. Wang, D. Sun, and K.-C. Toh, Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm, SIAM J. Optim., 20 (2010), 2994–3013. [41] J. Yang, D. Sun, and K.-C. Toh, A proximal point algorithm for log-determinant optimization with group Lasso regularization, SIAM J. Optim., 23 (2013), 857–893. [42] K. Yosida, Functional Analysis, Springer Verlag, Berlin, 1964. 26 [43] S. Yang, X. Shen, P. Wonka, Z. Lu, and J. Ye, Fused multiple graphical Lasso, available at http://people.math.sfu.ca/∼zhaosong/ResearchPapers/FMGL.pdf. [44] X. Yuan, Alternating direction methods for sparse covariance selection, J. Sci. Comput., 51 (2012), 261–273. [45] X.-Y. Zhao, A Semismooth Newton-CG Augmented Lagrangian Method for Large Scale Linear and Convex Quadratic SDPs, PhD thesis, National University of Singapore, 2009. [46] X.-Y. Zhao, D. Sun, and K.-C. Toh, A Newton-CG augmented Lagrangian method for semidefinite programming, SIAM J. Optim., 20 (2010), 1737–1765. 27