Tight Continuous Relaxation of the Balanced k-Cut Problem
Transcription
Tight Continuous Relaxation of the Balanced k-Cut Problem
Tight Continuous Relaxation of the Balanced k-Cut Problem Syama Sundar Rangapuram, Pramod Kaushik Mudrakarta and Matthias Hein Department of Mathematics and Computer Science Saarland University, Saarbr¨ucken Abstract Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced k-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the difficult sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method outperforms all existing approaches for ratio cut and other balanced k-cut criteria. 1 Introduction Graph-based techniques for clustering have become very popular in machine learning as they allow for an easy integration of pairwise relationships in data. The problem of finding k clusters in a graph can be formulated as a balanced k-cut problem [1, 2, 3, 4], where ratio and normalized cut are famous instances of balanced graph cut criteria employed for clustering, community detection and image segmentation. The balanced k-cut problem is known to be NP-hard [4] and thus in practice relaxations [4, 5] or greedy approaches [6] are used for finding the optimal multi-cut. The most famous approach is spectral clustering [7], which corresponds to the spectral relaxation of the ratio/normalized cut and uses k-means in the embedding of the vertices found by the first k eigenvectors of the graph Laplacian in order to obtain the clustering. However, the spectral relaxation has been shown to be loose for k = 2 [8] and for k > 2 no guarantees are known of the quality of the obtained k-cut with respect to the optimal one. Moreover, in practice even greedy approaches [6] frequently outperform spectral clustering. This paper is motivated by another line of recent work [9, 10, 11, 12] where it has been shown that an exact continuous relaxation for the two cluster case (k = 2) is possible for a quite general class of balancing functions. Moreover, efficient algorithms for its optimization have been proposed which produce much better cuts than the standard spectral relaxation. However, the multi-cut problem has still to be solved via the greedy recursive splitting technique. Inspired by the recent approach in [13], in this paper we tackle directly the general balanced k-cut problem based on a new tight continuous relaxation. We show that the relaxation for the asymmetric ratio Cheeger cut proposed recently by [13] is loose when the data does not contain k well-separated clusters and thus leads to poor performance in practice. Similar to [13] we can also integrate label information leading to a transductive clustering formulation. Moreover, we propose an efficient algorithm for the minimization of our continuous relaxation for which we can prove monotonic descent. This is in contrast to the algorithm proposed in [13] for which no such guarantee holds. In extensive experiments we show that our method outperforms all existing methods in terms of the 1 achieved balanced k-cuts. Moreover, our clustering error is competitive with respect to several other clustering techniques based on balanced k-cuts and recently proposed approaches based on nonnegative matrix factorization. Also we observe that already with small amount of label information the clustering error improves significantly. 2 Balanced Graph Cuts Graphs are used in machine learning typically as similarity graphs, that is the weight of an edge between two instances encodes their similarity. Given such a similarity graph of the instances, the clustering problem into k sets can be transformed into a graph partitioning problem, where the goal is to construct a partition of the graph into k sets such that the cut, that is the sum of weights of the edge from each set to all other sets, is small and all sets in the partition are roughly of equal size. Before we introduce balanced graph cuts, we briefly fix the setting and notation. Let G(V, W ) denote an undirected, weighted graph with vertex set V with n = |V | vertices and weight matrix W ∈ Rn×n with W = W T . There is an edge between two + P vertices i, j ∈ V if wij > 0. The cut between two sets A, B ⊂ V is defined as cut(A, B) = i∈A,j∈B wij and we write 1A for the indicator vector of set A ⊂ V . A collection of k sets (C1 , . . . , Ck ) is a partition of V if ∪ki=1 Ci = V , Ci ∩ Cj = ∅ if i 6= j and |Ci | ≥ 1, i = 1, . . . , k. We denote the set of all k-partitions of V by Pk . Pk Furthermore, we denote by ∆k the simplex {x : x ∈ Rk , x ≥ 0, i=1 xi = 1}. ˆ ˆ Finally, a set function Sˆ : 2V → R is called submodular if for all A, B ⊂ V , S(A∪B)+ S(A∩B) ≤ ˆ ˆ S(A) + S(B). Furthermore, we need the concept of the Lovasz extension of a set function. ˆ Definition 1 Let Sˆ : 2V → R be a set function with S(∅) = 0. Let f ∈ RV be ordered in increasing order f1 ≤ f2 ≤ . . . ≤ fn and define C = {j ∈ V | f > fi } where C0 = V . Then S : RV → R i j Pn ˆ ˆ ˆ given by, S(f ) = i=1 fi S(Ci−1 ) − S(Ci ) , is called the Lovasz extension of S. Note that ˆ S(1A ) = S(A) for all A ⊂ V . The Lovasz extension of a set function is convex if and only if the set function is submodular [14]. The cut function Pn cut(C, C), where C = V \C, is submodular and its Lovasz extension is given by TV(f ) = 12 i,j=1 wij |fi − fj |. 2.1 Balanced k-cuts The balanced k-cut problem is defined as min (C1 ,...,Ck )∈Pk k X cut(Ci , Ci ) i=1 ˆ i) S(C =: BCut(C1 , . . . , Ck ) (1) where Sˆ : 2V → R+ is a balancing function with the goal that all sets Ci are of the same “size”. ˆ ˆ In this paper, we assume that S(∅) = 0 and for any C ( V, C 6= ∅, S(C) ≥ m, for some m > 0. In the literature one finds mainly the following submodular balancing functions (in brackets is the name of the overall balanced graph cut criterion BCut(C1 , . . . , Ck )), ˆ S(C) = |C|, (Ratio Cut), (2) ˆ S(C) = min{|C|, |C|}, ˆ S(C) = min{(k − 1)|C|, C} (Ratio Cheeger Cut), (Asymmetric Ratio Cheeger Cut). The Ratio Cut is well studied in the literature e.g. [3, 7, 6] and corresponds to a balancing function without bias towards a particular size of the sets, whereas the Asymmetric Ratio Cheeger Cut recently ˆ proposed in [13] has a bias towards sets of size |Vk | (S(C) attains its maximum at this point) which makes perfect sense if one expects clusters which have roughly equal size. An intermediate version between the two is the Ratio Cheeger Cut which has a symmetric balancing function and strongly penalizes overly large clusters. For the ease of presentation we restrict ourselves to these balancing ˆ functions. However, we also handle the corresponding weighted cases e.g., S(C) = vol(C) = P Pcan n d , where d = w , leading to the normalized cut[4]. i i ij i∈C j=1 2 3 Tight Continuous Relaxation for the Balanced k-Cut Problem In this section we discuss our proposed relaxation for the balanced k-cut problem (1). It turns out that a crucial question towards a tight multi-cut relaxation is the choice of the constraints so that the continuous problem also yields a partition (together with a suitable rounding scheme). The motivation for our relaxation is taken from the recent work of [9, 10, 11], where exact relaxations are shown for the case k = 2. Basically, they replace the ratio of set functions with the ratio of the corresponding Lovasz extensions. We use the same idea for the objective of our continuous relaxation of the k-cut problem (1) which is given as min k X TV(Fl ) F =(F1 ,...,Fk ), l=1 F ∈Rn×k + (3) S(Fl ) subject to : F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints) max{F(i) } = 1, ∀i ∈ I, (membership constraints) S(Fl ) ≥ m, (size constraints) l = 1, . . . , k, ˆ We have where S is the Lovasz extension of the set function Sˆ and m = minC(V, C6=∅ S(C). m = 1, for Ratio Cut and Ratio Cheeger Cut whereas m = k − 1 for Asymmetric Ratio Cheeger Cut. Note that TV is the Lovasz extension of the cut functional cut(C, C). In order to simplify notation we denote for a matrix F ∈ Rn×k by Fl the l-th column of F and by F(i) the i-th row of F . Note that the rows of F correspond to the vertices of the graph and the j-th column of F corresponds to the set Cj of the desired partition. The set I ⊂ V in the membership constraints is chosen adaptively by our method during the sequential optimization described in Section 4. An obvious question is how to get from the continuous solution F ∗ of (3) to a partition (C1 , . . . , Ck ) ∈ Pk which is typically called rounding. Given F ∗ we construct the sets, by assigning each vertex i to the column where the i-th row attains its maximum. Formally, Ci = {j ∈ V | i = arg max Fjs }, i = 1, . . . , k, (Rounding) (4) s=1,...,k where ties are broken randomly. If there exists a row such that the rounding is not unique, we say that the solution is weakly degenerated. If furthermore the resulting set (C1 , . . . , Ck ) do not form a partition, that is one of the sets is empty, then we say that the solution is strongly degenerated. First, we connect our relaxation to the previous work of [11] for the case k = 2. Indeed for symmetric balancing function such as the Ratio Cheeger Cut, our continuous relaxation (3) is exact even without membership and size constraints. ˆ ˆ Theorem 1 Let Sˆ be a non-negative symmetric balancing function, S(C) = S(C), and denote by ∗ p the optimal value of (3) without membership and size constraints for k = 2. Then it holds p∗ = min (C1 ,C2 )∈P2 2 X cut(Ci , Ci ) i=1 ˆ i) S(C . Furthermore there exists a solution F ∗ of (3) such that F ∗ = [1C ∗ , 1C ∗ ], where (C ∗ , C ∗ ) is the optimal balanced 2-cut partition. Proof: Note that cut(C, C) is a symmetric set function and Sˆ by assumption. Thus with C2 = C1 , cut(C1 , C1 ) cut(C2 , C2 ) cut(C1 , C1 ) + =2 . ˆ ˆ ˆ 1) S(C1 ) S(C2 ) S(C Moreover, as TV(αf + β1) = |α| T V (f ) and by symmetry of Sˆ also S(αf + β1) = |α| S(f ) (see [14, 11]). The simplex constraint implies that F2 = 1 − F1 and thus TV(F2 ) TV(1 − F1 ) TV(F1 ) = = . S(F2 ) S(1 − F1 ) S(F1 ) 3 Thus we can write problem (3) equivalently as min 2 f ∈[0,1]V TV(f ) . S(f ) ˆ we have As for all A ⊂ V , TV(1A ) = cut(A, A) and S(1A ) = S(A), min f ∈[0,1]V cut(C, C) TV(f ) ≤ min . ˆ C⊂V S(f ) S(C) However, it has been shown in [11] that minf ∈RV TV(f ) S(f ) = minC⊂V a continuous solution such that f ∗ = 1C ∗ , where C ∗ = arg min C⊂V cut(C,C) ˆ S(C) cut(C,C) . ˆ S(C) and that there exists As F ∗ = [f ∗ , 1 − f ∗ ] = [1C ∗ , 1C ∗ ] this finishes the proof. Note that rounding trivially yields a solution in the setting of the previous theorem. A second result shows that indeed our proposed optimization problem (3) is a relaxation of the balanced k-cut problem (1). Furthermore, the relaxation is exact if I = V . Proposition 1 The continuous problem (3) is a relaxation of the k-cut problem (1). The relaxation is exact, i.e., both problems are equivalent, if I = V . Proof: For any k-way partition (C1 , . . . , Ck ), we can construct F = (1C1 , . . . , 1Ck ). It obviously satisfies the membership and size constraints and the simplex constraint is satisfied as ∪i Ci = V and Ci ∩ Cj = ∅ if i 6= j. Thus F is feasible for problem (3) and has the same objective value because TV(1C ) = cut(C, C), ˆ S(1C ) = S(C). Thus problem (3) is a relaxation of (1). If I = V , then the simplex together with the membership constraints imply that each row F(i) contains exactly one non-zero element which equals 1, i.e., F ∈ {0, 1}n×k . Define for l = 1, . . . , k, Cl = {i ∈ V | Fil = 1} (i.e, Fl = 1Cl ), then it holds ∪l Cl = V and Cl ∩ Cj = ∅, l 6= j. ˆ l ). Thus From the size constraints, we have for l = 1, . . . , k, 0 < m ≤ S(Fl ) = S(1Cl ) = S(C ˆ ˆ S(Cl ) > 0, l = 1, . . . , k, which by assumption on S implies that each Cl is non-empty. Hence the only feasible points allowed are indicators of k-way partitions and the equivalence of (1) and (3) follows. The row-wise simplex and membership constraints enforce that each vertex in I belongs to exactly one component. Note that these constraints alone (even if I = V ) can still not guarantee that F corresponds to a k-way partition since an entire column of F can be zero. This is avoided by the column-wise size constraints that enforce that each component has at least one vertex. If I = V it is immediate from the proof that problem (3) is no longer a continuous problem as the feasible set are only indicator matrices of partitions. In this case rounding yields trivially a partition. On the other hand, if I = ∅ (i.e., no membership constraints), and k > 2 it is not guaranteed that rounding of the solution of the continuous problem yields a partition. Indeed, we will see in the following that for symmetric balancing functions one can, under these conditions, show that the solution is always strongly degenerated and rounding does not yield a partition (see Theorem 2). Thus we observe that the index set I controls the degree to which the partition constraint is enforced. The idea behind our suggested relaxation is that it is well known in image processing that minimizing the total variation yields piecewise constant solutions (in fact this follows from seeing the total variation as Lovasz extension of the cut). Thus if |I| is sufficiently large, the vertices where the values are fixed to 0 or 1 propagate this to their neighboring vertices and finally to the whole graph. We discuss the choice of I in more detail in Section 4. 4 Simplex constraints alone are not sufficient to yield a partition: Our approach has been inspired by [13] who proposed the following continuous relaxation for the Asymmetric Ratio Cheeger Cut min k X F =(F1 ,...,Fk ), l=1 F ∈Rn×k + TV(Fl ) Fl − quant k−1 (Fl ) 1 (5) subject to : F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints) ˆ = min{(k − 1)|C|, C} and where S(f ) = f − quantk−1 (f )1 is the Lovasz extension of S(C) n quantk−1 (f ) is the k − 1-quantile of f ∈ R . Note that in their approach no membership constraints and size constraints are present. We now show that the usage of simplex constraints in the optimization problem (3) is not sufficient to guarantee that the solution F ∗ can be rounded to a partition for any symmetric balancing function in (1). For asymmetric balancing functions as employed for the Asymmetric Ratio Cheeger Cut by [13] in their relaxation (5) we can prove such a strong result only in the case where the graph is disconnected. However, note that if the number of components of the graph is less than the number of desired clusters k, the multi-cut problem is still non-trivial. ˆ Theorem 2 Let S(C) be any non-negative symmetric balancing function. Then the continuous relaxation k X TV(Fl ) min (6) S(Fl ) F =(F1 ,...,Fk ), F ∈Rn×k + l=1 subject to : F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints) of the balanced k-cut problem (1) is void in the sense that the optimal solution F ∗ of the continuous problem can be constructed from the optimal solution of the 2-cut problem and F ∗ cannot be rounded into a k-way partition, see (4). If the graph is disconnected, then the same holds also for any non-negative asymmetric balancing function. Proof: First, we derive a lower bound on the optimum of the continuous relaxation (6). Then we construct a feasible point for (6) that achieves this lower bound but cannot yield a partitioning thus finishing the proof. Let (C ∗ , C ∗ ) = arg min cut(C,C) be an optimal 2-way partition for the given graph. Using the exact ˆ S(C) C⊂V relaxation result for the balanced 2-cut problem in Theorem 3.1. in [11], we have min F :F(i) ∈∆k k X TV(Fl ) l=1 S(Fl ) ≥ k X l=1 k minn f ∈R TV(f ) X cut(C, C) cut(C ∗ , C ∗ ) min = =k . ˆ ˆ ∗) C⊂V S(f ) S(C) S(C l=1 Pk Now define F1 = 1C ∗ and Fl = αl 1C ∗ , l = 2, . . . , k such that l=2 αl = 1, αl > 0. Clearly F = (F1 , . . . , Fk ) is feasible for the problem (6) and the corresponding objective value is k k l=2 l=1 TV(1C ∗ ) X αl TV(1C ∗ ) X cut(C ∗ , C ∗ ) + = , ˆ ∗) S(1C ∗ ) αl S(1C ∗ ) S(C ˆ where we used the 1-homogeneity of TV and S [14] and the symmetry of cut and S. Thus the solution F constructed as above from the 2-cut problem is indeed optimal for the continuous relaxation (6) and it is not possible to obtain a k-way partition from this solution as there will be k − 2 sets that are empty. Finally, the argument can be extended to asymmetric set functions if ˆ ˆ there exists a set C such that cut(C, C) = 0 as in this case it does not matter that S(C) 6= S(C) in order that the argument holds. The proof of Theorem 2 shows additionally that for any balancing function if the graph is disconnected, the solution of the continuous relaxation (6) is always zero, while clearly the solution of the balanced k-cut problem need not be zero. This shows that the relaxation can be arbitrarily bad in this case. In fact the relaxation for the asymmetric case can even fail if the graph is not disconnected but there exists a cut of the graph which is very small as the following corollary indicates. 5 0 1 0 (a) 0 0 1 1 0 0 1 0 0 (b) 0 0 1 (c) 0 1 0 0 0 1 1 0 0 0 1 0 (d) 0 0 1 1 0 0 (e) Figure 1: Toy example illustrating that the relaxation of [13] converges to a degenerate solution when applied to a graph with dominating 2-cut. (a) 10NN-graph generated from three Gaussians in 10 dimensions (b) continuous solution of (5) from [13] for k = 3, (c) rounding of the continuous solution of [13] does not yield a 3-partition (d) continuous solution found by our method together with the vertices i ∈ I (black) where the membership constraint is enforced. Our continuous solution corresponds already to a partition. (e) clustering found by rounding of our continuous solution (trivial as we have converged to a partition). In (b)-(e), we color data point i according to F(i) ∈ R3 . Corollary 1 Let Sˆ be an asymmetric balancing function and C ∗ = arg min cut(C,C) and suppose ˆ S(C) C⊂V ∗ ∗ Pk ,C ∗ ) ,C ∗ ) i ,Ci ) that φ∗ := (k − 1) cut(C + cut(C . Then there exists < min(C1 ,...,Ck )∈Pk i=1 cut(C ˆ ∗) ˆ ∗) ˆ S(C S(C S(C ) Pk i a feasible F with F1 = 1C ∗ and Fl = αl 1C ∗ , l = 2, . . . , k such that l=2 αl = 1, αl > 0 for (6) Pk i) ∗ which has objective i=1 TV(F S(Fi ) = φ and which cannot be rounded to a k-way partition. Pk Proof: Let F1 = 1C ∗ and Fl = αl 1C ∗ , l = 2, . . . , k such that l=2 αl = 1, αl > 0. Clearly F = (F1 , . . . , Fk ) is feasible for the problem (6) and the corresponding objective value is k X TV(Fl ) l=1 S(Fl ) k = TV(1C ∗ ) X αl TV(1C ∗ ) + S(1C ∗ ) αl S(1C ∗ ) l=2 ∗ = , C ∗) cut(C ˆ ∗) S(C + (k − 1) cut(C ∗ , C ∗ ) , ˆ ∗) S(C where we used the 1-homogeneity of TV and S [14] and the symmetry of cut. This F cannot be rounded into a k-way partition as there will be k − 2 sets that are empty. Theorem 2 shows that the membership and size constraints which we have introduced in our relaxation (3) are essential to obtain a partition for symmetric balancing functions. For the asymmetric balancing function failure of the relaxation (6) and thus also of the relaxation (5) of [13] is only guaranteed for disconnected graphs. However, Corollary 1 indicates that degenerated solutions should also be a problem when the graph is still connected but there exists a dominating cut. We illustrate this with a toy example in Figure 1 where the algorithm of [13] for solving (5) fails as it converges exactly to the solution predicted by Corollary 1 and thus only produces a 2-partition instead of the desired 3-partition. The algorithm for our relaxation enforcing membership constraints converges to a continuous solution which is in fact a partition matrix so that no rounding is necessary. 4 Monotonic Descent Method for Minimization of a Sum of Ratios Apart from the new relaxation another key contribution of this paper is the derivation of an algorithm which yields a sequence of feasible points for the difficult non-convex problem (3) and reduces monotonically the corresponding objective. We would like to note that the algorithm proposed by [13] for (5) does not yield monotonic descent. In fact it is unclear what the derived guarantee for the algorithm in [13] implies for the generated sequence. Moreover, our algorithm works for any non-negative submodular balancing function. 6 The key insight in order to derive a monotonic descent method for solving the sum-of-ratio minimization problem (3) is to eliminate the ratio by introducing a new set of variables β = (β1 , . . . , βk ). min k X F =(F1 ,...,Fk ), l=1 F ∈Rn×k , β∈Rk + + βl (7) subject to : TV(Fl ) ≤ βl S(Fl ), F(i) ∈ ∆k , max{F(i) } = 1, S(Fl ) ≥ m, ∗ l = 1, . . . , k, i = 1, . . . , n, (descent constraints) (simplex constraints) ∀i ∈ I, (membership constraints) l = 1, . . . , k. (size constraints) ∗ Note that for the optimal solution (F , β ) of this problem it holds TV(Fl∗ ) = βl∗ S(Fl∗ ), l = 1, . . . , k (otherwise one can decrease βl∗ and hence the objective) and thus equivalence holds. This is still a non-convex problem as the descent, membership and size constraints are non-convex. Our algorithm proceeds now in a sequential manner. At each iterate we do a convex inner approximation of the constraint set, that is the convex approximation is a subset of the non-convex constraint set, based on the current iterate (F t , β t ). Then we optimize the resulting convex optimization problem and repeat the process. In this way we get a sequence of feasible points for the original problem (7) for which we will prove monotonic descent in the sum-of-ratios. Convex approximation: As Sˆ is submodular, S is convex. Let stl ∈ ∂S(Flt ) be an element of the ˆ l )− sub-differential of S at the current iterate Flt . We have by Prop. 3.2 in [14], (stl )ji = S(C i−1 th t t t ˆ S(Cli ), where ji is the i smallest component of Fl and Cli = {j ∈ V | (Fl )j > (Fl )i }. Moreover, using the definition of subgradient, we have S(Fl ) ≥ S(Flt ) + hstl , Fl − Flt i = hstl , Fl i. For the descent constraints, let λtl = TV(Flt ) S(Flt ) and introduce new variables δl = βl − λtl that capture the amount of change in each ratio. We further decompose δl as δl = δl+ − δl− , δl+ ≥ 0, δl− ≥ 0. ˆ then for S(Fl ) ≥ m, Let M = maxf ∈[0,1]n S(f ) = maxC⊂V S(C), TV(Fl ) − βl S(Fl ) ≤ TV(Fl ) − λtl stl , Fl − δl+ S(Fl ) + δl− S(Fl ) ≤ TV(Fl ) − λtl stl , Fl − δl+ m + δl− M Finally, note that because of the simplex constraints, the membership constraints can be rewritten as max{F(i) } ≥ 1. Let i ∈ I and define ji := arg maxj Fijt (ties are broken randomly). Then the membership constraints can be relaxed as follows: 0 ≥ 1 − max{F(i) } ≥ 1 − Fiji =⇒ Fiji ≥ 1. As Fij ≤ 1 we get Fiji = 1. Thus the convex approximation of the membership constraints fixes the assignment of the i-th point to a cluster and thus can be interpreted as “label constraint”. However, unlike the transductive setting, the labels for the vertices in I are automatically chosen by our method. The actual choice of the set I will be discussed in Section 4.1. We use the notation L = {(i, ji ) | i ∈ I} for the label set generated from I (note that L is fixed once I is fixed). Descent algorithm: Our descent algorithm for minimizing (7) solves at each iteration t the following convex optimization problem (8). min F ∈Rn×k , + − k δ + ∈Rk + , δ ∈R+ k X δl+ − δl− (8) l=1 subject to : TV(Fl ) ≤ λtl stl , Fl + δl+ m − δl− M, l = 1, . . . k, (descent constraints) F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints) Fiji = 1, t t sl , Fl ≥ m, ∀(i, ji ) ∈ L, (label constraints) l = 1, . . . , k. (size constraints) As its solution F t+1 is feasible for (3) we update λt+1 = l TV(Flt+1 ) S(Flt+1 ) and st+1 ∈ ∂S(Flt+1 ), l = l 1, . . . , k and repeat the process until the sequence terminates, that is no further descent is possible as Pk the following theorem states, or the relative descent in l=1 λtl is smaller than a predefined . The following Theorem 3 shows the monotonic descent property of our algorithm. 7 Theorem 3 The sequence {F t } produced by the above algorithm satisfies Pk TV(Flt ) l=1 S(F t ) for all t ≥ 0 or the algorithm terminates. TV(Flt+1 ) l=1 S(F t+1 ) l Pk < l Proof: Let (F t+1 , δ +, t+1 , δ −, t+1 ) be the optimal solution of the inner problem (8). By the feasibility of (F t+1 , δ +, t+1 , δ −, t+1 ) and S(Flt+1 ) ≥ m, λtl stl , Flt+1 + mδl+, t+1 − M δl−, t+1 TV(Flt+1 ) ≤ S(Flt+1 ) S(Flt+1 ) ≤ λtl + mδl+, t+1 − M δl−, t+1 ≤ λtl + δl+, t+1 − δl−, t+1 S(Flt+1 ) Summing over all ratios, we have k X TV(F t+1 ) l=1 l S(Flt+1 ) ≤ k X λtl + l=1 k X δl+, t+1 − δl−, t+1 l=1 Noting that δl+ = δl− = 0, F = F t is feasible for (8), the optimal value has to be either strictly negative in which case we have strict descent k X TV(F t+1 ) l=1 t or the previous iterate F together with terminates. l S(Flt+1 ) δl+ = δl− < k X Pk +, t+1 l=1 δl − δl−, t+1 λtl l=1 = 0 is already optimal and hence the algorithm The inner problem (8) is convex, but contains the non-smooth term TV in the constraints. We eliminate the non-smoothness by introducing additional variables and derive an equivalent linear programming (LP) formulation. We solve this LP via the PDHG algorithm [15, 16]. The LP and the exact iterates can be found in the supplementary material. Lemma 1 The convex inner problem (8) is equivalent to the following linear optimization problem where E is the set of edges of the graph and w ∈ R|E| are the edge weights. k X min , F ∈Rn×k + |E|×k α∈R+ δl+ − δl− (9) l=1 , − k δ + ∈Rk + , δ ∈R+ subject to : hw, αl i ≤ λtl stl , Fl + δl+ m − δl− M, l = 1, . . . , k, (descent constraints) F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints) Fiji = 1, t t sl , Fl ≥ m, ∀(i, ji ) ∈ L, (label constraints) l = 1, . . . , k, (size constraints) − (αl )ij ≤ Fil − Fjl ≤ (αl )ij , l = 1, . . . , k, ∀(i, j) ∈ E. Proof: We define new variables αl ∈ R|E| for each column l and introduce constraints (αl )ij = |(Fl )i − (Fl )j )|, which allows us to rewrite TV(Fl ) as hw, αl i. These equality constraints can be replaced by the inequality constraints (αl )ij ≥ |(fl )i − (fl )j )| without changing the optimality of the problem, because at the optimal these constraints are active. Otherwise one can decrease (αl )ij while still being feasible since w is non-negative. Finally, these inequality constraints are rewritten using the fact that |x| ≤ y ⇔ −y ≤ x ≤ y, for y ≥ 0. 4.0.1 Solving LP via PDHG Recently, first-order primal-dual hybrid gradient descent (PDHG for short) methods have been proposed [17, 15] to efficiently solve a class of convex optimization problems that can be rewritten as 8 the following saddle-point problem min max hAx, yi + G(x) − Φ∗ (y), x∈X y∈Y where X and Y are finite-dimensional vector spaces and A : X → Y is a linear operator and G and Φ∗ are convex functions. It has been shown that the PDHG algorithm achieves good performance in solving huge linear programming problems that appear in computer vision applications. We now show how the linear programming problem min hc, xi x≥0 subject to : A1 x ≤ b1 A2 x = b2 can be rewritten as a saddle-point problem so that PDHG can be applied. By introducing the Lagrange multipliers y, the optimal value of the LP can be written as min hc, xi + max hy1 , A1 x − b1 i + hy2 , A2 x − b2 i y1 ≥0, y2 x≥0 = min max hc, xi + ιx≥0 (x) + hy1 , A1 xi + hy2 , A2 xi − hb1 , y1 i − hb2 , y2 i + ιy1 ≥0 (y1 ), x y1 , y2 where ι·≥0 is the indicator function that takes a value of 0 on the non-negative orthant and ∞ elsewhere. b1 A1 y1 Define b = , A= and y = . Then the saddle point problem correspondb2 A2 y2 ing to the LP is given by min max hc, xi + +ιx≥0 (x) + hy, Axi − hb, yi + ιy1 ≥0 (y1 ). x y1 , y2 The primal and dual iterates for this saddle-point problem can be obtained as xr+1 = max{0, xr − τ (AT y r + c)}, y1r+1 = max{0, y1r + σ(A1 x ¯r+1 − b1 )}, y2r+1 = y2r + σ(A2 x ¯r+1 − b2 ), where x ¯r+1 = 2xr+1 − xr . Here the primal and dual step sizes τ and σ are chosen such that 2 τ σ kAk < 1, where k.k denotes the operator norm. Instead of the global step sizes τ and σ, we use in our implementation the diagonal preconditioning matrices introduced in [16] as it is shown to improve the practical performance of PDHG. The diagonal elements of these preconditioning matrices τ and σ are given by τj = Pnr 1 i=1 |Aij | , ∀j ∈ {1, . . . , nc }, σi = Pnc 1 i=1 |Aij | , ∀i ∈ {1, . . . , nr }, where nr , nc are the number of rows and the number of columns of the matrix A. For completeness, we now present the explicit form of the primal and dual iterates of the preconditioned PDHG for the LP (9). Let θ ∈ Rk , µ ∈ Rn , ζ ∈ R|L| , ν ∈ Rk , ηl ∈ R|E| , ξl ∈ R|E| , ∀l ∈ {1, . . . , k} be the Lagrange multipliers corresponding to the descent, simplex, label, size and the two sets of additional constraints (introduced to eliminatePthe non-smoothness) respectively. Let B : R|E| → R|V | be a linear mapping defined as (Bz)i = j:(i,j)∈E zij − zji and 1n ∈ Rn denote a vector of all ones. Then the primal iterates for the LP (9) are given by n o Flr+1 = max 0, Flr − τF, l (−θlr λtl − νlr )stl + µr + Zlr + B(ηlr − ξlr ) , ∀l ∈ {1, . . . , k}, n o αlr+1 = max 0, αlr − τα, l θlr w − ηlr − ξlr , ∀l ∈ {1, . . . , k}, n o δ +, r+1 = max 0, δ +, r − τδ+ − mθr + 1k , n o δ −, r+1 = max 0, δ −, r − τδ− M θr − 1k , 9 where Zlr ∈ Rn , l = 1, . . . , k, are given by (Zlr )i = ζilr , if (i, l) ∈ L and 0 otherwise. Here τF, l , τα, l , τδ+ , τδ− are the diagonal preconditioning matrices whose diagonal elements are given by 1 , ∀i ∈ {1, . . . , n}, (1 + λtl ) |(stl )i | + 2di + ρil + 1 1 , ∀(i, j) ∈ E, (τα, l )ij = wij + 2 1 (τδ+ )l = , ∀l ∈ {1, . . . , k}, m 1 (τδ− )l = , ∀l ∈ {1, . . . , k}, M (τF, l )i = where di is the number of vertices adjacent to the ith vertex and ρil = 1, if (i, l) ∈ L and 0 otherwise. The dual iterates are given by n o θlr+1 = max 0, θlr + σθ, l w, α ¯ lr+1 − λtl stl , F¯lr+1 − mδ¯l+, r+1 + M δ¯l−, r+1 , l = 1, . . . , k, µr+1 = µr + σµ F¯ r+1 1k − 1n , ζilr+1 = ζilr + σζ F¯ilr+1 − 1 , ∀(i, l) ∈ L, n o νlr+1 = max 0, νlr + σν, l − stl , F¯lr+1 + m , ∀l ∈ {1, . . . , k}, n o ηlr+1 = max 0, ηlr + ση, l − α ¯ lr+1 + F¯ilr+1 − F¯jlr+1 , ∀l ∈ {1, . . . , k}, n o ξlr+1 = max 0, ξlr + σξ, l − α ¯ lr+1 − F¯ilr+1 + F¯jlr+1 , ∀l ∈ {1, . . . , k}, where σθ, l = hw, 1i + λtl 1 1 Pn , t t ) | + m + M , σζ = 1, σν, l = |(s i=1 |(sl )i | i=1 l i Pn and σµ , ση,1 , σξ,l are the diagonal preconditioning matrices whose diagonal elements are given by (σµ )i = 1 1 , ∀i ∈ {1, . . . , n}, (ση,l )ij = (σξ,l )ij = , ∀(i, j) ∈ E. k 3 From the iterates, one sees that the computational cost per iteration is O(|E|). In our implementation, we further reformulated the LP (9) by directly integrating the label constraints, thereby reducing the problem size and getting rid of the dual variable ζ. 4.1 Choice of membership constraints I The overall algorithm scheme for solving the problem (1) is given in the supplementary material. For the membership constraints we start initially with I 0 = ∅ and sequentially solve the inner problem (8). From its solution F t+1 we construct a Pk0 = (C1 , . . . , Ck ) via rounding, see (4). We repeat this process until we either do not improve the resulting balanced k-cut or Pk0 is not a partition. In this case we update I t+1 and double the number of membership constraints. Let (C1∗ , . . . , Ck∗ ) be the currently optimal partition. For each l ∈ {1, . . . , k} and i ∈ Cl∗ we compute cut Cl∗ \{i}, Cl∗ ∪ {i} cut Cs∗ ∪ {i}, Cs∗ \{i} ∗ bli = + min (10) ˆ ∗ \{i}) ˆ s∗ ∪ {i}) s6=l S(C S(C l and define Ol = {(π1 , . . . , π|Cl∗ | ) | b∗lπ1 ≥ b∗lπ2 ≥ . . . ≥ b∗lπ|C ∗ | }. The top-ranked vertices in Ol l correspond to the ones which lead to the largest minimal increase in BCut when moved from Cl∗ to another component and thus are most likely to belong to their current component. Thus it is natural to fix the top-ranked vertices for each component first. Note that the rankings Ol , l = 10 1, . . . , k are updated when a better partition is found. Thus the membership constraints correspond always to the vertices which lead to largest minimal increase in BCut when moved to another component. In Figure 1 one can observe that the fixed labeled points are lying close to the centers of the found clusters. The number of membership constraints depends on the graph. The better separated the clusters are, the less membership constraints need to be enforced in order to avoid degenerate solutions. Finally, we stop the algorithm if we see no more improvement in the cut or the continuous objective and the continuous solution corresponds to a partition. Algorithm 1 for solving (1) 1: Initialization: F 0 ∈ Rn×k be such that F 0 1k = 1n , λ0l = + TV(Fl0 ) , S(Fl0 ) l = 1, . . . , k, γ 0 = Pk λ0l , I 0 = ∅, L = ∅, p = 0 Output: partition (C1∗ , . . . , Ck∗ ) repeat (F t+1 , δ +, t+1 , δ −, t+1 ) be the optimal solution of the inner problem (8) Pk TV(Flt+1 ) λt+1 = S(F t+1 , l = 1, . . . , k, γ t+1 = l=1 λt+1 , l l ) l Pk cut(Clt+1 ,Clt+1 ) , where (C1t+1 , . . . , Ckt+1 ) is obtained from F t+1 via rounding χt+1 = l=1 ˆ t+1 ) S(C l=1 2: 3: 4: 5: 6: l 10: 11: if χt+1 < χt and (C1t+1 , . . . , Ckt+1 ) is a k-partition then (C1∗ , . . . , Ck∗ ) = (C1t+1 , . . . , Ckt+1 ) compute new ordering Ol , ∀l = 1, . . . , k for (C1∗ , . . . , Ck∗ ) according to (10) Sk I t+1 = l=1 Olp , where Olp denotes p top-ranked vertices in Ol L = {(i, ji ) | i ∈ I t+1 , ji = arg max Fijt+1 )} 12: 13: 14: 15: else p = max{2 |I t | , 1} (double the number of membership constraints) Sk I t+1 = l=1 Olp , where Olp denotes p top-ranked vertices in Ol L = {(i, ji ) | i ∈ I t+1 , ji = arg max Fijt )} 7: 8: 9: j j F t+1 = F t , Fijt+1 = 0, ∀i ∈ I t+1 , ∀j ∈ {1, . . . , k}, Fijt+1 = 1, ∀(i, ji ) ∈ L i 16: TV(F t+1 ) l , l = 1, . . . , k λt+1 = S(F t+1 l ) l 18: end if Pk 19: until χt+1 = l=1 λt+1 and γ t+1 = γ t l 17: 5 Experiments We evaluate our method against a diverse selection of state-of-the-art clustering methods like spectral clustering (Spec) [7], BSpec [11], Graclus1 [6], NMF based approaches PNMF [19], NSC [20], ONMF [21], LSD [22], NMFR [23] and MTV [13] which optimizes (5). We used the publicly available code [23, 13] with default settings. We run our method using 5 random initializations, 7 initializations based on the spectral clustering solution similar to [13] (who use 30 such initializations). In addition to the datasets provided in [13], we also selected a variety of datasets from the UCI repository shown below. For all the datasets not in [13], symmetric k-NN graphs are built with skx−yk2 Gaussian weights exp − min{σ 2 ,σ 2 } , where σx,k is the k-NN distance of point x. We chose the x,k y,k parameters s and k in a method independent way by testing for each dataset several graphs using all the methods over different choices of k ∈ {3, 5, 7, 10, 15, 20, 40, 60, 80, 100} and s ∈ {0.1, 1, 4}. The best choice in terms of the clustering error across all the methods and datasets, is s = 1, k = 15. # vertices # classes Iris wine vertebral ecoli 4moons webkb4 optdigits USPS pendigits 20news MNIST 150 3 178 3 310 3 336 6 4000 4 4196 4 5620 10 9298 10 10992 10 19928 20 70000 10 Quantitative results: In our first experiment we evaluate our method in terms of solving the balanced k-cut problem for various balancing functions, data sets and graph parameters. The following 1 Since [6], a multi-level algorithm directly minimizing Rcut/Ncut, is shown to be superior to METIS [18], we do not compare with [18]. 11 table reports the fraction of times a method achieves the best as well as strictly best balanced k-cut over all constructed graphs and datasets (in total 30 graphs per dataset). For reference, we also report the obtained cuts for other clustering methods although they do not directly minimize this criterion in italic; methods that directly optimize the criterion are shown in normal font. Our algorithm can handle all balancing functions and significantly outperforms all other methods across all criteria. For ratio and normalized cut cases we achieve better results than [7, 11, 6] which directly optimize this criterion. This shows that the greedy recursive bi-partitioning affects badly the performance of [11], which, otherwise, was shown to obtain the best cuts on several benchmark datasets [24]. This further shows the need for methods that directly minimize the multi-cut. It is striking that the competing method of [13], which directly minimizes the asymmetric ratio cut, is beaten significantly by Graclus as well as our method. As this clear trend is less visible in the qualitative experiments, we suspect that extreme graph parameters lead to fast convergence to a degenerate solution. Ours MTV BSpec Spec Graclus PNMF NSC ONMF LSD NMFR RCC-asym Best (%) Strictly Best (%) 80.54 44.97 25.50 10.74 23.49 1.34 7.38 0.00 38.26 4.70 2.01 0.00 5.37 0.00 2.01 0.00 4.03 0.00 1.34 0.00 RCC-sym Best (%) Strictly Best (%) 94.63 61.74 8.72 0.00 19.46 0.67 6.71 0.00 37.58 4.70 0.67 0.00 4.03 0.00 0.00 0.00 0.67 0.00 0.67 0.00 NCC-asym Best (%) Strictly Best (%) 93.29 56.38 13.42 2.01 20.13 0.00 10.07 0.00 38.26 2.01 0.67 0.00 5.37 0.00 2.01 0.67 4.70 0.00 2.01 1.34 NCC-sym Best (%) Strictly Best (%) 98.66 59.06 10.07 0.00 20.81 0.00 9.40 0.00 40.27 1.34 1.34 0.00 4.03 0.00 0.67 0.00 3.36 0.00 1.34 0.00 Rcut Best (%) Strictly Best (%) 85.91 58.39 7.38 0.00 20.13 2.68 10.07 2.01 32.89 8.72 0.67 0.00 4.03 0.00 0.00 0.00 1.34 0.00 1.34 0.67 Ncut Best (%) Strictly Best (%) 95.97 61.07 10.07 0.00 20.13 0.00 9.40 0.00 37.58 4.03 1.34 0.00 4.70 0.00 0.67 0.00 3.36 0.00 0.67 0.00 Qualitative results: In the following table, we report the clustering errors and the balanced k-cuts obtained by all methods using the graphs built with k = 15, s = 1 for all datasets. As the main goal is to compare to [13] we choose their balancing function (RCC-asym). Again, our method always achieved the best cuts across all datasets. In three cases, the best cut also corresponds to the best clustering performance. In case of vertebral, 20news, and webkb4 the best cuts actually result in high errors. However, we see in our next experiment that integrating ground-truth label information helps in these cases to improve the clustering performance significantly. Iris wine vertebral ecoli 4moons webkb4 optdigits USPS BSpec Err(%) BCut 23.33 1.495 37.64 6.417 50.00 1.890 19.35 2.550 36.33 0.634 60.46 1.056 11.30 0.386 20.09 0.822 pendigits 20news 17.59 0.081 84.21 0.966 MNIST 11.82 0.471 Spec Err(%) BCut 22.00 1.783 20.22 5.820 48.71 1.950 14.88 2.759 31.45 0.917 60.32 1.520 7.81 0.442 21.05 0.873 16.75 0.141 79.10 1.170 22.83 0.707 PNMF Err(%) BCut 22.67 1.508 27.53 4.916 50.00 2.250 16.37 2.652 35.23 0.737 60.94 3.520 10.37 0.548 24.07 1.180 17.93 0.415 66.00 2.924 12.80 0.934 NSC Err(%) BCut 23.33 1.518 17.98 5.140 50.00 2.046 14.88 2.754 32.05 0.933 59.49 3.566 8.24 0.482 20.53 0.850 19.81 0.101 78.86 2.233 21.27 0.688 ONMF Err(%) BCut 23.33 1.518 28.09 4.881 50.65 2.371 16.07 2.633 35.35 0.725 60.94 3.621 10.37 0.548 24.14 1.183 22.82 0.548 69.02 3.058 27.27 1.575 LSD Err(%) BCut 23.33 1.518 17.98 5.399 39.03 2.557 18.45 2.523 35.68 0.782 47.93 2.082 8.42 0.483 22.68 0.918 13.90 0.188 67.81 2.056 24.49 0.959 NMFR Err(%) BCut 22.00 1.627 11.24 4.318 38.06 2.713 22.92 2.556 36.33 0.840 40.73 1.467 2.08 0.369 22.17 0.992 13.13 0.240 39.97 1.241 fail Graclus Err(%) BCut 23.33 1.534 8.43 4.293 49.68 1.890 16.37 2.414 0.45 0.589 39.97 1.581 1.67 0.350 19.75 0.815 10.93 0.092 60.69 1.431 2.43 0.440 MTV Err(%) BCut 22.67 1.508 18.54 5.556 34.52 2.433 22.02 2.500 7.72 0.774 48.40 2.346 4.11 0.374 15.13 0.940 20.55 0.193 72.18 3.291 3.77 0.458 Ours Err(%) BCut 23.33 1.495 6.74 4.168 50.00 1.890 16.96 2.399 0.45 0.589 60.46 1.056 1.71 0.350 19.72 0.802 19.95 0.079 79.51 0.895 2.37 0.439 - Transductive Setting: We evaluate our method against [13] in a transductive setting. As in [13], we randomly sample either one label or a fixed percentage of labels per class from the ground truth. We report clustering errors and the cuts (RCC-asym) for both methods for different choices of labels. For label experiments their initialization strategy seems to work better as the cuts improve compared 12 to the unlabeled case. However, observe that in some cases their method seems to fail completely (Iris and 4moons for one label per class). Labels MTV 1 Ours MTV 1% Ours MTV 5% Ours MTV 10% Ours 6 Iris wine vertebral ecoli Err(%) BCut Err(%) BCut 33.33 3.855 22.67 1.571 9.55 4.288 8.99 4.234 42.26 2.244 50.32 2.265 13.99 2.430 15.48 2.432 4moons webkb4 optdigits USPS pendigits 20news MNIST 35.75 0.723 0.57 0.610 51.98 1.596 45.11 1.471 1.69 0.352 1.69 0.352 12.91 0.846 12.98 0.812 14.49 0.127 10.98 0.113 50.96 1.286 68.53 1.057 2.45 0.439 2.36 0.439 Err(%) BCut Err(%) BCut 33.33 3.855 22.67 1.571 10.67 4.277 6.18 4.220 39.03 2.300 41.29 2.288 14.29 2.429 13.99 2.419 0.45 0.589 0.45 0.589 48.38 1.584 41.63 1.462 1.67 0.354 1.67 0.354 5.21 0.789 5.13 0.789 7.75 0.129 7.75 0.128 40.18 1.208 37.42 1.157 2.41 0.443 2.33 0.442 Err(%) BCut Err(%) BCut 17.33 1.685 17.33 1.685 7.87 4.330 6.74 4.224 40.65 2.701 37.10 2.724 14.58 2.462 13.99 2.461 0.45 0.589 0.45 0.589 40.09 1.763 38.04 1.719 1.51 0.369 1.53 0.369 4.85 0.812 4.85 0.811 1.79 0.188 1.76 0.188 31.89 1.254 30.07 1.210 2.18 0.455 2.18 0.455 Err(%) BCut Err(%) BCut 18.67 1.954 14.67 1.960 7.30 4.332 6.74 4.194 39.03 3.187 33.87 3.134 13.39 2.776 13.10 2.778 0.38 0.592 0.38 0.592 40.63 2.057 41.97 1.972 1.41 0.377 1.41 0.377 4.19 0.833 4.25 0.833 1.24 0.197 1.24 0.197 27.80 1.346 26.55 1.314 2.03 0.465 2.02 0.465 Conclusion We presented a framework for directly minimizing the balanced k-cut problem based on a new tight continuous relaxation. Apart from the standard ratio/normalized cut, our method can also handle new application-specific balancing functions. Moreover, in contrast to a recursive splitting approach [25], our method enables the direct integration of prior information available in form of must/cannotlink constraints, which is an interesting topic for future research. Finally, the monotonic descent algorithm proposed for the difficult sum-of-ratios problem is another key contribution of the paper that is of independent interest. 13 References [1] W. E. Donath and A. J. Hoffman. Lower bounds for the partitioning of graphs. IBM J. Res. Develop., 17:420–425, 1973. [2] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11(3):430–452, 1990. [3] L. Hagen and A. B. Kahng. Fast spectral methods for ratio cut partitioning and clustering. In ICCAD, pages 10–13, 1991. [4] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22:888–905, 2000. [5] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849–856, 2001. [6] I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell., pages 1944–1957, 2007. [7] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007. [8] S. Guattery and G. Miller. On the quality of spectral separators. SIAM J. Matrix Anal. Appl., 19:701–719, 1998. [9] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In ICML, pages 1039–1046, 2010. [10] M. Hein and T. B¨uhler. An inverse power method for nonlinear eigenproblems with applications in 1spectral clustering and sparse PCA. In NIPS, pages 847–855, 2010. [11] M. Hein and S. Setzer. Beyond spectral clustering - tight relaxations of balanced graph cuts. In NIPS, pages 2366–2374, 2011. [12] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht. Convergence and energy landscape for Cheeger cut clustering. In NIPS, pages 1394–1402, 2012. [13] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht. Multiclass total variation clustering. In NIPS, pages 1421–1429, 2013. [14] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145–373, 2013. [15] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. of Math. Imaging and Vision, 40:120–145, 2011. [16] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In ICCV, pages 1762–1769, 2011. [17] E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. on Imaging Sciences, 3(4):1015–1046, 2010. [18] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359–392, 1998. [19] Z. Yang and E. Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions on Neural Networks, 21(5):734–749, 2010. [20] C. Ding, T. Li, and M. I. Jordan. Nonnegative matrix factorization for combinatorial optimization: Spectral clustering, graph matching, and clique finding. In ICDM, pages 183–192, 2008. [21] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, pages 126–135, 2006. [22] R. Arora, M. R. Gupta, A. Kapila, and M. Fazel. Clustering by left-stochastic matrix factorization. In ICML, pages 761–768, 2011. [23] Z. Yang, T. Hao, O. Dikmen, X. Chen, and E. Oja. Clustering by nonnegative matrix factorization using graph random walk. In NIPS, pages 1088–1096, 2012. [24] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation approach to graph-partitioning. J. of Global Optimization, 29(2):225–241, 2004. [25] S. S. Rangapuram and M. Hein. Constrained 1-spectral clustering. In AISTATS, pages 1143–1151, 2012. 14