Tight Continuous Relaxation of the Balanced k-Cut Problem

Transcription

Tight Continuous Relaxation of the Balanced k-Cut Problem
Tight Continuous Relaxation of the Balanced k-Cut
Problem
Syama Sundar Rangapuram, Pramod Kaushik Mudrakarta and Matthias Hein
Department of Mathematics and Computer Science
Saarland University, Saarbr¨ucken
Abstract
Spectral Clustering as a relaxation of the normalized/ratio cut has become one of
the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k-cut of the graph, are
either based on greedy techniques or heuristics which have weak connection to
the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced k-cut problem and show
that a related recently proposed relaxation is in most cases loose leading to poor
performance in practice. For the optimization of our tight continuous relaxation
we propose a new algorithm for the difficult sum-of-ratios minimization problem
which achieves monotonic descent. Extensive comparisons show that our method
outperforms all existing approaches for ratio cut and other balanced k-cut criteria.
1
Introduction
Graph-based techniques for clustering have become very popular in machine learning as they allow for an easy integration of pairwise relationships in data. The problem of finding k clusters in
a graph can be formulated as a balanced k-cut problem [1, 2, 3, 4], where ratio and normalized
cut are famous instances of balanced graph cut criteria employed for clustering, community detection and image segmentation. The balanced k-cut problem is known to be NP-hard [4] and thus in
practice relaxations [4, 5] or greedy approaches [6] are used for finding the optimal multi-cut. The
most famous approach is spectral clustering [7], which corresponds to the spectral relaxation of the
ratio/normalized cut and uses k-means in the embedding of the vertices found by the first k eigenvectors of the graph Laplacian in order to obtain the clustering. However, the spectral relaxation has
been shown to be loose for k = 2 [8] and for k > 2 no guarantees are known of the quality of the
obtained k-cut with respect to the optimal one. Moreover, in practice even greedy approaches [6]
frequently outperform spectral clustering.
This paper is motivated by another line of recent work [9, 10, 11, 12] where it has been shown that
an exact continuous relaxation for the two cluster case (k = 2) is possible for a quite general class of
balancing functions. Moreover, efficient algorithms for its optimization have been proposed which
produce much better cuts than the standard spectral relaxation. However, the multi-cut problem has
still to be solved via the greedy recursive splitting technique.
Inspired by the recent approach in [13], in this paper we tackle directly the general balanced k-cut
problem based on a new tight continuous relaxation. We show that the relaxation for the asymmetric
ratio Cheeger cut proposed recently by [13] is loose when the data does not contain k well-separated
clusters and thus leads to poor performance in practice. Similar to [13] we can also integrate label
information leading to a transductive clustering formulation. Moreover, we propose an efficient
algorithm for the minimization of our continuous relaxation for which we can prove monotonic
descent. This is in contrast to the algorithm proposed in [13] for which no such guarantee holds.
In extensive experiments we show that our method outperforms all existing methods in terms of the
1
achieved balanced k-cuts. Moreover, our clustering error is competitive with respect to several other
clustering techniques based on balanced k-cuts and recently proposed approaches based on nonnegative matrix factorization. Also we observe that already with small amount of label information
the clustering error improves significantly.
2
Balanced Graph Cuts
Graphs are used in machine learning typically as similarity graphs, that is the weight of an edge
between two instances encodes their similarity. Given such a similarity graph of the instances, the
clustering problem into k sets can be transformed into a graph partitioning problem, where the goal
is to construct a partition of the graph into k sets such that the cut, that is the sum of weights of the
edge from each set to all other sets, is small and all sets in the partition are roughly of equal size.
Before we introduce balanced graph cuts, we briefly fix the setting and notation. Let G(V, W )
denote an undirected, weighted graph with vertex set V with n = |V | vertices and weight matrix
W ∈ Rn×n
with W = W T . There is an edge between two
+
P vertices i, j ∈ V if wij > 0. The
cut between two sets A, B ⊂ V is defined as cut(A, B) = i∈A,j∈B wij and we write 1A for the
indicator vector of set A ⊂ V . A collection of k sets (C1 , . . . , Ck ) is a partition of V if ∪ki=1 Ci = V ,
Ci ∩ Cj = ∅ if i 6= j and |Ci | ≥ 1, i = 1, . . . , k. We denote the set of all k-partitions of V by Pk .
Pk
Furthermore, we denote by ∆k the simplex {x : x ∈ Rk , x ≥ 0, i=1 xi = 1}.
ˆ
ˆ
Finally, a set function Sˆ : 2V → R is called submodular if for all A, B ⊂ V , S(A∪B)+
S(A∩B)
≤
ˆ
ˆ
S(A) + S(B). Furthermore, we need the concept of the Lovasz extension of a set function.
ˆ
Definition 1 Let Sˆ : 2V → R be a set function with S(∅)
= 0. Let f ∈ RV be ordered in increasing
order f1 ≤ f2 ≤ . . . ≤ fn and
define
C
=
{j
∈
V
|
f
>
fi } where C0 = V . Then S : RV → R
i
j
Pn
ˆ
ˆ
ˆ
given by, S(f ) =
i=1 fi S(Ci−1 ) − S(Ci ) , is called the Lovasz extension of S. Note that
ˆ
S(1A ) = S(A)
for all A ⊂ V .
The Lovasz extension of a set function is convex if and only if the set function is submodular [14].
The cut function
Pn cut(C, C), where C = V \C, is submodular and its Lovasz extension is given by
TV(f ) = 12 i,j=1 wij |fi − fj |.
2.1
Balanced k-cuts
The balanced k-cut problem is defined as
min
(C1 ,...,Ck )∈Pk
k
X
cut(Ci , Ci )
i=1
ˆ i)
S(C
=: BCut(C1 , . . . , Ck )
(1)
where Sˆ : 2V → R+ is a balancing function with the goal that all sets Ci are of the same “size”.
ˆ
ˆ
In this paper, we assume that S(∅)
= 0 and for any C ( V, C 6= ∅, S(C)
≥ m, for some m > 0.
In the literature one finds mainly the following submodular balancing functions (in brackets is the
name of the overall balanced graph cut criterion BCut(C1 , . . . , Ck )),
ˆ
S(C)
= |C|,
(Ratio Cut),
(2)
ˆ
S(C)
= min{|C|, |C|},
ˆ
S(C) = min{(k − 1)|C|, C}
(Ratio Cheeger Cut),
(Asymmetric Ratio Cheeger Cut).
The Ratio Cut is well studied in the literature e.g. [3, 7, 6] and corresponds to a balancing function
without bias towards a particular size of the sets, whereas the Asymmetric Ratio Cheeger Cut recently
ˆ
proposed in [13] has a bias towards sets of size |Vk | (S(C)
attains its maximum at this point) which
makes perfect sense if one expects clusters which have roughly equal size. An intermediate version
between the two is the Ratio Cheeger Cut which has a symmetric balancing function and strongly
penalizes overly large clusters. For the ease of presentation we restrict ourselves to these balancing
ˆ
functions.
However, we
also handle the corresponding weighted cases e.g., S(C)
= vol(C) =
P
Pcan
n
d
,
where
d
=
w
,
leading
to
the
normalized
cut[4].
i
i
ij
i∈C
j=1
2
3
Tight Continuous Relaxation for the Balanced k-Cut Problem
In this section we discuss our proposed relaxation for the balanced k-cut problem (1). It turns out
that a crucial question towards a tight multi-cut relaxation is the choice of the constraints so that
the continuous problem also yields a partition (together with a suitable rounding scheme). The
motivation for our relaxation is taken from the recent work of [9, 10, 11], where exact relaxations
are shown for the case k = 2. Basically, they replace the ratio of set functions with the ratio of
the corresponding Lovasz extensions. We use the same idea for the objective of our continuous
relaxation of the k-cut problem (1) which is given as
min
k
X
TV(Fl )
F =(F1 ,...,Fk ),
l=1
F ∈Rn×k
+
(3)
S(Fl )
subject to : F(i) ∈ ∆k ,
i = 1, . . . , n,
(simplex constraints)
max{F(i) } = 1, ∀i ∈ I,
(membership constraints)
S(Fl ) ≥ m,
(size constraints)
l = 1, . . . , k,
ˆ
We have
where S is the Lovasz extension of the set function Sˆ and m = minC(V, C6=∅ S(C).
m = 1, for Ratio Cut and Ratio Cheeger Cut whereas m = k − 1 for Asymmetric Ratio Cheeger
Cut. Note that TV is the Lovasz extension of the cut functional cut(C, C). In order to simplify
notation we denote for a matrix F ∈ Rn×k by Fl the l-th column of F and by F(i) the i-th row
of F . Note that the rows of F correspond to the vertices of the graph and the j-th column of F
corresponds to the set Cj of the desired partition. The set I ⊂ V in the membership constraints is
chosen adaptively by our method during the sequential optimization described in Section 4.
An obvious question is how to get from the continuous solution F ∗ of (3) to a partition
(C1 , . . . , Ck ) ∈ Pk which is typically called rounding. Given F ∗ we construct the sets, by assigning
each vertex i to the column where the i-th row attains its maximum. Formally,
Ci = {j ∈ V | i = arg max Fjs },
i = 1, . . . , k,
(Rounding)
(4)
s=1,...,k
where ties are broken randomly. If there exists a row such that the rounding is not unique, we say
that the solution is weakly degenerated. If furthermore the resulting set (C1 , . . . , Ck ) do not form a
partition, that is one of the sets is empty, then we say that the solution is strongly degenerated.
First, we connect our relaxation to the previous work of [11] for the case k = 2. Indeed for symmetric balancing function such as the Ratio Cheeger Cut, our continuous relaxation (3) is exact even
without membership and size constraints.
ˆ
ˆ
Theorem 1 Let Sˆ be a non-negative symmetric balancing function, S(C)
= S(C),
and denote by
∗
p the optimal value of (3) without membership and size constraints for k = 2. Then it holds
p∗ =
min
(C1 ,C2 )∈P2
2
X
cut(Ci , Ci )
i=1
ˆ i)
S(C
.
Furthermore there exists a solution F ∗ of (3) such that F ∗ = [1C ∗ , 1C ∗ ], where (C ∗ , C ∗ ) is the
optimal balanced 2-cut partition.
Proof: Note that cut(C, C) is a symmetric set function and Sˆ by assumption. Thus with C2 = C1 ,
cut(C1 , C1 ) cut(C2 , C2 )
cut(C1 , C1 )
+
=2
.
ˆ
ˆ
ˆ 1)
S(C1 )
S(C2 )
S(C
Moreover, as TV(αf + β1) = |α| T V (f ) and by symmetry of Sˆ also S(αf + β1) = |α| S(f ) (see
[14, 11]). The simplex constraint implies that F2 = 1 − F1 and thus
TV(F2 )
TV(1 − F1 )
TV(F1 )
=
=
.
S(F2 )
S(1 − F1 )
S(F1 )
3
Thus we can write problem (3) equivalently as
min 2
f ∈[0,1]V
TV(f )
.
S(f )
ˆ
we have
As for all A ⊂ V , TV(1A ) = cut(A, A) and S(1A ) = S(A),
min
f ∈[0,1]V
cut(C, C)
TV(f )
≤ min
.
ˆ
C⊂V
S(f )
S(C)
However, it has been shown in [11] that minf ∈RV
TV(f )
S(f )
= minC⊂V
a continuous solution such that f ∗ = 1C ∗ , where C ∗ = arg min
C⊂V
cut(C,C)
ˆ
S(C)
cut(C,C)
.
ˆ
S(C)
and that there exists
As F ∗ = [f ∗ , 1 − f ∗ ] =
[1C ∗ , 1C ∗ ] this finishes the proof.
Note that rounding trivially yields a solution in the setting of the previous theorem.
A second result shows that indeed our proposed optimization problem (3) is a relaxation of the
balanced k-cut problem (1). Furthermore, the relaxation is exact if I = V .
Proposition 1 The continuous problem (3) is a relaxation of the k-cut problem (1). The relaxation
is exact, i.e., both problems are equivalent, if I = V .
Proof: For any k-way partition (C1 , . . . , Ck ), we can construct F = (1C1 , . . . , 1Ck ). It obviously
satisfies the membership and size constraints and the simplex constraint is satisfied as ∪i Ci = V
and Ci ∩ Cj = ∅ if i 6= j. Thus F is feasible for problem (3) and has the same objective value
because
TV(1C ) = cut(C, C),
ˆ
S(1C ) = S(C).
Thus problem (3) is a relaxation of (1).
If I = V , then the simplex together with the membership constraints imply that each row F(i)
contains exactly one non-zero element which equals 1, i.e., F ∈ {0, 1}n×k . Define for l = 1, . . . , k,
Cl = {i ∈ V | Fil = 1} (i.e, Fl = 1Cl ), then it holds ∪l Cl = V and Cl ∩ Cj = ∅, l 6= j.
ˆ l ). Thus
From the size constraints, we have for l = 1, . . . , k, 0 < m ≤ S(Fl ) = S(1Cl ) = S(C
ˆ
ˆ
S(Cl ) > 0, l = 1, . . . , k, which by assumption on S implies that each Cl is non-empty. Hence the
only feasible points allowed are indicators of k-way partitions and the equivalence of (1) and (3)
follows.
The row-wise simplex and membership constraints enforce that each vertex in I belongs to exactly
one component. Note that these constraints alone (even if I = V ) can still not guarantee that F
corresponds to a k-way partition since an entire column of F can be zero. This is avoided by the
column-wise size constraints that enforce that each component has at least one vertex.
If I = V it is immediate from the proof that problem (3) is no longer a continuous problem as the
feasible set are only indicator matrices of partitions. In this case rounding yields trivially a partition.
On the other hand, if I = ∅ (i.e., no membership constraints), and k > 2 it is not guaranteed
that rounding of the solution of the continuous problem yields a partition. Indeed, we will see in
the following that for symmetric balancing functions one can, under these conditions, show that
the solution is always strongly degenerated and rounding does not yield a partition (see Theorem
2). Thus we observe that the index set I controls the degree to which the partition constraint is
enforced. The idea behind our suggested relaxation is that it is well known in image processing that
minimizing the total variation yields piecewise constant solutions (in fact this follows from seeing
the total variation as Lovasz extension of the cut). Thus if |I| is sufficiently large, the vertices where
the values are fixed to 0 or 1 propagate this to their neighboring vertices and finally to the whole
graph. We discuss the choice of I in more detail in Section 4.
4
Simplex constraints alone are not sufficient to yield a partition: Our approach has been inspired
by [13] who proposed the following continuous relaxation for the Asymmetric Ratio Cheeger Cut
min
k
X
F =(F1 ,...,Fk ),
l=1
F ∈Rn×k
+
TV(Fl )
Fl − quant
k−1 (Fl ) 1
(5)
subject to : F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints)
ˆ
= min{(k − 1)|C|, C} and
where S(f ) = f − quantk−1 (f )1 is the Lovasz extension of S(C)
n
quantk−1 (f ) is the k − 1-quantile of f ∈ R . Note that in their approach no membership constraints
and size constraints are present.
We now show that the usage of simplex constraints in the optimization problem (3) is not sufficient
to guarantee that the solution F ∗ can be rounded to a partition for any symmetric balancing function
in (1). For asymmetric balancing functions as employed for the Asymmetric Ratio Cheeger Cut by
[13] in their relaxation (5) we can prove such a strong result only in the case where the graph is
disconnected. However, note that if the number of components of the graph is less than the number
of desired clusters k, the multi-cut problem is still non-trivial.
ˆ
Theorem 2 Let S(C)
be any non-negative symmetric balancing function. Then the continuous
relaxation
k
X
TV(Fl )
min
(6)
S(Fl )
F =(F1 ,...,Fk ),
F ∈Rn×k
+
l=1
subject to : F(i) ∈ ∆k , i = 1, . . . , n, (simplex constraints)
of the balanced k-cut problem (1) is void in the sense that the optimal solution F ∗ of the continuous problem can be constructed from the optimal solution of the 2-cut problem and F ∗ cannot be
rounded into a k-way partition, see (4). If the graph is disconnected, then the same holds also for
any non-negative asymmetric balancing function.
Proof: First, we derive a lower bound on the optimum of the continuous relaxation (6). Then we
construct a feasible point for (6) that achieves this lower bound but cannot yield a partitioning thus
finishing the proof.
Let (C ∗ , C ∗ ) = arg min cut(C,C)
be an optimal 2-way partition for the given graph. Using the exact
ˆ
S(C)
C⊂V
relaxation result for the balanced 2-cut problem in Theorem 3.1. in [11], we have
min
F :F(i) ∈∆k
k
X
TV(Fl )
l=1
S(Fl )
≥
k
X
l=1
k
minn
f ∈R
TV(f ) X
cut(C, C)
cut(C ∗ , C ∗ )
min
=
=k
.
ˆ
ˆ ∗)
C⊂V
S(f )
S(C)
S(C
l=1
Pk
Now define F1 = 1C ∗ and Fl = αl 1C ∗ , l = 2, . . . , k such that l=2 αl = 1, αl > 0. Clearly
F = (F1 , . . . , Fk ) is feasible for the problem (6) and the corresponding objective value is
k
k
l=2
l=1
TV(1C ∗ ) X αl TV(1C ∗ ) X cut(C ∗ , C ∗ )
+
=
,
ˆ ∗)
S(1C ∗ )
αl S(1C ∗ )
S(C
ˆ
where we used the 1-homogeneity of TV and S [14] and the symmetry of cut and S.
Thus the solution F constructed as above from the 2-cut problem is indeed optimal for the continuous relaxation (6) and it is not possible to obtain a k-way partition from this solution as there will
be k − 2 sets that are empty. Finally, the argument can be extended to asymmetric set functions if
ˆ
ˆ
there exists a set C such that cut(C, C) = 0 as in this case it does not matter that S(C)
6= S(C)
in
order that the argument holds.
The proof of Theorem 2 shows additionally that for any balancing function if the graph is disconnected, the solution of the continuous relaxation (6) is always zero, while clearly the solution of the
balanced k-cut problem need not be zero. This shows that the relaxation can be arbitrarily bad in
this case. In fact the relaxation for the asymmetric case can even fail if the graph is not disconnected
but there exists a cut of the graph which is very small as the following corollary indicates.
5
0 1 0
(a)
0 0 1
1 0 0
1 0 0
(b)
0 0 1
(c)
0 1 0
0 0 1
1 0 0
0 1 0
(d)
0 0 1
1 0 0
(e)
Figure 1: Toy example illustrating that the relaxation of [13] converges to a degenerate solution
when applied to a graph with dominating 2-cut. (a) 10NN-graph generated from three Gaussians in
10 dimensions (b) continuous solution of (5) from [13] for k = 3, (c) rounding of the continuous
solution of [13] does not yield a 3-partition (d) continuous solution found by our method together
with the vertices i ∈ I (black) where the membership constraint is enforced. Our continuous solution
corresponds already to a partition. (e) clustering found by rounding of our continuous solution
(trivial as we have converged to a partition). In (b)-(e), we color data point i according to F(i) ∈ R3 .
Corollary 1 Let Sˆ be an asymmetric balancing function and C ∗ = arg min cut(C,C)
and suppose
ˆ
S(C)
C⊂V
∗
∗
Pk
,C ∗ )
,C ∗ )
i ,Ci )
that φ∗ := (k − 1) cut(C
+ cut(C
. Then there exists
< min(C1 ,...,Ck )∈Pk i=1 cut(C
ˆ ∗)
ˆ ∗)
ˆ
S(C
S(C
S(C
)
Pk i
a feasible F with F1 = 1C ∗ and Fl = αl 1C ∗ , l = 2, . . . , k such that l=2 αl = 1, αl > 0 for (6)
Pk
i)
∗
which has objective i=1 TV(F
S(Fi ) = φ and which cannot be rounded to a k-way partition.
Pk
Proof: Let F1 = 1C ∗ and Fl = αl 1C ∗ , l = 2, . . . , k such that l=2 αl = 1, αl > 0. Clearly
F = (F1 , . . . , Fk ) is feasible for the problem (6) and the corresponding objective value is
k
X
TV(Fl )
l=1
S(Fl )
k
=
TV(1C ∗ ) X αl TV(1C ∗ )
+
S(1C ∗ )
αl S(1C ∗ )
l=2
∗
=
, C ∗)
cut(C
ˆ ∗)
S(C
+ (k − 1)
cut(C ∗ , C ∗ )
,
ˆ ∗)
S(C
where we used the 1-homogeneity of TV and S [14] and the symmetry of cut. This F cannot be
rounded into a k-way partition as there will be k − 2 sets that are empty.
Theorem 2 shows that the membership and size constraints which we have introduced in our relaxation (3) are essential to obtain a partition for symmetric balancing functions. For the asymmetric
balancing function failure of the relaxation (6) and thus also of the relaxation (5) of [13] is only guaranteed for disconnected graphs. However, Corollary 1 indicates that degenerated solutions should
also be a problem when the graph is still connected but there exists a dominating cut. We illustrate
this with a toy example in Figure 1 where the algorithm of [13] for solving (5) fails as it converges
exactly to the solution predicted by Corollary 1 and thus only produces a 2-partition instead of the
desired 3-partition. The algorithm for our relaxation enforcing membership constraints converges to
a continuous solution which is in fact a partition matrix so that no rounding is necessary.
4
Monotonic Descent Method for Minimization of a Sum of Ratios
Apart from the new relaxation another key contribution of this paper is the derivation of an algorithm
which yields a sequence of feasible points for the difficult non-convex problem (3) and reduces
monotonically the corresponding objective. We would like to note that the algorithm proposed by
[13] for (5) does not yield monotonic descent. In fact it is unclear what the derived guarantee for
the algorithm in [13] implies for the generated sequence. Moreover, our algorithm works for any
non-negative submodular balancing function.
6
The key insight in order to derive a monotonic descent method for solving the sum-of-ratio minimization problem (3) is to eliminate the ratio by introducing a new set of variables β = (β1 , . . . , βk ).
min
k
X
F =(F1 ,...,Fk ),
l=1
F ∈Rn×k
, β∈Rk
+
+
βl
(7)
subject to : TV(Fl ) ≤ βl S(Fl ),
F(i) ∈ ∆k ,
max{F(i) } = 1,
S(Fl ) ≥ m,
∗
l = 1, . . . , k,
i = 1, . . . , n,
(descent constraints)
(simplex constraints)
∀i ∈ I,
(membership constraints)
l = 1, . . . , k.
(size constraints)
∗
Note that for the optimal solution (F , β ) of this problem it holds TV(Fl∗ ) = βl∗ S(Fl∗ ), l =
1, . . . , k (otherwise one can decrease βl∗ and hence the objective) and thus equivalence holds. This
is still a non-convex problem as the descent, membership and size constraints are non-convex. Our
algorithm proceeds now in a sequential manner. At each iterate we do a convex inner approximation
of the constraint set, that is the convex approximation is a subset of the non-convex constraint set,
based on the current iterate (F t , β t ). Then we optimize the resulting convex optimization problem
and repeat the process. In this way we get a sequence of feasible points for the original problem (7)
for which we will prove monotonic descent in the sum-of-ratios.
Convex approximation: As Sˆ is submodular, S is convex. Let stl ∈ ∂S(Flt ) be an element of the
ˆ l )−
sub-differential of S at the current iterate Flt . We have by Prop. 3.2 in [14], (stl )ji = S(C
i−1
th
t
t
t
ˆ
S(Cli ), where ji is the i smallest component of Fl and Cli = {j ∈ V | (Fl )j > (Fl )i }. Moreover, using the definition of subgradient, we have S(Fl ) ≥ S(Flt ) + hstl , Fl − Flt i = hstl , Fl i.
For the descent constraints, let λtl =
TV(Flt )
S(Flt )
and introduce new variables δl = βl − λtl that capture
the amount of change in each ratio. We further decompose δl as δl = δl+ − δl− , δl+ ≥ 0, δl− ≥ 0.
ˆ
then for S(Fl ) ≥ m,
Let M = maxf ∈[0,1]n S(f ) = maxC⊂V S(C),
TV(Fl ) − βl S(Fl ) ≤ TV(Fl ) − λtl stl , Fl − δl+ S(Fl ) + δl− S(Fl )
≤ TV(Fl ) − λtl stl , Fl − δl+ m + δl− M
Finally, note that because of the simplex constraints, the membership constraints can be rewritten
as max{F(i) } ≥ 1. Let i ∈ I and define ji := arg maxj Fijt (ties are broken randomly). Then the
membership constraints can be relaxed as follows: 0 ≥ 1 − max{F(i) } ≥ 1 − Fiji =⇒ Fiji ≥ 1.
As Fij ≤ 1 we get Fiji = 1. Thus the convex approximation of the membership constraints
fixes the assignment of the i-th point to a cluster and thus can be interpreted as “label constraint”.
However, unlike the transductive setting, the labels for the vertices in I are automatically chosen by
our method. The actual choice of the set I will be discussed in Section 4.1. We use the notation
L = {(i, ji ) | i ∈ I} for the label set generated from I (note that L is fixed once I is fixed).
Descent algorithm: Our descent algorithm for minimizing (7) solves at each iteration t the following convex optimization problem (8).
min
F ∈Rn×k
,
+
−
k
δ + ∈Rk
+ , δ ∈R+
k
X
δl+ − δl−
(8)
l=1
subject to : TV(Fl ) ≤ λtl stl , Fl + δl+ m − δl− M,
l = 1, . . . k,
(descent constraints)
F(i) ∈ ∆k ,
i = 1, . . . , n,
(simplex constraints)
Fiji = 1,
t t
sl , Fl ≥ m,
∀(i, ji ) ∈ L,
(label constraints)
l = 1, . . . , k.
(size constraints)
As its solution F t+1 is feasible for (3) we update λt+1
=
l
TV(Flt+1 )
S(Flt+1 )
and st+1
∈ ∂S(Flt+1 ), l =
l
1, . . . , k and repeat the process until the sequence terminates, that is no further descent is possible as
Pk
the following theorem states, or the relative descent in l=1 λtl is smaller than a predefined . The
following Theorem 3 shows the monotonic descent property of our algorithm.
7
Theorem 3 The sequence {F t } produced by the above algorithm satisfies
Pk TV(Flt )
l=1 S(F t ) for all t ≥ 0 or the algorithm terminates.
TV(Flt+1 )
l=1 S(F t+1 )
l
Pk
<
l
Proof: Let (F t+1 , δ +, t+1 , δ −, t+1 ) be the optimal solution of the inner problem (8). By the feasibility of (F t+1 , δ +, t+1 , δ −, t+1 ) and S(Flt+1 ) ≥ m,
λtl stl , Flt+1 + mδl+, t+1 − M δl−, t+1
TV(Flt+1 )
≤
S(Flt+1 )
S(Flt+1 )
≤ λtl +
mδl+, t+1 − M δl−, t+1
≤ λtl + δl+, t+1 − δl−, t+1
S(Flt+1 )
Summing over all ratios, we have
k
X
TV(F t+1 )
l=1
l
S(Flt+1 )
≤
k
X
λtl +
l=1
k
X
δl+, t+1 − δl−, t+1
l=1
Noting that δl+ = δl− = 0, F = F t is feasible for (8), the optimal value
has to be either strictly negative in which case we have strict descent
k
X
TV(F t+1 )
l=1
t
or the previous iterate F together with
terminates.
l
S(Flt+1 )
δl+
=
δl−
<
k
X
Pk
+, t+1
l=1 δl
− δl−, t+1
λtl
l=1
= 0 is already optimal and hence the algorithm
The inner problem (8) is convex, but contains the non-smooth term TV in the constraints. We
eliminate the non-smoothness by introducing additional variables and derive an equivalent linear
programming (LP) formulation. We solve this LP via the PDHG algorithm [15, 16]. The LP and the
exact iterates can be found in the supplementary material.
Lemma 1 The convex inner problem (8) is equivalent to the following linear optimization problem
where E is the set of edges of the graph and w ∈ R|E| are the edge weights.
k
X
min
,
F ∈Rn×k
+
|E|×k
α∈R+
δl+ − δl−
(9)
l=1
,
−
k
δ + ∈Rk
+ , δ ∈R+
subject to : hw, αl i ≤ λtl stl , Fl + δl+ m − δl− M,
l = 1, . . . , k,
(descent constraints)
F(i) ∈ ∆k ,
i = 1, . . . , n,
(simplex constraints)
Fiji = 1,
t t
sl , Fl ≥ m,
∀(i, ji ) ∈ L,
(label constraints)
l = 1, . . . , k,
(size constraints)
− (αl )ij ≤ Fil − Fjl ≤ (αl )ij ,
l = 1, . . . , k,
∀(i, j) ∈ E.
Proof: We define new variables αl ∈ R|E| for each column l and introduce constraints
(αl )ij = |(Fl )i − (Fl )j )|, which allows us to rewrite TV(Fl ) as hw, αl i. These equality constraints
can be replaced by the inequality constraints (αl )ij ≥ |(fl )i − (fl )j )| without changing the
optimality of the problem, because at the optimal these constraints are active. Otherwise one
can decrease (αl )ij while still being feasible since w is non-negative. Finally, these inequality
constraints are rewritten using the fact that |x| ≤ y ⇔ −y ≤ x ≤ y, for y ≥ 0.
4.0.1
Solving LP via PDHG
Recently, first-order primal-dual hybrid gradient descent (PDHG for short) methods have been proposed [17, 15] to efficiently solve a class of convex optimization problems that can be rewritten as
8
the following saddle-point problem
min max hAx, yi + G(x) − Φ∗ (y),
x∈X y∈Y
where X and Y are finite-dimensional vector spaces and A : X → Y is a linear operator and G and
Φ∗ are convex functions. It has been shown that the PDHG algorithm achieves good performance
in solving huge linear programming problems that appear in computer vision applications. We now
show how the linear programming problem
min hc, xi
x≥0
subject to : A1 x ≤ b1
A2 x = b2
can be rewritten as a saddle-point problem so that PDHG can be applied.
By introducing the Lagrange multipliers y, the optimal value of the LP can be written as
min hc, xi + max hy1 , A1 x − b1 i + hy2 , A2 x − b2 i
y1 ≥0, y2
x≥0
= min max hc, xi + ιx≥0 (x) + hy1 , A1 xi + hy2 , A2 xi − hb1 , y1 i − hb2 , y2 i + ιy1 ≥0 (y1 ),
x
y1 , y2
where ι·≥0 is the indicator function that takes a value of 0 on the non-negative orthant and ∞
elsewhere.
b1
A1
y1
Define b =
, A=
and y =
. Then the saddle point problem correspondb2
A2
y2
ing to the LP is given by
min max hc, xi + +ιx≥0 (x) + hy, Axi − hb, yi + ιy1 ≥0 (y1 ).
x
y1 , y2
The primal and dual iterates for this saddle-point problem can be obtained as
xr+1 = max{0, xr − τ (AT y r + c)},
y1r+1 = max{0, y1r + σ(A1 x
¯r+1 − b1 )},
y2r+1 = y2r + σ(A2 x
¯r+1 − b2 ),
where x
¯r+1 = 2xr+1 − xr . Here the primal and dual step sizes τ and σ are chosen such that
2
τ σ kAk < 1, where k.k denotes the operator norm.
Instead of the global step sizes τ and σ, we use in our implementation the diagonal preconditioning
matrices introduced in [16] as it is shown to improve the practical performance of PDHG. The
diagonal elements of these preconditioning matrices τ and σ are given by
τj = Pnr
1
i=1 |Aij |
, ∀j ∈ {1, . . . , nc }, σi = Pnc
1
i=1
|Aij |
, ∀i ∈ {1, . . . , nr },
where nr , nc are the number of rows and the number of columns of the matrix A.
For completeness, we now present the explicit form of the primal and dual iterates of the preconditioned PDHG for the LP (9). Let θ ∈ Rk , µ ∈ Rn , ζ ∈ R|L| , ν ∈ Rk , ηl ∈ R|E| , ξl ∈ R|E| , ∀l ∈
{1, . . . , k} be the Lagrange multipliers corresponding to the descent, simplex, label, size and the
two sets of additional constraints (introduced to eliminatePthe non-smoothness) respectively. Let
B : R|E| → R|V | be a linear mapping defined as (Bz)i = j:(i,j)∈E zij − zji and 1n ∈ Rn denote
a vector of all ones. Then the primal iterates for the LP (9) are given by
n
o
Flr+1 = max 0, Flr − τF, l (−θlr λtl − νlr )stl + µr + Zlr + B(ηlr − ξlr ) , ∀l ∈ {1, . . . , k},
n
o
αlr+1 = max 0, αlr − τα, l θlr w − ηlr − ξlr , ∀l ∈ {1, . . . , k},
n
o
δ +, r+1 = max 0, δ +, r − τδ+ − mθr + 1k ,
n
o
δ −, r+1 = max 0, δ −, r − τδ− M θr − 1k ,
9
where Zlr ∈ Rn , l = 1, . . . , k, are given by (Zlr )i = ζilr , if (i, l) ∈ L and 0 otherwise. Here
τF, l , τα, l , τδ+ , τδ− are the diagonal preconditioning matrices whose diagonal elements are given
by
1
, ∀i ∈ {1, . . . , n},
(1 + λtl ) |(stl )i | + 2di + ρil + 1
1
, ∀(i, j) ∈ E,
(τα, l )ij =
wij + 2
1
(τδ+ )l = , ∀l ∈ {1, . . . , k},
m
1
(τδ− )l =
, ∀l ∈ {1, . . . , k},
M
(τF, l )i =
where di is the number of vertices adjacent to the ith vertex and ρil = 1, if (i, l) ∈ L and 0
otherwise.
The dual iterates are given by
n
o
θlr+1 = max 0, θlr + σθ, l w, α
¯ lr+1 − λtl stl , F¯lr+1 − mδ¯l+, r+1 + M δ¯l−, r+1 , l = 1, . . . , k,
µr+1 = µr + σµ F¯ r+1 1k − 1n ,
ζilr+1 = ζilr + σζ F¯ilr+1 − 1 , ∀(i, l) ∈ L,
n
o
νlr+1 = max 0, νlr + σν, l − stl , F¯lr+1 + m , ∀l ∈ {1, . . . , k},
n
o
ηlr+1 = max 0, ηlr + ση, l − α
¯ lr+1 + F¯ilr+1 − F¯jlr+1 , ∀l ∈ {1, . . . , k},
n
o
ξlr+1 = max 0, ξlr + σξ, l − α
¯ lr+1 − F¯ilr+1 + F¯jlr+1 , ∀l ∈ {1, . . . , k},
where
σθ, l =
hw, 1i + λtl
1
1
Pn
,
t
t ) | + m + M , σζ = 1, σν, l =
|(s
i=1 |(sl )i |
i=1
l i
Pn
and σµ , ση,1 , σξ,l are the diagonal preconditioning matrices whose diagonal elements are given by
(σµ )i =
1
1
, ∀i ∈ {1, . . . , n}, (ση,l )ij = (σξ,l )ij = , ∀(i, j) ∈ E.
k
3
From the iterates, one sees that the computational cost per iteration is O(|E|). In our implementation, we further reformulated the LP (9) by directly integrating the label constraints, thereby reducing
the problem size and getting rid of the dual variable ζ.
4.1
Choice of membership constraints I
The overall algorithm scheme for solving the problem (1) is given in the supplementary material. For
the membership constraints we start initially with I 0 = ∅ and sequentially solve the inner problem
(8). From its solution F t+1 we construct a Pk0 = (C1 , . . . , Ck ) via rounding, see (4). We repeat this
process until we either do not improve the resulting balanced k-cut or Pk0 is not a partition. In this
case we update I t+1 and double the number of membership constraints. Let (C1∗ , . . . , Ck∗ ) be the
currently optimal partition. For each l ∈ {1, . . . , k} and i ∈ Cl∗ we compute
cut Cl∗ \{i}, Cl∗ ∪ {i}
cut Cs∗ ∪ {i}, Cs∗ \{i}
∗
bli =
+ min
(10)
ˆ ∗ \{i})
ˆ s∗ ∪ {i})
s6=l
S(C
S(C
l
and define Ol = {(π1 , . . . , π|Cl∗ | ) | b∗lπ1 ≥ b∗lπ2 ≥ . . . ≥ b∗lπ|C ∗ | }. The top-ranked vertices in Ol
l
correspond to the ones which lead to the largest minimal increase in BCut when moved from Cl∗
to another component and thus are most likely to belong to their current component. Thus it is
natural to fix the top-ranked vertices for each component first. Note that the rankings Ol , l =
10
1, . . . , k are updated when a better partition is found. Thus the membership constraints correspond
always to the vertices which lead to largest minimal increase in BCut when moved to another
component. In Figure 1 one can observe that the fixed labeled points are lying close to the centers
of the found clusters. The number of membership constraints depends on the graph. The better
separated the clusters are, the less membership constraints need to be enforced in order to avoid
degenerate solutions. Finally, we stop the algorithm if we see no more improvement in the cut or
the continuous objective and the continuous solution corresponds to a partition.
Algorithm 1 for solving (1)
1: Initialization: F 0 ∈ Rn×k
be such that F 0 1k = 1n , λ0l =
+
TV(Fl0 )
,
S(Fl0 )
l = 1, . . . , k, γ 0 =
Pk
λ0l , I 0 = ∅, L = ∅, p = 0
Output: partition (C1∗ , . . . , Ck∗ )
repeat
(F t+1 , δ +, t+1 , δ −, t+1 ) be the optimal solution of the inner problem (8)
Pk
TV(Flt+1 )
λt+1
= S(F t+1
, l = 1, . . . , k, γ t+1 = l=1 λt+1
,
l
l
)
l
Pk cut(Clt+1 ,Clt+1 )
, where (C1t+1 , . . . , Ckt+1 ) is obtained from F t+1 via rounding
χt+1 = l=1
ˆ t+1 )
S(C
l=1
2:
3:
4:
5:
6:
l
10:
11:
if χt+1 < χt and (C1t+1 , . . . , Ckt+1 ) is a k-partition then
(C1∗ , . . . , Ck∗ ) = (C1t+1 , . . . , Ckt+1 )
compute new ordering Ol , ∀l = 1, . . . , k for (C1∗ , . . . , Ck∗ ) according to (10)
Sk
I t+1 = l=1 Olp , where Olp denotes p top-ranked vertices in Ol
L = {(i, ji ) | i ∈ I t+1 , ji = arg max Fijt+1 )}
12:
13:
14:
15:
else
p = max{2 |I t | , 1} (double the number of membership constraints)
Sk
I t+1 = l=1 Olp , where Olp denotes p top-ranked vertices in Ol
L = {(i, ji ) | i ∈ I t+1 , ji = arg max Fijt )}
7:
8:
9:
j
j
F t+1 = F t , Fijt+1 = 0, ∀i ∈ I t+1 , ∀j ∈ {1, . . . , k}, Fijt+1
= 1, ∀(i, ji ) ∈ L
i
16:
TV(F t+1 )
l
, l = 1, . . . , k
λt+1
= S(F t+1
l
)
l
18:
end if
Pk
19: until χt+1 = l=1 λt+1
and γ t+1 = γ t
l
17:
5
Experiments
We evaluate our method against a diverse selection of state-of-the-art clustering methods like spectral clustering (Spec) [7], BSpec [11], Graclus1 [6], NMF based approaches PNMF [19], NSC [20],
ONMF [21], LSD [22], NMFR [23] and MTV [13] which optimizes (5). We used the publicly
available code [23, 13] with default settings. We run our method using 5 random initializations, 7
initializations based on the spectral clustering solution similar to [13] (who use 30 such initializations). In addition to the datasets provided in [13], we also selected a variety of datasets from the
UCI repository shown below. For all the datasets not in [13], symmetric k-NN graphs are built with
skx−yk2
Gaussian weights exp − min{σ
2 ,σ 2 } , where σx,k is the k-NN distance of point x. We chose the
x,k
y,k
parameters s and k in a method independent way by testing for each dataset several graphs using all
the methods over different choices of k ∈ {3, 5, 7, 10, 15, 20, 40, 60, 80, 100} and s ∈ {0.1, 1, 4}.
The best choice in terms of the clustering error across all the methods and datasets, is s = 1, k = 15.
# vertices
# classes
Iris
wine
vertebral
ecoli
4moons
webkb4
optdigits
USPS
pendigits
20news
MNIST
150
3
178
3
310
3
336
6
4000
4
4196
4
5620
10
9298
10
10992
10
19928
20
70000
10
Quantitative results: In our first experiment we evaluate our method in terms of solving the balanced k-cut problem for various balancing functions, data sets and graph parameters. The following
1
Since [6], a multi-level algorithm directly minimizing Rcut/Ncut, is shown to be superior to METIS [18], we do not compare with [18].
11
table reports the fraction of times a method achieves the best as well as strictly best balanced k-cut
over all constructed graphs and datasets (in total 30 graphs per dataset). For reference, we also report
the obtained cuts for other clustering methods although they do not directly minimize this criterion
in italic; methods that directly optimize the criterion are shown in normal font. Our algorithm can
handle all balancing functions and significantly outperforms all other methods across all criteria.
For ratio and normalized cut cases we achieve better results than [7, 11, 6] which directly optimize
this criterion. This shows that the greedy recursive bi-partitioning affects badly the performance of
[11], which, otherwise, was shown to obtain the best cuts on several benchmark datasets [24]. This
further shows the need for methods that directly minimize the multi-cut. It is striking that the competing method of [13], which directly minimizes the asymmetric ratio cut, is beaten significantly by
Graclus as well as our method. As this clear trend is less visible in the qualitative experiments, we
suspect that extreme graph parameters lead to fast convergence to a degenerate solution.
Ours
MTV BSpec Spec Graclus PNMF NSC ONMF LSD NMFR
RCC-asym
Best (%)
Strictly Best (%)
80.54
44.97
25.50
10.74
23.49
1.34
7.38
0.00
38.26
4.70
2.01
0.00
5.37
0.00
2.01
0.00
4.03
0.00
1.34
0.00
RCC-sym
Best (%)
Strictly Best (%)
94.63
61.74
8.72
0.00
19.46
0.67
6.71
0.00
37.58
4.70
0.67
0.00
4.03
0.00
0.00
0.00
0.67
0.00
0.67
0.00
NCC-asym
Best (%)
Strictly Best (%)
93.29
56.38
13.42
2.01
20.13
0.00
10.07
0.00
38.26
2.01
0.67
0.00
5.37
0.00
2.01
0.67
4.70
0.00
2.01
1.34
NCC-sym
Best (%)
Strictly Best (%)
98.66
59.06
10.07
0.00
20.81
0.00
9.40
0.00
40.27
1.34
1.34
0.00
4.03
0.00
0.67
0.00
3.36
0.00
1.34
0.00
Rcut
Best (%)
Strictly Best (%)
85.91
58.39
7.38
0.00
20.13
2.68
10.07
2.01
32.89
8.72
0.67
0.00
4.03
0.00
0.00
0.00
1.34
0.00
1.34
0.67
Ncut
Best (%)
Strictly Best (%)
95.97
61.07
10.07
0.00
20.13
0.00
9.40
0.00
37.58
4.03
1.34
0.00
4.70
0.00
0.67
0.00
3.36
0.00
0.67
0.00
Qualitative results: In the following table, we report the clustering errors and the balanced k-cuts
obtained by all methods using the graphs built with k = 15, s = 1 for all datasets. As the main goal
is to compare to [13] we choose their balancing function (RCC-asym). Again, our method always
achieved the best cuts across all datasets. In three cases, the best cut also corresponds to the best
clustering performance. In case of vertebral, 20news, and webkb4 the best cuts actually result in
high errors. However, we see in our next experiment that integrating ground-truth label information
helps in these cases to improve the clustering performance significantly.
Iris
wine
vertebral
ecoli
4moons
webkb4
optdigits
USPS
BSpec
Err(%)
BCut
23.33
1.495
37.64
6.417
50.00
1.890
19.35
2.550
36.33
0.634
60.46
1.056
11.30
0.386
20.09
0.822
pendigits 20news
17.59
0.081
84.21
0.966
MNIST
11.82
0.471
Spec
Err(%)
BCut
22.00
1.783
20.22
5.820
48.71
1.950
14.88
2.759
31.45
0.917
60.32
1.520
7.81
0.442
21.05
0.873
16.75
0.141
79.10
1.170
22.83
0.707
PNMF
Err(%)
BCut
22.67
1.508
27.53
4.916
50.00
2.250
16.37
2.652
35.23
0.737
60.94
3.520
10.37
0.548
24.07
1.180
17.93
0.415
66.00
2.924
12.80
0.934
NSC
Err(%)
BCut
23.33
1.518
17.98
5.140
50.00
2.046
14.88
2.754
32.05
0.933
59.49
3.566
8.24
0.482
20.53
0.850
19.81
0.101
78.86
2.233
21.27
0.688
ONMF
Err(%)
BCut
23.33
1.518
28.09
4.881
50.65
2.371
16.07
2.633
35.35
0.725
60.94
3.621
10.37
0.548
24.14
1.183
22.82
0.548
69.02
3.058
27.27
1.575
LSD
Err(%)
BCut
23.33
1.518
17.98
5.399
39.03
2.557
18.45
2.523
35.68
0.782
47.93
2.082
8.42
0.483
22.68
0.918
13.90
0.188
67.81
2.056
24.49
0.959
NMFR
Err(%)
BCut
22.00
1.627
11.24
4.318
38.06
2.713
22.92
2.556
36.33
0.840
40.73
1.467
2.08
0.369
22.17
0.992
13.13
0.240
39.97
1.241
fail
Graclus
Err(%)
BCut
23.33
1.534
8.43
4.293
49.68
1.890
16.37
2.414
0.45
0.589
39.97
1.581
1.67
0.350
19.75
0.815
10.93
0.092
60.69
1.431
2.43
0.440
MTV
Err(%)
BCut
22.67
1.508
18.54
5.556
34.52
2.433
22.02
2.500
7.72
0.774
48.40
2.346
4.11
0.374
15.13
0.940
20.55
0.193
72.18
3.291
3.77
0.458
Ours
Err(%)
BCut
23.33
1.495
6.74
4.168
50.00
1.890
16.96
2.399
0.45
0.589
60.46
1.056
1.71
0.350
19.72
0.802
19.95
0.079
79.51
0.895
2.37
0.439
-
Transductive Setting: We evaluate our method against [13] in a transductive setting. As in [13], we
randomly sample either one label or a fixed percentage of labels per class from the ground truth. We
report clustering errors and the cuts (RCC-asym) for both methods for different choices of labels.
For label experiments their initialization strategy seems to work better as the cuts improve compared
12
to the unlabeled case. However, observe that in some cases their method seems to fail completely
(Iris and 4moons for one label per class).
Labels
MTV
1
Ours
MTV
1%
Ours
MTV
5%
Ours
MTV
10%
Ours
6
Iris
wine
vertebral
ecoli
Err(%)
BCut
Err(%)
BCut
33.33
3.855
22.67
1.571
9.55
4.288
8.99
4.234
42.26
2.244
50.32
2.265
13.99
2.430
15.48
2.432
4moons webkb4 optdigits USPS pendigits 20news MNIST
35.75
0.723
0.57
0.610
51.98
1.596
45.11
1.471
1.69
0.352
1.69
0.352
12.91
0.846
12.98
0.812
14.49
0.127
10.98
0.113
50.96
1.286
68.53
1.057
2.45
0.439
2.36
0.439
Err(%)
BCut
Err(%)
BCut
33.33
3.855
22.67
1.571
10.67
4.277
6.18
4.220
39.03
2.300
41.29
2.288
14.29
2.429
13.99
2.419
0.45
0.589
0.45
0.589
48.38
1.584
41.63
1.462
1.67
0.354
1.67
0.354
5.21
0.789
5.13
0.789
7.75
0.129
7.75
0.128
40.18
1.208
37.42
1.157
2.41
0.443
2.33
0.442
Err(%)
BCut
Err(%)
BCut
17.33
1.685
17.33
1.685
7.87
4.330
6.74
4.224
40.65
2.701
37.10
2.724
14.58
2.462
13.99
2.461
0.45
0.589
0.45
0.589
40.09
1.763
38.04
1.719
1.51
0.369
1.53
0.369
4.85
0.812
4.85
0.811
1.79
0.188
1.76
0.188
31.89
1.254
30.07
1.210
2.18
0.455
2.18
0.455
Err(%)
BCut
Err(%)
BCut
18.67
1.954
14.67
1.960
7.30
4.332
6.74
4.194
39.03
3.187
33.87
3.134
13.39
2.776
13.10
2.778
0.38
0.592
0.38
0.592
40.63
2.057
41.97
1.972
1.41
0.377
1.41
0.377
4.19
0.833
4.25
0.833
1.24
0.197
1.24
0.197
27.80
1.346
26.55
1.314
2.03
0.465
2.02
0.465
Conclusion
We presented a framework for directly minimizing the balanced k-cut problem based on a new tight
continuous relaxation. Apart from the standard ratio/normalized cut, our method can also handle
new application-specific balancing functions. Moreover, in contrast to a recursive splitting approach
[25], our method enables the direct integration of prior information available in form of must/cannotlink constraints, which is an interesting topic for future research. Finally, the monotonic descent
algorithm proposed for the difficult sum-of-ratios problem is another key contribution of the paper
that is of independent interest.
13
References
[1] W. E. Donath and A. J. Hoffman. Lower bounds for the partitioning of graphs. IBM J. Res. Develop.,
17:420–425, 1973.
[2] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM
J. Matrix Anal. Appl., 11(3):430–452, 1990.
[3] L. Hagen and A. B. Kahng. Fast spectral methods for ratio cut partitioning and clustering. In ICCAD,
pages 10–13, 1991.
[4] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.,
22:888–905, 2000.
[5] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages
849–856, 2001.
[6] I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach.
IEEE Trans. Pattern Anal. Mach. Intell., pages 1944–1957, 2007.
[7] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007.
[8] S. Guattery and G. Miller. On the quality of spectral separators. SIAM J. Matrix Anal. Appl., 19:701–719,
1998.
[9] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In ICML, pages 1039–1046, 2010.
[10] M. Hein and T. B¨uhler. An inverse power method for nonlinear eigenproblems with applications in 1spectral clustering and sparse PCA. In NIPS, pages 847–855, 2010.
[11] M. Hein and S. Setzer. Beyond spectral clustering - tight relaxations of balanced graph cuts. In NIPS,
pages 2366–2374, 2011.
[12] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht. Convergence and energy landscape for Cheeger
cut clustering. In NIPS, pages 1394–1402, 2012.
[13] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht. Multiclass total variation clustering. In NIPS,
pages 1421–1429, 2013.
[14] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and
Trends in Machine Learning, 6(2-3):145–373, 2013.
[15] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to
imaging. J. of Math. Imaging and Vision, 40:120–145, 2011.
[16] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex
optimization. In ICCV, pages 1762–1769, 2011.
[17] E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of first order primal-dual algorithms
for convex optimization in imaging science. SIAM J. on Imaging Sciences, 3(4):1015–1046, 2010.
[18] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.
SIAM J. Sci. Comput., 20(1):359–392, 1998.
[19] Z. Yang and E. Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions
on Neural Networks, 21(5):734–749, 2010.
[20] C. Ding, T. Li, and M. I. Jordan. Nonnegative matrix factorization for combinatorial optimization: Spectral clustering, graph matching, and clique finding. In ICDM, pages 183–192, 2008.
[21] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In
KDD, pages 126–135, 2006.
[22] R. Arora, M. R. Gupta, A. Kapila, and M. Fazel. Clustering by left-stochastic matrix factorization. In
ICML, pages 761–768, 2011.
[23] Z. Yang, T. Hao, O. Dikmen, X. Chen, and E. Oja. Clustering by nonnegative matrix factorization using
graph random walk. In NIPS, pages 1088–1096, 2012.
[24] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation
approach to graph-partitioning. J. of Global Optimization, 29(2):225–241, 2004.
[25] S. S. Rangapuram and M. Hein. Constrained 1-spectral clustering. In AISTATS, pages 1143–1151, 2012.
14