What is a Cluster? Perspectives from Game Theory o

Transcription

What is a Cluster? Perspectives from Game Theory o
What is a Cluster?
Perspectives from Game Theory
Marcello Pelillo and Samuel Rota Bul`o
Department of Computer Science
Ca’ Foscari University, Venice, Italy
{pelillo,srotabul}@dsi.unive.it
Abstract. Contrary to the vast majority of approaches to clustering,
which view the problem as one of partitioning a set of observations into
coherent classes, thereby obtaining the clusters as a by-product of the
partitioning process, we propose to reverse the terms of the problem and
attempt instead to derive a rigorous formulation of the very notion of a
cluster. In our endeavor to provide an answer to this question, we found
that game theory offers a very elegant and general perspective that serves
well our purposes. Accordingly, we formulate the clustering problem as
a non-cooperative clustering game. Within this context, the notion of a
cluster turns out to be equivalent to a classical equilibrium concept from
(evolutionary) game theory.
1
Motivations
There is no shortage of clustering algorithms, and recently a new wave of excitement has spread across the machine learning community mainly because of
the important development of spectral methods. At the same time, there is also
growing interest around fundamental questions pertaining to the very nature of
the clustering problem [1–3]. Yet, despite the tremendous progress in the field,
the clustering problem remains elusive and a satisfactory answer even to the
most basic questions is still to come.
Upon scrutinizing the relevant literature on the subject, it becomes apparent
that the vast majority of the existing approaches deal with a very specific version
of the problem, which asks for partitioning the input data into coherent classes.
In fact, almost invariably, the problem of clustering is defined as a partitioning
problem, and even the classical distinction between hierarchical and partitional
algorithms [4] seems to suggest the idea that partitioning data is, in essence,
what clustering is all about (as hierarchies are but nested partitions). This is
unfortunate, because it has drawn the community’s attention away from different, and more general, variants of the problem and has led people to neglect
underdeveloped foundational issues. As J. Hartigan clearly put it more than a
decade ago: “We pay too much attention to the details of algorithms. [...] We
must begin to subordinate engineering to philosophy.” [5, p. 3].
The partitional paradigm (as we will call it, following Kuhn) is attractive as
it leads to elegant mathematical and algorithmic treatments and allows us to employ powerful ideas from such sophisticated fields as linear algebra, graph theory,
2
Marcello Pelillo and Samuel Rota Bul`
o
optimization, statistics, information theory, etc. However, there are several (far
too often neglected) reasons for feeling uncomfortable with this oversimplified
formulation. Probably the best-known limitation of the partitional approach is
the typical (algorithmic) requirement that the number of clusters be known in
advance, but there is more than that.
To begin, the very idea of a partition implies that all the input data will have
to get assigned to some class. This subsumes the old philosophical view which
gives categories an a priori ontological status, namely that they exist independent of human experience, a view which has now been discredited by cognitive
scientists, linguists, philosophers, and machine learning researchers alike (see,
e.g., [6–8]). Further, there are various applications for which it makes little sense
to force all data items to belong to some group, a process which might result
either in poorly-coherent clusters or in the creation of extra spurious classes.
As an extreme example, consider the classical figure/ground separation problem in computer vision which asks for extracting a coherent region (the figure)
from a noisy background [9, 10]. It is clear that, due to their intrinsic nature,
partitional algorithms have no chance of satisfactorily solving this problem, being, as they are, explicitly designed to partition all the input data, and hence
the unstructured clutter items too, into coherent groups. More recently, motivated by practical applications arising in document retrieval and bioinformatics,
a conceptually identical problem has attracted some attention within the machine learning community and is generally known under the name of one-class
clustering [11, 12].
The second intrinsic limitation of the partitional paradigm is even more severe as it imposes that each element cannot belong to more than one cluster.
There are a variety of important applications, however, where this requirement is
too restrictive. Examples abound and include, e.g., clustering micro-array gene
expression data (wherein a gene often participate in more than one process),
clustering documents into topic categories, perceptual grouping, and segmentation of images with transparent surfaces. In fact, the importance of dealing
with overlapping clusters has been recognized long ago [13] and recently, in the
machine learning community, there has been a renewed interest around this
problem [14, 15]. Typically, this is solved by relaxing the constraints imposed by
crisp partitions in such a way as to have “soft” boundaries between clusters.
Finally, we would like to mention another limitation of current state-of-theart approaches to clustering which, admittedly, is not caused in any direct way
by the partitioning assumption but, rather, by the intrinsic nature of the technical tools typically used to attack the problem. This is the symmetry assumption,
namely the requirement that the similarities between the data being clustered
be symmetric (and non-negative). Indeed, since Tversky’s classical work [16],
it is widely recognized by psychologists that similarity is an asymmetric relation. Further, there are many practical applications where asymmetric (or, more
generally, non-metric) similarities do arise quite naturally. For example, such
(dis)similarity measures are typically derived when images, shapes or sequences
are aligned in a template matching process. In image and video processing, these
What is a Cluster? Perspectives from Game Theory
3
measures are preferred in the presence of partially occluded objects [17]. Other
examples include pairwise structural alignments of proteins that focus on local
similarity [18], variants of the Hausdorff distance [19], normalized edit-distances,
and probabilistic measures such as the Kullback-Leibler divergence. A common
method to deal with asymmetric affinities is simply to symmetrize them, but in
so doing we might lose important information that reside in the asymmetry. As
argued in [17], the violation of metricity is often not an artifact of poor choice of
features or algorithms, but it is inherent in the problem of robust matching when
different parts of objects (shapes) are matched to different images. The same argument may hold for any type of local alignments. Corrections or simplifications
of the original affinity matrix may therefore destroy essential information.
Although probabilistic model-based approaches do not suffer from several of
the limitations mentioned above, here we will suggest an alternative strategy.
Instead of insisting on the idea of determining a partition of the input data,
and hence obtaining the clusters as a by-product of the partitioning process, we
propose to reverse the terms of the problem and attempt instead to derive a rigorous formulation of the very notion of a cluster. Clearly, the conceptual question
“what is a cluster?” is as hopeless, in its full generality, as is its companion “what
is an optimal clustering?” which has dominated the literature in the past few
decades, both being two sides of the same coin. An attempt to answer the former
question, however, besides shedding fresh light into the nature of the clustering
problem, would allow us, as a consequence, to naturally overcome the major
limitations of the partitional approach alluded to above, and to deal with more
general problems where, e.g., clusters may overlap and clutter elements may get
unassigned, thereby hopefully reducing the gap between theory and practice.
In our endeavor to provide an answer to the question raised above, we found
that game theory offers a very elegant and general perspective that serves well
our purposes [20–22]. The starting point is the elementary observation that a
“cluster” may be informally defined as a maximally coherent set of data items,
i.e., as a subset of the input data C which satisfies both an internal criterion
(all elements belonging to C should be highly similar to each other) and an
external one (no larger cluster should contain C as a proper subset). We then
formulate the clustering problem as a non-cooperative clustering game. Within
this context, the notion of a cluster turns out to be equivalent to a classical
equilibrium concept from (evolutionary) game theory, as the latter reflects both
the internal and external cluster conditions mentioned above.
2
The clustering game
The (pairwise) clustering problem can be formulated as the following (twoplayer) game. Assume a pre-existing set of objects O and a (possibly asymmetric
and even negative) matrix of affinities A between the elements of O. Two players with complete knowledge of the setup play by simultaneously selecting an
element of O. After both have shown their choice, each player receives a payoff,
monetary or otherwise, proportional to the affinity that the chosen element has
4
Marcello Pelillo and Samuel Rota Bul`
o
with respect to the element chosen by the opponent, except in the case when
the selected objects coincide, which leads to zero reward. Clearly, it is in each
player’s interest to pick an element that is strongly supported by the elements
that the adversary is likely to choose. As an example, let us assume that our
clustering problem is one of figure/ground discrimination, that is, the objects
in O consist of a cohesive group with high mutual affinity (figure) and of nonstructured noise (ground). Being non-structured, the noise gives equal average
affinity to elements of the figures as to elements of the ground. Informally, assuming no prior knowledge of the inclination of the adversary, a player will be
better-off selecting elements of the figure rather than of the ground.
The clustering game is thus a two-players symmetric game, where O =
{1, . . . , n} is the set of pure strategies and A is the payoff matrix. Mixed strategies will be elements of the standard simplex ∆ of Rn , which is defined as
(
)
n
X
n
∆= x∈R :
xi = 1, xi ≥ 0 for all i = 1 . . . n .
i=1
The support σ(x) of a mixed strategy x is the set σ(x) = {i ∈ O : xi >
0}. The average payoff that a player selecting strategy iP∈ O obtains against
an opponent playing mixed strategy x ∈ ∆ is (Ax)i = j∈O aij xj , while the
average
payoff that he obtains by playing mixed strategy y ∈ ∆ is y> Ax =
P
j∈O xj (Ax)j . The set of best replies β(x) to a mixed strategy x is given by
β(x) = arg maxy∈∆ y> Ax, while the set of pure best replies to x is given by
Ω(x) = {i ∈ O : ei ∈ β(x)}, where ei is the ith column of the n × n identity
matrix. If x ∈ β(x) then x is said to be a Nash equilibrium. It is straightforward
to verify that if x ∈ ∆ is a Nash equilibrium then (Ax)i ≤ x> Ax for all i ∈ O
with equality if i ∈ σ(x).
An interesting refinement of the Nash equilibrium is that of an evolutionary
stable strategy (ESS). This concept has been introduced in evolutionary game
theory, a field that originated in the early 1970’s as an attempt to apply the
principles and tools of game theory to biological contexts, with a view to model
the evolution of animal, as opposed to human, behavior [23]. It considers an
idealized scenario whereby pairs of individuals are repeatedly drawn at random
from a large, ideally infinite, population to play a symmetric two-player game. In
contrast to conventional game theory, here players are not supposed to behave
rationally or to have complete knowledge of the details of the game. They act
instead according to an inherited behavioral pattern, or pure strategy, and it
is supposed that some evolutionary selection process operates over time on the
distribution of behaviors. We refer the reader to [24, 25] for classical introductions
to this rapidly expanding field.
An ESS is essentially a Nash equilibrium satisfying an additional stability
property which guarantees that if an ESS is established in a population, and
if a small proportion of the population adopts some mutant behavior, then the
selection process will eventually drive them to extinction. Specifically, x ∈ ∆ is
an ESS if x is a Nash equilibrium and y ∈ β(x)\{x} ⇒ x> Ay > y> Ay (stability
property).
What is a Cluster? Perspectives from Game Theory
5
In the next section we will provide a combinatorial characterization of ESS
of a clustering game, which motivates its use as a notion of a cluster.
3
Clusters as ESSs
Consider a clustering problem where O = {1, . . . , n} is the set of the objects to
cluster and A = (aij ) is the n × n similarity matrix. Let C ⊆ O be a non-empty
subset of objects. The (average) weighted in-degree of i ∈ O with respect to C
is defined as:
1 X
aij ,
awindegC (i) =
|C|
j∈C
where |C| denotes the cardinality of C. Moreover, if j ∈ C we define:
φC (i, j) = aij − awindegC (j) ,
which is a measure of the similarity of object i with object j with respect to the
average similarity of object j with elements in C. The weight of i with respect
to C is
(
1
if |C| = 1
WC (i) = P
φ
(i,
j)W
(j)
otherwise ,
C\{i}
j∈C\{i} C\{i}
while the total weight of C is defined as:
X
W (C) =
WC (i) .
i∈C
Intuitively, WC (i) gives us a measure of the support that object i receives from
the objects in C \ {i} relative to the overall mutual similarity of the objects in
C \ {i}. Here positive values indicate that i has high similarity to C \ {i}.
A non-empty subset of objects C ⊆ O such that W (T ) > 0 for any non-empty
T ⊆ C, is said to be a dominant set if:
1. WC (i) > 0, for all i ∈ C,
2. WC∪{i} (i) ≤ 0, for all i ∈
/ C.
The two previous conditions correspond to the two main properties of a
cluster. The first regards internal homogeneity: each object in the cluster C
is positively supported by all other objects in C. The second regards external
heterogeneity: any extension of the cluster C with an external object i will
undermine the internal coherency, since the new cluster C ∪ {i} will negatively
support the novel entry i.
The next result provides a one-to-one correspondence between ESSs and
dominant sets, thereby motivating the use of ESSs as clusters.
Theorem 1. If C ⊆ O is a dominant set with respect to similarity matrix A,
then xC is an ESS for a two-player game with payoff matrix A, where xC is a
vector defined as
(
WC (i)
if i ∈ C
C
xi = W (C)
0
otherwise .
6
Marcello Pelillo and Samuel Rota Bul`
o
Conversely, if x is an ESS for a two-person game with payoff matrix A, then
C = σ(x) is a dominant set with respect to A, provided that C = Ω(x).
Evolutionary game theory provides us with an algorithmic means of extracting clusters. Indeed, the hypotheses that each object belongs to a cluster compete
with one-another, each obtaining support from similar objects and competitive
pressure from the others. Competition will reduce the population of individuals
that assume weakly supported hypotheses, while allowing populations assuming
hypotheses with strong support to thrive. Eventually, all inconsistent hypotheses
will be driven to extinction, while all the surviving ones will reach an equilibrium whereby they will all receive the same average support, hence exhibiting
the internal coherency characterizing a cluster (this derives from the Nash condition). As for the extinct hypotheses, they will provably have a lower support,
thereby hinting to external incoherency. The stable strategies can be found using
(discrete-time) replicator dynamics,
Ax(t) i
(t+1)
(t)
.
xi
= xi
>
x(t) Ax(t)
which is a classic formalization of a natural selection process [25, 24].
4
Applications and extensions
The proposed game-theoretic framework for clustering has been used to tackle a
variety of problems. Indeed, we find applications in image and video segmentation [26, 20] to perceptual grouping [21], analysis of fMRI data [27, 28], contentbased image retrieval [29], detection of anomalous activities in video streams
[30], bioinformatics [31] and human action recognition [32].
In [33], we show how this framework can also be used for addressing matching
problems, where the goal is to find correspondences between two sets of objects
O1 and O2 satisfying some desired criteria. The formulation we derive is general
and application-independent. The main idea is to model the set of possible correspondences between two sets of objects O1 and O2 as a set of game strategies.
The matching task becomes thus a non-cooperative game where the potential
associations between objects to be matched correspond to strategies, while the
payoff reflect the degree of compatibility between competing hypotheses. Within
this formulation, the solutions to the matching problem correspond to ESSs of
the underlying “matching game”. A distinguishing features of this formulation is
that it allows one to naturally deal with general many-to-many matching problems even in the presence of asymmetric compatibilities. Note that, within the
context of the clustering framework introduced in the previous sections, this is
equivalent to finding a cluster over the set of feasible associations, where similarities reflect the association compatibilities. The potential of the proposed
approach is demonstrated in [33] via two sets of image matching experiments,
both of which show that our results outperform those obtained using well-known
domain-specific algorithms.
What is a Cluster? Perspectives from Game Theory
7
Our game-theoretic framework for clustering has also be extended to the case
of high-order similarities [22]. Indeed, objects similarities are typically expressed
as pairwise relations, but in some applications higher-order relations are more
appropriate, and approximating them in terms of pairwise interactions can lead
to substantial loss of information. Consider for instance the problem of clustering a given set of d-dimensional Euclidean points into lines. As every pair of
data points trivially defines a line, there does not exist a meaningful pairwise
measure of similarity for this problem. In [22], we show that our game-theoretic
clustering framework can be naturally adapted to cases when high-order similarities are involved. In this case we refer to the hypergraph clustering problem,
which is the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Coherently with the
pairwise case, traditional approaches to this problem are based on the idea of
partitioning the input data into a user-defined number of classes, thereby obtaining the clusters as a by-product of the partitioning process. Moreover, most
of them cast the hypergraph clustering problem into a graph clustering problem,
by approximating higher-order similarities to pairwise similarities. In contrast
to the classical approach, we show that the hypergraph clustering problem can
be naturally cast into a non-cooperative multi-player clustering game (without
resorting to approximations), whereby the notion of a cluster is equivalent to a
classical game-theoretic equilibrium concept. In this case the number of players
involved in the game is determined by the ariety of the similarity relations, while
the payoff function is driven by the weights of the high-order similarities. From
the computational viewpoint, we show that the problem of finding the equilibria
of our clustering game is, in the case of supersymmetric similarities, equivalent
to locally optimizing a polynomial function over the standard simplex, and we
provide a discrete-time dynamics to perform this optimization, which generalize the replicator dynamics. In [22], we assess the superiority of the proposed
approach by performing synthetic experiments on line clustering with different
types of noise as well as real experiments on illuminant-invariant face clustering.
5
Conclusions
In this paper we have introduced a game-theoretic notion of a cluster, which has
found applications in fields as diverse as computer vision and bioinformatics. In
a nutshell, our game-theoretic perspective has the following attractive features:
1. it makes no assumption on the underlying (individual) data representation:
like spectral (and, more generally, graph-based) clustering, it does not require
that the elements to be clustered be represented as points in a vector space;
2. it makes no assumption on the structure of the affinity matrix, being it able
to work with asymmetric and even negative similarity functions alike;
3. it does not require a priori knowledge on the number of clusters (since it
extracts them sequentially);
4. it leaves clutter elements unassigned (useful, e.g., in figure/ground separation
or one-class clustering problems)
8
Marcello Pelillo and Samuel Rota Bul`
o
5. it allows extracting overlapping clusters [34];
6. it generalizes naturally to hypergraph clustering problems, i.e., in the presence of high-order affinities [22], in which case the clustering game is played
by more than two players.
The approach outlined above is but one example of using purely gametheoretic concepts to model generic machine learning problems (see [35] for
another such example in a totally different context), and the potential of game
theory to machine learning is yet to be fully explored. Other areas where game
theory could potentially offer a fresh and powerful perspective include, e.g., semisupervised learning, multi-similarity learning, multi-task learning, learning with
incomplete information, learning with context-dependent similarities. The concomitant increasing interest around the algorithmic aspects of game theory [36]
is certainly beneficial in this respect, as it will allow useful cross-fertilization of
ideas.
6
Acknowledgments
We acknowledge financial support from the FET programme within EU FP7,
under the SIMBAD project (contract 213250), and we are grateful to M´ario
Figueiredo and Tiberio Caetano for providing useful comments on a preliminary
version of this paper.
References
1. Kleinberg, J.M.: An impossibility theorem for clustering. In: Adv. in Neural
Inform. Proces. Syst. (NIPS). (2002)
2. Ackerman, M., Ben-David, S.: Measures of clustering quality: A working set of
axioms for clustering. In: Adv. in Neural Inform. Proces. Syst. (NIPS). (2008)
3. Zadeh, R.B., Ben-David, S.: A uniqueness theorem for clustering. In: Uncertainty
in Artif. Intell. (2009)
4. Jain, A.K., Dubes, R.C.: Algorithms for data clustering. Prentice-Hall (1988)
5. Hartigan, J.: Introduction. In Arabie, P., Hubert, L.J., de Soete, G., eds.: Clustering and Classification, River Edge, NJ, World Scientific (1996)
6. Lakoff, G.: Women, Fire, and Dangerous Things: What Categories Reveal about
the Mind. The University of Chicago Press (1987)
7. Eco, U.: Kant and the Platypus: Essays on Language and Cognition. Harvest
Books (2000)
8. Guyon, I., von Luxburg, U., Williamson, R.C.: Clustering: Science or art? (Opinion
paper, 2009)
9. Herault, L., Horaud, R.: Figure-ground discrimination: a combinatorial optimization approach. IEEE Trans. Pattern Anal. Machine Intell. 15(9) (1993) 899–914
10. Shashua, A., Ullman, S.: Structural saliency: The detection of globally salient
features using a locally connected network. In: Int. Conf. Comp. Vision (ICCV).
(1988)
11. Gupta, G., Ghosh, J.: Robust one-class clustering using hybrid global and local
search. In: Int. Conf. on Mach. Learning. (2005)
What is a Cluster? Perspectives from Game Theory
9
12. Crammer, K., Talukdar, P.P., Pereira, F.: A rate-distortion one-class model and
its applications to clustering. In: Int. Conf. on Mach. Learning. (2008)
13. Jardine, N., Sibson, R.: The construction of hierarchic and non-hierarchic classifications. Computer J. 11 (1968) 177–184
14. Banerjee, A., Krumpelman, C., Basu, S., Mooney, R.J., Ghosh, J.: Model-based
overlapping clustering. In: Int. Conf. on Knowledge Discovery and Data Mining.
(2005) 532 – 537
15. Heller, K., Ghahramani, Z.: A nonparametric bayesian approach to modeling overlapping clusters. In: Int. Conf. AI and Statistics. (2007)
16. Tversky, A.: Features of similarity. Psychological Review 84 (1977) 327–352
17. Jacobs, D.W., Weinshall, D., Gdalyahu, Y.: Classification with nonmetric distances. IEEE Trans. Pattern Anal. Machine Intell. 22(6) (2000) 583–600
18. Altschul, S.F., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J.: Basic local
alignment search tool. J. Molec. Biology 215(3) (1990) 403–410
19. Dubuisson, M.P., Jain, A.K.: A modified haudroff distance for object matching.
In: Int. Conf. Patt. Recogn. (ICPR). (1994) 566–568
20. Pavan, M., Pelillo, M.: Dominant sets and pairwise clustering. IEEE Trans. Pattern
Anal. Machine Intell. 29(1) (2007) 167–172
21. Torsello, A., Rota Bul`
o, S., Pelillo, M.: Grouping with asymmetric affinities: a
game-theoretic perspective. In: IEEE Conf. Computer Vision and Patt. Recogn.
(CVPR). (2006) 292–299
22. Rota Bul`
o, S., Pelillo, M.: A game-theoretic approach to hypergraph clustering.
In: Adv. in Neural Inform. Proces. Syst. (NIPS). (2009) In press.
23. Maynard Smith, J.: Evolution and the theory of games. Cambridge University
Press (1982)
24. Hofbauer, J., Sigmund, K.: Evolutionary games and population dynamics. Cambridge University Press (1998)
25. Weibull, J.W.: Evolutionary game theory. Cambridge University Press (1995)
26. Pavan, M., Pelillo, M.: A new graph-theoretic approach to clustering and segmentation. In: IEEE Conf. Computer Vision and Patt. Recogn. (CVPR). Volume 1.
(2003) 145–152
27. M¨
uller, K., Neumann, J., Grigutsch, M., von Cramon, D.Y., Lohmann, G.: Detecting groups of coherent voxels in functional MRI data using spectral analysis
and replicator dynamics. J. Magn. Reson. Imaging 26(6) (2007) 1642–1650
28. Neumann, J., von Cramon, D.Y., Forstmann, B.U., Zysset, S., Lohmann, G.: The
parcellation of cortical areas using replicator dynamics in fMRI. NeuroImage 32(1)
(2006) 208–219
29. Wang, M., Ye, Z.L., Wang, Y., Wang, S.X.: Dominant sets clustering for image
retrieval. Signal Process. 88(11) (2008) 2843–2849
30. Hamid, R., Johnson, A., Batta, S., Bobick, A., Isbell, C., Coleman, G.: Detection
and explanation of anomalous activities: representing activities as bags of event
n-grams. In: IEEE Conf. Computer Vision and Patt. Recogn. (CVPR). Volume 1.
(2005) 20–25
31. Florian, F.: Tag SNP selection based on clustering according to dominant sets
found using replicator dynamics. Adv. in Data Analysis and Classif. (2010) (in
press).
32. Wei, Q.D., Hu, W.M., Zhang, X.Q., Luo, G.: Dominant sets-based action recognition using image sequence matching. In: Int. Conf. Image Processing (ICIP).
Volume 6. (2007) 133–136
33. Albarelli, A., Torsello, A., Rota Bul`
o, S., Pelillo, M.: Matching as a non-cooperative
game. In: Int. Conf. Comp. Vision (ICCV). (2009)
10
Marcello Pelillo and Samuel Rota Bul`
o
34. Torsello, A., Rota Bul`
o, S., Pelillo, M.: Beyond partitions: allowing overlapping
groups in pairwise clustering. In: Int. Conf. Patt. Recogn. (ICPR). (2008)
35. Cesa Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press (2006)
´ Vazirani, V., eds.: Algorithmic Game
36. Nisan, N., Roughgarden, T., Tardos, E.,
Theory. Cambridge University Press (2007)