Slides

Transcription

Slides
Community Detection
Proseminar - Elementary Data Mining Techniques
by Simon Grätzer
Freitag, 1. Februar 13
1
Content
What is Community Detection?
Motivation
Defining a community
Methods to find communities
Overlapping communities
Clique percolation method
Finding a community with query nodes
Conclusion
Freitag, 1. Februar 13
2
What is Community
Detection?
Different from traditional clustering
Algorithms use the graph property
Graphs with a „natural“ origin have a
structure that is not random
We try to find these structures by
analyzing the graph
A „perfect“ solution has yet to be
found
Freitag, 1. Februar 13
3
Motivation
Communities can represent parts of a larger system
(Like organs in the human body)
Communities can be considered as a summary of
the graph
Communities make it easy to visualize and
understand complex systems
Communities on the web might represent pages of
related topics
Community can reveal the properties without
releasing the individual privacy information
Freitag, 1. Februar 13
4
Defining a Community
There is not exact definition of a community in a
graph
It depends on the application
A general definition:
Separation between nodes in different
communities
Cohesion between nodes in a community
The differences between algorithms come down to
the precise definition
Freitag, 1. Februar 13
5
Basics
For a Graph G = {V, E} and a subgraph C ⊆ G with
|G| = |V | = n and |C| = nc
φint(C) should have a higher value than the whole
graph and φext(C) should be much lower
Local definitions see communities as an
autonomous entity within a larger system
Global definitions see the communities as
essential parts of a larger system
Vertex similarity: compare individual nodes and
group them based on a similarity measure
Freitag, 1. Februar 13
6
Methods
Finding overlapping
communities
Clique percolation
method (CPM)
Finding communities
with query nodes
Freitag, 1. Februar 13
7
Clique Percolation
Method
CPM is based on the idea that communities are
likely to consist of cliques
Assumption: Every node in the same community is
connected to nearly every other node
A community is build up by a chain of k-cliques
which are adjacent.
Two k-cliques are adjacent if they share k-1 nodes
The largest possible chain is defined as community
This is a local definition
Freitag, 1. Februar 13
8
Implementation of CPM
The number of possible k-cliques in a graph is
quite high
Implementations search for maximal k-cliques
(NP-hard problem)
We build an clique-clique overlap matrix O
All entries smaller than k-1 are removed
Freitag, 1. Februar 13
9
Parameter k = 3; k = 4
The results of processing the example graph with the CFinder software
Freitag, 1. Februar 13
10
Drawbacks
Even if the underlying problem is NP-hard, for
large sparse graphs, this algorithm is reasonably
fast
Some cases lead to useless results:
It looks for cliques not dense subgraphs
It requires a large number of cliques, but not too
many
Freitag, 1. Februar 13
11
Finding a community
with query nodes
The goal is to find a subgraph H that contains a
given set Q of query nodes and is densely
connected.
The function f is maximized among all possible
choices for H
In this case we choose the minimum degree for f
Additionally we add a distance constraint d
Freitag, 1. Februar 13
12
Without size restriction Greedy algorithm
Choose f = f(H) = minimum degree of a node in H
We set G0=G then repeat the steps:
Obtain Gt+1 by removing a node which violates the
distance constraint or has the minimum degree
Terminate if either one of the query nodes has minimum
degree or the query nodes are no longer connected
We choose the component of Gt for which the minimum
degree f(H) is maximized
This can be implemented in O(n+m)
Freitag, 1. Februar 13
13
Q = {1, 2, 3}
The greedy algorithm, without size constraint, applied on the example graph
Freitag, 1. Februar 13
14
Communities with size
restriction
A size constraint k makes the problem NP hard (Can be
shown via a reduction to the Steiner tree problem)
But it can be assumed that the size of the result set is
correlated with the distance constraint
The paper proposes two heuristics:
GreedyDist repeatedly executes Greedy and decreases d until the size k‘ of the
graph is small enogh
GreedyFast restricts the graph to the k‘ closest nodes to the query nodes. Then
Greedy is invoked
Freitag, 1. Februar 13
15
Evaluation with the DBLP dataset
The goal was to find a network of scientific collaboration around Christos Papadimitriou
Freitag, 1. Februar 13
16
Conclusion
A really broad topic with lots of applications
Each algorithms is build with different problems in
mind
Algorithms are difficult to compare, there is no
standard way of testing
Freitag, 1. Februar 13
17
Bibliography
[1] P. Erdos and A. Renyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci,
5:17 61, 1960.
[2] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 ! 174, 2010.
[3] P. F. Jonsson and P. A. Bates*. Global topological features of cancer proteins in the human
interactome. Bioinformatics, 2291 2297, 2006.
[4] T. H. J. S. J.-P. O. K. Kaski. Spectral and network methods in the analysis of correlation matrices
of stock returns. Physica A 383, 147 151, 2007.
[5] J. M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki. Sequential algorithm for fast clique
percolation. Phys. Rev. E, 78:026109, Aug 2008.
[6] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping com- munity structure
of complex networks in nature and society. Nature, 435:814 818, June 2005.
[7] M. E. Porter, K. Schwab, M. E. Porter, K. Schwab, F. Paua, E. T. Herrera, and M. Porter.
Communities in networks. Notices of the American Mathematical Society, 1164 1166, 2009.
[8] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail
party. In Proceedings of the 16th ACM SIGKDD interna- tional conference on Knowledge discovery
and data mining, KDD '10, 939 948, New York, NY, USA, 2010. ACM.
[9] K.-F. W. Wei Gao. Information Retrieval Technology. Springer Berlin Heidelberg, 2008.
Freitag, 1. Februar 13
18

Similar documents