Slides
Transcription
Slides
Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer Freitag, 1. Februar 13 1 Content What is Community Detection? Motivation Defining a community Methods to find communities Overlapping communities Clique percolation method Finding a community with query nodes Conclusion Freitag, 1. Februar 13 2 What is Community Detection? Different from traditional clustering Algorithms use the graph property Graphs with a „natural“ origin have a structure that is not random We try to find these structures by analyzing the graph A „perfect“ solution has yet to be found Freitag, 1. Februar 13 3 Motivation Communities can represent parts of a larger system (Like organs in the human body) Communities can be considered as a summary of the graph Communities make it easy to visualize and understand complex systems Communities on the web might represent pages of related topics Community can reveal the properties without releasing the individual privacy information Freitag, 1. Februar 13 4 Defining a Community There is not exact definition of a community in a graph It depends on the application A general definition: Separation between nodes in different communities Cohesion between nodes in a community The differences between algorithms come down to the precise definition Freitag, 1. Februar 13 5 Basics For a Graph G = {V, E} and a subgraph C ⊆ G with |G| = |V | = n and |C| = nc φint(C) should have a higher value than the whole graph and φext(C) should be much lower Local definitions see communities as an autonomous entity within a larger system Global definitions see the communities as essential parts of a larger system Vertex similarity: compare individual nodes and group them based on a similarity measure Freitag, 1. Februar 13 6 Methods Finding overlapping communities Clique percolation method (CPM) Finding communities with query nodes Freitag, 1. Februar 13 7 Clique Percolation Method CPM is based on the idea that communities are likely to consist of cliques Assumption: Every node in the same community is connected to nearly every other node A community is build up by a chain of k-cliques which are adjacent. Two k-cliques are adjacent if they share k-1 nodes The largest possible chain is defined as community This is a local definition Freitag, 1. Februar 13 8 Implementation of CPM The number of possible k-cliques in a graph is quite high Implementations search for maximal k-cliques (NP-hard problem) We build an clique-clique overlap matrix O All entries smaller than k-1 are removed Freitag, 1. Februar 13 9 Parameter k = 3; k = 4 The results of processing the example graph with the CFinder software Freitag, 1. Februar 13 10 Drawbacks Even if the underlying problem is NP-hard, for large sparse graphs, this algorithm is reasonably fast Some cases lead to useless results: It looks for cliques not dense subgraphs It requires a large number of cliques, but not too many Freitag, 1. Februar 13 11 Finding a community with query nodes The goal is to find a subgraph H that contains a given set Q of query nodes and is densely connected. The function f is maximized among all possible choices for H In this case we choose the minimum degree for f Additionally we add a distance constraint d Freitag, 1. Februar 13 12 Without size restriction Greedy algorithm Choose f = f(H) = minimum degree of a node in H We set G0=G then repeat the steps: Obtain Gt+1 by removing a node which violates the distance constraint or has the minimum degree Terminate if either one of the query nodes has minimum degree or the query nodes are no longer connected We choose the component of Gt for which the minimum degree f(H) is maximized This can be implemented in O(n+m) Freitag, 1. Februar 13 13 Q = {1, 2, 3} The greedy algorithm, without size constraint, applied on the example graph Freitag, 1. Februar 13 14 Communities with size restriction A size constraint k makes the problem NP hard (Can be shown via a reduction to the Steiner tree problem) But it can be assumed that the size of the result set is correlated with the distance constraint The paper proposes two heuristics: GreedyDist repeatedly executes Greedy and decreases d until the size k‘ of the graph is small enogh GreedyFast restricts the graph to the k‘ closest nodes to the query nodes. Then Greedy is invoked Freitag, 1. Februar 13 15 Evaluation with the DBLP dataset The goal was to find a network of scientific collaboration around Christos Papadimitriou Freitag, 1. Februar 13 16 Conclusion A really broad topic with lots of applications Each algorithms is build with different problems in mind Algorithms are difficult to compare, there is no standard way of testing Freitag, 1. Februar 13 17 Bibliography [1] P. Erdos and A. Renyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5:17 61, 1960. [2] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 ! 174, 2010. [3] P. F. Jonsson and P. A. Bates*. Global topological features of cancer proteins in the human interactome. Bioinformatics, 2291 2297, 2006. [4] T. H. J. S. J.-P. O. K. Kaski. Spectral and network methods in the analysis of correlation matrices of stock returns. Physica A 383, 147 151, 2007. [5] J. M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki. Sequential algorithm for fast clique percolation. Phys. Rev. E, 78:026109, Aug 2008. [6] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping com- munity structure of complex networks in nature and society. Nature, 435:814 818, June 2005. [7] M. E. Porter, K. Schwab, M. E. Porter, K. Schwab, F. Paua, E. T. Herrera, and M. Porter. Communities in networks. Notices of the American Mathematical Society, 1164 1166, 2009. [8] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail party. In Proceedings of the 16th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining, KDD '10, 939 948, New York, NY, USA, 2010. ACM. [9] K.-F. W. Wei Gao. Information Retrieval Technology. Springer Berlin Heidelberg, 2008. Freitag, 1. Februar 13 18