Mining and Analysis Online Social Networks
Transcription
Mining and Analysis Online Social Networks
Mining and Analysis of Online Social Networks Emilio Ferrara Department of Mathematics University of Messina Supervisor Prof. Giacomo Fiumara A thesis submitted for the degree of PhilosophiæDoctor (PhD) in Mathematics 2012 February 1. Reviewer: Dr. Robert Baumgartner, Vienna Technische Universität 2. Reviewer: Dr. Haixuan Yang, Royal Holloway University of London Day of the defense: 26 March 2012 ii Abstract Social media and, in particular, Online Social Networks (OSNs) acquired a huge popularity and represent one of the most important social and Computer Science phenomena of these years. This dissertation presents a comprehensive study of the process of mining information from Online Social Networks and analyzing the structure of the networks themselves. To this purpose, several methods are adopted, ranging from Web Mining techniques, to graph-theoretical models and finally statistical analysis of network features, from a quantitative and qualitative perspective. The origin, distribution and sheer size of the data involved makes each of them either moot or inapplicable to the required scale. New methods are proposed and their effectivity is assessed against relevant data samples. The content of the present dissertation can be schematized in three main parts: (i) In the first we discuss the problem of mining Web sources from an algorithmic perspective; different techniques, largely adopted in Web data extraction tasks, are discussed and a novel approach to refine the process of automatic extraction of information from Web pages is presented, which is the core for the definition of a platform for sampling data from OSNs. (ii) The second part of this Thesis discusses the analysis of a large dataset acquired from the most representative (and largest) OSN to date: Facebook. This platform gathers hundred of millions of users and its modeling and analysis is possible by means of Social Network Analysis techniques. The investigation of topological features is extended also to other OSNs datasets available online for the scientific community. Several features of these networks, such as the well-known small world effect, scale-free distributions and community structure, are characterized and analyzed. At the same time, our analysis provides quantitative clues to verify the validity of different sociological theories on large-scale Social networks (for example, the six degrees of separation or the strength of weak ties). In particular, the problem of the community detection on massive OSNs is redefined and solved by means of a new algorithm. This result puts into evidence the need of defining computationally efficient, even if heuristic, measures to assess the importance of individuals in the network. (iii) The last part of the Thesis is devoted to presenting a novel, efficient measure of centrality for a Social networks, whose rationale is grounded in the random walks theory. Its validity is assessed against massive OSNs datasets; it becomes the basis for a novel community detection algorithm which is shown to work surprising well in different contexts, such as social and biological network analysis. iv Alla mia famiglia. Senza di voi uniti ad attendermi al nastro d’arrivo, nessun traguardo sarebbe importante. Acknowledgements This Thesis would not have been possible without the support of a lot of people whose help has been fundamental during the years of my Ph.D. studies. I mostly owe my gratitude to my Supervisor, Prof. Giacomo Fiumara. I received his personal support during several hard periods and I would like to thank him for his continuous efforts to show me the path to follow. This Thesis represents my commitment to shape his brilliant and creative scientific ideas and his multidisciplinary vision on how to apply Computer Science, Physics and Mathematics to real-world research challenges. It was a honor for me to work with him. I am indebted with Prof. Francesco Oliveri, Head of the Ph.D. School, who believed in me from the very beginning of my Ph.D. studies, and gave me the chance to pursue this goal supporting me throughout these years. Without his trust it would not have been possible for me to visit and stay at the Vienna Technische Universität and at the Royal Holloway University of London as a Ph.D. visiting student. During my studies I had the pleasure to work with two fantastic persons of the Computer Science research group of my University: Prof. Alessandro Provetti and Prof. Pasquale De Meo. I own my deepest gratitude to Prof. Provetti, who shown me the importance of establishing international contacts with the research community and to spend time studying abroad. In particular, I would like to thank him for his efforts to give me the opportunity to stay in Vienna and London. Prof. De Meo is simply one of the most talented scientists I have ever met. With his brilliant intuitions and hard work he greatly contributed to several topics discussed in this Thesis. It was a pleasure for me to work with him and without his support I would not even faced some scientific problems whose solution appeared hard to me. During 2010 I had the chance to spend four months in Vienna, collaborating with the DBAI group of the Vienna Technische Universität and with the Lixto GmbH, under the supervision of Dr. Robert Baumgartner. I owe my gratitude to him for a number of reasons. First of all, it was a great pleasure to work with him: in just a few months he taught me the fundamental concepts of Web data extraction and put me in the condition to contribute to this research field. In addition, he personally supported me in several occasions and I am particularly grateful to him since he agreed to be a Reviewer of this Thesis, investing an important amount of his time revising the work and providing with precious suggestions to improve it. I would also like to express my gratitude to several people that in different ways helped me during this experience in Vienna. From the DBAI group I am particularly grateful to Ruslan Fayzrakhmanov and Dr. Bernhard Kruepl, whose suggestions helped me to speed up my research activity. From the Lixto, Gerald Ledermueller and Serkan Avci supported me during the initial “bootstrap” period, helping in understanding the Lixto framework for Web data extraction and the rationale behind its functioning. During the end of 2011 and the beginning of 2012 I spent four months at the Royal Holloway University of London, working under the supervision of Dr. Alberto Paccanaro in the Centre for Systems and Synthetic Biology at the Department of Computer Science. I had the pleasure of joining a wonderful group of young scientists lead by a fantastic person. Dr. Paccanaro is amongst the most brilliant and incredible persons I ever met in Academia. During a so short period I have been exposed to an impressive number of new ideas, and his research group reflects his truly passion put in everything he does, including research in Computational Biology. I had the honor to collaborate with the colleagues of the PaccanaroLab, in particular with Dr. Alfonso E. Romero, Dr. Haixuan Yang, Dr. Prajwal Bhat, Sandra Smieszek and Horacio Caniza. I am particularly indebted with Dr. Yang who expressed his will of being a Reviewer for this Thesis: his suggestions helped me to improve the quality of this work, in particular regarding the last part of the Thesis. I am glad to had the chance to work with Dr. Romero and Dr. Bhat in a number of research projects, and I am grateful to both of them for their precious help without which it would not be possible to me to grasp the fundamental concepts of Bio-informatics and Computational Biology in such a short period of time. I would like to express my gratitude for their work to all my other coauthors: Francesco Pagano, Salvatore Catanese, Dr. Angela Ricciardello, Dr. Giovanni Quattrone, Dr. Licia Capra, Prof. Domenico Ursino, Dr. Fabian Abel, Prof. Lora Aroyo, Prof. Geert-Jan Houben. Without their efforts and precious contributions and ideas, all the work done during my Ph.D. studies would not have been possible. I owe my gratitude to all my colleagues of the Ph.D. School who gladden these years of studies in Messina, and to all my friends who supported me also during those periods in which I was too taken from my work to reciprocate. I dedicate this Thesis to my family. Dedico questa Tesi alla mia famiglia. Ai miei genitori, che mi hanno insegnato cosa significhi prefiggersi un obiettivo e lavorare sodo per raggiungerlo. Che mi hanno sempre supportato e incoraggiato nel corso dei miei studi, dimostrandomi che volere e’ potere. A mia sorella, la persona piu’ brillante che abbia mai conosciuto. Un radioso futuro l’aspetta. iv Contents List of Figures ix List of Tables xi 1 Introduction 2 Fundamentals 2.1 Formal Conventions . . . . 2.2 Graph Theory . . . . . . . . 2.2.1 Notion of Graph and 2.2.2 Centrality Measures 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . 7 . 7 . 8 . 13 3 Information Extraction from Web Sources 3.1 Background and Related Literature . . . . . . . . 3.2 Web Data Extraction Systems . . . . . . . . . . . 3.2.1 Definition . . . . . . . . . . . . . . . . . . 3.2.2 Classification Criteria . . . . . . . . . . . 3.3 Applications . . . . . . . . . . . . . . . . . . . . . 3.3.1 Enterprise Applications . . . . . . . . . . 3.3.2 Social Applications . . . . . . . . . . . . . 3.3.3 A Glance on the Future . . . . . . . . . . 3.4 Techniques . . . . . . . . . . . . . . . . . . . . . 3.4.1 Used Approaches . . . . . . . . . . . . . . 3.4.2 Wrappers . . . . . . . . . . . . . . . . . . 3.4.3 Semi-Automatic Wrapper Generation . . 3.4.4 Automatic Wrapper Generation . . . . . . 3.4.5 Wrapper Induction . . . . . . . . . . . . . 3.4.6 Wrapper Maintenance . . . . . . . . . . . 3.5 Automatic Wrapper Adaptation . . . . . . . . . 3.5.1 Primary Goals . . . . . . . . . . . . . . . 3.5.2 Details . . . . . . . . . . . . . . . . . . . . 3.5.3 Simple Tree Matching . . . . . . . . . . . 3.5.4 Weighted Tree Matching . . . . . . . . . . 3.5.5 Web Wrappers . . . . . . . . . . . . . . . 3.5.6 Automatic Adaptation of Web Wrappers 3.5.7 Experimentation . . . . . . . . . . . . . . 3.5.8 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Main Properties . . . . . . . . . . v . . . . 17 17 18 18 20 22 22 24 26 26 27 27 27 29 30 31 31 32 33 35 35 37 40 42 43 CONTENTS 4 Mining and Analysis of Facebook 4.1 Background and Related Literature . . . . . . . . . . 4.1.1 Data Collection from Online Social Networks 4.1.2 Similarity Detection . . . . . . . . . . . . . . 4.1.3 Influential User Detection . . . . . . . . . . . 4.2 Sampling the Facebook Social Graph . . . . . . . . . 4.2.1 The Structure of the Social Network . . . . . 4.2.2 The Sampling Architecture . . . . . . . . . . 4.2.3 Breadth-first-search Sampling . . . . . . . . . 4.2.4 Uniform Sampling . . . . . . . . . . . . . . . 4.2.5 Data Preparation . . . . . . . . . . . . . . . . 4.3 Network Analysis Aspects . . . . . . . . . . . . . . . 4.3.1 Definitions . . . . . . . . . . . . . . . . . . . 4.3.2 Experimentation . . . . . . . . . . . . . . . . 4.3.3 Privacy Settings . . . . . . . . . . . . . . . . 4.3.4 Degree Distribution . . . . . . . . . . . . . . 4.3.5 Diameter and Clustering Coefficient . . . . . 4.3.6 Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 46 46 48 48 48 50 52 54 55 56 56 58 58 58 58 62 5 Network Analysis and Models of Online Social Networks 5.1 Background and Related Literature . . . . . . . . . . . . . . . . 5.1.1 Social Networks and Models . . . . . . . . . . . . . . . . 5.1.2 Recent Studies and Current Trends . . . . . . . . . . . . 5.2 Features of Social Networks . . . . . . . . . . . . . . . . . . . . 5.2.1 The “Small-World” . . . . . . . . . . . . . . . . . . . . . 5.2.2 Scale-free Degree Distributions . . . . . . . . . . . . . . 5.2.3 Emergence of a Community Structure . . . . . . . . . . 5.3 Models of Social Networks . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Erdős-Rényi Model . . . . . . . . . . . . . . . . . . 5.3.2 The Watts-Strogatz Model . . . . . . . . . . . . . . . . 5.3.3 The Barabási-Albert Model . . . . . . . . . . . . . . . . 5.4 Community Structure . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Definition of Community Structure . . . . . . . . . . . . 5.4.2 Discovering Communities . . . . . . . . . . . . . . . . . 5.4.3 Models Representing the Community Structure . . . . . 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Description of Adopted Online Social Network Datasets 5.5.2 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 65 66 67 67 67 68 68 68 71 72 73 73 73 75 78 78 78 6 Community Structure in Facebook 6.1 Background and Related Literature . . . . . . . 6.1.1 Community Detection in Literature . . 6.2 Community Structure Discovery . . . . . . . . 6.2.1 Label Propagation Algorithm . . . . . . 6.2.2 Fast Network Community Algorithm . . 6.2.3 Experimentation . . . . . . . . . . . . . 6.2.4 Methodology of Investigation . . . . . . 6.3 Community Structure . . . . . . . . . . . . . . 6.3.1 Building the Community Meta-network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 85 86 87 87 88 89 90 100 100 vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 105 106 107 109 7 A Novel Centrality Measure for Social Networks 7.1 Background and Related Literature . . . . . . . . . . . . . . . . . . . . 7.2 Centrality Measures and Applications . . . . . . . . . . . . . . . . . . 7.2.1 Centrality Measure in Social Networks . . . . . . . . . . . . . . 7.2.2 Recent Approaches for Computing Betweenness Centrality . . . 7.2.3 Application of Centrality Measures in Social Network Analysis 7.3 Measuring Edge Centrality . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 -Path Centrality . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 The Algorithm for Computing the -Path Edge Centrality . . 7.3.4 Novelties Introduced by our Approach . . . . . . . . . . . . . . 7.3.5 Comparison of the ERW-Kpath and WERW-Kpath algorithms 7.4 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Analysis of Edge Centrality Distributions . . . . . . . . . . . . 7.5 Applications of our approach . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Understanding User Relationships in Virtual Communities . . . 7.6 Fast Community Structure Detection . . . . . . . . . . . . . . . . . . . 7.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Fast κ-path Community Detection . . . . . . . . . . . . . . . . 7.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Online Social Networks . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Extension to Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 115 117 117 119 119 119 119 121 122 126 127 128 128 132 133 134 139 140 141 141 141 142 143 144 144 145 146 6.4 6.3.2 Meta-network Analysis . 6.3.3 Discussion of Results . . The Strength of Weak Ties . . 6.4.1 Methodology . . . . . . 6.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . κ . . . . . . . . . . . . . . . κ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Conclusions 151 8.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Bibliography 157 vii CONTENTS viii List of Figures 3.1 3.2 3.3 3.4 3.5 3.6 Examples of XPaths over trees, selecting one (A) or multiple (B) items. A and B are two similar labeled rooted trees. . . . . . . . . . . . . . . . Robust Web object detection in Lixto VD. . . . . . . . . . . . . . . . . . Configuration of wrapper adaptation in Lixto VD. . . . . . . . . . . . . Wrapper adaptation process. . . . . . . . . . . . . . . . . . . . . . . . . Diagram of the Web wrapper creation, execution and maintenance flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 34 39 39 41 41 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 Architecture of the data mining platform. . . . . . . . . . . . . . . . State diagram of the data mining process. . . . . . . . . . . . . . . . Screenshot of the Facebook visual crawler . . . . . . . . . . . . . . . Node degree distribution BFS vs. UNI Facebook sample. . . . . . . CCDF node degree distribution BFS vs. UNI Facebook sample. . . . Node degree probability distribution BFS vs. UNI Facebook sample. Hops and diameter in Facebook. . . . . . . . . . . . . . . . . . . . . Clustering coefficient in Facebook. . . . . . . . . . . . . . . . . . . . Connected components in Facebook. . . . . . . . . . . . . . . . . . . Degree vs betweenness centrality in Facebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 51 52 59 59 60 61 61 62 63 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 Generative model: Erdős-Rényi (94). . . . . . . . . . . . . . . . . . Generative model: Newman-Watts-Strogatz (219). . . . . . . . . . Generative model: Watts-Strogatz (274). . . . . . . . . . . . . . . . Generative model: Barabási-Albert (14). . . . . . . . . . . . . . . . Generative model: Holme-Kim (149). . . . . . . . . . . . . . . . . . Community structure of the Erdős-Rényi (94) model. . . . . . . . . Community structure of the Newman-Watts-Strogatz (219) model. Community structure of the Watts-Strogatz (274) model. . . . . . Community structure of the Barabási-Albert (14) model. . . . . . . Community structure of the Holme-Kim (149) model. . . . . . . . Node degree distributions (log–log scale). . . . . . . . . . . . . . . . Effective diameters (log-normal scale). . . . . . . . . . . . . . . . . Community structure analysis (log–log scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 70 70 71 75 76 76 77 77 80 82 82 6.1 6.2 6.3 6.4 6.5 FNCA power law distributions on the “Uniform” sample. LPA power law distributions on the “Uniform” sample. . FNCA power law distribution on the BFS sample. . . . . LPA power law distribution on the BFS sample. . . . . . FNCA vs. LPA (UNI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 92 93 93 95 ix . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF FIGURES 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 FNCA vs. LPA (BFS). . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaccard distribution: FNCA vs. LPA (UNI). . . . . . . . . . . . . . . Jaccard distribution: FNCA vs. LPA (BFS). . . . . . . . . . . . . . . Heat-map: FNCA vs. LPA (UNI). . . . . . . . . . . . . . . . . . . . . Heat-map: FNCA vs. LPA (BFS). . . . . . . . . . . . . . . . . . . . . Meta-network representing the community structure (UNI with LPA). Meta-network degree and clustering coefficient distribution (UNI). . . Meta-network hops and shortest paths distribution (UNI). . . . . . . . Meta-network weights vs. strengths distribution (UNI). . . . . . . . . Meta-network heat-map of the distribution of connections (UNI). . . . Distribution of strong vs. weak ties in Facebook. . . . . . . . . . . . . CCDF of strong vs. weak ties in Facebook. . . . . . . . . . . . . . . . Density of weak ties among communities. . . . . . . . . . . . . . . . . Link fraction as a function of the community size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 98 98 99 99 102 103 104 105 106 110 111 112 112 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 Example of assignment of normalized degrees and initial edge weights. Robustness test on Wiki-Vote. . . . . . . . . . . . . . . . . . . . . . . Execution time with respect to network size. . . . . . . . . . . . . . . κ-paths centrality values distribution on Wiki-Vote. . . . . . . . . . . κ-paths centrality values distribution on CA-HepPh. . . . . . . . . . . κ-paths centrality values distribution on CA-CondMat. . . . . . . . . . κ-paths centrality values distribution on Cit-HepTh. . . . . . . . . . . κ-paths centrality values distribution on Facebook. . . . . . . . . . . . κ-paths centrality values distribution on Youtube. . . . . . . . . . . . Effect of different κ = 5, 10, 20 on Wiki-Vote. . . . . . . . . . . . . . . Effect of different κ = 5, 10, 20 on CA-HepPh. . . . . . . . . . . . . . . Effect of different κ = 5, 10, 20 on CA-CondMat. . . . . . . . . . . . . Effect of different κ = 5, 10, 20 on Cit-HepTh. . . . . . . . . . . . . . . Effect of different κ = 5, 10, 20 on Facebook. . . . . . . . . . . . . . . . Effect of different κ = 5, 10, 20 on Youtube. . . . . . . . . . . . . . . . Normalized mutual information test using the synthetic benchmarks. . Arabidopsis Thaliana gene-coexpression network (cluster 1). . . . . . . Arabidopsis Thaliana gene-coexpression network (cluster 2). . . . . . . Arabidopsis Thaliana gene-coexpression network (cluster 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 130 133 135 135 135 136 136 136 137 137 137 138 138 138 145 148 148 149 x List of Tables 3.1 3.2 W and M matrices for each matching subtree. . . . . . . . . . . . . . . . . . . . 36 Experimental results of automatic wrapper adaptation. . . . . . . . . . . . . . . . 44 4.1 4.2 4.3 HTTP requests flow of the crawler: authentication and mining steps. . . . . . . . 51 BFS dataset description (crawling period: 08/01-10/2010) . . . . . . . . . . . . . 54 “Uniform” dataset description (crawling period: 08/11-20/2010) . . . . . . . . . 55 5.1 Datasets and results: d(q) is the effective diameter, γ and σ, resp., the exponent of the power law node degree and community size distributions, Q the network modularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.1 6.2 6.3 6.4 6.5 Results of the community detection on Facebook . . . . . . . . . . . . . . . . . Representation of community structures. . . . . . . . . . . . . . . . . . . . . . . Similarity degree of community structures . . . . . . . . . . . . . . . . . . . . . The presence of outliers in our community structures. . . . . . . . . . . . . . . Features of the meta-networks representing the community structure for the uniform sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 95 97 100 . 101 7.1 7.2 Datasets adopted in our experimentation. . . . . . . . . . . . . . . . . . . . . . . 128 Analysis by using similarity coefficient J(τn) , correlation ρX,Y and Euclidean dis- 7.3 tance L2 (X, Y ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Results of the FKCD algorithm on the adopted datasets. . . . . . . . . . . . . . . 146 k xi LIST OF TABLES xii 1 Introduction The increasing popularity of Online Social Networks (OSNs) is witnessed by the huge number of users that Facebook, Twitter, etc. acquired in a short amount of time. The growing accessibility of the Web, through several media, allows to most users a 24/7 online presence and encourages them to build an online mesh of relationships. As OSNs become the tools of choice for connecting people, we expect that their structure will increasingly mirror real-life society and relationships. At the same time, with an estimated 13 millions of transactions per seconds (at peak), Facebook is one of the most challenging computer science artifacts, posing several optimization, scalability and robustness challenges. The essential feature of Online Social Networks is the friendship relation between participants. It consists, mainly, in a permission to consult each others’ friends list and posted content: news, photos, links, blog posts, etc; such permission can be mutual. In this Thesis we collect data from OSNs and we analyze their structure adopting graph theory models; for example, we consider the Facebook friendship network as the (undirected) graph having Facebook users as vertices and edges representing their friendship relations. The analysis of OSN connections is a fascinating topic on multiple levels. First, a complete study of the structure of large real (i.e., off line) social communities was impossible or at least very expensive before, even at fractions of the scale considered in OSN analysis. Second, data are clearly defined by some structural constraints, provided by the OSN structure itself, if compared with real-life relations, hardly identifiable. The interpretation of these data opens up new fascinating research questions, for example (i) is it possible to study OSNs with the tools of traditional Social Network Analysis, as in (272) and (203)? (ii) To what extent the behavior of OSN users is comparable to that of people in real-life social networks (118)? (iii) What are the topological characteristics of the relationships network (for example, friendship, in the case of Facebook) of OSNs (4)? (iv) And what about their structure and evolution (169)? To address these questions, further Computer Science research is needed to design and develop those tools required to acquire and analyze data from massive OSNs. First, scalability is an issue faced by anyone who wants to study a large OSN independently from the commercial organization that owns and operates it. Moreover, proper social metrics need to be introduced, in order to identify and evaluate features of the considered OSN. In 2010, some authors (125) 1 1. INTRODUCTION estimated the crawling overhead needed to collect the whole Facebook graph in 44 Terabytes of data. Even when such data could be acquired and stored locally (which however raises storage issues related to the social network compression (34, 35)), it is non-trivial to devise and implement effective functions that traverse and visit the graph or even evaluate simple metrics. In the literature, extensive research has been conducted on sampling techniques for large graphs; only recently, however, studies have shed light on the bias that those methodologies may introduce (170). That is, depending on the method by which the graph has been explored, certain features may result over/under-represented with respect to the actual graph. Our long-term research on these topics is presented in this Thesis. We describe in detail the architecture and functioning modes of our ad hoc Web crawler designed to extract data from Online Social Networks (such as Facebook), by which, even on modest computational resources, we can extract large samples containing several millions of profiles and connections among them. Two recently-collected samples of Facebook containing about 8 millions of nodes each are described and analyzed in detail. To comply with the Facebook end-user license, data are made anonymous upon extraction, hence we never memorize users’ sensible data. Next, we describe similar experiments performed on different OSNs, whose dataset have been made freely available on the Web. Moreover, this Thesis is focused on the problem of the community structure detection inside Online Social Networks. A community is formally defined as a sub-structure present into the network that represents connections among users, in which the density of relationships within the members of the community is much greater than the density of connections among communities. From a structural perspective, this is reflected by a graph which is very sparse almost everywhere but dense in local areas, corresponding to the communities. Different motivations to investigate the community structure of a network exist. For example, it is possible to put into evidence interesting properties or hidden information about the network itself. Moreover, individuals that belong to a same community may share some similarities and possibly have common interests or are connected by a specific relationship in the real-world. These aspects arise a lot of commercial and scientific applications; in the first category we count, for example, marketing and competitive intelligence investigations and recommender systems. In fact, users belonging to a same community could share tastes or interests in similar products. In the latter, models of diseases propagation and distribution of information have been largely investigated in the context of social networks. The two different samples of Facebook we collected have been analyzed in order to detect and describe the underlying community structure of this Online Social Network, highlighting its features with respect to existing mathematical models that try to describe the community structure of social networks. Our findings show that the community structure of the network emerges both from a quantitative and a qualitative perspective. In this panorama, not only from a scientific perspective but also for commercial or strategic motivations, the identification of the principal actors inside a network or inside a community is very important. Such an identification requires to define an importance measure (also called as centrality) and to rank nodes and/or edges of the network graph on the basis of such a measure. The simplest approaches for computing centrality consider only the local topological properties of a node/edge in the social network graph: for instance, the most intuitive node centrality measure is represented by the degree of a node, i.e., the number of social contacts of a user. Unfortunately, local measures of centrality, whose esteem is computationally feasible even on large networks, do not produce very faithful results (56). Due to these reasons, many authors 2 suggested to consider the whole social network topology to compute centrality values. This consideration generated a new family of centrality measures, called global measures. Some examples of global centrality measures are closeness (248) and betweenness centrality (for nodes (112), and edges (124)). Unfortunately, the problem of computing the exact value of centrality for each node/edge of a given graph is computationally demanding – or even unfeasible – as the size of the analyzed network grows. Therefore, the need of defining fast, even if heuristic, techniques to compute centrality arises and it is currently a relevant research topic in Social Network Analysis. For this reason, the last part of this Thesis is devoted to introduce a novel measure of centrality for edges of a social network. This measure is called κ-path edge centrality. In our approach, the procedure of computing the edge centrality is viewed as an information propagation problem. In detail, if we assume that multiple messages are generated and propagated within a social network, an edge is considered as “central” if it is frequently exploited to diffuse information. Relying on this idea, we simulate message propagations through random walks on the social network graphs. In addition, we assume that random walks are simple and of bounded length up to a constant and user-defined value κ. The former assumption is because loops should not be allowed, in order to avoid that messages get trapped; the latter, because, as in (115), we assume that the more distant two nodes are, the less they influence each other. The main contributions of this Thesis, therefore, can be summarized as follows: 1. In Chapter 2 we introduce some fundamental concepts that will be widely adopted throughout the Thesis. First, we define some formal mathematics conventions used to define the terminology, notations and a few other mathematical devices typical of the graph theory. Moreover, we formalize the characteristics of a graph and its properties. Thus, we introduce some of the metrics for measuring the characteristics of a graph. 2. Chapter 3 is intended as a brief survey on the problem of the extraction of information from Web sources, in particular concerning fields of application, approaches and techniques developed during the years. A particular attention is given to all the problems related to the extraction of information from Web Social Media, and in particular from Online Social Networks. The most of the techniques discussed have been applied in order to devise a platform to extract information from Online Social Networks, in a automatic and robust way. Contribution and Impact (1) • Our research line on the automatic extraction of information from Web sources focused in particular on the problem of the automatic adaptation of the procedures of data extraction. In this context we devised a novel algorithmic solution that improves the state-of-the-art of the algorithms for the comparison of tree data structures. • Our research results have been published in Lecture Notes in Computer Science (104), as a book chapter (102) and presented in the context of a conference of Artificial Intelligence (103). Moreover, a brief survey on the state-of-the-art in the discipline of the Web data extraction has been compiled and is currently under review (106). • Our technique of automatic wrapper adaptation has been harnessed in commercial products such Lixto1 and in the context of the Web content acquisition to build digital earth geo-spatial platforms (284). 1 http://www.lixto.com 3 1. INTRODUCTION 3. In Chapter 4 we discuss the architecture of Web mining that we devised, also called crawler, that permitted us to extract different samples of the Facebook social network, and to analyze the topological features of this graph. In particular, we investigate two different techniques of Web mining, the first based on the concept of visual extraction and the latter based on a more efficient but less accurate procedure of sampling. Moreover, two different sampling algorithms are devised and applied. The first one, often referred to as uniform sampling, is a rejection-based sampling algorithms, that is known to produce a sample that is unaffected by bias for construction – we used this sample as a ground truth. The second algorithm is the well-known breadth-first traversal, that has been recently referred as possibly introducing bias in the case of incomplete visits (170). We verified the structural differences between the two acquired samples, and we put into evidence the topological features that describe these samples with respect to mathematical models often adopted as generative models for social networks. Contribution and Impact (2) • Our research activity on mining and analyzing social networks has been published in the context of international conferences on Web mining (57, 58) and as a book chapter (59). • The published papers have been reported and cited in different occasions, such as in a large scale verification of the topological features of the Facebook social graph (269) and in an important study of the verification of the strength of weak ties theory (136). • Our techniques of social network mining and analysis have been applied to develop a tool, whose adoption has been exploited in different contexts, such as the forensics analysis of call networks (60). Some studies on the similarity of Facebook users have been also presented as a book chapter (78). • The datasets acquired during the mining of the Facebook social network have been released under an anonymized format as free for the research community for further studying. 4. The findings presented in the previous point, are extended in Chapter 5 to other Online Social Networks. In particular, it is given a special attention to the problem of defining the scale-free degree distribution, the small world characteristic and the composition of the community structure on different social networks, other than Facebook, such as Arxiv, Youtube and Wikipedia. Different mathematical models are compared against real-world data putting into evidence a lack of models that incorporate all the features of actual data, making it difficult to formalize with mathematical methods those features that make unique these Online Social Networks. This aspect puts into evidence that at the moment it is still necessary to collect data from the Web sources in order to correctly analyze these networks, and leaves space to further research in the area of mathematical modeling of social networks. Contribution and Impact (3) • The analysis of the topological features of different social networks has been published in the journal Communications in Applied and Industrial Mathematics (105). • Some further investigation of the behavior of social network users across different social systems has been presented in the context of the journal ACM Transactions on Intelligent Systems and Technology (77). 4 5. Chapter 6 presents our findings regarding the community structure of the Facebook social network. First of all, we prove that the community structure of this Online Social Network presents a clear power law distribution of the dimension of the communities. This result is independent of the algorithm adopted to discover the community structure, and even (but in a less evident way) with respect to the sampling methodology adopted to collect the datasets. As far as concerns the qualitative analysis of results, we put into evidence also that this community structure is qualitatively well defined. We finally investigate the strength of weak ties theory validity on this social network. It is well-known that this theory is strictly related to the community structure on a network and our findings support this aspect providing quantitative proofs. Contribution and Impact (4) • Our research study on the characteristics of the community structure of the Facebook social network have been published in the International Journal of Social Network Mining (99) and submitted the renowned Journal PLoS ONE (101). • The data on the community structure of Facebook have been released and have been exploited during the research activity of different international groups, such as the Information Systems Group of the University of Oxford. 6. In Chapter 7 we propose an approach based on random walks to compute edge centrality and we present an algorithm to efficiently compute an approximation of the proposed measure. We provide results of the performed experimentation, showing that our approach is able to generate reproducible results even if it relies on random walks. Concluding, we discuss a possible application of this measure to devise a novel community detection algorithm. Contribution and Impact (5) • The contribution to the research community of the novel measure of κ-path edge centrality has been presented in the context of the journal Knowledge-based Systems (81). Moreover, the community detection algorithm based on this technique has been discussed during an international conference on intelligent systems (79). Some further extension have been applied to improve recommendation in the context of social network platforms with the possibility of tagging resources, whose results have been presented in the same conference (80). • A clustering algorithm based on the concept of κ-path edge centrality has been successfully applied also in the context of the Bio-informatics, in particular for the clustering of gene co-expression networks (100), i.e., those networks representing the correlation among a set of genes that cooperate in response to a given event happening in the cell in order to produce a response. The affinity of this problem to the problem of finding clusters of users (i.e., communities) inside a social network is straightforward. 7. Finally, in Chapter 8 the Thesis concludes, showing some future directions of research. 5 1. INTRODUCTION 6 2 Fundamentals In this Chapter we introduce some fundamental concepts that will be widely adopted throughout the Thesis. First, we recall some formal mathematical conventions used to define the terminology and notations, and a few other mathematical devices in Section 2.1. Some simple definitions and notations typical of the graph theory are discussed in Section 2.2. In that Section we formalize the characteristics of a graph and its properties. Thus, we introduce some of the metrics for measuring the characteristics of a graph. 2.1 Formal Conventions Here we introduce the Landau notation, adopted throughout this Thesis to describe the asymptotic behavior of functions for evaluating the computational complexity of introduced algorithms. Given two functions f : N → N and g : N → N we say that: • f ∈ O(g) if ∃n0 ∈ N : n ≥ 0 and ∃c ∈ R : c > 0 such that f (n) ≤ c · g(n) holds ∀n ≥ n0 . • f ∈ Ω(g) if g ∈ O(f ). • f ∈ Θ(g) if ∃n0 ∈ N : n ≥ 0 and ∃c0 , c1 ∈ R : c0 , c1 > 0 such that c0 · g(n) ≤ f (n) ≤ c1 · g(n) holds ∀n ≥ n0 . 2.2 Graph Theory A network is defined as an object composed of entities interacting each other through some connections. The natural mean to mathematically represent a network is through a graph model. 7 2. FUNDAMENTALS 2.2.1 Notion of Graph and Main Properties In graph theory we define a graph G = (V, E) as an abstract representation of a set of objects V , namely vertices (or nodes), and a set E of edges which connect pairs of vertices together. Thus, we denote as V (G) such set of vertices V and, similarly, as E(G) the set of edges within the graph G. The amount of vertices V and of edges E, namely the cardinality of V and E, are commonly represented by n and m or denoted by |V | and |E|. Two vertices connected by an edge e are called endvertices for e, and they are adjacent or neighbors. Directed and undirected graphs Graphs can be classified considering the nature of the edges: i) undirected and, ii) directed. In undirected graphs the edges connect a pair of vertices without considering the order of the endvertices. Given two vertices u and v, an undirected edge connecting them is denoted by {u, v}, or, in short notation, uv or euv . In directed graphs the edges (also called arcs) connect pairs of vertices in a meaningful way with respect to the verse. Each edge comes out from a vertex u, namely the origin (or tail), and reaches a destination v (or head), and is represented by the notation (u, v), or in a short notation uv (which is different from vu) or euv (which differs from evu ). The irreversible transformation of a directed graph G = (V, E) in its underlying undirected graph G0 = (V, E 0 ) maintains the same set of vertices V and generates a new set of edges E 0 which contains an undirected edge between two vertices u, v ∈ V if ∃(u, v) ∈ E or ∃(v, u) ∈ E, for all the edges in E. Multigraphs and loops A set of edges E is a multiset if it contains multiple instances of a same edge. Two identical instances of the same edge are called parallel edges. If a graph contains parallel edges it is called a multigraphs. This may happen both for undirected and directed graphs. On the other hand, a graph is called simple if there is only one instance of each edge, so it does not contain parallel edges. If an edge connects a vertex to itself it is called a loop. Because loops create some unpleasant effects on the behavior of the most of the algorithms working on graphs, the standard assumption throughout this Thesis will be that graphs would not contain loops unless otherwise specified; a graph having this property is called loop-free. Weighted graphs Assigning a weight to the edges of a graph can be useful for several purposes. Let ω : E → R be a weight function calculated for each edge e ∈ E within a graph G = (V, E), which assigns a weight to the edges. This property can be adopted to describe particular characteristics of edges, such as costs e.g. for representing the physical distance between two vertices in G, the monetary cost to travel through that particular connection, the traffic through that specific link, etc. Moreover, weights are often used to describe the capacity of a connection, as in network flow problems. Graphs which include a weight function ω for characterizing edges are called 8 2.2 Graph Theory weighted graphs. Unweighted graphs can be considered as a special case of weighted graphs in which the weight function assumes the value ω(e) = 1 ∀e ∈ E. Degrees The notion of degree of vertices in a graph requires some distinctions among undirected and directed graphs, weighted and unweighted graphs, and multigraphs. First of all, the degree d(v) of a vertex v in a undirected graph G = (V, E) is equal to the number of edges e ∈ E that have v as an endvertex. All the vertices connected to v, i.e. the neighborhood, are denoted by N (v). For directed graphs, the concept of degree is split into two categories: i) out-degree and, ii) in-degree. The out-degree of a vertex v in a directed graph G = (V, E) represents the number of edges e ∈ E that have v as tail, and its notation is d+ (v). Similarly, the in-degree of the same vertex is the number of edges that have v as head, and it is denoted by d− (v). In weighted graphs, the notion of degree is calculated taking the weight of each considered edge, instead of just counting their number. In undirected multigraphs, the value of parallel edges is counted according to the number of instances that appear in the graph. In directed ones, the weight of parallel edges is counted instead. For undirected graphs we can calculate the mean degree of the graph G as d(G) = 1 X 2E d(v) = |V | V (2.1) v∈V We indicate with the notation ∆(G) the maximum degree of such undirected graph, and with δ(G) the minimum degree. Thus, we define a graph as regular if all the vertices have the same degree, and k-regular if this degree is equal to k. In the case of directed graphs we define the mean in-degree – Eq. (2.2) – and the mean outdegree – Eq. (2.3) – as dI (G) = 1 X dI (v) |V | (2.2) dO (G) = v∈V 1 X dO (v) |V | (2.3) v∈V Density A measure related to the degree of vertices in a graph G = (V, E) is the density, i.e. the proportion between the actual number of edges and the maximum possible number of edges with respect to the number of vertices. Since the maximum possible number of edges generated V (V −1) V by V vertices is defined by the Newton’s binomial 2 = , the proportion is denoted as 2 ∆= E = V 2 E V (V −1) 2 = 2E V (V − 1) (2.4) and is normalized in the interval [0,1]. If all the edges are present, the graph is said to be complete, its density is equal to 1 and all the node degrees are equal to V − 1. 9 2. FUNDAMENTALS There is a direct relationship between the density of a graph and the mean degree of its vertices. The sum of the degrees is equal to 2E (this because each edge is counted twice), thus combining Eq. 2.1 and Eq. 2.4 we obtain ∆= d(G) V −1 (2.5) which define the correlation between ∆ and d(G), thus the density of a graph assumes the meaning of the average proportion of edges incident with vertices in the graph. Subgraphs The notion of subgraphs is defined taking a subset of vertices and edges of a graph G = (V, E), say G0 = (V 0 , E 0 ) such that V 0 ⊆ V and E 0 ⊆ E and each e ∈ E 0 has endvertices falling in V 0 . Let W to be a proper subset of the vertices V of the graph G = (V, E); thus, G − W is the graph obtained by deleting the vertices contained in W and all their incident edges. Similarly, if F is a proper subset of the edges E, G − F results in the graph G0 = (V, E − F ). Walks and paths Walking over a graph G = (V, E) is a sequence of vertices v and edges e, in alternation, such that the vertices and edges taken at each step are adjacent. A walk from the vertex v0 to vn is denoted by the sequence v0 , e1 , v1 , . . . , en , vn , where ei = {vi−1 , vi } in the case of undirected graphs and ei = (vi−1 , vi ) otherwise. For generic graphs, the length l(w) of a walk w is given by the number of edges contained, while for weighted graphs the weight ω(w) of a walk is represented by the sum of the weights of each included edge. A walk is defined as a path if all the included edges ei are distinct, and is defined as simple if it does not include repeated vertices. If the starting vertex coincides with the destination of the path, it is called a cycle. A cycle with no repeated vertex is called a simple cycle. It always exists a path between a pair of vertices in a connected (see further) undirected graph. Random walks and Markov chains We recall the definition of discrete probability space, Markov chains and random walks as in (49). Let (Ω, P ) be a discrete probability space, Ω a (non empty) finite or countably infinite set and P a mapping from the power set ℘(Ω) of Ω to the real numbers as follows • P (A) ≥ 0, ∀A ⊆ Ω • P (Ω) = 1 S P • P ( i∈N Ai ) = i∈N P (Ai ), ∀ sequences (Ai )i∈N of pairwise disjoint sets from ℘(Ω). Ω is said a sample space and any subset of Ω is said an event. Let X be a random variable, i.e., a mapping from the sample space Ω to the real numbers, whose image is denoted by IX = X(Ω). 10 2.2 Graph Theory The expected value of a random variable X is E(X) = P ω∈Ω X(ω) · P (ω). Let (Xt )t∈N0 be a sequence of random variables Xt withPIXt ⊆ S, where S is a set state, and an initial distribution q0 that maps S to R+ 0 and satisfies s∈S q0 (s) = 1. Such a sequence is said Markov chain iif it satisfies the so-called Markov condition, that is ∀t > 0 and ∀I ⊆ 0, 1, . . . , t − 1 and ∀i, j, sk ∈ S P (Xt+1 = j|Xt = i, ∀k ∈ I : Xk = sk ) = P (Xt+1 = j|Xt = i) holds true. A random walk on a simple directed graph G = (V, E) is a Markov chain with S = V and 1 if (u, v) ∈ E P (Xt+1 = v|Xt = u) = d+ (u) 0, otherwise. At each step, the random walk selects a random outgoing edge e from the current vertex and moves to the destination vertex reached by means of e. In an analogous fashion it is possible to define a random walk on undirected graphs. Connected components An undirected graph G = (V, E) is defined as connected if there exists a path connecting all the pairs of vertices within the graph. Otherwise, the graph is called disconnected. It is possible to induct a connected subgraph from a disconnected graph. If the graph G = (V, E) is disconnected, a subgraph G0 = (V 0 , E 0 ) is defined as connected component if it is a connected subgraph for G. Moreover, G0 is defined as the largest connected component if it is the maximal connected subgraph inductable over G. It is possible to evaluate if an undirected graph is connected or, otherwise, calculate the largest connected component adopting two classical algorithms, namely the depth-first search (DFS) and the breadth-first search (BFS) (72), with the cost O(n + m). Directed graphs are defined as strongly connected if there exists a directed path connecting each pair of vertices. Similarly to the undirected case, it is possible to induct the strongly connected component of a directed graph finding a strongly connected subgraph (that is the largest strongly connected component it it is the maximal one). It is possible to find the strongly connected components of a directed graph adopting the improved DFS algorithm (264). If the underlying undirected graph is connected, such original directed graph is called weakly connected. Shortest paths and single-source shortest paths For a weighted graph, the weight ω(p) of a path p is defined as the sum of the weights of the edges included in p. It is possible to find the path between a pair of vertices u and v with the minimal weight, namely the shortest path with respect to the weight function ω as the one with the smallest weight among all the paths connecting u and v. For unweighted graphs, the shortest path assumes the mean of the path which includes the smallest number of edges. Given a graph G = (V, E), a weight function ω : E → R and a source vertex s ∈ V , the single-source shortest path problem (SSSP) is solved computing all the shortest paths from s 11 2. FUNDAMENTALS to any other vertex in G. SSSP can be solved with the classical Dijkstra’s algorithm (87) in O(m + n log n) time, if the graph does not contain cycles of negative weight, otherwise the Bellman-Ford algorithm (72) can be used, with cost O(mn). For a given unweighted graph the problem can be solved using the BFS algorithm in O(m + n) time (72). Geodesic, distance and eccentricity The geodesic is defined as the shortest path between two vertices u and v in a graph G = (V, E). The geodesic distance or simply the distance between these two vertices is defined as the number of edges in the geodesic connecting them (44). If there is no path between two vertices, the geodesic distance between them is infinite (in a disconnected graph there exists at least a distance between a pair of vertices which is infinite). The eccentricity of a vertex v is represented by the greatest geodesic distance between v and any other vertex in G. The eccentricity can range in the interval [1, V − 1]. The diameter and several measures of centrality (see further) are based on the concept of eccentricity, such as the center and the centroid of a graph. Diameter and all-pairs shortest paths The concept of diameter of a graph is important because it quantifies how far apart the farthest two vertices in such a graph are. This information is fundamental for several applications of networks (see for example (7, 38, 67)). The diameter D of a graph G = (V, E) is equal to the maximum value of eccentricity among any pair of vertices of the graph. In other words, it is the greatest distance between any pair of vertices. The diameter can range in the interval [1, V − 1]. In a disconnected graph the diameter is infinite, but it is possible to find the diameter of its largest connected component. Similarly, it is possible to find the diameter of a subgraph. To find the value of the diameter, the so-called problem of the all-pairs shortest paths (APSP) in a graph must be solved. The greatest value among the APSP is the diameter of the graph. The APSP problem can be solved using classical algorithms such as the Floyd-Warshall (72) in O(n3 ) time, by solving n times the SSSP problem, with a cost of O(mn + n2 log n) or, in the special case of unweighted undirected graphs, the Seidel algorithm (252) with cost O(M (n) log n) with M (n) the cost of multiplying two n × n matrices containing small integers (e.g. by using the Coppersmith-Winograd algorithm (71) whose cost is O(n2.376 )). Matrix representation for graphs The information contained in a graph G = (V, E) can be stored in several ways, for example using the matrix form. The most common way to store a graph is using the so-called sociomatrix and the incidence matrix form. Here we describe the matrices to store simple unweighted undirected graphs, and then a generalization to treat directed graphs and weighted graphs. The Sociomatrix The sociomatrix, or adjacency matrix, denoted by X contains entries which indicate whether two vertices are adjacent or not. A sociomatrix X of size n × n can efficiently describe an unweighted undirected graph G = (V, E) containing n vertices. Rows and columns of the sociomatrix both represent the index of each vertex in the graph, and are labeled as 1, 2, . . . n. Each entry xij in the sociomatrix represents if the indicated pair of 12 2.2 Graph Theory vertices ni and nj are adjacent or not. Usually, there is a 1 in the (i, j)th cell if there is an edge connecting ni and nj in the graph, or a 0 otherwise. Thus, if vertices ni and nj are adjacent xij = 1, otherwise xij = 0. Because the graph is undirected, the matrix is symmetric respect to its diagonal, thus xij = xji ∀i 6= j. Sociomatrices are widely adopted for storing undirected network structures because of some particular properties; for example, social networks (see further) arise to sparse sociomatrices, thus it is convenient to adopt techniques of compact matrix decomposition (256, 262) for efficiently storing data. The incidence matrix Another possible representation of an undirected graph G = (V, E) through a matrix is called incidence matrix, usually denoted by I. It stores which edges are incident with which vertices, indexing the former on the columns and the latters on the rows, thus the dimension of the matrix I is V × E. A matrix entry Iij contains 1 if the vertex ni is incident with the edge ej , or 0 otherwise. Both the incidence matrix and the sociomatrix contains all the information required to describe the represented graph. Generalization for directed graphs The sociomatrix form can be intuitively extended to represent directed graphs. In this case, the sociomatrix X has elements xij equal to 1 if there exists an arc connecting the vertex ni (tail) to the vertex nj (head), or 0 otherwise. Formally, xij = 1 if ∃(xi , xj ) in E(G). In other words, the (i, j)th cell of X contains 1 only if the directed edge eij connects vertices ni and nj ; because the graph is directed, the entry in xij may be different from the entry in xji . Thus, a sociomatrix representing directed graphs is not symmetric. Generalization for weighted graphs Sociomatrices can be adapted for representing also weighted graphs. The entry in the cell xij represents the weight ω(eij ) associated to the edge eij connecting vertices ni and nj , both for undirected and directed graphs. The sociomatrices for weighted graphs have the same properties of the related unweighted version; thus, a sociomatrix representing a weighted undirected graph is symmetric respect to its diagonal. 2.2.2 Centrality Measures Within graph theory and network analysis, there are various measures of the centrality of a vertex within a graph that determine the relative importance of a vertex within the graph. There are four measures of centrality that are widely used in network analysis: degree centrality, closeness, betweenness, and eigenvector centrality. Degree centrality The most intuitive measure of centrality of a vertex into a network is called degree centrality. Given a graph G = (V, E) represented by means of its adjacency matrix A, in which a given entry Aij = 1 if and only if i and j are connected by an edge, and Aij = 0 otherwise, the degree centrality CD (vi ) of a vertex vi ∈ V is defined as CD (vi ) = d(vi ) = X j 13 Aij . (2.6) 2. FUNDAMENTALS The idea behind the degree centrality is that the importance of a vertex is determined by the number of vertices adjacent to it, i.e., the larger the degree, the more important the vertex is. Even though, in real world networks only a small number of vertices have high degrees, the degree centrality is a rough measure but it is adopted very often because of the low computational cost required for its computation. There exists a normalized version of the degree centrality, defined as follows 0 CD (vi ) = d(vi ) n−1 where n represents the number of vertices in the network. Closeness centrality A more accurate measure of centrality of a vertex is represented by the closeness centrality (248). The closeness centrality relies on the concept of average distance, defined as n 1 X Davg (vi ) = g(vi , vj ) n−1 j6=i where g(vi , vj ) represents the geodesic distance between vertices vi and vj . The closeness centrality CC (vi ) of a vertex vi is defined as n X 1 CC (vi ) = g(vi , vj ) n−1 (2.7) j6=i In practice, the closeness centrality calculates the importance of a vertex on how close the give vertex is to the other vertices. “Central” vertices, with respect to this measure, are important as they can reach the whole network more quickly than non-central vertices. Different generalizations of this measures for weighted and disconnected graphs have been proposed in (223). Betweenness centrality A more complex measure of centrality is the betweenness centrality (112, 113). It relies on the concept of shortest paths, previously introduced. In detail, in order to compute the betweenness centrality of a vertex, it is necessary to count the number of shortest paths that pass across the given vertex. The betweenness centrality CB (vi ) of a vertex vi is computed as CB (vi ) = X vs 6=vi 6=vt ∈V 14 σst (vi ) σst (2.8) 2.2 Graph Theory where σst is the number of shortest paths between vertices vs and vt and σst (vi ) is the number of shortest paths between vs and vt that pass through vi . Vertices with high values of betweenness centrality are important because maintain an efficient way of communication inside a network and foster the information diffusion. Eigenvector centrality Another way to assign the centrality to a vertex is based of the idea that if a vertex has many central neighbors, it should be central as well. This measure is called eigenvector centrality and establishes that the importance of a vertex is determined by the importance of its neighbors. The eigenvector centrality CE (vi ) of a given vertex vi is X Aij CE (vj ) CE (vi ) ∝ (2.9) vj ∈Ni where Ni is the neighborhood of the vertex vi , being x ∝ Ax that implies Ax = λx. The centrality corresponds to the top eigenvector of the adjacency matrix A. Conclusion In this Chapter we introduced the formal conventions that will be used through the rest of the Thesis and those fundamental concepts related to the graph theory that underly the formalization of problems related to network analysis. In detail, in the first part we discussed the notions related to the graphs, their structure and their properties; in the second part we introduced the centrality measures. These concepts will be extensively adopted in those Chapters of this Thesis that cover the topological analysis of different Online Social Networks. 15 2. FUNDAMENTALS 16 3 Information Extraction from Web Sources This Chapter is intended as a brief survey on the problem of the extraction of information from Web sources, in particular concerning fields of application, approaches and techniques developed during latest years. A particular attention is given to all the problems related to the extraction of information from social media, and in particular from Online Social Networks. This Chapter is structured as follows. In the first part, we focus on fields of application of Web information extraction tools, developed with classic and novel techniques. In particular, we discuss enterprise, social and scientific applications that are strictly interconnected with Web data extraction tasks. In the second part, we introduce more in details those techniques related to the functioning of a Web information extraction platform, discussing the concepts of Web wrappers and those problems related to their generation and maintenance. Summarizing, in Section 3.1 we present related works, providing reference to useful surveys to this discipline. In Section 3.2 we track a complete profile of Web data extraction systems and in Section 3.3 we classify fields of application of Web data extraction techniques, focusing, in particular, on enterprise and social applications. In Section 3.4 we discuss in detail the problem of wrapper generation, induction and maintenance, and other notable approaches. Finally, in Section 3.5 we present our solution of automatic adaptation of Web wrappers. 3.1 Background and Related Literature The Computer Science scientific literature counts many valid surveys on the Web data extraction problem. Laender et al. (174), in 2002, presented a notable survey, offering a rigorous taxonomy to classify Web data extraction systems. They introduced a set of criteria and a qualitative analysis of various Web data extraction tools. In the same year Kushmerick (173) tracked a profile of finite-state approaches to the problem, including the analysis of wrapper induction and maintenance, natural language processing and hidden Markov models. Kuhlins and Tredwell (168) surveyed tools to generate wrappers already 17 3. INFORMATION EXTRACTION FROM WEB SOURCES in 2003: information could not be up-to-date but analyzing the approach is still very interesting. Again on the wrapper induction problem, Flesca et al. (108) and Kaiser and Miksch (157) discussed approaches, techniques and tools. The latter in particular modeled a representation of an Information Extraction system architecture. Chang et al. (62) introduced a tri-dimensional categorization of Web data extraction systems, based on task difficulties, techniques used and degree of automation. Fiumara (107) applied these criteria to classify four new tools that are also presented here. Sarawagi published an illuminating work on Information Extraction (250): anybody who intends to approach this discipline should read it. To the best of our knowledge, the work from Baumgartner et al. (26) is the most recent short survey on the state-of-the-art of the discipline. 3.2 3.2.1 Web Data Extraction Systems Definition We can generically define a Web data extraction system as a sequence of procedures that extracts information from Web sources (174). From this generic definition, we can infer two fundamental aspects of the problem: • Interaction with Web pages • Generation of a wrapper Baumgartner et al. (26) define a Web data extraction system, as “a software extracting, automatically and repeatedly, data from Web pages with changing contents, and that delivers extracted data to a database or some other application”. This is the definition that better fits the modern view of the problem of the Web data extraction as it introduces three important aspects: • Automation and scheduling • Data transformation, and • Use of the extracted data The following five points cover techniques used to solve the problem of Web data extraction. Interaction with Web pages The first phase of a generic Web data extraction system is the Web interaction (271): Web sources, usually represented as Web pages, but also as RSS/Atom feeds (142), Microformats (160) and so on, could be visited by users, both in visual and textual mode, or just simply inputted to the system by the URL of the document(s) containing the information. Some commercial systems, Lixto1 for first but also Kapow Mashup Server2 , include a Graphical User Interface for fully visual and interactive navigation of HTML pages, integrated with data extraction tools. 1 http://www.lixto.com/ 2 http://kapowtech.com/ 18 3.2 Web Data Extraction Systems The state-of-the-art is represented by systems that support the extraction of data from pages reached by deep Web navigation (21), i.e. simulating the activity of users clicking on DOM elements of pages, through macros or, more simply, filling HTML forms. These systems also support the extraction of information from dynamically generated Web pages, usually built at run-time as a consequence of the user request, filling a template page with data from some database. The other kind of pages are commonly called static Web pages, because of their static content. Generation of a wrapper Just for now, we generically define the concept of wrapper as a procedure extracting unstructured information from a source and transforming them into structured data (153, 291). A Web data extraction system must implement the support for wrapper generation and wrapper execution. We will cover approaches and techniques, used by several systems, later. Automation and scheduling The automation of page access, localization and extraction is one of the most important features included in recent Web data extraction systems (232): the capability to create macros to execute multiple instances of the same task, including the possibility to simulate the click stream of the user, filling forms and selecting menus and buttons, the support for AJAX technology (117) to handle the asynchronous updating of the page, etc. are only some of the most important automation features. Also the scheduling is important, e.g. if a user wants to extract data from a news website updated every 5 minutes, many of the most recent tools let her/him to setup a scheduler, working like a cron, launching macros and executing scripts automatically and periodically. Data transformation Information could be wrapped from multiple sources, which means using different wrappers and also, probably, obtaining different structures of extracted data. The steps between extraction and delivering are called data transformation: during these phases, such as data cleaning (240) and conflict resolution (204), users reach the target to obtain homogeneous information under a unique resulting structure. The most powerful Web data extraction systems provide tools to perform automatic schema matching from multiple wrappers (239), then packaging data into a desired format (e.g. a database, XML, etc.) to make it possible to query data, normalize structure and de-duplicate tuples. Use of extracted data When the extraction task is complete and acquired data are packaged in the required format, this information is ready to be used; the last step is to deliver the package, now represented by structured data, to a managing system (e.g. a native XML DBMS, a RDBMS, a data warehouse, a CMS, etc.). In addition to all the specific fields of application covered later in this 19 3. INFORMATION EXTRACTION FROM WEB SOURCES work, acquired data can be also generically used for analytical (55) or statistical purposes (30) or simply to republish them under a structured format. 3.2.2 Classification Criteria A taxonomy for characterizing Web data extraction tools Laender et al. (174) presented a widely accepted taxonomy to classify systems, according to techniques used to generate wrappers: Languages for Wrapper Development: Before the birth of some languages specifically studied for wrapper generation (e.g. Elog, the Lixto Web extraction language (22)) extraction systems relied on standard scripting languages, like Perl, or general purpose programming languages, like Java to create the environment for the wrapper execution. HTML-aware Tools: Some tools rely on the intrinsic formal structure of the HTML to extract data, using HTML tags to build the DOM tree (e.g. RoadRunner (74), Lixto (23) and W4F (249)). Boronat (42) analyzed and compared performances of common Web data extraction tools. NLP-based Tools: Natural Language Processing techniques were born in the context of Information Extraction (IE) (29, 194, 278). They were applied to the Web data extraction problem in order to solve specific problems such as the extraction of facts from speech transcriptions in forums, email messages, newspaper articles, resumes etc. Wrapper Induction Tools: These tools generate rule-based wrappers, automatically or semi-automatically: usually they rely on delimiter-based extraction criteria inferred from formatting features (12). Modeling-based Tools: Relying on a set of primitives to compare with the structure of the given page, these tools can find one or more objects in the page matching the primitive items (93, 121). A strong domain knowledge is needed, but it is a good approach for the extraction of data from Web sources based on templates and dynamically generated pages. Ontology-based Tools: These techniques do not rely on the page structure, but directly on the data. Actually, ontologies can be applied successfully on specific well-known domain applications, e.g. social networks and communities (201) or bio-informatics (151). Some works try to apply the ontological approach to generic domains of Web data extraction (144) or to tables (263). 20 3.2 Web Data Extraction Systems Qualitative analysis criteria Laender et al. (174) remarked that this taxonomy is not intended to strictly classify a Web data extraction system, because it is common that some tools fit well in two or more groups. Maybe for this reason, they extended these criteria including: Degree of automation: data extraction tool. Just determines the amount of human effort needed to run a Web Support for complex objects: Nowadays, Web pages are based on the rich-content paradigm, so objects included in the Web sources could be complex. Only some systems can handle these kind of data. Page contents: Page contents can be distinguished in two categories: unstructured text and semi-structured data. The first family fits better with NLP-based tools and Ontology-based tools, the latter with the others. Ease of use: Availability of a Graphical User Interface (GUI) is a must for last generation tools. Platforms often feature wizards to create wrappers, WYSIWYG editor interfaces, integration with Web browsers etc. Lixto, Denodo1 , Kapow Mashup Server, WebQL2 , Mozenda3 , Visual Web Ripper4 all use advanced GUI to ease the user experience. XML output: XML is simply the standard, according to W3C5 , for the semantic Web representation of data. The capability to output the extracted data in XML format nowadays is a requirement, at least for commercial software. Support for Non-HTML sources: NLP-based tools fit better in this domain: this is a great advantage, because a very large amount of data are stored in the Web in semi-structured texts (emails, documentations, logs, etc.) Resilience and adaptiveness: Web sources are usually updated without any forewarning, i.e. the frequency of updates is not known a priori, thus systems that generate wrappers with a high degree of resilience show better performances. Also the adaptiveness of the wrapper, moving from a specific Web source to another, within the same domain, is a great advantage. 1 http://www.denodo.com/ 2 http://www.ql2.com/ 3 http://www.mozenda.com 4 http://www.visualwebripper.com/ 5 http://www.w3.org/ 21 3. INFORMATION EXTRACTION FROM WEB SOURCES 3.3 Applications On the one hand the Web is moving to semantics and enabling machine-to-machine communication: it is a slow, long term evolution, but it has started, in fact. Extracting data from Web sources is one of the most important steps of this process, because it is the key to build a solid level of reliable semantic information. On the other hand, Web 2.0 extends the way humans consume the Web with social networks, rich client technologies, and the consumer as producer philosophy. Hence, new evolvements put further requirements on Web data extraction rules, including to understand the logic of Web applications. In the literature of the Web data extraction discipline, many works cover approaches and techniques adopted to solve some particular problems related to a single or, sometimes, a couple of fields of application. The aim of this Section is to survey and analyze some of the possible applications that are strictly interconnected with Web data extraction tasks. In the following, we describe a taxonomy in which key application fields, heavily involved with data extraction from Web sources, are divided into two families, enterprise applications and social applications. 3.3.1 Enterprise Applications We classify here software applications and procedures with a direct, subsequent or final commercial scope. Context-aware advertising Thanks to Applied Semantic, Inc.1 first, and Google, who bought their ’AdSense’ advertising solution later, this field captured a great attention. The main underlying principle is to present to the final user, commercial thematized advertisements together with the content of the Web page the user is reading, ensuring a potential increase of the interest in the ad. This aim can be reached analyzing the semantic content of the page, extracting relevant information, both in the structure and in the data, and then contextualizing the ads content and placement in the same page. Contextual advertising, compared to the old concept of Web advertising, represents an intelligent approach to provide useful information to the user, statistically more interested in thematized ads, and a better source of income for advertisers. Customer care Usually medium/big-sized companies, with customers support, receive a lot of unstructured information like emails, support forum discussions, documentation, shipment address information, credit card transfer reports, phone conversation transcripts, etc.: the capability of extracting this information eases their categorization, inferring underlying relationships, populating own structured databases and ontologies, etc. Actually NLP-based techniques are the best approach to solve these problems. 1 http://www.appliedsemantics.com/ 22 3.3 Applications Database building This is a key concept in the Web marketing sector: generically we can define the concept of database building as the activity of building a database of information about a particular domain. Fields of application are countless: financial companies could be interested in extracting financial data from the Web, e.g. scheduling these activities to be executed automatically and periodically. Also the real estate market is very florid: acquiring data from multiple Web sources is an important task for a real estate company, for comparison, pricing, co-offering, etc. Companies selling products or services probably want to compare their pricing with other competitors: the extraction of products pricing is an interesting application of Web data extraction systems. Finally we can list other related tasks involved in the Web data extraction: duplicating an on-line database, extracting dating sites information, capturing auction information and prices from on-line auction sites, acquiring job postings from job sites, comparing betting information and prices, etc. Software Engineering Extracting data from websites became interesting also for Software Engineering: Web 2.0 is usually strictly related to the concept of Rich Internet Applications (RIAs), Web applications characterized by an high degree of interaction and usability, inherited from the similarity to desktop applications. Amalfitano et al. (9) are developing a reverse engineering approach to abstract finite states machines representing the client-side behavior offered by RIAs. Business Intelligence and Competitive Intelligence Baumgartner et al. (24, 25, 27) deeply analyzed how to apply Web data extraction techniques and tools to improve the process of acquiring market information. A solid layer of knowledge is fundamental to optimize the decision-making activities and a huge amount of public information could be retrieved on the Web. They illustrate how to acquire these unstructured and semistructured information; using Lixto to access, extract, clean and deliver data, it is possible to gather, transform and obtain information useful to business purposes. It is also possible to integrate these data with other common platforms for Business Intelligence (BI), like SAP1 or Microsoft Analysis Services (199). Wider, the process of gathering and analyzing information for business purposes is commonly called Competitive Intelligence (CI), and is strictly related to data mining (145). Zanasi (288) was the first to introduce the possibility of acquiring these data, through data mining processes, on public domain information. Chen et al. (64) developed a platform, that works more like a spider than like a Web data extraction system, which represents a useful tool to support operations of CI providing data from the Web. In BI scenarios the main requirements include scalability and efficient planning strategies to extract as much data as possible with the smallest number of possible resources in time and space. The requirements for tools in the area of Web application testing are to deal well with Ajax/dynamic HTML, to create robust test scripts, to efficiently maintain test scripts, to execute test runs 1 http://www.sap.com 23 3. INFORMATION EXTRACTION FROM WEB SOURCES and create meaningful reports, and, unlike other application areas, the support of multiple state-of-the-art browsers in various versions is an absolute must. One widely used open source tool for Web application testing is Selenium 1 . 3.3.2 Social Applications One can say that social is the engine of Web 2.0: many websites evolved into Web applications built around users, letting them to create a Web of links between people, to share thoughts, opinions, photos, travel tips, etc. Here we are mainly listing all these kind of applications born and grown in the Web and through the Web, thanks to User-Generated Contents (UGC), built from users for users. Online Social networks Online Social Networks are the most important expression of change in the use of the World Wide Web, and are often considered a key step of the evolution to Web 2.0: millions of people creating a digital social structure, made of nodes (individuals, entities, groups or organizations) and connections (i.e. ties), - representing relationships between nodes, sometimes implementing hierarchies - sharing personal information, relying on platforms of general purpose (e.g. Facebook, MySpace, etc.) or thematized (e.g. Twitter for micro-blogging, Flickr for photo-sharing, etc.), all sharing a common principle: on-line socialization. Online Social Networks attracted an enormous attention, by both academic and industries, and many works studied several aspects of the phenomenon: extracting relevant data from social networks is a new interesting problem and field of application for Web data extraction systems. This thesis focuses even on this problem as a part of the process of mining and analyzing data from Online Social Networks. Actually does not exist a tool specifically studied to approach and solve this problem: people are divided by ethics on mining personal data from Online Social Networks. Regardless of moral disputes, some interesting applications of the Web data extraction from Online Social Networks are discussed in this Thesis in the following Chapters, and can be summarized as: i) acquiring information from relationships between nodes in order to study topological features of Online Social Networks; ii) analyzing statistical data in order to infer new information, to support better recommendations to users and to find users with similar tastes or interests; iii) discovering the community structure of a given Online Social Network . Social bookmarks Another form of social application is the social bookmarking, a new kind of knowledge sharing: users post links to Web sources of interest, into platforms with the capability of creating folksonomies, collaboratively tagging contents. Extracting relevant information from social bookmarks should be faster and easier than in other fields: HTML-aware and model-based extraction systems should fit very well with the semi-structured templates used by most common social bookmarking services. Once extracted, information is used to retrieve resources and to provide recommendations to users of the social bookmark websites (235, 236). Sometimes data 1 http://seleniumhq.org/ 24 3.3 Applications are distributed under structured formats like RSS/Atom so acquiring this information is easier than with traditional HTML sources. Comparison shopping One of the most appreciated among Web social services is the comparison shopping, through platforms with the capability to compare products or services, going from simple prices comparison to features comparison, technical sheets comparison, user experiences comparison, etc. These services heavily rely on Web data extraction, using websites as sources for data mining and a custom internal engine to make possible the comparison of similar items. Many Web stores today also offer personalization forms that make the extraction tasks more difficult: for this reason many last-generation commercial Web data extraction systems (e.g. Lixto, Kapow Mashup Server, UnitMiner1 , Bget2 ) provide support for deep navigation and dynamic content pages. Opinion sharing Complementary to comparison shopping, there exist the opinion sharing services: users want to express opinions on products, experiences, services they enjoyed, etc. The most common form of opinion sharing is represented by blogs, containing articles, reviews, comments, tags, polls, charts, etc. All this information usually lacks of structure, so its extraction is a huge problem, also for current systems, because of the billions of Web sources currently available. Sometimes model-based tools fit well, taking advantage of common templates (e.g. Wordpress, Blogger, etc.), other times natural language processing techniques fit better. Kushal et al. (76) approached the problem of opinion extraction and the subsequent semantic classification of reviews of products. Another form of opinion sharing in semi-structured platforms is represented by Web portals that let users to write unmoderated opinions on various topics and products. Citation databases Citation database building is one of the most intensive Web data extraction fields of application: CiteSeer3 , Google Scholar, DBLP4 and Publish or Perish5 are brilliant examples of applying Web data extraction to approach and solve the problem of collect digital publications, extract relevant data – for example, references and citations – and build a structured database, where users can perform searches, comparisons, count of citations, cross-references, etc. 1 http://www.qualityunit.com/unitminer/ 2 http://www.bget.com/ 3 http://citeseer.ist.psu.edu/ 4 http://www.informatik.uni-trier.de/ley/db 5 http://www.harzing.com/pop.html 25 3. INFORMATION EXTRACTION FROM WEB SOURCES 3.3.3 A Glance on the Future Bio-informatics and Scientific Computing A growing field of application of the Web data extraction is bio-informatics: on the World Wide Web it is very common to find medical sources, in particular regarding bio-chemistry and genetics. Bio-informatics is an excellent example of the application of scientific computing – refer e.g. to (85) for a selected scientific computing project. Plake et al. (233) worked on PubMed1 - the biggest repository of medical-scientific works that covers a broad range of topics - extracting information and relationships to create a graph; this structure could be a good starting point to proceed in extracting data about proteins and protein interactions. This information can be usually found, not in Web pages, rather it is available as the PDF of the corresponding scientific papers. In the future, Web data extraction could be extensively used also to classify these documents: approaches to solve this problem are going to be developed, inherited, both from Information Extraction and Web data extraction systems, because of the semi-structured format of PostScript-based files. On the other hand, Web services play a dominant role in this area as well, and another important challenge is the intelligent and efficient querying of Web services as investigated by the ongoing SeCo project2 . Web harvesting One of the most attractive future applications of the Web data extraction is Web Harvesting (276): Gatterbauer (119) defines it as “the process of gathering and integrating data from various heterogeneous Web sources”. The most important aspect (although partially different from specific Web data extraction) is that, during the last phase of data transformation, the amount of gathered data is many times greater than the extracted data. The work of filtering and refining information from Web sources ensures that extracted data lie in the domain of interest and are relevant for users: this step is called integration. The Web harvesting remains an open problem with large margin of improvement: because of the billions of Web pages, it is a computational problem, also for restricted domains, to crawl enough sources from the Web to build a solid ontological base. There is also a human engagement problem, correlated to the degree of automation of the process: when and where humans should interact with the system of Web harvesting? Should be a fully automatic process? What degree of precision could be accepted for the harvesting? All these questions are still open for future works. Projects such as the DIADEM3 at Oxford University tackle the challenge for fully automatic generation of wrappers for restricted domains such as real estate. 3.4 Techniques This Section focuses in particular on the techniques adopted to design Web mining platforms. Concepts such as the Web wrappers, that will be extensively adopted in the next Chapters, are here introduced in details. 1 http://www.pubmed.com/ 2 http://www.search-computing.it/ 3 http://web.comlab.ox.ac.uk/projects/DIADEM/ 26 3.4 Techniques 3.4.1 Used Approaches The techniques first used to extract data from Web pages were inherited from Information Extraction (IE) approaches. Kaiser and Miksch (157) categorized them into two groups: learning techniques and knowledge engineering techniques. Sarawagi (250) calls them hand-coded or learning-based approach and rule-based or statistical approach respectively. These definitions explain the same concept: the first method is used to develop a system that requires human expertise to define rules (usually regular expressions or program snippets) to perform the extraction. Both in hand-coded and learning-based approaches domain expertise is needed: people writing rules and training the system must have programming experience and a good knowledge of the domain. Also in some approaches of the latter family, in particular the rule-based ones, a strong familiarity with both the requirements and the functions is needed, so the human engagement is essential. Statistical methods are more effective and reliable in domains of unstructured data (like natural language processing problems, facts extraction from speeches, and automated text categorization (251)). 3.4.2 Wrappers A wrapper is a procedure that implements a family of algorithms, that seek and find the information the user needs to extract from an unstructured source, and transform them into structured data, merging and unifying this information for future processing. A wrapper life-cycle starts with its generation: it could be described and implemented manually, e.g. using regular expressions, or in an inductive way; wrapper induction (171) is one of the most interesting aspects of this discipline, because it introduces high level automation algorithms implementation; we can also count on hybrid approaches that make possible for users to generate semi-automatic wrappers by means of visual interfaces. Web pages change without forewarning, so wrapper maintenance is an outstanding aspect for ensuring the regular working of wrapperbased systems. Wrappers fit very well to the Web data extraction problem because HTML pages, although lacking in semantic structure, are syntactically structured: HTML is just a presentation markup language, but wrappers can use tags to infer underlying information. Wrappers succeeded where IE NLP-based techniques failed: often Web pages do not own a rich grammar structure, so NLP cannot be applied with good results. 3.4.3 Semi-Automatic Wrapper Generation Visual Extraction Under this category we classify techniques that make it possible to the users to build wrappers from Web pages of interest using a GUI and interactively, without any deep understanding of the wrapper programming language, as wrappers are generated automatically by the system relying on users’ directives. 27 3. INFORMATION EXTRACTION FROM WEB SOURCES Regular-Expression-Based approach: One of the most common approaches is based on regular expressions, which are an old, but still powerful, formal language used to identify strings or patterns in unstructured text, defining matching criteria. Rules could be complex so, writing them manually, could require too much time and a great expertise: wrappers based on regular expressions dynamically generate rules to extract desired data from Web pages. Usually writing regular expressions on HTML pages relies on the following criteria: word boundaries, HTML tags, tables structure, etc. A notable tool implementing regular-expression-based extraction is W4F (249), adopting an annotation approach: instead of putting users facing the HTML code, W4F eases the design of the wrapper through a wizard that allows users to select and annotate elements directly on the page; W4F builds the regular expression extraction rules of the annotated items and presents them to the user demanding her/him the optimization step. W4F extraction rules, besides match, could also implement the split expression, which separates words, annotating different elements on the same string. Logic-Based approach: Tools based on this approach successfully build wrappers through a wrapper programming language, considering Web pages not as simply text strings but as preparsed trees, representing the DOM of the page. Gottlob and Koch (134) formalized the first wrapping language, suitable for being incorporated into visual tools, satisfying the condition that all its constructs can be implemented through corresponding visual primitives: starting from the unranked labeled tree representing the DOM of the Web page, the algorithm relabels nodes, truncates the irrelevant ones, and finally returns a subset of original tree nodes, representing the selected data extracted. The extraction function of all these operations relies on the Monadic Datalogs (135). Authors demonstrated that Monadic Datalog over tree structures is equivalent to Monadic Second Order logic (MSO), and hence very expressive. However, unlike MSO, a wrapper in Monadic Datalog can be modeled nicely in a visual and interactive step-by-step manner. Baumgartner et al. (23) developed the Elog wrapping language as a possible implementation of a monadic datalog with minor restrictions, using it as the core extraction function of the Lixto Visual Wrapper (22): this tool provides a GUI to select, through visual specification, patterns in Web pages, in hierarchical order, highlighting elements of the document and specifying relationships among them; information identified in this way could be too general, so the system permits to add some restricting conditions, e.g. before/after, not-before/not-after, internal and range conditions. Finally, selected information are translated into XML using pattern names as XML element names, obtaining structured data from unstructured pages. Spatial Reasoning A completely different approach, called Visual Box Model, exploits visual cues to understand the presence of tabular data in HTML documents, not strictly represented under the <table> element (120, 167): the technique is based on the X-Y cut OCR algorithm, relying on the Gecko1 rendering engine used by the Mozilla Web browser, to extract the CSS 2.0 visual box model, accessing the positional information through XPCOM. Cuts are recursively applied to the bitmap image (the rendering of the page) and stored into an X-Y tree, building a tree where ancestor nodes with leaves represent not-empty tables. Some secondary operations check 1 https://developer.mozilla.org/en/Gecko 28 3.4 Techniques that extracted tables contain useful information, because usually, although it is a deprecated practice, many Web developers use tables for structural and graphical issues. 3.4.4 Automatic Wrapper Generation By definition, the automatic wrapper generation implies no human interaction: techniques that recognize and extract relevant data were autonomously developed and systems with an advanced degree of automation and a high level of independent decision capability represent the state-of-the-art of the automatic wrapper generation approach. Automatic Matching RoadRunner (73, 74) is an interesting example of automatic wrapper generator: this system is oriented to data-intensive websites based on templates or regular structures. This system tackles the problem of automatic matching bypassing common features used by standard wrappers, typically using additional information, provided by users, labeling example pages, or by the system through automatic labeling, or a priori knowledge on the schema, e.g. on the page structure/template. In particular, RoadRunner relies on the fundamental idea of working with two HTML pages at a time in order to discover patterns while analyzing similarities and differences between structure and content of pages. Essentially RoadRunner can extract relevant information from any website containing at least two pages with similar structure: usually Web pages are dynamically generated and relevant data are positioned in the same area of the page, excluding small differences due, for example, to missing values. those kind of Web sources characterized by a common generation script are called class of pages. The problem is reduced to extracting the source dataset, thus generating a wrapper starting from the inference of a common structure from the two-page-based comparison. This system can handle missing and optional values and also small structural differences, adapting very well to all kinds of dataintensive Web sources (e.g. based on templates), relying on a solid theoretical background, that ensures a high degree of precision of the matching technique. Partial Tree Alignment Zhai and Liu (289, 290) theorized the partial tree alignment technique and developed a Web data extraction system based on it. This technique relies on the idea that information in Web documents usually are collected in contiguous regions of the page, called data record regions. Partial tree alignment consists in extracting these regions, e.g. using a tree matching algorithm, called tree edit distance. This approach works in two steps: i) segmentation, and ii) partial tree alignment. In the first phase the Web page is split in segments, without extracting data; this pre-processing step is fundamental because the system does not simply perform an analysis based on the DOM tree, but also relies on visual cues, like the spatial reasoning technique, trying to identify gaps between data records; it is useful also because it helps the process of extracting structural information from the HTML tag tree, in those situations when the HTML syntax is abused, e.g. using tabular structure instead of CSS to arrange the graphical aspect of the page. After that, the partial tree alignment algorithm is applied to data records earlier identified: each data record is extracted from its DOM sub-tree position, constituting the root of a new single tree, because each data record could be contained in more than one non-contiguous sub-tree in the original tag tree. Partial tree alignment approach implies the 29 3. INFORMATION EXTRACTION FROM WEB SOURCES alignment of data fields with certainty, excluding those which cannot be aligned, to ensure a high degree of precision; during this process no data items are involved. This, because partial tree alignment works only on tree tags matching, represented as the minimum cost, in terms of operations (e.g. node removal, node insertion, node replacement), to transform one node in another one. 3.4.5 Wrapper Induction Wrapper induction techniques differ from the latter essentially for the degree of automation: most of the wrapper induction systems need labeled examples provided during the training sessions, thus requiring human engagement. Actually, a couple of systems can obtain this information autonomously, representing, de facto, a hybrid approach between wrapper induction and automatic wrapper generation. In wrapper induction, extraction rules are learned during training sessions and then applied to extract data from Web pages similar to the example provided. Machine-Learning-Based approach Standard machine-learning-based techniques rely on training sessions to let the system acquire a domain expertise. Training a Web data extraction system, based on the machine-learning approach, requires a huge amount of labeled Web pages, before starting to work with an acceptable degree of precision. Manual labeling should provide both positive and negative examples, especially for different websites but also in the same website, in pages with different structure, because, usually, templates or patterns differ, and the machine-learning-based system should learn how to extract information in these cases. Statistical machine-learning-based systems were developed relying on conditional models (232) or adaptive search (268) as an alternative solution to human knowledge and interaction. Many wrapper induction systems were developed relying on the machine-learning approach: Flesca et al. (108) classified ShopBot, WIEN, SoftMealy, STALKER, RAPIER, SRV and WHISK (257), analyzing some particular features like support for HTML-documents, NLP, texts etc. Kushmerick developed the first wrapper induction system, WIEN (172), based on a couple of brilliant inductive learning techniques that enable the system to automatically label training pages, representing, de facto, a hybrid approach to speed-up the learning process. Although these hybrid features, WIEN has many limitations, e.g. it cannot handle missing values. SoftMealy, developed by Hsu and Dung (150), was the first wrapper induction system specifically designed for the Web data extraction: relying on non-deterministic finite state automata, SoftMealy also uses a bottom-up inductive learning approach to extract wrapping rules. During the training session the system acquires training pages represented as an automaton on all the possible permutations of Web pages: states represent extracted data, while state transitions represent extraction rules. STALKER (206) is a system for learning supervised wrappers with some affinity with SoftMealy but differing in the relevant data specification: a set of tokens is manually placed on the Web page identifying information that should be extracted, ensuring the capability of handling empty values, hierarchical structures and unordered items. Bossa et al. (43) developed a token based lightweight Web data extraction system called Dynamo that differs from STALKER because tokens are placed during the Web pages building to identify 30 3.5 Automatic Wrapper Adaptation elements on the page and relevant information. This system is viable strictly in such situations in which webmasters can modify the structure of Web pages providing tokens placement to help the extraction system. 3.4.6 Wrapper Maintenance Wrapper building, regardless the technique applied to generate it, is only one aspect of the problem of data extraction from Web sources: unlike static documents, Web pages dynamically change and evolve, and their structure may change, sometimes with the consequence that wrappers cannot successfully extract data. Actually, a critical step of the Web data extraction process is the wrapper maintenance: this can be performed manually, updating or rewriting the wrapper each time Web pages change; this approach could fit well for small problems, but is not-trivial if the pool of Web pages is extended (for example, a regular data extraction task include hundred thousand pages, usually dynamically generated and frequently updated). Kushmerick (173) defined the wrapper verification problem and, shortly, a couple of manual wrapper maintenance techniques were developed to handle simple problems. In the following, we analyze a viable practice presented in literature to automatically solve the problem of the wrapper maintenance, called schema-guided wrapper maintenance. Finally, we propose a novel technique of automatic wrapper adaptation that contributes to the state-of-the-art in this field. Schema-Guided Wrapper Maintenance Meng et al. (200) developed the SG-WRAM (Schema-Guided WRApper Maintenance) for Web data extraction starting from the observation that, changes in Web pages, even substantial, always preserve syntactic features (i.e. syntactic characteristics of data items like data patterns, string lengths, etc.), hyperlinks and annotations (e.g. descriptive information representing the semantic meaning of a piece of information in its context). They developed a Web data extraction system, providing schemes, starting from the wrapper generation, until to the wrapper maintenance: during the generation the user provides HTML documents and XML schemas, specifying mappings between them. Later, the system will generate extraction rules and then it will execute the wrapper to extract data, building a XML document with the specified XML schema; the wrapper maintainer checks extraction issues and provide an automatic repairing protocol for wrappers which fail the task because Web pages changed. The XML schema is in the format of a DTD (Document Type Definition) and the HTML document is represented as a DOM tree: SG-WRAM builds corresponding mappings between them and generates extraction rules in the format of a XQuery expression. 3.5 Automatic Wrapper Adaptation We developed a novel method of automatic wrapper adaptation relying on the analysis of structural similarities between different versions of the same Web page. Our idea is to compare some helpful structural information stored by applying the wrapper on the original version of the Web page, searching for similarities in the new one. 31 3. INFORMATION EXTRACTION FROM WEB SOURCES (B) /html[1]/body[1]/table[1]/tr[2]/td (A) /html[1]/body[1]/table[1]/tr[1]/td[1] html html head head body body table table tr td .. . tr tr td td td td .. . tr td .. . td .. . tr tr td td td td .. . td .. . Figure 3.1: Examples of XPaths over trees, selecting one (A) or multiple (B) items. 3.5.1 Primary Goals Regardless the method of extraction implemented by the wrapping system (e.g. we can consider a simple XPath), elements identified and represented as subtrees of the DOM tree of the Web page, can be exploited to find similarities between two different versions. In the simplest case, the XPath identifies just a single element on the Web page (Figure 3.1.A); our idea is to look for some elements, in the new Web page, sharing similarities with the original one, evaluating comparable features (e.g. subtrees, attributes, etc.); we call these elements candidates; among candidates, the one showing the higher degree of similarity – possibly – represents the new version of the original element. It is possible to extend the same approach in the common case in which the XPath identifies multiple similar elements on the original page (e.g. a XPath selecting results of a search in a retail online shop, represented as table rows, divs or list items) (Figure 3.1.B); it is possible to identify multiple elements sharing a similar structure in the new page, within a custom level of accuracy (e.g. establishing a threshold value of similarity). Once identified, elements in the new version of the Web page can be extracted as usual, for example just re-inducting the XPath1 . Our purpose is to define some general rules to enable the wrapper to face the problem of automatically adapting itself to extract information from the new version of the Web page. We implemented this approach in a commercial tool – Lixto. The most efficient way to acquire some structural information about elements the original wrapper extracts, is to store them inside the definition of the wrapper itself. For example, generating signatures representing the DOM subtree of extracted elements from the original Web page, stored as a tree diagram, or a simple XML document (or, even, the HTML itself). This shrewdness avoids that we need to store the whole original page, ensuring better performance and efficiency. This technique requires just a few settings during the definition of the wrapper step: the user enables the automatic wrapper adaptation feature and set an accuracy threshold. During 1 For sake of simplicity let us assume that the given wrappers rely on Xpath(s) to identify and extract elements of a Web page. Since the provided model is very general, it could work with any DOM-based model of data extraction and its adoption is straightforward in most of the cases. 32 3.5 Automatic Wrapper Adaptation the execution of the wrapper, if some XPath definition does not match a node, the wrapper adaptation algorithm automatically starts and tries to find the new version of the missing node. 3.5.2 Details First of all, to establish a measure of similarity between two trees we need to find some comparable properties between them. In HTML Web pages, each node of the DOM tree represents an HTML element defined by a tag (or, otherwise, free text). The simplest way to evaluate similarity between two elements is to compare their tag name. Elements own some particular common attributes (e.g. id, class, etc.) and some type-related attributes (e.g. href for anchors, src for images, etc.); it is possible to exploit this information for additional checks, constraints and comparisons. The algorithm selects candidates between subtrees sharing the same root element, or, in some cases, comparable -but not identical- elements, analyzing tags. This is very effective in those cases of deep modification of the structure of an object (e.g. conversion of tables in divs). As discussed in Section above, several approaches have been developed to analyze similarities between HTML trees; for our purpose we improved a version of the simple tree matching algorithm, originally led by Selkow (253); we call it weighted tree matching. There are two important novel aspects we are introducing in facing the problem of the automatic wrapper adaptation: first of all, exploiting previously acquired information through a smart and focused usage of the tree similarity comparison; thus adopting a consolidated approach in a new field of application. Moreover, we contributed applying some particular and useful changes to the algorithm itself, improving its behavior in the HTML trees similarity measurement. 0 00 Algorithm 1 SimpleTreeMatching(T , T ) 0 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 00 if T has the same label of T then 0 m ← d(T ) 00 n ← d(T ) for i = 0 to m do M [i][0] ← 0; end for for j = 0 to n do M [0][j] ← 0; end for for all i such that 1 ≤ i ≤ m do for all j such that 1 ≤ j ≤ n do M [i][j] ← Max(M [i][j − 1], M [i − 1][j], M [i − 1][j − 1] + W [i][j]) where W [i][j] = 0 00 SimpleTreeMatching(T (i − 1), T (j − 1)) end for end for return M[m][n]+1 else return 0 end if 33 (A) e N7 1 2 d N6 1 2 b N2 1 · 4 (N6+N7) 34 1 2 f N8 c N3 1 ·N8 4 1 2 e N9 1 2 d N10 b N4 1 · 4 (N9+N10) i N13 1 3 h N12 1 3 1 3 j N14 g N11 1 · 2 (N12+N13+N14) c N5 1 ·N11 4 (B) Figure 3.2: A and B are two similar labeled rooted trees. a N1 1 2 1 2 1 3 h N22 g N20 1 ·N22 2 e N19 d N18 1 2 f N21 c N17 1 · 4 (N20+N21) b N16 1 · 4 (N18+N19) a N15 3. INFORMATION EXTRACTION FROM WEB SOURCES 3.5 Automatic Wrapper Adaptation 3.5.3 Simple Tree Matching Let d(n) be the degree of a node n (i.e. the number of first-level children); let T(i) be the i-th subtree of the tree rooted at node T; a possible implementation of the simple tree matching is as in Algorithm 1. Advantages of adopting this algorithm, which has been shown quite effective for Web data extraction (162, 289), are multiple; for example, the simple tree matching algorithm evaluates similarity between two trees by producing the maximum matching through dynamic programming, without computing inserting, relabeling and deleting operations; moreover, approximate tree edit distance algorithms relies on complex implementations to achieve good performance, instead simple tree matching, or similar algorithms are very simple to devise. 0 00 0 00 The computational cost is O(n2 · max(leaves(T ), leaves(T )) ·max(depth(T ), depth(T ))), thus ensuring good performances, applied to HTML trees. There are some limitations; most of them are not relevant in our context but there is an important one: this approach can not match permutations of nodes. Despite this intrinsic limit, this technique appears to fit very well to our purpose of measuring HTML trees similarity. 3.5.4 Weighted Tree Matching Let t(n) be the number of total siblings of a node n including itself: 0 00 Algorithm 2 WeightedTreeMatching(T , T ) {Change line 11 with the following code} if m > 0 AND n > 0 then 0 00 return M[m][n] * 1 / Max(t(T ), t(T )) else 0 00 return M[m][n] + 1 / Max(t(T ), t(T )) 6: end if 1: 2: 3: 4: 5: In order to better reflect a good measure of similarity between HTML trees, we applied some focused changes to the way of assignment of a value for each matching node. In the simple tree matching algorithm the assigned matching value is always 1. After leading some analysis and considerations on structure of HTML pages, our intuition was to assign a weighted value, with the purpose of attributing less importance to slight changes, in the structure of the tree, when they occur in deep sublevels (e.g. missing/added leaves, small truncated/added branches, etc.) and also when they occur in sublevels with many nodes, because these mainly represent HTML list of items, table rows, etc., usually more likely to modifications. In the weighted tree matching, the weighted value assigned to a match between two nodes is 1, divided by the greater number of siblings with respect to the two compared nodes, considering nodes themselves (e.g. Figure 3.2.A, 3.2.B); thus reducing the impact of missing/added nodes. Before assigning a weight, the algorithm checks if it is comparing two leaves or a leaf with a node which has children (or two nodes which both have children). The final contribution of a sublevel of leaves is the sum of assigned weighted values to each leaf (cfr. Code Line (4,5)); thus, 35 3. INFORMATION EXTRACTION FROM WEB SOURCES the contribution of the parent node of those leaves is equal to its weighted value multiplied by the sum of contributions of its children (cfr. Code Line (2,3)). This choice produces an effect of clustering the process of matching, subtree by subtree; this implies that, for each sublevel of leaves the maximum sum of assigned values is 1; thus, for each parent node of that sublevel the maximum value of the multiplication of its contribution with the sum of contributions of its children, is 1; each cluster, singly considered, contributes with a maximum value of 1. In the last recursion of this top-down algorithm, the two roots will be evaluated. The resulting value at the end of the process is the measure of similarity between the two trees, expressed in the interval [0,1]. The closer to 1 the final value, the more similar the two trees. Let us analyze the behavior of the algorithm with a classic example, already used by (285) and (289) to explain the simple tree matching (Figure 3.2): 3.2.A and 3.2.B are two very simple generic rooted labeled trees (i.e. the same structure of HTML trees). They show several similarities except for some missing nodes/branches. Applying the weighted tree matching algorithm, in the first step (Figure 3.2.A, 3.2.B) contributions assigned to leaves, with respect to matches between the two trees, reflect the past considerations (e.g. a value of 13 is established for nodes (h), (i) and (j), although two of them are missing in 2.B). Going up to parents, the summation of contributions of matching leaves is multiplied by the relative value of each node (e.g. in the first sublevel, the contribution of each node is 14 because of the four first-sublevel nodes in 2.A). Once completed these operations for all nodes of the sublevel, values are added and the final measure of similarity for the two trees is obtained. Intuitively, in more complex and deeper trees, this process is iteratively executed for all the sublevels. The deeper a mismatch is found, the less its missing contribution will affect the final measure of similarity. Analogous considerations hold for missing/added nodes and branches, sublevels with many nodes, etc. Table 3.1 shows M and W matrices containing contributions and weights. W N6 N7 N18 1 2 0 W N20 N21 N8 0 1 2 W N20 N21 N11 1 6 0 N19 0 1 2 M 0 N6 0 0 0 N6-7 0 M 0 N20 N20-21 0 0 0 0 N8 0 0 1 2 M 0 N20 0 0 0 N20-21 0 N11 0 1 6 1 6 N18 0 1 2 1 2 N18-19 0 1 2 1 W N22 N12 1 3 W N16 N2 1 4 0 N17 W N9 N18 0 1 2 N10 N13 0 N3 0 1 8 M 0 N22 N14 0 N4 1 8 0 N19 1 2 0 N5 0 1 24 M 0 N9 0 0 0 N9-10 0 0 0 0 N12 0 1 3 M 0 N16 0 0 0 N16-17 0 N2 0 1 4 1 4 N18 0 0 1 2 N18-19 0 1 2 1 2 N12-13 0 1 3 N12-14 0 1 3 N2-3 0 1 4 3 8 N2-4 0 1 4 3 8 N2-5 0 1 4 3 8 Table 3.1: W and M matrices for each matching subtree. In this example, WeightedTreeMatching(2.A, 2.B) returns a measure of similarity of 38 (0.375) whereas SimpleTreeMatching(2.A, 2.B) would return a mapping value of 7; the main difference on results provided by these two algorithms is the following: our weighted tree matching intrinsically produces an absolute measure of similarity between the two compared trees; the simple tree matching, instead, returns the mapping value and then it needs subsequent operations to establish the measure of similarity. Hypothetically, in the simple tree matching case, we could suppose to establish a good estimation of similarity dividing the mapping value by the total number of nodes of the tree with more nodes; indeed, a value calculated in this way would be linear with respect to the number 36 3.5 Automatic Wrapper Adaptation of nodes, thus ignoring important information as the position of mismatches, the number of mismatches with respect to the total number of subnodes/leaves in a particular sublevel, etc. In this case, for example, the measure of similarity between 2.A and 2.B, applying this approach, 7 (0.5). A greater value of similarity could suggest, wrongly, that this approach is would be 14 more accurate. Experimentation showed us that, the closer the measure of similarity is to reflect changes in complex structures, the higher the accuracy of the matching process. This fits particularly well for HTML trees, which often show very rich and articulated structures. The main advantage of using the weighted tree matching algorithm is that, the more the structure of considered trees is complex and similar, the more the measure of similarity will be accurate. On the other hand, for simple and quite different trees the accuracy of this approach is lower than the one ensured by the simple tree matching. But, as already underlined, the most of changes in Web pages are usually minor changes, thus weighted tree matching appears to be a valid technique to achieve a reliable process of automatic wrapper adaptation. 3.5.5 Web Wrappers In supervised and interactive wrapper generation, the application designer is in charge of deciding how to characterize Web objects that are used for traversing the Web and for extracting information. It is one of the most important aspects of a wrapper to be resilient against changes (both changes over time and variations of similarly structured pages), and parts of the robustness of a data extractor depend on how the application designer configures it. However, it is crucial that the wrapper generation system assists the wrapper designer and suggests how to make the identification of Web objects and trails through Web sites as stable as possible. Robust XPath generation and fall-back strategies In Lixto Visual Developer (VD) (26), a number of mechanisms are offered to create a resilient wrapper. During recording, one task is to generate a robust XPath or regular expression, interactively and supported by the system. During wrapper generation, in many cases only one labeled example object is available, especially in automatically recorded deep Web navigation sequences. In such cases, efficient heuristics in XPath generation and fallback strategies during replay, are required. Typical heuristics during recording for reliably identifying such single Web objects include: • Generalization of a chosen XPath by using form properties, element properties, textual properties and formatting properties. During replay, these ingredients are used as input for an algorithm that checks in which constellation to best apply this property information to satisfy the integrity constraints imposed on a rule (e.g., as result a single instance is required). • DOM Structural Generalization – starting from the full path, several generalized paths are created, using only characteristic elements and characteristic element sequences. A number of stable anchor points are identified and stored, from which relative paths to this object are created. Typical stable anchor points are automatically identified and include, e.g., the outermost table structure and the main content area (being chosen upon factors such as the longest content). 37 3. INFORMATION EXTRACTION FROM WEB SOURCES • Positional information is considered if the structurally generalized paths identify more than one element. In this case, during execution, variations of the XPath generated with this “index heuristics” are applied on the active Web page, removing indexes until the integrity constraints of the current rule are satisfied. • Attributes and properties of elements are taken into account, in particular of the element of choice, but we also consider ancestor attributes if the element attributes are not sufficient. • Attributes that make an element unique are preferred, i.e., similar elements are checked for distinguishing criteria. • Attribute Values are considered, if attribute names are not sufficient. Attribute Value Fragments are considered, if attribute values are not sufficient (using regular expressions). • The id attributes are used as far as possible. If an id is unique and meaningful for characterizing an element it is considered in the fallback strategies with a high weight. • Textual information and label information is used, only if explicitly turned on (since this might fail in case of a language switch). The output of the heuristic step is a “best XPath” shown to the wrapper designer, and a set of XPath expressions and priorities regarding when to use which fallback strategy, stored in the configuration. Figure 3.3 illustrates which information is stored by the system during recording. In this case, a drop down was selected by the application designer, and the system decided that the “id” attribute is the most reliable one and is chosen as best XPath. If this evaluation fails, the system will apply heuristics based on the (in this example, three) stored fallback XPaths, which mainly exploit form and index properties. In case one of the heuristics generates results that do not invalidate the defined integrity constraints, these Web objects are considered as result. During generation of rules (e.g., “extract”) and actions (e.g., “click”), the wrapper designer imposes constraints on the results to be obtained, such as: • Cardinality Constraints: restrictions on the number of results, e.g., exactly one element or at least one element must be matched. • Data Type Constraints: restrictions on the data type of a result, e.g., a result must be of type integer or match a particular regular expression. Constraints can be defined individually per rule and action, or defined globally by using a schema on the output data model. Configuring adaptable wrappers The procedures described in the previous section do not adapt the wrapper, but address situations in which the initially chosen XPath does no longer match and simply try different ones based on this one. In the configuration of wrapper adaptation, we go one step beyond: on the one hand we exploit tree and string similarity techniques to find the most similar Web object(s) on the new page, and on the other hand, in case the adaptation is triggered, the wrapper is changed on the fly using the new configuration created by the adaptation algorithms. As before, integrity constraints can be imposed on extraction and navigation rules. Moreover, the application designer can choose whether to use wrapper adaptation on a particular rule in case the constraints are violated during runtime. When adaptation is chosen, alternatively to 38 3.5 Automatic Wrapper Adaptation Figure 3.3: Robust Web object detection in Lixto VD. Figure 3.4: Configuration of wrapper adaptation in Lixto VD. 39 3. INFORMATION EXTRACTION FROM WEB SOURCES using XPath-based means to identify Web objects we store the actual result subtree. In case of HTML leaf elements, which are usually the elements under consideration for navigation actions, we instead store the tree rooted at the n-th ancestor of the element, and the additional fact where the result element is located within this tree. In this way, tree matching can also be exploited for HTML leaf elements. Wrapper designers can choose between various similarity measures: this includes in particular the Simple Tree Matching algorithm (253) and the Weighted Tree Matching algorithm previously described. In future, further algorithms will extend the capabilities of the tool, e.g., a bigram-based tree matching that is capable to deal with node permutations in a more favorable fashion. In addition to the similarity function, one can choose certain parameters, e.g., whether to use the HTML element name as node label or instead to use spelling attributes such as class and id attributes. Figure 3.4 illustrates the configuration of wrapper adaptation in Lixto VD. 3.5.6 Automatic Adaptation of Web Wrappers Self-repairing rules Figure 3.5 describes the adaptation process. The wrapper adaptation process is triggered upon violation of defined constraints. In case in the initial wrapper an element is detected with an XPath, the adaptation procedure substitutes this by storing the subtree of a matched element. In case the wrapper definition already stores the example tree, and the similarity computation returns results that violate the defined constraints, the threshold is lowered or raised until a perfect match is generated. During runtime, the stored tree is compared to the elements on the new page, and the best fitting element(s) are considered as extraction results. During configuration, wrapper designers can choose an algorithm (such as the Weighted Tree Matching), and a similarity threshold. The similarity threshold can be constant, or defined to be within an interval of acceptable thresholds. During execution, various thresholds within the allowed range are considered, and the one generating the best fit with respect to the defined constraints is chosen. As a next step, the stored tree is refined and generalized so that it maximizes the matching value for both the original subtree and the new trees, reflecting the changes of a Web page over time. This generalization process generates a simple tree grammar, a “tree template” that is allowed to use occurrence indicators (one or more element, at least one element, etc.) and optional depth levels. In further runs, the tree template is compared against the sub trees of an active Web page during execution. First, the algorithm checks which trees on the new page satisfy the tree template. In case the results are within the defined integrity constraints, no further action is taken. In case the results are not satisfying, the system searches for most similar trees based on the defined distance metrics; in this case, the wrapper is auto-adapted, the tree template is further refined and the threshold or threshold interval is automatically re-adjusted. At the very end of the process, the corrected wrapper is stored in the wrapper repository and committed to a versioning system to keep track of all changes. Wrapper re-induction In practice, single adaptation steps of rules and actions are embedded into the whole execution process of a wrapper and the adapted wrapper is stored in the repository after all adaptation 40 3.5 Automatic Wrapper Adaptation Figure 3.5: Wrapper adaptation process. steps have been concluded. The need for adapting a particular rule influences the further execution steps. Usually, wrapper generation in VD is a hierarchical top-down process – e.g., first, a “hotel record” is characterized, and inside the hotel record, entities such as “rating” and “room types”. To define a rule to match such entities, the wrapper designer visually selects an example and together with system suggestions generalizes the rule configuration until the desired instances are matched. To support the automatic adaptation process during runtime, as described above, the wrapper designer further specifies what it means that extraction failed. In general, this means wrong or missing data, and with integrity constraints one can give indications how correct results look like. The upper half of Figure 3.6 summarizes the wrapper generation. Figure 3.6: Diagram of the Web wrapper creation, execution and maintenance flow. During wrapper creation, the application designer provides a number of configuration settings to this process. This includes: • Threshold Values. 41 3. INFORMATION EXTRACTION FROM WEB SOURCES • Priorities/Order of Adaptation Algorithms used. • Flags of the chosen algorithm (e.g., using HTML element name as node label, using id/class attributes as node labels, etc.). • Triggers for bottom-up, top-down and process flow adaptation bubbling. • Whether stored tree-grams and XPath statements are updated based on adaptation results to be additionally used as inputs in future adaptation procedures (reflecting and addressing regular slight changes of a Web page over time). Triggers in Adaptation Settings can be used to force adaptation of further fragments of the wrapper as depicted in the lower half of Figure 3.6. • Top-down: forcing adaptation of all/some descendant rules (e.g., adapt the “price” rule as well to identify prices within a record if the “record” rule was adapted). • Bottom-up: forcing adaptation of a parent rule in case adaptation of a particular rule was not successful. Experimental evaluation pointed out that in such cases it is often the problem that the parent rule already provides wrong or missing results (even if matched by the integrity constraints) and has to be adapted first. • Process flow: it might happen that particular rule matches can no longer be detected because the wrapper evaluates on the wrong page. Hence, there is the need to use variations in the deep web navigation actions. In particular, a simple approach explored at this time is to use a switch window or back step action to check if the previous window or another tab/popup provides the required information. 3.5.7 Experimentation In this section we discuss some experimentation performed on common fields of application and following results. We tried to automatically adapt wrappers, previously built to extract information from particular Web pages, after some -often minor- structural changes. All the followings are real use cases: we did not modify any Web page, the original owners did; thus re-publishing pages with changes and altering the behavior of old wrappers. These real use cases confirmed our expectations and simulations on ad hoc examples we prepared to test the algorithms. We obtained an acceptable degree of precision using the simple tree matching and a great rate of precision/recall using the weighted tree matching. Precision, Recall and F-Measure will summarize these results showed in Table 3.2. We focused on the following areas, widely interested by Web data extraction: • News and information: Google News is a valid use case for wrapper adaptation; templates change frequently and sometimes is not possible to identify elements with old wrappers. • Web search: Google Search completely rebuilt the results page layout in the same period we started our experimentation 1 ; we exploited the possibility of automatically adapting wrappers built on the old version of the result page. 1 http://googleblog.blogspot.com/2010/05/spring-metamorphosis-googles-new-look.html 42 3.5 Automatic Wrapper Adaptation • Social networks: another great example of continuous restyling is represented by the most common social network, Facebook; we successfully adapted wrappers extracting friend lists also exploiting additional checks performed on attributes. • Social bookmarking: building folksonomies and tagging contents is a common behavior of Web 2.0 users. Several Websites provide platforms to aggregate and classify sources of information and these could be extracted, so, as usual, wrapper adaptation is needed to face changes. We choose Delicious for our experimentation obtaining stunning results. • Retail: these Websites are common fields of application of data extraction and Ebay is a nice real use case for wrapper adaptation, continuously showing, often almost invisible, structural changes which require wrappers to be adapted to continue working correctly. • Comparison shopping: related to the previous category, many Websites provide tools to compare prices and features of products. Often, it is interesting to extract this information and sometimes this task requires adaptation of wrappers to structural changes of Web pages. Kelkoo1 provided us a good use case to test our approach. • Journals and communities: Web data extraction tasks can also be performed on the millions of online Web journals, blogs and forums, based on open source blog publishing applications (e.g. Wordpress, Serendipity2 , etc.), CMS (e.g. Joomla3 , Drupal4 , etc.) and community management systems (e.g. phpBB5 , SMF6 , etc.). These platforms allow changing templates and often this implies wrappers must be adapted. We lead the automatic adaptation process on Techcrunch7 , a tech journal built on Wordpress. We adapted wrappers for these 7 use cases considering 70 Web pages; Table 3.2 summarizes results obtained comparing the two algorithms applied on the same page, with the same configuration (threshold, additional checks, etc.). Threshold represents the minimum value of similarity required to match two trees. The columns true pos., false pos. and false neg. represent true and false positive and false negative items extracted from Web pages through adapted wrappers. 3.5.8 Discussion of Results In some situations of deep changes (Facebook, Kelkoo, Delicious) we had to lower the threshold in order to correctly match the most of the results. Both the algorithms show a great elasticity and it is possible to adapt wrappers with a high degree of reliability; the simple tree matching approach shows a weaker recall value, whereas performances of the weighted tree matching are stunning (F-Measure greater than 98% is an impressive result). Sometimes, additional checks on nodes attributes are performed to refine results of both the two algorithms. For example, we can additionally include attributes as part of the node label (e.g. id, name and class) to refine results. Also without including these additional checks, the most of the time the false positive results are very limited in number (cfr. the Facebook use case). 1 http://shopping.kelkoo.co.uk 2 http://www.s9y.org 3 http://www.joomla.org 4 http://drupal.org 5 http://www.phpbb.com 6 http://www.simplemachines.org 7 http://www.techcrunch.com 43 3. INFORMATION EXTRACTION FROM WEB SOURCES SimpleTreeMatching Precision/Recall URL threshold news.google.com google.com facebook.com delicious.com ebay.com kelkoo.co.uk techcrunch.com Total Recall Precision F-Measure 90% 80% 65% 40% 85% 40% 85% - true pos. 604 100 240 100 200 60 52 1356 false pos. 72 4 12 4 92 90.64% 93.65% 92.13% false neg. 52 60 28 140 WeightedTreeMatching Precision/Recall true pos. 644 136 240 100 196 58 80 1454 false pos. 12 12 97.19% 99.18% 98.18% false neg. 12 24 4 2 42 Table 3.2: Experimental results of automatic wrapper adaptation. Conclusion In this Chapter we briefly discussed the current panorama that regards the techniques and the fields of application of Web mining platforms, with a particular attention to the thematics related to the extraction of data from social media sources, such as Online Social Networks, social bookmarking services, and so on. The procedures referred in this Chapter as Web wrappers are the basis on which the platform of Web data extraction from Online Social Networks discussed in the next Chapters is based on. Our research activity furnished a great contribution to the field of Web data extraction systems, in detail in the area concerning the automatic wrappers maintenance (102, 103, 104). The details regarding our algorithmic and technical solution to this problem are described in the final part of this Chapter. 44 4 Mining and Analysis of Facebook This Chapter is organized as follows: Section 4.1 presents the summary of the most representative related work, in particular those regarding the study on Facebook and other OSNs based on the concept of friendship. Moreover, we discuss existing projects on data extraction and analysis of OSNs. Section 4.2 describes the methodology we used to conduct the analysis of the Facebook social network; in particular, we discuss the architecture of the Web mining platform algorithms and techniques exploited. We define the technical challenges underlying the process of information extraction from Facebook and describe in detail the design and implementation of our application, called crawling agent. Some experimentation details regarding Social Network Analysis aspects are discussed in Section 4.3. The most important results are summarized describing the analysis of topological features of the Facebook social networks. 4.1 Background and Related Literature The task of extracting and analyzing data from Online Social Networks has attracted the interest of many researchers e.g. in (8, 118, 286). In this Section we review some relevant literature directly related to our approach. In particular, we first discuss techniques to crawl large social networks and collect data from them (see Section 4.1.1). Collected data are usually mapped onto graph data structures (and sometimes hypergraphs) with the goal of analyzing their structural properties. The ultimate goal of these efforts is perhaps best laid out by Kleinberg (165): topological properties of graphs may be reliable indicators of human behaviors. For instance, several studies show that node degree distribution follows a power law, both in real and Online Social Networks. That feature points to the fact that most social network participants are often inactive, while few key users generate a large portion of data/traffic. As a consequence, many researchers leverage on tools provided from graph theory to analyze the social network graph with the goal -among others- of better interpreting personal and collective behaviors on a large scale. The list of potential research questions arising from the analysis of OSN graphs is very long. As discussed in the introduction of this dissertation, we point out the following themes which are directly relevant to our research: 45 4. MINING AND ANALYSIS OF FACEBOOK i) Data collection from OSNs, i.e., the process of acquisition of relevant data from OSN plaftorms by means of Web mining techniques; ii) Node similarity detection, i.e., the task of assessing the degree of similarity of two users in OSNs (see Section 4.1.2), iii) Influential user detection, i.e., the task of identifying users capable of stimulating other users to join activities/discussions in their OSN (see Section 4.1.3). 4.1.1 Data Collection from Online Social Networks The most works focusing on data collection adopt techniques of Web information extraction, to crawl the front-end of Websites; this because OSN datasets are usually not publicly accessible; data rest in back-end databases that are accessible only through the Web interface. In (184) the authors discussed the problem of sampling from large graphs adopting several graph mining techniques, in order to establish whether it is possible to avoid bias in acquiring a subset of the whole graph of a social network. The main outcome of the analysis in (184) is that a sample of size of 15% of the whole graph preserves the most of the properties. In (203), the authors crawled data from large Online Social Networks like Orkut, Flickr and LiveJournal. They carried out an in-depth analysis of OSN topological properties (e.g., link symmetry, power law node degrees, groups formation) and discussed challenges arising from large-scale crawling of OSNs. (286) considered the problem of crawling OSNs analyzing quantitative aspects like the efficiency of the adopted visiting algorithms, and bias of data produced by different crawling approaches. The work by Gjoka et al. (125) on OSN graphs is perhaps the most similar to our approach. Gjoka et al. have sampled and analyzed the Facebook friendship graph with different visiting algorithms (namely BFS, Random Walk and Metropolis-Hastings Random Walks). Our objectives differ from those of Gjoka et al. because their goal is to produce a consistent sample of the Facebook graph. A sample is defined consistent when some of its key structural properties, i.e., node degree distribution, assortativity and clustering coefficient approximate fairly well the corresponding properties of the original Facebook graph. Vice versa, our work aims at crawling a portion of the Facebook graph and to analytically study the structural properties of the crawled data. A further difference with Gjoka et al. is in the strategy for selecting which nodes to visit: Gjoka’s strategy requires to know in advance the degree of the considered nodes. Nodes with the highest degree are selected and visited at each stage of the sampling. In the Facebook context, node degree represents the number of friends a user has; such information is available in advance by querying the profile of the user. Such an assumption, however, is not applicable if we consider other Online Social Networks. Hence, to know the degree of a node we should preliminarily perform a complete visit of the graph, which may not be feasible for large-scale OSNs. 4.1.2 Similarity Detection Finding similar users of a given OSN is a key issue in several research fields like Recommender Systems, especially Collaborative Filtering (CF) Recommender Systems (3). In the context 46 4.1 Background and Related Literature of social networks, the simplest way to compute user similarities is by means of similarity metrics such as the Jaccard coefficient (146). In particular, given two users ui and ul of a given social network, the simplest and most intuitive way to compute their similarity requires the computation of the Jaccard coefficient of the sets of their neighbors. However, the usage of the Jaccard coefficient is often not satisfactory because it considers only the acquaintances of a user in a social network (and, therefore, local information) and does not take global information into account. A further drawback consists of the fact that users with a large number of acquaintances have a higher probability of sharing some of them with respect to users with a small number of acquaintances; therefore, they are likely to be regarded as similar even if no real similarity exists between them. (1) proposed that the similarity of two users increases if they share acquaintances who, in turn, have a low number of acquaintances themselves. In order to consider global network properties, many approaches rely on the idea of regular equivalence, i.e., on the idea that two users are similar if their acquaintances are similar too. In (20) the problem of computing user similarities is formalized as an optimization problem. Other approaches compute similarities by exploiting matrix based methods. For instance, the approaches of (182) use a modified version of the Katz coefficient. SimRank (154) provides an iterative fixpoint method. The approach of (31) operates on directed graphs and uses an iterative approach relying on their spectral properties. Some authors studied computational complexity of social network analysis with an emphasis on the problem of discovering links between social network users (255, 256). To describe these approaches, assume to consider a social network and let G = (V, E) be the graph representing it; each node in V represents a user whereas an edge specifies a tie between a pair of users (in particular, the fact that a user knows another user). In the first stage, Formal Concept Analysis is applied to map G onto a graph G0 . The graph G0 is more compact than G (i.e., it contains less nodes and edges of G) but, however, it is sparse, i.e., a node in G0 still has few connections with other nodes. As a consequence, the task of predicting if two nodes are similar is quite hard and comparing the number of friends/acquaintances they share is not effective because, in most cases, two users do not share any common friend and, therefore, the similarity degree of an arbitrary pair of users will be close to 0. To alleviate sparsity, Singular Value Decomposition (133) (SVD) is applied. Experiments provided in (255) show that the usage of SVD is effective in producing a more detailed and refined analysis of social network data. The SV D is a technique from Linear Algebra which has been successfully employed in many fields of Computer Science like Information Retrieval; in particular, given a matrix A, the SV D allows the matrix A to be decomposed as follows A = UΣV being U and V two orthogonal matrices (i.e., the columns of U and V are pairwise orthogonal); the matrix Σ is a diagonal matrix whose elements coincide with the square roots of the eigenvalues of the matrix AAT ; as usual, the symbol AT denotes the transpose of the matrix A. The SV D allows to decompose a matrix A into the product of three matrices and if we would multiply these three matrices, we would reconstruct the original matrix A. As a consequence, if A is the adjacency matrix of a social network, any operation carried out on A can be equivalently performed on the three matrices U, Σ and V in which A has been decomposed. 47 4. MINING AND ANALYSIS OF FACEBOOK The main advantage of such a transformation is that matrices U and V are dense and, then, we can compute the similarity degree of two users even if the number of friends they share is close or equal to zero. 4.1.3 Influential User Detection A recent trend in OSN analysis is the identification of influential users (122, 222). Influential users are those capable of stimulating others to join OSN activities and/or to actively contribute to them. In Weblog (blog) analysis, there is a special emphasis on the so-called leader identification. In particular, (258) suggested to model the blogosphere as a graph (whose nodes represent bloggers whereas edges model the fact that a blogger cites another one). In (195), the authors introduce the concept of starter, i.e., a user who first generate information that catchs the interest of fellow users/readers. Among others, the approach of (195) has deployed the Random Walk technique to find starters. Researchers from HP Labs analyzed user behaviors on Twitter (246); they found that influential users should not only catch attention from other users but they should also overcome the passivity of other users and spur them to get involved in OSN activities. To this purpose, they developed an algorithm – based on the HITS algorithm of (164) – to assess the degree of influence of a user. Experimental results show that high levels of popularity of a user do not necessarily imply high values in the degree of influence. 4.2 Sampling the Facebook Social Graph Our work on OSN analysis began with the goal to understand the organization of popular OSN, and as of 2010 (the time of the data collection) Facebook was by far the largest and most studied. Facebook as of December 2011 gathered more than 720 millions active users (269), and its growth rate has been proved to be the highest among all the other competitors in the last few years. More than 50% of users log on to the platform in any given day. Coarse statistics about the usage of the social network are provided by the company itself1 . Our study is interested in analyzing the characteristics and the properties of this network on a large-scale. In order to reach this goal, first of all we had to acquire data from this platform, and later we proceeded to their analysis. 4.2.1 The Structure of the Social Network The structure of the Facebook social network is simple. Each subscribed user can be connected to others via friendship relationships. The concept of friendship is bilateral, this means that users must confirm the relationships among them. Connections among users do not follow any particular hierarchy, thus we define the social network as unimodal. This network can be represented as a graph G = (V, E) whose nodes V represent users and edges E represent friendship connections among them. Because of the assumption on the bilateralness of relationships, the graph is undirected. Moreover, the graph we consider is unweighted, because all the friendship connections have the same value within the network. However, it could be 1 Please refer to http://www.facebook.com/press/info.php?statistics 48 4.2 Sampling the Facebook Social Graph possible to assign a weight to each friendship relation, for instance, considering the frequency of interaction between each pair of users, or different criteria. Considering the assumption that loops are not allowed, we conclude that in our case it is possible to use a simple unweighted undirected graph to represent the Facebook social network. The adoption of this model has been proved to be optimal for several social networks to this purpose (see (130)). Although choosing the model for representing a network could appear to be simple, this phase is important and could be not trivial. Compared to Facebook, the structure of other social networks requires a more complex representative model. For example, Twitter should be represented using a multiplex network; this because it introduces different types of connections among users (“following”, “reply to” and “mention”) (247). Moreover, there is no mutuality in user relationships, thus its representation requires a directed graph. Similar considerations hold for other OSNs, such as aNobii (5), Flickr and YouTube (203), etc. How to get information about the structure of the network One important aspect to be considered for representing the model of a social network is the amount of information about its structure we have access to. The ideal condition would be to have access to the whole network data, for example acquiring them directly from the company which manages the social networking service. For several reasons (see further), the most of the time this solution is not viable – this is the case for Facebook. Another option is to obtain data required to reconstruct the model of the network, acquiring them directly from the platform itself, exploiting its public interface. In other words, a viable solution is to collect a representative sample of the network to try to represent its structure. To this purpose, it is possible to exploit Web data mining techniques to extract data from the frontend of the social network Websites. This implies that, for very large OSNs, such as Facebook, Twitter, etc., it is hard or even impossible to collect a complete sample of the network. The first limitation is related to the computational overhead of a large-scale Web mining task. In the case of Facebook, for example, to crawl the friend-list Web page (dimension ' 200KB) for half a billion users, it would approximately require to download more than 200KB ·500M = 100 Terabytes of HTML data. Even if possible, the acquired sample would be a snapshot of the structure of the graph at the time of the data collection process. Moreover, during the sampling process the structure of the network slightly changes. This because, even if short, the data mining process requires a not negligible time, during which the connections among users evolve, thus the social network, and its structure changes accordingly. For example, the growth rate of Facebook has been estimated in the order of 0.2% per day by Gjoka et al. In other words, neither all these efforts could ensure to acquire a perfect sample. For these reasons, a widely adopted approach is to collect small samples of a network, trying to preserve characteristics about its structure. There are several different sampling techniques that can be exploited; each algorithm ensures different performances, and possibly introduces bias in the data. For our experimentation we collected two significant samples of the structure of the social network, of a size comparable to other similar studies (63, 277). In particular, we adopted two different sampling algorithms, namely “breadth-first-search” and “Uniform”. The first is proved to introduce bias in certain conditions (e.g., in incomplete visits) towards high degree nodes (170). The latter is proved to be unbiased by construction by Gjoka et al. Once collected, data are compared and analyzed in order to establish their quality, study their properties and characteristics. We consider two quality criteria to evaluate the samples: i) sta- 49 4. MINING AND ANALYSIS OF FACEBOOK tistical significance with respect to mathematical/statistical models; ii) congruency with results reported by similar studies. Considerations about the characteristics of both the “breadth-firstsearch” and the “Uniform” samples follow. How to extract data from Facebook Companies providing online social networking services, such as Facebook, Twitter, etc., do not have economic interests in sharing their data about users, because their business model mostly relies on advertising. For example, exploiting this information, Facebook provides unique and efficient services to advertising companies. Moreover, questions about the protection of these data have been advanced, for privacy reasons, in particular for Facebook (139, 196). In this social network, for example, information about users and the interconnections among them, their activities, etc. can only be accessed through the interface of the platform. To preserve this condition some constraints have been implemented. Among others, a limit is imposed to the amount of information accessible from profiles of users not in friendship relations among each other. There are also some technical limitations, e.g the friend-list is dispatched through an asynchronous script, so as preventing naive techniques of crawling. Some Web services, such as the “Graph API”1 , etc., have been provided during the last months of 2010, by the Facebook developers team, but they do not bypass these limitations (and they eventually add even more restrictions). As of 2011, the structure of this social network can be accessed only exploiting techniques typical of Web data mining. 4.2.2 The Sampling Architecture In order to collect data from the Facebook platform, we designed a Web data mining architecture, which is composed of the following elements (see Figure 4.1): i) a server running the mining agent(s); ii) a cross-platform Java application, which implements the logic of the agent; and, iii) an Apache interface, which manages the information transfer through the Web. While running, the agent(s) query the Facebook server(s) obtaining the friend-list Web pages of specific requested users (this aspect depends on the implemented sampling algorithm), reconstructing the structure of relationships among them. Collected data are stored on the server and, after a post-processing step (see Section 4.2.5), they are delivered (eventually represented using an XML standard format (48)) for further experimentation. Figure 4.1: Architecture of the data mining platform. 1 Available from http://developers.facebook.com/docs/api 50 4.2 Sampling the Facebook Social Graph The Facebook crawler The cross-platform Java agent which crawl the Facebook front-end is the core of our mining platform. The logic of the developed agent, regardless the sampling algorithm implemented, is depicted in Figure 4.2. The first preparative step for the agent execution includes choosing the sampling methodology and configuring some technical parameters, such as termination criterion/a, maximum running time, etc. Thus, the crawling task can start or be resumed from a previous point. During its execution the crawler visits the friend-list page of a user, following the chosen sampling algorithm directives, for traversing the social graph. Data about new discovered nodes and connections among them are stored in a compact format, in order to save I/O operations. The process of crawling concludes when the termination criterion/a is/are met. Figure 4.2: State diagram of the data mining process. During the data mining step, the platform exploits the Apache HTTP Request Library1 to communicate with the Facebook server(s). After an authentication phase which uses a secure connection and “cookies” for logging into the Facebook platform, HTML friend-list Web pages are obtained via HTTP requests. This process is described in Table 4.1. N. 1 Action Protocol Method open the Facebook page HTTP 2 KB www.facebook.com/ 242 login.facebook.com/login.php /home.php 234 87 login providing credentials HTTPS HTTP 3 GET URI POST GET visit the friend-list page of a specific users HTTP GET /friends/ajax/friends.php?id=#&filter=afp 224 Table 4.1: HTTP requests flow of the crawler: authentication and mining steps. Regarding the data extraction, the crawler implements a Web wrapper, as discussed in the previous Chapter, which includes automatic adaptation features. The wrapper exploits a XPath that is selected by the user during the configuration phase, such that it is possible to put into evidence what elements must be extracted. The crawler provides two running mode: i) visual extraction and, ii) HTTP request-based extraction. In the visual extraction mode, depicted in Figure 4.3, the crawler embeds a Firefox browser interfaced through XPCOM2 and XULRunner3 . The advantage of this solution is its ability to perform asynchronous requests, such as AJAX scripts (that, in the case of Facebook, are useful to fulfill those friend-list pages exceeding a certain size, as further described in the next Section). Its drawback is clearly a slower execution, since the rendering of the Web pages is required and time-consuming. Thus, for our large-scale extraction we adopted the HTTP request-based solution. 1 http://httpd.apache.org/apreq 2 https://developer.mozilla.org/en/xpcom 3 https://developer.mozilla.org/en/XULRunner 51 4. MINING AND ANALYSIS OF FACEBOOK Figure 4.3: Screenshot of the Facebook visual crawler Limitations During the data mining task we noticed a technical limitation imposed by Facebook on the dimension of the dispatched friend-list Web pages, via HTTP requests. To reduce the traffic through its network, Facebook provides shortened friend-lists not exceeding 400 friends. During a normal experience of navigation on the Website, if the dimension of the friend-list Web page exceeds 400 friends, an asynchronous script fills the page with the remaining. This results is not reproducible using an agent based on HTTP requests. This problem can be avoided using a different mining approach, for example adopting the visual crawler to integrate missing data. However, this approach is not viable for large-scale mining tasks, due to its cost, even if we proved its functioning for a smaller experimentation (57). In Section 4.3.4 we investigated the impact of this limitation on the samples. 4.2.3 Breadth-first-search Sampling The breadth-first-search (BFS) is a uninformed traversal algorithm which aims to visit a graph. Starting from a “seed node”, it explores its neighborhood; then, for each neighbor, it visits its unexplored neighbors, and so on, until the whole graph is visited (or, alternatively, if a termination criterion is met). This sampling technique shows several advantages: i) ease of implementation; ii) optimal solution for unweighted graphs; iii) efficiency. For these reasons it has been adopted in a variety of OSN mining studies, including (57, 58, 63, 203, 277, 286). In the last year, the hypothesis that the BFS algorithm produces biased data, toward high degree nodes, if adopted for partial graph traversal, has been advanced by (170). This, because in the same (partial) graph obtained adopting a BFS visiting algorithm are both represented nodes which have been visited (high degree nodes) and nodes which have just been discovered, as neighbors of visited ones (low degree nodes). One important aspect of our experimentation has been to verify this hypothesis, in order to highlight which properties of a partial graph obtained using the BFS sampling are preserved, and which are biased. To do so, we had to acquire a comparable sample which is certainly unbiased by construction (see further). 52 4.2 Sampling the Facebook Social Graph Description of the breadth-first-search crawler The BFS sampling methodology is implemented as one of the possible visiting algorithms in our Facebook crawler, described before. While using this algorithm, the crawler, for first extracts the friend-list of the “seed node”, which is represented by the user actually logged on the Facebook platform. The user-IDs of contacts in its friend-list are stored in a FIFO queue. Then, the friend-lists of these users are visited, and so on. In our experimentation, the process continued until two termination criteria have been met: i) at least the third sub-level of friendship was completely covered; ii) the mining process exceeded 240 hours of running time. As discussed before, the time constraint is adopted in order to observe a short mining interval, thus the temporal evolution of the network is minimal (in the order of 2%) and can be ignored. The obtained graph is a partial reconstruction of the Facebook network structure, and its dimension is used as a yardstick for configuring the “Uniform” sampling (see further). Characteristics of the breadth-first-search dataset This crawler has been executed during the first part of August 2010. At that time the number of users subscribed to Facebook was more or less 540 millions. The acquired sample covers about 12 millions friendship connections among about 8 millions users. Among these users, we performed the complete visit of about 63.4 thousands of them, thus resulting in an average degree d = 2·|E| |Vv | ' 396.4, considering V as the number of visited users. The overall mean degree, considering V as the number of total nodes on the graph (visited users + discovered neighbors), is o = 2·|E| |Vt | ' 3.064. The expected density of the graph is ∆= 2·|E| |Vv |·(|Vv |−1) ' 0.006259 ' 0.626%, considering V as the number of visited nodes. We can combine the previous equations obtaining ∆ = |Vvd|−1 . It means that the expected density of a graph is the average proportion of edges incident with nodes in the graph. v| In our case, the value δ = do = |V |Vt | ' 0.007721 ' 0.772%, which here we introduce, represents the effective density of the obtained graph. The distance among the effective and the expected density of the graph is computed as ∂ = ' 18.94%. 100 − ∆∗100 δ This result means that the obtained graph is slightly more connected than expected, with respect to the number of unique users it contains. This consideration is also compatible with hypothesis advanced in (170). The effective diameter of this (partial) graph is 8.75, which is compliant with the “six degrees of separation” theory (15, 202, 216, 266). The largest connected component of the graph is almost complete (99.98%). The small amount of disconnected nodes can be intuitively adducted due to some collisions caused by the hash function exploited to de-duplicate and anonymize user-IDs adopted during the data cleaning step (see Section 4.2.5). Some interesting considerations hold for the obtained clustering coefficient result. It lies in the lower part of the interval [0.05, 0.18] reported by (125) and, similarly, [0.05, 0.35] by (277), using the same sampling methodology. The characteristics of the collected sample are summarized in Table 4.2. 53 4. MINING AND ANALYSIS OF FACEBOOK No. visited users 63.4K Avg. deg. 396.8 Bigg. eigenval. 68.93 No. discovered neighbors 8.21M Eff. diam. 8.69 Avg. clust. coef. 0.0188 No. edges 12.58M Conn. comp. 98.98% Density 0.626% Table 4.2: BFS dataset description (crawling period: 08/01-10/2010) 4.2.4 Uniform Sampling To acquire a comparable sample, unbiased for construction, we exploited a rejection-based sampling methodology. This technique has been applied to Facebook by Gjoka et al., where the authors proved its correctness. Its efficiency relies on the following assumptions: 1. it is possible to generate uniform sampling values for the domain of interest; 2. these values are not sparse with respect to the dimension of the domain and 3. it is possible to sample these values from the domain. In Facebook, each user is identified by a 32-bit number user-ID. Considering that user-IDs lie in the interval [0, 232 − 1], the highest possible number of assignable user-IDs using this system is H ' 4.295e9. The space for names is currently filling up since the actual number of assigned user-IDs, R ' 5.4e8 roughly equals to the 540 millions of currently subscribed users1 ,2 ), the two domains are comparable and the rejection sampling is viable. We generated an arbitrary number of random 32-bit user-IDs, querying Facebook for their existence (and, eventually, obtaining their friendlists). That sampling methodology shows two advantages: i) we can statistically estimate the R probability H ' 12.5% of getting an existing user; thus, ii) we can generate an arbitrary number of user-IDs in order to acquire a sample of the desired dimension. Moreover, the distribution of user-IDs is completely independent with respect to the graph structure. Description of the “Uniform” crawler The “Uniform” sampling is the second algorithm implemented in the Facebook crawler we developed. Differently with respect to the BFS sampler, if adopting this algorithm, it is possible to parallelize the process of extraction. This is because user-IDs to be requested can be stored in different “queues”. We designed the uniform sampling task starting from these assumptions: i) the number of subscribed users is 229 ' 5.368e8; ii) this value is comparable with the highest possible assignable number of user-IDs, 232 ' 4.295e9, thus iii) we can statistically assert that 29 the possibility of querying Facebook for an existing user-ID is 2232 = 18 (12.5%). To this purpose, we generated eight different queues, each containing 216 ' 65.5K ∼ = 63.4K random user-IDs (the number of visited users of the BFS sample), used to feed eight parallel crawlers. This number has been chosen in order to obtain a sample whose size was comparable with the BFS sample. 1 As of August 2010, http://www.facebook.com/press/info.php?statistics 2 http://www.google.com/adplanner/static/top1000/ 54 4.2 Sampling the Facebook Social Graph No. visited users 48.1K Avg. deg. 326.0 Bigg. eigenval. 23.63 No. discovered neighbors 7.69M Eff. diam. 14.72 Avg. clust. coef. 0.0014 No. edges 7.84M Conn. comp. 94.96% Density 0.678 % Table 4.3: “Uniform” dataset description (crawling period: 08/11-20/2010) Characteristics of the “Uniform” dataset The uniform sampling process has been executed during the second part of August 2010. The crawler collected a sample which contains almost 8 millions friendship connections among a similar number of users. The acquired amount of nodes differs from the expected number due to the privacy policy adopted by those users who prevent their friend-lists being visited. The privacy policy aspect is discussed in Section 4.3.3. The total number of visited users has been about 48.1 thousands, thus resulting in an average degree of d = 2·|E| |Vv | ' 326.0, considering V as the number of visited users. Same assumptions, the expected density of the graph is ∆ = 2·|E| |Vv |·(|Vv |−1) ' 0.006777 ' 0.678%. If we consider V as the number of total nodes (visited users + discovered neighbors), the overall mean degree is o = 2·|E| |Vt | ' 2.025. The effective density of the graph, previously introduced, is v| δ = |V |Vt | ' 0.006214 ' 0.621%. The distance among the effective and the expected density of the graph, is ∂ = 100 − ∆∗100 ' −9.06%. This can be intuitively interpreted as a slight lack of δ connection of this sample with respect to the theoretical expectation. Some considerations hold, comparing this sample against the BFS one: the average degree is slightly smaller (326.0 vs. 396.8), but the effective diameter is almost the double (14.72 vs. 8.69). We hypothesize that this is due to its size, which is insufficient to faithfully reflect the structure of the network. Our hypothesis is also supported by the dimension of the largest connected component, which does not contain the 5% of the sample. Finally, the clustering coefficient, less than the BFS sample (0.0471 vs. 0.0789), is still comparable with respect to previously considered studies (125, 277). 4.2.5 Data Preparation During the data mining process it could happen to store redundant information. In particular, while extracting friend-lists, a crawler could save multiple instances of the same edge (i.e., a parallel edge), if both the connected users are visited; this is related to the fact that we adopted an undirected graph representation. We adopted a hashing-based algorithm which cleans data in O(N ) time, removing duplicate edges. Another step, during the data preparation, is the anonymization: user-IDs are “encrypted” adopting a 48-bit hybrid rotative and additive hash function (229), to obtain anonymized datasets. The final touch was to verify the integrity and the congruency of data. We found that the usage of the hashing function caused occasional collisions (0.0002%). Finally, some datasets of small sub-graphs (e.g., ego-networks) have been post-precessed and stored using the GraphML format (48). 55 4. MINING AND ANALYSIS OF FACEBOOK 4.3 Network Analysis Aspects During the last years, important achievements have been reached in understanding the structural properties of several complex real networks. The availability of large-scale empirical data, on the one hand, and the advances in computational resources, on the other, made it possible to discover and understand interesting statistical properties commonly shared among different real-world social, biological and technological networks. Among others, some important examples are: the World Wide Web (7), Internet (97), metabolic networks (155), scientific collaboration networks (17, 208), citation networks (243), etc. Indeed, during the last years even the social networks are strongly imposing themselves as complex networks described by very specific models and properties. For example, some studies (10, 202, 266) proved the validity of the well-known “small-world” effect in complex social networks. Others (2, 228), assert that the “scale-free” complex networks reflect a “power law” distribution as a law for describing the behavior of node degrees. We can conclude that the topology of the networks usually provides useful information about the dynamics of the network entities and the interaction among them. The study of complex networks led to important results in some specific contexts, such as the social sciences. A branch of the network analysis applied to social sciences is the Social Network Analysis (SNA). From a different perspective with respect to the analysis of complex networks, which mainly aims to analyze structural properties of networks, the SNA focuses on studying the nature of relationships among entities of the network and, in the case of social networks, investigating the motivational aspect of these connections. 4.3.1 Definitions In this Section we describe some of the common structural properties which are usually observed in several complex networks. The early introduction of some concepts such as clustering, “small-world” effect and scale-free distributions, is instrumental for the discussion of further experiments on Facebook. Indeed, the same concepts will be extended in very details in the next Chapter of this dissertation. Clustering In several networks it is shown that, if a node i is connected to a node j, which in its turn is connected to a node k, then there is a heightened probability that node i will be also connected to node k. From a social network perspective, a friend of your friend is likely also to be your friend. In terms of network topology, transitivity means the presence of a heightened number of triangles in the network, which constitute sets of three nodes connected each others (210). The global clustering coefficient is defined by Cg = 3 × no. of triangles in G no. of connected triples (4.1) where a connected triple represents a pair of nodes connected to another node. Cg is the mean probability that two persons who have a common friend are also friends together. An alternative 56 4.3 Network Analysis Aspects definition of clustering coefficient C has been provided by Watts and Strogatz (274) Ci = num. of triangles connected to i num. of triples centered on i (4.2) where the denominator is equal to ki (ki − 1)/2 for the degree ki of the node i. For kP i = 0 and 1, Ci = 0 by convention. The averaged clustering coefficient is then defined by C = i Ci /N . The local clustering coefficient Ci has a strong dependence on the degree ki . To quantify it, one usually defines C(k) = hCi i |ki =k . During our experimentation we investigated the clustering effect on the Facebook network (see Section 4.3.5). The “small world” effect It is well-known in literature that the most of large-scale networks, despite their huge size, show a common property: there exists a relatively short path which connects any pair of nodes within the network. This characteristic, called small-world effect1 scales proportionally to the logarithm of the dimension of the network. The study of this phenomenon is rooted in social sciences (202, 266) and is strictly interconnected with the notion of diameter we introduced before. The Facebook social network reflects the “small world” effect as discussed in Section 4.3.5. Scale-free degree distributions In a random graph (94) the node degree is characterized by a distribution function P (k) which defines the probability that a randomly chosen node has exactly k edges. Because the distribution of edges in a random graph is aleatory, the most of the nodes have approximatively the same node degree, similar to the mean degree hki of the network. Thus, the degree distribution of a random graph is well described by a Poisson distribution law, with a peak in P (hki). Recent empirical results show that in the most of real-world networks the degree distribution significantly differs with respect to a Poisson distribution. In particular, for several large-scale networks, such as the World Wide Web (7), Internet (97), metabolic networks (155), the degree distribution follows a power law P (k) ∼ k −λ (4.3) This power law distribution falls off more gradually than an exponential one, allowing for a few nodes of very large degree to exist. Since these power laws are free of any characteristic scale, such a network with a power law degree distribution is called a scale-free network (14). We proved that Facebook is a scale-free network well described by a power law degree distribution, as discussed in Section 4.3.4. 1 Note: it will be extensively introduced in the next Chapter 57 4. MINING AND ANALYSIS OF FACEBOOK 4.3.2 Experimentation We describe some interesting experimental results as follows. To compute the following overall statistics and centrality measures, such as degree and betweenness, we have adopted the Stanford Network Analysis Platform (SNAP) (183), a general purpose network analysis library. 4.3.3 Privacy Settings We investigated the adoption of restrictive privacy policies by users: our statistical expectation 16 using the “Uniform” crawler was to acquire 8 · 223 ' 65.5K users. Instead, the actual number of collected users was 48.1K. Because of restrictive privacy settings chosen by users, the discrepancy between the expected number of acquired users and the actual number was about 26.6%. In other words, only a quarter of Facebook users adopt privacy policies which prevent other users (except for those in their friendship network) from visiting their friend-list. 4.3.4 Degree Distribution A first description of the network topology of the Facebook friendship graph can be obtained from the degree distribution. According to Equation 4.3, a relatively small number of nodes exhibit a very large number of links. An alternative approach involves the Complementary Cumulative Distribution Function (CCDF), defined as Z ℘(k) = ∞ P (k 0 )dk 0 ∼ k −α ∼ k −(γ−1) (4.4) k When calculated for scale-free networks, the CCDF shows up as a straight line in a log-log plot, while the exponent of the power law distribution only varies the height (not the shape) of the curve. In Figure 4.4 is plotted the degree distribution, as obtained from the BFS and “Uniform” sampling techniques. The limitations due to the dimensions of the cache which contains the friend-lists, upper bounded to 400, are evident. The CCDF is shown, for the same samples, in Figure 4.5. From these data emerges that the degree distribution is not clearly defined by a strict power law. Rather, it emerges that different regimes can be identified for both the samples. In detail, roughly dividing the domain into two intervals, tentatively 1 ≤ x ≤ 10 and S S = 2.45, λBF = 0.6 10 ≤ x ≤ 400, there exist two clear regimes whose exponents are λBF 1 2 UNI UNI = 0.2 respectively for the BFS and the Uniform sample. Figure 4.6 and λ1 = 2.91, λ2 summarizes the previous findings, depicting the probability P(x) of finding a given number of nodes with a specific degree. 4.3.5 Diameter and Clustering Coefficient It is well-known that the most of real-world networks exhibit a relatively small diameter. A graph has diameter D if every pair of nodes can be connected by a path of length of at most D edges. Indeed, the diameter may be affected by outliers. A robust measure of the pairwise distances between nodes in a graph is the effective diameter, which is the minimum number of links (steps/hops) within which some fraction (or quantile q, say q = 0.9) of all connected 58 4.3 Network Analysis Aspects Figure 4.4: Node degree distribution BFS vs. UNI Facebook sample. Figure 4.5: CCDF node degree distribution BFS vs. UNI Facebook sample. 59 4. MINING AND ANALYSIS OF FACEBOOK Figure 4.6: Node degree probability distribution BFS vs. UNI Facebook sample. pairs of nodes can reach each other. The effective diameter has been found to be small for large real-world graphs, like Internet and the Web, real-life and OSNs (8, 185, 202). The hop-plot package extends the notion of diameter by plotting the number of reachable pairs g(h) within h hops, as a function of the number of hops h (228). It gives us a sense of how quickly the neighborhoods of nodes expand with the number of hops. In Figure 4.7 the number of pair of nodes is plotted as a function of the number of hops required to connect each pair. As a consequence of the more “compact” structure of the graph, the BFS sample shows a faster convergence to the asymptotic value listed in Table 4.2. Often, it is insightful to examine not only the mean clustering coefficient (see Section 4.3.1), but its distribution. Figure 4.8 shows the average clustering coefficient plotted as a function of the node degree for the two sampling techniques. As a consequence of the more systematic approach of extraction, the distribution of the clustering coefficient of the BFS sample shows a smooth behavior. The following considerations hold for the diameter and hops: the BFS sample may be affected by the “wavefront expansion” behavior of the visiting algorithm, while the “Uniform” sample may still be too small to represent a faithful estimate of the diameter (this hypothesis is supported by the dimension of the largest connected component which does not cover the whole graph, as discussed in the next paragraph). Different conclusions can be derived for the clustering coefficient property. It is important to observe that the average values of the BFS sample fluctuate in a similar interval reported by recent similar studies on OSNs (i.e., [0.05, 0.18] by Wilson et al., [0.05, 0.35] by Gjoka et al.), confirming that this property is preserved by the BFS sampling technique. On the contrary, due to the intrinsic feature of the uniform sampling, the clustering coefficient is not sufficiently well represented at this scale. 60 4.3 Network Analysis Aspects Figure 4.7: Hops and diameter in Facebook. Figure 4.8: Clustering coefficient in Facebook. 61 4. MINING AND ANALYSIS OF FACEBOOK 4.3.6 Connected Components A connected component is a maximal set of nodes where for each pair of nodes there exists a path connecting them. As shown in Tables 4.2 and 4.3, the largest connected components cover the 99.98% of the BFS graph and the 94.96% of the “Uniform” graph. In Figure 4.9, the scattered points in the left part of the plot have a different meaning for each sampling techniques. In the “Uniform” case, the sampling picked up disconnected nodes. In the BFS case, disconnected nodes are meaningless, as they are due to some collisions of the hashing function during the de-duplication phase of the data-cleaning step. This interpretation is supported by their small number (29 collisions over 12.58 millions of hashed edges) involving only the 0.02% of the total edges. However, the quality of the sample is not affected. These conclusions are confirmed in Figure 4.10, where the betweenness centrality (BC) is plotted as a function of the degree, in a log-log scale. The BC shows a linearly proportional behavior with respect to the degree, which means that it follows a power law distribution as p(g) ∼ g −η . In our opinion, this implies a high degree of connectedness of the sample, since a high value of BC is related to a high value of the degree of the nodes. Moreover, it is well-known that the BC distribution follows a power law distribution for scale-free networks (128). Similarly to the degree exponent case, in general, the BC exponents increase for node and link sampling and decrease for snowball sampling as the sampling fraction gets lower. The correlation between degree and BC of nodes (19), shown in Figure 4.10, could explain the same direction of changes of degree and BC exponents. We found that the best fitting function to describe the BC for Facebook has an exponent η = 0.61. Figure 4.9: Connected components in Facebook. 62 4.3 Network Analysis Aspects Figure 4.10: Degree vs betweenness centrality in Facebook. Conclusion Extraction and analysis of OSN data describing social networks poses both a technological and an interpretation challenge. We have presented in this Chapter our implemented systems, the ad hoc Facebook crawler that has been developed to comply with the increasingly-strict terms of Facebook end-user license, i.e., to create large, fully anonymous samples that can be employed for scientific purposes. Two different sampling techniques have been implemented in order to explore the graph of friendships of Facebook, since the BFS visiting algorithm is known to introduce a bias in case of an incomplete visit. Analysis of such large samples was tackled using concepts and algorithms typical of the graph theory, namely users were represented by nodes of a graph and relations among users were represented by edges. Social Network Analysis concepts, such as degree distribution, diameter, centrality metrics, clustering coefficient distribution have been considered for such samples, highlighting those features which are believed to be preserved and those which are affected by some bias due to the partial sampling of the given OSN. 63 4. MINING AND ANALYSIS OF FACEBOOK 64 5 Network Analysis and Models of Online Social Networks This Chapter is organized as follows: Section 5.1 presents related literature on the topics of social networks, their analysis – in particular regarding Online Social Networks – and the latest works, which define directions of Social Network Analysis. Section 5.2 introduces the key features reflected by the most of the Online Social Networks. In Section 5.3 we describe the generative models proposed to represent social networks, putting into evidence those aspects that could fit well to represent Online Social Networks and those in which they could fail. Results of our experimentation, presented in Section 5.5, depict the topological features of the other studied Online Social Networks, such as Arxiv, Wikipedia and Youtube. 5.1 Background and Related Literature Studying large-scale Online Social Networks, and their evolution, can be useful to investigate similarities and differences with real-life social networks. Some interesting aspects in the study of social networks are defined by the Social Network Analysis (SNA), a novel branch of Computational Social Sciences. It provides techniques to evaluate networks, both from a quantitative (e.g. defining properties, characteristics and metrics) and a qualitative perspective. In this Chapter we face the problem of analyzing the topological features of some popular Online Social Networks other than Facebook, focusing on the graphs which represent these networks. To do so, we adopt some specific topological measures, such as the diameter and we study the degree distribution. Moreover, we investigate the emerging community structure characterizing these networks. 5.1.1 Social Networks and Models Literature about social network models is rooted in social sciences: in the sixties, Milgram and Travers (202, 266) analyzed characteristics of real-life social networks, conducting several social experiments, and in conclusion, proposing the well known “small-world” model (see Section 5.2.1). 65 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS Kleinberg (165), analyzed the “small-world” effect from an algorithmic perspective, providing important algorithms to compute metrics on graphs representing social networks, the so called social graphs. Another important concept, introduced by Zachary (287), is the community structure (see Section 5.2.3). The author analyzed a small real-life social community (the components of a karate club), defining a model which describes the clusterization of social networks via cuts and fissions in sub-groups. One of the first models is the so called Erdős-Rényi model (see Section 5.3.1), and employs random graphs in order to represent real networks. Watts and Strogatz (273, 274) furnished a one-parameter model that interpolates between an ordered finite dimensional lattice and a random graph (see Section 5.3.2). This because they empirically found that real-world social networks are well connected and have a short average path length like random graphs, but they also have exceptionally large clustering coefficients, a feature which is not reflected by random graph models. Barabási and Albert (7, 8, 15) introduced different models that can be applied to friendship networks, the World Wide Web, business and commerce networks, etc., proving that they all share similar properties (see Section 5.3.3). 5.1.2 Recent Studies and Current Trends Some of the current trends in the analysis of social networks are summarized as follows: a) Some works (4, 203) investigate topological features of social networks by means of measurements, studying link symmetries, degree distributions, clustering coefficients, groups formations, etc., usually on a large-scale, by analyzing Online Social Networks. b) Another trend in current research is the analysis of evolutionary aspects of social networks. In this context, Kumar et al. (169) defined a generative model which describes the structure of OSNs and their dynamics over the time. This model has been compared against actual data, in order to validate its reliability. Similarly, Leskovec (184), analyzed evolutionary aspects of social networks, trying to describe dynamic and structural features which influence the growth of communities, in particular when considering large social networks. c) Graph mining techniques assume growing importance because of the computational complexity of studying large-scale social graphs with millions of nodes and edges. Some authors (125, 184) faced the problem of sampling from large graphs adopting different techniques, in order to establish if it is possible to avoid bias of data studying sub-graphs of social networks. They found that Random Walks and Metropolis-Hastings algorithms perform better, respectively, for static and dynamic graphs, concluding that samples of size of 15% of a social graph preserve the most of the properties. d) Some authors (129, 189) try to identify whose characteristics of the network could suggest what nodes are more likely to be connected by trusted relationships, the so called link prediction problem. This is of a great interest for different commercial reasons, which will be discussed in very details in the next Chapter of this dissertation. Applications of Social Network Analysis research Possible applications of information acquired from social networks have been investigated by Staab et al. (260): methodologies for exploiting discovered data were defined, for marketing 66 5.2 Features of Social Networks purposes, recommendation and trust analysis, etc. Recently, several marketing and commercial studies have been applied to OSNs, in particular to discover efficient channels to distribute information (159) and users who share similar tastes and preferences in order to suggest them useful recommendations (84). This Thesis provides useful information in all these directions, identifying interesting characteristics of Online Social Networks, considering the topological features that could affect how much efficiently nodes and edges could carry information through the networks. 5.2 Features of Social Networks In this Section we put into evidence three key features that characterize social networks, i.e., i) “small-world”, ii) scale-free degree distributions and, iii) emergence of a community structure. During our experimentation, we take into account these features in order to establish if Online Social Networks show these well-known characteristics. A social network can be defined by means of a graph G = (V, E) whose set of vertices V represents nodes of the network (i.e., the individuals), and whose set of edges E represents connections (i.e., the social ties) among nodes of the network. 5.2.1 The “Small-World” The study of the “small-world” effect on social networks is rooted in Social Sciences (202, 266). Authors put into evidence that, despite their big dimension, social networks usually show a common feature: there exists a relatively short path connecting any pair of nodes within the network. In fact, formally, a “small-world” network is defined as a graph in which most nodes are not reciprocal neighbors each other, but could be reached from every other node by a small number of hops. The diameter `, that reflects the so called “small-world” effect, scales proportionally to the logarithm of the dimension of the network, which is formalized as ` ∝ Log(|V |) (5.1) where |V | represents the cardinality of V . Some characteristics of many real-world networks are well modeled by means of “small-world” networks, such as OSNs (274), Internet (230), World Wide Web (7), biological networks (124). 5.2.2 Scale-free Degree Distributions An important feature that is reflected by several generative models of social networks is the degree distribution of nodes. This because this feature characterizes the way the nodes are interconnected each other in the social network. On the one hand, in a random graph (94) the node degree is characterized by a distribution function P (k) well described by a Poisson law, with a peak in P (hki). On the other hand, recent empirical results show that in the most of real-world networks the degree distribution follows a power law P (k) ∼ k −γ 67 (5.2) 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS Power law based models (see Section 5.3.3) apparently well depict the node degree distributions of large-scale social networks. Since these power laws are free of any characteristic scale, such a network with a power law degree distribution is called a scale-free network (14). 5.2.3 Emergence of a Community Structure Another aspect to take into account when studying social networks is the emergence of a community structure: the more this structural characteristic is evident, the higher a network tends to divide into groups of nodes whose connections are denser among entities belonging to the given group and sparser otherwise. Not all the network models are able to represent this characteristic. In fact, the Erdős-Rényi (see Section 5.3.1) or the Barabási-Albert models (see Section 5.3.3) can not meaningfully represent the concept of community structure, that emerges from the empirical analysis of social networks. The community structure in Online Social Networks is widely described in the following. 5.3 Models of Social Networks Concepts such as the short path length, the clustering and the scale-free degree distribution have been applied to rigorously model the social networks. Different models have been presented, but in this dissertation we focus on the three most widely exploited modeling paradigms: i) random graphs, ii) “small-world” networks and, iii) power law networks. Random graphs represent an evolution of the Erdős-Rényi model, and are widely used in several empirical studies, because of their ease of adoption. After the discovery of the clustering effect, a new class of models, namely “small-world” networks, has been introduced. Similarly, the power law degree distribution emerging from real-world social networks led to the modeling of the homonym networks, which are adopted to describe scale-free behaviors. This, to focus on the dynamics of the network, to explain phenomena such as the power laws and other non-Poisson degree distributions. 5.3.1 The Erdős-Rényi Model Erdős and Rényi (94) proposed one of the first models of network, the random graph. They defined two models: the simple one consists of a graph containing n vertices connected randomly. The commonly adopted model, indeed, is defined as a graph Gn,p in which each possible edge between two vertices may be included in the graph with the probability p (and may not be included with the probability (1 − p)). Although random graphs have been widely adopted because their properties ease the work of modeling networks (for example, random graphs have small diameters), they do not properly reflect the structure of real-world large-scale networks, mainly for two reasons: i) the degree distribution of random graphs follows a Poisson law, which substantially differs from the power law distribution shown by empirical data; ii) they do not reflect the clustering phenomenon, considering all the nodes of the network with the same weight, and reducing, de facto, the network to a giant cluster. This emerges by considering Figure 5.1, where it is shown a Erdős-Rényi graph generated by adopting n = 30 and p = 0.25. The most of the nodes have similar closeness centrality (that is related to their degree), identified by the gray color in a gray-scale, and this means that 68 5.3 Models of Social Networks Figure 5.1: Generative model: Erdős-Rényi (94). Figure 5.2: Generative model: Newman-Watts-Strogatz (219). 69 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS Figure 5.3: Generative model: Watts-Strogatz (274). Figure 5.4: Generative model: Barabási-Albert (14). 70 5.3 Models of Social Networks Figure 5.5: Generative model: Holme-Kim (149). all the nodes have relatively similar features (which is consistent with the formulation of the graph model, according with the Poisson distribution of node degrees). Social networks exhibit a rather different behavior, making this model unfeasible for modern studies although it has been widely adopted in the past. 5.3.2 The Watts-Strogatz Model Real-world social networks are well connected and have a short average path length like random graphs, but they also have exceptionally large clustering coefficients, a characteristic that is not reflected by the Erdős-Rényi model or by other random graph models. Watts and Strogatz proposed a one-parameter model that interpolates between an ordered finite dimensional lattice and a random graph (274). Starting from a ring lattice with n vertices and k edges per vertex, each edge is rewired at random with probability p, p ranging from 0 (regular network) to 1 (random network). Focusing on two quantities, namely the characteristic path length L(p) (defined as the number of edges in the shortest path between two vertices) and the clustering coefficient C(p), some authors (148) found that L ∼ n/2k ≥ 1 and C ∼ 3/4 as p tends to 0, while L = Lrandom ln(n)/ln(k) and C = Crandom k/n ≤ 1 as p tends to 1. The Watts-Strogatz model is therefore suitable for explaining such properties in many real-world examples. The model has been widely studied since the details have been published. Its role is important in the study of the “small-world” theory. Some relevant theories, such as Kleinberg’s work (164, 165), are based on this model and its variants. The disadvantage of the model, however, is that it is not able to capture the power law degree distribution as presented in most real-world social networks. 71 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS A strong structural difference is evident between Watts-Strogatz (274), its variant NewmanWatts-Strogatz (219) presented in Figures 5.2 and 5.3 if compared with the Erdős-Rényi graph. First of all, it emerges that centrality of nodes is more heterogeneous, covering the whole gray-scale. On the other hand, is evident, if compared with the other models, that it can not well reflect the power law distribution of node degree experimentally shown by data, even if a community structure is well represented (see Section 5.5.2). 5.3.3 The Barabási-Albert Model The two previously discussed theories observe properties of real-world networks and attempt to create models that incorporate those characteristics. However, they do not help in understanding the origin of social networks and how those properties evolve. The Barabási-Albert model suggests that two main ingredients of self-organization of a network in a scale-free structure are growth and preferential attachment. These pinpoint to the facts that the most of the networks continuously grow by the addition of new nodes which are preferentially attached to existing nodes with large numbers of connections. The generation scheme of a Barabási-Albert scale-free model is as follows: (i) Growth: P let pk be the fraction of nodes in the undirected network of size n with degree k, so that k pk = 1 and therefore the P mean degree m of the network is 21 k kpk . Starting with a small number of nodes, at each time step, we add a new node with m edges Q linked to nodes already part of the system; (ii) preferential attachment: the probability i that a new node will be connected to the node i (one Q of the P n already existing nodes) depends on the degree ki of the node i, in such a way that = k i i j kj . Models based on preferential attachment operates in the following way. Nodes are added one at a time. When a new node u has to be added to the network it creates m edges (m is a parameter and it is constant for all nodes). The edges are not placed uniformly at random but preferentially, i.e., the probability that a new edge of u is placed to a node v of degree d(v) is proportional to its degree, pu (v) ∝ d(v). This simple behavior leads to power law degree tails with exponent γ ≈ 3. Moreover it also leads to low diameters. While the model captures the power law tail of the degree distribution, it has other properties that may or may not agree with empirical results in real networks. Recent analytical research on average path length indicate that ` ∼ ln(|V |)/lnln(|V |). Thus the model has much shorter ` w.r.t. a random graph. The clustering coefficient decreases with the network size, following approximately a power law C ∼ V −0.75 . Though greater than those of random graphs, it depends on the size of the network, which is not true for real-world social networks. Figures 5.4 and 5.5 propose two example of graphs generated by using the Barabási-Albert scale-free model (14) and a variant by Holme and Kim (149). It is evident that the structure of these networks is much more compact than the Watts-Strogatz models but, on the other hand, there are a few nodes that have a very high centrality (that is proportional to their degree) while the most of the others have very low degrees (those depicted in dark gray). It is also possible to put into evidence that, due to the spring layout given by the Fruchterman-Reingold algorithm (116), those nodes with low degrees (belonging to the tail of the power law) are represented in peripheral positions w.r.t. central nodes. On the other hand, this model fails in representing a meaningful community structure of the network, differently to the Watts-Strogatz based models (see further). 72 5.4 Community Structure 5.4 Community Structure The concept of community structure in social networks is crucial and this Section is devoted to: i) formally define the meaning of community and community structure of a social network, ii) introduce the problem of community detection in social networks – including those problems related to massive OSNs – and, finally iii) discuss those mathematical models introduced above and their inability to represent the community structure of social networks. 5.4.1 Definition of Community Structure We define the community structure of a network as in (109). In a random graph (94), the distribution of edges among the vertices is highly homogeneous, since it follows a Poissonian distribution (as previously discussed), so most nodes have equal or similar degree. Real-world networks are not random graphs, as they display big inhomogeneities, revealing a high level of order and organization. The degree distribution is broad and is scale-free: therefore, many nodes with low degree coexist with few node with large degree. Furthermore, the distribution of edges is not only globally, but also locally inhomogeneous, with high concentrations of edges within special groups of nodes, and low concentrations between these groups. This feature of real-world networks is called community structure, or clustering effect. 5.4.2 Discovering Communities The problem of unveiling the community structure of a network is called community detection. In the context of the community detection, two main type of algorithms exist: i) partitioning algorithms; ii) overlapping nodes community detection algorithms. Partitioning Algorithms In its general formulation, the problem of finding communities in a network is intended as a data clustering problem, thus solvable assigning each vertex of the network to a cluster, in a meaningful way. There are essentially two different and widely adopted approaches to solve this problem; the first is the spectral clustering (141) which relies on optimizing the process of cutting the graph; the latter is based on the concept of network modularity. The problem of minimizing the graph-cuts is NP-hard, thus an approximation of the exact solution can be obtained by using the spectral clustering (220), exploiting the eigenvectors of the Laplacian matrix of the network. We recall that the Laplacian matrix L of a given graph has components Lij = ki δ(i, j) − Aij , where ki is the degree of a node i, δ(i, j) is the Kronecker delta (that is, δ(i, j) = 1 if and only if i = j) and Aij is the adjacency matrix representing the graph connections. This process can be performed using the concept of ratio cut (141, 275), a function which can be minimized in order to obtain large clusters with a minimum number of outgoing interconnections among them. The main limitation of the spectral clustering is that it requires in advance to define the number of communities present in the network and their size. This makes it unsuitable if one wants to discover the number and the features of existing 73 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS communities in a given network. Moreover, as demonstrated by (254), it does not work very well in detecting small communities within densely connected networks. The network modularity concept can be explained as in (217): let consider a network, represented by means of a graph G = (V, E), which has been partitioned into m communities; its corresponding value of network modularity is Q= m X s=1 " ls − |E| ds 2|E| 2 # (5.3) assuming ls the number of edges between vertices belonging to the s-th community and ds the sum of the degrees of the vertices in the s-th community. Intuitively, high values of Q imply high values of ls for each discovered community; thus, detected communities are dense within their structure and weakly coupled among each other. Because the task of maximizing the function Q is NP-hard, several approximate techniques have been presented during the last years. Let us consider the Girvan-Newman (GN) algorithm (214, 217). It first calculates the edge betweenness B(e) of any given edge e in a network G = (V, E), defined as B(e) = X X npe (ni , nl ) np(ni , nl ) (5.4) ni ∈S nl ∈S where ni and nl are vertices of G, np(ni , nl ) is the number of the shortest paths between ni and nl and npe (ni , nl ) is the number of the shortest paths between ni and nl containing e. The GN algorithm is based on the assumption that it is possible to maximize the value of Q deleting edges with a high value of betweenness. This, because they connect vertices belonging to different communities. Starting from this intuition, first the algorithm ranks all the edges with respect to their betweenness, thus it removes the most influent, calculates the value of Q and iterates the process until a significant increase of Q is obtained. At each iteration, each connected component of G identifies a community. Its cost is O(n3 ), being n the number of vertices in G; intuitively, it is unsuitable for large-scale networks. A large number of improved versions of this approach has been provided in the last years, such as the fast clustering algorithm provided by (68, 69), running in O(n log n) on sparse graphs; the extremal optimization method proposed by (91), based on a fast agglomerative approach with O(n2 log n) time complexity; the Newman-Leicht (218) mixture model based on statistical inferences; other maximization techniques by (213) based on eigenvectors and matrices. Concluding, another different approach of partitioning is the “core-periphery”, introduced by (41, 95); it relies on separating a tight core from a sparse periphery. The next Chapter of this dissertation is completely devoted to the problem of community detection in massive OSNs and a number of details will be further provided. Overlapping Nodes Community Detection Recently, it has been advanced the problem of discovering the community structure in a network including the possibility of finding overlapping nodes belonging to different communities at the same time. One of the first approach has been presented by (227) and has attracted a lot of attention by the scientific community. A lot of efforts have been spent in order to advance novel 74 5.4 Community Structure possible strategies. For example, an interesting approach has been proposed by (138), that is based on an extension of the Label Propagation Algorithm. On the other hand, an approach in which the hierarchical clustering is instrumental to find the overlapping community structure has been proposed by (175). Finally, during latest years some novel techniques have been proposed (181, 197). 5.4.3 Models Representing the Community Structure From the perspective of the models representing the community structure of a network, we can infer the following information: from Figure 5.6, where the community structure of a ErdősRényi model is represented, the result appears random, according to the formulation of the model and its expected behavior when the calculated network modularity Q function (Equation 5.3) is analyzed. From Figures 5.7–5.8, at a first glance, it emerges that the community structure of Watts-Strogatz models is very regular and there is a balance between communities with tighter connections and those with weaker connections. This reflects the formulation of the model but does not well depict the community structure represented by scale-free networks. Finally, Figures 5.9–5.10 appear more compact and densely connected, features that are not reflected by experimental data. Even if well-representing the “small-world” effect and the power law distribution of degrees, the Barabási-Albert model and its variants appear inefficient to represent the community structure of Online Social Networks. Figure 5.6: Community structure of the Erdős-Rényi (94) model. 75 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS Figure 5.7: Community structure of the Newman-Watts-Strogatz (219) model. Figure 5.8: Community structure of the Watts-Strogatz (274) model. 76 5.4 Community Structure Figure 5.9: Community structure of the Barabási-Albert (14) model. Figure 5.10: Community structure of the Holme-Kim (149) model. 77 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS no. Network no. nodes no. edges 1 CA-AstroPh 18,772 396,160 2 CA-CondMat 23,133 186,932 3 CA-GrQc 5,242 28,980 4 CA-HepPh 12,008 237,010 5 CA-HepTh 9,877 51,971 6 Cit-HepTh 27,770 352,807 7 Email-Enron 36,692 377,662 8 Facebook 63,731 1,545,684 9 Youtube 1,138,499 4,945,382 10 Wiki-Vote 7,115 103,689 Dir. No No No No No Yes Yes Yes Yes Yes Type Collaborat. Collaborat. Collaborat. Collaborat. Collaborat. Citation Collaborat. Online Com. Online Com. Collaborat. d(q) 5.3 7.9 8.9 6.6 8.4 6.5 5.4 6.8 7.6 4.5 γ 2.23 2.65 2.12 1.71 2.63 3.28 1.84 2.91 2.05 1.38 σ 1.50 1.49 1.48 1.46 1.46 1.48 1.48 1.48 – – Q 0.628 0.731 0.861 0.659 0.768 0.658 0.615 0.634 0.447 0.418 Ref (184) (184) (184) (184) (184) (184) (184) (203) (203) (184) Table 5.1: Datasets and results: d(q) is the effective diameter, γ and σ, resp., the exponent of the power law node degree and community size distributions, Q the network modularity. 5.5 Experimental Evaluation Our experimentation has been conducted on different Online Social Networks whose dataset are available online and are discussed in the following. 5.5.1 Description of Adopted Online Social Network Datasets Datasets 1 − 5 are taken from Arxiv1 datasets, as of April 2003, of papers in the field of, respectively: 1) “Astro Physics”, 2) “Condensed Matter Physics”, 3) “General Relativity and Quantum Cosmology”; 4) “High Energy Physics - Phenomenology”, and 5) “High Energy Physics - Theory”. Dataset 6 represents a network of scientific citations among papers belonging to the Arxiv “High Energy Physics - Theory” field. Dataset 7 illustrates the email communications among the Federal Energy Regulatory Commission members (184). Dataset 8 describes a sample of the Facebook friendship network, representing its social graph. Dataset 9 depicts the social graph of YouTube as of 2007 (203). Finally, dataset 10 depicts the voting system of Wikipedia for the elections of administrators that occurred in January 2008. Adopted datasets have been summarized in Table 5.1. 5.5.2 Topological Properties Several measures are usually needed in order to study the topological features of social networks. To this purpose, for example, Carrington et al. (56) propose a list of some of them, including, amongst other, nodes/edges degree distributions, diameter, clustering coefficients, and more. In this experiment, the following features have been investigated for all the datasets discussed above: i) node degree distribution; ii) diameter and hops; iii) community structure. Degree distribution The first interesting feature we analyzed is the degree distribution, which reflects in the topology of the network. The literature refers that social networks are usually described by power law degree distributions, P (k) ∼ k −γ , where k is the node degree and γ ≤ 3. 1 Arxiv (http://arxiv.org/) is an Online archive for scientific preprints in the fields of Mathematics, Physics and Computer Science, amongst others. 78 5.5 Experimental Evaluation We already found that this indication holds ture for the Facebook samples we collected and discussed in the previous Chapter, even if with some corrections by identifying multiple regimes. We recall that the degree distribution can be represented by using the distribution functions Complementary Cumulative Distribution Function (CCDF), previously defined as ℘(k) = Rcalled ∞ 0 P (k )dk 0 ∼ k −α ∼ k −(γ−1) . k In Figure 5.11 we show the degree distribution and the correspondingly CCDF evaluated on our Online Social Networks1 . For those networks that are directed, the out-degree is represented. All the plots are represented by using a log-log scale, in order to put into evidence the scale-free behavior shown by these networks. In particular, for each of these distributions we estimated the value of the exponent γ of the power law distribution (as in Equation 5.2). Values of γ are reported in Table 5.1. Online Social Networks can be classified into two categories: i) networks that are properly described by a power law distribution; ii) networks that show some fluctuations with respect to those power law distributions that best fit to the real data. We discuss these two categories separately. The networks that are well described by a power law distribution, such as those depicting datasets 7–10, are all Online Communities (i.e., networks of individuals connected by social ties such as a friendship relations) characterized by a fact: the most of the users are rather inactive and, thus, they have a few connections with others members of the network. This phenomenon shapes a very consisting and long tail and a short head of the power law distribution of the degree (as graphically depicted by the respective plots in Figure 5.11). The former category, instead, includes the co-authors networks (datasets 1–5) that are collaboration networks and a citation network (dataset 6). The plotting of these data against the power law distribution that try to best fit them show some fluctuation, in particular in the head of the distribution, in which, apparently, the behavior of the distribution is not properly described. The rationale behind this phenomenon lies into the intrinsic features of these networks, that are slightly different in respect to Online Communities. For example, regarding the co-authors networks that represent collaborations among Physics scientists, the most of the papers are characterized by the following behavior, that can be inferred from the analysis of the real data: the number of co-authors tends to increase up to 3 (in average), then it slowly slopes down to a dozen, and finally it quickly decreases. A similar interpretation holds even for the citation network, that is usually intended as a network in which there is a very small number of papers that gather the most of the citations, and a huge number of papers that have few or even no citations at all. This is a well-known phenomenon, called the “first-mover effect” (215). Intuitively, from a modeling perspective, the only viable solution to capture these scale-free degree distribution behaviors would be by means of a Barabási-Albert preferential attachment model. On the one hand, by using this model it would be possible to reproduce the power law degree distribution of all the Online Social Networks depicted above. Similarly, even the “small-world” effect that describes networks with small diameters would be captured. On the other hand, this model would fail in depicting the community structure of those networks, whose existence has been put into evidence, both in this study (see further) and in other works (186, 209). 1 For each network we plot the data, the best fitting power law function and complementary cumulative distribution function (all the plots use the same scale of the first one). 79 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS Figure 5.11: Node degree distributions (log–log scale). Diameter and hops Most real-world social networks exhibit a relatively small diameter, but the diameter is susceptible to outliers. A more reliable measure of the pairwise distances between nodes in a graph is the effective diameter, already previously introduced. In Figure 5.12 the number of hops necessary to connect any pair of nodes is plotted as a function of the number of pairs of nodes, for each given network1 . As a consequence of the compact structure of these networks (highlighted by the scale-free distributions and the “smallworld” effect, discussed above), diameters show a fast convergence to asymptotic values listed in Table 5.1. From a modeling standpoint, as for the degree distributions, the previous considerations hold true. Both the Watts-Strogatz and the Barabási-Albert models could efficiently depict the “small-world” feature of these Online Social Networks, and, most importantly, empirical data verify the so called “six degrees of separation” theory, that is strictly related with the “smallworld” formulation. In fact, it is evident that, regardless of the large scale of the networks analyzed, the effective diameters are really small (and, in average, close to 6), which is proved for real-world social networks (18, 274). Community Structure Analysis From our experimental analysis on real datasets, by analyzing the obtained community structures by using the Louvain method 2 (32), we focus on the study of the distribution of the dimensions of the obtained communities (i.e., the number of members constituting each detected community) (109). Recently, Fortunato and Barthelemy (110) put into evidence a resolution limit while adopting network modularity as maximization function for the community detection. In detail, authors 1 The number of pairs of nodes reachable is plotted against the number of hops required, with q=0.9. Louvain method is a community detection algorithm which is discussed in very details in the Chapter 7. Since at this point it is not crucial to understand its functioning, we defer its discussion to that Chapter. 2 The 80 5.5 Experimental Evaluation found that modularity optimization may fail in the discovery of communities whose size is smaller than a given threshold. This value is strictly correlated to the characteristics of the given network. This resolution limit results in the creation of large communities incorporating an important part of the nodes of the network. In practice, in some particular cases it is possible that the clustering process produces a small number of communities of big sizes. This would possibly affect results in two ways: i) enlarging the tail of the power law distribution of the size of the community, or ii) producing a not significant clustering of the network. Because the clustering algorithm adopted (i.e., the Louvain method ) is a modularity maximization technique, we investigated the effect of the resolution limit on our datasets. We found that in two cases (i.e., for the datasets 9–10) the clustering obtained was biased from the resolution limit and we excluded these networks from our analysis. In the following we investigate the behavior of the distribution of the size of the communities in our networks. In Figure 5.13 on the x-axis we plot the size of the community, and on the y-axis the probability P(x) of finding a community of the given size into the network. For each distribution we provide the best fitting power law function with a given exponent σ (that always ranges into the interval [1.4,1.5]) that well approximates the behavior of the community size. In the figure the data are plotted as points and it is possible to highlight some communities whose size is larger than that expressed by the expected power law function (plotted as a red line), that constitute the heavy tail of the power law distribution. The depicted results show that, within large Online Social Networks, there is a high probability of finding a large number of communities that contain few individuals and a relatively low probability of finding communities constituted by a large number of members. This confirms that individuals are more likely to aggregate in small communities, such as those representing family, friends, colleagues, etc., rather than in large communities (99, 234). Fashinating, a similar phenomenon happens for co-authors and citations on scientific papers. Moreover, from Figure 5.13 we can put into evidence that large Online Communities, for example Facebook and the scientific collaboration networks, show a very tight community structure (a fact proved also by the high values of network modularity, reported as Q in Table 5.1). For example, regarding the collaboration networks, intuitively, we interpret this fact considering that scientists usually co-authoring different works, with different persons, work on papers signed only by a small amount of co-authors. It is very likely that these co-authors tend to group together (for example, if they co-authored several works) in the corresponding scientific communities. On the other hand, for some networks such as the citation network and the email network, Figure 5.13 shows that it exists an important amount of communities constituted by a large amount of individuals, constituting the heavy long tail of the power law distribution1 . Also this aspect has an intuitive explanation. In fact, if we consider a network of scientific citations, there is a small amount of papers with a huge number of citations (which are very central in the topology of the network and, thus, are aggregated in same communities) and the most of the others that have very few citations, that forming small communities among each other (or single entities). 1 The probability P (x) of finding a community of a given size into the network is plotted against the size of the community. In red, the best fitting power law distribution functions are depicted. 81 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS Figure 5.12: Effective diameters (log-normal scale). Figure 5.13: Community structure analysis (log–log scale). 82 5.5 Experimental Evaluation Conclusion In this Chapter we put into evidence those models which try to efficiently and faithfully represent the topological features of Online Social Networks. Several models have been presented in literature and we focused our attention on the three main exploited models, i.e., i) Erdős-Rényi random graphs, ii) Watt-Strogatz and, iii) Barabási-Albert preferential attachment. Each model, even if well describes some specific characteristics, fails in faithfully representing all the three main features we identified, that characterize Online Social Networks, namely i) “small-world” effect, ii) scale-free degree distributions and, finally, iii) emergence of a community structure. We analyzed the topological features of several real-world Online Social Networks, fitting real data to models and putting into evidence what characteristic are preserved and what could not faithfully be represented by using these models. 83 5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS 84 6 Community Structure in Facebook This Chapter is organized as follows. Section 6.1 covers the background and the related work about detecting the community structure within a network, with particular attention to the specific area of Online Social Networks. Section 6.2 introduces some details about two fast community detection algorithms we have adopted to detect the community structure of Facebook. Experimental results, performance evaluation and data analysis are shown. We describe the methodology behind this work, illustrating the aspects on which we focused during our experimentation. Details related to the formulation of the problem and the choice of the solutions are illustrated. In detail, in Section 6.3 we describe the community structure of Facebook, the process of building a meta-network from the communities we discovered and the analysis of the topological features of this artifact. Finally, Section 6.4 presents some clues in the direction of the quantitative assessment of the renowned sociological theory of the strength of weak ties (137). 6.1 Background and Related Literature The social role of Online Social Networks is to help people to enhance the connections among each other in the context of Internet. On the one hand, these relationships are very tight over some areas of the social life of each user, such as family, colleagues, friends, and so on. On the other, the outgoing connections with other individuals not belonging to any of these categories are less likely to happen. This effect reflects in a phenomenon called community structure. We recall that a community is formally defined as a sub-structure present into the network that represents connections among users, in which the density of relationships within the members of the community is much greater than the density of connections among communities. From a structural perspective, this is reflected by a graph which is very sparse almost everywhere but dense in local areas, corresponding to the communities (also called clusters). A lot of different motivations to investigate the community structure of a network exist. From a scientific perspective, it is possible to put into evidence interesting properties or hidden information about the network itself. Moreover, individuals that belong to a same community 85 6. COMMUNITY STRUCTURE IN FACEBOOK may share some similarities and possibly have common interests or are connected by a specific relationship in the real-world. These aspects arise a lot of commercial and scientific applications; in the first category we cite, for example, marketing and competitive intelligence investigations and recommender systems. In fact, users belonging to a same community could share tastes or interests in similar products. In the latter, models of diseases propagation and distribution of information have been largely investigated in the context of social networks. The problem of discovering the community structure of a network has been approached in several different ways. A common formulation of this problem is to find a partitioning V = (V1 ∪ V2 ∪ · · · ∪ Vn ) of disjoint subsets of vertices of the graph G = (V, E) representing the network, in a meaningful manner. Two intuitive problems can be already sketched. The first one arises when partitioning the vertices into disjoint subsets, because each entity of the network could possibly belong to several different communities. The problem of overlapping communities has been already investigated in literature (181, 197, 227) and presented in the previous Chapter. The latter problem is represented by networks in which it makes sense that an individual does not belong to any group. In the formulation introduced above, we imposed that, regardless the overlapping communities are considered or not, each individual is required to belong at least to one group. This requirement could make sense for several networks, but is unaffordable in those cases in which some individuals could remain isolated from the rest of the network, as recently put into evidence by (152). Such a case commonly happens in real and Online Social Networks, as reported by recent social studies (143). In this Chapter we analyze the community structure of Facebook on a large scale. We recall that we collected two different samples of the network of relationships among the users of the social network. Each of them contains millions of nodes and edges and, for this reason, we adopt two fast and efficient community detecting algorithms optimized for massive networks, working without any a-priori knowledge, in order to discover the emergent community structure of Facebook. 6.1.1 Community Detection in Literature Several studies have been conducted in order to investigate the community structure of real and Online Social Networks (109, 158, 234, 254, 265, 293). They all rely on the algorithmic background of detecting communities in a network. There are several comprehensive surveys to this problem, addressed to non practitioner readers, such as (109, 234). The problem of detecting groups of related nodes in a single social network has been largely analyzed in the Physics, Bioinformatics and Computer Science literature and is often known as community detection (210, 216) and studied, among others, by Borgatti et al. (41). The computational complexity of the Girvan-Newman (GN) algorithm introduced before is O(n3 ), being n the number of nodes of a graph G = (V, E). The cubic complexity algorithm may not be scalable enough for the size of Online Social Networks but a more efficient –O(nlog 2 n)– implementation of GN can be found in (69). (237) illustrates an algorithm which strongly resembles GN. In particular, for each edge e ∈ E of G, it computes the so-called edge clustering coefficient of e, defined as the ratio of the number of cycles containing e to the maximum number of cycles which could potentially contain it. Next, GN is applied with the edge clustering coefficient (rather than edge betweenness) 86 6.2 Community Structure Discovery as the parameter of reference. The most important advantage of this approach is that the computational cost of the edge clustering coefficient is significantly smaller than that of edge betweenness. All approaches described above use greedy techniques to maximize Q. In (140), the authors propose a different approach which maximizes Q by means of the simulated annealing technique. That approach achieves a higher accuracy but can be computationally very expensive. (227) describes CFinder, which, to the best of our knowledge, is the first attempt to find overlapping communities, i.e., communities which may share some nodes. In CFinder communities are detected by finding cliques of size k, where k is a parameter provided by the user. Such a problem is computationally expensive but experiments showed that it scales well on real networks and it achieves a great accuracy. The approach of (218) uses a Bayesian probabilistic model to represent an Online Social Network. The parameters of this model are determined by means of the Expectation Maximization algorithm. An interesting feature of (218) is the capability of finding group structures, i.e., relationships among the users of a social network which go beyond those characterizing conventional communities. For instance, this approach is capable of detecting groups of users who show forms of aversion with each other rather than just users who are willing to interact. Experimental comparisons of various approaches to finding communities in OSNs are reported in (109, 188). In (161) the authors propose CHRONICLE, an algorithm to find time-evolving communities in a social network. CHRONICLE operates in two stages: in the first one it considers T “snapshots” of the social network in correspondence of T different timestamps. For each timestamp it applies a density-based clustering algorithm on each snapshot to find communities in the social network. After this, it builds a T -partite graph GT which consists of T layers each containing the communities of nodes detected in the corresponding timestamp. It adds also some edges linking adjacent layers: they indicate that two communities, detected in correspondence of consecutive timestamps, share some similarities. As a consequence, the edges and the paths in GT identify similarities among communities over time. 6.2 Community Structure Discovery The detection of the community structure within a large network is a complex and computationally expensive task. Community detection algorithms such as those originally presented by Girvan and Newman or by (141), are not viable solutions, respectively because too expensive for the large scale of the Facebook sample we gathered, or because they require a priori knowledge. Fortunately, several optimizations have been proposed during latest years. To our purposes, we adopted two fast and efficient optimized algorithms, whose performance are the best to date proposed in literature. LPA (Label Propagation Algorithm), presented by (238), and FNCA (Fast Network Community Algorithm), more recently described by (156), have been adopted to detect communities from the collected samples of the network. A description of their functioning follows, in particular in the context of our study. 6.2.1 Label Propagation Algorithm LPA (Label Propagation Algorithm) (238) is a near linear time algorithm for community detection. Its functioning is very simple, considered its computational efficiency. LPA uses only the 87 6. COMMUNITY STRUCTURE IN FACEBOOK network structure as its guide, is optimized for large-scale networks, does not follows any a priori defined objective function and does not require any prior information about the communities. In addition, this technique does not require to define in advance the number of communities present into the network or their size. Labels represent unique identifiers, assigned to each vertex of the network. Its functioning is reported as described in (238): Step 1 To initialize, each vertex is given a unique label; Step 2 Repeatedly, each vertex updates its label with the one used by the greatest number of neighbors. If more than one label is used by the same maximum number of neighbors, one is chosen randomly. After several iterations, the same label tends to become associated with all the members of a community; Step 3 Vertices labeled alike are added to one community. Authors themselves proved that this process, under specific conditions, could not converge. In order to avoid deadlocks and to guarantee an efficient network clustering, they suggested to adopt an “asynchronous” update of the labels, considering the values of some neighbors at the previous iteration and some at the current one. This precaution ensures the convergence of the process, usually in few steps. (238) ensure that five iterations are sufficient to correctly classify 95% of vertices of the network. After some experimentation, we found that this forecast is too optimistic, thus we elevated the maximum number of iterations to 50, finding a good compromise between quality of results and amount of time required for computation. A characteristic of this approach is that it produces groups that are not necessarily contiguous, thus there could exist a path connecting a pair of vertices in a group passing through vertices belonging to different groups. Although in our case this condition would be acceptable, we adopted the suggestion provided by the authors to devise a final step to split the groups into one or more contiguous communities. Authors proved its near linear computational cost. Recently, a great attention has been captured by the possibility of discovering the community structure in a network finding overlapping nodes belonging to different communities at the same time. An interesting approach has been proposed by (138), that is based on an extension of the Label Propagation Algorithm previously described. 6.2.2 Fast Network Community Algorithm The second efficient algorithm that has been chosen for our analysis is called FNCA (Fast Network Community Algorithm) (156). The main advantage of FNCA is that it does not require to define in advance the number of communities present into the network, or their size. This aspect makes it suitable for the investigation of the unknown community structure of a large network, such as in the case of Facebook. FNCA is an optimization algorithm which aims to maximize the value of the network modularity function, in order to detect the community structure of a given network. The network modularity function has been introduced by (217) and has been largely adopted in the last few years by the scientific community (33, 90, 109). Given an undirected, unweighted network G = (V, E), let i ∈ V be a vertex belonging to the 88 6.2 Community Structure Discovery community r(i) denoted by cr (i); the network modularity function can be written as follows 1 X Q= 2m ij ki kj Aij − 2m × δ (r(i), r(j)) (6.1) where Aij is the element of the adjacency matrix A = (Aij )n×n representing the network, whose value is Aij = 1 if i and j are tied by an edge, Aij = 0 otherwise. The function δ(u, v), namely Kronecker delta, is equal toX 1 if u = v and 0 otherwise. The value ki represents the degree of a vertex i, defined as ki = Aij while m is the maximum number of edges in the network, j 1X Aij . Equation 6.1 can be rewritten as defined as m = 2 ij Q= 1 X fi , 2m i fi = X j∈cr(i) Aij − ki kj 2m (6.2) where the function f represents the difference between actual and expected number of edges which fall within communities, from the “perspective” of each node of the network, thus indicating how strong the community structure is. Any node of the network could evaluate the value of its f function only considering local information (i.e., information about its community). Moreover, if the local effect of relabeling a node, without changing the labels of others, is that the value of its f function increases, the global effect is that also the network modularity increases. Given these assumptions, (156) devised a fast community detection algorithm, optimized for complex networks, adopting local information strategies. FNCA relies on the consideration that, in networks with an emergent community structure, each node should be labeled alike one of its neighbors, otherwise it is a cluster itself. Thus, each node needs to calculate its f function only for the labels of its neighbors, instead of for all the nodes of the network. Moreover, authors put into evidence that, if the labels of neighbors of one node do not change at last iteration, the label of that node is less likely to change in the current iteration. This provides a speed-up strategy, putting nodes which satisfy this condition, in an “inactive” state, not requiring the update of their labels. Because this weak condition may fail, it is important to immediately “wake up” those nodes which do not satisfy this constraint anymore, at each iteration. Also this algorithm, like LPA, could not converge. In our experimentation we defined a termination criterion of 50 iterations, obtaining good results also with our large-scale samples. The time complexity of FNCA is O(T · n · k · c), where T is the number of maximum iterations, n the number of total nodes, k the average degree of all the nodes, and c the average community size at the end the algorithm execution. Furthermore, with the support of the analysis in literature (187), for large-scale networks, FNCA is a near linear algorithm. 6.2.3 Experimentation The experimental results obtained by using LPA and FNCA on the Facebook network are reported in Table 6.1. Both the algorithms show good performance while applied to this network. 89 6. COMMUNITY STRUCTURE IN FACEBOOK Algorithm No. Communities Q Time (s) BFS (8.21 M vertices, 12.58 M edges) FNCA 50,156 0.6867 5.97e+004 LPA 48,750 0.6963 2.27e+004 Uniform (7.69 M vertices, 7.84 M edges) FNCA 40,700 0.9650 3.77e+004 LPA 48,022 0.9749 2.32e+004 Table 6.1: Results of the community detection on Facebook A very compact community structure has been highlighted by using both the algorithms. In detail, resulting values of Q are almost identical with respect to the considered sample; moreover, the number of detected communities is very similar. 6.2.4 Methodology of Investigation By analyzing the obtained community structures we considered the following aspects: i) the distribution of the dimensions of the obtained clusters (i.e. the number of members constituting each detected community), and, ii) the qualitative composition of the communities and the degree of similarity among different sample sets (i.e., those obtained by using different algorithms and sampling techniques). Community distribution: Uniform sample The analysis of the community structure of a network from a quantitative perspective may start with the study of the distribution of the dimension of the communities. Our investigation started considering first the “Uniform” sample, which is known to be unbiased for construction. Results obtained are adopted to investigate the possible bias introduced by the BFS sampling technique, as discussed in the following. As depicted by Figures 6.1 and 6.2, results obtained by using the two different algorithms on the “Uniform” sample are interesting and deserve explanations. In detail, analytical results (as reported in Table 6.1) and figures put into evidence that both the algorithms identified a similar amount of communities, which is reflected by almost identical values of network modularity in the two different sets. Moreover, identified communities are themselves, the most of the times, of the same dimensions, regardless the adopted community detection algorithm. These aspects lead us to advance different hypothesis on the characteristics of the community structure of Facebook detected on the unbiased “Uniform” sample. The first consideration regards the distribution of the size of the communities. Both the distributions obtained by using the LPA and the FNCA algorithm show a characteristic power law behavior. This is emphasized by Figures 6.1 and 6.2, which represent the distributions of the dimension of communities obtained by using, respectively, FNCA and LPA, applied on the “Uniform” sample. In Figure 6.1, the clusters size distribution obtained by using FNCA is fitted to a power law function (γ = 0.45) which effectively approximates its behavior. Similarly, Figure 6.2 represents the clusters size distribution produced by LPA, which gives results in a shorter interval (i.e., [0,500] with respect to [0,1000] used in Figure 6.1), well fitting to a power law function (γ = 0.37). A 90 6.2 Community Structure Discovery first consideration is that the communities detected by using the LPA algorithm appears to be slightly displaced to bigger values with respect to those represented by the FNCA in the first quartile, while the number of communities greater than 400 members quickly decreases. These results permit us to draw two conclusions: • On a large scale, to the best of our knowledge, this is the first experimental analysis that proves that the size of the communities emerging in an Online Social Networks follows a well-defined power law distribution. This result is novel and validates the hypothesis, proved on a small scale on several real-world social networks (for example, (265)), that not only the degree distribution follows a scale-free behavior, but even the processes of aggregations among individuals of an Online Social Network can be effectively described by communities whose dimensions follow a power law. This results in the following point. • Our analysis puts into evidence that, even on a large scale that is well represented by an OSN such as Facebook, people tend to aggregate principally in a large amount of small communities instead of in very large communities. • A rejection-based sampling methodology (such as the “Uniform” sampling) is appropriate to describe the community structure emerging on a large sample of a Online Social Network. Differently with respect to other approaches, it seems to preserve those characteristics that influence the distribution of the friendship relations thus well representing the community structure of large networks. Figure 6.1: FNCA power law distributions on the “Uniform” sample. 91 6. COMMUNITY STRUCTURE IN FACEBOOK Figure 6.2: LPA power law distributions on the “Uniform” sample. Community distribution: BFS sample Results obtained by analyzing the BFS distribution, show partially different characteristics. Figures 6.3 and 6.4 show the cluster dimensions distribution by using, respectively, FNCA and LPA applied to the BFS sample. Both these distributions show some fluctuations if compared to the power law distribution adopted as possible fitting function. By using the FNCA (see Figure 6.3), the peak of the distribution is represented by those communities constituted of 10–30 members, then it sharply slopes depicting a first fluctuation around clusters of dimension around a hundred of members, and a second minor fluctuation around the three hundreds of members. A similar behavior is shown by the LPA algorithm (see Figure 6.4). The differences in the behavior between the BFS and “Uniform” samples distributions reflect accordingly with the adopted sampling techniques. In fact, in (125, 170) it has been put into evidence the influence of the adopted sampling methods on the characteristics of the obtained sets, in particular focusing on the possible bias introduced by using the BFS algorithm, towards high degree nodes, if the BFS visit is incomplete (such as in our case). We could draw the conclusion that the adoption of the BFS sampling technique is not very appropriate in the case one would to investigate the community structure of a large network whose complete sampling is not feasible, for example because of constraints imposed by the network itself or by its dimension (such in the case of Facebook). On the other hand, the BFS sampling has been proved to be effective and efficient in the opposite cases. 92 6.2 Community Structure Discovery Figure 6.3: FNCA power law distribution on the BFS sample. Figure 6.4: LPA power law distribution on the BFS sample. 93 6. COMMUNITY STRUCTURE IN FACEBOOK Overlapping Rate between Distributions The idea that two different algorithms could produce different community structures is not counterintuitive, but in our case we have some indications that the obtained results could share a high degree of similarity. To this purpose, in the following we investigate the similarities among the community structures obtained by using the two different algorithms, FNCA and LPA. This is represented by the overlapping rate calculated considering the distributions of the community dimensions from a quantitative perspective. To do so, we adopt a divergence measure, called Kullback-Leibler divergence, that is defined as DKL (P ||Q) = X i P (i) log P (i) Q(i) (6.3) where P and Q represent, respectively, the probability distribution that characterizes the behavior of the LPA and the FNCA community sizes, calculated on a given sample. In detail, let i be a given size such that P (i) and Q(i) represent the probability that a community of size i exists in the distribution P and Q. The KL divergence is helpful if one would like to calculate how much different is a distribution with respect to a given one. In particular, being the KL divergence defined in the interval 0 ≤ DKL ≤ ∞, the smaller the value of KL divergence between two distributions, the more similar they are. In the light of this assumption, we calculated the pairwise KL divergences between the distributions discussed above, finding the following results: • On the “Uniform” sample: – DKL (LP A||F N CA) = 0.007722 – DKL (F N CA||LP A) = 0.007542 • On the BFS sample: – DKL (LP A||F N CA) = 0.003764 – DKL (F N CA||LP A) = 0.004292 The values found by adopting the KL divergence put into evidence a strong correlation between the distributions calculated by using the two different algorithms on the two different samples. From a graphical standpoint, we put into evidence the correlation found by means of the KL divergence, as follows. In Figures 6.5 and 6.6 a semi-logarithmic scale has been adopted. In Figure 6.5, we plotted together the distributions depicted in Figures 6.1 and 6.2 that represent the community structure of the “Uniform” sample. Similarly, Figure 6.6 shows the distributions presented in Figures 6.3 and 6.4 regarding the BFS sample. By analyzing the distribution of the community sizes of the “Uniform” set, it emerges a perfectly linear behavior, that characterizes both the FNCA and the LPA results. This agrees with the power law distributions previously emphasized, that well depicts the behavior of the emergent community structure in that sample. Additionally, the two distributions are almost overlapping. A similar consideration holds for the BFS sample. Even though the distributions suffers of the spikes previously discussed, a strong correlation between them has been put into evidence both by the KL divergence and by the graphical representation. These indications gave us the opportunity of investigating from a qualitative perspective the characteristics of the community structure of Facebook. 94 6.2 Community Structure Discovery In detail, a different consideration regarding the qualitative analysis on the similarity of the two different community structures is provided in the next Section. That kind of investigation aims at evaluating what members constitute the communities detected by adopting the algorithms previously introduced. Our findings prove that, regardless the adopted community detection algorithm, the communities discovered, not only are characterized by similar distributions of sizes, but are also mainly constituted of the same members. This finding proves that the emergent community structure in Facebook is well characterized and defined, according with the quantitative results we discussed above. Figure 6.5: FNCA vs. LPA (UNI). Community structure similarity In this Section we introduce the methodology of investigation of the similarity among different community structures. A community structure is represented by a list of vectors which are identified by a “community-ID”; each vector contains the list of user-IDs (in anonymized format) of the users belonging to that specific community; an example is depicted in Table 6.2. Community-ID community-ID1 community-ID2 ... community-IDN List of Members {user-IDa ; user-IDb ; . . . ; user-IDc } {user-IDi ; user-IDj ; . . . ; user-IDk } {. . . } {user-IDx ; user-IDy ; . . . ; user-IDz } Table 6.2: Representation of community structures. In order to evaluate the similarity of the community structures obtained by using the two 95 6. COMMUNITY STRUCTURE IN FACEBOOK Figure 6.6: FNCA vs. LPA (BFS). algorithms, FNCA and LPA, a coarse-grained way to compare two sample sets would be by adopting a simple measure of similarity such as the Jaccard coefficient, defined as J(A, B) = |A ∩ B| |A ∪ B| (6.4) where A and B represent the two community structures. While calculating the intersection of the two sets, communities differing even by only one member would be considered different, while a high degree of similarity among them could be envisaged. A more convenient way to compute the similarity among these sets is to evaluate the Jaccard coefficient at the finest level, comparing each vector of the former set against all the vectors in the latter set, in order to “match” the most similar ones. Under these assumptions, the Jaccard coefficient could be rewritten in its vectorial formulation as J(v, w) = M11 M01 + M10 + M11 (6.5) where M11 represents the total number of shared elements between vectors v and w, M01 represents the total number of elements belonging to w and not belonging to v, and, finally M10 the vice-versa. The result lies in [0, 1]. The more two compared communities are similar, or, in other words, the more the constituting members of two compared communities are overlapping, the higher the value of the Jaccard coefficient computed this way. An almost equivalent way to compute the similarity with a high degree of accuracy would be by applying the Cosine similarity, among each possible couple of vectors belonging to the two 96 6.2 Community Structure Discovery sets. The Cosine similarity is defined as cos(Θ) = Pn A·B i=1 Ai × Bi pPn = pPn 2 2 ||A|| ||B|| (A i) × i=1 i=1 (Bi ) (6.6) where Ai and Bi represent the binary frequency vectors computed on the list members over i. Once matched the most similar pairs of communities between the two compared sets, the mean degree of similarity is computed by N X max (J(v, w)i ) i=1 N and N X max (cos(Θ)i ) i=1 N (6.7) where max(J(v, w)i ) and max (cos(Θ)i ) mean the highest value of similarity chosen among those calculated combining the vector i of the former set A with all the vectors of the latter set B, respectively adopting the Jaccard coefficient and the Cosine similarity. We obtained the results as in Table 6.3. Metric J Dataset BFS Uniform Degree of Similarity FNCA vs. In Common Mean Median 2.45% 73.28% 74.24% 35.57% 91.53% 98.63% LPA Std. D. 18.76% 15.98% Table 6.3: Similarity degree of community structures As deducible by analyzing Figures 6.5 and 6.6, not only the community structures calculated by using the two different algorithms, FNCA and LPA, follow similar distributions with respect to the dimensions, but also the communities themselves are constituted mostly by the same members or by a congruous amount of common members. From results it emerges that both the algorithms produce a faithful and reliable clustering representing the community structure of the Facebook network. Moreover, while the number of identical communities between the two sets obtained by using BFS and “Uniform” sampling, is not so high (i.e., respectively, '2% and '35%), on the contrary the overall mean degree of similarity is very high (i.e., '73% and '91%). Considered the way we compute the mean similarity degree, as in Equation 6.7, this is due to the high number of communities which differ only for a very small number of components. Finally, the fact that the median, which identifies the second quartile of the samples is, respectively, '75% and '99% demonstrates the strong similarities of the produced result sets. All these considerations graphically emerge by analyzing Figures 6.7 and 6.8, in which the higher the degree of similarity, calculated by using the Jaccard coefficient, the denser the distribution, in particular in the first quartile, becoming obvious for values near to 1. The unbiased characteristics of the “Uniform” sample reflect also in Figure 6.7, in which the similarity degree of the community structure is evident because the most of the values lie on the boundary zone near to 1. The degree of similarity of the community structure of the BFS sample, shown in Figure 6.8, appears more distributed all over the second half of the distribution, becoming denser in the first quartile. Finally, Figures 6.9 and 6.10 summarize these findings. Their interpretation of these heat-maps is as follows: the higher the degree of similarity between the compared community structures, the higher the heat-map scores. The similarity becomes graphically evident considering that the values of heat shown in the figures are very high for the most of the map. 97 6. COMMUNITY STRUCTURE IN FACEBOOK Figure 6.7: Jaccard distribution: FNCA vs. LPA (UNI). Figure 6.8: Jaccard distribution: FNCA vs. LPA (BFS). 98 6.2 Community Structure Discovery Figure 6.9: Heat-map: FNCA vs. LPA (UNI). Figure 6.10: Heat-map: FNCA vs. LPA (BFS). 99 6. COMMUNITY STRUCTURE IN FACEBOOK Resolution limit and outliers As previously discussed, community detection algorithm based on the network modularity maximization paradigm may suffer a resolution. In (110), the authors proved that modularity optimization could fail in the detection of communities smaller than a given threshold. As an effect, we obtain the creation of large communities which incorporate smaller ones compromising the final quality of the clustering. We investigated the effect of the resolution limit put into evidence by (110) on our datasets, to the purpose of assessing the quality of our analysis. The results of this investigation, respectively on the BFS and the “Uniform” samples, could be discussed separately. On the former dataset, a small number of communities whose dimensions exceed those obtained in the distributions previously discussed, have been identified. Possibly, these large communities have been identified because of the resolution limit. Table 6.4 reports the amount of outliers, i.e., those communities that statistically exceed the average dimension and are suspected of suffering from the problem of the resolution limit. From this analysis, emerges that a smaller number of outliers has been found by using the LPA method, with respect to the adoption of the FNCA algorithm, in the context of the BFS sample. This could indicate that FNCA, which is a modularity maximization algorithm, may suffer the resolution limit. On the contrary, the “Uniform” sample apparently does not cause any problem of resolution limit. By using the FNCA on the “Uniform” sample, a large number of communities whose dimension is slightly greater than one thousand members appear, that is coincident to the final part of the tail of the power law distribution, depicted in Figure 6.1. The LPA method applied to the “Uniform” sample provides possibly the most reliable results, without incurring in any possible outliers. Set BFS UNI Alg. FNCA LPA FNCA LPA Amount with respect to Number of Members ≥ 1K ≥ 5K ≥ 10K ≥ 50K ≥ 100K 4 1 2 1 1 1 0 2 0 1 81 0 0 0 0 0 0 0 0 0 Table 6.4: The presence of outliers in our community structures. 6.3 6.3.1 Community Structure Building the Community Meta-network Once we verified the quality of the community detection which unveiled the community structure of Facebook, we proceed with its analysis. First of all, we build a meta-network of the community structure, as follows. We generate a new weighted undirected graph G0 = (V 0 , E 0 , ω), whose set of nodes is represented by the communities constituting the given community structure. In G0 there will exist an edge e0uv ∈ E 0 connecting a pair of nodes u, v ∈ V 0 if and only if there exists in the social network graph G = (V, E) at least one edge eij ∈ E which connects 100 6.3 Community Structure a pairs of vertices i, j ∈ V , such that i ∈ u and j ∈ v (i.e., user i belongs to community u and P user v belongs to community j). The weight function is simply defined as ωu,v = i∈u,j∈v eij (i.e., the sum of the total number of edges connecting all the users belonging to u and v). Table 6.5 summarizes the results obtained for the uniform sample by using FNCA and LPA. Something which immediately emerges is that the results obtained by using the two different community detection methods are very similar. The number of nodes in the meta-networks is smaller than the total number of communities discovered by the algorithms, because we excluded all those communities containing only one member, which are reasonably belived to be constituted by inactive users. We discuss the features of the community structure metanetwork in Section 6.3.2. Feature No. nodes/edges Min./Max./Avg. weight Size largest conn. comp. Avg. degree 2nd largest eigenvalue Effective diameter Avg. clustering coefficient Density FNCA 36,248/836,130 1/16,088/1.47 99.76% 46.13 171.54 4.85 0.1236 0.127% LPA 35,276/785,751 1/7,712/1.47 99.75% 44.54 23.63 4.45 0.1318 0.126% Table 6.5: Features of the meta-networks representing the community structure for the uniform sample. Figure 6.11, depicted by using Cvis1 – a hierarchical-based circular visualization algorithm –, represents the community structure unveiled by LPA from the uniform sample. It is possible to appreciate that there exists a tight core of communities which occupy a central position into the meta-network. Moreover, an in-depth analysis reveals that the positioning of the communities is generally irrespective of their size. This means that there are several different small communities which play a dominant role in the network. Similarly, the periphery of the graph is constituted both by small and larger communities. The visual analysis of large-scale networks is usually unfeasible when managing samples of such a size, but by adopting the meta-network representation of the community structure we are able to infer additional insights about the structure of the original network. 6.3.2 Meta-network Analysis Node degree and clustering coefficient Figure 6.12 depicts the node degree probability distribution and the average clustering coefficient plotted as a function of the node degree for the two community detection techniques. Analyzing the degree distribution of the community structure meta-network we find a very peculiar feature. In detail, the distribution is clearly identified by two different regimes, roughly 1 ≤ x < 102 and x ≥ 102 . Both the probability distributions fit well to a power law, which is P (x) ∝ x−γ with γ = 0.56 for the former and γ = 3.51 for the latter regime. Such a particular behavior has been previously found in Facebook, regarding the social graph. 1 https://sites.google.com/site/andrealancichinetti/cvis 101 6. COMMUNITY STRUCTURE IN FACEBOOK Figure 6.11: Meta-network representing the community structure (UNI with LPA). 102 6.3 Community Structure The clustering coefficient of a node is the ratio of the number of existing links over the number of possible links between its neighbors. Given a network G = (V, E), we recall the definition of clustering coefficient Ci of node i ∈ V as Ci = 2|{(v, w)|(i, v), (i, w), (v, w) ∈ E}|/ki (ki − 1) where ki is the degree of node i. In our case, it can be interpreted as the probability that any two randomly chosen communities that share a common neighbor also have a link between them. The high values of average clustering coefficient obtained for the community structure meta-network are an interesting indicator that the communities are well connected among each other. This is a peculiar feature, which reflects the small world effect, well-known in social networks. Moreover, this means that, for two randomly chosen disconnected networks, it is very likely that a very short path connecting their members exists. Finally, the clear power law distribution which describes the average clustering coefficient for this network has an exponent γ = 0.48. Figure 6.12: Meta-network degree and clustering coefficient distribution (UNI). Hops and shortest paths distribution In the following we analyze the effective diameter and the shortest paths distribution in the community structure. Figure 6.13 represents the probability distribution for the shortest paths against the path length, and, concurrently, the number of pairs of nodes connected by paths of a given length. The interesting behavior which emerges from the analysis is that the shortest path probability distribution reaches a peak for paths of length 2 and 3. In correspondence with this peak, the number of connected pairs of communities quickly grows, reaching the effective diameter of the networks with a value slightly above 4. This finding has an important impact on the features of the overall social graph. In fact, if we would suppose that all the nodes belonging to a given communities are well connected by very 103 6. COMMUNITY STRUCTURE IN FACEBOOK short paths, or even directly connected, this would result in a very short diameter of the social graph itself. In fact, there will always exist a very short path connecting the communities of any pair of randomly chosen members of the social network. Moreover, this result has been recently assessed by using heuristic techniques on the whole Facebook network (269). Figure 6.13: Meta-network hops and shortest paths distribution (UNI). Weight and strength distribution The analysis of the weight and strength probability distribution is depicted in Figure 6.14. We resemble that the strength sω (v) (or weighted degree) of a given node v is determined as the sum of the weights of all edges incident on v X sω (v) = ω(e) e∈I(v) where ω(e) is the weight of a given edge e and I(v) the set of edges incident on v. In detail, both distributions resemble a power law behavior. The former is defined by a single regime clearly described by a coefficient γ = 1.45. The latter is better described by two different regimes, in those intervals roughly similar to the node degree probability distribution, by the two coefficients γ = 1.50 and γ = 3.12. In Figure 6.15 it is represented a heat-map of the distribution of connections among the communities in the meta-network. It emerges that the links mainly connect communities of mediumlarge dimension. This aspect is important because it highlights the roles of the links with high weight and strength. For example, they efficiently connect communities containing many members that, otherwise, would be far from each other. On the other hand, according to the strength 104 6.3 Community Structure of weak ties theory (137), weak links typically occur among communities that do not share a large amount of neighbors and are important to keep the network proficiently connected. Figure 6.14: Meta-network weights vs. strengths distribution (UNI). 6.3.3 Discussion of Results A summary of the results achieved with our analysis of the Facebook community structure follows. First of all, in this Section we put into evidence that the community structure of the Facebook social network presents a clear power law distribution of the dimension of the communities, similarly to other large social networks (176). This result is independent with respect to the algorithm adopted to discover the community structure, and even (but in a less evident way) to the sampling methodology adopted to collect the datasets. On the other hand, this is the first experimental work that proves on a large scale the hypothesis, theoretically advanced by (170), of the possible bias towards high degree nodes introduced by the BFS sampling methodology for incomplete visits of large graphs. Regarding the qualitative analysis of our results, it emerges that the community structure is well defined. In fact, regardless the algorithm adopted for the discovery of the communities share a high degree of similarity among different datasets, which means that they emerge clearly in the topology of the network. As for the community detection algorithm, we found that the LPA method represents a feasible choice among the heuristic methods based on local information in order to unveil the community structure of a large network. Results compared against FNCA appear slightly better, in 105 6. COMMUNITY STRUCTURE IN FACEBOOK Figure 6.15: Meta-network heat-map of the distribution of connections (UNI). particular if we consider the well-known problem of the resolution limit (110) that affect the process of community detection on a large scale for modularity optimization algorithms. Performance provided by the two algorithms are comparable and reasonable for large-scale analysis. Even if the computational cost of these two techniques is very similar, we experienced that the LPA method performs slightly better that FNCA on our datasets. Finally, the analysis of the community structure meta-network puts into evidence different mesoscopic features. For example, we discovered that the community structure is characterized by a power law probability distribution of node degree/weight and we found that it reflects the well-known small world effect. 6.4 The Strength of Weak Ties This Section introduces some experiments in the direction of the quantitative assessment of the theory known as the strength of weak ties, whose foundations lie in Sociology (137). In particular, by means of the data previously acquired and exploiting the analysis carried out regarding the community structure of Facebook, we have been able to assess some features of this theory in the context of a real-world large scale social network like Facebook. We try to capture the original intuition underlying the so-called weak ties and their role in complex social networks. In particular, we have been concerned with the experimental assessment of the importance, foreseen by the early works of Mark Granovetter (137) of weak ties, i.e. human relationships (acquaintance, loose friendship etc.) that are less binding than family and close friendship but 106 6.4 The Strength of Weak Ties might, according to Granovetter, yield better access to information and opportunities. Facebook is organized around the recording of just one type of relationship, i.e. the friendship. Of course, Facebook friendship captures several degrees and nuances of the human relationships that are hard to separate and characterize within data analysis. However, weak ties have a clear and valuable interpretation: friendship between individuals who otherwise belong to distant areas of the friendship graph. Or, in other words, happen to have most of their other relationships in different national/linguistic/age/common experience groups. Such weak ties have strength precisely because they connect distant areas of the network, thus yielding several interesting properties, which will be discussed in the following. 6.4.1 Methodology The classical definition of strength of a social tie has been provided by Granovetter (137): The strength of a tie is a (probably linear) combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie. This definition introduces some important feature of a social tie that will be discussed later, in particular: (i) intensity of the connection, and (ii) mutuality of the relationship. Granovetter’s paper gives a formal definition of strong and weak ties by introducing the concept of bridge: A bridge is a line in a network which provides the only path between two points. Since, in general, each person has a great many contacts, a bridge between A and B provides the only route along which information or influence can flow from any contact of A to any contact of B. From this definition it emerges that – at least in the context of social networks – no strong tie is a bridge. However, that is not sufficient to affirm that each weak tie is a bridge, but what is important is that all bridges are weak ties. Granovetter’s definition of bridge is restrictive and unsuitable for the analysis of large-scale social networks. In fact, because of the well-known features such as the small world effect and the scale-free degree distribution, it is unlikely to find an edge whose deletion would lead to the inability for two nodes to connect by means of alternative paths. On the other hand, without loss of generality on a large scale, we can define a shortcut bridge as the link that connects any pair of nodes whose deletion would cause an increase of the distance between them, being the distance of two nodes the length of the shortest path linking them. Unfortunately, also this definition leads to two relevant problems. The former is due to the introduction of the concept of shortest paths; the latter is due to the arbitrariness given by the concept of distance between nodes. In detail, regarding the shortest paths, the computation of all pairs shortest paths has a high computational cost which makes it unfeasible even on networks of modest size – even worse if considering large social networks. Regarding the second aspect, in the context of shortest paths the distance could be considered as the number of hops required to connect two given nodes. Alternatively, it could be possible to assign a value of strength (i.e., a weight) to each edge of the network and to define the distance of two nodes as the cost of the cheapest path joining them1 . In such a case, however, we do not know wether this 1 In this context, measuring the strength of the edges in online social networks has been recently advanced by (123, 231, 281). 107 6. COMMUNITY STRUCTURE IN FACEBOOK definition of distance is better than in the previous one but its computation remains excessively expensive in real-life networks. In the light of the considerations above, we suspect that the problem of discriminating weak and strong ties in a social network is not trivial, at least on a large scale. To this purpose, in the following we give a definition of weak ties from a different perspective, trying not to distort the Granovetter’s original intuition. In particular, recalling that weak ties are considered as loose connections between any given individual and her/his acquaintances with whom she/he seldom interacts and who belong to different areas of the social graph, we give the definition of weak ties as those ties that connects any pair of nodes belonging to different communities. Note that our definition is more relaxed than that provided by Granovetter. In detail, the fact that two nodes connected by a tie belong to different communities does not necessarily imply that the connection among them is a bridge, nor a shortcut bridge, since its deletion could not increase the length of the path connecting them (there could yet exists another path of the same length). On the other hand, in our opinion, it is a reasonable assumption at least in the context of large social networks, since it has been proved that the edges connecting different communities are bottlenecks (217) and their iterative deletion causes the fragmentation of the network in disconnected components. One of the most important characteristics of weak ties is that those which are bridges create more, and shorter, paths. The effect in the deletion of a weak ties would be more disruptive than the removal of a strong tie1 , from a community structure perspective. Experimental Set Up In order to verify the strength of weak ties theory on a large scale we initially carefully analyzed the features of existing online social networks, considering some requirements that come directly from Granovetter’s seminal work (137): Ties discussed in this paper are assumed to be positive and symmetric. Discussion of operational measures of and weights attaching to each of the four elements is postponed to future empirical studies. Granovetter introduces two concepts that are crucial to understanding weak ties. The first is related to the symmetry of the relationship among two individuals of the network. This concept is extremely interconnected with the definition of mutual friendship relation which characterizes several online social networks. In detail, a friendship connection can be symmetric (i.e., mutual) if there is no directionality in the relation between two individuals – otherwise the relation is asymmetric – of which Facebook friendship is perhaps the best-known example. While in the real-world social networks the classification of a relation between individuals can be not trivial, online social networks platforms permit to clearly and uniquely define different types of connections among users. For example, in Twitter the concept of relation between two individuals intrinsically holds a directionality. In fact, each user can be a follower of others, can retweet their tweets and can mention them. Recently, research has started on assessing the strength of weak ties in the context of a directed network (136, 226). In directed networks, however, what is also important is the weight assigned to connections. Even if the possibility of weighting connections among users of social networks 1 For this reason weak ties have been recently proved to be very effective in the diffusion of information and in the rumor spreading through social networks (61, 292). 108 6.4 The Strength of Weak Ties has been recently envisaged by us (81) as well as by other authors (123, 231, 281), we consider the most appropriate setting for a quantitative validation of the theory, a network represented by an unweighted graph. Facebook arguably represents an ideal setting for the validation of the strength of weak ties theory. In fact, both of Granovetter’s requirements are satisfied in the Facebook friendship network because: • it is naturally represented as an undirected graph: friendship in Facebook is symmetric, and • can be represented by adopting an unweighted graph 1 . To sum it up, our definition of the Facebook social graph is simply an unweighted, undirected graph G = (V, E) where vertices v ∈ V represent Facebook users and edges e ∈ E represent the friendship connections among them. In this context, we define as weak ties those edges that, after dividing the network structure in communities (obtaining the so-called community structure), connect nodes belonging to different communities. Vice versa, we classify as strong ties the intra-community edges. 6.4.2 Experiments Recently, several works focused on the Facebook social graph (58, 126, 269) and on its community structure (99, 101, 205), but none of them has been carried out to assess the validity of the strength of weak ties theory. In this section: (i) firstly, we investigate presence and behavior of strong and weak ties in such a network; finally, (ii) we try to describe the density of weak ties among communities and the way in which the are distributed as a function of the size of the communities themselves. Distribution and CCDF of strong and weak ties The second experiment is devoted to understand the presence and the distribution of strong and weak ties among communities. To this purpose, we consider the community structure discussed above, classifying those edges connecting nodes belonging to different communities as weak ties, and strong ties the vice-versa. Intuitively, given the power law distribution of the size of communities (and, coincidentally, the power law distribution of node degrees), the number of weak ties will be much greater than the number of strong ties. Even though this effect could appear as counter-intuitive (for example, we could suppose that weak ties are much more rare than strong ties on a large scale), we should recall that some sociological theories2 assume that individuals tend to aggregate in small communities3 , i.e., the most of connections among individuals are weak ties in the Granovetter’s sense – small amount of contacts, low frequency of interactions, etc. 1 Of course, this is not necessarily the only valid representation of the Facebook network since it should be possible to adopt a weighted network where edge weights represent, for example, the intensity of the relations among users. 2 For example, cognitive balance (147, 207), triadic closure (137) and homophily (198). 3 According to these theories, we can explain that the intensity of human relations is very tight in small groups of individuals, and decreases towards individuals belonging to distant communities. 109 6. COMMUNITY STRUCTURE IN FACEBOOK This intuitions are reflected by analyzing Figure 6.16. For each node v ∈ V of the graph G = (V, E), Figure 6.16 depicts the amount of strong and weak ties incident on v. It is evident that the weak ties are much more that the strong ties. The two distributions tend to behave quite similarly, but they maintain a certain constant offset which represents the ratio between strong and weak ties in this network. This ratio is more or less 80%-20% and carries also an important social interpretation. In fact, it is closely related to the concept of rich club – deriving from the renown Pareto principle (212) – whose validity has been recently proved for complex networks (70) (for example for Internet (294) and scientific collaboration networks (224)). In addition, since both the distributions recall a straight line (which, in a log-log plot, induces to scale-free behaviors), we can assume that also the distribution of weak and strong ties is well described by a power law, such as in the case of node degree and size of communities. As for a different perspective of studying the same picture, Figure 6.17 represents the CCDF of the probability of finding a given number of strong and weak ties in the network. From its analysis, it emerges an important difference between the behavior of the weak and the strong ties. In detail, the cumulative probability of finding a node with an increasing number of strong ties quickly decreases. Tentatively, it is possible to identify in k ≈ 5 the tipping point from which the presence of weak ties quickly overcomes that of strong ties, making the latters less numerous in nodes with degree higher than k. Figure 6.16: Distribution of strong vs. weak ties in Facebook. Density of weak ties among communities and link fraction The last experiment discussed in this paper is devoted to understanding the density of weak ties connecting communities in Facebook. In particular, we are interested in defining to what extent a weak tie links community of comparable size. To do so, we considered each weak tie in 110 6.4 The Strength of Weak Ties Figure 6.17: CCDF of strong vs. weak ties in Facebook. the network and we computed the size of the community to which the source node of the weak tie belongs to. Similarly, we computed the size of the target community. Figure 6.18 represents a density map of the distribution of weak ties among communities. First, we highlight that the map is symmetric with respect to the diagonal, according to the fact that the graph is undirected and each weak tie is counted twice, once for each end-vertex. From the analysis of this figure, it clearly emerges that the weak ties mainly connects nodes belonging to small communities. To a certain extant, this could be intuitive since the number of communities of small size, according to their power law distribution, is much greater than the number of large communities. On the other hand, it is an important assessment since similar results have been recently described for Twitter (136). As for further analysis, we carried out another investigation oriented to the evaluation of the amount of weak ties that fall in each given community with respect to its size. The results of this assessment are reported in Figure 6.19. The interpretation of this plot is the following: on the y-axis it is represented the fraction of weak ties per community as a function of the size of the community itself, reported on the x-axis. It emerges that also the distribution of the link fraction against the size of the communities resembles a power law. Indeed, this result is different from that recently proved for Twitter (136), in which a Gaussianlike distribution has been discovered. This is probably due to the intrinsic characteristics of the networks, that are topologically dissimilar (i.e., Twitter is represented by a directed graph with multiple type of edges) and also the interpretation itself of social tie is different. In fact, Twitter represents in a way hierarchical connections (in the form of follower and followed users), while Facebook tries to reflects a friendship social structure which better represents the community structure of real social networks. 111 6. COMMUNITY STRUCTURE IN FACEBOOK Figure 6.18: Density of weak ties among communities. Figure 6.19: Link fraction as a function of the community size. 112 6.4 The Strength of Weak Ties Conclusion In this Chapter we presented a large-scale community structure investigation of the Facebook social network. We adopted two fast and efficient algorithms already presented in literature, specifically optimized to detect the community structure on large-scale networks, such as Facebook, consisting of millions of nodes and edges. A very strong community structure emerges by our analysis, and several characteristics have been highlighted by our experimentation, such as a typical power law distribution of the dimension of the clusters. We also investigated the degree of similarity of the different community structures obtained by using the two algorithms on the respective samples, putting into evidence strong similarities. Once the presence of the community structure has been assessed, we studied the mesoscopic characteristics of the community structure meta-network, verifying that the most of the features presented by the original Social graph hold in the community structure. In particular, we verified the presence of a power law degree distribution of the community size degree and the clustering effect in this network. Moreover, we encountered the presence of a small world effect which contributes to the existence of a very small diameter and shortest paths among each pair of communities. Finally, we investigated the validity of the strength of weak ties sociological theory on Facebook. Since it is well-known that this theory is strictly related to the community structure, our findings support this aspect providing several quantitative clues which testify the presence and the importance of weak ties in the network. 113 6. COMMUNITY STRUCTURE IN FACEBOOK 114 7 A Novel Centrality Measure for Social Networks The Chapter is organized as follows: Section 7.1 presents the literature related to the problem of computing centrality on graphs. In Section 7.2 we provide some background information on those problems related to centrality measures. Section 7.3 presents our novel κ-path edge centrality, including the fast algorithm for its computation. An extensive experimental evaluation of performance of this strategy is discussed in Section 7.4. In Section 7.5 we discuss different fields of application of our approach; in Section 7.6 we describe its adoption to devise a new efficient technique of community detection well suited for the investigation of the community structure of large networks. The Chapter concludes in Section 7.7, in which we report the results of the experimentation of this algorithm applied to different social and biological network problems. 7.1 Background and Related Literature In the context of the social knowledge management, not only from a scientific perspective but also for commercial or strategic motivations, the identification of the principal actors inside a network is very important. Such an identification requires to define an importance measure (also referred to as centrality) to weight nodes and/or edges of a given network. The simplest approaches to computing centrality consider only the local topological properties of a node/edge in the social network graph: for instance, the most intuitive node centrality measure is represented by the degree of a node, – the number of social contacts of a user. Unfortunately, local measures of centrality, whose esteem is computationally feasible even on large networks, do not produce very faithful results (56). Due to this reason, many authors suggested to consider the whole social network topology to compute centrality values. A new family of centrality measures was born, called global measures. Some examples of global centrality measures are closeness (248) and betweenness centrality (for nodes (112), and edges (11, 124)). Betweenness centrality is one of the most popular measures and its computation is the core 115 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS component of a range of algorithms and applications. It relies on the idea that, in social networks, information flows along shortest paths: as a consequence, a node/edge has a high betweenness centrality if a large number of shortest paths crosses it. Some authors, however, raised some concerns on the effectiveness of the betweenness centrality. First of all, the problem of computing the exact value of betweenness centrality for each node/edge of a given graph is computationally demanding – or even unfeasible – as the size of the analyzed network grows. Therefore, the need of finding fast, even if approximate, techniques to compute betweenness centrality arises and it is currently a relevant research topic in Social Network Analysis. A further issue is that the assumption that information in social networks propagates only along shortest paths could not be true (261). By contrast, information propagation models have been provided in which information, encoded as messages generated in a source node and directed toward a target node in the network, may flow along arbitrary paths. In the spirit of such a model, some authors (211, 221) suggested to perform random walks on the social network to compute centrality values. A prominent approach following this research line is the work proposed in (6). In that work, the authors introduced a novel node centrality measure known as κ-path centrality. In detail, the authors suggested to use self-avoiding random walks (192) of length κ (being κ a suitable integer) to compute centrality values. They provided an approximate algorithm, running in O(κ3 n2−2α log n) being n the number of nodes and α ∈ [− 12 , 12 ]. In this Chapter we extend that work (6) by introducing a novel measure of edge centrality. This measure is called κ-path edge centrality. In our approach, the procedure of computing edge centrality is viewed as an information propagation problem. In detail, if we assume that multiple messages are generated and propagated within a social network, an edge is considered as “central” if it is frequently exploited to diffuse information. Relying on this idea, we simulate message propagations through random walks on the social network graphs. In our simulation, in addition, we assume that random walks are simple and of bounded length up to a constant and user-defined value κ. The former assumption is because a random walk should be forced to pass no more than once through an edge; the latter, because, as in (115), we assume that the more distant two nodes are, the less they influence each other. The computation of edge centrality has many practical applications in a wide range of contexts and, in particular, in the area of Knowledge-Based (KB) Systems. For instance in KB systems in which data can be conveniently managed through graphs, the procedure of weighting edges plays a key role in identifying communities, i.e., groups of nodes densely connected each other and weakly coupled with nodes residing outside the community itself (259, 280). This is useful to better organize available knowledge: think, for instance, to an e-commerce platform and observe that we could partition customer communities into smaller groups and we could selectively forward messages (like commercial advertisements) only to groups whose members are actually interested to them. In addition, in the context of Semantic Web, edge centralities are useful to quantify the strength of the relationships linking two objects and, therefore, it can be useful to discover new knowledge (245). Finally, in the context of social networks, edge centralities are helpful to model the intensity of the social tie between two individuals (88): in such a case, we could extract patterns of interactions among users in virtual communities and analyze them to understand how a user is able to influence another one. The main contributions of this Chapter are the following: • We propose an approach based on random walks consisting of up-to κ edges to compute 116 7.2 Centrality Measures and Applications edge centrality. In detail, we observe that many approaches in the literature have been proposed to compute node centrality but, comparatively, there are few studies on edge centrality computation (among them we cite the edge betweenness centrality introduced in the Girvan-Newman algorithm). In addition, some authors (50, 211, 221) successfully applied random walks to compute node centrality in networks. We suggest to extend these ideas in the direction of edge centrality, and, therefore, this work is the first attempt to compute edge centrality by means of random walks. • We design an algorithm to efficiently compute edge centrality. The worst case time complexity of our algorithm is O(κm), being m the number of edges in the social network graph and κ a (typically small) parameter. Therefore, the running time of our algorithm scales in linear fashion against the number of edges of a social network. This is an interesting improvement of the state-of-the-art: in fact, exact algorithms for computing centrality run in O(n3 ) and, with some ingenious optimizations they can run in O(nm) (45). Unfortunately, real-life social networks consist of up to millions nodes/edges (203), and, therefore these approaches may not scale well. By contrast, our algorithm works fairly well also on large real-life social networks even in presence of limited computing resources. • We provide results of the performed experimentation, showing that our approach is able to generate reproducible results even if it relies on random walks. Several experiments have been carried out in order to emphasize that the κ-path edge centrality computation is feasible even on large social networks. The properties shown by this measure are discussed, in order to characterize each of the studied networks. • Finally, we design a novel computationally efficient algorithm of community detection based on the κ-path edge centrality and we apply it to social and biological networks with encouraging results. 7.2 Centrality Measures and Applications In this Section we review the concept of centrality measure and illustrate some recent approaches to compute it. 7.2.1 Centrality Measure in Social Networks One of the first (and the most popular) node centrality measures is the betweenness centrality (112). We recall its definition: Definition 1. (Betweenness centrality) Given a graph G = hV, Ei, the betweenness centrality for the node v ∈ V is defined as CBn (v) = X s6=v6=t∈V σst (v) σst (7.1) where s and t are nodes in V , σst is the number of shortest paths connecting s to t, and σst (v) is the number of shortest paths connecting s to t passing through the node v. 117 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS If there is no path joining s and t we conventionally set σst (v) σst = 0. The concept of centrality has been defined also for the edges in a graph and, from an historical standpoint, the first approach to compute edge centrality was proposed in 1971 by J.M. Anthonisse (11, 166) and was implemented in the GRADAP software package. In this approach, edge centrality is interpreted as a “flow centrality” measure. To define it, let us consider a graph G = hV, Ei and let s ∈ V , t ∈ V be a fixed pair of nodes. Assume that a “unit of flow” is injected in the network by picking s as the source node and assume that this unit flows in G along the shortest paths. The rush index associated with the pair hs, ti and the edge e ∈ E is defined as δst (e) = σst (e) σst being, as before, σst the number of shortest paths connecting s to t, and σst (e) the number of shortest paths connecting s to t passing through the edge e. As in the previous case, we conventionally set δst (e) = 0 if there is no path joining s and t. The rush index of an edge e ranges from 0 (if e does not belong to any shortest path joining s and t) to 1 (if e belongs to all the shortest paths joining s and t). Therefore, the higher δst , the more relevant the contribution of e in the transfer of a unit of flow from s to t. The centrality of e can be defined by considering all the pairs hs, ti of nodes and by computing, for each pair, the rush index δst (e); the centrality CRe (e) of e is the sum of all these contributions CRe (e) = XX δst (e) s∈V v∈V More recently, in 2002 Girvan and Newman proposed a definition of edge betweenness centrality which strongly resembles that provided by Anthonisse. According to the notation introduced above, the edge betweenness centrality for the edge e ∈ E is defined as CBe (e) = X σst (e) σst (7.2) s6=t∈V and it differs from that of Anthonisse because the source node s and the target node t must be different. Other, marginally different, definitions of betweenness centrality have been proposed by (46), such as bounded-distance, distance-scaled, edge and group betweenness, and stress and load centrality. Although the appropriateness of the betweenness centrality in the representation of the “importance” of a node/edge inside the network is evident, its adoption is not always the unique solution to a given problem. For example, as already put into evidence by (261), the first limit of the concept of betweenness centrality is related to the fact that influence or information does not propagate following only shortest paths. With regards to the influence propagation, it is also evident that the more distant two nodes are, the less they influence each other, as stated by (115). Additionally, in real applications (such as those described in Section 7.2.3) it is not usually required to calculate the exact ranking with respect to the betweenness centrality of 118 7.3 Measuring Edge Centrality each node/edge inside the network. In fact, it results more useful to identify the top arbitrary percentage of nodes/edges which are more relevant to the given specific problem (e.g., study of propagation of information, identification of key actors, etc.). 7.2.2 Recent Approaches for Computing Betweenness Centrality As to date, several algorithms to compute the betweenness centrality (of nodes) in a graph have been presented. The most efficient has been proposed by (45), which runs in O(nm) for unweighted graphs, and in O(nm + n2 log n) for weighted graphs, containing n nodes and m edges. The computational complexity of these approaches makes them unfeasible for large network analysis. To this purpose, different approximate solutions have been proposed. Amongst others, (51) developed a randomized algorithm (namely, “RA-Brandes”) and, similarly by using adaptive techniques, (13) proposed another approximate version (called, “AS-Bader”). In (211), Newman devised a random-walk based algorithm to compute betweenness centrality which shares similarities to our approach, starting from the concept of message propagation along random paths. From the same concept, (6) proposed the κ-path centrality measure (for nodes) and developed a O(κ3 n2−2α log n) algorithm (namely, “RA-κpath”) to compute it. 7.2.3 Application of Centrality Measures in Social Network Analysis Applications of centrality information acquired from social networks have been investigated by (260). The authors defined different methodologies to exploit discovered data, e.g., for marketing purposes, recommendation and trust analysis. Several marketing and commercial studies have been applied to Online Social Networks (OSNs), in particular to discover efficient channels to distribute information (52, 267) and to study the spread of influence (159). Potentially, our study could provide useful information to all these applied research directions, identifying those interesting edges with high κ-path edge centrality, which emphasizes their importance within the social network. Those nodes interconnected by high central edges are important because of the position they “topologically” occupy. Moreover, they could efficiently carry information to their neighborhood. 7.3 7.3.1 Measuring Edge Centrality Design Goals Before to providing a formal description of our algorithm, we illustrate the main ideas behind it. We start from a real-life example and we use it to derive some “requirements” our algorithm should satisfy. Let us consider a network of devices. In this context, without loss of generality, we can assume that the simplest “piece” of information is a message. In addition, each device has an address book storing the devices with which it can exchange messages. A device can both receive and transmit messages to other devices appearing in its address book. 119 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS The purpose of our algorithm is to rank links of the network on the basis of their aptitude of favoring the diffusion of information. In detail, the higher the rank of a link, the higher its ability of propagating a message. Henceforth, we refer to this problem as link ranking. The link ranking problem in our scenario can be viewed as the problem of computing edge centrality in social networks. We guess that some of the hypotheses/procedures adopted to compute edge centrality can be applied to solve the link ranking problem. We suggest to extend these techniques in a number of ways. In detail, we guess that the algorithm to compute the link ranking should satisfy the following requirements: Requirement 1 - Simulation of Message Propagation by using Random Walks. As shown in Section 7.2, some authors assume that information flows on a network along the shortest paths. Such an intuition is formally captured by Equation (7.1). However, as observed in (114, 211), centrality measures based on shortest paths can provide some counterintuitive results. In detail, (114, 211) present some simple examples showing that the application of Equation (7.1) would lead to assign excessively low centrality scores to some nodes. To this purpose, some authors (114) provided a more refined definition of centrality relying on the concept of flow in a graph. To define this measure, assume that each edge in the network can carry one or more messages; we are interested in finding those edges capable of transferring the largest amount of messages between a source node s and a target node t. The centrality of a vertex v can be computed by considering all the pairs hs, ti of nodes and, for each pair, by computing the amount of flow passing through v. In the light of such a definition, in the computation of node centrality also non-shortest paths are considered. However, in (211), Newman shows that centrality measures based on the concept of flow are not exempt from odd effects. To this purpose, the author suggested to consider a random walker which is not forced to move along the shortest paths of a network to compute the centrality of nodes. The Newman’s strategy has been designed to compute node centrality, whereas our approach targets at computing edge centrality. Despite this difference, we believe that the idea of using random walks in place of shortest paths can be successful even when applied to the link ranking problem. In our scenario, if a device wants to propagate a message, it is generally not aware of the whole network topology, and therefore it is not aware of the shortest paths to route the message. In fact, each device is only aware of the devices appearing in its address book. As a consequence, the device selects, according to its own criteria, one (or more) of its contacts and sends them the message in the hope that they will further continue the propagation. In order to simulate the message propagation, our first requirement is to exploit random walks. Requirement 2 - Dynamic Update of Ranking. Ideally, if we would simulate the propagation of multiple messages on our network of devices, it could happen that an edge is selected more frequently than others. Edges appearing more frequently than others show a better aptitude to spread messages and, therefore, their rank should be higher than others. As a consequence, our mechanism to rank edges should be dynamic: at the beginning, all the edges are equally likely to propagate a message and, therefore, they have the same rank. At each step of the simulation, if an edge is selected, it must be awarded by getting a “bonus score”. Requirement 3 - Simple Paths. The procedure of simulating message propagation through random walks described above could imply that a message can pass through an edge more than once. In such a case, the rank of edges which are traversed multiple times would be dispropor- 120 7.3 Measuring Edge Centrality tionately inflated whereas the rank of edges rarely (or never) visited could be underestimated. The global effect would be that the ranking produced by this approach would not be correct. As a consequence, another requirement is that the paths exploited by our algorithm must be simple. Requirement 4 - Bounded Length Paths. As shown in (115), the more distant two nodes are, the less they influence each other. The usage of paths of bounded length has been already explored to compute node centrality (40, 96). A first relevant example is provided in (96); in that paper the authors observe that methods to compute node centralities like those based on eigenvectors can lead to counterintuitive results. In fact, those methods take the whole network topology into account and, therefore, they compute the centrality of a node on a global scale. It may happen that a node could have a big impact on a small scale (think of a well-respected researcher working on a niche topic) but a limited visibility on a large scale. Therefore, the approach of (96) suggested to compute node centralities in local networks and they considered ego networks. An ego network is defined as a network consisting of a single node (ego) together with the nodes it is connected to (the alters) and all the links among those alters. The diameter of an ego network is 2 and, therefore, the computation of node centrality in a network requires to compute paths up to a length 2. In (40) the authors extended these concepts by considering paths up to a length k. We agree with the observations above and figure that two nodes are considered to be distant if the shortest path connecting them is longer than κ hops, being κ the established threshold. Such a consideration depicts as effective paths only those paths whose length is up to κ. We take this requirement and for our simulation procedure we considered paths of bounded length. In the next sections we shall discuss how our algorithm is able to incorporate the requirements illustrated above. 7.3.2 κ-Path Centrality In this section we introduce the concepts of κ-path node centrality and κ-path edge centrality. The notion of κ-path node centrality, introduced by (6), is defined as follows: Definition 2. (κ-path node centrality) For each node v of a graph G = hV, Ei, the κ-path node centrality C κ (v) of v is defined as the sum, over all possible source nodes s, of the frequency with which a message originated from s goes through v, assuming that the message traversals are only along random simple paths of at most κ edges. It can be formalized, for an arbitrary node v ∈ V , as C κ (v) = X σ κ (v) s σsκ (7.3) s∈V where s are all the possible source nodes, σsκ (v) is the number of κ-paths originating from s and passing through v and σsκ is the overall number of κ-paths originating from s. Observe that Equation (7.3) resembles the definition of betweenness centrality provided in Equation (7.1). In fact, the structure of the two equations coincides if we replace the concept of shortest paths (adopted in the betweenness centrality) with the concept of κ-paths which is the core of our definition of κ-path centrality. 121 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS The possibility of extending the concept of “centrality” from nodes to edges has been already exploited by Girvan and Newman. In particular, they generalized the formulation of “betweenness centrality” (referred to nodes), introducing the novel concept of “edge betweenness centrality”. Similarly, we extend Definition 2 in order to define a novel edge centrality index, baptized κ-path edge centrality. Definition 3. (κ-path edge centrality) For each edge e of a graph G = hV, Ei, the κ-path edge centrality Lκ (e) of e is defined as the sum, over all possible source nodes s, of the frequency with which a message originated from s traverses e, assuming that the message traversals are only along random simple paths of at most κ edges. The κ-path edge centrality is formalized, for an arbitrary edge e, as follows Lκ (e) = X σ κ (e) s σsκ (7.4) s∈V where s are all the possible source nodes, σsκ (e) is the number of κ-paths originating from s and traversing the edge e and, finally, σsκ is the number of κ-paths originating from s. In practical cases, the application of Equation (7.4) can not be feasible because it requires to count all the κ-paths originating from all the source nodes s and such a number can be exponential in the number of nodes of G. To this purpose, we need to design some algorithms capable of efficiently approximating the value of κ-path edge centrality. These algorithms will be introduced and discussed in the following. 7.3.3 The Algorithm for Computing the κ-Path Edge Centrality In this Section we discuss an algorithm, called Edge Random Walk κ-Path Centrality (or, shortly, ERW-Kpath), to efficiently compute edge centrality values. It consists of two main steps: (i) node and edge weights assignment and, (ii) simulation of message propagations through random simple paths. In the ERW-KPath algorithm, the probability of selecting a node or an edge are uniform; we provide also another version of the ERW-Kpath algorithm (called WERW-Kpath - Weighted Edge Random Walk κ-Path Centrality) in which the node/edge probabilities are not uniform. It has been proved (81) that the ERW-KPath and the WERW-Kpath algorithms return, as output, an approximate value of the edge centrality index as provided in Definition 3. In the following we shall discuss ERW-KPath algorithm by illustrating each of the two steps composing it. After that, we will introduce the WERW-KPath algorithm as a generalization of the ERW-KPath algorithm. Step 1: node and edge weights assignment In the first stage of our algorithm, we assign a weight to both nodes and edges of the graph G = hV, Ei representing our social network. Weights on nodes are used to select the source nodes from which each message propagation simulation starts. Weights on edges represent initial values of edge centrality and, to comply with Requirement 2, they will be updated during the execution of our algorithm. 122 7.3 Measuring Edge Centrality To compute weight on nodes, we introduce the normalized degree δ(vn ) of a node vn ∈ V as follows: Definition 4. (Normalized degree) Given an undirected graph G = hV, Ei and a node vn ∈ V , its normalized degree δ(vn ) is |I(vn )| (7.5) δ(vn ) = |V | where I(vn ) represents the set of edges incident on vn . The normalized degree δ(vn ) correlates the degree of vn and the number of total nodes on the network. Intuitively, it represents how much a node contributes to the overall connectivity of the graph. Its value belongs to the interval [0, 1] and the higher δ(vn ), the better vn is connected in the graph. Regarding edge weights, we introduce the following definition: Definition 5. (Initial edge weight) Given an undirected graph G = hV, Ei and an edge em ∈ E, its initial edge weight ω0 (em ) is 1 (7.6) ω0 (em ) = |E| Intuitively, the meaning of Equation (7.6) is as follows: we initially manage a “budget” consisting of |E| points; these points are equally divided among all the possible edges; the amount of points received by an edge represents its initial rank. In Figure 7.1 we report an example of graph G along with the distribution of weights on nodes and edges. 3 ) b( 11 3 ) g( 11 1/12 1/12 1/12 1 ) a( 11 1/12 2 ) c( 11 1/12 1 f( 11 ) 1/12 3 ) h( 11 1/12 3 d( 11 ) 1/12 1/12 1/12 1 i( 11 ) 1/12 3 e( 11 ) 3 k( 11 ) 1/12 1 j( 11 ) Figure 7.1: Example of assignment of normalized degrees and initial edge weights. Step 2: Simulation of message propagations through random simple κ-paths In the second step we simulate multiple random walks on the graph G; this is consistent with Requirement 1. To this purpose, our algorithm iterates the following sub-steps a number of times equal to a value ρ, being ρ a fixed value. We will later provide a practical rule for tuning ρ. At each iteration, our algorithm performs the following operations: 123 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS 1. A node vn ∈ V is selected according to one of the following two possible strategies: a. uniformly at random, with a probability P (vn ) = 1 |V | (7.7) b. with a probability proportional to its normalized degree δ(vn ), given by δ(vn ) vk ∈V δ(vk ) P (vn ) = P (7.8) 2. All the edges in G are marked as not traversed. 3. The procedure MessagePropagation is invoked. It generates a simple random walk whose length is not greater than κ, satisfying Requirement 3. Let us describe the procedure MessagePropagation. This procedure carries out a loop as long as both the following conditions hold true: • The length of the path currently generated is no greater than κ. This is managed through a length counter N . • Assuming that the walk has reached the node vn , there must exist at least an incident edge on vn which has not been already traversed. To do so, we attach a flag T (em ) to each edge em ∈ E, such that ( 1 if em has already been traversed T (em ) = 0 otherwise We observe that the following condition must be true X |I(vn )| > T (ek ) (7.9) ek ∈I(vn ) being I(vn ) the set of edges incident onto vn . The former condition complies with Requirement 4 (i.e., it allows us to consider only paths up to length κ). The latter condition, instead, avoids that the message passes more than once through an edge, thus satisfying Requirement 3. If the conditions above are satisfied, the MessagePropagation procedure selects an edge em by applying two strategies: a. uniformly at random, with a probability P (em ) = 1 |I(vn )| − P ek ∈I(vn ) T (ek ) (7.10) among all the edges em ∈ {I(vn ) | T (em ) = 0} incident on vn (i.e., excluding already traversed edges); 124 7.3 Measuring Edge Centrality b. with a probability proportional to the edge weight ωl (em ), given by ωl (em ) ˆ n ) ωl (em ) em ∈I(v (7.11) P (em ) = P ˆ n ) = {ek ∈ I(vn ) | T (ek ) = 0} and ωl (em ) = ωl−1 (em ) + β · T (em ) if 1 ≤ l ≤ κρ. being I(v Let em be the selected edge and let vn+1 be the node reached from vn by means of em . The MessagePropagation procedure awards a bonus β to em , sets T (em ) = 1 and increases the counter N by 1. The message propagation activity continues from vn+1 . At the end, each edge e ∈ E is assigned a centrality index Lκ (e) equal to its final weight ωκρ (e). The values of β and ρ, in principle, can be fixed in an arbitrary fashion but we provide a simple practical rule to tune them. Due to the experimentation, it emerges that in ERW-KPath it 1 is convenient to set ρ ' |E|. In particular, if we set ρ = |E| − 1 and β = |E| we get a nice 1 result: the edge centrality indexes always range in [ |E| , 1] and, ideally, the centrality index of a given edge will be equal to 1 if (and only if) it is always selected in any message propagation 1 and if that edge is simulation. In fact, each edge initially receives a default score equal to |E| 1 selected in a subsequent trial, it will increase its score by a factor β = |E| . Intuitively, if an edge is selected in all the trials, its final score will be equal to 1 |E| +ρ· 1 |E| = 1 |E| + |E|−1 |E| = 1. The time complexity of this algorithm is O(κρ). If we fix ρ = |E| − 1, we achieve a good trade-off between accuracy and computational costs. In fact, in such a case, the worst case time complexity of the ERW-KPath algorithm is O(κ|E|) and, since in real social networks |E| is of the same order of magnitude of |V |, the time complexity of our approach is near linear against the number of nodes. This makes our approach computationally feasible also for large Online Social Networks. The version of the algorithm shown in Algorithms 3 and 4 adopts uniform probability distribution functions in order to choose nodes and edges purely at random and, as said before, it is called ERW-KPath. A weighted version of the same algorithm, called WERW-KPath, would differ only in line 5 (Algorithm 3) and 2 (Algorithm 4), adopting weighted functions specified in Equations (7.8) and (7.11). During our experimentation we always adopted the WERW-Kpath algorithm. Algorithm 3 ERW-Kpath(Graph G = hV, Ei, int κ, int ρ, float β) Assign each node vn ∈ V its normalized degree Assign each edge em ∈ E the uniform probability function as weight for i = 1 to ρ do N ← 0 a counter to check the length of the κ-path vn ← a node chosen uniformly at random in V MessagePropagation(vn , N , κ, β) 7: end for 1: 2: 3: 4: 5: 6: 125 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS Algorithm 4 MessagePropagation(Node vn , int N , int κ, float β ) h i P 1: while N < κ and |I(v)| > e∈I(v) T (e) do 2: 3: 4: 5: 6: 7: 8: em ← em ∈ {I(v) | T (em ) = 0}, chosen uniformly at random Let vn+1 be the node reached by vn through em ω(em ) ← ω(em ) + β T (em ) ← 1 vn ← vn+1 N ←N +1 end while 7.3.4 Novelties Introduced by our Approach In this Section we discuss the main novelties introduced by our ERW-Kpath and WERW-Kpath algorithms. First of all, we observe that our approach is flexible in the sense that it can be easily modified to incorporate new models capable of describing the spread of a message in a network. For instance, we can define multiple strategies to select the source node from which each message propagation simulation starts. In particular, in this Chapter we considered two chances, namely: (i) the probability of selecting a node s as the source is uniform across all the nodes in the network (and this is at the basis of the ERW-Kpath algorithm) or (ii) the probability of selecting a node s as the source is proportional to the degree of s (and this is at the basis of the WERW-Kpath). It would be easy to select a different probability distribution, if necessary. In an analogous fashion, in the ERW-Kpath and WERW-Kpath algorithms we defined two strategies to select the node receiving a message; of course, other, and more complex, strategies could be implemented in order to replace those described in this Chapter. In addition, observe that the ERW-Kpath and WERW-Kpath algorithms provide a unicast propagation model in which any sender node is in charge of selecting exactly one receiving node. We could easily modify our algorithms in such a way as to support a multicast propagation model in which a node could issue a message to multiple receivers. A further novelty is that we use multiple random walks to simulate the propagation of messages and assume that the frequency of selecting an edge e in these walks is a measure of its centrality. An approach similar to our was presented in (111) but it assumes that messages propagate along shortest paths. In detail, given a pair of nodes i and j, the approach of (111) introduces a parameter, called network efficiency εij as the inverse of the length of the shortest path(s) connecting i and j. After that, it provides a new parameter, called information centrality; the information centrality ICe of an edge e is defined as the relative drop in the network efficiency generated by the removal of e from the network. Our approach provides some novelties in comparison with that of (111): in fact, in our approach a network is viewed as a decentralized system in which there is no user having a complete knowledge of the network topology. Due to this incomplete knowledge, users are not able to identify shortest path and, therefore, they use a probabilistic model to spread messages. This yields also relevant computational consequences: the identification of all the pairs of shortest paths in a network is computationally expensive and it could be unfeasible on networks containing millions of nodes. By contrast, our approach scales almost linearly with the number of edges and, therefore, it can easily run also over large networks. 126 7.3 Measuring Edge Centrality Finally, despite our approach relies on the concept of message propagation which requires an orientation on edges, it can work also on undirected networks. In fact, the ERW-Kpath (resp., WERW-Kpath) algorithm selects at the beginning a source node s that decides the node v to which a message has to be forwarded. Therefore, at run-time, the ERW-Kpath (resp., WERWKpath) algorithm induces an orientation on the edge linking s and v which coincides with the direction of the message sent by s; such a process does not require to operate on directed networks, even if it could intrinsically work well with such a type of networks. 7.3.5 Comparison of the ERW-Kpath and WERW-Kpath algorithms In this Section we provide a comparison between ERW-Kpath and WERW-Kpath. First of all, we would like to observe that both the two algorithms are capable of correctly approximating the κ-path centrality values provided in Definition 3. Despite the two algorithms are formally correct, however, we observe that the WERW-Kpath algorithm should be preferred to ERW-Kpath. In fact, in the ERW-Kpath algorithm, we assume that each node can select, at random, any edge (among those that have not yet been selected) to propagate a message. Such an assumption could be, however, too strong in real-life social networks. To better clarify this concept, consider Online Social Networks like Facebook or Twitter. In both of these networks a single user may have a large number of contacts with whom she/he can exchange information (e.g., a wall post on Facebook or a tweet on Twitter). However, sociological studies reveal that there is an upper limit to the number of people with whom a user could maintain stable social relationships and this number is known as Dunbar number (92). For instance, in Facebook, we found that the average number of friends of a user is more than 300. On the other hand, it has been reported that male users actively communicate with only 10 of them, whereas female users with 161 . This implies that there are preferential edges along which information flows in social networks. The ERW-Kpath algorithm is simple and easy to implement but it could fail to identify preferential edges along which messages propagate. By contrast, in the WERW-Kpath algorithm, the probability of selecting an edge is proportional to the weight already acquired by that edge. This weight, therefore, has to be intended as the frequency with which two nodes exchanged messages in the past. Such a property has also a relevant implication and makes feasible some applications which could not be implemented by the ERW-Kpath algorithm. In fact, our approach, to some extent can be exploited to recommend/predict links in a social network. The problem of recommending/predicting links plays a key role in Computer Science and Sociology and it is often known in the literature as the link prediction problem (189). In the link prediction problem, the network topology is analyzed to find pairs of non-connected nodes which could get a profit by creating a social link. Various measures can be exploited to assess whether a link should be recommended between a pair of nodes u and v; for instance, the simplest measure is to compute the Jaccard coefficient J(u, v) on the neighbors of u and v. The larger the number of neighboring nodes shared by u and v, the larger J(u, v); in such a case it is convenient to add an edge in the network linking u and v. Further (and more complex measures) take the whole network topology into account to recommend links. For instance, the Katz coefficient (189) considers the whole ensemble of paths running between u and v to decide whether a link between them should be recommended. 1 http://www.economist.com/node/13176775?story id=13176775 127 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS N. 1 2 3 4 5 6 Network Wiki-Vote CA-HepPh CA-CondMat Cit-HepTh Facebook Youtube No. Nodes 7,115 12,008 23,133 27,770 63,731 1,138,499 No. Edges 103,689 237,010 186,932 352,807 1,545,684 4,945,382 Directed Yes No No Yes Yes No Type Elections Co-authors Co-authors Citations Online SN Online SN Ref (183) (183) (183) (183) (270) (270) Table 7.1: Datasets adopted in our experimentation. The WERW-Kpath algorithm can be exploited to address the link prediction problem. In detail, by means of WERW-Kpath, we can handle not only topological information but we can also quantify the strength of the relationship joining two nodes. So, we know that two nodes u and v are connected and, in addition, we know also how frequently they exchange information. This allows us to extend the measure introduced above: for instance, if we would like to use the Jaccard coefficient, we can consider only those edges (called strong edges) coming out from u (resp., v) such that the weight of these edge is greater than a given threshold. This is equivalent to filter out all the edges which are rarely employed to spread information. As a consequence, the Jaccard coefficient could be computed only on strong edges. Due to these reasons, in the following experiments we focused only on the WERW-Kpath algorithm. 7.4 Experimentation Our experimentation has been conducted on different Online Social Networks whose datasets are available. Adopted datasets have been summarized in Table 7.1. Dataset 1 depicts the voting system of Wikipedia for the elections of January 2008. Datasets 2 and 3 represent the Arxiv1 archives of papers in the field of, respectively, High Energy Physics (Phenomenology), Condensed Matter Physics and Condensed Matter Physics, as of April 2003. Dataset 4 represents a network of scientific citations among papers belonging to the Arxiv High Energy Physics (Theory) field. Dataset 5 describes a small sample of the Facebook network, representing its friendship graph. Finally, Dataset 6 depicts a fragment of the YouTube social graph as of 2007. 7.4.1 Robustness A quality required for a good random-walk based algorithm is the robustness of results. In fact, it is important that obtained results are consistent among different iterations of the algorithm, if initial conditions are the same. In order to verify that our WERW-Kpath produces reliable results, we performed a quantitative and a qualitative analysis as follows. In the quantitative analysis we are interested in checking whether the algorithm produces the same results in different runs. In the qualitative analysis, instead, we studied whether different values of κ deeply impact on the ranking of edges. 1 Arxiv (http://arxiv.org/) is an online archive for scientific preprints in the fields of Mathematics, Physics and Computer Science, amongst others. 128 7.4 Experimentation Quantitative analysis of results Our first experimentation is in order to verify that, over different iterations with the same configuration, results are consistent. It is possible to highlight this aspect, running several times the WERW-Kpath algorithm on the same dataset, with the same configuration. Regarding ρ, in the experimentation we adopt ρ = |E| − 1. According to the previous choice, 1 . As for the maximum length of the κ-paths, we chose a the bonus awarded is fixed to β = |E| value of κ = 20. Our quantitative analysis highlights that the distributions of values are almost completely overlapping, over different runs on each dataset among those considered in Table 7.1. In Figure 7.2 we graphically report the distribution of edge centrality values for the “Wiki-Vote” dataset. Results are from four different runs of the algorithm on the same dataset with the same configuration. Data are plotted using a semi-logarithmic scale in order to highlight the “high” part of the distribution, where edges with high κ-path edge centrality lie. Similar results are confirmed performing the same test over each considered dataset but they are not reported due to space limitations. The robustness property is necessary but not sufficient to ensure the correctness of our algorithm. In fact, the quantitative evaluation we performed ensures that centrality values produced by WERW-Kpath are consistent over different runs of the algorithm, but does not ensure that, for example, a same edge e ∈ E after the Run 1 has a centrality value which is the same (or, at least, very similar) that after Run 2. In other words, those values of centrality that overlap in different distributions may be not referred to the same edges. To the purpose of investigating this aspect we analyze results from a qualitative perspective, as follows. Qualitative analysis of results Our random-walk-based approach ensures minimum fluctuations of centrality values assigned to each edge along different runs, if the configuration of each run is the same. To verify this aspect, we calculate the similarity of the distributions obtained by running WERW-Kpath four times on each dataset, using the same configuration, comparing results by adopting different measures. For this experiment, we considered different settings for the length of the exploited κ-paths, i.e., κ = 5, 10, 20, in order to investigate also its impact. The first measure considered is a variant of the Jaccard coefficient, classically defined as J(X, Y ) = |X ∩ Y | |X ∪ Y | (7.12) where X and Y represent, in our case, a pair of compared distributions of κ-path edge centrality values. In order to define the Jaccard coefficient in our context we need to take into account the following considerations. Let us consider two runs of our algorithms, say X and Y and let us first consider an edge e; let us denote with ωX (e) (resp., ωY (e)) the centrality index of e in the run X (resp., Y ); intuitively, the performance of our algorithm is “good” if ωX (e) is close to 129 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS Figure 7.2: Robustness test on Wiki-Vote. ωY (e); however, a direct comparison of the two values could make no sense because, for instance, the edge e could have the highest weight in both the two runs but ωX (e) may significantly differ ωY (e) X (e) from ωY (e). Therefore, we need to consider the normalized values maxωe∈X ω(e) and maxe∈Y ω(e) and we assume that the algorithm yields good results if these values are “close”. To make this ωX (e) Y (e) definition more rigorous we can define Λ(e) = maxe∈X ω(e) − maxωe∈Y ω(e) and we say that the algorithm produces good results if Λ(e) is smaller than a threshold ε. Now, in order to fix the value of ε, let us consider the values achieved by Λ(e) for each e ∈ E. We can provide an upper bound Λ on Λ(e) by considering two extremal cases: (i) ωX (e) = maxe∈X ω(e) and ωY (e) = mine∈Y ω(e) or, vice versa, (ii) ωX (e) = mine∈X ω(e) and ωY (e) = maxe∈Y ω(e). For the sake of simplicity, assume that case (i) occurs; of course, the following mine∈Y ω(e) considerations hold true also in case (ii). In such a case we obtain Λ = 1 − max . As ω(e) e∈Y discussed in the following (see Figures 7.4–7.9 and 7.10–7.15), edge centralities are distributed according to a power law and, therefore, the value of mine∈Y ω(e) is some orders of magnitude smaller than maxe∈Y ω(e). Therefore, the ratio of mine∈Y ω(e) to maxe∈Y ω(e) tends to 0 and Λ tends 1. According to these considerations, we computed how many times the following condition holds true Λ(e) ≤ τ Λ, being 0 < τ ≤ 1 a tolerance threshold. Since Λ ' 1, this amounts to counting how many times Λ(e) ≤ τ . Therefore, we can define the modified Jaccard coefficient as follows τ J (X, Y ) = X (e) |{e : | maxωe∈X ω(e) − ωY (e) maxe∈Y ω(e) | |X ∪ Y | 130 ≤ τ }| (7.13) 7.4 Experimentation In our tests we considered the following values of tolerance τ = 0.01, 0.05, 0.10 to identify 1%, 5% and 10% of maximum accepted variation of the edge centrality value assigned to a given edge along different runs with same configurations. A mean degree of similarity avg J τn is taken to average the 42 = 6 possible combinations (k ) of pairs of distributions obtained by analyzing the four runs over the datasets discussed above. The second measure we consider is the Pearson correlation. It is adopted to evaluate the correlation of the two distributions obtained. It is defined as ρX,Y = p cov(X, Y ) var(X) · var(Y ) (7.14) whose results are normalized in the interval [−1, +1], with the following interpretations: • ρX,Y > 0: distributions are directly correlated, such that: – ρX,Y > 0.7: strongly correlated; – 0.3 < ρX,Y < 0.7: moderately correlated; – 0 < ρX,Y < 0.3: weakly correlated; • ρX,Y = 0: not correlated; • ρX,Y < 0: inversely correlated. Clearly, the higher ρX,Y , the better the WERW-KPath algorithm works. Observe that the ρX,Y coefficient tells us whether the two distributions X and Y are deterministically related or not. Therefore, it could happen that the WERW-KPath algorithm, in two different runs generates two edge centrality distributions X and Y such that Y = aX, being a a real coefficient. In such a case, the ρX,Y coefficient would be 1 but we could not conclude that the algorithm works properly. In fact, the coefficient a could be very low (or in the opposite case very large) and, therefore, the two distributions would significantly differ even if they would preserve the same edge rankings. To this purpose, we consider a third measure in order to compute the distance between the two distributions X and Y . To do so, we adopt the Euclidean distance L2 (X, Y ) defined as v u n uX 2 L2 (X, Y ) = t (Xi − Yi ) (7.15) i=1 As it emerges from the distributions shown in Figure 7.2, almost all the terms in Equation (7.15) annul each other, and therefore, the final value of L2 (X, Y ) is dominated by the difference of the κ-path centrality values associated with the few top-ranked edges. To obtain the average distance between two points in distribution X and Y in a given dataset, we should simply divide L2 (X, Y ) by the number of edges in that dataset. Intrinsic characteristics of analyzed datasets do not influence the robustness of results. In fact, even if considering datasets representing different social networks (e.g., collaboration networks, citation networks and online communities), WERW-Kpath produces highly overlapping results over different runs. 131 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS Jτn k Dataset Wiki-Vote CA-HepPh CA-CondMat Cit-HepTh Facebook Youtube κ κ=5 κ = 10 κ = 20 κ=5 κ = 10 κ = 20 κ=5 κ = 10 κ = 20 κ=5 κ = 10 κ = 20 κ=5 κ = 10 κ = 20 κ=5 κ = 10 κ = 20 τ = 0.01 43.52% 61.13% 70.68% 52.63% 70.45% 75.65% 22.23% 35.16% 35.63% 47.62% 60.61% 63.68% 56.98% 56.85% 68.58% 11.74% 13.18% 27.92% τ = 0.05 98.49% 98.86% 99.96% 96.11% 99.02% 99.51% 80.51% 93.72% 95.80% 97.76% 99.45% 99.62% 97.34% 98.49% 99.39% 44.28% 59.40% 82.29% τ = 0.10 99.91% 99.98% 99.98% 99.53% 99.88% 99.87% 96.98% 99.40% 99.44% 99.78% 99.93% 99.93% 99.36% 99.76% 99.90% 72.41% 84.91% 96.17% ρX,Y 0.67 0.69 0.70 0.92 0.95 0.96 0.73 0.79 0.83 0.78 0.83 0.85 0.79 0.84 0.84 0.49 0.75 0.89 L2 (X, Y ) 1.61·10−2 2.37·10−2 3.48·10−2 1.18·10−2 1.23·10−2 2.90·10−2 1.39·10−2 2.18·10−2 3.40·10−2 0.92·10−2 1.36·10−2 2.04·10−2 1.01·10−2 1.87·10−2 2.67·10−2 1.31·10−3 1.87·10−3 2.83·10−3 avg(L2 (X, Y )) 1.55·10−7 2.28·10−7 3.35·10−7 4.97·10−8 5.18·10−8 1.22·10−7 7.43·10−8 1.16·10−7 1.81·10−7 2.60·10−8 3.85·10−8 5.78·10−8 5.11·10−9 1.20·10−8 1.72·10−8 2.64·10−10 3.78·10−10 5.72·10−10 Table 7.2: Analysis by using similarity coefficient J(τn) , correlation ρX,Y and Euclidean distance k L2 (X, Y ). Already adopting a low tolerance, such as τ = 0.01 or τ = 0.05, values of κ-path edge centrality are highly overlapping. Results improve according to the length of the κ-path adopted. By increasing tolerance and/or length of κ-paths, the full overlap becomes obvious. The same considerations hold true with respect to the Pearson correlation coefficient which identifies strong correlations among all the different distributions. Finally, as for the Euclidean distance, we observe that returned values are always small and, in every case the distance is no larger than [10−2 , 10−3 ] and the average distance is around [10−7 , 10−10 ]. 7.4.2 Performance All the experiments have been carried out by using a standard Personal Computer equipped with a Intel i5 Processor with 4 GB of RAM. The implementation of the WERW-Kpath algorithm adopted in the following experiments, developed by using Java 1.6, has been released1 and its adoption is strongly encouraged. As shown in Figure 7.3, the execution of WERW-Kpath scales very good (i.e., almost linearly) according with the setup of the length of the κ-paths and with respect to the number of edges in the given network. This means that this approach is feasible also for the analysis of large networks, making it possible to compute an efficient centrality measure for edges in all those cases in which it would be very difficult or even unfeasible, for the computational cost, to calculate the exact edge-betweenness. The importance of this aspect is evident if we consider that there exist several Social Network 1 http://www.emilio.ferrara.name/werw-kpath/ 132 7.4 Experimentation Analysis tools, that implement different algorithms to compute centrality indices on network nodes/edges. Our novel measure could be integrated in such tools (e.g., NodeXL1 , Pajek2 , NWB3 , and so on), in order to allow social network analysts, to manage (possibly, even larger) social networks in order to study the centrality of edges. Figure 7.3: Execution time with respect to network size. 7.4.3 Analysis of Edge Centrality Distributions In this Section we study the distribution of edge centrality values computed by the WERWKpath algorithm. In detail, we present the results of two experiments. In the first experiment we ran our algorithm four times. In addition, we varied the value of κ = 5, 10, 20. We averaged the κ-path centrality values at each iteration and we plotted the edge centrality distribution; on the horizontal axis we reported the identifier of each edge. The results are reported in Figures 7.4–7.9 by exploiting a logarithmic scale. Each figure has the following interpretation: on the x-axis it represents each edge of the given network, on the y-axis its corresponding value of κ-path edge centrality. The usage of a logarithmic scale highlights a power law distribution for the centrality values. In fact, when the behavior in a log-log scale resembles a straight line, the distribution could be well approximated by using a power law function f (x) ∝ x−α . As a result, for the all considered datasets, there are few edges with high centrality values whereas a large fraction of edges presents low (or very low) centrality values. Such a result can be explained by recalling that, at the beginning, our algorithm considers all the edges on an equal foot and provides them with an initial score which is the same for all the edges. However, during the algorithm execution, it happens that few edges (which are actually the most central edges in a social network) are frequently selected and, therefore, their centrality index is frequently updated. By contrast, many edges are seldom selected and, therefore, their centrality index is rarely increased. This process yields a power law distribution in edge centrality values. 1 http://nodexl.codeplex.com/ 2 http://pajek.imfm.si/doku.php?id=pajek 3 http://nwb.cns.iu.edu/ 133 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS In the second experiment, we studied how the value of κ impacted on edge centrality. In detail, we considered the datasets separately and repeated the experiments described above. Also for this experiment we considered three different values for κ, namely κ = 5, 10, 20. The corresponding results are plotted in Figures 7.10–7.15, where the probability P of finding an edge in the network which has the given value of centrality is plotted as a function of the κ-path centrality. Each plot adopts a log-log scale. The analysis of each figure highlights three relevant facts: • The probability of finding edges in the network with the lowest κ-path edge centrality values is smaller than finding edges with relatively higher centrality values. This means that the most of the edges are exploited for the message propagation by the random walks a number of times greater than zero. • The power law distribution in edge centrality emerges even more for different values of κ and in presence of different datasets. In other words, if we use different values of κ the centrality indexes may change (see below); however, as emerges from Figures 7.4–7.9, for each considered dataset, the curves representing κ path centrality values are straight and parallel lines with the exception of the latest part. This implies that, for a fixed value of κ, say κ = 5, an edge e will have a particular centrality score. If κ passes from 5 to 10 and, then, from 10 to 20, the centrality of e will be increased by a constant factor. This implies that the ordering of the edges remains unchanged and, therefore, the edge having the highest centrality at κ = 5 will continue to be the most central edges also when κ = 10 and κ = 20. This highlights a nice feature of WERW-Kpath: potential uncertainties on the tuning of the parameter κ do not have a devastating impact on the process of identifying the highest ranked edges. • The higher κ, the higher the value of centrality indexes. This has an intuitive explanation. If κ increases, our algorithm manages longer paths to compute centrality values. Therefore, the chance that an edge is selected multiple times increases too. Each time an edge is selected, our algorithm awards it by a bonus score (equal to β). As a consequence, the larger κ, the higher the number of times an edge with high centrality will be selected, and ultimately, the higher its final centrality index. Such a consideration provides a practical criterion for tuning κ. In fact, if we select high values of κ, we are able to better discriminate edges with high centrality from edges with low centrality. By contrast, in presence of low values of κ, edge centrality indexes tend to edge flatten in a small interval and it is harder to distinguish high centrality edges from low centrality ones. On the one hand, therefore, it would be fine to fix κ as high as possible. On the other, since the complexity of our algorithm is O(κm), large values of κ negatively impact on the performance of our algorithm. A good trade-off (explained by the experiments showed in this Section) is to fix κ = 20. 7.5 Applications of our approach In this Section we detail some possible applications of our approach to rank edges in social networks in the area of Knowledge-Based systems (hereafter, KBS). 134 7.5 Applications of our approach Figure 7.4: κ-paths centrality values distribution on Wiki-Vote. Figure 7.5: κ-paths centrality values distribution on CA-HepPh. Figure 7.6: κ-paths centrality values distribution on CA-CondMat. 135 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS Figure 7.7: κ-paths centrality values distribution on Cit-HepTh. Figure 7.8: κ-paths centrality values distribution on Facebook. Figure 7.9: κ-paths centrality values distribution on Youtube. 136 7.5 Applications of our approach Figure 7.10: Effect of different κ = 5, 10, 20 on Wiki-Vote. Figure 7.11: Effect of different κ = 5, 10, 20 on CA-HepPh. Figure 7.12: Effect of different κ = 5, 10, 20 on CA-CondMat. 137 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS Figure 7.13: Effect of different κ = 5, 10, 20 on Cit-HepTh. Figure 7.14: Effect of different κ = 5, 10, 20 on Facebook. Figure 7.15: Effect of different κ = 5, 10, 20 on Youtube. 138 7.5 Applications of our approach In detail, we shall focus on three possible applications. The first is data clustering and we will show how our approach can be employed in conjunction with a clustering algorithm with the aim of better organizing data available in a KBS. The second is related to the Semantic Web and we will show how our approach can be used to assess the strength of the semantic association between two objects and how this feature is useful to improve the task of discovering new knowledge in a KBS. The third, finally, is related to better understand the relationship and the roles of user in virtual communities; in this case we show that our approach is useful to elucidate relationships like trust ones. 7.5.1 Data Clustering A central theme in KBS-related research is the design and implementation of effective data clustering algorithms (259). In fact, if a KBS has to manage massive datasets (potentially split across multiple data sources), clustering algorithms can be used to organize available data at different levels of abstraction. The end user (both a human user or a software program) can focus only on the portion of data which are the most relevant to her/him rather than exploring the whole data space managed by a KBS (54, 193, 259). If we ideally assume that any data managed by a KBS is mapped onto a point of a multidimensional space, the task of clustering available data requires to compute the mutual distance existing between any pair of data points. Such a task, however, in many cases is unfeasible. In fact, the computation of the distance can be prohibitively time-consuming if the number of data points is very large. In addition, KBS often manage data which are related each other but, for these kind of data, the computation of a distance could make no-sense: think, for instance, of data on health status of a person and her/his demographic data like age or gender. Therefore, many authors suggested to represent data as graphs such that each node represents a data point and each edge specifies the type of relationships binding two nodes. The problem of clustering graphs has been extensively studied in the past and several algorithms have been proposed. In particular, the graph clustering problem in the social network literature is also known as community detection problem (109). One of the early algorithms to find communities in graphs/networks was proposed by Girvan and Newman. Unfortunately, due to its high computational complexity, the Girvan-Newman algorithm can not be applied on very large and complex data repositories consisting of million of information objects. Our algorithm, instead, can be employed to rank edges in networks and to find communities. This is an ongoing research effort and the first results are quite encouraging (79). Once a community finding algorithm is available we can design complex applications to effectively manage data in a KBS. For instance, in (280) the authors focused on Online Social Networks like Internet newsgroups and chat rooms. They analyzed through semantic tools the text comments posted by users and this allowed large Online Social Networks to be mapped onto weighted graphs. The authors showed that the discovery of the latent communities is a useful way to better understand patterns of interactions among users and how opinions spread in the network. We then describe two use cases possibly benefiting from community detection algorithms. In the first case, consider a social network in which users fill a profile specifying their interests. A graph can be constructed which records users (mapped onto nodes) and relationship among 139 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS them (e.g., an edge between two nodes may indicate that two users share at least one interest). Our algorithm, therefore, could identify group of users showing the same interests. Therefore, given an arbitrary message (for instance a commercial advertisement) we could identify groups of users interested to it and we could selectively send the message only to interested groups. As an opposite application, we can consider the objects generated within a social media platform. These objects could be for instance photos in a platform like Flickr or musical tracks in a platform like Last.fm. We can map the space of user generated contents onto a graph and apply on it our community detection algorithm. In this way we could design advanced query tools: in fact, once a user issues a query, a KBS may retrieve not only the objects exactly labeled by the keywords composing user queries but also objects falling in the same community of the retrieved objects. In this way, users can retrieve objects of their interest even if they are not aware about their existence. 7.5.2 Semantic Web A further research scenario that can take advantage from our research work is represented by the Semantic Web. In detail, Semantic Web tools like RDF allow complex and real-life scenarios to be modeled by means of networks. In many cases these networks are called multi-relational networks (or semantic networks) because they consist of heterogeneous objects and many type of relationships can exist among them (244). For instance, an RDF knowledge base in the e-learning domain (180) could consist of students, instructors and learning materials in a University. In this case, the RDF knowledge base could be converted to a semantic network in which nodes are the players described above. Of course, an edge may link two students (for instance, if they are friends or if they are enrolled in the same BsC programme), a student and a learning object (if a student is interested in that learning object), an instructor and a learning material (if the instructor authored that learning material) and so on (82). A relevant theme in Semantic Web is to assess the weight of the relationships binding two objects because this is beneficial to discover new knowledge. For instance, in the case of the e-learning example described above, if a student has downloaded multiple learning objects on the same topic, the weight of an edge linking the student and a learning material would reflect the relevance of that learning material to the student. Therefore, learning materials can be ranked on the basis of their relevance to the user and only the most relevant learning materials can be suggested to the user. An approach like ours, therefore, could have a relevant impact in this application scenario because we could find interesting associations among items by automatically computing the weight of the ties connecting them. To the best of our knowledge there are few works on the computation of node centrality in semantic networks (244) but, recently some authors suggested to extend parameters introduced in Social Network Analysis like the concept of shortest path to multi-relational networks (245). Therefore, we plan to extend our approach to the context of semantic networks. Our aim is to use simple random walks in place of shortest paths to efficiently discover relevant associations between nodes in a semantic network and to experimentally compare the quality of the results produced by our approach against that achieved by approaches relying on shortest paths. 140 7.6 Fast Community Structure Detection 7.5.3 Understanding User Relationships in Virtual Communities A central theme in KBS research is represented by the extraction of patterns of interactions among humans in a virtual community and their analysis with the goal of understanding how humans influence each other. A relevant problem is represented by the classification of the relationship of humans on the basis of their intensity. For instance, in (88) the authors focus on the criminal justice domain and focus on the identification of social ties playing a crucial role in the transmission of sensitive information. In (279), the author provides a belief propagation algorithm which exploits social ties among members of a criminal social network to identify criminals. Our approach resembles that of (88) because both of them are able too associate each edge in a network with a score indicating the strength of the association between the nodes linked by that edge. A special case occurs when we assume that the edge connecting two nodes specifies a trust relationship (83, 163). In (83), the authors suggested to propagate trust values along paths in the social network graph. In an analogous fashion, the approach of (163) uses path in the social network graph to propagate trust values and infer trust relationships between pairs of unknown users. Finally, Reinforcement Learning techniques are applied to estimate to what extent an inferred trust relationship has to be considered as credible. Our approach is similar to those presented above because both of them rely on a diffusion model. In (83, 163), the main assumption is that the trust reflects the transitive property, i.e., if a user x trusts a user y who, in her/his turn, trusts a user z, then we can assume that x trusts z too. In our approach, we exploit connections among nodes to propagate messages by using random walks of bounded length. There are, however, some relevant differences: in the approaches devoted to compute trust all the paths of any arbitrary length are, in principle, useful to compute trust values even if the contribution brought in by long paths is considered less relevant than that of short paths. Vice versa, in our approach, the length of a path is bounded by a fixed constant κ. 7.6 Fast Community Structure Detection In the following, we present a novel algorithm to calculate the community structure of a network. It is baptized Fast κ-path Community Detection (or, shortly, FKCD). The strategy relies on three steps: i) ranking edges by using the WERW-Kpath algorithm; ii) calculating the proximity (the inverse of the distance) between each pair of connected nodes; ii) partitioning the network into communities so to optimize the network modularity, according to the Louvain method (32). The algorithm is discussed as follows. 7.6.1 Background The strategy exploited in the following adopts the paradigm of the maximization of the network modularity. It can be explained as follows: let consider a network, represented by means of a graph G = (V, E), partitioned into m communities; assuming ls the number of edges between nodes belonging to the s-th community and ds is the sum of the degrees of the nodes in the 141 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS s-th community, we recall the definition of the network modularity Q= m X s=1 " ls − |E| ds 2|E| 2 # (7.16) Intuitively, high values of Q imply high values of ls for each discovered community; thus, detected communities are dense within their structure and weakly coupled among each other. Equation 7.16 reveals a possible maximization strategy: in order to increase the value of the first term (namely, the coverage), the highest possible number of edges should fall in each given community, whereas the minimization of the second term is obtained by dividing the network in several communities with small total degrees. The problem of maximizing the network modularity has been proved to be NP complete (47). The state-of-the-art approximate technique is called Louvain method (LM) (32). This strategy is based on local information and is well-suited for analyzing large weighted networks. It is based on the two simple steps: i) each node is assigned to a community chosen in order to maximize the network modularity Q; the gain derived from moving a node i into a community C can simply be calculated as (32) P ∆Q = +kiC − 2m C P +ki 2m Ĉ "P 2 − C 2m P 2 − Ĉ 2m − ki 2m # (7.17) P P where C is the sum of the weights of the edges inside C, Ĉ is the sum of the weights of the edges incident to nodes in C, ki is the sum of the weights of the edges incident to node i, kiC is the sum of the weights of the edges from i to nodes in C, m is the sum of the weights of all the edges in the network; ii) the second step simply makes a new network consisting of nodes that are those communities previously found. Then the process iterates until a significant improvement of the network modularity is obtained. In the following we present an efficient community detection algorithm which represents a generalization of the LM. In fact, it can be applied even on unweighted networks and, most importantly, it exploits both global and local information. To make this possible, our strategy computes the pairwise distance between nodes of the network. To do so, edges are weighted by using a global feature which represents their aptitude to propagate information through the network. The edge weighting is based on the κ-path edge centrality. Thus, the partition of the network is obtained improving the LM. Details of our strategy are explained in the following. 7.6.2 Design Goals In this Section we briefly and informally discuss the ideas behind our strategy. First of all, we explain the principal motivations that make our approach suitable, in particular but not only, for the analysis of the community structure of social networks. To this purpose, we introduce a real-life example from which we infer some features of our approach. Let consider a social network, in which users are connected among them by friendship relations. In this context, we can assume that one of the principal activities could be exchanging information. Thus, let assume that a “message” (that, could be, for example, a wall post on Facebook or a tweet on Twitter) represents the simplest “piece” of information and that users 142 7.6 Fast Community Structure Detection of this network could exchange messages, by means of their connections. This means that a user could both directly send and receive information only to/from the people in her neighborhood. In fact, this assumption will be fundamental (see further), in order to define the concepts of community and community structure. Intuitively, say that a community is defined as a group of individuals in which the interconnections are denser than outside the group (in fact, this maximizes the benefit function Q). The aim of our community detection algorithm is to identify the partitioning of the network in communities, such that the network modularity is optimal. To do so, our strategy is to rank links of the network on the basis of their aptitude of favoring the diffusion of information. In detail, the higher the ability of a node to propagate a message, the higher its centrality in the network. This is important because, as already proved by (124, 217), we could ensure that the higher the centrality of a edge, the higher the probability that it connects different communities. Our algorithm adopts different optimizations in order to efficiently compute the link ranking. Once we define an optimized strategy for ranking links, we can compute the pairwise distances between nodes and finally the partitioning of the network, according to the LM. The evaluation of the goodness of the partitioning in communities is attained by adopting the measure of the network modularity Q. In the next sections we shall discuss how our algorithm is able to incorporate these requirements. In Section 7.3.2, we provided a definition of centrality of edges in social networks based on the propagation of messages by using simple random walks of length at most κ (called, hereafter, κpath edge centrality). Then, we provided a description of an efficient algorithm to approximate it, running in O(κ|E|), where |E| is the number of edges in the network. In the following, we discuss our novel community detection algorithm. 7.6.3 Fast κ-path Community Detection First of all, our Fast κ-path Community Detection (henceforth, FKCD) needs a ranking criterion to compute the aptitude of all the edges to propagate information through the network. To do so, FKCD invokes the WERW-Kpath algorithm, previously described. Once all the edges have been labeled with their κ-path edge centrality, a ranking in decreasing order of centrality could be obtained. This is not fundamental, but could be useful in some applications. Similarly, before to proceed, a first network modularity esteem (hereafter, Q) could be calculated. This could help in order to put into evidence how Q increases during next steps. With respect to Q, we recall that the higher Q, the better the community structure of the network appears evident. The computational cost of this first step is O(κ|E|), with κ length of the κ-paths and |E| cardinality of E. The second step consists in calculating the proximity among each pair of connected nodes. This is done by using a L2 distance (i.e., the Euclidean distance) calculated as v u n uX (Lκ (eik ) − Lκ (ekj ))2 (7.18) rij = t d(k) k=1 where Lκ (eik ) (resp., Lκ (ekj )) is the κ-path edge centrality of the edge eik (resp., ekj ) and d(k) is the degree of the node. We put into evidence that, even though the L2 measure would return a distance, in our case, the higher Lκ (eik ) (resp., Lκ (ekj )), the more the nodes are near, 143 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS instead of distant. This important aspect leads us to consider the results of Equation 7.18 as the pairwise proximities of nodes. This step is theoretically computationally expensive, because it should require O(|V |2 ) iterations, but in practice, by adopting optimization techniques, its near linear cost is O(d(v)|V |), where d(v) is the mean degree of all the nodes of the network (and it is usually small in social networks). The last step is the network partitioning. The main idea is inspired by the LM (32) for detecting the community structure of weighted networks in near linear time. The partitioning is an iterative process. At each iteration, two simple steps occur: i) each node is assigned to a community chosen in order to maximize the network modularity Q; the possible increase of Q derived from eventually moving a node i into a community C is calculated according with Equation 7.17; ii) the second step produces a meta-network whose nodes are those communities previously found. The partitioning ends when no further improvements of Q can be obtained. This reflects in splitting communities connected by edges with high proximity, which is a global feature, thus maximizing the network modularity. Its cost is O(γ|V |), where |V | is the cardinality of V and γ is the number of iterations required by the algorithm to converge (in our experience, usually, γ < 5). The FKCD is schematized in Algorithm 5. We recall that CalculateDistance computes the pairwise node distance by using Equation 7.18, Partition extracts the communities according to the LM descripted above and NetworkModularity calculates the value of network modularity by using Equation 7.16. The computational cost of our strategy is near linear. In fact, O(κ|E| + d(e)|V | + γ|V |) = O(Γ|E|), by adopting efficient graph memorization in order to minimize the execution time for the computation of Equations 7.16 and 7.18. Algorithm 5 FKCD(Graph G = (V, E), int κ) 1: 2: 3: 4: 5: 6: WERW-Kpath(G, κ) CalculateDistance(G) while Q increases at least of (arbitrarily small) do P = Partition(G) Q ← NetworkModularity(P) end while 7.7 Experimental Results Our experimentation has been conducted both on synthetic and real-world Online Social Networks, whose datasets are available online. All the experiments have been carried out by using a standard Personal Computer equipped with a Intel i5 Processor with 4 GB of RAM. 7.7.1 Synthetic Networks The method proposed to evaluate the quality of the community structure detected by using the FKCD exploits the technique presented by Lancichinetti et al. (177). We generated the same synthetic networks reported in (177), adopting the following configuration: i) N = 1000 nodes; ii) the four pairs of networks identified by (γ, β) = (2, 1), (2, 2), (3, 1), (3, 2), where γ represents the exponent of the power law distribution of node degrees, β the exponent of the power law 144 7.7 Experimental Results distribution of the community sizes; iii) for each pair of exponents, three values of average degree hki = 15, 20, 25; iv) for each of the combinations above, we generated six networks by varying the mixing parameter µ = 0.1, 0.2, . . . , 0.6 (Note: the threshold value µ = 0.5 is the border beyond which communities are no longer defined in the strong sense. i.e., each node has more neighbors in its own community than in the others (237).). Figure 7.16 highlights the quality of the obtained results. The measure adopted is the normalized mutual information (75). Values obtained put into evidence that our strategy performs fair good results, avoiding the well-known effect due to the resolution limit of the modularity optimization (110). Moreover, a classification of results as in Table 7.3 (discussed later) is omitted because values of Q obtained by using FKCD and the LM in the case of these quite small synthetic networks are very similar. Figure 7.16: Normalized mutual information test using the synthetic benchmarks. 7.7.2 Online Social Networks Results obtained by analyzing several Online Social Networks datasets (184, 270) are summarized in Table 7.3. This experimentation has been carried out to qualitatively analyze the performance of our strategy. Obtained results, measured by means of the network modularity calculated by our algorithm (FKCD), are compared against those obtained by using the original LM. Our analysis puts into evidence the following observations: i) classic not optimized algorithms (for example Girvan-Newman) are unfeasible for large network analysis; ii) results obtained by using LM are slightly higher than those obtained by using FKCD; on the other hand, LM adopts 145 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS local information in order to optimize the network modularity, while our strategy exploits both local and global information; this results in (possibly) more convenient identified community structures for some applications; iii) the performance of FKCD slightly increase by using longer κ-paths; iv) both the compared efficient strategies are feasible even if analyzing large networks using standard resources of calculus (i.e., a classic personal computer). Network CA-GrQc CA-HepTh CA-HepPh CA-AstroPh CA-CondMat Facebook No. nodes 5,242 9,877 12,008 18,772 23,133 63,731 No. edges 28,980 51,971 237,010 396,160 186,932 1,545,684 No. comm. 883 1,501 1,243 1,552 2,819 6,484 FKCDκ=5 0.734 0.585 0.565 0.486 0.546 0.414 FKCDκ=20 0.786 0.648 0.598 0.568 0.599 0.444 LM 0.816 0.768 0.659 0.628 0.731 0.634 Table 7.3: Results of the FKCD algorithm on the adopted datasets. 7.7.3 Extension to Biological Networks The adoption of the community detection method presented has been extended also to different fields of application. The algorithm we devised, in fact, can be considered a powerful technique to obtain a meaningful clustering of any given network. The original limit of the Louvain method has been overcome extending its possible application both to weighted and unweighted networks, and its utilization for directed networks is straightforward. In this Section we want to introduce an example application of our method to a slightly different field, that lies in the application of the network analysis approach to a bio-informatics problem, that is the analysis of gene-coexpression networks. A gene-coexpression network can be informally defined as a network representing the interactions among genes into the cell. More in detail, a gene-coexpression network represents the behavior of a set of genes, that cooperate in response to a given event (i.e., a stress condition) to perform a given task. Data analyzed in this kind of task come from micro-array experiments, in which the amount of genes (called expression) produced into the cell at a given time-point is sampled, during a control-phase and a stress-phase, whose length is usually the same. In the case of our experimentation, we consider the response of the model organism called Arabidopsis thaliana to a stressing condition simulating the drought. A number of 1,217 genes (amongst the more than 29 thousands characterizing this plant) has been monitored with a time-sampling constituted of 28 time-points, 14 of them under control condition and 14 under drought stress condition. For each gene, the amount of gene expressed at each given time-point is sampled, obtaining a matrix of values E G×T , in which G is the number of genes (i.e., 1,217) and T is the number of time-points considered (i.e., 28). The matrix E is exploited to calculate the pairwise correlation between each pair of genes, for example, in our case, by adopting the Pearson correlation. Values obtained usually range in the interval [-1,1], even if it is common to consider the absolute value by taking into account the fact that two genes are highly correlated even if the type of correlation is inverse (i.e., if, in response to a given stress, the amount of expression of a given gene grows in the same proportion in 146 7.7 Experimental Results which the amount of expression of another given gene decreases). In the end of this process we obtain a matrix C G×G in which the correlation of each pair of genes is represented. The network obtained by using the matrix C as a weighted adjacency matrix is a fully connected networks containing n(n−1) edges, that is usually a large network. In order to optimize the size 2 of the network, it is possible to apply a thresholding operation to cut at a certain value (for example, t = 0.7) the correlation between genes. This results in obtaining a network in which each edge represents a weighted strong correlation between genes during the response process to the given event. In our case, the gene co-expression network obtained for the Arabidopsis thaliana in response to a drought stress is represented by a graph G = (V, E) with V = 1, 217 genes and E = 278, 374 edges representing the number of strong correlations existing among the genes. In order to make a clustering of this network, we adopted our method compared against two other state-of-the-art techniques, i.e., the already discussed Louvain method and the OSLOM algorithm (178) that is a overlapping community detection algorithm recently developed by Lancichinetti, Radicchi and Ramasco. The results obtained by using our method largely outperform those two renowned methods both in terms of the quality of the discovered clusters and in the number of clusters obtained. In Figures 7.17–7.19 we depict the 3 largest clusters of genes discovered inside the analyzed network. In detail, each plot represents the profiles of the expression of each gene, i.e., the graphical representation of the amount of expression of each gene at the given time-point. First of all, it is evident as all the genes classified in these 4 clusters present a behavior that is very similar among each other and with respect to the average of the values of gene expression. More importantly, by using techniques such as the over-representation analysis, biologists verified that each of the discovered clusters meaningfully represent processes activated by the plant in response to the administered drought stress. Not only the quality of the clusters produced by using our method is better than those characterizing the clusters produced by the other methods, but also the number of recognized clusters is more appropriate. In fact, our technique has recognized 24 clusters, 11 of which are above the size (in terms of contained genes) that is meaningful for biologists to ideally define a cluster of gene co-expression (28, 86). By using the classic Louvain method we obtained just 6 clusters, 4 of which are of acceptable size. The OSLOM algorithm identified 11 clusters, 6 of which are representative. Moreover, the technique of the over-representation analysis revealed, in particular for the Louvain method, that it was possible to identify several different processes inside each of the 4 clusters defined by this technique, i.e., the results produced were not reliable in terms of biological meaning. This is possibly due to the well-known problem of the resolution limit that may affect the Louvain method. Conclusion In this Chapter we introduced a novel edge centrality measure for social networks called κ-path edge centrality index. Its computation is computationally feasible even on large scale networks by using the algorithm we provided. It performs multiple random walks on the social network graph, which are simple and whose length is bounded by a factor κ. We showed that the worst-case time complexity of our algorithm is O(κm), being m the number of edges in the 147 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS Figure 7.17: Arabidopsis Thaliana gene-coexpression network (cluster 1). Figure 7.18: Arabidopsis Thaliana gene-coexpression network (cluster 2). 148 7.7 Experimental Results Figure 7.19: Arabidopsis Thaliana gene-coexpression network (cluster 3). social network graph. We discussed experimental results obtained by applying our method to different Online Social Networks. Finally, we shown that our centrality measure can be used to detect communities in large networks. In the last part we presented a novel strategy that has two advantages. The former is that it exploits both local and global information. The latter is that, by using some optimization, it efficiently provides good results. This way, our approach is able to discover the community structure in, possibly large, networks. Our experimental evaluation, carried out over both synthetic and Online Social Networks datasets, proves the efficiency and the robustness of the proposed strategy. Encouraging results have been put into evidence when applying our algorithm to the problem of clustering gene co-expression biological networks, if compared against results provided by other state-ofthe-art techniques. 149 7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS 150 8 Conclusions The final Chapter of this Thesis concludes the dissertation discussing and summarizing (i) the main findings and the contribution to the state-of-the-art in the disciplines covered by this work, and (ii) as for future work, those directions that our research work will undertake and the topics, among those discussed in this Thesis, which deserve yet more investigations. 8.1 Findings The main motivation underlying the research work conducted during this Thesis is the increasing popularity of those social phenomena such as social media and, in particular, Online Social Networks (OSNs). In detail, in this work we presented a comprehensive analysis of the principal tasks related to the research in social networks, that are (i) the extraction (also called Web mining) of data from Online Social Networks, (ii) the analysis of the network structure representing OSNs, and, finally (iii) the development of efficient algorithms for the computation of Social Network Analysis measures on massive OSNs. The main findings of this work can be discussed separately, corresponding to those three main parts already presented in the introduction of this work, namely: (i) A first part, in which we discussed all those problems related to mining massive Web sources, such as Online Social Networks. In particular, different techniques have been shown and we focused on the so-called Web wrappers and the problems related to their automatic adaptation in order to refine and make the process of automatic extraction of information from Web pages more robust. (ii) A second part, in which we discussed the analysis of a large sample acquired from Facebook, which is, as to date, the largest and most representative OSN and a relevant artifact from a Computer Science perspective. From this analysis, it emerges the possibility of investigating topological features of the network, such as the well-known small world effect, scale-free distributions and community structure, and in addition to verify the validity of different sociological theories such as the six degrees of separation or the strength of weak ties. The assessment of 151 8. CONCLUSIONS different aspect of the analysis of the community structure of this network is also discussed in details. (iii) A final part, in which we contribute with the development of a novel, computationally efficient measure of centrality for social networks. The functioning of this measure is rooted in the random walks theory and its evaluation is computationally feasible even on a large scale. It is the core of different applications, for example a community detection algorithm whose performance is discussed against different fields of applicability, like social and biological networks. In conclusion, we recall the main contributions of this dissertation: (a) We discussed the current panorama regarding the field of Web information mining platforms. Those procedures considered, namely Web wrappers, represent the core on which our platform of Web mining for Online Social Networks has been built on. Among the main findings, a relevant contribution in this field is represented by the solution of automatic wrapper adaptation. It relies on a tree-edit distance-based algorithm which is very extensible and powerful. Its performance is assessed on different fields of application regarding social media and Web platforms, and this algorithm is able to provide high values of precision and recall, actually making the process of automatic extraction of information more robust. (b) We presented our platform of Web mining for Online Social Networks, which has been able to extract millions of user profiles and friendship connections among them. The system is an ad hoc Facebook crawler developed to comply with the strict privacy terms of Facebook, supported us in the task of creating a large scale sample that has been adopted for scientific purposes and publicly released for the scientific community. In detail, once two different sampling algorithms presented in literature have been implemented, we explored the social graph of friendship relations among Facebook users and we assessed several topological characteristics of the obtained samples. Important graph-theoretical and Social Network Analysis features, such as degree distribution, diameter, centrality metrics, clustering coefficient and so on, have been considered during the analysis of these datasets. (c) We put into evidence those mathematical models which try to faithfully represent the topological features of Online Social Networks. In detail, once we presented models and methods proper of the Social Network Analysis, we focused our attention on different Online Social Networks – whose datasets were freely available online. We considered three main generative models, i.e., i) Erdős-Rényi random graphs, ii) Watts-Strogatz and, iii) Barabási-Albert preferential attachment. These models have been compared against real data in order to assess their reliability. Our results show that each model is only able to correctly describe a few characteristics, but the all fail in faithfully representing all the main features of OSNs we previously identified, namely i) small world effect, ii) scale-free degree distributions and, finally, iii) emergence of a community structure. (d) A large number of experimental original results have been presented regarding the community structure of Facebook. In detail, we carried out a large-scale community structure investigation in order to study the problem of community detection on massive OSNs. Given the size of our datasets, we exploited different computationally efficient algorithms present in literature, optimized to detect the community structure on large-scale networks, like Facebook, containing millions of nodes and edges. Both from a quantitative and qualitative perspective, a very strong community structure emerges by our analysis. We have been able to highlight different features of the community structure of Facebook, for example a typical power law distribution of the dimension of the communities. In addition, those communities unveiled by 152 8.2 Future Work our analysis reveal a very high degree of similarity, regardless the methodology adopted for the community detection. This aspect underlines the clear emergence of a community structure for massive OSNs, that eases the task of unveiling communities in large networks. Once the presence of the community structure has been assessed, we investigated those mesoscopic features of this characteristic artifact, building a community structure meta-network, that is a network whose nodes represent communities of the main graph and edges represent connections among individuals belonging to the given communities. The most important finding was the verification that the most of the features shown by the original social graph, are still clearly visible in the community structure meta-network. In detail, we assessed the presence of a power law degree distribution of the community size degree, a clustering effect, the small world effect that explains the existence of very short paths among all the communities and a diameter smaller than 5. These results pushed us to try to verify the validity of a renowned sociological theory, the strength of weak ties, which is well-known to be related to the community structure of a network. Our findings support the quantitative proof providing several clues that testify the presence and the importance of weak ties in the network. (e) Related to the importance of individuals and connections in a given social networks, our final contribution in this work is the definition of a novel measure of edge centrality particularly well suited for social networks. It has been baptized κ-path edge centrality index. This measure is characterized by a computationally feasibility even on large scale networks, by means of an algorithm we provided, whose rationale is based on random walks theory. It carries out multiple random walks on the network, which are simple and of bounded length, by a factor κ. The worst-case time complexity of our algorithm is O(κm), being m the number of edges in the social network graph. The validity of this algorithm has been tested onto different massive OSNs, since the algorithm provides an approximation of the measure we defined. Finally, we provided a real-world application of our centrality measure, devising a novel community detection method able to work with large networks. Its advantages respect to other solutions are different: first of all, the algorithm exploits both local and global information. Its evaluation has been carried out over both synthetic and real-world networks, proving the efficiency and the robustness of the proposed strategy. In addition, by using some optimization, it efficiently provides good results also in different contexts, such as biological applications. In fact, relevant results have been obtained applying this method to the problem of clustering gene co-expression networks. 8.2 Future Work In this Thesis we introduced a number of concepts and research lines which deserve further studies and this final Section is devoted to discuss some of the most relevant future works closely related to this dissertation. A first research line which is promising is related to the simulation of diffusion of information on Online Social Networks or, more in general, in social networks built combining knowledge from different networks (for example, the social network representing the friendship relationships among Facebook users and the geographical network representing the physical location of the given set of users). In particular, one goal could be to verify which is the most efficient way to chose a small subset of nodes of the network from which to spread the information such that to 153 8. CONCLUSIONS maximize a certain function, for example the coverage1 . The process of diffusion of information over the network could be modeled, simulated and studied by exploiting different paradigms presented in literature (for example, the Independent Cascade Model (131, 132, 159)). Closely related to this topic, another relevant problem is the ability of maximizing the spread of influence on such kind of networks (66, 159). This problem is particularly important since it has immediate economical applications, for example related to the ability of advertising new products in a more efficient way (89, 179, 190) – for example by means of the well-known phenomenon of the word-of-mouth (53). Yet on the analysis of information propagation on Online Social Networks, another trending topic is leveraging the behavior of users on a large scale to study and model the spread of information through the network. This is applied to different purposes, from marketing trends to political consensus diffusion analysis. In the former category we include the sentiment analysis of the public mood related to socio-economic phenomena (36, 37) and the study of viral marketing campaigns diffusion (39, 65). In the latter, the identification of deceptive attempts to diffuse defamatory or misleading political information, an illicit practice usually called astroturfing (241, 242). Considering the research line related to the novel centrality measure we defined and to community detection algorithms, as for future work we planned a long-term research evaluation of our method, in order to cover different domains of application and to face several scientific challenges. For example, in the context of bio-informatics – in which the proposed algorithm has been already proved working well in some contexts –, our method will be applied to the study of human disease networks (16, 127). To this purpose, it could be applied to unveil the modular structure of disease networks (225), for example to understand how disease gene modules are preserved during the evolution of organisms (98). Moreover, we are planning to extend this method in a number of ways. First, the novel centrality measure we defined in this work can be used to detect overlapping communities in large social networks. Such a task is currently unfeasible on a large scale, even if adopting the state-of-theart overlapping community detection algorithms presented in literature, such as COPRA (138), OSLOM (175) or SLPA (282, 283). In fact, to the best of our knowledge, efficient algorithms do not currently exist that estimate the community structure of a large network based on global topological information and our strategy could fit well to this purpose. In addition, based on this measure, we plan to design an algorithm to estimate the strength of ties between two social network actors: for instance, in social networks like Facebook this is equivalent to estimate the friendship degree between each pair of users. Finally, we point out that some researchers studied how to design parallel algorithms to compute centrality measures. For instance, (191) proposed a fast and parallel algorithm to compute betweenness centrality. We guess that a new, interesting, research opportunity could be to design a parallel algorithm to compute the κ-path edge centrality. 1 For sake of simplicity, the coverage can be considered as the amount of nodes in the network which is informed about the given information at a certain time. 154 8.3 List of Publications 8.3 List of Publications In the following, a list of publications of the author related to this dissertation: E. Ferrara. Community structure discovery in Facebook. International Journal of Social Network Mining. 1(1):67–90, (2012). E. Ferrara and G. Fiumara. Topological features of Online Social Networks. Communications on Applied and Industrial Mathematics. 2(2):1–20, (2011). P. De Meo, E. Ferrara, and G. Fiumara. A novel measure of edge centrality in social networks. Knowledge-based Systems DOI: 10.1016/j.knosys.2012.01.007, (2012). S. Catanese, E. Ferrara, and G. Fiumara. Forensic analysis of phone call networks. Social Network Analysis and Mining. (Accepted) S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Extraction and analysis of Facebook friendship relations. In Computational Social Networks: Mining and Visualization Springer Verlag, (In press). P. De Meo, E. Ferrara, and G. Fiumara. Finding similar users in Facebook. In Social Networking and Community Behavior Modeling: Qualitative and Quantitative Measurement, pages 304–323. IGI Global, 2011. E. Ferrara and R. Baumgartner. Automatic wrapper adaptation by tree edit distance matching. In Combinations of Intelligent Methods and Applications, pages 41–54. Springer Verlag, 2011. E. Ferrara and R. Baumgartner. Intelligent self-repairable web wrappers. In Lecture Notes in Computer Science, volume 6934, pages 274–285. Springer Verlag, 2011. P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Generalized Louvain method for community detection in large networks. In ISDA ’11: Proceedings of the 11th International Conference on Intelligent Systems Design and Applications, pages 88– 93, 2011. S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Crawling Facebook for social network analysis purposes. In WIMS ’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, pages 52:1–52:8. ACM, 2011. E. Ferrara and R. Baumgartner. Design of automatically adaptable web wrappers. In ICAART ’11: Proceedings of the 3rd International Conference on Agents and Artificial Intelligence, pages 211–217, 2011. S. Catanese, P. De Meo, E. Ferrara, and G. Fiumara. Analyzing the Facebook friendship graph. In MIFI ’10: Proceedings of the 1st International Workshop on Mining the Future Internet, volume 685, pages 14–19, 2010. P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Strength of weak ties in Online Social Networks. Physical Review E (Under review) E. Ferrara, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: a survey. ACM Computing Surveys (Under review). 155 8. CONCLUSIONS P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Enhancing community detection using a network weighting method. Information Sciences (Under review). P. De Meo, E. Ferrara, F. Abel, L. Aroyo, and G. J. Houben. Analyzing user behavior across Social Web environments. ACM Transactions on Intelligent Systems and Technology (Under review). E. Ferrara. A large-scale community structure analysis in Facebook. In PLoS ONE (Under review). E. Ferrara. CONCLUDE: complex network cluster detection for social and biological applications. In WWW ’12 PhD Symposium (Under review). The following references are related to publications of the author regarding with subjects not included in this dissertation: G. Quattrone, L. Capra, P. De Meo, E. Ferrara, and D. Ursino. Effective retrieval of resources in folksonomies using a new tag similarity measure. In CIKM ’11: Proceedings of the 20th ACM Conference on Information and Knowledge Management, pages 545–550, 2011. G. Quattrone, E. Ferrara, P. De Meo, and L. Capra. Measuring similarity in largescale folksonomies. In SEKE ’11: Proceedings of the 23rd International Conference on Software Engineering and Knowledge Engineering, pages 385–391, 2011. P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Improving recommendation quality by merging collaborative filtering and social relationships. In ISDA ’11: Proceedings of the 11th International Conference on Intelligent Systems Design and Applications, pages 587–592, 2011. S. Catanese, E. Ferrara, G. Fiumara, and F. Pagano. A framework for designing 3D virtual environments. In Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. Springer Verlag, (In press). S. Catanese, E. Ferrara, G. Fiumara, and F. Pagano. Rendering of 3D dynamic virtual environments. In Simutools ’11: Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques, 2011. E. Ferrara, G. Fiumara, and F. Pagano. Living city, a collaborative browserbased massively multiplayer online game. In Simutools ’10: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, pages 1–8, ICST, 2010. 156 Bibliography [1] Adamic, L., Adar, E.: Friends and neighbors on the web. Social networks 25(3), 211–230 (2003) 47 [2] Adamic, L., et al.: Power-law distribution of the world wide web. Science 287(5461), 2115 (2000) 56 [3] Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering pp. 734–749 (2005) 46 [4] Ahn, Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of huge online social networking services. In: Proc. of the 16th international conference on World Wide Web, pp. 835–844. ACM (2007) 1, 66 [5] Aiello, L.M., Barrat, A., Cattuto, C., Ruffo, G., Schifanella, R.: Link creation and profile alignment in the aNobii social network. In: Proc. of the 2nd International Conference on Social Computing, pp. 249–256 (2010) 49 [6] Alahakoon, T., Tripathi, R., Kourtellis, N., Simha, R., Iamnitchi, A.: K-path centrality: A new centrality measure in social networks. In: Proc. of the 4th Workshop on Social Network Systems, pp. 1–6 (2011) 116, 119, 121 [7] Albert, R.: Diameter of the World Wide Web. Nature 401(6749), 130 (1999) 12, 56, 57, 66, 67 [8] Albert, R., Barabási, A.: Statistical mechanics of complex networks. Reviews of Modern Physics 74(1), 47–97 (2002) 45, 60, 66 [9] Amalfitano, D., Fasolino, A.R., Tramontana, P.: Reverse engineering finite state machines from rich internet applications. In: Proc. of the 15th Working Conference on Reverse Engineering, pp. 69–73. IEEE (2008) 23 [10] Amaral, L., Scala, A., Barthélémy, M., Stanley, H.: Classes of small-world networks. Proceedings of the National Academy of Sciences 97(21), 11,149 (2000) 56 [11] Anthonisse, J.: The rush in a directed graph. Tech. Rep. BN/9/71, Stichting Mathematisch Centrum, Amsterdam, The Netherlands (1971) 115, 118 [12] Anton, T.: XPath-Wrapper Induction by generalizing tree traversal patterns. MIT Press (2004) 20 157 BIBLIOGRAPHY [13] Bader, D., Kintali, S., Madduri, K., Mihail, M.: Approximating betweenness centrality. Algorithms and Models for the Web-Graph pp. 124–137 (2007) 119 [14] Barabási, A., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509 (1999) ix, 57, 68, 70, 72, 77 [15] Barabási, A., Crandall, R.: Linked: The new science of networks. American Journal of Physics 71, 409 (2003) 53, 66 [16] Barabási, A., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12(1), 56–68 (2011) 154 [17] Barabási, A., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications 311(3-4), 590–614 (2002) 56 [18] Barrat, A., Weigt, M.: On the properties of small-world network models. The European Physical Journal B - Condensed Matter and Complex Systems 13(3), 547–560 (2000) 80 [19] Barthelemy, M.: Betweenness Centrality in Large Complex Networks. European Physical Journal B 38, 163–168 (2004) 62 [20] Batagelj, V., Doreian, P., Ferligoj, A.: An optimizational approach to regular equivalence. Social Networks 14(1-2), 121–135 (1992) 47 [21] Baumgartner, R., Ceresna, M., Ledermuller, G.: Deepweb navigation in web data extraction. In: Proc. of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, pp. 698–703. IEEE (2005) 19 [22] Baumgartner, R., Flesca, S., Gottlob, G.: The elog web extraction language. In: Proc. of the Artificial Intelligence on Logic for Programming, pp. 548–560. Springer Verlag (2001) 20, 28 [23] Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann Publishers Inc. (2001) 20, 28 [24] Baumgartner, R., Frölich, O., Gottlob, G., Harz, P., Herzog, M., Lehmann, P.: Web data extraction for business intelligence: the lixto approach. Datenbanksysteme in Business, Technologie und Web 11, 30–47 (2005) 23 [25] Baumgartner, R., Fröschl, K., Hronsky, M., Pöttler, M., Walchhofer, N.: Semantic online tourism market monitoring. Proc. of the 17th eTourism International Conference (2010) 23 [26] Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. Encyclopedia of Database Systems pp. 3465–3471 (2009) 18, 37 [27] Baumgartner, R., Gottlob, G., Herzog, M.: Scalable web data extraction for online market intelligence. Proceedings of the VLDB Endowment 2(2), 1512–1523 (2009) 23 [28] Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of computational biology 6(3-4), 281–297 (1999) 147 158 BIBLIOGRAPHY [29] Berger, A., Pietra, V., Pietra, S.: A maximum entropy approach to natural language processing. Computational linguistics 22(1), 39–71 (1996) 20 [30] Berthold, M., Hand, D.J.: Intelligent Data Analysis: An Introduction. Springer Verlag (1999) 20 [31] Blondel, V., Gajardo, A., Heymans, M., Senellart, P., Van Dooren, P.: A measure of similarity between graph vertices: Applications to synonym extraction and web searching. Siam Review pp. 647–666 (2004) 47 [32] Blondel, V., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10,008 (2008) 80, 141, 142, 144 [33] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.: Complex networks: Structure and dynamics. Physics Reports 424(4-5), 175–308 (2006) 88 [34] Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In: Proc. of the 20th international conference on World wide web, pp. 587–596. ACM (2011) 2 [35] Boldi, P., Vigna, S.: The webgraph framework i: compression techniques. In: Proc. of the 13th international conference on World Wide Web, pp. 595–602. ACM (2004) 2 [36] Bollen, J., Goncalves, B., Ruan, G., Mao, H.: Happiness is assortative in online social networks. Artificial Life 17(3), 237–251 (2011) 154 [37] Bollen, J., Pepe, A., Mao, H.: Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In: Proc. 5th International AAAI Conference on Weblogs and Social Media, pp. 17–21 (2011) 154 [38] Bollobás, B., Riordan, O.: The diameter of a scale-free random graph. Combinatorica 24(1), 5–34 (2004) 12 [39] Bonchi, F., Castillo, C., Ienco, D.: Meme ranking to maximize posts virality in microblogging platforms. Journal of Intelligent Information Systems pp. 1–29 (2011) 154 [40] Borgatti, S., Everet, M.: A graph-theoretic perspective on centrality. Social Networks 28(4), 466–484 (2006) 121 [41] Borgatti, S., Everett, M.: Models of core/periphery structures. Social networks 21(4), 375–395 (2000) 74, 86 [42] Boronat, X.: A comparison of html-aware tools for web data extraction. Master’s thesis, Universität Leipzig, Fakultät für Mathematik und Informatik (2008). Abteilung Datenbanken 20 [43] Bossa, S., Fiumara, G., Provetti, A.: A lightweight architecture for rss polling of arbitrary web sources. In: Proc. of WOA conference (2006) 30 [44] Bouttier, J., Di Francesco, P., Guitter, E.: Geodesic distance in planar graphs. Nuclear Physics B 663(3), 535–567 (2003) 12 [45] Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25(2), 163–177 (2001) 117, 119 159 BIBLIOGRAPHY [46] Brandes, U.: On variants of shortest-path betweenness centrality and their generic computation. Social Networks 30(2), 136–145 (2008) 118 [47] Brandes, U., Delling, D., Gaertler, M., Görke, R., Hoefer, M., Nikoloski, Z., Wagner, D.: On finding graph clusterings with maximum modularity. In: Graph-Theoretic Concepts in Computer Science, pp. 121–132 (2007) 142 [48] Brandes, U., Eiglsperger, M., Herman, I., Himsolt, M., Marshall, M.: GraphML progress report structural layer proposal. In: Graph Drawing, pp. 109–112. Springer (2002) 50, 55 [49] Brandes, U., Erlebach, T.: Fundamentals, Lecture Notes in Computer Science, vol. Network Analysis: Methodological Foundations, chap. 2, pp. 8–15. Springer (2005) 10 [50] Brandes, U., Fleischer, D.: Centrality measures based on current flow. In: Proc. of the 22nd Symposium Theoretical Aspects of Computer Science, pp. 533–544. Springer (2005) 117 [51] Brandes, U., Pich, C.: Centrality estimation in large networks. International Journal of Bifurcation and Chaos 17(7), 2303–2318 (2007) 119 [52] Brown, J., Broderick, A., Lee, N.: Word of mouth communication within online communities: Conceptualizing the online social network. Journal of interactive marketing 21(3), 2–20 (2007) 119 [53] Brown, J., Reingen, P.: Social ties and word-of-mouth referral behavior. Journal of Consumer Research pp. 350–362 (1987) 154 [54] Cafarella, M., Halevy, A., Khoussainova, N.: Data integration for the relational web. Proceedings of the VLDB Endowment 2(1), 1090–1101 (2009) 139 [55] de Campos, L., Fernández-Luna, J., Huete, J., Romero, A.: Probabilistic methods for structured document classification at inex07. Focused Access to XML Documents pp. 195–206 (2008) 20 [56] Carrington, P., Scott, J., Wasserman, S.: Models and methods in social network analysis. Cambridge University Press (2005) 2, 78, 115 [57] Catanese, S., De Meo, P., Ferrara, E., Fiumara, G.: Analyzing the facebook friendship graph. In: Proc. of the 1st International Workshop on Mining the Future Internet, vol. 685, pp. 14–19 (2010) 4, 52 [58] Catanese, S., De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Crawling facebook for social network analysis purposes. In: Proc. of the International Conference on Web Intelligence, Mining and Semantics, pp. 52:1–52:8. ACM (2011) 4, 52, 109 [59] Catanese, S., De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Extraction and analysis of facebook friendship relations. Computational Social Networks: Mining and Visualization (2012) 4 [60] Catanese, S., Ferrara, E., Fiumara, G.: Forensic analysis of phone call networks. Social Network Analysis and Mining (In press) 4 [61] Centola, D.: The spread of behavior in an online social network experiment. Science 329(5996), 1194 (2010) 108 160 BIBLIOGRAPHY [62] Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006) 18 [63] Chau, D., Pandit, S., Wang, S., Faloutsos, C.: Parallel crawling for online social networks. In: Proc. of the 16th International Conference on the World Wide Web, pp. 1283–1284 (2007) 49, 52 [64] Chen, H., Chau, M., Zeng, D.: Ci spider: a tool for competitive intelligence on the web. Decision Support Systems 34(1), 1–17 (2002) 23 [65] Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proc. of the 16th SIGKDD international conference on Knowledge discovery and data mining, pp. 1029–1038. ACM (2010) 154 [66] Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks. In: Proc. of the 15th SIGKDD international conference on Knowledge discovery and data mining, pp. 199–208. ACM (2009) 154 [67] Chung, F., Lu, L.: The diameter of sparse random graphs. Advances in Applied Mathematics 26(4), 257–279 (2001) 12 [68] Clauset, A.: Finding local community structure in networks. Physical Review E 72(2), 026,132 (2005) 74 [69] Clauset, A., Newman, M., Moore, C.: Finding community structure in very large networks. Physical review E 70(6), 066,111 (2004) 74, 86 [70] Colizza, V., Flammini, A., Serrano, M., Vespignani, A.: Detecting rich-club ordering in complex networks. Nature Physics 2(2), 110–115 (2006) 110 [71] Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of symbolic computation 9(3), 251–280 (1990) 12 [72] Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT Press, 2nd edition (2001) 11, 12 [73] Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. Journal of the ACM 51(5), 731–779 (2004) 29 [74] Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann Publishers Inc. (2001) 20, 29 [75] Danon, L., Dı́az-Guilera, A., Duch, J., Arenas, A.: Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment 2005, P09,008 (2005) 145 [76] Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proc. of the 12th international conference on World Wide Web, pp. 519–528. ACM (2003) 25 [77] De Meo, P., Ferrara, E., Abel, F., Aroyo, L., Houben, G.J.: Analyzing user behavior across social web environments. ACM Transactions on Intelligent Systems and Technology (Under Review) 4 161 BIBLIOGRAPHY [78] De Meo, P., Ferrara, E., Fiumara, G.: Finding similar users in facebook. Social Networking and Community Behavior Modeling: Qualitative and Quantitative Measurement pp. 304–323 (2011) 4 [79] De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Generalized louvain method for community detection in large networks. In: Proc. of the 11th International Conference on Intelligent Systems Design and Applications, pp. 88–93 (2011) 5, 139 [80] De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Improving recommendation quality by merging collaborative filtering and social relationships. In: Proc. of the 11th International Conference on Intelligent Systems Design and Applications, pp. 587–592 (2011) 5 [81] De Meo, P., Ferrara, E., Fiumara, G., Ricciardello, A.: A novel measure of edge centrality in social networks. Knowledge-based Systems (2012). DOI 10.1016/j.knosys.2012.01.007 5, 109, 122 [82] De Meo, P., Garro, A., Terracina, G., Ursino, D.: X-learn: an xml-based, multi-agent system for supporting user-device adaptive e-learning. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE pp. 739–756 (2003) 140 [83] De Meo, P., Nocera, A., Quattrone, G., Rosaci, D., Ursino, D.: Finding reliable users and social networks in a social internetworking system. In: Proc. of the International Database Engineering & Applications Symposium, pp. 173–181. ACM (2009) 141 [84] De Meo, P., Nocera, A., Terracina, G., Ursino, D.: Recommendation of similar users, resources and social networks in a social internetworking scenario. Information Sciences 181(7), 1285–1305 (2011) 67 [85] Descher, M., Feilhauer, T., Ludescher, T., Masser, P., Wenzel, B., Brezany, P., Elsayed, I., Wöhrer, A., Tjoa, A., Huemer, D.: Position paper: Secure infrastructure for scientific data life cycle management. In: International Conference on Availability, Reliability and Security, pp. 606–611. IEEE (2009) 26 [86] D’haeseleer, P.: How does gene expression clustering work? Nature Biotechnology 23(12), 1499–1502 (2005) 147 [87] Dijkstra, E.: A note on two problems in connexion with graphs. Numerische mathematik 1(1), 269–271 (1959) 12 [88] Ding, L., Steil, D., Dixon, B., Parrish, A., Brown, D.: A relation context oriented approach to identify strong ties in social networks. Knowledge-Based Systems 24(8), 1187– 1195 (2011) 116, 141 [89] Dodson Jr, J., Muller, E.: Models of new product diffusion through advertising and word-of-mouth. Management Science pp. 1568–1578 (1978) 154 [90] Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale social networks. In: Proc. of the 9th WebKDD and 1st SNA-KDD workshop on Web mining and social network analysis, pp. 16–25. ACM (2007) 88 [91] Duch, J., Arenas, A.: Community detection in complex networks using extremal optimization. Physical Review E 72(2), 027,104 (2005) 74 [92] Dunbar, R.: Grooming, gossip, and the evolution of language. Harvard University Press (1998) 127 162 BIBLIOGRAPHY [93] Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y., Smith, R.: Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering 31(3), 227–251 (1999) 20 [94] Erdős, P., Rényi, A.: On random graphs. Publicationes Mathematicae 6(26), 290–297 (1959) ix, 57, 67, 68, 69, 73, 75 [95] Everett, M., Borgatti, S.: Peripheries of cohesive subsets. Social Networks 21, 397–407 (1999) 74 [96] Everett, M., Borgatti, S.: Ego network betweenness. Social Networks 27(1), 31–38 (2005) 121 [97] Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: ACM SIGCOMM Computer Communication Review, vol. 29, pp. 251–262. ACM (1999) 56, 57 [98] Feldman, I., Rzhetsky, A., Vitkup, D.: Network properties of genes harboring inherited disease mutations. Proceedings of the National Academy of Sciences 105(11), 4323 (2008) 154 [99] Ferrara, E.: Community structure discovery in facebook. International Journal of Social Network Mining 1(1), 67–90 (2012) 5, 81, 109 [100] Ferrara, E.: Conclude: Complex network cluster detection for social and biological applications. In: WWW ’12 PhD Symposium (Under Review) 5 [101] Ferrara, E.: A large-scale community structure analysis in facebook. PLoS ONE (Under Review) 5, 109 [102] Ferrara, E., Baumgartner, R.: Automatic wrapper adaptation by tree edit distance matching. Combinations of Intelligent Methods and Applications pp. 41–54 (2011) 3, 44 [103] Ferrara, E., Baumgartner, R.: Design of automatically adaptable web wrappers. In: Proc. of the 3rd International Conference on Agents and Artificial Intelligence, pp. 211– 217 (2011) 3, 44 [104] Ferrara, E., Baumgartner, R.: Intelligent self-repairable web wrappers. In: Lecture Notes in Computer Science (Proc. of the 12th International Conference on Advances in Artificial Intelligence), vol. 6934, pp. 274–285 (2011) 3, 44 [105] Ferrara, E., Fiumara, G.: Topological features of online social networks. Communications on Applied and Industrial Mathematics 2(2), 1–20 (2011) 4 [106] Ferrara, E., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey. ACM Computing Surveys (Under Review) 3 [107] Fiumara, G.: Automated information extraction from web sources: a survey. Proc. of Between Ontologies and Folksonomies Workshop in 3rd International Conference on Communities and Technology pp. 1–9 (2007) 18 [108] Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Communications 17(2), 57–61 (2004) 18, 30 163 BIBLIOGRAPHY [109] Fortunato, S.: Community detection in graphs. Physics Reports 486, 75–174 (2010) 73, 80, 86, 87, 88, 139 [110] Fortunato, S., Barthélemy, M.: Resolution limit in community detection. Proceedings of the National Academy of Sciences 104(1), 36 (2007) 80, 100, 106, 145 [111] Fortunato, S., Latora, V., Marchiori, M.: Method to find community structures based on information centrality. Physical review E 70(5), 056,104 (2004) 126 [112] Freeman, L.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977) 3, 14, 115, 117 [113] Freeman, L.: Centrality in social networks conceptual clarification. Social networks 1(3), 215–239 (1979) 14 [114] Freeman, L., Borgatti, S., White, D.: Centrality in valued graphs: A measure of betweenness based on network flow. Social Networks 13(2), 141–154 (1991) 120 [115] Friedkin, N.: Horizons of observability and limits of informal control in organizations. Social Forces 62(1), 55–77 (1983) 3, 116, 118, 121 [116] Fruchterman, T., Reingold, E.: Graph drawing by force-directed placement. Software: Practice and Experience 21(11), 1129–1164 (1991) 72 [117] Garrett, J.J.: Ajax: A new approach to web applications. Tech. rep., Adaptive Path (2005). URL http://www.adaptivepath.com/ideas/essays/archives/000385.php 19 [118] Garton, L., Haythornthwaite, C., Wellman, B.: Studying online social networks. Journal of Computer-Mediated Communication 3(1) (1997) 1, 45 [119] Gatterbauer, W.: Web harvesting. Encyclopedia of Database Systems pp. 3472–3473 (2009) 26 [120] Gatterbauer, W., Bohunsky, P.: Table extraction using spatial reasoning on the css2 visual box model. In: AAAI ’06 Proc. of the 21st national conference on Artificial intelligence, pp. 1313–1318. AAAI Press (2006) 28 [121] Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domainindependent information extraction from web tables. In: Proc. of the 16th international conference on World Wide Web, pp. 71–80. ACM (2007) 20 [122] Ghosh, R., Lerman, K.: Predicting influential users in online social networks. In: Proc. of KDD workshop on Social Network Analysis (2010) 48 [123] Gilbert, E., Karahalios, K.: Predicting tie strength with social media. In: Proc. of the 27th international conference on Human factors in computing systems, pp. 211–220. ACM (2009) 107, 109 [124] Girvan, M., Newman, M.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821 (2002) 3, 67, 115, 143 [125] Gjoka, M., Kurant, M., Butts, C., Markopoulou, A.: Walking in facebook: a case study of unbiased sampling of OSNs. In: Proc. of the 29th conference on Information communications, pp. 2498–2506. IEEE (2010) 1, 46, 53, 55, 66, 92 164 BIBLIOGRAPHY [126] Gjoka, M., Kurant, M., Butts, C., Markopoulou, A.: Practical recommendations on crawling online social networks. Selected Areas in Communications, IEEE Journal on 29(9), 1872–1892 (2011) 109 [127] Goh, K., Cusick, M., Valle, D., Childs, B., Vidal, M., Barabási, A.: The human disease network. Proceedings of the National Academy of Sciences 104(21), 8685 (2007) 154 [128] Goh, K., Kahng, B., Kim, D.: Universal behavior of load distribution in scale-free networks. Physical Review Letters 87(27), 278,701 (2001) 62 [129] Golbeck, J., Hendler, J.: Inferring binary trust relationships in web-based social networks. Transactions on Internet Technology 6(4), 497–529 (2006) 66 [130] Goldenberg, A., Zheng, A., Fienberg, S., Airoldi, E.: A survey of statistical network models. Foundations and Trends in Machine Learning 2(2), 129–233 (2010) 49 [131] Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing letters 12(3), 211–223 (2001) 154 [132] Goldenberg, J., Libai, B., Muller, E.: Using complex systems analysis to advance marketing theory development: Modeling heterogeneity effects on new product growth through stochastic cellular automata. Academy of Marketing Science Review 9(3), 1–18 (2001) 154 [133] Golub, G., Van Loan, C.: Matrix computations, vol. 3. Johns Hopkins University Press (1996) 47 [134] Gottlob, G., Koch, C.: Logic-based web information extraction. ACM SIGMOD Record 33(2), 87–94 (2004) 28 [135] Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for web information extraction. Journal of the ACM 51(1), 74–113 (2004) 28 [136] Grabowicz, P., Ramasco, J., Moro, E., Pujol, J., Eguiluz, V.: Social features of online networks: The strength of intermediary ties in online social media. PLoS ONE 7(1), e29,358 (2012) 4, 108, 111 [137] Granovetter, M.: The strength of weak ties. American journal of sociology pp. 1360–1380 (1973) 85, 105, 106, 107, 108, 109 [138] Gregory, S.: An algorithm to find overlapping community structure in networks. Knowledge Discovery in Databases: PKDD 2007 pp. 91–102 (2007) 75, 88, 154 [139] Gross, R., Acquisti, A.: Information revelation and privacy in online social networks. In: Proc. of the Workshop on Privacy in the Electronic Society, pp. 71–80. ACM (2005) 50 [140] Guimera, R., Amaral, L.: Functional cartography of complex metabolic networks. Nature 433(7028), 895 (2005) 87 [141] Hagen, L., Kahng, A.: New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 11(9), 1074–1085 (2002) 73, 87 [142] Hammersley, B.: Developing feeds with rss and atom. O’Reilly (2005) 18 165 BIBLIOGRAPHY [143] Hampton, K., Sessions, L., Her, E., Rainie, L.: Social isolation and new technology. PEW Research Center 4 (2007) 86 [144] Han, H.: Conceptual modeling and ontology extraction for web information. Ph.D. thesis (2002). Supervisor-Elmasri, Ramez 20 [145] Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc. (2000) 23 [146] Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann Pub (2011) 47 [147] Heider, F.: The psychology of interpersonal relations. Lawrence Erlbaum (1982) 109 [148] Holland, P., Leinhardt, S.: Transitivity in structural models of small groups. Comparative Group Studies (1971) 71 [149] Holme, P., Kim, B.: Growing scale-free networks with tunable clustering. Physical Review E 65(2), 026,107 (2002) ix, 71, 72, 77 [150] Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the web. Information systems 23(9), 521–538 (1998) 30 [151] Hu, X., Lin, T.Y., Song, I.Y., Lin, X., Yoo, I., Lechner, M., Song, M.: Ontology-based scalable and portable information extraction system to extract biological knowledge from huge collection of biomedical web documents. In: Proc. of the International Conference on Web Intelligence, pp. 77–83. IEEE (2004) 20 [152] Hunter, D., Goodreau, S., Handcock, M.: Goodness of fit of social network models. Journal of American Statistics Association 103(481), 248–258 (2008) 86 [153] Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: Proc. of the 15th international conference on World Wide Web, pp. 553–563. ACM (2006) 19 [154] Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proc. of the 8th SIGKDD international conference on Knowledge discovery and data mining, pp. 538–543. ACM (2002) 47 [155] Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.: The large-scale organization of metabolic networks. Nature 407(6804), 651–654 (2000) 56, 57 [156] Jin, D., Liu, D., Yang, B., Liu, J.: Fast Complex Network Clustering Algorithm Using Agents. In: Proc. of the 8th International Conference on Dependable, Autonomic and Secure Computing, pp. 615–619 (2009) 87, 88, 89 [157] Kaiser, K., Miksch, S.: Information extraction. a survey. Tech. rep., E188 - Institut für Softwaretechnik und Interaktive Systeme; Technische Universität Wien (2005) 18, 27 [158] Karrer, B., Levina, E., Newman, M.: Robustness of community structure in networks. Physical Review E 77(4), 46,119 (2008) 86 [159] Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a social network. In: Proc. of 9th international conference on Knowledge discovery and data mining, pp. 137–146 (2003) 67, 119, 154 166 BIBLIOGRAPHY [160] Khare, R., Çelik, T.: Microformats: a pragmatic path to the semantic web. In: Proc. of the 15th international conference on World Wide Web, pp. 865–866. ACM (2006) 18 [161] Kim, M., Han, J.: CHRONICLE: A Two-Stage Density-Based Clustering Algorithm for Dynamic Networks. In: Proc. of the International Conference on Discovery Science, Lecture Notes in Computer Science, pp. 152–167. Springer (2009) 87 [162] Kim, Y., Park, J., Kim, T., Choi, J.: Web information extraction by html tree edit distance matching. In: International Conference on Convergence Information Technology, pp. 2455–2460. IEEE (2007) 35 [163] Kim, Y.A., Song, H.S.: Strategies for predicting local trust based on trust propagation in social networks. Knowledge-Based Systems 24(8), 1360–1371 (2011) 141 [164] Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999) 48, 71 [165] Kleinberg, J.: The small-world phenomenon: an algorithm perspective. In: Proc. of the 32nd annual symposium on Theory of computing, pp. 163–170. ACM (2000) 45, 66, 71 [166] Koschutzki, D., Lehmann, K.A., Peeters, L., Richter, S., Tenfelde-Podehl, D., Zlotowski, O.: Centrality Indices, Lecture Notes in Computer Science, vol. Network Analysis: Methodological Foundations, chap. 3, pp. 16–61. Springer (2005) 118 [167] Krüpl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data from arbitrary html documents. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1000–1001. ACM (2005) 28 [168] Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers. In: Revised Papers from the International Conference NetObjectDays on Objects, Components, Architectures, Services, and Applications for a Networked World, pp. 184–198. Springer Verlag (2003) 17 [169] Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. Link Mining: Models, Algorithms, and Applications pp. 337–357 (2010) 1, 66 [170] Kurant, M., Markopoulou, A., Thiran, P.: On the bias of breadth first search (bfs) and of other graph sampling techniques. In: Proc. of the 22nd International Teletraffic Congress, pp. 1–8 (2010) 2, 4, 49, 52, 53, 92, 105 [171] Kushmerick, N.: Wrapper induction for information extraction. Ph.D. thesis, University of Washington (1997). Chairperson-Weld, Daniel S. 27 [172] Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000) 30 [173] Kushmerick, N.: Finite-state approaches to web information extraction. Proc. of 3rd Summer Convention on Information Extraction pp. 77–91 (2002) 17, 31 [174] Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002) 17, 18, 20, 21 [175] Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics 11, 033,015 (2009) 75, 154 167 BIBLIOGRAPHY [176] Lancichinetti, A., Kivelä, M., Saramäki, J.: Characterizing the community structure of complex networks. PloS one 5(8), e11,976 (2010) 105 [177] Lancichinetti, A., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Physical Review E 78(4), 046,110 (2008) 144 [178] Lancichinetti, A., Radicchi, F., Ramasco, J.: Finding statistically significant communities in networks. PloS one 6(4), e18,961 (2011) 147 [179] Lappas, T., Terzi, E., Gunopulos, D., Mannila, H.: Finding effectors in social networks. In: Proc. of the 16th SIGKDD international conference on Knowledge discovery and data mining, pp. 1059–1068. ACM (2010) 154 [180] Lau, A., Tsui, E.: Knowledge management perspective on e-learning effectiveness. Knowledge-Based Systems 22(4), 324–325 (2009) 140 [181] Lee, C., Reid, F., McDaid, A., Hurley, N.: Detecting highly overlapping community structure by greedy clique expansion. In: Proc. of the 4th Workshop on Social Network Mining and Analysis. ACM (2010) 75, 86 [182] Leicht, E., Holme, P., Newman, M.: Vertex similarity in networks. Physical Review E 73(2), 026,120 (2006) 47 [183] Leskovec, J.: Stanford Network Analysis Package (SNAP). URL http://snap.stanford. edu/ 58, 128 [184] Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proc. of the 12th SIGKDD international conference on Knowledge discovery and data mining, pp. 631–636. ACM (2006) 46, 66, 78, 145 [185] Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proc. of the 11th SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 177–187 (2005) 60 [186] Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Statistical properties of community structure in large social and information networks. In: Proc. of the 17th international conference on World Wide Web, pp. 695–704 (2008) 79 [187] Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6(1), 29–123 (2009) 89 [188] Leskovec, J., Lang, K., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proc. of the 19th international conference on World Wide Web, pp. 631–640. ACM (2010) 87 [189] Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology 58(7), 1019–1031 (2007) 66, 127 [190] Ma, H., Yang, H., Lyu, M., King, I.: Mining social networks using heat diffusion processes for marketing candidates selection. In: Proceeding of the 17th conference on Information and knowledge management, pp. 233–242. ACM (2008) 154 168 BIBLIOGRAPHY [191] Madduri, K., Ediger, D., Jiang, K., Bader, D., Chavarria-Miranda, D.: A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In: Proc. of the International Symposium on Parallel & Distributed Processing, pp. 1–8. IEEE (2009) 154 [192] Madras, N., Slade, G.: The self-avoiding walk. Birkhauser (1996) 116 [193] Mahmoud, H., Aboulnaga, A.: Schema clustering and retrieval for multi-domain pay-asyou-go data integration systems. In: Proc. of the international conference on Management of data, pp. 411–422. ACM (2010) 139 [194] Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press (1999) 20 [195] Mathioudakis, M., Koudas, N.: Efficient identification of starters and followers in social media. In: Proc. of the International Conference on Extending Database Technology, pp. 708–719. ACM (2009) 48 [196] McCown, F., Nelson, M.: What happens when Facebook is gone? In: Proc. of the 9th Joint Conference on Digital Libraries, pp. 251–254. ACM (2009) 50 [197] McDaid, A., Hurley, N.: Detecting highly overlapping communities with model-based overlapping seed expansion. In: 2010 International Conference on Advances in Social Networks Analysis and Mining, pp. 112–119. IEEE (2010) 75, 86 [198] McPherson, M., Smith-Lovin, L., Cook, J.: Birds of a feather: Homophily in social networks. Annual review of sociology pp. 415–444 (2001) 109 [199] Melomed, E., Gorbach, I., Berger, A., Bateman, P.: Microsoft SQL Server 2005 Analysis Services (SQL Server Series). Sams (2006) 23 [200] Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Proc. of the 5th international workshop on Web information and data management, pp. 1–8. ACM (2003) 31 [201] Mika, P.: Ontologies are us: A unified model of social networks and semantics. Web Semantics: Science, Services and Agents on the World Wide Web 5(1), 5–15 (2007) 20 [202] Milgram, S.: The small world problem. Psychology Today 2(1), 60–67 (1967) 53, 56, 57, 60, 65, 67 [203] Mislove, A., Marcon, M., Gummadi, K., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: Proc. of the 7th SIGCOMM conference on Internet measurement, pp. 29–42. ACM (2007) 1, 46, 49, 52, 66, 78, 117 [204] Monge, A.E.: Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin 23(4) (2000) 19 [205] Mucha, P., Richardson, T., Macon, K., Porter, M., Onnela, J.: Community structure in time-dependent, multiscale, and multiplex networks. Science 328(5980), 876 (2010) 109 [206] Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proc. of the 3rd annual conference on Autonomous Agents, pp. 190–197. ACM (1999) 30 [207] Newcomb, T.: The acquaintance process. (1961) 109 169 BIBLIOGRAPHY [208] Newman, M.: Scientific collaboration networks. I. Network Construction and Fundamental Results. Physical Review E 64(1), 16,131 (2001) 56 [209] Newman, M.: The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences 98(2), 404 (2001) 79 [210] Newman, M.: The Structure and Function of Complex Networks. SIAM Review 45(2), 167 (2003) 56, 86 [211] Newman, M.: A measure of betweenness centrality based on random walks. Social Networks 26(2), 175–188 (2004) 116, 117, 119, 120 [212] Newman, M.: Power laws, pareto distributions and zipf’s law. Contemporary physics 46(5), 323–351 (2005) 110 [213] Newman, M.: Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74(3), 036,104 (2006) 74 [214] Newman, M.: Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103(23), 8577 (2006) 74 [215] Newman, M.: The first-mover advantage in scientific publication. Europhysics Letters 86, 68,001 (2009) 79 [216] Newman, M., Barabasi, A., Watts, D.: The structure and dynamics of networks. Princeton University Press (2006) 53, 86 [217] Newman, M., Girvan, M.: Finding and evaluating community structure in networks. Physical Review E 69(2), 26,113 (2004) 74, 88, 108, 143 [218] Newman, M., Leicht, E.: Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences 104(23), 9564 (2007) 74, 87 [219] Newman, M., Watts, D.: Renormalization group analysis small-world network model. Physics Letters A 263(4-6), 341–346 (1999) ix, 69, 72, 76 [220] Ng, A., Jordan, M., Weiss, Y.: On Spectral Clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 (2001) 73 [221] Noh, J.D., Rieger, H.: Random walks on complex networks. Physical Review Letters 92, 118,701 (2004) 116, 117 [222] Onnela, J., Reed-Tsochas, F.: The spontaneous emergence of social influence in online systems. Proceedings of the National Academy of Sciences 107, 18,375 (2010) 48 [223] Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32(3), 245–251 (2010) 14 [224] Opsahl, T., Colizza, V., Panzarasa, P., Ramasco, J.: Prominence and control: The weighted rich-club effect. Physical review letters 101(16), 168,702 (2008) 110 [225] Oti, M., Brunner, H.: The modular nature of genetic diseases. Clinical genetics 71(1), 1–11 (2007) 154 [226] Pajevic, S., Plenz, D.: The organization of strong links in complex networks. Arxiv preprint arXiv:1109.2577 (2011) 108 170 BIBLIOGRAPHY [227] Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 9 (2005) 74, 86, 87 [228] Palmer, C., Steffan, J.: Generating network topologies that obey power laws. In: Global Telecommunications Conference, vol. 1, pp. 434–438. IEEE (2002) 56, 60 [229] Partow, A.: General Purpose Hash Function Algorithms. URL http://www.partow. net/programming/hashfunctions/ 55 [230] Pastor-Satorras, R., Vázquez, A., Vespignani, A.: Dynamical and correlation properties internet. Physical Review Letters 87(25), 258,701 (2001) 67 [231] Petróczi, A., Nepusz, T., Bazsó, F.: Measuring tie-strength in virtual social networks. Connections 27(2), 39–52 (2006) 107, 109 [232] Phan, X., Horiguchi, S., Ho, T.: Automated data extraction from the web with conditional models. International Journal of Business Intelligence and Data Mining 1(2), 194–209 (2005) 19, 30 [233] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J., Leser, U.: Alibaba: Pubmed as a graph. Bioinformatics 22(19), 2444–2445 (2006) 26 [234] Porter, M., Onnela, J., Mucha, P.: Communities in networks. Notices of the American Mathematical Society 56(9), 1082–1097 (2009) 81, 86 [235] Quattrone, G., Capra, L., De Meo, P., Ferrara, E., Ursino, D.: Effective retrieval of resources in folksonomies using a new tag similarity measure. In: Proc. of the 20th Conference on Information and Knowledge Management, pp. 545–550. ACM (2011) 24 [236] Quattrone, G., Ferrara, E., De Meo, P., Capra, L.: Measuring similarity in large-scale folksonomies. In: Proc. of the 23rd International Conference on Software Engineering and Knowledge Engineering, pp. 385–391 (2011) 24 [237] Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences 101(9), 2658 (2004) 86, 145 [238] Raghavan, U., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76(3), 036,106 (2007) 87, 88 [239] Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001) 19 [240] Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23(4) (2000) 19 [241] Ratkiewicz, J., Conover, M., Meiss, M., Goncalves, B., Flammini, A., Menczer, F.: Detecting and tracking political abuse in social media. In: Proc. 5th International AAAI Conference on Weblogs and Social Media (2011) 154 [242] Ratkiewicz, J., Conover, M., Meiss, M., Gonçalves, B., Patil, S., Flammini, A., Menczer, F.: Truthy: Mapping the spread of astroturf in microblog streams. In: Proc. of the 20th international conference companion on World wide web, pp. 249–252. ACM (2011) 154 171 BIBLIOGRAPHY [243] Redner, S.: How popular is your paper? An empirical study of the citation distribution. The European Physical Journal B 4(2), 131–134 (1998) 56 [244] Rodriguez, M.: Grammar-based random walkers in semantic networks. Knowledge-Based Systems 21(7), 727–739 (2008) 140 [245] Rodriguez, M., Watkins, J.: Grammar-based geodesics in semantic networks. KnowledgeBased Systems 23(8), 844–855 (2010) 116, 140 [246] Romero, D., Galuba, W., Asur, S., Huberman, B.: Influence and passivity in social media. In: Proc. of the 20th International Conference Companion on World Wide Web, pp. 113–114. ACM (2011) 48 [247] Romero, D., Kleinberg, J.: The Directed Closure Process in Hybrid Social-Information Networks, with an Analysis of Link Formation on Twitter. In: Proc. of the 4th International Conference on Weblogs and Social Media (2010) 49 [248] Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4), 581–603 (1966) 3, 14, 115 [249] Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: Proc. of the 25th International Conference on Very Large Data Bases, pp. 738–741. Morgan Kaufmann Publishers Inc. (1999) 20, 28 [250] Sarawagi, S.: Information extraction. Foundations and trends in databases 1(3), 261–377 (2008) 18, 27 [251] Sebastiani, F.: Machine learning in automated text categorization. ACM computing surveys 34(1), 1–47 (2002) 27 [252] Seidel, R.: On the all-pairs-shortest-path problem. In: Proc. of the 24th Symposium on Theory of Computing, pp. 745–749. ACM (1992) 12 [253] Selkow, S.: The tree-to-tree editing problem. Information processing letters 6(6), 184–186 (1977) 33, 40 [254] Shah, D., Zaman, T.: Community detection in networks: The leader-follower algorithm. In: Proc. of the Workshop on Networks Across Disciplines: Theory and Applications, pp. 1–8 (2010) 74, 86 [255] Snasel, V., Horak, Z., Abraham, A.: Understanding social networks using formal concept analysis. In: Proc. of the International Conference on Web Intelligence and Intelligent Agent Technology, vol. 3, pp. 390–393. IEEE (2008) 47 [256] Snasel, V., Horak, Z., Kocibova, J., Abraham, A.: Reducing social network dimensions using matrix factorization methods. In: International Conference on Advances in Social Network Analysis and Mining, pp. 348–351. IEEE (2009) 13, 47 [257] Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine learning 34(1), 233–272 (1999) 30 [258] Song, X., Chi, Y., Hino, K., Tseng, B.: Identifying opinion leaders in the blogosphere. In: Proc. of the 16th Conference on Information and Knowledge Management, pp. 971–974. ACM (2007) 48 172 BIBLIOGRAPHY [259] Sridhar, V., Narasimha Murty, M.: Knowledge-based clustering approach for data abstraction. Knowledge-Based Systems 7(2), 103–113 (1994) 116, 139 [260] Staab, S., Domingos, P., Mike, P., Golbeck, J., Ding, L., Finin, T., Joshi, A., Nowak, A., Vallacher, R.: Social networks applied. IEEE Intelligent Systems 20(1), 80–93 (2005) 66, 119 [261] Stephenson, K., Zelan, M.: Rethinking centrality: Methods and examples. Social Networks 11, 1–37 116, 118 [262] Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition for large sparse graphs. Statistical Analysis and Data Mining 1, 6–22 (2008) 13 [263] Tanaka, M., Ishida, T.: Ontology extraction from tables on the web. In: Proc. of the International Symposium on Applications on Internet, pp. 284–290. IEEE (2006) 20 [264] Tarjan, R.: Depth-first search and linear grajh algorithms. In: Conference Record 12th Annual Symposium on Switching and Automata Theory, pp. 114–121. IEEE (1971) 11 [265] Traud, A., Kelsic, E., Mucha, P., Porter, M.: Comparing Community Structure to Characteristics in Online Collegiate Social Networks. SIAM Review pp. 1–17 (2011) 86, 91 [266] Travers, J., Milgram, S.: An experimental study of the small world problem. Sociometry 32(4), 425–443 (1969) 53, 56, 57, 65, 67 [267] Trusov, M., Bucklin, R., Pauwels, K.: Effects of word-of-mouth versus traditional marketing: Findings from an internet social networking site. Journal of Marketing 73(5), 90–102 (2009) 119 [268] Turmo, J., Ageno, A., Català, N.: Adaptive information extraction. ACM Computing Surveys 38(2), 4 (2006) 30 [269] Ugander, J., Karrer, B., Backstrom, L., Marlow, C.: The anatomy of the facebook social graph. Arxiv preprint arXiv:1111.4503 (2011) 4, 48, 104, 109 [270] Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in facebook. In: Proc. of the 2nd SIGCOMM Workshop on Social Networks (2009) 128, 145 [271] Wang, P., Hawk, W., Tenopir, C.: Users’ interaction with world wide web resources: an exploratory study using a holistic approach. Information processing & management 36(2), 229–251 (2000) 18 [272] Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge University Press (1994) 1 [273] Watts, D.: Small worlds: the dynamics of networks between order and randomness. Princeton University Press (2004) 66 [274] Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998) ix, 57, 66, 67, 70, 71, 72, 76, 80 [275] Wei, Y., Cheng, C.: Towards efficient hierarchical designs by ratio cut partitioning. In: Proc. of the International Conference on Computer-Aided Design, pp. 298–301 (1989) 73 173 BIBLIOGRAPHY [276] Weikum, G.: Harvesting, searching, and ranking knowledge on the web: invited talk. In: Proc. of the 2nd International Conference on Web Search and Data Mining, pp. 3–4. ACM (2009) 26 [277] Wilson, C., Boe, B., Sala, A., Puttaswamy, K., Zhao, B.: User interactions in social networks and their implications. In: Proc. of the 4th European Conference on Computer Systems, pp. 205–218. ACM (2009) 49, 52, 53, 55 [278] Winograd, T.: Understanding natural language. Cognitive Psychology 3(1), 1–191 (1972) 20 [279] Xia, Z.: Fighting criminals: Adaptive inferring and choosing the next investigative objects in the criminal network. Knowledge-Based Systems 21(5), 434–442 (2008) 141 [280] Xia, Z., Bu, Z.: Community detection based on a semantic network. Knowledge-Based Systems p. In Press (2011) 116, 139 [281] Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks. In: Proc. of the 19th international conference on World wide web, pp. 981–990. ACM (2010) 107, 109 [282] Xie, J., Kelley, S., Szymanski, B.: Overlapping community detection in networks: the state of the art and comparative study. Arxiv preprint arXiv:1110.5813 (2011) 154 [283] Xie, J., Szymanski, B., Liu, X.: Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: Proc. of the Workshop on Data Mining Technologies for Computational Collective Intelligence (2011) 154 [284] Xu, Y., Weng, J., Sharma, A., Yussupov, D.: Web content acquisition in web content aggregation service based on digital earth geospatial framework. In: Geoinformatics, 2011 19th International Conference on, pp. 1–5. IEEE 3 [285] Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991) 36 [286] Ye, S., Lang, J., Wu, F.: Crawling Online Social Graphs. In: Proc. of the 12th International Asia-Pacific Web Conference, pp. 236–242. IEEE (2010) 45, 46, 52 [287] Zachary, W.: An information flow model for conflict and fission in small groups. Journal of Anthropological Research 33(4), 452–473 (1977) 66 [288] Zanasi, A.: Competitive intelligence through data mining public sources. Competitive Intelligence Review 9(1), 44–54 (1998) 23 [289] Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of the 14th international conference on World Wide Web, pp. 76–85. ACM (2005) 29, 35, 36 [290] Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering 18(12), 1614–1628 (2006) 29 [291] Zhao, H.: Automatic wrapper generation for the extraction of search result records from search engines. Ph.D. thesis, State University of New York at Binghamton (2007). Adviser-Meng, Weiyi 19 174 BIBLIOGRAPHY [292] Zhao, J., Wu, J., Xu, K.: Weak ties: Subtle role of information diffusion in online social networks. Physical Review E 82(1), 016,105 (2010) 108 [293] Zhao, Y., Levina, E., Zhu, J.: Community extraction for social networks. In: Proc. of the Joint Statistical Meetings (2011) 86 [294] Zhou, S., Mondragón, R.: The rich-club phenomenon in the internet topology. IEEE Communications Letters 8(3), 180–182 (2004) 110 175 Declaration I herewith declare that I have produced this paper without the prohibited assistance of third parties and without making use of aids other than those specified; notions taken over directly or indirectly from other sources have been identified as such. This Thesis has not previously been presented in identical or similar form to any other Italian or foreign examination board. The thesis work was conducted from January 2009 to December 2011 under the supervision of Prof. Giacomo Fiumara at the University of Messina.