Mining and Analysis Online Social Networks

Transcription

Mining and Analysis Online Social Networks
Mining and Analysis of
Online Social Networks
Emilio Ferrara
Department of Mathematics
University of Messina
Supervisor
Prof. Giacomo Fiumara
A thesis submitted for the degree of
PhilosophiæDoctor (PhD) in Mathematics
2012 February
1. Reviewer: Dr. Robert Baumgartner, Vienna Technische Universität
2. Reviewer: Dr. Haixuan Yang, Royal Holloway University of London
Day of the defense: 26 March 2012
ii
Abstract
Social media and, in particular, Online Social Networks (OSNs) acquired a huge
popularity and represent one of the most important social and Computer Science
phenomena of these years.
This dissertation presents a comprehensive study of the process of mining information from Online Social Networks and analyzing the structure of the networks
themselves. To this purpose, several methods are adopted, ranging from Web Mining techniques, to graph-theoretical models and finally statistical analysis of network
features, from a quantitative and qualitative perspective.
The origin, distribution and sheer size of the data involved makes each of them
either moot or inapplicable to the required scale. New methods are proposed and
their effectivity is assessed against relevant data samples.
The content of the present dissertation can be schematized in three main parts:
(i) In the first we discuss the problem of mining Web sources from an algorithmic
perspective; different techniques, largely adopted in Web data extraction tasks, are
discussed and a novel approach to refine the process of automatic extraction of
information from Web pages is presented, which is the core for the definition of a
platform for sampling data from OSNs.
(ii) The second part of this Thesis discusses the analysis of a large dataset acquired
from the most representative (and largest) OSN to date: Facebook. This platform gathers hundred of millions of users and its modeling and analysis is possible
by means of Social Network Analysis techniques. The investigation of topological
features is extended also to other OSNs datasets available online for the scientific
community. Several features of these networks, such as the well-known small world
effect, scale-free distributions and community structure, are characterized and analyzed. At the same time, our analysis provides quantitative clues to verify the
validity of different sociological theories on large-scale Social networks (for example,
the six degrees of separation or the strength of weak ties). In particular, the problem
of the community detection on massive OSNs is redefined and solved by means of a
new algorithm. This result puts into evidence the need of defining computationally
efficient, even if heuristic, measures to assess the importance of individuals in the
network.
(iii) The last part of the Thesis is devoted to presenting a novel, efficient measure
of centrality for a Social networks, whose rationale is grounded in the random walks
theory. Its validity is assessed against massive OSNs datasets; it becomes the basis
for a novel community detection algorithm which is shown to work surprising well
in different contexts, such as social and biological network analysis.
iv
Alla mia famiglia.
Senza di voi uniti ad attendermi al nastro d’arrivo,
nessun traguardo sarebbe importante.
Acknowledgements
This Thesis would not have been possible without the support of a lot of people
whose help has been fundamental during the years of my Ph.D. studies.
I mostly owe my gratitude to my Supervisor, Prof. Giacomo Fiumara. I received
his personal support during several hard periods and I would like to thank him
for his continuous efforts to show me the path to follow. This Thesis represents
my commitment to shape his brilliant and creative scientific ideas and his multidisciplinary vision on how to apply Computer Science, Physics and Mathematics to
real-world research challenges. It was a honor for me to work with him.
I am indebted with Prof. Francesco Oliveri, Head of the Ph.D. School, who believed
in me from the very beginning of my Ph.D. studies, and gave me the chance to
pursue this goal supporting me throughout these years. Without his trust it would
not have been possible for me to visit and stay at the Vienna Technische Universität
and at the Royal Holloway University of London as a Ph.D. visiting student.
During my studies I had the pleasure to work with two fantastic persons of the
Computer Science research group of my University: Prof. Alessandro Provetti and
Prof. Pasquale De Meo. I own my deepest gratitude to Prof. Provetti, who shown
me the importance of establishing international contacts with the research community and to spend time studying abroad. In particular, I would like to thank him
for his efforts to give me the opportunity to stay in Vienna and London.
Prof. De Meo is simply one of the most talented scientists I have ever met. With his
brilliant intuitions and hard work he greatly contributed to several topics discussed
in this Thesis. It was a pleasure for me to work with him and without his support I
would not even faced some scientific problems whose solution appeared hard to me.
During 2010 I had the chance to spend four months in Vienna, collaborating with
the DBAI group of the Vienna Technische Universität and with the Lixto GmbH,
under the supervision of Dr. Robert Baumgartner. I owe my gratitude to him for a
number of reasons. First of all, it was a great pleasure to work with him: in just a
few months he taught me the fundamental concepts of Web data extraction and put
me in the condition to contribute to this research field. In addition, he personally
supported me in several occasions and I am particularly grateful to him since he
agreed to be a Reviewer of this Thesis, investing an important amount of his time
revising the work and providing with precious suggestions to improve it. I would
also like to express my gratitude to several people that in different ways helped me
during this experience in Vienna. From the DBAI group I am particularly grateful to
Ruslan Fayzrakhmanov and Dr. Bernhard Kruepl, whose suggestions helped me to
speed up my research activity. From the Lixto, Gerald Ledermueller and Serkan Avci
supported me during the initial “bootstrap” period, helping in understanding the
Lixto framework for Web data extraction and the rationale behind its functioning.
During the end of 2011 and the beginning of 2012 I spent four months at the
Royal Holloway University of London, working under the supervision of Dr. Alberto
Paccanaro in the Centre for Systems and Synthetic Biology at the Department
of Computer Science. I had the pleasure of joining a wonderful group of young
scientists lead by a fantastic person. Dr. Paccanaro is amongst the most brilliant
and incredible persons I ever met in Academia. During a so short period I have
been exposed to an impressive number of new ideas, and his research group reflects
his truly passion put in everything he does, including research in Computational
Biology. I had the honor to collaborate with the colleagues of the PaccanaroLab,
in particular with Dr. Alfonso E. Romero, Dr. Haixuan Yang, Dr. Prajwal Bhat,
Sandra Smieszek and Horacio Caniza. I am particularly indebted with Dr. Yang
who expressed his will of being a Reviewer for this Thesis: his suggestions helped
me to improve the quality of this work, in particular regarding the last part of the
Thesis. I am glad to had the chance to work with Dr. Romero and Dr. Bhat in a
number of research projects, and I am grateful to both of them for their precious
help without which it would not be possible to me to grasp the fundamental concepts
of Bio-informatics and Computational Biology in such a short period of time.
I would like to express my gratitude for their work to all my other coauthors:
Francesco Pagano, Salvatore Catanese, Dr. Angela Ricciardello, Dr. Giovanni
Quattrone, Dr. Licia Capra, Prof. Domenico Ursino, Dr. Fabian Abel, Prof. Lora
Aroyo, Prof. Geert-Jan Houben. Without their efforts and precious contributions
and ideas, all the work done during my Ph.D. studies would not have been possible.
I owe my gratitude to all my colleagues of the Ph.D. School who gladden these years
of studies in Messina, and to all my friends who supported me also during those
periods in which I was too taken from my work to reciprocate.
I dedicate this Thesis to my family. Dedico questa Tesi alla mia famiglia.
Ai miei genitori, che mi hanno insegnato cosa significhi prefiggersi un obiettivo e
lavorare sodo per raggiungerlo. Che mi hanno sempre supportato e incoraggiato nel
corso dei miei studi, dimostrandomi che volere e’ potere.
A mia sorella, la persona piu’ brillante che abbia mai conosciuto.
Un radioso futuro l’aspetta.
iv
Contents
List of Figures
ix
List of Tables
xi
1 Introduction
2 Fundamentals
2.1 Formal Conventions . . . .
2.2 Graph Theory . . . . . . . .
2.2.1 Notion of Graph and
2.2.2 Centrality Measures
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
. 7
. 7
. 8
. 13
3 Information Extraction from Web Sources
3.1 Background and Related Literature . . . . . . . .
3.2 Web Data Extraction Systems . . . . . . . . . . .
3.2.1 Definition . . . . . . . . . . . . . . . . . .
3.2.2 Classification Criteria . . . . . . . . . . .
3.3 Applications . . . . . . . . . . . . . . . . . . . . .
3.3.1 Enterprise Applications . . . . . . . . . .
3.3.2 Social Applications . . . . . . . . . . . . .
3.3.3 A Glance on the Future . . . . . . . . . .
3.4 Techniques . . . . . . . . . . . . . . . . . . . . .
3.4.1 Used Approaches . . . . . . . . . . . . . .
3.4.2 Wrappers . . . . . . . . . . . . . . . . . .
3.4.3 Semi-Automatic Wrapper Generation . .
3.4.4 Automatic Wrapper Generation . . . . . .
3.4.5 Wrapper Induction . . . . . . . . . . . . .
3.4.6 Wrapper Maintenance . . . . . . . . . . .
3.5 Automatic Wrapper Adaptation . . . . . . . . .
3.5.1 Primary Goals . . . . . . . . . . . . . . .
3.5.2 Details . . . . . . . . . . . . . . . . . . . .
3.5.3 Simple Tree Matching . . . . . . . . . . .
3.5.4 Weighted Tree Matching . . . . . . . . . .
3.5.5 Web Wrappers . . . . . . . . . . . . . . .
3.5.6 Automatic Adaptation of Web Wrappers
3.5.7 Experimentation . . . . . . . . . . . . . .
3.5.8 Discussion of Results . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . .
. . . . . . . . . .
Main Properties
. . . . . . . . . .
v
.
.
.
.
17
17
18
18
20
22
22
24
26
26
27
27
27
29
30
31
31
32
33
35
35
37
40
42
43
CONTENTS
4 Mining and Analysis of Facebook
4.1 Background and Related Literature . . . . . . . . . .
4.1.1 Data Collection from Online Social Networks
4.1.2 Similarity Detection . . . . . . . . . . . . . .
4.1.3 Influential User Detection . . . . . . . . . . .
4.2 Sampling the Facebook Social Graph . . . . . . . . .
4.2.1 The Structure of the Social Network . . . . .
4.2.2 The Sampling Architecture . . . . . . . . . .
4.2.3 Breadth-first-search Sampling . . . . . . . . .
4.2.4 Uniform Sampling . . . . . . . . . . . . . . .
4.2.5 Data Preparation . . . . . . . . . . . . . . . .
4.3 Network Analysis Aspects . . . . . . . . . . . . . . .
4.3.1 Definitions . . . . . . . . . . . . . . . . . . .
4.3.2 Experimentation . . . . . . . . . . . . . . . .
4.3.3 Privacy Settings . . . . . . . . . . . . . . . .
4.3.4 Degree Distribution . . . . . . . . . . . . . .
4.3.5 Diameter and Clustering Coefficient . . . . .
4.3.6 Connected Components . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
46
46
48
48
48
50
52
54
55
56
56
58
58
58
58
62
5 Network Analysis and Models of Online Social Networks
5.1 Background and Related Literature . . . . . . . . . . . . . . . .
5.1.1 Social Networks and Models . . . . . . . . . . . . . . . .
5.1.2 Recent Studies and Current Trends . . . . . . . . . . . .
5.2 Features of Social Networks . . . . . . . . . . . . . . . . . . . .
5.2.1 The “Small-World” . . . . . . . . . . . . . . . . . . . . .
5.2.2 Scale-free Degree Distributions . . . . . . . . . . . . . .
5.2.3 Emergence of a Community Structure . . . . . . . . . .
5.3 Models of Social Networks . . . . . . . . . . . . . . . . . . . . .
5.3.1 The Erdős-Rényi Model . . . . . . . . . . . . . . . . . .
5.3.2 The Watts-Strogatz Model . . . . . . . . . . . . . . . .
5.3.3 The Barabási-Albert Model . . . . . . . . . . . . . . . .
5.4 Community Structure . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Definition of Community Structure . . . . . . . . . . . .
5.4.2 Discovering Communities . . . . . . . . . . . . . . . . .
5.4.3 Models Representing the Community Structure . . . . .
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Description of Adopted Online Social Network Datasets
5.5.2 Topological Properties . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
65
66
67
67
67
68
68
68
71
72
73
73
73
75
78
78
78
6 Community Structure in Facebook
6.1 Background and Related Literature . . . . . . .
6.1.1 Community Detection in Literature . .
6.2 Community Structure Discovery . . . . . . . .
6.2.1 Label Propagation Algorithm . . . . . .
6.2.2 Fast Network Community Algorithm . .
6.2.3 Experimentation . . . . . . . . . . . . .
6.2.4 Methodology of Investigation . . . . . .
6.3 Community Structure . . . . . . . . . . . . . .
6.3.1 Building the Community Meta-network
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
86
87
87
88
89
90
100
100
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
105
106
107
109
7 A Novel Centrality Measure for Social Networks
7.1 Background and Related Literature . . . . . . . . . . . . . . . . . . . .
7.2 Centrality Measures and Applications . . . . . . . . . . . . . . . . . .
7.2.1 Centrality Measure in Social Networks . . . . . . . . . . . . . .
7.2.2 Recent Approaches for Computing Betweenness Centrality . . .
7.2.3 Application of Centrality Measures in Social Network Analysis
7.3 Measuring Edge Centrality . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2
-Path Centrality . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 The Algorithm for Computing the -Path Edge Centrality . .
7.3.4 Novelties Introduced by our Approach . . . . . . . . . . . . . .
7.3.5 Comparison of the ERW-Kpath and WERW-Kpath algorithms
7.4 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Analysis of Edge Centrality Distributions . . . . . . . . . . . .
7.5 Applications of our approach . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.2 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.3 Understanding User Relationships in Virtual Communities . . .
7.6 Fast Community Structure Detection . . . . . . . . . . . . . . . . . . .
7.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.2 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.3 Fast κ-path Community Detection . . . . . . . . . . . . . . . .
7.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7.1 Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . . .
7.7.2 Online Social Networks . . . . . . . . . . . . . . . . . . . . . .
7.7.3 Extension to Biological Networks . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
117
117
119
119
119
119
121
122
126
127
128
128
132
133
134
139
140
141
141
141
142
143
144
144
145
146
6.4
6.3.2 Meta-network Analysis .
6.3.3 Discussion of Results . .
The Strength of Weak Ties . .
6.4.1 Methodology . . . . . .
6.4.2 Experiments . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
κ
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
κ
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Conclusions
151
8.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Bibliography
157
vii
CONTENTS
viii
List of Figures
3.1
3.2
3.3
3.4
3.5
3.6
Examples of XPaths over trees, selecting one (A) or multiple (B) items.
A and B are two similar labeled rooted trees. . . . . . . . . . . . . . . .
Robust Web object detection in Lixto VD. . . . . . . . . . . . . . . . . .
Configuration of wrapper adaptation in Lixto VD. . . . . . . . . . . . .
Wrapper adaptation process. . . . . . . . . . . . . . . . . . . . . . . . .
Diagram of the Web wrapper creation, execution and maintenance flow.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
34
39
39
41
41
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
Architecture of the data mining platform. . . . . . . . . . . . . . . .
State diagram of the data mining process. . . . . . . . . . . . . . . .
Screenshot of the Facebook visual crawler . . . . . . . . . . . . . . .
Node degree distribution BFS vs. UNI Facebook sample. . . . . . .
CCDF node degree distribution BFS vs. UNI Facebook sample. . . .
Node degree probability distribution BFS vs. UNI Facebook sample.
Hops and diameter in Facebook. . . . . . . . . . . . . . . . . . . . .
Clustering coefficient in Facebook. . . . . . . . . . . . . . . . . . . .
Connected components in Facebook. . . . . . . . . . . . . . . . . . .
Degree vs betweenness centrality in Facebook. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
51
52
59
59
60
61
61
62
63
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
Generative model: Erdős-Rényi (94). . . . . . . . . . . . . . . . . .
Generative model: Newman-Watts-Strogatz (219). . . . . . . . . .
Generative model: Watts-Strogatz (274). . . . . . . . . . . . . . . .
Generative model: Barabási-Albert (14). . . . . . . . . . . . . . . .
Generative model: Holme-Kim (149). . . . . . . . . . . . . . . . . .
Community structure of the Erdős-Rényi (94) model. . . . . . . . .
Community structure of the Newman-Watts-Strogatz (219) model.
Community structure of the Watts-Strogatz (274) model. . . . . .
Community structure of the Barabási-Albert (14) model. . . . . . .
Community structure of the Holme-Kim (149) model. . . . . . . .
Node degree distributions (log–log scale). . . . . . . . . . . . . . . .
Effective diameters (log-normal scale). . . . . . . . . . . . . . . . .
Community structure analysis (log–log scale). . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
70
70
71
75
76
76
77
77
80
82
82
6.1
6.2
6.3
6.4
6.5
FNCA power law distributions on the “Uniform” sample.
LPA power law distributions on the “Uniform” sample. .
FNCA power law distribution on the BFS sample. . . . .
LPA power law distribution on the BFS sample. . . . . .
FNCA vs. LPA (UNI). . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
92
93
93
95
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF FIGURES
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
FNCA vs. LPA (BFS). . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jaccard distribution: FNCA vs. LPA (UNI). . . . . . . . . . . . . . .
Jaccard distribution: FNCA vs. LPA (BFS). . . . . . . . . . . . . . .
Heat-map: FNCA vs. LPA (UNI). . . . . . . . . . . . . . . . . . . . .
Heat-map: FNCA vs. LPA (BFS). . . . . . . . . . . . . . . . . . . . .
Meta-network representing the community structure (UNI with LPA).
Meta-network degree and clustering coefficient distribution (UNI). . .
Meta-network hops and shortest paths distribution (UNI). . . . . . . .
Meta-network weights vs. strengths distribution (UNI). . . . . . . . .
Meta-network heat-map of the distribution of connections (UNI). . . .
Distribution of strong vs. weak ties in Facebook. . . . . . . . . . . . .
CCDF of strong vs. weak ties in Facebook. . . . . . . . . . . . . . . .
Density of weak ties among communities. . . . . . . . . . . . . . . . .
Link fraction as a function of the community size. . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
98
98
99
99
102
103
104
105
106
110
111
112
112
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
Example of assignment of normalized degrees and initial edge weights.
Robustness test on Wiki-Vote. . . . . . . . . . . . . . . . . . . . . . .
Execution time with respect to network size. . . . . . . . . . . . . . .
κ-paths centrality values distribution on Wiki-Vote. . . . . . . . . . .
κ-paths centrality values distribution on CA-HepPh. . . . . . . . . . .
κ-paths centrality values distribution on CA-CondMat. . . . . . . . . .
κ-paths centrality values distribution on Cit-HepTh. . . . . . . . . . .
κ-paths centrality values distribution on Facebook. . . . . . . . . . . .
κ-paths centrality values distribution on Youtube. . . . . . . . . . . .
Effect of different κ = 5, 10, 20 on Wiki-Vote. . . . . . . . . . . . . . .
Effect of different κ = 5, 10, 20 on CA-HepPh. . . . . . . . . . . . . . .
Effect of different κ = 5, 10, 20 on CA-CondMat. . . . . . . . . . . . .
Effect of different κ = 5, 10, 20 on Cit-HepTh. . . . . . . . . . . . . . .
Effect of different κ = 5, 10, 20 on Facebook. . . . . . . . . . . . . . . .
Effect of different κ = 5, 10, 20 on Youtube. . . . . . . . . . . . . . . .
Normalized mutual information test using the synthetic benchmarks. .
Arabidopsis Thaliana gene-coexpression network (cluster 1). . . . . . .
Arabidopsis Thaliana gene-coexpression network (cluster 2). . . . . . .
Arabidopsis Thaliana gene-coexpression network (cluster 3). . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
130
133
135
135
135
136
136
136
137
137
137
138
138
138
145
148
148
149
x
List of Tables
3.1
3.2
W and M matrices for each matching subtree. . . . . . . . . . . . . . . . . . . . 36
Experimental results of automatic wrapper adaptation. . . . . . . . . . . . . . . . 44
4.1
4.2
4.3
HTTP requests flow of the crawler: authentication and mining steps. . . . . . . . 51
BFS dataset description (crawling period: 08/01-10/2010) . . . . . . . . . . . . . 54
“Uniform” dataset description (crawling period: 08/11-20/2010) . . . . . . . . . 55
5.1
Datasets and results: d(q) is the effective diameter, γ and σ, resp., the exponent
of the power law node degree and community size distributions, Q the network
modularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1
6.2
6.3
6.4
6.5
Results of the community detection on Facebook . . . . . . . . . . . . . . . . .
Representation of community structures. . . . . . . . . . . . . . . . . . . . . . .
Similarity degree of community structures . . . . . . . . . . . . . . . . . . . . .
The presence of outliers in our community structures. . . . . . . . . . . . . . .
Features of the meta-networks representing the community structure for the uniform sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
90
95
97
100
. 101
7.1
7.2
Datasets adopted in our experimentation. . . . . . . . . . . . . . . . . . . . . . . 128
Analysis by using similarity coefficient J(τn) , correlation ρX,Y and Euclidean dis-
7.3
tance L2 (X, Y ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Results of the FKCD algorithm on the adopted datasets. . . . . . . . . . . . . . . 146
k
xi
LIST OF TABLES
xii
1
Introduction
The increasing popularity of Online Social Networks (OSNs) is witnessed by the huge number of
users that Facebook, Twitter, etc. acquired in a short amount of time. The growing accessibility
of the Web, through several media, allows to most users a 24/7 online presence and encourages
them to build an online mesh of relationships.
As OSNs become the tools of choice for connecting people, we expect that their structure will
increasingly mirror real-life society and relationships. At the same time, with an estimated 13
millions of transactions per seconds (at peak), Facebook is one of the most challenging computer
science artifacts, posing several optimization, scalability and robustness challenges.
The essential feature of Online Social Networks is the friendship relation between participants.
It consists, mainly, in a permission to consult each others’ friends list and posted content:
news, photos, links, blog posts, etc; such permission can be mutual. In this Thesis we collect
data from OSNs and we analyze their structure adopting graph theory models; for example, we
consider the Facebook friendship network as the (undirected) graph having Facebook users as
vertices and edges representing their friendship relations.
The analysis of OSN connections is a fascinating topic on multiple levels. First, a complete
study of the structure of large real (i.e., off line) social communities was impossible or at least
very expensive before, even at fractions of the scale considered in OSN analysis. Second, data
are clearly defined by some structural constraints, provided by the OSN structure itself, if
compared with real-life relations, hardly identifiable.
The interpretation of these data opens up new fascinating research questions, for example (i)
is it possible to study OSNs with the tools of traditional Social Network Analysis, as in (272)
and (203)? (ii) To what extent the behavior of OSN users is comparable to that of people in
real-life social networks (118)? (iii) What are the topological characteristics of the relationships
network (for example, friendship, in the case of Facebook) of OSNs (4)? (iv) And what about
their structure and evolution (169)?
To address these questions, further Computer Science research is needed to design and develop
those tools required to acquire and analyze data from massive OSNs. First, scalability is an
issue faced by anyone who wants to study a large OSN independently from the commercial
organization that owns and operates it. Moreover, proper social metrics need to be introduced,
in order to identify and evaluate features of the considered OSN. In 2010, some authors (125)
1
1. INTRODUCTION
estimated the crawling overhead needed to collect the whole Facebook graph in 44 Terabytes
of data. Even when such data could be acquired and stored locally (which however raises
storage issues related to the social network compression (34, 35)), it is non-trivial to devise and
implement effective functions that traverse and visit the graph or even evaluate simple metrics.
In the literature, extensive research has been conducted on sampling techniques for large graphs;
only recently, however, studies have shed light on the bias that those methodologies may introduce (170). That is, depending on the method by which the graph has been explored, certain
features may result over/under-represented with respect to the actual graph.
Our long-term research on these topics is presented in this Thesis. We describe in detail
the architecture and functioning modes of our ad hoc Web crawler designed to extract data
from Online Social Networks (such as Facebook), by which, even on modest computational
resources, we can extract large samples containing several millions of profiles and connections
among them. Two recently-collected samples of Facebook containing about 8 millions of nodes
each are described and analyzed in detail. To comply with the Facebook end-user license, data
are made anonymous upon extraction, hence we never memorize users’ sensible data. Next,
we describe similar experiments performed on different OSNs, whose dataset have been made
freely available on the Web.
Moreover, this Thesis is focused on the problem of the community structure detection inside
Online Social Networks. A community is formally defined as a sub-structure present into
the network that represents connections among users, in which the density of relationships
within the members of the community is much greater than the density of connections among
communities. From a structural perspective, this is reflected by a graph which is very sparse
almost everywhere but dense in local areas, corresponding to the communities.
Different motivations to investigate the community structure of a network exist. For example, it
is possible to put into evidence interesting properties or hidden information about the network
itself. Moreover, individuals that belong to a same community may share some similarities and
possibly have common interests or are connected by a specific relationship in the real-world.
These aspects arise a lot of commercial and scientific applications; in the first category we count,
for example, marketing and competitive intelligence investigations and recommender systems.
In fact, users belonging to a same community could share tastes or interests in similar products.
In the latter, models of diseases propagation and distribution of information have been largely
investigated in the context of social networks.
The two different samples of Facebook we collected have been analyzed in order to detect
and describe the underlying community structure of this Online Social Network, highlighting
its features with respect to existing mathematical models that try to describe the community
structure of social networks. Our findings show that the community structure of the network
emerges both from a quantitative and a qualitative perspective.
In this panorama, not only from a scientific perspective but also for commercial or strategic
motivations, the identification of the principal actors inside a network or inside a community is
very important. Such an identification requires to define an importance measure (also called as
centrality) and to rank nodes and/or edges of the network graph on the basis of such a measure.
The simplest approaches for computing centrality consider only the local topological properties
of a node/edge in the social network graph: for instance, the most intuitive node centrality
measure is represented by the degree of a node, i.e., the number of social contacts of a user.
Unfortunately, local measures of centrality, whose esteem is computationally feasible even on
large networks, do not produce very faithful results (56). Due to these reasons, many authors
2
suggested to consider the whole social network topology to compute centrality values. This
consideration generated a new family of centrality measures, called global measures. Some
examples of global centrality measures are closeness (248) and betweenness centrality (for nodes
(112), and edges (124)). Unfortunately, the problem of computing the exact value of centrality
for each node/edge of a given graph is computationally demanding – or even unfeasible – as
the size of the analyzed network grows. Therefore, the need of defining fast, even if heuristic,
techniques to compute centrality arises and it is currently a relevant research topic in Social
Network Analysis.
For this reason, the last part of this Thesis is devoted to introduce a novel measure of centrality
for edges of a social network. This measure is called κ-path edge centrality. In our approach, the
procedure of computing the edge centrality is viewed as an information propagation problem.
In detail, if we assume that multiple messages are generated and propagated within a social
network, an edge is considered as “central” if it is frequently exploited to diffuse information.
Relying on this idea, we simulate message propagations through random walks on the social
network graphs. In addition, we assume that random walks are simple and of bounded length
up to a constant and user-defined value κ. The former assumption is because loops should not
be allowed, in order to avoid that messages get trapped; the latter, because, as in (115), we
assume that the more distant two nodes are, the less they influence each other.
The main contributions of this Thesis, therefore, can be summarized as follows:
1. In Chapter 2 we introduce some fundamental concepts that will be widely adopted
throughout the Thesis. First, we define some formal mathematics conventions used to
define the terminology, notations and a few other mathematical devices typical of the
graph theory. Moreover, we formalize the characteristics of a graph and its properties.
Thus, we introduce some of the metrics for measuring the characteristics of a graph.
2. Chapter 3 is intended as a brief survey on the problem of the extraction of information
from Web sources, in particular concerning fields of application, approaches and techniques
developed during the years. A particular attention is given to all the problems related
to the extraction of information from Web Social Media, and in particular from Online
Social Networks. The most of the techniques discussed have been applied in order to
devise a platform to extract information from Online Social Networks, in a automatic and
robust way.
Contribution and Impact (1)
• Our research line on the automatic extraction of information from Web sources
focused in particular on the problem of the automatic adaptation of the procedures of
data extraction. In this context we devised a novel algorithmic solution that improves
the state-of-the-art of the algorithms for the comparison of tree data structures.
• Our research results have been published in Lecture Notes in Computer Science (104),
as a book chapter (102) and presented in the context of a conference of Artificial
Intelligence (103). Moreover, a brief survey on the state-of-the-art in the discipline
of the Web data extraction has been compiled and is currently under review (106).
• Our technique of automatic wrapper adaptation has been harnessed in commercial
products such Lixto1 and in the context of the Web content acquisition to build
digital earth geo-spatial platforms (284).
1 http://www.lixto.com
3
1. INTRODUCTION
3. In Chapter 4 we discuss the architecture of Web mining that we devised, also called
crawler, that permitted us to extract different samples of the Facebook social network,
and to analyze the topological features of this graph. In particular, we investigate two
different techniques of Web mining, the first based on the concept of visual extraction and
the latter based on a more efficient but less accurate procedure of sampling. Moreover,
two different sampling algorithms are devised and applied. The first one, often referred to
as uniform sampling, is a rejection-based sampling algorithms, that is known to produce a
sample that is unaffected by bias for construction – we used this sample as a ground truth.
The second algorithm is the well-known breadth-first traversal, that has been recently
referred as possibly introducing bias in the case of incomplete visits (170). We verified
the structural differences between the two acquired samples, and we put into evidence
the topological features that describe these samples with respect to mathematical models
often adopted as generative models for social networks.
Contribution and Impact (2)
• Our research activity on mining and analyzing social networks has been published
in the context of international conferences on Web mining (57, 58) and as a book
chapter (59).
• The published papers have been reported and cited in different occasions, such as
in a large scale verification of the topological features of the Facebook social graph
(269) and in an important study of the verification of the strength of weak ties theory
(136).
• Our techniques of social network mining and analysis have been applied to develop
a tool, whose adoption has been exploited in different contexts, such as the forensics
analysis of call networks (60). Some studies on the similarity of Facebook users have
been also presented as a book chapter (78).
• The datasets acquired during the mining of the Facebook social network have been
released under an anonymized format as free for the research community for further
studying.
4. The findings presented in the previous point, are extended in Chapter 5 to other Online
Social Networks. In particular, it is given a special attention to the problem of defining
the scale-free degree distribution, the small world characteristic and the composition of
the community structure on different social networks, other than Facebook, such as Arxiv,
Youtube and Wikipedia. Different mathematical models are compared against real-world
data putting into evidence a lack of models that incorporate all the features of actual
data, making it difficult to formalize with mathematical methods those features that
make unique these Online Social Networks. This aspect puts into evidence that at the
moment it is still necessary to collect data from the Web sources in order to correctly
analyze these networks, and leaves space to further research in the area of mathematical
modeling of social networks.
Contribution and Impact (3)
• The analysis of the topological features of different social networks has been published
in the journal Communications in Applied and Industrial Mathematics (105).
• Some further investigation of the behavior of social network users across different
social systems has been presented in the context of the journal ACM Transactions
on Intelligent Systems and Technology (77).
4
5. Chapter 6 presents our findings regarding the community structure of the Facebook social
network. First of all, we prove that the community structure of this Online Social Network
presents a clear power law distribution of the dimension of the communities. This result
is independent of the algorithm adopted to discover the community structure, and even
(but in a less evident way) with respect to the sampling methodology adopted to collect
the datasets. As far as concerns the qualitative analysis of results, we put into evidence
also that this community structure is qualitatively well defined. We finally investigate
the strength of weak ties theory validity on this social network. It is well-known that
this theory is strictly related to the community structure on a network and our findings
support this aspect providing quantitative proofs.
Contribution and Impact (4)
• Our research study on the characteristics of the community structure of the Facebook
social network have been published in the International Journal of Social Network
Mining (99) and submitted the renowned Journal PLoS ONE (101).
• The data on the community structure of Facebook have been released and have been
exploited during the research activity of different international groups, such as the
Information Systems Group of the University of Oxford.
6. In Chapter 7 we propose an approach based on random walks to compute edge centrality
and we present an algorithm to efficiently compute an approximation of the proposed
measure. We provide results of the performed experimentation, showing that our approach
is able to generate reproducible results even if it relies on random walks. Concluding,
we discuss a possible application of this measure to devise a novel community detection
algorithm.
Contribution and Impact (5)
• The contribution to the research community of the novel measure of κ-path edge
centrality has been presented in the context of the journal Knowledge-based Systems
(81). Moreover, the community detection algorithm based on this technique has
been discussed during an international conference on intelligent systems (79). Some
further extension have been applied to improve recommendation in the context of
social network platforms with the possibility of tagging resources, whose results have
been presented in the same conference (80).
• A clustering algorithm based on the concept of κ-path edge centrality has been successfully applied also in the context of the Bio-informatics, in particular for the
clustering of gene co-expression networks (100), i.e., those networks representing the
correlation among a set of genes that cooperate in response to a given event happening in the cell in order to produce a response. The affinity of this problem to
the problem of finding clusters of users (i.e., communities) inside a social network is
straightforward.
7. Finally, in Chapter 8 the Thesis concludes, showing some future directions of research.
5
1. INTRODUCTION
6
2
Fundamentals
In this Chapter we introduce some fundamental concepts that will be widely adopted throughout the Thesis. First, we recall some formal mathematical conventions used to define the
terminology and notations, and a few other mathematical devices in Section 2.1.
Some simple definitions and notations typical of the graph theory are discussed in Section
2.2. In that Section we formalize the characteristics of a graph and its properties. Thus, we
introduce some of the metrics for measuring the characteristics of a graph.
2.1
Formal Conventions
Here we introduce the Landau notation, adopted throughout this Thesis to describe the asymptotic behavior of functions for evaluating the computational complexity of introduced algorithms. Given two functions f : N → N and g : N → N we say that:
• f ∈ O(g) if ∃n0 ∈ N : n ≥ 0 and ∃c ∈ R : c > 0 such that
f (n) ≤ c · g(n)
holds ∀n ≥ n0 .
• f ∈ Ω(g) if g ∈ O(f ).
• f ∈ Θ(g) if ∃n0 ∈ N : n ≥ 0 and ∃c0 , c1 ∈ R : c0 , c1 > 0 such that
c0 · g(n) ≤ f (n) ≤ c1 · g(n)
holds ∀n ≥ n0 .
2.2
Graph Theory
A network is defined as an object composed of entities interacting each other through some
connections. The natural mean to mathematically represent a network is through a graph
model.
7
2. FUNDAMENTALS
2.2.1
Notion of Graph and Main Properties
In graph theory we define a graph G = (V, E) as an abstract representation of a set of objects
V , namely vertices (or nodes), and a set E of edges which connect pairs of vertices together.
Thus, we denote as V (G) such set of vertices V and, similarly, as E(G) the set of edges within
the graph G.
The amount of vertices V and of edges E, namely the cardinality of V and E, are commonly
represented by n and m or denoted by |V | and |E|. Two vertices connected by an edge e are
called endvertices for e, and they are adjacent or neighbors.
Directed and undirected graphs
Graphs can be classified considering the nature of the edges: i) undirected and, ii) directed.
In undirected graphs the edges connect a pair of vertices without considering the order of the
endvertices. Given two vertices u and v, an undirected edge connecting them is denoted by
{u, v}, or, in short notation, uv or euv .
In directed graphs the edges (also called arcs) connect pairs of vertices in a meaningful way
with respect to the verse. Each edge comes out from a vertex u, namely the origin (or tail),
and reaches a destination v (or head), and is represented by the notation (u, v), or in a short
notation uv (which is different from vu) or euv (which differs from evu ).
The irreversible transformation of a directed graph G = (V, E) in its underlying undirected
graph G0 = (V, E 0 ) maintains the same set of vertices V and generates a new set of edges E 0
which contains an undirected edge between two vertices u, v ∈ V if ∃(u, v) ∈ E or ∃(v, u) ∈ E,
for all the edges in E.
Multigraphs and loops
A set of edges E is a multiset if it contains multiple instances of a same edge. Two identical
instances of the same edge are called parallel edges. If a graph contains parallel edges it is
called a multigraphs. This may happen both for undirected and directed graphs. On the other
hand, a graph is called simple if there is only one instance of each edge, so it does not contain
parallel edges. If an edge connects a vertex to itself it is called a loop. Because loops create
some unpleasant effects on the behavior of the most of the algorithms working on graphs, the
standard assumption throughout this Thesis will be that graphs would not contain loops unless
otherwise specified; a graph having this property is called loop-free.
Weighted graphs
Assigning a weight to the edges of a graph can be useful for several purposes. Let ω : E → R
be a weight function calculated for each edge e ∈ E within a graph G = (V, E), which assigns
a weight to the edges. This property can be adopted to describe particular characteristics of
edges, such as costs e.g. for representing the physical distance between two vertices in G, the
monetary cost to travel through that particular connection, the traffic through that specific link,
etc. Moreover, weights are often used to describe the capacity of a connection, as in network
flow problems. Graphs which include a weight function ω for characterizing edges are called
8
2.2 Graph Theory
weighted graphs. Unweighted graphs can be considered as a special case of weighted graphs in
which the weight function assumes the value ω(e) = 1 ∀e ∈ E.
Degrees
The notion of degree of vertices in a graph requires some distinctions among undirected and
directed graphs, weighted and unweighted graphs, and multigraphs. First of all, the degree
d(v) of a vertex v in a undirected graph G = (V, E) is equal to the number of edges e ∈ E that
have v as an endvertex. All the vertices connected to v, i.e. the neighborhood, are denoted
by N (v). For directed graphs, the concept of degree is split into two categories: i) out-degree
and, ii) in-degree. The out-degree of a vertex v in a directed graph G = (V, E) represents the
number of edges e ∈ E that have v as tail, and its notation is d+ (v). Similarly, the in-degree of
the same vertex is the number of edges that have v as head, and it is denoted by d− (v).
In weighted graphs, the notion of degree is calculated taking the weight of each considered edge,
instead of just counting their number. In undirected multigraphs, the value of parallel edges is
counted according to the number of instances that appear in the graph. In directed ones, the
weight of parallel edges is counted instead.
For undirected graphs we can calculate the mean degree of the graph G as
d(G) =
1 X
2E
d(v) =
|V |
V
(2.1)
v∈V
We indicate with the notation ∆(G) the maximum degree of such undirected graph, and with
δ(G) the minimum degree. Thus, we define a graph as regular if all the vertices have the same
degree, and k-regular if this degree is equal to k.
In the case of directed graphs we define the mean in-degree – Eq. (2.2) – and the mean outdegree – Eq. (2.3) – as
dI (G) =
1 X
dI (v)
|V |
(2.2)
dO (G) =
v∈V
1 X
dO (v)
|V |
(2.3)
v∈V
Density
A measure related to the degree of vertices in a graph G = (V, E) is the density, i.e. the
proportion between the actual number of edges and the maximum possible number of edges
with respect to the number of vertices. Since the maximum
possible number of edges generated
V (V
−1)
V
by V vertices is defined by the Newton’s binomial 2 =
, the proportion is denoted as
2
∆=
E
=
V
2
E
V (V −1)
2
=
2E
V (V − 1)
(2.4)
and is normalized in the interval [0,1]. If all the edges are present, the graph is said to be
complete, its density is equal to 1 and all the node degrees are equal to V − 1.
9
2. FUNDAMENTALS
There is a direct relationship between the density of a graph and the mean degree of its vertices.
The sum of the degrees is equal to 2E (this because each edge is counted twice), thus combining
Eq. 2.1 and Eq. 2.4 we obtain
∆=
d(G)
V −1
(2.5)
which define the correlation between ∆ and d(G), thus the density of a graph assumes the
meaning of the average proportion of edges incident with vertices in the graph.
Subgraphs
The notion of subgraphs is defined taking a subset of vertices and edges of a graph G = (V, E),
say G0 = (V 0 , E 0 ) such that V 0 ⊆ V and E 0 ⊆ E and each e ∈ E 0 has endvertices falling in V 0 .
Let W to be a proper subset of the vertices V of the graph G = (V, E); thus, G − W is the
graph obtained by deleting the vertices contained in W and all their incident edges. Similarly,
if F is a proper subset of the edges E, G − F results in the graph G0 = (V, E − F ).
Walks and paths
Walking over a graph G = (V, E) is a sequence of vertices v and edges e, in alternation, such
that the vertices and edges taken at each step are adjacent. A walk from the vertex v0 to vn
is denoted by the sequence v0 , e1 , v1 , . . . , en , vn , where ei = {vi−1 , vi } in the case of undirected
graphs and ei = (vi−1 , vi ) otherwise. For generic graphs, the length l(w) of a walk w is given
by the number of edges contained, while for weighted graphs the weight ω(w) of a walk is
represented by the sum of the weights of each included edge. A walk is defined as a path if
all the included edges ei are distinct, and is defined as simple if it does not include repeated
vertices. If the starting vertex coincides with the destination of the path, it is called a cycle. A
cycle with no repeated vertex is called a simple cycle. It always exists a path between a pair of
vertices in a connected (see further) undirected graph.
Random walks and Markov chains
We recall the definition of discrete probability space, Markov chains and random walks as in
(49).
Let (Ω, P ) be a discrete probability space, Ω a (non empty) finite or countably infinite set and
P a mapping from the power set ℘(Ω) of Ω to the real numbers as follows
• P (A) ≥ 0, ∀A ⊆ Ω
• P (Ω) = 1
S
P
• P ( i∈N Ai ) = i∈N P (Ai ), ∀ sequences (Ai )i∈N of pairwise disjoint sets from ℘(Ω).
Ω is said a sample space and any subset of Ω is said an event.
Let X be a random variable, i.e., a mapping from the sample space Ω to the real numbers,
whose image is denoted by IX = X(Ω).
10
2.2 Graph Theory
The expected value of a random variable X is E(X) =
P
ω∈Ω
X(ω) · P (ω).
Let (Xt )t∈N0 be a sequence of random variables Xt withPIXt ⊆ S, where S is a set state, and an
initial distribution q0 that maps S to R+
0 and satisfies
s∈S q0 (s) = 1. Such a sequence is said
Markov chain iif it satisfies the so-called Markov condition, that is ∀t > 0 and ∀I ⊆ 0, 1, . . . , t − 1
and ∀i, j, sk ∈ S
P (Xt+1 = j|Xt = i, ∀k ∈ I : Xk = sk ) = P (Xt+1 = j|Xt = i)
holds true.
A random walk on a simple directed graph G = (V, E) is a Markov chain with S = V and

 1
if (u, v) ∈ E
P (Xt+1 = v|Xt = u) = d+ (u)

0,
otherwise.
At each step, the random walk selects a random outgoing edge e from the current vertex and
moves to the destination vertex reached by means of e. In an analogous fashion it is possible
to define a random walk on undirected graphs.
Connected components
An undirected graph G = (V, E) is defined as connected if there exists a path connecting
all the pairs of vertices within the graph. Otherwise, the graph is called disconnected. It
is possible to induct a connected subgraph from a disconnected graph. If the graph G =
(V, E) is disconnected, a subgraph G0 = (V 0 , E 0 ) is defined as connected component if it is a
connected subgraph for G. Moreover, G0 is defined as the largest connected component if it is the
maximal connected subgraph inductable over G. It is possible to evaluate if an undirected graph
is connected or, otherwise, calculate the largest connected component adopting two classical
algorithms, namely the depth-first search (DFS) and the breadth-first search (BFS) (72), with
the cost O(n + m).
Directed graphs are defined as strongly connected if there exists a directed path connecting each
pair of vertices. Similarly to the undirected case, it is possible to induct the strongly connected
component of a directed graph finding a strongly connected subgraph (that is the largest strongly
connected component it it is the maximal one). It is possible to find the strongly connected
components of a directed graph adopting the improved DFS algorithm (264). If the underlying
undirected graph is connected, such original directed graph is called weakly connected.
Shortest paths and single-source shortest paths
For a weighted graph, the weight ω(p) of a path p is defined as the sum of the weights of the
edges included in p. It is possible to find the path between a pair of vertices u and v with
the minimal weight, namely the shortest path with respect to the weight function ω as the
one with the smallest weight among all the paths connecting u and v. For unweighted graphs,
the shortest path assumes the mean of the path which includes the smallest number of edges.
Given a graph G = (V, E), a weight function ω : E → R and a source vertex s ∈ V , the
single-source shortest path problem (SSSP) is solved computing all the shortest paths from s
11
2. FUNDAMENTALS
to any other vertex in G. SSSP can be solved with the classical Dijkstra’s algorithm (87) in
O(m + n log n) time, if the graph does not contain cycles of negative weight, otherwise the
Bellman-Ford algorithm (72) can be used, with cost O(mn). For a given unweighted graph the
problem can be solved using the BFS algorithm in O(m + n) time (72).
Geodesic, distance and eccentricity
The geodesic is defined as the shortest path between two vertices u and v in a graph G =
(V, E). The geodesic distance or simply the distance between these two vertices is defined as
the number of edges in the geodesic connecting them (44). If there is no path between two
vertices, the geodesic distance between them is infinite (in a disconnected graph there exists
at least a distance between a pair of vertices which is infinite). The eccentricity of a vertex
v is represented by the greatest geodesic distance between v and any other vertex in G. The
eccentricity can range in the interval [1, V − 1]. The diameter and several measures of centrality
(see further) are based on the concept of eccentricity, such as the center and the centroid of a
graph.
Diameter and all-pairs shortest paths
The concept of diameter of a graph is important because it quantifies how far apart the farthest
two vertices in such a graph are. This information is fundamental for several applications of
networks (see for example (7, 38, 67)). The diameter D of a graph G = (V, E) is equal to the
maximum value of eccentricity among any pair of vertices of the graph. In other words, it is the
greatest distance between any pair of vertices. The diameter can range in the interval [1, V − 1].
In a disconnected graph the diameter is infinite, but it is possible to find the diameter of its
largest connected component. Similarly, it is possible to find the diameter of a subgraph. To
find the value of the diameter, the so-called problem of the all-pairs shortest paths (APSP) in a
graph must be solved. The greatest value among the APSP is the diameter of the graph. The
APSP problem can be solved using classical algorithms such as the Floyd-Warshall (72) in O(n3 )
time, by solving n times the SSSP problem, with a cost of O(mn + n2 log n) or, in the special
case of unweighted undirected graphs, the Seidel algorithm (252) with cost O(M (n) log n) with
M (n) the cost of multiplying two n × n matrices containing small integers (e.g. by using the
Coppersmith-Winograd algorithm (71) whose cost is O(n2.376 )).
Matrix representation for graphs
The information contained in a graph G = (V, E) can be stored in several ways, for example
using the matrix form. The most common way to store a graph is using the so-called sociomatrix
and the incidence matrix form. Here we describe the matrices to store simple unweighted
undirected graphs, and then a generalization to treat directed graphs and weighted graphs.
The Sociomatrix The sociomatrix, or adjacency matrix, denoted by X contains entries
which indicate whether two vertices are adjacent or not. A sociomatrix X of size n × n can
efficiently describe an unweighted undirected graph G = (V, E) containing n vertices. Rows
and columns of the sociomatrix both represent the index of each vertex in the graph, and
are labeled as 1, 2, . . . n. Each entry xij in the sociomatrix represents if the indicated pair of
12
2.2 Graph Theory
vertices ni and nj are adjacent or not. Usually, there is a 1 in the (i, j)th cell if there is an edge
connecting ni and nj in the graph, or a 0 otherwise. Thus, if vertices ni and nj are adjacent
xij = 1, otherwise xij = 0. Because the graph is undirected, the matrix is symmetric respect
to its diagonal, thus xij = xji ∀i 6= j. Sociomatrices are widely adopted for storing undirected
network structures because of some particular properties; for example, social networks (see
further) arise to sparse sociomatrices, thus it is convenient to adopt techniques of compact
matrix decomposition (256, 262) for efficiently storing data.
The incidence matrix Another possible representation of an undirected graph G = (V, E)
through a matrix is called incidence matrix, usually denoted by I. It stores which edges are
incident with which vertices, indexing the former on the columns and the latters on the rows,
thus the dimension of the matrix I is V × E. A matrix entry Iij contains 1 if the vertex ni
is incident with the edge ej , or 0 otherwise. Both the incidence matrix and the sociomatrix
contains all the information required to describe the represented graph.
Generalization for directed graphs The sociomatrix form can be intuitively extended to
represent directed graphs. In this case, the sociomatrix X has elements xij equal to 1 if there
exists an arc connecting the vertex ni (tail) to the vertex nj (head), or 0 otherwise. Formally,
xij = 1 if ∃(xi , xj ) in E(G). In other words, the (i, j)th cell of X contains 1 only if the
directed edge eij connects vertices ni and nj ; because the graph is directed, the entry in xij
may be different from the entry in xji . Thus, a sociomatrix representing directed graphs is not
symmetric.
Generalization for weighted graphs Sociomatrices can be adapted for representing also
weighted graphs. The entry in the cell xij represents the weight ω(eij ) associated to the edge eij
connecting vertices ni and nj , both for undirected and directed graphs. The sociomatrices for
weighted graphs have the same properties of the related unweighted version; thus, a sociomatrix
representing a weighted undirected graph is symmetric respect to its diagonal.
2.2.2
Centrality Measures
Within graph theory and network analysis, there are various measures of the centrality of
a vertex within a graph that determine the relative importance of a vertex within the graph.
There are four measures of centrality that are widely used in network analysis: degree centrality,
closeness, betweenness, and eigenvector centrality.
Degree centrality
The most intuitive measure of centrality of a vertex into a network is called degree centrality.
Given a graph G = (V, E) represented by means of its adjacency matrix A, in which a given
entry Aij = 1 if and only if i and j are connected by an edge, and Aij = 0 otherwise, the degree
centrality CD (vi ) of a vertex vi ∈ V is defined as
CD (vi ) = d(vi ) =
X
j
13
Aij .
(2.6)
2. FUNDAMENTALS
The idea behind the degree centrality is that the importance of a vertex is determined by the
number of vertices adjacent to it, i.e., the larger the degree, the more important the vertex is.
Even though, in real world networks only a small number of vertices have high degrees, the degree centrality is a rough measure but it is adopted very often because of the low computational
cost required for its computation.
There exists a normalized version of the degree centrality, defined as follows
0
CD (vi ) =
d(vi )
n−1
where n represents the number of vertices in the network.
Closeness centrality
A more accurate measure of centrality of a vertex is represented by the closeness centrality
(248).
The closeness centrality relies on the concept of average distance, defined as
n
1 X
Davg (vi ) =
g(vi , vj )
n−1
j6=i
where g(vi , vj ) represents the geodesic distance between vertices vi and vj .
The closeness centrality CC (vi ) of a vertex vi is defined as


n
X
1
CC (vi ) = 
g(vi , vj )
n−1
(2.7)
j6=i
In practice, the closeness centrality calculates the importance of a vertex on how close the give
vertex is to the other vertices. “Central” vertices, with respect to this measure, are important
as they can reach the whole network more quickly than non-central vertices.
Different generalizations of this measures for weighted and disconnected graphs have been
proposed in (223).
Betweenness centrality
A more complex measure of centrality is the betweenness centrality (112, 113). It relies on the
concept of shortest paths, previously introduced. In detail, in order to compute the betweenness
centrality of a vertex, it is necessary to count the number of shortest paths that pass across the
given vertex.
The betweenness centrality CB (vi ) of a vertex vi is computed as
CB (vi ) =
X
vs 6=vi 6=vt ∈V
14
σst (vi )
σst
(2.8)
2.2 Graph Theory
where σst is the number of shortest paths between vertices vs and vt and σst (vi ) is the number
of shortest paths between vs and vt that pass through vi .
Vertices with high values of betweenness centrality are important because maintain an efficient
way of communication inside a network and foster the information diffusion.
Eigenvector centrality
Another way to assign the centrality to a vertex is based of the idea that if a vertex has many
central neighbors, it should be central as well. This measure is called eigenvector centrality and
establishes that the importance of a vertex is determined by the importance of its neighbors.
The eigenvector centrality CE (vi ) of a given vertex vi is
X
Aij CE (vj )
CE (vi ) ∝
(2.9)
vj ∈Ni
where Ni is the neighborhood of the vertex vi , being x ∝ Ax that implies Ax = λx. The
centrality corresponds to the top eigenvector of the adjacency matrix A.
Conclusion
In this Chapter we introduced the formal conventions that will be used through the rest of the
Thesis and those fundamental concepts related to the graph theory that underly the formalization of problems related to network analysis. In detail, in the first part we discussed the notions
related to the graphs, their structure and their properties; in the second part we introduced
the centrality measures. These concepts will be extensively adopted in those Chapters of this
Thesis that cover the topological analysis of different Online Social Networks.
15
2. FUNDAMENTALS
16
3
Information Extraction from
Web Sources
This Chapter is intended as a brief survey on the problem of the extraction of information from
Web sources, in particular concerning fields of application, approaches and techniques developed
during latest years. A particular attention is given to all the problems related to the extraction
of information from social media, and in particular from Online Social Networks.
This Chapter is structured as follows. In the first part, we focus on fields of application of Web
information extraction tools, developed with classic and novel techniques. In particular, we
discuss enterprise, social and scientific applications that are strictly interconnected with Web
data extraction tasks.
In the second part, we introduce more in details those techniques related to the functioning
of a Web information extraction platform, discussing the concepts of Web wrappers and those
problems related to their generation and maintenance.
Summarizing, in Section 3.1 we present related works, providing reference to useful surveys to
this discipline. In Section 3.2 we track a complete profile of Web data extraction systems and
in Section 3.3 we classify fields of application of Web data extraction techniques, focusing, in
particular, on enterprise and social applications. In Section 3.4 we discuss in detail the problem
of wrapper generation, induction and maintenance, and other notable approaches. Finally, in
Section 3.5 we present our solution of automatic adaptation of Web wrappers.
3.1
Background and Related Literature
The Computer Science scientific literature counts many valid surveys on the Web data extraction
problem. Laender et al. (174), in 2002, presented a notable survey, offering a rigorous taxonomy
to classify Web data extraction systems. They introduced a set of criteria and a qualitative
analysis of various Web data extraction tools.
In the same year Kushmerick (173) tracked a profile of finite-state approaches to the problem,
including the analysis of wrapper induction and maintenance, natural language processing and
hidden Markov models. Kuhlins and Tredwell (168) surveyed tools to generate wrappers already
17
3. INFORMATION EXTRACTION FROM WEB SOURCES
in 2003: information could not be up-to-date but analyzing the approach is still very interesting.
Again on the wrapper induction problem, Flesca et al. (108) and Kaiser and Miksch (157)
discussed approaches, techniques and tools. The latter in particular modeled a representation
of an Information Extraction system architecture.
Chang et al. (62) introduced a tri-dimensional categorization of Web data extraction systems,
based on task difficulties, techniques used and degree of automation. Fiumara (107) applied
these criteria to classify four new tools that are also presented here. Sarawagi published an
illuminating work on Information Extraction (250): anybody who intends to approach this
discipline should read it. To the best of our knowledge, the work from Baumgartner et al. (26)
is the most recent short survey on the state-of-the-art of the discipline.
3.2
3.2.1
Web Data Extraction Systems
Definition
We can generically define a Web data extraction system as a sequence of procedures that extracts
information from Web sources (174).
From this generic definition, we can infer two fundamental aspects of the problem:
• Interaction with Web pages
• Generation of a wrapper
Baumgartner et al. (26) define a Web data extraction system, as “a software extracting, automatically and repeatedly, data from Web pages with changing contents, and that delivers
extracted data to a database or some other application”.
This is the definition that better fits the modern view of the problem of the Web data extraction
as it introduces three important aspects:
• Automation and scheduling
• Data transformation, and
• Use of the extracted data
The following five points cover techniques used to solve the problem of Web data extraction.
Interaction with Web pages
The first phase of a generic Web data extraction system is the Web interaction (271): Web
sources, usually represented as Web pages, but also as RSS/Atom feeds (142), Microformats
(160) and so on, could be visited by users, both in visual and textual mode, or just simply
inputted to the system by the URL of the document(s) containing the information.
Some commercial systems, Lixto1 for first but also Kapow Mashup Server2 , include a Graphical
User Interface for fully visual and interactive navigation of HTML pages, integrated with data
extraction tools.
1 http://www.lixto.com/
2 http://kapowtech.com/
18
3.2 Web Data Extraction Systems
The state-of-the-art is represented by systems that support the extraction of data from pages
reached by deep Web navigation (21), i.e. simulating the activity of users clicking on DOM
elements of pages, through macros or, more simply, filling HTML forms.
These systems also support the extraction of information from dynamically generated Web
pages, usually built at run-time as a consequence of the user request, filling a template page
with data from some database. The other kind of pages are commonly called static Web pages,
because of their static content.
Generation of a wrapper
Just for now, we generically define the concept of wrapper as a procedure extracting unstructured information from a source and transforming them into structured data (153, 291). A
Web data extraction system must implement the support for wrapper generation and wrapper
execution. We will cover approaches and techniques, used by several systems, later.
Automation and scheduling
The automation of page access, localization and extraction is one of the most important features
included in recent Web data extraction systems (232): the capability to create macros to execute
multiple instances of the same task, including the possibility to simulate the click stream of the
user, filling forms and selecting menus and buttons, the support for AJAX technology (117)
to handle the asynchronous updating of the page, etc. are only some of the most important
automation features.
Also the scheduling is important, e.g. if a user wants to extract data from a news website
updated every 5 minutes, many of the most recent tools let her/him to setup a scheduler,
working like a cron, launching macros and executing scripts automatically and periodically.
Data transformation
Information could be wrapped from multiple sources, which means using different wrappers and
also, probably, obtaining different structures of extracted data. The steps between extraction
and delivering are called data transformation: during these phases, such as data cleaning (240)
and conflict resolution (204), users reach the target to obtain homogeneous information under
a unique resulting structure.
The most powerful Web data extraction systems provide tools to perform automatic schema
matching from multiple wrappers (239), then packaging data into a desired format (e.g. a
database, XML, etc.) to make it possible to query data, normalize structure and de-duplicate
tuples.
Use of extracted data
When the extraction task is complete and acquired data are packaged in the required format,
this information is ready to be used; the last step is to deliver the package, now represented
by structured data, to a managing system (e.g. a native XML DBMS, a RDBMS, a data
warehouse, a CMS, etc.). In addition to all the specific fields of application covered later in this
19
3. INFORMATION EXTRACTION FROM WEB SOURCES
work, acquired data can be also generically used for analytical (55) or statistical purposes (30)
or simply to republish them under a structured format.
3.2.2
Classification Criteria
A taxonomy for characterizing Web data extraction tools
Laender et al. (174) presented a widely accepted taxonomy to classify systems, according to
techniques used to generate wrappers:
Languages for Wrapper Development: Before the birth of some languages specifically
studied for wrapper generation (e.g. Elog, the Lixto Web extraction language (22)) extraction
systems relied on standard scripting languages, like Perl, or general purpose programming
languages, like Java to create the environment for the wrapper execution.
HTML-aware Tools: Some tools rely on the intrinsic formal structure of the HTML to
extract data, using HTML tags to build the DOM tree (e.g. RoadRunner (74), Lixto (23)
and W4F (249)). Boronat (42) analyzed and compared performances of common Web data
extraction tools.
NLP-based Tools: Natural Language Processing techniques were born in the context of Information Extraction (IE) (29, 194, 278). They were applied to the Web data extraction problem
in order to solve specific problems such as the extraction of facts from speech transcriptions in
forums, email messages, newspaper articles, resumes etc.
Wrapper Induction Tools: These tools generate rule-based wrappers, automatically or
semi-automatically: usually they rely on delimiter-based extraction criteria inferred from formatting features (12).
Modeling-based Tools: Relying on a set of primitives to compare with the structure of the
given page, these tools can find one or more objects in the page matching the primitive items
(93, 121). A strong domain knowledge is needed, but it is a good approach for the extraction
of data from Web sources based on templates and dynamically generated pages.
Ontology-based Tools: These techniques do not rely on the page structure, but directly
on the data. Actually, ontologies can be applied successfully on specific well-known domain
applications, e.g. social networks and communities (201) or bio-informatics (151). Some works
try to apply the ontological approach to generic domains of Web data extraction (144) or to
tables (263).
20
3.2 Web Data Extraction Systems
Qualitative analysis criteria
Laender et al. (174) remarked that this taxonomy is not intended to strictly classify a Web
data extraction system, because it is common that some tools fit well in two or more groups.
Maybe for this reason, they extended these criteria including:
Degree of automation:
data extraction tool.
Just determines the amount of human effort needed to run a Web
Support for complex objects: Nowadays, Web pages are based on the rich-content paradigm,
so objects included in the Web sources could be complex. Only some systems can handle these
kind of data.
Page contents: Page contents can be distinguished in two categories: unstructured text and
semi-structured data. The first family fits better with NLP-based tools and Ontology-based
tools, the latter with the others.
Ease of use: Availability of a Graphical User Interface (GUI) is a must for last generation
tools. Platforms often feature wizards to create wrappers, WYSIWYG editor interfaces, integration with Web browsers etc. Lixto, Denodo1 , Kapow Mashup Server, WebQL2 , Mozenda3 ,
Visual Web Ripper4 all use advanced GUI to ease the user experience.
XML output: XML is simply the standard, according to W3C5 , for the semantic Web representation of data. The capability to output the extracted data in XML format nowadays is
a requirement, at least for commercial software.
Support for Non-HTML sources: NLP-based tools fit better in this domain: this is a
great advantage, because a very large amount of data are stored in the Web in semi-structured
texts (emails, documentations, logs, etc.)
Resilience and adaptiveness: Web sources are usually updated without any forewarning,
i.e. the frequency of updates is not known a priori, thus systems that generate wrappers with
a high degree of resilience show better performances. Also the adaptiveness of the wrapper,
moving from a specific Web source to another, within the same domain, is a great advantage.
1 http://www.denodo.com/
2 http://www.ql2.com/
3 http://www.mozenda.com
4 http://www.visualwebripper.com/
5 http://www.w3.org/
21
3. INFORMATION EXTRACTION FROM WEB SOURCES
3.3
Applications
On the one hand the Web is moving to semantics and enabling machine-to-machine communication: it is a slow, long term evolution, but it has started, in fact. Extracting data from Web
sources is one of the most important steps of this process, because it is the key to build a solid
level of reliable semantic information. On the other hand, Web 2.0 extends the way humans
consume the Web with social networks, rich client technologies, and the consumer as producer
philosophy. Hence, new evolvements put further requirements on Web data extraction rules,
including to understand the logic of Web applications.
In the literature of the Web data extraction discipline, many works cover approaches and
techniques adopted to solve some particular problems related to a single or, sometimes, a
couple of fields of application. The aim of this Section is to survey and analyze some of the
possible applications that are strictly interconnected with Web data extraction tasks. In the
following, we describe a taxonomy in which key application fields, heavily involved with data
extraction from Web sources, are divided into two families, enterprise applications and social
applications.
3.3.1
Enterprise Applications
We classify here software applications and procedures with a direct, subsequent or final commercial scope.
Context-aware advertising
Thanks to Applied Semantic, Inc.1 first, and Google, who bought their ’AdSense’ advertising
solution later, this field captured a great attention. The main underlying principle is to present
to the final user, commercial thematized advertisements together with the content of the Web
page the user is reading, ensuring a potential increase of the interest in the ad.
This aim can be reached analyzing the semantic content of the page, extracting relevant information, both in the structure and in the data, and then contextualizing the ads content and
placement in the same page.
Contextual advertising, compared to the old concept of Web advertising, represents an intelligent approach to provide useful information to the user, statistically more interested in
thematized ads, and a better source of income for advertisers.
Customer care
Usually medium/big-sized companies, with customers support, receive a lot of unstructured information like emails, support forum discussions, documentation, shipment address information,
credit card transfer reports, phone conversation transcripts, etc.: the capability of extracting
this information eases their categorization, inferring underlying relationships, populating own
structured databases and ontologies, etc.
Actually NLP-based techniques are the best approach to solve these problems.
1 http://www.appliedsemantics.com/
22
3.3 Applications
Database building
This is a key concept in the Web marketing sector: generically we can define the concept
of database building as the activity of building a database of information about a particular
domain. Fields of application are countless: financial companies could be interested in extracting
financial data from the Web, e.g. scheduling these activities to be executed automatically and
periodically. Also the real estate market is very florid: acquiring data from multiple Web sources
is an important task for a real estate company, for comparison, pricing, co-offering, etc.
Companies selling products or services probably want to compare their pricing with other competitors: the extraction of products pricing is an interesting application of Web data extraction
systems. Finally we can list other related tasks involved in the Web data extraction: duplicating an on-line database, extracting dating sites information, capturing auction information
and prices from on-line auction sites, acquiring job postings from job sites, comparing betting
information and prices, etc.
Software Engineering
Extracting data from websites became interesting also for Software Engineering: Web 2.0 is
usually strictly related to the concept of Rich Internet Applications (RIAs), Web applications
characterized by an high degree of interaction and usability, inherited from the similarity to
desktop applications. Amalfitano et al. (9) are developing a reverse engineering approach to
abstract finite states machines representing the client-side behavior offered by RIAs.
Business Intelligence and Competitive Intelligence
Baumgartner et al. (24, 25, 27) deeply analyzed how to apply Web data extraction techniques
and tools to improve the process of acquiring market information. A solid layer of knowledge is
fundamental to optimize the decision-making activities and a huge amount of public information
could be retrieved on the Web. They illustrate how to acquire these unstructured and semistructured information; using Lixto to access, extract, clean and deliver data, it is possible to
gather, transform and obtain information useful to business purposes. It is also possible to
integrate these data with other common platforms for Business Intelligence (BI), like SAP1 or
Microsoft Analysis Services (199).
Wider, the process of gathering and analyzing information for business purposes is commonly
called Competitive Intelligence (CI), and is strictly related to data mining (145). Zanasi (288)
was the first to introduce the possibility of acquiring these data, through data mining processes,
on public domain information. Chen et al. (64) developed a platform, that works more like
a spider than like a Web data extraction system, which represents a useful tool to support
operations of CI providing data from the Web.
In BI scenarios the main requirements include scalability and efficient planning strategies to
extract as much data as possible with the smallest number of possible resources in time and
space.
The requirements for tools in the area of Web application testing are to deal well with Ajax/dynamic
HTML, to create robust test scripts, to efficiently maintain test scripts, to execute test runs
1 http://www.sap.com
23
3. INFORMATION EXTRACTION FROM WEB SOURCES
and create meaningful reports, and, unlike other application areas, the support of multiple
state-of-the-art browsers in various versions is an absolute must. One widely used open source
tool for Web application testing is Selenium 1 .
3.3.2
Social Applications
One can say that social is the engine of Web 2.0: many websites evolved into Web applications
built around users, letting them to create a Web of links between people, to share thoughts,
opinions, photos, travel tips, etc. Here we are mainly listing all these kind of applications born
and grown in the Web and through the Web, thanks to User-Generated Contents (UGC), built
from users for users.
Online Social networks
Online Social Networks are the most important expression of change in the use of the World
Wide Web, and are often considered a key step of the evolution to Web 2.0: millions of people
creating a digital social structure, made of nodes (individuals, entities, groups or organizations)
and connections (i.e. ties), - representing relationships between nodes, sometimes implementing
hierarchies - sharing personal information, relying on platforms of general purpose (e.g. Facebook, MySpace, etc.) or thematized (e.g. Twitter for micro-blogging, Flickr for photo-sharing,
etc.), all sharing a common principle: on-line socialization.
Online Social Networks attracted an enormous attention, by both academic and industries, and
many works studied several aspects of the phenomenon: extracting relevant data from social
networks is a new interesting problem and field of application for Web data extraction systems.
This thesis focuses even on this problem as a part of the process of mining and analyzing data
from Online Social Networks.
Actually does not exist a tool specifically studied to approach and solve this problem: people
are divided by ethics on mining personal data from Online Social Networks. Regardless of
moral disputes, some interesting applications of the Web data extraction from Online Social
Networks are discussed in this Thesis in the following Chapters, and can be summarized as: i)
acquiring information from relationships between nodes in order to study topological features
of Online Social Networks; ii) analyzing statistical data in order to infer new information, to
support better recommendations to users and to find users with similar tastes or interests; iii)
discovering the community structure of a given Online Social Network .
Social bookmarks
Another form of social application is the social bookmarking, a new kind of knowledge sharing: users post links to Web sources of interest, into platforms with the capability of creating
folksonomies, collaboratively tagging contents. Extracting relevant information from social
bookmarks should be faster and easier than in other fields: HTML-aware and model-based extraction systems should fit very well with the semi-structured templates used by most common
social bookmarking services. Once extracted, information is used to retrieve resources and to
provide recommendations to users of the social bookmark websites (235, 236). Sometimes data
1 http://seleniumhq.org/
24
3.3 Applications
are distributed under structured formats like RSS/Atom so acquiring this information is easier
than with traditional HTML sources.
Comparison shopping
One of the most appreciated among Web social services is the comparison shopping, through
platforms with the capability to compare products or services, going from simple prices comparison to features comparison, technical sheets comparison, user experiences comparison, etc.
These services heavily rely on Web data extraction, using websites as sources for data mining
and a custom internal engine to make possible the comparison of similar items. Many Web
stores today also offer personalization forms that make the extraction tasks more difficult: for
this reason many last-generation commercial Web data extraction systems (e.g. Lixto, Kapow
Mashup Server, UnitMiner1 , Bget2 ) provide support for deep navigation and dynamic content
pages.
Opinion sharing
Complementary to comparison shopping, there exist the opinion sharing services: users want
to express opinions on products, experiences, services they enjoyed, etc. The most common
form of opinion sharing is represented by blogs, containing articles, reviews, comments, tags,
polls, charts, etc. All this information usually lacks of structure, so its extraction is a huge
problem, also for current systems, because of the billions of Web sources currently available.
Sometimes model-based tools fit well, taking advantage of common templates (e.g. Wordpress,
Blogger, etc.), other times natural language processing techniques fit better. Kushal et al. (76)
approached the problem of opinion extraction and the subsequent semantic classification of
reviews of products.
Another form of opinion sharing in semi-structured platforms is represented by Web portals
that let users to write unmoderated opinions on various topics and products.
Citation databases
Citation database building is one of the most intensive Web data extraction fields of application:
CiteSeer3 , Google Scholar, DBLP4 and Publish or Perish5 are brilliant examples of applying
Web data extraction to approach and solve the problem of collect digital publications, extract
relevant data – for example, references and citations – and build a structured database, where
users can perform searches, comparisons, count of citations, cross-references, etc.
1 http://www.qualityunit.com/unitminer/
2 http://www.bget.com/
3 http://citeseer.ist.psu.edu/
4 http://www.informatik.uni-trier.de/ley/db
5 http://www.harzing.com/pop.html
25
3. INFORMATION EXTRACTION FROM WEB SOURCES
3.3.3
A Glance on the Future
Bio-informatics and Scientific Computing
A growing field of application of the Web data extraction is bio-informatics: on the World
Wide Web it is very common to find medical sources, in particular regarding bio-chemistry and
genetics. Bio-informatics is an excellent example of the application of scientific computing –
refer e.g. to (85) for a selected scientific computing project.
Plake et al. (233) worked on PubMed1 - the biggest repository of medical-scientific works that
covers a broad range of topics - extracting information and relationships to create a graph;
this structure could be a good starting point to proceed in extracting data about proteins and
protein interactions. This information can be usually found, not in Web pages, rather it is
available as the PDF of the corresponding scientific papers. In the future, Web data extraction
could be extensively used also to classify these documents: approaches to solve this problem are
going to be developed, inherited, both from Information Extraction and Web data extraction
systems, because of the semi-structured format of PostScript-based files. On the other hand,
Web services play a dominant role in this area as well, and another important challenge is the
intelligent and efficient querying of Web services as investigated by the ongoing SeCo project2 .
Web harvesting
One of the most attractive future applications of the Web data extraction is Web Harvesting
(276): Gatterbauer (119) defines it as “the process of gathering and integrating data from
various heterogeneous Web sources”. The most important aspect (although partially different
from specific Web data extraction) is that, during the last phase of data transformation, the
amount of gathered data is many times greater than the extracted data. The work of filtering
and refining information from Web sources ensures that extracted data lie in the domain of
interest and are relevant for users: this step is called integration. The Web harvesting remains
an open problem with large margin of improvement: because of the billions of Web pages, it is
a computational problem, also for restricted domains, to crawl enough sources from the Web
to build a solid ontological base. There is also a human engagement problem, correlated to the
degree of automation of the process: when and where humans should interact with the system
of Web harvesting? Should be a fully automatic process? What degree of precision could be
accepted for the harvesting? All these questions are still open for future works. Projects such
as the DIADEM3 at Oxford University tackle the challenge for fully automatic generation of
wrappers for restricted domains such as real estate.
3.4
Techniques
This Section focuses in particular on the techniques adopted to design Web mining platforms.
Concepts such as the Web wrappers, that will be extensively adopted in the next Chapters, are
here introduced in details.
1 http://www.pubmed.com/
2 http://www.search-computing.it/
3 http://web.comlab.ox.ac.uk/projects/DIADEM/
26
3.4 Techniques
3.4.1
Used Approaches
The techniques first used to extract data from Web pages were inherited from Information
Extraction (IE) approaches.
Kaiser and Miksch (157) categorized them into two groups: learning techniques and knowledge
engineering techniques.
Sarawagi (250) calls them hand-coded or learning-based approach and rule-based or statistical
approach respectively. These definitions explain the same concept: the first method is used
to develop a system that requires human expertise to define rules (usually regular expressions
or program snippets) to perform the extraction. Both in hand-coded and learning-based approaches domain expertise is needed: people writing rules and training the system must have
programming experience and a good knowledge of the domain.
Also in some approaches of the latter family, in particular the rule-based ones, a strong familiarity with both the requirements and the functions is needed, so the human engagement is
essential. Statistical methods are more effective and reliable in domains of unstructured data
(like natural language processing problems, facts extraction from speeches, and automated text
categorization (251)).
3.4.2
Wrappers
A wrapper is a procedure that implements a family of algorithms, that seek and find the
information the user needs to extract from an unstructured source, and transform them into
structured data, merging and unifying this information for future processing.
A wrapper life-cycle starts with its generation: it could be described and implemented manually,
e.g. using regular expressions, or in an inductive way; wrapper induction (171) is one of the most
interesting aspects of this discipline, because it introduces high level automation algorithms
implementation; we can also count on hybrid approaches that make possible for users to generate
semi-automatic wrappers by means of visual interfaces. Web pages change without forewarning,
so wrapper maintenance is an outstanding aspect for ensuring the regular working of wrapperbased systems.
Wrappers fit very well to the Web data extraction problem because HTML pages, although
lacking in semantic structure, are syntactically structured: HTML is just a presentation markup
language, but wrappers can use tags to infer underlying information. Wrappers succeeded where
IE NLP-based techniques failed: often Web pages do not own a rich grammar structure, so NLP
cannot be applied with good results.
3.4.3
Semi-Automatic Wrapper Generation
Visual Extraction
Under this category we classify techniques that make it possible to the users to build wrappers
from Web pages of interest using a GUI and interactively, without any deep understanding of
the wrapper programming language, as wrappers are generated automatically by the system
relying on users’ directives.
27
3. INFORMATION EXTRACTION FROM WEB SOURCES
Regular-Expression-Based approach: One of the most common approaches is based on
regular expressions, which are an old, but still powerful, formal language used to identify strings
or patterns in unstructured text, defining matching criteria. Rules could be complex so, writing
them manually, could require too much time and a great expertise: wrappers based on regular
expressions dynamically generate rules to extract desired data from Web pages. Usually writing
regular expressions on HTML pages relies on the following criteria: word boundaries, HTML
tags, tables structure, etc.
A notable tool implementing regular-expression-based extraction is W4F (249), adopting an
annotation approach: instead of putting users facing the HTML code, W4F eases the design of
the wrapper through a wizard that allows users to select and annotate elements directly on the
page; W4F builds the regular expression extraction rules of the annotated items and presents
them to the user demanding her/him the optimization step. W4F extraction rules, besides
match, could also implement the split expression, which separates words, annotating different
elements on the same string.
Logic-Based approach: Tools based on this approach successfully build wrappers through
a wrapper programming language, considering Web pages not as simply text strings but as preparsed trees, representing the DOM of the page. Gottlob and Koch (134) formalized the first
wrapping language, suitable for being incorporated into visual tools, satisfying the condition
that all its constructs can be implemented through corresponding visual primitives: starting
from the unranked labeled tree representing the DOM of the Web page, the algorithm relabels nodes, truncates the irrelevant ones, and finally returns a subset of original tree nodes,
representing the selected data extracted. The extraction function of all these operations relies on
the Monadic Datalogs (135). Authors demonstrated that Monadic Datalog over tree structures
is equivalent to Monadic Second Order logic (MSO), and hence very expressive. However,
unlike MSO, a wrapper in Monadic Datalog can be modeled nicely in a visual and interactive
step-by-step manner.
Baumgartner et al. (23) developed the Elog wrapping language as a possible implementation of
a monadic datalog with minor restrictions, using it as the core extraction function of the Lixto
Visual Wrapper (22): this tool provides a GUI to select, through visual specification, patterns
in Web pages, in hierarchical order, highlighting elements of the document and specifying
relationships among them; information identified in this way could be too general, so the system
permits to add some restricting conditions, e.g. before/after, not-before/not-after, internal and
range conditions. Finally, selected information are translated into XML using pattern names
as XML element names, obtaining structured data from unstructured pages.
Spatial Reasoning
A completely different approach, called Visual Box Model, exploits visual cues to understand
the presence of tabular data in HTML documents, not strictly represented under the <table>
element (120, 167): the technique is based on the X-Y cut OCR algorithm, relying on the
Gecko1 rendering engine used by the Mozilla Web browser, to extract the CSS 2.0 visual box
model, accessing the positional information through XPCOM. Cuts are recursively applied to
the bitmap image (the rendering of the page) and stored into an X-Y tree, building a tree
where ancestor nodes with leaves represent not-empty tables. Some secondary operations check
1 https://developer.mozilla.org/en/Gecko
28
3.4 Techniques
that extracted tables contain useful information, because usually, although it is a deprecated
practice, many Web developers use tables for structural and graphical issues.
3.4.4
Automatic Wrapper Generation
By definition, the automatic wrapper generation implies no human interaction: techniques
that recognize and extract relevant data were autonomously developed and systems with an
advanced degree of automation and a high level of independent decision capability represent
the state-of-the-art of the automatic wrapper generation approach.
Automatic Matching
RoadRunner (73, 74) is an interesting example of automatic wrapper generator: this system
is oriented to data-intensive websites based on templates or regular structures. This system
tackles the problem of automatic matching bypassing common features used by standard wrappers, typically using additional information, provided by users, labeling example pages, or by
the system through automatic labeling, or a priori knowledge on the schema, e.g. on the page
structure/template. In particular, RoadRunner relies on the fundamental idea of working with
two HTML pages at a time in order to discover patterns while analyzing similarities and differences between structure and content of pages. Essentially RoadRunner can extract relevant
information from any website containing at least two pages with similar structure: usually Web
pages are dynamically generated and relevant data are positioned in the same area of the page,
excluding small differences due, for example, to missing values. those kind of Web sources
characterized by a common generation script are called class of pages. The problem is reduced
to extracting the source dataset, thus generating a wrapper starting from the inference of a
common structure from the two-page-based comparison. This system can handle missing and
optional values and also small structural differences, adapting very well to all kinds of dataintensive Web sources (e.g. based on templates), relying on a solid theoretical background, that
ensures a high degree of precision of the matching technique.
Partial Tree Alignment
Zhai and Liu (289, 290) theorized the partial tree alignment technique and developed a Web
data extraction system based on it. This technique relies on the idea that information in Web
documents usually are collected in contiguous regions of the page, called data record regions.
Partial tree alignment consists in extracting these regions, e.g. using a tree matching algorithm,
called tree edit distance. This approach works in two steps: i) segmentation, and ii) partial tree
alignment. In the first phase the Web page is split in segments, without extracting data; this
pre-processing step is fundamental because the system does not simply perform an analysis
based on the DOM tree, but also relies on visual cues, like the spatial reasoning technique,
trying to identify gaps between data records; it is useful also because it helps the process
of extracting structural information from the HTML tag tree, in those situations when the
HTML syntax is abused, e.g. using tabular structure instead of CSS to arrange the graphical
aspect of the page. After that, the partial tree alignment algorithm is applied to data records
earlier identified: each data record is extracted from its DOM sub-tree position, constituting
the root of a new single tree, because each data record could be contained in more than one
non-contiguous sub-tree in the original tag tree. Partial tree alignment approach implies the
29
3. INFORMATION EXTRACTION FROM WEB SOURCES
alignment of data fields with certainty, excluding those which cannot be aligned, to ensure a
high degree of precision; during this process no data items are involved. This, because partial
tree alignment works only on tree tags matching, represented as the minimum cost, in terms
of operations (e.g. node removal, node insertion, node replacement), to transform one node in
another one.
3.4.5
Wrapper Induction
Wrapper induction techniques differ from the latter essentially for the degree of automation:
most of the wrapper induction systems need labeled examples provided during the training
sessions, thus requiring human engagement. Actually, a couple of systems can obtain this
information autonomously, representing, de facto, a hybrid approach between wrapper induction
and automatic wrapper generation. In wrapper induction, extraction rules are learned during
training sessions and then applied to extract data from Web pages similar to the example
provided.
Machine-Learning-Based approach
Standard machine-learning-based techniques rely on training sessions to let the system acquire
a domain expertise. Training a Web data extraction system, based on the machine-learning
approach, requires a huge amount of labeled Web pages, before starting to work with an acceptable degree of precision. Manual labeling should provide both positive and negative examples,
especially for different websites but also in the same website, in pages with different structure,
because, usually, templates or patterns differ, and the machine-learning-based system should
learn how to extract information in these cases. Statistical machine-learning-based systems
were developed relying on conditional models (232) or adaptive search (268) as an alternative
solution to human knowledge and interaction.
Many wrapper induction systems were developed relying on the machine-learning approach:
Flesca et al. (108) classified ShopBot, WIEN, SoftMealy, STALKER, RAPIER, SRV and
WHISK (257), analyzing some particular features like support for HTML-documents, NLP,
texts etc.
Kushmerick developed the first wrapper induction system, WIEN (172), based on a couple of
brilliant inductive learning techniques that enable the system to automatically label training
pages, representing, de facto, a hybrid approach to speed-up the learning process. Although
these hybrid features, WIEN has many limitations, e.g. it cannot handle missing values.
SoftMealy, developed by Hsu and Dung (150), was the first wrapper induction system specifically designed for the Web data extraction: relying on non-deterministic finite state automata,
SoftMealy also uses a bottom-up inductive learning approach to extract wrapping rules. During
the training session the system acquires training pages represented as an automaton on all the
possible permutations of Web pages: states represent extracted data, while state transitions
represent extraction rules. STALKER (206) is a system for learning supervised wrappers with
some affinity with SoftMealy but differing in the relevant data specification: a set of tokens is
manually placed on the Web page identifying information that should be extracted, ensuring
the capability of handling empty values, hierarchical structures and unordered items. Bossa et
al. (43) developed a token based lightweight Web data extraction system called Dynamo that
differs from STALKER because tokens are placed during the Web pages building to identify
30
3.5 Automatic Wrapper Adaptation
elements on the page and relevant information. This system is viable strictly in such situations
in which webmasters can modify the structure of Web pages providing tokens placement to help
the extraction system.
3.4.6
Wrapper Maintenance
Wrapper building, regardless the technique applied to generate it, is only one aspect of the
problem of data extraction from Web sources: unlike static documents, Web pages dynamically
change and evolve, and their structure may change, sometimes with the consequence that
wrappers cannot successfully extract data. Actually, a critical step of the Web data extraction
process is the wrapper maintenance: this can be performed manually, updating or rewriting
the wrapper each time Web pages change; this approach could fit well for small problems,
but is not-trivial if the pool of Web pages is extended (for example, a regular data extraction
task include hundred thousand pages, usually dynamically generated and frequently updated).
Kushmerick (173) defined the wrapper verification problem and, shortly, a couple of manual
wrapper maintenance techniques were developed to handle simple problems. In the following,
we analyze a viable practice presented in literature to automatically solve the problem of the
wrapper maintenance, called schema-guided wrapper maintenance. Finally, we propose a novel
technique of automatic wrapper adaptation that contributes to the state-of-the-art in this field.
Schema-Guided Wrapper Maintenance
Meng et al. (200) developed the SG-WRAM (Schema-Guided WRApper Maintenance) for Web
data extraction starting from the observation that, changes in Web pages, even substantial, always preserve syntactic features (i.e. syntactic characteristics of data items like data patterns,
string lengths, etc.), hyperlinks and annotations (e.g. descriptive information representing the
semantic meaning of a piece of information in its context). They developed a Web data extraction system, providing schemes, starting from the wrapper generation, until to the wrapper
maintenance: during the generation the user provides HTML documents and XML schemas,
specifying mappings between them. Later, the system will generate extraction rules and then
it will execute the wrapper to extract data, building a XML document with the specified XML
schema; the wrapper maintainer checks extraction issues and provide an automatic repairing
protocol for wrappers which fail the task because Web pages changed. The XML schema is in
the format of a DTD (Document Type Definition) and the HTML document is represented as a
DOM tree: SG-WRAM builds corresponding mappings between them and generates extraction
rules in the format of a XQuery expression.
3.5
Automatic Wrapper Adaptation
We developed a novel method of automatic wrapper adaptation relying on the analysis of
structural similarities between different versions of the same Web page. Our idea is to compare
some helpful structural information stored by applying the wrapper on the original version of
the Web page, searching for similarities in the new one.
31
3. INFORMATION EXTRACTION FROM WEB SOURCES
(B) /html[1]/body[1]/table[1]/tr[2]/td
(A) /html[1]/body[1]/table[1]/tr[1]/td[1]
html
html
head
head
body
body
table
table
tr
td
..
.
tr
tr
td
td
td
td
..
.
tr
td
..
.
td
..
.
tr
tr
td
td
td
td
..
.
td
..
.
Figure 3.1: Examples of XPaths over trees, selecting one (A) or multiple (B) items.
3.5.1
Primary Goals
Regardless the method of extraction implemented by the wrapping system (e.g. we can consider
a simple XPath), elements identified and represented as subtrees of the DOM tree of the Web
page, can be exploited to find similarities between two different versions.
In the simplest case, the XPath identifies just a single element on the Web page (Figure 3.1.A);
our idea is to look for some elements, in the new Web page, sharing similarities with the original
one, evaluating comparable features (e.g. subtrees, attributes, etc.); we call these elements
candidates; among candidates, the one showing the higher degree of similarity – possibly –
represents the new version of the original element.
It is possible to extend the same approach in the common case in which the XPath identifies
multiple similar elements on the original page (e.g. a XPath selecting results of a search in a
retail online shop, represented as table rows, divs or list items) (Figure 3.1.B); it is possible to
identify multiple elements sharing a similar structure in the new page, within a custom level of
accuracy (e.g. establishing a threshold value of similarity).
Once identified, elements in the new version of the Web page can be extracted as usual, for
example just re-inducting the XPath1 . Our purpose is to define some general rules to enable
the wrapper to face the problem of automatically adapting itself to extract information from
the new version of the Web page.
We implemented this approach in a commercial tool – Lixto. The most efficient way to acquire
some structural information about elements the original wrapper extracts, is to store them
inside the definition of the wrapper itself. For example, generating signatures representing the
DOM subtree of extracted elements from the original Web page, stored as a tree diagram, or
a simple XML document (or, even, the HTML itself). This shrewdness avoids that we need to
store the whole original page, ensuring better performance and efficiency.
This technique requires just a few settings during the definition of the wrapper step: the
user enables the automatic wrapper adaptation feature and set an accuracy threshold. During
1 For sake of simplicity let us assume that the given wrappers rely on Xpath(s) to identify and extract
elements of a Web page. Since the provided model is very general, it could work with any DOM-based model of
data extraction and its adoption is straightforward in most of the cases.
32
3.5 Automatic Wrapper Adaptation
the execution of the wrapper, if some XPath definition does not match a node, the wrapper
adaptation algorithm automatically starts and tries to find the new version of the missing node.
3.5.2
Details
First of all, to establish a measure of similarity between two trees we need to find some comparable properties between them. In HTML Web pages, each node of the DOM tree represents
an HTML element defined by a tag (or, otherwise, free text). The simplest way to evaluate
similarity between two elements is to compare their tag name. Elements own some particular
common attributes (e.g. id, class, etc.) and some type-related attributes (e.g. href for anchors,
src for images, etc.); it is possible to exploit this information for additional checks, constraints
and comparisons.
The algorithm selects candidates between subtrees sharing the same root element, or, in some
cases, comparable -but not identical- elements, analyzing tags. This is very effective in those
cases of deep modification of the structure of an object (e.g. conversion of tables in divs).
As discussed in Section above, several approaches have been developed to analyze similarities
between HTML trees; for our purpose we improved a version of the simple tree matching
algorithm, originally led by Selkow (253); we call it weighted tree matching. There are two
important novel aspects we are introducing in facing the problem of the automatic wrapper
adaptation: first of all, exploiting previously acquired information through a smart and focused
usage of the tree similarity comparison; thus adopting a consolidated approach in a new field
of application. Moreover, we contributed applying some particular and useful changes to the
algorithm itself, improving its behavior in the HTML trees similarity measurement.
0
00
Algorithm 1 SimpleTreeMatching(T , T )
0
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
00
if T has the same label of T then
0
m ← d(T )
00
n ← d(T )
for i = 0 to m do
M [i][0] ← 0;
end for
for j = 0 to n do
M [0][j] ← 0;
end for
for all i such that 1 ≤ i ≤ m do
for all j such that 1 ≤ j ≤ n do
M [i][j] ← Max(M [i][j − 1], M [i − 1][j], M [i − 1][j − 1] + W [i][j]) where W [i][j] =
0
00
SimpleTreeMatching(T (i − 1), T (j − 1))
end for
end for
return M[m][n]+1
else
return 0
end if
33
(A)
e
N7
1
2
d
N6
1
2
b
N2
1
·
4
(N6+N7)
34
1
2
f
N8
c
N3
1
·N8
4
1
2
e
N9
1
2
d
N10
b
N4
1
·
4
(N9+N10)
i
N13
1
3
h
N12
1
3
1
3
j
N14
g
N11
1
·
2
(N12+N13+N14)
c
N5
1
·N11
4
(B)
Figure 3.2: A and B are two similar labeled rooted trees.
a
N1
1
2
1
2
1
3
h
N22
g
N20
1
·N22
2
e
N19
d
N18
1
2
f
N21
c
N17
1
·
4
(N20+N21)
b
N16
1
·
4
(N18+N19)
a
N15
3. INFORMATION EXTRACTION FROM WEB SOURCES
3.5 Automatic Wrapper Adaptation
3.5.3
Simple Tree Matching
Let d(n) be the degree of a node n (i.e. the number of first-level children); let T(i) be the i-th
subtree of the tree rooted at node T; a possible implementation of the simple tree matching is
as in Algorithm 1.
Advantages of adopting this algorithm, which has been shown quite effective for Web data
extraction (162, 289), are multiple; for example, the simple tree matching algorithm evaluates
similarity between two trees by producing the maximum matching through dynamic programming, without computing inserting, relabeling and deleting operations; moreover, approximate
tree edit distance algorithms relies on complex implementations to achieve good performance,
instead simple tree matching, or similar algorithms are very simple to devise.
0
00
0
00
The computational cost is O(n2 · max(leaves(T ), leaves(T )) ·max(depth(T ), depth(T ))), thus
ensuring good performances, applied to HTML trees. There are some limitations; most of them
are not relevant in our context but there is an important one: this approach can not match
permutations of nodes. Despite this intrinsic limit, this technique appears to fit very well to
our purpose of measuring HTML trees similarity.
3.5.4
Weighted Tree Matching
Let t(n) be the number of total siblings of a node n including itself:
0
00
Algorithm 2 WeightedTreeMatching(T , T )
{Change line 11 with the following code}
if m > 0 AND n > 0 then
0
00
return M[m][n] * 1 / Max(t(T ), t(T ))
else
0
00
return M[m][n] + 1 / Max(t(T ), t(T ))
6: end if
1:
2:
3:
4:
5:
In order to better reflect a good measure of similarity between HTML trees, we applied some
focused changes to the way of assignment of a value for each matching node.
In the simple tree matching algorithm the assigned matching value is always 1. After leading
some analysis and considerations on structure of HTML pages, our intuition was to assign
a weighted value, with the purpose of attributing less importance to slight changes, in the
structure of the tree, when they occur in deep sublevels (e.g. missing/added leaves, small
truncated/added branches, etc.) and also when they occur in sublevels with many nodes,
because these mainly represent HTML list of items, table rows, etc., usually more likely to
modifications.
In the weighted tree matching, the weighted value assigned to a match between two nodes is 1,
divided by the greater number of siblings with respect to the two compared nodes, considering
nodes themselves (e.g. Figure 3.2.A, 3.2.B); thus reducing the impact of missing/added nodes.
Before assigning a weight, the algorithm checks if it is comparing two leaves or a leaf with a
node which has children (or two nodes which both have children). The final contribution of a
sublevel of leaves is the sum of assigned weighted values to each leaf (cfr. Code Line (4,5)); thus,
35
3. INFORMATION EXTRACTION FROM WEB SOURCES
the contribution of the parent node of those leaves is equal to its weighted value multiplied by
the sum of contributions of its children (cfr. Code Line (2,3)). This choice produces an effect
of clustering the process of matching, subtree by subtree; this implies that, for each sublevel
of leaves the maximum sum of assigned values is 1; thus, for each parent node of that sublevel
the maximum value of the multiplication of its contribution with the sum of contributions of
its children, is 1; each cluster, singly considered, contributes with a maximum value of 1. In
the last recursion of this top-down algorithm, the two roots will be evaluated. The resulting
value at the end of the process is the measure of similarity between the two trees, expressed in
the interval [0,1]. The closer to 1 the final value, the more similar the two trees.
Let us analyze the behavior of the algorithm with a classic example, already used by (285)
and (289) to explain the simple tree matching (Figure 3.2): 3.2.A and 3.2.B are two very
simple generic rooted labeled trees (i.e. the same structure of HTML trees). They show several
similarities except for some missing nodes/branches.
Applying the weighted tree matching algorithm, in the first step (Figure 3.2.A, 3.2.B) contributions assigned to leaves, with respect to matches between the two trees, reflect the past
considerations (e.g. a value of 13 is established for nodes (h), (i) and (j), although two of them
are missing in 2.B). Going up to parents, the summation of contributions of matching leaves is
multiplied by the relative value of each node (e.g. in the first sublevel, the contribution of each
node is 14 because of the four first-sublevel nodes in 2.A).
Once completed these operations for all nodes of the sublevel, values are added and the final
measure of similarity for the two trees is obtained. Intuitively, in more complex and deeper trees,
this process is iteratively executed for all the sublevels. The deeper a mismatch is found, the
less its missing contribution will affect the final measure of similarity. Analogous considerations
hold for missing/added nodes and branches, sublevels with many nodes, etc. Table 3.1 shows
M and W matrices containing contributions and weights.
W
N6
N7
N18
1
2
0
W
N20
N21
N8
0
1
2
W
N20
N21
N11
1
6
0
N19
0
1
2
M
0
N6
0
0
0
N6-7
0
M
0
N20
N20-21
0
0
0
0
N8
0
0
1
2
M
0
N20
0
0
0
N20-21
0
N11
0
1
6
1
6
N18
0
1
2
1
2
N18-19
0
1
2
1
W
N22
N12
1
3
W
N16
N2
1
4
0
N17
W
N9
N18
0
1
2
N10
N13
0
N3
0
1
8
M
0
N22
N14
0
N4
1
8
0
N19
1
2
0
N5
0
1
24
M
0
N9
0
0
0
N9-10
0
0
0
0
N12
0
1
3
M
0
N16
0
0
0
N16-17
0
N2
0
1
4
1
4
N18
0
0
1
2
N18-19
0
1
2
1
2
N12-13
0
1
3
N12-14
0
1
3
N2-3
0
1
4
3
8
N2-4
0
1
4
3
8
N2-5
0
1
4
3
8
Table 3.1: W and M matrices for each matching subtree.
In this example, WeightedTreeMatching(2.A, 2.B) returns a measure of similarity of 38 (0.375)
whereas SimpleTreeMatching(2.A, 2.B) would return a mapping value of 7; the main difference
on results provided by these two algorithms is the following: our weighted tree matching intrinsically produces an absolute measure of similarity between the two compared trees; the simple
tree matching, instead, returns the mapping value and then it needs subsequent operations to
establish the measure of similarity.
Hypothetically, in the simple tree matching case, we could suppose to establish a good estimation of similarity dividing the mapping value by the total number of nodes of the tree with
more nodes; indeed, a value calculated in this way would be linear with respect to the number
36
3.5 Automatic Wrapper Adaptation
of nodes, thus ignoring important information as the position of mismatches, the number of
mismatches with respect to the total number of subnodes/leaves in a particular sublevel, etc.
In this case, for example, the measure of similarity between 2.A and 2.B, applying this approach,
7
(0.5). A greater value of similarity could suggest, wrongly, that this approach is
would be 14
more accurate. Experimentation showed us that, the closer the measure of similarity is to
reflect changes in complex structures, the higher the accuracy of the matching process. This
fits particularly well for HTML trees, which often show very rich and articulated structures.
The main advantage of using the weighted tree matching algorithm is that, the more the structure of considered trees is complex and similar, the more the measure of similarity will be
accurate. On the other hand, for simple and quite different trees the accuracy of this approach
is lower than the one ensured by the simple tree matching. But, as already underlined, the most
of changes in Web pages are usually minor changes, thus weighted tree matching appears to be
a valid technique to achieve a reliable process of automatic wrapper adaptation.
3.5.5
Web Wrappers
In supervised and interactive wrapper generation, the application designer is in charge of deciding how to characterize Web objects that are used for traversing the Web and for extracting
information. It is one of the most important aspects of a wrapper to be resilient against changes
(both changes over time and variations of similarly structured pages), and parts of the robustness of a data extractor depend on how the application designer configures it. However, it is
crucial that the wrapper generation system assists the wrapper designer and suggests how to
make the identification of Web objects and trails through Web sites as stable as possible.
Robust XPath generation and fall-back strategies
In Lixto Visual Developer (VD) (26), a number of mechanisms are offered to create a resilient
wrapper. During recording, one task is to generate a robust XPath or regular expression,
interactively and supported by the system. During wrapper generation, in many cases only one
labeled example object is available, especially in automatically recorded deep Web navigation
sequences. In such cases, efficient heuristics in XPath generation and fallback strategies during
replay, are required. Typical heuristics during recording for reliably identifying such single Web
objects include:
• Generalization of a chosen XPath by using form properties, element properties, textual
properties and formatting properties. During replay, these ingredients are used as input
for an algorithm that checks in which constellation to best apply this property information
to satisfy the integrity constraints imposed on a rule (e.g., as result a single instance is
required).
• DOM Structural Generalization – starting from the full path, several generalized paths
are created, using only characteristic elements and characteristic element sequences. A
number of stable anchor points are identified and stored, from which relative paths to this
object are created. Typical stable anchor points are automatically identified and include,
e.g., the outermost table structure and the main content area (being chosen upon factors
such as the longest content).
37
3. INFORMATION EXTRACTION FROM WEB SOURCES
• Positional information is considered if the structurally generalized paths identify more
than one element. In this case, during execution, variations of the XPath generated with
this “index heuristics” are applied on the active Web page, removing indexes until the
integrity constraints of the current rule are satisfied.
• Attributes and properties of elements are taken into account, in particular of the element of
choice, but we also consider ancestor attributes if the element attributes are not sufficient.
• Attributes that make an element unique are preferred, i.e., similar elements are checked
for distinguishing criteria.
• Attribute Values are considered, if attribute names are not sufficient. Attribute Value
Fragments are considered, if attribute values are not sufficient (using regular expressions).
• The id attributes are used as far as possible. If an id is unique and meaningful for
characterizing an element it is considered in the fallback strategies with a high weight.
• Textual information and label information is used, only if explicitly turned on (since this
might fail in case of a language switch).
The output of the heuristic step is a “best XPath” shown to the wrapper designer, and a set of
XPath expressions and priorities regarding when to use which fallback strategy, stored in the
configuration. Figure 3.3 illustrates which information is stored by the system during recording.
In this case, a drop down was selected by the application designer, and the system decided that
the “id” attribute is the most reliable one and is chosen as best XPath. If this evaluation fails,
the system will apply heuristics based on the (in this example, three) stored fallback XPaths,
which mainly exploit form and index properties. In case one of the heuristics generates results
that do not invalidate the defined integrity constraints, these Web objects are considered as
result.
During generation of rules (e.g., “extract”) and actions (e.g., “click”), the wrapper designer
imposes constraints on the results to be obtained, such as:
• Cardinality Constraints: restrictions on the number of results, e.g., exactly one element
or at least one element must be matched.
• Data Type Constraints: restrictions on the data type of a result, e.g., a result must be of
type integer or match a particular regular expression.
Constraints can be defined individually per rule and action, or defined globally by using a
schema on the output data model.
Configuring adaptable wrappers
The procedures described in the previous section do not adapt the wrapper, but address situations in which the initially chosen XPath does no longer match and simply try different ones
based on this one. In the configuration of wrapper adaptation, we go one step beyond: on the
one hand we exploit tree and string similarity techniques to find the most similar Web object(s)
on the new page, and on the other hand, in case the adaptation is triggered, the wrapper is
changed on the fly using the new configuration created by the adaptation algorithms.
As before, integrity constraints can be imposed on extraction and navigation rules. Moreover,
the application designer can choose whether to use wrapper adaptation on a particular rule in
case the constraints are violated during runtime. When adaptation is chosen, alternatively to
38
3.5 Automatic Wrapper Adaptation
Figure 3.3: Robust Web object detection in Lixto VD.
Figure 3.4: Configuration of wrapper adaptation in Lixto VD.
39
3. INFORMATION EXTRACTION FROM WEB SOURCES
using XPath-based means to identify Web objects we store the actual result subtree. In case of
HTML leaf elements, which are usually the elements under consideration for navigation actions,
we instead store the tree rooted at the n-th ancestor of the element, and the additional fact
where the result element is located within this tree. In this way, tree matching can also be
exploited for HTML leaf elements.
Wrapper designers can choose between various similarity measures: this includes in particular
the Simple Tree Matching algorithm (253) and the Weighted Tree Matching algorithm previously described. In future, further algorithms will extend the capabilities of the tool, e.g., a
bigram-based tree matching that is capable to deal with node permutations in a more favorable
fashion. In addition to the similarity function, one can choose certain parameters, e.g., whether
to use the HTML element name as node label or instead to use spelling attributes such as class
and id attributes. Figure 3.4 illustrates the configuration of wrapper adaptation in Lixto VD.
3.5.6
Automatic Adaptation of Web Wrappers
Self-repairing rules
Figure 3.5 describes the adaptation process. The wrapper adaptation process is triggered upon
violation of defined constraints. In case in the initial wrapper an element is detected with an
XPath, the adaptation procedure substitutes this by storing the subtree of a matched element.
In case the wrapper definition already stores the example tree, and the similarity computation
returns results that violate the defined constraints, the threshold is lowered or raised until a
perfect match is generated.
During runtime, the stored tree is compared to the elements on the new page, and the best
fitting element(s) are considered as extraction results. During configuration, wrapper designers
can choose an algorithm (such as the Weighted Tree Matching), and a similarity threshold.
The similarity threshold can be constant, or defined to be within an interval of acceptable
thresholds. During execution, various thresholds within the allowed range are considered, and
the one generating the best fit with respect to the defined constraints is chosen.
As a next step, the stored tree is refined and generalized so that it maximizes the matching
value for both the original subtree and the new trees, reflecting the changes of a Web page
over time. This generalization process generates a simple tree grammar, a “tree template” that
is allowed to use occurrence indicators (one or more element, at least one element, etc.) and
optional depth levels. In further runs, the tree template is compared against the sub trees of
an active Web page during execution. First, the algorithm checks which trees on the new page
satisfy the tree template. In case the results are within the defined integrity constraints, no
further action is taken. In case the results are not satisfying, the system searches for most
similar trees based on the defined distance metrics; in this case, the wrapper is auto-adapted,
the tree template is further refined and the threshold or threshold interval is automatically
re-adjusted. At the very end of the process, the corrected wrapper is stored in the wrapper
repository and committed to a versioning system to keep track of all changes.
Wrapper re-induction
In practice, single adaptation steps of rules and actions are embedded into the whole execution
process of a wrapper and the adapted wrapper is stored in the repository after all adaptation
40
3.5 Automatic Wrapper Adaptation
Figure 3.5: Wrapper adaptation process.
steps have been concluded. The need for adapting a particular rule influences the further
execution steps.
Usually, wrapper generation in VD is a hierarchical top-down process – e.g., first, a “hotel
record” is characterized, and inside the hotel record, entities such as “rating” and “room types”.
To define a rule to match such entities, the wrapper designer visually selects an example and
together with system suggestions generalizes the rule configuration until the desired instances
are matched. To support the automatic adaptation process during runtime, as described above,
the wrapper designer further specifies what it means that extraction failed. In general, this
means wrong or missing data, and with integrity constraints one can give indications how
correct results look like. The upper half of Figure 3.6 summarizes the wrapper generation.
Figure 3.6: Diagram of the Web wrapper creation, execution and maintenance flow.
During wrapper creation, the application designer provides a number of configuration settings
to this process. This includes:
• Threshold Values.
41
3. INFORMATION EXTRACTION FROM WEB SOURCES
• Priorities/Order of Adaptation Algorithms used.
• Flags of the chosen algorithm (e.g., using HTML element name as node label, using
id/class attributes as node labels, etc.).
• Triggers for bottom-up, top-down and process flow adaptation bubbling.
• Whether stored tree-grams and XPath statements are updated based on adaptation results to be additionally used as inputs in future adaptation procedures (reflecting and
addressing regular slight changes of a Web page over time).
Triggers in Adaptation Settings can be used to force adaptation of further fragments of the
wrapper as depicted in the lower half of Figure 3.6.
• Top-down: forcing adaptation of all/some descendant rules (e.g., adapt the “price” rule
as well to identify prices within a record if the “record” rule was adapted).
• Bottom-up: forcing adaptation of a parent rule in case adaptation of a particular rule
was not successful. Experimental evaluation pointed out that in such cases it is often the
problem that the parent rule already provides wrong or missing results (even if matched
by the integrity constraints) and has to be adapted first.
• Process flow: it might happen that particular rule matches can no longer be detected because the wrapper evaluates on the wrong page. Hence, there is the need to use variations
in the deep web navigation actions. In particular, a simple approach explored at this time
is to use a switch window or back step action to check if the previous window or another
tab/popup provides the required information.
3.5.7
Experimentation
In this section we discuss some experimentation performed on common fields of application
and following results. We tried to automatically adapt wrappers, previously built to extract
information from particular Web pages, after some -often minor- structural changes.
All the followings are real use cases: we did not modify any Web page, the original owners did;
thus re-publishing pages with changes and altering the behavior of old wrappers. These real
use cases confirmed our expectations and simulations on ad hoc examples we prepared to test
the algorithms.
We obtained an acceptable degree of precision using the simple tree matching and a great
rate of precision/recall using the weighted tree matching. Precision, Recall and F-Measure
will summarize these results showed in Table 3.2. We focused on the following areas, widely
interested by Web data extraction:
• News and information: Google News is a valid use case for wrapper adaptation; templates
change frequently and sometimes is not possible to identify elements with old wrappers.
• Web search: Google Search completely rebuilt the results page layout in the same period
we started our experimentation 1 ; we exploited the possibility of automatically adapting
wrappers built on the old version of the result page.
1 http://googleblog.blogspot.com/2010/05/spring-metamorphosis-googles-new-look.html
42
3.5 Automatic Wrapper Adaptation
• Social networks: another great example of continuous restyling is represented by the most
common social network, Facebook; we successfully adapted wrappers extracting friend
lists also exploiting additional checks performed on attributes.
• Social bookmarking: building folksonomies and tagging contents is a common behavior
of Web 2.0 users. Several Websites provide platforms to aggregate and classify sources of
information and these could be extracted, so, as usual, wrapper adaptation is needed to
face changes. We choose Delicious for our experimentation obtaining stunning results.
• Retail: these Websites are common fields of application of data extraction and Ebay is a
nice real use case for wrapper adaptation, continuously showing, often almost invisible,
structural changes which require wrappers to be adapted to continue working correctly.
• Comparison shopping: related to the previous category, many Websites provide tools to
compare prices and features of products. Often, it is interesting to extract this information
and sometimes this task requires adaptation of wrappers to structural changes of Web
pages. Kelkoo1 provided us a good use case to test our approach.
• Journals and communities: Web data extraction tasks can also be performed on the millions of online Web journals, blogs and forums, based on open source blog publishing
applications (e.g. Wordpress, Serendipity2 , etc.), CMS (e.g. Joomla3 , Drupal4 , etc.)
and community management systems (e.g. phpBB5 , SMF6 , etc.). These platforms allow changing templates and often this implies wrappers must be adapted. We lead the
automatic adaptation process on Techcrunch7 , a tech journal built on Wordpress.
We adapted wrappers for these 7 use cases considering 70 Web pages; Table 3.2 summarizes
results obtained comparing the two algorithms applied on the same page, with the same configuration (threshold, additional checks, etc.). Threshold represents the minimum value of similarity
required to match two trees. The columns true pos., false pos. and false neg. represent true and
false positive and false negative items extracted from Web pages through adapted wrappers.
3.5.8
Discussion of Results
In some situations of deep changes (Facebook, Kelkoo, Delicious) we had to lower the threshold
in order to correctly match the most of the results. Both the algorithms show a great elasticity
and it is possible to adapt wrappers with a high degree of reliability; the simple tree matching
approach shows a weaker recall value, whereas performances of the weighted tree matching are
stunning (F-Measure greater than 98% is an impressive result). Sometimes, additional checks
on nodes attributes are performed to refine results of both the two algorithms. For example,
we can additionally include attributes as part of the node label (e.g. id, name and class) to
refine results. Also without including these additional checks, the most of the time the false
positive results are very limited in number (cfr. the Facebook use case).
1 http://shopping.kelkoo.co.uk
2 http://www.s9y.org
3 http://www.joomla.org
4 http://drupal.org
5 http://www.phpbb.com
6 http://www.simplemachines.org
7 http://www.techcrunch.com
43
3. INFORMATION EXTRACTION FROM WEB SOURCES
SimpleTreeMatching
Precision/Recall
URL
threshold
news.google.com
google.com
facebook.com
delicious.com
ebay.com
kelkoo.co.uk
techcrunch.com
Total
Recall
Precision
F-Measure
90%
80%
65%
40%
85%
40%
85%
-
true
pos.
604
100
240
100
200
60
52
1356
false
pos.
72
4
12
4
92
90.64%
93.65%
92.13%
false
neg.
52
60
28
140
WeightedTreeMatching
Precision/Recall
true
pos.
644
136
240
100
196
58
80
1454
false
pos.
12
12
97.19%
99.18%
98.18%
false
neg.
12
24
4
2
42
Table 3.2: Experimental results of automatic wrapper adaptation.
Conclusion
In this Chapter we briefly discussed the current panorama that regards the techniques and
the fields of application of Web mining platforms, with a particular attention to the thematics
related to the extraction of data from social media sources, such as Online Social Networks,
social bookmarking services, and so on.
The procedures referred in this Chapter as Web wrappers are the basis on which the platform
of Web data extraction from Online Social Networks discussed in the next Chapters is based
on. Our research activity furnished a great contribution to the field of Web data extraction
systems, in detail in the area concerning the automatic wrappers maintenance (102, 103, 104).
The details regarding our algorithmic and technical solution to this problem are described in
the final part of this Chapter.
44
4
Mining and Analysis of Facebook
This Chapter is organized as follows: Section 4.1 presents the summary of the most representative related work, in particular those regarding the study on Facebook and other OSNs based
on the concept of friendship. Moreover, we discuss existing projects on data extraction and
analysis of OSNs. Section 4.2 describes the methodology we used to conduct the analysis of the
Facebook social network; in particular, we discuss the architecture of the Web mining platform
algorithms and techniques exploited. We define the technical challenges underlying the process
of information extraction from Facebook and describe in detail the design and implementation
of our application, called crawling agent. Some experimentation details regarding Social Network Analysis aspects are discussed in Section 4.3. The most important results are summarized
describing the analysis of topological features of the Facebook social networks.
4.1
Background and Related Literature
The task of extracting and analyzing data from Online Social Networks has attracted the
interest of many researchers e.g. in (8, 118, 286). In this Section we review some relevant
literature directly related to our approach.
In particular, we first discuss techniques to crawl large social networks and collect data from
them (see Section 4.1.1). Collected data are usually mapped onto graph data structures (and
sometimes hypergraphs) with the goal of analyzing their structural properties.
The ultimate goal of these efforts is perhaps best laid out by Kleinberg (165): topological
properties of graphs may be reliable indicators of human behaviors. For instance, several studies
show that node degree distribution follows a power law, both in real and Online Social Networks.
That feature points to the fact that most social network participants are often inactive, while
few key users generate a large portion of data/traffic. As a consequence, many researchers
leverage on tools provided from graph theory to analyze the social network graph with the goal
-among others- of better interpreting personal and collective behaviors on a large scale. The
list of potential research questions arising from the analysis of OSN graphs is very long. As
discussed in the introduction of this dissertation, we point out the following themes which are
directly relevant to our research:
45
4. MINING AND ANALYSIS OF FACEBOOK
i) Data collection from OSNs, i.e., the process of acquisition of relevant data from OSN
plaftorms by means of Web mining techniques;
ii) Node similarity detection, i.e., the task of assessing the degree of similarity of two users in
OSNs (see Section 4.1.2),
iii) Influential user detection, i.e., the task of identifying users capable of stimulating other
users to join activities/discussions in their OSN (see Section 4.1.3).
4.1.1
Data Collection from Online Social Networks
The most works focusing on data collection adopt techniques of Web information extraction, to
crawl the front-end of Websites; this because OSN datasets are usually not publicly accessible;
data rest in back-end databases that are accessible only through the Web interface.
In (184) the authors discussed the problem of sampling from large graphs adopting several
graph mining techniques, in order to establish whether it is possible to avoid bias in acquiring
a subset of the whole graph of a social network. The main outcome of the analysis in (184) is
that a sample of size of 15% of the whole graph preserves the most of the properties.
In (203), the authors crawled data from large Online Social Networks like Orkut, Flickr and
LiveJournal. They carried out an in-depth analysis of OSN topological properties (e.g., link
symmetry, power law node degrees, groups formation) and discussed challenges arising from
large-scale crawling of OSNs.
(286) considered the problem of crawling OSNs analyzing quantitative aspects like the efficiency
of the adopted visiting algorithms, and bias of data produced by different crawling approaches.
The work by Gjoka et al. (125) on OSN graphs is perhaps the most similar to our approach.
Gjoka et al. have sampled and analyzed the Facebook friendship graph with different visiting
algorithms (namely BFS, Random Walk and Metropolis-Hastings Random Walks). Our objectives differ from those of Gjoka et al. because their goal is to produce a consistent sample of
the Facebook graph. A sample is defined consistent when some of its key structural properties,
i.e., node degree distribution, assortativity and clustering coefficient approximate fairly well
the corresponding properties of the original Facebook graph. Vice versa, our work aims at
crawling a portion of the Facebook graph and to analytically study the structural properties of
the crawled data.
A further difference with Gjoka et al. is in the strategy for selecting which nodes to visit:
Gjoka’s strategy requires to know in advance the degree of the considered nodes. Nodes with
the highest degree are selected and visited at each stage of the sampling. In the Facebook
context, node degree represents the number of friends a user has; such information is available
in advance by querying the profile of the user. Such an assumption, however, is not applicable
if we consider other Online Social Networks. Hence, to know the degree of a node we should
preliminarily perform a complete visit of the graph, which may not be feasible for large-scale
OSNs.
4.1.2
Similarity Detection
Finding similar users of a given OSN is a key issue in several research fields like Recommender
Systems, especially Collaborative Filtering (CF) Recommender Systems (3). In the context
46
4.1 Background and Related Literature
of social networks, the simplest way to compute user similarities is by means of similarity
metrics such as the Jaccard coefficient (146). In particular, given two users ui and ul of a given
social network, the simplest and most intuitive way to compute their similarity requires the
computation of the Jaccard coefficient of the sets of their neighbors. However, the usage of the
Jaccard coefficient is often not satisfactory because it considers only the acquaintances of a user
in a social network (and, therefore, local information) and does not take global information into
account. A further drawback consists of the fact that users with a large number of acquaintances
have a higher probability of sharing some of them with respect to users with a small number
of acquaintances; therefore, they are likely to be regarded as similar even if no real similarity
exists between them. (1) proposed that the similarity of two users increases if they share
acquaintances who, in turn, have a low number of acquaintances themselves.
In order to consider global network properties, many approaches rely on the idea of regular
equivalence, i.e., on the idea that two users are similar if their acquaintances are similar too.
In (20) the problem of computing user similarities is formalized as an optimization problem.
Other approaches compute similarities by exploiting matrix based methods. For instance, the
approaches of (182) use a modified version of the Katz coefficient. SimRank (154) provides
an iterative fixpoint method. The approach of (31) operates on directed graphs and uses an
iterative approach relying on their spectral properties.
Some authors studied computational complexity of social network analysis with an emphasis
on the problem of discovering links between social network users (255, 256).
To describe these approaches, assume to consider a social network and let G = (V, E) be the
graph representing it; each node in V represents a user whereas an edge specifies a tie between
a pair of users (in particular, the fact that a user knows another user).
In the first stage, Formal Concept Analysis is applied to map G onto a graph G0 . The graph G0 is
more compact than G (i.e., it contains less nodes and edges of G) but, however, it is sparse, i.e.,
a node in G0 still has few connections with other nodes. As a consequence, the task of predicting
if two nodes are similar is quite hard and comparing the number of friends/acquaintances they
share is not effective because, in most cases, two users do not share any common friend and,
therefore, the similarity degree of an arbitrary pair of users will be close to 0. To alleviate
sparsity, Singular Value Decomposition (133) (SVD) is applied. Experiments provided in (255)
show that the usage of SVD is effective in producing a more detailed and refined analysis of
social network data.
The SV D is a technique from Linear Algebra which has been successfully employed in many
fields of Computer Science like Information Retrieval; in particular, given a matrix A, the SV D
allows the matrix A to be decomposed as follows
A = UΣV
being U and V two orthogonal matrices (i.e., the columns of U and V are pairwise orthogonal); the matrix Σ is a diagonal matrix whose elements coincide with the square roots of the
eigenvalues of the matrix AAT ; as usual, the symbol AT denotes the transpose of the matrix
A.
The SV D allows to decompose a matrix A into the product of three matrices and if we would
multiply these three matrices, we would reconstruct the original matrix A. As a consequence,
if A is the adjacency matrix of a social network, any operation carried out on A can be
equivalently performed on the three matrices U, Σ and V in which A has been decomposed.
47
4. MINING AND ANALYSIS OF FACEBOOK
The main advantage of such a transformation is that matrices U and V are dense and, then,
we can compute the similarity degree of two users even if the number of friends they share is
close or equal to zero.
4.1.3
Influential User Detection
A recent trend in OSN analysis is the identification of influential users (122, 222). Influential
users are those capable of stimulating others to join OSN activities and/or to actively contribute
to them.
In Weblog (blog) analysis, there is a special emphasis on the so-called leader identification.
In particular, (258) suggested to model the blogosphere as a graph (whose nodes represent
bloggers whereas edges model the fact that a blogger cites another one). In (195), the authors
introduce the concept of starter, i.e., a user who first generate information that catchs the
interest of fellow users/readers. Among others, the approach of (195) has deployed the Random
Walk technique to find starters. Researchers from HP Labs analyzed user behaviors on Twitter
(246); they found that influential users should not only catch attention from other users but
they should also overcome the passivity of other users and spur them to get involved in OSN
activities. To this purpose, they developed an algorithm – based on the HITS algorithm of
(164) – to assess the degree of influence of a user. Experimental results show that high levels
of popularity of a user do not necessarily imply high values in the degree of influence.
4.2
Sampling the Facebook Social Graph
Our work on OSN analysis began with the goal to understand the organization of popular
OSN, and as of 2010 (the time of the data collection) Facebook was by far the largest and
most studied. Facebook as of December 2011 gathered more than 720 millions active users
(269), and its growth rate has been proved to be the highest among all the other competitors
in the last few years. More than 50% of users log on to the platform in any given day. Coarse
statistics about the usage of the social network are provided by the company itself1 . Our study
is interested in analyzing the characteristics and the properties of this network on a large-scale.
In order to reach this goal, first of all we had to acquire data from this platform, and later we
proceeded to their analysis.
4.2.1
The Structure of the Social Network
The structure of the Facebook social network is simple. Each subscribed user can be connected
to others via friendship relationships. The concept of friendship is bilateral, this means that
users must confirm the relationships among them. Connections among users do not follow any
particular hierarchy, thus we define the social network as unimodal.
This network can be represented as a graph G = (V, E) whose nodes V represent users and edges
E represent friendship connections among them. Because of the assumption on the bilateralness
of relationships, the graph is undirected. Moreover, the graph we consider is unweighted, because
all the friendship connections have the same value within the network. However, it could be
1 Please
refer to http://www.facebook.com/press/info.php?statistics
48
4.2 Sampling the Facebook Social Graph
possible to assign a weight to each friendship relation, for instance, considering the frequency
of interaction between each pair of users, or different criteria. Considering the assumption that
loops are not allowed, we conclude that in our case it is possible to use a simple unweighted
undirected graph to represent the Facebook social network. The adoption of this model has
been proved to be optimal for several social networks to this purpose (see (130)).
Although choosing the model for representing a network could appear to be simple, this phase
is important and could be not trivial. Compared to Facebook, the structure of other social
networks requires a more complex representative model. For example, Twitter should be represented using a multiplex network; this because it introduces different types of connections
among users (“following”, “reply to” and “mention”) (247). Moreover, there is no mutuality
in user relationships, thus its representation requires a directed graph. Similar considerations
hold for other OSNs, such as aNobii (5), Flickr and YouTube (203), etc.
How to get information about the structure of the network
One important aspect to be considered for representing the model of a social network is the
amount of information about its structure we have access to. The ideal condition would be to
have access to the whole network data, for example acquiring them directly from the company
which manages the social networking service. For several reasons (see further), the most of the
time this solution is not viable – this is the case for Facebook.
Another option is to obtain data required to reconstruct the model of the network, acquiring
them directly from the platform itself, exploiting its public interface. In other words, a viable
solution is to collect a representative sample of the network to try to represent its structure. To
this purpose, it is possible to exploit Web data mining techniques to extract data from the frontend of the social network Websites. This implies that, for very large OSNs, such as Facebook,
Twitter, etc., it is hard or even impossible to collect a complete sample of the network. The
first limitation is related to the computational overhead of a large-scale Web mining task. In
the case of Facebook, for example, to crawl the friend-list Web page (dimension ' 200KB) for
half a billion users, it would approximately require to download more than 200KB ·500M = 100
Terabytes of HTML data. Even if possible, the acquired sample would be a snapshot of the
structure of the graph at the time of the data collection process. Moreover, during the sampling
process the structure of the network slightly changes. This because, even if short, the data
mining process requires a not negligible time, during which the connections among users evolve,
thus the social network, and its structure changes accordingly. For example, the growth rate
of Facebook has been estimated in the order of 0.2% per day by Gjoka et al. In other words,
neither all these efforts could ensure to acquire a perfect sample. For these reasons, a widely
adopted approach is to collect small samples of a network, trying to preserve characteristics
about its structure. There are several different sampling techniques that can be exploited; each
algorithm ensures different performances, and possibly introduces bias in the data.
For our experimentation we collected two significant samples of the structure of the social
network, of a size comparable to other similar studies (63, 277). In particular, we adopted
two different sampling algorithms, namely “breadth-first-search” and “Uniform”. The first is
proved to introduce bias in certain conditions (e.g., in incomplete visits) towards high degree
nodes (170). The latter is proved to be unbiased by construction by Gjoka et al.
Once collected, data are compared and analyzed in order to establish their quality, study their
properties and characteristics. We consider two quality criteria to evaluate the samples: i) sta-
49
4. MINING AND ANALYSIS OF FACEBOOK
tistical significance with respect to mathematical/statistical models; ii) congruency with results
reported by similar studies. Considerations about the characteristics of both the “breadth-firstsearch” and the “Uniform” samples follow.
How to extract data from Facebook
Companies providing online social networking services, such as Facebook, Twitter, etc., do not
have economic interests in sharing their data about users, because their business model mostly
relies on advertising. For example, exploiting this information, Facebook provides unique and
efficient services to advertising companies. Moreover, questions about the protection of these
data have been advanced, for privacy reasons, in particular for Facebook (139, 196).
In this social network, for example, information about users and the interconnections among
them, their activities, etc. can only be accessed through the interface of the platform. To
preserve this condition some constraints have been implemented. Among others, a limit is
imposed to the amount of information accessible from profiles of users not in friendship relations
among each other. There are also some technical limitations, e.g the friend-list is dispatched
through an asynchronous script, so as preventing naive techniques of crawling. Some Web
services, such as the “Graph API”1 , etc., have been provided during the last months of 2010,
by the Facebook developers team, but they do not bypass these limitations (and they eventually
add even more restrictions). As of 2011, the structure of this social network can be accessed
only exploiting techniques typical of Web data mining.
4.2.2
The Sampling Architecture
In order to collect data from the Facebook platform, we designed a Web data mining architecture, which is composed of the following elements (see Figure 4.1): i) a server running the mining
agent(s); ii) a cross-platform Java application, which implements the logic of the agent; and,
iii) an Apache interface, which manages the information transfer through the Web. While running, the agent(s) query the Facebook server(s) obtaining the friend-list Web pages of specific
requested users (this aspect depends on the implemented sampling algorithm), reconstructing
the structure of relationships among them. Collected data are stored on the server and, after
a post-processing step (see Section 4.2.5), they are delivered (eventually represented using an
XML standard format (48)) for further experimentation.
Figure 4.1: Architecture of the data mining platform.
1 Available
from http://developers.facebook.com/docs/api
50
4.2 Sampling the Facebook Social Graph
The Facebook crawler
The cross-platform Java agent which crawl the Facebook front-end is the core of our mining
platform. The logic of the developed agent, regardless the sampling algorithm implemented,
is depicted in Figure 4.2. The first preparative step for the agent execution includes choosing
the sampling methodology and configuring some technical parameters, such as termination
criterion/a, maximum running time, etc. Thus, the crawling task can start or be resumed
from a previous point. During its execution the crawler visits the friend-list page of a user,
following the chosen sampling algorithm directives, for traversing the social graph. Data about
new discovered nodes and connections among them are stored in a compact format, in order to
save I/O operations. The process of crawling concludes when the termination criterion/a is/are
met.
Figure 4.2: State diagram of the data mining process.
During the data mining step, the platform exploits the Apache HTTP Request Library1 to
communicate with the Facebook server(s). After an authentication phase which uses a secure
connection and “cookies” for logging into the Facebook platform, HTML friend-list Web pages
are obtained via HTTP requests. This process is described in Table 4.1.
N.
1
Action Protocol Method
open the Facebook page
HTTP
2
KB
www.facebook.com/
242
login.facebook.com/login.php
/home.php
234
87
login providing credentials
HTTPS
HTTP
3
GET
URI
POST
GET
visit the friend-list page of a specific users
HTTP
GET
/friends/ajax/friends.php?id=#&filter=afp
224
Table 4.1: HTTP requests flow of the crawler: authentication and mining steps.
Regarding the data extraction, the crawler implements a Web wrapper, as discussed in the
previous Chapter, which includes automatic adaptation features. The wrapper exploits a XPath
that is selected by the user during the configuration phase, such that it is possible to put
into evidence what elements must be extracted. The crawler provides two running mode:
i) visual extraction and, ii) HTTP request-based extraction. In the visual extraction mode,
depicted in Figure 4.3, the crawler embeds a Firefox browser interfaced through XPCOM2 and
XULRunner3 . The advantage of this solution is its ability to perform asynchronous requests,
such as AJAX scripts (that, in the case of Facebook, are useful to fulfill those friend-list pages
exceeding a certain size, as further described in the next Section). Its drawback is clearly a
slower execution, since the rendering of the Web pages is required and time-consuming. Thus,
for our large-scale extraction we adopted the HTTP request-based solution.
1 http://httpd.apache.org/apreq
2 https://developer.mozilla.org/en/xpcom
3 https://developer.mozilla.org/en/XULRunner
51
4. MINING AND ANALYSIS OF FACEBOOK
Figure 4.3: Screenshot of the Facebook visual crawler
Limitations
During the data mining task we noticed a technical limitation imposed by Facebook on the
dimension of the dispatched friend-list Web pages, via HTTP requests. To reduce the traffic
through its network, Facebook provides shortened friend-lists not exceeding 400 friends. During
a normal experience of navigation on the Website, if the dimension of the friend-list Web page
exceeds 400 friends, an asynchronous script fills the page with the remaining. This results is
not reproducible using an agent based on HTTP requests. This problem can be avoided using
a different mining approach, for example adopting the visual crawler to integrate missing data.
However, this approach is not viable for large-scale mining tasks, due to its cost, even if we
proved its functioning for a smaller experimentation (57). In Section 4.3.4 we investigated the
impact of this limitation on the samples.
4.2.3
Breadth-first-search Sampling
The breadth-first-search (BFS) is a uninformed traversal algorithm which aims to visit a graph.
Starting from a “seed node”, it explores its neighborhood; then, for each neighbor, it visits
its unexplored neighbors, and so on, until the whole graph is visited (or, alternatively, if a
termination criterion is met). This sampling technique shows several advantages: i) ease of
implementation; ii) optimal solution for unweighted graphs; iii) efficiency. For these reasons it
has been adopted in a variety of OSN mining studies, including (57, 58, 63, 203, 277, 286). In
the last year, the hypothesis that the BFS algorithm produces biased data, toward high degree
nodes, if adopted for partial graph traversal, has been advanced by (170). This, because in the
same (partial) graph obtained adopting a BFS visiting algorithm are both represented nodes
which have been visited (high degree nodes) and nodes which have just been discovered, as
neighbors of visited ones (low degree nodes). One important aspect of our experimentation has
been to verify this hypothesis, in order to highlight which properties of a partial graph obtained
using the BFS sampling are preserved, and which are biased. To do so, we had to acquire a
comparable sample which is certainly unbiased by construction (see further).
52
4.2 Sampling the Facebook Social Graph
Description of the breadth-first-search crawler
The BFS sampling methodology is implemented as one of the possible visiting algorithms in
our Facebook crawler, described before. While using this algorithm, the crawler, for first
extracts the friend-list of the “seed node”, which is represented by the user actually logged
on the Facebook platform. The user-IDs of contacts in its friend-list are stored in a FIFO
queue. Then, the friend-lists of these users are visited, and so on. In our experimentation, the
process continued until two termination criteria have been met: i) at least the third sub-level
of friendship was completely covered; ii) the mining process exceeded 240 hours of running
time. As discussed before, the time constraint is adopted in order to observe a short mining
interval, thus the temporal evolution of the network is minimal (in the order of 2%) and can
be ignored. The obtained graph is a partial reconstruction of the Facebook network structure,
and its dimension is used as a yardstick for configuring the “Uniform” sampling (see further).
Characteristics of the breadth-first-search dataset
This crawler has been executed during the first part of August 2010. At that time the number
of users subscribed to Facebook was more or less 540 millions. The acquired sample covers
about 12 millions friendship connections among about 8 millions users. Among these users,
we performed the complete visit of about 63.4 thousands of them, thus resulting in an average
degree d = 2·|E|
|Vv | ' 396.4, considering V as the number of visited users.
The overall mean degree, considering V as the number of total nodes on the graph (visited
users + discovered neighbors), is o = 2·|E|
|Vt | ' 3.064. The expected density of the graph is
∆=
2·|E|
|Vv |·(|Vv |−1)
' 0.006259 ' 0.626%, considering V as the number of visited nodes. We can
combine the previous equations obtaining ∆ = |Vvd|−1 . It means that the expected density of a
graph is the average proportion of edges incident with nodes in the graph.
v|
In our case, the value δ = do = |V
|Vt | ' 0.007721 ' 0.772%, which here we introduce, represents
the effective density of the obtained graph.
The distance among the effective and the expected density of the graph is computed as ∂ =
' 18.94%.
100 − ∆∗100
δ
This result means that the obtained graph is slightly more connected than expected, with
respect to the number of unique users it contains. This consideration is also compatible with
hypothesis advanced in (170). The effective diameter of this (partial) graph is 8.75, which is
compliant with the “six degrees of separation” theory (15, 202, 216, 266).
The largest connected component of the graph is almost complete (99.98%). The small amount
of disconnected nodes can be intuitively adducted due to some collisions caused by the hash
function exploited to de-duplicate and anonymize user-IDs adopted during the data cleaning
step (see Section 4.2.5). Some interesting considerations hold for the obtained clustering coefficient result. It lies in the lower part of the interval [0.05, 0.18] reported by (125) and, similarly,
[0.05, 0.35] by (277), using the same sampling methodology. The characteristics of the collected
sample are summarized in Table 4.2.
53
4. MINING AND ANALYSIS OF FACEBOOK
No. visited users
63.4K
Avg. deg.
396.8
Bigg. eigenval.
68.93
No. discovered neighbors
8.21M
Eff. diam.
8.69
Avg. clust. coef.
0.0188
No. edges
12.58M
Conn. comp.
98.98%
Density
0.626%
Table 4.2: BFS dataset description (crawling period: 08/01-10/2010)
4.2.4
Uniform Sampling
To acquire a comparable sample, unbiased for construction, we exploited a rejection-based
sampling methodology. This technique has been applied to Facebook by Gjoka et al., where
the authors proved its correctness. Its efficiency relies on the following assumptions:
1. it is possible to generate uniform sampling values for the domain of interest;
2. these values are not sparse with respect to the dimension of the domain and
3. it is possible to sample these values from the domain.
In Facebook, each user is identified by a 32-bit number user-ID. Considering that user-IDs lie
in the interval [0, 232 − 1], the highest possible number of assignable user-IDs using this system
is H ' 4.295e9.
The space for names is currently filling up since the actual number of assigned user-IDs, R '
5.4e8 roughly equals to the 540 millions of currently subscribed users1 ,2 ), the two domains are
comparable and the rejection sampling is viable. We generated an arbitrary number of random
32-bit user-IDs, querying Facebook for their existence (and, eventually, obtaining their friendlists). That sampling methodology shows two advantages: i) we can statistically estimate the
R
probability H
' 12.5% of getting an existing user; thus, ii) we can generate an arbitrary number
of user-IDs in order to acquire a sample of the desired dimension. Moreover, the distribution
of user-IDs is completely independent with respect to the graph structure.
Description of the “Uniform” crawler
The “Uniform” sampling is the second algorithm implemented in the Facebook crawler we
developed. Differently with respect to the BFS sampler, if adopting this algorithm, it is possible
to parallelize the process of extraction. This is because user-IDs to be requested can be stored
in different “queues”. We designed the uniform sampling task starting from these assumptions:
i) the number of subscribed users is 229 ' 5.368e8; ii) this value is comparable with the highest
possible assignable number of user-IDs, 232 ' 4.295e9, thus iii) we can statistically assert that
29
the possibility of querying Facebook for an existing user-ID is 2232 = 18 (12.5%). To this purpose,
we generated eight different queues, each containing 216 ' 65.5K ∼
= 63.4K random user-IDs
(the number of visited users of the BFS sample), used to feed eight parallel crawlers. This
number has been chosen in order to obtain a sample whose size was comparable with the BFS
sample.
1 As
of August 2010, http://www.facebook.com/press/info.php?statistics
2 http://www.google.com/adplanner/static/top1000/
54
4.2 Sampling the Facebook Social Graph
No. visited users
48.1K
Avg. deg.
326.0
Bigg. eigenval.
23.63
No. discovered neighbors
7.69M
Eff. diam.
14.72
Avg. clust. coef.
0.0014
No. edges
7.84M
Conn. comp.
94.96%
Density
0.678 %
Table 4.3: “Uniform” dataset description (crawling period: 08/11-20/2010)
Characteristics of the “Uniform” dataset
The uniform sampling process has been executed during the second part of August 2010. The
crawler collected a sample which contains almost 8 millions friendship connections among a
similar number of users. The acquired amount of nodes differs from the expected number due
to the privacy policy adopted by those users who prevent their friend-lists being visited. The
privacy policy aspect is discussed in Section 4.3.3.
The total number of visited users has been about 48.1 thousands, thus resulting in an average
degree of d = 2·|E|
|Vv | ' 326.0, considering V as the number of visited users. Same assumptions,
the expected density of the graph is ∆ =
2·|E|
|Vv |·(|Vv |−1)
' 0.006777 ' 0.678%.
If we consider V as the number of total nodes (visited users + discovered neighbors), the overall
mean degree is o = 2·|E|
|Vt | ' 2.025. The effective density of the graph, previously introduced, is
v|
δ = |V
|Vt | ' 0.006214 ' 0.621%. The distance among the effective and the expected density of
the graph, is ∂ = 100 − ∆∗100
' −9.06%. This can be intuitively interpreted as a slight lack of
δ
connection of this sample with respect to the theoretical expectation.
Some considerations hold, comparing this sample against the BFS one: the average degree is
slightly smaller (326.0 vs. 396.8), but the effective diameter is almost the double (14.72 vs.
8.69). We hypothesize that this is due to its size, which is insufficient to faithfully reflect the
structure of the network. Our hypothesis is also supported by the dimension of the largest
connected component, which does not contain the 5% of the sample. Finally, the clustering
coefficient, less than the BFS sample (0.0471 vs. 0.0789), is still comparable with respect to
previously considered studies (125, 277).
4.2.5
Data Preparation
During the data mining process it could happen to store redundant information. In particular,
while extracting friend-lists, a crawler could save multiple instances of the same edge (i.e., a
parallel edge), if both the connected users are visited; this is related to the fact that we adopted
an undirected graph representation. We adopted a hashing-based algorithm which cleans data
in O(N ) time, removing duplicate edges. Another step, during the data preparation, is the
anonymization: user-IDs are “encrypted” adopting a 48-bit hybrid rotative and additive hash
function (229), to obtain anonymized datasets. The final touch was to verify the integrity and
the congruency of data. We found that the usage of the hashing function caused occasional
collisions (0.0002%). Finally, some datasets of small sub-graphs (e.g., ego-networks) have been
post-precessed and stored using the GraphML format (48).
55
4. MINING AND ANALYSIS OF FACEBOOK
4.3
Network Analysis Aspects
During the last years, important achievements have been reached in understanding the structural properties of several complex real networks. The availability of large-scale empirical data,
on the one hand, and the advances in computational resources, on the other, made it possible
to discover and understand interesting statistical properties commonly shared among different real-world social, biological and technological networks. Among others, some important
examples are: the World Wide Web (7), Internet (97), metabolic networks (155), scientific collaboration networks (17, 208), citation networks (243), etc. Indeed, during the last years even
the social networks are strongly imposing themselves as complex networks described by very
specific models and properties. For example, some studies (10, 202, 266) proved the validity of
the well-known “small-world” effect in complex social networks. Others (2, 228), assert that
the “scale-free” complex networks reflect a “power law” distribution as a law for describing the
behavior of node degrees. We can conclude that the topology of the networks usually provides
useful information about the dynamics of the network entities and the interaction among them.
The study of complex networks led to important results in some specific contexts, such as the
social sciences. A branch of the network analysis applied to social sciences is the Social Network
Analysis (SNA). From a different perspective with respect to the analysis of complex networks,
which mainly aims to analyze structural properties of networks, the SNA focuses on studying
the nature of relationships among entities of the network and, in the case of social networks,
investigating the motivational aspect of these connections.
4.3.1
Definitions
In this Section we describe some of the common structural properties which are usually observed in several complex networks. The early introduction of some concepts such as clustering,
“small-world” effect and scale-free distributions, is instrumental for the discussion of further
experiments on Facebook. Indeed, the same concepts will be extended in very details in the
next Chapter of this dissertation.
Clustering
In several networks it is shown that, if a node i is connected to a node j, which in its turn is
connected to a node k, then there is a heightened probability that node i will be also connected
to node k. From a social network perspective, a friend of your friend is likely also to be your
friend. In terms of network topology, transitivity means the presence of a heightened number
of triangles in the network, which constitute sets of three nodes connected each others (210).
The global clustering coefficient is defined by
Cg =
3 × no. of triangles in G
no. of connected triples
(4.1)
where a connected triple represents a pair of nodes connected to another node. Cg is the mean
probability that two persons who have a common friend are also friends together. An alternative
56
4.3 Network Analysis Aspects
definition of clustering coefficient C has been provided by Watts and Strogatz (274)
Ci =
num. of triangles connected to i
num. of triples centered on i
(4.2)
where the denominator is equal to ki (ki − 1)/2 for the degree ki of the node i. For kP
i = 0 and
1, Ci = 0 by convention. The averaged clustering coefficient is then defined by C = i Ci /N .
The local clustering coefficient Ci has a strong dependence on the degree ki . To quantify it,
one usually defines C(k) = hCi i |ki =k .
During our experimentation we investigated the clustering effect on the Facebook network (see
Section 4.3.5).
The “small world” effect
It is well-known in literature that the most of large-scale networks, despite their huge size,
show a common property: there exists a relatively short path which connects any pair of nodes
within the network. This characteristic, called small-world effect1 scales proportionally to the
logarithm of the dimension of the network. The study of this phenomenon is rooted in social
sciences (202, 266) and is strictly interconnected with the notion of diameter we introduced
before. The Facebook social network reflects the “small world” effect as discussed in Section
4.3.5.
Scale-free degree distributions
In a random graph (94) the node degree is characterized by a distribution function P (k) which
defines the probability that a randomly chosen node has exactly k edges. Because the distribution of edges in a random graph is aleatory, the most of the nodes have approximatively the
same node degree, similar to the mean degree hki of the network. Thus, the degree distribution of a random graph is well described by a Poisson distribution law, with a peak in P (hki).
Recent empirical results show that in the most of real-world networks the degree distribution
significantly differs with respect to a Poisson distribution. In particular, for several large-scale
networks, such as the World Wide Web (7), Internet (97), metabolic networks (155), the degree
distribution follows a power law
P (k) ∼ k −λ
(4.3)
This power law distribution falls off more gradually than an exponential one, allowing for a few
nodes of very large degree to exist. Since these power laws are free of any characteristic scale,
such a network with a power law degree distribution is called a scale-free network (14). We
proved that Facebook is a scale-free network well described by a power law degree distribution,
as discussed in Section 4.3.4.
1 Note:
it will be extensively introduced in the next Chapter
57
4. MINING AND ANALYSIS OF FACEBOOK
4.3.2
Experimentation
We describe some interesting experimental results as follows. To compute the following overall statistics and centrality measures, such as degree and betweenness, we have adopted the
Stanford Network Analysis Platform (SNAP) (183), a general purpose network analysis library.
4.3.3
Privacy Settings
We investigated the adoption of restrictive privacy policies by users: our statistical expectation
16
using the “Uniform” crawler was to acquire 8 · 223 ' 65.5K users. Instead, the actual number of collected users was 48.1K. Because of restrictive privacy settings chosen by users, the
discrepancy between the expected number of acquired users and the actual number was about
26.6%. In other words, only a quarter of Facebook users adopt privacy policies which prevent
other users (except for those in their friendship network) from visiting their friend-list.
4.3.4
Degree Distribution
A first description of the network topology of the Facebook friendship graph can be obtained
from the degree distribution. According to Equation 4.3, a relatively small number of nodes
exhibit a very large number of links. An alternative approach involves the Complementary
Cumulative Distribution Function (CCDF), defined as
Z
℘(k) =
∞
P (k 0 )dk 0 ∼ k −α ∼ k −(γ−1)
(4.4)
k
When calculated for scale-free networks, the CCDF shows up as a straight line in a log-log plot,
while the exponent of the power law distribution only varies the height (not the shape) of the
curve.
In Figure 4.4 is plotted the degree distribution, as obtained from the BFS and “Uniform”
sampling techniques. The limitations due to the dimensions of the cache which contains the
friend-lists, upper bounded to 400, are evident. The CCDF is shown, for the same samples,
in Figure 4.5. From these data emerges that the degree distribution is not clearly defined by
a strict power law. Rather, it emerges that different regimes can be identified for both the
samples. In detail, roughly dividing the domain into two intervals, tentatively 1 ≤ x ≤ 10 and
S
S
= 2.45, λBF
= 0.6
10 ≤ x ≤ 400, there exist two clear regimes whose exponents are λBF
1
2
UNI
UNI
= 0.2 respectively for the BFS and the Uniform sample. Figure 4.6
and λ1
= 2.91, λ2
summarizes the previous findings, depicting the probability P(x) of finding a given number of
nodes with a specific degree.
4.3.5
Diameter and Clustering Coefficient
It is well-known that the most of real-world networks exhibit a relatively small diameter. A
graph has diameter D if every pair of nodes can be connected by a path of length of at most
D edges. Indeed, the diameter may be affected by outliers. A robust measure of the pairwise
distances between nodes in a graph is the effective diameter, which is the minimum number
of links (steps/hops) within which some fraction (or quantile q, say q = 0.9) of all connected
58
4.3 Network Analysis Aspects
Figure 4.4: Node degree distribution BFS vs. UNI Facebook sample.
Figure 4.5: CCDF node degree distribution BFS vs. UNI Facebook sample.
59
4. MINING AND ANALYSIS OF FACEBOOK
Figure 4.6: Node degree probability distribution BFS vs. UNI Facebook sample.
pairs of nodes can reach each other. The effective diameter has been found to be small for large
real-world graphs, like Internet and the Web, real-life and OSNs (8, 185, 202).
The hop-plot package extends the notion of diameter by plotting the number of reachable pairs
g(h) within h hops, as a function of the number of hops h (228). It gives us a sense of how
quickly the neighborhoods of nodes expand with the number of hops. In Figure 4.7 the number
of pair of nodes is plotted as a function of the number of hops required to connect each pair.
As a consequence of the more “compact” structure of the graph, the BFS sample shows a faster
convergence to the asymptotic value listed in Table 4.2.
Often, it is insightful to examine not only the mean clustering coefficient (see Section 4.3.1),
but its distribution. Figure 4.8 shows the average clustering coefficient plotted as a function
of the node degree for the two sampling techniques. As a consequence of the more systematic
approach of extraction, the distribution of the clustering coefficient of the BFS sample shows a
smooth behavior.
The following considerations hold for the diameter and hops: the BFS sample may be affected by
the “wavefront expansion” behavior of the visiting algorithm, while the “Uniform” sample may
still be too small to represent a faithful estimate of the diameter (this hypothesis is supported
by the dimension of the largest connected component which does not cover the whole graph,
as discussed in the next paragraph). Different conclusions can be derived for the clustering
coefficient property. It is important to observe that the average values of the BFS sample
fluctuate in a similar interval reported by recent similar studies on OSNs (i.e., [0.05, 0.18] by
Wilson et al., [0.05, 0.35] by Gjoka et al.), confirming that this property is preserved by the
BFS sampling technique. On the contrary, due to the intrinsic feature of the uniform sampling,
the clustering coefficient is not sufficiently well represented at this scale.
60
4.3 Network Analysis Aspects
Figure 4.7: Hops and diameter in Facebook.
Figure 4.8: Clustering coefficient in Facebook.
61
4. MINING AND ANALYSIS OF FACEBOOK
4.3.6
Connected Components
A connected component is a maximal set of nodes where for each pair of nodes there exists a
path connecting them.
As shown in Tables 4.2 and 4.3, the largest connected components cover the 99.98% of the BFS
graph and the 94.96% of the “Uniform” graph. In Figure 4.9, the scattered points in the left part
of the plot have a different meaning for each sampling techniques. In the “Uniform” case, the
sampling picked up disconnected nodes. In the BFS case, disconnected nodes are meaningless,
as they are due to some collisions of the hashing function during the de-duplication phase of
the data-cleaning step. This interpretation is supported by their small number (29 collisions
over 12.58 millions of hashed edges) involving only the 0.02% of the total edges. However, the
quality of the sample is not affected.
These conclusions are confirmed in Figure 4.10, where the betweenness centrality (BC) is plotted
as a function of the degree, in a log-log scale. The BC shows a linearly proportional behavior
with respect to the degree, which means that it follows a power law distribution as p(g) ∼ g −η .
In our opinion, this implies a high degree of connectedness of the sample, since a high value of
BC is related to a high value of the degree of the nodes.
Moreover, it is well-known that the BC distribution follows a power law distribution for scale-free
networks (128). Similarly to the degree exponent case, in general, the BC exponents increase for
node and link sampling and decrease for snowball sampling as the sampling fraction gets lower.
The correlation between degree and BC of nodes (19), shown in Figure 4.10, could explain the
same direction of changes of degree and BC exponents.
We found that the best fitting function to describe the BC for Facebook has an exponent
η = 0.61.
Figure 4.9: Connected components in Facebook.
62
4.3 Network Analysis Aspects
Figure 4.10: Degree vs betweenness centrality in Facebook.
Conclusion
Extraction and analysis of OSN data describing social networks poses both a technological
and an interpretation challenge. We have presented in this Chapter our implemented systems,
the ad hoc Facebook crawler that has been developed to comply with the increasingly-strict
terms of Facebook end-user license, i.e., to create large, fully anonymous samples that can be
employed for scientific purposes. Two different sampling techniques have been implemented
in order to explore the graph of friendships of Facebook, since the BFS visiting algorithm is
known to introduce a bias in case of an incomplete visit.
Analysis of such large samples was tackled using concepts and algorithms typical of the graph
theory, namely users were represented by nodes of a graph and relations among users were
represented by edges. Social Network Analysis concepts, such as degree distribution, diameter,
centrality metrics, clustering coefficient distribution have been considered for such samples,
highlighting those features which are believed to be preserved and those which are affected by
some bias due to the partial sampling of the given OSN.
63
4. MINING AND ANALYSIS OF FACEBOOK
64
5
Network Analysis and Models of
Online Social Networks
This Chapter is organized as follows: Section 5.1 presents related literature on the topics of
social networks, their analysis – in particular regarding Online Social Networks – and the latest
works, which define directions of Social Network Analysis. Section 5.2 introduces the key
features reflected by the most of the Online Social Networks. In Section 5.3 we describe the
generative models proposed to represent social networks, putting into evidence those aspects
that could fit well to represent Online Social Networks and those in which they could fail.
Results of our experimentation, presented in Section 5.5, depict the topological features of the
other studied Online Social Networks, such as Arxiv, Wikipedia and Youtube.
5.1
Background and Related Literature
Studying large-scale Online Social Networks, and their evolution, can be useful to investigate
similarities and differences with real-life social networks. Some interesting aspects in the study
of social networks are defined by the Social Network Analysis (SNA), a novel branch of Computational Social Sciences. It provides techniques to evaluate networks, both from a quantitative
(e.g. defining properties, characteristics and metrics) and a qualitative perspective.
In this Chapter we face the problem of analyzing the topological features of some popular Online
Social Networks other than Facebook, focusing on the graphs which represent these networks.
To do so, we adopt some specific topological measures, such as the diameter and we study the
degree distribution. Moreover, we investigate the emerging community structure characterizing
these networks.
5.1.1
Social Networks and Models
Literature about social network models is rooted in social sciences: in the sixties, Milgram and
Travers (202, 266) analyzed characteristics of real-life social networks, conducting several social
experiments, and in conclusion, proposing the well known “small-world” model (see Section
5.2.1).
65
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
Kleinberg (165), analyzed the “small-world” effect from an algorithmic perspective, providing
important algorithms to compute metrics on graphs representing social networks, the so called
social graphs.
Another important concept, introduced by Zachary (287), is the community structure (see
Section 5.2.3). The author analyzed a small real-life social community (the components of a
karate club), defining a model which describes the clusterization of social networks via cuts and
fissions in sub-groups.
One of the first models is the so called Erdős-Rényi model (see Section 5.3.1), and employs
random graphs in order to represent real networks. Watts and Strogatz (273, 274) furnished
a one-parameter model that interpolates between an ordered finite dimensional lattice and a
random graph (see Section 5.3.2). This because they empirically found that real-world social
networks are well connected and have a short average path length like random graphs, but they
also have exceptionally large clustering coefficients, a feature which is not reflected by random
graph models. Barabási and Albert (7, 8, 15) introduced different models that can be applied
to friendship networks, the World Wide Web, business and commerce networks, etc., proving
that they all share similar properties (see Section 5.3.3).
5.1.2
Recent Studies and Current Trends
Some of the current trends in the analysis of social networks are summarized as follows:
a) Some works (4, 203) investigate topological features of social networks by means of measurements, studying link symmetries, degree distributions, clustering coefficients, groups formations,
etc., usually on a large-scale, by analyzing Online Social Networks.
b) Another trend in current research is the analysis of evolutionary aspects of social networks.
In this context, Kumar et al. (169) defined a generative model which describes the structure of
OSNs and their dynamics over the time. This model has been compared against actual data,
in order to validate its reliability. Similarly, Leskovec (184), analyzed evolutionary aspects of
social networks, trying to describe dynamic and structural features which influence the growth
of communities, in particular when considering large social networks.
c) Graph mining techniques assume growing importance because of the computational complexity of studying large-scale social graphs with millions of nodes and edges. Some authors
(125, 184) faced the problem of sampling from large graphs adopting different techniques, in
order to establish if it is possible to avoid bias of data studying sub-graphs of social networks.
They found that Random Walks and Metropolis-Hastings algorithms perform better, respectively, for static and dynamic graphs, concluding that samples of size of 15% of a social graph
preserve the most of the properties.
d) Some authors (129, 189) try to identify whose characteristics of the network could suggest
what nodes are more likely to be connected by trusted relationships, the so called link prediction
problem. This is of a great interest for different commercial reasons, which will be discussed in
very details in the next Chapter of this dissertation.
Applications of Social Network Analysis research
Possible applications of information acquired from social networks have been investigated by
Staab et al. (260): methodologies for exploiting discovered data were defined, for marketing
66
5.2 Features of Social Networks
purposes, recommendation and trust analysis, etc. Recently, several marketing and commercial
studies have been applied to OSNs, in particular to discover efficient channels to distribute
information (159) and users who share similar tastes and preferences in order to suggest them
useful recommendations (84). This Thesis provides useful information in all these directions,
identifying interesting characteristics of Online Social Networks, considering the topological
features that could affect how much efficiently nodes and edges could carry information through
the networks.
5.2
Features of Social Networks
In this Section we put into evidence three key features that characterize social networks, i.e., i)
“small-world”, ii) scale-free degree distributions and, iii) emergence of a community structure.
During our experimentation, we take into account these features in order to establish if Online
Social Networks show these well-known characteristics.
A social network can be defined by means of a graph G = (V, E) whose set of vertices V
represents nodes of the network (i.e., the individuals), and whose set of edges E represents
connections (i.e., the social ties) among nodes of the network.
5.2.1
The “Small-World”
The study of the “small-world” effect on social networks is rooted in Social Sciences (202, 266).
Authors put into evidence that, despite their big dimension, social networks usually show a
common feature: there exists a relatively short path connecting any pair of nodes within the
network.
In fact, formally, a “small-world” network is defined as a graph in which most nodes are not
reciprocal neighbors each other, but could be reached from every other node by a small number
of hops. The diameter `, that reflects the so called “small-world” effect, scales proportionally
to the logarithm of the dimension of the network, which is formalized as
` ∝ Log(|V |)
(5.1)
where |V | represents the cardinality of V . Some characteristics of many real-world networks are
well modeled by means of “small-world” networks, such as OSNs (274), Internet (230), World
Wide Web (7), biological networks (124).
5.2.2
Scale-free Degree Distributions
An important feature that is reflected by several generative models of social networks is the
degree distribution of nodes. This because this feature characterizes the way the nodes are
interconnected each other in the social network. On the one hand, in a random graph (94) the
node degree is characterized by a distribution function P (k) well described by a Poisson law,
with a peak in P (hki). On the other hand, recent empirical results show that in the most of
real-world networks the degree distribution follows a power law
P (k) ∼ k −γ
67
(5.2)
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
Power law based models (see Section 5.3.3) apparently well depict the node degree distributions
of large-scale social networks. Since these power laws are free of any characteristic scale, such
a network with a power law degree distribution is called a scale-free network (14).
5.2.3
Emergence of a Community Structure
Another aspect to take into account when studying social networks is the emergence of a
community structure: the more this structural characteristic is evident, the higher a network
tends to divide into groups of nodes whose connections are denser among entities belonging
to the given group and sparser otherwise. Not all the network models are able to represent
this characteristic. In fact, the Erdős-Rényi (see Section 5.3.1) or the Barabási-Albert models
(see Section 5.3.3) can not meaningfully represent the concept of community structure, that
emerges from the empirical analysis of social networks. The community structure in Online
Social Networks is widely described in the following.
5.3
Models of Social Networks
Concepts such as the short path length, the clustering and the scale-free degree distribution have
been applied to rigorously model the social networks. Different models have been presented, but
in this dissertation we focus on the three most widely exploited modeling paradigms: i) random
graphs, ii) “small-world” networks and, iii) power law networks. Random graphs represent an
evolution of the Erdős-Rényi model, and are widely used in several empirical studies, because of
their ease of adoption. After the discovery of the clustering effect, a new class of models, namely
“small-world” networks, has been introduced. Similarly, the power law degree distribution
emerging from real-world social networks led to the modeling of the homonym networks, which
are adopted to describe scale-free behaviors. This, to focus on the dynamics of the network, to
explain phenomena such as the power laws and other non-Poisson degree distributions.
5.3.1
The Erdős-Rényi Model
Erdős and Rényi (94) proposed one of the first models of network, the random graph. They
defined two models: the simple one consists of a graph containing n vertices connected randomly.
The commonly adopted model, indeed, is defined as a graph Gn,p in which each possible edge
between two vertices may be included in the graph with the probability p (and may not be
included with the probability (1 − p)).
Although random graphs have been widely adopted because their properties ease the work of
modeling networks (for example, random graphs have small diameters), they do not properly
reflect the structure of real-world large-scale networks, mainly for two reasons: i) the degree
distribution of random graphs follows a Poisson law, which substantially differs from the power
law distribution shown by empirical data; ii) they do not reflect the clustering phenomenon,
considering all the nodes of the network with the same weight, and reducing, de facto, the
network to a giant cluster.
This emerges by considering Figure 5.1, where it is shown a Erdős-Rényi graph generated by
adopting n = 30 and p = 0.25. The most of the nodes have similar closeness centrality (that
is related to their degree), identified by the gray color in a gray-scale, and this means that
68
5.3 Models of Social Networks
Figure 5.1: Generative model: Erdős-Rényi (94).
Figure 5.2: Generative model: Newman-Watts-Strogatz (219).
69
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
Figure 5.3: Generative model: Watts-Strogatz (274).
Figure 5.4: Generative model: Barabási-Albert (14).
70
5.3 Models of Social Networks
Figure 5.5: Generative model: Holme-Kim (149).
all the nodes have relatively similar features (which is consistent with the formulation of the
graph model, according with the Poisson distribution of node degrees). Social networks exhibit
a rather different behavior, making this model unfeasible for modern studies although it has
been widely adopted in the past.
5.3.2
The Watts-Strogatz Model
Real-world social networks are well connected and have a short average path length like random
graphs, but they also have exceptionally large clustering coefficients, a characteristic that is
not reflected by the Erdős-Rényi model or by other random graph models. Watts and Strogatz
proposed a one-parameter model that interpolates between an ordered finite dimensional lattice
and a random graph (274). Starting from a ring lattice with n vertices and k edges per vertex,
each edge is rewired at random with probability p, p ranging from 0 (regular network) to 1
(random network). Focusing on two quantities, namely the characteristic path length L(p)
(defined as the number of edges in the shortest path between two vertices) and the clustering
coefficient C(p), some authors (148) found that L ∼ n/2k ≥ 1 and C ∼ 3/4 as p tends to 0,
while L = Lrandom ln(n)/ln(k) and C = Crandom k/n ≤ 1 as p tends to 1. The Watts-Strogatz
model is therefore suitable for explaining such properties in many real-world examples.
The model has been widely studied since the details have been published. Its role is important
in the study of the “small-world” theory. Some relevant theories, such as Kleinberg’s work
(164, 165), are based on this model and its variants. The disadvantage of the model, however,
is that it is not able to capture the power law degree distribution as presented in most real-world
social networks.
71
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
A strong structural difference is evident between Watts-Strogatz (274), its variant NewmanWatts-Strogatz (219) presented in Figures 5.2 and 5.3 if compared with the Erdős-Rényi graph.
First of all, it emerges that centrality of nodes is more heterogeneous, covering the whole
gray-scale. On the other hand, is evident, if compared with the other models, that it can not
well reflect the power law distribution of node degree experimentally shown by data, even if a
community structure is well represented (see Section 5.5.2).
5.3.3
The Barabási-Albert Model
The two previously discussed theories observe properties of real-world networks and attempt
to create models that incorporate those characteristics. However, they do not help in understanding the origin of social networks and how those properties evolve.
The Barabási-Albert model suggests that two main ingredients of self-organization of a network
in a scale-free structure are growth and preferential attachment. These pinpoint to the facts
that the most of the networks continuously grow by the addition of new nodes which are
preferentially attached to existing nodes with large numbers of connections. The generation
scheme of a Barabási-Albert scale-free model is as follows: (i) Growth:
P let pk be the fraction of
nodes in the undirected network of size
n
with
degree
k,
so
that
k pk = 1 and therefore the
P
mean degree m of the network is 21 k kpk . Starting with a small number of nodes, at each
time step, we add a new node with m edges
Q linked to nodes already part of the system; (ii)
preferential attachment: the probability i that a new node will be connected to the node i
(one
Q of the
P n already existing nodes) depends on the degree ki of the node i, in such a way that
=
k
i
i
j kj .
Models based on preferential attachment operates in the following way. Nodes are added one
at a time. When a new node u has to be added to the network it creates m edges (m is a
parameter and it is constant for all nodes). The edges are not placed uniformly at random
but preferentially, i.e., the probability that a new edge of u is placed to a node v of degree
d(v) is proportional to its degree, pu (v) ∝ d(v). This simple behavior leads to power law
degree tails with exponent γ ≈ 3. Moreover it also leads to low diameters. While the model
captures the power law tail of the degree distribution, it has other properties that may or may
not agree with empirical results in real networks. Recent analytical research on average path
length indicate that ` ∼ ln(|V |)/lnln(|V |). Thus the model has much shorter ` w.r.t. a random
graph. The clustering coefficient decreases with the network size, following approximately a
power law C ∼ V −0.75 . Though greater than those of random graphs, it depends on the size of
the network, which is not true for real-world social networks.
Figures 5.4 and 5.5 propose two example of graphs generated by using the Barabási-Albert
scale-free model (14) and a variant by Holme and Kim (149). It is evident that the structure of
these networks is much more compact than the Watts-Strogatz models but, on the other hand,
there are a few nodes that have a very high centrality (that is proportional to their degree) while
the most of the others have very low degrees (those depicted in dark gray). It is also possible to
put into evidence that, due to the spring layout given by the Fruchterman-Reingold algorithm
(116), those nodes with low degrees (belonging to the tail of the power law) are represented in
peripheral positions w.r.t. central nodes. On the other hand, this model fails in representing a
meaningful community structure of the network, differently to the Watts-Strogatz based models
(see further).
72
5.4 Community Structure
5.4
Community Structure
The concept of community structure in social networks is crucial and this Section is devoted
to: i) formally define the meaning of community and community structure of a social network,
ii) introduce the problem of community detection in social networks – including those problems
related to massive OSNs – and, finally iii) discuss those mathematical models introduced above
and their inability to represent the community structure of social networks.
5.4.1
Definition of Community Structure
We define the community structure of a network as in (109). In a random graph (94), the
distribution of edges among the vertices is highly homogeneous, since it follows a Poissonian
distribution (as previously discussed), so most nodes have equal or similar degree. Real-world
networks are not random graphs, as they display big inhomogeneities, revealing a high level
of order and organization. The degree distribution is broad and is scale-free: therefore, many
nodes with low degree coexist with few node with large degree.
Furthermore, the distribution of edges is not only globally, but also locally inhomogeneous, with
high concentrations of edges within special groups of nodes, and low concentrations between
these groups. This feature of real-world networks is called community structure, or clustering
effect.
5.4.2
Discovering Communities
The problem of unveiling the community structure of a network is called community detection.
In the context of the community detection, two main type of algorithms exist: i) partitioning
algorithms; ii) overlapping nodes community detection algorithms.
Partitioning Algorithms
In its general formulation, the problem of finding communities in a network is intended as a
data clustering problem, thus solvable assigning each vertex of the network to a cluster, in a
meaningful way. There are essentially two different and widely adopted approaches to solve
this problem; the first is the spectral clustering (141) which relies on optimizing the process of
cutting the graph; the latter is based on the concept of network modularity.
The problem of minimizing the graph-cuts is NP-hard, thus an approximation of the exact
solution can be obtained by using the spectral clustering (220), exploiting the eigenvectors of
the Laplacian matrix of the network. We recall that the Laplacian matrix L of a given graph
has components Lij = ki δ(i, j) − Aij , where ki is the degree of a node i, δ(i, j) is the Kronecker
delta (that is, δ(i, j) = 1 if and only if i = j) and Aij is the adjacency matrix representing the
graph connections. This process can be performed using the concept of ratio cut (141, 275), a
function which can be minimized in order to obtain large clusters with a minimum number of
outgoing interconnections among them. The main limitation of the spectral clustering is that
it requires in advance to define the number of communities present in the network and their
size. This makes it unsuitable if one wants to discover the number and the features of existing
73
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
communities in a given network. Moreover, as demonstrated by (254), it does not work very
well in detecting small communities within densely connected networks.
The network modularity concept can be explained as in (217): let consider a network, represented by means of a graph G = (V, E), which has been partitioned into m communities; its
corresponding value of network modularity is
Q=
m
X
s=1
"
ls
−
|E|
ds
2|E|
2 #
(5.3)
assuming ls the number of edges between vertices belonging to the s-th community and ds the
sum of the degrees of the vertices in the s-th community. Intuitively, high values of Q imply
high values of ls for each discovered community; thus, detected communities are dense within
their structure and weakly coupled among each other. Because the task of maximizing the
function Q is NP-hard, several approximate techniques have been presented during the last
years.
Let us consider the Girvan-Newman (GN) algorithm (214, 217). It first calculates the edge
betweenness B(e) of any given edge e in a network G = (V, E), defined as
B(e) =
X X npe (ni , nl )
np(ni , nl )
(5.4)
ni ∈S nl ∈S
where ni and nl are vertices of G, np(ni , nl ) is the number of the shortest paths between ni
and nl and npe (ni , nl ) is the number of the shortest paths between ni and nl containing e.
The GN algorithm is based on the assumption that it is possible to maximize the value of Q
deleting edges with a high value of betweenness. This, because they connect vertices belonging
to different communities. Starting from this intuition, first the algorithm ranks all the edges
with respect to their betweenness, thus it removes the most influent, calculates the value of Q
and iterates the process until a significant increase of Q is obtained. At each iteration, each
connected component of G identifies a community. Its cost is O(n3 ), being n the number of
vertices in G; intuitively, it is unsuitable for large-scale networks.
A large number of improved versions of this approach has been provided in the last years, such
as the fast clustering algorithm provided by (68, 69), running in O(n log n) on sparse graphs;
the extremal optimization method proposed by (91), based on a fast agglomerative approach
with O(n2 log n) time complexity; the Newman-Leicht (218) mixture model based on statistical
inferences; other maximization techniques by (213) based on eigenvectors and matrices.
Concluding, another different approach of partitioning is the “core-periphery”, introduced by
(41, 95); it relies on separating a tight core from a sparse periphery.
The next Chapter of this dissertation is completely devoted to the problem of community
detection in massive OSNs and a number of details will be further provided.
Overlapping Nodes Community Detection
Recently, it has been advanced the problem of discovering the community structure in a network
including the possibility of finding overlapping nodes belonging to different communities at the
same time. One of the first approach has been presented by (227) and has attracted a lot of
attention by the scientific community. A lot of efforts have been spent in order to advance novel
74
5.4 Community Structure
possible strategies. For example, an interesting approach has been proposed by (138), that is
based on an extension of the Label Propagation Algorithm. On the other hand, an approach in
which the hierarchical clustering is instrumental to find the overlapping community structure
has been proposed by (175). Finally, during latest years some novel techniques have been
proposed (181, 197).
5.4.3
Models Representing the Community Structure
From the perspective of the models representing the community structure of a network, we can
infer the following information: from Figure 5.6, where the community structure of a ErdősRényi model is represented, the result appears random, according to the formulation of the
model and its expected behavior when the calculated network modularity Q function (Equation
5.3) is analyzed. From Figures 5.7–5.8, at a first glance, it emerges that the community structure
of Watts-Strogatz models is very regular and there is a balance between communities with
tighter connections and those with weaker connections. This reflects the formulation of the
model but does not well depict the community structure represented by scale-free networks.
Finally, Figures 5.9–5.10 appear more compact and densely connected, features that are not
reflected by experimental data.
Even if well-representing the “small-world” effect and the power law distribution of degrees, the
Barabási-Albert model and its variants appear inefficient to represent the community structure
of Online Social Networks.
Figure 5.6: Community structure of the Erdős-Rényi (94) model.
75
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
Figure 5.7: Community structure of the Newman-Watts-Strogatz (219) model.
Figure 5.8: Community structure of the Watts-Strogatz (274) model.
76
5.4 Community Structure
Figure 5.9: Community structure of the Barabási-Albert (14) model.
Figure 5.10: Community structure of the Holme-Kim (149) model.
77
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
no. Network no. nodes no. edges
1 CA-AstroPh 18,772
396,160
2 CA-CondMat 23,133
186,932
3 CA-GrQc
5,242
28,980
4 CA-HepPh
12,008
237,010
5 CA-HepTh
9,877
51,971
6 Cit-HepTh
27,770
352,807
7 Email-Enron 36,692
377,662
8 Facebook
63,731 1,545,684
9
Youtube 1,138,499 4,945,382
10 Wiki-Vote
7,115
103,689
Dir.
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Type
Collaborat.
Collaborat.
Collaborat.
Collaborat.
Collaborat.
Citation
Collaborat.
Online Com.
Online Com.
Collaborat.
d(q)
5.3
7.9
8.9
6.6
8.4
6.5
5.4
6.8
7.6
4.5
γ
2.23
2.65
2.12
1.71
2.63
3.28
1.84
2.91
2.05
1.38
σ
1.50
1.49
1.48
1.46
1.46
1.48
1.48
1.48
–
–
Q
0.628
0.731
0.861
0.659
0.768
0.658
0.615
0.634
0.447
0.418
Ref
(184)
(184)
(184)
(184)
(184)
(184)
(184)
(203)
(203)
(184)
Table 5.1: Datasets and results: d(q) is the effective diameter, γ and σ, resp., the exponent of
the power law node degree and community size distributions, Q the network modularity.
5.5
Experimental Evaluation
Our experimentation has been conducted on different Online Social Networks whose dataset
are available online and are discussed in the following.
5.5.1
Description of Adopted Online Social Network Datasets
Datasets 1 − 5 are taken from Arxiv1 datasets, as of April 2003, of papers in the field of, respectively: 1) “Astro Physics”, 2) “Condensed Matter Physics”, 3) “General Relativity and Quantum Cosmology”; 4) “High Energy Physics - Phenomenology”, and 5) “High Energy Physics
- Theory”. Dataset 6 represents a network of scientific citations among papers belonging to
the Arxiv “High Energy Physics - Theory” field. Dataset 7 illustrates the email communications among the Federal Energy Regulatory Commission members (184). Dataset 8 describes
a sample of the Facebook friendship network, representing its social graph. Dataset 9 depicts
the social graph of YouTube as of 2007 (203). Finally, dataset 10 depicts the voting system of
Wikipedia for the elections of administrators that occurred in January 2008. Adopted datasets
have been summarized in Table 5.1.
5.5.2
Topological Properties
Several measures are usually needed in order to study the topological features of social networks.
To this purpose, for example, Carrington et al. (56) propose a list of some of them, including,
amongst other, nodes/edges degree distributions, diameter, clustering coefficients, and more.
In this experiment, the following features have been investigated for all the datasets discussed
above: i) node degree distribution; ii) diameter and hops; iii) community structure.
Degree distribution
The first interesting feature we analyzed is the degree distribution, which reflects in the topology
of the network. The literature refers that social networks are usually described by power law
degree distributions, P (k) ∼ k −γ , where k is the node degree and γ ≤ 3.
1 Arxiv (http://arxiv.org/) is an Online archive for scientific preprints in the fields of Mathematics, Physics
and Computer Science, amongst others.
78
5.5 Experimental Evaluation
We already found that this indication holds ture for the Facebook samples we collected and
discussed in the previous Chapter, even if with some corrections by identifying multiple regimes.
We recall that the degree distribution can be represented by using the distribution functions
Complementary Cumulative Distribution Function (CCDF), previously defined as ℘(k) =
Rcalled
∞
0
P
(k
)dk 0 ∼ k −α ∼ k −(γ−1) .
k
In Figure 5.11 we show the degree distribution and the correspondingly CCDF evaluated on our
Online Social Networks1 . For those networks that are directed, the out-degree is represented.
All the plots are represented by using a log-log scale, in order to put into evidence the scale-free
behavior shown by these networks. In particular, for each of these distributions we estimated
the value of the exponent γ of the power law distribution (as in Equation 5.2). Values of γ are
reported in Table 5.1.
Online Social Networks can be classified into two categories: i) networks that are properly
described by a power law distribution; ii) networks that show some fluctuations with respect
to those power law distributions that best fit to the real data. We discuss these two categories
separately.
The networks that are well described by a power law distribution, such as those depicting
datasets 7–10, are all Online Communities (i.e., networks of individuals connected by social ties
such as a friendship relations) characterized by a fact: the most of the users are rather inactive
and, thus, they have a few connections with others members of the network. This phenomenon
shapes a very consisting and long tail and a short head of the power law distribution of the
degree (as graphically depicted by the respective plots in Figure 5.11).
The former category, instead, includes the co-authors networks (datasets 1–5) that are collaboration networks and a citation network (dataset 6). The plotting of these data against the power
law distribution that try to best fit them show some fluctuation, in particular in the head of the
distribution, in which, apparently, the behavior of the distribution is not properly described.
The rationale behind this phenomenon lies into the intrinsic features of these networks, that
are slightly different in respect to Online Communities.
For example, regarding the co-authors networks that represent collaborations among Physics
scientists, the most of the papers are characterized by the following behavior, that can be
inferred from the analysis of the real data: the number of co-authors tends to increase up to 3
(in average), then it slowly slopes down to a dozen, and finally it quickly decreases.
A similar interpretation holds even for the citation network, that is usually intended as a
network in which there is a very small number of papers that gather the most of the citations,
and a huge number of papers that have few or even no citations at all. This is a well-known
phenomenon, called the “first-mover effect” (215).
Intuitively, from a modeling perspective, the only viable solution to capture these scale-free
degree distribution behaviors would be by means of a Barabási-Albert preferential attachment
model. On the one hand, by using this model it would be possible to reproduce the power
law degree distribution of all the Online Social Networks depicted above. Similarly, even the
“small-world” effect that describes networks with small diameters would be captured. On the
other hand, this model would fail in depicting the community structure of those networks,
whose existence has been put into evidence, both in this study (see further) and in other works
(186, 209).
1 For each network we plot the data, the best fitting power law function and complementary cumulative
distribution function (all the plots use the same scale of the first one).
79
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
Figure 5.11: Node degree distributions (log–log scale).
Diameter and hops
Most real-world social networks exhibit a relatively small diameter, but the diameter is susceptible to outliers. A more reliable measure of the pairwise distances between nodes in a graph is
the effective diameter, already previously introduced.
In Figure 5.12 the number of hops necessary to connect any pair of nodes is plotted as a
function of the number of pairs of nodes, for each given network1 . As a consequence of the
compact structure of these networks (highlighted by the scale-free distributions and the “smallworld” effect, discussed above), diameters show a fast convergence to asymptotic values listed
in Table 5.1.
From a modeling standpoint, as for the degree distributions, the previous considerations hold
true. Both the Watts-Strogatz and the Barabási-Albert models could efficiently depict the
“small-world” feature of these Online Social Networks, and, most importantly, empirical data
verify the so called “six degrees of separation” theory, that is strictly related with the “smallworld” formulation. In fact, it is evident that, regardless of the large scale of the networks
analyzed, the effective diameters are really small (and, in average, close to 6), which is proved
for real-world social networks (18, 274).
Community Structure Analysis
From our experimental analysis on real datasets, by analyzing the obtained community structures by using the Louvain method 2 (32), we focus on the study of the distribution of the
dimensions of the obtained communities (i.e., the number of members constituting each detected community) (109).
Recently, Fortunato and Barthelemy (110) put into evidence a resolution limit while adopting
network modularity as maximization function for the community detection. In detail, authors
1 The
number of pairs of nodes reachable is plotted against the number of hops required, with q=0.9.
Louvain method is a community detection algorithm which is discussed in very details in the Chapter
7. Since at this point it is not crucial to understand its functioning, we defer its discussion to that Chapter.
2 The
80
5.5 Experimental Evaluation
found that modularity optimization may fail in the discovery of communities whose size is
smaller than a given threshold. This value is strictly correlated to the characteristics of the
given network. This resolution limit results in the creation of large communities incorporating
an important part of the nodes of the network. In practice, in some particular cases it is possible
that the clustering process produces a small number of communities of big sizes. This would
possibly affect results in two ways: i) enlarging the tail of the power law distribution of the size
of the community, or ii) producing a not significant clustering of the network.
Because the clustering algorithm adopted (i.e., the Louvain method ) is a modularity maximization technique, we investigated the effect of the resolution limit on our datasets. We found that
in two cases (i.e., for the datasets 9–10) the clustering obtained was biased from the resolution
limit and we excluded these networks from our analysis.
In the following we investigate the behavior of the distribution of the size of the communities
in our networks.
In Figure 5.13 on the x-axis we plot the size of the community, and on the y-axis the probability
P(x) of finding a community of the given size into the network. For each distribution we provide
the best fitting power law function with a given exponent σ (that always ranges into the interval
[1.4,1.5]) that well approximates the behavior of the community size.
In the figure the data are plotted as points and it is possible to highlight some communities
whose size is larger than that expressed by the expected power law function (plotted as a
red line), that constitute the heavy tail of the power law distribution. The depicted results
show that, within large Online Social Networks, there is a high probability of finding a large
number of communities that contain few individuals and a relatively low probability of finding
communities constituted by a large number of members. This confirms that individuals are more
likely to aggregate in small communities, such as those representing family, friends, colleagues,
etc., rather than in large communities (99, 234). Fashinating, a similar phenomenon happens
for co-authors and citations on scientific papers.
Moreover, from Figure 5.13 we can put into evidence that large Online Communities, for example Facebook and the scientific collaboration networks, show a very tight community structure
(a fact proved also by the high values of network modularity, reported as Q in Table 5.1).
For example, regarding the collaboration networks, intuitively, we interpret this fact considering that scientists usually co-authoring different works, with different persons, work on papers
signed only by a small amount of co-authors. It is very likely that these co-authors tend to
group together (for example, if they co-authored several works) in the corresponding scientific
communities.
On the other hand, for some networks such as the citation network and the email network,
Figure 5.13 shows that it exists an important amount of communities constituted by a large
amount of individuals, constituting the heavy long tail of the power law distribution1 . Also
this aspect has an intuitive explanation. In fact, if we consider a network of scientific citations,
there is a small amount of papers with a huge number of citations (which are very central in
the topology of the network and, thus, are aggregated in same communities) and the most of
the others that have very few citations, that forming small communities among each other (or
single entities).
1 The probability P (x) of finding a community of a given size into the network is plotted against the size of
the community. In red, the best fitting power law distribution functions are depicted.
81
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
Figure 5.12: Effective diameters (log-normal scale).
Figure 5.13: Community structure analysis (log–log scale).
82
5.5 Experimental Evaluation
Conclusion
In this Chapter we put into evidence those models which try to efficiently and faithfully represent
the topological features of Online Social Networks.
Several models have been presented in literature and we focused our attention on the three main
exploited models, i.e., i) Erdős-Rényi random graphs, ii) Watt-Strogatz and, iii) Barabási-Albert
preferential attachment. Each model, even if well describes some specific characteristics, fails
in faithfully representing all the three main features we identified, that characterize Online
Social Networks, namely i) “small-world” effect, ii) scale-free degree distributions and, finally,
iii) emergence of a community structure.
We analyzed the topological features of several real-world Online Social Networks, fitting real
data to models and putting into evidence what characteristic are preserved and what could not
faithfully be represented by using these models.
83
5. NETWORK ANALYSIS AND MODELS OF ONLINE SOCIAL NETWORKS
84
6
Community Structure in
Facebook
This Chapter is organized as follows. Section 6.1 covers the background and the related work
about detecting the community structure within a network, with particular attention to the
specific area of Online Social Networks. Section 6.2 introduces some details about two fast community detection algorithms we have adopted to detect the community structure of Facebook.
Experimental results, performance evaluation and data analysis are shown. We describe the
methodology behind this work, illustrating the aspects on which we focused during our experimentation. Details related to the formulation of the problem and the choice of the solutions
are illustrated. In detail, in Section 6.3 we describe the community structure of Facebook, the
process of building a meta-network from the communities we discovered and the analysis of the
topological features of this artifact. Finally, Section 6.4 presents some clues in the direction
of the quantitative assessment of the renowned sociological theory of the strength of weak ties
(137).
6.1
Background and Related Literature
The social role of Online Social Networks is to help people to enhance the connections among
each other in the context of Internet. On the one hand, these relationships are very tight over
some areas of the social life of each user, such as family, colleagues, friends, and so on. On the
other, the outgoing connections with other individuals not belonging to any of these categories
are less likely to happen. This effect reflects in a phenomenon called community structure. We
recall that a community is formally defined as a sub-structure present into the network that
represents connections among users, in which the density of relationships within the members
of the community is much greater than the density of connections among communities. From a
structural perspective, this is reflected by a graph which is very sparse almost everywhere but
dense in local areas, corresponding to the communities (also called clusters).
A lot of different motivations to investigate the community structure of a network exist. From
a scientific perspective, it is possible to put into evidence interesting properties or hidden
information about the network itself. Moreover, individuals that belong to a same community
85
6. COMMUNITY STRUCTURE IN FACEBOOK
may share some similarities and possibly have common interests or are connected by a specific
relationship in the real-world. These aspects arise a lot of commercial and scientific applications;
in the first category we cite, for example, marketing and competitive intelligence investigations
and recommender systems. In fact, users belonging to a same community could share tastes or
interests in similar products. In the latter, models of diseases propagation and distribution of
information have been largely investigated in the context of social networks.
The problem of discovering the community structure of a network has been approached in
several different ways. A common formulation of this problem is to find a partitioning V =
(V1 ∪ V2 ∪ · · · ∪ Vn ) of disjoint subsets of vertices of the graph G = (V, E) representing the
network, in a meaningful manner.
Two intuitive problems can be already sketched. The first one arises when partitioning the vertices into disjoint subsets, because each entity of the network could possibly belong to several
different communities. The problem of overlapping communities has been already investigated
in literature (181, 197, 227) and presented in the previous Chapter. The latter problem is represented by networks in which it makes sense that an individual does not belong to any group.
In the formulation introduced above, we imposed that, regardless the overlapping communities
are considered or not, each individual is required to belong at least to one group. This requirement could make sense for several networks, but is unaffordable in those cases in which some
individuals could remain isolated from the rest of the network, as recently put into evidence
by (152). Such a case commonly happens in real and Online Social Networks, as reported by
recent social studies (143).
In this Chapter we analyze the community structure of Facebook on a large scale. We recall
that we collected two different samples of the network of relationships among the users of the
social network. Each of them contains millions of nodes and edges and, for this reason, we
adopt two fast and efficient community detecting algorithms optimized for massive networks,
working without any a-priori knowledge, in order to discover the emergent community structure
of Facebook.
6.1.1
Community Detection in Literature
Several studies have been conducted in order to investigate the community structure of real
and Online Social Networks (109, 158, 234, 254, 265, 293). They all rely on the algorithmic
background of detecting communities in a network. There are several comprehensive surveys
to this problem, addressed to non practitioner readers, such as (109, 234).
The problem of detecting groups of related nodes in a single social network has been largely
analyzed in the Physics, Bioinformatics and Computer Science literature and is often known as
community detection (210, 216) and studied, among others, by Borgatti et al. (41).
The computational complexity of the Girvan-Newman (GN) algorithm introduced before is
O(n3 ), being n the number of nodes of a graph G = (V, E). The cubic complexity algorithm may
not be scalable enough for the size of Online Social Networks but a more efficient –O(nlog 2 n)–
implementation of GN can be found in (69).
(237) illustrates an algorithm which strongly resembles GN. In particular, for each edge e ∈ E
of G, it computes the so-called edge clustering coefficient of e, defined as the ratio of the
number of cycles containing e to the maximum number of cycles which could potentially contain
it. Next, GN is applied with the edge clustering coefficient (rather than edge betweenness)
86
6.2 Community Structure Discovery
as the parameter of reference. The most important advantage of this approach is that the
computational cost of the edge clustering coefficient is significantly smaller than that of edge
betweenness.
All approaches described above use greedy techniques to maximize Q. In (140), the authors
propose a different approach which maximizes Q by means of the simulated annealing technique.
That approach achieves a higher accuracy but can be computationally very expensive.
(227) describes CFinder, which, to the best of our knowledge, is the first attempt to find overlapping communities, i.e., communities which may share some nodes. In CFinder communities
are detected by finding cliques of size k, where k is a parameter provided by the user. Such
a problem is computationally expensive but experiments showed that it scales well on real
networks and it achieves a great accuracy.
The approach of (218) uses a Bayesian probabilistic model to represent an Online Social Network. The parameters of this model are determined by means of the Expectation Maximization
algorithm. An interesting feature of (218) is the capability of finding group structures, i.e., relationships among the users of a social network which go beyond those characterizing conventional
communities. For instance, this approach is capable of detecting groups of users who show forms
of aversion with each other rather than just users who are willing to interact. Experimental
comparisons of various approaches to finding communities in OSNs are reported in (109, 188).
In (161) the authors propose CHRONICLE, an algorithm to find time-evolving communities in a
social network. CHRONICLE operates in two stages: in the first one it considers T “snapshots”
of the social network in correspondence of T different timestamps. For each timestamp it
applies a density-based clustering algorithm on each snapshot to find communities in the social
network. After this, it builds a T -partite graph GT which consists of T layers each containing
the communities of nodes detected in the corresponding timestamp. It adds also some edges
linking adjacent layers: they indicate that two communities, detected in correspondence of
consecutive timestamps, share some similarities. As a consequence, the edges and the paths in
GT identify similarities among communities over time.
6.2
Community Structure Discovery
The detection of the community structure within a large network is a complex and computationally expensive task. Community detection algorithms such as those originally presented
by Girvan and Newman or by (141), are not viable solutions, respectively because too expensive for the large scale of the Facebook sample we gathered, or because they require a priori
knowledge. Fortunately, several optimizations have been proposed during latest years. To our
purposes, we adopted two fast and efficient optimized algorithms, whose performance are the
best to date proposed in literature. LPA (Label Propagation Algorithm), presented by (238),
and FNCA (Fast Network Community Algorithm), more recently described by (156), have been
adopted to detect communities from the collected samples of the network. A description of their
functioning follows, in particular in the context of our study.
6.2.1
Label Propagation Algorithm
LPA (Label Propagation Algorithm) (238) is a near linear time algorithm for community detection. Its functioning is very simple, considered its computational efficiency. LPA uses only the
87
6. COMMUNITY STRUCTURE IN FACEBOOK
network structure as its guide, is optimized for large-scale networks, does not follows any a priori
defined objective function and does not require any prior information about the communities.
In addition, this technique does not require to define in advance the number of communities
present into the network or their size. Labels represent unique identifiers, assigned to each
vertex of the network. Its functioning is reported as described in (238):
Step 1 To initialize, each vertex is given a unique label;
Step 2 Repeatedly, each vertex updates its label with the one used by the greatest number of
neighbors. If more than one label is used by the same maximum number of neighbors, one
is chosen randomly. After several iterations, the same label tends to become associated
with all the members of a community;
Step 3 Vertices labeled alike are added to one community.
Authors themselves proved that this process, under specific conditions, could not converge. In
order to avoid deadlocks and to guarantee an efficient network clustering, they suggested to
adopt an “asynchronous” update of the labels, considering the values of some neighbors at
the previous iteration and some at the current one. This precaution ensures the convergence
of the process, usually in few steps. (238) ensure that five iterations are sufficient to correctly
classify 95% of vertices of the network. After some experimentation, we found that this forecast
is too optimistic, thus we elevated the maximum number of iterations to 50, finding a good
compromise between quality of results and amount of time required for computation.
A characteristic of this approach is that it produces groups that are not necessarily contiguous,
thus there could exist a path connecting a pair of vertices in a group passing through vertices
belonging to different groups. Although in our case this condition would be acceptable, we
adopted the suggestion provided by the authors to devise a final step to split the groups into
one or more contiguous communities. Authors proved its near linear computational cost.
Recently, a great attention has been captured by the possibility of discovering the community
structure in a network finding overlapping nodes belonging to different communities at the same
time. An interesting approach has been proposed by (138), that is based on an extension of
the Label Propagation Algorithm previously described.
6.2.2
Fast Network Community Algorithm
The second efficient algorithm that has been chosen for our analysis is called FNCA (Fast
Network Community Algorithm) (156). The main advantage of FNCA is that it does not
require to define in advance the number of communities present into the network, or their size.
This aspect makes it suitable for the investigation of the unknown community structure of a
large network, such as in the case of Facebook.
FNCA is an optimization algorithm which aims to maximize the value of the network modularity function, in order to detect the community structure of a given network. The network
modularity function has been introduced by (217) and has been largely adopted in the last few
years by the scientific community (33, 90, 109).
Given an undirected, unweighted network G = (V, E), let i ∈ V be a vertex belonging to the
88
6.2 Community Structure Discovery
community r(i) denoted by cr (i); the network modularity function can be written as follows
1 X
Q=
2m ij
ki kj
Aij −
2m
× δ (r(i), r(j))
(6.1)
where Aij is the element of the adjacency matrix A = (Aij )n×n representing the network, whose
value is Aij = 1 if i and j are tied by an edge, Aij = 0 otherwise. The function δ(u, v), namely
Kronecker delta, is equal toX
1 if u = v and 0 otherwise. The value ki represents the degree of
a vertex i, defined as ki =
Aij while m is the maximum number of edges in the network,
j
1X
Aij . Equation 6.1 can be rewritten as
defined as m =
2 ij
Q=
1 X
fi ,
2m i
fi =
X j∈cr(i)
Aij −
ki kj
2m
(6.2)
where the function f represents the difference between actual and expected number of edges
which fall within communities, from the “perspective” of each node of the network, thus indicating how strong the community structure is.
Any node of the network could evaluate the value of its f function only considering local
information (i.e., information about its community). Moreover, if the local effect of relabeling
a node, without changing the labels of others, is that the value of its f function increases, the
global effect is that also the network modularity increases.
Given these assumptions, (156) devised a fast community detection algorithm, optimized for
complex networks, adopting local information strategies. FNCA relies on the consideration
that, in networks with an emergent community structure, each node should be labeled alike
one of its neighbors, otherwise it is a cluster itself. Thus, each node needs to calculate its f
function only for the labels of its neighbors, instead of for all the nodes of the network.
Moreover, authors put into evidence that, if the labels of neighbors of one node do not change
at last iteration, the label of that node is less likely to change in the current iteration. This
provides a speed-up strategy, putting nodes which satisfy this condition, in an “inactive” state,
not requiring the update of their labels. Because this weak condition may fail, it is important
to immediately “wake up” those nodes which do not satisfy this constraint anymore, at each
iteration.
Also this algorithm, like LPA, could not converge. In our experimentation we defined a termination criterion of 50 iterations, obtaining good results also with our large-scale samples.
The time complexity of FNCA is O(T · n · k · c), where T is the number of maximum iterations, n
the number of total nodes, k the average degree of all the nodes, and c the average community
size at the end the algorithm execution. Furthermore, with the support of the analysis in
literature (187), for large-scale networks, FNCA is a near linear algorithm.
6.2.3
Experimentation
The experimental results obtained by using LPA and FNCA on the Facebook network are reported in Table 6.1. Both the algorithms show good performance while applied to this network.
89
6. COMMUNITY STRUCTURE IN FACEBOOK
Algorithm No. Communities
Q
Time (s)
BFS (8.21 M vertices, 12.58 M edges)
FNCA
50,156
0.6867 5.97e+004
LPA
48,750
0.6963 2.27e+004
Uniform (7.69 M vertices, 7.84 M edges)
FNCA
40,700
0.9650 3.77e+004
LPA
48,022
0.9749 2.32e+004
Table 6.1: Results of the community detection on Facebook
A very compact community structure has been highlighted by using both the algorithms. In detail, resulting values of Q are almost identical with respect to the considered sample; moreover,
the number of detected communities is very similar.
6.2.4
Methodology of Investigation
By analyzing the obtained community structures we considered the following aspects: i) the
distribution of the dimensions of the obtained clusters (i.e. the number of members constituting
each detected community), and, ii) the qualitative composition of the communities and the degree of similarity among different sample sets (i.e., those obtained by using different algorithms
and sampling techniques).
Community distribution: Uniform sample
The analysis of the community structure of a network from a quantitative perspective may
start with the study of the distribution of the dimension of the communities. Our investigation
started considering first the “Uniform” sample, which is known to be unbiased for construction.
Results obtained are adopted to investigate the possible bias introduced by the BFS sampling
technique, as discussed in the following.
As depicted by Figures 6.1 and 6.2, results obtained by using the two different algorithms on
the “Uniform” sample are interesting and deserve explanations. In detail, analytical results (as
reported in Table 6.1) and figures put into evidence that both the algorithms identified a similar
amount of communities, which is reflected by almost identical values of network modularity in
the two different sets. Moreover, identified communities are themselves, the most of the times,
of the same dimensions, regardless the adopted community detection algorithm.
These aspects lead us to advance different hypothesis on the characteristics of the community
structure of Facebook detected on the unbiased “Uniform” sample. The first consideration
regards the distribution of the size of the communities. Both the distributions obtained by using
the LPA and the FNCA algorithm show a characteristic power law behavior. This is emphasized
by Figures 6.1 and 6.2, which represent the distributions of the dimension of communities
obtained by using, respectively, FNCA and LPA, applied on the “Uniform” sample. In Figure
6.1, the clusters size distribution obtained by using FNCA is fitted to a power law function
(γ = 0.45) which effectively approximates its behavior. Similarly, Figure 6.2 represents the
clusters size distribution produced by LPA, which gives results in a shorter interval (i.e., [0,500]
with respect to [0,1000] used in Figure 6.1), well fitting to a power law function (γ = 0.37). A
90
6.2 Community Structure Discovery
first consideration is that the communities detected by using the LPA algorithm appears to be
slightly displaced to bigger values with respect to those represented by the FNCA in the first
quartile, while the number of communities greater than 400 members quickly decreases.
These results permit us to draw two conclusions:
• On a large scale, to the best of our knowledge, this is the first experimental analysis that
proves that the size of the communities emerging in an Online Social Networks follows
a well-defined power law distribution. This result is novel and validates the hypothesis,
proved on a small scale on several real-world social networks (for example, (265)), that
not only the degree distribution follows a scale-free behavior, but even the processes of
aggregations among individuals of an Online Social Network can be effectively described
by communities whose dimensions follow a power law. This results in the following point.
• Our analysis puts into evidence that, even on a large scale that is well represented by an
OSN such as Facebook, people tend to aggregate principally in a large amount of small
communities instead of in very large communities.
• A rejection-based sampling methodology (such as the “Uniform” sampling) is appropriate to describe the community structure emerging on a large sample of a Online Social
Network. Differently with respect to other approaches, it seems to preserve those characteristics that influence the distribution of the friendship relations thus well representing
the community structure of large networks.
Figure 6.1: FNCA power law distributions on the “Uniform” sample.
91
6. COMMUNITY STRUCTURE IN FACEBOOK
Figure 6.2: LPA power law distributions on the “Uniform” sample.
Community distribution: BFS sample
Results obtained by analyzing the BFS distribution, show partially different characteristics.
Figures 6.3 and 6.4 show the cluster dimensions distribution by using, respectively, FNCA and
LPA applied to the BFS sample. Both these distributions show some fluctuations if compared
to the power law distribution adopted as possible fitting function. By using the FNCA (see
Figure 6.3), the peak of the distribution is represented by those communities constituted of
10–30 members, then it sharply slopes depicting a first fluctuation around clusters of dimension
around a hundred of members, and a second minor fluctuation around the three hundreds of
members. A similar behavior is shown by the LPA algorithm (see Figure 6.4).
The differences in the behavior between the BFS and “Uniform” samples distributions reflect
accordingly with the adopted sampling techniques. In fact, in (125, 170) it has been put into
evidence the influence of the adopted sampling methods on the characteristics of the obtained
sets, in particular focusing on the possible bias introduced by using the BFS algorithm, towards
high degree nodes, if the BFS visit is incomplete (such as in our case).
We could draw the conclusion that the adoption of the BFS sampling technique is not very
appropriate in the case one would to investigate the community structure of a large network
whose complete sampling is not feasible, for example because of constraints imposed by the
network itself or by its dimension (such in the case of Facebook). On the other hand, the BFS
sampling has been proved to be effective and efficient in the opposite cases.
92
6.2 Community Structure Discovery
Figure 6.3: FNCA power law distribution on the BFS sample.
Figure 6.4: LPA power law distribution on the BFS sample.
93
6. COMMUNITY STRUCTURE IN FACEBOOK
Overlapping Rate between Distributions
The idea that two different algorithms could produce different community structures is not
counterintuitive, but in our case we have some indications that the obtained results could share
a high degree of similarity. To this purpose, in the following we investigate the similarities
among the community structures obtained by using the two different algorithms, FNCA and
LPA. This is represented by the overlapping rate calculated considering the distributions of
the community dimensions from a quantitative perspective. To do so, we adopt a divergence
measure, called Kullback-Leibler divergence, that is defined as
DKL (P ||Q) =
X
i
P (i) log
P (i)
Q(i)
(6.3)
where P and Q represent, respectively, the probability distribution that characterizes the behavior of the LPA and the FNCA community sizes, calculated on a given sample. In detail, let
i be a given size such that P (i) and Q(i) represent the probability that a community of size i
exists in the distribution P and Q. The KL divergence is helpful if one would like to calculate
how much different is a distribution with respect to a given one.
In particular, being the KL divergence defined in the interval 0 ≤ DKL ≤ ∞, the smaller the
value of KL divergence between two distributions, the more similar they are. In the light of
this assumption, we calculated the pairwise KL divergences between the distributions discussed
above, finding the following results:
• On the “Uniform” sample:
– DKL (LP A||F N CA) = 0.007722
– DKL (F N CA||LP A) = 0.007542
• On the BFS sample:
– DKL (LP A||F N CA) = 0.003764
– DKL (F N CA||LP A) = 0.004292
The values found by adopting the KL divergence put into evidence a strong correlation between
the distributions calculated by using the two different algorithms on the two different samples.
From a graphical standpoint, we put into evidence the correlation found by means of the KL
divergence, as follows. In Figures 6.5 and 6.6 a semi-logarithmic scale has been adopted. In
Figure 6.5, we plotted together the distributions depicted in Figures 6.1 and 6.2 that represent
the community structure of the “Uniform” sample. Similarly, Figure 6.6 shows the distributions
presented in Figures 6.3 and 6.4 regarding the BFS sample.
By analyzing the distribution of the community sizes of the “Uniform” set, it emerges a perfectly
linear behavior, that characterizes both the FNCA and the LPA results. This agrees with the
power law distributions previously emphasized, that well depicts the behavior of the emergent
community structure in that sample. Additionally, the two distributions are almost overlapping.
A similar consideration holds for the BFS sample. Even though the distributions suffers of the
spikes previously discussed, a strong correlation between them has been put into evidence
both by the KL divergence and by the graphical representation. These indications gave us the
opportunity of investigating from a qualitative perspective the characteristics of the community
structure of Facebook.
94
6.2 Community Structure Discovery
In detail, a different consideration regarding the qualitative analysis on the similarity of the two
different community structures is provided in the next Section. That kind of investigation aims
at evaluating what members constitute the communities detected by adopting the algorithms
previously introduced. Our findings prove that, regardless the adopted community detection
algorithm, the communities discovered, not only are characterized by similar distributions of
sizes, but are also mainly constituted of the same members. This finding proves that the
emergent community structure in Facebook is well characterized and defined, according with
the quantitative results we discussed above.
Figure 6.5: FNCA vs. LPA (UNI).
Community structure similarity
In this Section we introduce the methodology of investigation of the similarity among different
community structures. A community structure is represented by a list of vectors which are
identified by a “community-ID”; each vector contains the list of user-IDs (in anonymized format)
of the users belonging to that specific community; an example is depicted in Table 6.2.
Community-ID
community-ID1
community-ID2
...
community-IDN
List of Members
{user-IDa ; user-IDb ; . . . ; user-IDc }
{user-IDi ; user-IDj ; . . . ; user-IDk }
{. . . }
{user-IDx ; user-IDy ; . . . ; user-IDz }
Table 6.2: Representation of community structures.
In order to evaluate the similarity of the community structures obtained by using the two
95
6. COMMUNITY STRUCTURE IN FACEBOOK
Figure 6.6: FNCA vs. LPA (BFS).
algorithms, FNCA and LPA, a coarse-grained way to compare two sample sets would be by
adopting a simple measure of similarity such as the Jaccard coefficient, defined as
J(A, B) =
|A ∩ B|
|A ∪ B|
(6.4)
where A and B represent the two community structures. While calculating the intersection of
the two sets, communities differing even by only one member would be considered different,
while a high degree of similarity among them could be envisaged.
A more convenient way to compute the similarity among these sets is to evaluate the Jaccard
coefficient at the finest level, comparing each vector of the former set against all the vectors in
the latter set, in order to “match” the most similar ones. Under these assumptions, the Jaccard
coefficient could be rewritten in its vectorial formulation as
J(v, w) =
M11
M01 + M10 + M11
(6.5)
where M11 represents the total number of shared elements between vectors v and w, M01
represents the total number of elements belonging to w and not belonging to v, and, finally M10
the vice-versa. The result lies in [0, 1]. The more two compared communities are similar, or, in
other words, the more the constituting members of two compared communities are overlapping,
the higher the value of the Jaccard coefficient computed this way.
An almost equivalent way to compute the similarity with a high degree of accuracy would be
by applying the Cosine similarity, among each possible couple of vectors belonging to the two
96
6.2 Community Structure Discovery
sets. The Cosine similarity is defined as
cos(Θ) =
Pn
A·B
i=1 Ai × Bi
pPn
= pPn
2
2
||A|| ||B||
(A
i) ×
i=1
i=1 (Bi )
(6.6)
where Ai and Bi represent the binary frequency vectors computed on the list members over i.
Once matched the most similar pairs of communities between the two compared sets, the mean
degree of similarity is computed by
N
X
max (J(v, w)i )
i=1
N
and
N
X
max (cos(Θ)i )
i=1
N
(6.7)
where max(J(v, w)i ) and max (cos(Θ)i ) mean the highest value of similarity chosen among
those calculated combining the vector i of the former set A with all the vectors of the latter
set B, respectively adopting the Jaccard coefficient and the Cosine similarity. We obtained the
results as in Table 6.3.
Metric
J
Dataset
BFS
Uniform
Degree of Similarity FNCA vs.
In Common
Mean
Median
2.45%
73.28% 74.24%
35.57%
91.53% 98.63%
LPA
Std. D.
18.76%
15.98%
Table 6.3: Similarity degree of community structures
As deducible by analyzing Figures 6.5 and 6.6, not only the community structures calculated
by using the two different algorithms, FNCA and LPA, follow similar distributions with respect
to the dimensions, but also the communities themselves are constituted mostly by the same
members or by a congruous amount of common members.
From results it emerges that both the algorithms produce a faithful and reliable clustering
representing the community structure of the Facebook network. Moreover, while the number
of identical communities between the two sets obtained by using BFS and “Uniform” sampling,
is not so high (i.e., respectively, '2% and '35%), on the contrary the overall mean degree of
similarity is very high (i.e., '73% and '91%).
Considered the way we compute the mean similarity degree, as in Equation 6.7, this is due to the
high number of communities which differ only for a very small number of components. Finally,
the fact that the median, which identifies the second quartile of the samples is, respectively,
'75% and '99% demonstrates the strong similarities of the produced result sets.
All these considerations graphically emerge by analyzing Figures 6.7 and 6.8, in which the higher
the degree of similarity, calculated by using the Jaccard coefficient, the denser the distribution,
in particular in the first quartile, becoming obvious for values near to 1. The unbiased characteristics of the “Uniform” sample reflect also in Figure 6.7, in which the similarity degree of the
community structure is evident because the most of the values lie on the boundary zone near
to 1. The degree of similarity of the community structure of the BFS sample, shown in Figure
6.8, appears more distributed all over the second half of the distribution, becoming denser in
the first quartile.
Finally, Figures 6.9 and 6.10 summarize these findings. Their interpretation of these heat-maps
is as follows: the higher the degree of similarity between the compared community structures,
the higher the heat-map scores. The similarity becomes graphically evident considering that
the values of heat shown in the figures are very high for the most of the map.
97
6. COMMUNITY STRUCTURE IN FACEBOOK
Figure 6.7: Jaccard distribution: FNCA vs. LPA (UNI).
Figure 6.8: Jaccard distribution: FNCA vs. LPA (BFS).
98
6.2 Community Structure Discovery
Figure 6.9: Heat-map: FNCA vs. LPA (UNI).
Figure 6.10: Heat-map: FNCA vs. LPA (BFS).
99
6. COMMUNITY STRUCTURE IN FACEBOOK
Resolution limit and outliers
As previously discussed, community detection algorithm based on the network modularity maximization paradigm may suffer a resolution. In (110), the authors proved that modularity optimization could fail in the detection of communities smaller than a given threshold. As an effect,
we obtain the creation of large communities which incorporate smaller ones compromising the
final quality of the clustering.
We investigated the effect of the resolution limit put into evidence by (110) on our datasets, to
the purpose of assessing the quality of our analysis.
The results of this investigation, respectively on the BFS and the “Uniform” samples, could be
discussed separately. On the former dataset, a small number of communities whose dimensions
exceed those obtained in the distributions previously discussed, have been identified. Possibly,
these large communities have been identified because of the resolution limit.
Table 6.4 reports the amount of outliers, i.e., those communities that statistically exceed the
average dimension and are suspected of suffering from the problem of the resolution limit.
From this analysis, emerges that a smaller number of outliers has been found by using the
LPA method, with respect to the adoption of the FNCA algorithm, in the context of the BFS
sample. This could indicate that FNCA, which is a modularity maximization algorithm, may
suffer the resolution limit.
On the contrary, the “Uniform” sample apparently does not cause any problem of resolution
limit. By using the FNCA on the “Uniform” sample, a large number of communities whose
dimension is slightly greater than one thousand members appear, that is coincident to the final
part of the tail of the power law distribution, depicted in Figure 6.1. The LPA method applied
to the “Uniform” sample provides possibly the most reliable results, without incurring in any
possible outliers.
Set
BFS
UNI
Alg.
FNCA
LPA
FNCA
LPA
Amount with respect to Number of Members
≥ 1K ≥ 5K ≥ 10K ≥ 50K
≥ 100K
4
1
2
1
1
1
0
2
0
1
81
0
0
0
0
0
0
0
0
0
Table 6.4: The presence of outliers in our community structures.
6.3
6.3.1
Community Structure
Building the Community Meta-network
Once we verified the quality of the community detection which unveiled the community structure
of Facebook, we proceed with its analysis. First of all, we build a meta-network of the community structure, as follows. We generate a new weighted undirected graph G0 = (V 0 , E 0 , ω),
whose set of nodes is represented by the communities constituting the given community structure. In G0 there will exist an edge e0uv ∈ E 0 connecting a pair of nodes u, v ∈ V 0 if and only if
there exists in the social network graph G = (V, E) at least one edge eij ∈ E which connects
100
6.3 Community Structure
a pairs of vertices i, j ∈ V , such that i ∈ u and j ∈ v (i.e., user i belongs to community
u and
P
user v belongs to community j). The weight function is simply defined as ωu,v = i∈u,j∈v eij
(i.e., the sum of the total number of edges connecting all the users belonging to u and v).
Table 6.5 summarizes the results obtained for the uniform sample by using FNCA and LPA.
Something which immediately emerges is that the results obtained by using the two different
community detection methods are very similar. The number of nodes in the meta-networks
is smaller than the total number of communities discovered by the algorithms, because we
excluded all those communities containing only one member, which are reasonably belived to
be constituted by inactive users. We discuss the features of the community structure metanetwork in Section 6.3.2.
Feature
No. nodes/edges
Min./Max./Avg. weight
Size largest conn. comp.
Avg. degree
2nd largest eigenvalue
Effective diameter
Avg. clustering coefficient
Density
FNCA
36,248/836,130
1/16,088/1.47
99.76%
46.13
171.54
4.85
0.1236
0.127%
LPA
35,276/785,751
1/7,712/1.47
99.75%
44.54
23.63
4.45
0.1318
0.126%
Table 6.5: Features of the meta-networks representing the community structure for the uniform
sample.
Figure 6.11, depicted by using Cvis1 – a hierarchical-based circular visualization algorithm –,
represents the community structure unveiled by LPA from the uniform sample. It is possible
to appreciate that there exists a tight core of communities which occupy a central position
into the meta-network. Moreover, an in-depth analysis reveals that the positioning of the
communities is generally irrespective of their size. This means that there are several different
small communities which play a dominant role in the network. Similarly, the periphery of the
graph is constituted both by small and larger communities.
The visual analysis of large-scale networks is usually unfeasible when managing samples of such
a size, but by adopting the meta-network representation of the community structure we are
able to infer additional insights about the structure of the original network.
6.3.2
Meta-network Analysis
Node degree and clustering coefficient
Figure 6.12 depicts the node degree probability distribution and the average clustering coefficient plotted as a function of the node degree for the two community detection techniques.
Analyzing the degree distribution of the community structure meta-network we find a very peculiar feature. In detail, the distribution is clearly identified by two different regimes, roughly
1 ≤ x < 102 and x ≥ 102 . Both the probability distributions fit well to a power law, which is
P (x) ∝ x−γ with γ = 0.56 for the former and γ = 3.51 for the latter regime. Such a particular
behavior has been previously found in Facebook, regarding the social graph.
1 https://sites.google.com/site/andrealancichinetti/cvis
101
6. COMMUNITY STRUCTURE IN FACEBOOK
Figure 6.11: Meta-network representing the community structure (UNI with LPA).
102
6.3 Community Structure
The clustering coefficient of a node is the ratio of the number of existing links over the number
of possible links between its neighbors. Given a network G = (V, E), we recall the definition
of clustering coefficient Ci of node i ∈ V as Ci = 2|{(v, w)|(i, v), (i, w), (v, w) ∈ E}|/ki (ki − 1)
where ki is the degree of node i. In our case, it can be interpreted as the probability that
any two randomly chosen communities that share a common neighbor also have a link between
them. The high values of average clustering coefficient obtained for the community structure
meta-network are an interesting indicator that the communities are well connected among each
other. This is a peculiar feature, which reflects the small world effect, well-known in social
networks. Moreover, this means that, for two randomly chosen disconnected networks, it is
very likely that a very short path connecting their members exists. Finally, the clear power law
distribution which describes the average clustering coefficient for this network has an exponent
γ = 0.48.
Figure 6.12: Meta-network degree and clustering coefficient distribution (UNI).
Hops and shortest paths distribution
In the following we analyze the effective diameter and the shortest paths distribution in the
community structure. Figure 6.13 represents the probability distribution for the shortest paths
against the path length, and, concurrently, the number of pairs of nodes connected by paths of
a given length. The interesting behavior which emerges from the analysis is that the shortest
path probability distribution reaches a peak for paths of length 2 and 3. In correspondence with
this peak, the number of connected pairs of communities quickly grows, reaching the effective
diameter of the networks with a value slightly above 4.
This finding has an important impact on the features of the overall social graph. In fact, if we
would suppose that all the nodes belonging to a given communities are well connected by very
103
6. COMMUNITY STRUCTURE IN FACEBOOK
short paths, or even directly connected, this would result in a very short diameter of the social
graph itself. In fact, there will always exist a very short path connecting the communities of
any pair of randomly chosen members of the social network. Moreover, this result has been
recently assessed by using heuristic techniques on the whole Facebook network (269).
Figure 6.13: Meta-network hops and shortest paths distribution (UNI).
Weight and strength distribution
The analysis of the weight and strength probability distribution is depicted in Figure 6.14.
We resemble that the strength sω (v) (or weighted degree) of a given node v is determined as
the sum of the weights of all edges incident on v
X
sω (v) =
ω(e)
e∈I(v)
where ω(e) is the weight of a given edge e and I(v) the set of edges incident on v.
In detail, both distributions resemble a power law behavior. The former is defined by a single
regime clearly described by a coefficient γ = 1.45. The latter is better described by two different
regimes, in those intervals roughly similar to the node degree probability distribution, by the
two coefficients γ = 1.50 and γ = 3.12.
In Figure 6.15 it is represented a heat-map of the distribution of connections among the communities in the meta-network. It emerges that the links mainly connect communities of mediumlarge dimension. This aspect is important because it highlights the roles of the links with high
weight and strength. For example, they efficiently connect communities containing many members that, otherwise, would be far from each other. On the other hand, according to the strength
104
6.3 Community Structure
of weak ties theory (137), weak links typically occur among communities that do not share a
large amount of neighbors and are important to keep the network proficiently connected.
Figure 6.14: Meta-network weights vs. strengths distribution (UNI).
6.3.3
Discussion of Results
A summary of the results achieved with our analysis of the Facebook community structure
follows.
First of all, in this Section we put into evidence that the community structure of the Facebook
social network presents a clear power law distribution of the dimension of the communities,
similarly to other large social networks (176).
This result is independent with respect to the algorithm adopted to discover the community
structure, and even (but in a less evident way) to the sampling methodology adopted to collect
the datasets. On the other hand, this is the first experimental work that proves on a large scale
the hypothesis, theoretically advanced by (170), of the possible bias towards high degree nodes
introduced by the BFS sampling methodology for incomplete visits of large graphs.
Regarding the qualitative analysis of our results, it emerges that the community structure is
well defined. In fact, regardless the algorithm adopted for the discovery of the communities
share a high degree of similarity among different datasets, which means that they emerge clearly
in the topology of the network.
As for the community detection algorithm, we found that the LPA method represents a feasible
choice among the heuristic methods based on local information in order to unveil the community structure of a large network. Results compared against FNCA appear slightly better, in
105
6. COMMUNITY STRUCTURE IN FACEBOOK
Figure 6.15: Meta-network heat-map of the distribution of connections (UNI).
particular if we consider the well-known problem of the resolution limit (110) that affect the
process of community detection on a large scale for modularity optimization algorithms.
Performance provided by the two algorithms are comparable and reasonable for large-scale
analysis. Even if the computational cost of these two techniques is very similar, we experienced
that the LPA method performs slightly better that FNCA on our datasets.
Finally, the analysis of the community structure meta-network puts into evidence different
mesoscopic features. For example, we discovered that the community structure is characterized
by a power law probability distribution of node degree/weight and we found that it reflects the
well-known small world effect.
6.4
The Strength of Weak Ties
This Section introduces some experiments in the direction of the quantitative assessment of
the theory known as the strength of weak ties, whose foundations lie in Sociology (137). In
particular, by means of the data previously acquired and exploiting the analysis carried out
regarding the community structure of Facebook, we have been able to assess some features of
this theory in the context of a real-world large scale social network like Facebook. We try to
capture the original intuition underlying the so-called weak ties and their role in complex social
networks.
In particular, we have been concerned with the experimental assessment of the importance,
foreseen by the early works of Mark Granovetter (137) of weak ties, i.e. human relationships
(acquaintance, loose friendship etc.) that are less binding than family and close friendship but
106
6.4 The Strength of Weak Ties
might, according to Granovetter, yield better access to information and opportunities. Facebook
is organized around the recording of just one type of relationship, i.e. the friendship. Of course,
Facebook friendship captures several degrees and nuances of the human relationships that are
hard to separate and characterize within data analysis. However, weak ties have a clear and
valuable interpretation: friendship between individuals who otherwise belong to distant areas
of the friendship graph. Or, in other words, happen to have most of their other relationships
in different national/linguistic/age/common experience groups. Such weak ties have strength
precisely because they connect distant areas of the network, thus yielding several interesting
properties, which will be discussed in the following.
6.4.1
Methodology
The classical definition of strength of a social tie has been provided by Granovetter (137):
The strength of a tie is a (probably linear) combination of the amount of time,
the emotional intensity, the intimacy (mutual confiding), and the reciprocal services
which characterize the tie.
This definition introduces some important feature of a social tie that will be discussed later, in
particular: (i) intensity of the connection, and (ii) mutuality of the relationship.
Granovetter’s paper gives a formal definition of strong and weak ties by introducing the concept
of bridge:
A bridge is a line in a network which provides the only path between two points.
Since, in general, each person has a great many contacts, a bridge between A and
B provides the only route along which information or influence can flow from any
contact of A to any contact of B.
From this definition it emerges that – at least in the context of social networks – no strong tie
is a bridge. However, that is not sufficient to affirm that each weak tie is a bridge, but what is
important is that all bridges are weak ties.
Granovetter’s definition of bridge is restrictive and unsuitable for the analysis of large-scale
social networks. In fact, because of the well-known features such as the small world effect and
the scale-free degree distribution, it is unlikely to find an edge whose deletion would lead to the
inability for two nodes to connect by means of alternative paths. On the other hand, without
loss of generality on a large scale, we can define a shortcut bridge as the link that connects any
pair of nodes whose deletion would cause an increase of the distance between them, being the
distance of two nodes the length of the shortest path linking them.
Unfortunately, also this definition leads to two relevant problems. The former is due to the
introduction of the concept of shortest paths; the latter is due to the arbitrariness given by the
concept of distance between nodes. In detail, regarding the shortest paths, the computation
of all pairs shortest paths has a high computational cost which makes it unfeasible even on
networks of modest size – even worse if considering large social networks. Regarding the second
aspect, in the context of shortest paths the distance could be considered as the number of hops
required to connect two given nodes. Alternatively, it could be possible to assign a value of
strength (i.e., a weight) to each edge of the network and to define the distance of two nodes as
the cost of the cheapest path joining them1 . In such a case, however, we do not know wether this
1 In this context, measuring the strength of the edges in online social networks has been recently advanced
by (123, 231, 281).
107
6. COMMUNITY STRUCTURE IN FACEBOOK
definition of distance is better than in the previous one but its computation remains excessively
expensive in real-life networks.
In the light of the considerations above, we suspect that the problem of discriminating weak
and strong ties in a social network is not trivial, at least on a large scale. To this purpose, in
the following we give a definition of weak ties from a different perspective, trying not to distort
the Granovetter’s original intuition.
In particular, recalling that weak ties are considered as loose connections between any given
individual and her/his acquaintances with whom she/he seldom interacts and who belong to
different areas of the social graph, we give the definition of weak ties as those ties that connects
any pair of nodes belonging to different communities. Note that our definition is more relaxed
than that provided by Granovetter. In detail, the fact that two nodes connected by a tie belong
to different communities does not necessarily imply that the connection among them is a bridge,
nor a shortcut bridge, since its deletion could not increase the length of the path connecting
them (there could yet exists another path of the same length). On the other hand, in our
opinion, it is a reasonable assumption at least in the context of large social networks, since
it has been proved that the edges connecting different communities are bottlenecks (217) and
their iterative deletion causes the fragmentation of the network in disconnected components.
One of the most important characteristics of weak ties is that those which are bridges create
more, and shorter, paths. The effect in the deletion of a weak ties would be more disruptive
than the removal of a strong tie1 , from a community structure perspective.
Experimental Set Up
In order to verify the strength of weak ties theory on a large scale we initially carefully analyzed
the features of existing online social networks, considering some requirements that come directly
from Granovetter’s seminal work (137):
Ties discussed in this paper are assumed to be positive and symmetric. Discussion
of operational measures of and weights attaching to each of the four elements is
postponed to future empirical studies.
Granovetter introduces two concepts that are crucial to understanding weak ties. The first is
related to the symmetry of the relationship among two individuals of the network. This concept
is extremely interconnected with the definition of mutual friendship relation which characterizes
several online social networks. In detail, a friendship connection can be symmetric (i.e., mutual)
if there is no directionality in the relation between two individuals – otherwise the relation is
asymmetric – of which Facebook friendship is perhaps the best-known example.
While in the real-world social networks the classification of a relation between individuals can
be not trivial, online social networks platforms permit to clearly and uniquely define different
types of connections among users. For example, in Twitter the concept of relation between two
individuals intrinsically holds a directionality. In fact, each user can be a follower of others,
can retweet their tweets and can mention them.
Recently, research has started on assessing the strength of weak ties in the context of a directed
network (136, 226). In directed networks, however, what is also important is the weight assigned
to connections. Even if the possibility of weighting connections among users of social networks
1 For this reason weak ties have been recently proved to be very effective in the diffusion of information and
in the rumor spreading through social networks (61, 292).
108
6.4 The Strength of Weak Ties
has been recently envisaged by us (81) as well as by other authors (123, 231, 281), we consider
the most appropriate setting for a quantitative validation of the theory, a network represented
by an unweighted graph.
Facebook arguably represents an ideal setting for the validation of the strength of weak ties
theory. In fact, both of Granovetter’s requirements are satisfied in the Facebook friendship
network because:
• it is naturally represented as an undirected graph: friendship in Facebook is symmetric,
and
• can be represented by adopting an unweighted graph 1 .
To sum it up, our definition of the Facebook social graph is simply an unweighted, undirected
graph G = (V, E) where vertices v ∈ V represent Facebook users and edges e ∈ E represent
the friendship connections among them.
In this context, we define as weak ties those edges that, after dividing the network structure
in communities (obtaining the so-called community structure), connect nodes belonging to
different communities. Vice versa, we classify as strong ties the intra-community edges.
6.4.2
Experiments
Recently, several works focused on the Facebook social graph (58, 126, 269) and on its community structure (99, 101, 205), but none of them has been carried out to assess the validity of the
strength of weak ties theory. In this section: (i) firstly, we investigate presence and behavior of
strong and weak ties in such a network; finally, (ii) we try to describe the density of weak ties
among communities and the way in which the are distributed as a function of the size of the
communities themselves.
Distribution and CCDF of strong and weak ties
The second experiment is devoted to understand the presence and the distribution of strong and
weak ties among communities. To this purpose, we consider the community structure discussed
above, classifying those edges connecting nodes belonging to different communities as weak ties,
and strong ties the vice-versa.
Intuitively, given the power law distribution of the size of communities (and, coincidentally,
the power law distribution of node degrees), the number of weak ties will be much greater
than the number of strong ties. Even though this effect could appear as counter-intuitive (for
example, we could suppose that weak ties are much more rare than strong ties on a large scale),
we should recall that some sociological theories2 assume that individuals tend to aggregate
in small communities3 , i.e., the most of connections among individuals are weak ties in the
Granovetter’s sense – small amount of contacts, low frequency of interactions, etc.
1 Of course, this is not necessarily the only valid representation of the Facebook network since it should be
possible to adopt a weighted network where edge weights represent, for example, the intensity of the relations
among users.
2 For example, cognitive balance (147, 207), triadic closure (137) and homophily (198).
3 According to these theories, we can explain that the intensity of human relations is very tight in small
groups of individuals, and decreases towards individuals belonging to distant communities.
109
6. COMMUNITY STRUCTURE IN FACEBOOK
This intuitions are reflected by analyzing Figure 6.16. For each node v ∈ V of the graph
G = (V, E), Figure 6.16 depicts the amount of strong and weak ties incident on v. It is evident
that the weak ties are much more that the strong ties. The two distributions tend to behave
quite similarly, but they maintain a certain constant offset which represents the ratio between
strong and weak ties in this network. This ratio is more or less 80%-20% and carries also an
important social interpretation. In fact, it is closely related to the concept of rich club – deriving
from the renown Pareto principle (212) – whose validity has been recently proved for complex
networks (70) (for example for Internet (294) and scientific collaboration networks (224)).
In addition, since both the distributions recall a straight line (which, in a log-log plot, induces
to scale-free behaviors), we can assume that also the distribution of weak and strong ties is well
described by a power law, such as in the case of node degree and size of communities.
As for a different perspective of studying the same picture, Figure 6.17 represents the CCDF
of the probability of finding a given number of strong and weak ties in the network. From its
analysis, it emerges an important difference between the behavior of the weak and the strong
ties. In detail, the cumulative probability of finding a node with an increasing number of strong
ties quickly decreases. Tentatively, it is possible to identify in k ≈ 5 the tipping point from
which the presence of weak ties quickly overcomes that of strong ties, making the latters less
numerous in nodes with degree higher than k.
Figure 6.16: Distribution of strong vs. weak ties in Facebook.
Density of weak ties among communities and link fraction
The last experiment discussed in this paper is devoted to understanding the density of weak
ties connecting communities in Facebook. In particular, we are interested in defining to what
extent a weak tie links community of comparable size. To do so, we considered each weak tie in
110
6.4 The Strength of Weak Ties
Figure 6.17: CCDF of strong vs. weak ties in Facebook.
the network and we computed the size of the community to which the source node of the weak
tie belongs to. Similarly, we computed the size of the target community.
Figure 6.18 represents a density map of the distribution of weak ties among communities. First,
we highlight that the map is symmetric with respect to the diagonal, according to the fact that
the graph is undirected and each weak tie is counted twice, once for each end-vertex. From the
analysis of this figure, it clearly emerges that the weak ties mainly connects nodes belonging to
small communities. To a certain extant, this could be intuitive since the number of communities
of small size, according to their power law distribution, is much greater than the number of
large communities. On the other hand, it is an important assessment since similar results have
been recently described for Twitter (136).
As for further analysis, we carried out another investigation oriented to the evaluation of the
amount of weak ties that fall in each given community with respect to its size. The results of
this assessment are reported in Figure 6.19. The interpretation of this plot is the following: on
the y-axis it is represented the fraction of weak ties per community as a function of the size of
the community itself, reported on the x-axis. It emerges that also the distribution of the link
fraction against the size of the communities resembles a power law.
Indeed, this result is different from that recently proved for Twitter (136), in which a Gaussianlike distribution has been discovered. This is probably due to the intrinsic characteristics of the
networks, that are topologically dissimilar (i.e., Twitter is represented by a directed graph with
multiple type of edges) and also the interpretation itself of social tie is different. In fact, Twitter
represents in a way hierarchical connections (in the form of follower and followed users), while
Facebook tries to reflects a friendship social structure which better represents the community
structure of real social networks.
111
6. COMMUNITY STRUCTURE IN FACEBOOK
Figure 6.18: Density of weak ties among communities.
Figure 6.19: Link fraction as a function of the community size.
112
6.4 The Strength of Weak Ties
Conclusion
In this Chapter we presented a large-scale community structure investigation of the Facebook
social network. We adopted two fast and efficient algorithms already presented in literature,
specifically optimized to detect the community structure on large-scale networks, such as Facebook, consisting of millions of nodes and edges. A very strong community structure emerges
by our analysis, and several characteristics have been highlighted by our experimentation, such
as a typical power law distribution of the dimension of the clusters. We also investigated the
degree of similarity of the different community structures obtained by using the two algorithms
on the respective samples, putting into evidence strong similarities.
Once the presence of the community structure has been assessed, we studied the mesoscopic
characteristics of the community structure meta-network, verifying that the most of the features
presented by the original Social graph hold in the community structure. In particular, we
verified the presence of a power law degree distribution of the community size degree and the
clustering effect in this network. Moreover, we encountered the presence of a small world effect
which contributes to the existence of a very small diameter and shortest paths among each pair
of communities.
Finally, we investigated the validity of the strength of weak ties sociological theory on Facebook.
Since it is well-known that this theory is strictly related to the community structure, our
findings support this aspect providing several quantitative clues which testify the presence and
the importance of weak ties in the network.
113
6. COMMUNITY STRUCTURE IN FACEBOOK
114
7
A Novel Centrality Measure for
Social Networks
The Chapter is organized as follows: Section 7.1 presents the literature related to the problem
of computing centrality on graphs. In Section 7.2 we provide some background information on
those problems related to centrality measures. Section 7.3 presents our novel κ-path edge centrality, including the fast algorithm for its computation. An extensive experimental evaluation
of performance of this strategy is discussed in Section 7.4. In Section 7.5 we discuss different
fields of application of our approach; in Section 7.6 we describe its adoption to devise a new
efficient technique of community detection well suited for the investigation of the community
structure of large networks. The Chapter concludes in Section 7.7, in which we report the results of the experimentation of this algorithm applied to different social and biological network
problems.
7.1
Background and Related Literature
In the context of the social knowledge management, not only from a scientific perspective but
also for commercial or strategic motivations, the identification of the principal actors inside a
network is very important. Such an identification requires to define an importance measure
(also referred to as centrality) to weight nodes and/or edges of a given network.
The simplest approaches to computing centrality consider only the local topological properties
of a node/edge in the social network graph: for instance, the most intuitive node centrality
measure is represented by the degree of a node, – the number of social contacts of a user.
Unfortunately, local measures of centrality, whose esteem is computationally feasible even on
large networks, do not produce very faithful results (56).
Due to this reason, many authors suggested to consider the whole social network topology to
compute centrality values. A new family of centrality measures was born, called global measures.
Some examples of global centrality measures are closeness (248) and betweenness centrality (for
nodes (112), and edges (11, 124)).
Betweenness centrality is one of the most popular measures and its computation is the core
115
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
component of a range of algorithms and applications. It relies on the idea that, in social
networks, information flows along shortest paths: as a consequence, a node/edge has a high
betweenness centrality if a large number of shortest paths crosses it.
Some authors, however, raised some concerns on the effectiveness of the betweenness centrality. First of all, the problem of computing the exact value of betweenness centrality for each
node/edge of a given graph is computationally demanding – or even unfeasible – as the size of
the analyzed network grows. Therefore, the need of finding fast, even if approximate, techniques
to compute betweenness centrality arises and it is currently a relevant research topic in Social
Network Analysis.
A further issue is that the assumption that information in social networks propagates only along
shortest paths could not be true (261). By contrast, information propagation models have been
provided in which information, encoded as messages generated in a source node and directed
toward a target node in the network, may flow along arbitrary paths. In the spirit of such a
model, some authors (211, 221) suggested to perform random walks on the social network to
compute centrality values.
A prominent approach following this research line is the work proposed in (6). In that work,
the authors introduced a novel node centrality measure known as κ-path centrality. In detail,
the authors suggested to use self-avoiding random walks (192) of length κ (being κ a suitable
integer) to compute centrality values. They provided an approximate algorithm, running in
O(κ3 n2−2α log n) being n the number of nodes and α ∈ [− 12 , 12 ].
In this Chapter we extend that work (6) by introducing a novel measure of edge centrality.
This measure is called κ-path edge centrality. In our approach, the procedure of computing
edge centrality is viewed as an information propagation problem. In detail, if we assume that
multiple messages are generated and propagated within a social network, an edge is considered
as “central” if it is frequently exploited to diffuse information.
Relying on this idea, we simulate message propagations through random walks on the social
network graphs. In our simulation, in addition, we assume that random walks are simple and of
bounded length up to a constant and user-defined value κ. The former assumption is because a
random walk should be forced to pass no more than once through an edge; the latter, because,
as in (115), we assume that the more distant two nodes are, the less they influence each other.
The computation of edge centrality has many practical applications in a wide range of contexts
and, in particular, in the area of Knowledge-Based (KB) Systems. For instance in KB systems
in which data can be conveniently managed through graphs, the procedure of weighting edges
plays a key role in identifying communities, i.e., groups of nodes densely connected each other
and weakly coupled with nodes residing outside the community itself (259, 280). This is useful to
better organize available knowledge: think, for instance, to an e-commerce platform and observe
that we could partition customer communities into smaller groups and we could selectively
forward messages (like commercial advertisements) only to groups whose members are actually
interested to them. In addition, in the context of Semantic Web, edge centralities are useful to
quantify the strength of the relationships linking two objects and, therefore, it can be useful to
discover new knowledge (245). Finally, in the context of social networks, edge centralities are
helpful to model the intensity of the social tie between two individuals (88): in such a case, we
could extract patterns of interactions among users in virtual communities and analyze them to
understand how a user is able to influence another one. The main contributions of this Chapter
are the following:
• We propose an approach based on random walks consisting of up-to κ edges to compute
116
7.2 Centrality Measures and Applications
edge centrality. In detail, we observe that many approaches in the literature have been
proposed to compute node centrality but, comparatively, there are few studies on edge
centrality computation (among them we cite the edge betweenness centrality introduced
in the Girvan-Newman algorithm). In addition, some authors (50, 211, 221) successfully
applied random walks to compute node centrality in networks. We suggest to extend
these ideas in the direction of edge centrality, and, therefore, this work is the first attempt
to compute edge centrality by means of random walks.
• We design an algorithm to efficiently compute edge centrality. The worst case time complexity of our algorithm is O(κm), being m the number of edges in the social network
graph and κ a (typically small) parameter. Therefore, the running time of our algorithm
scales in linear fashion against the number of edges of a social network. This is an interesting improvement of the state-of-the-art: in fact, exact algorithms for computing
centrality run in O(n3 ) and, with some ingenious optimizations they can run in O(nm)
(45). Unfortunately, real-life social networks consist of up to millions nodes/edges (203),
and, therefore these approaches may not scale well. By contrast, our algorithm works
fairly well also on large real-life social networks even in presence of limited computing
resources.
• We provide results of the performed experimentation, showing that our approach is able
to generate reproducible results even if it relies on random walks. Several experiments
have been carried out in order to emphasize that the κ-path edge centrality computation
is feasible even on large social networks. The properties shown by this measure are
discussed, in order to characterize each of the studied networks.
• Finally, we design a novel computationally efficient algorithm of community detection
based on the κ-path edge centrality and we apply it to social and biological networks with
encouraging results.
7.2
Centrality Measures and Applications
In this Section we review the concept of centrality measure and illustrate some recent approaches
to compute it.
7.2.1
Centrality Measure in Social Networks
One of the first (and the most popular) node centrality measures is the betweenness centrality
(112). We recall its definition:
Definition 1. (Betweenness centrality) Given a graph G = hV, Ei, the betweenness centrality
for the node v ∈ V is defined as
CBn (v) =
X
s6=v6=t∈V
σst (v)
σst
(7.1)
where s and t are nodes in V , σst is the number of shortest paths connecting s to t, and σst (v)
is the number of shortest paths connecting s to t passing through the node v.
117
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
If there is no path joining s and t we conventionally set
σst (v)
σst
= 0.
The concept of centrality has been defined also for the edges in a graph and, from an historical
standpoint, the first approach to compute edge centrality was proposed in 1971 by J.M. Anthonisse (11, 166) and was implemented in the GRADAP software package. In this approach,
edge centrality is interpreted as a “flow centrality” measure. To define it, let us consider a
graph G = hV, Ei and let s ∈ V , t ∈ V be a fixed pair of nodes. Assume that a “unit of flow”
is injected in the network by picking s as the source node and assume that this unit flows in G
along the shortest paths. The rush index associated with the pair hs, ti and the edge e ∈ E is
defined as
δst (e) =
σst (e)
σst
being, as before, σst the number of shortest paths connecting s to t, and σst (e) the number
of shortest paths connecting s to t passing through the edge e. As in the previous case, we
conventionally set δst (e) = 0 if there is no path joining s and t.
The rush index of an edge e ranges from 0 (if e does not belong to any shortest path joining s
and t) to 1 (if e belongs to all the shortest paths joining s and t). Therefore, the higher δst , the
more relevant the contribution of e in the transfer of a unit of flow from s to t. The centrality
of e can be defined by considering all the pairs hs, ti of nodes and by computing, for each pair,
the rush index δst (e); the centrality CRe (e) of e is the sum of all these contributions
CRe (e) =
XX
δst (e)
s∈V v∈V
More recently, in 2002 Girvan and Newman proposed a definition of edge betweenness centrality
which strongly resembles that provided by Anthonisse.
According to the notation introduced above, the edge betweenness centrality for the edge e ∈ E
is defined as
CBe (e) =
X σst (e)
σst
(7.2)
s6=t∈V
and it differs from that of Anthonisse because the source node s and the target node t must be
different.
Other, marginally different, definitions of betweenness centrality have been proposed by (46),
such as bounded-distance, distance-scaled, edge and group betweenness, and stress and load
centrality.
Although the appropriateness of the betweenness centrality in the representation of the “importance” of a node/edge inside the network is evident, its adoption is not always the unique
solution to a given problem. For example, as already put into evidence by (261), the first limit
of the concept of betweenness centrality is related to the fact that influence or information does
not propagate following only shortest paths. With regards to the influence propagation, it is
also evident that the more distant two nodes are, the less they influence each other, as stated
by (115). Additionally, in real applications (such as those described in Section 7.2.3) it is not
usually required to calculate the exact ranking with respect to the betweenness centrality of
118
7.3 Measuring Edge Centrality
each node/edge inside the network. In fact, it results more useful to identify the top arbitrary
percentage of nodes/edges which are more relevant to the given specific problem (e.g., study of
propagation of information, identification of key actors, etc.).
7.2.2
Recent Approaches for Computing Betweenness Centrality
As to date, several algorithms to compute the betweenness centrality (of nodes) in a graph
have been presented. The most efficient has been proposed by (45), which runs in O(nm) for
unweighted graphs, and in O(nm + n2 log n) for weighted graphs, containing n nodes and m
edges.
The computational complexity of these approaches makes them unfeasible for large network
analysis. To this purpose, different approximate solutions have been proposed. Amongst others,
(51) developed a randomized algorithm (namely, “RA-Brandes”) and, similarly by using adaptive techniques, (13) proposed another approximate version (called, “AS-Bader”). In (211),
Newman devised a random-walk based algorithm to compute betweenness centrality which
shares similarities to our approach, starting from the concept of message propagation along
random paths. From the same concept, (6) proposed the κ-path centrality measure (for nodes)
and developed a O(κ3 n2−2α log n) algorithm (namely, “RA-κpath”) to compute it.
7.2.3
Application of Centrality Measures in Social Network Analysis
Applications of centrality information acquired from social networks have been investigated
by (260). The authors defined different methodologies to exploit discovered data, e.g., for
marketing purposes, recommendation and trust analysis.
Several marketing and commercial studies have been applied to Online Social Networks (OSNs),
in particular to discover efficient channels to distribute information (52, 267) and to study the
spread of influence (159). Potentially, our study could provide useful information to all these
applied research directions, identifying those interesting edges with high κ-path edge centrality,
which emphasizes their importance within the social network. Those nodes interconnected by
high central edges are important because of the position they “topologically” occupy. Moreover,
they could efficiently carry information to their neighborhood.
7.3
7.3.1
Measuring Edge Centrality
Design Goals
Before to providing a formal description of our algorithm, we illustrate the main ideas behind
it. We start from a real-life example and we use it to derive some “requirements” our algorithm
should satisfy.
Let us consider a network of devices. In this context, without loss of generality, we can assume
that the simplest “piece” of information is a message. In addition, each device has an address
book storing the devices with which it can exchange messages. A device can both receive and
transmit messages to other devices appearing in its address book.
119
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
The purpose of our algorithm is to rank links of the network on the basis of their aptitude of
favoring the diffusion of information. In detail, the higher the rank of a link, the higher its
ability of propagating a message. Henceforth, we refer to this problem as link ranking.
The link ranking problem in our scenario can be viewed as the problem of computing edge
centrality in social networks. We guess that some of the hypotheses/procedures adopted to
compute edge centrality can be applied to solve the link ranking problem. We suggest to
extend these techniques in a number of ways. In detail, we guess that the algorithm to compute
the link ranking should satisfy the following requirements:
Requirement 1 - Simulation of Message Propagation by using Random Walks. As shown in
Section 7.2, some authors assume that information flows on a network along the shortest paths.
Such an intuition is formally captured by Equation (7.1). However, as observed in (114, 211),
centrality measures based on shortest paths can provide some counterintuitive results. In detail,
(114, 211) present some simple examples showing that the application of Equation (7.1) would
lead to assign excessively low centrality scores to some nodes.
To this purpose, some authors (114) provided a more refined definition of centrality relying on
the concept of flow in a graph. To define this measure, assume that each edge in the network
can carry one or more messages; we are interested in finding those edges capable of transferring
the largest amount of messages between a source node s and a target node t. The centrality
of a vertex v can be computed by considering all the pairs hs, ti of nodes and, for each pair,
by computing the amount of flow passing through v. In the light of such a definition, in the
computation of node centrality also non-shortest paths are considered.
However, in (211), Newman shows that centrality measures based on the concept of flow are not
exempt from odd effects. To this purpose, the author suggested to consider a random walker
which is not forced to move along the shortest paths of a network to compute the centrality of
nodes.
The Newman’s strategy has been designed to compute node centrality, whereas our approach
targets at computing edge centrality. Despite this difference, we believe that the idea of using
random walks in place of shortest paths can be successful even when applied to the link ranking
problem.
In our scenario, if a device wants to propagate a message, it is generally not aware of the whole
network topology, and therefore it is not aware of the shortest paths to route the message. In
fact, each device is only aware of the devices appearing in its address book. As a consequence,
the device selects, according to its own criteria, one (or more) of its contacts and sends them
the message in the hope that they will further continue the propagation. In order to simulate
the message propagation, our first requirement is to exploit random walks.
Requirement 2 - Dynamic Update of Ranking. Ideally, if we would simulate the propagation
of multiple messages on our network of devices, it could happen that an edge is selected more
frequently than others. Edges appearing more frequently than others show a better aptitude
to spread messages and, therefore, their rank should be higher than others. As a consequence,
our mechanism to rank edges should be dynamic: at the beginning, all the edges are equally
likely to propagate a message and, therefore, they have the same rank. At each step of the
simulation, if an edge is selected, it must be awarded by getting a “bonus score”.
Requirement 3 - Simple Paths. The procedure of simulating message propagation through random walks described above could imply that a message can pass through an edge more than
once. In such a case, the rank of edges which are traversed multiple times would be dispropor-
120
7.3 Measuring Edge Centrality
tionately inflated whereas the rank of edges rarely (or never) visited could be underestimated.
The global effect would be that the ranking produced by this approach would not be correct.
As a consequence, another requirement is that the paths exploited by our algorithm must be
simple.
Requirement 4 - Bounded Length Paths. As shown in (115), the more distant two nodes are,
the less they influence each other. The usage of paths of bounded length has been already
explored to compute node centrality (40, 96). A first relevant example is provided in (96); in
that paper the authors observe that methods to compute node centralities like those based on
eigenvectors can lead to counterintuitive results. In fact, those methods take the whole network
topology into account and, therefore, they compute the centrality of a node on a global scale.
It may happen that a node could have a big impact on a small scale (think of a well-respected
researcher working on a niche topic) but a limited visibility on a large scale. Therefore, the
approach of (96) suggested to compute node centralities in local networks and they considered
ego networks. An ego network is defined as a network consisting of a single node (ego) together
with the nodes it is connected to (the alters) and all the links among those alters. The diameter
of an ego network is 2 and, therefore, the computation of node centrality in a network requires
to compute paths up to a length 2. In (40) the authors extended these concepts by considering
paths up to a length k.
We agree with the observations above and figure that two nodes are considered to be distant
if the shortest path connecting them is longer than κ hops, being κ the established threshold.
Such a consideration depicts as effective paths only those paths whose length is up to κ. We
take this requirement and for our simulation procedure we considered paths of bounded length.
In the next sections we shall discuss how our algorithm is able to incorporate the requirements
illustrated above.
7.3.2
κ-Path Centrality
In this section we introduce the concepts of κ-path node centrality and κ-path edge centrality.
The notion of κ-path node centrality, introduced by (6), is defined as follows:
Definition 2. (κ-path node centrality) For each node v of a graph G = hV, Ei, the κ-path node
centrality C κ (v) of v is defined as the sum, over all possible source nodes s, of the frequency
with which a message originated from s goes through v, assuming that the message traversals
are only along random simple paths of at most κ edges.
It can be formalized, for an arbitrary node v ∈ V , as
C κ (v) =
X σ κ (v)
s
σsκ
(7.3)
s∈V
where s are all the possible source nodes, σsκ (v) is the number of κ-paths originating from s
and passing through v and σsκ is the overall number of κ-paths originating from s.
Observe that Equation (7.3) resembles the definition of betweenness centrality provided in Equation (7.1). In fact, the structure of the two equations coincides if we replace the concept of
shortest paths (adopted in the betweenness centrality) with the concept of κ-paths which is the
core of our definition of κ-path centrality.
121
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
The possibility of extending the concept of “centrality” from nodes to edges has been already exploited by Girvan and Newman. In particular, they generalized the formulation of “betweenness
centrality” (referred to nodes), introducing the novel concept of “edge betweenness centrality”.
Similarly, we extend Definition 2 in order to define a novel edge centrality index, baptized κ-path
edge centrality.
Definition 3. (κ-path edge centrality) For each edge e of a graph G = hV, Ei, the κ-path edge
centrality Lκ (e) of e is defined as the sum, over all possible source nodes s, of the frequency
with which a message originated from s traverses e, assuming that the message traversals are
only along random simple paths of at most κ edges.
The κ-path edge centrality is formalized, for an arbitrary edge e, as follows
Lκ (e) =
X σ κ (e)
s
σsκ
(7.4)
s∈V
where s are all the possible source nodes, σsκ (e) is the number of κ-paths originating from s
and traversing the edge e and, finally, σsκ is the number of κ-paths originating from s.
In practical cases, the application of Equation (7.4) can not be feasible because it requires
to count all the κ-paths originating from all the source nodes s and such a number can be
exponential in the number of nodes of G. To this purpose, we need to design some algorithms
capable of efficiently approximating the value of κ-path edge centrality. These algorithms will
be introduced and discussed in the following.
7.3.3
The Algorithm for Computing the
κ-Path Edge Centrality
In this Section we discuss an algorithm, called Edge Random Walk κ-Path Centrality (or,
shortly, ERW-Kpath), to efficiently compute edge centrality values.
It consists of two main steps: (i) node and edge weights assignment and, (ii) simulation of message propagations through random simple paths. In the ERW-KPath algorithm, the probability
of selecting a node or an edge are uniform; we provide also another version of the ERW-Kpath
algorithm (called WERW-Kpath - Weighted Edge Random Walk κ-Path Centrality) in which
the node/edge probabilities are not uniform.
It has been proved (81) that the ERW-KPath and the WERW-Kpath algorithms return, as
output, an approximate value of the edge centrality index as provided in Definition 3.
In the following we shall discuss ERW-KPath algorithm by illustrating each of the two steps
composing it. After that, we will introduce the WERW-KPath algorithm as a generalization of
the ERW-KPath algorithm.
Step 1: node and edge weights assignment
In the first stage of our algorithm, we assign a weight to both nodes and edges of the graph
G = hV, Ei representing our social network. Weights on nodes are used to select the source
nodes from which each message propagation simulation starts. Weights on edges represent
initial values of edge centrality and, to comply with Requirement 2, they will be updated
during the execution of our algorithm.
122
7.3 Measuring Edge Centrality
To compute weight on nodes, we introduce the normalized degree δ(vn ) of a node vn ∈ V as
follows:
Definition 4. (Normalized degree) Given an undirected graph G = hV, Ei and a node vn ∈ V ,
its normalized degree δ(vn ) is
|I(vn )|
(7.5)
δ(vn ) =
|V |
where I(vn ) represents the set of edges incident on vn .
The normalized degree δ(vn ) correlates the degree of vn and the number of total nodes on the
network. Intuitively, it represents how much a node contributes to the overall connectivity of
the graph. Its value belongs to the interval [0, 1] and the higher δ(vn ), the better vn is connected
in the graph.
Regarding edge weights, we introduce the following definition:
Definition 5. (Initial edge weight) Given an undirected graph G = hV, Ei and an edge em ∈ E,
its initial edge weight ω0 (em ) is
1
(7.6)
ω0 (em ) =
|E|
Intuitively, the meaning of Equation (7.6) is as follows: we initially manage a “budget” consisting of |E| points; these points are equally divided among all the possible edges; the amount
of points received by an edge represents its initial rank.
In Figure 7.1 we report an example of graph G along with the distribution of weights on nodes
and edges.
3
)
b( 11
3
)
g( 11
1/12
1/12
1/12
1
)
a( 11
1/12
2
)
c( 11
1/12
1
f( 11
)
1/12
3
)
h( 11
1/12
3
d( 11
)
1/12
1/12
1/12
1
i( 11
)
1/12
3
e( 11
)
3
k( 11
)
1/12
1
j( 11
)
Figure 7.1: Example of assignment of normalized degrees and initial edge weights.
Step 2: Simulation of message propagations through random simple κ-paths
In the second step we simulate multiple random walks on the graph G; this is consistent with
Requirement 1.
To this purpose, our algorithm iterates the following sub-steps a number of times equal to a
value ρ, being ρ a fixed value. We will later provide a practical rule for tuning ρ. At each
iteration, our algorithm performs the following operations:
123
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
1. A node vn ∈ V is selected according to one of the following two possible strategies:
a. uniformly at random, with a probability
P (vn ) =
1
|V |
(7.7)
b. with a probability proportional to its normalized degree δ(vn ), given by
δ(vn )
vk ∈V δ(vk )
P (vn ) = P
(7.8)
2. All the edges in G are marked as not traversed.
3. The procedure MessagePropagation is invoked. It generates a simple random walk whose
length is not greater than κ, satisfying Requirement 3.
Let us describe the procedure MessagePropagation. This procedure carries out a loop as long
as both the following conditions hold true:
• The length of the path currently generated is no greater than κ. This is managed through
a length counter N .
• Assuming that the walk has reached the node vn , there must exist at least an incident
edge on vn which has not been already traversed. To do so, we attach a flag T (em ) to
each edge em ∈ E, such that
(
1 if em has already been traversed
T (em ) =
0 otherwise
We observe that the following condition must be true
X
|I(vn )| >
T (ek )
(7.9)
ek ∈I(vn )
being I(vn ) the set of edges incident onto vn .
The former condition complies with Requirement 4 (i.e., it allows us to consider only paths
up to length κ). The latter condition, instead, avoids that the message passes more than once
through an edge, thus satisfying Requirement 3.
If the conditions above are satisfied, the MessagePropagation procedure selects an edge em by
applying two strategies:
a. uniformly at random, with a probability
P (em ) =
1
|I(vn )| −
P
ek ∈I(vn )
T (ek )
(7.10)
among all the edges em ∈ {I(vn ) | T (em ) = 0} incident on vn (i.e., excluding already
traversed edges);
124
7.3 Measuring Edge Centrality
b. with a probability proportional to the edge weight ωl (em ), given by
ωl (em )
ˆ n ) ωl (em )
em ∈I(v
(7.11)
P (em ) = P
ˆ n ) = {ek ∈ I(vn ) | T (ek ) = 0} and ωl (em ) = ωl−1 (em ) + β · T (em ) if 1 ≤ l ≤ κρ.
being I(v
Let em be the selected edge and let vn+1 be the node reached from vn by means of em . The
MessagePropagation procedure awards a bonus β to em , sets T (em ) = 1 and increases the
counter N by 1. The message propagation activity continues from vn+1 .
At the end, each edge e ∈ E is assigned a centrality index Lκ (e) equal to its final weight ωκρ (e).
The values of β and ρ, in principle, can be fixed in an arbitrary fashion but we provide a simple
practical rule to tune them. Due to the experimentation, it emerges that in ERW-KPath it
1
is convenient to set ρ ' |E|. In particular, if we set ρ = |E| − 1 and β = |E|
we get a nice
1
result: the edge centrality indexes always range in [ |E| , 1] and, ideally, the centrality index of
a given edge will be equal to 1 if (and only if) it is always selected in any message propagation
1
and if that edge is
simulation. In fact, each edge initially receives a default score equal to |E|
1
selected in a subsequent trial, it will increase its score by a factor β = |E| . Intuitively, if an
edge is selected in all the trials, its final score will be equal to
1
|E|
+ρ·
1
|E|
=
1
|E|
+
|E|−1
|E|
= 1.
The time complexity of this algorithm is O(κρ). If we fix ρ = |E| − 1, we achieve a good
trade-off between accuracy and computational costs. In fact, in such a case, the worst case
time complexity of the ERW-KPath algorithm is O(κ|E|) and, since in real social networks |E|
is of the same order of magnitude of |V |, the time complexity of our approach is near linear
against the number of nodes. This makes our approach computationally feasible also for large
Online Social Networks.
The version of the algorithm shown in Algorithms 3 and 4 adopts uniform probability distribution functions in order to choose nodes and edges purely at random and, as said before, it is
called ERW-KPath.
A weighted version of the same algorithm, called WERW-KPath, would differ only in line 5
(Algorithm 3) and 2 (Algorithm 4), adopting weighted functions specified in Equations (7.8)
and (7.11). During our experimentation we always adopted the WERW-Kpath algorithm.
Algorithm 3 ERW-Kpath(Graph G = hV, Ei, int κ, int ρ, float β)
Assign each node vn ∈ V its normalized degree
Assign each edge em ∈ E the uniform probability function as weight
for i = 1 to ρ do
N ← 0 a counter to check the length of the κ-path
vn ← a node chosen uniformly at random in V
MessagePropagation(vn , N , κ, β)
7: end for
1:
2:
3:
4:
5:
6:
125
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
Algorithm 4 MessagePropagation(Node vn , int N , int κ, float β )
h
i
P
1: while N < κ and |I(v)| >
e∈I(v) T (e) do
2:
3:
4:
5:
6:
7:
8:
em ← em ∈ {I(v) | T (em ) = 0}, chosen uniformly at random
Let vn+1 be the node reached by vn through em
ω(em ) ← ω(em ) + β
T (em ) ← 1
vn ← vn+1
N ←N +1
end while
7.3.4
Novelties Introduced by our Approach
In this Section we discuss the main novelties introduced by our ERW-Kpath and WERW-Kpath
algorithms.
First of all, we observe that our approach is flexible in the sense that it can be easily modified to
incorporate new models capable of describing the spread of a message in a network. For instance,
we can define multiple strategies to select the source node from which each message propagation
simulation starts. In particular, in this Chapter we considered two chances, namely: (i) the
probability of selecting a node s as the source is uniform across all the nodes in the network (and
this is at the basis of the ERW-Kpath algorithm) or (ii) the probability of selecting a node s as
the source is proportional to the degree of s (and this is at the basis of the WERW-Kpath). It
would be easy to select a different probability distribution, if necessary. In an analogous fashion,
in the ERW-Kpath and WERW-Kpath algorithms we defined two strategies to select the node
receiving a message; of course, other, and more complex, strategies could be implemented in
order to replace those described in this Chapter.
In addition, observe that the ERW-Kpath and WERW-Kpath algorithms provide a unicast
propagation model in which any sender node is in charge of selecting exactly one receiving node.
We could easily modify our algorithms in such a way as to support a multicast propagation
model in which a node could issue a message to multiple receivers.
A further novelty is that we use multiple random walks to simulate the propagation of messages
and assume that the frequency of selecting an edge e in these walks is a measure of its centrality.
An approach similar to our was presented in (111) but it assumes that messages propagate
along shortest paths. In detail, given a pair of nodes i and j, the approach of (111) introduces
a parameter, called network efficiency εij as the inverse of the length of the shortest path(s)
connecting i and j. After that, it provides a new parameter, called information centrality; the
information centrality ICe of an edge e is defined as the relative drop in the network efficiency
generated by the removal of e from the network. Our approach provides some novelties in
comparison with that of (111): in fact, in our approach a network is viewed as a decentralized
system in which there is no user having a complete knowledge of the network topology. Due to
this incomplete knowledge, users are not able to identify shortest path and, therefore, they use
a probabilistic model to spread messages. This yields also relevant computational consequences:
the identification of all the pairs of shortest paths in a network is computationally expensive
and it could be unfeasible on networks containing millions of nodes. By contrast, our approach
scales almost linearly with the number of edges and, therefore, it can easily run also over large
networks.
126
7.3 Measuring Edge Centrality
Finally, despite our approach relies on the concept of message propagation which requires an
orientation on edges, it can work also on undirected networks. In fact, the ERW-Kpath (resp.,
WERW-Kpath) algorithm selects at the beginning a source node s that decides the node v to
which a message has to be forwarded. Therefore, at run-time, the ERW-Kpath (resp., WERWKpath) algorithm induces an orientation on the edge linking s and v which coincides with the
direction of the message sent by s; such a process does not require to operate on directed
networks, even if it could intrinsically work well with such a type of networks.
7.3.5
Comparison of the ERW-Kpath and WERW-Kpath algorithms
In this Section we provide a comparison between ERW-Kpath and WERW-Kpath. First of all,
we would like to observe that both the two algorithms are capable of correctly approximating
the κ-path centrality values provided in Definition 3.
Despite the two algorithms are formally correct, however, we observe that the WERW-Kpath
algorithm should be preferred to ERW-Kpath. In fact, in the ERW-Kpath algorithm, we assume
that each node can select, at random, any edge (among those that have not yet been selected)
to propagate a message. Such an assumption could be, however, too strong in real-life social
networks. To better clarify this concept, consider Online Social Networks like Facebook or
Twitter. In both of these networks a single user may have a large number of contacts with
whom she/he can exchange information (e.g., a wall post on Facebook or a tweet on Twitter).
However, sociological studies reveal that there is an upper limit to the number of people with
whom a user could maintain stable social relationships and this number is known as Dunbar
number (92). For instance, in Facebook, we found that the average number of friends of a user is
more than 300. On the other hand, it has been reported that male users actively communicate
with only 10 of them, whereas female users with 161 . This implies that there are preferential
edges along which information flows in social networks.
The ERW-Kpath algorithm is simple and easy to implement but it could fail to identify preferential edges along which messages propagate. By contrast, in the WERW-Kpath algorithm,
the probability of selecting an edge is proportional to the weight already acquired by that edge.
This weight, therefore, has to be intended as the frequency with which two nodes exchanged
messages in the past.
Such a property has also a relevant implication and makes feasible some applications which
could not be implemented by the ERW-Kpath algorithm. In fact, our approach, to some extent
can be exploited to recommend/predict links in a social network. The problem of recommending/predicting links plays a key role in Computer Science and Sociology and it is often known in
the literature as the link prediction problem (189). In the link prediction problem, the network
topology is analyzed to find pairs of non-connected nodes which could get a profit by creating a
social link. Various measures can be exploited to assess whether a link should be recommended
between a pair of nodes u and v; for instance, the simplest measure is to compute the Jaccard
coefficient J(u, v) on the neighbors of u and v. The larger the number of neighboring nodes
shared by u and v, the larger J(u, v); in such a case it is convenient to add an edge in the network linking u and v. Further (and more complex measures) take the whole network topology
into account to recommend links. For instance, the Katz coefficient (189) considers the whole
ensemble of paths running between u and v to decide whether a link between them should be
recommended.
1 http://www.economist.com/node/13176775?story
id=13176775
127
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
N.
1
2
3
4
5
6
Network
Wiki-Vote
CA-HepPh
CA-CondMat
Cit-HepTh
Facebook
Youtube
No. Nodes
7,115
12,008
23,133
27,770
63,731
1,138,499
No. Edges
103,689
237,010
186,932
352,807
1,545,684
4,945,382
Directed
Yes
No
No
Yes
Yes
No
Type
Elections
Co-authors
Co-authors
Citations
Online SN
Online SN
Ref
(183)
(183)
(183)
(183)
(270)
(270)
Table 7.1: Datasets adopted in our experimentation.
The WERW-Kpath algorithm can be exploited to address the link prediction problem. In
detail, by means of WERW-Kpath, we can handle not only topological information but we can
also quantify the strength of the relationship joining two nodes. So, we know that two nodes u
and v are connected and, in addition, we know also how frequently they exchange information.
This allows us to extend the measure introduced above: for instance, if we would like to use the
Jaccard coefficient, we can consider only those edges (called strong edges) coming out from u
(resp., v) such that the weight of these edge is greater than a given threshold. This is equivalent
to filter out all the edges which are rarely employed to spread information. As a consequence,
the Jaccard coefficient could be computed only on strong edges.
Due to these reasons, in the following experiments we focused only on the WERW-Kpath
algorithm.
7.4
Experimentation
Our experimentation has been conducted on different Online Social Networks whose datasets
are available. Adopted datasets have been summarized in Table 7.1.
Dataset 1 depicts the voting system of Wikipedia for the elections of January 2008. Datasets 2
and 3 represent the Arxiv1 archives of papers in the field of, respectively, High Energy Physics
(Phenomenology), Condensed Matter Physics and Condensed Matter Physics, as of April 2003.
Dataset 4 represents a network of scientific citations among papers belonging to the Arxiv High
Energy Physics (Theory) field. Dataset 5 describes a small sample of the Facebook network,
representing its friendship graph. Finally, Dataset 6 depicts a fragment of the YouTube social
graph as of 2007.
7.4.1
Robustness
A quality required for a good random-walk based algorithm is the robustness of results. In fact,
it is important that obtained results are consistent among different iterations of the algorithm,
if initial conditions are the same. In order to verify that our WERW-Kpath produces reliable
results, we performed a quantitative and a qualitative analysis as follows.
In the quantitative analysis we are interested in checking whether the algorithm produces the
same results in different runs. In the qualitative analysis, instead, we studied whether different
values of κ deeply impact on the ranking of edges.
1 Arxiv (http://arxiv.org/) is an online archive for scientific preprints in the fields of Mathematics, Physics
and Computer Science, amongst others.
128
7.4 Experimentation
Quantitative analysis of results
Our first experimentation is in order to verify that, over different iterations with the same
configuration, results are consistent. It is possible to highlight this aspect, running several
times the WERW-Kpath algorithm on the same dataset, with the same configuration.
Regarding ρ, in the experimentation we adopt ρ = |E| − 1. According to the previous choice,
1
. As for the maximum length of the κ-paths, we chose a
the bonus awarded is fixed to β = |E|
value of κ = 20.
Our quantitative analysis highlights that the distributions of values are almost completely
overlapping, over different runs on each dataset among those considered in Table 7.1.
In Figure 7.2 we graphically report the distribution of edge centrality values for the “Wiki-Vote”
dataset. Results are from four different runs of the algorithm on the same dataset with the
same configuration. Data are plotted using a semi-logarithmic scale in order to highlight the
“high” part of the distribution, where edges with high κ-path edge centrality lie.
Similar results are confirmed performing the same test over each considered dataset but they are
not reported due to space limitations. The robustness property is necessary but not sufficient
to ensure the correctness of our algorithm.
In fact, the quantitative evaluation we performed ensures that centrality values produced by
WERW-Kpath are consistent over different runs of the algorithm, but does not ensure that, for
example, a same edge e ∈ E after the Run 1 has a centrality value which is the same (or, at
least, very similar) that after Run 2. In other words, those values of centrality that overlap in
different distributions may be not referred to the same edges.
To the purpose of investigating this aspect we analyze results from a qualitative perspective,
as follows.
Qualitative analysis of results
Our random-walk-based approach ensures minimum fluctuations of centrality values assigned
to each edge along different runs, if the configuration of each run is the same.
To verify this aspect, we calculate the similarity of the distributions obtained by running
WERW-Kpath four times on each dataset, using the same configuration, comparing results
by adopting different measures. For this experiment, we considered different settings for the
length of the exploited κ-paths, i.e., κ = 5, 10, 20, in order to investigate also its impact.
The first measure considered is a variant of the Jaccard coefficient, classically defined as
J(X, Y ) =
|X ∩ Y |
|X ∪ Y |
(7.12)
where X and Y represent, in our case, a pair of compared distributions of κ-path edge centrality
values.
In order to define the Jaccard coefficient in our context we need to take into account the
following considerations. Let us consider two runs of our algorithms, say X and Y and let us
first consider an edge e; let us denote with ωX (e) (resp., ωY (e)) the centrality index of e in the
run X (resp., Y ); intuitively, the performance of our algorithm is “good” if ωX (e) is close to
129
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
Figure 7.2: Robustness test on Wiki-Vote.
ωY (e); however, a direct comparison of the two values could make no sense because, for instance,
the edge e could have the highest weight in both the two runs but ωX (e) may significantly differ
ωY (e)
X (e)
from ωY (e). Therefore, we need to consider the normalized values maxωe∈X
ω(e) and maxe∈Y ω(e)
and we assume that the algorithm yields good results
if these values are “close”. To make this
ωX (e)
Y (e)
definition more rigorous we can define Λ(e) = maxe∈X ω(e) − maxωe∈Y
ω(e) and we say that the
algorithm produces good results if Λ(e) is smaller than a threshold ε.
Now, in order to fix the value of ε, let us consider the values achieved by Λ(e) for each e ∈ E.
We can provide an upper bound Λ on Λ(e) by considering two extremal cases: (i) ωX (e) =
maxe∈X ω(e) and ωY (e) = mine∈Y ω(e) or, vice versa, (ii) ωX (e) = mine∈X ω(e) and ωY (e) =
maxe∈Y ω(e). For the sake of simplicity, assume that case (i) occurs; of course, the following
mine∈Y ω(e) considerations hold true also in case (ii). In such a case we obtain Λ = 1 − max
. As
ω(e)
e∈Y
discussed in the following (see Figures 7.4–7.9 and 7.10–7.15), edge centralities are distributed
according to a power law and, therefore, the value of mine∈Y ω(e) is some orders of magnitude
smaller than maxe∈Y ω(e). Therefore, the ratio of mine∈Y ω(e) to maxe∈Y ω(e) tends to 0 and
Λ tends 1.
According to these considerations, we computed how many times the following condition holds
true Λ(e) ≤ τ Λ, being 0 < τ ≤ 1 a tolerance threshold. Since Λ ' 1, this amounts to counting
how many times Λ(e) ≤ τ . Therefore, we can define the modified Jaccard coefficient as follows
τ
J (X, Y ) =
X (e)
|{e : | maxωe∈X
ω(e) −
ωY (e)
maxe∈Y ω(e) |
|X ∪ Y |
130
≤ τ }|
(7.13)
7.4 Experimentation
In our tests we considered the following values of tolerance τ = 0.01, 0.05, 0.10 to identify 1%,
5% and 10% of maximum accepted variation of the edge centrality value assigned to a given
edge along different runs with same configurations.
A mean degree of similarity avg J τn is taken to average the 42 = 6 possible combinations
(k )
of pairs of distributions obtained by analyzing the four runs over the datasets discussed above.
The second measure we consider is the Pearson correlation. It is adopted to evaluate the
correlation of the two distributions obtained. It is defined as
ρX,Y = p
cov(X, Y )
var(X) · var(Y )
(7.14)
whose results are normalized in the interval [−1, +1], with the following interpretations:
• ρX,Y > 0: distributions are directly correlated, such that:
– ρX,Y > 0.7: strongly correlated;
– 0.3 < ρX,Y < 0.7: moderately correlated;
– 0 < ρX,Y < 0.3: weakly correlated;
• ρX,Y = 0: not correlated;
• ρX,Y < 0: inversely correlated.
Clearly, the higher ρX,Y , the better the WERW-KPath algorithm works. Observe that the ρX,Y
coefficient tells us whether the two distributions X and Y are deterministically related or not.
Therefore, it could happen that the WERW-KPath algorithm, in two different runs generates
two edge centrality distributions X and Y such that Y = aX, being a a real coefficient. In such
a case, the ρX,Y coefficient would be 1 but we could not conclude that the algorithm works
properly. In fact, the coefficient a could be very low (or in the opposite case very large) and,
therefore, the two distributions would significantly differ even if they would preserve the same
edge rankings.
To this purpose, we consider a third measure in order to compute the distance between the two
distributions X and Y . To do so, we adopt the Euclidean distance L2 (X, Y ) defined as
v
u n
uX
2
L2 (X, Y ) = t
(Xi − Yi )
(7.15)
i=1
As it emerges from the distributions shown in Figure 7.2, almost all the terms in Equation (7.15)
annul each other, and therefore, the final value of L2 (X, Y ) is dominated by the difference of
the κ-path centrality values associated with the few top-ranked edges. To obtain the average
distance between two points in distribution X and Y in a given dataset, we should simply divide
L2 (X, Y ) by the number of edges in that dataset.
Intrinsic characteristics of analyzed datasets do not influence the robustness of results. In fact,
even if considering datasets representing different social networks (e.g., collaboration networks,
citation networks and online communities), WERW-Kpath produces highly overlapping results
over different runs.
131
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
Jτn
k
Dataset
Wiki-Vote
CA-HepPh
CA-CondMat
Cit-HepTh
Facebook
Youtube
κ
κ=5
κ = 10
κ = 20
κ=5
κ = 10
κ = 20
κ=5
κ = 10
κ = 20
κ=5
κ = 10
κ = 20
κ=5
κ = 10
κ = 20
κ=5
κ = 10
κ = 20
τ = 0.01
43.52%
61.13%
70.68%
52.63%
70.45%
75.65%
22.23%
35.16%
35.63%
47.62%
60.61%
63.68%
56.98%
56.85%
68.58%
11.74%
13.18%
27.92%
τ = 0.05
98.49%
98.86%
99.96%
96.11%
99.02%
99.51%
80.51%
93.72%
95.80%
97.76%
99.45%
99.62%
97.34%
98.49%
99.39%
44.28%
59.40%
82.29%
τ = 0.10
99.91%
99.98%
99.98%
99.53%
99.88%
99.87%
96.98%
99.40%
99.44%
99.78%
99.93%
99.93%
99.36%
99.76%
99.90%
72.41%
84.91%
96.17%
ρX,Y
0.67
0.69
0.70
0.92
0.95
0.96
0.73
0.79
0.83
0.78
0.83
0.85
0.79
0.84
0.84
0.49
0.75
0.89
L2 (X, Y )
1.61·10−2
2.37·10−2
3.48·10−2
1.18·10−2
1.23·10−2
2.90·10−2
1.39·10−2
2.18·10−2
3.40·10−2
0.92·10−2
1.36·10−2
2.04·10−2
1.01·10−2
1.87·10−2
2.67·10−2
1.31·10−3
1.87·10−3
2.83·10−3
avg(L2 (X, Y ))
1.55·10−7
2.28·10−7
3.35·10−7
4.97·10−8
5.18·10−8
1.22·10−7
7.43·10−8
1.16·10−7
1.81·10−7
2.60·10−8
3.85·10−8
5.78·10−8
5.11·10−9
1.20·10−8
1.72·10−8
2.64·10−10
3.78·10−10
5.72·10−10
Table 7.2: Analysis by using similarity coefficient J(τn) , correlation ρX,Y and Euclidean distance
k
L2 (X, Y ).
Already adopting a low tolerance, such as τ = 0.01 or τ = 0.05, values of κ-path edge centrality
are highly overlapping. Results improve according to the length of the κ-path adopted. By
increasing tolerance and/or length of κ-paths, the full overlap becomes obvious. The same
considerations hold true with respect to the Pearson correlation coefficient which identifies
strong correlations among all the different distributions.
Finally, as for the Euclidean distance, we observe that returned values are always small and,
in every case the distance is no larger than [10−2 , 10−3 ] and the average distance is around
[10−7 , 10−10 ].
7.4.2
Performance
All the experiments have been carried out by using a standard Personal Computer equipped with
a Intel i5 Processor with 4 GB of RAM. The implementation of the WERW-Kpath algorithm
adopted in the following experiments, developed by using Java 1.6, has been released1 and its
adoption is strongly encouraged.
As shown in Figure 7.3, the execution of WERW-Kpath scales very good (i.e., almost linearly)
according with the setup of the length of the κ-paths and with respect to the number of edges
in the given network.
This means that this approach is feasible also for the analysis of large networks, making it
possible to compute an efficient centrality measure for edges in all those cases in which it
would be very difficult or even unfeasible, for the computational cost, to calculate the exact
edge-betweenness.
The importance of this aspect is evident if we consider that there exist several Social Network
1 http://www.emilio.ferrara.name/werw-kpath/
132
7.4 Experimentation
Analysis tools, that implement different algorithms to compute centrality indices on network
nodes/edges. Our novel measure could be integrated in such tools (e.g., NodeXL1 , Pajek2 ,
NWB3 , and so on), in order to allow social network analysts, to manage (possibly, even larger)
social networks in order to study the centrality of edges.
Figure 7.3: Execution time with respect to network size.
7.4.3
Analysis of Edge Centrality Distributions
In this Section we study the distribution of edge centrality values computed by the WERWKpath algorithm. In detail, we present the results of two experiments.
In the first experiment we ran our algorithm four times. In addition, we varied the value of
κ = 5, 10, 20. We averaged the κ-path centrality values at each iteration and we plotted the
edge centrality distribution; on the horizontal axis we reported the identifier of each edge. The
results are reported in Figures 7.4–7.9 by exploiting a logarithmic scale. Each figure has the
following interpretation: on the x-axis it represents each edge of the given network, on the
y-axis its corresponding value of κ-path edge centrality.
The usage of a logarithmic scale highlights a power law distribution for the centrality values.
In fact, when the behavior in a log-log scale resembles a straight line, the distribution could
be well approximated by using a power law function f (x) ∝ x−α . As a result, for the all
considered datasets, there are few edges with high centrality values whereas a large fraction of
edges presents low (or very low) centrality values. Such a result can be explained by recalling
that, at the beginning, our algorithm considers all the edges on an equal foot and provides
them with an initial score which is the same for all the edges. However, during the algorithm
execution, it happens that few edges (which are actually the most central edges in a social
network) are frequently selected and, therefore, their centrality index is frequently updated.
By contrast, many edges are seldom selected and, therefore, their centrality index is rarely
increased. This process yields a power law distribution in edge centrality values.
1 http://nodexl.codeplex.com/
2 http://pajek.imfm.si/doku.php?id=pajek
3 http://nwb.cns.iu.edu/
133
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
In the second experiment, we studied how the value of κ impacted on edge centrality. In
detail, we considered the datasets separately and repeated the experiments described above.
Also for this experiment we considered three different values for κ, namely κ = 5, 10, 20. The
corresponding results are plotted in Figures 7.10–7.15, where the probability P of finding an
edge in the network which has the given value of centrality is plotted as a function of the κ-path
centrality. Each plot adopts a log-log scale.
The analysis of each figure highlights three relevant facts:
• The probability of finding edges in the network with the lowest κ-path edge centrality
values is smaller than finding edges with relatively higher centrality values. This means
that the most of the edges are exploited for the message propagation by the random walks
a number of times greater than zero.
• The power law distribution in edge centrality emerges even more for different values of κ
and in presence of different datasets. In other words, if we use different values of κ the
centrality indexes may change (see below); however, as emerges from Figures 7.4–7.9, for
each considered dataset, the curves representing κ path centrality values are straight and
parallel lines with the exception of the latest part. This implies that, for a fixed value
of κ, say κ = 5, an edge e will have a particular centrality score. If κ passes from 5 to
10 and, then, from 10 to 20, the centrality of e will be increased by a constant factor.
This implies that the ordering of the edges remains unchanged and, therefore, the edge
having the highest centrality at κ = 5 will continue to be the most central edges also
when κ = 10 and κ = 20. This highlights a nice feature of WERW-Kpath: potential
uncertainties on the tuning of the parameter κ do not have a devastating impact on the
process of identifying the highest ranked edges.
• The higher κ, the higher the value of centrality indexes. This has an intuitive explanation. If κ increases, our algorithm manages longer paths to compute centrality values.
Therefore, the chance that an edge is selected multiple times increases too. Each time an
edge is selected, our algorithm awards it by a bonus score (equal to β). As a consequence,
the larger κ, the higher the number of times an edge with high centrality will be selected,
and ultimately, the higher its final centrality index.
Such a consideration provides a practical criterion for tuning κ. In fact, if we select high
values of κ, we are able to better discriminate edges with high centrality from edges with
low centrality. By contrast, in presence of low values of κ, edge centrality indexes tend to
edge flatten in a small interval and it is harder to distinguish high centrality edges from
low centrality ones.
On the one hand, therefore, it would be fine to fix κ as high as possible. On the other,
since the complexity of our algorithm is O(κm), large values of κ negatively impact on the
performance of our algorithm. A good trade-off (explained by the experiments showed in
this Section) is to fix κ = 20.
7.5
Applications of our approach
In this Section we detail some possible applications of our approach to rank edges in social
networks in the area of Knowledge-Based systems (hereafter, KBS).
134
7.5 Applications of our approach
Figure 7.4: κ-paths centrality values distribution on Wiki-Vote.
Figure 7.5: κ-paths centrality values distribution on CA-HepPh.
Figure 7.6: κ-paths centrality values distribution on CA-CondMat.
135
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
Figure 7.7: κ-paths centrality values distribution on Cit-HepTh.
Figure 7.8: κ-paths centrality values distribution on Facebook.
Figure 7.9: κ-paths centrality values distribution on Youtube.
136
7.5 Applications of our approach
Figure 7.10: Effect of different κ = 5, 10, 20 on Wiki-Vote.
Figure 7.11: Effect of different κ = 5, 10, 20 on CA-HepPh.
Figure 7.12: Effect of different κ = 5, 10, 20 on CA-CondMat.
137
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
Figure 7.13: Effect of different κ = 5, 10, 20 on Cit-HepTh.
Figure 7.14: Effect of different κ = 5, 10, 20 on Facebook.
Figure 7.15: Effect of different κ = 5, 10, 20 on Youtube.
138
7.5 Applications of our approach
In detail, we shall focus on three possible applications. The first is data clustering and we
will show how our approach can be employed in conjunction with a clustering algorithm with
the aim of better organizing data available in a KBS. The second is related to the Semantic
Web and we will show how our approach can be used to assess the strength of the semantic
association between two objects and how this feature is useful to improve the task of discovering
new knowledge in a KBS. The third, finally, is related to better understand the relationship
and the roles of user in virtual communities; in this case we show that our approach is useful
to elucidate relationships like trust ones.
7.5.1
Data Clustering
A central theme in KBS-related research is the design and implementation of effective data
clustering algorithms (259). In fact, if a KBS has to manage massive datasets (potentially split
across multiple data sources), clustering algorithms can be used to organize available data at
different levels of abstraction. The end user (both a human user or a software program) can
focus only on the portion of data which are the most relevant to her/him rather than exploring
the whole data space managed by a KBS (54, 193, 259). If we ideally assume that any data
managed by a KBS is mapped onto a point of a multidimensional space, the task of clustering
available data requires to compute the mutual distance existing between any pair of data points.
Such a task, however, in many cases is unfeasible. In fact, the computation of the distance can
be prohibitively time-consuming if the number of data points is very large. In addition, KBS
often manage data which are related each other but, for these kind of data, the computation of
a distance could make no-sense: think, for instance, of data on health status of a person and
her/his demographic data like age or gender.
Therefore, many authors suggested to represent data as graphs such that each node represents
a data point and each edge specifies the type of relationships binding two nodes. The problem
of clustering graphs has been extensively studied in the past and several algorithms have been
proposed. In particular, the graph clustering problem in the social network literature is also
known as community detection problem (109).
One of the early algorithms to find communities in graphs/networks was proposed by Girvan
and Newman. Unfortunately, due to its high computational complexity, the Girvan-Newman
algorithm can not be applied on very large and complex data repositories consisting of million
of information objects.
Our algorithm, instead, can be employed to rank edges in networks and to find communities.
This is an ongoing research effort and the first results are quite encouraging (79).
Once a community finding algorithm is available we can design complex applications to effectively manage data in a KBS. For instance, in (280) the authors focused on Online Social
Networks like Internet newsgroups and chat rooms. They analyzed through semantic tools the
text comments posted by users and this allowed large Online Social Networks to be mapped
onto weighted graphs. The authors showed that the discovery of the latent communities is a
useful way to better understand patterns of interactions among users and how opinions spread
in the network.
We then describe two use cases possibly benefiting from community detection algorithms. In
the first case, consider a social network in which users fill a profile specifying their interests.
A graph can be constructed which records users (mapped onto nodes) and relationship among
139
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
them (e.g., an edge between two nodes may indicate that two users share at least one interest).
Our algorithm, therefore, could identify group of users showing the same interests.
Therefore, given an arbitrary message (for instance a commercial advertisement) we could
identify groups of users interested to it and we could selectively send the message only to
interested groups.
As an opposite application, we can consider the objects generated within a social media platform. These objects could be for instance photos in a platform like Flickr or musical tracks in a
platform like Last.fm. We can map the space of user generated contents onto a graph and apply
on it our community detection algorithm. In this way we could design advanced query tools:
in fact, once a user issues a query, a KBS may retrieve not only the objects exactly labeled
by the keywords composing user queries but also objects falling in the same community of the
retrieved objects. In this way, users can retrieve objects of their interest even if they are not
aware about their existence.
7.5.2
Semantic Web
A further research scenario that can take advantage from our research work is represented by
the Semantic Web. In detail, Semantic Web tools like RDF allow complex and real-life scenarios
to be modeled by means of networks. In many cases these networks are called multi-relational
networks (or semantic networks) because they consist of heterogeneous objects and many type
of relationships can exist among them (244).
For instance, an RDF knowledge base in the e-learning domain (180) could consist of students,
instructors and learning materials in a University. In this case, the RDF knowledge base could
be converted to a semantic network in which nodes are the players described above. Of course,
an edge may link two students (for instance, if they are friends or if they are enrolled in the same
BsC programme), a student and a learning object (if a student is interested in that learning
object), an instructor and a learning material (if the instructor authored that learning material)
and so on (82).
A relevant theme in Semantic Web is to assess the weight of the relationships binding two
objects because this is beneficial to discover new knowledge. For instance, in the case of the
e-learning example described above, if a student has downloaded multiple learning objects on
the same topic, the weight of an edge linking the student and a learning material would reflect
the relevance of that learning material to the student. Therefore, learning materials can be
ranked on the basis of their relevance to the user and only the most relevant learning materials
can be suggested to the user.
An approach like ours, therefore, could have a relevant impact in this application scenario
because we could find interesting associations among items by automatically computing the
weight of the ties connecting them. To the best of our knowledge there are few works on the
computation of node centrality in semantic networks (244) but, recently some authors suggested
to extend parameters introduced in Social Network Analysis like the concept of shortest path
to multi-relational networks (245).
Therefore, we plan to extend our approach to the context of semantic networks. Our aim is to
use simple random walks in place of shortest paths to efficiently discover relevant associations
between nodes in a semantic network and to experimentally compare the quality of the results
produced by our approach against that achieved by approaches relying on shortest paths.
140
7.6 Fast Community Structure Detection
7.5.3
Understanding User Relationships in Virtual Communities
A central theme in KBS research is represented by the extraction of patterns of interactions
among humans in a virtual community and their analysis with the goal of understanding how
humans influence each other.
A relevant problem is represented by the classification of the relationship of humans on the
basis of their intensity. For instance, in (88) the authors focus on the criminal justice domain
and focus on the identification of social ties playing a crucial role in the transmission of sensitive
information. In (279), the author provides a belief propagation algorithm which exploits social
ties among members of a criminal social network to identify criminals. Our approach resembles
that of (88) because both of them are able too associate each edge in a network with a score
indicating the strength of the association between the nodes linked by that edge.
A special case occurs when we assume that the edge connecting two nodes specifies a trust
relationship (83, 163). In (83), the authors suggested to propagate trust values along paths
in the social network graph. In an analogous fashion, the approach of (163) uses path in the
social network graph to propagate trust values and infer trust relationships between pairs of
unknown users. Finally, Reinforcement Learning techniques are applied to estimate to what
extent an inferred trust relationship has to be considered as credible. Our approach is similar
to those presented above because both of them rely on a diffusion model. In (83, 163), the
main assumption is that the trust reflects the transitive property, i.e., if a user x trusts a user y
who, in her/his turn, trusts a user z, then we can assume that x trusts z too. In our approach,
we exploit connections among nodes to propagate messages by using random walks of bounded
length. There are, however, some relevant differences: in the approaches devoted to compute
trust all the paths of any arbitrary length are, in principle, useful to compute trust values even
if the contribution brought in by long paths is considered less relevant than that of short paths.
Vice versa, in our approach, the length of a path is bounded by a fixed constant κ.
7.6
Fast Community Structure Detection
In the following, we present a novel algorithm to calculate the community structure of a network.
It is baptized Fast κ-path Community Detection (or, shortly, FKCD). The strategy relies on
three steps: i) ranking edges by using the WERW-Kpath algorithm; ii) calculating the proximity
(the inverse of the distance) between each pair of connected nodes; ii) partitioning the network
into communities so to optimize the network modularity, according to the Louvain method (32).
The algorithm is discussed as follows.
7.6.1
Background
The strategy exploited in the following adopts the paradigm of the maximization of the network
modularity. It can be explained as follows: let consider a network, represented by means of a
graph G = (V, E), partitioned into m communities; assuming ls the number of edges between
nodes belonging to the s-th community and ds is the sum of the degrees of the nodes in the
141
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
s-th community, we recall the definition of the network modularity
Q=
m
X
s=1
"
ls
−
|E|
ds
2|E|
2 #
(7.16)
Intuitively, high values of Q imply high values of ls for each discovered community; thus,
detected communities are dense within their structure and weakly coupled among each other.
Equation 7.16 reveals a possible maximization strategy: in order to increase the value of the
first term (namely, the coverage), the highest possible number of edges should fall in each given
community, whereas the minimization of the second term is obtained by dividing the network
in several communities with small total degrees.
The problem of maximizing the network modularity has been proved to be NP complete (47).
The state-of-the-art approximate technique is called Louvain method (LM) (32). This strategy
is based on local information and is well-suited for analyzing large weighted networks. It is
based on the two simple steps: i) each node is assigned to a community chosen in order to
maximize the network modularity Q; the gain derived from moving a node i into a community
C can simply be calculated as (32)
P
∆Q =
+kiC
−
2m
C
P
+ki
2m
Ĉ
"P
2
−
C
2m
P 2
−
Ĉ
2m
−
ki
2m
#
(7.17)
P
P
where C is the sum of the weights of the edges inside C, Ĉ is the sum of the weights of
the edges incident to nodes in C, ki is the sum of the weights of the edges incident to node i,
kiC is the sum of the weights of the edges from i to nodes in C, m is the sum of the weights
of all the edges in the network; ii) the second step simply makes a new network consisting of
nodes that are those communities previously found. Then the process iterates until a significant
improvement of the network modularity is obtained.
In the following we present an efficient community detection algorithm which represents a
generalization of the LM. In fact, it can be applied even on unweighted networks and, most
importantly, it exploits both global and local information. To make this possible, our strategy
computes the pairwise distance between nodes of the network. To do so, edges are weighted
by using a global feature which represents their aptitude to propagate information through the
network. The edge weighting is based on the κ-path edge centrality. Thus, the partition of the
network is obtained improving the LM. Details of our strategy are explained in the following.
7.6.2
Design Goals
In this Section we briefly and informally discuss the ideas behind our strategy. First of all, we
explain the principal motivations that make our approach suitable, in particular but not only,
for the analysis of the community structure of social networks. To this purpose, we introduce
a real-life example from which we infer some features of our approach.
Let consider a social network, in which users are connected among them by friendship relations. In this context, we can assume that one of the principal activities could be exchanging
information. Thus, let assume that a “message” (that, could be, for example, a wall post on
Facebook or a tweet on Twitter) represents the simplest “piece” of information and that users
142
7.6 Fast Community Structure Detection
of this network could exchange messages, by means of their connections. This means that a user
could both directly send and receive information only to/from the people in her neighborhood.
In fact, this assumption will be fundamental (see further), in order to define the concepts of
community and community structure. Intuitively, say that a community is defined as a group
of individuals in which the interconnections are denser than outside the group (in fact, this
maximizes the benefit function Q).
The aim of our community detection algorithm is to identify the partitioning of the network in
communities, such that the network modularity is optimal. To do so, our strategy is to rank
links of the network on the basis of their aptitude of favoring the diffusion of information. In
detail, the higher the ability of a node to propagate a message, the higher its centrality in the
network. This is important because, as already proved by (124, 217), we could ensure that the
higher the centrality of a edge, the higher the probability that it connects different communities.
Our algorithm adopts different optimizations in order to efficiently compute the link ranking.
Once we define an optimized strategy for ranking links, we can compute the pairwise distances
between nodes and finally the partitioning of the network, according to the LM. The evaluation
of the goodness of the partitioning in communities is attained by adopting the measure of the
network modularity Q.
In the next sections we shall discuss how our algorithm is able to incorporate these requirements.
In Section 7.3.2, we provided a definition of centrality of edges in social networks based on the
propagation of messages by using simple random walks of length at most κ (called, hereafter, κpath edge centrality). Then, we provided a description of an efficient algorithm to approximate
it, running in O(κ|E|), where |E| is the number of edges in the network. In the following, we
discuss our novel community detection algorithm.
7.6.3
Fast κ-path Community Detection
First of all, our Fast κ-path Community Detection (henceforth, FKCD) needs a ranking criterion
to compute the aptitude of all the edges to propagate information through the network. To do
so, FKCD invokes the WERW-Kpath algorithm, previously described. Once all the edges have
been labeled with their κ-path edge centrality, a ranking in decreasing order of centrality could
be obtained. This is not fundamental, but could be useful in some applications. Similarly,
before to proceed, a first network modularity esteem (hereafter, Q) could be calculated. This
could help in order to put into evidence how Q increases during next steps. With respect to
Q, we recall that the higher Q, the better the community structure of the network appears
evident. The computational cost of this first step is O(κ|E|), with κ length of the κ-paths and
|E| cardinality of E.
The second step consists in calculating the proximity among each pair of connected nodes. This
is done by using a L2 distance (i.e., the Euclidean distance) calculated as
v
u n
uX (Lκ (eik ) − Lκ (ekj ))2
(7.18)
rij = t
d(k)
k=1
where Lκ (eik ) (resp., Lκ (ekj )) is the κ-path edge centrality of the edge eik (resp., ekj ) and
d(k) is the degree of the node. We put into evidence that, even though the L2 measure would
return a distance, in our case, the higher Lκ (eik ) (resp., Lκ (ekj )), the more the nodes are near,
143
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
instead of distant. This important aspect leads us to consider the results of Equation 7.18 as
the pairwise proximities of nodes. This step is theoretically computationally expensive, because
it should require O(|V |2 ) iterations, but in practice, by adopting optimization techniques, its
near linear cost is O(d(v)|V |), where d(v) is the mean degree of all the nodes of the network
(and it is usually small in social networks).
The last step is the network partitioning. The main idea is inspired by the LM (32) for detecting
the community structure of weighted networks in near linear time. The partitioning is an
iterative process. At each iteration, two simple steps occur: i) each node is assigned to a
community chosen in order to maximize the network modularity Q; the possible increase of
Q derived from eventually moving a node i into a community C is calculated according with
Equation 7.17; ii) the second step produces a meta-network whose nodes are those communities
previously found. The partitioning ends when no further improvements of Q can be obtained.
This reflects in splitting communities connected by edges with high proximity, which is a global
feature, thus maximizing the network modularity. Its cost is O(γ|V |), where |V | is the cardinality of V and γ is the number of iterations required by the algorithm to converge (in our
experience, usually, γ < 5). The FKCD is schematized in Algorithm 5.
We recall that CalculateDistance computes the pairwise node distance by using Equation 7.18,
Partition extracts the communities according to the LM descripted above and NetworkModularity calculates the value of network modularity by using Equation 7.16.
The computational cost of our strategy is near linear. In fact, O(κ|E| + d(e)|V | + γ|V |) =
O(Γ|E|), by adopting efficient graph memorization in order to minimize the execution time for
the computation of Equations 7.16 and 7.18.
Algorithm 5 FKCD(Graph G = (V, E), int κ)
1:
2:
3:
4:
5:
6:
WERW-Kpath(G, κ)
CalculateDistance(G)
while Q increases at least of (arbitrarily small) do
P = Partition(G)
Q ← NetworkModularity(P)
end while
7.7
Experimental Results
Our experimentation has been conducted both on synthetic and real-world Online Social Networks, whose datasets are available online. All the experiments have been carried out by using
a standard Personal Computer equipped with a Intel i5 Processor with 4 GB of RAM.
7.7.1
Synthetic Networks
The method proposed to evaluate the quality of the community structure detected by using the
FKCD exploits the technique presented by Lancichinetti et al. (177). We generated the same
synthetic networks reported in (177), adopting the following configuration: i) N = 1000 nodes;
ii) the four pairs of networks identified by (γ, β) = (2, 1), (2, 2), (3, 1), (3, 2), where γ represents
the exponent of the power law distribution of node degrees, β the exponent of the power law
144
7.7 Experimental Results
distribution of the community sizes; iii) for each pair of exponents, three values of average
degree hki = 15, 20, 25; iv) for each of the combinations above, we generated six networks by
varying the mixing parameter µ = 0.1, 0.2, . . . , 0.6 (Note: the threshold value µ = 0.5 is the
border beyond which communities are no longer defined in the strong sense. i.e., each node has
more neighbors in its own community than in the others (237).).
Figure 7.16 highlights the quality of the obtained results. The measure adopted is the normalized
mutual information (75). Values obtained put into evidence that our strategy performs fair good
results, avoiding the well-known effect due to the resolution limit of the modularity optimization
(110). Moreover, a classification of results as in Table 7.3 (discussed later) is omitted because
values of Q obtained by using FKCD and the LM in the case of these quite small synthetic
networks are very similar.
Figure 7.16: Normalized mutual information test using the synthetic benchmarks.
7.7.2
Online Social Networks
Results obtained by analyzing several Online Social Networks datasets (184, 270) are summarized in Table 7.3. This experimentation has been carried out to qualitatively analyze the
performance of our strategy. Obtained results, measured by means of the network modularity
calculated by our algorithm (FKCD), are compared against those obtained by using the original
LM.
Our analysis puts into evidence the following observations: i) classic not optimized algorithms
(for example Girvan-Newman) are unfeasible for large network analysis; ii) results obtained by
using LM are slightly higher than those obtained by using FKCD; on the other hand, LM adopts
145
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
local information in order to optimize the network modularity, while our strategy exploits both
local and global information; this results in (possibly) more convenient identified community
structures for some applications; iii) the performance of FKCD slightly increase by using longer
κ-paths; iv) both the compared efficient strategies are feasible even if analyzing large networks
using standard resources of calculus (i.e., a classic personal computer).
Network
CA-GrQc
CA-HepTh
CA-HepPh
CA-AstroPh
CA-CondMat
Facebook
No. nodes
5,242
9,877
12,008
18,772
23,133
63,731
No. edges
28,980
51,971
237,010
396,160
186,932
1,545,684
No. comm.
883
1,501
1,243
1,552
2,819
6,484
FKCDκ=5
0.734
0.585
0.565
0.486
0.546
0.414
FKCDκ=20
0.786
0.648
0.598
0.568
0.599
0.444
LM
0.816
0.768
0.659
0.628
0.731
0.634
Table 7.3: Results of the FKCD algorithm on the adopted datasets.
7.7.3
Extension to Biological Networks
The adoption of the community detection method presented has been extended also to different
fields of application. The algorithm we devised, in fact, can be considered a powerful technique
to obtain a meaningful clustering of any given network. The original limit of the Louvain
method has been overcome extending its possible application both to weighted and unweighted
networks, and its utilization for directed networks is straightforward.
In this Section we want to introduce an example application of our method to a slightly different
field, that lies in the application of the network analysis approach to a bio-informatics problem,
that is the analysis of gene-coexpression networks.
A gene-coexpression network can be informally defined as a network representing the interactions among genes into the cell. More in detail, a gene-coexpression network represents the
behavior of a set of genes, that cooperate in response to a given event (i.e., a stress condition)
to perform a given task.
Data analyzed in this kind of task come from micro-array experiments, in which the amount
of genes (called expression) produced into the cell at a given time-point is sampled, during a
control-phase and a stress-phase, whose length is usually the same.
In the case of our experimentation, we consider the response of the model organism called
Arabidopsis thaliana to a stressing condition simulating the drought. A number of 1,217 genes
(amongst the more than 29 thousands characterizing this plant) has been monitored with a
time-sampling constituted of 28 time-points, 14 of them under control condition and 14 under
drought stress condition. For each gene, the amount of gene expressed at each given time-point
is sampled, obtaining a matrix of values E G×T , in which G is the number of genes (i.e., 1,217)
and T is the number of time-points considered (i.e., 28).
The matrix E is exploited to calculate the pairwise correlation between each pair of genes, for
example, in our case, by adopting the Pearson correlation. Values obtained usually range in the
interval [-1,1], even if it is common to consider the absolute value by taking into account the fact
that two genes are highly correlated even if the type of correlation is inverse (i.e., if, in response
to a given stress, the amount of expression of a given gene grows in the same proportion in
146
7.7 Experimental Results
which the amount of expression of another given gene decreases). In the end of this process
we obtain a matrix C G×G in which the correlation of each pair of genes is represented. The
network obtained by using the matrix C as a weighted adjacency matrix is a fully connected
networks containing n(n−1)
edges, that is usually a large network. In order to optimize the size
2
of the network, it is possible to apply a thresholding operation to cut at a certain value (for
example, t = 0.7) the correlation between genes. This results in obtaining a network in which
each edge represents a weighted strong correlation between genes during the response process
to the given event.
In our case, the gene co-expression network obtained for the Arabidopsis thaliana in response to
a drought stress is represented by a graph G = (V, E) with V = 1, 217 genes and E = 278, 374
edges representing the number of strong correlations existing among the genes.
In order to make a clustering of this network, we adopted our method compared against two
other state-of-the-art techniques, i.e., the already discussed Louvain method and the OSLOM
algorithm (178) that is a overlapping community detection algorithm recently developed by
Lancichinetti, Radicchi and Ramasco. The results obtained by using our method largely outperform those two renowned methods both in terms of the quality of the discovered clusters
and in the number of clusters obtained.
In Figures 7.17–7.19 we depict the 3 largest clusters of genes discovered inside the analyzed
network. In detail, each plot represents the profiles of the expression of each gene, i.e., the
graphical representation of the amount of expression of each gene at the given time-point.
First of all, it is evident as all the genes classified in these 4 clusters present a behavior that is
very similar among each other and with respect to the average of the values of gene expression.
More importantly, by using techniques such as the over-representation analysis, biologists verified that each of the discovered clusters meaningfully represent processes activated by the plant
in response to the administered drought stress. Not only the quality of the clusters produced
by using our method is better than those characterizing the clusters produced by the other
methods, but also the number of recognized clusters is more appropriate.
In fact, our technique has recognized 24 clusters, 11 of which are above the size (in terms of
contained genes) that is meaningful for biologists to ideally define a cluster of gene co-expression
(28, 86). By using the classic Louvain method we obtained just 6 clusters, 4 of which are of
acceptable size. The OSLOM algorithm identified 11 clusters, 6 of which are representative.
Moreover, the technique of the over-representation analysis revealed, in particular for the Louvain method, that it was possible to identify several different processes inside each of the 4
clusters defined by this technique, i.e., the results produced were not reliable in terms of biological meaning. This is possibly due to the well-known problem of the resolution limit that
may affect the Louvain method.
Conclusion
In this Chapter we introduced a novel edge centrality measure for social networks called κ-path
edge centrality index. Its computation is computationally feasible even on large scale networks
by using the algorithm we provided. It performs multiple random walks on the social network
graph, which are simple and whose length is bounded by a factor κ. We showed that the
worst-case time complexity of our algorithm is O(κm), being m the number of edges in the
147
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
Figure 7.17: Arabidopsis Thaliana gene-coexpression network (cluster 1).
Figure 7.18: Arabidopsis Thaliana gene-coexpression network (cluster 2).
148
7.7 Experimental Results
Figure 7.19: Arabidopsis Thaliana gene-coexpression network (cluster 3).
social network graph. We discussed experimental results obtained by applying our method to
different Online Social Networks.
Finally, we shown that our centrality measure can be used to detect communities in large
networks. In the last part we presented a novel strategy that has two advantages. The former is
that it exploits both local and global information. The latter is that, by using some optimization,
it efficiently provides good results.
This way, our approach is able to discover the community structure in, possibly large, networks. Our experimental evaluation, carried out over both synthetic and Online Social Networks datasets, proves the efficiency and the robustness of the proposed strategy. Encouraging
results have been put into evidence when applying our algorithm to the problem of clustering
gene co-expression biological networks, if compared against results provided by other state-ofthe-art techniques.
149
7. A NOVEL CENTRALITY MEASURE FOR SOCIAL NETWORKS
150
8
Conclusions
The final Chapter of this Thesis concludes the dissertation discussing and summarizing (i) the
main findings and the contribution to the state-of-the-art in the disciplines covered by this
work, and (ii) as for future work, those directions that our research work will undertake and
the topics, among those discussed in this Thesis, which deserve yet more investigations.
8.1
Findings
The main motivation underlying the research work conducted during this Thesis is the increasing
popularity of those social phenomena such as social media and, in particular, Online Social
Networks (OSNs).
In detail, in this work we presented a comprehensive analysis of the principal tasks related to
the research in social networks, that are (i) the extraction (also called Web mining) of data
from Online Social Networks, (ii) the analysis of the network structure representing OSNs,
and, finally (iii) the development of efficient algorithms for the computation of Social Network
Analysis measures on massive OSNs.
The main findings of this work can be discussed separately, corresponding to those three main
parts already presented in the introduction of this work, namely:
(i) A first part, in which we discussed all those problems related to mining massive Web sources,
such as Online Social Networks. In particular, different techniques have been shown and we
focused on the so-called Web wrappers and the problems related to their automatic adaptation
in order to refine and make the process of automatic extraction of information from Web pages
more robust.
(ii) A second part, in which we discussed the analysis of a large sample acquired from Facebook,
which is, as to date, the largest and most representative OSN and a relevant artifact from a
Computer Science perspective. From this analysis, it emerges the possibility of investigating
topological features of the network, such as the well-known small world effect, scale-free distributions and community structure, and in addition to verify the validity of different sociological
theories such as the six degrees of separation or the strength of weak ties. The assessment of
151
8. CONCLUSIONS
different aspect of the analysis of the community structure of this network is also discussed in
details.
(iii) A final part, in which we contribute with the development of a novel, computationally
efficient measure of centrality for social networks. The functioning of this measure is rooted
in the random walks theory and its evaluation is computationally feasible even on a large
scale. It is the core of different applications, for example a community detection algorithm
whose performance is discussed against different fields of applicability, like social and biological
networks.
In conclusion, we recall the main contributions of this dissertation:
(a) We discussed the current panorama regarding the field of Web information mining platforms. Those procedures considered, namely Web wrappers, represent the core on which our
platform of Web mining for Online Social Networks has been built on. Among the main findings, a relevant contribution in this field is represented by the solution of automatic wrapper
adaptation. It relies on a tree-edit distance-based algorithm which is very extensible and powerful. Its performance is assessed on different fields of application regarding social media and
Web platforms, and this algorithm is able to provide high values of precision and recall, actually
making the process of automatic extraction of information more robust.
(b) We presented our platform of Web mining for Online Social Networks, which has been
able to extract millions of user profiles and friendship connections among them. The system
is an ad hoc Facebook crawler developed to comply with the strict privacy terms of Facebook,
supported us in the task of creating a large scale sample that has been adopted for scientific
purposes and publicly released for the scientific community. In detail, once two different sampling algorithms presented in literature have been implemented, we explored the social graph
of friendship relations among Facebook users and we assessed several topological characteristics
of the obtained samples. Important graph-theoretical and Social Network Analysis features,
such as degree distribution, diameter, centrality metrics, clustering coefficient and so on, have
been considered during the analysis of these datasets.
(c) We put into evidence those mathematical models which try to faithfully represent the topological features of Online Social Networks. In detail, once we presented models and methods
proper of the Social Network Analysis, we focused our attention on different Online Social
Networks – whose datasets were freely available online. We considered three main generative
models, i.e., i) Erdős-Rényi random graphs, ii) Watts-Strogatz and, iii) Barabási-Albert preferential attachment. These models have been compared against real data in order to assess
their reliability. Our results show that each model is only able to correctly describe a few
characteristics, but the all fail in faithfully representing all the main features of OSNs we previously identified, namely i) small world effect, ii) scale-free degree distributions and, finally, iii)
emergence of a community structure.
(d) A large number of experimental original results have been presented regarding the community structure of Facebook. In detail, we carried out a large-scale community structure
investigation in order to study the problem of community detection on massive OSNs. Given
the size of our datasets, we exploited different computationally efficient algorithms present in
literature, optimized to detect the community structure on large-scale networks, like Facebook,
containing millions of nodes and edges. Both from a quantitative and qualitative perspective,
a very strong community structure emerges by our analysis. We have been able to highlight
different features of the community structure of Facebook, for example a typical power law
distribution of the dimension of the communities. In addition, those communities unveiled by
152
8.2 Future Work
our analysis reveal a very high degree of similarity, regardless the methodology adopted for the
community detection. This aspect underlines the clear emergence of a community structure for
massive OSNs, that eases the task of unveiling communities in large networks.
Once the presence of the community structure has been assessed, we investigated those mesoscopic features of this characteristic artifact, building a community structure meta-network,
that is a network whose nodes represent communities of the main graph and edges represent
connections among individuals belonging to the given communities. The most important finding was the verification that the most of the features shown by the original social graph, are still
clearly visible in the community structure meta-network. In detail, we assessed the presence
of a power law degree distribution of the community size degree, a clustering effect, the small
world effect that explains the existence of very short paths among all the communities and a
diameter smaller than 5.
These results pushed us to try to verify the validity of a renowned sociological theory, the
strength of weak ties, which is well-known to be related to the community structure of a network.
Our findings support the quantitative proof providing several clues that testify the presence and
the importance of weak ties in the network.
(e) Related to the importance of individuals and connections in a given social networks, our
final contribution in this work is the definition of a novel measure of edge centrality particularly
well suited for social networks. It has been baptized κ-path edge centrality index. This measure
is characterized by a computationally feasibility even on large scale networks, by means of an
algorithm we provided, whose rationale is based on random walks theory. It carries out multiple
random walks on the network, which are simple and of bounded length, by a factor κ. The
worst-case time complexity of our algorithm is O(κm), being m the number of edges in the
social network graph. The validity of this algorithm has been tested onto different massive
OSNs, since the algorithm provides an approximation of the measure we defined.
Finally, we provided a real-world application of our centrality measure, devising a novel community detection method able to work with large networks. Its advantages respect to other
solutions are different: first of all, the algorithm exploits both local and global information.
Its evaluation has been carried out over both synthetic and real-world networks, proving the
efficiency and the robustness of the proposed strategy. In addition, by using some optimization,
it efficiently provides good results also in different contexts, such as biological applications. In
fact, relevant results have been obtained applying this method to the problem of clustering gene
co-expression networks.
8.2
Future Work
In this Thesis we introduced a number of concepts and research lines which deserve further
studies and this final Section is devoted to discuss some of the most relevant future works
closely related to this dissertation.
A first research line which is promising is related to the simulation of diffusion of information on
Online Social Networks or, more in general, in social networks built combining knowledge from
different networks (for example, the social network representing the friendship relationships
among Facebook users and the geographical network representing the physical location of the
given set of users). In particular, one goal could be to verify which is the most efficient way to
chose a small subset of nodes of the network from which to spread the information such that to
153
8. CONCLUSIONS
maximize a certain function, for example the coverage1 . The process of diffusion of information
over the network could be modeled, simulated and studied by exploiting different paradigms
presented in literature (for example, the Independent Cascade Model (131, 132, 159)).
Closely related to this topic, another relevant problem is the ability of maximizing the spread
of influence on such kind of networks (66, 159). This problem is particularly important since
it has immediate economical applications, for example related to the ability of advertising new
products in a more efficient way (89, 179, 190) – for example by means of the well-known
phenomenon of the word-of-mouth (53).
Yet on the analysis of information propagation on Online Social Networks, another trending
topic is leveraging the behavior of users on a large scale to study and model the spread of
information through the network. This is applied to different purposes, from marketing trends to
political consensus diffusion analysis. In the former category we include the sentiment analysis of
the public mood related to socio-economic phenomena (36, 37) and the study of viral marketing
campaigns diffusion (39, 65). In the latter, the identification of deceptive attempts to diffuse
defamatory or misleading political information, an illicit practice usually called astroturfing
(241, 242).
Considering the research line related to the novel centrality measure we defined and to community detection algorithms, as for future work we planned a long-term research evaluation of our
method, in order to cover different domains of application and to face several scientific challenges. For example, in the context of bio-informatics – in which the proposed algorithm has
been already proved working well in some contexts –, our method will be applied to the study
of human disease networks (16, 127). To this purpose, it could be applied to unveil the modular
structure of disease networks (225), for example to understand how disease gene modules are
preserved during the evolution of organisms (98).
Moreover, we are planning to extend this method in a number of ways. First, the novel centrality
measure we defined in this work can be used to detect overlapping communities in large social
networks. Such a task is currently unfeasible on a large scale, even if adopting the state-of-theart overlapping community detection algorithms presented in literature, such as COPRA (138),
OSLOM (175) or SLPA (282, 283). In fact, to the best of our knowledge, efficient algorithms do
not currently exist that estimate the community structure of a large network based on global
topological information and our strategy could fit well to this purpose.
In addition, based on this measure, we plan to design an algorithm to estimate the strength
of ties between two social network actors: for instance, in social networks like Facebook this is
equivalent to estimate the friendship degree between each pair of users. Finally, we point out
that some researchers studied how to design parallel algorithms to compute centrality measures.
For instance, (191) proposed a fast and parallel algorithm to compute betweenness centrality.
We guess that a new, interesting, research opportunity could be to design a parallel algorithm
to compute the κ-path edge centrality.
1 For sake of simplicity, the coverage can be considered as the amount of nodes in the network which is
informed about the given information at a certain time.
154
8.3 List of Publications
8.3
List of Publications
In the following, a list of publications of the author related to this dissertation:
E. Ferrara. Community structure discovery in Facebook. International Journal of
Social Network Mining. 1(1):67–90, (2012).
E. Ferrara and G. Fiumara. Topological features of Online Social Networks. Communications on Applied and Industrial Mathematics. 2(2):1–20, (2011).
P. De Meo, E. Ferrara, and G. Fiumara. A novel measure of edge centrality in social
networks. Knowledge-based Systems DOI: 10.1016/j.knosys.2012.01.007, (2012).
S. Catanese, E. Ferrara, and G. Fiumara. Forensic analysis of phone call networks.
Social Network Analysis and Mining. (Accepted)
S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Extraction
and analysis of Facebook friendship relations. In Computational Social Networks:
Mining and Visualization Springer Verlag, (In press).
P. De Meo, E. Ferrara, and G. Fiumara. Finding similar users in Facebook. In
Social Networking and Community Behavior Modeling: Qualitative and Quantitative
Measurement, pages 304–323. IGI Global, 2011.
E. Ferrara and R. Baumgartner. Automatic wrapper adaptation by tree edit distance matching. In Combinations of Intelligent Methods and Applications, pages
41–54. Springer Verlag, 2011.
E. Ferrara and R. Baumgartner. Intelligent self-repairable web wrappers. In Lecture
Notes in Computer Science, volume 6934, pages 274–285. Springer Verlag, 2011.
P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Generalized Louvain method
for community detection in large networks. In ISDA ’11: Proceedings of the 11th
International Conference on Intelligent Systems Design and Applications, pages 88–
93, 2011.
S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Crawling Facebook for social network analysis purposes. In WIMS ’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, pages 52:1–52:8.
ACM, 2011.
E. Ferrara and R. Baumgartner. Design of automatically adaptable web wrappers.
In ICAART ’11: Proceedings of the 3rd International Conference on Agents and
Artificial Intelligence, pages 211–217, 2011.
S. Catanese, P. De Meo, E. Ferrara, and G. Fiumara. Analyzing the Facebook
friendship graph. In MIFI ’10: Proceedings of the 1st International Workshop on
Mining the Future Internet, volume 685, pages 14–19, 2010.
P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Strength of weak ties in
Online Social Networks. Physical Review E (Under review)
E. Ferrara, G. Fiumara, and R. Baumgartner. Web data extraction, applications
and techniques: a survey. ACM Computing Surveys (Under review).
155
8. CONCLUSIONS
P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Enhancing community detection using a network weighting method. Information Sciences (Under review).
P. De Meo, E. Ferrara, F. Abel, L. Aroyo, and G. J. Houben. Analyzing user
behavior across Social Web environments. ACM Transactions on Intelligent Systems
and Technology (Under review).
E. Ferrara. A large-scale community structure analysis in Facebook. In PLoS ONE
(Under review).
E. Ferrara. CONCLUDE: complex network cluster detection for social and biological
applications. In WWW ’12 PhD Symposium (Under review).
The following references are related to publications of the author regarding with subjects not
included in this dissertation:
G. Quattrone, L. Capra, P. De Meo, E. Ferrara, and D. Ursino. Effective retrieval
of resources in folksonomies using a new tag similarity measure. In CIKM ’11: Proceedings of the 20th ACM Conference on Information and Knowledge Management,
pages 545–550, 2011.
G. Quattrone, E. Ferrara, P. De Meo, and L. Capra. Measuring similarity in largescale folksonomies. In SEKE ’11: Proceedings of the 23rd International Conference
on Software Engineering and Knowledge Engineering, pages 385–391, 2011.
P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Improving recommendation
quality by merging collaborative filtering and social relationships. In ISDA ’11:
Proceedings of the 11th International Conference on Intelligent Systems Design and
Applications, pages 587–592, 2011.
S. Catanese, E. Ferrara, G. Fiumara, and F. Pagano. A framework for designing 3D
virtual environments. In Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. Springer Verlag, (In press).
S. Catanese, E. Ferrara, G. Fiumara, and F. Pagano. Rendering of 3D dynamic
virtual environments. In Simutools ’11: Proceedings of the 4th International ICST
Conference on Simulation Tools and Techniques, 2011.
E. Ferrara, G. Fiumara, and F. Pagano. Living city, a collaborative browserbased massively multiplayer online game. In Simutools ’10: Proceedings of the
3rd International ICST Conference on Simulation Tools and Techniques, pages 1–8,
ICST, 2010.
156
Bibliography
[1] Adamic, L., Adar, E.: Friends and neighbors on the web. Social networks 25(3), 211–230
(2003) 47
[2] Adamic, L., et al.: Power-law distribution of the world wide web. Science 287(5461),
2115 (2000) 56
[3] Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A
survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge
and Data Engineering pp. 734–749 (2005) 46
[4] Ahn, Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics
of huge online social networking services. In: Proc. of the 16th international conference
on World Wide Web, pp. 835–844. ACM (2007) 1, 66
[5] Aiello, L.M., Barrat, A., Cattuto, C., Ruffo, G., Schifanella, R.: Link creation and profile
alignment in the aNobii social network. In: Proc. of the 2nd International Conference on
Social Computing, pp. 249–256 (2010) 49
[6] Alahakoon, T., Tripathi, R., Kourtellis, N., Simha, R., Iamnitchi, A.: K-path centrality:
A new centrality measure in social networks. In: Proc. of the 4th Workshop on Social
Network Systems, pp. 1–6 (2011) 116, 119, 121
[7] Albert, R.: Diameter of the World Wide Web. Nature 401(6749), 130 (1999) 12, 56, 57,
66, 67
[8] Albert, R., Barabási, A.: Statistical mechanics of complex networks. Reviews of Modern
Physics 74(1), 47–97 (2002) 45, 60, 66
[9] Amalfitano, D., Fasolino, A.R., Tramontana, P.: Reverse engineering finite state machines
from rich internet applications. In: Proc. of the 15th Working Conference on Reverse
Engineering, pp. 69–73. IEEE (2008) 23
[10] Amaral, L., Scala, A., Barthélémy, M., Stanley, H.: Classes of small-world networks.
Proceedings of the National Academy of Sciences 97(21), 11,149 (2000) 56
[11] Anthonisse, J.: The rush in a directed graph. Tech. Rep. BN/9/71, Stichting Mathematisch Centrum, Amsterdam, The Netherlands (1971) 115, 118
[12] Anton, T.: XPath-Wrapper Induction by generalizing tree traversal patterns. MIT Press
(2004) 20
157
BIBLIOGRAPHY
[13] Bader, D., Kintali, S., Madduri, K., Mihail, M.: Approximating betweenness centrality.
Algorithms and Models for the Web-Graph pp. 124–137 (2007) 119
[14] Barabási, A., Albert, R.: Emergence of scaling in random networks. Science 286(5439),
509 (1999) ix, 57, 68, 70, 72, 77
[15] Barabási, A., Crandall, R.: Linked: The new science of networks. American Journal of
Physics 71, 409 (2003) 53, 66
[16] Barabási, A., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach
to human disease. Nature Reviews Genetics 12(1), 56–68 (2011) 154
[17] Barabási, A., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of
the social network of scientific collaborations. Physica A: Statistical Mechanics and its
Applications 311(3-4), 590–614 (2002) 56
[18] Barrat, A., Weigt, M.: On the properties of small-world network models. The European
Physical Journal B - Condensed Matter and Complex Systems 13(3), 547–560 (2000) 80
[19] Barthelemy, M.: Betweenness Centrality in Large Complex Networks. European Physical
Journal B 38, 163–168 (2004) 62
[20] Batagelj, V., Doreian, P., Ferligoj, A.: An optimizational approach to regular equivalence.
Social Networks 14(1-2), 121–135 (1992) 47
[21] Baumgartner, R., Ceresna, M., Ledermuller, G.: Deepweb navigation in web data extraction. In: Proc. of the International Conference on Computational Intelligence for
Modelling, Control and Automation and International Conference on Intelligent Agents,
Web Technologies and Internet Commerce, pp. 698–703. IEEE (2005) 19
[22] Baumgartner, R., Flesca, S., Gottlob, G.: The elog web extraction language. In: Proc. of
the Artificial Intelligence on Logic for Programming, pp. 548–560. Springer Verlag (2001)
20, 28
[23] Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto.
In: Proc. of the 27th International Conference on Very Large Data Bases, pp. 119–128.
Morgan Kaufmann Publishers Inc. (2001) 20, 28
[24] Baumgartner, R., Frölich, O., Gottlob, G., Harz, P., Herzog, M., Lehmann, P.: Web data
extraction for business intelligence: the lixto approach. Datenbanksysteme in Business,
Technologie und Web 11, 30–47 (2005) 23
[25] Baumgartner, R., Fröschl, K., Hronsky, M., Pöttler, M., Walchhofer, N.: Semantic online
tourism market monitoring. Proc. of the 17th eTourism International Conference (2010)
23
[26] Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. Encyclopedia of Database Systems pp. 3465–3471 (2009) 18, 37
[27] Baumgartner, R., Gottlob, G., Herzog, M.: Scalable web data extraction for online market
intelligence. Proceedings of the VLDB Endowment 2(2), 1512–1523 (2009) 23
[28] Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of
computational biology 6(3-4), 281–297 (1999) 147
158
BIBLIOGRAPHY
[29] Berger, A., Pietra, V., Pietra, S.: A maximum entropy approach to natural language
processing. Computational linguistics 22(1), 39–71 (1996) 20
[30] Berthold, M., Hand, D.J.: Intelligent Data Analysis: An Introduction. Springer Verlag
(1999) 20
[31] Blondel, V., Gajardo, A., Heymans, M., Senellart, P., Van Dooren, P.: A measure of
similarity between graph vertices: Applications to synonym extraction and web searching.
Siam Review pp. 647–666 (2004) 47
[32] Blondel, V., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in
large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10,008
(2008) 80, 141, 142, 144
[33] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.: Complex networks:
Structure and dynamics. Physics Reports 424(4-5), 175–308 (2006) 88
[34] Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In: Proc. of the 20th
international conference on World wide web, pp. 587–596. ACM (2011) 2
[35] Boldi, P., Vigna, S.: The webgraph framework i: compression techniques. In: Proc. of
the 13th international conference on World Wide Web, pp. 595–602. ACM (2004) 2
[36] Bollen, J., Goncalves, B., Ruan, G., Mao, H.: Happiness is assortative in online social
networks. Artificial Life 17(3), 237–251 (2011) 154
[37] Bollen, J., Pepe, A., Mao, H.: Modeling public mood and emotion: Twitter sentiment
and socio-economic phenomena. In: Proc. 5th International AAAI Conference on Weblogs
and Social Media, pp. 17–21 (2011) 154
[38] Bollobás, B., Riordan, O.: The diameter of a scale-free random graph. Combinatorica
24(1), 5–34 (2004) 12
[39] Bonchi, F., Castillo, C., Ienco, D.: Meme ranking to maximize posts virality in microblogging platforms. Journal of Intelligent Information Systems pp. 1–29 (2011) 154
[40] Borgatti, S., Everet, M.: A graph-theoretic perspective on centrality. Social Networks
28(4), 466–484 (2006) 121
[41] Borgatti, S., Everett, M.: Models of core/periphery structures. Social networks 21(4),
375–395 (2000) 74, 86
[42] Boronat, X.: A comparison of html-aware tools for web data extraction. Master’s thesis,
Universität Leipzig, Fakultät für Mathematik und Informatik (2008). Abteilung Datenbanken 20
[43] Bossa, S., Fiumara, G., Provetti, A.: A lightweight architecture for rss polling of arbitrary
web sources. In: Proc. of WOA conference (2006) 30
[44] Bouttier, J., Di Francesco, P., Guitter, E.: Geodesic distance in planar graphs. Nuclear
Physics B 663(3), 535–567 (2003) 12
[45] Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathematical
Sociology 25(2), 163–177 (2001) 117, 119
159
BIBLIOGRAPHY
[46] Brandes, U.: On variants of shortest-path betweenness centrality and their generic computation. Social Networks 30(2), 136–145 (2008) 118
[47] Brandes, U., Delling, D., Gaertler, M., Görke, R., Hoefer, M., Nikoloski, Z., Wagner, D.:
On finding graph clusterings with maximum modularity. In: Graph-Theoretic Concepts
in Computer Science, pp. 121–132 (2007) 142
[48] Brandes, U., Eiglsperger, M., Herman, I., Himsolt, M., Marshall, M.: GraphML progress
report structural layer proposal. In: Graph Drawing, pp. 109–112. Springer (2002) 50, 55
[49] Brandes, U., Erlebach, T.: Fundamentals, Lecture Notes in Computer Science, vol. Network Analysis: Methodological Foundations, chap. 2, pp. 8–15. Springer (2005) 10
[50] Brandes, U., Fleischer, D.: Centrality measures based on current flow. In: Proc. of the
22nd Symposium Theoretical Aspects of Computer Science, pp. 533–544. Springer (2005)
117
[51] Brandes, U., Pich, C.: Centrality estimation in large networks. International Journal of
Bifurcation and Chaos 17(7), 2303–2318 (2007) 119
[52] Brown, J., Broderick, A., Lee, N.: Word of mouth communication within online communities: Conceptualizing the online social network. Journal of interactive marketing 21(3),
2–20 (2007) 119
[53] Brown, J., Reingen, P.: Social ties and word-of-mouth referral behavior. Journal of
Consumer Research pp. 350–362 (1987) 154
[54] Cafarella, M., Halevy, A., Khoussainova, N.: Data integration for the relational web.
Proceedings of the VLDB Endowment 2(1), 1090–1101 (2009) 139
[55] de Campos, L., Fernández-Luna, J., Huete, J., Romero, A.: Probabilistic methods for
structured document classification at inex07. Focused Access to XML Documents pp.
195–206 (2008) 20
[56] Carrington, P., Scott, J., Wasserman, S.: Models and methods in social network analysis.
Cambridge University Press (2005) 2, 78, 115
[57] Catanese, S., De Meo, P., Ferrara, E., Fiumara, G.: Analyzing the facebook friendship
graph. In: Proc. of the 1st International Workshop on Mining the Future Internet, vol.
685, pp. 14–19 (2010) 4, 52
[58] Catanese, S., De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Crawling facebook
for social network analysis purposes. In: Proc. of the International Conference on Web
Intelligence, Mining and Semantics, pp. 52:1–52:8. ACM (2011) 4, 52, 109
[59] Catanese, S., De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Extraction and analysis
of facebook friendship relations. Computational Social Networks: Mining and Visualization (2012) 4
[60] Catanese, S., Ferrara, E., Fiumara, G.: Forensic analysis of phone call networks. Social
Network Analysis and Mining (In press) 4
[61] Centola, D.: The spread of behavior in an online social network experiment. Science
329(5996), 1194 (2010) 108
160
BIBLIOGRAPHY
[62] Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information
extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10),
1411–1428 (2006) 18
[63] Chau, D., Pandit, S., Wang, S., Faloutsos, C.: Parallel crawling for online social networks.
In: Proc. of the 16th International Conference on the World Wide Web, pp. 1283–1284
(2007) 49, 52
[64] Chen, H., Chau, M., Zeng, D.: Ci spider: a tool for competitive intelligence on the web.
Decision Support Systems 34(1), 1–17 (2002) 23
[65] Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral
marketing in large-scale social networks. In: Proc. of the 16th SIGKDD international
conference on Knowledge discovery and data mining, pp. 1029–1038. ACM (2010) 154
[66] Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks. In:
Proc. of the 15th SIGKDD international conference on Knowledge discovery and data
mining, pp. 199–208. ACM (2009) 154
[67] Chung, F., Lu, L.: The diameter of sparse random graphs. Advances in Applied Mathematics 26(4), 257–279 (2001) 12
[68] Clauset, A.: Finding local community structure in networks. Physical Review E 72(2),
026,132 (2005) 74
[69] Clauset, A., Newman, M., Moore, C.: Finding community structure in very large networks. Physical review E 70(6), 066,111 (2004) 74, 86
[70] Colizza, V., Flammini, A., Serrano, M., Vespignani, A.: Detecting rich-club ordering in
complex networks. Nature Physics 2(2), 110–115 (2006) 110
[71] Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of symbolic computation 9(3), 251–280 (1990) 12
[72] Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT Press,
2nd edition (2001) 11, 12
[73] Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. Journal
of the ACM 51(5), 731–779 (2004) 29
[74] Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction
from large web sites. In: Proc. of the 27th International Conference on Very Large Data
Bases, pp. 109–118. Morgan Kaufmann Publishers Inc. (2001) 20, 29
[75] Danon, L., Dı́az-Guilera, A., Duch, J., Arenas, A.: Comparing community structure
identification. Journal of Statistical Mechanics: Theory and Experiment 2005, P09,008
(2005) 145
[76] Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction
and semantic classification of product reviews. In: Proc. of the 12th international conference on World Wide Web, pp. 519–528. ACM (2003) 25
[77] De Meo, P., Ferrara, E., Abel, F., Aroyo, L., Houben, G.J.: Analyzing user behavior
across social web environments. ACM Transactions on Intelligent Systems and Technology
(Under Review) 4
161
BIBLIOGRAPHY
[78] De Meo, P., Ferrara, E., Fiumara, G.: Finding similar users in facebook. Social Networking and Community Behavior Modeling: Qualitative and Quantitative Measurement pp.
304–323 (2011) 4
[79] De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Generalized louvain method for
community detection in large networks. In: Proc. of the 11th International Conference
on Intelligent Systems Design and Applications, pp. 88–93 (2011) 5, 139
[80] De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Improving recommendation quality by
merging collaborative filtering and social relationships. In: Proc. of the 11th International
Conference on Intelligent Systems Design and Applications, pp. 587–592 (2011) 5
[81] De Meo, P., Ferrara, E., Fiumara, G., Ricciardello, A.: A novel measure of edge centrality
in social networks. Knowledge-based Systems (2012). DOI 10.1016/j.knosys.2012.01.007
5, 109, 122
[82] De Meo, P., Garro, A., Terracina, G., Ursino, D.: X-learn: an xml-based, multi-agent system for supporting user-device adaptive e-learning. On The Move to Meaningful Internet
Systems 2003: CoopIS, DOA, and ODBASE pp. 739–756 (2003) 140
[83] De Meo, P., Nocera, A., Quattrone, G., Rosaci, D., Ursino, D.: Finding reliable users
and social networks in a social internetworking system. In: Proc. of the International
Database Engineering & Applications Symposium, pp. 173–181. ACM (2009) 141
[84] De Meo, P., Nocera, A., Terracina, G., Ursino, D.: Recommendation of similar users,
resources and social networks in a social internetworking scenario. Information Sciences
181(7), 1285–1305 (2011) 67
[85] Descher, M., Feilhauer, T., Ludescher, T., Masser, P., Wenzel, B., Brezany, P., Elsayed,
I., Wöhrer, A., Tjoa, A., Huemer, D.: Position paper: Secure infrastructure for scientific
data life cycle management. In: International Conference on Availability, Reliability and
Security, pp. 606–611. IEEE (2009) 26
[86] D’haeseleer, P.: How does gene expression clustering work? Nature Biotechnology 23(12),
1499–1502 (2005) 147
[87] Dijkstra, E.: A note on two problems in connexion with graphs. Numerische mathematik
1(1), 269–271 (1959) 12
[88] Ding, L., Steil, D., Dixon, B., Parrish, A., Brown, D.: A relation context oriented approach to identify strong ties in social networks. Knowledge-Based Systems 24(8), 1187–
1195 (2011) 116, 141
[89] Dodson Jr, J., Muller, E.: Models of new product diffusion through advertising and
word-of-mouth. Management Science pp. 1568–1578 (1978) 154
[90] Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale social
networks. In: Proc. of the 9th WebKDD and 1st SNA-KDD workshop on Web mining
and social network analysis, pp. 16–25. ACM (2007) 88
[91] Duch, J., Arenas, A.: Community detection in complex networks using extremal optimization. Physical Review E 72(2), 027,104 (2005) 74
[92] Dunbar, R.: Grooming, gossip, and the evolution of language. Harvard University Press
(1998) 127
162
BIBLIOGRAPHY
[93] Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y., Smith, R.:
Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering 31(3), 227–251 (1999) 20
[94] Erdős, P., Rényi, A.: On random graphs. Publicationes Mathematicae 6(26), 290–297
(1959) ix, 57, 67, 68, 69, 73, 75
[95] Everett, M., Borgatti, S.: Peripheries of cohesive subsets. Social Networks 21, 397–407
(1999) 74
[96] Everett, M., Borgatti, S.: Ego network betweenness. Social Networks 27(1), 31–38 (2005)
121
[97] Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet
topology. In: ACM SIGCOMM Computer Communication Review, vol. 29, pp. 251–262.
ACM (1999) 56, 57
[98] Feldman, I., Rzhetsky, A., Vitkup, D.: Network properties of genes harboring inherited
disease mutations. Proceedings of the National Academy of Sciences 105(11), 4323 (2008)
154
[99] Ferrara, E.: Community structure discovery in facebook. International Journal of Social
Network Mining 1(1), 67–90 (2012) 5, 81, 109
[100] Ferrara, E.: Conclude: Complex network cluster detection for social and biological applications. In: WWW ’12 PhD Symposium (Under Review) 5
[101] Ferrara, E.: A large-scale community structure analysis in facebook. PLoS ONE (Under
Review) 5, 109
[102] Ferrara, E., Baumgartner, R.: Automatic wrapper adaptation by tree edit distance matching. Combinations of Intelligent Methods and Applications pp. 41–54 (2011) 3, 44
[103] Ferrara, E., Baumgartner, R.: Design of automatically adaptable web wrappers. In:
Proc. of the 3rd International Conference on Agents and Artificial Intelligence, pp. 211–
217 (2011) 3, 44
[104] Ferrara, E., Baumgartner, R.: Intelligent self-repairable web wrappers. In: Lecture Notes
in Computer Science (Proc. of the 12th International Conference on Advances in Artificial
Intelligence), vol. 6934, pp. 274–285 (2011) 3, 44
[105] Ferrara, E., Fiumara, G.: Topological features of online social networks. Communications
on Applied and Industrial Mathematics 2(2), 1–20 (2011) 4
[106] Ferrara, E., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey. ACM Computing Surveys (Under Review) 3
[107] Fiumara, G.: Automated information extraction from web sources: a survey. Proc.
of Between Ontologies and Folksonomies Workshop in 3rd International Conference on
Communities and Technology pp. 1–9 (2007) 18
[108] Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a
brief survey. AI Communications 17(2), 57–61 (2004) 18, 30
163
BIBLIOGRAPHY
[109] Fortunato, S.: Community detection in graphs. Physics Reports 486, 75–174 (2010) 73,
80, 86, 87, 88, 139
[110] Fortunato, S., Barthélemy, M.: Resolution limit in community detection. Proceedings of
the National Academy of Sciences 104(1), 36 (2007) 80, 100, 106, 145
[111] Fortunato, S., Latora, V., Marchiori, M.: Method to find community structures based on
information centrality. Physical review E 70(5), 056,104 (2004) 126
[112] Freeman, L.: A set of measures of centrality based on betweenness. Sociometry 40(1),
35–41 (1977) 3, 14, 115, 117
[113] Freeman, L.: Centrality in social networks conceptual clarification. Social networks 1(3),
215–239 (1979) 14
[114] Freeman, L., Borgatti, S., White, D.: Centrality in valued graphs: A measure of betweenness based on network flow. Social Networks 13(2), 141–154 (1991) 120
[115] Friedkin, N.: Horizons of observability and limits of informal control in organizations.
Social Forces 62(1), 55–77 (1983) 3, 116, 118, 121
[116] Fruchterman, T., Reingold, E.: Graph drawing by force-directed placement. Software:
Practice and Experience 21(11), 1129–1164 (1991) 72
[117] Garrett, J.J.: Ajax: A new approach to web applications. Tech. rep., Adaptive Path
(2005). URL http://www.adaptivepath.com/ideas/essays/archives/000385.php 19
[118] Garton, L., Haythornthwaite, C., Wellman, B.: Studying online social networks. Journal
of Computer-Mediated Communication 3(1) (1997) 1, 45
[119] Gatterbauer, W.: Web harvesting. Encyclopedia of Database Systems pp. 3472–3473
(2009) 26
[120] Gatterbauer, W., Bohunsky, P.: Table extraction using spatial reasoning on the css2 visual
box model. In: AAAI ’06 Proc. of the 21st national conference on Artificial intelligence,
pp. 1313–1318. AAAI Press (2006) 28
[121] Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domainindependent information extraction from web tables. In: Proc. of the 16th international
conference on World Wide Web, pp. 71–80. ACM (2007) 20
[122] Ghosh, R., Lerman, K.: Predicting influential users in online social networks. In: Proc.
of KDD workshop on Social Network Analysis (2010) 48
[123] Gilbert, E., Karahalios, K.: Predicting tie strength with social media. In: Proc. of the
27th international conference on Human factors in computing systems, pp. 211–220. ACM
(2009) 107, 109
[124] Girvan, M., Newman, M.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821 (2002) 3, 67, 115, 143
[125] Gjoka, M., Kurant, M., Butts, C., Markopoulou, A.: Walking in facebook: a case study
of unbiased sampling of OSNs. In: Proc. of the 29th conference on Information communications, pp. 2498–2506. IEEE (2010) 1, 46, 53, 55, 66, 92
164
BIBLIOGRAPHY
[126] Gjoka, M., Kurant, M., Butts, C., Markopoulou, A.: Practical recommendations on
crawling online social networks. Selected Areas in Communications, IEEE Journal on
29(9), 1872–1892 (2011) 109
[127] Goh, K., Cusick, M., Valle, D., Childs, B., Vidal, M., Barabási, A.: The human disease
network. Proceedings of the National Academy of Sciences 104(21), 8685 (2007) 154
[128] Goh, K., Kahng, B., Kim, D.: Universal behavior of load distribution in scale-free networks. Physical Review Letters 87(27), 278,701 (2001) 62
[129] Golbeck, J., Hendler, J.: Inferring binary trust relationships in web-based social networks.
Transactions on Internet Technology 6(4), 497–529 (2006) 66
[130] Goldenberg, A., Zheng, A., Fienberg, S., Airoldi, E.: A survey of statistical network
models. Foundations and Trends in Machine Learning 2(2), 129–233 (2010) 49
[131] Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems look at
the underlying process of word-of-mouth. Marketing letters 12(3), 211–223 (2001) 154
[132] Goldenberg, J., Libai, B., Muller, E.: Using complex systems analysis to advance marketing theory development: Modeling heterogeneity effects on new product growth through
stochastic cellular automata. Academy of Marketing Science Review 9(3), 1–18 (2001)
154
[133] Golub, G., Van Loan, C.: Matrix computations, vol. 3. Johns Hopkins University Press
(1996) 47
[134] Gottlob, G., Koch, C.: Logic-based web information extraction. ACM SIGMOD Record
33(2), 87–94 (2004) 28
[135] Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for web
information extraction. Journal of the ACM 51(1), 74–113 (2004) 28
[136] Grabowicz, P., Ramasco, J., Moro, E., Pujol, J., Eguiluz, V.: Social features of online
networks: The strength of intermediary ties in online social media. PLoS ONE 7(1),
e29,358 (2012) 4, 108, 111
[137] Granovetter, M.: The strength of weak ties. American journal of sociology pp. 1360–1380
(1973) 85, 105, 106, 107, 108, 109
[138] Gregory, S.: An algorithm to find overlapping community structure in networks. Knowledge Discovery in Databases: PKDD 2007 pp. 91–102 (2007) 75, 88, 154
[139] Gross, R., Acquisti, A.: Information revelation and privacy in online social networks. In:
Proc. of the Workshop on Privacy in the Electronic Society, pp. 71–80. ACM (2005) 50
[140] Guimera, R., Amaral, L.: Functional cartography of complex metabolic networks. Nature
433(7028), 895 (2005) 87
[141] Hagen, L., Kahng, A.: New spectral methods for ratio cut partitioning and clustering.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 11(9),
1074–1085 (2002) 73, 87
[142] Hammersley, B.: Developing feeds with rss and atom. O’Reilly (2005) 18
165
BIBLIOGRAPHY
[143] Hampton, K., Sessions, L., Her, E., Rainie, L.: Social isolation and new technology. PEW
Research Center 4 (2007) 86
[144] Han, H.: Conceptual modeling and ontology extraction for web information. Ph.D. thesis
(2002). Supervisor-Elmasri, Ramez 20
[145] Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc. (2000) 23
[146] Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann
Pub (2011) 47
[147] Heider, F.: The psychology of interpersonal relations. Lawrence Erlbaum (1982) 109
[148] Holland, P., Leinhardt, S.: Transitivity in structural models of small groups. Comparative
Group Studies (1971) 71
[149] Holme, P., Kim, B.: Growing scale-free networks with tunable clustering. Physical Review
E 65(2), 026,107 (2002) ix, 71, 72, 77
[150] Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data
extraction from the web. Information systems 23(9), 521–538 (1998) 30
[151] Hu, X., Lin, T.Y., Song, I.Y., Lin, X., Yoo, I., Lechner, M., Song, M.: Ontology-based
scalable and portable information extraction system to extract biological knowledge from
huge collection of biomedical web documents. In: Proc. of the International Conference
on Web Intelligence, pp. 77–83. IEEE (2004) 20
[152] Hunter, D., Goodreau, S., Handcock, M.: Goodness of fit of social network models.
Journal of American Statistics Association 103(481), 248–258 (2008) 86
[153] Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: Proc.
of the 15th international conference on World Wide Web, pp. 553–563. ACM (2006) 19
[154] Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proc. of
the 8th SIGKDD international conference on Knowledge discovery and data mining, pp.
538–543. ACM (2002) 47
[155] Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.: The large-scale organization
of metabolic networks. Nature 407(6804), 651–654 (2000) 56, 57
[156] Jin, D., Liu, D., Yang, B., Liu, J.: Fast Complex Network Clustering Algorithm Using
Agents. In: Proc. of the 8th International Conference on Dependable, Autonomic and
Secure Computing, pp. 615–619 (2009) 87, 88, 89
[157] Kaiser, K., Miksch, S.: Information extraction. a survey. Tech. rep., E188 - Institut für
Softwaretechnik und Interaktive Systeme; Technische Universität Wien (2005) 18, 27
[158] Karrer, B., Levina, E., Newman, M.: Robustness of community structure in networks.
Physical Review E 77(4), 46,119 (2008) 86
[159] Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a
social network. In: Proc. of 9th international conference on Knowledge discovery and
data mining, pp. 137–146 (2003) 67, 119, 154
166
BIBLIOGRAPHY
[160] Khare, R., Çelik, T.: Microformats: a pragmatic path to the semantic web. In: Proc. of
the 15th international conference on World Wide Web, pp. 865–866. ACM (2006) 18
[161] Kim, M., Han, J.: CHRONICLE: A Two-Stage Density-Based Clustering Algorithm for
Dynamic Networks. In: Proc. of the International Conference on Discovery Science,
Lecture Notes in Computer Science, pp. 152–167. Springer (2009) 87
[162] Kim, Y., Park, J., Kim, T., Choi, J.: Web information extraction by html tree edit
distance matching. In: International Conference on Convergence Information Technology,
pp. 2455–2460. IEEE (2007) 35
[163] Kim, Y.A., Song, H.S.: Strategies for predicting local trust based on trust propagation
in social networks. Knowledge-Based Systems 24(8), 1360–1371 (2011) 141
[164] Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM
46(5), 604–632 (1999) 48, 71
[165] Kleinberg, J.: The small-world phenomenon: an algorithm perspective. In: Proc. of the
32nd annual symposium on Theory of computing, pp. 163–170. ACM (2000) 45, 66, 71
[166] Koschutzki, D., Lehmann, K.A., Peeters, L., Richter, S., Tenfelde-Podehl, D., Zlotowski, O.: Centrality Indices, Lecture Notes in Computer Science, vol. Network Analysis:
Methodological Foundations, chap. 3, pp. 16–61. Springer (2005) 118
[167] Krüpl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular
data from arbitrary html documents. In: Special interest tracks and posters of the 14th
international conference on World Wide Web, pp. 1000–1001. ACM (2005) 28
[168] Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers. In: Revised Papers from
the International Conference NetObjectDays on Objects, Components, Architectures,
Services, and Applications for a Networked World, pp. 184–198. Springer Verlag (2003)
17
[169] Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks.
Link Mining: Models, Algorithms, and Applications pp. 337–357 (2010) 1, 66
[170] Kurant, M., Markopoulou, A., Thiran, P.: On the bias of breadth first search (bfs) and of
other graph sampling techniques. In: Proc. of the 22nd International Teletraffic Congress,
pp. 1–8 (2010) 2, 4, 49, 52, 53, 92, 105
[171] Kushmerick, N.: Wrapper induction for information extraction. Ph.D. thesis, University
of Washington (1997). Chairperson-Weld, Daniel S. 27
[172] Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artificial Intelligence
118(1-2), 15–68 (2000) 30
[173] Kushmerick, N.: Finite-state approaches to web information extraction. Proc. of 3rd
Summer Convention on Information Extraction pp. 77–91 (2002) 17, 31
[174] Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web
data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002) 17, 18, 20, 21
[175] Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical
community structure in complex networks. New Journal of Physics 11, 033,015 (2009)
75, 154
167
BIBLIOGRAPHY
[176] Lancichinetti, A., Kivelä, M., Saramäki, J.: Characterizing the community structure of
complex networks. PloS one 5(8), e11,976 (2010) 105
[177] Lancichinetti, A., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Physical Review E 78(4), 046,110 (2008) 144
[178] Lancichinetti, A., Radicchi, F., Ramasco, J.: Finding statistically significant communities
in networks. PloS one 6(4), e18,961 (2011) 147
[179] Lappas, T., Terzi, E., Gunopulos, D., Mannila, H.: Finding effectors in social networks.
In: Proc. of the 16th SIGKDD international conference on Knowledge discovery and data
mining, pp. 1059–1068. ACM (2010) 154
[180] Lau, A., Tsui, E.: Knowledge management perspective on e-learning effectiveness.
Knowledge-Based Systems 22(4), 324–325 (2009) 140
[181] Lee, C., Reid, F., McDaid, A., Hurley, N.: Detecting highly overlapping community
structure by greedy clique expansion. In: Proc. of the 4th Workshop on Social Network
Mining and Analysis. ACM (2010) 75, 86
[182] Leicht, E., Holme, P., Newman, M.: Vertex similarity in networks. Physical Review E
73(2), 026,120 (2006) 47
[183] Leskovec, J.: Stanford Network Analysis Package (SNAP). URL http://snap.stanford.
edu/ 58, 128
[184] Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proc. of the 12th SIGKDD
international conference on Knowledge discovery and data mining, pp. 631–636. ACM
(2006) 46, 66, 78, 145
[185] Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proc. of the 11th SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 177–187 (2005) 60
[186] Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Statistical properties of community
structure in large social and information networks. In: Proc. of the 17th international
conference on World Wide Web, pp. 695–704 (2008) 79
[187] Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6(1), 29–123 (2009) 89
[188] Leskovec, J., Lang, K., Mahoney, M.: Empirical comparison of algorithms for network
community detection. In: Proc. of the 19th international conference on World Wide Web,
pp. 631–640. ACM (2010) 87
[189] Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. Journal
of the American Society for Information Science and Technology 58(7), 1019–1031 (2007)
66, 127
[190] Ma, H., Yang, H., Lyu, M., King, I.: Mining social networks using heat diffusion processes
for marketing candidates selection. In: Proceeding of the 17th conference on Information
and knowledge management, pp. 233–242. ACM (2008) 154
168
BIBLIOGRAPHY
[191] Madduri, K., Ediger, D., Jiang, K., Bader, D., Chavarria-Miranda, D.: A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness
centrality on massive datasets. In: Proc. of the International Symposium on Parallel &
Distributed Processing, pp. 1–8. IEEE (2009) 154
[192] Madras, N., Slade, G.: The self-avoiding walk. Birkhauser (1996) 116
[193] Mahmoud, H., Aboulnaga, A.: Schema clustering and retrieval for multi-domain pay-asyou-go data integration systems. In: Proc. of the international conference on Management
of data, pp. 411–422. ACM (2010) 139
[194] Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT
Press (1999) 20
[195] Mathioudakis, M., Koudas, N.: Efficient identification of starters and followers in social
media. In: Proc. of the International Conference on Extending Database Technology, pp.
708–719. ACM (2009) 48
[196] McCown, F., Nelson, M.: What happens when Facebook is gone? In: Proc. of the 9th
Joint Conference on Digital Libraries, pp. 251–254. ACM (2009) 50
[197] McDaid, A., Hurley, N.: Detecting highly overlapping communities with model-based
overlapping seed expansion. In: 2010 International Conference on Advances in Social
Networks Analysis and Mining, pp. 112–119. IEEE (2010) 75, 86
[198] McPherson, M., Smith-Lovin, L., Cook, J.: Birds of a feather: Homophily in social
networks. Annual review of sociology pp. 415–444 (2001) 109
[199] Melomed, E., Gorbach, I., Berger, A., Bateman, P.: Microsoft SQL Server 2005 Analysis
Services (SQL Server Series). Sams (2006) 23
[200] Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction.
In: Proc. of the 5th international workshop on Web information and data management,
pp. 1–8. ACM (2003) 31
[201] Mika, P.: Ontologies are us: A unified model of social networks and semantics. Web
Semantics: Science, Services and Agents on the World Wide Web 5(1), 5–15 (2007) 20
[202] Milgram, S.: The small world problem. Psychology Today 2(1), 60–67 (1967) 53, 56, 57,
60, 65, 67
[203] Mislove, A., Marcon, M., Gummadi, K., Druschel, P., Bhattacharjee, B.: Measurement
and analysis of online social networks. In: Proc. of the 7th SIGCOMM conference on
Internet measurement, pp. 29–42. ACM (2007) 1, 46, 49, 52, 66, 78, 117
[204] Monge, A.E.: Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin 23(4) (2000) 19
[205] Mucha, P., Richardson, T., Macon, K., Porter, M., Onnela, J.: Community structure in
time-dependent, multiscale, and multiplex networks. Science 328(5980), 876 (2010) 109
[206] Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In:
Proc. of the 3rd annual conference on Autonomous Agents, pp. 190–197. ACM (1999) 30
[207] Newcomb, T.: The acquaintance process. (1961) 109
169
BIBLIOGRAPHY
[208] Newman, M.: Scientific collaboration networks. I. Network Construction and Fundamental Results. Physical Review E 64(1), 16,131 (2001) 56
[209] Newman, M.: The structure of scientific collaboration networks. Proceedings of the
National Academy of Sciences 98(2), 404 (2001) 79
[210] Newman, M.: The Structure and Function of Complex Networks. SIAM Review 45(2),
167 (2003) 56, 86
[211] Newman, M.: A measure of betweenness centrality based on random walks. Social Networks 26(2), 175–188 (2004) 116, 117, 119, 120
[212] Newman, M.: Power laws, pareto distributions and zipf’s law. Contemporary physics
46(5), 323–351 (2005) 110
[213] Newman, M.: Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74(3), 036,104 (2006) 74
[214] Newman, M.: Modularity and community structure in networks. Proceedings of the
National Academy of Sciences 103(23), 8577 (2006) 74
[215] Newman, M.: The first-mover advantage in scientific publication. Europhysics Letters
86, 68,001 (2009) 79
[216] Newman, M., Barabasi, A., Watts, D.: The structure and dynamics of networks. Princeton University Press (2006) 53, 86
[217] Newman, M., Girvan, M.: Finding and evaluating community structure in networks.
Physical Review E 69(2), 26,113 (2004) 74, 88, 108, 143
[218] Newman, M., Leicht, E.: Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences 104(23), 9564 (2007) 74, 87
[219] Newman, M., Watts, D.: Renormalization group analysis small-world network model.
Physics Letters A 263(4-6), 341–346 (1999) ix, 69, 72, 76
[220] Ng, A., Jordan, M., Weiss, Y.: On Spectral Clustering: Analysis and an algorithm. In:
Advances in Neural Information Processing Systems 14 (2001) 73
[221] Noh, J.D., Rieger, H.: Random walks on complex networks. Physical Review Letters 92,
118,701 (2004) 116, 117
[222] Onnela, J., Reed-Tsochas, F.: The spontaneous emergence of social influence in online
systems. Proceedings of the National Academy of Sciences 107, 18,375 (2010) 48
[223] Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32(3), 245–251 (2010) 14
[224] Opsahl, T., Colizza, V., Panzarasa, P., Ramasco, J.: Prominence and control: The
weighted rich-club effect. Physical review letters 101(16), 168,702 (2008) 110
[225] Oti, M., Brunner, H.: The modular nature of genetic diseases. Clinical genetics 71(1),
1–11 (2007) 154
[226] Pajevic, S., Plenz, D.: The organization of strong links in complex networks. Arxiv
preprint arXiv:1109.2577 (2011) 108
170
BIBLIOGRAPHY
[227] Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 435, 9 (2005) 74, 86, 87
[228] Palmer, C., Steffan, J.: Generating network topologies that obey power laws. In: Global
Telecommunications Conference, vol. 1, pp. 434–438. IEEE (2002) 56, 60
[229] Partow, A.: General Purpose Hash Function Algorithms. URL http://www.partow.
net/programming/hashfunctions/ 55
[230] Pastor-Satorras, R., Vázquez, A., Vespignani, A.: Dynamical and correlation properties
internet. Physical Review Letters 87(25), 258,701 (2001) 67
[231] Petróczi, A., Nepusz, T., Bazsó, F.: Measuring tie-strength in virtual social networks.
Connections 27(2), 39–52 (2006) 107, 109
[232] Phan, X., Horiguchi, S., Ho, T.: Automated data extraction from the web with conditional
models. International Journal of Business Intelligence and Data Mining 1(2), 194–209
(2005) 19, 30
[233] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J., Leser, U.: Alibaba: Pubmed as a
graph. Bioinformatics 22(19), 2444–2445 (2006) 26
[234] Porter, M., Onnela, J., Mucha, P.: Communities in networks. Notices of the American
Mathematical Society 56(9), 1082–1097 (2009) 81, 86
[235] Quattrone, G., Capra, L., De Meo, P., Ferrara, E., Ursino, D.: Effective retrieval of
resources in folksonomies using a new tag similarity measure. In: Proc. of the 20th
Conference on Information and Knowledge Management, pp. 545–550. ACM (2011) 24
[236] Quattrone, G., Ferrara, E., De Meo, P., Capra, L.: Measuring similarity in large-scale
folksonomies. In: Proc. of the 23rd International Conference on Software Engineering
and Knowledge Engineering, pp. 385–391 (2011) 24
[237] Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying
communities in networks. Proceedings of the National Academy of Sciences 101(9), 2658
(2004) 86, 145
[238] Raghavan, U., Albert, R., Kumara, S.: Near linear time algorithm to detect community
structures in large-scale networks. Physical Review E 76(3), 036,106 (2007) 87, 88
[239] Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The
VLDB Journal 10(4), 334–350 (2001) 19
[240] Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data
Engineering Bulletin 23(4) (2000) 19
[241] Ratkiewicz, J., Conover, M., Meiss, M., Goncalves, B., Flammini, A., Menczer, F.: Detecting and tracking political abuse in social media. In: Proc. 5th International AAAI
Conference on Weblogs and Social Media (2011) 154
[242] Ratkiewicz, J., Conover, M., Meiss, M., Gonçalves, B., Patil, S., Flammini, A., Menczer,
F.: Truthy: Mapping the spread of astroturf in microblog streams. In: Proc. of the 20th
international conference companion on World wide web, pp. 249–252. ACM (2011) 154
171
BIBLIOGRAPHY
[243] Redner, S.: How popular is your paper? An empirical study of the citation distribution.
The European Physical Journal B 4(2), 131–134 (1998) 56
[244] Rodriguez, M.: Grammar-based random walkers in semantic networks. Knowledge-Based
Systems 21(7), 727–739 (2008) 140
[245] Rodriguez, M., Watkins, J.: Grammar-based geodesics in semantic networks. KnowledgeBased Systems 23(8), 844–855 (2010) 116, 140
[246] Romero, D., Galuba, W., Asur, S., Huberman, B.: Influence and passivity in social
media. In: Proc. of the 20th International Conference Companion on World Wide Web,
pp. 113–114. ACM (2011) 48
[247] Romero, D., Kleinberg, J.: The Directed Closure Process in Hybrid Social-Information
Networks, with an Analysis of Link Formation on Twitter. In: Proc. of the 4th International Conference on Weblogs and Social Media (2010) 49
[248] Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4), 581–603 (1966) 3,
14, 115
[249] Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources
using w4f. In: Proc. of the 25th International Conference on Very Large Data Bases, pp.
738–741. Morgan Kaufmann Publishers Inc. (1999) 20, 28
[250] Sarawagi, S.: Information extraction. Foundations and trends in databases 1(3), 261–377
(2008) 18, 27
[251] Sebastiani, F.: Machine learning in automated text categorization. ACM computing
surveys 34(1), 1–47 (2002) 27
[252] Seidel, R.: On the all-pairs-shortest-path problem. In: Proc. of the 24th Symposium on
Theory of Computing, pp. 745–749. ACM (1992) 12
[253] Selkow, S.: The tree-to-tree editing problem. Information processing letters 6(6), 184–186
(1977) 33, 40
[254] Shah, D., Zaman, T.: Community detection in networks: The leader-follower algorithm.
In: Proc. of the Workshop on Networks Across Disciplines: Theory and Applications, pp.
1–8 (2010) 74, 86
[255] Snasel, V., Horak, Z., Abraham, A.: Understanding social networks using formal concept
analysis. In: Proc. of the International Conference on Web Intelligence and Intelligent
Agent Technology, vol. 3, pp. 390–393. IEEE (2008) 47
[256] Snasel, V., Horak, Z., Kocibova, J., Abraham, A.: Reducing social network dimensions
using matrix factorization methods. In: International Conference on Advances in Social
Network Analysis and Mining, pp. 348–351. IEEE (2009) 13, 47
[257] Soderland, S.: Learning information extraction rules for semi-structured and free text.
Machine learning 34(1), 233–272 (1999) 30
[258] Song, X., Chi, Y., Hino, K., Tseng, B.: Identifying opinion leaders in the blogosphere. In:
Proc. of the 16th Conference on Information and Knowledge Management, pp. 971–974.
ACM (2007) 48
172
BIBLIOGRAPHY
[259] Sridhar, V., Narasimha Murty, M.: Knowledge-based clustering approach for data abstraction. Knowledge-Based Systems 7(2), 103–113 (1994) 116, 139
[260] Staab, S., Domingos, P., Mike, P., Golbeck, J., Ding, L., Finin, T., Joshi, A., Nowak, A.,
Vallacher, R.: Social networks applied. IEEE Intelligent Systems 20(1), 80–93 (2005) 66,
119
[261] Stephenson, K., Zelan, M.: Rethinking centrality: Methods and examples. Social Networks 11, 1–37 116, 118
[262] Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition
for large sparse graphs. Statistical Analysis and Data Mining 1, 6–22 (2008) 13
[263] Tanaka, M., Ishida, T.: Ontology extraction from tables on the web. In: Proc. of the
International Symposium on Applications on Internet, pp. 284–290. IEEE (2006) 20
[264] Tarjan, R.: Depth-first search and linear grajh algorithms. In: Conference Record 12th
Annual Symposium on Switching and Automata Theory, pp. 114–121. IEEE (1971) 11
[265] Traud, A., Kelsic, E., Mucha, P., Porter, M.: Comparing Community Structure to Characteristics in Online Collegiate Social Networks. SIAM Review pp. 1–17 (2011) 86, 91
[266] Travers, J., Milgram, S.: An experimental study of the small world problem. Sociometry
32(4), 425–443 (1969) 53, 56, 57, 65, 67
[267] Trusov, M., Bucklin, R., Pauwels, K.: Effects of word-of-mouth versus traditional marketing: Findings from an internet social networking site. Journal of Marketing 73(5),
90–102 (2009) 119
[268] Turmo, J., Ageno, A., Català, N.: Adaptive information extraction. ACM Computing
Surveys 38(2), 4 (2006) 30
[269] Ugander, J., Karrer, B., Backstrom, L., Marlow, C.: The anatomy of the facebook social
graph. Arxiv preprint arXiv:1111.4503 (2011) 4, 48, 104, 109
[270] Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction
in facebook. In: Proc. of the 2nd SIGCOMM Workshop on Social Networks (2009) 128,
145
[271] Wang, P., Hawk, W., Tenopir, C.: Users’ interaction with world wide web resources:
an exploratory study using a holistic approach. Information processing & management
36(2), 229–251 (2000) 18
[272] Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge
University Press (1994) 1
[273] Watts, D.: Small worlds: the dynamics of networks between order and randomness.
Princeton University Press (2004) 66
[274] Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393(6684),
440–442 (1998) ix, 57, 66, 67, 70, 71, 72, 76, 80
[275] Wei, Y., Cheng, C.: Towards efficient hierarchical designs by ratio cut partitioning. In:
Proc. of the International Conference on Computer-Aided Design, pp. 298–301 (1989) 73
173
BIBLIOGRAPHY
[276] Weikum, G.: Harvesting, searching, and ranking knowledge on the web: invited talk.
In: Proc. of the 2nd International Conference on Web Search and Data Mining, pp. 3–4.
ACM (2009) 26
[277] Wilson, C., Boe, B., Sala, A., Puttaswamy, K., Zhao, B.: User interactions in social
networks and their implications. In: Proc. of the 4th European Conference on Computer
Systems, pp. 205–218. ACM (2009) 49, 52, 53, 55
[278] Winograd, T.: Understanding natural language. Cognitive Psychology 3(1), 1–191 (1972)
20
[279] Xia, Z.: Fighting criminals: Adaptive inferring and choosing the next investigative objects
in the criminal network. Knowledge-Based Systems 21(5), 434–442 (2008) 141
[280] Xia, Z., Bu, Z.: Community detection based on a semantic network. Knowledge-Based
Systems p. In Press (2011) 116, 139
[281] Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks. In: Proc. of the 19th international conference on World wide web, pp. 981–990.
ACM (2010) 107, 109
[282] Xie, J., Kelley, S., Szymanski, B.: Overlapping community detection in networks: the
state of the art and comparative study. Arxiv preprint arXiv:1110.5813 (2011) 154
[283] Xie, J., Szymanski, B., Liu, X.: Slpa: Uncovering overlapping communities in social
networks via a speaker-listener interaction dynamic process. In: Proc. of the Workshop
on Data Mining Technologies for Computational Collective Intelligence (2011) 154
[284] Xu, Y., Weng, J., Sharma, A., Yussupov, D.: Web content acquisition in web content
aggregation service based on digital earth geospatial framework. In: Geoinformatics, 2011
19th International Conference on, pp. 1–5. IEEE 3
[285] Yang, W.: Identifying syntactic differences between two programs. Software - Practice
and Experience 21(7), 739–755 (1991) 36
[286] Ye, S., Lang, J., Wu, F.: Crawling Online Social Graphs. In: Proc. of the 12th International Asia-Pacific Web Conference, pp. 236–242. IEEE (2010) 45, 46, 52
[287] Zachary, W.: An information flow model for conflict and fission in small groups. Journal
of Anthropological Research 33(4), 452–473 (1977) 66
[288] Zanasi, A.: Competitive intelligence through data mining public sources. Competitive
Intelligence Review 9(1), 44–54 (1998) 23
[289] Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of the
14th international conference on World Wide Web, pp. 76–85. ACM (2005) 29, 35, 36
[290] Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment.
IEEE Transactions on Knowledge and Data Engineering 18(12), 1614–1628 (2006) 29
[291] Zhao, H.: Automatic wrapper generation for the extraction of search result records
from search engines. Ph.D. thesis, State University of New York at Binghamton (2007).
Adviser-Meng, Weiyi 19
174
BIBLIOGRAPHY
[292] Zhao, J., Wu, J., Xu, K.: Weak ties: Subtle role of information diffusion in online social
networks. Physical Review E 82(1), 016,105 (2010) 108
[293] Zhao, Y., Levina, E., Zhu, J.: Community extraction for social networks. In: Proc. of
the Joint Statistical Meetings (2011) 86
[294] Zhou, S., Mondragón, R.: The rich-club phenomenon in the internet topology. IEEE
Communications Letters 8(3), 180–182 (2004) 110
175
Declaration
I herewith declare that I have produced this paper without the prohibited assistance
of third parties and without making use of aids other than those specified; notions
taken over directly or indirectly from other sources have been identified as such.
This Thesis has not previously been presented in identical or similar form to any
other Italian or foreign examination board.
The thesis work was conducted from January 2009 to December 2011 under the
supervision of Prof. Giacomo Fiumara at the University of Messina.