ANALYSIS OF EVOLVING GRAPHS

Transcription

ANALYSIS OF EVOLVING GRAPHS
The Pennsylvania State University
The Graduate School
ANALYSIS OF EVOLVING GRAPHS
A Dissertation in
Industrial Engineering and Operations Research
by
An-Yi Chen
c 2009 An-Yi Chen
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
May 2009
The thesis of An-Yi Chen was reviewed and approved∗ by the following:
Dennis K. J. Lin
Distinguished Professor of Supply Chain and Information Systems
Thesis Advisor, Chair of Committee
M. Jeya Chandra
Professor of Industrial Engineering
Graduate Program Officer for Industrial Engineering
Soundar R. T. Kumara
Allen E. Pearce/Allen M. Pearce Professor of Industrial Engineering
Zan Huang
Assistant Professor of Supply Chain and Information Systems
∗
Signatures are on file in the Graduate School.
Abstract
Despite the tremendous variety of complex networks in the natural, physical, and
social worlds, little is known scientifically about the common rules that underlie
all networks. How do real-world networks look like? How do they evolve over
time? How could we tell an abnormal evolving network from a normal one? To
be able to answer these questions, we proposed a novel analytical approach, lying
at the intersection of graph theory and time series analysis, to study the network
evolution process. Utilizing the notion of evolution formalizes a time domain in
the problem. Specifically, we model the evolution process through the creation
of a sequence of sample graphs. The idea of slicing the entire network evolution
process by time stamps provides us a sequence of sampled graphs within the study
period and engages the opportunity to take a time series like approach to tackle
the problem.
One of the major contributions of this thesis is to propose a novel approach,
incorporating univariate time series models into a sequence of time graphs, to
solve the network evolution process. Specifically, two algorithms: UN-TSN and
DN-TSN is proposed to study the network evolution. UN-TSN is proposed for
simple network evolution process. It deals with network with unidirectional edges
without multiple loops between nodes. DN-TSN is proposed for general network
evolution process. The second type of network evolution processes involves network with directional edges and with multiple loops between nodes. To the best
of our knowledge, this is the first work lying at the intersection of graph theory
and time series analysis to study network evolution.
In addition, the proposed approaches provide the network research community
an effective and efficient framework to study the evolution process of real-world networked systems. Given general graph statistics, such as number of nodes, number
iii
of edges and node degree, the proposed model is capable of producing a reliable
predictive graph preserving the respective key principles that govern the graph
structures. A predictive graph of this kind is very useful in the context of extrapolations as well as in the detection of abnormally patterns. Specifically, our model
not only helps form a basis of validating scenarios for graph evolution but also
capable of continuously monitoring the evolving patterns and detecting anomalies.
We also present a valuable case study using a distinct data set: Enron corpus
to validate the proposed methodology. The collection of the data set is a touchstone, providing substantial collection of real email benchmark that is public. We
transform the temporal email communication relationship via graph representation, where each distinct email address corresponds to the node, and the presence
of emails between two distinct email addresses corresponds to the edge. Given
such a sequence of time graphs, the objectives are to observe graph evolutionary
patterns and generate graph prediction as null model. Capability of generating
reliable predictive graphs enables us to anticipate changes in communication patterns that emerge gradually over time, as well as discovering indirect senders and
recipients within the structure. Note that although the motivation of our work is
an email communication network, the proposed method is fairly general and could
be applied to other domains as well.
iv
Table of Contents
List of Figures
viii
List of Tables
xi
Acknowledgments
xii
Chapter 1
Introduction
1.1 Research objectives . . . . . . . . . . . . . . . . . . . . .
1.2 Problem definition: Time series network (TSN) problem
1.3 Uniqueness, contribution and potential applications . . .
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 2
Literature Reviews
2.1 Networks Definition . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Existing network models . . . . . . . . . . . . . . . . . . . . .
2.2.1 Random graph models . . . . . . . . . . . . . . . . . .
2.2.1.1 Properties of random graph models . . . . . .
2.2.1.2 Limitation of classical random graph models .
2.2.1.3 Random graph model with prescribed degree
quence . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Small world networks . . . . . . . . . . . . . . . . . . .
2.2.3 Scale-free network . . . . . . . . . . . . . . . . . . . . .
2.2.3.1 Scale-free metric . . . . . . . . . . . . . . . .
2.3 Statistical properties of complex networks . . . . . . . . . . .
2.3.1 Degree distribution . . . . . . . . . . . . . . . . . . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
. . .
. . .
. . .
se. . .
. . .
. . .
. . .
. . .
. . .
1
3
5
6
7
10
11
11
13
13
14
15
15
17
20
21
21
2.3.2
2.3.3
2.4
2.5
Clustering coefficient . . . . . . . .
Average shortest-path length . . . .
2.3.3.1 Small-world phenomenon
2.3.4 Mixing patterns . . . . . . . . . . .
2.3.5 Centrality . . . . . . . . . . . . . .
2.3.6 Temporal network evolution . . . .
Real networks . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
22
23
24
25
26
27
31
Chapter 3
Simple Evolving Networks
32
3.1 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Phase-I: Node degree prediction . . . . . . . . . . . . . . . . 33
3.1.2 Phase-II: Link association . . . . . . . . . . . . . . . . . . . 34
3.2 General properties of the predicted graph and simulation results . . 37
3.2.1 Predicted graph property 1: Assortativity . . . . . . . . . . 38
3.2.2 Predicted graph property 2: Clustering coefficient . . . . . . 38
3.2.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3.1 Description of test sets . . . . . . . . . . . . . . . . 39
3.2.3.2 Graphical visualization of test sets vs. predicted
graphs . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3.3 Measurements and evaluation of resultant graph . . 41
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 4
Directed Evolving Networks
4.1 Introduction . . . . . . . . . . . . . . . . .
4.2 Proposed Method . . . . . . . . . . . . . .
4.2.1 Phase-I: Bipartite graph formation
4.2.2 Phase-II: Node degree prediction .
4.2.3 Phase-III: Link association . . . . .
4.3 Simulation . . . . . . . . . . . . . . . . . .
4.3.1 Simulated Data Sets . . . . . . . .
4.3.2 Performance Evaluation . . . . . .
4.4 Partition-based algorithm . . . . . . . . .
4.5 conclusion . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
51
52
52
54
54
56
58
63
Chapter 5
Data and Experiment Results
64
5.1 Email communication dataset: Enron . . . . . . . . . . . . . . . . . 64
vi
5.2
5.3
5.4
5.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Data exploration . . . . . . . . . . . . . . . . . . . . . . .
Undirected graph evolution: Experiments and Results . . . . . . .
5.2.1 Experiment I and performance evaluation . . . . . . . . .
5.2.2 Experiment II and performance evaluation . . . . . . . . .
5.2.3 Experiment III and performance evaluation . . . . . . . .
5.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
Directed graph evolution-Enron dataset: Experiments and results
5.3.1 Generation of time-indexed sample graphs . . . . . . . . .
5.3.2 Partition-based graph prediction . . . . . . . . . . . . . . .
5.3.3 Performance evaluation . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 6
Conclusion and Future Research
6.1 TSN Problem . . . . . . . . . . . . . . . .
6.2 Hierarchical solution frameworks: UN-TSN
6.3 Uniqueness and contribution of the thesis .
6.4 Future works . . . . . . . . . . . . . . . .
Bibliography
. . . . . . . .
and DN-TSN
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
66
70
70
72
74
76
77
77
81
81
82
.
.
.
.
84
85
85
86
87
89
vii
List of Figures
1.1
Graphical illustration of time series network (TSN) problem. Gt
represents a graph sampled between time interval [t − 1, t), where
t = 1, 2, . . . , T . The objective here is to use a sequential series of
time-indexed graphs to predict realistic and reliable future graph(s)
ĜT +1 of time T + 1. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Potential real-world applications of TSN. For illustrating purposes,
we only select a few networked systems pervasive in the real-world.
8
2.1
2.2
2.3
An illustration of networks with different kinds of edges. . . . . . .
Evolution of ER random graph model, G(n, p), with n=20 . . . . .
Watts-Stogatz small world model with N =20 and K=6 . . . . . . .
11
14
17
3.1
3.2
Scalability of phase-II link association algorithm (HL). . . . . . . .
SM AX
Example resultant graphs GHL
with given node degree
D and GD
sequence D =[6,4,3,2,2,2,1,1,1,1,1] . . . . . . . . . . . . . . . . . . .
Graph visualization of goal graphs and their respective predicted
outputs from HL and SMAX algorithms. . . . . . . . . . . . . . . .
Relationship between $ and graph size. . . . . . . . . . . . . . . . .
AX
)
Test D3, D5, and D7 and the respective assortative value of R(GSM
D
HL
and R(GD ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
1.2
3.3
3.4
3.5
4.1
4.2
Illustrating example of bipartite graph formation . . . . . . . . . .
Bipartite graph shown in Figure 4.1 is served as the illustrating example to demonstrate the recovery results from different algorithms
in phase-III: link association. Alg. 1 accurately recovers all 6 edges:
e1,0 , e2,0 , e3,0 , e4,0 , and e3,1 , and e4,2 . Alg. 2 accurately recovers 4
edges: e1,0 , e2,0 , e3,0 , and e4,2 . Alg. 3 accurately recovers 4 edges:
e1,0 , e2,0 , e3,0 , and e4,0 . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
5
37
42
46
46
51
54
4.3
4.4
4.5
4.6
4.7
5.1
5.2
5.3
5.4
5.5
The in-degree distributions of the simulated data sets G100 to G500.
IN(N=100) is the in-degree distribution for G100; IN(N=200) is the
in-degree distribution for G200; IN(N=300) is the in-degree distribution for G300; IN(N=400) is the in-degree distribution for G400;
and IN(N=500) is the in-degree distribution for G500. . . . . . . . . 55
The out-degree distributions of the simulated data sets G100 to
G500. OUT(N=100) is the in-degree distribution for G100; OUT(N=200)
is the in-degree distribution for G200; OUT(N=300) is the in-degree
distribution for G300; OUT(N=400) is the in-degree distribution for
G400; and OUT(N=500) is the in-degree distribution for G500. . . 56
Histograms of predicted edge accuracy rate from configuration model
with given bipartite node degree sequence. Each histogram contains 300 samples of predicted edge accuracy rate from configuration model with given bipartite node degree sequence. X-axis in the
each histogram corresponds to the predicted edge accuracy rate and
Y-axis is the frequency with respect to each accuracy rate. Results
of five test sets G100-G500 are shown in sequential label from (a)
to (e). Figure (a) represents the results for test set G100 ; (b) represents the results for test set G200 ; (c) represents the results for
test set G300 ; (d) represents the results for test set G400 ; and (e)
represents the results for test set G500. . . . . . . . . . . . . . . . 58
Edge prediction accuracy versus graph density. . . . . . . . . . . . . 60
An illustrating example of weighted graph construction. Figure 4(a)
contains three sequential time indexed graphs D1, D2, and D3. Figure 4(b) illustrates the frequency of link appearances as weights in
the observed time period and forms the corresponding weighted graph. 60
Monthly average node degree of the 38 monthly graphs extracted
from Enron dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .
Number of edges versus number of nodes, in log-log scales, of increasing period May 1999 to October 2001 with slope= 1.2583. . . .
Density, average clustering coefficient and transitivity of the 38 time
graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Node degree versus counts, in log-log scales, of time graphs October
2001 with slope = -1.2145 . . . . . . . . . . . . . . . . . . . . . . .
Graph visualization of time graph from January, 2001 to June 2001
and their respective predict outputs from SMAX and HL algorithms.
ix
67
68
68
69
71
5.6
S(G) comparison between resultant projection graphs from HL and
SMAX algorithms. X-axis represents six different monthly projection results. 1: January 2001, 2: February 2001, 3: March 2001, 4:
April 2001, 5: May 2001, and 6: June 2001. . . . . . . . . . . . . .
5.7 Average clustering coefficient comparison between resultant projection graphs from HL and SMAX algorithms. X-axis represents six
different monthly projection results. 1: January 2001, 2: February
2001, 3: March 2001, 4: April 2001, 5: May 2001, and 6: June 2001.
5.8 Transitivity comparison between resultant projection graphs from
HL and SMAX algorithms. X-axis represents six different monthly
projection results. 1: January 2001, 2: February 2001, 3: March
2001, 4: April 2001, 5: May 2001, and 6: June 2001. . . . . . . . . .
5.9 Similarity comparison of the four month projected graphs ( G200111G200202) with various lengths of historical data T=10, 20, 30. . . .
5.10 Cosine measure comparison of 6-step ahead predicted graphs for
200111 to 200106, time scale T=20. . . . . . . . . . . . . . . . . . .
5.11 Monthly time graph from December 1999 to November 2000 are
labeled in sequential from (a) to (l) . . . . . . . . . . . . . . . . . .
5.12 Monthly time graph from December 2000 to November 2001 are
labeled in sequential from (a) to (l) . . . . . . . . . . . . . . . . . .
x
72
72
73
75
75
79
80
List of Tables
2.1
2.2
3.1
3.2
3.3
3.4
4.1
4.2
4.3
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Comparison of properties by varying β in Watt-Stogatz model. l(β)
is average path length; C(β) is clustering coefficient; and P (k) is
degree distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complexity of calculating degree, between-ness, and closeness centrality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic statistics of test sets . . . . . . . . . . . . . . . . . . . . . . .
Mean and variance of node degree sequence with respect to each
test instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation results for D30 to D39, D50 to D59, and D70 to D79 . .
Simulation results for N11 to N15, N21 to N25, and N31 to N35 . .
17
26
40
41
44
45
Descriptive statistics for generated graphs G100-G500. The network
size varies from 100 to 500 nodes; and 460 to 2382 edges. The
average in and out degree for each data set are also reported. . . . . 55
Edge accuracy results from three link association methods in phase-III. 56
Edge accuracy statistics of 300 samples of configuration model . . . 57
Number of nodes and edges of the monthly time graph . . . . . . .
Experiment II setup: three different lengths of historical data T =
10, 20, 30 (months) were used to project 4-month ahead graph evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Four measurements: Density, s(G), C̄G , and transitivity for four
projected graphs from November 2001 (G200111) to February 2002
(G200202). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Employee status and corresponding number of people . . . . . . . .
Predicted edge accuracy for monthly graph from April to August
2001 (T =20 and S=5). . . . . . . . . . . . . . . . . . . . . . . . . .
Node ID in partitioned cluster 2 . . . . . . . . . . . . . . . . . . . .
Predicted edge accuracy comparison of April 2001 monthly graph
between partition and non-partition results. . . . . . . . . . . . . .
xi
66
74
74
77
82
82
82
Acknowledgments
I am greatly indebted to many people who have supported me to get through the
Ph.D. study. I would like to express my sincere gratitude to all the people who
have helped me during this journey.
First of all, I thank the support of my advisor Dr. Dennis Lin. He has been a
great mentor and I enjoy the discussions with him exploring novel ideas of research.
I am thankful to my committee member Dr. Jeya Chandra for his guidance and
support throughout the study. I would like to express the gratitude to my committee member Dr. Soundar Kumara, who has been an exemplary researcher. I
have gained much from his thoughtful suggestions on my research work. I am also
grateful to my committee member Dr. Zan Huang for his insights and valuable
suggestions on research.
During my stay at Penn State, I had the great pleasure working with Dr.
William Harkness. I thank him for his constant encouragement and understanding
that carried me through the graduate school.
I am extremely fortunate to have a great bunch of friends from my childhood
to Penn State. I would like to thank Mook, Kelly, Craig, and Liya and many
friends at Penn State for putting so much thoughts, care, and imagination into
our friendship as well as making so many special memories that enriches my stay
at Penn State. And thank you Jane, Chjong, and Julianne, and Jslee for always
being there to pull me through any puddle.
My deepest and greatest acknowledgement goes to my parents: T. Z. Chen
and Y. C. Hu. I thank them for their endless patience, encouragement to explore
possibilities of life, and freedom to fulfill my dreams. I also thank my brother
C. T. Chen and cousin Patrick Chen for always having faith in me and constant
cheerfulness that lightens up the tough graduate moments. Last but not least I
would like to thank my husband, Julian Pan, whose support and patience during
the final stage of the reviewing process helped me see the dissertation through
calmly to the end.
Thank you all....
xii
Chapter
1
Introduction
The world we reside could be characterized by tremendous variety of networked
systems at many layers of abstraction, from micro to macro as well as diverse manmade/engineered systems. How do these networked systems look like? How do
they evolve over time? How could we tell abnormalies occur during the evolving
process? Where do network structures come from and what are the driving forces?
One of the earliest theoretical models to study the networked systems by mathematicians and scientists dated back to the 1960s [1]. Most commonly studied is
the Erdös-Rényi randon graph, where each pair of nodes has identical, independent
probability p of being joined by an edge. The organizations of edges in large-scale
natural networks was originally considered to be random, that is no obvious pattern or structure associated with these networks [1] [2] [3]. However, recent studies,
mostly empirical in the past decade have shown the topological properties in a wide
range of social, biological and technology networks deviate from randomness [1]
[2] [4] [5]. Features in these networks include but not limited to a heavy tail in
the degree distribution, a high clustering coefficient, assortativity or disassortativity among vertices, community structure, and hierarchical structure, etc. These
significant findings, which are elaborated in chapter 2 have inspired the interests
in the scientific study of network and network modeling since then. A new field
of interdisciplinary investigation: Network Science is provoked and dedicated to
the study in understanding and characterizing large-scale complex systems [6]. Increasing number of researchers describe their work related to this field; and various
2
techniques and models are developed to help us understand the behavior of networked systems.
Most of the works described above concern themselves with graph(network)
representation. The main reason graphs are that chosen is because they are better representation and analytical tools than vectors and attribute-value lists. The
ability to capture higher order relations and dependencies between discrete entities
and objects makes graphs a prevailing choice to model a complex system. A graph
is a collection of nodes, or vertices with edges or lines, connecting pairs of them. A
few examples taken from real world applications give a flavor of how graphs characterize entities and interactions between entities. For instance, the World Wide
Web [7] [8] [9] [10] is a network consists of web pages as nodes where the hyperlinks between the web pages are links. Scientific collaboration network (Watts and
Strogatz [11]; Albert et al. [5]; Barabasi et al. [12]; Newman [13] [14]) contains
nodes representing scientists or researchers and an edge exists between scientists
if they have co-authored a paper. Social networks, where nodes represent people
and edges are created between two acquaintances. In cellular networks setting [15]
[16] [17] [18], the substrates or molecules that constitute a cell represent the nodes
and the existence of bio-chemical interactions or regulatory relationships between
molecules are captured by edges.
The above networks are only a few examples of networked systems pervasive
in the real-world [3] [4] [19] [2]. One distinction sets these studies apart from
traditional graph theory is the network size. The size of these networks is substantially larger than the ones considered in traditional graph theory. In addition,
pre-specified structure/order or any design principle is seldom found in these networks. To differentiate these networks from regular graphs, a new terms: complex
networks is given. Complex networks are often characterized by diverse behaviors
that emerge as a result of non-linear spatial-temporal interactions among a large
number of components [20]. Research problems [7] [8] [9] [10] [11] [5] [15] [13]
[12] posed in such network settings are often novel. Although these studies are
promising leading us to learn a great deal about the effects of networks, majority
of the studies take static approach to explore the statistical mechanics of topology
3
as well as the respective key principles governing the graph structures. That is,
the system is analyzed as a snapshot taken from the period of interests.
In fact, almost all real networked systems evolve over time in nature. Evolution
is found everywhere in the world we reside. It is a series of ongoing processes of
development, formation, and growth in both natural and human-created systems.
For example, evolution is the cornerstone of modern biology. A complex, natural system is not created all at once but must instead evolve over time. In the
field of social science, research has shown that social networks should be properly
understood not as static structure, but as temporal processes. In essence, social
networks evolve in time, as a function of individual decisions, group attributes, and
organizational structure. That is events in social context take place sequentially.
For instance, friendship must be established with a person first then you meet
their friends, etc. Although the idea of studying network formation over time isn’t
new, one of the uniqueness of this thesis is the ability to propose an effective and
efficient algorithm to study the network evolutionary process on a large scale over
prolonged periods. To the best of our knowledge, little work has been done to address the ingredient of time in real networked system and investigate the temporal
evolving patterns.
In this thesis we contribute to the scientific and engineering aspects to study
the network evolution processes.
1.1
Research objectives
In this research, we have two primary interests of studying the network evolution
processes. The first interest focuses on understanding the structure, evolution, and
properties of networks. Next, we develop two effective and efficient algorithms and
find optimal parameters to project a realistic and reliable future graphs.
A lot of the questions could be boiled down to the following: How can we
generate synthetic but realistic graphs? To answer this, we must first understand
what patterns are common in real-world networks and can thus be considered a
4
mark of normality/realism. This study is motivated to answer these questions.
We present a new perspective to explore the network evolution processes from
time series point of view. Utilizing the notion of evolution formalizes a time domain in the problem. Specifically, we model the evolution process through the
creation of a sequence of sample graphs S. The idea of slicing the entire network
evolution process by time stamps provides us a sequence of sampled graphs within
the study period, and engages the opportunity to take a time series like approach
to tackle the problem. Let the set of nodes and edges during a time period (t0 , tb ]
be captured as a graph at time instance tb , denoted by Gb (Vb , Eb , ζb ). S is ordered
sequentially over time as T graphs in T time instances: S = (G1 , G2 , . . . , GT ),
where Gx are simple graphs. Figure 1.1 demonstrates the graphical presentation
of a sequence of sample graphs S.
In addition, developing an effective and efficient analytical and simulation tool
that allows reasoning about large-scale networks is a pressing need. Therefore, our
major focus in the thesis is to propose a novel analytical approach to study the
evolution process. One immediate difficulty is how to capture the dynamic addition
and/or deletion of node/edge arrivals. In this research, we adopt an intuitive
approach by taking sampled of the network at various points in time, and use these
captured graphs to make inferences about the evolutionary process. Specifically,
given a sequence of time indexed graphs S = (G1 , G2 , . . . , Gt ), we are generating a
realistic and reliable predictive(future) graphs at Gt+k for k = 1, 2, 3.... One of the
major contributions of this thesis is to propose a novel approach, incorporating univariate time series models into a sequence of time graphs S, to solve the network
evolution process. To the best of our knowledge, this is the first work lying at the
intersection of traditional graph theory and time series analysis to study network
evolution.
5
Figure 1.1. Graphical illustration of time series network (TSN) problem. Gt represents
a graph sampled between time interval [t − 1, t), where t = 1, 2, . . . , T . The objective
here is to use a sequential series of time-indexed graphs to predict realistic and reliable
future graph(s) ĜT +1 of time T + 1.
1.2
Problem definition: Time series network (TSN)
problem
We propose time series network (TSN) problem, a new type of problem considering
the temporal element in network setting to study the network evolution processes.
TSN problem is defined as a problem where the update operations include unrestricted insertions and deletions of edges. The uniqueness of TSN problem is as
follows:
1. TSN problem offers a new perspective to explore network evolutionary processes. For example, we may anticipate the network to evolve to a certain
state Gt in the future. But no effective measures are available to know how
long the graph will evolve to the certain this state.
2. TSN addresses network with directional edges. For examples, certain types
of network the directions of edges are essential, such as the disease spreading
network and World-Wide-Web to retrieve information.
For the application domain in this paper, we assume the existence of a universal
set of nodes, ψ, from which all nodes that occur in a sequence S are drawn. Let
N be the number of nodes in graph G. That is, Li ⊆ ψ for i = 1, 2, . . . , N and
S
ψ= N
i=1 Li . Next, we introduce and formulate the Time Series Network (TSN)
6
problem as follows. Given a sequence of graphs S= (G1 , G2 , . . . ,GT ), the TSN
problem attempts to predict a graph at time T + 1, ĜT +1 .
1.3
Uniqueness, contribution and potential applications
In this thesis, we consider a fundamentally new approach to analyze the evolution
process in complex engineering systems which can be realized as networks. The
uniqueness and contribution of this research is three fold.
First, we extend the prediction capability of time series analysis from a set of
single indexed value to a set of collected graphs. A predictive graph of this kind
is very useful in the context of extrapolations as well as abnormally patterns detection. In particular, our model helps form a basis of validating scenarios for the
evolution process. By continuously monitoring the evolving patterns, one potential
application is to detect abnormality. Activities producing structures that deviate
from the null model significantly could be identified.
Secondly, the proposed novel approach provides the network research community an effective and efficient framework to study the evolution process of real-world
networked system. Given general graph statistics such as number of nodes, number
of edges and node degree, the proposed model is capable of producing realistic and
reliable predictive graph(s) preserving the respective key principles that governs
the graph structures.
Last but not least, we present a valuable case study using a distinct data set:
Enron corpus to validate the proposed methodology. The collection of the data
set is a touchstone, providing substantial collection of real email benchmark that
is public. We transform the temporal email communication relationship via graph
representation, where each distinct email address corresponds to the node, and
the presence of emails between two distinct emails addresses corresponds to the
edge. Given such a sequence of time graphs, the objectives are to observe graph
7
evolutionary patterns and generate a realistic and reliable predicted grpah(s). Capability of generating reliable predictive graphs enables us to anticipate changes in
communication patterns emerge gradually over time as well as discovering indirect
senders and recipients within the structure. Note that although the motivation of
our work is email communication network, the proposed method is fairly general
and could be applied to other domains as well.
The fact that real-world networks evolve over time in nature, numerous complex engineered networks in the natural, physical, and social worlds are potential
applications for the proposed TSN problem.
Figure 1.2 outlines a few examples from technology, biological, social, communication as well as interacting networks perspectives. Take technology networks
for instance, internet and WWW are representative examples. In the categories
of biological networks, food web and cell level interaction such as genome, proteome, and metabolism are main study subjects. Social networks draw quite a
lot of attention recently, especially the online social networks. With the advent
of technology, human interactions have gone beyond regional restrictions. Blogs,
face book and instant message (IM) create a new arena of online social networks
across borders, which exhibit greater influential impact than the traditional social
networks. In addition, phone call graph is a typical example of the communication network. Reside in this electronic era; email graph is another major type
of communication network people heavily rely on. One last example in the in the
Figure is the terrorist network, which draws extensive research after the 911 attack.
1.4
Thesis organization
This thesis documents the study approach, findings, conclusions, and recommendations. Major portion of the thesis is devoted to present the proposed solution
framework to solve the time series network (TSN) problem. In addition, to discover the real-world networks evolving patterns, we examine the communication
network: Enron email and discuss the evolving patterns that we found. Some of
8
Figure 1.2. Potential real-world applications of TSN. For illustrating purposes, we only
select a few networked systems pervasive in the real-world.
the details that support these findings are presented in the appendixes; others may
be found in the references cited in the text. The thesis is organized as follows.
Chapter 1 discusses the current status of the network science and identifies associated research challenges, related to the content of evolution process in network
science research. A new type of problem devoted to study the evolution process of
network is proposed.
Chapter 2 broadly reviews the bulk of the literatures related to the concept of
networks, network properties, and their components. We discuss three prominent
network models: random network, small world network by Watts and Stogatz [11]
and scale-free network by Barabsi and Albert [21] in section 2.2. Next, a survey of
literatures of commonly identified complex network statistical properties are summarized in section 2.3. Section 2.4 outlines some selected examples, taken from
real world applications and their network representations.
Chapter 3 is devoted to an in-depth discussion and analysis of the proposed
framework for the simple network evolution process. In this section, we study the
simple network with only unidirected edges and no multiple loops between nodes.
Both analytical and simulation results of the predictive graph properties obtained
from the proposed solution framework are reported in detail.
9
Chapter 4 is dedicated to an in-depth discussion and analysis of the proposed
the framework for general network evolution process. General network evolution
process involves graphs with directional edges and with multiple loops between
nodes. Analytical and simulation results of the predictive graph properties obtained from the proposed solution framework are reported in detail.
Chapter 5 presents a comprehensive case study using large-scale dataset: ENRON corpus, a substantially real email benchmark, to validate our proposed approaches in Chapter 3 and 4. Both simple and general network evolution processes
are analyzed and studied. It documents the experiment results, applying UN-TSN
and DN-TSN algorithms to the Enron corpus. Findings concerning how the proposed solution methodology can create value from investments in network science
are provided in detail as well.
Finally, Chapter 6 concludes our findings, implications and discusses the uniqueness and contribution of this thesis as well as future research directions.
Chapter
2
Literature Reviews
In this chapter we present an overview of previous studies related to the research
topics in this thesis. First, we discuss existing network models in section 2.2.
Three major network models: random network, small world network and scale free
network are reviewed in section 2.2.1, 2.2.3, and 2.2.2 respectively.
Next, a survey of literatures related to graph properties, presented in section
2.3 are categorized into two subsections: statistical properties and temporal evolution patterns. Commonly identified statistical properties of complex network are
summarized in 2.3.1 - 2.3.5. We discussed statistical properties, such as degree
distribution, clustering coefficient, average shortest-path length, mixing patterns,
and node centrality. Section 2.3.6 depicts the two recently discovered temporal
evolutions patterns.
As explained in chapter 1, complex systems are modeled as networks (graphs) to
understand and optimize processes such as formation of opinions, resource sharing,
information retrieval, robustness to perturbations etc. Section 2.4 presents another
major portion of the literature review, covering various complex networks taken
from real-world. Brief description as well as the methodology of each type of
real-world network is discussed.
11
Figure 2.1. An illustration of networks with different kinds of edges. Figure (a) represents a network that contains simple network that contains undirected edge only. Figure
(b) depicts a general network that contains undirected, directed, multiple edges and
loops.
loop
2
2
directed edge
1
5
1
5
multiple edges
3
4
3
(a)
2.1
4
(b)
Networks Definition
Let a network consists of a set of nodes V and a set of edges E. An edge e ∈ E
connects exactly two nodes v and u ∈ V . This implies that u is a neighbor of v and
vice-versa. In the case that u = v, we refer to the edge euv as a loop. A loop occurs
on node 2 in Figure 2.1 (b). A multiple edge case occurs if two (or more) edges
connect to the same pairs of nodes in V . An illustration is given in node 2 and
4 in Figure 2.1 (b). Edges could also have directions associated with them, such
as the one between node 2 and 5 in Figure 2.1 (b). In general, a network could
have any combination of edge types: undirected, directed edges as well as multiple
edges and loops. In this thesis, we consider two types of networks illustrated in
Figure 2.1. We refer Figure 2.1 (a) as simple network, where only undirected edges
exist. And the second type of network is referred as general network, containing
undirected, directed, multiple edges and/or loops (Figure 2.1 (b)).
2.2
Existing network models
In this section, we discuss three statistical ensembles: random network, smallworld network and scale-free network. A statistical ensemble of graphs is defined
12
as follows:
(a) a set G of graphs, and
(b) a rule that associates some statistical weight P (g) > 0 with each graph g from
this set : g ∈ G.
The history of network generative models dated back to the 1960s. Most commonly studied model is the Erdös-Rényi random graph, G(n, p), where each pair
of nodes has identical, independent probability p of being joined by an edge. Although it is well studied, this model fails to generate power law degree distributions
commonly observed in real-world networks applications [11] [21]. Watts-Strogatz
small-world model [11] is another prominent type of generating model. It produces
small diameter and large clustering coefficient graphs.
Another main stream of network models focus on generating heavy-tailed degree distributions (e.g. power-law, or lognormal). Most well-known and influential
idea is the preferential attachment model proposed by Barabsi and Albert [21].
Preferential attachment mechanism refers to new nodes prefer to attach to high
degree older nodes, and results in power-law degree distribution tails and to low
graph diameters. Many variations of the original preferential attachment model
are proposed such as the copying model and the forest fire model. Chakrabarti
and Faloutsos [22] documented detailed survey and comparison of these methods.
One drawback of the above generative models is that they usually aim in modeling
only single property of the network (e.g. degree exponent) which does not truly
reflect real world graphs.
One recent model that generate more realistic graphs obeying multiple properties of real world graphs is the Kronecker Graph model proposed by Leskovec
et al. [23]. In this model, Kronecker matrix multiplication is used to generate
graphs that mimic the graph evolutionary process. However the proposed model
assumes the existence of self-similar in graph evolving process, and the resulting
graph is sensitive to the choice of initial graph. One more challenge is the O(N 2 )
complexity computational complexity in likelihood estimation of evaluating the
probability of each edge in the graph adjacency matrix.
13
2.2.1
Random graph models
Erdös and Rényi first introduced random graphs [1] in 1959. This simple yet powerful model has many real-world applications. The crust of generating mechanism
explains how increasing the probability of any two nodes being connected beyond
a critical threshold produces a connected network with small-world characteristics
and the average path length between any two nodes is short. Two main types of
random graph model with a fixed number of nodes, N , are described as follows :
First type is the Erdös-Rényi model, called G(n, p), where each two nodes of the
network are connected by an edge with identical, independent probability
p. That is, this particular edge is absent with probability 1 − p. Figure
2.2 shows the evolution of G(n, p) with N=20 and three different connection
probability p. As p increase from 0, the number of edges in the network also
increases.
The other type of random graph model, called G(n, M ) assigns equal probability
to all graphs with exactly M edges. The construction of G(n, M ) can be
realized by adding new edges one by one and repeatedly connecting randomly
chosen pairs of nodes.
There are limitations of the above model. One potential problem from G(n, M )
is it produces graphs with self-loop and multiple edges between the same pair of
nodes if no special restrictions are imposed on the construction procedure stated
above. However, Bollobas [1] pointed out that when n → ∞, large graphs limit
the averages of physical quantities on both of these graphs; and with and without
the restriction are the same.
2.2.1.1
Properties of random graph models
• The G(n, p) model creates 2
N (N −1)
2
graphs in total, and each of these graphs
is with any number of edges less than or equal to
N (N −1)
.
2
The statistical
weight of a graph G with L(G) edges taken from this set equals to P r(G) =
pL(G) (1 − p)N −L(G) , where L(G) varies from graph to graph.
14
Figure 2.2. Evolution of ER random graph model, G(n, p), with N=20 and three
different connection probability p. As p increase from 0, the number of edges in the
network also increases. Graphs are generated using Python NetworkX-0.35.1.
(a) p=0
(b) p=0.005
(c) p=0.01
The set of graphs in G(n, M ) consists of all possible graphs with a given
number M of edges and N vertices. Statistical weights of graphs from this
set are equal.
• The degree distribution for G(n, p) model could be obtained as follows. With
the assumption that each node in graph G with N nodes is the same, each
node could have any number of edges attached, from zero to N − 1 possibilities. The degree distribution of the G(n, p) model is definedS:
N −1 x
P r(x) =
p (1 − p)N −1−x .
x
(2.1)
P r(x) is the binomial distribution. Thus, the average degree is x̄ = p(N − 1)
and the graph G contains an average of
pN (N −1)
2
edges. For large N and fixed
x̄, the degree distribution of G(n, p) takes the Poisson form:
P r(x) =
2.2.1.2
e−x̄ x̄x
x!
(2.2)
Limitation of classical random graph models
Despite the simplicity and power, the E-R random graph models fail to explain
two important properties observed in real-world networks:
• By assuming a constant and independent probability of two nodes being
connected, the Erdös-Rényi graphs do not account for local clustering and
15
triadic closures. This fact results in that the Erdös-Rényi graphs have a low
clustering coefficient.
• In addition, the the Erdös-Rényi graphs do not account for the formation of
hubs. Its degree distribution converges to a Poisson distribution, rather than
a power law distribution observed commonly in real-world applications. The
term ”real-world” here refers to any of the observable phenomena that exhibit
network theoretic characteristics (e.g., social network, computer network,
neural network, epidemiology, etc.).
The Watts and Strogatz model (1998) [11], described in detail in 2.2.2, is designed as the simplest possible model that addresses the first limitation. It accounts
for clustering while retaining the short average path lengths of the Erdös-Rényi
model. It does so by interpolating between an ER random graph and a regular ring
lattice. On the other hand, the BA model described in detail in 2.2.3, is designed
to address the second issue.
2.2.1.3
Random graph model with prescribed degree sequence
The configuration model [24] [25] [26] [27] was a natural extension of classical
random graph models for uncorrelated random networks with prescribed degree
distribution, P (x). The algorithm is summarized as follows:
• Step 1: Label N nodes.
• Step 2: To the nodes j of the graph ascribe degree kj taken from the distribution P (x). Now the graph looks like a family of hedgehogs: each nodes
has kj quills sticking out.
• Step 3: Connect at random ends of pairs of distinct quills belonging to
distinct vertices.
2.2.2
Small world networks
”A small-world network is a type of mathematical graph in which most nodes are
not neighbors of one another, but most nodes can be reached from every other by
16
a small number of hops or steps.”
Watts and Strogatz [11] proposed a small-world network model with two characteristics: (1) high clustering and (2) small average path length simultaneously
in many real networks, especially, social networks. They argued that most of
the real networks are neither completely regular nor completely random, but lie
somewhere between these two extremes. The steps of constructing a Watt-Stogatz
model are as follows. Given the desired number of nodes N , the mean degree K
(assumed to be an even integer), and a special parameter p, where 0 ≤ p ≤ 1 and
N K ln(N ) 1, the model constructs an undirected graph with N nodes
and
NK
2
edges:
Step 1: The Watt-Stogatz model starts with a regular lattice of N nodes, each
connected to K neighbors, K/2 on each side.
Step 2: For every node vi = v0 ...vN −1 take every edge (vi , vj ) with i < j, and
rewire it with probability β. Rewiring is done by replacing (vi , vj ) with
(vi , vk ) where k is chosen with uniform probability from all possible values
that avoid loops (k 6= i) and link duplication (there is no edge (vi , vk0 ) with
k 0 = k at this point in the algorithm).
Figure2.3 illustrates that varying p allows to interpolate between a regular lattice (p = 0) and a random graph (p = 1) approaching the ER random graph model,
G(n, p) with n = N and p =
NK
.
2(N
2)
Table 2.1 compares three properties: average
path length l(p) , the clustering coefficient C(p), and the degree distribution P (k)
as p varies. A number of studies follow the Watt-Stogatz model, including exact
results. Barrat and Weigt [28] proposed the exact solution for degree distribution
when 0 < p < 1. It is defined:
f (k,K)
P (k) =
X
n=0
n
CK/2
(1 − p)n pK/2−n
(pK/2)k−K/2−n −pK/2
e
(k − K/2 − n)!
(2.3)
The Watts and Strogatz small-world networks displays a high degree of clustering coefficient for small values of β since we start with a regular lattice. Secondly,
most pairs of nodes will be connected by at least one short path. This follows from
17
Figure 2.3. Pictorial illustration of the random rewiring process for the Watts-Stogatz
small world model [11] with N=20 and K=6. (a): when p = 0 the graph is a regular
ring lattice (each node has 6 edges: K=6). However, as p increases, the graph becomes
increasingly disordered until p = 1. In this case, all the edges rewired randomly as shown
in (c). Graphs are generated using Python NetworkX-0.35.1.
(a) p=0
(b) p=0.005
(c) p=1
Table 2.1. Comparison of properties by varying β in Watt-Stogatz model. l(β) is
average path length; C(β) is clustering coefficient; and P (k) is degree distribution.
Regular Lattice
β=0
l(β)
N
2K
1
Random Network
0<β<1
β=1
falls very rapidly
ln N
ln K
C(β)
3
4
∼ C(0) (1 − β)3
P (k)
Dirac delta func. centered at K
Ref. 2.3
C(0) =
K
N
Poisson
the requirement that the mean-shortest path length be small. The co-existence of
high clustering coefficient and small average path length is in excellent agreement
with the characteristics of many real networks [11]. In addition, the shape of the
degree distribution in Watts and Strogatz model is similar to that of a random
graph and with a uni-valued peak at k = K and decays exponentially for large
|k − K|. Therefore, the topology of the network is relatively homogeneous, and all
nodes have more or less the same degree.
2.2.3
Scale-free network
Barabsi and Albert (1999) first found that World-Wide-Web did not have random
connectivity [21]. Instead, they discovered that certain nodes in the network had
many more connections than the average. After reviewing a few other networks,
this heavy-tailed degree distributions characteristic also exists in some social and
18
biological network. They pointed out that complex network like the World-WideWeb and collaboration networks are continuously growing by adding new vertex
(e.g. creation of new web pages, new researchers join in). They came to the conclusion that large-scale networks have two important features: (1) constant growth
and (2) inherent selectively in edge creation.
In addition, in the random networks setting, where each node has the same
probability p of creating a new edge, rather new nodes entering the network do
not connect uniformly to existing nodes, but attach preferentially to nodes of
higher degree. They argued that both static random graph and Watt-Strogatz
small world model failed to capture these two features [21]. This reasoning led us
to another important class of complex network: Scale-free network.
Considerable amount of real-world networks fall into this category. For example, phone call graphs [29], the Internet [30], the World-Wide-Web [21] [31], the
click-stream data [32], on-line social network [33], citation graphs [34], and biological network [35]. In scale-free networks, some nodes are recognized as ”highly
connected hubs” (high degree), and most nodes are of low degree. The structure
and dynamics of a scale-free network are independent of the system’s size N (e.g.
the number of nodes in the network). The defining characteristic is that their
degree distribution follows a power-law relationship: P (k) ∼ k −γ . Albert et. al [3]
pointed out that ¡γ may vary approximately between 2 to 3 for most real networks,
however, in some cases it can also take a value between 1 and 2.
The Barabási and Albert’s (BA) model is the mostly widely known generative
model for a subset of scale-free networks [21]. It incorporates two important general
concepts: growth and preferential attachment. Growth refers to the number of
nodes in the network increases over time; and preferential attachment refers to the
more connected a node is, the more likely it is to receive new links. Preferential
attachment mechanism [3] is quantified by:
ki
pi = X ,
kj
j
(2.4)
19
where ki is the degree on node i in the network. The BA model algorithm is
as follow:
Growth The network begins with an initial network of m0 nodes, where m0 ≥ 2.
The degree of each node in the initial network should be at least 1.
Preferential Attachment New nodes are added to the network one at a time.
When new node comes, each one is connected to m of the existing with a
probability pi that is biased so that it is proportional to the number of links
that the existing node already has.
Next, we summarize the statistical properties of BA model. The degree distribution resulting from BA model is scale free with a power law of the form as
follows: P (k) ∼ k −3 , The average path length of the BA model increases approximately logarithmical with the size of the network: l ∼
ln N
ln ln N
therefore display the
small-world property. Furthermore, BA Model is more clustered than a random
graph, and the clustering decreases more slowly with network size than for the
random graph. Although there is no analytical result for the clustering coefficient
of the BA model, the empirically determined clustering coefficients are generally
significantly higher for the BA model than for random networks. The clustering coefficient also scales with network size following approximately a power law :
C ∼ N −0.75 . It is worth noting that BA model has a systematically shorter average
path length than a random graph. In addition, Albert [3] pointed out that there
are two types of limiting case of the BA model. Case (1): retains growth but does
not include preferential attachment. The probability of a new node connecting to
any pre-existing node is equal. The resulting degree distribution in this limit is
exponential, indicating that growth alone is not sufficient to produce a scale-free
structure. Case (2): retains preferential attachment mechanism but does not have
growth. This model starts with a fixed number of disconnected nodes and adds
links, preferentially choosing high degree nodes as link destinations. Although the
degree distribution looks scale-free in the early simulation, the distribution is not
stable. Eventually, it becomes nearly Gaussian as the network nears saturation.
Therefore, Albert [3] concludes that preferential attachment alone is not sufficient
to produce a scale-free structure.
20
Many more refined models [3] [36] [37] [19] [38] [39] follow the BA model. Depending on the topological details, the clustering coefficient of scale-free networks
can vary significantly. Some other type of generative mechanisms that allow one to
create such networks that have a high density of triangles. Cohen and Havlin [40]
proved that uncorrelated power-law graph having 2 < γ < 3 will also have ultra
small diameter d ∼ ln ln N . Therefore, from the practical point of view, the diameter of a growing scale-free network might be considered almost constant. More
comprehensive reviews of general models and networks characteristics have been
proposed and studied. Dorogovtsev and Mendes [19] documents detailed findings.
2.2.3.1
Scale-free metric
Li et al. [41] proposed a potentially more precise scale-free metric: s(G). Let
G(N, E) be a graph with a set of nodes N and a set of edges E; and di be the
degree (number of edges) at a vertex i. s(G) is defined as:
s(G) =
X
di dj .
(2.5)
(i,j)∈
This is maximized when high-degree nodes are connected to other high-degree
nodes. Now define
S(G) =
s(G)
smax
(2.6)
where smax is the maximum value of s(h) for h in the set of all graphs with
an identical degree distribution to g. This gives a metric between 0 and 1, such
that graphs with low S(g) are ”scale-rich”, and graphs with S(g) close to 1 are
”scale-free”. This definition captures the notion of self-similarity implied in the
name ”scale-free”.
21
2.3
Statistical properties of complex networks
In this section, we discuss the statistical properties that are used to classify different
types of complex network. Some of these properties, prominently used in the
literature are selected. We discuss the definitions and present the empirical findings
for many real-world networks.
2.3.1
Degree distribution
The node degree is the number of edges incident to this particular node. This is
the distribution of a ”one-vertex” quantity. In other words, the degree distribution
characterizes only the local properties of a network. In undirected networks a node
has a node degree associated with it; where in directed networks a node has both an
in-degree and an out-degree. The degree distribution is the probability distribution
Pk for a randomly selected node to have a degree k in a network. Given an example
from Figure 2.1 (a), we can empirically determine its degree distribution. Among
the 5 nodes, v1 and v3 have 2 degree, v2 has 4 degree, and v4 and v5 have 1 degree.
Hence, the probability that any given node will have degree 1 is 25 ; degree 2 is 25 ;
degree 4 is 15 ; 0 otherwise.
2.3.2
Clustering coefficient
The clustering coefficient has long received attention in both theoretical and empirical research. It assesses the degree to which nodes tend to cluster together.
Research has shown that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high
density of ties [11]. The clustering coefficient measure was introduced by Watts
and Strogatz [11] in 1998 to determine whether a network is a small-world network.
It is measured in terms of number of triangles (3-cliques) present in the network.
Formal definition of clustering coefficient is as follows. Let graph G = (V, E) be
an undirected graph, where V is a set of vertices and E is a set of edges between
them. An edge eij connects vertex i with vertex j.
The degree ki of a vertex is defined as the number of vertices, | Ni |, in its
22
neighborhood Ni . And the clustering coefficient Ci for a vertex vi is then given
by the proportion of links between the vertices within its neighborhood divided
by the number of links that could possibly exist between them. Therefore, if a
vertex vi has ki neighbors,
ki (ki −1)
2
possible edges among the vertices within the
neighborhood. Hence, the clustering coefficient for undirected graphs is defined as:
Ci =
2x|ejk |
: vj , vk ∈ Ni , ejk ∈ E.
ki (ki − 1)
(2.7)
The clustering coefficient for the whole system is then taking the average of the
clustering coefficient for each vertex:
n
C̄ =
1X
Ci .
n i=1
(2.8)
We use Figure 2.1 (a) as an illustrating example to show the calculation of Ci
as well as C̄. For this network, both C1 (C1 =
2×1
)
2×(2−1)
and C3 equal 1, whereas
C4 and C5 would not get a value since the number of possible ties among their
neighbors is 0 (if a node has less than 2 neighbours, the coefficient is undefined).
For node 2, one out of 6 possible ties is present, so the coefficient is 1/6 or 0.1667.
And C̄ = 13/30.
A graph is considered small-world, if its average clustering coefficient C̄ is
significantly higher than a random graph constructed on the same vertex set, and
if the graph has a short mean-shortest path length.
2.3.3
Average shortest-path length
Average path length is one of the most robust properties to measure network topology. An application example would be the average number of clicks you made while
surfing between websites till you find the desired information. This is not the same
with the network diameter, which is defined as the maximal distance between any
two nodes in the network.
Formal definition of average path length is as follows. Consider an unweighed
network G(V, E) where V is a set of vertices (or nodes) and E is a set of links
23
(or edges) connecting the vertices. A path between any two vertices u and v
in the network G is a sequence of vertices u = u1 , u2 , ui = v, where ui are the
nodes in G and there exists an edge from ui−1 to ui ∀i. Since this is an unweighed
network, the path length is equal to the number edges along the path. Let d(u1 , u2 ),
where u1 , u2 ∈ V denote the shortest distance between u1 and u2 . Assume that
d(u1 , u2 ) = 0 if u1 = u2 or u2 cannot be reached from u1 . The average path length
lG is average of the shortest paths from each node to every other node in a network.
Let n be the number of vertices in G, lG is defined:
lG =
2.3.3.1
X
1
×
d(vi , vj ).
n × (n − 1) i6=j
(2.9)
Small-world phenomenon
Many empirical studies indicated that real-world network, despite the large size of
network (w.r.t. the number of nodes) have surprisingly small average path length.
Examples includes social networks [42] [11] [43] , WWW [44], connectivity of the
Internet [45], gene networks [46], transcriptional networks [47], Caenorhabditis
elegans, power grid of the western U.S., all exhibit small-world network characteristics. This implies that any node is reachable any other node in the network in
a relatively small number of steps. This characteristic phenomenon, is called the
small-world phenomenon [42] [11] [48].
Small-world phenomenon, often associated with the phrase ”six degrees of separation” was first demonstrated in the experiments conducted by phycologist Stanley Milgram in the 1960s [42]. He examined the average path length for social
networks of people in the United States. With random selection, individuals from
Wichita, Kansas and Omaha, Nebraska were chosen to pass on a letter to one of
their acquaintances by mail. The destinations of these letters was to finally reach
a specific person in Boston, Massachusetts. Name and profession of the target was
given to the participants. Upon receiving the invitation to participate, the recipient was asked whether he or she personally knew the contact person described in
the letter. If not, the participants were asked to send the letter to one of their
acquaintances whom they judged to be closer (than themselves) to the target.
24
Similar process repeats until the letter reached the target person. 64 letters successfully did reach the target contact. The average path length fell around or 6.
That is, there is people in the United States are separated by about six people on
average. Similar experiments, currently conducted by Dodd et al. [49] examines
the global social-search experiment via e-mail users.
2.3.4
Mixing patterns
An indicator to identify the generic tendency of node connection to similar or dissimilar peers is the assortatvitiy coefficient R(G). This measurement is introduced
by Newman [50], who defines that assortative mixing (R(G) > 0) as the preference
for high-degree nodes to attach to other high-degree nodes while disassortative mixing (R(G) < 0) as the contrary, where ḧigh-degree nodes attach to low-degree ones.̈
Newman [50] studied this mixing patterns in various social networks based on
characteristics of nodes/people such as race, sex, age and their degrees. When considering the mixing pattern in networks with respect to node degrees, he showed
that in natural networks the possibility of two nodes being connected is also dependent on the degrees on these nodes. Specifically, social networks tend to have
assortative mixing patterns where people of same degree (or maybe popularity) are
more likely to relate to each other. On the other hand, technological and biological
networks, a node of high degree shows a preference in attaching to nodes of low
degree and vice versa forming a dis-assortative mixing pattern.
For the practical purpose of calculating r in a given network, the above equation
can be re-written as, Li et al. [41] proposed the following sample-based definition
of assortativity.
R(G) =
[
P
i,j∈N
[
P
i∈N
P
di dj ] − [ i∈N 21 d2i ]2 /E
P
1 3
1 2 2
d
]
−
[
i
i∈N 2 di ] /E
3
(2.10)
25
2.3.5
Centrality
Various measures of the centrality of a vertex in a graph that determine the relative
importance of a vertex within the graph are proposed in the context of graph
theory and network analysis. For instance, how important a person is within a
social network? In this subsection, we discuss three measures of centrality that
are widely used in network analysis: degree centrality, between-ness centrality and
closeness centrality.
• Degree centrality: Degree centrality is defined as the number of links incident
to a node. In undirected networks a node has a node degree associated with
it; where in directed networks a node has both an in-degree and an outdegree. In-degree is a count of the number of ties directed to the node,
and out-degree is the number of ties that the node directs to others. Take
a directed network in Figure 2.1 (b) for example. Node 3 has 1 in-degree
and 1 out-degree. In the context of social network analysis positive relations
such as friendship or advice, we normally interpret in-degree as a form of
popularity, and out-degree as gregariousness.
• Between-ness centrality: Vertices that occur on many shortest path problem
between other vertices have higher between-ness than the ones that do not.
For a graph G := (V, E) with n vertices, the betweenness CB (v) for vertex v
is:
CB (v) =
X σst (v)
σst
s6=v6=t∈V
(2.11)
s6=t
where σst is the number of shortest geodesic paths from s to t, and σst (v) is
the number of shortest geodesic paths from s to t that pass through a vertex
v. This may be normalized by dividing through by the number of pairs of
vertices not including v, which is (n − 1)(n − 2).
• Closeness centrality: Newman [37] pointed out Closeness can be regarded as
a measure of how long it will take information to spread from a given vertex
to other reachable vertices in the network. Closeness is defined as the mean
geodesic distance:
26
Table 2.2. Complexity of calculating degree, between-ness, and closeness centrality.
Degree centrality
Between-ness centrality
Closeness centrality
Dense Graph
$O(V2 )
WFI Alg.: $Θ(V 3 )
WFI Alg.: $Θ(V 3 )
X
Sparse Graph
$O(E)
Johnson’s Alg: O(V 2 log V + V E)
Johnson’s Alg: O(V 2 log V + V E)
dG (v, t)
t∈V \v
n−1
(2.12)
where n ≥ 2 is the size of the network’s connectivity component V reachable
from v.
Table 2.2 shows the complexity of calculating degree centrality, between-ness
centrality and closeness centrality. For a dense graph, it takes O(V 2 ); and O(E)
in a sparse matrix representation. To obtain the betweenness and closeness centralities of all the nodes in a graph involves calculating the shortest paths between
all pairs of nodes on a graph. This takes Θ(V 3 ) time with the Floyd Warshall
algorithm. On a sparse graph, Johnson’s algorithm may be more efficient, taking
O(V 2 log V + V E) time.
2.3.6
Temporal network evolution
Ntoulas et al. [51] and Leskovec et al. [52] pointed out that great bulk of prior
work on the study of real graph datasets has been focusing on static properties,
identifying patterns in a single snapshot, or a small number of network snapshots.
The lack of studies in temporal patterns retrains the exploration of real graphs
evolution and starts to attract increasing attentions from researchers.
Two recent discoveries, both related to time-evolving patterns are stated in
Leskovec et al. [53]. One is the existence of Densification Power Law (DPL) as
graph grows over time, and the other is the shrinking effective diameter. DPL [53]
states that number of edges E(t) and number of nodes N (t) has the relationship
that E(t) ∝ N (t)a where a is the densification exponent and is typically greater
27
than 1, and a > 1 implies the average degree of node in the graph is increasing
over time. DPL describes the phenomenon that real graphs tend to sprout much
more edges than nodes over time. Recent works on citation networks, Redner [54]
and Katz [55] first discovered DPL while Leskovec et al.[53] confirms the finding
over a wide range of real graphs. In contrast to conventional understanding that
the graph average distance should increase as a function of the number of nodes,
Leskovec et al. [53] also discovered a surprising temporal evolution pattern that
the diameter of graph tends to shrink or stabilize over time in the empirical study.
Leskovec et al. [52] propose a graph generator that is mathematically tractable
and matches the above two temporal evolving properties. The generated graph
referred to as the ”Kronecker graph” using a non-standard matrix operation, the
Kronecker product, to generate graphs that mimic well in several real graphs.
2.4
Real networks
As explained in chapter 1, complex systems are modeled as networks to understand
and optimize processes such as formation of opinions, resource sharing, information retrieval, robustness to perturbations etc. The following are some selected
examples, taken from real world applications and their network representations.
• Email Network: With the advent of technology, electronic mails have an important role in daily life. In fact, email is not a less significant component
in this information era than the Internet or World-Wide-Web. Past studies
[56] have shown that e-mail communication is closely correlated with faceto-face and telephone interactions. Thus, e-mail exchanges are considered
reasonable proxies for underlying social ties. Email graphs are evolving net−−−−→ recipient.
works. The basic chain in all such networks is sender −
message
Ebel et al. [57] studied the statistics of a large email network from Kiel
University students’ email activities. The source and destination addresses
from the log-files of the server for all email messages were analyzed over a
period. Using this information, a directed network with 59912 vertices is
constructed. Only a small part of which, approximately 9% were students
accounts (5165 vertices). The rest are ”external” addresses. The average
28
degree of the ”complex network” accounting for ”external” addresses and
their connections with ”internal” vertices is 2.88. The average degree of
”internal” vertices is much greater, 25.45. Another empirical research by
Kossinet and Watt [56] is also conducted in a university environment. They
analyzed a year of e-mail contacts among over 43,000 students, faculty, and
staff at a large, private university. Anonymous data were collected on the
timestamp, sender, and recipients but the content. There were 14.5 million
e-mail messages collected in total. They then cross-referenced that data with
information about personal attributes (status, gender, age, etc.) of the individuals, and who attended and taught each class at the university. The
clustering coefficient and the average shortest-path length were obtained for
the network that includes external addresses. The directness of edges in this
net was ignored. The resulting clustering coefficient (0.156) is more than
3000 times greater than that for the corresponding random graph. On the
other hand, this empirical clustering coefficient is not as large as it seems at
first sight. It is only eight times greater than the clustering of the equilibrium uncorrelated network with the same degree distribution. The average
shortest-path length of this undirected graph was found to be 4.95, which
is even less than that for the corresponding classical random graph. On the
other hand, it is 1.4 times greater than the average shortest-path length of
the equilibrium uncorrelated network with the same degree distribution.
• Telephone call network: The size of telephone network is also very impressive.
Literatures show that the total number of telephones in the world is only
slightly less than 109 nowadays and is rapidly growing. Potentially, there
are various ways to construct a graph that reflects the structure of telephone
connections. Suppose all telephone calls in a telephone network are registered
over some period. Here we illustrate a simple example. A telephone call
network could be defined as follows: (a) its vertices are the telephone numbers
that have been registered. (b) Its directed edges are telephone calls between
these numbers. The direction of an edge is determined by the number that
made a call. In this construction, multiple edges between a pair of vertices are
possible. Aiello, Chung, and Lu (2000) [38] constructed a directed network
to study the long-distance telephone calls recorded in one day. The resulting
29
directed network has 47 million vertices and about 8x107 connections. Both
the distributions of incoming and outgoing connections of this telephone call
graph are fat-tailed. However, only the in-degree distribution can be fitted
by a power law. Its exponent was roughly estimated as γi n ≈ 2.1.
• Co-authorship network: The co-authorship network of scientists represents a
prototype of complex evolving networks. Back in the early 90s, a number of
authors pointed out the potential utility of co-authorship data. Small-scale
statistically analysis on frequency of coauthored articles is performed [58]
[59]. The research only focuses on a small group of authors or authors at
particular institutions. With the advent of comprehensive on-line bibliographies such as MEDLINE, Physics E-print archive, SPIRES and NCSTRL,
researchers started to construct large-scale networks representing research in
mathematics [12] [14], biology [14], physics [14], and computer science [13].
Although various results in this domain have reported, most of the works
focus on static cumulative view of the entire networks. One of these networks, formed from the MEDLINE database for the period from 1961 to
2001, had 1,520,251 nodes and 2,163,923 edges. Barabsi et. al. [12] studied the co-authorship database of mathematics and neuroscience, containing
data an 8-year period (1991-98). Their results indicate that the network is
scale-free, and that the network evolution is governed by preferential attachment, affecting both internal and external links. However, in contrast with
most model predictions the average degree increases in time and the node
separation decreases.
• Biological network: Watts and Strogatz [11] and Albert et al. [16] studied
topological properties of the neural network of C.elegans consisting of 282
neurons with pairs of neurons connected by the presence of either a synapse
or a gap junction. Albert et al. [17] study the protein to protein interaction
network. Applying a number of machine learning algorithms to the protein
connectivity information, they achieve a surprisingly good overall performance in predicting interacting proteins. Using a ’leave-one-out’ approach,
they find average success rates between 20 and 40% for predicting the correct
interaction partner of a protein. Kuhn [15]) and Ciliberti et al [18] examine
30
the principles for evolutionary innovation in gene expression patterns. They
model for the transcriptional regulation networks that are at the heart of
embryonic development.
• World Wide Web (WWW): Currently, the World Wide Web is the largest
network for which topological information is available. It had approximately
one billion nodes at the end of 1999 [20] and is continuously growing at an
exponential rate.
• Language Network: Example of the Word Web is a very large net which
contains 470,000 words and 17,000,000 connections. The empirical degree
distribution of this network is of a complex form with two power-law regions.
31
2.5
Summary
During the last few years there has been a tremendous amount of research activity
dedicated to the study of these complex networks. This activity was mainly triggered by significant findings in real-world networks. In this chapter, we discussed
thoroughly on the previous studies related to the research topics in this thesis.
In the first half, we showed three prominent network models: random network,
small world network by Watts and Stogatz [11] and scale-free network by Barabsi
and Albert [21]. Next, a survey of literatures related to graph properties is categorized into two subsections: statistical properties and temporal evolution patterns.
Commonly identified statistical properties of complex network are summarized. We
discussed statistical properties, such as degree distribution, clustering coefficient,
average shortest-path length, mixing patterns, and node centrality. Section 2.3.6
depicts the two recently discovered temporal evolution patterns. Next, comprehensive review, covering various complex networks taken from real-world is presented.
Although previous related research is promising, our study presents a new perspective to study graph evolutionary process from time series point of view with
the benefits of providing a computationally efficient and mathematically traceable
model. To the best of our knowledge, this is the first work taking hybrid approach
of time series approach and graph theory to study the network evolution process.
Chapter
3
Simple Evolving Networks
In this section, we introduce the first type of evolving graph problem of interest
and propose the solution approach to tackle the problem. For the ease of description, we first describe basic concepts and the notation used in the graph time
series problem. Here we focus only on simple graphs, that is undirected graphs
with no self-loops and no multiple edges between any two different vertices. Each
simple undirected graph G is labeled as an ordered 3-tuple, G = (V, E, ζ), where
V is the finite set of nodes ; E ⊆ V × V is a multi-set of unordered pairs of vertices, called edges or links ; and ζ is the finite set of the corresponding node degree.
With the belief that past graph evolving behavior is a good predictor of shortterm future evolving behavior, extrapolation is an appealing approach. Our goal
in here is to proposed a novel approach to solve the TSN problem described in
Chapter 2, that is, propagate the observed patterns from a series of time indexed
graphs S = (G1 , G2 , . . . , GT ) to predict the future graph at time T + 1, ĜT +1 .
3.1
Proposed method
Given a sequence of temporal indexed graphs described in the previous section, a
hierarchical solution framework is presented to extrapolate observed patterns from
these temporal indexed graphs, which incorporates uni-variate time series models
into the graphical sequence problem. The solution procedure is decomposed into
two phases. Phase-1 focuses on fitting uni-variate time series models to extrap-
33
olate the identified patterns for predicting future events of the 3-tuple elements
(Vt+1 , Et+1 , ζt+1 ) and Phase-2 utilizes the predicted values (V̂t+1 , Êt+1 , ζ̂t+1 ) to recover the resultant graph prediction Ĝt+1 . Details corresponding to each phase
and pseudo codes of the proposed algorithm is presented in section 3.1.1 and 3.1.2.
Next, analytical properties of the predicted graphs obtained from proposed algorithm, along with computer generated test results are discussed in section 3.2.
Finally, complexity issues are addressed in the last section that shows one of the
proposed link association algorithm scales linearly with the size of the graph O(N
+ E).
3.1.1
Phase-I: Node degree prediction
For each node i ∈ V , (ζ1i , ζ2i , . . . , ζTi ) gives the time series of its occurrences
frequencies. We fit an autoregressive integrated moving average (ARIMA) model
proposed by Box and Jenkins [60] for every distinct node to predict node degree
ζ̂Ti +1 ∀i ∈ V . Viewing node degree ζti where t = (1, 2, . . . , T ) as a process, we assume
independence between ζti ∀i ∈ V . For ease of illustration, we omit superscript i
of ζti . Introducing the backshift operator B · ζt = ζt−1 , ζt leads to the following
ARIMA(p,d,q) process:
φ(B)(1 − B)d ζt = θ0 + θ(B)ωt
∀i ∈ V
(3.1)
where ωt ∼ (0, σw2 ) is a white noise with zero mean and variance σw2 . φ(B) =
1 − φ1 B − φ2 B 2 − · · · − φp B p and θ(B) = 1+ θ1 B + θ2 B 2 + · · · + θq B q are polynomial
functions of the backshift operator, where φx ∀x = 1, 2, . . . , p, θy ∀y = 0, 1, 2, . . . , q
are constants.
For model identification, we vary the order of ARIMA model: p = 0, 1, 2, 3,
q = 0, 1, 2, 3, and d = 0, 1. The model that minimizes the Akaikeś information
criterion (AIC) is chosen as the final model to predict k-steps ahead node degree,
ζ̂Ti +k ∀i ∈ V . The choice of model selection criterion is made on pragmatic grounds.
Since overall algorithm complexity is our primary concern, AIC calculation offers
simplicity and efficiency and performs at least as well as other possibilities e.g.
Bayesian information criteria (BIC).
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Input : SG = (G1 , G2 , . . . , GT )
Output : ζ̂Ti +k
begin
repeat
for every i ∈ V do
Initialize: ζ̂Ti +k = −1000 ; initAIC i = 1000
Construct time series D = (ζ1i , ζ2i , . . . , ζTi )
for p ←− 0 to 3 do
AIC i ← 0 ; ARIM Ai ← 0
for q ←− 0 to 3 do
for d ←− 0 to 1 do
ARIM Ai =Fit(ARIMA(p,d,q),D)
AIC i =FindBestAIC(ARIMA(p,d,q),D)
if AIC i < initAIC i then
ζ̂Ti +k = Perdition(ARIM Ai ,k)
until ζ̂Ti +k ≥ 0
Return ζ̂Ti +k
end
Algorithm 1: Pseudo code of TSN phase-I algorithm.
3.1.2
Phase-II: Link association
In the second phase, we construct a simple graph that meets the predicted node
degree. Constructing a simple graph that meets a prescribed degree sequence is one
of the classical problems rooted in graph theory and theoretical computer science
(Thurlansirman and Swamy [61]; Lovasa and Plummer 1986). Given a predicted
node degree sequence from phase-I, two link association algorithms, Havel-Hakimi
algorithm (HL) and SMAX algorithm proposed by Li et al. [41] are adapted
to solve the graphical sequence problem. A necessary condition (Erdös-Gallai
theorem) shows that for the predicted degree sequence {ζ̂T1 +k , ζ̂T2 +k , ..., ζ̂TN+k }, for
k = 1, 2, 3 . . . to be realizable for every subset of p highest degree nodes, their total
edges can be absorbed amongst themselves and the other vertices in the graph.
35
Figure 3.1. Scalability of phase-II link association algorithm (HL).
Formally, for 1 ≤ p ≤ n − 1 :
Σpi=1 ζ̂Ti +k
≤
p(p − 1) + Σni=p+1 min{k, ζ̂Ti +k }
k = 1, 2, 3 . . .
(3.2)
HL algorithm is deterministic and always produces the same resultant graph
G with the given node degree (ζt1 , ζt2 , . . . , ζtN ). The HL algorithm constructs a
simple graph by successively connecting node of the highest degree to the other
of the highest degree, resorting the remaining nodes by degree, and repeating the
process until all nodes are selected. Pseudo code of HL algorithm is presented in
Algorithm 2.
Additional tests are performed for scalability analysis for HL algorithm. We
generate different size graphs with fixed graph density of 0.02, ranging from 500
nodes and 25,000 edges, to 32,000 nodes, and 1,600,000 edges. Computation time
of the second phase is measured. Figure 3.1 shows that using HL algorithm as
the phase-II link association procedure, it scales linearly with the size of the graph
O(N + E). Processor time of executing HL algorithm (Y-axis) is plotted again the
incensement of total node number in the graph (X-axis). Solid line presents linear
fit to the simulated data.
The SMAX algorithm is as well a deterministic approach, which aims at generating a simple, connected graph with given degree sequence (ζt1 , ζt2 , . . . , ζtN ) that
36
maximizes the objective function s(G). For any graph G having fixed degree sequence and deg(x) be the node degree of node x, Li et al. [41] defined s(G) as
follows.
s(G) =
X
deg(i) × deg(j) ∀e ∈ G
(3.3)
(i,j)∈e
The basic idea of SMAX algorithm is to sort all potential edges e = {(i, j) : i <
j; i, j = 1, 2, . . . , N } by the value of deg(i) × deg(j) in descending order and then
incrementally builds the graph in a greedy fashion via heuristic procedure until the resultant graph is simple, connected and meets the given degree sequence
(ζt1 , ζt2 , . . . , ζtN ).
8
Input : ζ̂Ti +k
Output : ĜT +k
begin
Matrix M ← 0
ζ̂Ti +1 ≥ 0 ∀ i ∈ V append ζ̂Ti +1 ∀ i ∈ V as row to M
for u ← rows[M ] to 2 do
v ← F irstN onZero(M [u])
for w ← length[ζ̂Ti +1 ∀ i ∈ V ] to v do
if M [u − 1][w] > M [u][w] then
connect v − 1 to w
9
end
1
2
3
4
5
6
7
Algorithm 2: Pseudo code of TSN phase-II algorithm (HL).
Example: Given a node degree sequence D =[6,4,3,2,2,2,1,1,1,1,1], we use HL
SM AX
and SMAX algorithm to generate two resultant graphs: GHL
. Figure
D and GD
SM AX
3.2 illustrates the graphical results of GHL
. The two graphs exhibits
D and GD
quite different structure. GHL
D produces two subgraphs from the given node degree
AX
whereas node 0 from GSM
plays the role as a hub that all nodes could be reached
D
to each other via the hub.
37
SM AX with given node degree seFigure 3.2. Example resultant graphs GHL
D and GD
quence D =[6,4,3,2,2,2,1,1,1,1,1]
2
2
1
1
3
3
4
4
0
0
10
10
9
5
8
9
5
8
6
6
7
7
GDHL
3.2
GDSMAX
General properties of the predicted graph
and simulation results
Benefits of employing HL and SMAX algorithm include preserving predicted properties from a sequence of evolving graphs, specifically the degree sequence as well
as creating graphs meeting these extrapolated patterns. In addition, the inclination of creating edges between high degree nodes accommodates previous research
results that the mechanism of preferential attachment underpins most recent social
network formation models [37]. In this section, we examine two generic properties
of the resultant graphs from HL and SMAX algorithms described in the previous section. The first one is what the tendency of nodes connecting to the peers
is? Do they incline to connect to similar or dissimilar peers? The second question of interest is how interconnected the neighboring environment are around the
nodes? Thus, we analyze the assortativity and clustering coefficient of the predicted graphs, reconstructed from HL and SMAX algorithms in section 3.2.1 and
3.2.2. Simulation results from computer generated problems are reported and discussed in section 3.2.3.
38
3.2.1
Predicted graph property 1: Assortativity
The assortatvitiy coefficient R(G) is an indicator to identify the generic tendency of
node connection to similar or dissimilar peers is. This measurement is introduced
by Newman [50]. He defines assortative mixing (R(G) > 0) as the preference for
high-degree nodes to attach to other high-degree nodes while disassortative mixing
(R(G) < 0) as the contrary, where ḧigh-degree nodes attach to low-degree ones.̈
Li et al. [41] proposed the following sample-based definition of assortativity.
R(G) =
[
P
i,j∈N
[
P
i∈N
P
di dj ] − [ i∈N 12 d2i ]2 /E
P
1 3
1 2 2
d
]
−
[
i
i∈N 2 di ] /E
3
(3.4)
Property 1. Given a fixed node degree sequence D= {D1 , D2 , . . . , DN },
AX
R(GSM
) ≥ R(GHL
D
D ).
AX
) is greater
Proof. Property 1 states the assortativity of SMAX algorithm R(GSM
D
or equal to the assortativity of HL algorithm R(GHL
D ). Since the degree sequence
AX
is the same for GSM
and GHL
D
D , the denominator and the second component of
equation 3.4 are constants and the first component in numerator is s(G). Since
AX
AX
the resultant graph GSM
maximize s(GSM
),
D
D
AX
s(GSM
) ≥ s(GHL
D
D )
AX
=⇒ R(GSM
) ≥ R(GHL
D
D ).
3.2.2
Predicted graph property 2: Clustering coefficient
The clustering coefficient CG (i) characterizes the density of connection in the environment of a node i. CG (i) is defined as the ratio of the number of edges y connecting the di neighbors of i over the maximum possible edges equations
di (di −1)
.
2
39
2y
di (di − 1)
1 X
CG (i)
C̄G (i) =
N i∈N
CG (i) =
(3.5)
(3.6)
The average clustering coefficient for the whole graph C¯G is given by Watts and
Strogatz [11]. C¯G expresses local robustness in the graph and thus has practical
implications: the higher the local clustering of a node, the more interconnected are
in its neighbors. Existing studies have shown that the small world graph shows
high clustering coefficients in comparison to random networks[11].
and
Property 2. Given number of nodes N and number of edges E, CGHL
(N,E)
AX
CGSM
has the following property
(N,E)
lim
N,E→∞
CGHL
(N,E)
AX
CGSM
(N,E)
→ 0.
Proof. Property 2 states the value of average clustering coefficient of SMAX,
AX
CGSM
, increases much faster than the value of average clustering coefficient of
(N,E)
, as the size of graph G grows (e.g. number of edges E and/or number
HL, CGHL
(N,E)
of nodes N increase).
3.2.3
Simulation results
To evaluate the robustness of the resulting graph from HL and SMAX algorithms,
experiments on computer generated problems are used to observe the basic graph
characteristics and verify the analytical results. Three factors: node size, degree
sequence, graph density are taken into account when designing the experiments.
3.2.3.1
Description of test sets
First, six test sets generated from random graph are used as goal graphs for comparison. Table 3.1 shows the basic statistics of the generated test sets. Test sets
40
Table 3.1. Basic statistics of test sets
Test set
D3
D5
D7
N1
N2
N3
Density
0.11
0.18
0.28
0.16
0.16
0.16
# of Node
30
30
30
20
50
100
# of edges
50
80
120
30
196
792
D3, D5, and D7 are designed to evaluate the impact of graph density for the
reconstruction quality. Due to the focus of sparse graphs, graph density of test
sets D3, D5 D7, ranging from 0.1 to 0.3 with fixed total number of nodes in each
graph (e.g. N = 30) are generated. Test sets N1, N2, and N3 are designed to
examine the consistency of node size for the graph reconstruction quality. In these
three test sets, different node size is created ranging from 20 to 100 in each graph.
Furthermore, 10 replications for each test instances in D3, D5, and D7 using the
same graph statistics (e.g. density, ] of nodes, and ] of edges) are obtained, which
results in different node degree sequences. Similarly, 5 replications for each test
instances in N1, N2, and N3 are generated as well. Mean (x̄) and variance (σx̄2 ) of
node degree sequences from all replications are listed in Table 3.2.
3.2.3.2
Graphical visualization of test sets vs. predicted graphs
Figure 3.3 presents the graphical visualization of the goal graph, HL algorithm and
SMAX algorithm from left to right. Visually, predicted graphs from HL algorithm
behave closer to the goal in all 45 test instances compared with the ones from
SMAX algorithm. This could due to the fact that the test sets are generated from
random graphs where no dense clusters are expected. On the contrary, predicted
graphs from SMAX exhibit observable clusters as well as highly interconnected
nodes within the cluster. In particular, as the graph size (e.g. N and/or |E|)
grows the tendency of creating observable clusters becomes stronger. Subgraphs
(k) and (m) in Figure 3.3 depict this behavior of graphs from SMAX algorithm.
Besides, SMAX generates graphs with the objective function of maximizing S(G),
which implies that hubs play the central roles in the overall connectivity of the
41
Table 3.2. Mean and variance of node degree sequence with respect to each test instance.
D30
D31
D32
D33
D34
D35
D36
D37
D38
D39
N11
N12
N13
N14
N15
σx̄2
2.09
2.37
2.51
1.54
1.82
3.20
2.99
2.02
3.20
1.54
x̄
3.33
3.33
3.33
3.33
3.33
3.33
3.33
3.33
3.33
3.33
x̄
3.00
3.00
3.00
3.00
3.00
σx̄2
1.37
1.79
2.00
1.68
3.89
D50
D51
D52
D53
D54
D55
D56
D57
D58
D59
N21
N22
N23
N24
N25
x̄
5.33
5.33
5.33
5.33
5.33
5.33
5.33
5.33
5.33
5.33
x̄
7.84
7.84
7.84
7.84
7.84
σx̄2
3.40
5.13
4.99
3.95
2.37
4.71
3.68
3.61
3.68
3.68
σx̄2
7.36
4.18
6.63
5.40
5.44
D70
D71
D72
D73
D74
D75
D76
D77
D78
D79
N31
N32
N33
N34
N35
x̄
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
σx̄2
5.31
7.72
3.45
6.62
4.83
3.66
7.03
7.79
6.83
6.28
x̄
15.84
15.84
15.84
15.84
15.84
σx̄2
11.97
13.77
15.11
11.11
15.39
network.
3.2.3.3
Measurements and evaluation of resultant graph
To measure the similarities between predicted graphs and the goal graph, we first
look at five measurements from the perspective of basic graph properties and clustering properties. The three basic graph properties considered are radius, diameter,
and S(G). Two other clustering properties: average clustering coefficient and transitivity (e.g. fraction of transitive triangles for a graph) are also evaluated.
Table 3.3 and 3.4 demonstrate details with respect to the five measurements of
the 45 test instances. HL outperforms SMAX in all 45 cases in the measurement
of basic graph properties such as graph diameter radius and S(G). Because of
the goal graphs are generated based on random graphs, they tend to have smaller
graph diameters, low clustering coefficients, and lack of power-law node degree
characteristics. These leads to the result that HL performs better than SMAX in
all 45 test instances.
42
Figure 3.3. Graph visualization of goal graphs and their respective predicted outputs
from HL and SMAX algorithms.
2(a): D30 goal graph
2(b): output from HL
2(c): output from SMAX
2(e) D53 goal graph
2(f) output from HL
2(g) output from SMAX
2(h) D75goal graph
2(i) output from HL
2(j) output from SMAX
2(k) N13 goal graph
2(l) output from HL
2(m) output from SMAX
2(m) N22 goal graph
2(o) output from HL
2(p) output from SMAX
43
SM AX
) when graph
SMAX produces smaller of average clustering coefficient (CG(N,E)
density and graph size (# of nodes N and # of edges |E|) are small. However, exHL
SM AX
increases significantly in comparison to CG(N,E)
periment results show that CG(N,E)
SM AX
as graph size increases. Figure 3.4 plots the relationship between CG(N,E)
and
HL
CG(N,E)
against various graph sizes. G(20,30) in Figure 3.4 contains 20 nodes and
30 edges; G(50, 196) contains 50 nodes and 196 edges; and G(100, 792) contains
100 nodes and 792 edges. Let
$=
HL
CG(N,E)
SM AX
CG(N,E)
,
(3.7)
$ decreases to zero as graph size increases. Simulation results of $ behaves
consistently as discussed in property 2. Furthermore, Figure 3.5 exhibits the
SM AX
simulation result of assortivity for R(GHL
). The relationship
D ) and R(GD
AX
) ≥ R(GHL
R(GSM
D ) stated in Property 1 hold true in all 45 simulation reD
sults.
GOAL
4
3
4
3
3
3
3
4
3
3
3
3
3
3
3
2
3
3
3
3
2
2
2
2
2
2
2
2
2
2
D30
D31
D32
D33
D34
D35
D36
D37
D38
D39
D50
D51
D52
D53
D54
D55
D56
D57
D58
D59
D70
D71
D72
D73
D74
D75
D76
D77
D78
D79
3
2
2
2
3
2
2
2
2
2
3
3
3
3
3
3
3
3
3
inf
HL
4
inf
inf
4
4
5
4
7
inf
4
Radius
3
3
3
3
3
4
4
3
3
3
5
7
5
5
5
5
5
6
5
5
SMAX
7
5
5
6
6
7
5
8
5
6
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
GOAL
6
6
7
6
6
6
5
5
5
6
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
6
inf
HL
7
inf
inf
7
7
8
7
9
inf
8
6
6
5
6
6
8
7
6
6
6
10
13
9
10
10
10
10
12
10
9
SMAX
12
10
10
12
12
13
10
15
10
12
Diameter
8979
9471
8543
9312
8824
8521
9263
9493
9236
9023
2817
3114
3167
2942
2635
3055
2862
2845
2832
2846
GOAL
774
793
840
707
751
886
837
761
882
720
9189
9859
8623
9574
9036
8654
9588
9900
9581
9427
2908
3294
3265
3038
2710
3198
2997
2975
2999
2957
HL
822
849
870
736
781
1009
956
828
977
762
S(G)
9392
10172
8699
9791
9098
8879
9863
10128
9835
9698
3018
3465
3378
3117
2792
3311
3106
3070
3094
3047
SMAX
857
880
907
768
821
1047
989
868
1036
785
0.3025
0.3088
0.2353
0.2714
0.2611
0.2812
0.2866
0.2943
0.2732
0.2768
0.1647
0.1735
0.1832
0.1822
0.1686
0.1626
0.2044
0.1663
0.1691
0.1637
GOAL
0.1611
0.0556
0.16
0
0.1133
0.1144
0.0239
0.0611
0.0887
0.0289
0.5345
0.5625
0.5053
0.5502
0.4562
0.4476
0.5623
0.6335
0.556
0.564
0.1574
0.3684
0.4154
0.2502
0.5039
0.6046
0.2464
0.5601
0.4065
0.3852
HL
0.4733
0.3411
0.2333
0.2711
0.46
0.4406
0.3313
0.2911
0.4021
0.1622
0.7301
0.7159
0.5402
0.6932
0.6698
0.7885
0.7585
0.6116
0.6941
0.7191
0.4272
0.6718
0.5168
0.4702
0.5051
0.5116
0.5132
0.6094
0.4736
0.5429
SMAX
0.2033
0.2556
0.2633
0.2733
0.3722
0.3072
0.2123
0.2578
0.2656
0.2889
Avg clustering coefficient
Table 3.3. Simulation results for D30 to D39, D50 to D59, and D70 to D79
0.3043
0.2931
0.2562
0.2853
0.2571
0.2654
0.2771
0.2865
0.2652
0.2481
0.1591
0.2138
0.2291
0.2079
0.1732
0.1807
0.18
0.1805
0.18
0.165
0.5071
0.4979
0.4989
0.5224
0.4615
0.4199
0.5159
0.5509
0.508
0.5188
0.2273
0.3848
0.4224
0.3119
0.4094
0.5133
0.3075
0.5338
0.4425
0.42
HL
0.3878
0.4371
0.2941
0.2806
0.3776
0.4785
0.375
0.3904
0.4601
0.2374
0.7623
0.7689
0.6082
0.7532
0.6926
0.8231
0.7611
0.6768
0.7348
0.7605
0.5833
0.7197
0.6444
0.5495
0.5433
0.6289
0.63
0.6842
0.585
0.69
SMAX
0.3469
0.457
0.5294
0.4101
0.6084
0.6258
0.4313
0.5753
0.5153
0.4317
Transitivity
GOAL
0.1429
0.0795
0.1765
0
0.0839
0.1104
0.0375
0.0822
0.092
0.0432
44
GOAL
4
3
3
4
inf
2
3
3
3
3
2
2
2
2
2
N11
N12
N13
N14
N15
N21
N22
N23
N24
N25
N31
N32
N33
N34
N35
3
3
3
2
3
3
3
3
3
3
HL
3
4
3
3
inf
Radius
6
5
5
5
5
6
5
6
5
5
SMAX
4
4
5
6
3
3
3
3
3
3
4
3
4
4
4
GOAL
8
5
5
6
inf
4
4
4
4
4
5
5
6
6
5
HL
5
8
6
6
inf
11
9
10
10
10
12
9
11
10
9
SMAX
8
8
9
11
6
Diameter
217841
221037
222417
216469
223649
15095
13630
14687
14096
14101
GOAL
355
372
390
370
504
220377
224531
227943
218905
228838
15464
13983
15083
14657
14640
HL
360
399
411
393
551
S(G)
224924
229430
233483
223939
234067
16158
14370
15579
14987
15024
SMAX
376
409
434
404
562
0.1598
0.1503
0.1604
0.1516
0.1617
0.1544
0.1206
0.154
0.1739
0.1737
GOAL
0.095
0.1433
0.1067
0.155
0.1386
0.4102
0.4236
0.4325
0.3923
0.4093
0.3885
0.4972
0.4332
0.3629
0.328
HL
0.4733
0.4267
0.3267
0.4767
0.2092
0.7887
0.7194
0.731
0.786
0.7027
0.6025
0.5643
0.6292
0.5844
0.4978
SMAX
0.29
0.2567
0.3167
0.2567
0.2492
Avg clustering coefficient
Table 3.4. Simulation results for N11 to N15, N21 to N25, and N31 to N35
0.1596
0.1525
0.1589
0.1524
0.1632
0.1578
0.1206
0.1517
0.167
0.1608
0.4014
0.4157
0.4224
0.3904
0.4094
0.3866
0.4491
0.4511
0.3829
0.3521
HL
0.3699
0.4286
0.3418
0.3947
0.3093
0.8051
0.7553
0.7377
0.8109
0.7424
0.6529
0.6091
0.6866
0.6376
0.5658
SMAX
0.4521
0.4675
0.5696
0.4737
0.433
Transitivity
GOAL
0.1233
0.1558
0.1139
0.1579
0.1856
45
46
Figure 3.4. Relationship between $ and graph size.
Ratio of avearge clustering
coefficient
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
G(20,30)
G(50,196)
G(100,792)
Graph size
In conclusion, predicted graphs from HL and SMAX algorithms exhibit quite
different characteristics. For instance, HL tends to produce graphs with smaller
diameters than SMAX. As the graph size increases, SMAX inclines to generate
much higher clustered groups than HL due to the objective is set to maximize the
S(G). One guideline to choose between the two algorithms would depend on the
graph properties. If the reconstructed graph tends to have low clustering properties and small diameter, HL algorithm would produce better results than SMAX.
AX )
Figure 3.5. Test D3, D5, and D7 and the respective assortative value of R(GSM
D
and R(GHL
D )
1.5
1.48
Value of R(G)
1.46
1.44
1.42
1.4
1.38
1.36
R(GSMAX)
R(GHL)
D30 D31 D32 D33 D34 D35 D36 D37 D39 D50 D51 D52 D53 D54 D55 D56 D57 D58 D59 D70 D71 D72 D73 D74 D75 D76 D77 D78 D79
Test set D3, D5, and D7
47
On the other hand, if the reconstructed graph exhibits high clustering properties
and larger diameter, SMAX reproduce superior results.
48
3.3
Summary
In this section, we present a framework to explore simple evolving graphs from
time series point of view. A novel hierarchical algorithm incorporating uni-variate
ARIMA model with two graphical sequence procedures is proposed to solve the
TSN problem. Constructing and fitting time series models from node degree sequence perspective offers simplicity in problem decomposition and keen insights
into the respective graph properties that are critical to the evolution of the graph
as a whole.
The proposed algorithm comprises of two phases. Phase-I focuses on fitting
ARIMA models to extrapolate the identified patterns from node degree sequence.
The best fitted model with smallest AIC is chosen to predict the node degree
sequence. Phase-II utilizes the predicted node degree sequence to recover the predictive graph via Havel-Hakimi and SMAX procedures.
It is shown that the proposed framework is capable of propagating time evolving
node degree sequence patterns, represented by a sequence of time indexed graphs
into a reliable predicted graph. The proposed algorithm produces mathematically
tractable and computationally efficient predictive graphs. Graph properties: assortativity and clustering coefficient of the resultant graphs from HL and SMAX
algorithms are analyzed. Simulation results of five different graph measurements
are provided in detail to quantify the general behavior of between predicted graphs
generated from HL and SMAX algorithms. Scalability issue is also addressed. In
particular, choosing Havel-Hakimi algorithm as the phase-II link association procedure, the TSN problem could be solved in linear time and space.
Chapter
4
Directed Evolving Networks
4.1
Introduction
With the belief that past graph evolving behavior is a good predictor of future
evolving behavior, extrapolation is an appealing approach. Our goal in this section is to propose a novel approach to propagate the observed patterns of a series
of time-indexed graphs for predicting future graphs. Utilizing the notion of evolution formalizes a time domain in the problem. Specifically, we model the evolution
process through the creation of a sequence of sample graphs SG . Let the set of
nodes and edges interactions during a time period (t0 , tb ] be captured as a graph
at time instance tb , denoted by Gb (Vb , Eb ). SD is ordered sequentially over time
as P graphs in P time instances: SG = < G1 , G2 , . . . , GP >, where Gx is a graph
with directional and/or multiple edges.
Due to the limitation of existing methodologies such as random graph models
and link analysis techniques apply to only simple and undirected graphs, develop
an effective and efficient methodology that could extend to study generalized network is essential. Indeed, to the best our knowledge, incorporating properties of
graph theoretical approach and time series analysis to study time varying networks
has not been discussed in the literature. We tackle the problem by transforming
Gx into an equivalent bipartite graph with two disjoint sets of nodes, NOU T and
NIN , and introduce the capability of dealing with generalized graphs, i.e. graphs
with directional and multiple edges. In contrast to previous approaches that detect
50
incremental changes in graphs [62], the proposed approach considers the evolution
of changes over an extended period of time. We investigate in detail real world
communication data sets, Enron email corpus to validate the proposed method. In
the email communication application, senders and recipients linked by subsequent
emails form a vast dynamic network of information exchange, which represents an
important new arena for knowledge discovery.
In this section, several key issues will be addressed. (1) A new solution methodology is proposed to study evolution processes of graphs with directional and multiple edges that allow easy implementation of simulations while preserving analytical
results of the system behavior. (2) Computational challenges of analyzing evolution of large scale graphs are addressed by forming proper clusters. (3) We also
demonstrate the utility of the proposed method by analyzing the email communication data sets, Enron corpus. The remainder of the section is organized as
follows. Section 2 introduces the proposed method. Section 3 quantifies the performance of proposed method. Section 4 presents the case study of Enron Email
corpus, and interprets observations and experimental results. Section 5 concludes
and discusses our findings, implications.
4.2
Proposed Method
In this section, we describe the main steps of proposed method. The proposed
procedure could be decomposed into three phases. Phase-I transforms the each
time-indexed graph Gt into an equivalent bipartite graph Bt . Phase-II treats each
node degree sequence (ζ1i , ζ2i , . . . , ζti ) for each node i up to time t as a stochastic
process and fits uni-variate time series models to extrapolate the future node dei
i
gree ζt+1
at time t + 1. And phase-III utilizes the predicted values ζt+1
for link
association to recover the graph prediction B̂t+1 . Details corresponding to each
phase is described in section 4.2.1 to 4.2.3.
51
Figure 4.1. Illustrating example of bipartite graph formation
Transformed bipartite graph B
Original directed graph D
1
3
0
2
4.2.1
4
NOUT
NIN
0
0
1
1
2
2
3
3
4
4
Phase-I: Bipartite graph formation
Each directed graph Gt for t = 1, 2, . . . , T is transformed to an equivalent bipartite
graph Bt with two disjoint sets of nodes, NOU T and NIN , where NIN , NOU T ⊆ N.
NIN represents the set of nodes with in-degree edges (e.g. directed arcs coming into
the designated nodes), and NOU T is the set of nodes with out-degree edge. For the
application domain in this section paper, we assume the existence of an universal
set of nodes, ψ, from which all nodes that occur in a sequence ϕ are drawn. Let VIN
and VOU T be the number of nodes in graph B. That is, Li ⊆ ψ for i = 1, 2, . . . , VIN
S OU T
S IN
Lj . Edges in
Li ; Lj ⊆ ψ for j = 1, 2, . . . , VOU T and ψ = Vj=1
and ψ = Vi=1
a bipartite graph Ẽ(i, j) for i 6= j ; i ⊆ NOU T ; j ⊆ NIN exists with respect to
the corresponding original graph G, that is Ẽ(i, j) = 1 if edge i → j ⊆ E exists.
An simple example is given in Figure 4.1 for illustration. Original graph G contains five nodes and six directed edges. Two disjoint sets of nodes NOU T ⊆ N and
NIN ⊆ N are created for the corresponding bipartite graph B. Two resulting node
degree sequences from B are obtained out-degree sequence: ζOU T =[0,1,1,2,2] and
in-degree sequence: ζIN =[4,1,1,0,0].
Each bipartite graph B is labeled as a 5-tuple, B = (NIN , NOU T , Ẽ, ζIN , ζOU T ),
where NIN and NOU T are the finite sets of nodes, Ẽ ⊆ {(u, v)|u ⊆ NIN , v ⊆
NOU T } is the set of edges, and ζIN and ζOU T are the finite sets of the corresponding in and out node degree. Imposing the subscript to bipartite graph
52
t
t
Bt = (NtIN , NtOU T , Ẽ t , ζIN
, ζOU
T ) represents individual bipartite time graph sam-
pled at time (t − 1, t] where t = 1, 2, . . . , T .
4.2.2
Phase-II: Node degree prediction
i,1
i,2
i,T
j,1
j,2
j,T
For each node i ∈ NIN and j ∈ NOU T , (ζIN
, ζIN
, . . . , ζIN
) and (ζOU
T , ζOU T , . . . , ζOU T )
gives the corresponding time series of its occurrences frequencies. Viewing node
i,1
i,2
i,T
j,1
j,2
j,T
degree (ζIN
, ζIN
, . . . , ζIN
) ∀i ∈ NIN and (ζOU
T , ζOU T , . . . , ζOU T ) ∀j ∈ NOU T
where t = (1, 2, . . . , T ) as a stochastic process, we assume independence bei,t
i,t
tween ζIN
∀i ∈ NIN and ζOU
T ∀j ∈ NOU T . We then fit autoregressive integrated
moving average (ARIMA) model [60] for every distinct node to predict in-degree
i,T +1
i,T +1
ζ̂IN
∀i ∈ NIN and out-degree ζ̂OU
T ∀i ∈ NOU T . For ease of illustration, we omit
i,t
i,t
superscript i and subscript IN or OU T of ζIN
and ζOU
T respectively. Introducing
the backshift operator B ·ζt = ζt−1 , ζt leads to the following ARIMA(p,d,q) process:
φ(B)(1 − B)d ζt = θ0 + θ(B)ωt
∀i ∈ V
(4.1)
where ωt ∼ (0, σw2 ) is a white noise with mean zero and variance σw2 , and φ(B) =
1 − φ1 B − φ2 B 2 − · · · − φp B p ; θ(B) = 1 + θ1 B + θ2 B 2 + · · · + θq B q are polynomial
functions of the backshift operator, where φx ∀x = 1, 2, . . . , p, θy ∀y = 0, 1, 2, . . . , q
are constants. For a systematic model identification, we vary the order of model:
p = 0, 1, 2, 3, q = 0, 1, 2, 3, and d = 0, 1. The model with the minimum AIC
(Akaikeś information criterion) value is chosen as the final model to predict node
degree at T + 1, ζ̂T +1 . There are other model identification criteria available however we chose AIC for the benefit of computational purposes.
4.2.3
Phase-III: Link association
i
To make the predicted in-degree and out-degree sequence ζ̂IN,t+1
∀i ∈ NIN and
P
P
j
i
i
ζ̂OU
T,t+1 ∀i ∈ NOU T graphical,
i ζIN =
j ζOU T should hold. The link association
algorithms are inspired by the Havel-Hakimi (HL) algorithm. We assume that
node degree is the simple mechanism governing the existence of links between two
nodes in the graph. Original HL algorithm constructs a simple graph by suc-
53
cessively connecting the node of highest degree to other nodes of highest degree,
resorting the remaining nodes by degree, and repeating the process until all nodes
are selected. Two other variants of HL algorithms are tested in this section and
described as follows.
• Alg. 1: i ∈ NOU T ∀i are connected to j ∈ NIN ∀j by connecting the highest
degree to the highest degree until all available in-degree and out-degree are
paired.
• Alg. 2: i ∈ NOU T ∀i are connected to j ∈ NIN ∀j by connecting the highest
degree to the lowest degree until all available in-degree and out-degree are
paired.
• Alg. 3: i ∈ NOU T ∀i are connected to j ∈ NIN ∀j by connecting the highest
degree to alternatively the highest and the lowest degree until all available
in-degree and out-degree are paired.
Previous bipartite graph shown in Figure 4.1 is served as the illustrating example to demonstrate the three link association algorithms. Corresponding degree
sequences are given as input ζOU T = [0, 1, 1, 2, 2] and ζIN = [4, 1, 1, 0, 0]. Figure
4.2 is the graph representation for three resulting bipartite graphs corresponding
to Alg. 1 to 3. Alg. 1 accurately recovers all 6 edges, while both algorithm 2 and
3 succeed in recovering 4 edges, with a recovering rate of 67.77%.
Bipartite graph shown in Figure 4.1 is served as the illustrating example to
demonstrate the three recovering algorithm. Corresponding degree sequences are
given as input ζOU T = [0, 1, 1, 2, 2] and ζIN = [4, 1, 1, 0, 0]. Figure 4.2 is the graph
representation for the resulting bipartite graphs from Alg. 1 to 3. Alg. 1 accurately recovers all 6 edges. Both algorithm 2 and 3 successfully recover 4 edges,
with a recovering rate of 67.77%.
54
Figure 4.2. Bipartite graph shown in Figure 4.1 is served as the illustrating example to
demonstrate the recovery results from different algorithms in phase-III: link association.
Alg. 1 accurately recovers all 6 edges: e1,0 , e2,0 , e3,0 , e4,0 , and e3,1 , and e4,2 . Alg. 2
accurately recovers 4 edges: e1,0 , e2,0 , e3,0 , and e4,2 . Alg. 3 accurately recovers 4 edges:
e1,0 , e2,0 , e3,0 , and e4,0 .
Alg 2
Alg 1
4.3
Alg 3
NOUT
NIN
NOUT
NIN
NOUT
NIN
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
Simulation
To evaluate the proposed methodology, we use directed growing network (DGN)
proposed by [63] to simulate directed graphs for simulation study. The DGN is
built by adding nodes according to the following simple rule. A DGN network
grows by adding nodes one at a time; a newly-introduced node randomly selects a
target node and links to it, as well as to all ancestor nodes of the target node.
4.3.1
Simulated Data Sets
Five data sets are generated using the DGN model. The network size varies from
100 to 500 nodes; and 460 to 2382 edges. Corresponding simulated graph statistics
for the generated graphs are listed in Table 4.1. Figure 4.3 and Figure 4.4 illustrate
the in-degree and out-degree distributions of the simulated data sets. Out-degree
distribution of the simulated data sets exhibits Poisson form characteristic; and the
in-degree distribution of the simulated data sets reveals near power-law structure.
Both characteristics coincide with the results reported in Krapivsky and Redner
[63].
55
Figure 4.3. The in-degree distributions of the simulated data sets G100 to G500.
IN(N=100) is the in-degree distribution for G100; IN(N=200) is the in-degree distribution for G200; IN(N=300) is the in-degree distribution for G300; IN(N=400) is the
in-degree distribution for G400; and IN(N=500) is the in-degree distribution for G500.
In degree distribution
100
IN (N=100)
IN (N=200)
10
IN (N=300)
IN (N=400)
IN (N=500)
1
1
10
100
1000
K
Table 4.1. Descriptive statistics for generated graphs G100-G500. The network size
varies from 100 to 500 nodes; and 460 to 2382 edges. The average in and out degree for
each data set are also reported.
G100
G200
G300
G400
G500
# of ndoes # of edges < NOU T > < NIN >
100
460
4.61
4.61
200
934
4.67
4.67
300
1357
4.52
4.52
400
2094
5.24
5.24
500
2382
4.78
4.78
It is worth mentioning that Broder et al. [45] suggested that the out-degree
distribution has a power-law tail, with exponent close to 2.7, however it is not
a good fit to the data. They further suggested that out-degree distribution may
be following a Poisson distribution. The Stanford WebBase project [64] based on
more recent data on the structure of web confirms that power-law does not fit its
out-degree distribution properly.
56
Figure 4.4. The out-degree distributions of the simulated data sets G100 to G500.
OUT(N=100) is the in-degree distribution for G100; OUT(N=200) is the in-degree distribution for G200; OUT(N=300) is the in-degree distribution for G300; OUT(N=400)
is the in-degree distribution for G400; and OUT(N=500) is the in-degree distribution for
G500.
Out degree distribution
1000
100
OUT (N=100)
OUT (N=200)
OUT (N=300)
OUT (N=400)
OUT (N=500)
10
1
1
10
100
K
Table 4.2. Edge accuracy results from three link association methods in phase-III.
G100
G200
G300
G400
G500
4.3.2
Alg. 1
55.87%
43.79%
34.64%
36.15%
29.61%
Alg. 2
41.52%
28.16%
22.70%
22.45%
17.48%
Alg. 3
51.74%
39.40%
33.24%
32.52%
28.77%
Performance Evaluation
Total number of corrected edges matched is chosen as the performance measurement for performance evaluation. Table 4.2 shows the total number of edges correctly predicted and the percentage rate for the three algorithms in phase-III: link
association. Alg. 1 performs best among the three algorithms while Alg. 3 performs slightly inferior than Alg. 1 and Alg. 2 is the worst. The accuracy rate for
Alg 1 in data set G100- G500 are about 56%, 44%, 35%, 36%, and 30% respectively.
57
Table 4.3. Edge accuracy statistics of 300 samples of configuration model
G100
G200
G300
G400
G500
x̄
29.92%
11.86%
7.44%
12.21%
12.12%
s.e.(x̄)
0.0282
0.0123
0.0086
0.0093
0.0068
99% CI
[29.5%, 30.3%]
[11.7%, 12.0%]
[7.3%, 7.5%]
[12.1%, 12.3%]
[12.0%, 12.2%]
To further quantify the quality of the three recovered algorithms in phase-III,
edge accuracy results obtained from configuration model is used as a baseline to
compare with the proposed DN-TSN algorithm. Given the same in-degree and
out-degree sequence, 300 samples were generated from the configuration model for
each test instance. Figure 4.5 depicts the near bell-shaped histogram of the 300
samples from configuration model with various node size (i.e. 100 to 500). Table
4.3 displays the sample mean (x̄), standard error (s.e.(x̄)), and 99% confidence
interval (99% C.I.) for the five test sets. Detailed results of the five histograms are
shown in Figure 4.5.
Comparing with the baseline results obtained from configuration model, all
three link association algorithms outperform the results from configuration model.
The edge accuracy percentage lies above the 99% C.I. of the respected 300 samples
for each test set.
58
Figure 4.5. Histograms of predicted edge accuracy rate from configuration model with
given bipartite node degree sequence. Each histogram contains 300 samples of predicted
edge accuracy rate from configuration model with given bipartite node degree sequence.
X-axis in the each histogram corresponds to the predicted edge accuracy rate and Y-axis
is the frequency with respect to each accuracy rate. Results of five test sets G100-G500
are shown in sequential label from (a) to (e). Figure (a) represents the results for test
set G100 ; (b) represents the results for test set G200 ; (c) represents the results for test
set G300 ; (d) represents the results for test set G400 ; and (e) represents the results for
test set G500.
4.4
(a)G100
(b)G200
(d)G400
(e)G500
(c)G300
Partition-based algorithm
This section exploits an improved algorithm motivated by the observation of the
computer generated problems described in the previous section. As shown in Figure
4.6, the higher graph density the better correct edge prediction accuracy is for the
proposed algorithm. Thus, we start form a weighed graph using graphs within
the observed time window, and partition the nodes of interest into disjoint sets.
Let < G1 , G2 , . . . , GP > be a series of P sampled graphs, where Gd denotes the
dth sampled graph. A graph G is a pair (V, E), where V is a set of vertices, and
E is a set of edges between the vertices, E ⊆ {(u, v)|u, v ⊆ |V |}. Let Qw =<
Gk , Gk+1 , . . . , Gk+Q > represent a window of sampled graphs, 1 ≤ k ≤ P − Q,
where Q is the window length. Let G<w> be the union graph of Qw and form an
undirected weighted graph Ω = (V, E, W), made up of a set of vertices V and a
59
set of edges E such that an edge between two vertices represents their similarity.
The adjacency matrix Z = V × V; non-zero entries equal to the edge weights (an
entry of Z is 0 if there is no edge between the corresponding vertices). Figure
4.7(a) serves as an illustrating example of weighted graph Ω construction from
three successive sampled graphs G1 , G2 , and G3 shown in Figure 4.7(b). Weight
k
Zi,j associated with Ei,j is the sum of edge Ei,j
occurrences in each graph Gk . That
is,
Zi,j =
X
k
Ei,j
(4.2)
k
Our focus here is to partition the weighted graph Ω into k disjoint partitions
{V1 , V2 , . . . , Vk } and V1 ∩ V2 ∩ . . . ∩ Vk = V, such that all edges in each partition have correlated temporal evolution, and they form a partition of high spatial
proximity.
Next, we discuss the analytical results of the partitioned subgraphs. Suppose
m edges exist in the graph Ω. Let ζ ∗ = arg maxi∈N ζi be the largest node degree
in Ω. And X be a random variable represents the number of correct predicted
edges in a sequence of n =
PN ˆ
i ζi
2
predicted edges drawn from a finite population
∗
Λ = N × ζ without replacement. Then X follows the hyper-geometric distribution
with parameters Λ, m, and n, the probability of getting exactly k correct edges is
given by
m
k
P r(k; Λ, m, n) =
Λ
n−k
Λ
n
and the average number of correct predicted edges is E(X) =
(4.3)
mn
.
Λ
Let k be the number of clusters divides graph G into equal sized subgraphs
Ω1 , Ω2 , . . . , Ωk . Xj be a random variable represents the number of correct predicted
edges within each subgraph. Let ζj∗ = arg maxk∈Nj ζk be the largest node degree in
subgraph Ωj . Each of subgraph has the average number of correct predicted edges
E(Xj ) =
mj nj
.
|Nj |×ζj∗
Given a graph G with m edges, predicted node degree sequence (ζ̂ 1 , ζ̂ 2 , . . . , ζ̂ N ),
that is n =
PN ˆ
i ζi
.
2
Let K be the number of partitions divides graph G into equal-
60
Predicted edge accuracy (%)
Figure 4.6. Edge prediction accuracy versus graph density.
60.00%
50.00%
40.00%
30.00%
Alg. 1
20.00%
Alg. 2
10.00%
Alg. 3
0.00%
0.093
0.047
0.030
0.026
0.019
Graph Density
Figure 4.7. An illustrating example of weighted graph construction. Figure 4(a) contains three sequential time indexed graphs D1, D2, and D3. Figure 4(b) illustrates the
frequency of link appearances as weights in the observed time period and forms the
corresponding weighted graph.
61
sized subgraphs G1 , G2 , . . . , GK . And mj be the number of edges reserved in cluster
cluster i. If a cluster solution exists such that
k
X
m2i ≥
i
m2
,
K
then partition graph G into K equal-sized clusters yields higher average number
of edges predicted correctly than the solutions obtained from the original G without
partition.
Proof. Proof of Lemma 1:
k
X
j=1
⇒{
m2j
m2
−
≥0
|Nj | × ζj∗ |N | × ζ ∗
m21
m22
m2k
m2
+
+
.
.
.
+
}
−
≥0
∗
|N1 | × ζ1∗ |N2 | × ζ2∗
|NK | × ζK
|N | × ζ ∗
∵ |N1 | = |N2 | = . . . = |NK | =
|N |
K
and
ζi∗ ≤ ζ ∗ ∀i = 1, 2, . . . K
m21
m22
m2k
m2
{
+
+ ... +
}−
∗
|N1 | × ζ1∗ |N2 | × ζ2∗
|NK | × ζK
|N | × ζ ∗
≥{
m22
m2k
m2
m21
+
+
.
.
.
+
}
−
|N1 | × ζ ∗ |N2 | × ζ ∗
|NK | × ζ ∗
|N | × ζ ∗
K
X
1
=
{K
×
m2i − m2 }
|N | × ζ ∗
i
⇒K×
K
X
m2i − m2 ≥ 0
i
⇒
k
X
i
m2i ≥
m2
K
62
Suppose graph G is partitioned into K unequal and equal-sized clusters respec0
tively. Let ni be the size of unequal-sized cluster i and ni be the size of equal-sized
P
P
0
cluster j such that ki=1 ni = kj=1 nj = |N |.
Let ρi =
mi
|Ni |
be the ratio of cluster i for unequal-sized partition, and øi be the
density of cluster i for equal-sized partition. If an optimal partition solution ρ∗i
P
P
0
exists for cluster i such that j ρ∗j nj ≥ j øi nj ∀i. Then
k
X
ρ∗j nj
j=1
ζj∗
≥
0
k
X
øj nj
j=1
ζj∗
Proof. Proof of Lemma 2:
k
X
ρ∗j nj
j=1
ζj∗
−
0
k
X
øj nj
j=1
ζj∗
0
={
≥
0
0
ρ∗1 n1 ρ∗2 n2
ρ∗k nk
ø1 n
ø2 n
øk n
+
+
.
.
.
+
} − { ∗1 + ∗2 + ... + ∗k}
∗
∗
∗
ζ1
ζ2
ζk
ζ1
ζ2
ζk
1
0
0
0
{[ρ∗1 n1 + ρ∗k nk + . . . + ρ∗k nk ] − [ø1 n1 + ø2 n2 + . . . + øk nk ] ≥}
∗
ζ
⇒
k
X
ρ∗j nj
j=1
ζj∗
≥
0
k
X
øj nj
j=1
ζj∗
Conclude from Lemma 1 and 2 that if an optimal partition solution ρ∗i exists
P
P
0
for cluster i such that j ρ∗j nj ≥ j øi nj ∀i, unequal-sized partition sub graphs
results in higher edge prediction accuracy than not partitioning. The lower bound
on number of clusters can be stated as the following Lemma.
63
4.5
conclusion
In this section, we propose a novel analytical approach to study the evolution process of general network. That is, networks with directional, multiple edges and
loops. This type of networks are essential in the study of disease spreading networks and World Wide Web networks; and draw increasing attention in recent
years. The proposed methodology handles the edge directional and multiple issue by transforming the network into an equivalent bi-partite graph. We then
construct two node degree sequences: one for in-degree and one for out-degree of
the corresponding bi-partite graph. Treating each series of node degree sequence
as a stochastic process enabled us to model and predict future in and out node
degree. We then use the prediction of future in and out node degree to recover
the associated links in the graph. It is shown that the three link association methods produced superior results than existing configuration model. In addition, we
also showed that forming proper clusters, which are identifying denser sub graphs
significantly, improved the performance of the proposed algorithm.
Chapter
5
Data and Experiment Results
A distinct data set: Enron Corpus, taken from Email communication network domain is used as an example to validate the proposed methodology in this research.
The collection of the data set is chosen due to the various social contexts represented by each data set, the varying counts of participants and events across data
sets, and the established familiarity and validity of the data sets. Data used in
the thesis is compiled and has not been altered in any way from their respective
original data source.
5.1
Email communication dataset: Enron
Hodak [65] remarked in his work that Enron’s collapse is generally viewed as a
morality tale - the natural result of managerial greed, a clueless board, and feckless gatekeepers. Before bankruptcy in late 2001, Enron Corporation was one of the
world’s leading energy, commodities and services companies. The company marketed electricity, natural gas, delivered energy, pulp and paper, and other physical
commodities. Among all corporate frauds and corruptions, Enron employees in
2001 artificially introduced an energy shortage and subsequently overcharged Californian energy users, resulted in California’s deregulations and subsequent energy
crisis.
The Enron corpus have been considered as a useful source for research in fields
like link analysis, social network analysis [66] and textual analysis [67] [36]. It is
65
a touchstone and valuable dataset, providing substantial collection of real email
benchmark that is public. Enron Email log was originally made public by the
Federal Energy Regulatory Commission during the legal investigation concerning
the Enron Corporation. It was then collected and prepared by the CALO Project
(A Cognitive Assistant that learns and Organizes), containing data from about
150 users, and mostly senior management of Enron, organized into folders. Later
the email dataset was purchased by Leslie Kaelbling at MIT. But a number of
integrity problems, such as duplicates and corrupted messages were in the dataset.
Many thanks to contributors such as Melinda Gervasio at SRI, William Cohen at
CMU and etc worked hard to correct these problems. In addition, Jotesh Shetty
and Jafar Adibi [68] reorganized the dataset and created a MySql database for link
analysis.
In this research, Enron email dataset served as a great real life benchmark to
evaluate our proposed methodology in addition to simulation results. The entire
corpus represented a large-scale email communication collection over the course of
3.5 years period. We used the pre-processed version provided by Shetty and Adibi
[68]. This dataset originally contains 252,759 emails from 151 Enron employees,
mainly senior managers. The time horizon selected in our analysis is from May
1999 to June 2002.
5.1.1
Pre-processing
The evolutionary process is analyzed on monthly basis. We first generate a sequence of time indexed graphs S with respect to the selected study period. An
email time graph of month t, Gt , is defined as an undirected graph with nodes
representing senders and recipients, and edges connecting senders and recipients
reside in the email during month [t − 1, t]. Define eij
t a binary variable representing
an edge in graph Gt . When eij
t =1 implies that there is at least one email communication between user i and j (e.g. either i being the sender to j being the recipient
in month t, or vice versa), else otherwise.
(
eij
t
=
1
emails between user i and j in month t > 0,
0
otherwise.
66
Table 5.1. Number of nodes and edges of the monthly time graph
Year-Month ] of nodes ] of edges
199905
11
13
199906
12
15
199907
11
17
199908
19
23
199909
17
23
199910
17
24
199911
17
24
199912
44
60
200001
55
71
200002
47
73
200003
60
80
200004
56
73
200005
64
91
200006
75
104
200007
79
139
200008
100
196
200009
94
168
200010
115
222
200011
124
264
Year-Month ] of nodes ] of edges
200012
113
255
200101
115
255
200102
110
240
200103
112
260
200104
123
317
200105
126
317
200106
113
221
200107
100
205
200108
123
291
200109
124
311
200110
132
481
200111
122
405
200112
104
259
200201
107
270
200202
99
243
200203
24
63
200204
3
2
200205
7
6
200206
5
5
When eij
t =1 implies that there has been at least one email communication between user i and j (e.g. either i being the sender to j being the recipient in month
t, or vice versa) else otherwise.
5.1.2
Data exploration
We analyzed the Enron data on monthly basis. 38 monthly graphs were extracted
for the study. Table 5.1 shows the statistics of number of links and nodes for each
of time indexed graphs. We observed increasing volumes of email exchanges (increasing number of nodes and links) between the 151 selected managers (nodes)
from May 1999 to October 2001. These patterns changed after October 2001 as the
communication started to decrease. This observation was also consistent with Figure 5.1, presenting the monthly average node degree of these 38 monthly graphs.
The average node degree had an increasing trend line till October 2001 and sharp
67
Figure 5.1. Monthly average node degree of the 38 monthly graphs extracted from
Enron dataset.
8
7
Average node degree
6
5
4
3
2
1
1999-05 1999-10 2000-03 2000-08 2001-01 2001-06 2001-11 2002-04 2002-09
Month
decreasing in the last 8 months (November 2001 to June 2002).
The 38 time graphs were split into two segments, the increasing period (May
1999 to October 2001) and the decreasing period (November 2001 to June 2002).
Densification Power Law (DPL) graph were constructed for the increasing period
data in Figure 5.2. It was shown that the number of links grew faster than the
number of nodes from May 1999 to October 2001 (with densification exponent in
this period a = 1.2583).
We also studied the 38 graphs to see if they were scale-free networks, many empirically observed networks appear to be, including the World Wide Web, protein
networks, citation networks, and some social networks. That is, the fraction P (k)
of nodes in the network having k connections to other nodes goes for large values
of k as P (k) ∼ k −γ where γ is a constant, and 2 < γ < 3. Most of our extracted
graphs from ENRON data set did not have this property. Take the graph of October 2001 for example, the log-log plot of node degree versus counts in Figure
5.4 showed the power law relationship between nodes and links did not hold. The
68
Figure 5.2. Number of edges versus number of nodes, in log-log scales, of increasing
period May 1999 to October 2001 with slope= 1.2583.
Time graph from May 1999 to Oct 2001
3
Number of edges
10
2
10
1
10
1
10
2
3
10
Number of nodes
10
Figure 5.3. Density, average clustering coefficient and transitivity of the 38 time graphs
0.7
density
average clustering coefficient
transitivity
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
69
Figure 5.4. Node degree versus counts, in log-log scales, of time graphs October 2001
with slope = -1.2145
Time graph at Octerber 2001
2
10
1
counts
10
0
10
-1
10
0
10
1
10
degree
2
10
power law exponent in this particular graph was γ = 1.2145.
70
5.2
Undirected graph evolution: Experiments and
Results
The Enron email corpus described in section5.1 is served to evaluate the proposed
algorithm: UN-TSN for undirected graph evolution. In this experiment, we adopt
time-indexed graphs of T previous months (Gt−T , . . . , Gt−1 , Gt ) to predict K future
graphs (Gt+1 , Gt+2 , . . . , Gt+K ).
5.2.1
Experiment I and performance evaluation
In the first experiment, we use monthly graphs from May 1999 to December 2000
(T =20) to predict the graphs of January to June 2001 (K=6). UN-TSN algorithm
designed specifically for simple evolving graphs were applied. Figure 5.5 illustrates
the visualization of the projected graphs of January to June 2001. Projected graphs
are arranged in three columns. Left most columns contains actual monthly graphs
extracted from Enron dataset and used as goal graphs to evaluate the projection
graphs; the middle column presents the projected graphs from SMAX algorithm;
and the right most column presents the projected graphs from HL algorithm. It is
observed that HL algorithm mimics better of the respected actual graphs; majority
of nodes were identified as a connected component in the projected graphs.
Three quantitative measures: S(G) proposed by Li et al.[41], average clustering
coefficient C̄G , and transitivity are used to compare of the projected graphs versus
the goal graphs, figure 5.6 to 5.8 present the comparison results respectively. In
the measurement of transitivity, HL outperforms SMAX in all six projected graphs
from January to June 2001. In the other two measures: C̄G and S(G), HL and
SMAX algorithm performed similarly. In projected graph January to March 2001,
HL performed slightly better than SMAX. We conjecture that better projection
results generated from HL algorithm is due to most of the extracted graphs are
not scale-free networks and lack of the power-law distribution relationship between
nodes and links.
71
Figure 5.5. Graph visualization of time graph from January, 2001 to June 2001 and
their respective predict outputs from SMAX and HL algorithms.
(a): G200101 goal graph
(b): output using SMAX
(c) output using HL
(d): G200102 goal graph
(e) output using SMAX
(f) output using HL
(g): G200103 goal graph
(h): output using SMAX
(i): output using HL
(j): G200104 goal graph
(k): output using SMAX
(l): output using HL
(m): G200105 goal graph
(n): output using SMAX
(o): output using HL
(p): G200106 goal graph
(q): output using SMAX
(r) output using HL
72
Figure 5.6. S(G) comparison between resultant projection graphs from HL and SMAX
algorithms. X-axis represents six different monthly projection results. 1: January 2001,
2: February 2001, 3: March 2001, 4: April 2001, 5: May 2001, and 6: June 2001.
15000
10000
5000
Havel
SMAX
0
1
2
3
4
5
6
-5000
-10000
Figure 5.7. Average clustering coefficient comparison between resultant projection
graphs from HL and SMAX algorithms. X-axis represents six different monthly projection results. 1: January 2001, 2: February 2001, 3: March 2001, 4: April 2001, 5: May
2001, and 6: June 2001.
0.35
0.3
0.25
0.2
Havel
0.15
SMAX
0.1
0.05
0
1
5.2.2
2
3
4
5
6
Experiment II and performance evaluation
Experiment II is setup to study the impact of various lengths of historical data to
the projection accuracy of UN-TSN algorithm. We chose best fitted model from
a sequence of graphs in T previous months (Gt−T , . . . , Gt−1 , Gt ) to project the Ksteps ahead graphs (Gt+1 , Gt+2 , . . . , Gt+k ). Three different length of time scales
T = 10, 20, 30 (months) and K = 4 (months) were selected for this experiment.
Table 5.2 illustrated the six test settings in experiment II. The two projection pe-
73
Figure 5.8. Transitivity comparison between resultant projection graphs from HL and
SMAX algorithms. X-axis represents six different monthly projection results. 1: January
2001, 2: February 2001, 3: March 2001, 4: April 2001, 5: May 2001, and 6: June 2001.
1.2
1
0.8
0.6
Havel
0.4
SMAX
0.2
0
1
2
3
4
5
6
-0.2
-0.4
riods were November 2001 - February 2002 and April - July 2002 using 10, 20 and
30 months of historical data for projection respectively.
To evaluate the reliability of predicted graphs, four macroscopic properties:
graph density, S(G) proposed by Li et al.[41], average clustering coefficient C̄G ,
and transitivity are used to evaluate the reliability of projected graphs. Both C̄G
and transitivity are clustering measurements.
Performance criteria chosen to evaluate the four associated properties between
projected and goal graph is the cosine measure cos(Ya , Yb ), where Ya and Yb are
two vectors, containing the four associated properties. The cosine measure is given
by
cos(Ya , Yb ) =
Ya × Yb
kYa k2 · kYb k2
(5.1)
and captures a scale invariant understanding of similarity. Due to this property,
vectors can be normalized to the unit sphere for more efficient processing. Other
vector measuring criteria could be chosen in similar manners as well.
Figure 5.9 illustrates the cosine measure for similarity comparison of the four
74
Table 5.2. Experiment II setup: three different lengths of historical data T = 10, 20, 30
(months) were used to project 4-month ahead graph evolution.
Projection period
G200111 - G200202
G200204 - G200207
T=10
T=20
T=30
G200101 - G200110 G200003 - G200110 G199905 - G200110
G200106 - G200203 G200008 - G200203 G199910 - G200203
Table 5.3. Four measurements: Density, s(G), C̄G , and transitivity for four projected
graphs from November 2001 (G200111) to February 2002 (G200202).
G200111
G200112
G200201
G200202
Goal graph
0.0549
0.0484
0.0476
0.0501
G200111
G200112
G200201
G200202
Goal graph
0.4351
0.3832
0.4446
0.4357
Density
T=10 T=20
0.0424 0.0394
0.0455 0.0402
0.0472 0.0390
0.0531 0.0405
C̄G
T=10 T=20
0.3474 0.3811
0.4058 0.3582
0.3828 0.2887
0.4871 0.3225
T=30
0.0401
0.0396
0.0404
0.0426
T=30
0.3370
0.3382
0.2982
0.3213
s(G)
T=10 T=20
100848 67640
130156 71847
167784 70148
246205 72532
Transitivity
Goal graph T=10 T=20
0.3408
0.3377 0.3498
0.3059
0.3515 0.3420
0.3502
0.3518 0.3569
0.1945
0.3297 0.3485
Goal graph
43151
15294
20993
24864
month projected graphs from November 2001 (G200111) to February 2002 (G200202)
with various lengths of historical data T = 10, 20, 30. The results showed that both
the length of historical data used (T ) and the projection length (K-step ahead projection) affect projection accuracy. In this empirical study time, historical data
length of 20 month (e.g. T = 20) had the best graph projection compared to
the T = 10 and T = 30 months, thus considered to be a better historical data
length choice. In addition, as the projection length K increases, the variation of
the variance also increased. We conjecture that if the time series extracted from
node degree sequence become linear, projection results of UN-TSN would be even
more accurate.
5.2.3
Experiment III and performance evaluation
Experiment III is setup to study mainly the increasing period (e.g. May 1999 to October 2001) where the variances of the node degree time series are relatively stable
and the trend is linear. We choose the historical data length to be T = 20 to project
T=30
75839
72790
80155
89138
T=30
0.3459
0.3423
0.3389
0.3525
75
Figure 5.9. Similarity comparison of the four month projected graphs ( G200111G200202) with various lengths of historical data T=10, 20, 30.
Figure 5.10. Cosine measure comparison of 6-step ahead predicted graphs for 200111
to 200106, time scale T=20.
graph evolutions of 6-month ahead (e.g. January to June 2001). Figure 5.10 shows
that all 6 projected graphs are more than 90% similar to the respected graphs using cosine measurement. Specifically, the 5-month ahead projected graphs (e.g.
January to May 2001) are above 96% similarity and the 1-month ahead projection
(e.g. January 2001) is the highest, 99.52% similar to the goal graph.
76
5.2.4
Summary
Enron corpus, an empirical email communication dataset is served to validate the
proposed algorithm, UN-TSN for simple evolving graphs. Three experiments were
designed to study the reliability and accuracy of UN-TSN.
For the first experiment, we used monthly graphs from May 1999 to December 2000 (T=20) to predict the graphs of January to June 2001 (K=6). Three
quantitative measures: S(G), average clustering coefficient C̄G , and transitivity
were used to measure the projected and goal graphs. We conjecture that better
projection results generated from HL algorithm is due to most of the extracted
graphs are not scale-free networks and lack of the power-law distribution relationship between nodes and links.
Experiment II is setup to study the impact of various lengths of historical data
to the projection accuracy of UN-TSN algorithm. It is shown that both the time
scale and the prediction scale (e.g. K prediction length) affect the prediction accuracy. In fact, when the time series become linear, prediction results of the proposed
algorithm perform even more accurate.
Experiment III is setup to study mainly the increasing period (e.g. May 1999 to
October 2001) where the variances of the node degree time series is relatively stable
and the trend is linear. We further generated 6 projected graphs with historical
data length of 20 months, T = 20, in the increasing period where the trend is
linear and the time series process is stable. All six projected graphs are greater
than 90% similar to the respected graphs using cosine measurement. Specifically,
the 5-month ahead projected graphs (e.g. January to May 2001) are above 96%
similarity and the 1-month ahead projection (e.g. January 2001) is the highest,
99.52% similar to the goal graph.
77
Table 5.4. Employee status and corresponding number of people
Employee Level Employee Status # of people
0
CEO
4
1
Director
22
2
Employee
36
3
In House Lawyer
4
4
Manager
13
5
Managing Director
5
6
N/A
26
7
President
3
8
Trader
12
9
Vice President
24
5.3
Directed graph evolution-Enron dataset: Experiments and results
ENRON email corpus described in section 5.1 is again used to validate the directed graph evolution experiments. We select only 149 employees, ranging from
10 different employee status as shown in Table 5.4, and extract email records from
and to these 149 individuals. The time span of the final email collection analyzed
starts from May, 1999 to June 2002, and the evolution analysis is performed on
monthly basis.
5.3.1
Generation of time-indexed sample graphs
We next identify the corresponding email records within the selected study period,
and then generate the monthly time graphs within each of the 38 snapshots. An
time graph of month t, Gt is defined as an undirected graph with nodes representing senders and recipients, and edges connecting senders and recipients reside
in the email during month [t − 1, t]. Define eij
t be a binary variable represents
an edge in graph Gt . When eij
t =1 implies that there has been at least one email
communication between sender i and recipient j, else otherwise ;
(
eij
t
=
1
emails exchange between sender i and recipient j in month t > 0, where i 6= j
0
otherwise.
78
Table 5.1 shows the monthly communication frequency for the 38 month time
graphs exhibiting the peak level of email communication around October, 2001.
Base on the indicator of average node degree, an overall trend was increasing till
October 2001 and starts decreasing after which is consist with the peak communication behavior from frequencies counting in Table 5.1. We then split the 38 time
graphs into two segments, the increasing period (May 1999 to October 2001) and
the decreasing period (November 2001 to June 2002).
Figure 5.11 and 5.12 give the graphical representations of 38 monthly time
indexed directed graphs extracted from the Enron dataset. Each sub graph is a
sampled on monthly basis from original Enron dataset, focusing only on the 151
Enron employees. Each node corresponds to a distinct email user. Each link between two nodes represents email exchange occurs between these two distinct email
users. The direction associated with each link reveals information on whether the
email user is a sender or recipient.
79
Figure 5.11. Monthly time graph from December 1999 to November 2000 are labeled
in sequential from (a) to (l)
(a) G199912
(b) G200001
(c) G200002
(d) G200003
(e) G200004
(f) G200005
(g) G200006
(h) G200007
(i) G200008
(j) G200009
(k) G200010
(l) G200011
80
Figure 5.12. Monthly time graph from December 2000 to November 2001 are labeled
in sequential from (a) to (l)
(a) G200012
(b) G200101
(c) G200102
(d) G200103
(e) G200104
(f) G200105
(g) G200106
(h) G200107
(i) G200108
(j) G200109
(k) G200110
(l) G200111
81
5.3.2
Partition-based graph prediction
We start with constructing the weighted graph Ω within time window of interest
and partitioning into k sub graphs. Graph partitioning is a well-studied problem
concerned with dividing a graph into k disjoint partitions by cutting the minimal
number of edges while maintaining a similar number of vertices in each partition
(Bondy and Murty 1976). Such partitions with a similar number of vertices are
referred to in the literature as balanced partitions. A number of different graph
clustering objectives [69] [70] [71] [72] have been proposed and studied. For the
context of this study, we aim at minimizing the cut between clusters and the
remaining vertices. The objective [70] chosen is written as
k
X
links(Vc , V \ Vc )
RCut(G) = min
V1 ,...,Vk
|Vc |
c=1
(5.2)
Based on these partitions, we devise the proposed algorithm that encourages
the prediction of edges that are within strongly connected graph partitions.
5.3.3
Performance evaluation
Various experiments are setup to test the proposed algorithm. We select different
time scale T , and choose the best fitted model using AIC criteria for each node
degree sequence extracted from a series of time indexed graphs within the observation periods. Then, predicted k-steps ahead node degrees, (ζ iˆ , . . . , ζ iˆ , ζˆi )
t−T
t−1
t
for each node i are used as inputs to recover the k-steps ahead predicted graphs
(Gˆt+1 , Gˆt+2 , . . . , Gˆt+k ) respectively. Table 5.5 illustrates the non partition experiment results of April to August 2001 using T =20 and S=5. The performance
measurement chosen to evaluate the prediction results is the number of correct
predicted edges. It is shown in table 5.5 that Alg.1 performs better than Alg.3 in 4
out of 5 cases while Alg. 2 performs the worst, indicating the higher the frequency
of email users the higher the probability that these users will exchange emails in
the future.
In addition, we extend the experiments to study the impact of partitioning
all email users (nodes) on predicted edge accuracy. Table 5.6 lists the node ID
82
Table 5.5. Predicted edge accuracy for monthly graph from April to August 2001
(T =20 and S=5).
G24
G25
G26
G27
G28
# of link
300
242
249
177
25
Alg1 Alg2 Alg3
22
2
18
23
0
13
18
1
12
10
0
12
1
0
1
Table 5.6. Node ID in partitioned cluster 2
Node ID
3,4,20,21,24,27,34,35,39,41,51,54,67,82,83
100,111,114,115,123,124,130,137,139,149
contained partitioned cluster #2. In this particular cluster, there are 26 nodes and
50 edges. Predicted edge accuracy results of April 2001 focusing only on these 26
email users are compared and reported in the Table 5.7. It is shown that proper
clustering of all email users significantly improves the edge prediction accuracy by
45.45% comparing to non-cluster results.
5.4
Summary
Chapter 5 documents the experiment results, applying UN-TSN and DN-TSN algorithms to the Enron corpus, a substantially real email benchmark dataset. The
remarks are summarized as follows:
The Enron data were analyzed on monthly basis; 38 monthly graphs were extracted for the study. We observe that email communication have one peak over
the course of the selected study period. The peak communications appear around
Table 5.7. Predicted edge accuracy comparison of April 2001 monthly graph between
partition and non-partition results.
Cluster
Without Cluster
Improvement
Edge prediction Accuracy
0.32
0.22
45.45%
83
October 2001 which separates the graph evolvements into two segments, the increasing periods (May 1999 to October 2001) and the decreasing period (November
2001 to June 2002). In addition, we studied the 38 graphs if they were scale-free
networks. The results showed that most of our extracted graphs from ENRON
data set did not have this property.
For UN-TSN, three experiments were designed to study the reliability and
accuracy. The first experiment involved with monthly graph from May 1999 to
December 2000 (T=20) to predict the graphs of January to June 2001 (K=6).
Three quantitative measures: S(G), average clustering coefficient C̄G , and transitivity were used to measure the projected and goal graphs. We conjecture that
better projection results generated from HL algorithm is due to most of the extracted graphs are not scale-free networks and lack of the power-law distribution
relationship between nodes and links.
Experiment II is setup to study the impact of various lengths of historical data
to the projection accuracy of UN-TSN algorithm. It is shown that both the time
scale and the prediction scale (e.g. K prediction length) affect the prediction accuracy. In fact, when the time series become linear, prediction results of the proposed
algorithm perform even more accurate.
Experiment III is setup to study mainly the increasing period (e.g. May 1999
to October 2001) where the variances of the node degree time series is relatively
stable and the trend is linear. We further generated 6 projected graphs with historical data length of 20 months, T = 20, in the increasing period where the trend
is linear and the time series process is stable. All six projected graphs are greater
than 90% similar to the respected graphs using cosine measurement. Specifically,
the 5-month ahead projected graphs (e.g. January to May 2001) are above 96%
similarity and the 1-month ahead projection (e.g. January 2001) is the highest,
99.52% similar to the goal graph.
For DN-TSN, proper clustering of all email users significantly improves the
edge prediction accuracy by 45.45% comparing to non-cluster results.
Chapter
6
Conclusion and Future Research
In this thesis, we present a new perspective to exlopre the network evolution processes from time series point of view. A new type of network evolution problem,
time series network (TSN), considering the temporal element in network evolution
process is proposed.
In addition, we propose two novel hierarchical frameworks: UN-TSN and DNTSN, incorporating uni-variate ARIMA model with graphical sequence procedure
to predict reliable and realistic future graph(s). Specifically, UN-TSN is proposed
for simple network evolution process. It deals with network with unidirectional
edges without multiple loops between nodes. DN-TSN is proposed for general network evolution process. The second type of network evolution processes involve
network with directional edges and with multiple loops between nodes. Constructing and fitting time series models from node degree sequence perspective offers
simplicity in problem decomposition and keen insights into the respective graph
properties that are critical to the evolution of the graph as a whole.
In this chapter we highlight some of the key results and contributions from our
research and also discuss directions for future work.
85
6.1
TSN Problem
A new type of network evolution problem, time series network (TSN), considering
the temporal element in network evolution process is proposed. TSN problem is
defined as a problem where the update operations include unrestricted insertions
and deletions of edges. Specifically, given a sequence of time indexed graphs S =
(G1 , G2 , . . . , Gt ), the objective is to generate realistic and reliable future graph(s)
ĜT +1 of time T + 1. To the best our knowledge, incorporating properties of graph
theoretical approach and time series analysis to study network evolution processes
has not been discussed in the literature.
6.2
Hierarchical solution frameworks: UN-TSN
and DN-TSN
UN-TSN comprises of two phases. Phase-1 focuses on fitting ARIMA models to extrapolate the identified patterns from node degree sequence. Best fitted model with
smallest AIC is further chosen to predict the node degree sequence. Phase-2 utilizes the predicted node degree sequence to recover the predictive graph via HavelHakimi procedure. The proposed algorithm produces mathematically tractable
and computationally efficient predictive graphs, which could be solved in linear
time and space.
Due to the limitation of existing methodologies such as random graph models
and link analysis techniques apply to only simple and undirected graphs, developing an effective and efficient methodology that could extend to study generalized
network is essential. DN-TSN is proposed to study graph evolution with directed
and multiple edges, a problem frequently occurs in many applications and draws
increasing attention in recent years. DN-TSN transforms a general network Gx
into an equivalent bipartite graph with two disjoint sets of nodes, NOU T and NIN ,
and introduce the capability of dealing with generalized graphs, i.e. graphs with
directional and multiple edges.
We investigate in detail real world communication data sets, Enron email cor-
86
pus to validate the proposed method. In the email communication application,
senders and recipients linked by subsequent emails form a vast dynamic network
of information exchange, which represents an important new arena for knowledge discovery. By transforming the directed graph into an equivalent bi-partite
graph, we constructed a series of in and out degree sequence for every node in
the corresponding bi-partite graph. Treating each series of node degree sequence
as a stochastic process enabled us to model and predicts future node degree, and
recovered the associated links between them. It was shown that the three link
association methods produced superior results than existing configuration model.
In addition, we also showed that forming proper clusters, that is identifying denser
sub graphs significantly improved the performance of the proposed algorithm.
6.3
Uniqueness and contribution of the thesis
In this thesis, we consider a fundamentally new approach to analyze the evolution
process in complex engineering systems which can be realized as networks. The
uniqueness and contribution of this research could be summarized as follows:
• We extend the prediction capability of time series analysis from a set of single
indexed value to a set of collected graphs.
• To the best of our knowledge, this is the first work lying at the intersection of
graph theory and time series analysis to study network evolution.
• In contrast to previous approaches that detect incremental changes in network
evolutions, TSN considers the evolution of changes over an extended period of
time.
• The proposed two novel hierarchical frameworks: UN-TSN and DN-TSN provides
the network research community an effective and efficient framework to study the
evolution process of real-world networked system. Given general graph statistics
such as number of nodes, number of edges and node degree, the proposed frameworks are capable of producing reliable and reliable predictive graph(s) preserving
the respective highlighting key principles that governs the graph structures.
87
• The proposed algorithm produces mathematically tractable and computationally
efficient predictive graph(s).
• Due to the limitation of existing methodologies such as random graph models and
link analysis techniques apply to only simple and undirected graphs, DN-TSN offers
a new opportunity that could be used to study generalized network, i.e. network
with loops, directional edges, and multiple edges. For certain types of networks
the directions of edges are essential, such as the disease spreading network and
World-Wide-Web to retrieve information.
• We present a valuable case study using a distinct data set: Enron corpus to validate
the proposed methodology. The collection of the data set is a touchstone, providing
substantial collection of real email benchmark that is public.
• It is shown from the experiment results that both proposed algorithms are capable of propagating time evolving node degree sequence patterns, represented by a
sequence of time indexed graphs into a reliable predicted graph that mimics the
real graph evolutionary process. Analytical and simulation results of the predictive
graph properties obtained from the proposed solution framework are reported in
detail as well. A real-world data set: ENRON corpus, a substantially real email
benchmark dataset, to validate our proposed approaches for both simple and general network evolution process.
• Without the existence of power-law distribution relationship between nodes and
links in the extracted monthly graphs from Enron, recovery results are better from
HL algorithm.
• The results show that both the time scale and the prediction scale (e.g. K prediction length) affect the prediction accuracy.
• Proper clustering of all email users before applying DN-TSN significantly improves
the prediction accuracy by 45.45% comparing to non-cluster results.
6.4
Future works
• Network growth: In this research, the time series network problem focuses on the
edge growth in the network evolution process. However, the number of nodes
88
changes as well. We assume that the number of nodes from each sampled graph is
fixed in this research. Further investigation of relaxing this constraint in studying
the network evolution process is a plausible direction and expected to increase its
applicability to real-world applications.
• Clustering problem: It is shown from analytical results in chapter 5 and empirical study in chapter 6 that proper clustering (e.g. identify dense clusters) before
applying the proposed DN-TSN algorithm significantly increases predication results. We proposed the idea of constructing a weighed graph all previous sampled
graphs, however solving the clustering problem is not in the scope of this study.
In traditional graph theory, finding the clusters from a weighted graph could be
translate into a max-flow min-cut problem. The max-flow min-cut problem itself
is a classical optimization problem. Significant analytical results and algorithms
are proposed in this regard. One critical issue in applying these algorithm is the
complexity. Design an effective and efficient clustering algorithm applicable to
complex network is a critical component in the network science studies.
• Temporal node correlation: In our proposed methodology, we assume that the
degree of vertices is independence from each other in the temporal perspective.
That is, we predict future node sequence independently node by node. However,
such correlations may exists and important in real network. In fact, the correlation
are inevitable in the network evolution process. How to examine and quantify this
effect would be a important next step and expected to have a significant impact
in understanding the overall network evolution structure.
• Network evolution in online social networks: Although the proposed approach is
validated using email dataset, it is fairly general and could be applied to other
domains as well. Online social networks may serve as good candidates. Sites such
as Youtube, Flickr, and Facebook are the most active ones on the Internet. These
sites presents are a repository of some of the largest data sets on social networks.
Up-to-date, there is no study, which reports the finding on the evolution processes
on these networks.
Bibliography
[1] Bollobas, B. (2001) Random Graphs 2nd edition, Cambridge University
Press.
[2] Newmans, M. E. J. (2003) “The structure and function of complex networks,” SIAM Review, 45, pp. 167–256.
[3] Albert, R. and A. L. Barábasi (2002) “Statistical mechanics of complex
networks,” Rev. Mod. Phys., 74, pp. 47–97.
URL http://dx.doi.org/10.1103/RevModPhys.74.47
[4] Boccaletti, S., V. Latora, Y. Moreno, M. Chavez, and D.-U.
Hwang (2006) “Complex Networks : Structure and Dynamics,” Physics Reports, 424(4-5).
[5] Albert, R., H. Jeong, and A. L. Barabási (1999) “Diameter of the
world-wide web,” Nature, 401, pp. 130–131.
[6] (2005) Committee on network science for future army applications. Network
Science., The National Academies Press.
[7] Lawrence, S. and C. L. Giles (1999) “Accessibility of information on the
web,” Nature, 400, pp. 107–109.
[8] Gulli, A. and A. Signorini (2005) “The indexable web is more than 11.5
billion pages,” WWW ’05: Special interest tracks and posters of the 14th
international conference on World Wide Web, ACM Press, New York, USA,
pp. 902–903.
[9] Page, L., R. M. S. Brin, and T. Winograd (1998) “The PageRank Citation Ranking: Bringing Order to the Web,” Tech. rep., Stanford Digital
Library Technologies Project.
URL citeseer.ist.psu.edu/page98pagerank.html
[10] Kleinberg, J. M. (1999) “Authoritative sources in a hyper-linked environment,” Journal of the ACM, 46(5).
90
[11] Watts, D. J. and S. H. Strogatz (1998) “Collective dynamics of ‘smallworld’ networks,” Nature, 393, pp. 440–442.
[12] Barabasi, A., H. Jeong, Z. Neda, E. Ravasz, A. Schubert, and
T. Viscek (2002) “Evolution of the social network of scientific collaborations,” Physica A: Statistical Mechanics and its Applications, 311(3-4), pp.
590–614.
[13] Newman, M. E. J. (2001) “The structure of scientific collaboration networks,” Proceedings of National Academy of Sciences, 98, pp. 404–409.
[14] ——— (2004) “Coauthorship networks and patterns of scientific collaboration,” Proceedings of the National Academy of Sciences, pp. 5200–5205.
[15] Kuhn, K. W. (1999) “Molecular Interaction Map of the Mammalian Cell
Cycle Control and DNA Repair Systems,” Mol. Biol., 10, pp. 2703–2734.
[16] Albert, I. and R. Albert (2004) “Conserved network motifs allow for
protein-protein interaction prediction,” Bioinformatics, 20(18), pp. 3346–
3352.
[17] Albert, R. (2007) “Network Inference, Analysis, and Modeling in Systems
Biology,” The Plant Cell, 19, pp. 3327–3328.
[18] Ciliberti, S., O. C. C. Martin, and A. Wagner (2007) “An efficient
heuristic procedure for partitioning graphs,” Proc Natl Acad Sci USA.
[19] Dorogovtsev, S. N., A. V. Goltsev, and J. F. F. Mendes (2007)
“Critical phenomena in complex networks,” e-print cond-mat/0112143.
[20] Thadakamalla, H. P., R. Albert, and S. R. T. Kumara (2007) “Search
in spatial scale-free networks,” New Journal of Physics, 9, p. 190.
[21] Barabási, A. L. and R. Albert (1999) “Emergence of Scaling in Random
Networks,” Science, 286(5439), pp. 509–512.
URL http://www.sciencemag.org/cgi/content/abstract/286/5439/509
[22] Chakrabarti, D. and C. Faloutsos (2006) “Graph mining: Laws, generators, and algorithms,” ACM Computing Surveys, 38.
[23] Leskovec, J., A. Singh, and J. Kleinberg (2003) “Patterns of influence
in a recommendation network.” Pacific-Asia Conference on Knowledge Discovery and Data Mining (PKDD), pp. 380–389.
[24] Bekessy, A., P. Bekessy, and J. Komlos (1972) Stud., Sci. Math. Hungar., 7, pp. 343 – 353.
[25] Bender, E. A. and E. R. Canfield (1978) J. Comb. Theory A, 24, p.
296V307.
[26] Molloy, M. and B. Reed (1995) Comb., Prob. and Comput., 6, p. 161V179.
[27] ——— (1998) Comb., Prob. and Comput., 7, p. 295V305.
91
[28] Barrat, A. and M. Weigt (2000) “On the properties of small-world network models,” The European Physical Journal B-Condensed Matter, 13, p.
547V560.
[29] Abello, J., A. L. Buchsbaum, and J. Westbrook (1998) “A functional
approach to external graph algorithms,” Proc. of the 6th Annual European
Symposium on Algorithms, pp. 332–343.
[30] Faloutsos, M., P. Faloutsos, and C. Faloutsos (1999) “On Power-law
Relationships of the Internet Topology,” SIGCOMM, pp. 251–262.
URL citeseer.ist.psu.edu/michalis99powerlaw.html
[31] Kumar, R., P. Raphavan, S. Rajagopalan, and A. Tomkins (1999)
“Trawling the Web for emerging cyber-communities,” Proc. of 8th International World Wide Web(WWW) Conference.
[32] Bi, Z., C. Faloutsos, and F. Korn (2001) “The ”DGX” distribution for
mining massive, skewed data,” KDD ’01: Proceedings of the seventh ACM
SIGKDD international conference on Knowledge discovery and data mining,
pp. 17–26.
[33] Chakrabarti, D., Y. Zhan, and C. Faloutsos (2004) “A recursive model
for graph mining,” SIAM Int. Conf. on Data Mining.
[34] Redner, S. (2005) “Citation Statistics from 110 Years of Physical Review,”
Physics Today, 58, p. 49.
URL http://arxiv.org/abs/physics/0506056
[35] Barabási, A. L. and Z. N. Oltvai (2004) “Network Biology,” Nature Reviews Genetics, 5, pp. 101–113.
[36] E. Minkov, R. W., W. Cohen (2006) “Contextual Search and Name Disambiguation in Email using Graphs,” SIGIR.
[37] Newmans, M. (2003) “Mixing patterns in networks,” Phys Rev E Stat Nonlin
Soft Matter Phys, 67.
[38] Aiello, W., F. Chung, and L. Lu (2000) “A random graph model for
massive graphs,” Proceedings of the thirty-second annual ACM symposium on
Theory of computing, pp. 171–180.
[39] ——— (2000) “A random graph model for power law graphs,” Experimental
Mathematics, 10(1).
[40] Cohen, R. and S. Havlin (2003) “Scale-Free Networks Are Ultrasmall,”
Phys. Rev. Lett., 90(5), p. xxx.
[41] Li, L., D. Alderson, R. Tanaka, J. C. Doyle, and W. Willinger
(2005) “Towards a Theory of Scale-Free Graphs: Definition Properties, and
Implications (Extended Version),” The primary version is to appear in Internet Mathematics.
92
[42] Milgram, S. (1967) “The small world problem,” Psychology Today, 2, pp.
60–67.
[43] Watts, D. J., P. S. Dodds, and M. E. J. Newman (2002) “Identity and
search in social networks,” Science, 296, pp. 1302–1305.
[44] Kleinberg, J. (2001) “Small-world phenomena and the dynamic of information,” Advances in Neural Information Processing Systems NIPS), (14).
[45] Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,
R. Stata, A. Tomkins, and J. Wiener (2000) “Graph structure in the
web,” Proc. of the 9th WWW Conference, 33, pp. 17–26.
[46] Bork, P., L. J. Jensen, C. von Mering, A. K. Ramani, I. Lee, and
E. M. Marcotte (2004) “Protein interaction networks from yeast to human,” Curr Opin Struct Biol, 14(3), pp. 292–299.
[47] Potapov, A. P., N. Voss, N. Sasse, and E. Wingender (2005) “Topology of Mammalian Transcription Networks,” Genome Informatics, 16(2), pp.
270–278.
[48] Barabási, A. (2003) “Linked: How Everything is Connected to Everything
Else and What It Means for Business, Science, and Everyday Life,” New York:
Plume.
[49] Dodds, P., R. Muhamad, and D. J. Watts. (2003) “An experimental
study of search in global social networks,” Science, 301(5634), pp. 827–829.
[50] Newmans, M. E. J. (2002) “Assortative mixing in network,” Physical Review Letter, 89, p. 208701.
[51] Ntoulas, A., J. Cho, and C. Olston (2004) “What’s new on the web?:
the evolution of the web from a search engine perspective,” WWW, pp. 1–12.
URL http://doi.acm.org/10.1145/988672.988674
[52] Leskovec, J., D. Chakrabarti, J. Kleinberg, and C. Faloutsos
(2005) “Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication,” PKDD.
[53] Leskovec, J., J. Kleinberg, and C. Faloutsos (2007) “Graph Evolution:
Densification and Shrinking Diameters,” ACM Transactions on Knowledge
Discovery from Data (ACM TKDD), 1.
[54] Redner, S. (2005) “Citation Statistics from 110 Years of Physical Review,”
Physics Today, 58, p. 49.
URL http://arxiv.org/abs/physics/0506056
[55] Katz, J. S. (2005) “Scale independent bibliometric indicators.” Measurement, 3, pp. 24–28.
URL citeseer.ist.psu.edu/leskovec06laws.html
93
[56] Kossinets, G. and D. J. Watts. (2006) “Empirical Analysis of an Evolving
Social Network,” Science, 311(5757), p. 88.
[57] Ebel, H., L. I. Mielsch, and S. Bornholdt (2002) “Scale-free topology
of e-mail networks,” Phys. Rev. E, 66.
[58] Kretschmer, H. (1994) “xxx,” SCIENTOMETRICS, 30, pp. 363–369.
[59] ——— (1996) “xxx,” SCIENTOMETRICS, 36, pp. 363–377.
[60] Box, G., G. M. Jenkins, and G. Reinsel (3rd Edition) Time Series Analysis: Forecasting and Control, Holden-Day.
[61] Thurlasiman, K. and M. N. S. Swamy (1992) Graphs: Theroy and algrothims, Wiley, New York.
[62] Desikan, P., N. Pathak, J. Srivastava, and V. Kumar (2005) “DIncremental. pagerank computation on evolving graphs,” WWW.
[63] Krapivsky, P. L. and S. Redner (2005) “Network Growth by Copying,”
Phys. Rev. E, 71, p. 036118.
[64] “The Stanford WebBase project:,” .
URL http://www-diglib.stanford.edu/tested/doc2/WebBase/
[65] Hodak, M. (June 4, 2007) “The Enron Scandal,” Organizational Behavior
Research Center Papers (SSRN).
[66] Priebe, C., J. Conroy, D. Marchette, and Y. Park (2005) “Scan
Statistics on Enron Graphs,” Computational and Mathematical Organization
Theory, 11(3), pp. 229–247.
[67] Minkov, E., R. Wang, and W. Cohen (2005) “Extracting Personal
Names from Emails: Applying Named Entity Recognition to Informal Text,”
HLT/EMNLP 2005.
[68] Shetty, J. and J. Adibi (2005), “ENRON email dataset,” .
URL http://www.isi.edu/ adibi/Enron/Enron.htm
[69] Shi, J. and J. Malik (2000) “Normalized cuts and image segmentation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp.
888–905.
[70] Chan, P. K., M. D. F. Schlag, and J. Y. Zien (1994) “Spectral k-way
ratio-cut partitioning and clustering,” Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, 13(9), pp. 1088–1096.
[71] Kernighan, B. W. and S. Lin (1970) “An efficient heuristic procedure for
partitioning graphs,” The Bell system technical journal, 49(2), pp. 291–307.
[72] Yu, S. and J. Shi (2003) “Multiclass spectral clustering,” International Conference on Computer Vision.
Vita
An-Yi Chen
EDUCATION:
The Pennsylvania State University, University Park, PA, USA.
Ph.D., Industrial Engineering and Operations Research, May 2009. (expected).
National Chiao Tung University, Hsinchu, Taiwan.
M.S., Industrial Engineering and Management, June 2001.
National Chiao Tung University, Hsinchu, Taiwan.
B.S., Industrial Engineering and Management, June 2001.
SELECTED PUBLICATION:
A. Y. Chen and Dennis K. J. Lin ”Analysis of evolving graphs with directed
and multiple edges.” (under review)
A. Y. Chen and Dennis K. J. Lin ”Analysis of evolving graphs: A time series
approach”. (under review)
W. L. Pearn, S. H. Chung, A. Y. Chen, and M. H. Yang. ”A Case Study
on the Multistage IC Final Testing Scheduling Problem with Reentry.”
International Journal of Production Economics, 2004, Vol. 88, pp. 257267.
S. H. Chung, Y. C. Su, J.S. Liao, and A.Y. Chen. ”The Construction of
Production Planning and Scheduling System for an IC Foundry in RampUp Period.” 2000 Summer Computer Simulation Conference, Vancouver
British Columbia.
AFFILIATION:
The Healthcare Information and Management Systems Society (HIMSS)
The Institute for Operations Research and the Management Sciences (INFORMS)
Society for Industrial and Applied mathematics (SIAM)