Twitter`s Big Hitters

Transcription

Twitter`s Big Hitters
Twitter’s Big Hitters
Des Higham
Department of Mathematics and Statistics
University of Strathclyde
Motivation
Consider large, complex networks with
a fixed set of nodes
edges that come and go over time
Application areas include
neuroscience
communication by email, phone, Twitter
on-line retail
on-line social media
PART 1: Models
PART 2: Algorithms for Computing Centrality
Bristol
Des Higham
Dynamic Networks
2 / 30
Example: Voice call contact at MIT
Eagle, Pentland & Lazer, Proc. Nat. Acad. Sci., 2009
106 individuals, 365 days from July 20th, 2004
Summarized into 28 day periods, treated as undirected
1
2
3
4
5
6
7
8
9
10
11
12
13
Bristol
Des Higham
Dynamic Networks
3 / 30
Part 1: Models
Challenges
understand mechanisms
calibrate parameters and compare models
forecast future behaviour
simulate ‘what-if’ scenarios
Grindrod & Higham, Proc. Royal Society A, 2010
Discrete time, stochastic models
Given A[k ] at time tk , how do we specify A[k +1] ?
Think in terms of birth and death of edges
Bristol
Des Higham
Dynamic Networks
4 / 30
Triadic Closure Model
Grindrod, Higham & Parsons, Internet Mathematics, 2012
Friends of friends become friends
Edge death probability is a constant ω ∈ (0, 1)
Edge birth probability between nodes i and j given by
[k ] 2
δ+ A
ij
where 0 < δ 1 and 0 < (N − 2) < 1 − δ
Consider N = 100, ω = 0.01, = 5 × 10−4 , δ = 4 × 10−4
Bristol
Des Higham
Dynamic Networks
5 / 30
Triangulation model: start with ER(0.3)
time=0
time=50
time=100
time=150
time=200
time=250
time=300
time=350
time=400
time=450
time=500
time=550
time=600
time=650
time=700
time=750
Edge density at t = 750 is 0.712
Bristol
Des Higham
Dynamic Networks
6 / 30
Triangulation Model: start with ER(0.15)
time=0
time=50
time=100
time=150
time=200
time=250
time=300
time=350
time=400
time=450
time=500
time=550
time=600
time=650
time=700
time=750
Edge density at t = 750 is 0.051
Bristol
Des Higham
Dynamic Networks
7 / 30
Mean field analysis for δ + A[k ]
2 ij
Ergodicity and symmetry ⇒ Erdös-Rényi limit: every edge
present with probablity p?
Heuristic mean field approach: insert the ansatz
“A[k ] = ER(pk )” into the model to obtain
pk +1 = (1 − ω)pk + (1 − pk )(δ + (N − 2)pk2 )
Generically: three real roots
Two are stable, one is unstable
N = 100, ω = 0.01, = 5 × 10−4 , δ = 4 × 10−4
Stable fixed points 0.049 & 0.721
Unstable 0.229
Bristol
Des Higham
Dynamic Networks
8 / 30
Mean-field vs. simulation from ER(0.4)
0.75
0.7
Edge Density
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0
500
1000
1500
Time
Bristol
Des Higham
Dynamic Networks
9 / 30
Four simulations from ER(0.23)
0.8
0.7
Edge Density
0.6
0.5
0.4
0.3
0.2
0.1
0
0
500
1000
1500
Time
Stable fixed points 0.049 & 0.721
Bristol
Des Higham
Unstable 0.229
Dynamic Networks
10 / 30
Calibration/Inference
Mantzaris & Higham, in Temporal Networks, Springer, 2013, edited by
P. Holme and J. Saramäki
Given model parameters, we can compute the probability of
observing the data: likelihood
Tests on synthetic data show that we can correctly infer the
triadic closure effect
Wealink data from Hu and Wang, Phys. Lett. A, 2009.
26 Million time stamps, over 841 days with 0.25 Million
nodes (no edge death):
we found statistical support for triadic closure
Bristol
Des Higham
Dynamic Networks
11 / 30
Part 2: Algorithms for Node Centrality
Time 1
Time 2
Time 3
A
C
B
E
D
F
G
Bristol
Des Higham
Dynamic Networks
12 / 30
Part 2: Algorithms for Node Centrality
Time 1
Time 2
A
Time 3
A
C
C
B
B
E
E
D
D
F
F
G
Bristol
G
Des Higham
Dynamic Networks
12 / 30
Part 2: Algorithms for Node Centrality
Time 1
Time 2
A
Time 3
A
C
C
B
B
E
B
E
E
D
D
F
D
F
G
Bristol
A
C
G
Des Higham
F
G
Dynamic Networks
12 / 30
Part 2: Algorithms for Node Centrality
Time 1
Time 2
A
Time 3
A
C
A
C
C
B
B
E
B
E
E
D
D
F
D
F
G
G
F
G
Note the lack of symmetry caused by time’s arrow
Bristol
Des Higham
Dynamic Networks
12 / 30
Dynamic Walks
Time points t0 < t1 < t2 < · · · < tM
Adjacency matrices A[0] , A[1] , A[2] , . . . , A[M]
Dynamic walk of length w from node i1 to node iw+1 :
sequence of times tr1 ≤ tr2 ≤ · · · ≤ trw and a
sequence of edges i1 ↔ i2 , i2 ↔ i3 , . . . , iw ↔ iw+1 ,
such that im ↔ im+1 exists at time trm
(Several variations are possible)
Use this to define centrality of a node, following Katz1
1
A new status index derived from sociometric analysis,
Leo Katz, Psychometrika, 1953
Bristol
Des Higham
Dynamic Networks
13 / 30
New Algorithm
Grindrod, Higham, Parsons & Estrada, Phys. Rev. E, 2011
Key observation: the matrix product
A[r1 ] A[r2 ] · · · A[rw ]
has i, j element that counts the number of dynamic walks of
length w from node i to node j, where the mth step takes
place at time trm
Keep track of all such walks and discount by αw
E.g. α2 A[0] A[1] ,
α4 A[0] A[2] A[3] A[7] ,
α3 A[3] A[3] A[9]
This is achieved by
−1
−1
−1
Q := I − αA[0]
I − αA[1]
· · · I − αA[M]
Then Q is our overall summary of how well information can
be passed from node i to node j
Bristol
Des Higham
Dynamic Networks
14 / 30
Dynamic Centrality
We will call the row and column sums
N
X
k =1
Qnk
&
N
X
Qkn
k =1
the broadcast and receive communicabilities
generalizes Katz centrality in social networks
involves sparse linear solves
captures the asymmetry through non-commutativity of
matrix multiplication
Bristol
Des Higham
Dynamic Networks
15 / 30
Enron email: 151 nodes over two weeks
Explanatory model for dynamic communicators in:
Mantzaris & Higham, Eur. J. Appl. Math., 2012
Bristol
Des Higham
Dynamic Networks
16 / 30
fMRI in Neuroscience: functional connectivity
Mantzaris, Bassett, Wymbs, Estrada, Porter, Mucha, Grafton, Higham,
J. Complex Networks, 2013
Data: Bassett, Wymbs, Porter, Mucha, Carlson, Grafton, PNAS, 2011
Each experiment: 112 × 112 × 25 tensor
region region time
60 experiments: 20 subjects repeat a task three times
Unsupervised k-means custering of the full data set shows
significant evidence for “learning”:
subjects typically move from cluster 1 to cluster 2
Summarizing each experiment in terms of the 112
broadcast (or receive) centrality measures, we recover the
same level of significance.
And we can now interpret the data more easily. . . . . .
Bristol
Des Higham
Dynamic Networks
17 / 30
Change in broadcast centrality: Right Medial
Bristol
Des Higham
Dynamic Networks
18 / 30
Twitter’s Big Hitters
Laflin, Mantzaris, Grindrod, Ainley, Otley, Higham,
Proc. of Social Informatics, 2012
Listen to tweets containing the phrases
city break, cheap holiday, travel, insurance,
cheap flight plus two brand names
From 17 June 2012 at 14:41 to 18 June at 12:41
0.5 Million Tweeters/Followers
Bristol
Des Higham
Dynamic Networks
19 / 30
Active Node Subnetwork Sequence
Record all relevant edges: tweeter 7→ followers
Remove all nodes with zero out degree
Binarize over time windows of length ∆t
Bristol
Des Higham
Dynamic Networks
20 / 30
Dynamic Broadcast Centralities
3
Dynamic Broadcast
2.5
2
1.5
1
0.5
0
0
5
10
15
20
25
Out Degree
30
35
40
45
Twitter account with fourth highest bandwidth (out
degree) is a very poor dynamic broadcaster
Closer inspection ⇒ an automated process.
Five social media experts were given the Twitter data and
asked to rank the accounts according to importance
We found that dynamic centrality measures are hard to
distinguish from human experts
Bristol
Des Higham
Dynamic Networks
21 / 30
Arsenal 5 - 2 Spurs, November 2012
Bristol
Des Higham
Dynamic Networks
22 / 30
Adebayor: volume of tweets
North London Derby: 17 November 2012 - Adebayor Tweets
12000
10000
Tweets per minute
8000
6000
4000
2000
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
129
133
137
141
145
149
153
157
161
165
169
173
177
181
185
189
193
197
201
205
209
213
217
221
0
Minutes after 1200 GMT
Bristol
Des Higham
Dynamic Networks
23 / 30
Adebayor: sentiment across time
Bristol
Des Higham
Dynamic Networks
24 / 30
Adebayor: sentiment weighted by influence
Bristol
Des Higham
Dynamic Networks
25 / 30
Downweighting over time
Grindrod & Higham, SIAM Review (Research Spotlights) 2013
Motivation: News goes stale, messages become irrelevant,
viruses mutate, . . . old information is less important
The algorithm can be generalized naturally to
−1
S [k ] = I + e−b∆tk S [k −1] I − α A[k ]
−I
Here, S [k ] ij counts the number of dynamic walks from i to j
up to time tk , scaled by
a factor αw for dynamic walks of length w
a factor e−bt for walks that begin t time units ago
We have a new parameter, b:
b = 0 is the previous algorithm
b = ∞ is Katz on the current network
Bristol
Des Higham
Dynamic Networks
26 / 30
Enron daily emails: Centrality prediction
Broadcast
Prediction Quality
1.1
1
0.9
0.8
0.7
0.6
0.5
0
0.5
1
1.5
2
2.5
3
2
2.5
3
b
Receive
Prediction Quality
1.1
1
0.9
0.8
0.7
0.6
0.5
0
0.5
1
1.5
b
Bristol
Des Higham
Dynamic Networks
27 / 30
Continuous Time?
Discretizing in time and binarizing is convenient, but
∆t too large can overlook or smear events
∆t too small may give a false impression of accuracy
So create A(t) over continuous time
⇒ ODE for the evolution of pairwise communicability:
U 0 (t) = −b(U(t) − I) − U(t) log I − aA(t)
Bristol
Des Higham
Dynamic Networks
28 / 30
What’s New?
Edge birth model based on triadic closure
Mean field analysis that accurately summarizes the
macro scale behavoiur and predicts bistability
Concept of dynamic walks
Resolvent-based algorithms generalizing Katz
Continuous-time analogues
Bristol
Des Higham
Dynamic Networks
29 / 30
What’s Next?
Large scale calibration/inference/model comparison
Dynamic network monitoring/prediction/disruption
Algorithms for dynamic clustering/classification,
e.g., fake account detection
Thanks to
EPSRC/RCUK Digital Economy programme,
Leverhulme Trust,
EPSRC/Strathclyde Impact Acceleration Account,
Bloom Agency.
Bristol
Des Higham
Dynamic Networks
30 / 30