Twitter`s Big Hitters
Transcription
Twitter`s Big Hitters
Twitter’s Big Hitters Des Higham Department of Mathematics and Statistics University of Strathclyde Motivation Consider large, complex networks with a fixed set of nodes edges that come and go over time Application areas include neuroscience communication by email, phone, Twitter on-line retail on-line social media PART 1: Models PART 2: Algorithms for Computing Centrality Bristol Des Higham Dynamic Networks 2 / 30 Example: Voice call contact at MIT Eagle, Pentland & Lazer, Proc. Nat. Acad. Sci., 2009 106 individuals, 365 days from July 20th, 2004 Summarized into 28 day periods, treated as undirected 1 2 3 4 5 6 7 8 9 10 11 12 13 Bristol Des Higham Dynamic Networks 3 / 30 Part 1: Models Challenges understand mechanisms calibrate parameters and compare models forecast future behaviour simulate ‘what-if’ scenarios Grindrod & Higham, Proc. Royal Society A, 2010 Discrete time, stochastic models Given A[k ] at time tk , how do we specify A[k +1] ? Think in terms of birth and death of edges Bristol Des Higham Dynamic Networks 4 / 30 Triadic Closure Model Grindrod, Higham & Parsons, Internet Mathematics, 2012 Friends of friends become friends Edge death probability is a constant ω ∈ (0, 1) Edge birth probability between nodes i and j given by [k ] 2 δ+ A ij where 0 < δ 1 and 0 < (N − 2) < 1 − δ Consider N = 100, ω = 0.01, = 5 × 10−4 , δ = 4 × 10−4 Bristol Des Higham Dynamic Networks 5 / 30 Triangulation model: start with ER(0.3) time=0 time=50 time=100 time=150 time=200 time=250 time=300 time=350 time=400 time=450 time=500 time=550 time=600 time=650 time=700 time=750 Edge density at t = 750 is 0.712 Bristol Des Higham Dynamic Networks 6 / 30 Triangulation Model: start with ER(0.15) time=0 time=50 time=100 time=150 time=200 time=250 time=300 time=350 time=400 time=450 time=500 time=550 time=600 time=650 time=700 time=750 Edge density at t = 750 is 0.051 Bristol Des Higham Dynamic Networks 7 / 30 Mean field analysis for δ + A[k ] 2 ij Ergodicity and symmetry ⇒ Erdös-Rényi limit: every edge present with probablity p? Heuristic mean field approach: insert the ansatz “A[k ] = ER(pk )” into the model to obtain pk +1 = (1 − ω)pk + (1 − pk )(δ + (N − 2)pk2 ) Generically: three real roots Two are stable, one is unstable N = 100, ω = 0.01, = 5 × 10−4 , δ = 4 × 10−4 Stable fixed points 0.049 & 0.721 Unstable 0.229 Bristol Des Higham Dynamic Networks 8 / 30 Mean-field vs. simulation from ER(0.4) 0.75 0.7 Edge Density 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0 500 1000 1500 Time Bristol Des Higham Dynamic Networks 9 / 30 Four simulations from ER(0.23) 0.8 0.7 Edge Density 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 Time Stable fixed points 0.049 & 0.721 Bristol Des Higham Unstable 0.229 Dynamic Networks 10 / 30 Calibration/Inference Mantzaris & Higham, in Temporal Networks, Springer, 2013, edited by P. Holme and J. Saramäki Given model parameters, we can compute the probability of observing the data: likelihood Tests on synthetic data show that we can correctly infer the triadic closure effect Wealink data from Hu and Wang, Phys. Lett. A, 2009. 26 Million time stamps, over 841 days with 0.25 Million nodes (no edge death): we found statistical support for triadic closure Bristol Des Higham Dynamic Networks 11 / 30 Part 2: Algorithms for Node Centrality Time 1 Time 2 Time 3 A C B E D F G Bristol Des Higham Dynamic Networks 12 / 30 Part 2: Algorithms for Node Centrality Time 1 Time 2 A Time 3 A C C B B E E D D F F G Bristol G Des Higham Dynamic Networks 12 / 30 Part 2: Algorithms for Node Centrality Time 1 Time 2 A Time 3 A C C B B E B E E D D F D F G Bristol A C G Des Higham F G Dynamic Networks 12 / 30 Part 2: Algorithms for Node Centrality Time 1 Time 2 A Time 3 A C A C C B B E B E E D D F D F G G F G Note the lack of symmetry caused by time’s arrow Bristol Des Higham Dynamic Networks 12 / 30 Dynamic Walks Time points t0 < t1 < t2 < · · · < tM Adjacency matrices A[0] , A[1] , A[2] , . . . , A[M] Dynamic walk of length w from node i1 to node iw+1 : sequence of times tr1 ≤ tr2 ≤ · · · ≤ trw and a sequence of edges i1 ↔ i2 , i2 ↔ i3 , . . . , iw ↔ iw+1 , such that im ↔ im+1 exists at time trm (Several variations are possible) Use this to define centrality of a node, following Katz1 1 A new status index derived from sociometric analysis, Leo Katz, Psychometrika, 1953 Bristol Des Higham Dynamic Networks 13 / 30 New Algorithm Grindrod, Higham, Parsons & Estrada, Phys. Rev. E, 2011 Key observation: the matrix product A[r1 ] A[r2 ] · · · A[rw ] has i, j element that counts the number of dynamic walks of length w from node i to node j, where the mth step takes place at time trm Keep track of all such walks and discount by αw E.g. α2 A[0] A[1] , α4 A[0] A[2] A[3] A[7] , α3 A[3] A[3] A[9] This is achieved by −1 −1 −1 Q := I − αA[0] I − αA[1] · · · I − αA[M] Then Q is our overall summary of how well information can be passed from node i to node j Bristol Des Higham Dynamic Networks 14 / 30 Dynamic Centrality We will call the row and column sums N X k =1 Qnk & N X Qkn k =1 the broadcast and receive communicabilities generalizes Katz centrality in social networks involves sparse linear solves captures the asymmetry through non-commutativity of matrix multiplication Bristol Des Higham Dynamic Networks 15 / 30 Enron email: 151 nodes over two weeks Explanatory model for dynamic communicators in: Mantzaris & Higham, Eur. J. Appl. Math., 2012 Bristol Des Higham Dynamic Networks 16 / 30 fMRI in Neuroscience: functional connectivity Mantzaris, Bassett, Wymbs, Estrada, Porter, Mucha, Grafton, Higham, J. Complex Networks, 2013 Data: Bassett, Wymbs, Porter, Mucha, Carlson, Grafton, PNAS, 2011 Each experiment: 112 × 112 × 25 tensor region region time 60 experiments: 20 subjects repeat a task three times Unsupervised k-means custering of the full data set shows significant evidence for “learning”: subjects typically move from cluster 1 to cluster 2 Summarizing each experiment in terms of the 112 broadcast (or receive) centrality measures, we recover the same level of significance. And we can now interpret the data more easily. . . . . . Bristol Des Higham Dynamic Networks 17 / 30 Change in broadcast centrality: Right Medial Bristol Des Higham Dynamic Networks 18 / 30 Twitter’s Big Hitters Laflin, Mantzaris, Grindrod, Ainley, Otley, Higham, Proc. of Social Informatics, 2012 Listen to tweets containing the phrases city break, cheap holiday, travel, insurance, cheap flight plus two brand names From 17 June 2012 at 14:41 to 18 June at 12:41 0.5 Million Tweeters/Followers Bristol Des Higham Dynamic Networks 19 / 30 Active Node Subnetwork Sequence Record all relevant edges: tweeter 7→ followers Remove all nodes with zero out degree Binarize over time windows of length ∆t Bristol Des Higham Dynamic Networks 20 / 30 Dynamic Broadcast Centralities 3 Dynamic Broadcast 2.5 2 1.5 1 0.5 0 0 5 10 15 20 25 Out Degree 30 35 40 45 Twitter account with fourth highest bandwidth (out degree) is a very poor dynamic broadcaster Closer inspection ⇒ an automated process. Five social media experts were given the Twitter data and asked to rank the accounts according to importance We found that dynamic centrality measures are hard to distinguish from human experts Bristol Des Higham Dynamic Networks 21 / 30 Arsenal 5 - 2 Spurs, November 2012 Bristol Des Higham Dynamic Networks 22 / 30 Adebayor: volume of tweets North London Derby: 17 November 2012 - Adebayor Tweets 12000 10000 Tweets per minute 8000 6000 4000 2000 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 0 Minutes after 1200 GMT Bristol Des Higham Dynamic Networks 23 / 30 Adebayor: sentiment across time Bristol Des Higham Dynamic Networks 24 / 30 Adebayor: sentiment weighted by influence Bristol Des Higham Dynamic Networks 25 / 30 Downweighting over time Grindrod & Higham, SIAM Review (Research Spotlights) 2013 Motivation: News goes stale, messages become irrelevant, viruses mutate, . . . old information is less important The algorithm can be generalized naturally to −1 S [k ] = I + e−b∆tk S [k −1] I − α A[k ] −I Here, S [k ] ij counts the number of dynamic walks from i to j up to time tk , scaled by a factor αw for dynamic walks of length w a factor e−bt for walks that begin t time units ago We have a new parameter, b: b = 0 is the previous algorithm b = ∞ is Katz on the current network Bristol Des Higham Dynamic Networks 26 / 30 Enron daily emails: Centrality prediction Broadcast Prediction Quality 1.1 1 0.9 0.8 0.7 0.6 0.5 0 0.5 1 1.5 2 2.5 3 2 2.5 3 b Receive Prediction Quality 1.1 1 0.9 0.8 0.7 0.6 0.5 0 0.5 1 1.5 b Bristol Des Higham Dynamic Networks 27 / 30 Continuous Time? Discretizing in time and binarizing is convenient, but ∆t too large can overlook or smear events ∆t too small may give a false impression of accuracy So create A(t) over continuous time ⇒ ODE for the evolution of pairwise communicability: U 0 (t) = −b(U(t) − I) − U(t) log I − aA(t) Bristol Des Higham Dynamic Networks 28 / 30 What’s New? Edge birth model based on triadic closure Mean field analysis that accurately summarizes the macro scale behavoiur and predicts bistability Concept of dynamic walks Resolvent-based algorithms generalizing Katz Continuous-time analogues Bristol Des Higham Dynamic Networks 29 / 30 What’s Next? Large scale calibration/inference/model comparison Dynamic network monitoring/prediction/disruption Algorithms for dynamic clustering/classification, e.g., fake account detection Thanks to EPSRC/RCUK Digital Economy programme, Leverhulme Trust, EPSRC/Strathclyde Impact Acceleration Account, Bloom Agency. Bristol Des Higham Dynamic Networks 30 / 30