Slides
Transcription
Slides
A Bayesian Framework for Estimating Properties of Network Diffusions Varun Embar1 Rama Kumar Pasumarthi2 Indrajit Bhattacharya1 1 IBM 2 American Research India Express (work done when at IBM) IISc MLSIG Lunch Talk 2015 (based on SIGKDD 2014 paper) Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 1 / 42 Quick Summary Network Diffusion Example 1 : Spread of Ebola in West Africa Example 2 : Spread of hashtags among Twitter users Entities of interest I Network: Captures connections between people / Twitter users I Diffusion Process: Stochastic mechanism of ‘infection’ spread I Diffusion Cascades: Time-stamped infection paths over network Long studied in epidemiology, sociology, econometrics, marketing Recent interest in computer science Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 2 / 42 Quick Summary Problems in Network Diffusion Analysis Evaluating Properties of Observed Network and Cascades I Centrality and reach of individual nodes I Viral marketing seeds (Kempe KDD03, Goyal CIKM08, PVLDB11) I Community structures (Mehmood ECML13, Barbieri ICDM13) I Likelier diffusion mechanism (Milling SIGMetrix12) Inferring Network from Partially-observed Cascades I Estimate network connections and strengths given cascades I Maximum likelihood estimation I Saito (AML09), Gomez-Rodriguez (KDD10, ICML11,13, WSDM13) Du (NIPS12), Netrapalli (SIGMetrix12), Wang (ECML12), Kutzkov (KDD13), Daneshmand (ICML14) Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 3 / 42 Quick Summary Example Property: Leaders of tribes (LoT) Leader of Tribes (Goyal CIKM08) I Network property: High-weight paths to large ‘tribe’ of nodes I Cascade property: Frequent transmissions over these paths Not tractable even given complete observations Weak LoT I Network property: High-weight edges to tribe nodes I Cascade property: Frequent transmissions over these edges Easy to compute given complete observations Still interesting for marketing, epidemiology Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 4 / 42 Quick Summary Weak LoT Distribution in Random Graphs Varies with nw structure Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 5 / 42 Quick Summary Evaluating Joint Properties: Challenges Network and Diffusion Cascade I αuv ∈ R+ : connection strength between nodes u, v I Cascade of infections: infected node ui , parent node zi , time ti Diffusion Process: Independent Cascade Model I Infected node proposes infection time for uninfected neighbors I Uninfected node catches infection with earliest proposed time I Multiple infections of same node (Splitting model) (Wang ECML12) Hidden variables I Network: Connection strengths and sometimes edges unobserved I Cascades: Infection sources zi unobserved Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 6 / 42 Quick Summary Network Diffusion Properties and a Possible Approach Network Diffusion Property Function f (α, z) defined on network α and cascade z Frequentist Plug-in I Point estimate of the network, e.g. MLE I Point estimate of cascade given network estimate I Evaluate property using point estimates Disadvantages I MLE overfits for infrequent edges I Property not always one-to-one: most likely value of property does not correspond to mostly likely network and cascade Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 7 / 42 Quick Summary Example: Weak LoT Reconstruction Does not work very well! Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 8 / 42 Quick Summary A Bayesian Solution Evaluate expectation of property under posterior distribution p(z, α|{c o }) given partially observed cascades ¯f (z, α) = Ep(z,α|{c o }) [f (z, α)] Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 9 / 42 Quick Summary Example: Weak LoT Reconstruction Many-to-one function Bayesian approach recovers the signature shape of the distribution, by considering less likely networks Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 10 / 42 Quick Summary Example: Network Inference One-to-one function Error Bayesian Frequentist CorePeriphery 0.116 2.553 Hierarchical 0.884 3.210 Random 0.147 17.483 ForestFire 0.329 736.821 Bayesian approach significantly reduces error by avoiding over-fitting Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 11 / 42 Quick Summary Computational Tractability Two marginalization operations I Integration over network strengths I Summation over cascade paths Characterization of properties (for Independent Cascade Model) I Not nice at all: Neither can be done efficiently I Partially-nice: One of the marginalizations can be done efficiently I Totally-nice: Joint marginalization can be done efficiently Interesting properties in all of these classes Monte Carlo approximation framework when not totally nice Map-reduce framework for large scale network diffusion analysis Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 12 / 42 Quick Summary “the rest are details” ... Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 13 / 42 Problem Definition Network and Diffusion Cascades Network I G = (V , E) with nodes V and edges E I αuv ∈ R+ : connection strength between nodes u, v Diffusion Cascades I Collection of cascades: C = {c} I Cascade c: set of infections (ui , zi , ti ) I Infected node ui , parent node zi , time ti I All cascades observed until time T Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 14 / 42 Problem Definition Diffusion Process: Continuous-time Independent Cascade Model Generative Process I Seed nodes infected initially I Infected node proposes infection time for uninfected neighbors I Uninfected node catches infection with earliest proposed time I Multiple node infections allowed (Splitting model) (Wang ECML12) Likelihood I Delay pdf f (ti |ui , uj , tj ; αu u ) j i Y Y Y S(T |ti ; αui v ) p(c|α) = H(ti |tzi ; αuzi ui ) S(ti |tj ; αuzj ui ) i I j∈πi v :lv <ti F (t): CDF, H(t): Hazard, S(t): Survival function for f (t) Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 15 / 42 Problem Definition Modeling the Delay Distribution Exponential: Chances of infection die off very quickly f (ti |tj ; α) = αe−α(ti −tj ) H(ti |tj ) = α; S(ti |tj ) = e−α(ti −tj ) Rayleigh: infection chances rise to peak and then die off quickly 1 2 1 2 f (ti |tj ; α) = α(ti − tj )e− 2 α(ti −tj ) H(ti |tj ) = α(ti − tj ); S(ti |tj ) = e− 2 α(ti −tj ) Power law: heavier tails Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 16 / 42 Problem Definition Inference Problems for Network Diffusion Cascade Path Inference / Parent Inference I Observed variables in infections c o = {(ui , ti )} I Infer unobserved parent zi p(z | {c o }, α) = Y i H(ti |tzi ; αuzi ui ) P j∈πi H(ti |tj ; αuj ui ) Decouples into terms involving individual infection parents zi Network Inference I Network connection strength matrix α is unobserved I Maximum likelihood estimate given {c o } α ˆ = arg max log p({c o }|α) = arg max log α Embar, Pasumarthi, Bhattacharya α Network Diffusion Properties X p(C|α) z IISc MLSIG 2015 17 / 42 Problem Definition Network Diffusion Properties and Expectations I Property: Function f (α, z) defined on network α and cascade z I Model α and z as random variables Expected property Expectation ¯f under posterior distribution p(z, α|{c o }) ¯f (C, α) = Ep(z,α|{c o }) [f (C, α)] I Expectation under α-marginal p(α|{c o }) for properties only of α I Expectation under z-marginal p(z|{c o }) for properties only of z Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 18 / 42 Bayesian Framework Posterior Distribution of Network I I Network strengths α random variable with prior p(α) Q IID assumption: p(α) = uv p(αuv ) Posterior Distribution p(α | {c o }, z) = Y R uv αuv ¯ uv p(αuv ) ¯ uv S H ¯ uv p(αuv )dαuv ¯ uv S H Decouples into terms involving individual edge strengths αuv Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 19 / 42 Bayesian Framework Conjugate Prior for Network Require analytical integration wrt αuv Conjugate prior I Rayleigh and Exponential special cases of Weibull distribution I Gamma is conjugate for Weibull (with given shape parameter) p(αuv ) = Gamma(αuv ; a, b) = ba a−1 exp{−bαuv } α Γ(a) uv Posterior p(α|{c o }, z) = Y Gamma(a + ρuv (z), b + ∆uv ) uv ρuv (z): #u-v infections; ∆uv : Cumulative u-v infection delay Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 20 / 42 Bayesian Framework Suitability for Network Inference Modeling sparse networks Given no transmission evidence, very little belief in edge existence I For edges with no transmission, ρuv = 0 I For a < 1 and suitable b, posterior peaked sharply at 0 For large data volumes, mean approaches MLE I For ρuv ≥ 1, unimodal posterior peaked at (a + ρuv )/(b + ∆uv ) Avoiding bias Gamma prior non-informative for a, b 1 Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 21 / 42 Niceness of Properties Network-nice Properties Definition: nice-α Q A property f (α, z) is nice-α if it can be written as g(z) u,v huv (αuv , z) R P or as g(z) u,v huv (αuv , z) where huv (αuv , z) p(αuv |z, {c o })dαuv can be performed analytically ∀ u, v . I Decomposes over parents and individual connection strengths I Amenable to analytical integration with p(αuv |z, {c o }) Theorem Let f (α,Rz) be nice-α. Then computing the z-marginal ¯fz (z) = f (α, z)p(α|z, {c o })dα is O(|E|). α Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 22 / 42 Niceness of Properties Cascade-nice Properties Definition: nice-z Q A property P f (α, z) is nice-z if it can be written either as g(α) i hi (zi , α) or as g(α) i hi (zi , α) Property decomposes over individual infection parents zi Theorem Let f (α, z) be nice-z. Then the α-marginal ¯fα (α) = P f (α, z)p(z|α, {c o }) can be computed in O(π|C|) time, z where π is max. no. of potential parents over all infections. Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 23 / 42 Niceness of Properties Totally-nice Properties Definition: nice-z, α A property f (α, z) is nice-z, α if it can be written as R Q Q|D| hi (zi ) guv (αuv )p(αuv |z, {c o })dαuv can be u,v guv (αuv ) i=1 αu u where zi i performed analytically ∀u, v . I αuv and zi terms are decoupled in the expectation I αuv terms are Gamma integrable Theorem Let f (α, z) be nice-z, α. Then the expectation ¯f (α, z) can be computed in O(π|C|)) + O(|E|) time, up to a multiplicative constant. Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 24 / 42 Network Diffusion Properties Interesting Network Diffusion Properties Network-centric Properties: Scores for nodes, edges, etc I Nodes with large path-reach (approx. LoT): not nice at all I Nodes with large edge-reach (weak LoT): network-nice only I Strong frequent edges: network-nice only I Network Inference: network-nice, cascade-nice, not totally nice I Edges that are strong or frequent but not both: totally nice Cascade-centric Properties: Scores for individual infections, infection paths, etc I Infections by strongest neighbor: cascade-nice only I Infection parent inference: cascade-nice only I Complete likelihood, Likelihood: cascade-nice only Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 25 / 42 Network Diffusion Properties Network-centric Properties Network-centric Properties: Details Scores for entities in the network, e.g., nodes, edges, etc Building Blocks ∗ : Indirect network connection strength between u and v αuv (r ) αuv = X (r −1) αuw αwv ∗ ; αuv = w R X (r ) αuv r =1 ρ∗uv : Indirect cascade transmission frequency between u and v Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 26 / 42 Network Diffusion Properties Network-centric Properties Node-centric properties Node influence score (approx. LoT) X ∗ fu (α, z; a, r ) = I(αuv > a)I(ρ∗uv (z) ≥ r ) v Not nice at all, even for R = 2 Node influence score for direct infections (weak LoT) X fu (α, z; a, r ) = I(αuv > a)I(ρuv (z) ≥ r ) v Network-nice, but not totally-nice Cascade-only version : Network-nice, but not totally-nice Network-only version : Network-nice, cascade-nice, but not totally-nice Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 27 / 42 Network Diffusion Properties Network-centric Properties Edge-centric properties Edge strength-frequency distribution X f (α, z) = I(a1 < αuv < a2 )I(r1 ≤ ρuv (z) < r2 ) u,v Network-nice, but not totally nice Edge strength distribution: Network-nice, not totally nice Edge freq. distribution: Network-nice, cascade-nice, not totally nice Network Inference f (α, z) = α Network-nice, cascade-nice, not totally nice Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 28 / 42 Network Diffusion Properties Network-centric Properties One Nice Edge-centric Property Identifying strange edges Edges that are strong but infrequent, or weak but frequent −ρ (z) fuv (α, z) = αuv uv Totally nice : Expectation can be computed exactly and efficiently Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 29 / 42 Network Diffusion Properties Cascade-centric Properties Cascade-centric Properties: Details Scores for entities in cascade, e.g., individual infections, infection paths Infection property Infections by strongest neighbor of u: arg maxv αuv X f (α, z) = I(uzi = arg max αvui ) i v Infection parent identification f (α, z)iu = 1 if zi = u; = 0 otherwise Both cascade-nice, but not totally nice Cascade property Complete likelihood, Likelihood : Both only cascade-nice Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 30 / 42 Approximation via MCMC Approximate Evaluation of Not-nice Properties Use Monte Carlo marginalization for properties that are not totally-nice. Only Network-nice Marginalize over network efficiently; Monte Carlo sum over paths X ¯f (α, z) ≈ 1 ¯fz (z (s) ), where z (s) ∼ p(z|{c o }), s = 1 . . . S S s Only Cascade-nice Marginalize over paths efficiently; Monte-Carlo integration over network Not nice Monte Carlo marginalization over both network and paths Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 31 / 42 Approximation via MCMC Gibbs Sampling for Network Diffusion Full / Uncollapsed Gibbs Sampling Iterate over all α and z variables, sampling a new value from its conditional distribution, given the current values of all other variables p(zi = j|z−i , α, {c o }) ∝ αji p(αuv |z, α−uv , {c o }) ∼ Gamma(ρuv + a, ∆uv + b) For sparse networks, sample αuv only when ρuv > 0 Collapsed Gibbs sampling for network-nice properties Integrate out α analytically, sample only z variables p(zi = j|z−i , {c o }) ∝ Embar, Pasumarthi, Bhattacharya (ρ−i u u (z)+a) j i ∆uj ui +b Network Diffusion Properties IISc MLSIG 2015 32 / 42 Experiments Experiments: Algorithms Bayesian Expectation Gamma prior parameters: a = 0.00001, b = 0.1 Frequentist Plug-in I Take point estimate α ˆ of network I I Most likely infection parents given α ˆ : zˆ = arg maxz p(z|ˆ α, {c o }) Evaluate f (ˆ α, zˆ ) I For α ˆ , use MONET Exponential distribution for all experiments Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 33 / 42 Experiments Experiments: Synthetic Data I Evaluate accuracy against a gold-standard I Analysis for various random graph models Data Generation I Forest Fire, Random, Hierarchical, Core-Periphery I 1000 nodes, ∼ 2000 edges I αuv ∼ U(0.01, 10) I 20 splitting cascades with 2 random seeds, ∼ 50,000 infections Evaluation I Parent inference: accuracy against true parent z ∗ I Strength inference: Best achievable given true parents I Property: Error wrt f (α∗ , z ∗ ) Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 34 / 42 Experiments Synthetic Data Experiments: Weak LoT distribution Bayesian recovery significantly better! Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 35 / 42 Experiments Synthetic Data Experiments: Approx. LoT distribution Bayesian recovery significantly better! Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 36 / 42 Experiments Synthetic Data Experiments: One-to-one Properties Loglikelihood Test Train CorePeriphery BE, FP 1.0e4, 0.6e4 2.8e4, 3.6e4 Hierarchical BE, FP 6.5e3, 2.4e3 2.0e4, 2.2e4 Random BE, FP 1.1e4, -1.5e4 2.3e4, 2.9e4 ForestFire BE, FP 1.2e4, 926 2.8e4, 3.3e4 Network and Parent Inference Network Inf Parent Inf CorePeriphery BE FP 0.116 2.553 0.533 0.406 Hierarchical BE FP 0.884 3.210 0.861 0.783 Random BE FP 0.147 17.483 0.757 0.646 ForestFire BE FP 0.329 736.821 0.770 0.674 Bayesian approach avoids overfitting for one-to-one properties Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 37 / 42 Experiments Real Data Experiments Meme-Tracker I Meme diffusion between 5000 blogs, news sites (Mar 11 - Feb 12) I 5 topics: Basketball, Alcohol, Technology, NBA, Occupy I Long cascades: length>30 I 80-20 train-test split; infections of new users pruned in test Test Loglikelihood BE FP Bball -1.5e6 -3.5e6 Alcohol -5.8e5 -8.9e5 Tech -6.6e5 -2.6e6 NBA -8.9e5 -1.1e7 Occupy -5.1e5 -1.2e6 Bayesian approach generalizes much better Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 38 / 42 Experiments Scaling Experiments I Map reduce implementation; 12 core server I Randomly sampled cascades from Meme-Tracker Scaling (roughly) linear in no. of cores Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 39 / 42 Experiments LIVE at Wimbledon 2014 Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 40 / 42 Experiments LIVE at Wimbledon 2014 Identify most influential Twitter handles on Wimbledon-related topics Challenges I Scaling to millions of tweets and users I Online updates to influence scores I Considering textual content of tweets for analyzing influence I Interpretability of the scores Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 41 / 42 Experiments Digression: IBM Debating Technologies I Develop technologies that assist humans to debate and reason using current world knowledge when there are no clear yes or no answers. I E.g. Should smoking be banned altogether? I Use cases: Government policy making, health-care, legal, finance I Visit our webpage and see (unfortunately not the latest) demo online (google ‘IBM Debating Technologies’) Embar, Pasumarthi, Bhattacharya Network Diffusion Properties IISc MLSIG 2015 42 / 42