A Random Graph Model embedding Vertices Information
Transcription
A Random Graph Model embedding Vertices Information
Introduction A Random Graph Model embedding Vertices Information Stevenn Volant, Hugo Zanghi and Christophe Ambroise Exalead, Laboratoire Statistique et Genome, La genopole - Universite d’Evry Journées MAS – 2008, August 28th Volant, Zanghi, Ambroise, Matias 1 Introduction The Web as a Graph At Exalead I Modelisation I I I Use cases I I I I Nodes are pages or websites Edges are hyperlinks Ranking. Spam detection. website bounds. Data size I I I billions of pages. ∼ 10 hyperlinks by pages. ∼ 100 millions of websites. Volant, Zanghi, Ambroise, Matias web subgraph extraction 2 Introduction Real networks Embedding Vertices Information Structure Analysis with Vertices Information I Clustering, I Function relationship. Families of networks I world wide web. I social networks, I biological networks, Insurance sites into car insurance (blue), life insurance (orange), financial insurance (green), and health insurance (purple). Let us define a statistical model Volant, Zanghi, Ambroise, Matias 3 Introduction Real networks Embedding Vertices Information Structure Analysis with Vertices Information I Clustering, I Function relationship. Families of networks I world wide web. I social networks, I biological networks, Insurance sites into car insurance (blue), life insurance (orange), financial insurance (green), and health insurance (purple). Let us define a statistical model Volant, Zanghi, Ambroise, Matias 3 Introduction Random graphs Notations and basic model I Notations : I I I E a set of edges ∈ {1, . . . , n}2 , X = (Xij ) the adjacency matrix such that {Xij = 1} = I{i ↔ j}. Possible graphs : k I I Xjk=0 Random graph definition : I I oriented : Xij 6= Xji , valued : Xij ∈ R. the distribution of X describes the topology of the network. Erdös Rényi (ER) model (1959) : I j Xij=1 i (Xij ) independent, with Bernoulli distribution B(p). Volant, Zanghi, Ambroise, Matias 4 Introduction Erdös Rényi Mixtures for Graphs MixNet: an alternative probabilistic model I Origin I I I I Modelling connection heterogeneity I I I I I model developed by J. Daudin et al. (2008), ER model generalization, application fields: biology, internet, social network . . . hyp.: there exists a hidden structure into Q classes of connectivity, Z = (Zi )i , Ziq = I{i ∈ q} are indep. hidden variables, α = {αq }, the prior proportions of groups, (Zi ) ∼ M(1, α). Distribution of X I I I I Conditional distribution: Xij |{Ziq Zj` = 1} ∼ B(πql ) Xij |Z are independant. π = (πq` ) is the connectivity matrix. Erdös-Rényi Mixture for Network. Volant, Zanghi, Ambroise, Matias 5 Covariates in Hidden Structures using Mixture models CohsMix: Covariables in Hidden Structures using Mixture models Embedding informations I Extending data relationships with exogenous variables : n nodes with p covariables. Example : ? Web : text, geolocalisation, language, ranking, etc. ? Social : age, geolocalisation, genre, etc. ? Biological : gene expression data. Handling Informations : two cases I Adding a covariate vector ⇒ Yi (p dimensions) for each node. I Building a similarity matrix ⇒ Y (n by n). Volant, Zanghi, Ambroise, Matias 6 Covariates in Hidden Structures using Mixture models Mixture models and dependancies Different dependencies 1. P (X, Y, Z) = P (Z)P (X, Y |Z) = P (Z)P (Y |Z)P (X|Z) 2. P (X, Y, Z) = P (Z)P (X|Z)P (Y |X, Z) 3. P (X, Y, Z) = P (Z)P (Y |Z)P (X|Y, Z) ⇒ not considered in this talk. Common distributions I Zi ∼ M(1, α) with α = {αq }, the prior proportions of groups. I Conditional distribution: Xij |{Ziq Zj` = 1} ∼ B(πql ) Volant, Zanghi, Ambroise, Matias 7 Covariates in Hidden Structures using Mixture models Independence of X and Y conditionally to Z Model 1: Independence of X and Y / Z Considering independence between edges and covariates : I Normal Distribution : Yij |Zi Zj = 1 ∼ N (µql , σ 2 ) Affiliation model : ( µqq = µ1 , ∀q ∈ [1, Q] µql = µ2 , ∀q, l ∈ [1, Q]2 , q 6= l Remarks I Strong assumption : independence between covariates and edges. I Not a realistic modelel. Volant, Zanghi, Ambroise, Matias 8 Covariates in Hidden Structures using Mixture models Network Mixture with dependence between X and Y Model 2: dependence between X and Y Considering dependence between edges and covariates: ( Yij |Ziq Zjl = 1 ∼ N (µql , σ 2 ), Yij |Ziq Zjl = 1 ∼ N (µ̃ql , σ 2 ), si Xij = 1, si Xij = 0. Affiliation model : ( µqq = µ1 , ∀q ∈ [1, Q], µql = µ2 , ∀q, l ∈ [1, Q]2 q 6= l. and ( µ̃qq = µ̃1 , ∀q ∈ [1, Q], µ̃ql = µ̃2 , ∀q, l ∈ [1, Q]2 , q 6= l. Volant, Zanghi, Ambroise, Matias 9 Covariates in Hidden Structures using Mixture models Network Mixture with dependence between X and Y Model 2: Log-Likelihood related to Y /X, Z P (Y |X, Z) = YY P (Yij |Xij = xij )xij ziq zjl P (Yij |Xij = xij )(1−xij )ziq zjl . i,j q,l details . . . log(P (Y |X, Z)) = XX i,j + ziq zjl xij log(P (Yij |Xij = xij )) q,l XX i,j ziq zjl (1 − xij ) log(P (Yij |Xij = xij )) q,l (1) = XX ziq zjl xij (− (yij − µql )2 2σ 2 i,j q,l i6=j (2) + (yij − µql )2 2σ 2 ) (2) − Volant, Zanghi, Ambroise, Matias XX i,j q,l ziq zjl (yij − µql )2 2σ 2 + XX i,j q,l ziq zjl log( √ 1 2πσ 2 ). 10 Covariates in Hidden Structures using Mixture models Generative Models Generative Models n = 150, q = 4, α = (0.2, 0.3, 0.1, 0.4), πql = 0.2, πqq = 0.05, µqq = 2, µql = 4, µ¯ql = 5, µ¯qq = 7 and σ 2 = 1 model 1 : P (X, Y, Z) = P (Z)P (Y |Z)P (X|Z) Volant, Zanghi, Ambroise, Matias model 2 : P (X, Y, Z) = P (Z)P (X|Z)P (Y /X, Z) 11 Covariates in Hidden Structures using Mixture models Generative Models How to estimate the model parameters? I I Log-likelihood(s) of the model: P → Observed data : L(X, Y) = log ( Z exp L(X, Y, Z)). → Complete data : L(X, Y, Z). ⇒ Qn partitions : not tractable. EM-based Strategies: → Expectation of Complete data : Q(θ) = E [L(X, Y, Z)|X, Y]. → EM-like strategies require the knowledge of Pr(Z|X, Y). ⇒In our case, this distribution is not tractable (no conditional independence). Objective Restrict the set of distributions of Z : Variationnal Approach. n Y Approximation : R(Z) = P (Zi |X, Y, θ(m) ). i=1 Volant, Zanghi, Ambroise, Matias 12 Covariates in Hidden Structures using Mixture models Generative Models Variational method Daudin et. al, 2008 Principle → Optimizing J (θ) defined by : J (θ) = L(X, Y) − KL(R(Z), Pr(Z|X, Y, θ(m) )). → R(Z) chosen such that KL(R(Z), Pr(Z|X, Y)) is minimal. → If R(Z) = Pr(Z|X, Y) then J (R(Z)) = L(X, Y). After Simplification : ER(Z) (J (θ)) = ER(Z) (log(Pθ (X, Y, Z))|X, Y, θ) − X R(Z) log(R(Z)). Z ⇒ tractable Volant, Zanghi, Ambroise, Matias 13 Covariates in Hidden Structures using Mixture models Estimation : Variational approach of the EM algorithm Estimation : Variational approach of the EM algorithm Quantity to maximize ER(Z) (J(θ)) = Q n X X R(ziq ) log(αq ) i=1 q=1 XX + R(ziq )R(zjl )(xij log(πql ) + (1 − xij ) log(1 − πql )) i,j q,l i6=j + 2 0 1 3 (2) (2) (1) (yij − µql )2 (yij − µql )2 (yij − µql )2 XX A− 5 R(ziq )R(zjl ) 4xij @− + 2 2 2 2σ 2σ 2σ i,j q,l i6=j + X XX 1 )− R(Z) log(R(Z)). R(ziq )R(zjl ) log( √ 2πσ 2 i,j q,l Z i6=j Volant, Zanghi, Ambroise, Matias 14 Covariates in Hidden Structures using Mixture models Estimation : Variational approach of the EM algorithm Estimation : Optimal parameters X I π̂ql = R(Ziq )R(Zjl )xij i,j i6=j X R(Ziq )R(Zjl ) , i,j i6=j n X I X R(Ziq ) α̂q = i=1 n , µ̂ql = X R(Ziq )R(Zjl )xij yij i,j i6=j X R(Ziq )R(Zjl )xij ˆ ql = , µ̃ i,j i6=j R(Ziq )R(Zjl )(1 − xij )yij i,j i6=j X R(Ziq )R(Zjl )(1 − xij ) , i,j i6=j h “ ” i XX 2 ˆ ql )2 + (yij − µ̃ ˆ ql )2 R(Ziq )R(Zjl ) xij (yij − µ̂ql ) − (yij − µ̃ I σ̂ 2 = i,j q,l i6=j XX R(Ziq )R(Zjl ) . i,j q,l i6=j I (m+1) τ̂iq αq ∝ 3τ (m) 2 xij 1−xij “ ” i jl π̂ (1 − π̂ql ) 1 h 2 2 2 4 ql exp( xij −(yij − µql ) + (yij − µ̃ql ) − (yij − µ̃ql ) 5 √ 2σ 2 2πσ 2 j6=i l Y Y Volant, Zanghi, Ambroise, Matias 15 Covariates in Hidden Structures using Mixture models Model selection: ICL algorithm Model selection: ICL algorithm The Integrated Classification Likelihood ICL(Q) = max log(X, Y, Z|θ, Q) θ − 1 n(n − 1) × Q2 log( ) 2 |2 {z } Penalization term related to πql − Q2 log( | n(n − 1) n(n − 1) ) − log( ) 2 2 {z } Penalization term related to µql , µ̃ql and σ 2 − Q−1 log(n) | 2 {z } . Penalization term related to αq Volant, Zanghi, Ambroise, Matias 16 Application Simulated data Application : Simulated data I Synthetic network n = 150, Q = 4 α = (0.2, 0.3, 0.1, 0.4) πqq = 0.05, ∀q ∈ [1, Q] πql = 0.2, ∀q, l ∈ [1, Q], q 6= l µqq = 2, ∀q ∈ [1, Q] µql = 4, ∀q, l ∈ [1, Q], q 6= l µ̃qq = 5, ∀q ∈ [1, Q] µ̃ql = 7, ∀q, l ∈ [1, Q], q 6= l 2 σ =1 I Results matching the simulated number of groups. Volant, Zanghi, Ambroise, Matias 17 Application Simulated data Application : Simulated data Parameters estimation Parameter πqq πql µqq µql µ̃qq µ̃ql σ2 Simulated 0.2 0.05 2 4 5 7 1 Estimated Model 1 Model 2 0.18 0.18 0.044 0.047 2.04 2.03 4.02 3.99 / 5.01 / 7.01 0.99 1.003 Table: Means of the parameter estimates of the two models computed over 10 runs. Volant, Zanghi, Ambroise, Matias 18 Application Simulated data Algorithm comparison Algorithm comparison I reference : HMRF with covariates [Besag (1986)]. I X is known : favorable Markov fields settings : I I I I I Simulate X : classical Erdös-Rényi (parameter λ) Simulate Z|X : Gibbs sampling and Potts model (p) (p) (p) Simulate Yi |Zi : Yi |Zi ∼ N (ζq , σ 2 ) Build Yij : Yij = < Yi , Yj > Volant, Zanghi, Ambroise, Matias 19 Application Simulated data Algorithm comparison Markov favorable settings. Volant, Zanghi, Ambroise, Matias CohsMix favorable settings. 20 Application Simulated data Application : Web Search Results Clustering using hypertextuality Aims I Reduce ambiguous queries (“avocat”, “orange”, “jaguar”, etc.) I Organize search results into groups, one for each meaning of the query. Innovation I Use the hypertextuality. I Competitive communities(“abortion”, “political”, “jo”, etc.) Volant, Zanghi, Ambroise, Matias 21 Application Simulated data Document similarities matrix I Vector Space Model : Weight vector for document i : vi = [w1,i , w2,i , ..., wp,i ]T , where wt,i = T Ft .log I I I |D| . |t ∈ D| T Ft : term frequency of term t in document i. D log t∈D : inverse document frequency with |D| the total number of documents and |t ∈ D| the number of documents containing the term t. Document similarities : comparing deviations of angles between each pair of document vectors : cosθi,j = Yi,j = Volant, Zanghi, Ambroise, Matias vi .vj . ||vi ||||vj || 22 Application Simulated data Application : Main process I A query time process 1. Retrieve web pages matching the query; 2. Compute the text similarities; 3. Fetch websites graph (because webpages are underconnected); 4. Multithreading of CohsMix algo. to find the optimal models; 5. Shows clusters; Volant, Zanghi, Ambroise, Matias query ”orange” 23 Application Simulated data Application : Example I Example : query ”orange”, Q = 6, n = 269 Analysis Phone Orange French Town ADSL S I Junk : Q = 1 Q = 4 Q 2 3 5 6 α 0.25 0.11 0.3 0.16 I Proof of concept I Need tuning : I I url examples pressphone.com, comparatel.fr orange.fr, www.orange-wifi.com ville-orange.fr, immobilier.orange.fr dedibox-news.com, infosadsl.com Document similarities. WebSites graph. Volant, Zanghi, Ambroise, Matias 24 Application Simulated data Conclusion and perspectives I CohsMix : I I I References : I I I Daudin J-J., Picard F., Robin S. (2008) , A mixture model for random graphs, Statistic and Computing Zanghi, H, Ambroise, C. and Miele, V. (2008), Fast online Graph Clustering via Erdös-Rényi Mixture, Pattern Recognition Softwares : I I Uses MixNet : a probabilistic model which captures features of real-networks, Embedding vertice informations to detect hidden structure. CohsMix, a R code of this talk (available on demand ?) Perpectives : I I Compare with other methods (bi-clustering). demo online : http://labs.exalead.com. Volant, Zanghi, Ambroise, Matias 25