A Random Graph Model embedding Vertices Information

Transcription

Introduction
A Random Graph Model embedding Vertices
Information
Stevenn Volant, Hugo Zanghi and Christophe Ambroise
Exalead,
Laboratoire Statistique et Genome,
La genopole - Universite d’Evry
Journées MAS – 2008, August 28th
Volant, Zanghi, Ambroise, Matias
1
Introduction
The Web as a Graph
At Exalead
I
Modelisation
I
I
I
Use cases
I
I
I
I
Nodes are pages or
websites
Edges are hyperlinks
Ranking.
Spam detection.
website bounds.
Data size
I
I
I
billions of pages.
∼ 10 hyperlinks by pages.
∼ 100 millions of websites.
web subgraph extraction
2
Introduction
Real networks
Embedding Vertices Information
Structure Analysis with
Vertices Information
I
Clustering,
I
Function relationship.
Families of networks
I
world wide web.
I
social networks,
I
biological networks,
Insurance sites into car insurance (blue), life insurance
(orange), financial insurance (green), and health
insurance (purple).
Let us define a statistical model
3
Introduction
Real networks
Embedding Vertices Information
Structure Analysis with
Vertices Information
I
Clustering,
I
Function relationship.
Families of networks
I
world wide web.
I
social networks,
I
biological networks,
Insurance sites into car insurance (blue), life insurance
(orange), financial insurance (green), and health
insurance (purple).
Let us define a statistical model
3
Introduction
Random graphs
Notations and basic model
I
Notations :
I
I
I
E a set of edges ∈ {1, . . . , n}2 ,
X = (Xij ) the adjacency matrix such that
{Xij = 1} = I{i ↔ j}.
Possible graphs :
k
I
I
Xjk=0
Random graph definition :
I
I
oriented : Xij 6= Xji , valued : Xij ∈ R.
the distribution of X describes the
topology of the network.
Erdös Rényi (ER) model (1959) :
I
j
Xij=1
i
(Xij ) independent, with Bernoulli
distribution B(p).
4
Introduction
Erdös Rényi Mixtures for Graphs
MixNet: an alternative probabilistic model
I
Origin
I
I
I
I
Modelling connection heterogeneity
I
I
I
I
I
model developed by J. Daudin et al. (2008),
ER model generalization,
application fields: biology, internet, social network . . .
hyp.: there exists a hidden structure into Q classes of
connectivity,
Z = (Zi )i , Ziq = I{i ∈ q} are indep. hidden variables,
α = {αq }, the prior proportions of groups,
(Zi ) ∼ M(1, α).
Distribution of X
I
I
I
I
Conditional distribution: Xij |{Ziq Zj` = 1} ∼ B(πql )
Xij |Z are independant.
π = (πq` ) is the connectivity matrix.
Erdös-Rényi Mixture for Network.
5
Covariates in Hidden Structures using Mixture models
CohsMix: Covariables in Hidden Structures using
Mixture models
Embedding informations
I
Extending data relationships with exogenous variables : n
nodes with p covariables. Example :
? Web : text, geolocalisation, language, ranking, etc.
? Social : age, geolocalisation, genre, etc.
? Biological : gene expression data.
Handling Informations : two cases
I
Adding a covariate vector ⇒ Yi (p dimensions) for each
node.
I
Building a similarity matrix ⇒ Y (n by n).
6
Mixture models and dependancies
Different dependencies
1. P (X, Y, Z) = P (Z)P (X, Y |Z) = P (Z)P (Y |Z)P (X|Z)
2. P (X, Y, Z) = P (Z)P (X|Z)P (Y |X, Z)
3. P (X, Y, Z) = P (Z)P (Y |Z)P (X|Y, Z) ⇒ not considered in this
talk.
Common distributions
I
Zi ∼ M(1, α) with α = {αq }, the prior proportions of groups.
I
Conditional distribution: Xij |{Ziq Zj` = 1} ∼ B(πql )
7
Independence of X and Y conditionally to Z
Model 1: Independence of X and Y / Z
Considering independence between edges and
covariates :
I
Normal Distribution : Yij |Zi Zj = 1 ∼ N (µql , σ 2 )
Affiliation model :
(
µqq = µ1 , ∀q ∈ [1, Q]
µql = µ2 , ∀q, l ∈ [1, Q]2 , q 6= l
Remarks
I
Strong assumption : independence between covariates and
edges.
I
Not a realistic modelel.
8
Network Mixture with dependence between X and Y
Model 2: dependence between X and Y
Considering dependence between edges and covariates:
(
Yij |Ziq Zjl = 1 ∼ N (µql , σ 2 ),
Yij |Ziq Zjl = 1 ∼ N (µ̃ql , σ 2 ),
si Xij = 1,
si Xij = 0.
Affiliation model :
(
µqq = µ1 , ∀q ∈ [1, Q],
µql = µ2 , ∀q, l ∈ [1, Q]2 q 6= l.
and
(
µ̃qq = µ̃1 , ∀q ∈ [1, Q],
µ̃ql = µ̃2 , ∀q, l ∈ [1, Q]2 , q 6= l.
9
Network Mixture with dependence between X and Y
Model 2: Log-Likelihood related to Y /X, Z
P (Y |X, Z) =
YY
P (Yij |Xij = xij )xij ziq zjl P (Yij |Xij = xij )(1−xij )ziq zjl .
i,j q,l
details . . .
log(P (Y |X, Z))
=
XX
i,j
+
ziq zjl xij log(P (Yij |Xij = xij ))
q,l
XX
i,j
ziq zjl (1 − xij ) log(P (Yij |Xij = xij ))
q,l
(1)
=
XX
ziq zjl xij (−
(yij − µql )2
2σ 2
i,j q,l
i6=j
(2)
+
(yij − µql )2
2σ 2
)
(2)
−
XX
i,j
q,l
ziq zjl
(yij − µql )2
2σ 2
+
XX
i,j
q,l
ziq zjl log( √
1
2πσ 2
).
10
Generative Models
Generative Models
n = 150, q = 4, α = (0.2, 0.3, 0.1, 0.4), πql = 0.2, πqq = 0.05, µqq = 2,
µql = 4, µ¯ql = 5, µ¯qq = 7 and σ 2 = 1
model 1 : P (X, Y, Z) = P (Z)P (Y |Z)P (X|Z)
model 2 : P (X, Y, Z) = P (Z)P (X|Z)P (Y /X, Z)
11
Generative Models
How to estimate the model parameters?
I
I
Log-likelihood(s) of the model:
P
→ Observed data : L(X, Y) = log ( Z exp L(X, Y, Z)).
→ Complete data : L(X, Y, Z).
⇒ Qn partitions : not tractable.
EM-based Strategies:
→ Expectation of Complete data :
Q(θ) = E [L(X, Y, Z)|X, Y].
→ EM-like strategies require the knowledge of Pr(Z|X, Y).
⇒In our case, this distribution is not tractable (no conditional
independence).
Objective
Restrict the set of distributions of Z : Variationnal Approach.
n
Y
Approximation : R(Z) =
P (Zi |X, Y, θ(m) ).
i=1
12
Generative Models
Variational method
Daudin et. al, 2008
Principle
→ Optimizing J (θ) defined by :
J (θ) = L(X, Y) − KL(R(Z), Pr(Z|X, Y, θ(m) )).
→ R(Z) chosen such that KL(R(Z), Pr(Z|X, Y)) is minimal.
→ If R(Z) = Pr(Z|X, Y) then J (R(Z)) = L(X, Y).
After Simplification :
ER(Z) (J (θ)) = ER(Z) (log(Pθ (X, Y, Z))|X, Y, θ) −
X
R(Z) log(R(Z)).
Z
⇒ tractable
13
Estimation : Variational approach of the EM algorithm
Estimation : Variational approach of the EM
algorithm
Quantity to maximize
ER(Z) (J(θ))
=
Q
n X
X
R(ziq ) log(αq )
i=1 q=1
XX
+
R(ziq )R(zjl )(xij log(πql ) + (1 − xij ) log(1 − πql ))
i,j q,l
i6=j
+
2
0
1
3
(2)
(2)
(1)
(yij − µql )2
(yij − µql )2
(yij − µql )2
XX
A−
5
R(ziq )R(zjl ) 4xij @−
+
2
2
2
2σ
2σ
2σ
i,j q,l
i6=j
+
X
XX
1
)−
R(Z) log(R(Z)).
R(ziq )R(zjl ) log( √
2πσ 2
i,j q,l
Z
i6=j
14
Estimation : Variational approach of the EM algorithm
Estimation : Optimal parameters
X
I
π̂ql =
R(Ziq )R(Zjl )xij
i,j
i6=j
X
R(Ziq )R(Zjl )
,
i,j
i6=j
n
X
I
X
R(Ziq )
α̂q = i=1 n
, µ̂ql =
X
R(Ziq )R(Zjl )xij yij
i,j
i6=j
X
R(Ziq )R(Zjl )xij
ˆ ql =
, µ̃
i,j
i6=j
R(Ziq )R(Zjl )(1 − xij )yij
i,j
i6=j
X
R(Ziq )R(Zjl )(1 − xij )
,
i,j
i6=j
h
“
”
i
XX
2
ˆ ql )2 + (yij − µ̃
ˆ ql )2
R(Ziq )R(Zjl ) xij (yij − µ̂ql ) − (yij − µ̃
I
σ̂ 2 =
i,j q,l
i6=j
XX
R(Ziq )R(Zjl )
.
i,j q,l
i6=j
I
(m+1)
τ̂iq
αq
∝
3τ (m)
2 xij
1−xij
“
”
i jl
π̂
(1 − π̂ql )
1 h
2
2
2
4 ql
exp(
xij −(yij − µql ) + (yij − µ̃ql )
− (yij − µ̃ql ) 5
√
2σ 2
2πσ 2
j6=i l
Y Y
15
Model selection: ICL algorithm
Model selection: ICL algorithm
The Integrated Classification Likelihood
ICL(Q) = max log(X, Y, Z|θ, Q)
θ
−
1
n(n − 1)
× Q2 log(
)
2
|2
{z
}
Penalization term related to πql
−
Q2 log(
|
n(n − 1)
n(n − 1)
) − log(
)
2
2
{z
}
Penalization term related to µql , µ̃ql and σ 2
−
Q−1
log(n)
| 2 {z
}
.
Penalization term related to αq
16
Application
Simulated data
Application : Simulated data
I
Synthetic network


n = 150, Q = 4





α = (0.2, 0.3, 0.1, 0.4)





πqq = 0.05, ∀q ∈ [1, Q]





πql = 0.2, ∀q, l ∈ [1, Q], q 6= l
µqq = 2, ∀q ∈ [1, Q]



µql = 4, ∀q, l ∈ [1, Q], q 6= l





µ̃qq = 5, ∀q ∈ [1, Q]




µ̃ql = 7, ∀q, l ∈ [1, Q], q 6= l



 2
σ =1
I
Results matching the
simulated number of
groups.
17
Application
Simulated data
Application : Simulated data
Parameters estimation
Parameter
πqq
πql
µqq
µql
µ̃qq
µ̃ql
σ2
Simulated
0.2
0.05
2
4
5
7
1
Estimated
Model 1 Model 2
0.18
0.18
0.044
0.047
2.04
2.03
4.02
3.99
/
5.01
/
7.01
0.99
1.003
Table: Means of the parameter estimates of the two models
computed over 10 runs.
18
Application
Simulated data
Algorithm comparison
I
reference : HMRF with covariates [Besag (1986)].
I
X is known :
favorable Markov fields settings :
I
I
I
I
I
Simulate X : classical Erdös-Rényi (parameter λ)
Simulate Z|X : Gibbs sampling and Potts model
(p)
(p)
(p)
Simulate Yi |Zi : Yi |Zi ∼ N (ζq , σ 2 )
Build Yij : Yij = < Yi , Yj >
19
Application
Simulated data
Markov favorable settings.
CohsMix favorable settings.
20
Application
Simulated data
Application : Web Search Results Clustering using
hypertextuality
Aims
I
Reduce ambiguous queries (“avocat”, “orange”, “jaguar”,
etc.)
I
Organize search results into groups, one for each meaning
of the query.
Innovation
I
Use the hypertextuality.
I
Competitive communities(“abortion”, “political”, “jo”, etc.)
21
Application
Simulated data
Document similarities matrix
I
Vector Space Model : Weight vector for document i :
vi = [w1,i , w2,i , ..., wp,i ]T , where
wt,i = T Ft .log
I
I
I
|D|
.
|t ∈ D|
T Ft : term frequency of term t in document i.
D
log t∈D
: inverse document frequency with |D| the total
number of documents and |t ∈ D| the number of documents
containing the term t.
Document similarities : comparing deviations of angles
between each pair of document vectors :
cosθi,j = Yi,j =
vi .vj
.
||vi ||||vj ||
22
Application
Simulated data
Application : Main process
I
A query time process
1. Retrieve web pages
matching the query;
2. Compute the text
similarities;
3. Fetch websites graph
(because webpages are
underconnected);
4. Multithreading of CohsMix
algo. to find the optimal
models;
5. Shows clusters;
query ”orange”
23
Application
Simulated data
Application : Example
I
Example : query ”orange”, Q = 6, n = 269
Analysis
Phone
Orange
French Town
ADSL
S
I Junk : Q = 1 Q = 4
Q
2
3
5
6
α
0.25
0.11
0.3
0.16
I
Proof of concept
I
Need tuning :
I
I
url examples
pressphone.com, comparatel.fr
orange.fr, www.orange-wifi.com
ville-orange.fr, immobilier.orange.fr
dedibox-news.com, infosadsl.com
Document similarities.
WebSites graph.
24
Application
Simulated data
Conclusion and perspectives
I
CohsMix :
I
I
I
References :
I
I
I
Daudin J-J., Picard F., Robin S. (2008) , A mixture model for
random graphs, Statistic and Computing
Zanghi, H, Ambroise, C. and Miele, V. (2008), Fast online
Graph Clustering via Erdös-Rényi Mixture, Pattern Recognition
Softwares :
I
I
Uses MixNet : a probabilistic model which captures features
of real-networks,
Embedding vertice informations to detect hidden structure.
CohsMix, a R code of this talk (available on demand ?)
Perpectives :
I
I
Compare with other methods (bi-clustering).
demo online : http://labs.exalead.com.
25

A Random Graph Model embedding Vertices Information

Transcription

Similar documents