Learning to Rank for Information Retrieval

Transcription

Learning to Rank
for Information Retrieval
Tie-Yan Liu
Lead Researcher
Microsoft Research Asia
Speaker
• Tie-Yan Liu
– Lead Researcher, Microsoft Research Asia
– Co-author of 70+ papers in SIGIR, WWW, NIPS, ICML, KDD, etc.
– Co-author of SIGIR Best Student Paper (2008) and JVCIR
Most Cited Paper Award (2004~2006).
– Author of two books on learning to rank.
– Associate editor, ACM Transactions on Information System;
Editorial board member, IR Journal.
– PC Chair of RIAO 2010, Area Chair of SIGIR (2008~2011),
Track Chair of WWW (2011).
2011/5/9
Tie-Yan Liu, Copyright reserved
2
Outline
•
•
•
•
•
•
Introduction
Major Approaches to Learning to Rank
Analysis of the Approaches
Benchmark Evaluation of the Approaches
Statistical Learning Theory for Ranking
Summary and Outlook
2011/5/9
3
Ranking in Information
Retrieval
Overwhelmed by Flood of Information
2011/5/9
5
Facts about the Web
• According to www.worldwidewebsize.com, major
search engines index at least tens of billions of web
pages.
• CUIL.com claimed that they index 127 billion pages.
• Google claimed that they know 1 trillion pages.
2011/5/9
6
Ranking is Essential
Indexed
Document Repository
D  d j 
Ranking
model
Query
2011/5/9
7
Conventional Ranking Models
• Query-dependent
–
–
–
–
Boolean model, extended Boolean model, etc.
Vector space model, latent semantic indexing (LSI), etc.
BM25 model, statistical language model, etc.
Span based model, distance aggregation model, etc.
• Query-independent
– PageRank, TrustRank, BrowseRank, Toolbar Clicks, etc.
2011/5/9
8
Evaluation of Ranking Models
• Construct a test set containing a large number of
(randomly sampled) queries and their associated
documents, and relevance judgment for each querydocument pair.
• Evaluate the ranking result given by a model for a
particular query in the test set with an evaluation
measure.
• Use the average measure over the entire test set to
represent the overall ranking performance.
2011/5/9
9
Collecting Queries and Documents
• Queries are usually randomly sampled from
query logs.
• TREC’s Pooling strategy for sampling documents
– A pool of possibly relevant documents is created by
taking a sample of documents selected by the various
participating systems.
– Top 100 documents retrieved in each submitted run
for a given query are selected and merged into the
pool for human assessment.
– On average, an assessor judges the relevance of
approximately 1500 documents per query.
2011/5/9
10
Relevance Judgment
• Degree of relevance lk
– Binary: relevant vs. irrelevant
– Multiple ordered categories: Perfect > Excellent > Good >
Fair > Bad
• Pairwise preference lu ,v
– Document A is more relevant than
document B
• Total order
l
• Documents are ranked as {A,B,C,..}
according to their relevance
2011/5/9
11
Evaluation Measure - MAP
• Precision at position k for query q:
#{relevant documents in top k results}
k
• Average precision for query q:
k P @ k  lk
AP 
# {relevant documents}
P@k 
1 1 2 3 
AP        0.76
3 1 3 5 
• MAP: averaged over all queries.
2011/5/9
12
Evaluation Measure - NDCG
• NDCG at position n for query q:
NDCG @ k  Z k  G  1 ( j )  ( j )
k
j 1
Normalization
Cumulating
Gain
G ( j )  2
l
1
Position discount
 1 ( j )
 1,  ( j )  1 / log( j  1)
• Averaged over all queries.
 1 ( j ) The document ranked at
the j-th position in π.
2011/5/9
 ( j)
The rank position of
document j in π.
13
Evaluation Measure - Summary
• Query-level: every query contributes equally to the
measure.
– Computed on documents associated with the same query.
– Bounded for each query.
– Averaged over all test queries.
• Position-based: rank position is explicitly used.
– Top-ranked objects are more important.
– Relative order vs. relevance score of each document.
– Non-continuous and non-differentiable w.r.t. scores
2011/5/9
14
Learning to Rank
Challenges of Conventional Models
• Manual parameter tuning is usually difficult,
especially when there are many parameters and the
evaluation measures are non-smooth.
• Manual parameter tuning sometimes leads to overfitting.
• It is non-trivial to combine the large number of
models proposed in the literature to obtain an even
more effective model.
2011/5/9
16
Machine Learning Can Help
• Machine learning is an effective tool
– To automatically tune parameters.
– To combine multiple evidences.
– To avoid over-fitting (by means of regularization, etc.)
• “Learning to Rank”
– In general, those methods that use
machine learning technologies to solve
the problem of ranking can be named as
“learning to rank” methods.
2011/5/9
17
Learning to Rank
• In most recent works, learning to rank is
defined as having the following two properties:
– Feature based;
– Discriminative training.
2011/5/9
18
Feature Based
• Documents represented by feature vectors.
– Features are extracted for each query-document pair.
– Even if a feature is the output of an existing retrieval
model, one assumes that the parameter in the model is
fixed, and only learns the optimal way of combining these
features.
• The capability of combining a large number of
features is very promising.
– It can easily incorporate any new progress on retrieval
model, by including the output of the model as a feature.
2011/5/9
19
Discriminative Training
• An automatic learning process based on the
training data,
• Contains
the objects
under
– With the four pillars of
discriminative
learning
•
•
•
•
Input space,
Output space,
Hypothesis space,
Loss function.
investigation.
• Usually objects are
represented by feature vectors,
extracted according to different
applications
– Not even necessary to have a probabilistic
explanation.
2011/5/9
20
training data,
are two relatedlearning
but different
– With the four pillars of• There
discriminative
•
•
•
•
Input space,
Output space,
Hypothesis space,
Loss function.
– Not even necessary
explanation.
2011/5/9
definitions of the output space.
- Output space of the task.
- Output space to facilitate learning.
• Example: using regression to solve
classification
- Output space of the task: {+1, -1}
- Output space to facilitate learning:
toreal
have
a probabilistic
numbers
21
training data,
– With the four pillars of discriminative learning
•
•
•
•
Input space,
Output space,
Hypothesis space,
Loss function.
– Not even necessary
explanation.
2011/5/9
• Defines the class of functions
mapping the input space to the output
space.
• The functions operate on the feature
vectors of the input objects, and make
predictions according to the format of
tothe
have
a probabilistic
output
space.
22
• The loss function measures to what
training data,
degree the prediction generated by the
is in accordance with the
– With the four pillars ofhypothesis
discriminative
learning
ground truth label.
•
•
•
•
Input space,
Output space,
Hypothesis space,
Loss function.
– Not even necessary to
explanation.
2011/5/9
- For example, widely used loss
functions for classification include the
exponential loss, the hinge loss, and
the logistic loss.
- With the loss function, an empirical
risk can be defined on the training set,
have
probabilistic
and theaoptimal
hypothesis is usually
(but not always) learned by means of
empirical risk minimization.
23
training data,
• This is highly demanding for real search
engines,
– Everyday these search engines will receive a lot of
user feedback and usage logs
– It is very important to automatically learn from
the feedback and constantly improve the ranking
mechanism.
2011/5/9
24
Learning to Rank Framework
2011/5/9
25
Wide Applications of Learning to Rank
•
•
•
•
•
•
•
Document retrieval
Question answering
Multimedia retrieval
Text summarization
Online advertising
Collaborative filtering
Machine translation
2011/5/9
26
Recent Trends on Learning to Rank
• Successfully applied to web search
• Hot topic in IR and ML
– Hundreds of publications at SIGIR, ICML, NIPS, etc.
– Multiple sessions at SIGIR every year
– Several workshops at SIGIR, ICML, NIPS, etc.
– Several tutorials at SIGIR, WWW, ACL, etc.
– Special issue at IRJ and JMLR
– Yahoo! Learning to rank challenges
– ……
2011/5/9
27
Outline
•
•
•
•
•
•
Introduction
Summary and Outlook
2011/5/9
28
Learning to Rank Algorithms
Least Square Retrieval Function Query refinement (WWW 2008)
(TOIS 1989)
Nested Ranker (SIGIR 2006)
SVM-MAP (SIGIR 2007)
ListNet (ICML 2007)
Pranking (NIPS 2002)
MPRank (ICML 2007)
LambdaRank (NIPS 2006)
Frank (SIGIR 2007)
Learning to retrieval info (SCC 1995)
MHR (SIGIR 2007) RankBoost (JMLR 2003)
LDM (SIGIR 2005)
Large margin ranker (NIPS 2002)
IRSVM (SIGIR 2006)
RankNet (ICML 2005) Ranking SVM (ICANN 1999)
Discriminative model for IR (SIGIR 2004)
SVM Structure (JMLR 2005)
OAP-BPM (ICML 2003)
Subset Ranking (COLT 2006)
GPRank (LR4IR 2007) QBRank (NIPS 2007) GBRank (SIGIR 2007)
Constraint Ordinal Regression (ICML 2005)McRank (NIPS 2007)
AdaRank (SIGIR 2007)
RankCosine (IP&M 2007)
Relational ranking (WWW 2008)
2011/5/9
CCA (SIGIR 2007)
SoftRank (LR4IR 2007)
ListMLE (ICML 2008)
Supervised Rank Aggregation (WWW 2007)
Learning to order things (NIPS 1998)
Round robin ranking (ECML
2003)
29
Key Questions to Answer
• To what respect are these learning to rank algorithms similar
and in which aspects do they differ? What are the strengths
and weaknesses of each algorithm?
• Empirically speaking, which of those many learning to rank
algorithms perform the best?
• What are the unique theoretical issues for ranking that should
be investigated?
• Are there many remaining issues regarding learning to rank to
study in the future?
2011/5/9
30
Categorization of the Algorithms
Category
Algorithms
Pointwise
Approach
Regression: Least Square Retrieval Function (TOIS 1989), Regression Tree for Ordinal
Class Prediction (Fundamenta Informaticae, 2000), Subset Ranking using Regression
(COLT 2006), …
Classification: Discriminative model for IR (SIGIR 2004), McRank (NIPS 2007), …
Ordinal regression: Pranking (NIPS 2002), OAP-BPM (EMCL 2003), Ranking with
Large Margin Principles (NIPS 2002), Constraint Ordinal Regression (ICML 2005), …
Pairwise
Approach
Learning to Retrieve Information (SCC 1995), Learning to Order Things (NIPS 1998),
Ranking SVM (ICANN 1999), RankBoost (JMLR 2003), LDM (SIGIR 2005), RankNet
(ICML 2005), Frank (SIGIR 2007), MHR(SIGIR 2007), GBRank (SIGIR 2007), QBRank
(NIPS 2007), MPRank (ICML 2007), IRSVM (SIGIR 2006), LambdaRank (NIPS 2006),…
Listwise
Approach
Non-measure specific: ListNet (ICML 2007), ListMLE (ICML 2008), BoltzRank (ICML
2009) …
Measure-specific: AdaRank (SIGIR 2007), SVM-MAP (SIGIR 2007), SoftRank (LR4IR
2007), RankGP (LR4IR 2007), …
2011/5/9
31
The Poinitwise Approach
The Pointwise Approach
• Reduce ranking to
– Regression
• Subset Ranking
– Classification
• Discriminative model for IR
• MCRank
– Ordinal regression
q

• PRanking
• Ranking with large margin principle
 x1 
 
 x2 
 
  
x 
 m
( x , y ), ( x , y ),..., ( x
1
2011/5/9
1
2
2
m

, ym )
33
Subset Ranking using Regression
(D. Cossock and T. Zhang, COLT 2006)
• Regard relevance degree as real number, and use
regression to learn the ranking function.
L( f ; x j , y j )   f ( x j )  y j 
2
Regression-based
2011/5/9
34
McRank
(P. Li, et al. NIPS 2007)
• Multi-class classification is used to learn the ranking
function.
– For document x j , the output of the classifier is ŷ j .
– Loss function: surrogate function of I{ yˆ j  y j }
• Ranking is produced by combining the outputs of the
classifiers.
K
pˆ j ,k  P( yˆ j  k ), f ( x j )   pˆ j ,k k
k 1
Classification-based
2011/5/9
35
Pranking with Ranking
(K. Krammer and Y. Singer, NIPS 2002)
• Ranking is solved by ordinal regression
x  w x
Ordinal regression based
2011/5/9
36
Problem with Pointwise Approach
• Properties of IR evaluation measures have not been
well considered.
– The fact is ignored that some documents are associated
with the same query and some are not.
• When the number of associated documents varies largely for
different queries, the overall loss function will be dominated by
those queries with a large number of documents.
– The position of documents in the ranked list is invisible to
the loss functions.
• The pointwise loss function may unconsciously emphasize too
much those unimportant documents (which are ranked low in the
final results).
2011/5/9
37
The Pairwise Approach
The Pairwise Approach
• Reduce ranking to pairwise classification
–
–
–
–
–
RankNet and Frank
RankBoost
Ranking SVM
MHR
IR-SVM
q

 x1

 x2

 
x
 m








x1 , x2 ,1, x2 , x1 ,1,...,




x 2 , xm ,1, xm , x2 ,1 

2011/5/9
39
RankNet
(C. Burges, et al. ICML 2005)
• Target probability:
Pu ,v  1, if yu ,v  1; Pu ,v  0, otherwise
• Modeled probability:
Pu ,v ( f ) 
exp( f ( xu )  f ( xv ))
1  exp( f ( xu )  f ( xv ))
• Cross entropy as the loss function
l ( f ; xu , xv , yu ,v )
  Pu ,v log Pu ,v ( f )  (1  Pu ,v ) log(1  Pu ,v ( f ))
• Use neural network as model, and gradient descent as
algorithm, to optimize the cross-entropy loss.
2011/5/9
40
FRank
(M. Tsai, et al. SIGIR 2007)
• Similar setting to that of RankNet
• New loss function: fidelity
L( f ; xu , xv , yu ,v )
 1  Pu ,v  Pu ,v ( f )  (1  Pu ,v )  (1  Pu ,v ( f ))
2011/5/9
41
RankBoost
(Y. Freund, et al. JMLR 2003)
• Exponential loss
L( f ; xu , xv , yu ,v )  exp(  yu,v ( f ( xu )  f ( xv )))
2011/5/9
42
RankBoost
• Due to the analogy to AdaBoost, RankBoost
inherits
Weak
learnermany
that can
nice properties, such as the ability in feature
selection,
classify
important pairs
more correctly
will have
convergence in training, and certain generalization
abilities.
larger weight α
Hypothesis is the
combination of weak
learners
Emphasize those hard
pairs by setting their
distributions.
2011/5/9
43
Ranking SVM
(R. Herbrich, et al. , Advances in Large Margin Classifiers, 2000;
T. Joachims, KDD 2002)
• Hinge loss
L( f ; xu , xv , yu,v )  1  yu,v  f ( xu )  f ( xv )
2011/5/9
44
Ranking SVM
n
1
min || w || 2 C    u(,iv)
2
i 1 u ,v: yu( i,v) 1


wT xu(i )  xv(i )  1   u(,iv) , if yu(i,)v  1.
 uv(i )  0, i  1,..., n.
xu-xv as positive instance of
learning
Use SVM to perform binary
classification on these instances,
to learn model parameter w
• Since Ranking SVM is well rooted in the framework of SVM, it
inherits nice properties of SVM. For example, with the help of
margin maximization, Ranking SVM can have a good
generalization ability. Kernel tricks can also be applied to
Ranking SVM, so as to handle complex non-linear problems.
2011/5/9
45
Problems with Pairwise Approach
• When we are given the relevance judgment in terms of
multiple ordered categories, converting it to pairwise
preference will lead to the missing of the information
about the finer granularity in the relevance judgment
• The distribution of document pair number is more
skewed than the distribution of document number,
with respect to different queries.
• The pairwise approach is more sensitive to noisy label
than the pointwise approach.
• Most of the pairwise ranking algorithms have not
considered the position in the final ranking results.
2011/5/9
46
Improved Algorithms
• Multi-hyperplane ranker (MHR)
• IR-SVM
• LambdaRank
2011/5/9
47
Multi-Hyperplane Ranker
(T. Qin, et al. SIGIR 2007)
• If we have K categories, then will have K(K-1)/2 pairwise
preferences between two labels.
– Learn a model fk,l for the pair of categories k and l, using any previous
method (for example Ranking SVM).
• Testing
– Rank Aggregation
f ( x )    k ,l f k , l ( x )
k ,l
2011/5/9
48
IR-SVM
(J. Xu, et al. WWW 2007)
• Introduce query-level normalization to the pairwise loss
function.
– Use the number of document pairs to normalize the pairwise loss.
• IRSVM (Y. Cao, J. Xu, et al. SIGIR 2006; T. Qin, T.-Y. Liu, et al. IPM 2007)
N
1
1
min || w ||2 C  ~ ( i ) uv( i )
2
i 1 m
u ,v
Query-level normalizer
Significant improvement has been observed by using the query-level normalizer.
2011/5/9
49
LambdaRank
(C. Burges, NIPS 2006)
• Basic idea
– The evaluation measures (which are position based) are
directly used to define the gradient with respect to each
document pair in the training process.
• Why is it feasible to directly define the gradient?
2011/5/9
50
LambdaRank
• Lambda function
– The gradient is directly defined as a lambda function,
without caring what the corresponding loss function really
looks like.
– For each pair of documents xu and xv, in each round of
optimization, their scores are updated by +λ and − λ.
2011/5/9
51
The Listwise Approach
Listwise Approach
• Measure-specific loss
– Optimize the IR evaluation measures directly
• Non measure-specific loss
– Optimize a loss function defined on all documents
associated with a query, according to the unique
properties of ranking
2011/5/9
53
Measure-specific Listwise Ranking
• Also known as “Direct Optimization of IR Measures”.
• It is natural to directly optimize what is used to evaluate the
ranking results.
• However, it is non-trivial.
– Evaluation measures such as NDCG are non-continuous and nondifferentiable since they depend on the rank positions.
(S. Robertson and H. Zaragoza, Information Retrieval 2007).
– It is challenging to optimize such objective functions, since most
optimization techniques in the literature were developed to handle
continuous and differentiable cases.
2011/5/9
54
Tackle the Challenges
• Approximate the objective
– Soften (approximate) the evaluation measure so as to make it smooth
and differentiable (SoftRank)
• Bound the objective
– Optimize a smooth and differentiable upper bound of the evaluation
measure (SVM-MAP)
• Optimize the non-smooth objective directly
– Use IR measure to update the distribution in Boosting (AdaRank)
– Use genetic programming (RankGP)
2011/5/9
55
SoftRank
(M. Taylor, et al. LR4IR 2007/WSDM 2008)
• Key Idea:
– Avoid sorting by treating scores as random variables
• Steps
– Construct score distribution
– Map score distribution to rank distribution
– Compute expected NDCG (SoftNDCG) which is smooth and
differentiable.
– Optimize SoftNDCG by gradient descent.
2011/5/9
56
Construct Score Distribution
• Considering the score for each document sj as random
variable, by using Gaussian.
p(s j )  Ν (s j | f ( x j ),  s2 )
2011/5/9
57
From Scores to Rank Distribution
• Probability of du being ranked before dv

 uv   Ν (s | f ( xu )  f ( xv ),2 s2 )ds
0
• Rank Distribution
p (ju ) (r )  p (ju 1) (r  1) uj  p (ju 1) (r )(1   uj )
Define the probability of a document being ranked at a particular position
2011/5/9
58
Soft NDCG
• Expected NDCG as the objective, or in the language of loss
function,
m
m 1
L( f ; x, y )  1  E[ NDCG]  1  Z m  (2  1)r p j ( r )
j 1
yj
r 0
– Only Pj(r) contain the model parameter w, and it is differential with
respect to w.
• A neural network is used as the model, and gradient descent
is used as the algorithm to optimize the above loss function.
2011/5/9
59
SVM-MAP
(Y. Yue, et al. SIGIR 2007)
• Use the framework of Structured SVM for optimization
– I. Tsochantaridis, et al. ICML 2004; T. Joachims, ICML 2005.
• Sum of slacks ∑ξ upper bounds ∑ (1-AP).
1 2 C
min w 
2
N
N
(i )


i 1
y '  y (i ) : wT  (y (i ) , x (i ) )  wT  (y ' , x (i ) )  (y (i ) , y ' )   (i )
 ( y ' , x) 
 ( y'  y'
u
u ,v: yu 1, yv 0
( y, x ) 
2011/5/9
( x
v
u
u ,v: yu 1, yv 0
)( xu  xv )
(y (i ) , y' )  1  AP(y' )
 xv )
60
Challenges in SVM-MAP
• For AP, the true ranking is a ranking where the relevant
documents are all ranked in the front, e.g.,
• An incorrect ranking would be any other ranking, e.g.,
• Exponential number of rankings, thus an exponential number
of constraints!
•  Solution: use a working set only containing the most
violated constraints.
2011/5/9
61
Finding the Most Violated Constraint
• Most violated y’:
arg max ( y, y' )  wT ( y' , x)
y'
• If the relevance at each position is fixed, Δ(y,y’) will be the same.
• Strategy (with a complexity of O(mlogm))
– Sort the relevant and irrelevant documents and form a perfect ranking.
– Start with the perfect ranking and swap two adjacent relevant and
irrelevant documents, so as to find the optimal interleaving of the two
sorted lists.
2011/5/9
62
AdaRank
(J. Xu and H. Li. SIGIR 2007)
Weak learner that can
• Embed IR evaluation measure in the distribution
rank important queries
update of Boosting
more correctly will have
larger weight α
Emphasize those hard
queries by setting their
distributions.
2011/5/9
63
RankGP
(J. Yeh, et al. LR4IR 2007)
• Define ranking function as a tree
• Learning
– Single-population GP, which has been widely used to
optimize non-smoothing non-differentiable objectives.
• Evolution mechanism
– Crossover, mutation, reproduction, tournament selection.
• Fitness function
– IR evaluation measure (MAP).
2011/5/9
64
Non Measure-specific Listwise Ranking
• Defining listwise loss functions based on the
understanding on the unique properties of ranking
for IR.
• Representative Algorithms
– ListNet
– ListMLE
– BoltzRank
2011/5/9
65
Ranking Loss is Non-trivial!
• An example:
– function f:
f(A)=3, f(B)=0, f(C)=1
– function h:
h(A)=4, h(B)=6, h(C)=3
– ground truth g: g(A)=6, g(B)=4, g(C)=3
ACB
BAC
ABC
• Question: which function is closer to ground truth?
– Based on pointwise similarity: sim(f,g) < sim(g,h).
– Based on pairwise similarity: sim(f,g) = sim(g,h)
– Based on cosine similarity between score vectors?
f: <3,0,1>
g:<6,4,3>
h:<4,6,3>
Closer!
sim(f,g)=0.85
sim(g,h)=0.93
• However, according to NDCG, f should be closer to g!
2011/5/9
66
ListNet
(Z. Cao, et al. ICML 2007)
• Ranked list ↔ Permutation probability distribution
• More informative representation for ranked list:
permutation and ranked list has 1-1 correspondence.
2011/5/9
67
Defining Permutation Probability
• Probability of a permutation is defined with Luce Model
m
P( | f )  
j 1
 ( f ( x ( j ) ))

m
k j
 ( f ( x ( k ) ))
• Example:
PABC | f  
  f ( A) 
  f (B) 
  f (C) 


  f (A)     f (B)     f (A)    f (B)    f (C)    f (C) 
P(A ranked No.1)
P(B ranked No.2 | A ranked No.1)
= P(B ranked No.1)/(1- P(A ranked No.1))
P(C ranked No.3 | A ranked No.1, B ranked No.2)
2011/5/9
68
Properties of Luce Model
• Continuous, differentiable, and concave w.r.t. score s.
• Suppose f(A) = 3, f(B)=0, f(C)=1, P(ACB) is largest and P(BCA) is
smallest; for the ranking based on f: ACB, swap any two,
probability will decrease, e.g., P(ACB) > P(ABC)
• Translation invariant, when  (s)  exp( s)
• Scale invariant, when  (s)  a  s
2011/5/9
69
Distance between Ranked Lists
φ = exp
Using KL-divergence
to measure difference
between distributions
dis(f,g) = 0.46
dis(g,h) = 2.56
2011/5/9
70
K-L Divergence Loss
• Loss function = KL-divergence between two permutation
probability distributions (φ = exp)
L( f ; x,  y )  DPy   || P |   f (x) 
Probability distribution defined
by the ground truth
Probability distribution defined
by the model output
• Model = Neural Network
• Algorithm = Gradient Descent
2011/5/9
71
Limitations of ListNet
• Too complex to use in practice: O(m⋅m!)
• By using the top-k Luce model, the complexity of ListNet can
be reduced to polynomial order of m.
• However,
– When k is large, the complexity is still high!
– When k is small, the information loss in the top-k Luce model will
affect the effectiveness of ListNet.
2011/5/9
72
Solution
• Given the scores outputted by the ranking model, compute
the likelihood of a ground truth permutation, and maximize it
to learn the model parameter.
• The complexity is reduced from O(m⋅m!) to O(m).
2011/5/9
73
ListMLE
(F. Xia, et al. ICML 2008)
• Likelihood Loss
L( f ; x,  y )   log P( y |  ( f (x)))
• Model = Neural Network
• Algorithm = Stochastic Gradient Descent
2011/5/9
74
Extension of ListMLE
• Dealing with other types of ground truth labels using
the concept of “equivalent permutation set”.
– For relevance degree
 y  { y | u  v, if l 1 (u )  l 1 (u ) }
y
y
– For pairwise preference
 y  { y | u  v, if l 1 (u ), 1 (u )  1}
y
y
L( f , x,  y )  min  log P y |   f (w, x)
 y  y
2011/5/9
75
BoltzRank
(M. Volkovs and R. Zemel, ICML 2009)
• BoltzRank is also based on permutation probability
models.
• BoltzRank utilizes a scoring function composed of
individual and pairwise potentials to define a
conditional probability distribution, in the form of a
Boltzmann distribution, over all permutations of
documents retrieved for a given query.
2011/5/9
76
BoltzRank
• E(π|s) thus represents the lack of compatibility
between the relative document orderings given by p
and those given by s, with more positive energy
indicating less compatibility.
• Using the energy function, we can now define the
conditional Boltzmann distribution over document
permutations by exponentiating and normalizing
2011/5/9
77
BoltzRank
• After defining the permutation probability as above,
one can follow the idea in ListMLE or ListNet to
define the loss function.
• It is also possible to extend the above idea to
optimize evaluation measures.
– For example, suppose the evaluation measure is NDCG,
then its expected value with respect to all the possible
permutations can be computed as follows.
2011/5/9
78
Discussions
• Advantages
– Take all the documents associated with the same
query as the learning instance
– Rank position is visible to the loss function
• Problems
– Complexity issue.
– The use of the position information is insufficient.
2011/5/9
79
Outline
•
•
•
•
•
•
Introduction
Summary and Outlook
2011/5/9
80
Question
• Different loss functions are used in different
approaches, while the models learned with all
the approaches are evaluated by IR measures.
• What are the relationship between the loss
functions and IR evaluation measures?
2011/5/9
81
• Many pointwise loss functions can upper
bound (1-NDCG).
• Minimization of the pointwise loss functions
can lead to the maximization of NDCG.
2011/5/9
82
• For regression based algorithms (D. Cossock and T. Zhang,
COLT 2006)
1
1  NDCG( f ) 
Zm
1/ 


 2 j 


j

1


m
1/ 


   f ( x j )  y j  


j

1


m
• For multi-class classification based algorithms (P. Li, et al.
NIPS 2007)
15
1  NDCG( f ) 
Zm
2011/5/9
2
m
m
 m 2

2  j  m jm    I{ y j  yˆ j }
j 1
 j 1
 j 1
83
• The minimization of the losses is only the sufficient
condition of minimizing (1-NDCG).
xi , yi
 x1 ,4 


x
,
3
 2 
 x ,2 
 3 
 x ,1 
 4 
xi , f ( xi )
Z m  21.4
1  NDCG( f )  0
m
I
j 1
{ y j  f ( x j )}
4
 x1 ,3 


x
,
2
 2 
 x ,1 
 3 
 x ,0 
 4 
2
2
 m

m
m


  m
15  
1
1
2 
  m 
    I{ y j  f ( x j )}  1.15  1
Z m  j 1  log( j  1) 
log(
j

1
)
j 1 
  j 1


2011/5/9
84
Pairwise and Non-measure Specific Listwise
Approaches
(W. Chen, et al. NIPS 2009)
• Many pairwise and non-measure specific listwise
loss functions are surrogate functions, and also
upper bounds of a quantity named the “essential
loss”.
• The essential loss is an upper bound of (1-NDCG)
and (1-MAP).
• The minimization of these surrogate loss
functions can lead to the maximization of NDCG
and MAP.
2011/5/9
85
Essential Loss for Ranking
• Model ranking as a sequence of classifications
Ground truth permutation:
Prediction of the ranking function:
y
 A
B
 
 
C 
 D 
2011/5/9

y
B
 A  B 
  
  C 
D D
C   
y  {A  B  C  D}
  {B  A  D  C}

y

B
   C   D 
D    
C   D  C 
 
Classifier
√
86
Essential Loss vs. Ranking Measures
1) Both (1-NDCG) and (1-MAP) are upper bounded by the
essential loss.
2) The zero value of the essential loss is a necessary and
sufficient condition for the zero values of (1-NDCG) and
(1-MAP).
2011/5/9
87
Essential Loss vs. Surrogate Losses
1) Many pairwise and listwise loss functions are upper
bounds of the essential loss.
2) Therefore, the pairwise and listwise loss functions are also
upper bounds of (1-NDCG) and (1-MAP).
2011/5/9
88
Discussions
• Revealed that different loss functions are upper
bounds of the same thing.
• Explained why many different pairwise and
listwise algorithms perform similarly well.
• Suggest better algorithms
– The essential loss is a tighter upper bound of
measure-based ranking errors than existing surrogate
loss functions.
– Our experiments showed that higher ranking
performances can be achieved by optimizing such a
tighter bound.
2011/5/9
89
Measure-specific Listwise Approach
• To what degree is a surrogate measure
correlated with the ranking measure?
• Can the maximization of a surrogate measure
lead to the same ranking model that
maximizes the ranking measure?
2011/5/9
90
Tendency Correlation as a Tool
(Y. He and T.-Y. Liu, IR Journal 2010)
0
0.2
Finding the best linear
0.4 transformation, by
applying which to one of the functions,
the maximum difference between the
tendencies of the two functions in the
entire input space can be minimized.
2011/5/9
The “tendency” of a function with
respect to two inputs is defined as
the difference in the values of the
function at the two inputs.
91
Theoretical Justification of Tendency Correlation
If the tendency correlation of a surrogate measure is
arbitrarily strong, its optimization will lead to the
ranking model that can optimize the corresponding
ranking measure, on both training and test sets.
2011/5/9
92
Major Results - SoftRank
1) When a parameter in SoftRank is set to be small, the
tendency correlation of its surrogate measure will
become strong, regardless of the data distribution.
2) However, in this case, the surrogate measure will contain
many local optima, therefore robust global optimization
techniques are needed.
2011/5/9
93
Major Results – SVM-MAP, etc.
1) The surrogate measures in SVM-MAP, SVM-NDCG, DORM,
and PermuRank are concave and easy to optimize.
2) No matter how the parameters in these algorithms are set,
for certain data distributions, the tendency correlation of
their surrogate measures can never be as strong as expected.
2011/5/9
94
Summary
• With the tool of tendency correlation, we reveal
the relationship between surrogate measures and
original ranking measures.
• The theorems well explain previous empirical
observations
– Some methods can perform well on different datasets,
while some others cannot on some datasets.
• The theorems provide a guideline for trade-off
between effectiveness and ease of optimization.
2011/5/9
95
Outline
•
•
•
•
•
•
Introduction
Summary and Outlook
2011/5/9
96
Benchmark Datasets for Learning
to Rank
Role of a Benchmark Dataset
• Enable researchers to focus on technologies
instead of experimental preparations.
• Enable fair comparison among algorithms.
• Examples
– Reuters and RCV-1 for text classification
– UCI for general machine learning
2011/5/9
98
The LETOR Collection
(T.-Y. Liu, et al. LR4IR 2007)
• The first benchmark dataset for learning to
rank
– Document corpora
– Document sampling
– Feature extraction
– Meta information
2011/5/9
99
Document Corpora
• The “.gov” corpus
• The OHSUMED corpus
– 106 queries
2011/5/9
100
Document Sampling
– Top 1000 documents per query returned by BM25
– All judged documents
2011/5/9
101
Feature Extraction
– 64 features
– TF, IDF, BM25, LMIR, PageRank, HITS, Relevance
Propagation, URL length, etc.
– 40 features
– TF, IDF, BM25, LMIR, etc.
2011/5/9
102
Meta Information
• Statistical information about the corpus
– Number of documents, number of streams, and number of
(unique) terms in each stream.
• Raw information of the documents associated with
each query
– Term frequency, and document length.
• Relational information
– Hyperlink graph, sitemap information, and similarity
relationship matrix of the corpora.
2011/5/9
103
LETOR Website
2011/5/9
104
New Benchmark Datasets
• Yahoo! Challenge Datasets
• Microsoft Learning to Rank Datasets
• Both are large-scale datasets retired from
commercial search engines.
2011/5/9
105
Yahoo! Challenge Datasets
• Released on March 2010.
• Accessible to participants of the challenge
• About 30K queries
• 700K documents
• 700 features for each document.
• http://learningtorankchallenge.yahoo.com/
2011/5/9
106
Yahoo! Challenge Datasets
2011/5/9
107
Microsoft Learning to Rank Datasets
• Released on June 16, 2010.
• Publicly available
• More than 30K queries
• 3.7M documents
• 136 features for each document.
• http://research.microsoft.com/en-us/projects/mslr/
2011/5/9
108
Microsoft Learning to Rank Datasets
2011/5/9
109
Evaluation Results on LETOR
Datasets
Algorithms under Investigation
• The pointwise approach
– Linear regression
• The pairwise approach
– Ranking SVM, RankBoost, Frank
• The listwise approach
– ListNet
– SVM-MAP, AdaRank
Linear ranking model (except RankBoost)
2011/5/9
111
Experimental Results – TD2003
2011/5/9
112
Experimental Results – TD2004
2011/5/9
113
Experimental Results – NP2003
2011/5/9
114
Experimental Results – NP2004
2011/5/9
115
Experimental Results – HP2003
2011/5/9
116
Experimental Results – HP2004
2011/5/9
117
Experimental Results - OHSUMED
2011/5/9
118
Experimental Results - Winning Numbers
2011/5/9
119
Experimental Results - Winning Numbers
40
35
30
Regression
RankSVM
25
RankBoost
20
FRank
ListNet
15
AdaRank
10
SVMmap
5
0
N@1
2011/5/9
N@3
N@10
P@1
P@3
P@10
MAP
120
Discussions
• NDCG@1
– {ListNet, AdaRank} > {SVM-MAP, RankSVM}
> {RankBoost, FRank} > {Regression}
• NDCG@3, @10
– {ListNet} > {AdaRank, SVM-MAP, RankSVM, RankBoost}
> {FRank} > {Regression}
• MAP
– {ListNet} > {AdaRank, SVM-MAP, RankSVM}
> {RankBoost, FRank} > {Regression}
The listwise approach has certain advantages, especially in
correctly predicting top-ranked documents.
2011/5/9
121
Notes
• The experimental results can be found at the
LETOR website
– http://research.microsoft.com/~letor/
• The results are still primal
– The implementations of the these algorithms are
not perfect
– Almost every algorithm can be further improved.
– The datasets are relatively small.
2011/5/9
122
Outline
•
•
•
•
•
•
Introduction
Summary and Outlook
2011/5/9
123
Why Theory?
• In practice, one can only observe experimental results on
relatively small datasets. To some extent, such empirical
results are not trustworthy, because
– Small training set cannot fully realize the potential of a learning
algorithm.
– Small test set cannot reflect the true performance of an
algorithm, since the query space is too huge to be represented
by a small number of samples.
• Statistical learning theory for ranking analyzes the
performance of an algorithm when the training data is infinite
and the test data is really randomly sampled.
2011/5/9
124
Generalization Analysis
• More specifically, in the training phase, one
minimizes the empirical risk
on the training
data.
• In the test phase, we evaluate the expected risk
of the ranking model on any sample.
• Generalization analysis is concerned with the bound
of the difference between the expected and empirical
risks, when the number of training data approaches
infinity.
2011/5/9
125
?
• An algorithm is regarded as better than the other
algorithm if its empirical risk can converge to the
expected risk but that of the other cannot.
• An algorithm is regarded as better than the other if
its corresponding convergence rate is faster than that
of the other.
2011/5/9
126
• With different assumptions on the data, we
have different definitions of the empirical and
expected risks.
• As a consequence, we have different
generalization bounds.
2011/5/9
127
Statistical Ranking Framework
• Document Ranking
– Assume that all the documents are i.i.d. no matter
they are associated with the same query or not
• Subset Ranking
– Regard documents as deterministically given, and only
regard queries as i.i.d. random variables
• Two-layer Ranking
– Assume that only the documents associated with the
same query are i.i.d. (the distribution depends on
queries), but queries are also i.i.d. in the query space
2011/5/9
128
Document Ranking Framework
• Pointwise Approach
• Pairwise Approach
• Listwise Approach
– Cannot be explained using document ranking framework
2011/5/9
129
Subset Ranking Framework
2011/5/9
130
Two-Layer Ranking Framework
– Cannot be explained using document ranking framework
2011/5/9
131
Two Kinds of Generalization Bounds
• Uniform generalization analysis
– To reveal a bound of the generalization ability for
any function in the function class under
investigation.
• Algorithm-dependent generalization analysis
– Only applicable to the model learned from the
training data using a specific algorithm.
2011/5/9
132
Uniform Generalization Bound for
Document Ranking
(Y. Freund, et al. JMLR 2003)
• W.r.t. pairwise 0-1 loss, based on VC dimension.
Convergence Rate: log 𝑚 /𝑚
2011/5/9
133
Document Ranking
(S. Agarwal, et al. JMLR 2005)
• W.r.t. pairwise 0-1 loss, based on Rank Shatter
Coefficient.
Convergence Rate:
For linear scoring function in 1-D space:
1
𝑚
For linear scoring function in d-D space: log 𝑚 /𝑚
2011/5/9
134
Document Ranking
(P. Bartlett and S. Mendelson, JMLR 2002)
• W.r.t. pairwise 0-1 loss, for k-partite ranking, based
on properties of U-statistics.
1
Convergence Rate:
𝑚
2011/5/9
135
Document Ranking
(S. Clemencon, et al. Annual Stats 2008)
• W.r.t. pairwise surrogate loss, based on VC dimension.
1
Convergence Rate:
𝑚
2011/5/9
136
Subset Ranking
(Y. Lan, et al. ICML 2009)
• W.r.t. listwise surrogate loss, based on Rademacher
average.
• The generalization bound for ListMLE is much tighter than
that for ListNet, especially when the length of the list is large.
• The generalization bound for ListMLE decreases
monotonously, while that of ListNet increases monotonously,
with respect to the length of the list.
1
Convergence
• The linear transformation
functionRate:
is the best choice in terms
𝑛
of the generalization bound in most cases
2011/5/9
137
Two-Layer Ranking
(W. Chen, et al. NIPS 2010)
• W.r.t. pairwise 0-1/surrogate loss, based on
Rademacher average.
• The increasing number of either queries or documents per
query in the training data will enhance the two-layer
generalization ability.
• Only if m  ∞ and n  ∞ simultaneously, does the two-layer
generalization bound uniformly converge.
• If we only have the budget to label C documents in total, there
is an optimal tradeoff between the number of training queries
and that of training documents per query.
2011/5/9
138
Algorithm-dependent Generalization
Bound for Document Ranking
(S. Agarwal, et al. ALT 2008 and COLT 2005)
• W.r.t. pairwise surrogate loss, based on stability.
For example, Ranking SVM is a stable algorithm,
16𝜅2
and its stability coefficient is β(𝑚) = λ𝑚 , where
K 𝑥, 𝑥 ≤ 𝜅 2 < ∞.
2011/5/9
140
Algorithm-dependent Generalization
Bound for Subset Ranking
(Y. Lan, et al. ICML 2008)
• W.r.t. pairwise surrogate loss, based on query-level stability.
For Ranking SVM, 𝜏 𝑛 =
For IR-SVM, 𝜏 𝑛 =
4𝜅2
.
λ𝑛
4𝜅2
𝑚𝑖
max 1
𝑖
λ𝑛
𝑚
𝑛
.
The convergence rates for these two algorithms are both O(1/n). By
comparing the case with a finite number of training queries, the
bound for IR-SVM is much tighter than that for Ranking SVM.
2011/5/9
141
Summary
2011/5/9
142
Outline
•
•
•
•
•
•
Introduction
Summary and Outlook
2011/5/9
143
Listwise approach
- non-measure specific
- measure specific
• NIPS workshop on • ICML workshop on • 1st SIGIR workshop
learning to rank
structure output
on LR4IR
Data
Activity
Pointwise approach
Pairwise approach
Algorithms
Theory
• LETOR 1.0
• LETOR 2.0
•
•
•
•
Pranking
DMIR
Ranking SVM
RankBoost
< 2005
2011/5/9
• Gaussian process for
ordinal regression
• RankNet
2005
2006
WWW tutorial
•
SIGIR tutorial
•
2nd LR4IR workshop •
1st SIGIR workshop on •
•
diversity
• LETOR 3.0
• Generalization of
pairwise methods
• Consistency of
pointwise methods
•
•
•
• Subset Ranking •
• IR-SVM
•
• LambdaRank
•
• P-norm push
•
• Nested Ranker •
• RankCosine
•
•
•
•
•
• Consistency of
Listwise methods
MC-Rank
GBRank
MHR
Frank
ListNet
Supervised RA
RankGP
AdaRank
SVM-MAP
2007
•
•
•
•
•
•
•
•
•
•
•
•
WWW tutorial
ACL tutorial
IRJ special issue
3rd LR4IR workshop
2nd Diversity workshop
• LETOR 4.0
• Generalization of
listwise methods
IsoRank
QBRank
•
CDN Ranker
•
ListMLE
•
Relational Ranking
•
GlobalRank
•
Query-dependent Ranker
AppRank
•
PermuRank
SoftRank
G-SoftRank
SVM-NDCG
2008
OWA Rank
BoltzmanRank
Bayes-Luce
SmoothRank
Robust sparse
ranker
……
2009
144
Answer to Question 1
• To what respect are these learning to rank algorithms
similar and in which aspects do they differ? What are
the strengths and weaknesses of each algorithm?
Approach
Modeling
Pointwise
Regression/
classification/
ordinal
regression
Pairwise
Pairwise
classification
Listwise
2011/5/9
Ranking
Pro
Con
Easy to leverage
existing theories
and algorithms
• Accurate ranking
≠ Accurate score or
category
•Effectively
• Position info is
invisible to the loss minimize (1NDCG).
query-level and
position based
• More complex
• New theory
needed
Connection
145
• Empirically speaking, which of those many learning
to rank algorithms perform the best?
– LETOR makes it possible to perform fair comparisons
among different learning to rank methods.
– Empirical studies on LETOR have shown that the listwise
ranking algorithms seem to have certain advantages over
other algorithms, especially in predicting the top-ranked
documents, and the pairwise ranking algorithms seem to
outperform the pointwise algorithms.
2011/5/9
146
• What are the unique theoretical issues for ranking
that should be investigated?
– Unique properties of ranking for IR: evaluation is
performed at query level and is position based.
– The risks should also be defined at the query level, and a
new theoretical framework is needed for conducting
analyses on the learning to rank methods.
– The “true loss” for ranking should consider the position
information in the ranking result, but not as simple as the
0−1 loss in classification.
2011/5/9
147
• Are there many remaining issues regarding learning
to rank to study in the future?
–
–
–
–
–
–
–
2011/5/9
Learning from logs
Feature engineering
Advanced ranking model
Ranking theory
Stability of ranking
Online learning
Large-scale learning to rank
148
Learning from Logs
• Data scale matters a lot!
– Ground truth mining from logs is one of the possible
ways to construct large-scale training data.
• Ground truth mining may lose information .
– User sessions, frequency of clicking a certain
document, frequency of a certain click pattern, and
diversity in the intentions of different users.
• Directly learn from logs.
– E.g., Regard click-through logs as the “ground truth”,
and define its likelihood as the loss function.
2011/5/9
149
Feature Engineering
• Ranking is deeply rooted in IR
– Not just new algorithms and theories.
• Feature engineering is very importance
– How to encode the knowledge on IR accumulated
in the past half a century in the features?
– Currently, these kinds of works have not been
given enough attention to.
2011/5/9
150
Advanced Ranking Model
• Listwise ranking function
• Generative ranking model
• Semi-supervised ranking
• Relational ranking
• Query-dependent ranking
• ……
2011/5/9
151
Ranking Theory
•
•
•
•
•
A more unified view on loss functions
True loss for ranking
Statistical consistency for ranking
Complexity of ranking function class
……
2011/5/9
152
References
2011/5/9
153
Acknowledgement
• I would like to thank
– My colleagues Tao Qin, Jun Xu, Hang Li, and Wei-Ying Ma
– Our interns Zhe Cao, Ming-Feng Tsai, Xiubo Geng, Yuting
Liu, Yanyan Lan, Fen Xia, Wei Chen, Jiang Bian, Yin He, Di
He, and Wenkui Ding.
– Our internal collaborators Chris Burges, Michael Taylor,
Stephen Robertson, Emine Yilmaz, and Ralf Herbrich.
– Our external collaborators Hsin-Hsi Chen, Hongyuan Zha,
Zhiming Ma, Liwei Wang, Cheng Xiang Zhai, Thorsten
Joachims, Olivier Chappell, and Yi Chang.
2011/5/9
154
[email protected]
http://research.microsoft.com/people/tyliu/
http://weibo.com/tieyanliu/
2011/5/9
155

Learning to Rank for Information Retrieval

Transcription

Similar documents

Media Kit - AgingOptions

2011 US Open

What is AcuTouch How Liu WS came to develop

Identification of a Novel Potent Selective SMYD3 Inhibitor

to read the press release. - This is 2-UP

Learning to Rank

Jeunes de Chine

Spice of life - Business Events Toronto