Distributed Data Management - Databases and Information Systems

Transcription

Distributed Data Management - Databases and Information Systems
Information Retrieval and Data Mining
Summer Semester 2015
TU Kaiserslautern
Prof. Dr.-Ing. Sebastian Michel
Databases and Information Systems
Group (AG DBIS)
http://dbis.informatik.uni-kl.de/
Information Retrieval and Data Mining, SoSe 2015, S. Michel
1
Outlook and Announcement
• Outlook on this lecture + next one:
– Continuation of probabilistic models
– Language Models (=generative models)
• Then: New Chapter: Link Analysis (Pagerank etc).
• Then, Indexing and Querying, ….., Data Mining….
• Announcement for exercises:
– Solution to sheet will be uploaded after exercise session.
Access is restricted to university network or via login:
• Username: irdm
• Password:
Information Retrieval and Data Mining, SoSe 2015, S. Michel
2
Ranking Proportional to Relevance
Odds (cont’d)
…
• pt is probability that terms
appears in relevant document
• qt is probability that terms
appears in an irrelevant
document
Information Retrieval and Data Mining, SoSe 2015, S. Michel
3
Estimating pt and qt with
a Training Sample
• We can estimate pt and qt based on a training sample
obtained
by evaluating the query q on a small sample of the corpus and
asking the user for relevance feedback about the results
• Let N be the # documents in our sample
R be the # relevant documents in our sample
nt be the # documents in our sample that contain t
rt be the # relevant documents in our sample that contain t
we estimate
or with Lidstone smoothing (λ = 0.5)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
4
More info on MLE for Binomial:
https://onlinecourses.science.psu.edu/stat504/node/28
Smoothing
(with Uniform Prior)
• Probabilities pt and qt for term t are estimated (by
Maximum-Likelihood-Estimate (MLE) for Binomial
distribution)
– repeated coin tosses for term t in relevant documents (pt)
– repeated coin tosses for term t in irrelevant documents (qt)
• Avoid overfitting to the training sample by smoothing
estimates
– Laplace smoothing (based on Laplace’s law of succession)
– Lidstone smoothing (heuristic generalization with λ > 0)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
5
Estimating pt and qt without a
Training Sample
• When no training sample is available, we estimate pt and qt as
– pt reflects that we have no information about relevant documents
– qt under the assumption that
# relevant documents <<< # documents
• When we plug in these estimates of pt and qt, we obtain
which can be seen as TF*IDF with binary term frequencies
and logarithmically dampened inverse document frequencies
Information Retrieval and Data Mining, SoSe 2015, S. Michel
6
3. Okapi BM25
• Generalizes term weight
into
where pi and qi denote the probability that term
occurs i times
in a relevant or irrelevant document, respectively
• Postulates Poisson (or 2-Poisson-mixture)
distributions for terms
Information Retrieval and Data Mining, SoSe 2015, S. Michel
7
Okapi BM25 (cont’d)
• Reduces the number of parameters that have to be learned
and
approximates Poisson model by similarly-shaped function
• Finally leads to Okapi BM25 as state-of-the-art retrieval
model
(with top-ranked results in TREC)
𝑘1 + 1 𝑡𝑓𝑡,𝑑
𝑤𝑡,𝑑 =
𝑘1
1−𝑏 +𝑏
𝑑
+ 𝑡𝑓𝑡,𝑑
𝑎𝑣𝑑𝑙
log
𝐷 − 𝑑𝑓𝑡 + 0.5
𝑑𝑓𝑡 + 0.5
– k1 controls impact of term frequency (common choice k1 = 1.2)
– b controls impact of document length (common choice b = 0.75)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
8
Okapi BM25 (Example)
• 3D plot of a simplified
BM25 scoring function
using k1 = 1.2 as parameter
(DF mirrored for better readability)
• Scores for dft > |D|/2 are negative (is this a
problem?)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
9
Models Considering Term
Dependences
• So far, models assume terms appear
independent from one another.
• Instead, the upcoming models
– Tree Dependence Model
– Bayesian Network Model
do not assume independence of term
appearances.
• Need to obtain probability of term, or term pair,
or term triplet,… appearances in relevant,
respectively, irrelevant documents.
• Then, can compute
Information Retrieval and Data Mining, SoSe 2015, S. Michel
10
4. Tree Dependence Model
• Consider term correlations in documents
(with binary RV Xi)
requires estimating m-dimensional probability
distribution
• Tree dependence model [van Rijsbergen 1979]
– considers only 2-dimensional probabilities for term pairs (i, j)
– estimates for each (i, j) the error made by independence
assumptions
– constructs a tree with terms as nodes and m-1 weighted
edges connecting the highest-error term pairs
– Assumption: term depends at most on one other term
Information Retrieval and Data Mining, SoSe 2015, S. Michel
11
Two-Dimensional Term Correlations
• Kullback-Leibler divergence estimates error of
approximating f
by g assuming pairwise term independence
• Correlation coefficient for term pairs
• p-values of Χ2 test of independence
Information Retrieval and Data Mining, SoSe 2015, S. Michel
12
Kullback-Leibler Divergence (Example)
• Given are documents d1=(1,1), d2=(0,0), d3=(1,1),
d4=(0,1)
• 2-dimensional probability distribution f:
f(1,1) = P[X1 = 1, X2 = 1] = 2/4
f(0,0) = 1/4, f(0,1) = 1/4, f(1,0) = 0
• 1-dimensional marginal distributions g1 and g2
g1(1) = P[X1=1] = 2/4, g1(0) = 2/4
g2(1) = P[X2=1] = 3/4, g2(0) = 1/4
• 2-dimensional probability distribution assuming
independence
g(1,1) = g1(1) g2(1) = 3/8
g(0,0) = 1/8, g(0,1) = 3/8,
g(1,0) = 1/8
• approximation error ε (Kullback-Leibler divergence)
ε = 2/4 log 4/3 + 1/4 log 2 + 1/4 log 2/3 + 0  0.311
Information Retrieval and Data Mining, SoSe 2015, S. Michel
13
Constructing the
Term Dependence Tree
• Input: Complete graph (V, E ) with m nodes Xi ∈ V
and m2 undirected edges (i, j) ∈ E with weights ε
• Output: Spanning tree (V, E’) with maximum total
edge weight
• Algorithm:
– Sort m2 edges in descending order of weights
– E’ = ∅
– Repeat until |E’| = m-1
• E’ = E’ ∪
{(i, j) ∈ E \ E’ | (i, j) has maximal weight and E’ remains
acyclic}
• Example:
web
0.9
0.7
0.5 surf
0.3
net 0.1
0.1
web
net
swim
Information Retrieval and Data Mining, SoSe 2015, S. Michel
surf
swim
14
Estimation with
Term Dependence Tree
• Given a term dependence tree (V={X1, …, Xm},
E’) with preorder-labeled nodes (i.e., X1 is root)
and assuming that
Xi and Xj are independent for (i, j) ∉ E’
web
• Example:
net
surf
15
swim
Information Retrieval and Data Mining, SoSe 2015, S. Michel
Summary: Tree Dependence Model
• Uses (n-1) term dependences for n terms
• Which ones to take? Highest “error” made by
independence assumption
• Efficient to evaluate
• How to apply? Learned for relevant and
irrelevant documents.
• Then as BIM computing
with
difference that conditional probability is
computed given the “parent” term in tree.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
16
5. Bayesian Networks
• A Bayesian network (BN) is a directed, acyclic
graph (V, E) with
– Vertices V representing random variables
– Edges E representing dependencies
– For a root R ∈ V the BN captures the prior probability
P[R = …]
– For a vertex X ∈ V with parents parents(x) = {P1, …, Pk}
the BN captures
the conditional probability P[X |P1, …, Pk ]
– The vertex X is conditionally independent of a nonparent node Y
given its parents parents(x) = {P1, …, Pk}, i.e.:
Information Retrieval and Data Mining, SoSe 2015, S. Michel
17
Bayesian Networks (Example)
Cloudy
Sprinklers
Rain
Wet
Information Retrieval and Data Mining, SoSe 2015, S. Michel
18
Bayesian Networks (cont’d)
• We can determine any joint probability using
the BN
Information Retrieval and Data Mining, SoSe 2015, S. Michel
19
Some Math Recap
• Chain rule
we can rewrite
repeating this leads to
Example with 4 variables:
copied from Wikipedia
Information Retrieval and Data Mining, SoSe 2015, S. Michel
20
Bayesian Networks for IR
…
d1
t1
…
…
dj
ti
tk
…
dN
tM
q
Information Retrieval and Data Mining, SoSe 2015, S. Michel
21
Advanced Bayesian Networks for IR
• BN not widely adopted in IR due to challenges in
parameter estimation, representation,
efficiency, and practical effectiveness
…
d1
t1
…
dj
…
ti
c1
…
tk
…
q
dN
cK
tM
concepts/topics cl:
Information Retrieval and Data Mining, SoSe 2015, S. Michel
22
Summary of I.3
• Probabilistic IR as a family of (more) principled
approaches
relying on generative models of documents as bags of
words
• Probabilistic ranking principle as the foundation
establishing that ranking documents by P[R| d, q] is
optimal
• Binary independence model puts that principle into
practice
based on a multivariate Bernoulli model
• Smoothing to avoid overfitting to the training sample
• Okapi BM25 as a state-of-the-art retrieval model
based on an approximation of a 2-Poisson mixture model
Information Retrieval and Data Mining, SoSe 2015, S. Michel
23
Additional Literature for I.3
•
•
•
•
•
•
•
G. Salton, C. Buckley, and C. T. Yu: An Evaluation of Term Dependence Models
in Information Retrieval. Proceedings of the 5th annual ACM conference on
Research and development in information retrieval (SIGIR), 1982.
F. Crestani, M. Lalmas, C. J. Van Rijsbergen, and I. Campbell: “Is This
Document Relevant? ... Probably”: A Survey of Probabilistic Models in
Information Retrieval, ACM Computing Surveys 30(4):528-552, 1998
S.E. Robertson, K. Spärck Jones: Relevance Weighting of Search Terms,
JASIS 27(3), 1976
S.E. Robertson, S. Walker: Some Simple Effective Approximations to the 2Poisson Model for Probabilistic Weighted Retrieval, SIGIR 1994
T. Roelleke: Information Retrieval Models: Foundations and Relationships
Morgan & Claypool Publishers, 2013
K. Spärck-Jones, S. Walter, S. E. Robertson: A probabilistic model of
information retrieval: development and comparative experiments, IP&M
36:779-840, 2000
K. J. van Rijsbergen: Information Retrieval, University of Glasgow, 1979
http://www.dcs.gla.ac.uk/Keith/Preface.html
Information Retrieval and Data Mining, SoSe 2015, S. Michel
24
I.4 Statistical Language Models
1.
2.
3.
4.
Basics of Statistical Language Models
Query-Likelihood Approaches
Smoothing Methods
Novely and Diversity
Based on MRS Chapter 12 and [Zhai 2008]
Information Retrieval and Data Mining, SoSe 2015, S. Michel
25
1. Basics of Statistical Language
Models
• Statistical language models (LMs) are
generative models of word sequences (or, bags
of words, sets of words, etc.)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
26
Illustration: Finite Automata
• Consider the following finite automata that
outputs individual terms with certain probability
dog
cat
hog
:
:
:
0.5
0.4
0.1
0.9
Probability of
generating
a term
Probability of
stopping
0.1
Probability of
continuing
Information Retrieval and Data Mining, SoSe 2015, S. Michel
27
Language Models: Intuition
• How can you come up with a good query?
• Think of terms that would likely appear in a
relevant document!
• Intuition behind Language Model: A document
d is a good match for a query if its model M is
likely to generate the query (i.e., P[d|M] is high)
• Essentially: LM is around finding probability
measure for strings generated out of a model.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
28
Example
Model M2
the
0.2
the
0.15
a
0.1
a
0.12
frog
0.01
frog
0.0002
toad
0.01
toad
0.0001
said
0.03
said
0.03
likes
0.02
likes
0.04
that
0.04
that
0.04
dog
0.005
dog
0.01
cat
0.003
cat
0.015
monkey
0.001
monkey
0.002
…
…
…
…
s
frog
said
that
toad
likes
that
dog
M1
0.01
0.03
0.04
0.01
0.02
0.04
0.005
M2
0.0002
0.03
0.04
0.0001
0.04
0.04
0.01
P(s|M1) = 4.8E-13
and
P(s|M2) = 3.84E-16
i.e., P(s|M1) > P(s|M2)
Example from MRS Chapter 12
Model M1
29
Statistical Language Models
Application Examples
• Application examples:
– Speech recognition, to select among multiple
phonetically similar sentences (“get up at 8 o’clock”
vs. “get a potato clock”)
– Statistical machine translation, to select among
multiple candidate
translations (“logical closing” vs. “logical reasoning”)
– Information retrieval, to rank documents in
response to a query
Information Retrieval and Data Mining, SoSe 2015, S. Michel
30
Types of Language Models
• Unigram LM based on only single words (unigrams),
considers no context, and assumes independent
generation of words
• Bigram LM conditions on the preceding term
• n-Gram LM conditions on the preceding (n-1) terms
Information Retrieval and Data Mining, SoSe 2015, S. Michel
31
Parameter Estimation
• Parameters (e.g., P(ti), P(ti | ti-1)) of language
model θ are estimated based on a sample of
documents (or for each document), which are
assumed to have been generated by θ
Information Retrieval and Data Mining, SoSe 2015, S. Michel
32
Example
• Example: Unigram language models θSports and
θPolitics estimated from documents about sports
and politics
θSports
soccer
goal
tennis
player
:
:
:
:
:
0.20
0.15
0.10
0.05
:
:
:
:
:
0.20
0.20
0.15
0.05
θPolitics
party
debate
scandal
election
Sample
generates
Sample
generates
Information Retrieval and Data Mining, SoSe 2015, S. Michel
33
Probabilistic IR vs. Statistical
Language Models
“User finds document d
relevant to query q”
Probabilistic IR
ranks according to
relevance odds
Statistical LMs
rank according to
query likelihood
Information Retrieval and Data Mining, SoSe 2015, S. Michel
34
2. Query-Likelihood Approaches
θd1
apple
pie
:
:
:
0.20
0.15
Sample
d1
:
:
:
0.20
0.15
Sample
d2
q
θd2
cake
apple
• P(q|d) is the likelihood that the query was generated by
the language model θd estimated from document d
• Intuition:
– User formulates query q by selecting words from a prototype
document
– Which document is “closest” to that prototype document
Information Retrieval and Data Mining, SoSe 2015, S. Michel
35
Multi-Bernoulli LM
• Query q is seen as a set of terms and generated
from document d
by tossing a coin for every word from the
vocabulary V
Estimate
[Ponte and Croft ’98] pioneered the use of LMs in IR
Information Retrieval and Data Mining, SoSe 2015, S. Michel
36
Multinomial LM
• Query q is seen as a bag of terms and generated from
document d
by drawing terms from the bag of terms corresponding to d
• Multinomial LM is more expressive than Multi-Bernoulli LM
and therefore usually preferred
Information Retrieval and Data Mining, SoSe 2015, S. Michel
37
Multinomial LM (cont’d)
• Maximum-likelihood estimate for parameters
P(ti|d)
is prone to overfitting and leads to
– bias in favor of short documents / against long
documents
– conjunctive query semantics, i.e., query can not be
generated from language models of documents that
miss one of the query terms
Information Retrieval and Data Mining, SoSe 2015, S. Michel
38
3. Smoothing
• Smoothing methods avoid overfitting to the sample
(often: one document) and are essential for LMs to work
in practice
–
–
–
–
–
–
–
Laplace smoothing (cf. Chapter I.3)
Absolute discounting
Jelinek-Mercer smoothing
Dirichlet smoothing
Good-Turing smoothing
Katz’s back-off model
…
• Choice of smoothing method and parameter setting still
mostly
“black art” (or empirical, i.e., based on training data)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
39
Jelinek-Mercer Smoothing
• Uses a linear combination (mixture) of document language
model θd and document-collection language model θD
with document D as concatenation of entire document
collection
• Parameter λ can be tuned by cross-validation with held-out
data
– divide set of relevant (q, d) pairs into n partitions
– build LM on the pairs from n-1 partitions
– choose λ to maximize precision (or recall or F1) on held-out
partition
– iterate with different choice of nth partition and average
• Parameter λ can be made document- or term-dependent
Information Retrieval and Data Mining, SoSe 2015, S. Michel
40
Jelinek-Mercer Smoothing vs.
TF*IDF
~ tf
~ idf
• (Jelinek-Mercer) smoothing has effect similar to IDF
weighting
• Jelinek-Mercer smoothing leads to a TF*IDF-style
model
Information Retrieval and Data Mining, SoSe 2015, S. Michel
41
Dirichlet-Prior Smoothing
• Uses Bayesian estimation with a conjugate
Dirichlet prior
instead of the Maximum-Likelihood Estimation
• Intuition: Document d is extended by α terms
generated
by the document-collection language model
• Parameter α usually set as multiple of average
document length
Information Retrieval and Data Mining, SoSe 2015, S. Michel
42
Dirichlet Smoothing vs. JelinekMercer Smoothing
• Jelinek-Mercer smoothing with documentdependent λ
becomes a special case of Dirichlet smoothing
Information Retrieval and Data Mining, SoSe 2015, S. Michel
43
4. Novelty & Diversity
• Retrieval models seen so far (e.g., TF*IDF, LMs) assume that
relevance of documents is independent from each other
• Problem: Not a very realistic assumption in practice due to
(near-)duplicate documents (e.g., articles about same event)
• Objective: Make sure that the user sees novel (i.e., nonredundant) information with every additional result inspected
• Queries are often ambiguous (e.g., jaguar) with multiple
different information needs behind them (e.g., car, cat, OS)
• Objective: Make sure that user sees diverse results that cover
many of the information needs possibly behind the query
Information Retrieval and Data Mining, SoSe 2015, S. Michel
44
Maximum Marginal Relevance (MMR)
• Intuition: Next result returned di should be relevant to the
query but also different from the already returned results d1,
…, di-1
with tunable parameter λ and similarity measure sim(q,d)
• Usually implemented as re-ranking of top-k query results
• Example:
sim(q,d2) = 0.8
sim(q,d3) = 0.7
sim(q,d4) = 0.6
mmr(q,d1) = 0.45
Final Result
Initial Result
sim(q,d1) = 0.9
sim(q,d5) = 0.5
mmr(q,d3) = 0.35
mmr(q,d5) = 0.25
mmr(q,d2) = -0.10
mmr(q,d4) = -0.20
• Full details: [Carbonell and Goldstein ’98]
Information Retrieval and Data Mining, SoSe 2015, S. Michel
45
Summary of I.4
• Statistical language models
widely used in natural language applications
other than IR
• Query-likelihood approaches
see the query as a sample from the document
LM
• Smoothing methods
are absolutely essential to make LMs work in
practice
• Brief discussion of one approach for
novelty&diversity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
46
Additional Literature for I.4
• D. Hiemstra: Using Language Models for Information Retrieval, Ph.D. Thesis,
University of Twente, 2001
• M. Federico and N. Bertoldi: Statistical Cross-Language Information Retrieval
using N-Best Query Translations, SIGIR 2001
• Z. Nie, Y. Ma, S. Shi, J.-R. Wen and W.-Y. Ma: Web Object Retrieval,
WWW 2007
• H. M. Peetz and M. de Rijke: Cognitive Temporal Document Priors,
ECIR 2013
• J. M. Ponte and B. Croft: A Language Modeling Approach to Information
Retrieval, SIGIR 1998
• C. Zhai and J. Lafferty: Model-based Feedback in the Language Modeling
Approach for Information Retrieval, CIKM 2001
• C. Zhai: Statistical Language Models for Information Retrieval A Critical
Review, Foundations and Trends in Information Retrieval 2(3):137-213, 2008
• R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong: Diversifying Search
Results, WSDM 2009
• J. G. Carbonell and J. Goldstein: The Use of MMR, Diversity-Based Reranking
for Reordering
Information Retrieval and Data Mining, SoSe 2015, S. Michel
47