Exploiting Temporal References in Information Retrieval

Transcription

Exploiting Temporal References in Information Retrieval
Exploiting Temporal References in
Text Retrieval
Irem Arikan
advised by: Srikanta Bedathur, Klaus Berberich
Motivation
 users’
information needs often have a
temporal dimension,
 but traditional information retrieval
systems do not exploit the temporal content
in documents.
 query: PM United Kingdom 2000
 search engine is not aware that
2000 is actually mentioned implicitly
by the document
!
an approach which recognizes and exploits
temporal references in documents to yield
better search results
Example Temporal Queries
Broad Queries
• British colony 17th century
• Economic situtation Germany 1920s
• President assasination 1950 – 2000
Specific Queries
• US president October 1962
• Pope 1940s
• Academy awards best actor 1975
Ambiguous Queries
• George Bush 1990 vs. George Bush 2007
• Gulf war 1991 vs. Gulf war 2005
Outline
 Language Modeling for Information Retrieval
 Time Modeling for Temporal Information Retrieval
 Combining Text Relevance with Temporal Relevance
 Experimental Results
Language Modeling for Information Retrieval
Language Model:
a statistical model to generate text
Language Modeling:
the task of estimating the statistical parameters of a
language model
Language Modeling for IR:
the problem of estimating the likelihood that a query
and a document could have been generated by the
same language model
• In practical IR approaches: Unigram Language Model
• words occur independently
Language Modeling for IR
1)
document : a sample from a language model
•
assume an underlying multinomial probability distribution over words for each
document
•
estimate statistics of this distribution: P[word]
document
2)
infer
estimate the likelihood that the query is generated by this distribution
P(q | d )   P(t | d )
tq
3)
Md : P [ word | Md]
rank the documents by P(q | d )
Temporal Modeling for Temporal Retrieval
General approach

similar to LM approach

based on a generative model which generates temporal references


temporal model
splits query into 2 parts: text query and temporal query
Probabilistic mechanism for producing temporal content of the document

each time reference generated by a different generative temporal model
M it

i  1.....n
for generating a time reference
1)
first choose a temporal model
2)
then generate a time reference using this temporal model
Temporal Modeling
Estimating temporal query likelihood

Infer a temporal model from each temporal reference in the document
M it

i  1.....n
Estimate the likelihood that the temporal query is generated by one of the
models which generated the temporal content of the document
P(q t | d t )
Temporal query generation probability
n
P(q t | d t ) =
1
t
t
P(q
|
M

i )
i 1n
P(q | M it )  ?
Temporal Modeling
What is a temporal model?
P(q | M it )  ?
 A probabilistic model to generate temporal references
 What kind of distribution?
 How can we estimate its parameters?
Temporal Modeling
What is a temporal model?
P(q | M it )  ?
 A probabilistic model to generate temporal references
 What kind of distribution?
 How can we estimate its parameters?
Formalize the problem in a goal-oriented way,
 We should infer a temporal model from each time interval (sample time
interval)
 This temporal model should be able to generate all time intervals which
are relevant to the sample interval
1. Approach
lOverlap
Assumptions:
rOverlap
• only relevant if they intersect
• the generative model inferred should be able to
produce subintervals, superintervals, overlapping
intervals of the interval in the document
• probability of generating an intersecting time
interval should be proportional to the length of
intersection
sup1
sup2
sub1
sub2
t
• query: 1980 – 1990
• 1980 – 1989 is more relevant than 23 March 1984
s
e
Appropriate probabilistic model:
•2 underlying triangular distributions
• one for start,
• one for end,
Ps (x)
Pe (x)
M it  { Ps ( x), Pe ( x) }
Triangular Distribution
f ( x | a, b, c) 
2( x  a)
(b  a)(c  a)
for
axc
2(b  x)
(b  a)(b  c)
for
cxb
0
Parameters
a : a  ( ,  )
b:b  a
c:a  c b
Support
a xb
for any other case
1. Approach
Ps (x)
Pe ( y )
r1
u
r2
s
r3
e +1
qs - 1
r4
e
• nonzero probability for intersecting intervals
•r1 – r3 : left overlaps
•r1 – r4 : super intervals
•r2 – r3 : subintervals
•r2 - r4 : right overlaps
• interval [s,e] has the highest probability
• probability decreases to the left and right resulting in lower probability for intervals
which have smaller intersection lengths
l
1. Approach
Ps (x)
Pe (x)
r1
r2
s
u
r3
e +1
qs - 1
r4
e
l
M it  { Ps ( x), Pe ( x) }
q  {qs , qe }
P(q | M )  ?
t
i
P(q | M it )  P( qs , qe | M it )
 P(qs )  P(qe | qs )
2. Approach
Assumptions:
 Only relevant if they are positioned closely to each other on the time axis and have similar
lengths
 | start1 – start2 | < a
 | length1 – length2 | < b
 The generative model inferred should be able to produce temporal intervals in some
neighbourhood on the time axis
∆l
∆s
t
l
s
2. Approach
Ps (x)
Pl ( y )
s -a
s
s+a
l-b
l
 Temporal interval x = s , y = l has the highest probability
 Probability decreases as start point moves away from s and as length moves
away from l
l+b
2. Approach
Ps (x)
Pl (x)
s -a
s
s+a
l-b
l
M it  { Ps ( x), Pl ( x) }
q  {qs , ql }
P(q | M )  ?
t
i
P(q | M it )  P( qs , ql | M it )
 P(qs )  P(ql )
l+b
Combining Text Relevance with Temporal Relevance
score(q, d )  scorew (q, d )  scoret (q, d )
Text relevance
P(q | M dw )
Combining Text Relevance with Temporal Relevance
score(q, d )  scorew (q, d )  scoret (q, d )
Text relevance
P(q | M dw )
Temporal relevance
P (q | M dt )
 Filter and re-rank search results by weighting text relevance score by temporal
relevance
System Architecture
Information Retrieval (IR) with Temporal Extension
Query
IR System
Index
Result Set
Temporal
Query
Temporal Retrieval
Result Set
Index
for temporal
references
Experimental Results-1
Query:
Spanish painter 18th century
Terrier
Boolean
Our Method
Art_in_Puerto_Rico
Agustín_Esteve
José_del_Castillo
Spanish_art
Acislo_Antonio_Palomino_
de_Castro_y_Velasco
Agustín_Esteve
Palazzo_Bianco_(Genoa)
Alvarez
Roybal
Caprichos
Agostino_Scilla_00e6
Maldonado
List_of_people_from_Antw
erp
Bassano
Luis_Egidio_Meléndez
Experimental Results-2
Query:
Chancellor Germany 1955
Terrier
Boolean
Our Method
Federal_Minister_for_Speci
al_Affairs_of_Germany
Basic_Law_for_the_Federal
_Republic_of_Germany
Occupation_statute
Otto_Gessler
Bonn-Paris_conventions
Second_German_Bundestag
Bonn-Paris_conventions
Bavaria_Party
West_Germany
Occupation_statute
All-German_Bloc_League_
of_Expellees_and_Deprived
_of_Rights
Bonn-Paris_conventions
Petersberg_Agreement
Anschluss
Konrad_Adenauer
Experimental Results-3
Query:
George Bush 1990
Terrier
Boolean
Our Method
George_W._Bush_insider_tr
ading_allegations
Bush_family
President_Bush
Bush_family
Bush_administration
Bush_administration
Early_life_of_George_W._B
ush
Andrew_Card
President's Council of
Advisors on Science and
Technology
George_H._W._Bush
Approval_rating
George_H._W._Bush
C_Boyden_Gray
Brent_Scowcroft
Arbusto_Energy
Thanks!