20/03/2007 1 SPARK: Top-k Keyword Query in Relational Database

Comments

Transcription

20/03/2007 1 SPARK: Top-k Keyword Query in Relational Database
20/03/2007
Outline
SPARK: Top-k Keyword Query in
Relational Database
Wei Wang
Demo & Introduction
Ranking
Query Evaluation
Conclusions
University of New South Wales Australia
20/03/2007
1
Demo
2
Demo …
20/03/2007
3
SPARK
20/03/2007
I
20/03/2007
4
SPARK
Searching, Probing & Ranking Top-k
II
Continued as a research project with
Results
PhD student Yi Luo
• Thesis project (2004 – 2005)
• Taste of Research Summary Scholarship
• 2005 – 2006
• SIGMOD 20007 paper
• trying VLDB 2007 Demo now!
(2005)
• Finally, CISRA prize winner
• http://www.computing.unsw.edu.au/softwareengine
ering.php
20/03/2007
5
20/03/2007
6
1
20/03/2007
A Motivating Example
A Motivating Example …
Top-3 results in our system
1 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller
(#2.1)
2 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller
(#2.1)
ActorPlay: Character = Himself
Actors:
Hanks, Tom
ActorPlay: Character = Alexander
3 Actors: John Hanks
Kerst
Movies: Rosamunde Pilcher - Winduber dem
Fluss (2001)
20/03/2007
7
20/03/2007
8
Improving the Effectiveness
Preliminaries
Three factors are considered to
Data Model
• Relation-based
contribute to the final score of a search
result (joined tuple tree)
Query Model
• (modified) IR ranking score.
• the completeness factor.
• the size normalization factor.
• Joined tuple trees (JTTs)
• Sophisticated ranking
• address one flaw in previous approaches
• unify AND and OR semantics
• alternative size normalization
20/03/2007
9
20/03/2007
Problems with DISCOVER2
Virtual Document
∑ ln 1 +(1ln(− s1)++ln(s tf )) ⋅ qtf ⋅ ln Ndf+1 ∑ 1 + ln(1+ ln(tf )) ⋅ ln Ndf+1
dl
avdl
t∈Q ∩ D
p1
c2 p2
c1
20/03/2007
Combine tf contributions
before tf normalization /
attenuation.
t∈Q ∩ D
score(c i)
score(pj)
score
signature SPARK
1.0
1.0
2.0
(1, 1)
0.98
1.0
1.0
2.0
(0, 2)
0.44
10
p1
c2 p2
c1
11
20/03/2007
ci • pj
∑ 1 + ln(1+ ln(tf )) ⋅ ln Ndf+1
t∈Q ∩ D
score(maxtor)
score(netvista) scorea*
1.00
1.00
2.00
0.00
1.53
1.53
12
2
20/03/2007
Virtual Document Collection
Collection: 3 results
•
•
∑
idfnetvista = ln(4/3)
idfmaxtor = ln(4/2)
Estimate avdl = avdlC + avdlP
Estimate idf:
•
•
idfnetvista = ε
idfmaxtor =
ln
Completeness Factor
1 + ln(1 + ln(tf ))
N +1
ln
⋅ qtf ⋅ ln
dl
(1 − s ) + s avdl
df
t∈Q ∩ D
1
1 − (1 − 1 )(1 − 1 )
3
3
= ln 9
5
p1
c2 p2
c1
scorea
For “short queries”
• User prefer results
Derive completeness
(c2
factor based on
extended Boolean
model
0.44
Ideal Pos
p2)
d = 0.5
d = 1.41 (c1
p1)
maxtor
• Measure Lp distance to
p1
c2 p2
c1
the idea position
13
d=1
(1,1)
matching more keywords
0.98
20/03/2007
L2 distance
netvista
scoreb
(1.41-0.5)/1.41 = 0.65
(1.41-1)/1.41 = 0.29
20/03/2007
14
Size Normalization
Putting ‘em Together
Results in large CNs tend to have more
score(JTT) = scorea * scoreb * scorec
• a: IR-score of the virtual document
• b: completeness factor
• c: size normalization factor
matches to the keywords
Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|)
• Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works
well
p1
c2 p2
c1
20/03/2007
15
scorea * scoreb
0.98 * 0.65 = 0.64
0.44 * 0.29 = 0.13
20/03/2007
16
Comparing Top-1 Results
#Rel and R-Rank Results
DBLP; Query = “nikos clique”
DBLP; 18 queries; Union of top-20
results
DISCOVER2 [Liu et al, SIGMOD06]
#Rel
R-Rank
p = 1.0
p = 1.4
p = 2.0
2
2
16
16
18
≤ 0.243
≤ 0.333
0.926
0.935
1.000
Mondial; 35 queries; Union of top-20
results
DISCOVER2 [Liu et al, SIGMOD06]
#Rel
R-Rank
20/03/2007
17
20/03/2007
p = 1.0
p = 1.4
p = 2.0
2
10
27
29
34
≤ 0.276
≤ 0.491
0.881
0.909
0.986
18
3
20/03/2007
Query Processing
Query Processing …
3 Steps
3 Steps
Generate candidate tuples in every relation in the
Generate candidate tuples in every relation in the
schema (using full-text indexes)
schema (using full-text indexes)
Enumerate all possible Candidate Networks (CN)
20/03/2007
19
20/03/2007
20
Query Processing …
Monotonic Scoring Function
3 Steps
Execute a CN
CN:
Generate candidate tuples in every relation in the
schema (using full-text indexes)
Enumerate all possible Candidate Networks (CN)
Execute the CNs
•
•
PQ
c1
P1
P2
C2
C1
DISCOVER2
20/03/2007
21
Non-Monotonic Scoring Function
P2 P1
P
CN: PQ
C
p1
c2 p2
?
?
SPARK
20/03/2007
C
C1
C2
c1
c2
≤
p1
<
p2
score(c i)
score(pj)
scorea
1.06
0.97
0.98
1.06
1.06
0.44
<
c2 c1
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
c1
c2
≤
p1
<
p2
score(pj)
score
1.06
0.97
2.03
1.06
1.06
2.12
<
c2 c1
p1
p2
20/03/2007
22
Idea: use a monotonic & tight, upper bounding
function to SPARK’s non-monotonic scoring
function
Details
Q
c1
C
score(c i)
Upper Bounding Function
Assume: idfnetvista > idfmaxtor and k = 1
Execute a CN
p1
c2 p2
P
Most algorithms differ here.
The key is how to optimize for top-k retrieval
Assume: idfnetvista > idfmaxtor and k = 1
CQ
•
•
•
•
•
p1
p2
23
20/03/2007
sumidf = Σw idfw
monotonic
watf(t) = (1/sumidf) * Σw (tfw(t) * idfw)
wrt. watf(t)
A = sumidf * (1 + ln(1 + ln( Σt watf(t) )))
B = sumidf * Σt watf(t)
then, scorea ≤ uscorea = (1/(1-s)) * min(A, B)
scoreb
score≤uscore
are constants given the CN
scorec
24
4
20/03/2007
Early Stopping Criterion
Execute a CN
CN: PQ
Query Processing …
Assume: idfnetvista > idfmaxtor and k = 1
C
Execute the CNs
p1
c2 p2
P
c1
uscore
scorea
1.13
0.98
1.76
0.44
C
CN: PQ
Q
Q
P2
P2
C2
C1
stop!
P1
C1
C
SPARK
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
20/03/2007
C2
C3
C
[VLDB 03]
25
• Score(Pi Cj) = Score(Pi)
+ Score(Cj)
Operations:
P
P3
score( ) ≥ uscore( )
score( ) ≥
 uscore( )
P1
• {P1, P2, …} and {C1, C2, …}
have been sorted based on
their IR relevance scores.
[P1 ,P1] [C1 ,C1]
C.get_next()
[P1 ,P1] C2
P.get_next()
P2 [C1 ,C2]
P.get_next()
P3 [C1 ,C2]
…
// a parametric SQL query is
sent to the dbms
20/03/2007
Dominance
uscore(<Pi , Cj>) > uscore(<Pi+1, Cj>) and
uscore(<Pi , Cj>) > uscore(<Pi , Cj+1>)
26
Skyline Sweeping Algorithm
Block Pipeline Algorithm
Execute the CNs
Inherent deficiency to bound non-monotonic function with
(a few) monotonic upper bounding functions
CN: PQ
C
Q
P
P3
P2
P1
C1
C2
C3
<P1 , C1 >
<P2 , C1 >, <P1 , C2 >
<P3 , C1 >, <P1 , C2 >,
<P2 , C2 >
<P1 , C2 >, <P2 , C2 >,
<P4 , C1 >, <P3 , C2 >
…
C
C
P C
P1
P2
1
1
3
1
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
•
•
CQ
P
(n:0,
m:1)
(n:1,
m:0)
uscore
bscore
2.74
1.05
2.63
2.63
2.63
2.63
2.50
0.95
2.74
C
(n:1, m:0) (n:0, m:1)
Block Pipeline
20/03/2007
•
•
•
scorea
100000
2.63
2.38
2.63
2.63
1.05
~ 0.9M tuples in total
k = 10
PC 1.8G, 512M
Sparse
GP
SS
BP
10000
2.41
2.63
28
DBLP
Assume: idfn > idfm and k = 1
Block
Partition the space (into blocks) and derive tighter upper
bounds for each partitions
“unwilling” to check a candidate until we are quite sure
about its “prospect” (bscore)
Efficiency
time(ms)
CN:
PQ
unnecessary (expensive) checking
cannot stop earlier
20/03/2007
Block Pipeline Algorithm …
Execute a CN
Lots of candidates with high uscores return much lower (real)
score
Idea
27
sort of
draw an example
•
•
C
Skyline Sweep
20/03/2007
•
•
Priority Queue:
Operations:
stop!
1000
100
1.05
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
10
29
1
20/03/2007 DQ1 DQ2 DQ3 DQ4 DQ5 DQ6 DQ7 DQ8 DQ9 DQ10 DQ11 DQ12 DQ13 DQ14 DQ15 DQ16 DQ17 DQ18
30
5
20/03/2007
Efficiency …
Conclusions
DBLP, DQ13
100000
A system that can perform effective &
efficient keyword search on relational
databases
Sparse
GP
SS
BP
10000
• Meaningful query results with appropriate
rankings
1000
• second-level response time for ~10M tuple
DB (imdb data) on a commodity PC
100
10
1
3
5
7
9
11
13
15
20/03/2007
17
19
31
Q&A
20/03/2007
32
Backup Slides
BANKS demo:
• http://www.cse.iitb.ac.in/banks/tejasdemo/dev
-shashank//servlet/SearchForm
Thank you.
20/03/2007
33
20/03/2007
34
6

Similar documents