Intelligente Suchmaschinen der Zukunft: Trends und

Transcription

Intelligente Suchmaschinen der Zukunft:
Trends und Herausforderungen
Gerhard Weikum ([email protected])
What Google Can‘t Do
professors from Saarbruecken who
teach DB or IR and have projects on XML
drama with three women making a prophecy
to a British nobleman that he will become king
the woman from Paris whom I met at the
PC meeting chaired by Renee Miller
best & latest insights on percolation theory for networks
pros and cons of dark energy hypothesis
evolving opinions on EU constitution in different countries
market impact of XML standards in 2002 vs. 2004
experienced NLP experts who may be recruited for IT staff
apps in customer support, business analytics, health care, law, etc.
+ multilingual/multicultural, personalized/contextual, multimedia, etc.
Gerhard Weikum May 10, 2006
2/48
3/48
4/48
5/48
6/48
What is Beyond Google?
for Advanced Information Requests by „Power Users“
(librarians, market analysts, scientists, students, etc.)
background knowledge
→ ontologies & thesauri, statistics, continuous learning
(semi-)structured and „semantic“ data
→ XML, info extraction, annotation & classification
humans in the loop, wisdom of crowds
→ collaboration, recommendation, social networks, P2P
context awareness
→ personalization, geo & time, user behavior, reality mining
7/48
A Broader View of Search Engine
Technology (Information Retrieval)
• Intranet and Enterprise Search
• Scholarly Work on Digital Libraries, Web Archives, etc.
• „Vertical“ Search: Products, Entertainment, Health, etc.
• Desktop Search / Personal Information Management
• Deep Web Search / Information Integration
• Continuous Queries (PubSub) on News, Blogs, etc.
• Personalized and „Social“ Search
• Multimedia Search (Images, Video, Speech, Music, etc.)
• Multilingual and Multicultural Search
• Embedded (Mobile) and Integrated (DB&IR) Applications
8/48
Outline
9 Motivation and Strategic Direction
• Semantic Search (Ontologies, XML, Info Extraction)
• Personalized Search (User-Behavior History)
• Social Search (Communities, P2P)
• Conclusion
9/48
Ontologies & Thesauri: Example WordNet
IR&NLP Approach
e.g. WordNet Thesaurus (Princeton)
(> 100 000 concepts
with lexical & linguistic relations)
woman, adult female – (an adult female person)
=> amazon, virago – (a large strong and aggressive woman)
=> donna -- (an Italian woman of rank)
=> geisha, geisha girl -- (...)
=> lady (a polite name for any woman)
...
=> wife – (a married woman, a man‘s partner in marriage)
=> witch – (a being, usually female, imagined to
have special powers derived from the devil)
10/48
„Semantic“ Query Expansion and Execution
Thesaurus/Ontology:
User query: ~c = ~t1 ... ~tm
concepts, relationships, glosses
from WordNet, Gazetteers,
Web forms & tables, Wikipedia
Example:
~professor and ( ~course = „~IR“ )
Term2Concept with WSD
Query expansion
exp(ti)={w | sim(ti,w)≥ θ}
alchemist
primadonna
magician
artist director
wizard
investigator
intellectual
Weighted expanded query
Example:
(professor lecturer (0.749) scholar (0.71) ...)
and ( (course class (1.0) seminar (0.84) ... )
= („IR“ „Web search“ (0.653) ... ) )
Efficient top-k search
with dynamic expansion
better recall, better mean
precision for hard queries
researcher
RELATED
RELATED (0.48)
(0.48)
professor
HYPONYM
HYPONYM (0.749)
(0.749)
scientist
scholar
academic,
academician,
faculty member
mentor
teacher
relationships quantified by
statistical correlation measures
11/48
Query Expansion Example
From TREC 2004 Robust Track Benchmark:
Title: International Organized Crime
Description: Identify organizations that participate in international criminal activity,
the activity, and, if possible, collaborating organizations and the countries involved.
Query = {international[0.145|1.00],
~META[1.00|1.00][{gangdom[1.00|1.00],
gangland[0.742|1.00],
Let us take, for example, the case of Medellin cartel's
"organ[0.213|1.00] & crime[0.312|1.00]",
camorra[0.254|1.00],
boss Pablo Escobar.
Will the fact thatmaffia[0.318|1.00],
he was eliminated
mafia[0.154|1.00], "sicilian[0.201|1.00]
& mafia[0.154|1.00]",
change anything
at all? No, it may perhaps have a
"black[0.066|1.00] & hand[0.053|1.00]",
mob[0.123|1.00],
syndicate[0.093|1.00]}],
psychological
effect on other drug
dealers but,
...
organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20],
columbian[0.686|0.20], cartel[0.466|0.20],
...}} the illicit export of metals and import
... for organizing
of arms. It is extremely difficult for the law-enforcement
135530 sorted accesses in 11.073s.
organs to investigate and stamp out corruption among
leading officials.
Interpol Chief on Fight Against
... Narcotics
Economic CounterintelligenceATasks
Viewed commission accused Swiss prosecutors
parliamentary
today ofofdoing
little toCrime
stop drug
and money-laundering
Dresden Conference Views Growth
Organized
in Europe
international
networks
from Region
pumping billions of dollars
Report on Drug, Weapons Seizures
in Southwest
Border
through
Swiss
companies.
SWITZERLAND CALLED SOFT
ON
CRIME
...
Results:
1.
2.
3.
4.
5.
...
12/48
What If The Semantic Web Existed And
All Information Were in XML?
Which professors
<?xml version = '1.0' Professor
from Saarbruecken (SB)
encoding = 'UTF-8'?>
are teaching IR and have
<homepage>
research projects on XML?
Address
…
...
<professor>
Name:
City:
SB
<name>
Gerhard Weikum
</name>
Country:
Gerhard
<teaching
Germany Research:
Weikum Teaching:
xlink:href=„http://www.uni-saarland.de/...“ />
<address>
Course
<city> Saarbrücken </city>
<country>
Title: IR Germany </country>
</address>
Syllabus
Description:
<research
...
Information
xlink:href=„http://www.mpi-inf.mpg.de/…“
/>
retrieval ...
Book
Article
…
...
...
Project
Title:
Intelligent
...
Search
Sponsor:
of XML German
Data
Science
Foundation
13/48
Professor
Address
...
XML-IR Example (1)
Which professors
are teaching IR and have
research projects on XML?
Name:
City: SB
Country:
Gerhard
Germany Research:
Weikum Teaching:
Course
Project
Title: IR
Description:
Information
retrieval ...
Syllabus
...
Book
...
Article
...
// Professor [//* = „ Saarbruecken“]
[// Course [//* = „ IR“] ]
[// Research [//* = „ XML“]
]
Gerhard Weikum
May 10, 2006
Title:
Intelligent
...
Search
Sponsor:
of XML German
Data
Science
Foundation
14/48
Professor
XML-IR Example (2)
Lecturer professors
Which
areAddress:
teaching IR and have
research
projects on XML?
Max-Planck
Address
...
Name:
City: SB
Country:
Gerhard
Institute for CS,
Name:Research:
Germany
Weikum Teaching:
Germany
Interests:
Ralf
Semistructured
Schenkel Teaching:
Data, IR
Course
Project
Title: IR
Description:
Information
retrieval ...
Book
Title: Statistical ...
Language Models
Syllabus
... Contents:
Article
Book Ranked
Search
...
... ...
Seminar
Title:
Intelligent
...
Search
Literature
Sponsor:
of XML German
Data
Science
Foundation
Combine DB and IR techniques
with logics, statistics, AI, ML, NLP
for ranked retrieval
//// Professor
~Professor[//*
[//*= =„ „Saarbruecken“]
~ Saarbruecken“]
[// ~Course
] of
Course[//*
[//*==„„IR“]
~ IR“]
] semistructured data (e.g. TopX)
[// ~Research
] ] May 10, 2006
Gerhard Weikum
Research[//*
[//*==„„XML“]
~ XML“]
15/48
TopX Engine at MPII (1)
16/48
17/48
18/48
19/48
20/48
Efficient Top-k Search [Buckley85, Güntzer et al. 00, Fagin01]
TA: efficient & principled
top-k query processing
with monotonic score aggr.
Data items: d1, …, dn
d11
s(t
s(t11,d
,d11)) == 0.7
0.7
…
…
s(t
s(tmm,d
,d11)) == 0.2
0.2
Query: q = (t1, t2, t3)
TA with sorted access only (NRA):
can index lists; consider d at posi in Li;
E(d) := E(d) ∪ {i}; highi := s(ti,d);
worstscore(d) := aggr{s(tν,d) | ν ∈E(d)};
bestscore(d) := aggr{worstscore(d),
aggr{highν | ν ∉ E(d)}};
if worstscore(d) > min-k then add d to top-k
min-k := min{worstscore(d’) | d’ ∈ top-k};
else if bestscore(d) > min-k then
cand := cand ∪ {d}; s
threshold := max {bestscore(d’) | d’∈ cand};
if threshold ≤ min-k then exit;
Index lists
t1
t2
t3
d78
0.9
d64
0.8
d10
0.7
d23
0.8
d23
0.6
d78
0.5
d10
0.8
d10
0.6
d64
0.4
d1
0.7
d10
0.2
d99
0.2
d88
0.2
d78
0.1
d34
0.1
…
…
k=1
Scan
Scan
Scan
Scan
Scan
Scan
depth
112
depth
depth
depth
depth
depth233
…
Ex. Google: > 10 mio. terms, > 8 bio. docs, > 4 TB index
Rank Doc Worst- BestRank
WorstBestRank Doc
Docscore
Worst-score
Bestscore
score
score
score
1
2.4
d78 0.9
1 1 d78
1.4
2.0
d10
2.1
2
2.4 2.1
d64 0.8
2 2 d23
1.9
d78 1.4
1.4
2.0
3
2.4
d10 0.7
3 3 d64
0.8
2.1
d23
1.4
1.8
STOP!
STOP!
4 4 d10
2.1
d64 0.7
1.2
2.0
21/48
Probabilistic Pruning of Top-k Candidates [VLDB 04]
TA family of algorithms based on invariant (with sum as aggr):
si ( d
∑
i∈ E( d )
) ≤ s( d ) ≤
si ( d
∑
i∈ E( d )
worstscore(d)
•
•
Æ Often overly conservative
(deep scans,
high memory for PQ)
score
drop d
from
priority
queue
bestscore(d)
min-k
Æ Approximate top-k with
score predictor can use
LSTs & Chernoff bounds,
Poisson approximations,
or histogram convolution
scan
depth
worstscore(d)
probabilistic guarantees:
si ( d
∑
i∈ E( d )
highi
∑
i∉ E( d )
bestscore(d)
Add d to top-k result, if
worstscore(d) > min-k
Drop d only if bestscore(d) <
min-k, otherwise keep in PQ
p( d ) := P [
)+
)+
Si
∑
i∉ E( d )
>δ ]
discard candidates d from queue if p(d) ≤ ε
⇒ E[rel. precision@k] = 1−ε
22/48
Top-k Queries with Query Expansion [SIGIR 05]
consider expandable query „~professor and research = XML“
with score Σi∈q {max j∈exp(i) { sim(i,j)*sj(d) }}
dynamic query expansion with
incremental on-demand merging of additional index lists
B+ tree index on tag-term pairs and terms
thesaurus / meta-index
research:
professor lecturer:
scholar: 0.6
XML
0.7
92: 0.9
67: 0.9
52: 0.9
44: 0.8
55: 0.8
...
37: 0.9
44: 0.8
22: 0.7
23: 0.6
51: 0.6
52: 0.6
...
12: 0.9
14: 0.8
28: 0.6
17: 0.55
61: 0.5
44: 0.5
...
...
57: 0.6
44: 0.4
52: 0.4
33: 0.3
75: 0.3
professor
lecturer: 0.7
scholar: 0.6
academic: 0.53
scientist: 0.5
...
+ much more efficient than threshold-based expansion
+ no threshold tuning
+ no topic drift
23/48
Performance Results for .Gov Queries
on .GOV corpus from TREC-12 Web track: speedup by factor 10
1.25 Mio. docs (html, pdf, etc.)
at high precision/recall
(relative to TA-sorted);
50 keyword queries, e.g.:
aggressive queue mgt.
• „Lewis Clark expedition“,
even yields factor 100
• „juvenile delinquency“,
at 30-50 % prec./recall
• „legalization Marihuana“,
• „air bag safety reducing injuries death facts“
#sorted accesses
elapsed time [s]
max queue size
relative recall
rank distance
score error
TA-sorted
2,263,652
148.7
10849
1
0
0
Prob-sorted (smart)
527,980
15.9
400
0.69
39.5
0.031
24/48
Experimental Results: INEX Benchmark
on IEEE-CS journal and conference articles:
12,000 XML docs with 12 Mio. elements,7.9 GB for all indexes
20 CO queries, e.g.: „XML editors or parsers“
20 CAS queries, e.g.: //article[ .//bibl[about(.//„QBIC“)] and
.//p[about(.//„image retrieval“)] ]
#sorted accesses @10
#random accesses @10
relative recall @10
precision@10
MAP@1000
Join
&Sort
Struct
Index
TopX
(ε=0.0)
TopX
(ε=0.1)
9,122,318
0
1
0.34
0.17
761,970
635,507
426,986
3,245,068 64,807
59,414
1
1
0.8
TopX
outperforms
0.34
0.34
0.32
Join&Sort
by factor
0.17
0.17
0.17 > 10
and
beats StructIndex by
factor > 20 on INEX,
factor 2-3 on IMDB
25/48
Towards a Statistically Semantic Web
<Person>
Information extraction yields:
<TimePeriod>
<Scientist>
Person
TimePeriod
...
Sir Isaac Newton 4 Jan 1643 - ...
... Leibniz
... Kneller
Publication
Philosophiae Naturalis
<Publication>
Author
... Newton
<Scientist>
Topic
... gravitation
Publication
Philosophia ...
Scientist
<Painter>
Sir Isaac Newton
... Leibniz
<Person>
but with confidence < 1
→ Semantic-Web database
with uncertainty !
→ ranked retrieval !
26/48
Information Extraction from Web Pages
Leading open-source tool: GATE/ANNIE
http://www.gate.ac.uk/annie/
27/48
Outline
9 Semantic Search (Ontologies, XML, Info Extraction)
• Personalized Search (User-Behavior History)
• Conclusion
28/48
Personalized Search & Info Management
Personalized Result Ranking:
or
• query interpretation depends on
personal interests and bias
• need to learn user-specific weights for
multi-criteria ranking (relevance, authority, freshness, etc.)
• can exploit user behavior
(feedback, bookmarks, query logs, click streams, etc.)
Personal Information Management (PIM):
• manage, annotate, organize, and search all your personal data
• on desktop (mail, files, calendar, etc.)
• at home (photos, videos, music, parties, invoices, tax filing, etc.)
and in smart home with ambient intelligence
29/48
Google‘s PageRank [Brin & Page 1998]
Idea: incoming links are endorsements & increase page authority,
authority is higher if links come from high-authority pages
PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑
PR( p ) ⋅ t( p,q )
p∈IN ( q )
with
t ( p, q ) = 1 / outdegree( p)
and j ( q ) = 1 / N
Authority (page q) =
stationary prob. of visiting q
random walk: uniformly random choice of links + random jumps
30/48
Personalized PageRank [Haveliwala et al. 2003]
Idea: random jumps favor designated high-quality pages
such as personal bookmarks, frequently visited pages, etc.
PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑
with
⎧1 / | B | for q ∈ B
j(q ) = ⎨
otherwise
⎩0
PR( p ) ⋅ t( p,q )
p∈IN ( q )
Authority (page q) =
stationary prob. of visiting q
random walk: uniformly random choice of links
+ biased jumps to personal favorites
31/48
Exploiting Query Logs and Click Streams
from PageRank: uniformly random choice of links + random jumps
to QRank: + query-doc transitions + query-query transitions
+ doc-doc transitions on implicit links (w/ thesaurus)
with probabilities estimated from log statistics
PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅
∑
max
planck
PR( p ) ⋅ t( p,q )
p∈IN ( q )
QR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ (
α
∑
mpg
budget
max planck
wissenschaft
A.M.
MPII
MPII
PR( p ) ⋅ t( p,q ) +
A.M.
p∈ exp licitIN ( q )
(1−α )
∑
p ∈ implicitIN ( q )
PR( p ) ⋅ sim( p,q )
)
32/48
Small-Scale Experiments
Setup:
70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries
ca. 500 queries, ca. 300 refinements, ca. 1000 positive clicks
ca. 15 000 implicit links based on doc-doc similarity
Results (assessment by blind-test users):
• QRank top-10 result preferred over PageRank in 81% of all cases
• QRank has 50.3% precision@10, PageRank has 33.9%
Untrained example query „philosophy“:
1.
2.
3.
4.
5.
PageRank
QRank
Philosophy
GNU free doc. license
Free software foundation
Richard Stallman
Debian
Philosophy
GNU free doc. license
Early modern philosophy
Mysticism
Aristotle
33/48
Outline
9 Personalized Search (User-Behavior History)
• Conclusion
34/48
Social Search: Vision & Trends
„Enable people to find, use, share, and expand
all human knowledge“ (Yahoo!: knowledge fusion)
Collect & harvest the wisdom of crowds:
• bookmarks of users, with content tags
• query logs, click streams, news readings, etc.
all
• interactions in communities (blogs, e-groups, etc.)
managed
• opinions on products, movies, music, pharmaceuticals,
etc.
by one
• photos, annotations, ratings, etc.
„super provider“
(Yahoo!, MSN, or Google)
Affects search result ranking:
→ decentralized & self-organizing
prefer results liked by similar users
peer-to-peer (P2P) networks !
35/48
Social Search: Yahoo! MyWeb
search engine
highly susceptible to
spam & manipulation !
36/48
Social Search: Yahoo! Flickr
37/48
38/48
39/48
40/48
Peer-to-Peer (P2P) Web Search
Vision: Self-organizing P2P Web Search Engine
with Google-or-better functionality
• Scalable & Self-Organizing Data Structures and Algorithms
(DHTs, Semantic Overlay Networks, Epidemic Spreading, Distr. Link Analysis, etc.)
• Better Search Result Quality (Precision, Recall, etc.)
• Powerful Search Methods for Each Peer
(Concept-based Search, Query Expansion, Personalization, etc.)
• Leverage Intellectual Input at Each Peer
(Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.)
• Collaboration among Peers
(Query Routing, Incentives, Fairness, Anonymity, etc.)
• Benefits of Large-Scale Social Networks:
Small-World Phenomenon, Breaking Information Monopolies
Foundations pursued in EU
Integrated Project DELIS
41/48
Minerva System Architecture
peer lists (directory)
term a: 17, 11, 92, ...
term f: 43, 65, 92, ...
url z: 54, 128, 7, ...
url x: 37, 44, 12, ...
term c: 13, 92, 45, ...
term g: 13, 11, 45, ...
url y: 75, 43, 12, ...
bookmarks
query peer P0
B0
local index X0
term g: 13, 11, 45, ...
Query routing aims to optimize benefit/cost
driven by distributed statistics on
peers‘ content similarity, content overlap,
freshness, authority, trust, performability etc.
Dynamically precompute „good peers“
to maintain a Semantic Overlay Network
Exploit community input (bookmarks, etc.)
42/48
Spam: Not Just for E-mail Anymore
Distortion of search results by „spam farms“
(aka. search engine optimization)
boosting
pages
(spam farm)
page to be
„promoted“
Susceptibility
to manipulation
and lack of trust model
Research
Challenge:
is a major •problem:
Robustness to egoistic and malicious behavior
• 2004 DarkBlue
SEO Challenge:
„nigritude
ultramarine“
• Trust/Distrust
models
and mechanisms
extremely „successful“
• Pessimists estimate 75 Mio. out of 150 Mio. Web hosts are spam
• Recent example: Ληstές
http://www.google.gr/search?hl=el&q=%CE%BB%CE%B7%CF%83%CF%84%CE%AD%CF
unclear borderline between spam and community opinions
43/48
44/48
45/48
Web Spam Generation
Content spam:
• repeat words (boost tf scores)
• weave words/phrases into copied text
• manipulate anchor texts
Link spam:
• copy links from Web dir. and distort
Example:
Remember not only online learning
to say the right doctoral degree thing
in the right place, but far cheap
tuition more difficult still, to leave
career unsaid the wrong thing
at university the tempting moment.
• create honeypot page and sneak in links
• infiltrate Web directory
• purchase expired domains
• generate posts to Blogs, message boards, etc.
• build & run spam farm (collusion) + form alliances
Hide/cloak the manipulation:
• masquerade href anchors
Example:
read about my <a
href=„myonlinecasino.com“>
trip to Las Vegas </a>.
• use tiny anchor images with background color
• generate different dynamic pages to browsers and crawlers
46/48
Countermeasures: BadRank and TrustRank
BadRank:
start with explicit set B of blacklisted pages
define random-jump vector r by setting ri=1/|B| if i∈B and 0 else
propagate BadRank mass to predecessors
BR( p) = β rp + (1 − β )∑q∈OUT ( p ) BR(q) / indegree(q)
TrustRank:
start with explicit set T of trusted pages with trust values ti
define random-jump vector r by setting ri = ti / if i ∈T and 0 else
propagate TrustRank mass to successors
TR (q) = τ rq + (1 − τ )∑ p∈IN ( p ) TR ( p) / outdegree( p)
Problems:
maintenance of explicit lists is difficult
difficult to understand (& guarantee) effects
47/48
Learning Spam Features [Drost/Scheffer 2005]
Use classifier (e.g. Bayesian predictor, SVM) to predict
„spam vs. ham“ based on page and page-context features
Most discriminative features are:
•tfidf weights of words in p0 and IN(p0)
•avg, #inlinks of pages in IN(p0)
•avg. #words in title of pages in OUT(p0)
•#pages in IN(p0) that have same length as some other page in IN(p0)
•avg. # inlinks and outlinks of pages in IN(p0)
But spammers may
•avg. #outlinks of pages in IN(p0)
learn to adjust to the
•avg. #words in title of p0
•total #outlinks of pages in OUT(p0)
anti-spam measures.
•total #inlinks of pages in IN(p0)
It‘s an arms race!
•clustering coefficient of pages in IN(p0) (#linked pairs / m(m-1) possible pairs)
•total #words in titles of pages in OUT(p0)
•total #outlinks of pages in OUT(p0)
•avg. #characters of URLs in IN(p0)
•#pages in IN(p0) and OUT(p0) with same MD5 hash signature as p0
•#characters in domain name of p0
•#pages in IN(p0) with same IP number as p0
48/48
Outline
9 Personalized Search (User-Behavior History)
9 Social Search (Communities, P2P)
• Conclusion
49/48
Strategic Research Avenues
Exploit the Web‘s potential for being a knowledge base
• Build large-scale & interesting „Semantic“ Web corpora
(Wikipedia++, all homepages of CS researchers, etc.)
• Enhance & interconnect Deep-Web databases
(digital libraries, scientific data, judicial expertise, etc.)
Semantic search: ontologies, richly structured &
annotated (XML) data, info extraction & enrichment
Personalized search: history of user behavior
(queries, clicks, etc.) and current context
Social search: wisdom of crowds (recommendations,
community behavior, etc.) embedded in P2P network
Data curation, quality control & trust
are crucial for effective information search:
authenticity, freshness, accuracy, authority, etc.
50/48

Intelligente Suchmaschinen der Zukunft: Trends und

Transcription

Similar documents

1512-1594 Gerardus Mercator, or Gerard Kremer as he was called

Gerhard Heintzman Piano 10 Years Limited Warranty

Leisure activities and relaxation

Batch Production of Driving Distances and Times Using SAS® and

- Lab for Media Search - National University of Singapore

- Lab for Media Search - National University of Singapore

MIDI-PYRENÉES

Why Email?

Giving yourself all the chances to succeed... in college !

Article (Published version)