From federated to aggregated search

Transcription

From federated to aggregated search
From federated to aggregated search
Fernando Diaz, Mounia Lalmas and Milad Shokouhi
[email protected]
[email protected]
[email protected]
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
1
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
Introduction
 What is federated search?
 What is aggregated search?
 Motivations
 Challenges
 Relationships
2
A classical example of federated
search
One query
Collections
to be searched
www.theeuropeanlibrary.org
A classical example of federated
search
www.theeuropeanlibrary.org
Merged list
of results
3
Motivation for federated search
 Search a number of independent
collections, with a focus on hidden web
collections
 Collections not easily crawlable (and often
should not)
 Access to up-to-date information and data
 Parallel search over several collections
 Effective tool for enterprise and digital
library environments
Challenges for federated search
 How to represent collections, so that to know
what documents each contain?
 How to select the collection(s) to be searched
for relevant documents?
 How to merge results retrieved from several
collections, to return one list of results to the
users?
 Cooperative environment
 Uncooperative environment
4
From federated search to aggregated
search
 “Federated search on the web”
 Peer-to-peer network connects distributed
peers (usually for file sharing), where each peer
can be both server and client
 Metasearch engine combines the results of
different search engines into a single result list
 Vertical search – also known as aggregated
search – add the top-ranked results from
relevant verticals (e.g. images, videos, maps) to
typical web search results
A classical
example of
aggregated
search
Structured
Data
News
Homepage
Wikipedia
Real-time results
Video
Twitter
5
Motivation for aggregated search
 Increasingly different types of information being
available, sough and relevant
 e.g. news, image, wiki, video, audio, blog, map, tweet
 Search engine allows accessing these through
so-called verticals
 Two “ways” to search
 Users can directly search the verticals
 Or rely on so called aggregated search
Google universal search 2007: [ … ] search across all its content sources,
compare and rank all the information in real time, and deliver a single, integrated set
of search results [ … ] will incorporate information from a variety of previously
separate sources – including videos, images, news, maps, books, and websites –
into a single set of results.
http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html
Motivation for aggregated search
25K editorially classified queries
(Arguello et al, 09)
6
Motivation for aggregated search
Motivation for aggregated search
7
Challenges in aggregated search
 Extremely heterogeneous collections
 What is/are the vertical intent(s)?
 And
  Handling ambiguous (query | vertical) intent
  Handling non-stationary intent (e.g. news, local)
 How many results from each to return and
where to position them in the result page?
  Slotting results
  Users looking at 1st result page
 Page optimization and its evaluation
Ambiguous non-stationary intent
Query
- Travel
- Molusk
- Paul
Vertical
- Wikipedia
- News
- Image
8
Recap – Introduction
federated
search
aggregated
search
heterogeneity
low
high
scale
(documents,
users)
small
large
user feedback
little
a lot
Terminology
1.  federated search, distributed information
retrieval, data fusion, aggregated search,
universal search, peer-to-peer network
2.  resource, vertical, database, collection,
source, server, domain, genre
3.  merging, blending, fusion, aggregation,
slotted, tiled
9
Problem definition
Present the “querier” with a summary of
search results from one or more resources.
General architecture
User
Raw Query
Search Interface/
Portal/
Broker
Query
Query
Query
Query
Query
Source/
Server/
Vertical
Source/
Server/
Vertical
Source/
Server/
Vertical
Source/
Server/
Vertical
Source/
Server/
Vertical
10
Peer-to-peer network
Peer
Directory Server
Peer to Peer (P2P) networks
 Broker-based
 Single centralized broker with documents lists shared
from peer (e.g. Napster, original version)
 Decentralized
 Each peer acts as both client and server (e.g.
Gnutella v0.4)
 Structure-based
 Use distributed hash tables (DHT) (e.g. Chord (Stocia
et al, 03) )
 Hierarchical
 Use local directory services for routing and merging
(e.g. Swapper.NET)
11
Federated search
Merged results
Query
Broker
Sum Sum Sum Sum Sum
A
B
C
D
E
Query
Query
Query
Query
Query
Collection
A
Collection
B
Collection
C
Collection
D
Collection
E
Federated search
 Also known as distributed information
retrieval (DIR) system
 Provides one portal for searching
information from multiple sources
 corporate intranets, fee-based databases,
library catalogues, internet resources, userspecific digital storage
 Funnelback, Westlaw, FedStats, Cheshire,
etc (see also http://federatedsearchblog.com/)"
12
http://funnelback.com/pdfs/brochures/enterprise.pdf
User
Metasearch
Raw Query
Metasearch engine
Query
Query
Query
Query
WWW
13
Metasearch
 Search engine querying several different
search engines and combines results from
them (blended), or displays results
separately (non-blended)
 Does not crawl the web but rely on data
gathered by other search engines
 Dogpile,Metacrawler, Search.com, etc
(see http://www.cryer.co.uk/resources/searchengines/meta.htm)
Aggregated
search
User
Angelina Jolie
Query
Query
Results
Query
Query
WWW
Index (text)
14
Aggregated search
 Specific to a web search engine
 “Increasingly” more than one type of information
relevant to an information need
 mostly web page + image, map, blog, etc
 These types of information are indexed and
ranked using dedicated approaches (verticals)
 Presenting the results from verticals in an
aggregated way believed to be more useful
 All major search engines are doing some levels
of aggregated search
Data fusion
Query
One ranked list of
result (merged)
Merging
Different document
representations
Different retrieval
models
BM25
KL
Inquery
Anchor only
Title only
GOV2
One document collection
(e.g. Voorhees etal, 95)
15
Data fusion
 Search one collection
 Document can be indexed in different ways
 Title index, abstract index, etc (poly-representation)
 Weighting scheme
 Different retrieval models
 Rankings generated by different retrieval models
(or different document representations) merged
to produce the final rank
 Has often been shown to improve retrieval
performance (TREC)
Terminology - Resource
 Source
 Server
 Database
 Collection (federated search)
 Server
 Vertical (aggregated search)
 Domain
 Genre
16
Terminology - Aggregation
 Merging
 Blending
 Fusion
 Slotted
 Tiled
Aggregated search (tiled)
http://au.alpha.yahoo.com/
17
Aggregated
search (tiled)
Naver.com
Aggregated search (slotted)
18
Others
 Clustering
 Faceted search
 Multi-document summarization
 Document generation
 Entity search
(see special issue – in press – on “Current research in focused retrieval and
result aggregation”, Journal of Information Retrieval (Trotman etal, 10))
Yippy – Clustering search engine from Vivisimo
clusty.com
19
Faceted search
Multi-document summarization
http://newsblaster.cs.columbia.edu/
20
“Fictitious” document generation
(Paris et al, 10)
Entity search
http://sandbox.yahoo.com/Correlator
21
Recap
 Shown the relations between federated,
aggregated search, and others
 Exposed the various terminologies used
 In the rest of the tutorial, we concentrate
on federated search and aggregated
search
 Focus is on “effective search”
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
22
Architecture: what are the general components of
federated and aggregated search systems.
Federated search architecture
23
Aggregated search architecture
 Pre-retrieval aggregation: decide verticals
before seeing results
 Post-retrieval aggregation: decide verticals
after seeing results
 Pre-web aggregation: decide verticals
before seeing web results
 Post-web aggregation: decide verticals
after seeing web results
Post-retrieval, pre-web
24
Pre and post-retrieval,
pre-web
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
25
Resource representation: how to represent resources,
so that we know what documents each contain.
Resource representation in federated
search
(Also known as resource summary/description)
26
Resource representation
 Cooperative environments
 Comprehensive term statistics
 Collection size information
 Uncooperative environments
 Query-based sampling
 Collection size estimation
Resource representation
(cooperative environments)
 STARTS Protocol (Gravano et al, 97)
  Source metadata
  Rich query language
27
Resource representation
(cooperative environments)
 Different types of term statistics
(Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01;
Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97)
 Anchor-text
 HARP (Hawking and Thomas, 05)
Resource representation
(uncooperative environments)
 Query-based sampling (Callan and Connell, 01)
 Select a query, probe collection
 Download the top n documents
 Select the next query, repeat
Query selector
Query
Sampled documents
28
Resource representation
(uncooperative environments)
 Query selector
 (Callan and Connell, 01)
 Other resource description (ord)
 Learned resource description (lrd)
•  Average tf, random, df, ctf
 Query logs
 (Craswell, 00; Shokouhi et al, 07d)
 Focused probing
 (Ipeirotis and Gravano, 02)
Resource representation
(uncooperative environments)
 Adaptive sampling
 (Shokouhi et al, 06a)
 Rate of visiting new vocabulary
 (Baillie et al, 06a)
 Rate of sample quality improvement (reference query
log)
 (Caverlee et al, 06)
 Proportional document ratio (PD)
 Proportional vocabulary ratio (PV)
 Vocabulary growth (VG)
29
Resource representation
(uncooperative environments)
 Improving incomplete samples
 Shrinkage (Ipeirotis, 04; Ipeirotis and Gravano, 04):
topically related collections should share similar
terms
 Q-pilot (Sugiura and Etzioni, 00):
sampled documents + backlinks + front page
Resource representation
(Collection size estimation)
 Capture-recapture (Liu et al, 01)
Sample A
(Capture)
Sample B
(recapture)
http://www.dorlingkindersley-uk.co.uk/static/cs/uk/11/clipart/nature/image_nature040.html
30
Resource representation
(Collection size estimation)
 
Resource representation
(Collection size estimation)
 Multiple queries sampler
(Thomas and Hawking, 07)
 Random-walk sampler, and pool-based
sampler
(Bar-Yossef and Gurevich, 06)
 Collection overlap estimation
(Shokouhi and Zobel, 07)
31
Resource representation
(Updating summaries)
(Ipeirotis et al, 05)
(Shokouhi et al, 07a)
Resource representation in
aggregated search
 Vertical content
 samples or access to vertical API
 represents content supply
 Vertical query logs
 samples or access to historic vertical searches
 represents content demand
32
Vertical content includes text
NEWS
Vertical content includes structure
SPORTS
33
Vertical content includes images
IMAGES
Issues with vertical content
 Dynamics
 some vertical becomes stale fast
 Heterogeneous content
 heterogeneous ranking algorithms
 Non-free text APIs
 affects query-based sampling
34
Addressing content dynamics
  sample most recently
indexed documents
(Diaz 09)
  assumes users more
likely to be interested in
recent content
(Konig et al, 09)
  in practice, only need a
fraction of the corpus to
perform well
Addressing heterogeneous content
1.  use text available with
documents (e.g. captions)
2.  manually map to
surrogates (e.g. wikipedia
pages)
performance of two different methods of
dealing with heterogeneous content
(Arguello et al, 09)
35
Vertical query logs
  Queries issued directly
to a vertical represent
explicit vertical intent
  Is similar to having a
large body of labeled
queries
Issues with vertical query logs
 Dynamics
 some verticals require temporally-sensitive
sampling
 for example, we do not want to sample news
query logs for a whole year
 Non-free text APIs
 affects query modeling
36
Hybrid approaches
 Should only sample documents likely to be
useful for vertical selection/merging
 e.g. a document which is never requested is not
useful for representing a vertical
 Suggests log-biased sampling
(Shokouhi et al, 06; Arguello et al, 09)
Recap – Resource representation
federated
search
aggregated
search
Representation
completeness
low
low-high
Representation
generation
sampling/shared
dictionaries
sampling, API
Freshness
important
critical
37
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
Resource selection: how to select the resource(s) to be
searched for relevant documents.
38
Resource selection for federated
search
Query
Broker
Sum Sum Sum Sum Sum
A
B
C
D
E
Query
Query
Collection
A
Collection
B
Query
Collection
C
Collection
D
Collection
E
Resource selection
(Lexicon-based methods)
 “Big-document” bag of word summaries
Collection A
Collection B
Collection C
CORI (Callan et al, 95)
GlOSS (Gravano et al, 94b)
CVV (Yuwono and Lee, 97)
Sampling
Sampling
Sampling
Broker
39
Resource selection
(Lexicon-based methods)
 CORI
 GlOSS
Resource selection
(Document-surrogate methods)
 Sample documents with retained boundaries
Collection A
Collection B
Collection C
ReDDE (Si and Callan, 03a)
CRCS (Shokouhi, 07a)
SUSHI (Thomas and Shokouhi, 09)
Sampling
Sampling
Sampling
Broker
40
Resource selection
(Document-surrogate methods)
 ReDDE
  ReDDE assumes that the topranked sampled documents are
relevant.
Broker
  ReDDE estimates the size of
collections by sample-resample
Ranking
  Assuming that all collections have
the same size we have: yellow >
blue > red
  CRCS is inspired by ReDDE but
assigns different probability of
relevance based on document
position: red > yellow, blue
Query
Resource selection
(Document-surrogate methods)
 SUSHI
http://www.monthly.se/nucleus/index.php?itemid=1464
41
Resource selection
(Document-surrogate methods)
 SUSHI
http://www.monthly.se/nucleus/index.php?itemid=1464
Resource selection
(Document-surrogate methods)
 SUSHI
 Different regression
functions for each
collection and query
 Scores are comparable
(estimated over the
same index)
http://www.monthly.se/nucleus/index.php?itemid=1464
42
Resource selection
(Supervised methods)
 Utility maximization techniques
 Model the search effectiveness
 DTF (Nottelmann and Fuhr, 03), UUM (Si and
Callan, 04a), RUM (Si and Callan, 05b)
 Classification-based methods
 Classify collections/queries for better selection
 Classification-aware server selection (Ipeirotis
and Gravano, 08), classification-based resource
selection (Arguello et al, 09a), learning from past
queries (Cetintas et al, 09)
Resource selection in aggregated
Search
 Content-based predictors
 derived from (sampled) vertical content
 Query string-based predictors
 derived from query text, independent of any
resource associated with a vertical
 Query log-based predictors
 derived from previous requests issued by users
to the vertical portal
43
Content-based predictors
 Distributed information retrieval (DIR)
predictors
 Simple result set predictors
 numresults, score distributions, etc
(Diaz 09; Konig etal, 09)
 Complex result set predictors
 Clarity (Cronen-Townsend et al, 02)
 Autocorrelation (Diaz, 07)
 Many, many more (Hauff, 10)
Issues with content-based predictors
 DIR (usually) assumes homogeneous
content types
 performance predictors (usually) assume
text corpora
 assumes ranking function consistency
 between verticals
 between vertical selector machine and vertical
ranker machine
 verticals have different dynamics (e.g.
news vs. image)
44
String-based predictors
 Dictionary lookups
 terms correlated with a vertical (e.g., movie
titles)
 Regular expressions
 patterns correlated with explicit vertical
requests (e.g., obama news)
 Named entities
 automatically-detected entity types (e.g.,
geographic entities)
String-based predictors
 Issues
 curating lists and expressions (manual or
automatic)
 terms included in dictionary manually vetted for
relevance
 high precision/low recall
45
Log-based predictors
 Classification approaches
(Beitzel etal 07; Li etal, 08)
 Language model approaches
(Arguello etal, 09)
 Issues
 verticals with structured queries (e.g. local)
 query logs with dynamics (e.g. news)
(Diaz, 09)
Comparing predictor performance
(Arguello et al, 09)
46
Predictor cost
 Pre-retrieval predictors
 computed without sending the query to the
vertical
 no network cost
 Post-retrieval predictors
 computed on the results from the vertical
 requires vertical support of web scale query
traffic
 incurs network latency
 can be mitigated with vertical content caches
Combining predictors
 Use predictors as features for a machinelearned model
 Training data
1.  editorial data
2.  behavioral data (e.g. clicks)
3.  other vertical data
(Diaz, 09; Arguello etal, 09; Konig etal, 09)
47
Editorial data
 Data: <query,vertical,{+,-}>
 Features: predictors based on
f(query,vertical)
 Models:
 log-linear (Arguello etal, 09)
 boosted decision trees (Arguello etal, 10)
Combining predictors
(Arguello etal, 09)
48
Click data
 Data: <query,vertical,{click,skip}>,
<query,vertical,click through rate>
 Features: predictors based on
f(query,vertical)
 Models:
 log-linear (Diaz, 09)
 boosted decision trees (Konig etal, 09)
Gathering click data
 Exploration bucket:
 show suboptimal presentations in order to
gather positive (and negative) click/skip data
 Cold start problem:
 without a basic model, the best exploration is
random
 Random exploration results in poor user
experience
49
Gathering click data
 Solutions
 reduce impact to small fraction of traffic/users
 train a basic high-precision non-click model
(perhaps with editorial data)
 Other issues
 Presentation bias: different verticals have
different click-through rates a priori
 Position bias: different presentation positions
have different click-through rates a priori
Click precision and recall
ability to predict queries
using thresholded
click-through-rate to infer
relevance
(Konig etal, 09)
50
Non-target data
have training data
no data
Non-target data
 Data: <query,source vertical,{+,-}>
 Features: predictors based on f
(query,target vertical)
 Models:
 generic model+adaptation
(Arguello etal, 10)
51
Non-target data
(Arguello etal, 10)
Generic model
 Objective
 train a single model that performs well for all
source verticals
 Assumption
 if it performs well across all source verticals, it
will perform well on the target vertical
(Arguello etal, 10)
52
Non-target data
adapted model
(Arguello etal, 10)
Adapted model
 Objective
 learn non-generic relationship between features
and the target vertical
 Assumption
 can bootstrap from labels generated by the
generic model
(Arguello etal, 10)
53
Non-target query classification
average precision on target query classification; red (blue) indicates
statistically significant improvements (degradations) compared to
the single predictor
(Arguello etal, 10)
Training set characteristics
 What is the cost of generating training
data
 how much money?
 how much time?
 how many negative impressions as a result of
exploration?
 Are targets normalized?
 can we compare classifier output?
54
Training set cost summary
Online adaptation
 Production vertical selection systems
receive a variety of feedback signals
 clicks, skips
 reformulations
 A machine-learned system can adjust
predictions based on real time user
feedback
 very important for dynamic verticals
(Diaz, 09; Diaz and Arguello, 09)
55
Online adaptation
 Passive feedback: adjust prediction/
parameters in response to feedback
 allows recovery from false positives
 difficult to recover from false negatives
 Active feedback/explore-exploit:
opportunistically present suboptimal
verticals for feedback
 allows recovery from both errors
 incurs exploration cost
(Diaz, 09; Diaz and Arguello, 09)
Online adaptation
 Issues
 setting learning rate for dynamic intent verticals
 normalizing feedback signal across verticals
 resolving feedback and training signal
(click≠relevance)
(Diaz, 09; Diaz and Arguello, 09)
56
Recap – Resource selection
federated
search
aggregated
search
Features and
content type
often textual
diverse
Collection size
unavailable
(uncooperative)
Training data
none
some-much
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
57
Resource presentation: how to return results retrieved
from several resources to users.
Result merging
(Metasearch engines)
 Same source (web) different overlapped indexes
 Document scores may not be available
 Title, snippet, position and timestamps
 D-WISE (Yuwono and Lee, 96)
  Inquirus (Glover et al., 99)
  SavvySearch (Dreilinger and Howe, 1997)
58
Result merging
(Data fusion)
 Same corpus
 Different retrieval models
 Document scores/positions available
 Unsupervised techniques
 CombSUM, CombMNZ (Fox and Shaw, 93, 94)
 Borda fuse (Aslam and Montague, 01)
 Supervised techniques
 Bayes-fuse, weighted Borda fuse (Aslam and Montague, 01)
 Segment-based fusion (Lillis et al 06, 08; Shokouhi 07b)
Result merging in federated search
Merged results
User
Broker
Sum Sum Sum Sum Sum
A
B
C
D
E
Query
Collection
A
Query
Query
Collection
B
Collection
C
Collection
D
Collection
E
59
Result merging
 CORI (Callan et al, 95)
 Normalized collection score + Normalized
document score.
Result merging
 SSL (Si and Callan, 2003b)
A
B
Broker
Ranking
L
R
C
D
D
F
E
Q
Selected resources
F
G
H
Query
60
Result merging
Broker score
 
Source-specific score
http://upload.wikimedia.org/wikipedia/en/1/13/Linear_regression.png
Result merging - Miscellaneous scenarios
 Multi-lingual result
merging
 Merging overlapped
collections
 SSL with logistic regression
(Si and Callan, 05a; Si et al, 08)
 Personalized metasearch
 (Thomas, 08)
 COSCO (Hernandez and
Kambhampati 05):
exact duplicates
 GHV (Bernstein et al, 06;
Shokouhi et al, 07b):
exact/near duplicates
61
Slotted vs tiled result presentation
Images on top
Images at top-right
3 verticals
3 positions
3 degree of vertical intents
Images in the middle
Images at the bottom-right
Images at the bottom
Images on the left
(Sushmita et al, 10)
Slotted vs tiled
Designers of aggregated search interfaces should
account for the aggregation styles
 for both, vertical intent key for deciding on position and
type of “vertical” results
 slotted  accurate estimation of the best position of
“vertical” result
 tiled  accurate selection of the type of “vertical” result
62
Recap – Result presentation
federated
search
aggregated
search
Content type
homogenous
(text documents)
heterogeneous
Document scores
depends on
environment
heterogeneous
Oracle
centralized index
none
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
63
Evaluation
Evaluation: how to measure the effectiveness of
federated and aggregated search systems.
Resource representation (summaries)
evaluation – Federated search
 CTF ratio (Callan and Connell, 01)
 Spearman rank correlation coefficient (SRCC),
(Callan and Connell, 01)
 Kullback-Leibler divergence (KL) (Baillie et al,06b;
Ipeirotis et al, 2005), topical KL (Baillie et al, 09)
 Predictive likelihood (Baillie et al, 06a)
64
Resource selection evaluation –
Federated search
 
Result merging evaluation – Federated
search
 Oracle
 Correct merging (centralized index ranking)
(Hawking and Thistlewaite, 99)
 Perfect merging (ordered by relevance labels)
(Hawking and Thistlewaite, 99)
 Metrics
 Precision
 Correct matches (Chakravarthy and Haase, 95)
65
Vertical Selection Evaluation –
Aggregated search
  Majority of publications
focus on single vertical
selection
 vertical accuracy,
precision, recall
  Evaluation data
 editorial data
 behavioral data
single vertical selection
Editorial data
 Guidelines
 judge relevance based on vertical results
(implicit judging of retrieval/content quality)
 judge relevance based on vertical description
(assumes idealized retrieval/content quality)
 Evaluation metric derived from binary or
graded relevance judgments
(Arguello etal, 09; Arguello et al, 10)
66
Behavioral data
 Inference relevance from behavioral data
(e.g. click data)
 Evaluation metric
 regression error on predicted CTR
 infer binary or graded relevance
(Diaz, 09; Konig etal, 09)
Test collections (a la TREC)
quantity/media
text
image
video
total
size (G)
2125
41.1
445.5
2611.6
number of documents
86,186,315
670,439
1,253*
86,858,007
Statistics on Topics
number of topics
150
average rel docs per topic
110.3
average rel verticals per topic
1.75
ratio of “General Web” topics
29.3%
ratio of topics with two vertical
intents
66.7%
ratio of topics with more than
two vertical intents
4.0%
* There are on an average more than 100 events/shots contained in each video clip (document)
(Zhou & Lalmas, 10)
67
Test collections (a la TREC)
existing test collections
ImageCLEF
photo retrieval
track
Image
Vertical
TREC
web track
Blog
Vertical
INEX
ad-hoc track
Reference
(Encyclopedia)
Vertical
……
……
Shopping
Vertical
TREC
blog track
topic
t1
doc
d1
d2
d3
…
dn
judgment
R
N
R
…
R
topic
t1
vertical
V1
doc
d1
d2
…
dV1
judgment
R
N
…
R
t1
V2
d1
d2
…
dV2
N
N
…
R
General Web
Vertical
(simulated) verticals
……
Vk
d1
d2
…
dVk
N
N
…
N
Recap – Evaluation
federated
search
aggregated
search
Editorial data
document
relevance
judgments
query labels
Behavioral data
none
critical
68
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
Open problems in federated search
  Beyond big document
 Classification-based server selection (Arguello et al, 09a)
 Topic modeling
  Query expansion
 Previous techniques had little success (Ogilvie and Callan, 01;
Shokouhi et al, 09)
  Evaluating federated search
  Confounding factors
  Federated search in other context
 Blog Search (Elsas et al, 08; Seo and Croft, 08)
  Effective merging
 Supervised techniques
69
Open problems in aggregated search
 Evaluation metrics
 slotted presentation
 tiled presentation
 metrics based on behavioral signals
 Models for multiple verticals
 Minimizing the cost for new verticals,
markets
Outline
 Introduction and Terminology
 Architecture
 Resource Representation
 Resource Selection
 Result Presentation
 Evaluation
 Open Problems
 Bibliography
70
Bibliography
  J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical
selection. In SIGIR 2009 (2009).
  J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In
Proceedings of the ACM CIKM, Pages 1277--1286, Hong Kong, China, 2009a.
  J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled
Verticals. In SIGIR 2010 (2010).
  J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR,
Pages, 276--284, New Orleans, LA, 2001.
  M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of
distributed collections, In Proceedings of SPIRE, Pages 316--328, Glasgow, UK,
2006a.
  M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of
estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings
of the First International Conference on Scalable Information systems, page 41, Hong
Kong, 2006b.
  M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource
description quality for distributed information retrieval. In Proceedings of ECIR, pages
485--496, Toulouse, France, 2009.
Bibliography
  Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index.
Proceedings of WWW, pages 367--376, Edinburgh, UK, 2006.
  S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. and Frieder, Automatic
classification of web queries using very large unlabeled query logs. ACM Trans. Inf.
Syst. 25, 2 (2007), 9.
  Y. Bernstein, M. Shokouhi, and J. Zobel. Compact features for detection of nearduplicates in distributed retrieval. Proceedings of SPIRE, Pages 110--121, Glasgow,
UK, 2006.
  J. Callan and M. Connell. Query-based sampling of text databases. ACM
Transactions on Information Systems, 19(2):97--130, 2001.
  J. Callan, Z. Lu, and B. Croft. Searching distributed collections with inference
networks. In Proceedings of ACM SIGIR, pages 21--28. Seattle, WA, 1995
  J. Caverlee, L. Liu, and J. Bae. Distributed query sampling: a quality-conscious
approach. In Proceedings of ACM SIGIR, pages 340--347. Seattle, WA, 2006.
  S. Cetintas, L. Si, and H. Yuan, Learning from past queries for resource selection, In
Proceedings of ACM CIKM, Pages1867--1870, Hong Kong, China.
71
Bibliography
  B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked
Retrieval Systems, ACM SIGIR, pp 173-181, 1994.
  C. Baumgarten. A Probabilitstic Solution to the Selection and Fusion Problem in
Distributed Information Retrieval, ACM SIGIR, pp 246-253, 1999.
  N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, Australian
National University, 2000.
  S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. ACM
SIGIR, pp 299–306, 2002.
  A. Chakravarthy and K. Haase. NetSerf: using semantic knowledge to find internet
information archives, ACM SIGIR, pp 4-11, Seattle, WA, 1995.
  F. Diaz. Performance prediction using spatial autocorrelation. ACM SIGIR, pp. 583–590,
2007.
  F. Diaz. Integration of news content into web results. ACM International Conference on
Web Search and Data Mining, 2009.
  F. Diaz, J. and Arguello. Adaptation of offline vertical selection predictions in the
presence of user feedback, ACM SIGIR, 2009.
  D. Dreilinger and A. Howe. Experiences with selecting search engines using
metasearch. ACM Transaction on Information Systems, 15(3):195-222, 1997.
  J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog
feed search, ACM SIGIR, pp 347-354, Singapore, 2009.
Bibliography
  E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch
engine that supports user information needs, ACM CIKM, pp 210—216,1999.
  L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS
estimators for database discovery. Third International conference on Parallel and
Distributed Information Systems, pp 103--106, Austin, TX, 1994a.
  L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the
text database discovery problem. ACM SIGMOD, pp 126--137, Minneapolis, MN,
1994b.
  L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford
proposal for internet metasearching, ACM SIGMOD, pp 207--218, Tucson, AZ, 1997.
  L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: text-source discovery over
the internet, ACM Transactions on Database Systems, 24(2):229--264, 1999.
  E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval
Conference, pp 243-252, Gaithersburg, MD, 1993.
  E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval
Conference, pp 105-108, Gaithersburg, MD, 1994.
  J. French, and A. Powell. Metrics for evaluating database selection techniques, World
Wide Web, 3(3):153--163, 2000.
  C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis,
University of Twente, 2010.
 
72
Bibliography
  D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM
SIGIR, pp 75-82, Salvador, Brazil, 2005.
  D. Hawking and P. Thistlewaite. Methods for information server selection, ACM
Transactions on Information Systems, 17(1):40-76, 1999.
  T. Hernandez and S. Kambhampati. Improving text collection selection with coverage
and overlap statistics. WWW, pp 1128-1129, Chiba, Japan, 2005.
  P. Ipeirotis and L. Gravano. When one sample is not enough: improving text
database selection using shrinkage. ACM SIGMOD, pp 767-778, Paris, France, 2004.
  P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical
database sampling and selection. VLDB, pages 394-405, Hong Kong, China, 2002.
  P. Ipeirotis and L. Gravano. Classification-aware hidden-web text database selection.
ACM Transactions on Information Systems, 26(2):1-66, 2008.
  P. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content
changes in text databases, 21st International Conference on Data Engineering, pp
606-617, Tokyo, Japan, 2005.
  A. C. König, M. Gamon, and Q. Wu. Click-through prediction for news queries, ACM
SIGIR, 2009.
Bibliography
  X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs,
ACM SIGIR, pp. 339–346.
  D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to
data fusion, ACM SIGIR, pp 139-146, Seattle, WA, 2006.
  K. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. ACM
CIKM, pp 652-654, McLean, VA, 2002.
  N. Liu, J. Yan, W. Fan, Q. Yang, and Z. Chen. Identifying Vertical Search Intention of
Query through Social Tagging Propagation, WWW, Madrid, 2009.
  W. Meng, Z. Wu, C. Yu, and Z. Li. A highly scalable and effective method for
metasearch, ACM Transactions on Information Systems, 19(3):310-335, 2001.
  W. Meng, C. Yu, and K. Liu. Building efficient and effective metasearch engines.
ACM Computing Surveys, 34(1):48-89, 2002.
  V. Murdock, and M. Lalmas. Workshop on aggregated search, SIGIR Forum 42(2):
80-83, 2008.
  H. Nottelmann and N. Fuhr. Combining CORI and the decision-theoretic approach for
advanced resource selection, ECIR, pp 138--153, Sunderland, UK, 2004.
  P. Ogilvie and J. Callan. The effectiveness of query expansion for distributed
information retrieval, ACM CIKM, pp 1830--190, Atlanta, GA, 2001.
  C. Paris, S. Wan and P. Thomas. Focused and aggregated search: a perspective
from natural language generation, Journal of Information Retrieval, Special Issue,
2010.
73
Bibliography
  S. Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a
major Korean search engine, Library & Information Science Research 31(2): 126-133,
2009.
  F. Schumacher and R. Eschmeyer. The estimation of fish populations in lakes and
ponds, Journal of the Tennessee Academy of Science, 18:228-249, 1943.
  M. Shokouhi. Central-rank-based collection selection in uncooperative distributed
information retrieval, ECIR, pp 160-172, Rome, Italy, 2007a.
  J. Seo and B. Croft. Blog site search using resource selection, ACM CIKM, pp
1053-1062, Napa Valley, CA, 2008.
  M. Shokouhi. Segmentation of search engine results for effective data-fusion, ECIR,
pp 185-197, Rome, Italy, 2007b.
  M. Shokouhi and J. Zobel. Robust result merging using sample-based score
estimates, ACM Transactions on Information Systems, 27(3):1-29, 2009.
  M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped
collections, ACM SIGIR, pp 495-502. Amsterdam, Netherlands, 2007.
  M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in
uncooperative distributed information retrieval, Eighth Asia Pacific Web Conference,
pp 63--75, Harbin, China, 2006a.
Bibliography
  M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for
distributed non-cooperative retrieval, ACM SIGIR, pp 316-323, Seattle, WA, 2006b.
  M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish
vocabularies in distributed information retrieval, Information Processing and
Management, 43(1):169-180, 2007d.
  M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated
search, ACM SIGIR, pp 427-434, Singapore, 2009.
  L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM
CIKM, pages 32-41, Washington, DC, 2004a.
  L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual
ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria,
2005a. http://www.cs.purdue.edu/homes/lsi/publications.htm
  L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging
strategy for multilingual information retrieval in federated search environments,
Information Retrieval, 11(1):1--24, 2008.
  L. Si and J. Callan. Relevant document distribution estimation method for resource
selection, ACM SIGIR, pp 298-305, Toronto, Canada, 2003a.
  L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM
SIGIR, pp 83-90, Salvador, Brazil, 2005b.
  L. Si and J. Callan. A semisupervised learning method to merge search engine results,
ACM Transactions on Information Systems, 21(4):457-491, 2003b.
74
Bibliography
  A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and
experiments, WWW, Pages 417-429, Amsterdam, Netherlands, 2000.
  S. Sushmita, H. Joho and M. Lalmas. A Task-Based Evaluation of an Aggregated
Search Interface, SPIRE, Saariselkä, Finland, 2009.
  S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factors Affecting Click-Through
Behavior in Aggregated Search Interfaces, ACM CIKM, Toronto, Canada, 2010.
  S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics of Genre and Domain Intents,
Technical Report, University of Glasgow 2010.
  S. Sushmita, H. Joho, M. Lalmas and J.M. Jose. Understanding domain "relevance"
in web search, WWW 2009 Workshop on Web Search Result Summarization and
Presentation, Madrid, Spain, 2009.
  P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative
collections, ACM SIGIR, pp 503-510, Amsterdam, Netherlands, 2007.
  P. Thomas. Server characterisation and selection for personal metasearch, PhD
thesis, Australian National University, 2008.
  P. Thomas and M. Shokouhi. SUSHI: scoring scaled samples for server selection,
ACM SIGIR, pp 419-426, Singapore, Singapore, 2009.
  A. Trotman, S. Geva, J. Kamps, M. Lalmas and V. Murdock (eds). Current research
in focused retrieval and result aggregation, Special Issue in the Journal of Information
Retrieval, Springer, 2010.
Bibliography
  T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the
Web, ACM CIKM, pp 181-189, Atlanta, Georgia, 2001.
  Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection
Fusion Strategies, ACM SIGIR, pp 172-179, 1995.
  B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE
Transactions on Knowledge and Data Engineering, 8(4):548--554, 1996.
  B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the
internet. Fifth International Conference on Database Systems for Advanced
Applications, 6, pp 41-50, Melbourne, Australia, 1997.
  J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp
112-120, Melbourne, Australia, 1998.
  A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical
Report, University of Glasgow 2010.
  J. Zobel. Collection selection via lexicon inspection, Australian Document Computing
Symposium, pp 74--80, Melbourne, Australia, 1997.
75