based on slides by Roger Weber, edited by Thomas Meyer

Transcription

Internet-Technologien (CS262)
Web Information Retrieval
22. April 2015
Christian Tschudin (based on slides by Roger Weber, edited by Thomas Meyer)
Departement Mathematik und Informatik, Universität Basel
Problem
IP-, Transport-, und Application Layer (HTTP)
establish connectivity (linked content)
❖ How to efficiently locate content in the global
graph of web documents?
❖
❖
“Expert sites”
• Content (links) sorted according to topics
• Human classification ➝ “Web Directory”
❖
Complete index
• Extract words from text, title, and meta-information of web
pages
• Automated process ➝ “Search Engine”
CS262 — FS15 — Info Retrieval
2
Content
History
❖ Search Engines
❖
Web Crawlers
❖ Term
Extraction/Indexing
❖ Text Retrieval Models
❖ Web Retrieval
❖ Optimization
❖
❖
Web Directories
❖
DMOZ, Google
❖
Semantic Web
Web 3.0 (meta data)
❖ Ontologies
❖ Folksonomies
❖
3
History
❖
1980s
files distributed via anonymous FTP
❖
1990
“Archie”: assembled lists of files on FTP servers
❖
1990
Tim Berners-Lee @ CERN develops the WWW
❖
1992-1994 early browsers (Erwise, ViolaWWW, Mosaic)
❖
1993
early web robots to collect URLs
(Wanderer, ALIWEB, WWW Worm)
❖
1994
Stanford students manually collect popular web
sites, put them into a hierarchy, called Yahoo
❖
1995
DEC develops Altavista (crawler + text search)
❖
1998
Stanford students Larry Page and Sergey Brin start
Google (better link analysis & page ranking)
4
Size of the Web
❖
❖
Size of the Web is estimated by the size of search engine indices;
this only reflects the surface/visible/indexable web
But there is an invisible/dark/deep web:
❖ Search engines require linked web pages to build an index
❖ But huge amount of content is not linked, i.e. not indexable:
• Mostly relates to dynamically created web pages
e.g. links generated by scripts, content from databases, e.g. libraries!
• The access to other web pages is restricted
e.g. requires password, captchas
• Also sites with “no-crawl” directives
e.g. robots.txt:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images
Disallow: /tmp/
5
Size of the Web (cont.)
Current estimate based on search machine indices
❖
(April 2015)
❖
❖
❖
from Google:
from Bing:
from Yahoo:
50∙109
10∙109
? (in 2012, it was 3∙109)
For current figures, see http://www.worldwidewebsize.com
❖
Number of Web sites: 925 ∙ 106 (http://www.internetlivestats.com/)
… which gives a low 50 pages per web site …
❖
❖
Huge amount of information; how to organize search
in the (surface) web efficiently?
6
Content
History
❖ Search Engines
❖
Web Crawlers
❖ Term
Extraction/Indexing
❖ Web Retrieval
❖ Optimization
❖
❖
Web Directories
❖
DMOZ, Google
❖
Semantic Web
Web 3.0 (meta data)
❖ Ontologies
❖ Folksonomies
❖
7
Search Engines
❖
Killer functionality of todays Web
❖
Goal: Find relevant information from all web pages quickly
❖
Search engines (global market shares):
❖
❖
In May 2011 - Google (82%), Yahoo (6%), Baidu (5%), Bing (4%),...
❖
In Apr 2015 - Google (62%), Baidu (20%), Bing (8%), Yahoo (6%), ...
Operation principle:
1. Web crawling: Find most (wish: all) web pages automatically
through Spiders
2. Indexing: Analyze web pages and linked documents,
build an index
3. Retrieval: Provide efficient text search over the whole content,
rank results based on relevance and popularity
8
Search Engine Operation / Overview
9
a. Web Crawlers (aka Spider, Robot)
Goal: find all web pages
in the World Wide Web
❖ Start with a set of URLs
❖
❖
Iteratively follow
the links on the
pages analyzed
World Wide
Web
Web pages
URLs
Scheduler
Downloader
(multithreaded)
Eliminates duplicate URLs
URLs
URLs
Queue
❖ Ignore pages listed
in the site’s robots.txt
❖ Build a storage of pages (input for indexing)
Text and
metadata
❖
Storage
(ready for
indexing)
10
11
b. Indexing
❖
Databases: information maintained in a structure way
(e.g. relational DB)
❖
❖
❖
❖
❖
Simple queries make use of structure (e.g. SQL)
Index: data structure to speed up queries (e.g. phone
dictionary by city/name, number, profession)
Complex queries more difficult, require ranking mechanism
The WWW is not a relational database, information is
unstructured
How to build an index, nevertheless?
12
b.1 Term Extraction - Overview
13
b.1 Term Extraction
i) Elimination of Structure
HTML contains structure and content
❖ Remove the structure (remove markup tags), but
remember meta-information, e.g.
❖
URI of page: http://cn.cs.unibas.ch/index.html
❖ Title of document: <title>Computer Networks Home</title>
❖ Meta tags: <meta name=”keywords” content=”network,basel”>
❖
❖
Body tags give information about the importance
of content
Headlines: <h1>1. Information Retrieval</h1>
❖ Emphasized: <b>This is important</b>, <i>this, too</i>
❖ Link: Description of content: <a href=”...gif”>Logo</a>
❖
14
b.1 Term Extraction
ii) Elimination of Frequent/Infrequent Terms
❖
Remove terms with little/no semantics (e.g. “the”, “a”)
❖
Remove terms that appear seldom
(e.g. “sausage” in a computer science article)
❖
❖
Theoretic Solution: restrict indexing to terms that have
proven to be useful in the past (needs user feedback)
Pragmatic Solution:
❖
Compute Zipfian distribution: compute frequency of terms
❖
Rank terms based on their occurrence frequency
❖
Strip off words that are unfrequent or too frequent
15
b.1 Term Extraction
iii) Mapping text to terms
❖
Most search engines use words or phrases as features
(some use stemming, some distinguish upper-/lower-case)
❖
An option is to use fragments aka. n-grams
❖
Example:
street:
streets:
strets:
str, tre, ree, eet
str, tre, ree, eet, ets
str, tre, ret, ets
Simple misspellings often result in bad retrievals
❖ Fragments significantly improve quality
❖
❖
Also extract term location and frequency
Frequency is later used for page ranking
❖ Location is used in conditional search (e.g. Q=”white NEAR house”)
❖ Location is used for page ranking (e.g. Q=”white house”)
❖
16
b.1 Term Extraction
iv) Reduction of terms to their stems
Stemming: in most languages, words have various
inflected forms, which carry same/similar meaning
❖ Not easy to derive linguistic stems
❖
❖
English: good algorithms exist (e.g. Porter algorithm)
❖
German: too complex, need the help of a dictionary
(e.g. EuroWordNet, GermanNet, WordNet)
• Strong conjugations and declinations
– gehen: gehe, gehst, geht, gehen, ging, gingst, gingen, gegangen
– Haus: Haus, Hauses, Häuser
• Composite words may or may not be splitted into parts
– Gartenhaus ➝ Garten, Haus
❖
(good or bad??)
Not implemented in all search engines (Google since 2003)
17
b.1 Term Extraction
v) Mapping to index terms
❖
Term extraction has to deal with homonyms and synonyms
❖
Homonyms: equal term but different semantics
(e.g. bank = shore | financial institute)
❖
Synonyms: different terms but equal/similar semantics
(e.g. walk, go, pace, run, sprint)
❖
Hypernyms: umbrella term (species)
(e.g. Animal ← dog, cat, bird, ...)
❖
Holonyms: (is part of) / Meronyms (has parts)
(e.g. door ← lock)
These terms define a network (denoted as ontology);
terms = nodes, relations = edges
❖ Occurrence of a term in the document may be interpreted
of occurrences of nearby terms in the network as well
❖
(e.g. “dog” may be interpreted as “animal” with smaller weight)
18
19
Core Technology: Text Retrieval
❖
❖
❖
Goal: When a user enters a query, examine the index
and provide a list of best matching web pages
Input: Terms, often allows separation by boolean
operators (AND, OR, NOT)
Ranking: Algorithms that returns the best result first
❖
❖
❖
Ranking algorithm is the core-business of a search engine,
often kept secret
Ranking may also be influenced by advertisers that pay
Different retrieval approaches:
❖
Boolean-, Fuzzy-, Vector-Space-, Probabilistic- Retrieval
(here: only a brief overview without going into the details)
20
A) Boolean Retrieval
❖
Query = boolean operations (and, or, not) on terms
❖
Iterate over documents:
❖
❖
❖
Retrieval Status Value (RSV) = 1 if document matches query
❖
Retrieval Status Value (RSV) = 0 if not
Historically: already used with punch cards,
requires only sequential access
Today: not state-of-the-art anymore:
❖
No ranking of documents, returns all matching documents
❖
Size of result becomes unreasonably large
❖
Complex query language
21
B) Fuzzy Retrieval
❖
Same model as Boolean Retrieval enriched with a ranking
mechanism (based on frequency of terms)
Retrieval Status Value (RSV) evaluates to a value between
0 and 1 (fuzzy logic)
❖ Ordering of documents with descending RSVs
❖
❖
Advantage:
❖
❖
Ranking of retrieved documents
Disadvantage:
Not better than Boolean retrieval but worse than all other
❖ No weighting of terms, i.e. frequent terms dominate result
❖ Complex query language
❖
22
C) Vector Space Retrieval
❖
❖
Documents D and queries Q represented by a
M-dimensional vector d, q ∈ ℝM (M: nr. of terms in collection)
Definitions:
❖ Term frequency:
tf(Ti,Dj)
number of occurrences of term Ti in Document Dj
❖ Document frequency:
df(Ti)
number of documents that contain feature Ti
❖ Inverse document frequency:
idf(Ti) = log( N / df(Ti) )
discrimination value of term Ti, describes how good a term
distinguishes documents in the collection
(e.g. “the” is not able to segregate documents, “computer” is sharply segregates docs)
23
C) Vector Space Retrieval (cont.)
❖
❖
❖
❖
The Document-Term-Matrix
combines all document vectors
into a huge matrix
N: number of documents
M: number of terms
N
dij = tf(Ti, Dj) ∙ idf(Ti)
A = [ aij = dij ] M
qi = tf(Ti, Q) ∙ idf(Ti)
RSV(q,dj) ranks documents for a given query
A query q is answered with the k documents
having the highest RSVs.
Typical RSV functions:
❖ Inner vector product:
❖ Cosine measure:
RSV(q,dj) = qTd
RSV(q,dj) = qTd / (||q|| ||d||)
24
C) Vector Space Retrieval (Example)
❖
Given 3 documents (D1, D2, D3) and query Q
❖
❖
❖
❖
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of sliver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Q: “gold silver truck”
ID
Term Ti
df(Ti)
idf(Ti)
1
a
3
0
2
arrived
2
0.176
3
damaged
1
0.447
4
delivery
1
0.447
5
fire
1
0.447
6
gold
2
0.176
7
in
3
0
8
of
3
0
9
silver
1
0.447
10
shipment
2
0.176
11
truck
2
0.176
M = 11
N=3
25
C) Vector Space Retrieval (Example)
❖
Document-Term-Matrix AT
doc
T1
T2
D1
T3
T4
0.447
D2
0.176
D3
0.176
T5
T6
T7
T8
T9
0.447 0.176
0.447
T10
T11
0.176
0.954
0.176
0.176
0.176 0.176
❖
Q
❖
0.176
0.447
0.176
Use the inner vector product to rank documents:
RSV = ATq = ( 0.031, 0.486, 0.062 )T
❖
This results in the following order: D2 < D3 < D1
26
C) Vector Space Retrieval (cont.)
❖
❖
Many more methods to determine vector representation and to
compute Result Status Values (RSV)
Main assumption: Terms occur independent from each other in
documents
This is actually not true, e.g. if one writes about “Mercedes”, the term “car” is likely to cooccur in the document.
❖
❖
Advantages:
❖ simple model with efficient algorithms
❖ partial match queries possible
❖ very good retrieval quality, but not state-of-the-art
Disadvantages:
Many heuristics and simplifications, no proof of “correctness”
❖ HTML/Web: occurrences of terms is not the most important criterion
to rank documents (spamming)
❖
27
D) Probabilistic Retrieval
Basic idea: given a query and a document, estimate
the probability that the user considers the document to
be relevant
❖ Requires user-interaction: user’s choice is fed back
into the probabilistic reasoning
❖ Advantages:
❖
Ordering of documents based on decreasing probability of
being relevant
❖ Efficient evaluation possible
❖
❖
Disadvantages:
❖ Partially uses rough estimates (heuristics)
❖ Frequency and position of terms not considered
❖ Assumption of independence of terms does not hold
28
E) Latent Semantic Indexing (LSI)
❖
Vector space retrieval maps documents to points
in M-dimensional term space; not sufficient
there are correlation between terms (synonyms!)
❖ the M-dim. space may be too high dimensional
❖
❖
Basic idea:
Transform document vectors to some low-dimensional
space
❖ New dimensions no longer bound to individual terms
❖ New dimensions should denote concepts
encompassing several terms
❖ This transformation is called Latent Semantic Indexing
❖
29
E) Latent Semantic Indexing (cont.)
❖
Advantages:
Synonyms are automatically detected
❖ Simplifies term extraction:
❖
• no dictionary and ontology required
• different languages and cross-language retrieval for free
• stemming not necessary
❖
❖
Good retrieval quality
Disadvantages:
Extremely expensive, fast algorithms for parallel
computations necessary (but not available)
❖ Retrieval quality not much better than with other
methods
❖
30
Content
History
❖ Search Engines
❖
Web Crawlers
❖ Term
Extraction/Indexing
❖ Web Retrieval
❖ Optimization
❖
❖
Web Directories
❖
DMOZ, Google
❖
Semantic Web
Web 3.0 (meta data)
❖ Ontologies
❖ Folksonomies
❖
31
Web Retrieval / ordering problem
❖
What to show first?:
Most result sets contain more than 100’000
documents with an RSV > 0
❖ But not all documents are relevant
❖
• e.g. query “Ford” returns 1’510’000’000 results (Google, April
2012)
• 1st rank: car manufacturer Ford
• How is this possible?
Search engines do not only sort based on RSVs
❖ Classical text retrieval also lacks a defense
mechanism against spamming
❖
32
Ordering of Documents
Todays Search Engines use similar (but more
advanced) methods as discussed before, but the
details are secret!
❖ The ranking considers:
❖
a. Proximity of terms
(i.e. distance between occurrences of distinct query terms)
b. Position in the document
(URL, text of references, title, meta-tag, body)
c. “PageRank”
d. Further Criteria
(advertisements, pushed content, formatting)
33
a. Proximity of Terms
❖
❖
Query: “White House”
❖
Document 1: “the white car stands in front of the house”
(not relevant)
❖
Document 2: “the president entered the White House” (relevant)
❖
the closer the query terms are, the more relevant the text is
Implementation in Google prototype
for each position pair, a proximity value was assigned
❖ the frequency of these values result in the proximity vector
❖ multiplying this vector with the a weighting vector leads to
the overall proximity value for a document
❖
34
b. Position in the Document
❖
Queries typically aim at the title (heading)
❖
e.g. “White House” instead of “Central Executive Place”
Users often look for brands, persons, firms
❖ External links to pages contain good descriptions
❖
❖
e.g. query “eth lausanne” is answered with home page of EPFL,
although that page does not contain the term “ETH”
Pages more relevant if query terms appear in the title, with
special visual attributes, or in external references
❖ Google:
❖
counts the occurrences of terms along these dimensions
❖ multiplies the frequencies with well-chosen weights
❖ sums these values to a second relevance value for the document
❖ contains mechanisms to cut-off spamming
❖
35
c. PageRank
❖
Idea: more inbound links = more relevant,
it is more likely a surfer lands on that page
Problems: not every page is equally important + spamming
❖ Improved algorithm:
❖ Random surfer clicks with probability p an outgoing link
❖ With probability 1-p the surfer goes to an other, arbitrary
page (bookmark, URL)
❖ The PageRank of a page is the probability that a random
surfer lands on the page (after a number of steps)
❖
36
c. PageRank (cont.)
❖
A
❖
L(A)
set of pages which have a link to A
❖
N(A)
number of outgoing links of page A
❖
PR(A) PageRank of page A
❖
p
❖
Definition of PageRank:
❖
an arbitrary page
Probability that a surfer is following a outgoing link
PR(A) = (1-p) + p ∙ ∑ ( PR(B) / N(B) )
B ∈ L(A)
37
c. PageRank (cont.)
❖
Definition of PageRank:
❖
PR(A) = (1-p) + p ∙ ∑ ( PR(B) / N(B) )
B ∈ L(A)
The value of a link is given by the PageRank of the
source page divided by the number of outgoing links on
that page.
Simulates the freedom of the random surfer to follow any
link
❖ The first part denotes the freedom of the surfer to follow a
link with probability p or to jump to an arbitrary page.
❖
38
c. PageRank (cont.)
❖
The formula is recursive! The PageRank can be computed
by a fix point iteration:
1. Assign arbitrary initial values PR(A) for all pages A
2. Compute PR’(A) according to the formula
3. If |PR’(A)-PR(A)| sufficiently small:
PR’(A) = PR(A) is the solution
Solving the fixed point takes only a few iterations (<100)
❖ Experimental evidence:
❖
❖
❖
PageRank computation is minimal compared to crawling effort
Google uses PageRank in combination with other criteria
39
d. Further Criteria
❖
Bought ranking positions
❖
Search engines get money for placing pages at the top
(advertisements, Google: AdWords)
❖
Length of URL
❖
A query for “ford” may be answered by the following pages:
• http://www.ford.com/
• http://www.ford.com/HelpDesk/
• http://www.careers.ford.com/main.asp
• http://www.ford.com/servlet/ecmcs/ford/index.jsp?SECTION=ourServices&LEVEL2=rentalsFromDeale
❖
❖
User feedback
❖
❖
Shorter URLs (home pages) are ranked at higher positions
count result clicks, increase the relevance in next queries
Formatting
❖
2015: Google “honors” sites formatted for mobile devices
40
Content
History
❖ Search Engines
❖
Web Crawlers
❖ Term
Extraction/Indexing
❖ Web Retrieval
❖ Optimization
❖
❖
Web Directories
❖
DMOZ, Google
❖
Semantic Web
Web 3.0 (meta data)
❖ Ontologies
❖ Folksonomies
❖
41
Search Engine Optimization (SEO)
❖
How does Google learn about your new page, fast?
❖
❖
Announce your page to Google (Webmaster tools)
How to influence the ranking of your page?
❖
❖
Companies and their products want (must) appear in the first 10
results
Two strategies:
1. Paid entries (AdWords)
2. Improve “organic search”
➡ for “organic search”: SEO (Search Engine Optimization)
❖
SEO is an official marketing strategy;
consultants, spam …
42
Search Engine Optimization (cont.)
❖
How to improve “organic search”?
❖
❖
❖
❖
optimize HTML code, information structure
increase relevance to specific keywords
increase number of inbound links
It’s also in Google’s interest that your web page content is
accurately entered into their index!
❖
❖
Helps to better serve advertisements
Google provides guidelines (search-engine-optimization-starter-guide.pdf)
• only first 100k of page matters
• use header tags, meta tags, site maps, etc.
❖
But: Could create fake pages (social media) for pointing to
the main page we want to boost
❖
There is also a market for this!
43
Content
History
❖ Search Engines
❖
Web Crawlers
❖ Term
Extraction/Indexing
❖ Web Retrieval
❖ Optimization
❖
❖
Web Directories
❖
DMOZ, Google
❖
Semantic Web
Web 3.0 (meta data)
❖ Ontologies
❖ Folksonomies
❖
44
Web Directories
❖
Automatic indexing is not always optimal
(despite ranking)
Manually gathered and edited link often more
relevant
❖ Examples: yellow pages, classified advertisements
❖
❖
Web Directories:
Links are organized in a hierarchy
❖ URLs often submitted by site owners,
❖ edited by humans (professional editors, volunteers)
❖ requires a classification of terms into categories and
sub-categories
❖
45
Web Directories - Pros/Cons
❖
advantages over search engines:
❖
❖
human classification better than automatic “spiders”
disadvantages:
lists sometimes outdated (robots help)
❖ new pages listed late (search engines are faster)
❖
❖
Yahoo was the king of web directories,
(before Google),
but it has stopped its famous directory in 2014!
46
Web Directories - Examples
❖
Yahoo (http://www.yahoo.com)
❖
❖
❖
❖
❖
Google Directory
❖
❖
❖
human editors
collect list of essential web pages
organize web pages in a hierarchy
shut down Dec 2014
content mainly from DMOZ
shut down in 2011
DMOZ (=directory.mozilla): Open Directory Project
(http://www.dmoz.org)
❖
❖
❖
❖
labor distributed to volunteer editors (“net-citizens”)
multilingual
used by other search engines
2010: 4.7∙106 entries
47
Content
History
❖ Search Engines
❖
Web Crawlers
❖ Term
Extraction/Indexing
❖ Web Retrieval
❖ Optimization
❖
❖
Web Directories
❖
DMOZ, Google
❖
Semantic Web
Web 3.0 (meta data)
❖ Ontologies
❖ Folksonomies
❖
48
Web 3.0
❖
What pages shall search engines display for a query of
“jaguar”?
❖
❖
Approach: Important terms in web pages contain metainformation that help to eliminate ambiguities:
❖
❖
car? animal? operating system?
e.g. <item
rdf:about=”http://dbpedia.org/resource/Cat”>Cat</item>
Advantage: Programs (crawlers, indexers, and in the
future automatic search agents for end users)
“understand” the web
❖
Vision by Tim Berners-Lee: “...the day-to-day mechanisms of
trade, bureaucracy and our daily lives will be handled by
machines talking to machines.”
49
Ontologies
An ontology (greek: onto = “being: that which is”) is the
study of things that exists
❖ Concepts about the world must be standardized:
cat = Katze = chat = gatto = kissa = ...
❖ Ontologies in Information Science:
❖
“formal, explicit specification of a shared conceptualization”
Usually implemented as a domain-specific vocabulary,
attributes, relations, etc.
❖ Frameworks and description languages (no content):
❖ RDF = Resource Description Framework
❖ OWL = Web Ontology Language
❖
50
Ontologies vs. Folksonomies
Creation of ontologies cannot be automated;
ontologies are human artifacts
❖ There is not a single ontology, but a growing set of
competing and complementing ontologies (e.g. Cyc, WordNet, ...)
❖ Ontologies are published under different license models
❖ Ontologies are either
❖ created by experts, or
❖ created by the public = Folksonomies
❖
• Folksonomy = ontology derived collaboratively
• aka. collaborative tagging, social classification, social indexing,
social tagging
51
Linked Open Data (LOD)
Public data, accessible via URI, classified RDF
and OWL (Web Ontology Language)
❖ World-wide network, also called “Linked Data
Cloud”
❖ Web Consortium long-term plan:
unification of all databases
❖
52
Linked Open Data (LOD)
Browsers: see
http://en.wikipedia.org/wiki/Linked_data#Browsers
53

based on slides by Roger Weber, edited by Thomas Meyer

Transcription

Similar documents

History of Computers Seminar Killer Applications