based on slides by Roger Weber, edited by Thomas Meyer
Transcription
based on slides by Roger Weber, edited by Thomas Meyer
Internet-Technologien (CS262) Web Information Retrieval 22. April 2015 Christian Tschudin (based on slides by Roger Weber, edited by Thomas Meyer) Departement Mathematik und Informatik, Universität Basel Problem IP-, Transport-, und Application Layer (HTTP) establish connectivity (linked content) ❖ How to efficiently locate content in the global graph of web documents? ❖ ❖ “Expert sites” • Content (links) sorted according to topics • Human classification ➝ “Web Directory” ❖ Complete index • Extract words from text, title, and meta-information of web pages • Automated process ➝ “Search Engine” CS262 — FS15 — Info Retrieval 2 Content History ❖ Search Engines ❖ Web Crawlers ❖ Term Extraction/Indexing ❖ Text Retrieval Models ❖ Web Retrieval ❖ Optimization ❖ ❖ Web Directories ❖ DMOZ, Google ❖ Semantic Web Web 3.0 (meta data) ❖ Ontologies ❖ Folksonomies ❖ CS262 — FS15 — Info Retrieval 3 History ❖ 1980s files distributed via anonymous FTP ❖ 1990 “Archie”: assembled lists of files on FTP servers ❖ 1990 Tim Berners-Lee @ CERN develops the WWW ❖ 1992-1994 early browsers (Erwise, ViolaWWW, Mosaic) ❖ 1993 early web robots to collect URLs (Wanderer, ALIWEB, WWW Worm) ❖ 1994 Stanford students manually collect popular web sites, put them into a hierarchy, called Yahoo ❖ 1995 DEC develops Altavista (crawler + text search) ❖ 1998 Stanford students Larry Page and Sergey Brin start Google (better link analysis & page ranking) CS262 — FS15 — Info Retrieval 4 Size of the Web ❖ ❖ Size of the Web is estimated by the size of search engine indices; this only reflects the surface/visible/indexable web But there is an invisible/dark/deep web: ❖ Search engines require linked web pages to build an index ❖ But huge amount of content is not linked, i.e. not indexable: • Mostly relates to dynamically created web pages e.g. links generated by scripts, content from databases, e.g. libraries! • The access to other web pages is restricted e.g. requires password, captchas • Also sites with “no-crawl” directives e.g. robots.txt: User-agent: * Disallow: /cgi-bin/ Disallow: /images Disallow: /tmp/ CS262 — FS15 — Info Retrieval 5 Size of the Web (cont.) Current estimate based on search machine indices ❖ (April 2015) ❖ ❖ ❖ from Google: from Bing: from Yahoo: 50∙109 10∙109 ? (in 2012, it was 3∙109) For current figures, see http://www.worldwidewebsize.com ❖ Number of Web sites: 925 ∙ 106 (http://www.internetlivestats.com/) … which gives a low 50 pages per web site … ❖ ❖ Huge amount of information; how to organize search in the (surface) web efficiently? CS262 — FS15 — Info Retrieval 6 Content History ❖ Search Engines ❖ Web Crawlers ❖ Term Extraction/Indexing ❖ Text Retrieval Models ❖ Web Retrieval ❖ Optimization ❖ ❖ Web Directories ❖ DMOZ, Google ❖ Semantic Web Web 3.0 (meta data) ❖ Ontologies ❖ Folksonomies ❖ CS262 — FS15 — Info Retrieval 7 Search Engines ❖ Killer functionality of todays Web ❖ Goal: Find relevant information from all web pages quickly ❖ Search engines (global market shares): ❖ ❖ In May 2011 - Google (82%), Yahoo (6%), Baidu (5%), Bing (4%),... ❖ In Apr 2015 - Google (62%), Baidu (20%), Bing (8%), Yahoo (6%), ... Operation principle: 1. Web crawling: Find most (wish: all) web pages automatically through Spiders 2. Indexing: Analyze web pages and linked documents, build an index 3. Retrieval: Provide efficient text search over the whole content, rank results based on relevance and popularity CS262 — FS15 — Info Retrieval 8 Search Engine Operation / Overview CS262 — FS15 — Info Retrieval 9 a. Web Crawlers (aka Spider, Robot) Goal: find all web pages in the World Wide Web ❖ Start with a set of URLs ❖ ❖ Iteratively follow the links on the pages analyzed World Wide Web Web pages URLs Scheduler Downloader (multithreaded) Eliminates duplicate URLs URLs URLs Queue ❖ Ignore pages listed in the site’s robots.txt ❖ Build a storage of pages (input for indexing) Text and metadata ❖ Storage (ready for indexing) CS262 — FS15 — Info Retrieval 10 Search Engine Operation / Overview CS262 — FS15 — Info Retrieval 11 b. Indexing ❖ Databases: information maintained in a structure way (e.g. relational DB) ❖ ❖ ❖ ❖ ❖ Simple queries make use of structure (e.g. SQL) Index: data structure to speed up queries (e.g. phone dictionary by city/name, number, profession) Complex queries more difficult, require ranking mechanism The WWW is not a relational database, information is unstructured How to build an index, nevertheless? CS262 — FS15 — Info Retrieval 12 b.1 Term Extraction - Overview CS262 — FS15 — Info Retrieval 13 b.1 Term Extraction i) Elimination of Structure HTML contains structure and content ❖ Remove the structure (remove markup tags), but remember meta-information, e.g. ❖ URI of page: http://cn.cs.unibas.ch/index.html ❖ Title of document: <title>Computer Networks Home</title> ❖ Meta tags: <meta name=”keywords” content=”network,basel”> ❖ ❖ Body tags give information about the importance of content Headlines: <h1>1. Information Retrieval</h1> ❖ Emphasized: <b>This is important</b>, <i>this, too</i> ❖ Link: Description of content: <a href=”...gif”>Logo</a> ❖ CS262 — FS15 — Info Retrieval 14 b.1 Term Extraction ii) Elimination of Frequent/Infrequent Terms ❖ Remove terms with little/no semantics (e.g. “the”, “a”) ❖ Remove terms that appear seldom (e.g. “sausage” in a computer science article) ❖ ❖ Theoretic Solution: restrict indexing to terms that have proven to be useful in the past (needs user feedback) Pragmatic Solution: ❖ Compute Zipfian distribution: compute frequency of terms ❖ Rank terms based on their occurrence frequency ❖ Strip off words that are unfrequent or too frequent CS262 — FS15 — Info Retrieval 15 b.1 Term Extraction iii) Mapping text to terms ❖ Most search engines use words or phrases as features (some use stemming, some distinguish upper-/lower-case) ❖ An option is to use fragments aka. n-grams ❖ Example: street: streets: strets: str, tre, ree, eet str, tre, ree, eet, ets str, tre, ret, ets Simple misspellings often result in bad retrievals ❖ Fragments significantly improve quality ❖ ❖ Also extract term location and frequency Frequency is later used for page ranking ❖ Location is used in conditional search (e.g. Q=”white NEAR house”) ❖ Location is used for page ranking (e.g. Q=”white house”) ❖ CS262 — FS15 — Info Retrieval 16 b.1 Term Extraction iv) Reduction of terms to their stems Stemming: in most languages, words have various inflected forms, which carry same/similar meaning ❖ Not easy to derive linguistic stems ❖ ❖ English: good algorithms exist (e.g. Porter algorithm) ❖ German: too complex, need the help of a dictionary (e.g. EuroWordNet, GermanNet, WordNet) • Strong conjugations and declinations – gehen: gehe, gehst, geht, gehen, ging, gingst, gingen, gegangen – Haus: Haus, Hauses, Häuser • Composite words may or may not be splitted into parts – Gartenhaus ➝ Garten, Haus ❖ (good or bad??) Not implemented in all search engines (Google since 2003) CS262 — FS15 — Info Retrieval 17 b.1 Term Extraction v) Mapping to index terms ❖ Term extraction has to deal with homonyms and synonyms ❖ Homonyms: equal term but different semantics (e.g. bank = shore | financial institute) ❖ Synonyms: different terms but equal/similar semantics (e.g. walk, go, pace, run, sprint) ❖ Hypernyms: umbrella term (species) (e.g. Animal ← dog, cat, bird, ...) ❖ Holonyms: (is part of) / Meronyms (has parts) (e.g. door ← lock) These terms define a network (denoted as ontology); terms = nodes, relations = edges ❖ Occurrence of a term in the document may be interpreted of occurrences of nearby terms in the network as well ❖ (e.g. “dog” may be interpreted as “animal” with smaller weight) CS262 — FS15 — Info Retrieval 18 Search Engine Operation / Overview CS262 — FS15 — Info Retrieval 19 Core Technology: Text Retrieval ❖ ❖ ❖ Goal: When a user enters a query, examine the index and provide a list of best matching web pages Input: Terms, often allows separation by boolean operators (AND, OR, NOT) Ranking: Algorithms that returns the best result first ❖ ❖ ❖ Ranking algorithm is the core-business of a search engine, often kept secret Ranking may also be influenced by advertisers that pay Different retrieval approaches: ❖ Boolean-, Fuzzy-, Vector-Space-, Probabilistic- Retrieval (here: only a brief overview without going into the details) CS262 — FS15 — Info Retrieval 20 A) Boolean Retrieval ❖ Query = boolean operations (and, or, not) on terms ❖ Iterate over documents: ❖ ❖ ❖ Retrieval Status Value (RSV) = 1 if document matches query ❖ Retrieval Status Value (RSV) = 0 if not Historically: already used with punch cards, requires only sequential access Today: not state-of-the-art anymore: ❖ No ranking of documents, returns all matching documents ❖ Size of result becomes unreasonably large ❖ Complex query language CS262 — FS15 — Info Retrieval 21 B) Fuzzy Retrieval ❖ Same model as Boolean Retrieval enriched with a ranking mechanism (based on frequency of terms) Retrieval Status Value (RSV) evaluates to a value between 0 and 1 (fuzzy logic) ❖ Ordering of documents with descending RSVs ❖ ❖ Advantage: ❖ ❖ Ranking of retrieved documents Disadvantage: Not better than Boolean retrieval but worse than all other ❖ No weighting of terms, i.e. frequent terms dominate result ❖ Complex query language ❖ CS262 — FS15 — Info Retrieval 22 C) Vector Space Retrieval ❖ ❖ Documents D and queries Q represented by a M-dimensional vector d, q ∈ ℝM (M: nr. of terms in collection) Definitions: ❖ Term frequency: tf(Ti,Dj) number of occurrences of term Ti in Document Dj ❖ Document frequency: df(Ti) number of documents that contain feature Ti ❖ Inverse document frequency: idf(Ti) = log( N / df(Ti) ) discrimination value of term Ti, describes how good a term distinguishes documents in the collection (e.g. “the” is not able to segregate documents, “computer” is sharply segregates docs) CS262 — FS15 — Info Retrieval 23 C) Vector Space Retrieval (cont.) ❖ ❖ ❖ ❖ The Document-Term-Matrix combines all document vectors into a huge matrix N: number of documents M: number of terms N dij = tf(Ti, Dj) ∙ idf(Ti) A = [ aij = dij ] M qi = tf(Ti, Q) ∙ idf(Ti) RSV(q,dj) ranks documents for a given query A query q is answered with the k documents having the highest RSVs. Typical RSV functions: ❖ Inner vector product: ❖ Cosine measure: RSV(q,dj) = qTd RSV(q,dj) = qTd / (||q|| ||d||) CS262 — FS15 — Info Retrieval 24 C) Vector Space Retrieval (Example) ❖ Given 3 documents (D1, D2, D3) and query Q ❖ ❖ ❖ ❖ D1: “Shipment of gold damaged in a fire” D2: “Delivery of sliver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Q: “gold silver truck” ID Term Ti df(Ti) idf(Ti) 1 a 3 0 2 arrived 2 0.176 3 damaged 1 0.447 4 delivery 1 0.447 5 fire 1 0.447 6 gold 2 0.176 7 in 3 0 8 of 3 0 9 silver 1 0.447 10 shipment 2 0.176 11 truck 2 0.176 M = 11 N=3 CS262 — FS15 — Info Retrieval 25 C) Vector Space Retrieval (Example) ❖ Document-Term-Matrix AT doc T1 T2 D1 T3 T4 0.447 D2 0.176 D3 0.176 T5 T6 T7 T8 T9 0.447 0.176 0.447 T10 T11 0.176 0.954 0.176 0.176 0.176 0.176 ❖ Q ❖ 0.176 0.447 0.176 Use the inner vector product to rank documents: RSV = ATq = ( 0.031, 0.486, 0.062 )T ❖ This results in the following order: D2 < D3 < D1 CS262 — FS15 — Info Retrieval 26 C) Vector Space Retrieval (cont.) ❖ ❖ Many more methods to determine vector representation and to compute Result Status Values (RSV) Main assumption: Terms occur independent from each other in documents This is actually not true, e.g. if one writes about “Mercedes”, the term “car” is likely to cooccur in the document. ❖ ❖ Advantages: ❖ simple model with efficient algorithms ❖ partial match queries possible ❖ very good retrieval quality, but not state-of-the-art Disadvantages: Many heuristics and simplifications, no proof of “correctness” ❖ HTML/Web: occurrences of terms is not the most important criterion to rank documents (spamming) ❖ CS262 — FS15 — Info Retrieval 27 D) Probabilistic Retrieval Basic idea: given a query and a document, estimate the probability that the user considers the document to be relevant ❖ Requires user-interaction: user’s choice is fed back into the probabilistic reasoning ❖ Advantages: ❖ Ordering of documents based on decreasing probability of being relevant ❖ Efficient evaluation possible ❖ ❖ Disadvantages: ❖ Partially uses rough estimates (heuristics) ❖ Frequency and position of terms not considered ❖ Assumption of independence of terms does not hold CS262 — FS15 — Info Retrieval 28 E) Latent Semantic Indexing (LSI) ❖ Vector space retrieval maps documents to points in M-dimensional term space; not sufficient there are correlation between terms (synonyms!) ❖ the M-dim. space may be too high dimensional ❖ ❖ Basic idea: Transform document vectors to some low-dimensional space ❖ New dimensions no longer bound to individual terms ❖ New dimensions should denote concepts encompassing several terms ❖ This transformation is called Latent Semantic Indexing ❖ CS262 — FS15 — Info Retrieval 29 E) Latent Semantic Indexing (cont.) ❖ Advantages: Synonyms are automatically detected ❖ Simplifies term extraction: ❖ • no dictionary and ontology required • different languages and cross-language retrieval for free • stemming not necessary ❖ ❖ Good retrieval quality Disadvantages: Extremely expensive, fast algorithms for parallel computations necessary (but not available) ❖ Retrieval quality not much better than with other methods ❖ CS262 — FS15 — Info Retrieval 30 Content History ❖ Search Engines ❖ Web Crawlers ❖ Term Extraction/Indexing ❖ Text Retrieval Models ❖ Web Retrieval ❖ Optimization ❖ ❖ Web Directories ❖ DMOZ, Google ❖ Semantic Web Web 3.0 (meta data) ❖ Ontologies ❖ Folksonomies ❖ CS262 — FS15 — Info Retrieval 31 Web Retrieval / ordering problem ❖ What to show first?: Most result sets contain more than 100’000 documents with an RSV > 0 ❖ But not all documents are relevant ❖ • e.g. query “Ford” returns 1’510’000’000 results (Google, April 2012) • 1st rank: car manufacturer Ford • How is this possible? Search engines do not only sort based on RSVs ❖ Classical text retrieval also lacks a defense mechanism against spamming ❖ CS262 — FS15 — Info Retrieval 32 Ordering of Documents Todays Search Engines use similar (but more advanced) methods as discussed before, but the details are secret! ❖ The ranking considers: ❖ a. Proximity of terms (i.e. distance between occurrences of distinct query terms) b. Position in the document (URL, text of references, title, meta-tag, body) c. “PageRank” d. Further Criteria (advertisements, pushed content, formatting) CS262 — FS15 — Info Retrieval 33 a. Proximity of Terms ❖ ❖ Query: “White House” ❖ Document 1: “the white car stands in front of the house” (not relevant) ❖ Document 2: “the president entered the White House” (relevant) ❖ the closer the query terms are, the more relevant the text is Implementation in Google prototype for each position pair, a proximity value was assigned ❖ the frequency of these values result in the proximity vector ❖ multiplying this vector with the a weighting vector leads to the overall proximity value for a document ❖ CS262 — FS15 — Info Retrieval 34 b. Position in the Document ❖ Queries typically aim at the title (heading) ❖ e.g. “White House” instead of “Central Executive Place” Users often look for brands, persons, firms ❖ External links to pages contain good descriptions ❖ ❖ e.g. query “eth lausanne” is answered with home page of EPFL, although that page does not contain the term “ETH” Pages more relevant if query terms appear in the title, with special visual attributes, or in external references ❖ Google: ❖ counts the occurrences of terms along these dimensions ❖ multiplies the frequencies with well-chosen weights ❖ sums these values to a second relevance value for the document ❖ contains mechanisms to cut-off spamming ❖ CS262 — FS15 — Info Retrieval 35 c. PageRank ❖ Idea: more inbound links = more relevant, it is more likely a surfer lands on that page Problems: not every page is equally important + spamming ❖ Improved algorithm: ❖ Random surfer clicks with probability p an outgoing link ❖ With probability 1-p the surfer goes to an other, arbitrary page (bookmark, URL) ❖ The PageRank of a page is the probability that a random surfer lands on the page (after a number of steps) ❖ CS262 — FS15 — Info Retrieval 36 c. PageRank (cont.) ❖ A ❖ L(A) set of pages which have a link to A ❖ N(A) number of outgoing links of page A ❖ PR(A) PageRank of page A ❖ p ❖ Definition of PageRank: ❖ an arbitrary page Probability that a surfer is following a outgoing link PR(A) = (1-p) + p ∙ ∑ ( PR(B) / N(B) ) B ∈ L(A) CS262 — FS15 — Info Retrieval 37 c. PageRank (cont.) ❖ Definition of PageRank: ❖ PR(A) = (1-p) + p ∙ ∑ ( PR(B) / N(B) ) B ∈ L(A) The value of a link is given by the PageRank of the source page divided by the number of outgoing links on that page. Simulates the freedom of the random surfer to follow any link ❖ The first part denotes the freedom of the surfer to follow a link with probability p or to jump to an arbitrary page. ❖ CS262 — FS15 — Info Retrieval 38 c. PageRank (cont.) ❖ The formula is recursive! The PageRank can be computed by a fix point iteration: 1. Assign arbitrary initial values PR(A) for all pages A 2. Compute PR’(A) according to the formula 3. If |PR’(A)-PR(A)| sufficiently small: PR’(A) = PR(A) is the solution Solving the fixed point takes only a few iterations (<100) ❖ Experimental evidence: ❖ ❖ ❖ PageRank computation is minimal compared to crawling effort Google uses PageRank in combination with other criteria CS262 — FS15 — Info Retrieval 39 d. Further Criteria ❖ Bought ranking positions ❖ Search engines get money for placing pages at the top (advertisements, Google: AdWords) ❖ Length of URL ❖ A query for “ford” may be answered by the following pages: • http://www.ford.com/ • http://www.ford.com/HelpDesk/ • http://www.careers.ford.com/main.asp • http://www.ford.com/servlet/ecmcs/ford/index.jsp?SECTION=ourServices&LEVEL2=rentalsFromDeale ❖ ❖ User feedback ❖ ❖ Shorter URLs (home pages) are ranked at higher positions count result clicks, increase the relevance in next queries Formatting ❖ 2015: Google “honors” sites formatted for mobile devices CS262 — FS15 — Info Retrieval 40 Content History ❖ Search Engines ❖ Web Crawlers ❖ Term Extraction/Indexing ❖ Text Retrieval Models ❖ Web Retrieval ❖ Optimization ❖ ❖ Web Directories ❖ DMOZ, Google ❖ Semantic Web Web 3.0 (meta data) ❖ Ontologies ❖ Folksonomies ❖ CS262 — FS15 — Info Retrieval 41 Search Engine Optimization (SEO) ❖ How does Google learn about your new page, fast? ❖ ❖ Announce your page to Google (Webmaster tools) How to influence the ranking of your page? ❖ ❖ Companies and their products want (must) appear in the first 10 results Two strategies: 1. Paid entries (AdWords) 2. Improve “organic search” ➡ for “organic search”: SEO (Search Engine Optimization) ❖ SEO is an official marketing strategy; consultants, spam … CS262 — FS15 — Info Retrieval 42 Search Engine Optimization (cont.) ❖ How to improve “organic search”? ❖ ❖ ❖ ❖ optimize HTML code, information structure increase relevance to specific keywords increase number of inbound links It’s also in Google’s interest that your web page content is accurately entered into their index! ❖ ❖ Helps to better serve advertisements Google provides guidelines (search-engine-optimization-starter-guide.pdf) • only first 100k of page matters • use header tags, meta tags, site maps, etc. ❖ But: Could create fake pages (social media) for pointing to the main page we want to boost ❖ There is also a market for this! CS262 — FS15 — Info Retrieval 43 Content History ❖ Search Engines ❖ Web Crawlers ❖ Term Extraction/Indexing ❖ Text Retrieval Models ❖ Web Retrieval ❖ Optimization ❖ ❖ Web Directories ❖ DMOZ, Google ❖ Semantic Web Web 3.0 (meta data) ❖ Ontologies ❖ Folksonomies ❖ CS262 — FS15 — Info Retrieval 44 Web Directories ❖ Automatic indexing is not always optimal (despite ranking) Manually gathered and edited link often more relevant ❖ Examples: yellow pages, classified advertisements ❖ ❖ Web Directories: Links are organized in a hierarchy ❖ URLs often submitted by site owners, ❖ edited by humans (professional editors, volunteers) ❖ requires a classification of terms into categories and sub-categories ❖ CS262 — FS15 — Info Retrieval 45 Web Directories - Pros/Cons ❖ advantages over search engines: ❖ ❖ human classification better than automatic “spiders” disadvantages: lists sometimes outdated (robots help) ❖ new pages listed late (search engines are faster) ❖ ❖ Yahoo was the king of web directories, (before Google), but it has stopped its famous directory in 2014! CS262 — FS15 — Info Retrieval 46 Web Directories - Examples ❖ Yahoo (http://www.yahoo.com) ❖ ❖ ❖ ❖ ❖ Google Directory ❖ ❖ ❖ human editors collect list of essential web pages organize web pages in a hierarchy shut down Dec 2014 content mainly from DMOZ shut down in 2011 DMOZ (=directory.mozilla): Open Directory Project (http://www.dmoz.org) ❖ ❖ ❖ ❖ labor distributed to volunteer editors (“net-citizens”) multilingual used by other search engines 2010: 4.7∙106 entries CS262 — FS15 — Info Retrieval 47 Content History ❖ Search Engines ❖ Web Crawlers ❖ Term Extraction/Indexing ❖ Text Retrieval Models ❖ Web Retrieval ❖ Optimization ❖ ❖ Web Directories ❖ DMOZ, Google ❖ Semantic Web Web 3.0 (meta data) ❖ Ontologies ❖ Folksonomies ❖ CS262 — FS15 — Info Retrieval 48 Web 3.0 ❖ What pages shall search engines display for a query of “jaguar”? ❖ ❖ Approach: Important terms in web pages contain metainformation that help to eliminate ambiguities: ❖ ❖ car? animal? operating system? e.g. <item rdf:about=”http://dbpedia.org/resource/Cat”>Cat</item> Advantage: Programs (crawlers, indexers, and in the future automatic search agents for end users) “understand” the web ❖ Vision by Tim Berners-Lee: “...the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines.” CS262 — FS15 — Info Retrieval 49 Ontologies An ontology (greek: onto = “being: that which is”) is the study of things that exists ❖ Concepts about the world must be standardized: cat = Katze = chat = gatto = kissa = ... ❖ Ontologies in Information Science: ❖ “formal, explicit specification of a shared conceptualization” Usually implemented as a domain-specific vocabulary, attributes, relations, etc. ❖ Frameworks and description languages (no content): ❖ RDF = Resource Description Framework ❖ OWL = Web Ontology Language ❖ CS262 — FS15 — Info Retrieval 50 Ontologies vs. Folksonomies Creation of ontologies cannot be automated; ontologies are human artifacts ❖ There is not a single ontology, but a growing set of competing and complementing ontologies (e.g. Cyc, WordNet, ...) ❖ Ontologies are published under different license models ❖ Ontologies are either ❖ created by experts, or ❖ created by the public = Folksonomies ❖ • Folksonomy = ontology derived collaboratively • aka. collaborative tagging, social classification, social indexing, social tagging CS262 — FS15 — Info Retrieval 51 Linked Open Data (LOD) Public data, accessible via URI, classified RDF and OWL (Web Ontology Language) ❖ World-wide network, also called “Linked Data Cloud” ❖ Web Consortium long-term plan: unification of all databases ❖ CS262 — FS15 — Info Retrieval 52 Linked Open Data (LOD) Browsers: see http://en.wikipedia.org/wiki/Linked_data#Browsers CS262 — FS15 — Info Retrieval 53