Anotaciones semánticas: unidades de busqueda del futuro?
Transcription
Anotaciones semánticas: unidades de busqueda del futuro?
Anotaciones semánticas: unidades de busqueda del futuro? Hugo Zaragoza, Yahoo! Research, Barcelona Jornadas MAVIR Madrid, Nov.07 Hugo Zaragoza, MAVIR, 2007. Hugo Zaragoza, MAVIR, 2007. Document Understanding Cartoon our work! Complexity of Document Understanding grep search engines Q&A Hugo Zaragoza, MAVIR, 2007. semantic web? domain expert Beyond strings, beyond bag of words “In the room the women come and go Talking of Michelangelo.” { ARTIFACT, PEOPLE, VERB MOTION, VERB COMMUNICATION, PERSON } { room, women, come, go, talking, Michelangelo } { ARTIFACT room --MODIFIER, PEOPLE women --SUBJECT , MOTION come and go --VERB, } { COMMUNICATION talking --VERB, PERSON Michelangelo --OBJECT } } Applications in Search, Algorithmic Advertisement, Answers, … Hugo Zaragoza, MAVIR, 2007. { PEOPLE–MOTION, COMMUNICATION–PERSON, Michelangelo } Beyond strings, beyond bag of words “In the room the women come and go Talking of Michelangelo.” Hugo Zaragoza, MAVIR, 2007. Structure & Domain Knowledge Domain “Verticals” Independent Specialised Search WWW RSS Feeds Blogs News Mail Y! Answers Health Search Tasks: Find relevant to string. Methods: String matching (tf-idf) Hyperlink Popularity Hugo Zaragoza, MAVIR, 2007. Domain Dependent Craig’s List Ryanair InfoZoom FaceBook Tasks: AirFlight Booking (find flights) Housing (find apartment) Hire People (find CVs) … Methods: Match + Domain-Based Ranking The NLRA Pipeline Segmentation, Tokenisation POS Word Named Semantics Entities Dependency Parser SuperSense Tagger Open Source Kryptonise, Pigcise, … Corpus Adaptation Sentiment Analysis (example) Hugo Zaragoza, MAVIR, 2007. Gazeteers Code Docum. & Support Corpus Adaptation Anaphora Resolution Pipeline Server Multilingual Support Statistical Methods for Semantic Tagging NU LL I-P ER SO N BPE RS ON CA RD IN AL • Sequence bracketing task: – Model: Collins Parser (Avg.Perceptron-HMM tagger) – Features: tokens, POS, word shape, most frequent, previous label, combinations. (Massimiliano Ciaramita and Yasemin Altun, EMNLP 2006) Hugo Zaragoza, MAVIR, 2007. Example: WordNet Supersense labels Hugo Zaragoza, MAVIR, 2007. Extraer Entidades Hugo Zaragoza, MAVIR, 2007. Statistical Methods for Semantic Tagging Method Recall Rand 42.99 Baseline 69.25 Supersense-Tagger Precision F-score 38.17 40.44 63.90 66.47 77.71 76.65 77.18 (Massimiliano Ciaramita and Yasemin Altun, EMNLP 2006) Hugo Zaragoza, MAVIR, 2007. Semantic Web & User annotations • Microformat Hugo Zaragoza, MAVIR, 2007. Semantic Web & User annotations • Categories, lists… Hugo Zaragoza, MAVIR, 2007. NLRA Search Engine query Fast inverted index (LUCENE): normalized text, (IXE) unit of retrieval. TAG-aware Forward index (UIMA): surface and normalised text, POS, overlapping semantic tags resolved anaphoras,… Inverted Index docID docID unit Id Forward Index (demo) Feature extractor Hugo Zaragoza, MAVIR, 2007. Id, score Id, Id,score score NLRA “Server” Corpus Pipeline Index Forward Index Tag Graph Your Killer Application! Hugo Zaragoza, MAVIR, 2007. Search Engine (C++, IXR) Graph Engine (Java, WebGraph) NLR Search Engine RMI & REST APIs Lenguaje Escrito Hugo Zaragoza, MAVIR, 2007. Applications II: Better Operators Hugo Zaragoza, MAVIR, 2007. Entity Extractor + Dependency Extractor Type: • LOCATION • PEOPLE • ORGANIZATION • DATE •… Anchor: Role: • SUBJECT • OBJECT • MODIFIER •… Hugo Zaragoza, MAVIR, 2007. Target: • VERB • SUBJECT • OBJECT • MODIFIER •… Sentiment Analisys • Informational • Generic/Specific • Objective/Subjective • Positive/Negative •… The problem: ranking (very many) (typed) entities Millions of unique entities, dozens of types: 1 model per entity is unfeasale we explore on-line (ad-hoc) models [Zaragoza et.al. CIKM07] Hugo Zaragoza, MAVIR, 2007. Task examples Hugo Zaragoza, MAVIR, 2007. SW0: a publically available Semantic Snapshot of Wikipedia • 6.2K English Wikipedia entries. – Sentence and token splitting – POS, NEs, Semantic tagging (WNSS). • 28M unique entities, 5.5M occurrences. – Dependency Parsing. link Hugo Zaragoza, MAVIR, 2007. Entity Containment Graph query Wikipedia search Sentences Hugo Zaragoza, MAVIR, 2007. Lenguaje Escrito Hugo Zaragoza, MAVIR, 2007. Lenguaje Escrito Hugo Zaragoza, MAVIR, 2007. Lenguaje Escrito Hugo Zaragoza, MAVIR, 2007. Problems • Descriptive entities are not as interesting: – “person” vs. “Picasso” – “city” vs. “Paris” • Easy fix: discount by global entity frequency: • But it’d be better to introduce types: – “Bush” > “food” ! Hugo Zaragoza, MAVIR, 2007. Refining Entity Graphs • Different type semantics Hugo Zaragoza, MAVIR, 2007. Detailed Results Hugo Zaragoza, MAVIR, 2007. Web-RankDiscounted method QUERY = “Life of Pablo Picasso” WWW search ENTITY = “Gertrude Stein” Gertrude Stein 1. 2. 3. 4. 5. 6. 7. 8. … ENTITY_RANK “Life of Pablo Picasso” QUERY_SET Score(ENTITY, QUERY) = AvgPrec [ rank=ENTITY_RANK, rel=QUERY_SET ] Hugo Zaragoza, MAVIR, 2007. Detailed Results Hugo Zaragoza, MAVIR, 2007. Gracias, Hugo Zaragoza, MAVIR, 2007.