Anotaciones semánticas: unidades de busqueda del futuro?

Transcription

Anotaciones semánticas: unidades de busqueda del futuro?
Anotaciones semánticas:
unidades de busqueda del futuro?
Hugo Zaragoza,
Yahoo! Research,
Barcelona
Jornadas MAVIR
Madrid, Nov.07
Hugo Zaragoza, MAVIR, 2007.
Hugo Zaragoza, MAVIR, 2007.
Document Understanding Cartoon
our work!
Complexity of Document Understanding
grep
search
engines
Q&A
Hugo Zaragoza, MAVIR, 2007.
semantic
web?
domain
expert
Beyond strings, beyond bag of words
“In the room the women come and go
Talking of Michelangelo.”
{ ARTIFACT,
PEOPLE,
VERB MOTION,
VERB COMMUNICATION,
PERSON }
{ room,
women,
come,
go,
talking,
Michelangelo }
{ ARTIFACT room --MODIFIER,
PEOPLE women --SUBJECT ,
MOTION come and go --VERB,
}
{
COMMUNICATION talking --VERB,
PERSON Michelangelo --OBJECT }
}
Applications in Search, Algorithmic Advertisement, Answers, …
Hugo Zaragoza, MAVIR, 2007.
{
PEOPLE–MOTION,
COMMUNICATION–PERSON,
Michelangelo
}
Beyond strings, beyond bag of
words
“In the room the women come and go
Talking of Michelangelo.”
Hugo Zaragoza, MAVIR, 2007.
Structure & Domain Knowledge
Domain
“Verticals”
Independent Specialised Search
WWW
RSS Feeds
Blogs
News
Mail
Y! Answers
Health
Search
Tasks:
Find relevant to string.
Methods:
String matching (tf-idf)
Hyperlink Popularity
Hugo Zaragoza, MAVIR, 2007.
Domain
Dependent
Craig’s List Ryanair
InfoZoom
FaceBook
Tasks:
AirFlight Booking (find flights)
Housing (find apartment)
Hire People (find CVs)
…
Methods:
Match + Domain-Based Ranking
The NLRA Pipeline
Segmentation,
Tokenisation
POS
Word
Named
Semantics Entities
Dependency
Parser
SuperSense Tagger
Open Source
Kryptonise,
Pigcise, …
Corpus
Adaptation
Sentiment
Analysis
(example)
Hugo Zaragoza, MAVIR, 2007.
Gazeteers
Code
Docum.
&
Support
Corpus
Adaptation
Anaphora
Resolution
Pipeline
Server
Multilingual
Support
Statistical Methods for Semantic
Tagging
NU
LL
I-P
ER
SO
N
BPE
RS
ON
CA
RD
IN
AL
• Sequence bracketing task:
– Model: Collins Parser (Avg.Perceptron-HMM
tagger)
– Features: tokens, POS, word shape, most
frequent, previous label, combinations.
(Massimiliano Ciaramita and Yasemin Altun, EMNLP 2006)
Hugo Zaragoza, MAVIR, 2007.
Example: WordNet Supersense
labels
Hugo Zaragoza, MAVIR, 2007.
Extraer Entidades
Hugo Zaragoza, MAVIR, 2007.
Statistical Methods for Semantic
Tagging
Method
Recall
Rand
42.99
Baseline
69.25
Supersense-Tagger
Precision
F-score
38.17
40.44
63.90
66.47
77.71 76.65
77.18
(Massimiliano Ciaramita and Yasemin Altun, EMNLP 2006)
Hugo Zaragoza, MAVIR, 2007.
Semantic Web & User annotations
• Microformat
Hugo Zaragoza, MAVIR, 2007.
Semantic Web & User annotations
• Categories, lists…
Hugo Zaragoza, MAVIR, 2007.
NLRA Search Engine
query
Fast inverted index (LUCENE):
normalized text,
(IXE)
unit of retrieval.
TAG-aware
Forward index (UIMA):
surface and normalised text,
POS,
overlapping semantic tags
resolved anaphoras,…
Inverted
Index
docID
docID
unit Id
Forward
Index
(demo)
Feature extractor
Hugo Zaragoza, MAVIR, 2007.
Id,
score
Id,
Id,score
score
NLRA “Server”
Corpus
Pipeline
Index
Forward
Index
Tag Graph
Your
Killer
Application!
Hugo Zaragoza, MAVIR, 2007.
Search Engine
(C++, IXR)
Graph Engine
(Java, WebGraph)
NLR Search Engine
RMI & REST APIs
Lenguaje Escrito
Hugo Zaragoza, MAVIR, 2007.
Applications II: Better Operators
Hugo Zaragoza, MAVIR, 2007.
Entity Extractor +
Dependency Extractor
Type:
• LOCATION
• PEOPLE
• ORGANIZATION
• DATE
•…
Anchor:
Role:
• SUBJECT
• OBJECT
• MODIFIER
•…
Hugo Zaragoza, MAVIR, 2007.
Target:
• VERB
• SUBJECT
• OBJECT
• MODIFIER
•…
Sentiment Analisys
• Informational
• Generic/Specific
• Objective/Subjective
• Positive/Negative
•…
The problem:
ranking (very many) (typed) entities
Millions of unique entities,
dozens of types:
 1 model per entity is unfeasale
 we explore on-line (ad-hoc) models
[Zaragoza et.al. CIKM07]
Hugo Zaragoza, MAVIR, 2007.
Task examples
Hugo Zaragoza, MAVIR, 2007.
SW0: a publically available
Semantic Snapshot of Wikipedia
• 6.2K English Wikipedia entries.
– Sentence and token splitting
– POS, NEs, Semantic tagging (WNSS).
• 28M unique entities, 5.5M occurrences.
– Dependency Parsing.
link
Hugo Zaragoza, MAVIR, 2007.
Entity Containment Graph
query
Wikipedia
search
Sentences
Hugo Zaragoza, MAVIR, 2007.
Lenguaje Escrito
Hugo Zaragoza, MAVIR, 2007.
Lenguaje Escrito
Hugo Zaragoza, MAVIR, 2007.
Lenguaje Escrito
Hugo Zaragoza, MAVIR, 2007.
Problems
• Descriptive entities are not as
interesting:
– “person” vs. “Picasso”
– “city” vs. “Paris”
• Easy fix: discount by global entity
frequency:
• But it’d be better to introduce types:
– “Bush” > “food” !
Hugo Zaragoza, MAVIR, 2007.
Refining Entity Graphs
• Different type semantics
Hugo Zaragoza, MAVIR, 2007.
Detailed Results
Hugo Zaragoza, MAVIR, 2007.
Web-RankDiscounted method
QUERY = “Life of Pablo Picasso”
WWW
search
ENTITY = “Gertrude Stein”
Gertrude Stein
1.
2.
3.
4.
5.
6.
7.
8.
…
ENTITY_RANK
“Life of Pablo Picasso”
QUERY_SET
Score(ENTITY, QUERY) = AvgPrec [ rank=ENTITY_RANK, rel=QUERY_SET ]
Hugo Zaragoza, MAVIR, 2007.
Detailed Results
Hugo Zaragoza, MAVIR, 2007.
Gracias,
Hugo Zaragoza, MAVIR, 2007.

Similar documents