Multilingual Word Sense Disambiguation and Entity

Transcription

Multilingual Word Sense Disambiguation and Entity
Multilingual Word Sense Disambiguation
and Entity Linking
COLING Tutorial – 24th August 2014
Roberto Navigli
[email protected]
Andrea Moro
[email protected]
http://lcl.uniroma1.it
ERC Starting Grant MultiJEDI No. 259234
ERC StG: Multilingual Joint Word Sense Disambiguation (MultiJEDI)
Roberto Navigli
1
BabelNet goes to the Multilingual Semantic Web – ESWC 2014 tutorial
Roberto Navigli and David Jurgens
2
The instructors
• Roberto Navigli, associate professor, Department of
Computer Science, Sapienza University of Rome
• Andrea Moro, PhD student, Department of Computer
Science, Sapienza University of Rome
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
3
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
4
But… just to make sure you chose the right room!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
5
Tutorial Outline
• Foundations in Semantic Processing
• Basic concepts, terminology, and examples
• Motivations for incorporating multilinguality
• Constructing Multilingual Semantic Resources
• Methods for building new resources by combining
heterogenous resources in many languages
• How multilingual representations solve current
problems
• Multilingual Word Sense Disambiguation and Entity
Linking
• How Entities and Concepts Differ
• Methods for identifying each in any language
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
6
And, if you resist until the coffee break…
you will…
…receive a prize!!!
A BabelNet t-shirt!!!
[model is not included]
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
7
Projects thanks to which this tutorial exists
MultiJEDI (1.3Meuros): ERC Starting Grant
LIDER (1.5Meuros): EU CSA
Google Focused Research Award (200k$)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
8
Part 1:
Foundations
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
9
Understanding a simple phrase
Barack Obama peruses the internet.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
10
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
11
Natural language is ambiguous
Listen to some rock!
Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial
Roberto Navigli and David Jurgens
12
I cannot hear anything…
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
13
Natural language is ambiguous
Yesterday, I saw an underground rock concert
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
14
Natural language is ambiguous
Yesterday, I saw an underground rock concert
or
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
15
Natural language is ambiguous
Underground rock concert
• a music event
Underground rock formation
• a stone structure
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
16
Natural language is highly ambiguous
Underground rock concert
• a music event
Underground rock formation
• a stone structure
Formation of an underground rock concert
• setup and planning for a music event
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
17
Natural language is highly ambiguous
Underground rock concert
• a music event
Underground rock formation
• a stone structure
Formation of an underground rock concert
• setup and planning for a music event
(?) A concert of underground rock formations
• (metaphoric) harmoniously arranged stone structures
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
18
Natural language is highly ambiguous
Underground rock concert
• a music event
Underground rock formation
• a stone structure
Formation of an underground rock concert
• setup and planning for a music event
(?) A concert of underground rock formations
• (metaphoric) harmoniously arranged stone structures
We need knowledge of a phrase’s semantics
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
19
State-of-the-art Machine Translation
EN: These are movies in which the music genre, e.g. rock,
is an important element but not necessarily central to
the plot. Examples are Easy Rider (1969), The
Graduate (1969), and Saturday Night Fever (1978).
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
20
State-of-the-art Machine Translation
EN: These are movies in which the music genre, e.g. rock,
is an important element but not necessarily central to
the plot. Examples are Easy Rider (1969), The
Graduate (1969), and Saturday Night Fever (1978).
IT: Questi sono i film in cui il genere musicale, ad es
roccia, è un elemento importante, ma non
necessariamente al centro della trama.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
21
State-of-the-art Machine Translation
EN: Knowledge of the distribution of underground rock
densities can assist in interpreting subsurface geologic
structure and rock type.
Danger here!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
22
State-of-the-art Machine Translation
EN: Knowledge of the distribution of underground rock
densities can assist in interpreting subsurface geologic
structure and rock type.
IT: La conoscenza della distribuzione di densità di rock
underground può aiutare a interpretare in sottosuolo
struttura geologica e tipo di roccia.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
23
The Multilingual, Big-Picture Goal
“Underground
rock concert”
“언더그라운드 락
콘서트"
[semantic representation]
Black
Box
[semantic representation]
NLP
Applications
“Underground rock
formation”
“지하 암석"
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
24
The General Problem
POLYSEMY
• The most frequent words have several
meanings!
• Our job: model meaning from a
computational perspective
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
25
Monosemous vs. Polysemous words
• Monosemous words have only one meaning
– Examples:
• plant life
• internet
• Polysemous words have more than one
meaning
– Example: bar
– “a room or establishment where alcoholic drinks are
served”
– “a counter where you can obtain food or drink”
– “a rigid piece of metal or wood”
– “musical notation for a repeating pattern of musical
beats”
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
26
The Triangle of Meaning (Semiotic Triangle)
Writer:
- object evokes thought
- refers to object with symbol
Reader:
- symbol evokes thought
- refers symbol to the object
Concept
(thought)
Symbol (sign)
“dog”
“cane”
“犬”
Object
(referent)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
27
What is a word sense?
• A word sense is a commonly-accepted
meaning of a word:
– We are fond of fruit such as kiwi/fruit and banana.
– The kiwi/bird is the national bird of New Zealand.
• How to represent word senses?
– Can we enumerate the senses of a word?
,
,
,
?
– “Kiwi is my mother tongue, but I also speak all other
English languages”
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
28
What is a word sense?
sense
(thought)
Symbol (sign)
“kiwi”
“киви”
Object
(referent)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
29
Word Senses
• The bank1 holds the mortgage on my home.
• The river overflowed its banks2 this year.
• He walked to the bank3 on the street corner.
• The treasures were buried in banks4 of dirt.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
30
Word Senses: Homonymy
• The bank1 holds the mortgage on my home.
• The river overflowed its banks2 this year.
• He walked to the bank3 on the street corner.
• The treasures were buried in banks4 of dirt.
Homonymy: two senses share an orthographic form
(e.g., bank), but are semantically and etymologically
unrelated (different lemmas!)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
31
Word Senses: Polysemy
• The bank1 holds the mortgage on my home.
• The river overflowed its banks2 this year.
• He walked to the bank3 on the street corner.
• The treasures were buried in banks4 of dirt.
Polysemy: two senses are very close to each
other semantically
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
32
How do we represent and
encode semantics?
“Underground
rock concert”
“언더그라운드 락
콘서트"
[semantic representation]
Black
Box
[semantic representation]
NLP
Applications
“Underground rock
formation”
“지하 암석"
What comes out of the black box?
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
33
How do we represent and encode semantics?
• Thesauri
• Groups words according to similar meaning
• Relations between groups (e.g., narrower meanings)
• Roget’s Thesaurus (1911)
• Machine Readable Dictionaries
• Enumerates all meanings of a word
• Includes definitions, morphology, example usages, etc.
• Oxford Dictionary of English, LDOCE, Collins, etc.
• Computation Lexicons
• Repositories of structured knowledge about a word semantics and
syntax
• Include relations like hypernymy, meronymy, or entailment
• WordNet
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
34
Senses and Relations in WordNet
• Each meaning is encoded as a synset (synonym set), which is a
collection of synonymous senses
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
35
Senses and Relations in WordNet
• Each meaning is encoded as a synset (synonym set), which is a
collection of synonymous senses
• Semantic relations between synsets
– Hypernymy (carn1 is-a motor vehiclen1)
– Meronymy (carn1 has-a car doorn1)
– Entailment, similarity, attribute, etc.
• Lexical relations between word senses
– Antonymy (gooda1 antonym of bada1)
– Pertainymy (dentala1 pertains to toothn1)
– Nominalization (servicen2 nominalizes servev4)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
36
WordNet [Miller et al., 1990; Fellbaum, 1998]
{wheeled vehicle}
s-p
ha
ar
is-a
a
is-
has-part
has
-pa
rt
{brake}
{wheel}
t
{wagon,
waggon}
isa
isa
is-a
semantic relation
{motor vehicle}
{locomotive, engine,
locomotive engine,
railway locomotive}
{tractor}
isa
is-a
a
is-
t
{car,auto, automobile,
machine, motorcar}
has
-pa
r
{golf cart,
golfcart}
{splasher}
{self-propelled vehicle}
{convertible}
{air bag}
rt
has-pa
{car window}
ha
s-p
art
concepts
{accelerator,
accelerator pedal,
gas pedal, throttle}
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
37
Wordnets in other Languages
•
•
•
•
•
•
•
•
•
•
EuroWordNet (Vossen, 1998)
BalkaNet (Tufis et al., 2004)
Multilingual Central Repository (Atserias et al., 2003)
GermaNet (Hamp and Feldweg, 1997)
SloWNet (Fišer and Sagot, 2008)
WOLF (Sagot and Fišer, 2008)
Hungarian WN (Miháltz et al, 2008)
Japanese WN (Isahara et al, 2008)
…
Currently 73 unique wordnets: http://globalwordnet.org/wordnets-in-the-world/
MultiWordNet
WOLF
BalkaNet
MCR
WordNet
GermaNet
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
38
An ideal resource for Multilingual Semantic
Processing
• Capable of representing the meaning of a piece of text as
word senses in any language
• broad coverage of different senses, including
language-specific senses
• currently problematic for many language-specific
wordnets
• Encodes semantic and syntactic relationships between
the synsets
• Highly beneficial for NLP applications
• Encodes definitions and usages for synsets
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
39
Part 2:
Building resources for
multilingual semantic
processing
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
40
Objective and motivation
Goal:
• A large repository of knowledge in a multilingual setting
Motivations:
• A common ground for language technologies that brings
together:
•
•
•
•
•
•
•
Multilinguality
Encyclopedic knowledge
Lexicographic knowledge
Semantic relations
Textual definitions
Domain information
…
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
41
The Richer, The Better
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
42
The Richer, The Better
Highly-interconnected semantic networks have a great
impact on knowledge-based WSD even in a fine-grained
setting [Navigli & Lapata, IEEE TPAMI 2010]
nirvana point!!!
State-of-theart WSD
source: [Navigli and Lapata, 2010]
divergence
point
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
43
How many meanings for «balloon»?
balloon
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
WordNet
Wikipedia
44
Core Challenges
1. Integrating and unifying heterogeneous resources
2. Managing many different languages
3. Having a wide range of semantic relations between
concepts and named entities
4. Maintaining high accuracy
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
45
This is where the ERC (and our project) comes
into play
A 5-year ERC Starting Grant (2011-2016)
on Multilingual Word Sense Disambiguation
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
48
Multilingual Joint Word Sense Disambiguation
(MultiJEDI)
Key Objective 1: create knowledge for all languages
MultiWordNet
WOLF
BalkaNet
MCR
WordNet
GermaNet
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
49
Multilingual Joint Word Sense Disambiguation
(MultiJEDI)
Key Objective 2: use all languages to disambiguate one
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
50
The Vision
MultiJEDI
Input text in *any* language
Disambiguated
text
WordNet
?
Wikipedia
Multilingual Joint WSD:
central research objective
Multilingual
Semantic
Network
Automatic Acquisition of a Wide-Coverage
Multilingual Semantic Network:
BabelNet
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
51
Goal: Creating a Multilingual Semantic Network
Start from two large complementary resources:
WordNet: full-fledged taxonomy
Wikipedia: multilingual and continuously updated
{wheeled vehicle}
s-p
is-a
ar
a
is-
{brake}
ha
has-part
has
-pa
rt
{wheel}
t
{wagon,
waggon}
isa
is-a
isa
{locomotive, engine,
locomotive engine,
railway locomotive}
{tractor}
Get the best from both worlds
h as
-pa
r
t
{car,auto, automobile,
machine, motorcar}
a
is-
isa
is-a
{motor vehicle}
{golf cart,
golfcart}
{splasher}
{self-propelled vehicle}
{convertible}
{air bag}
rt
has-pa
{car window}
ha
s-p
ar t
{accelerator,
accelerator pedal,
gas pedal, throttle}
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
52
WordNet [Miller et al., 1990; Fellbaum, 1998]
{wheeled vehicle}
s-p
ha
ar
is-a
a
is-
has-part
has
-pa
rt
{brake}
{wheel}
t
{wagon,
waggon}
isa
isa
is-a
semantic relation
{motor vehicle}
{locomotive, engine,
locomotive engine,
railway locomotive}
{tractor}
isa
is-a
a
is-
t
{car,auto, automobile,
machine, motorcar}
has
-pa
r
{golf cart,
golfcart}
{splasher}
{self-propelled vehicle}
{convertible}
{air bag}
rt
has-pa
{car window}
ha
s-p
art
concepts
{accelerator,
accelerator pedal,
gas pedal, throttle}
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
53
Wikipedia [The Web Community, 2001-today]
(unspecified) semantic relation
Playing with senses
Bla bla bla bla bla bla bla
Bla bla bla bla bla bla bla
Bla bla bla bla bla bla bla
Bla bla bla bla bla bla bla
Bla bla bla bla bla bla bla
concepts
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
54
54
BabelNet: concepts and semantic relations (1)
Concepts and relations in BabelNet are harvested from
WordNet and Wikipedia:
WordNet:
BabelNet:
synsets
concepts
lexico-semantic relations
semantic relations
Wikipedia:
BabelNet:
pages
concepts
hyperlinks
semantic relations
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
55
An example of mapping
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
56
Creation of the Wikipedia disambiguation
contexts
ctx(Balloon (aircraft)) = { }
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
57
Creation of the Wikipedia disambiguation
contexts
sense label
ctx(Balloon (aircraft)) = { aircraft }
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
58
Creation of the Wikipedia disambiguation
contexts
hyperlinks
ctx(Balloon (aircraft)) = { aircraft, aerostat, buoyancy,
airship, …, gondola }
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
59
Creation of the Wikipedia disambiguation
contexts
categories
ctx(Balloon (aircraft)) = { aircraft, aerostat, buoyancy,
airship, …, gondola, ballooning, hydrogen, aeronautics }
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
60
Building BabelNet: Mapping Wikipedia to
WordNet
Given a Wikipage w and its disambiguation context ctx(w):
For each WordNet sense s of w, calculate score(s, w) as follows:
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
61
62
BabelNet goes to the
Multilingual Semantic Web.
Roberto Navigli and David
Jurgens.
The Wikipedia page context in the WordNet
graph
ctx(Balloon (aircraft)) = { aircraft, aerostat, buoyancy,
airship, …, gondola }
balloon#n#1
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
63
The Wikipedia page context in the WordNet
graph
aircraft#n#1
gondola#n#1 buoyancy#n#1
airship#n#1
balloon#n#1
aerostat#n#1
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
64
The Wikipedia page context in the WordNet
graph
aircraft#n#1
gondola#n#1 buoyancy#n#1
airship#n#1
balloon#n#1
aerostat#n#1
balloon#n#1 -> aircraft#n#1
balloon#n#1 -> aircraft#n#1 -> airship#n#1
balloon#n#1 -> gondola#n#1
balloon#n#1 -> gondola#n#1 -> flight#n#1 -> buoyancy#n#1
balloon#n#1 -> aerostat#n#1
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
65
The Wikipedia page context in the WordNet
graph
aircraft#n#1
gondola#n#1 buoyancy#n#1
airship#n#1
balloon#n#1
aerostat#n#1
balloon#n#1 -> aircraft#n#1
0.35
balloon#n#1 -> aircraft#n#1 -> airship#n#1
balloon#n#1 -> gondola#n#1
balloon#n#1 -> gondola#n#1 -> flight#n#1 -> buoyancy#n#1
balloon#n#1 -> aerostat#n#1
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
66
BabelNet: concepts and semantic relations (2)
We encode knowledge as a labeled directed graph:
Each vertex is a Babel synset
balloonEN, BallonDE,
aerostatoES, aerostatoIT,
pallone aerostaticoIT,
mongolfièreFR
Each edge is a semantic relation between synsets:
is-a
(balloon is-a aircraft)
part-of
(gasbag part-of balloon)
instance-of (Einstein instance-of physicist)
…
unspecified/relatedness (balloon related-to flight)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
67
Building BabelNet: Translating Babel synsets
1. Exploiting Wikipedia interlanguage links
Ballon
globo
aerostàtico
pallone
aerostatico
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
68
Building BabelNet: Translating Babel synsets
2. Filling the lexical translation gaps using a Machine
Translation system to translate the English lexicalizations of
a concept
On August 27, 1783 in Paris, Franklin witnessed the
world's first hydrogen [[Balloon (aircraft)|balloon]]
flight.
Google Translate
Le 27 Août, 1783 à Paris, Franklin vu le premier vol en
ballon d'hydrogène.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
69
Building BabelNet: Translating Babel synsets
2. Filling the lexical translation gaps using a Machine
Translation system to translate the English lexicalizations of
a concept
For each word sense s, we translate:
sentences from SemCor (a corpus annotated with WordNet
senses) which contain s
sentences from Wikipedia linked to the Wikipage of s
The most frequent translation of s is selected for each target
language
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
70
The most frequent translation of a word in a given
meaning
left context
term
right context
wikification
may refer to: the…
geoinformatics services' and '
wikification
of GIS by the masses'
the process may be called
wikification
(as in ...
which is then called "
wikification
and to the related problem
reason needs copyediting,
wikification
, reduction of POV, work on references
huge amount of cleanup,
wikification
, etc. Version of 12 Nov
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
71
The most frequent translation of a word in a given
meaning
left context
term
right context
wikificazione potrebbe riferirsi a: il…
servizi geoinformatici' e '
wikification
di GIS dalle masse'
il processo chiamato wikificazione (come in ...
che è quindi chiamato wikificazione e al problema correlato…
ragione richiede copyediting,
wikification
, riduzione di POV, lavoro su reference
grandi quantità di pulizia, wikificazione , ecc. Versione del 12 Novembre
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
72
The most frequent translation of a word in a given
meaning
left context
term
right context
wikificazione potrebbe riferirsi a: il…
servizi geoinformatici' e '
wikification
di GIS dalle masse'
il processo chiamato wikificazione (come in ...
che è quindi chiamato wikificazione e al problema correlato…
ragione richiede copyediting,
wikification
, riduzione di POV, lavoro su reference
grandi quantità di pulizia, wikificazione , ecc. Versione del 12 Novembre
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
73
BabelNet [Navigli and Ponzetto, AIJ 2012]
A wide-coverage multilingual semantic network
including both encyclopedic (from Wikipedia) and
lexicographic (from WordNet) entries
NEs and specialized
concepts from Wikipedia
Concepts from WordNet
Concepts integrated from
both resources
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
74
Integrating WordNet with Wikipedia…
WordNet
Is that all?!?
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
75
Open Multilingual WordNet
[Bond and Foster, 2013]
•
•
•
•
http://compling.hss.ntu.edu.sg/omw/
22 languages
Mappings to the Princeton WordNet synsets
More than 600,000 lexicalizations
Francis Bond and Kyonghee Paik. 2012. A survey of wordnets and their
licenses. In Proc. of GWC 2012
Francis Bond and Ryan Foster. 2013. Linking and extending an open
multilingual wordnet. In Proc. of ACL
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
76
OmegaWiki (http://www.omegawiki.org)
• Hundreds of languages
• About 50,000 entries («synsets»)
Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial
Roberto Navigli and David Jurgens
77
77
Some statistics for OmegaWiki
Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial
Roberto Navigli and David Jurgens
78
78
Wiktionary (http://www.wiktionary.org)
• A collaborative dictionary!
• Hundreds of languages
• About 3.7M entries
Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial
Roberto Navigli and David Jurgens
79
79
Some statistics for Wiktionary
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
80
80
Wikidata (http://www.wikidata.org)
• A collaborative knowledge base!
• Hundreds of languages
• About 15M entries
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
81
81
But how to integrate all these resources?
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
82
82
Alignment Approaches
Usually measure the similarity of two concepts
WordNet
plant#n#1
plant#n#1
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
83
Alignment Approaches
Usually measure the similarity of two concepts
And align two concepts if their similarity exceeds
a threshold
84
SemAlign: Cross-resource Concept Alignment
[Pilehvar and Navigli, ACL 2014]
We combine two different similarity measures:
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
85
SemAlign: Definition similarity
Definition similarity
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
86
Alignment Approaches
Definition similarity
Gloss similarity
WordNet
87
Alignment Approaches
Definition similarity
Gloss similarity
Gloss similarity
Strong baseline
Falls short when
Totally different wordings are used for same concepts
When we lack quality glosses
An area within a building enclosed by walls and floor and ceiling.
A room is any distinguishable space within a structure.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
88
SemAlign: structural similarity
Structural similarity
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
93
SemAlign: structural similarity
Wikipedia
Semantic
Network
WordNet
Semantic
Network
sheet
cellulose
cellulose
fiber
material
fiber
1. paper -- a material made
of cellulose pulp derived
mainly from wood or rags
or certain grasses.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
94
SemAlign: structural similarity
Wiktionary
Semantic
Network
WordNet
Semantic
Network
printing
cellulose
cellulose
fiber
material
material
1. paper -- a material made
of cellulose pulp derived
mainly from wood or rags
or certain grasses.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
95
SemAlign: structural similarity
OmegaWiki
Semantic
Network
WordNet
Semantic
Network
sheet
cellulose
cellulose
fiber
material
fiber
1. paper -- a material made
of cellulose pulp derived
mainly from wood or rags
or certain grasses.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
96
SemAlign: Core
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
98
SemAlign: Core
PPR-based Similarity Measure
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
99
99
Personalized PageRank
some
100
Personalized PageRank
101
Semantic Signature of a concept
Distributional representation
over all concepts in the semantic network
. . .
102
Semantic Signature of a concept: an example
WordNet concept for-----
. . .
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
103
Semantic Signature of a concept: an example
WordNet concept for ---
. . .
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
104
Semantic Signature of a concept: an example
WordNet concept for -----
. . .
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
105
Semantic Signature of a concept: an example
WordNet concept for -----
. . .
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
106
SemAlign: signature unification
WordNet
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
107
SemAlign: signature unification
Find concepts associated with monosemous words
WordNet
108
108
SemAlign: signature unification
Truncate vectors to the overlap of such concepts
WordNet
109
109
SemAlign: signature unification
Good news: vector gets reduced, but not too much!
# WordNet synsets containing
at least one monosemous word = 117,659
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
110
SemAlign: signature comparison
Structural similarity
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
111
Semantic Signature Comparison
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
112
112
Comparing Semantic Signatures
Weighted Overlap [Pilehvar et al., ACL 2013]
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
113
Comparing Semantic Signatures
Weighted Overlap [Pilehvar et al., ACL 2013]
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
114
Comparing Semantic Signatures
Weighted Overlap [Pilehvar et al., ACL 2013]
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
115
Comparing Semantic Signatures
Weighted Overlap [Pilehvar et al., ACL 2013]
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
116
Comparing Semantic Signatures
Weighted Overlap [Pilehvar et al., ACL 2013]
• We calculate the following formula:
• where rik is the ranking of the i-th element in vector k
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
117
SemAlign: score combination
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
118
BabelNet 2.5 is online: http://babelnet.org
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
124
124
BabelNet goes at a faster pace than I can cope
with
Key fact!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
125
Anatomy of BabelNet 2.5
50 languages covered (including Latin!)
List at http://babelnet.org/stats.jsp
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
126
Anatomy of BabelNet 2.5
50 languages covered (including Latin!)
9.3M Babel synsets (concepts and named entities)
67M word senses
262M semantic relations (28 edges per synset on avg.)
7.7M synset-associated images
21M textual definitions
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
127
New 2.5 version out!
• Seamless integration of:
•
•
•
•
•
•
WordNet 3.0
Wikipedia
Wikidata
Wiktionary
OmegaWiki
Open Multilingual WordNet [Bond and Foster, 2013]
• Translations for all open-class parts of speech
• 1.1B RDF triples available via SPARQL endpoint
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
128
WordNet+OpenMultilingualWordNet+Wikipedia+…
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
129
+OmegaWiki+automatic translations…
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
130
+textual definitions
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
131
More definitions+Wikipedia categories+…
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
132
+images
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
133
We are not alone in the (resource) universe!
03/09/2014
134
BabelNet:
a Very Large 134
Multilingual Ontology
Roberto Navigli
We are not alone in the (resource) universe!
DBPedia [Bizer et al. 2009] - a resource obtained from
structured information in Wikipedia
«Describes 3.77M things»
Core of the Linked Open Data Cloud
YAGO [Suchanek et al. 2007]
«Contains 10M entities and 120M facts about these entities»
Links Wikipedia categories to WordNet synsets
MENTA [de Melo and Weikum, 2010]
A «multilingual taxonomy with 5.4M entities»
WikiNet [Nastase and Strube, 2013]
Semantic network connecting Wikipedia entities
«3M concepts and 38+M relations»
Freebase (http://freebase.com): collaborative effort
Structured data; started from Wikipedia, MusicBrainz, ChefMoz, etc.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
135
Evaluations: I (might) have to go fast here!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
137
WordNet-Wikipedia mapping accuracy
Overall quality of the mapping: ~84%
On a random sample of 1k Wikipages
Note: this concerns only those 50k synsets in the intersection
138
BabelNet goes to the
Multilingual Semantic Web.
Roberto Navigli and David
Jurgens.
Evaluation of BabelNet against gold standard
resources
Up to +2300% new senses!
Extra-coverage
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
141
Hands-on Session: the BabelNet Java API
03/09/2014
142
Pagina 142
Natural
Language
Processing:
Regular Expressions,
Automata and Morphology
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
143
Part 3:
Identifying multilingual
concepts and entities in text
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
144
Motivation
• Web content is available in many languages
• Information should be extracted and processed
independently of the source/target language
• This could be done automatically by means of highperformance multilingual text understanding
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
145
Motivation
One of the key challenges of multilingual text
understanding regards the effective treatment of one of
the fundamental aspects of language:
Ambiguity!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
146
Word Sense Disambiguation and Entity Linking
Thomas and Mario are strikers playing in Munich
Entity Linking: The task
of discovering mentions
of entities within a text
and linking them in a
knowledge base.
WSD: The task aimed at
assigning meanings to word
occurrences within text.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
147
Word Sense Disambiguation in a Nutshell
strikers
(target word)
“Thomas and Mario are strikers playing in Munich”
(context)
WSD
system
knowledge
sense of target word
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
148
Main references
A complete survey of the field:
Navigli R. Word Sense Disambiguation: a Survey. ACM
Computing Surveys, 41(2), ACM Press, 2009, pp. 1-69.
WSD book:
Agirre E. and Edmonds P. Word Sense Disambiguation:
Algorithms and Applications, New York, USA, Springer,
2006.
Another survey from last decade:
Ide N. and Véronis J. Word Sense Disambiguation: The
State of The Art. Computational Linguistics, 24(1), 1998,
pp. 1-40.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
149
WSD: main approaches
Supervised WSD
Frames the problem as a classification task
Relies on hand-labeled training sets
Knowledge-based WSD
Uses knowledge resources to identify the best senses for words in context
Typically, it does not need a training phase and relies on an existing
inventory of senses
Word Sense Discrimination / Induction
Unsupervised WSD: clustering
Does not need manually-tagged datasets
Can make the task more difficult to evaluate
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
150
Supervision: labeled data vs.
knowledge
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
151
State-of-the-art WSD systems
• Supervised:
• It Makes Sense (Zhong et al., 2010): a SVM trained on
manually annotated corpora;
• Structural:
• UKB (Agirre et al., 2009): an application of the Personalized
PageRank on semantic networks containing word senses
• (Navigli and Ponzetto, ACL 2010; EMNLP 2012): graph-based,
with contextual degree
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
152
Supervision: labeled data vs.
knowledge
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
153
IMS: It Makes Sense (Zhong et al., 2010)
• A Support Vector Machine based approach using the
following features:
• POS tags of surrounding words;
• Surrounding words;
• Local collocations
• Trained on:
• SemCor (Miller et al., 1994)
• DSO corpus (Ng and Lee, 1996)
• Six English-Chinese parallel texts
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
154
IMS: It Makes Sense (Zhong et al., 2010)
Pro:
• High quality annotations
• Fast
Cons:
• Performance and coverage highly dependent on the
availability of annotated text for that language
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
155
Supervision: labeled data vs.
knowledge
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
156
Knowledge-based WSD:
structural approaches
Structural approaches analyze and exploit the structure of
a knowledge resource.
Given a knowledge resource:
View the resource as a graph
Apply a method that makes use of the structure of the graph
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
157
UKB: Random Walks for Knowledge-Based
Word Sense Disambiguation (Agirre et al., 2009)
• WordNet as a graph;
• Given a set of context and target words;
• For each target word compute the Personalized PageRank over
WordNet starting from the context and target words
• Then select for the considered target word the sense that has a
maximum score
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
158
UKB: Random Walks for Knowledge-Based
Word Sense Disambiguation (Agirre et al., 2009)
Pro:
• Wide coverage
• Good quality annotations
• No need for annotated text
Cons:
• Slow when considering huge graphs (e.g., BabelNet)
• No local features
• Lower performance
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
159
Entity Linking in a Nutshell
Thomas
(target mention)
“Thomas and Mario are strikers playing in Munich”
(context)
EL
system
knowledge
Named Entity
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
160
Entity Linking
EL encompasses a set of similar tasks:
• Named Entity Disambiguation, that is the task of
linking entity mentions in a text to a knowledge base
• Wikification, that is the automatic annotation of text by
linking its relevant fragments of text to the appropriate
Wikipedia articles.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
161
Entity Linking
State-of-the-art approaches are based on the following
concepts:
• Collective disambiguation of mentions vs. indipendent
disambiguation of mentions;
• Enforcing semantic coherence among the chosen
named entities;
• Efficiency: there are orders of magnitude between the
number of word senses and named entities!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
162
State-of-the-art EL systems
• AIDA (Hoffart et al., 2011): a graph-based framework for the
exploitation of similarity measures between candidate entities;
• KORE (Hoffart et al., 2012): a graph-based similarity measure
integrated with key phrases contained within the context to
disambiguate entities;
• Tagme (Ferragina and Scaiella, 2012): a combination of the
Milne-Witten measure (hyperlinks similarity on Wikipedia) with the
commonness of an entity;
• Wikifier (Cheng and Roth, 2013): a global and local approach
based on the TF-IDF score combined with hyperlinks in Wikipedia;
• DBpedia Spotlight (Mendes et al., 2011): a generative model
based on counts obtained from manually disambiguated Wikipedia
hyperlinks (high prec., low recall).
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
163
State-of-the-art EL systems
• AIDA (Hoffart et al., 2011): a graph-based framework for the
exploitation of similarity measures between candidate entities;
• KORE (Hoffart et al., 2012): a graph-based similarity measure
integrated with key phrases contained within the context to
disambiguate entities;
• Tagme (Ferragina and Scaiella, 2012): a combination of the
Milne-Witten measure (hyperlinks similarity on Wikipedia) with
the commonness of an entity;
• Wikifier (Cheng and Roth, 2013): a global and local approach
based on TF-IDF combined with hyperlinks in Wikipedia;
• DBpedia Spotlight (Mendes et al., 2011): a generative model
based on counts obtained from manually disambiguated Wikipedia
hyperlinks (high prec., low recall).
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
164
AIDA (Hoffart et al., 2011)
• Precompute a popularity measure for entities;
• Based on Wikipedia link anchors. For instance, ‘Kashmir’ refers
to the geographical region 91% of the times in Wikipedia.
• Run a NER recognizer to find mentions;
• Stanford NER tagger
• Find candidates within the considered Knowledge Base;
• Yago ‘means’ relations (they exploit a heuristic for extracting
first names!)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
165
AIDA (Hoffart et al., 2011)
• Measure context textual similarity among the
candidates;
• Keyphrase-base similarity based on words Mutual Information
• Thater et al. (2010) syntactically enriched distributional
representation
• Use semantic relations to enforce coherence among
entities;
• The Milne-Witten (2008) measure, i.e., the normalized Google
distance on out-links in Wikipedia:
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
166
AIDA (Hoffart et al., 2011)
• Compute the final scores using the popularity, context
similarity and coherence measures
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
167
TAGME (Ferragina and Scaiella, 2012)
• Extract anchor dictionary from Wikipedia;
• Titles, Redirections and manually annotated anchors;
• Compute popularity score based on Wikipedia counts;
• Based on Wikipedia link anchors. For instance, ‘Kashmir’ refers
to the geographical region 91% of the times in Wikipedia.
• Prune candidates using the popularity and coherence
measures;
• It exploits the link probability of a mentions (calculated in
WIkipedia) and its semantic relatedness with the candidates for
the other mentions
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
168
TAGME (Ferragina and Scaiella, 2012)
• Exploit in-link edges within Wikipedia to compute a
semantic relatedness/coherence measure;
• The Milne-Witten (2008) measure, i.e., the normalized Google
distance on out-links in Wikipedia:
• Select the best candidate by means of relatedness;
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
169
Wikifier (Cheng and Roth, 2013)
• Uses Wikipedia Titles to build the mapping:
mention -> candidates
• It solves an optimization problem based on two classes
of features: local and global
• Local features ϕ (Given a Wikipedia page t and a mention m):
• cos(text(t), text(m)), cos(text(t), context(m)), cos(context(t),
text(m)), cos(context(t), context(m)), where:
• TF-IDF scores of considered Wikipedia pages: text(wikipage)
• TF-IDF score of extended Wikipedia page (i.e., the page itself plus the ones
linked in it): context(wikipage)
• text(m), the mention itself
• context(m), the top 100 words within a window based on TF-IDF
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
170
Wikifier (Cheng and Roth, 2013)
• Global features φ (Given two Wikipedia pages t and u):
•
•
•
•
NGD(inlinks(t), inlinks(u))
PMI(inlinks(t), inlinks(u))
NGD(outlinks(t), outlinks(u))
PMI(outlinks(t), outlinks(u))
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
171
Wikifier (Ratinov et al., 2011)
• Optimization Problem:
• After the disambiguation, link only those disambiguated
mentions m for which the whole optimization score does
not increase when removing m from the count
• In recent work (Cheng and Roth, 2013), a
postprocessing phase based on relational inference has
been added.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
172
DBpedia Spotlight (Mendes et al., 2011)
• DBpedia entries define the mapping:
mentions -> candidates
• To find candidates for each mention Spotlight uses the
LingPipe Dictionary-based chunker (http://alias-i.com/lingpipe)
• Obtain a vector space model description of DBpedia by
using the TF-ICF score:
•
Where Rs is the set of candidate resources in DBpedia for the mention s
and n(wj) is the number of resources in Rs associated with the word wj
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
173
DBpedia Spotlight (Mendes et al., 2011)
• Finally rank candidates by cosine similarity and link by
threshold
• This approach obtains almost 100% precision on many
datasets but with really low recall
• Recent work (Daiber et al., 2013) focused on achieving
an even better precision focusing on the mentions
spotting phase for which now a noun chunker and a
NER system have been added to the pipeline
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
174
The multilingual aspect of disambiguation
• In both tasks, WSD and EL, knowledge-based
approaches have been shown to perform well
• What about multilinguality?
• Which kind of resources are available out there?
Open
Multilingual
WordNet
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
176
BabelNet (http://babelnet.org)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
177
BabelNet (http://babelnet.org)
Named Entities and
specialized concepts
from Wikipedia
Concepts from WordNet
Concepts integrated from
both resources
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
178
BabelNet as a Multilingual Inventory for:
Concepts
Calcio in Italian can denote different concepts:
Named Entities
The text Mario can be used to represent different things
such as the video game charachter or a soccer player
(Gomez) or even a music album
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
179
Calcio / Kick in BabelNet 2.5
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
180
Calcio / Calcium in BabelNet 2.5
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
181
Calcio / Soccer in BabelNet 2.5
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
182
Word Sense Disambiguation in a Nutshell
striker
(target word)
“Thomas and Mario are strikers playing in Munich”
(context)
WSD
system
knowledge
sense of target word
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
184
Entity Linking in a Nutshell
Thomas
(target mention)
“Thomas and Mario are strikers playing in Munich”
(context)
Entity Linking
system
knowledge
Named Entity
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
185
So what?
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
186
Babelfy: A Joint approach to WSD and EL
[Moro et al., TACL 2014]
• Based on Personalized PageRank, the state-of-the-art
method for graph-based WSD.
 However, it cannot be run for each new input on huge graphs.
• Idea: Precompute semantic signatures for the nodes!
• Semantic signatures are the most relevant nodes for
a given node in the graph computed by using random
walk with restart
Andrea Moro and Alessandro Raganato and Roberto Navigli. 2014. Entity
Linking meets Word Sense Disambiguation: a Unified Approach.
Transactions of the Association for Computational Linguistics (TACL), 2.
http://babelfy.org
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
187
Babelfy: A Joint approach to WSD and EL
[Moro et al., TACL 2014]
1. Precompute semantic signatures;
2. Given an input text select all the possible candidate
meanings from BabelNet by matching mentions with
BabelNet lexicalizations;
3. Connect all the candidate meanings by using semantic
signatures;
4. Extract a dense subgraph containing semantically
coherent candidates;
5. Select the most connected candidate for each fragment
of text.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
188
Step 1: Semantic Signatures
a. Start from one target vertex of the semantic network;
b. Randomly select a neighbor of the current vertex or
restart from the target vertex;
c. Keep the counts of hitting frequencies;
d. Take the most visited vertices.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
189
Step 1: Semantic Signatures
offside
striker
athlete
soccer player
sport
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
190
Babelfy: A Joint approach to WSD and EL
[Moro et al., TACL 2014]
1. Precompute semantic signatures;
2. Given an input text select all the possible candidate
meanings from BabelNet by matching mentions with
BabelNet lexicalizations;
3. Connect all the candidate meanings by using semantic
signatures;
4. Extract a dense subgraph containing semantically
coherent candidates;
5. Select the most connected candidate for each fragment
of text.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
191
Step 2: Find all possible meanings of words
1. Exact Matching (good for WSD, bad for EL)
Thomas and Mario are strikers playing in Munich
Thomas,
Norman
Thomas,
Seth
They both have
Thomas as one of
their lexicalizations
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
192
Step 2: Find all possible meanings of words
1. Partial Matching (good for EL)
Thomas and Mario are strikers playing in Munich
Thomas,
Norman
Thomas,
Seth
Thomas
Müller
It has Thomas as a
subsequence of one
of its lexicalizations
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
193
Step 2: Find all possible meanings of words
“Thomas and Mario are strikers playing in Munich”
Seth Thomas
Mario (Character)
Mario (Album)
Munich (City)
striker (Sport)
Striker (Video Game)
Thomas Müller
FC Bayern Munich
Mario Gómez
Thomas (novel)
Striker (Movie)
Munich (Song)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
194
Step 2: Find all possible meanings of words
“Thomas and Mario are strikers playing in Munich”
Seth Thomas
Mario (Character)
Mario (Album)
Thomas Müller
striker (Sport)
Striker (Video Game)
Ambiguity!
Mario Gómez
Thomas (novel)
Munich (City)
FC Bayern Munich
Striker (Movie)
Munich (Song)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
195
Babelfy: A Joint approach to WSD and EL
[Moro et al., TACL 2014]
1. Precompute semantic signatures;
2. Given an input text select all the possible candidate
meanings from BabelNet by matching mentions with
BabelNet lexicalizations;
3. Connect all the candidate meanings by using semantic
signatures;
4. Extract a dense subgraph containing semantically
coherent candidates;
5. Select the most connected candidate for each fragment
of text.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
196
Step 3: Connect all the candidate meanings
Thomas and Mario are strikers playing in Munich
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
197
Babelfy: A Joint approach to WSD and EL
[Moro et al., TACL 2014]
1. Precompute semantic signatures;
2. Given an input text select all the possible candidate
meanings from BabelNet by matching mentions with
BabelNet lexicalizations;
3. Connect all the candidate meanings by using semantic
signatures;
4. Extract a dense subgraph containing semantically
coherent candidates;
5. Select the most connected candidate for each fragment
of text.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
198
Step 4: Extract a dense subgraph
Thomas and Mario are strikers playing in Munich
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
199
Step 4: Extract a dense subgraph
Thomas and Mario are strikers playing in Munich
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
200
Babelfy: A Joint approach to WSD and EL
[Moro et al., TACL 2014]
1. Precompute semantic signatures;
2. Given an input text select all the possible candidate
meanings from BabelNet by matching mentions with
BabelNet lexicalizations;
3. Connect all the candidate meanings by using semantic
signatures;
4. Extract a dense subgraph containing semantically
coherent candidates;
5. Select the most connected candidate for each fragment
of text.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
201
Step 5: Select the most reliable meanings
• We take into account both the lexical coherence, in
terms of the number of fragments a candidate relates to,
and the semantic coherence, using a graph centrality
measure among the candidate meanings.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
202
Step 5: Select the most reliable meanings
Thomas and Mario are strikers playing in Munich
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
203
Step 5: Select the most reliable meanings
“Thomas and Mario are strikers playing in Munich”
Seth Thomas
Mario (Character)
Mario (Album)
Munich (City)
striker (Sport)
Striker (Video Game)
Thomas Müller
FC Bayern Munich
Mario Gómez
Thomas (novel)
Striker (Movie)
Munich (Song)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
204
Experimental Setup
Word Sense Disambiguation datasets:
• Senseval-3 (Snyder and Palmer, 2004);
• SemEval-2007 task 7 (Navigli et al., 2007);
• SemEval-2007 task 17 (Pradhan et al., 2007);
• SemEval-2013 task 12 (Navigli et al., 2013);
Entity Linking datasets:
• AIDA-CoNLL (Hoffart et al., 2011);
• KORE50 (Hoffart et al., 2012);
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
205
Experimental Results:
Fine-grained (Multilingual) Disambiguation
SemEval-2007
task 17
Senseval-3
SemEval-2013 task 12
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
206
Experimental Results:
Coarse-grained Word Sense Disambiguation
SemEval-2007 task 7 dataset:
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
207
Experimental Results: Entity Linking
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
208
http://babelfy.org
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
209
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
210
210
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
211
BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens.
212
212
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
213
Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial
Roberto Navigli and David Jurgens
214
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
215
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
216
Babelfy: RESTful API
Babelfy bfy = Babelfy.getInstance(AccessType.ONLINE);
String inputText = "hello world, I'm a computer scientist";
Annotation annotations =
bfy.babelfy("key", inputText, Matching.PARTIAL, Language.EN);
System.out.println("inputText: "+inputText);
System.out.println("annotations:");
for(BabelSynsetAnchor annotation : annotations.getAnnotations())
{
System.out.println(annotation.getAnchorText());
System.out.println("\t"+annotation.getBabelSynset().getId()+"\t"+
annotation.getBabelSynset());
}
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
217
Hands-on Session: Babelfy
03/09/2014
218
Pagina 218
Natural
Language
Processing:
Regular Expressions,
Automata and Morphology
Key fact!
Annotating with BabelNet:
all in one!
Annotating with BabelNet implies annotating with WordNet
and Wikipedia
(now also OmegaWiki, Open Multilingual WordNet,
Wiktionary and WikiData!)
BabelNet
7
219
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
219
MASC: Manually Annotated Sub-Corpus
(Ide et al., 2008)
• 500k words of text from many different genres
• It is freely available and with many annotations!
• This makes it an invaluable resource for both industry
and academic communities in order to produce and
improve cutting-edge language technologies.
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
220
MASC: Manually Annotated Sub-Corpus
(Ide et al., 2008)
• The corpus is available in different formats such as
GrAF, in-line XML, token/part of speech sequences,
RDF encoding and CoNLL format
• It already contains many different linguistic annotations:
•
•
•
•
Sentence boundary,
part of speech,
syntactic dependency
…
• We augmented this resource with word senses and
named entities using Babelfy [Moro et al., LREC 2014]
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
221
Babelfying MASC
[Moro et al., LREC 2014]
Statistics:
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
222
Babelfying MASC
[Moro et al., LREC 2014]
Our semantic annotation, together with the others, is
available at:
http://lcl.uniroma1.it/MASC-NEWS/
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
224
Open Problems: grammar-agnostic
• All current approaches exploit:
• POS tagging
• Lemmatization
• How to improve?
• Waiting for better POS taggers
• Character-based analysis of text
Noisy (>90% for
English, but much
less on
morphologically
rich languages).
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
225
Open Problems: language-agnostic Noisy (>90% for
English, but much
less on resource
poor languages).
Moreover, text
• All current approaches exploit:
which consists of
• Knowledge of the input language
text in multiple
• Automatic language recognition
languages will be
wrongly analyzed
for sure!
• How to improve?
• Waiting for better language recognition systems
• Unify the lexicalizations of different languages
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
226
Open Problems: fragment recognition
• Most of the current approaches exploit:
• Named Entity Recognition
Noisy (>80% for
• Not overlapping text assumption
English, but much
less on resource
poor languages).
• How to improve?
Moreover, when
• Waiting for better NER system
assuming that
• Overlap and match everything
entities and word
senses should not
overlap you lose
information!
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
227
Conclusion
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
333
To summarize
• We have taken you through a tour of:
 A very large multilingual semantic network: BabelNet
 A state-of-the-art WSD and EL system: Babelfy
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
334
Acknowledgements
• European Research Council and the EU Commission for
funding our research
• Tiziano Flati, Maud Ehrmann, Andrea Moro and
Mohammad Taher Pilehvar, Daniele Vannella for their
help with slides
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
335
03/09/2014
344
BabelNet
& friends
Roberto Navigli
344
Thanks or…
m
i
(grazie)
Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial
Roberto Navigli and Andrea Moro
345
http://lcl.uniroma1.it
http://babelnet.org
http://babelfy.org
Google group: babelnet-group
346