Multilingual Word Sense Disambiguation and Entity
Transcription
Multilingual Word Sense Disambiguation and Entity
Multilingual Word Sense Disambiguation and Entity Linking COLING Tutorial – 24th August 2014 Roberto Navigli [email protected] Andrea Moro [email protected] http://lcl.uniroma1.it ERC Starting Grant MultiJEDI No. 259234 ERC StG: Multilingual Joint Word Sense Disambiguation (MultiJEDI) Roberto Navigli 1 BabelNet goes to the Multilingual Semantic Web – ESWC 2014 tutorial Roberto Navigli and David Jurgens 2 The instructors • Roberto Navigli, associate professor, Department of Computer Science, Sapienza University of Rome • Andrea Moro, PhD student, Department of Computer Science, Sapienza University of Rome Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 3 BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 4 But… just to make sure you chose the right room! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 5 Tutorial Outline • Foundations in Semantic Processing • Basic concepts, terminology, and examples • Motivations for incorporating multilinguality • Constructing Multilingual Semantic Resources • Methods for building new resources by combining heterogenous resources in many languages • How multilingual representations solve current problems • Multilingual Word Sense Disambiguation and Entity Linking • How Entities and Concepts Differ • Methods for identifying each in any language Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 6 And, if you resist until the coffee break… you will… …receive a prize!!! A BabelNet t-shirt!!! [model is not included] Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 7 Projects thanks to which this tutorial exists MultiJEDI (1.3Meuros): ERC Starting Grant LIDER (1.5Meuros): EU CSA Google Focused Research Award (200k$) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 8 Part 1: Foundations Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 9 Understanding a simple phrase Barack Obama peruses the internet. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 10 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 11 Natural language is ambiguous Listen to some rock! Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial Roberto Navigli and David Jurgens 12 I cannot hear anything… BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 13 Natural language is ambiguous Yesterday, I saw an underground rock concert Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 14 Natural language is ambiguous Yesterday, I saw an underground rock concert or Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 15 Natural language is ambiguous Underground rock concert • a music event Underground rock formation • a stone structure Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 16 Natural language is highly ambiguous Underground rock concert • a music event Underground rock formation • a stone structure Formation of an underground rock concert • setup and planning for a music event Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 17 Natural language is highly ambiguous Underground rock concert • a music event Underground rock formation • a stone structure Formation of an underground rock concert • setup and planning for a music event (?) A concert of underground rock formations • (metaphoric) harmoniously arranged stone structures Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 18 Natural language is highly ambiguous Underground rock concert • a music event Underground rock formation • a stone structure Formation of an underground rock concert • setup and planning for a music event (?) A concert of underground rock formations • (metaphoric) harmoniously arranged stone structures We need knowledge of a phrase’s semantics Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 19 State-of-the-art Machine Translation EN: These are movies in which the music genre, e.g. rock, is an important element but not necessarily central to the plot. Examples are Easy Rider (1969), The Graduate (1969), and Saturday Night Fever (1978). Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 20 State-of-the-art Machine Translation EN: These are movies in which the music genre, e.g. rock, is an important element but not necessarily central to the plot. Examples are Easy Rider (1969), The Graduate (1969), and Saturday Night Fever (1978). IT: Questi sono i film in cui il genere musicale, ad es roccia, è un elemento importante, ma non necessariamente al centro della trama. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 21 State-of-the-art Machine Translation EN: Knowledge of the distribution of underground rock densities can assist in interpreting subsurface geologic structure and rock type. Danger here! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 22 State-of-the-art Machine Translation EN: Knowledge of the distribution of underground rock densities can assist in interpreting subsurface geologic structure and rock type. IT: La conoscenza della distribuzione di densità di rock underground può aiutare a interpretare in sottosuolo struttura geologica e tipo di roccia. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 23 The Multilingual, Big-Picture Goal “Underground rock concert” “언더그라운드 락 콘서트" [semantic representation] Black Box [semantic representation] NLP Applications “Underground rock formation” “지하 암석" Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 24 The General Problem POLYSEMY • The most frequent words have several meanings! • Our job: model meaning from a computational perspective Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 25 Monosemous vs. Polysemous words • Monosemous words have only one meaning – Examples: • plant life • internet • Polysemous words have more than one meaning – Example: bar – “a room or establishment where alcoholic drinks are served” – “a counter where you can obtain food or drink” – “a rigid piece of metal or wood” – “musical notation for a repeating pattern of musical beats” Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 26 The Triangle of Meaning (Semiotic Triangle) Writer: - object evokes thought - refers to object with symbol Reader: - symbol evokes thought - refers symbol to the object Concept (thought) Symbol (sign) “dog” “cane” “犬” Object (referent) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 27 What is a word sense? • A word sense is a commonly-accepted meaning of a word: – We are fond of fruit such as kiwi/fruit and banana. – The kiwi/bird is the national bird of New Zealand. • How to represent word senses? – Can we enumerate the senses of a word? , , , ? – “Kiwi is my mother tongue, but I also speak all other English languages” Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 28 What is a word sense? sense (thought) Symbol (sign) “kiwi” “киви” Object (referent) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 29 Word Senses • The bank1 holds the mortgage on my home. • The river overflowed its banks2 this year. • He walked to the bank3 on the street corner. • The treasures were buried in banks4 of dirt. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 30 Word Senses: Homonymy • The bank1 holds the mortgage on my home. • The river overflowed its banks2 this year. • He walked to the bank3 on the street corner. • The treasures were buried in banks4 of dirt. Homonymy: two senses share an orthographic form (e.g., bank), but are semantically and etymologically unrelated (different lemmas!) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 31 Word Senses: Polysemy • The bank1 holds the mortgage on my home. • The river overflowed its banks2 this year. • He walked to the bank3 on the street corner. • The treasures were buried in banks4 of dirt. Polysemy: two senses are very close to each other semantically Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 32 How do we represent and encode semantics? “Underground rock concert” “언더그라운드 락 콘서트" [semantic representation] Black Box [semantic representation] NLP Applications “Underground rock formation” “지하 암석" What comes out of the black box? Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 33 How do we represent and encode semantics? • Thesauri • Groups words according to similar meaning • Relations between groups (e.g., narrower meanings) • Roget’s Thesaurus (1911) • Machine Readable Dictionaries • Enumerates all meanings of a word • Includes definitions, morphology, example usages, etc. • Oxford Dictionary of English, LDOCE, Collins, etc. • Computation Lexicons • Repositories of structured knowledge about a word semantics and syntax • Include relations like hypernymy, meronymy, or entailment • WordNet Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 34 Senses and Relations in WordNet • Each meaning is encoded as a synset (synonym set), which is a collection of synonymous senses Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 35 Senses and Relations in WordNet • Each meaning is encoded as a synset (synonym set), which is a collection of synonymous senses • Semantic relations between synsets – Hypernymy (carn1 is-a motor vehiclen1) – Meronymy (carn1 has-a car doorn1) – Entailment, similarity, attribute, etc. • Lexical relations between word senses – Antonymy (gooda1 antonym of bada1) – Pertainymy (dentala1 pertains to toothn1) – Nominalization (servicen2 nominalizes servev4) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 36 WordNet [Miller et al., 1990; Fellbaum, 1998] {wheeled vehicle} s-p ha ar is-a a is- has-part has -pa rt {brake} {wheel} t {wagon, waggon} isa isa is-a semantic relation {motor vehicle} {locomotive, engine, locomotive engine, railway locomotive} {tractor} isa is-a a is- t {car,auto, automobile, machine, motorcar} has -pa r {golf cart, golfcart} {splasher} {self-propelled vehicle} {convertible} {air bag} rt has-pa {car window} ha s-p art concepts {accelerator, accelerator pedal, gas pedal, throttle} Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 37 Wordnets in other Languages • • • • • • • • • • EuroWordNet (Vossen, 1998) BalkaNet (Tufis et al., 2004) Multilingual Central Repository (Atserias et al., 2003) GermaNet (Hamp and Feldweg, 1997) SloWNet (Fišer and Sagot, 2008) WOLF (Sagot and Fišer, 2008) Hungarian WN (Miháltz et al, 2008) Japanese WN (Isahara et al, 2008) … Currently 73 unique wordnets: http://globalwordnet.org/wordnets-in-the-world/ MultiWordNet WOLF BalkaNet MCR WordNet GermaNet Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 38 An ideal resource for Multilingual Semantic Processing • Capable of representing the meaning of a piece of text as word senses in any language • broad coverage of different senses, including language-specific senses • currently problematic for many language-specific wordnets • Encodes semantic and syntactic relationships between the synsets • Highly beneficial for NLP applications • Encodes definitions and usages for synsets Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 39 Part 2: Building resources for multilingual semantic processing Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 40 Objective and motivation Goal: • A large repository of knowledge in a multilingual setting Motivations: • A common ground for language technologies that brings together: • • • • • • • Multilinguality Encyclopedic knowledge Lexicographic knowledge Semantic relations Textual definitions Domain information … Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 41 The Richer, The Better Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 42 The Richer, The Better Highly-interconnected semantic networks have a great impact on knowledge-based WSD even in a fine-grained setting [Navigli & Lapata, IEEE TPAMI 2010] nirvana point!!! State-of-theart WSD source: [Navigli and Lapata, 2010] divergence point Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 43 How many meanings for «balloon»? balloon Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro WordNet Wikipedia 44 Core Challenges 1. Integrating and unifying heterogeneous resources 2. Managing many different languages 3. Having a wide range of semantic relations between concepts and named entities 4. Maintaining high accuracy Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 45 This is where the ERC (and our project) comes into play A 5-year ERC Starting Grant (2011-2016) on Multilingual Word Sense Disambiguation Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 48 Multilingual Joint Word Sense Disambiguation (MultiJEDI) Key Objective 1: create knowledge for all languages MultiWordNet WOLF BalkaNet MCR WordNet GermaNet Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 49 Multilingual Joint Word Sense Disambiguation (MultiJEDI) Key Objective 2: use all languages to disambiguate one Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 50 The Vision MultiJEDI Input text in *any* language Disambiguated text WordNet ? Wikipedia Multilingual Joint WSD: central research objective Multilingual Semantic Network Automatic Acquisition of a Wide-Coverage Multilingual Semantic Network: BabelNet Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 51 Goal: Creating a Multilingual Semantic Network Start from two large complementary resources: WordNet: full-fledged taxonomy Wikipedia: multilingual and continuously updated {wheeled vehicle} s-p is-a ar a is- {brake} ha has-part has -pa rt {wheel} t {wagon, waggon} isa is-a isa {locomotive, engine, locomotive engine, railway locomotive} {tractor} Get the best from both worlds h as -pa r t {car,auto, automobile, machine, motorcar} a is- isa is-a {motor vehicle} {golf cart, golfcart} {splasher} {self-propelled vehicle} {convertible} {air bag} rt has-pa {car window} ha s-p ar t {accelerator, accelerator pedal, gas pedal, throttle} Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 52 WordNet [Miller et al., 1990; Fellbaum, 1998] {wheeled vehicle} s-p ha ar is-a a is- has-part has -pa rt {brake} {wheel} t {wagon, waggon} isa isa is-a semantic relation {motor vehicle} {locomotive, engine, locomotive engine, railway locomotive} {tractor} isa is-a a is- t {car,auto, automobile, machine, motorcar} has -pa r {golf cart, golfcart} {splasher} {self-propelled vehicle} {convertible} {air bag} rt has-pa {car window} ha s-p art concepts {accelerator, accelerator pedal, gas pedal, throttle} Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 53 Wikipedia [The Web Community, 2001-today] (unspecified) semantic relation Playing with senses Bla bla bla bla bla bla bla Bla bla bla bla bla bla bla Bla bla bla bla bla bla bla Bla bla bla bla bla bla bla Bla bla bla bla bla bla bla concepts BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 54 54 BabelNet: concepts and semantic relations (1) Concepts and relations in BabelNet are harvested from WordNet and Wikipedia: WordNet: BabelNet: synsets concepts lexico-semantic relations semantic relations Wikipedia: BabelNet: pages concepts hyperlinks semantic relations Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 55 An example of mapping Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 56 Creation of the Wikipedia disambiguation contexts ctx(Balloon (aircraft)) = { } Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 57 Creation of the Wikipedia disambiguation contexts sense label ctx(Balloon (aircraft)) = { aircraft } Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 58 Creation of the Wikipedia disambiguation contexts hyperlinks ctx(Balloon (aircraft)) = { aircraft, aerostat, buoyancy, airship, …, gondola } Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 59 Creation of the Wikipedia disambiguation contexts categories ctx(Balloon (aircraft)) = { aircraft, aerostat, buoyancy, airship, …, gondola, ballooning, hydrogen, aeronautics } Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 60 Building BabelNet: Mapping Wikipedia to WordNet Given a Wikipage w and its disambiguation context ctx(w): For each WordNet sense s of w, calculate score(s, w) as follows: Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 61 62 BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. The Wikipedia page context in the WordNet graph ctx(Balloon (aircraft)) = { aircraft, aerostat, buoyancy, airship, …, gondola } balloon#n#1 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 63 The Wikipedia page context in the WordNet graph aircraft#n#1 gondola#n#1 buoyancy#n#1 airship#n#1 balloon#n#1 aerostat#n#1 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 64 The Wikipedia page context in the WordNet graph aircraft#n#1 gondola#n#1 buoyancy#n#1 airship#n#1 balloon#n#1 aerostat#n#1 balloon#n#1 -> aircraft#n#1 balloon#n#1 -> aircraft#n#1 -> airship#n#1 balloon#n#1 -> gondola#n#1 balloon#n#1 -> gondola#n#1 -> flight#n#1 -> buoyancy#n#1 balloon#n#1 -> aerostat#n#1 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 65 The Wikipedia page context in the WordNet graph aircraft#n#1 gondola#n#1 buoyancy#n#1 airship#n#1 balloon#n#1 aerostat#n#1 balloon#n#1 -> aircraft#n#1 0.35 balloon#n#1 -> aircraft#n#1 -> airship#n#1 balloon#n#1 -> gondola#n#1 balloon#n#1 -> gondola#n#1 -> flight#n#1 -> buoyancy#n#1 balloon#n#1 -> aerostat#n#1 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 66 BabelNet: concepts and semantic relations (2) We encode knowledge as a labeled directed graph: Each vertex is a Babel synset balloonEN, BallonDE, aerostatoES, aerostatoIT, pallone aerostaticoIT, mongolfièreFR Each edge is a semantic relation between synsets: is-a (balloon is-a aircraft) part-of (gasbag part-of balloon) instance-of (Einstein instance-of physicist) … unspecified/relatedness (balloon related-to flight) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 67 Building BabelNet: Translating Babel synsets 1. Exploiting Wikipedia interlanguage links Ballon globo aerostàtico pallone aerostatico Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 68 Building BabelNet: Translating Babel synsets 2. Filling the lexical translation gaps using a Machine Translation system to translate the English lexicalizations of a concept On August 27, 1783 in Paris, Franklin witnessed the world's first hydrogen [[Balloon (aircraft)|balloon]] flight. Google Translate Le 27 Août, 1783 à Paris, Franklin vu le premier vol en ballon d'hydrogène. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 69 Building BabelNet: Translating Babel synsets 2. Filling the lexical translation gaps using a Machine Translation system to translate the English lexicalizations of a concept For each word sense s, we translate: sentences from SemCor (a corpus annotated with WordNet senses) which contain s sentences from Wikipedia linked to the Wikipage of s The most frequent translation of s is selected for each target language Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 70 The most frequent translation of a word in a given meaning left context term right context wikification may refer to: the… geoinformatics services' and ' wikification of GIS by the masses' the process may be called wikification (as in ... which is then called " wikification and to the related problem reason needs copyediting, wikification , reduction of POV, work on references huge amount of cleanup, wikification , etc. Version of 12 Nov Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 71 The most frequent translation of a word in a given meaning left context term right context wikificazione potrebbe riferirsi a: il… servizi geoinformatici' e ' wikification di GIS dalle masse' il processo chiamato wikificazione (come in ... che è quindi chiamato wikificazione e al problema correlato… ragione richiede copyediting, wikification , riduzione di POV, lavoro su reference grandi quantità di pulizia, wikificazione , ecc. Versione del 12 Novembre Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 72 The most frequent translation of a word in a given meaning left context term right context wikificazione potrebbe riferirsi a: il… servizi geoinformatici' e ' wikification di GIS dalle masse' il processo chiamato wikificazione (come in ... che è quindi chiamato wikificazione e al problema correlato… ragione richiede copyediting, wikification , riduzione di POV, lavoro su reference grandi quantità di pulizia, wikificazione , ecc. Versione del 12 Novembre Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 73 BabelNet [Navigli and Ponzetto, AIJ 2012] A wide-coverage multilingual semantic network including both encyclopedic (from Wikipedia) and lexicographic (from WordNet) entries NEs and specialized concepts from Wikipedia Concepts from WordNet Concepts integrated from both resources Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 74 Integrating WordNet with Wikipedia… WordNet Is that all?!? Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 75 Open Multilingual WordNet [Bond and Foster, 2013] • • • • http://compling.hss.ntu.edu.sg/omw/ 22 languages Mappings to the Princeton WordNet synsets More than 600,000 lexicalizations Francis Bond and Kyonghee Paik. 2012. A survey of wordnets and their licenses. In Proc. of GWC 2012 Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet. In Proc. of ACL Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 76 OmegaWiki (http://www.omegawiki.org) • Hundreds of languages • About 50,000 entries («synsets») Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial Roberto Navigli and David Jurgens 77 77 Some statistics for OmegaWiki Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial Roberto Navigli and David Jurgens 78 78 Wiktionary (http://www.wiktionary.org) • A collaborative dictionary! • Hundreds of languages • About 3.7M entries Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial Roberto Navigli and David Jurgens 79 79 Some statistics for Wiktionary BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 80 80 Wikidata (http://www.wikidata.org) • A collaborative knowledge base! • Hundreds of languages • About 15M entries BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 81 81 But how to integrate all these resources? BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 82 82 Alignment Approaches Usually measure the similarity of two concepts WordNet plant#n#1 plant#n#1 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 83 Alignment Approaches Usually measure the similarity of two concepts And align two concepts if their similarity exceeds a threshold 84 SemAlign: Cross-resource Concept Alignment [Pilehvar and Navigli, ACL 2014] We combine two different similarity measures: Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 85 SemAlign: Definition similarity Definition similarity Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 86 Alignment Approaches Definition similarity Gloss similarity WordNet 87 Alignment Approaches Definition similarity Gloss similarity Gloss similarity Strong baseline Falls short when Totally different wordings are used for same concepts When we lack quality glosses An area within a building enclosed by walls and floor and ceiling. A room is any distinguishable space within a structure. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 88 SemAlign: structural similarity Structural similarity Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 93 SemAlign: structural similarity Wikipedia Semantic Network WordNet Semantic Network sheet cellulose cellulose fiber material fiber 1. paper -- a material made of cellulose pulp derived mainly from wood or rags or certain grasses. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 94 SemAlign: structural similarity Wiktionary Semantic Network WordNet Semantic Network printing cellulose cellulose fiber material material 1. paper -- a material made of cellulose pulp derived mainly from wood or rags or certain grasses. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 95 SemAlign: structural similarity OmegaWiki Semantic Network WordNet Semantic Network sheet cellulose cellulose fiber material fiber 1. paper -- a material made of cellulose pulp derived mainly from wood or rags or certain grasses. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 96 SemAlign: Core Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 98 SemAlign: Core PPR-based Similarity Measure BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 99 99 Personalized PageRank some 100 Personalized PageRank 101 Semantic Signature of a concept Distributional representation over all concepts in the semantic network . . . 102 Semantic Signature of a concept: an example WordNet concept for----- . . . Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 103 Semantic Signature of a concept: an example WordNet concept for --- . . . Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 104 Semantic Signature of a concept: an example WordNet concept for ----- . . . Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 105 Semantic Signature of a concept: an example WordNet concept for ----- . . . Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 106 SemAlign: signature unification WordNet Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 107 SemAlign: signature unification Find concepts associated with monosemous words WordNet 108 108 SemAlign: signature unification Truncate vectors to the overlap of such concepts WordNet 109 109 SemAlign: signature unification Good news: vector gets reduced, but not too much! # WordNet synsets containing at least one monosemous word = 117,659 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 110 SemAlign: signature comparison Structural similarity Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 111 Semantic Signature Comparison BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 112 112 Comparing Semantic Signatures Weighted Overlap [Pilehvar et al., ACL 2013] Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 113 Comparing Semantic Signatures Weighted Overlap [Pilehvar et al., ACL 2013] Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 114 Comparing Semantic Signatures Weighted Overlap [Pilehvar et al., ACL 2013] Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 115 Comparing Semantic Signatures Weighted Overlap [Pilehvar et al., ACL 2013] Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 116 Comparing Semantic Signatures Weighted Overlap [Pilehvar et al., ACL 2013] • We calculate the following formula: • where rik is the ranking of the i-th element in vector k Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 117 SemAlign: score combination Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 118 BabelNet 2.5 is online: http://babelnet.org BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 124 124 BabelNet goes at a faster pace than I can cope with Key fact! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 125 Anatomy of BabelNet 2.5 50 languages covered (including Latin!) List at http://babelnet.org/stats.jsp Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 126 Anatomy of BabelNet 2.5 50 languages covered (including Latin!) 9.3M Babel synsets (concepts and named entities) 67M word senses 262M semantic relations (28 edges per synset on avg.) 7.7M synset-associated images 21M textual definitions Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 127 New 2.5 version out! • Seamless integration of: • • • • • • WordNet 3.0 Wikipedia Wikidata Wiktionary OmegaWiki Open Multilingual WordNet [Bond and Foster, 2013] • Translations for all open-class parts of speech • 1.1B RDF triples available via SPARQL endpoint Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 128 WordNet+OpenMultilingualWordNet+Wikipedia+… Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 129 +OmegaWiki+automatic translations… Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 130 +textual definitions Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 131 More definitions+Wikipedia categories+… Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 132 +images Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 133 We are not alone in the (resource) universe! 03/09/2014 134 BabelNet: a Very Large 134 Multilingual Ontology Roberto Navigli We are not alone in the (resource) universe! DBPedia [Bizer et al. 2009] - a resource obtained from structured information in Wikipedia «Describes 3.77M things» Core of the Linked Open Data Cloud YAGO [Suchanek et al. 2007] «Contains 10M entities and 120M facts about these entities» Links Wikipedia categories to WordNet synsets MENTA [de Melo and Weikum, 2010] A «multilingual taxonomy with 5.4M entities» WikiNet [Nastase and Strube, 2013] Semantic network connecting Wikipedia entities «3M concepts and 38+M relations» Freebase (http://freebase.com): collaborative effort Structured data; started from Wikipedia, MusicBrainz, ChefMoz, etc. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 135 Evaluations: I (might) have to go fast here! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 137 WordNet-Wikipedia mapping accuracy Overall quality of the mapping: ~84% On a random sample of 1k Wikipages Note: this concerns only those 50k synsets in the intersection 138 BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. Evaluation of BabelNet against gold standard resources Up to +2300% new senses! Extra-coverage Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 141 Hands-on Session: the BabelNet Java API 03/09/2014 142 Pagina 142 Natural Language Processing: Regular Expressions, Automata and Morphology Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 143 Part 3: Identifying multilingual concepts and entities in text Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 144 Motivation • Web content is available in many languages • Information should be extracted and processed independently of the source/target language • This could be done automatically by means of highperformance multilingual text understanding Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 145 Motivation One of the key challenges of multilingual text understanding regards the effective treatment of one of the fundamental aspects of language: Ambiguity! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 146 Word Sense Disambiguation and Entity Linking Thomas and Mario are strikers playing in Munich Entity Linking: The task of discovering mentions of entities within a text and linking them in a knowledge base. WSD: The task aimed at assigning meanings to word occurrences within text. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 147 Word Sense Disambiguation in a Nutshell strikers (target word) “Thomas and Mario are strikers playing in Munich” (context) WSD system knowledge sense of target word Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 148 Main references A complete survey of the field: Navigli R. Word Sense Disambiguation: a Survey. ACM Computing Surveys, 41(2), ACM Press, 2009, pp. 1-69. WSD book: Agirre E. and Edmonds P. Word Sense Disambiguation: Algorithms and Applications, New York, USA, Springer, 2006. Another survey from last decade: Ide N. and Véronis J. Word Sense Disambiguation: The State of The Art. Computational Linguistics, 24(1), 1998, pp. 1-40. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 149 WSD: main approaches Supervised WSD Frames the problem as a classification task Relies on hand-labeled training sets Knowledge-based WSD Uses knowledge resources to identify the best senses for words in context Typically, it does not need a training phase and relies on an existing inventory of senses Word Sense Discrimination / Induction Unsupervised WSD: clustering Does not need manually-tagged datasets Can make the task more difficult to evaluate Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 150 Supervision: labeled data vs. knowledge Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 151 State-of-the-art WSD systems • Supervised: • It Makes Sense (Zhong et al., 2010): a SVM trained on manually annotated corpora; • Structural: • UKB (Agirre et al., 2009): an application of the Personalized PageRank on semantic networks containing word senses • (Navigli and Ponzetto, ACL 2010; EMNLP 2012): graph-based, with contextual degree Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 152 Supervision: labeled data vs. knowledge Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 153 IMS: It Makes Sense (Zhong et al., 2010) • A Support Vector Machine based approach using the following features: • POS tags of surrounding words; • Surrounding words; • Local collocations • Trained on: • SemCor (Miller et al., 1994) • DSO corpus (Ng and Lee, 1996) • Six English-Chinese parallel texts Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 154 IMS: It Makes Sense (Zhong et al., 2010) Pro: • High quality annotations • Fast Cons: • Performance and coverage highly dependent on the availability of annotated text for that language Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 155 Supervision: labeled data vs. knowledge Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 156 Knowledge-based WSD: structural approaches Structural approaches analyze and exploit the structure of a knowledge resource. Given a knowledge resource: View the resource as a graph Apply a method that makes use of the structure of the graph Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 157 UKB: Random Walks for Knowledge-Based Word Sense Disambiguation (Agirre et al., 2009) • WordNet as a graph; • Given a set of context and target words; • For each target word compute the Personalized PageRank over WordNet starting from the context and target words • Then select for the considered target word the sense that has a maximum score Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 158 UKB: Random Walks for Knowledge-Based Word Sense Disambiguation (Agirre et al., 2009) Pro: • Wide coverage • Good quality annotations • No need for annotated text Cons: • Slow when considering huge graphs (e.g., BabelNet) • No local features • Lower performance Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 159 Entity Linking in a Nutshell Thomas (target mention) “Thomas and Mario are strikers playing in Munich” (context) EL system knowledge Named Entity Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 160 Entity Linking EL encompasses a set of similar tasks: • Named Entity Disambiguation, that is the task of linking entity mentions in a text to a knowledge base • Wikification, that is the automatic annotation of text by linking its relevant fragments of text to the appropriate Wikipedia articles. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 161 Entity Linking State-of-the-art approaches are based on the following concepts: • Collective disambiguation of mentions vs. indipendent disambiguation of mentions; • Enforcing semantic coherence among the chosen named entities; • Efficiency: there are orders of magnitude between the number of word senses and named entities! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 162 State-of-the-art EL systems • AIDA (Hoffart et al., 2011): a graph-based framework for the exploitation of similarity measures between candidate entities; • KORE (Hoffart et al., 2012): a graph-based similarity measure integrated with key phrases contained within the context to disambiguate entities; • Tagme (Ferragina and Scaiella, 2012): a combination of the Milne-Witten measure (hyperlinks similarity on Wikipedia) with the commonness of an entity; • Wikifier (Cheng and Roth, 2013): a global and local approach based on the TF-IDF score combined with hyperlinks in Wikipedia; • DBpedia Spotlight (Mendes et al., 2011): a generative model based on counts obtained from manually disambiguated Wikipedia hyperlinks (high prec., low recall). Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 163 State-of-the-art EL systems • AIDA (Hoffart et al., 2011): a graph-based framework for the exploitation of similarity measures between candidate entities; • KORE (Hoffart et al., 2012): a graph-based similarity measure integrated with key phrases contained within the context to disambiguate entities; • Tagme (Ferragina and Scaiella, 2012): a combination of the Milne-Witten measure (hyperlinks similarity on Wikipedia) with the commonness of an entity; • Wikifier (Cheng and Roth, 2013): a global and local approach based on TF-IDF combined with hyperlinks in Wikipedia; • DBpedia Spotlight (Mendes et al., 2011): a generative model based on counts obtained from manually disambiguated Wikipedia hyperlinks (high prec., low recall). Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 164 AIDA (Hoffart et al., 2011) • Precompute a popularity measure for entities; • Based on Wikipedia link anchors. For instance, ‘Kashmir’ refers to the geographical region 91% of the times in Wikipedia. • Run a NER recognizer to find mentions; • Stanford NER tagger • Find candidates within the considered Knowledge Base; • Yago ‘means’ relations (they exploit a heuristic for extracting first names!) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 165 AIDA (Hoffart et al., 2011) • Measure context textual similarity among the candidates; • Keyphrase-base similarity based on words Mutual Information • Thater et al. (2010) syntactically enriched distributional representation • Use semantic relations to enforce coherence among entities; • The Milne-Witten (2008) measure, i.e., the normalized Google distance on out-links in Wikipedia: Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 166 AIDA (Hoffart et al., 2011) • Compute the final scores using the popularity, context similarity and coherence measures Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 167 TAGME (Ferragina and Scaiella, 2012) • Extract anchor dictionary from Wikipedia; • Titles, Redirections and manually annotated anchors; • Compute popularity score based on Wikipedia counts; • Based on Wikipedia link anchors. For instance, ‘Kashmir’ refers to the geographical region 91% of the times in Wikipedia. • Prune candidates using the popularity and coherence measures; • It exploits the link probability of a mentions (calculated in WIkipedia) and its semantic relatedness with the candidates for the other mentions Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 168 TAGME (Ferragina and Scaiella, 2012) • Exploit in-link edges within Wikipedia to compute a semantic relatedness/coherence measure; • The Milne-Witten (2008) measure, i.e., the normalized Google distance on out-links in Wikipedia: • Select the best candidate by means of relatedness; Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 169 Wikifier (Cheng and Roth, 2013) • Uses Wikipedia Titles to build the mapping: mention -> candidates • It solves an optimization problem based on two classes of features: local and global • Local features ϕ (Given a Wikipedia page t and a mention m): • cos(text(t), text(m)), cos(text(t), context(m)), cos(context(t), text(m)), cos(context(t), context(m)), where: • TF-IDF scores of considered Wikipedia pages: text(wikipage) • TF-IDF score of extended Wikipedia page (i.e., the page itself plus the ones linked in it): context(wikipage) • text(m), the mention itself • context(m), the top 100 words within a window based on TF-IDF Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 170 Wikifier (Cheng and Roth, 2013) • Global features φ (Given two Wikipedia pages t and u): • • • • NGD(inlinks(t), inlinks(u)) PMI(inlinks(t), inlinks(u)) NGD(outlinks(t), outlinks(u)) PMI(outlinks(t), outlinks(u)) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 171 Wikifier (Ratinov et al., 2011) • Optimization Problem: • After the disambiguation, link only those disambiguated mentions m for which the whole optimization score does not increase when removing m from the count • In recent work (Cheng and Roth, 2013), a postprocessing phase based on relational inference has been added. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 172 DBpedia Spotlight (Mendes et al., 2011) • DBpedia entries define the mapping: mentions -> candidates • To find candidates for each mention Spotlight uses the LingPipe Dictionary-based chunker (http://alias-i.com/lingpipe) • Obtain a vector space model description of DBpedia by using the TF-ICF score: • Where Rs is the set of candidate resources in DBpedia for the mention s and n(wj) is the number of resources in Rs associated with the word wj Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 173 DBpedia Spotlight (Mendes et al., 2011) • Finally rank candidates by cosine similarity and link by threshold • This approach obtains almost 100% precision on many datasets but with really low recall • Recent work (Daiber et al., 2013) focused on achieving an even better precision focusing on the mentions spotting phase for which now a noun chunker and a NER system have been added to the pipeline Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 174 The multilingual aspect of disambiguation • In both tasks, WSD and EL, knowledge-based approaches have been shown to perform well • What about multilinguality? • Which kind of resources are available out there? Open Multilingual WordNet Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 176 BabelNet (http://babelnet.org) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 177 BabelNet (http://babelnet.org) Named Entities and specialized concepts from Wikipedia Concepts from WordNet Concepts integrated from both resources Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 178 BabelNet as a Multilingual Inventory for: Concepts Calcio in Italian can denote different concepts: Named Entities The text Mario can be used to represent different things such as the video game charachter or a soccer player (Gomez) or even a music album Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 179 Calcio / Kick in BabelNet 2.5 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 180 Calcio / Calcium in BabelNet 2.5 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 181 Calcio / Soccer in BabelNet 2.5 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 182 Word Sense Disambiguation in a Nutshell striker (target word) “Thomas and Mario are strikers playing in Munich” (context) WSD system knowledge sense of target word Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 184 Entity Linking in a Nutshell Thomas (target mention) “Thomas and Mario are strikers playing in Munich” (context) Entity Linking system knowledge Named Entity Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 185 So what? Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 186 Babelfy: A Joint approach to WSD and EL [Moro et al., TACL 2014] • Based on Personalized PageRank, the state-of-the-art method for graph-based WSD. However, it cannot be run for each new input on huge graphs. • Idea: Precompute semantic signatures for the nodes! • Semantic signatures are the most relevant nodes for a given node in the graph computed by using random walk with restart Andrea Moro and Alessandro Raganato and Roberto Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics (TACL), 2. http://babelfy.org Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 187 Babelfy: A Joint approach to WSD and EL [Moro et al., TACL 2014] 1. Precompute semantic signatures; 2. Given an input text select all the possible candidate meanings from BabelNet by matching mentions with BabelNet lexicalizations; 3. Connect all the candidate meanings by using semantic signatures; 4. Extract a dense subgraph containing semantically coherent candidates; 5. Select the most connected candidate for each fragment of text. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 188 Step 1: Semantic Signatures a. Start from one target vertex of the semantic network; b. Randomly select a neighbor of the current vertex or restart from the target vertex; c. Keep the counts of hitting frequencies; d. Take the most visited vertices. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 189 Step 1: Semantic Signatures offside striker athlete soccer player sport Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 190 Babelfy: A Joint approach to WSD and EL [Moro et al., TACL 2014] 1. Precompute semantic signatures; 2. Given an input text select all the possible candidate meanings from BabelNet by matching mentions with BabelNet lexicalizations; 3. Connect all the candidate meanings by using semantic signatures; 4. Extract a dense subgraph containing semantically coherent candidates; 5. Select the most connected candidate for each fragment of text. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 191 Step 2: Find all possible meanings of words 1. Exact Matching (good for WSD, bad for EL) Thomas and Mario are strikers playing in Munich Thomas, Norman Thomas, Seth They both have Thomas as one of their lexicalizations Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 192 Step 2: Find all possible meanings of words 1. Partial Matching (good for EL) Thomas and Mario are strikers playing in Munich Thomas, Norman Thomas, Seth Thomas Müller It has Thomas as a subsequence of one of its lexicalizations Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 193 Step 2: Find all possible meanings of words “Thomas and Mario are strikers playing in Munich” Seth Thomas Mario (Character) Mario (Album) Munich (City) striker (Sport) Striker (Video Game) Thomas Müller FC Bayern Munich Mario Gómez Thomas (novel) Striker (Movie) Munich (Song) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 194 Step 2: Find all possible meanings of words “Thomas and Mario are strikers playing in Munich” Seth Thomas Mario (Character) Mario (Album) Thomas Müller striker (Sport) Striker (Video Game) Ambiguity! Mario Gómez Thomas (novel) Munich (City) FC Bayern Munich Striker (Movie) Munich (Song) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 195 Babelfy: A Joint approach to WSD and EL [Moro et al., TACL 2014] 1. Precompute semantic signatures; 2. Given an input text select all the possible candidate meanings from BabelNet by matching mentions with BabelNet lexicalizations; 3. Connect all the candidate meanings by using semantic signatures; 4. Extract a dense subgraph containing semantically coherent candidates; 5. Select the most connected candidate for each fragment of text. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 196 Step 3: Connect all the candidate meanings Thomas and Mario are strikers playing in Munich Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 197 Babelfy: A Joint approach to WSD and EL [Moro et al., TACL 2014] 1. Precompute semantic signatures; 2. Given an input text select all the possible candidate meanings from BabelNet by matching mentions with BabelNet lexicalizations; 3. Connect all the candidate meanings by using semantic signatures; 4. Extract a dense subgraph containing semantically coherent candidates; 5. Select the most connected candidate for each fragment of text. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 198 Step 4: Extract a dense subgraph Thomas and Mario are strikers playing in Munich Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 199 Step 4: Extract a dense subgraph Thomas and Mario are strikers playing in Munich Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 200 Babelfy: A Joint approach to WSD and EL [Moro et al., TACL 2014] 1. Precompute semantic signatures; 2. Given an input text select all the possible candidate meanings from BabelNet by matching mentions with BabelNet lexicalizations; 3. Connect all the candidate meanings by using semantic signatures; 4. Extract a dense subgraph containing semantically coherent candidates; 5. Select the most connected candidate for each fragment of text. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 201 Step 5: Select the most reliable meanings • We take into account both the lexical coherence, in terms of the number of fragments a candidate relates to, and the semantic coherence, using a graph centrality measure among the candidate meanings. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 202 Step 5: Select the most reliable meanings Thomas and Mario are strikers playing in Munich Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 203 Step 5: Select the most reliable meanings “Thomas and Mario are strikers playing in Munich” Seth Thomas Mario (Character) Mario (Album) Munich (City) striker (Sport) Striker (Video Game) Thomas Müller FC Bayern Munich Mario Gómez Thomas (novel) Striker (Movie) Munich (Song) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 204 Experimental Setup Word Sense Disambiguation datasets: • Senseval-3 (Snyder and Palmer, 2004); • SemEval-2007 task 7 (Navigli et al., 2007); • SemEval-2007 task 17 (Pradhan et al., 2007); • SemEval-2013 task 12 (Navigli et al., 2013); Entity Linking datasets: • AIDA-CoNLL (Hoffart et al., 2011); • KORE50 (Hoffart et al., 2012); Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 205 Experimental Results: Fine-grained (Multilingual) Disambiguation SemEval-2007 task 17 Senseval-3 SemEval-2013 task 12 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 206 Experimental Results: Coarse-grained Word Sense Disambiguation SemEval-2007 task 7 dataset: Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 207 Experimental Results: Entity Linking Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 208 http://babelfy.org Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 209 BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 210 210 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 211 BabelNet goes to the Multilingual Semantic Web. Roberto Navigli and David Jurgens. 212 212 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 213 Multilingual Semantic Processing with BabelNet – LREC 2014 Tutorial Roberto Navigli and David Jurgens 214 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 215 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 216 Babelfy: RESTful API Babelfy bfy = Babelfy.getInstance(AccessType.ONLINE); String inputText = "hello world, I'm a computer scientist"; Annotation annotations = bfy.babelfy("key", inputText, Matching.PARTIAL, Language.EN); System.out.println("inputText: "+inputText); System.out.println("annotations:"); for(BabelSynsetAnchor annotation : annotations.getAnnotations()) { System.out.println(annotation.getAnchorText()); System.out.println("\t"+annotation.getBabelSynset().getId()+"\t"+ annotation.getBabelSynset()); } Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 217 Hands-on Session: Babelfy 03/09/2014 218 Pagina 218 Natural Language Processing: Regular Expressions, Automata and Morphology Key fact! Annotating with BabelNet: all in one! Annotating with BabelNet implies annotating with WordNet and Wikipedia (now also OmegaWiki, Open Multilingual WordNet, Wiktionary and WikiData!) BabelNet 7 219 Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 219 MASC: Manually Annotated Sub-Corpus (Ide et al., 2008) • 500k words of text from many different genres • It is freely available and with many annotations! • This makes it an invaluable resource for both industry and academic communities in order to produce and improve cutting-edge language technologies. Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 220 MASC: Manually Annotated Sub-Corpus (Ide et al., 2008) • The corpus is available in different formats such as GrAF, in-line XML, token/part of speech sequences, RDF encoding and CoNLL format • It already contains many different linguistic annotations: • • • • Sentence boundary, part of speech, syntactic dependency … • We augmented this resource with word senses and named entities using Babelfy [Moro et al., LREC 2014] Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 221 Babelfying MASC [Moro et al., LREC 2014] Statistics: Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 222 Babelfying MASC [Moro et al., LREC 2014] Our semantic annotation, together with the others, is available at: http://lcl.uniroma1.it/MASC-NEWS/ Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 224 Open Problems: grammar-agnostic • All current approaches exploit: • POS tagging • Lemmatization • How to improve? • Waiting for better POS taggers • Character-based analysis of text Noisy (>90% for English, but much less on morphologically rich languages). Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 225 Open Problems: language-agnostic Noisy (>90% for English, but much less on resource poor languages). Moreover, text • All current approaches exploit: which consists of • Knowledge of the input language text in multiple • Automatic language recognition languages will be wrongly analyzed for sure! • How to improve? • Waiting for better language recognition systems • Unify the lexicalizations of different languages Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 226 Open Problems: fragment recognition • Most of the current approaches exploit: • Named Entity Recognition Noisy (>80% for • Not overlapping text assumption English, but much less on resource poor languages). • How to improve? Moreover, when • Waiting for better NER system assuming that • Overlap and match everything entities and word senses should not overlap you lose information! Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 227 Conclusion Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 333 To summarize • We have taken you through a tour of: A very large multilingual semantic network: BabelNet A state-of-the-art WSD and EL system: Babelfy Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 334 Acknowledgements • European Research Council and the EU Commission for funding our research • Tiziano Flati, Maud Ehrmann, Andrea Moro and Mohammad Taher Pilehvar, Daniele Vannella for their help with slides Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 335 03/09/2014 344 BabelNet & friends Roberto Navigli 344 Thanks or… m i (grazie) Multilingual Word Sense Disambiguation and Entity Linking – COLING 2014 Tutorial Roberto Navigli and Andrea Moro 345 http://lcl.uniroma1.it http://babelnet.org http://babelfy.org Google group: babelnet-group 346