Slides - Université de Montréal

Transcription

Projective Methods for Mining Missing Translations in DBpedia
1
Projective Methods for Mining Missing Translations in
DBpedia
Laurent Jakubina
1
Philippe Langlais
1 RALI - DIRO
Université de Montréal
[email protected]
2 RALI - DIRO
Université de Montréal
[email protected]
BUCC Workshop 2015
2
Introduction
Linked (Open) Data in Semantic Web
Fig.: ”Classical” Web vs. Semantic Web
2
Introduction
DBpedia in/and The Semantic Web
Fig.: Concepts and Labels
=⇒ Truly Multilingual World Wide Web ? ...Most labels are currently only in
English. [Gómez-Pérez et al., 2013]
3
Introduction
Zoom
Fig.: Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja
Jentzsch and Richard Cyganiak : http ://lod-cloud.net/
=⇒ in DBpedia ? Same Problem : only one label on five in French.1
1
see Data Set Statistics in January 2015 : http ://wiki.dbpedia.org/Datasets/DatasetStatistics
4
Introduction
Wikipedia and Goals
Where that come from ?
One rdfs :label property in a given language in DBpedia =...
...= the title of the Wikipedia article which is inter-language linked to...
...the (English) Wikipedia article associated to the DBpedia concept.
=⇒ The root problem comes from Wikipedia.
20% → 100%?
Identifying translations for (English) Wikipedia article titles (in French).
=⇒ Investigating two projectives approaches and theirs parameters...
...using Wikipedia and its structure as a comparable corpus.
5
Approaches
Standard Approach (Stand) - Presentation
Assumption :
If two words co-occur more often than expected from chance in a source language,
then theirs translations must co-occur more often than expected from chance in a
targuet language. [Rapp, 1995]
Fig.: Steps of the Standard Approach in a nutshell. add reference bouamor dessin
6
Approaches
Standard Approach (Stand) - Presentation
Parameters :
Contextual Window Size : 2, 6, 14, 30.
Association Measure
Discontinuous Odds-Ratio (ord) [Evert, 2005, p. 86]
Log-Likelihood Ratio (llr) [Dunning, 1993]
Bilingual Seed Lexicon : One large lexicons comprising 116 354 word pairs
populated from several available resources (in-house, Ergane, Freelang).
Similarity Measure : Cosine Similarity (as in [Laroche and Langlais, 2010]).
Note :
The co-occurrent words are extracted from all the source documents of the
comparable corpus in which the term to translate appears.
7
Approaches
Neighbourhood variants (lki, lko, cmp and ra) - Presentation
Idea : Translating Wikipedia titles ?
Only considering the occurrences of this term in the article whose title we seek to
translate. And avoiding populating the context vector different senses of the word to
translate.
Idea : Too few occurrences ?
Considering some neighbourhood functions : returns a set of Wikipedia articles related
to the one under consideration for translation.
4 Functions (and many combinaisons of them) :
lki(a) returns the set of articles that have a link pointing to the article a under
consideration.
lko(a) returns the set of articles to which a points to.
cmp(a) returns the set of articles that are the most similar to a (Using the
MoreLikeThis method of the search engine Lucene).
rnd() randomly returns articles (for sanity check).
8
Approaches
Neighbourhood variants (lki, lko, cmp and ra) - Presentation
One New Parameter :
Size of the returned set of articles : 10, 100 or more ?
Fig.: Neighbourhood functions with the article : ”Alternating series”
9
Approaches
Explicit Semantic Analysis (Esa-B) - Presentation
Approach described in [Bouamor, 2014].
Adaptation of the Explicit Semantic Analysis approach described in
[Gabrilovich and Markovitch, 2007].
Words Vectors → Documents Vectors
Parameter ? Documents vectors Maximum Size (Semantic Drift)
Bilingual Lexicon → Wikipedia Interlanguage Links.
Fig.: Esa-B approach in a nutshell.
10
11
Experimental Protocol
Reference List
= a list of English source terms and their reference (French) translation.
Randomly sampling pairs of articles in Wikipedia that are inter-language linked.
(good translations [Hovy et al., 2013])
Named Entities Filter by bilingual lexicon (see Stand) on english side.
Unigrams and Specials Characters filters on both side.
Random entries through 4 frenquency classes.
[1-25]
[26-100]
[101-1000]
[1001+]
Total
74 (8.5%)
myringotomy
paracentèse
267 (30.7%)
syllabification
césure
259 (29.8%)
numerology
numérologie
269 (30.9%)
entertainment
divertissement
869 (100%)
Experimental Protocol
Evaluation
Each approach returns ranked list of (at most) 20 candidates (for each source
English term).
P@1 : % of terms lists for which the best ranked candidate is the reference.
MAP@20 : Mean Average Precision at rank 20 [Manning et al., 2008].
Example :
A = [A’,B’,C’,D’,E’] = 1
C = [A’,B’,C’,D’,E’] = 1/3
E = [A’,B’,C’,D’,E’] = 1/5
—————————————————
MAP5 = (1 + (1/3) + (1/5))/3 = 0.511
P@1 = 1/3 = 0.333
12
13
Results
Stand
P@1
Stand (llr)
Stand (ord)
[1-25]
MAP
0.000
0.027
0.003
0.057
[26-100]
P@1
MAP
[101-1000]
P@1
MAP
[1001+]
P@1
MAP
P@1
0.011
0.217
0.019
0.425
0.134
0.461
0.051
0.338
0.019
0.281
0.023
0.474
0.154
0.506
Observations :
With previous xp : optimal window size = 6. (3 words each side, no func words)
ord For the win = six time higher performance, on average. ([Laroche and Langlais, 2010])
Strong correlation btw frequency and performance. (well-know fact, [Prochasson and Fung, 2011])
[Total]
MAP
0.061
0.389
14
Results
Stand - 2
Observations :
Rare words better ranked in ord context vector.
Better discriminative power.
Deserves further investigations.
ord
myringoplasty (16.32)
myringa (16.14)
laryngotracheal (15.13)
tympanostomy (14.60)
laryngomalacia (14.19)
patency (13.43)
equalized (11.75)
grommet (11.58)
obstructive (11.09)
incision (10.37)
llr
tube (147.6)
laser (44.90)
procedure (40.83)
usually (31.86)
knife (30.13)
myringoplasty (29.85)
ear (28.19)
laryngotracheal (27.45)
tympanostomy (26.39)
cold (24.09)
Fig.: Top words in the context vector computed with ord and llr for the source term
Myringotomy. Words in bold appear in both context vectors.
15
Results
Neighbourhood variants
[1-25]
[26-100]
[101-1000]
[1001+]
[Total]
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
lki-1000
lko-1000
cmp-1000
0.000
0.000
0.016
0.002
0.000
0.022
0.064
0.016
0.072
0.080
0.022
0.099
0.124
0.089
0.131
0.156
0.119
0.170
0.126
0.033
0.093
0.155
0.046
0.120
0.096
0.044
0.092
0.119
0.058
0.121
rnd-1000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Observations :
Meta-Parameter : more = better. (1000 everywhere)
From a practical pov : very disappointing = Significant drop in performance.
Dissymmetry btw src and trg context vector =⇒ context vector computation online for each term.
Left as future works.
At least, outperform random sampling.
16
Results
Esa-B
P@1
[1-25]
MAP
[26-100]
P@1
MAP
[101-1000]
P@1
MAP
[1001+]
P@1
MAP
P@1
[Total]
MAP
Stand (ord)
0.027
0.057
0.217
0.281
0.425
0.474
0.461
0.506
0.338
0.389
Esa-B
0.014
0.080
0.056
0.122
0.205
0.300
0.424
0.513
0.211
0.293
Observations :
Doc Vec Maximum Size = 30 in our cases. (default, 100)
Contrary to [Bouamor et al., 2013], under-performs Stand with ord.
Authors filter nouns, verbs and adjectives = might interfere with the previous observations on rare words.
(URLs, Spelling Mistakes, etc. = more discriminative (anecdotes))
17
Results
All Results
[1-25]
Stand (ord)
[26-100]
[101-1000]
[1001+]
[Total]
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
0.027
0.057
0.217
0.281
0.425
0.474
0.461
0.506
0.338
0.389
Esa-B
0.014
0.080
0.056
0.122
0.205
0.300
0.424
0.513
0.211
0.293
cmp-1000
0.016
0.022
0.072
0.099
0.131
0.170
0.093
0.120
0.092
0.121
lki-1000
0.000
0.002
0.064
0.080
0.124
0.156
0.126
0.155
0.096
0.119
Stand (llr)
lko-1000
0.000
0.000
0.003
0.000
0.011
0.016
0.019
0.022
0.019
0.089
0.023
0.119
0.134
0.033
0.154
0.046
0.051
0.044
0.061
0.058
rnd-1000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Tab.: Precision (at rank 1) and MAP-20 of some variants we tested. Each neighbourhood function
was asked to return (at most) 1000 English articles. The Esa-B variant is making use of context
vectors of (at most) 30 titles.
Results
Analysis
Combination ? Considering 528 terms that appear over hundred times...
Stand - ord = 362 successes (top-20)
Esa-B = 351 successes (top-20)
With an oracle telling us which variant to trust ?
Potentially translate correctly 431 terms (81,6%).
Failures for the 97 terms ?
English terms appear in the French Wikipedia and are proposed by Stand
approach. (ex : barber (oracle translation : coiffeur))
Stand proposes morphological variants of the reference. (ex : coudre (verbal
form) in-place of the noun couture for sewing)
Wrong reference translations or too specific. (ex : For instance, the reference
translation of veneration is dulie, while the first translation produced by
Stand is vénération
Most frequent case : the thesaurus effect of both approaches where terms related
to the source one are proposed.
Finally, sometimes, it is just...wrong. (e.g. noun translated as spora)
18
Conclusion
Discussion
What have we learned ?
Stand performs as well or better than Esa-B, depending on parameters.
And combining both might improve results for high frequency terms.
Well-known bias on unfrequent terms = need for methods =⇒ direct future
works
Lot of meta-parameters through approaches and need costly calibration experiences
=⇒ the code and resources used in this work will be available at this url :
http://rali.iro.umontreal.ca/rali/?q=fr/Ressources (WIP)
19
Bibliography
Bibliographie I
Bouamor, D. (2014).
Constitution de ressources linguistiques multilingues à partir de corpus de textes
parallèles et comparables.
PhD thesis, Université Paris Sud - Paris XI.
Bouamor, D., Popescu, A., Semmar, N., and Zweigenbaum, P. (2013).
Building specialized bilingual lexicons using large scale background knowledge.
In EMNLP, pages 479–489.
Dunning, T. (1993).
Accurate Methods for the Statistics of Surprise and Coincidence.
Comput. Linguist., 19(1) :61–74.
Evert, S. (2005).
The statistics of word cooccurrences.
PhD thesis, Dissertation, Stuttgart University.
Gabrilovich, E. and Markovitch, S. (2007).
Computing semantic relatedness using wikipedia-based explicit semantic analysis.
In Proceedings of the 20th International Joint Conference on Artifical
Intelligence, IJCAI’07, pages 1606–1611, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
20
Bibliography
Bibliographie II
Gómez-Pérez, A., Vila-Suero, D., Montiel-Ponsoda, E., Gracia, J., and
Aguado-de Cea, G. (2013).
Guidelines for multilingual linked data.
In Proceedings of the 3rd International Conference on Web Intelligence, Mining
and Semantics, WIMS ’13, pages 3 :1–3 :12, New York, NY, USA. ACM.
Hovy, E., Navigli, R., and Ponzetto, S. P. (2013).
Collaboratively built semi-structured content and artificial intelligence : The story
so far.
Artificial Intelligence, 194 :2–27.
Laroche, A. and Langlais, P. (2010).
Revisiting context-based projection methods for term-translation spotting in
comparable corpora.
In Proceedings of the 23rd International Conference on Computational
Linguistics, COLING ’10, pages 617–625, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Manning, C. D., Raghavan, P., and Schütze, H. (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA.
21
Bibliography
Bibliographie III
Prochasson, E. and Fung, P. (2011).
Rare word translation extraction from aligned comparable documents.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics : Human Language Technologies - Volume 1, HLT ’11, pages
1327–1335, Stroudsburg, PA, USA. Association for Computational Linguistics.
Rapp, R. (1995).
Identifying word translations in non-parallel texts.
In Proceedings of the 33rd Annual Meeting on Association for Computational
Linguistics, ACL ’95, pages 320–322, Stroudsburg, PA, USA. Association for
Computational Linguistics.
22
Questions ?
Questions ?
23

Slides - Université de Montréal

Transcription

Similar documents

A Comparison of Methods for Identifying the Translation of Words in