AIDArabic+ Named Entity Disambiguation for Arabic Text

Transcription

AIDArabic+ Named Entity Disambiguation for Arabic Text
Universität des Saarlandes
Max-Planck-Institut für Informatik
AIDArabic+
Named Entity Disambiguation for
Arabic Text
Masterarbeit im Fach Informatik
Master’s Thesis in Computer Science
von / by
Mohamed Gad-Elrab
angefertigt unter der Leitung von / supervised by
Prof. Dr. Gerhard Weikum
betreut von / advised by
Mohamed Amir Yosef
begutachtet von / reviewers
Prof. Dr. Gerhard Weikum
Dr. Klaus Berberich
Saarbrücken, July 2015
Eidesstattliche Erklärung
Ich erkläre hiermit an Eides Statt, dass ich die vorliegende Arbeit selbstständig verfasst
und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Statement in Lieu of an Oath
I hereby confirm that I have written this thesis on my own and that I have not used any
other media or materials than the ones referred to in this thesis.
Einverständniserklärung
Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die
Bibliothek der Informatik aufgenommen und damit veröffentlicht wird.
Declaration of Consent
I agree to make both versions of my thesis (with a passing grade) accessible to the public
by having them added to the library of the Computer Science Department.
Saarbrücken, July 2015
Mohamed Gad-Elrab
Abstract
Named Entity Disambiguation (NED) is the problem of mapping mentions of ambiguous
names in a natural language text onto canonical entities such as people or places, registered
in a knowledge base. Recent advances in this field enable semantically understanding
content in different types of text. While the problem had been extensively studied for
the English text, the support for other languages and, in particular, Arabic is still in its
infancy. In addition, Arabic web content (e.g. in the social media) has been exponentially
increasing over the last few years. Therefore, we see a great potential for endeavors that
support entity-level analytics of these data. AIDArabic is the first work in the direction
of using evidences from both English and Arabic Wikipedia to allow disambiguation of
Arabic content to an automatically generated knowledge base from Wikipedia.
The contributions of this thesis are threefold: 1) We introduce EDRAK resource
as an automatic augmentation for AIDArabic’s entity catalog and disambiguation data
components using information beyond manually crafted data in the Arabic Wikipedia.
We build EDRAK by fusing external web resources and the output of machine translation
and transliteration applied on the data extracted from the English Wikipedia. 2) We
incorporate an Arabic-specific input pre-processing module into the disambiguation
process to handle the complex features of Arabic text. 3) We automatically build a test
corpus from other parallel English-Arabic corpus to overcome the absence of standard
benchmarks for Arabic NED systems. We evaluated the data resource as well as the full
pipeline using a mix of manual and automatic assessment. Our enrichment approaches
in EDRAK are capable of expanding the disambiguation space from 143K entities, in the
original AIDArabic, to 2.4M entities. Moreover, the full disambiguation process is able
to map 94.7% of the mentions to non-null entities with a precision of 73%, compared to
87.2% non-null mapping with only 69% precision in the original AIDArabic.
Acknowledgements
.. AJ.J
£
Q g
@ J» @YÔ é<Ë YÒmÌ '@
During this thesis, I have learned several essential research skills that, I believe, will
shape my research career. Therefore, I would like to express my sincere gratitude to
Prof.Gerhard Weikum for giving me the opportunity to work under his supervision in such
a pioneering group, for facilitating the research and for his valuable advice throughout
the thesis.
I also would like to show my sincere gratitude and appreciation for my advisor
Mohamed Amir for his continuous guidance and support on the professional and personal
levels. I really appreciate his patience teaching me loads of essential research, communication and planning skills. I am extremely thankful for his generosity sharing his valuable
expertise and time. Working with him was one of the richest experiences in my life. I
wish to have the opportunity to work with him again in the future.
I would like to thank Akram El-korashy, Mayank Goyal and Uzair Mahmoud for
their useful feedback regarding the thesis writing. Also, our experiments would not
be finalized without the help of the proactive volunteers who agreed to participate in
our manual assessment. I would also like to thank the reviewers of this thesis for their
precious time and effort.
By the end of this masters program, I am grateful to the International MaxPlanck Research School for Computer Science (IMPRS-CS) family for their
support throughout the master program. I believe, I was fortunate enough to be part of
this big family.
On the personal level, words will never be enough to express how I am thankful
and indebted to my family (my parents, sister and brother) for their sincere support,
encouragement and prayers throughout my life and my long education journey. I
appreciate their patience towards my continuous absence.
I would like to thank all my friends in Saarbrücken. I believe. I am blessed being
surrounded by all those intelligent, caring and enthusiastic personalities.
Finally, I would like to extend my sense of gratitude to everyone expressed his support
and/or made Dua for me.
Beijing, China.
Mohamed Gad-Elrab
July, 2015
vii
Contents
Abstract
v
Acknowledgements
vii
Contents
ix
List of Figures
xi
List of Tables
xiii
1 Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Background
5
2.1
Named Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Named Entity Disambiguation for Arabic . . . . . . . . . . . . . . . . . .
6
2.3
AIDArabic: Under The Hood . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3.1
Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3.2
Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 AIDArabic+
11
3.1
AIDArabic Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2
AIDArabic+ in A Nutshell
3.3
Enriching Data Components . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4
Language Specific Processing . . . . . . . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . . . . 11
ix
x
CONTENTS
4 EDRAK: Entity-Centric Resource For Arabic Knowledge
4.1
4.2
4.3
19
External Name Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1
Entity-Aware Resources . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2
Lexical Name Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 21
Named-Entities Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1
General Statistical Machine Translation . . . . . . . . . . . . . . . 23
4.2.2
Named-Entities SMT . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3
Named-Entities Light-SMT . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4
Named-Entities Full-SMT . . . . . . . . . . . . . . . . . . . . . . . 28
Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1
Transliteration Approaches . . . . . . . . . . . . . . . . . . . . . . 30
4.3.2
Character-Level Statistical Machine Translation
. . . . . . . . . . 31
4.4
Arabic Names Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5
Edrak as A Standalone Resource . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1
Use-cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.2
Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Evaluation and Statistics
5.1
5.2
35
Evaluation EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.2
Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.3
Manual Assessment
. . . . . . . . . . . . . . . . . . . . . . . . . . 37
AIDArabic+ Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1
Arabic Corpus Creation . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2
Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion and Outlook
45
6.1
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2
Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography
47
A Manual Assessment Interface
55
List of Figures
1.1
Internet Users Population . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2.1
Building Name Dictionary in AIDArabic . . . . . . . . . . . . . . . . . . .
8
2.2
Building Keyphrases Dictionary in AIDArabic . . . . . . . . . . . . . . . .
9
3.1
Building Name Dictionary in AIDArabic+ . . . . . . . . . . . . . . . . . . 14
3.2
Building Keyphrases Dictionary in AIDArabic+ . . . . . . . . . . . . . . . 15
4.1
General Statistical Machine Translation Pipeline . . . . . . . . . . . . . . 23
4.2
Single token translation using popularity voting . . . . . . . . . . . . . . . 27
4.3
Type-Aware Entity-Name Translation using full SMT system . . . . . . . 29
A.1 Manual Assessment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.2 Manual Assessment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xi
List of Tables
4.1
Sample of Google-Word-to-Concept raw data . . . . . . . . . . . . . . . . 20
4.2
Sample of JRC-Names raw data . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3
Sample of CMUQ-Arabic-NET raw data . . . . . . . . . . . . . . . . . . . 22
4.4
Entity Names SMT Training Data Size . . . . . . . . . . . . . . . . . . . . 25
4.5
Sample of character-level training data . . . . . . . . . . . . . . . . . . . . 31
4.6
Arabic names splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7
Main SQL Tables in EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1
AIDArabic vs EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2
Enrichment Techniques Contribution . . . . . . . . . . . . . . . . . . . . . 36
5.3
Number of Entities per Type in AIDArabbic vs EDRAK . . . . . . . . . . 36
5.4
Conextual keyphrases dictionary AIDArabbic vs EDRAK . . . . . . . . . 37
5.5
Example from EDRAK resource . . . . . . . . . . . . . . . . . . . . . . . 38
5.6
Manual Assessment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7
LDC2014T05 Annotated Benchmark . . . . . . . . . . . . . . . . . . . . . 42
5.8
Disambiguation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
xiii
Chapter 1
Introduction
1.1
Motivation
Entities such as persons, organizations and locations can be referred to by many different
name aliases, and similarly, the same name can be used to refer to different entities. For
example, Barack Obama can be referred to as “Barack Hussien Obama”, “Obama”, or
“USA president”, in different pieces of text. This type of ambiguity makes it challenging
for Information Extraction (IE) and Information Retrieval (IR) to retrieve information
about these entities.
Named Entity Disambiguation (NED) is the process of resolving the different mentions
of people, organizations and places that appear in text onto canonical entities in a
knowledge base [27, 67] such as DBpedia [6] and YAGO [26]. NED is essential for several
IR and Semantic Analysis tasks. It can help create accurate analytics over canonical
entities instead of ambiguous mention strings [20]. Furthermore, NED can help advance
applications such as Entity-based search, Text Summarization and News Analysis [24].
Arabic is one of the most widely spoken languages around the globe. As shown in
Figure 1.1a, in December 2013, Arabic was estimated to have the 4th largest online
users population (135M) after English, Chinese and Spanish, and followed by famous
languages such as Japanese, German, French and Russian. Moreover, Arabic has the
highest growing population on the Internet in the period between 2000 and 2013 with
around 5000% growth. Consequently, Arabic online unstructured content such news
articles, forums, blogs and social media is rapidly growing. For instance, in March 2014,
Arabic speaking users contributed to twitter only with an average of 17.5M tweets/day2 .
2
http://www.arabsocialmediareport.com/
1
2
Chapter 1 Introduction
(a) Number of users
(b) Users Growth
Figure 1.1: Internet Users Population1
On the other hand, the amount of structured or semi-structured Arabic content is
lagging behind. For example, Wikipedia is one of the main resources from which many
modern Knowledge Bases (KB) are extracted. It is heavily used in the literature for IR
and NLP tasks. However, the size of the Arabic Wikipedia is an order of magnitude
smaller than the English one3 . Furthermore, the structured data in the Arabic Wikipedia,
such as info boxes, are on average of less quality in terms of coverage and accuracy.
Therefore, Arabic is still considered a resource-poor language.
1.2
Problem
While NED is a well studied problem for English input, few systems have considered
extending NED to other languages such as Arabic. Adapting NED to Arabic text exhibits
three main challenges:
• Limited structured resources: NED systems usually require dictionaries that
link the candidate entities to one or more name aliases. Moreover, it requires
textual representation of entities (entity description or entity context), usually, in
the form of a set of keyphrases. Keyphrases are essential for estimating the context
similarity between candidate entities and the retrieved mention [67]. Dictionaries
built from Arabic structured resources are limited in size and quality. This restricts
their ability to offer a robust NED process.
• Arabic language characteristics: Arabic is a morphologically-rich language
with a different character set and writing rules from Latin-alphabet languages.
Normal English tokenization and normalization (i.e. lemmatization and stemming)
techniques are not suitable for Arabic text. Incorrectly tokenized input heavily
3
In July 2015, Arabic Wikipedia had 374,291 articles, while the English Wikipedia has 4,910,360
Chapter 1 Introduction
3
reduce the quality of name dictionary look-up and similarity measurement [67].
For example, pronouns written connected to previous words should be separated
out for better string matching.
• Annotated NED corpus: There is no Arabic corpus with semantic/entity annotation. Annotated corpora are essential for tuning NED parameters as well as
measuring the overall performance of the NED process in the development phase.
1.3
Proposed Solution
In this thesis, we introduce AIDArabic+, an NED system geared for Arabic text input.
AIDArabic+ overcomes the mentioned challenges by utilizing:
• Enriched Arabic data schema (EDRAK): EDRAK is an Arabic entity-centric
resource that offers comprehensive name and contextual keyphrases dictionaries for
entities from both Arabic and English Wikipedia’s. Dictionaries are not limited to
manually crafted data in the Arabic Wikipedia, instead, several external name
dictionaries are harnessed. In addition, Type-aware Named-Entity translation and transliteration techniques are developed to automatically compile
EDRAK’s dictionaries.
• Arabic input pre-processing components: We integrated an Arabic morphologicalbased pre-processing component to perform deep tokenization and normalization,
consequently, enhancing name matching and context similarity estimation.
• Annotated NED corpus for Arabic: We automatically created an Arabic annotated corpus using manually translated and aligned parallel English-Arabic corpus.
Produced corpus was used in evaluating the effect of the proposed components.
1.4
Outline
The following chapter discusses Named Entity Disambiguation concepts and essential
components as well as existing systems supporting Arabic. Chapter 3 describes our
general approach to create AIDarabic+. Then, the creation of EDRAK, our enriched
data schema, is explained in Chapter 4. We describe the statistics of the generated
resource EDRAK and the manual assessment performed to verify the resource quality in
Chapter 5. Finally, we discuss the creation of the annotated corpus and the effect of the
proposed approaches on the quality of the full NED pipeline.
Chapter 2
Background
2.1
Named Entity Disambiguation
Named-Entity Disambiguation (NED) (or as in some literature Entity Linking [20]) is
the problem of mapping ambiguous mentions of named entities such as places, organizations, and persons appearing in natural language input text onto canonical entities
registered in a Knowledge base (KB) [27] such as DBpedia [6], YAGO [26], BabelNet [45].
NED is different from Named Entity Recognition (NER) which only concerns with
extracting named entities and classifying them to coarse-grain categories LOCATION,
ORGANIZATION, MISC and PERSON. NER is usually performed to recognize Named-Entities
for the disambiguation processes.
It is worth illustrating that, the NED tasks are different from the Word Sense
disambiguation tasks (WSD). While the WSD is concerned with resolving the correct
meaning for words and concepts such as “bank” or “plant” in the provided context, the
NED tasks focus on mapping ambiguous names to the correct entities. For instance,
the sentence “Müller plays for the German National Team” has two ambiguous names,
“Müller ” and “German National Team” . “Müller ” can refer to any person with this
name, and “German National Team” can be either the German football team or the
German basketball team. Nevertheless, both requires rich back-end names and word
context dictionaries to map the word correctly [62].
The NED problem is a well studied for English input. Several NED systems have been
developed for the English language such as DBpedia Spotlight [40], Illinois Wikifier [50],
Tagme2 [17], AIDA [27, 66], NERD-ML [61], AGDISTIS [59] and Babelfy [43]. Besides,
several annotated corpora have been developed to evaluate the performance of these
systems on English input [60] such as TAC Entity-Linking task [29], KORE-50 [25],
#Microposts2015 [51] and AIDA-CoNLL [27].
5
6
Chapter 2 Background
On the other hand, only few of these systems are cable of processing input in
other languages. Moreover, only few corpora exist for other language such as TAC
Entity-Linking Spanish and Chinese.
2.2
Named Entity Disambiguation for Arabic
Resource-poor languages such as Arabic have a limited support. Some research attempted
to support Arabic disambiguation as Cross-Lingual Information Retrieval problem (CLIR).
McNamee et al . (2011) [38] developed a cross-languages entity linking approach to map
names in any language text to the entities in the English Wikipedia registered in the
TAC-KBP [63, 39]. The input names and context are translated/transliterated to English
before processing, then, the linking is performed as a monolingual English problem. In
order to evaluate the performance of their approach, they developed a persons-only crosslanguage ground truth for their experiments, using parallel corpora and crowd-sourcing
for creating annotation [37]. However, this approach overlooks the language and culture
specific entities and names that may not exist in the English-only KB.
Up to our knowledge, Babelfy and AIDArabic [67] are the only systems built to
disambiguate Arabic mentions to KB containing Arabic entities together with their
potential names.
Babelfy1 is a multilingual system that combines both the WSD and the NED tasks.
They use BabelNet [46] as their back end KB. BabelNet is a multilingual resource built
using Wikipedia entities and WordNet senses. They used the sense labels, Wikipedia
titles from incoming links, outgoing anchor texts, redirects and categories as sources
for the disambiguation context. In addition, off-the-shelf translation service was used
to translate Wikipedia concepts to other languages. Nevertheless, translation was not
applied on Named-Entities [46]. Babelfy was evaluated on English, Spanish, Italian,
German, and French corpora but not on an Arabic one.
AIDArabic [67] is an NED system that has been built specifically for Arabic on
top of YAGO3 [36] KB. While AIDArabic’s entity catalog spans over a sufficiently large
number of entities from the English and Arabic Wikipedia’s, it exhibits a low entity-name
and entity-description dictionaries coverage. Hence, the recall of the disambiguation was
heavily harmed.
1
http://babelfy.org/
Chapter 2 Background
2.3
7
AIDArabic: Under The Hood
In this section, we describing in detail the main data components of AIDArabic and how
they are used in the disambiguation process.
2.3.1
Data Sources
AIDArabic, similar to most of NED frameworks, has three main data components:
Entity Catalog, Name-Entity Dictionary and Entity Descriptions. In addition to these
three components, AIDArabic uses an Entity-Entity relatedness model as a supporting
component.
Entity Catalog
Entity Catalog, or repository, is the source of the canonical entities known for the NED
system. During the disambiguation process, all names in the text are mapped to one of
the entities in the catalog. Names without proper mapping to any entity in the catalog
are mapped to null.
AIDArabic populate its entity catalog form YAGO3 KB [36] built from both English
and Arabic Wikipedia’s. This allows capturing English prominent entities as well as
culture specific Arabic entities. For the sake of the data integrity, English entities
identification are used to represent entities existing in both the English and the Arabic
Wikipedia’s.
Entity-Name Dictionary
Entity-Name Dictionary contains the possible names for each entity in the catalog. Names
in the dictionary are connected to all potential entities. The dictionary is then used to
extract all possible candidate for mentions appearing in the text. Entities that do not
have any potential names cannot appear in the disambiguation candidate list.
In AIDArabic, name dictionary is populated from the Arabic Wikipeida data only
(Figure: 2.1), and names belong to one of the following four sources:
• Titles of the Wikipedia pages. Titles are different from the page id appears in the
URL.
8
Chapter 2 Background
Entity-Name Dictionary
YAGO3
EN, AR
Arabic Titles
Entity
Arabic Anchors
<Barack_Obama>
Arabic Disamb. Pages
<Germany>
‫أﻟﻣﺎﻧﯾﺎ‬
<Egypt>
‫ﻣﺻر‬
Arabic Redirects
Name
‫ﺑﺎراك أوﺑﺎﻣﺎ‬
Figure 2.1: Building Name Dictionary in AIDArabic
JË@
• Disambiguation Pages, in Arabic “ iJ
“ñ
”.
®“
HAj
These pages contain all
possible entities/meaning referred to by a specific name. The title of the of the
disambiguation page is added as a potential name for all entities referenced in this
page.
ñm'”. Redirects are pages with no actual content but
• Redirects, in Arabic “ HCK
refer the reader to another page. Redirects are used when searching Wikipedia to
rout the user to the most prominent entity referred to by this name. For example,
Ð @” (Um Alkora) redirects to “ éºÓ
” (Mekka).
searching “ øQ®Ë@
• Anchor Text of links pointing to the entity page. Anchor texts can differ from
the original title, and hence, they are harvested as potential names.
As shown, only manually crafted content is used in building the name-dictionary.
This limits the size of the dictionary to the existing Arabic content.
Technically, this information is collected from YAGO3 RDF tuples where redirects are
represented with predicate <redirectedFrom> and the remaining are represented under
predicate rdfs:label. In addition, separated persons names under <hasGivenName>
and <hasFamilyName> are added to the dictionary.
Entity Description
Entity descriptions or contextual keyphrases are the set of keywords that describes an
entity and are expected to appear in the text surrounding the entity mention. For
example, when “Tomas Müller ”, the German footballer, appears in some text, usually
Chapter 2 Background
9
Entity-Keyphrases Dictionary
Arabic Anchor Text
Entity
Keyphrase
Arabic Inlinks Titles
YAGO3
EN, AR
<Barack_Obama>
Arabic Categories
<Germany>
‫اﻟوﻻﯾﺎت اﻟﻣﺗﺣدة‬
‫أﻧﺟﯾﻼ ﻣﯾرﻛل‬
English Inlinks
English Categories
<Egypt>
‫اﻟﻘﺎھرة‬
English-Arabic
Interwiki links lookup
Figure 2.2: Building Keyphrases Dictionary in AIDArabic
words related to football also appear in the text such as football, match, national team,
Germany, goal,etc. Contextual keyphrases are used to compute the similarity between
the mention context and the candidate entity context.
AIDArabic utilizes an entity-description dictionary of Arabic keyphrases. Keyphrases
are further split up into keywords with a specific weight for each. Keyphrases are harvested
from three sources (figure: 2.2):
• Anchor Text inside the Arabic entity pages that point to other pages are assigned
as keyphrases to this entity.
• Inlink Titles are the titles of the pages that link to the current entity. Inlink
titles of Arabic pages pointing to an entity are added directly to its keyphrases
set. However, English Inlink titles are translated to Arabic via the cross-languages
inter-Wikipedia links dictionary. For example, to include the inlink title of page
“<Eygpt>”, the dictionary pair “<Eygpt>→ <ar/Qå”Ó>” is used to get the Arabic
title “Qå”Ó”.
• Categories are manual classes added to each entity. Similar to entities, YAGO3
contains a union of the English and Arabic categories. English Wikipedia ids
are used to represent the categories, unless the category only exists in Arabic.
Similar to the Inlink titles, Arabic categories are added directly to the keyphrases
but English categories are translated via the cross-languages inter-Wikipedia links
between categories.
As mentioned above, only Inlink titles and Categories are translated using the manual
inter-Wikipedia links. Moreover, no other external dictionaries are used, hence, the
context is still limited to the Arabic Wikipedia size.
10
Chapter 2 Background
Entity-Entity relatedness model
It is common that a single text snippet or document contains small amount of related
entities. Therefore, AIDArabic exploits the Entity-Entity relatedness model to improve
the quality of the disambiguation results. The relatedness is estimated based on the
overlap in the incoming links [41] fused from both the Arabic and the English Wikipedia’s.
2.3.2
Processing
AIDArabic, as thethe most of NED systems, starts with retrieving the possible name
mentions from the input text. Name mentions are usually recognized via a NER system.
Retrieved mentions are normalized (e.g. converting text to lowercase or uppercase in
English). Then, the possible candidate entities for the mentions in the text are retrieved
from the Entity Catalog using the Name Dictionary.
In order to resolve the mention-to-entity mapping, a weighted graph of the mentions
and the candidate entities is constructed. Weights on edges between mentions and their
candidates are estimated from the entity keyphrases and the mention context similarity as
well as the candidate entity popularity (i.e. prior). Weights on edges between candidate
entities are assigned according to the entity-entity relatedness scores. The disambiguation
problem is solved by iteratively reducing the graph to dense sub-graph till each mention
is connected to exactly one candidate.
Chapter 3
AIDArabic+
3.1
AIDArabic Challenges
The original AIDArabic introduced NED for Arabic text. Nevertheless, it exhibited a
low recall compared to the English AIDA [67]. This problem has two main roots:
First, while AIDArabic utilized a comprehensive entity catalog, the generated name
and contextual keyphrases dictionaries are still limited to the manually crafted information
in the Arabic Wikipedia and cross-languages inter-Wikipedia links. On its turn, the
Arabic Wikipedia does not have enough coverage and quality. It does not only miss
a lot of entities, but also existing entities have short non-comprehensive pages. The
Arabic Wikipedia, as the most used structured resource, is not capable of covering the
fast growing Arabic content.
Secondly, AIDArabic follows the same tokenization and normalization applied to
Latin input without any Arabic-specific pre-processing. Improperly tokenized names
cannot be matched against the name-dictionary using the strict matching mechanism
adopted in AIDArabic, consequently, no candidate entities will be retrieved for this
mention. Similarly, entity-mention similarity computation using keyphrases is negatively
affected.
3.2
AIDArabic+ in A Nutshell
AIDArabic+ aims at achieving a robust NED on Arabic text. Hence, we need to target
the weak points in both the data schema and the processing components to enhance the
overall recall an precision.
11
12
Chapter 3 AIDArabic+
In this work, we introduce EDRAK resource as an automatic augmentation for
AIDArabic resource. We propose two approaches to overcome the limited data of the
Arabic Wikipedia. The first approach is to collect names from other resources on the web
containing possible names using semantic and syntactic equivalence. The second is to
incorporate translation and transliteration techniques to automatically generate Arabic
content based on evidence from the English and Arabic Wikipedia beyond the direct
cross-language inter-Wikipedia mapping. In order to guarantee building an accurate
data schema, different rules are enforced on the techniques according to the type of the
entity and the source of the data as discussed in Section 3.3.
In addition, we introduce the integration of a pre-processing component into the NED
pipeline to handle Arabic-specific features. Section 3.4 illustrates the set of procedures
proposed for proper Arabic normalization and tokenization to achieve better name
matching and context similarity estimation.
3.3
Enriching Data Components
As illustrated in Section 2.3, the main three data components necessary for our NED
system are an entity catalog, name dictionary and entity description (i.e. contextual
keyphrases). The forth component of AIDArabic, Entity-Entity relatedness model,
depends on the topology of the KB, but not the language used. Hence, it does not
require language specific enhancements. This section discusses the idea behind applying
enrichment techniques on each of the main three components to improve the Arabic
NED process. The design decisions and the implementation of the proposed enrichment
approaches (as closed modules) are discussed in Chapter 4.
Let us consider this hypothetical Arabic sentence
‘ A« l … Q K Y ¯
éKñ « è Q K Am.Ì úG Q ®Ë@
à Q m' B éK. AJ» á«
written in English for clarity as:
Aaidh Al-Qarni might get nominated for Goethe Prize for his book La Tahzan
), a
‘ A«
®Ë@
This sentence has three named entities: a writer (Aaidh Al-Qarni / àQ
èQ KAg
) and a book (La Tahzan/ à Qm ' B).
.
prize (Goethe Prize/ éKñ«
We will illustrate, how
we can adapt each data component of the NED framework to be capable of correctly
disambiguating them.
Chapter 3 AIDArabic+
13
Entity Catalog
By considering our example, the writer is known enough to exist in both the English and
the Arabic Wikipedia’s. However, despite the fact that the book is translated, it has
only an Arabic Wikipedia page. Moreover, the prize is not known enough in the Arab
world to exist in the Arabic Wikipedia1 .
In order to disambiguate such sentence, we need to make sure that the entity
repository contains all of those entities. Therefore, we followed the same approach as
in AIDArabic[67], we used YAGO3 compiled from both the English and the Arabic
Wikipedia’s as our back-end KB. This allows capturing prominent English entities as
well as local entities that are only known in the Arabic culture.
Name Dictionary
Generally, entity-name dictionary is an influential component for any NED system.
Having an incomplete dictionary dramatically harms the disambiguation quality. If
the dictionary misses a name-entity entry, either no candidates will be nominated for
disambiguation, or even worse, a wrong entity might be picked for one or more mentions.
Since modern NED systems consider coherence measures when collectively resolving
all mentions, one or more wrong mappings might propagate to mislead mapping other
mentions onto wrong canonical entities.
We started with the same sources collected in the original AIDArabic. we harvest
Wikipedia page titles, anchor texts, disambiguation pages’ titles (under predicate rdfs:label
in YAGO3 [36]) and redirects. Also, we include separated Given names and Family
names of persons. Nevertheless, the contribution of these sources to the Arabic dictionary
is limited as shown in the statistics in Section 5.1.
In order to correctly disambiguate names in our example, we need a dictionary
that is aware of the Arabic names of all the three entities. Since, the writer (Aaidh
) and the book (La Tahzan/ à Qm ' B) exist in the Arabic Wikipedia,
‘ A«
®Ë@
Al-Qarni / àQ
our names dictionary has at least one Arabic name for both of them (their page titles).
On the other hand, Goethe Prize exists only in the English Wikipedia, without any
potential Arabic name. Therefore, the correct entity will not be nominated as a candidate
for its Arabic mention.
In this work, we propose to go beyond Wikipedia content via automatic data
generation. However, It is a challenging task to automatically build a entity-name
1
As on 01 July 2015
14
Chapter 3 AIDArabic+
Entity-Name Dictionary
YAGO3
EN, AR
Arabic Titles
Entity
Arabic Anchors
<Barack_Obama>
Arabic Disamb. Pages
<Germany>
‫أﻟﻣﺎﻧﯾﺎ‬
<Egypt>
‫ﻣﺻر‬
Arabic Redirects
Name
‫ﺑﺎراك أوﺑﺎﻣﺎ‬
En. Titles
En. Anchors
En. Redirects
En. Disamb. Pages
External Dictionaries
External Dict. Names
Translations
Translated Names
Transliterated Persons Names
Transliteration
Figure 3.1: Building Name Dictionary in AIDArabic+
dictionary that captures name variations for all entities in the entity catalog while
keeping the data precision. For example, the name of “Goethe Prize” in Arabic is
obtained by (1) Transliterating “Goethe” into the Arabic script, (2) Translating
“Prize” into Arabic, and finally (3) Reordering the tokens to follow Arabic writing rules.
Therefore, we introduce three approaches to enrich the entity-name dictionary of
AIDArabic+:
1. External Name Dictionaries: We harness the existing English-Arabic name
dictionaries via semantic and syntactic equivalence. For example, if two strings
from one or more dictionaries are linking to the same canonical entity, we consider
them potential name aliases. In Section 4.1, we discuss the harnessed resources as
well as the procedure designed for the integration process.
2. Entity-Name Translation: While external dictionaries (e.g. gazetteers) and
hyper links extracted from the web provided Arabic names for some English entities,
many entities are still lacking potential names in the Arabic world. Arabic names
should be generated instead of only extracting them. Moreover, general purpose
translations exhibited problems translating/transliterating Named-Entities [4, 21, 7]
even if they appear within a context. Therefore, we introduced Entity-name
translation to populate our dictionary with accurate automatically generated
Arabic names as discussed in Section 4.2 in-detail.
3. Persons Names Transliteration: Fair amount of entities obtained one or more
Arabic names using external resources and/or translation. However, English names
have different variants when being written in the Arabic script. In addition, not all
Chapter 3 AIDArabic+
15
Entity-Keyphrases Dictionary
Arabic Anchor Text
Entity
YAGO3
EN, AR
Keyphrase
Arabic Inlinks Titles
<Barack_Obama>
Arabic Categories
<Germany>
‫اﻟوﻻﯾﺎت اﻟﻣﺗﺣدة‬
‫أﻧﺟﯾﻼ ﻣﯾرﻛل‬
English Inlinks
English Categories
<Egypt>
‫اﻟﻘﺎھرة‬
Interwiki Dictionary
Translation
Figure 3.2: Building Keyphrases Dictionary in AIDArabic+
name variants can be generated by translation. Therefore, we introduce incorporating a transliteration module geared for PERSON names. While transliteration is
applicable on many NON-PERSON entities, applying it for such entities will create a
lot of inaccurate entries that should be either fully or partially translated. Thus,
we decided to exclude them from the transliteration process.
Technically, We applied these approaches on the English names from all sources,
namely, Given Name, Family Name, Redirects and rdfs:label (includes titles, anchor texts
and disambiguation pages) as in Figure: 3.1.
Entity Contextual Keyphrases
Contextual keyphrases are used as a set of descriptions for the entity. Entity keyphrases
are matched against the input text to compute a similarity score between the expected
context of the candidate entity and the context in which the mention exists.
As shown in Figure 3.2, we used the standard approaches in the original AIDArabic
to extract entity contexts from Wikipedia (Wikipedia Anchor texts, Wikipedia categories,
and Wikipedia inlink titles). Similar to AIDArabic, the English inlink titles and categories
are looked up in the English-Arabic inter-Wikipedia links dictionary. However, English
names without entries in the dictionary were neglected, consequently, many English
entities were not supported by any context description. Such entities that lack adequate
description cannot be promoted as a winning entity in the mapping process.
16
Chapter 3 AIDArabic+
In AIDArabic+, we overcame the low coverage and quality of the Arabic Wikipedia
by applying the Entity-Name Translation and the Persons Names Transliteration techniques on Wikipedia in-link titles remaining from the inter-Wikipedia dictionary look-up.
Furthermore, we trained our category translation module on a parallel corpus extracted
from categories English-Arabic inter-Wikipedia links. Finally, while it seems possible to
translate anchor texts, we did not perform that to avoid the inaccuracy resulting from
being noisy and sometimes long.
3.4
Language Specific Processing
Name matching and context similarity computation processes are necessary for successful
disambiguation process. Achieving robust matching requires clean input.
However, due to the specific morphological characteristics of the Arabic text, standard
English text normalization and tokenization techniques (e.g. converting to lowercase) are
not directly suitable.
For example, Arabic text has several specific features such as:
• Definitive Article (AL) “ È@” is attached to the beginning of the word (e.g. The
library: éJ.JºÖÏ @ ). Nevertheless, not each “ È@” at the beginning of a word are treated
is a definitive article.
¼ , H. , and È and connectors ( ¬ , ð) appear connected
to the word beginning (e.g. H
. at the beginning of “in France”:“ A‚Q®K.”).
• Some propositions such as
• Several pronouns are attached to the end of the word. For example, the last two
characters
Aë at the end of AîD®K Yg (meaning ”its park”).
• Sometimes, Arabic text is written with diacritics to express vowels and facilitate
pronunciation. Arabic diacritics appear as decoration above and under the normal
character (e.g.
H. , H. , H. , H. ,and H. ).
They differ according to the meaning of
the word and its position in the sentence (i.e. subject or object)
In AIDArabic+, we incorporate an Arabic specific pre-processing component for
input text and data schema building.
There are two state-of-the-art systems that perform morphological-based analysis:
MADAMIRA [49] and Stanford Arabic Word Segmenter [42]. Stanford Word Segmenter
provides interpolatable handy Java API. Hence, we have used it for pre-processing. Our
pre-processing component perform two main steps:
Chapter 3 AIDArabic+
17
Tokenization
Besides the normal word tokenization based on punctuation and spaces, we perform
word segmentation to split clitics and attached connectors. Split suffixes and prefixes
are marked with special character ’+’ indicating that they are originally connected
to the next or previous word. This allows reconstructing the original input text. For
+ H” and ” AîD®K Yg”
.
example, “ A‚Q®K.”(pronounced as “BFranca”) is segmented into ” A‚Q¯
is segmented as ” Aë+
I ®K Yg”.
It is worth noting that Stanford Word Segmenter does
not split the definitive articles.
Splitting prefixes and suffixes, i.e. clitics, should increase name matching accuracy,
and hence, enhancing the candidate retrieval process coverage and quality. Furthermore, it
allows better keyphrases matching which is essential for achieving accurate disambiguation
results.
Normalization
Unlike English, Arabic text normalization include several steps. Applied normalization
should be customized according to application. The most common normalizations are:
• Removing Diacritics: Despite the fact that removing diacritics will increase
ambiguity of some words, it is important to obtain a uniform representation of the
AJ‚ K @
diacritics will be ” ÕËð @ éJK YÓ ú¯ áK
ú
¯ áK AJ‚ K
@ H Q . Ë @ YË ð ” after removing
YËð”.
Normalizing Hamza: It includes replacing the different forms of letter Hamza ( @,
c
@, @, c@ , and @) to the normalized form @. This helps avoid common typing mistakes
same word. For example, sentence ” ÕË ð @
•
é JK YÓ
H Q.Ë @
and confusing different states of the same word.
• Normalizing Ya: In order to avoid a common writing mistake in the informal
text, the character Ya/ ø (with dots) is replaced with
ø (without dots).
• Normalizing Ta-marbutah: Also, to avoid different writing forms tah-marbutah
é & è (with dots) is replaced with è or é (without dots).
• Removing Tatweel: Some informal text uses series of ’ ’ (U+0640) to extend
the word. All Tatweel characters should be removed to get the pure word.
• Normalizing Punctuation: In some cases, it is useful to replace the Arabic
punctuation with equivalent ASCII symbols.
18
Chapter 3 AIDArabic+
In AIDArabic+, we apply Diacritics Removal, Hamza Normalization, Ya Normaliza-
tion, Ta-marbutah Normalization and Tatweel Removal to achieve decent input quality
that guarantee higher matching recall without sacrificing the precision. Recalling our
example in Section 3.3, the input sentence:
éKñ «
à Q m' B éK. AJ» á«
è Q K AmÌ
.
‘ A«
ú
G Q ®Ë@
l … Q K
Y ¯
after normalization and tokenization will be
l…QK Y¯
®Ë@
‘ A«
+ È úGQ
éKñ« èQ KAg
à Qm ' B è+ H. AJ» á«
.
As in the example, names and contextual keyphrases in the normalized sentence
become more clear for matching against AIDArabic+ dictionaries. Proposition
+È
Ì
attached to the beginning of the word èQKAm. is detached and can be treated as a stop
word. Similarly, the pronoun è+ has been detached from the end of éK. AJ».
Chapter 4
EDRAK
Entity-Centric Resource For Arabic Knowledge
EDRAK is an entity-centric resource developed as back-end schema for AIDArabic+
NED as shown in Section 3.3. This chapter focuses on the automatic generation
techniques beyond Wikipedia used in EDRAK together with the decisions taken
within each technique. Section 4.1 describes the integration of external dictionaries.
Then, named-entity translation methods are discussed in Section 4.2, person names
transliteration in 4.3, and Arabic names splitting in 4.4. Finally, technical details about
EDRAK as a general purpose standalone resource is explained in secion 4.5.
4.1
External Name Dictionaries
Wikipedia, as the largest comprehensive online encyclopedia, is the most used corpus for
creating knowledge bases such as YAGO [26], DBpedia [6] and Freebase [10]. Due to the
limited size of the Arabic Wikipedia, building a strong semantic resources becomes a
challenge. One approach to go beyond Wikipedia limits is to capture possible Arabic
names mentioned in other resources such as websites and online news, then attaching them
to the corresponding entities. These resources are usually harvested through automatic
or semi-automatic processes. Among the generated resources, some are entity-aware,
while others are purely textual names dictionaries.
19
20
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
4.1.1
Entity-Aware Resources
Entity-Aware resource is the type resource that has canonical entities registered a long
with their names and, in some cases, their context description. Google-Word-ToConcept (GW2C) [54] is a multilingual Entity-Aware resource that harness Wikipedia
concepts, including Named Entities (NE) and their possible names from both Wikipedia
and non-Wikipedia Web pages.
Resource Description:
Concepts’ strings (i.e. names) are harvested from:
• English Wikipedia pages titles.
• English anchor texts from inter-Wikipedia links into the concept.
• Anchor texts from non-Wikipedia pages to Wikipedia concepts with English page.
Name-to-concept mappings are stored together with a strength score that is measured
through the conditional probability P (Concept|name) representing the ratio of links into
the Wikipedia concept having this name. Nevertheless, names in GW2C names are
stored without any kind of post-processed or cleaning.
éËA®Ó
‚j.JË@ l×. @Q.»
ar: éK
Qº‚ªË@ éƒPYÖÏ @
AJë ¡ª “@
AÓðP èYëAªÓ
. QÓ
ø
ñËñJ
.ƒ AÒJ
Jk
0.0013
Chuck (engineering)
W08 W09 WDB Wx:1/500
1
Spyware
W08 W09 WDB Wx:8/8
0.5
École Militaire
W08 W09 WDB Wx:3/3
0.0005
World War I
KB W08 W09 WDB Wx:4/5357
1
Treaties of Rome
KB W:4/4 W08 W09 WDB Wx:3/3
1
M~
arginimea Sibiului
W09 W08 WDB Wx:1/1
Table 4.1: Sample of Google-Word-to-Concept raw data
GW2C contains 297M multilingual name-to-concept mapping. As shown in Table 4.1,
the first and the third columns in order contain retrieved names and their Wikipedia
concepts URLs. Second column contains the conditional probability computed from the
witness counts presented in the flags column (fourth column).
Integrating with AIDArabic+ Resource:
GW2C is created automatically without
any manual verification or post-processing. Therefore, it contains noise that should be
filtered out. In order to include GW2C names in our names dictionary, we perform the
following steps:
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
21
1. Detecting Arabic names using off-the-shelf language detection tool developed
by Shuyo (2010) [53] to filter out non-Arabic records. This resulted in only 736K
out of 297M as Arabic entries.
2. Filtering out ambiguous names based on the provided conditional probability
scores. Excluding records with low scores filters out anchor texts such as “(Read
®“
” or “(more on Wikipedia)
YK
QÖ Ï @ @Q¯@ ” , “(Wikipedia page) AK
YJ
.J
ºK
ñË@ éj
AK
YJ
.J
ºK
ð úΫ YK
QÖ Ï @”. We used 0.01 as a lower threshold on the provided scores.
more)
3. Name-level post-processing to remove URLs, punctuation, common prefixes
and suffixes.
4. Mapping names to AIDArabic+ Entities using Wikipedia pages URLs.
4.1.2
Lexical Name Dictionaries
Lexical Name dictionaries is another type of resource that contains just name variants
in different languages without any notion of canonical entities. Since these dictionaries
do not consider the semantic differences, the name variants can be mapped to different
entities. Therefore, we use them as look-up dictionaries to translate English entitynames to Arabic. We have utilized two dictionaries that have been exposed to a manual
verification.
62
62
P
P
u
u
62
P
u
62
P
u
62
P
ar
62
P
sl
Javier+Solana
Q
J
¯A g
AKBñƒ+
AKBñƒ+QKðA
g
Q
J
K. Ag
AKBñƒ+
Q
J
¯A gð
AKBñƒ+
Javierjem+Solano
Table 4.2: Sample of JRC-Names raw data
JRC-Names
[55] is a multilingual resource of organisations and persons names
extracted from News Articles and Wikipedia. In the creation of JRC-Names, they used
manually compiled lists of language specific rules and triggers such as persons titles,
ethnic groups or modifiers to extract names of persons. In addition, a list of frequent
words (e.g. club, organization, bank etc.) was used to extract organization names.
The similarity between the names extracted from news and those from Wikipedia
page titles was computed to recognize name variants. Names in non-Roman script were
romanized. Hence, monolingual edit distance was used as a unified similarity function.
Names below the specified threshold were manually matched to the corresponding name
22
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
cluster. Finally, names that either appeared in five different news clusters, manually
validated or found in Wikipedia were included in the published dictionary.
The dictionary has 617k multilingual name variants with only 17k Arabic name
variants. As shown in Table 4.2, variants of the same name have a unique identifier. In
addition, types and partial language tagging are provided with the names.
National Investors Bank
BALTIC COUNTRIES
Yoli Adlestein
Nathan Byron
PAÒJƒB@
½JK
ú
×ñ®Ë@
.
‡J ¢ÊJ.Ë@ ÈðX
úÍñK
áK AJ‚ËX@
Q
K. àA KA
K
àð
ORGANIZATION
000
0.3 0.38 0.5
ORGANIZATION
00
0.2 0.076
PERSON
11
0.75 0.875
PERSON
11
0.71 0.66
Table 4.3: Sample of CMUQ-Arabic-NET raw data
CMUQ Arabic-NET
is an English-Arabic name dictionary compiled from Wikipedia
and parallel English-Arabic news corpora [7]. They used off-the-shelf NER system on
the English side of the corpora. The NER results were projected onto the Arabic side
according to the word alignment information. Additionally, they included Wikipedia
cross-languages links titles in their dictionary. The dictionary was manually annotated
to fit targeted use.
The full dictionary has 62k English-Arabic name pairs. Table 4.3 shows a sample
of the dictionary. First two columns are the English-Arabic pairs. The third column
contains the type of the entity name (i.e. person or organisation). The remaining are
annotations that are used for their target.
Including Dictionaries in EDRAK
is performed as follows:
1. Pre-processing and language detection are applied on JRC-Names.
2. English names of the entities are normalized and matched strictly against these
dictionaries to get the accurate Arabic names.
3. Only new name variants are added to the resource.
4.2
Named-Entities Translation
Up to this point, English entities that do nor have any possible names in the dictionary
and/or context keyphrases still form a big part of our catalog. In addition, not all entity
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
23
Source language Input
Statistical Models
Bilingual
Corpus
Training Translation
Model
Translation
Model
Decoder
Target
Language
Corpus
Training Language
Model
Language
Model
SMT System
Target language Output
Figure 4.1: General Statistical Machine Translation Pipeline
names have already appeared in Arabic text corpora, some however are prominent enough
to appear in the near future. Accordingly, in this section we discuss using machine
translation on English names and keyphrases.
4.2.1
General Statistical Machine Translation
Statistical Machine Translation (SMT) is the process of generating possible translations
for text based on statistical models trained on bilingual parallel corpora [31]. Recently,
several translation systems are being developed based on SMT such as Moses [32],
Cdec [15], Phrasal [18], and Thot [48]. The main advantage distinguishing SMT from
other translation approaches such as rule-based translation and example-based translation
is being generic enough to be used with any language pairs.
Implementations of SMT mostly follow similar steps to train the models required
for the decoding process (Figure: 4.1). The parallel corpora word-alignment information
is extracted using one of the automatic statistical alignment tools such as GIZ++ [47]
or Fast Aligner [14]. Generated word-alignment information is used to produce the
translation tables/grammar. In addition, a statistical language model is generated from
the target language part of the parallel corpora. In some cases, other monolingual corpora
are used to generate richer models. Later, both the translation table and the language
model are used in the decoding phase while translating the input text. Usually, several
translations are generated for each input sentence. Resulting translations are ranked
based on the accumulated probability derived from the language model and translation
table.
24
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
Statistical machine translation is a viable option to translate English names into
Arabic. Off-the-shelf trained SMT systems such as Google 1 or Microsoft Bing 2 translation
services are trained on large parallel corpora. However, they are not geared for translating
NEs. While seeking suitable translation quality on natural language input text, NE
translation quality straggles as:
• Most of the existing SMTs do not handle named entities explicitly, only the language
model is responsible for generating the weights of name translations [22]. They do
not utilize any part-of-speech tagging or NER information.
• Entity names are domain specific. Hence, they can be easily missing from the
parallel training corpus which cannot cover all domains. SMTs will split the name
and translate each token separately resulting in wrong translations.
• Entity names tend to appear less than other nouns and verbs in the parallel training
corpora. Consequently, their translations have lower weights in the language models
rather than normal words [33, 5, 7]. For example, “North” as geographical direction
and “Green” as color have higher weight rather than their name counterpart.
• SMT system does not take NE type into consideration, yet different entity types
should not be translated similarly. For example, “Nolan North” (PERSON) and
“North Atlantic Treaty Organization” (ORGANISATION), both Norths are proper
entity names, yet the first should be transliterated while the later should be
translated.
4.2.2
Named-Entities SMT
Several research attempts focus on enhancing the quality of NE translation. Huang et
al . (2004) [28] introduced the usage of phonetic and semantic similarity in order to
improve NEs translation. Lee (2014) [34] proposed including part-of-speech tagging
information in the translation process in order to enhance the translation of person names
in text. Furthermore, Azab et al . (2013) [7] developed a classification technique to decide
whether to translate or transliterate named-entities appearing in a full text from English
to Arabic. They used combination of the token-based, semantic and contextual features
including the coarse-grained type tags (PERSON, ORGANIZATION) in the classification.
Nevertheless, since our problem is focused on NEs solely, we propose creating a NE
customized translation module.
1
2
https://translate.google.com
https://www.bing.com/translator/
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
PER
NON-PER
25
ALL
Azab
Wikipedia
28493
33962
34116
79699
62609
128790
Both
62455
113815
191399
Table 4.4: Entity Names SMT Training Data Size
Named-Entities Training Corpus
Parallel training corpora is a key player in achieving the desired translation quality.
Our proposed approach is to use training data purely designed for translating NEs.
Therefore, we compile our training corpus from the NEs existing in English-Arabic
cross-languages inter-Wikipedia links. The intuition is if our Knowledge base knows the
name of “William Hook Morley” ( úÍPñÓ
¼ñë Õæ
Ëð) and “Edward Said” ( YJ
ªƒ XP@ð X@)
in Arabic script (which is the case), this should be sufficient for our SMT to learn the
Arabic script of “Edward Morley”. By adapting names only, we guarantee a suitable
language model that provides higher weight for name translations.
Type-Aware Translation
Similar to the recent technique for translating named-entities within natural language
text proposed by Azab et al . (2013) [7], we propose utilizing type information in names
translation process. we used well structured type information provided in YAGO3 [26].
Training data have been split into two sets PERSON and NON-PERSONS. We do not split
the data on more fine-grained type-basis (e.g. ORGANIZATION vs LOCATION) in order to
maintain an adequate amount of training data [57] for each type. Table 4.4 shows the
size of the training data for each system.
We trained three SMT systems: PERSONS, NON-PERSON, ALL. The third system is
used as a fallback solution in case an entity of type T could not be translated using the
corresponding system. The main advantage of the type-aware architecture is allowing
translating person names differently. In addition, for non-person entities like “Goethe
Prize”, we are able to translate it using the fallback system which learned to translate
“Goethe” from the PERSONS part of the data, and “Prize” from the NON-PERSONS part.
4.2.3
Named-Entities Light-SMT
Since our target is to translate entity-names only, language models are not highly beneficial
in our case. Even more, language models may reduce the quality of translation, if the
name is not well represented in the target language training data. Therefore, we propose
26
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
Light-weighted SMT. Our approach intuition is that if we collect all Arabic names of
entities that have the token-to-translate in one of their English names, the corresponding
Arabic token should have the highest and the most distinguished occurrence count among
other Arabic tokens. Therefore, applying popularity voting among Arabic tokens of
the entities with the token-to-translate (in their English names) results in a proper
translation. Consider a simplified example, given a set of parallel names that all its
records have “Müller ” in the English side such as (“Thomas Müller ”/“QËñÓ
(“Gerd Müller ”/“QËñÓ
€AÓñK”),
XQ
« ” ) etc., “Müller ” translation in Arabic “QËñÓ” is going to be
the most frequent name. Light-SMT module is built as follows:
1. The English side in the parallel English-Arabic training data is tokenized using
Stanford tokenizer and converted to lowercase.
2. Arabic side is normalized and tokenized as discussed in Section 3.4. Since a
single English name can be written in several forms with only a difference in
vowels, diacritics and/or Hamza, stemming is important for counting all such
representations as a single candidate.
3. Ambiguous tokens such as “(Disambiguation)” in English and its corresponding
K” are considered noise and removed. Furthermore, tokens with
Arabic word “ iJ
“ñ
no mapping in Arabic such as ”a”,”an” and ”The” as well as punctuation are
eliminated.
4. In order to follow the type-aware approach, training data is split into PERSON,
NONPERSON. We created three indexes, two for each type and a third for type ALL
(which includes the whole parallel data combined). Each index is composed of:
(a) An inverse index from English tokens to their entities.
(b) An index from the entities to all tokens of their Arabic names together with
their normalized version.
In the decoding (i.e translation) phase, the entity-name passes through the following
steps to generate all possible translations for the name:
1. Entity type is resolved via YAGO types.
2. English Name is tokenized and normalized.
3. The translation of each token is generated according to the entity type (Figure:4.2):
(a) List of entities with this English token is retrieved from the inverted index.
(b) Arabic tokens of all entities in the list are retrieved from the Arabic index.
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
Edward
Offline Generated Translation Model
Source Tokens to Entities Inverted Index
edward
<Edward_Morely>
<Edward_Said>
<Edward_England>
<Edward_Barker>
morely
<Edward_Morely>
<William_Morley>
<Ebenezer_Morley>
Parallel
Training
Data
<Ebenezer_Morley>
<William_Morley>
Get Entities with EN Token
<Edward_Morely>
<Edward_Said>
<Edward_England>
<Edward_Barker>
Get AR tokens for Entities
Entities to Target Tokens Index
<Edward_Said>
<Edward_England>
<Edward_Barker>
27
‫إدوارد‬
‫إدورد‬
‫إدوارد‬
‫ﻣورﻟﻲ‬
‫وﻟﯾم‬
‫ﺳﻌﯾد‬
‫إﻧﺟﻠﻧد‬
‫ﺑﺎرﻛر‬
‫إﺑﻧﯾزﯾر‬
‫ﻣورﻟﻲ‬
Vote between normalized Tokens
‫ﺳﻌﯾد‬
‫إﻧﺟﻠﻧد‬
‫ﺑﺎرﻛر‬
‫إدوارد‬
‫إدورد‬
‫إدوارد‬
No
Fail
>T
yes
Vote between Different Representations
‫إدوارد‬
Top-k
‫إدوارد‬
‫إدورد‬
‫إدوارد‬
Figure 4.2: Single token translation using popularity voting
(c) A popularity voting is performed on the distinct Arabic stems. The Arabic
stem with the highest count is considered the proper translation for the
English token. In order to achieve suitable translation accuracy, we impose
two conditions: (i) the number of participating entities should be greater than
five, (ii) the winning stem should achieve at least a value of 0.3 from number
participating entities.
(d) In order to avoid rare Arabic representations of English names written in
Arabic script as well as incorrect representation, a second popularity voting is
performed among the original words contributing to the winning stem. The
top two representations are chosen as possible translations of the input token.
(e) If no translation is found for the token using its type data, these steps are
repeated with the ALL model.
4. Token translated successful are then joined together to generate all possible translations.
Up till this point, Light-SMT is capable of generating acceptable translation quality
for persons names as they usually follow the same token order across languages. In
contrast, multi-token NONPERSON names suffer from ordering problem. For example,
while translating “Max Planck Institute” to Arabic using LSMT, the result will be
“ YêªÓ
. »AÓ”.
½KCK
Each token is correctly translated but word “ YêªÓ”, the Arabic
28
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
translation of “Institute” should appear in the beginning followed by the rest of the name
.
as “ ½KCK
»AÓ YêªÓ”.
Since NEs are short and follow common patterns, we implemented a rule-based
reordering approach similar to Badr et al . (2009) [8]. This approach is efficient with
Arabic NEs case as they are usually written in the format of
( < Category >
< list of genitives or proper noun > )
For English NEs, the category (e.g. “University”) either appears at the beginning
(e.g. “University of Saarland ”) or at the end (e.g. “Saarland University”). The first
case is similar to Arabic order, and hence, no changes are required. The later however
should be flipped. We learned the list of categories that requires reordering from all
English nonperson names by considering the top thousand common tokens appearing
at the end of the name. Before translating, the English Name is reordered such that
category names are put at the beginning to follow the Arabic naming. For example,
“Goethe Prize” becomes “Prize Goethe”.
The main advantage offered by Light-SMT is that it allows controlling the translation
quality on the token level. In addition, it leverages the correct translation by combining
all similar representations. On the other hand, Light-SMT does not capture dependencies
between words. Also, it assumes one-to-one word mapping which is not the case, specially
with nonperson names. We tried to enhance it by applying it on n-grams and choosing
then choosing the translation of the longest n-gram in the top-k as the correct one.
Nevertheless, result samples do not show considerable improvement. In Section 5.2, we
examined the effect of Light-SMT buit on disambiguation full pipeline.
4.2.4
Named-Entities Full-SMT
Due to the limitation of the Light-SMT, decision was made to train an off-the-shelf
SMT framework. Therefore, we adopted Cdec [15], a full fledged SMT framework that
includes a decoder, aligner, and learning framework. We used it in combination with the
type-aware paradigm. We used the same training data in addition to CMUQ Arabic-NET
dictionary provided by Azab et al . (2013) [7]. We train the three system (PERSONS,
NON-PERSONS and ALL) as follows:
1. Parallel data is split into development data (5%) and training data.
2. English part is tokenized normally and the Arabic side is normalized and tokenized
as described in Section 3.4.
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
29
PEROSN Translation
English
Entity-Name
Resolve Type
is PEROSN
OOV filtering
NONPEROSN Translation
Fail?
Top3
FALLBACK Translation
OOV filtering
Arabic
Entity-Names
Figure 4.3: Type-Aware Entity-Name Translation using full SMT system
3. Symmetric Word-Alignment information is extracted from the English-Arabic
training pairs using Fast Aligner [14].
4. Word-Alignment information are used to generate translation grammar using
Cdec Grammar Extractor. Unique grammars are kept in Synchronous context-free
grammars(SCFGs) format.
5. The normalized Arabic part of the corpus is used to train the language model using
KenLM Language Model Toolkit [23] provided in Cdec framework.
6. Translation parameters are tuned using MIRA tuning tool [11] adopted by Cdec.
In the translation phase, the translation grammar, tuned weights, and the language
model are loaded to Cdec decoder. As we follow the Type-Aware framework (Figure: 4.3),
we start with resolving the type of the translated entity. Then, the English name is
translated according to the entity type. We configure Cdec decoder to retriever the
Top-5 translations. Translations with one or more Out-of-Vocabulary words are excluded.
Finally, only top-3 translations are taken into consideration. In the case of failing to get
at least one full translation, the fallback translation component is used to translate the
name with the same protocol.
The usage of a full fledged SMT solves the challenge of handling words dependencies
and reordering. Since SMT is language independent, our enrichment approach will
inherit this feature, allowing easily adopting it in other languages. On the other hand,
the language model may decrease the weights of some correct translations as explained
previously. Finally, it is hard to control single token translations.
4.3
Transliteration
Transliteration is the process of converting a Name in language L1 script to other language
L2 script while the pronunciation is preserved as much as possible. Most of SMT systems
use transliteration as a fallback protocol for name translation failures.
30
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
Named-Entity Translation allowed translating huge amount of English names as well
contextual keyphrases. However, there are still names that Named-Entities translation
fails to generate Arabic names for them, because they are not represented in the parallel
training data. Moreover, translation usually generates only the prominent Arabic
representation of the name. Thus, This section discusses how transliteration is adopted
to enrich our resource.
4.3.1
Transliteration Approaches
English to Arabic transliteration is performed using several approaches [30]. The simplest
is the Rule-based or Grapheme-based approach where each set of characters in the source
language is mapped to one or more sets of characters in the target language. Another
approach is Phoneme-based mapping that consider the similarity on sounds and phonetics
level. Words are represented as phonetics/sounds, then transformed to the close or similar
phonetics in the target language. Finally, the target language phonetics are transformed
to characters [3, 56]. A widely used approach is Statistical Machine Translation on
the Characters, where normal SMT systems are used but trained on parallel corpora of
segmented words [1, 44, 13, 2].
There are several transliteration services from English to Arabic. On the top of these
services there are:
Google Transliterate
3
is a service for multilingual transliteration. However, since
2011, it has been integrated in Google translate service. The translation service
does not guarantee generating transliteration rather than translation. For example,
“Green”, as a name, without any context will be translated.
Yamli
4
is a service5 that aims at transforming romanized Arabic text to Arabic
characters (i.e. backword Transliteration). Their engine is well trained on a general
romanized Arabic text as well as person names. However, it is designed to be an
interactive service that suggests several transliterations and expects the user to
choose the correct one (i.e. Recall oriented).
3Arrib [2] is transliteration service targeting romanized dialectal Arabic (e.g. chat in
Egyptian dialectal). Usually, non-formal Arabic text uses numbers as extension for
the English to cover Arabic phonetics without corresponding English ones. Thus,
they are using SMS/Chat training data [9] that is not suitable for transliterating
3
https://developers.google.com/transliterate/
Yamli is the Arabic translation for “[he] dictates”
5
http://www.yamli.com/
4
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
al-samman
|||
a l b e r t S SPACE e i n s t e i n
|||
john
|||
31
à @ Ð € È @
H P H. È @ T SPACE à ø
@ H € à ø
@
à ð h.
Table 4.5: Sample of character-level training data
formally written English names. Furthermore, it detects English names and concepts
and excludes them from the transliteration process [16].
These solutions are not designed to deal with names only. Their output accuracy is
not satisfactory for our target.
4.3.2
Character-Level Statistical Machine Translation
In order to create a transliteration system geared towards names, we train our SMT
system on the character level. For training, we consider PERSON data in Table 4.4. PERSON
parallel data were used to train Cdec SMT system as follows:
1. Spaces are replaced with special symbols S SPACE and T SPACE.
2. Words-character are separated with spaces as shown in Table: 4.5. Thus, each
name record is treated as a phrase and characters as its tokens.
3. We follow the same trainings steps with Cdec SMT as in Subsection 4.2.4.
In the decoding phase, the same segmentation as in the training is followed to prepare
the input data. Results with Out-Of-Vocabulary words (i.e. English characters) are
excluded. Then, Arabic transliteration are reconstructed by reversing the tokenization
steps. Finally, all names with at least one English character are excluded as failures.
We used transliteration on Given names and Family names, providing that names
are always transliterated. We did not apply transliteration on any other type for the
sake of achieving high quality results.
4.4
Arabic Names Splitting
AIDArabic uses full matching to retrieve the candidate entities for the sake of achieving
accurate disambiguation process. Nevertheless, person entities are not always mentioned
with the full name, only parts of their names may appear (e.g. given name and family
32
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
Type
Prefixes
in English script
Meaning
Abd
Worshiper
Abo
Father of
Umm
Mather of
Al
Family of
Gad
Connectors
Suffixes
bin/ben
Son of
bent
Daughter
Allah, Ellah, Lellah
The God
Al-Dawla
State
Al-Deen
Religion
in Arabic
YJ.«
ñK. @
Ð@
È@
XAg.
áK . , áK . @
I K .
é<Ë , éËB@ , é<Ë@
éËðYË@
áK YË@
Table 4.6: Arabic names common prefixes, suffixes and name connectors with their
meaning
name). Therefore, splitting person names allows covering partial mentions. Unlike most
of Latin names, Arabic names are not just composed of given and last names. Hence,
they require different splitting rules.
Person names are extracted from rdfs:label relation for entities with type PERSON
in YAGO. After normalizing the names by Removing Tatweel, Normalizing Alif, and
Removing Diacritics, the following splitting rules are applied:
XAg. ,È @ ,Ð@ ,ñK. @ , YJ.« shown in Table:4.6 are combined with the
following token as one part (e.g. Umm Kulthum/ ÐñJÊ¿ Ð @, Abd-Alkareem/ Õç'
QºË@ YJ.«).
K . , áK . , áK . @, which are common in names originated in
Name connectors such as I
1. Arabic names prefixes
2.
the Gulf countries as well as old Arabic names, are considered splitters and added
to its following part.
é<Ë , éËB@ , é<Ë@ , éËðYË@
, áK YË@ with the previous token
YË@ PñK ).
as one part (e.g. Noor Al-Deen/ áK
3. Common Arabic names suffixes
4. Full Names composed of two parts are split into <Given Name> <Last Name>.
For example, Mohamed Salah/ hC“
given name and (Salah)
YÒm× is divided into (Mohamed ) YÒm×
as the
hC“ as the last name.
5. Three or more parts names are split into <Given Name> <Middle Name> <Last
È @ QK QªË@
YJ.« áK . àAÒʃ
QªË@
YJ.« áK . as the
is split into (Salman) àAÒʃ
as the given name, (bin abdulaziz ) QK
middle name, and (Al Saud ) Xñªƒ È @ as the family name.
Name>. For example, (Salman bin Abdulaziz Al Saud )/ Xñªƒ
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
33
Finally, resulting name partitions, given name, middle name, and last name are added
to enriched resource after applying the required normalisation and word segmentation.
4.5
Edrak as A Standalone Resource
In order to help advancing the Arabic research, we publically released EDRAK for the
research community as an Entity-centric stand alone resource.
4.5.1
Use-cases
EDRAK is not only useful for as data schema for NED, but it is also a valuable asset for
many Natural Language Processing NLP and Information Retrieval tasks. For example,
EDRAK contains a comprehensive dictionary for different potential Arabic names for
entities gathered from both the English and Arabic Wikipedia’s. EDRAK dictionary can
be used for building an Arabic Dictionary-based NER [12, 52].
In addition to the name dictionary, the resource contains a large catalog of entity
Arabic textual context in the form of keyphrases. They can be used to estimate EntityEntity Semantic Relatedness scores as in [25].
Entities in EDRAK are classified under the type hierarchy of YAGO [26]. Together
with the keyphrases, EDRAK can be used to build an Entity Summarization system
as in [58], or to build a Fine-grained Semantic Type Classifier for named entities as in
[64, 65].
4.5.2
Technical details
EDRAK is available in the form of an SQL dump, and can be downloaded from the
Downloads section in AIDA project page http://www.mpi-inf.mpg.de/yago-naga/
aida/. We followed the same schema used in the original AIDA framework [27] for data
storage. Highlights of the SQL dump are shown in Table 4.7. EDRAK’s comprehensive
entity catalog is stored in SQL table entity ids. Each entity has many potential Arabic
names together stored in SQL table dictionary. In addition, each entity is assigned a
set of Arabic contextual keyphrases stored in SQL table entity keyphrases.
It is worth noting that sources of dictionary entries as well as entities keyphrases are
kept in the schema (YAGO3 LABEL, REDIRECT, GIVEN NAME, or FAMILY NAME). Furthermore,
generated data (by translation or transliteration) are differentiated from the original
Arabic data extracted directly from the Arabic Wikipedia. Different generation techniques
34
Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge
Table Name
Major Columns
Description
entity ids
- id
- entity
Lists all entities together with their numerical IDs.
dictionary
- mention
- entity
- source
Contains information about the candidate entities
for a name. It keeps track of the source of the entry
to allow application-specific filtering.
entity keyphrases
-
Holds the characteristic description of entities in the
form of keyphrases. The source of each keyphrase is
kept for application-specific filtering.
entity types
- entity
- types []
Stores YAGO semantic types to which this entity
belongs.
entity rank
- entity
- rank
Ranks all entities based on the number of incoming
links in both the English and Arabic Wikipedia. This
can be used as a measure for entity prominence.
entity
keyphrase
source
weight
Table 4.7: Main SQL Tables in EDRAK
and data sources entail different data quality. Therefore, keeping data sources enables
downstream applications to filter data for precision-recall trade-off.
Chapter 5
Evaluation and Statistics
This chapter discusses the evaluation performed for the size and quality of EDRAK as in
Section 5.1 and its effect on the quality of AIDArabic+ results as in Section 5.2 .
5.1
Evaluation EDRAK
It is important to evaluate the effect of the enrichment approaches on the generated
resource quality and size which directly affects the overall NED system quality and
performance [62]. Statistics about EDRAK resource are shown in Subsection 5.1.1. In
addition, the manual assessment performed by the native Arabic speakers is discussed in
Subsection 5.1.3.
5.1.1
Statistics
EDRAK contains around 2.4M entities (with at least one name for each) classified
under YAGO type hierarchy. By this size, EDRAK is an order of magnitude bigger than
the original AIDArabic resource, that contains 143K entities, because it is constrained
by the amount of Arabic names and contextual keyphrases available in the Arabic
Wikipedia.
Table 5.1 shows a comparison between AIDArabic and EDRAK in terms of the name
and contextual keyphrases dictionaries. The name dictionary size increased form less
than 0.5M entity-name pairs to 21M pairs. The number of unique names is now 20
times that in AIDArabic. In addition, the average names per entity increased from 2.45
to 7.75 name/entity.
35
36
Chapter 5 Evaluation and Statistics
Unique Names
Entities with Names
Entity-Name Pairs
Unique Keyphrases
Entity-Keyphrase Pairs
AIDArabic
EDRAK
333,017
143,394
495,245
885,970
5,574,375
9,354,875
2,400,340
21,669,568
7,918,219
211,681,910
Table 5.1: AIDArabic vs EDRAK: Sizes of Name and Contextual Keyphrases Dictionaries
Technique
# New Entities
# Dictionary Entries
47,406
19,706
1664
3,549,248
3,340,921
0
241,104
23,338
4148
11,222,876
9,578,658
94,782
Google W2C
CMUQ-Arabic-NET
JRC
Translation
Transliteration
Name Splitting
Table 5.2: Number of New Entities1 and Entity-Name pairs per Generation Technique
Semantic Type
PERSON
EVENT
LOCATION
ORGANIZATION
ARTIFACT
AIDArabic
EDRAK
47,483
11,065
34,451
10,212
15,650
1,220,032
199,846
360,108
196,305
359,071
Table 5.3: Number of Entities per Type in AIDArabbic vs EDRAK
The contributions of each generation technique are summarized in Table 5.2. Numbers
indicate that the automatic generation (i.e. translation and transliteration) contributes
way more entries than external name dictionaries. In addition, translation delivers more
entries than transliteration since it is applied on all types of entities, in contrast to only
persons names for transliteration. Furthermore, GW2C did not introduce many new
entities because it is not common to manually link a mention in an Arabic article to an
English Wikipedia page. For CMUQ and JRC, both are collected from news-wire, hence,
they added only to prominent entities.
Table 5.3 lists the number of entities per high level semantic type for both AIDArabic
entity catalog and EDRAK. The highest increase is observed in type PERSON as a result
of applying both translation and transliteration.
Similarly, the contextual keyphrases dictionary increased 42 times as shown in
Table 5.1. Although, we applied the generation techniques on the categories and the
Inlink titles only, the expansion in the contextual keyphrases was expected to be higher
Chapter 5 Evaluation and Statistics
Semantic Type
citationTitle
linkAnchor
inlinkTitle
wikipediaCategory
wikipediaCategory TRANS
inlinkTitle TRANS
37
AIDArabic
EDRAK
67,031
2,469,923
2,734,530
302,891
67,031
2,469,923
5,216,657
4,029,483
13,842,770
186,056,046
Table 5.4: Conextual keyphrases dictionary AIDArabbic vs EDRAK
than the name dictionary. New contextual keyphrases are originated to: (i) new entities
that can be translated using the manual English-Arabic inter-Wikipedia links [67] or (ii)
automatically generated Arabic keyphrases using translation and transliteration. This
explains the expansion in the original sources, inlinkTitle and wikipediaCategory,
as shown in Table 5.4. Also, it worth noting that wikipediaCategories were only
translated while inlinkTitles were both translated and transliterated according to
their entity type.
5.1.2
Data Example
Many prominent entities do not exist in the Arabic Wikipedia, and hence do not
appear in any Wikipedia-based resource. For example, Christian Schmidt, the current
German Federal Minister of Food and Agriculture, and Edward W. Morley, a famous
American scientist, are both missing in the Arabic Wikipedia2 . EDRAK’s data enrichment
techniques managed to automatically generate reasonable potential names as well as
contextual keyphrases for both. Table 5.5 lists a snippet of what EDRAK knows about
those two entities.
5.1.3
Manual Assessment
The target of the manual assessment is to quantify the quality of the generated names
and contextual keyphrases using different methods.
Setup
We evaluated all aspects of data generation in EDRAK. We included entity names
belonging to First Name, Last Name, Wikipedia redirects, and rdfs:label relation which
2
as of June 2015
38
Chapter 5 Evaluation and Statistics
Entity
Generated Arabic Names
Christian Schmidt
҂ HYJ
Öޅ
IJ
Öޅ
É® J‚Ó IJ
Öޅ
HYJ
Öޅ
É® J‚Ó HYJ
Edward W. Morley
k.
àñ‚
YJ
҂ƒ
Öޅ
IJ
Öޅ
HYJ
ƒQ»
àAJ
Q»
‚
àAJ
Q»
àAJ
‚
Q»
àAJ
‚
Q»
‚
àAJ
Q»
àAJ
‚
Q»
àAJ
‚
á J‚
Q»
P@ð X@
XP@ð X@
XP@ð X@
úÍPñÓ ñJ
ÊK. X XP@ð X@
ú
ÍPñÓ XP@ð X@
úÍPñÓ XP@ð X@
Ëð XP@ð X@
úÍPñÓ QÓAJ
ú
ÍPñÓ +ð XP@ð X@
úÍPñÓ +ð XP@ð X@
ÊK
+ð XP@ð X@
ú
ÍPñÓ QÓAJ
ÊK
+ð XP@ð X@
úÍPñÓ QÓAJ
úÍPñÓ P@ð X@
XPðX@
úÍPñÓ +ð ø
@
XP@ð X
ú
ÍQÓ
ú
ÍPñÓ
úÍPñÓ
ú
Í Q
Ó
Generated Keyphrases
èP@Pð
Ï B@ éK
XAm' B@ ¨A¯YË@
éJ
KAÖ
éJj‚Ó éJ «AÒJk @ AK PA¯AK
ú¯ XAm' B@ àñJ
. . ƒAJ
ƒ
ú
æ„Ê£B@ ©ÒJm.×
éJ KAÖ
éJ
Ë@P YJ
®Ë@
Ï B@ ¨A¯YË@ èP@Pð
Q»
‚
Öޅ àAJ
É® J‚Ó HYJ
é«@P QË@ àAÖ
Ï @ Z@P Pð
Q¯ QJ
K. Q KAë
€PYK
éJKAÖ
Ï B@ éJ
Ë@P YJ
®Ë@ ¨A¯YË@ èP@Pð
½K
PYK
Q¯ QJ
K. Q KAë
é«@P P àAÖ
Ï @ Z@P Pð
ú
æ„Ê£B@ é«ñÒm.×
Ï @ àñJ
KAÖ
ÏQK.
àAÖ
ú
æ„Ê£B@ é«ñÒj
. ÖÏ @
éJËAJË@ éÓñºm
Ì '@ àA£Qå…
 PYK
Q¯ QJ
K. Q KAë
ú¯ éJ j‚Ó éJ «AÒJk @ XAm' B@ àñJ
ƒAJ
ƒ
AK
PA¯AK
. .
Q»
‚
Öޅ àAJ
É® J‚Ó IJ
JË@ éÓñºmÌ '@ àA£Qå…
IËA
P QË@ Z@P Pð
Ï @ é«@
àAÖ
Q ¯ àñJ
ºK
QÓ@ éJ
KAK
àñJ
KAJ
ÒJ
»
.J
« èQ KAg
.
ð Q ¯ éK . Qm.'
à Qƒ +ð 
» éJ
Öß
XA¿@
Q ¯
KAK
ú
æ.K
Qj.JË@ àñJ
éJ
ºK
QÓB@ éJ
ºÊ¯ éJ
ªÔg.
úÍ@P YJ
®Ë@
éJ
K. QªË@
. ñm.'
Qk
‡®jÖÏ @ éªÓAg
à Qƒ +ð 
» éªÓAg
. €A¿
¯
úÍPñÓ éëñ
QK + ¬
KAK
­J £ àñJ
Q ¯ àñJ
ú»Q
ÓB@ éJ
KAK
ºK
QÓ@
àñJ
KAJ
ÒJ
»
à Qƒ +ð 
» éJ
Öß
XA¿B@
. Ë@ ZAJ
ÒJ
ºË@ t'
PAK
éJ
KYJ
éJ
ºÊ¯ éJ
ºK
QÓB@ éJ
ªÒm.Ì '@
Q» HñJ
Q KA ¯
Ë@ ÐAƒñK. àð
àñ‚
. Ë@ ZAJ
ÒJ
ºË +È úæÓ QË@ ɂʂË@
éJ
KYJ
éJ
»ñºƒ éÊm .×
XPñ® KPAë H. Q«
ú
ÍPñÓð àñ‚ʾJ
Ó éK. Qm.'
ð Q ¯ PAJ.Jk@
Table 5.5: Examples for Entities in EDRAK with their Generated Arabic Names and
Keyphrases
Chapter 5 Evaluation and Statistics
39
carries names extracted from Wikipedia page titles, disambiguation pages and anchor
texts.
The data was generated for evaluation using full SMT system trained on NamedEntities only. In order to examine the effect of considering semantic types in translation,
We implemented two approaches, the first is Type-Aware SMT as described in Subsection 4.2.4, and the second uses a universal SMT for translating all names (which is referred
to as Combined). For each name, the top-3 successful translation or transliteration
were generated if they exist.
Data assessment experiment covered all types of data against both translation
approaches. Additionally, we conducted experiments to assess the quality of translating
Wikipedia categories using system trained on parallel English-Arabic categories. Finally,
we evaluated the performance of transliteration when applied on English person names.
We randomly sampled the generated data and conducted an online experiment to manually
assess the quality of the data.
Task
We asked a group of native Arabic speakers to manually judge the correctness of the
generated data through our web-based tool (as shown in appendix A). Each participant
was presented around 150 English Names together with the top-3 potential Arabic
translations or transliteration proposed by cdec (or less if cdec proposed less than three
translations). Participants were asked to pick all possible correct Arabic names or None,
if all translations are incorrect. Participants had the option to skip the name (by choosing
Don’t know option), if they needed to. The experiment was designed such that each
English Name should be evaluated by three different persons.
Results and Discussion
In total, we had 55 participants who evaluated 1646 English surface forms, that were
assigned 4463 potential Arabic translations. These English names were annotated with
at least three participant to either one of the proposed translation or None. Participants
were native Arabic speakers that are based in USA, Canada, Europe, KSA, and Egypt.
Their homelands span Egypt, Jordan, and Palestine. Manual assessment results are
shown in Table 5.6. Evaluation results are given per entity type, translation approach
and name source. Since cdec did not return three potential translations for each name,
we computed the total number of translations added when considering up to top one or
40
Chapter 5 Evaluation and Statistics
Count@Top-K
Prec@Top-K
1
2
3
1
2
3
Type-Aware
First Name
Last Name
rdfs:label
redirects
8
14
156
113
10
17
288
210
12
19
383
285
87.50
92.86
79.49
69.91
80.00
88.24
63.19
57.62
66.67
78.95
57.44
50.18
Combined
First Name
Last Name
rdfs:label
redirects
7
16
160
108
10
22
307
210
12
25
421
288
100.00
87.50
81.25
67.59
90.00
81.82
64.82
60.00
75.00
76.00
57.24
54.51
Transliteration
First Name
Last Name
26
94
52
188
76
279
80.77 61.54 56.58
70.21 63.83 55.91
Type-Aware
rdfs:label
redirects
269
191
519
370
742
526
53.16 43.16 36.66
45.55 34.86 30.99
Combined
rdfs:label
redirects
273
195
533
378
770
539
49.82 41.84 36.75
46.67 39.42 34.69
Categories
Categories
118
234
340
67.80 52.99 46.18
Approach
Persons
Non-Persons
Categories
Source
Table 5.6: Assessment Results of Applying SMT for Translating Entities and Wikipedia
Categories Names
two or three results. For each case, we computed the corresponding precision based on
participants annotations.
Data was randomly sampled from all generated data such that the size of each test
set reflects the distribution of the sources included in the original data. For example,
names originating from rdfs:label relation are an order of magnitude more than those
coming from FirstName, and LastName relations.
The quality of the generated data varies according to the entity type, name source and
generation technique. The quality of translated Wikipedia redirects is consistently less
than that of other sources. This is due to the nature of redirects. They are not necessarily
another variation of the entity name. In addition, redirects tend to be longer strings,
and hence are more error-prone than rdfs:labels. For example, “European Union common
passport design” which redirects to the entity Passports of the European Union could
not be correctly translated. Each token was translated correctly, but the final tokens
order was wrong. Evaluators were asked to annotate such examples as wrong. However,
such ordering problems are less critical for applications that incorporate partial matching
techniques. Similarly, categories tend to be relatively longer than entity names, hence
they exhibit the same problems as redirects.
Although the size of the evaluated FirstName and LastName data points is small,
the assessment results are as expected. Translating one token name is relatively an easy
Chapter 5 Evaluation and Statistics
41
task. In addition, cdec returned only one or two translations for the majority of the
names as shown in Table 5.6.
Results also show that the type-aware translation system does not necessarily improve
results, and one universal system can deliver comparable results for most of the cases.
Person names transliteration unexpectedly achieved less quality than translation.
This is a result of the fact that names are pronounced differently across countries. For
example, a USA-based annotator is expecting “Friedrich” to be written “ ½K
PYK
Q¯”, while
PYK
Q¯”. Similarly, the person
a Germany-based one is expecting it to be written as “ 
name “Johannes” is only known for German-based participants that it should be written
as “ AëñK
” not as “ Aëñk. ”. We attempted to overcome this problem by inviting
Arabic speakers located globally in different areas.
Finally, inter-annotators agreement was measured using Fleiss’ kappa to be 0.484,
indicating moderate agreement.
5.2
AIDArabic+ Evaluation
In this section we discuss the experiment conducted to evaluate the effect of the enriched
data resource (EDRAK) and the Arabic specific tokenization and normalization on
AIDArabic+ results.
5.2.1
Arabic Corpus Creation
The first problem we faced is the lack of annotated Arabic benchmark. Creating a well
annotated corpus manually is a time consuming task. Therefore, we need to create our
benchmark automatically.
The main idea is to use a parallel corpus and annotate the Arabic part using
automatically generated evidences from the English counter part. This approach was
followed also by [7] to collect Named-Entities from parallel news and by [37] to create
persons only multilingual annotated corpus.
We used LDC2014T05 [35] news and web English-Arabic parallel manually translated
and aligned corpus. LDC2014T05 is developed mainly for SMT development. The corpus
Arabic documents are tokenized using MADA+TOKEN [19, 49] and aligned with the
tokenized English translation on the word-level (many-to-many mapping). We favored
using a manually word-aligned corpus over using an automatic alignment tool such as
GIZA++ [47] or FAST Aligner [14] to guarantee a better projection quality.
42
Chapter 5 Evaluation and Statistics
Type
#Docs
#Uniq. Entities
#Mentions
#Non-null Mentions
702
74
2009
338
18,240
2055
14,413
1385
News-wire
Web
Table 5.7: LDC2014T05 Annotated Benchmark Statistics
We started with applying AIDA [27, 66], as a state-of-art NED system, accompanied with Named Entity Recognition on the tokenized English side. English AIDA
disambiguation results were project on the tokenized Arabic side as follows:
1. All English mentions (with their entity mapping) were projected on the Arabic
tokens using the word alignment information.
2. Tokens marked as GLUE at the boundaries of the Arabic mentions such as connected
prepositions e.g. character “ ð” in “and Egypt”/ Qå”Óð and connected pronouns at
the end are removed from the mention string.
3. Overlapping Arabic mentions, resulting from the translation nature, were combined.
4. Mentions were filtered such that:
(a) Arabic mentions mapped to two different entities are excluded.
(b) Long Arabic mentions are also excluded.
The Arabic documents and the produced ground truth was exported in CoNLL
dataset format. After excluding all documents with alignment problems, our Arabic
corpus contains total of 776 documents with total of 15798 non-null mentions. Table 5.7
shows the details of the annotated data.
5.2.2
Experiment Setup
Systems Setup
For testing, we built AIDArabic+ including the new source (EDRAK) and the Arabic
specific pre-processing component. We evaluated two data generation approaches: (i) using Yamli for transliteration and the type-aware Light SMT proposed in Subsection 4.2.3,
(ii) using the type-aware translation and transliteration using the full SMT framework.
In both, we used the external dictionaries introduced in 4.1.
We tested both AIDArabic+ configurations against AIDArabic [67] and Bablfy [43]
NED systems. Up to our knowledge, there is no other available systems supporting NED
on Arabic input.
Chapter 5 Evaluation and Statistics
43
Mention
Prec.
Document
Prec.
Mapped
to Entity
Dataset
System
LDC news
AIDArabic+ (Full-SMT)
AIDArabic+ (Yamli & L-SMT)
AIDArabic
Babelfy (Full Matching)
Babelfy (Partial Matching)
73.23
70.83
69.07
30.32
25.24
71.34
68.94
67.26
31.16
25.84
94.69
92.73
87.19
39.75
39.48
AIDArabic+ (Full-SMT)
AIDArabic+ (Yamli & L-SMT)
AIDArabic
Babelfy (Full Matching)
Babelfy (Partial Matching)
68.16
66.06
62.02
22.33
20.66
60.10
56.86
52.48
21.13
19.52
93.86
92.13
85.56
38.62
35.52
LDC web
Table 5.8: Disambiguation Results for AIDArabic+ vs AIDArabic vs Babelfy
For all AIDA-based systems, we used YAGO3 as our back-end Knowledgebase built
from the English Wikipedia dump of 12Jan2015 combined with the Arabic Wikipedia
dump of 18Dec2014. The same configuration was used in the original AIDA local
similarity technique [27].
For Babelfy, we used their web service3 version 1.0. It offers two modes: named
entities full matching and partial matching. We ran both using a predefined set of
mentions. For fair comparison, we limited their candidate space to Wikipedia. We
resolved the corpus ground truth from YAGO3 to BabelNet [46] through BabelNet web
service getSynsetIdsFromWikipediaTitle.
Evaluation
We evaluated against mentions with non-null ground truth annotations. For fair comparison, Null annotations returned by all systems were considered wrong annotations.
We computed both mention precision and document average precision. Precision is
computed according to the number of correct annotations in contrast to the number of
all annotations returned by the system.
5.2.3
Results and Discussion
Results of our experiments are shown Table 5.8. AIDArabic+ consistently delivered
better results than competitors under test. Both versions of AIDArabic+ mapped above
92% of the mentions to non-null entities. AIDArabic+ built with full SMT achieved
better precision over Yamli & L-SMT build due to the better quality of the generated
3
http://babelfy.org/guide
44
Chapter 5 Evaluation and Statistics
dictionaries. While the enhancement in the precision of the latter compared to AIDArabic
is less than 1%, the full-SMT version achieved 4% increment. Nevertheless, for the news,
our comprehensive KB could not shine enough, since most of the entities are prominent
enough to appear in the Arabic Wikipedia.
On the other hand, since entities in the web corpus are less prominent than the
news, the enriched KB showed better performance for the web documents. AIDArabic+
achieved 8% increment in the document precision and 6% in the mentions precision.
Babelfy with full and partial matching achieved less than 35% for both news-wire
and web corpora. Babelfy backend source does not apply entity name translation [46],
that explains its poor performance.
Results sampling shows that enhancements in AIDArabic+ resulted from the following:
• New Entities: EDRAK covered new entities that were not covered in AIDArabic schema by introducing at least one potential name. For example, names
Kñ»
“ñKñ
éJ
¯A ® K@”
á“
”
@YK
Pñʯ HðAƒ
Jƒ
and “ ÉJ
J
were linked to ”Cotonou Agree-
ment”, and ”Sun-Sentinel newspaper” respectively, although there exist no Arabic
Wikipedia page for both. Thus, they were correctly disambiguated.
• Name Variants: Some entities already existed in the Arabic Wikipedia together
with their names, however, some English names have several potential forms in
Arabic. Transliteration was able to produce Arabic potential name variants. For
instance, the Arabic Wikipedia page of the Nobel prize winner“José Saramag” lists
his Arabic name as “ ñ«AÓ@P Aƒ
. ”.
éK
Pñk
However, in our news corpus, the name
“ñ«AÓ@P Aƒ éJ
ƒñk ” was used instead. Our system could learn both forms and correctly
disambiguate that mention.
• New Names: Similarly, some entities lacks several prominent Arabic name aliases.
Translation and external dictionaries were able to expand the name dictionary
with such names. For example, entity United State Department of State may
be referred to as “ éJ
k. PAmÌ '@” which did not exist in the Arabic Wikipedia, but was
translated from the English redirect “Department of State”.
Chapter 6
Conclusion and Outlook
6.1
Conclusion
In this thesis, we discussed adapting Named Entity Disambiguation effectively for Arabic
text. AIDArabic was the first attempt to enable NED on Arabic. Nevertheless, it
exhibited low recall due to the sparsity of structured Arabic resources and the complex
Arabic specific features. We are introducing AIDArabic+ to enhance NED on the
Arabic input by utilizing rich data schema and a customized Arabic pre-processing
component.
In order to overcome data sparsity of Arabic structured resources, we introduced
EDRAK as a back-end schema for AIDArabic+. EDRAK is an entity-centric Arabic
resource that contains around 2.4M entities, with their potential Arabic names, contextual
keyphrases and semantic types. Data in EDRAK has been extracted from the Arabic
Wikipedia and other available resources such as GW2C and name dictionaries. In
addition, we enriched EDRAK with automatically generated Arabic data based on the
English Wikipedia.
For the sake of achieving accurate data generation, we developed the Type-aware
Named-Entity translation, utilizing the fully fledged SMT framework Cdec and
a parallel corpus of Entity-Names. Furthermore, we developed a persons names
transliteration module to generate all possible variants of persons-names. Generated
data has been manually assessed by group of Native Arabic speakers. We made EDRAK
publicly available as a standalone resource to help advance the research for the Arabic
language.
Due to the morphological nature of Arabic, we integrated an Arabic pre-processing
module into AIDArabic+ architecture to correctly tokenize and normalize Arabic input.
45
46
Chapter 6 Conclusion and Outlook
Arabic customized pre-processing allowed better recall and precision for name and context
matching. We used Stanford Arabic Segmenter to perform the required morphological
analysis and tokenization.
Finally, in order to evaluate the effect of the proposed enhancements in AIDArabic,
we utilized a parallel word-aligned English-Arabic corpus (LDC2014T05 ) to create an
automatically annotated Arabic corpus. AIDA, as a state-of-art NED system was used
to generate annotations on the English side. Then, annotations were projected on the
Arabic side.
AIDArabic+ was able to resolve 94% of the mentions in the news-wire corpus to
non-null entities instead of 87% in the original AIDArabic. The expansion in the coverage
was achieved with 73% mention precision, which is 4% more than the precision for
AIDArabic and way better than Babelfy. Also, for the web articles, non-null mapping
increased from 85.6% to 93.9%, keeping a mention precision of 68% (8% increase over
AIDArabic). This indicates that our approach allows capturing more information about
non-prominent entities.
6.2
Outlook
There is still space for enhancing Named Entity Disambiguation on Arabic. The data
schema can be further enriched. Anchor texts have not been translated for the sake
of accuracy of the contextual keyphrases. Developing translation module with proper
training data for anchor texts can enrich the keyphrases dictionary, and hence, achieving
better precision.
AIDArabic+ used entities extracted from YAGO3 English and Arabic, no other
language was included. Evidences from languages other than the English Wikipidia can
be harnessed to enrich EDRAK entities repository and dictionaries. For example, more
entities can be captured from the German Wikipedia. Also, other languages that have
Arabic script such as Persian and Urdu can be further processed to provide a big number
of name entries specially for entities of type PERSON.
NED can be used for different applications. One of the applications built based on
AIDA NED is STICS [24]. STICS offers a web interface to search and explore news article
using canonical entities instead of normal text search. AIDArabic+ can be adapted as a
NED engine for STICS to support Arabic articles. This will introduce many use-cases
and challenges.
Bibliography
[1] Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-arabic cross
language information retrieval. In Proceedings of the Twelfth International Conference on
Information and Knowledge Management, CIKM ’03, pages 139–146, New York, NY, USA,
2003. ACM.
[2] Mohamed Al-Badrashiny, Ramy Eskander, Nizar Habash, and Owen Rambow. Automatic
transliteration of romanized dialectal arabic. In Proceedings of the Eighteenth Conference on
Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June
26-27, 2014, pages 30–38, 2014.
[3] Yaser Al-Onaizan and Kevin Knight. Machine transliteration of names in arabic text. In
Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages,
SEMITIC ’02, pages 1–13, Stroudsburg, PA, USA, 2002. Association for Computational
Linguistics.
[4] Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual and
bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 400–408, Stroudsburg, PA, USA, 2002. Association for
Computational Linguistics.
[5] Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual and
bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 400–408, Stroudsburg, PA, USA, 2002. Association for
Computational Linguistics.
[6] Sren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, and Zachary Ives. Dbpedia:
A nucleus for a web of open data. In In 6th Intl Semantic Web Conference, Busan, Korea,
pages 11–15. Springer, 2007.
[7] Mahmoud Azab, Houda Bouamor, Behrang Mohit, and Kemal Oflazer. Dudley north
visits north london: Learning when to transliterate to arabic. In Proceedings of the 2013
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 439–444, Atlanta, Georgia, June 2013. Association for
Computational Linguistics.
[8] Ibrahim Badr, Rabih Zbib, and James Glass. Syntactic phrase reordering for english-to-arabic
statistical machine translation. In Proceedings of the 12th Conference of the European Chapter
47
48
BIBLIOGRAPHY
of the Association for Computational Linguistics, EACL ’09, pages 86–93, Stroudsburg, PA,
USA, 2009. Association for Computational Linguistics.
[9] Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright,
Stephanie Strassel, Nizar Habash, Ramy Eskander, and Owen Rambow. Transliteration
of arabizi into arabic orthography: Developing a parallel annotated arabizi-arabic script
sms/chat corpus. ANLP 2014, page 93, 2014.
[10] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A
collaboratively created graph database for structuring human knowledge. In Proceedings of
the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08,
pages 1247–1250, New York, NY, USA, 2008. ACM.
[11] David Chiang. Hope and fear for discriminative training of statistical translation models. J.
Mach. Learn. Res., 13(1):1159–1187, April 2012.
[12] Kareem Darwish. Named entity recognition using cross-lingual resources: Arabic as an
example. In ACL (1), pages 1558–1567. The Association for Computer Linguistics, 2013.
[13] Chris Irwin Davis. Tajik-farsi persian transliteration using statistical machine translation. In
Proceedings of the Eighth International Conference on Language Resources and Evaluation
(LREC-2012), Istanbul, Turkey, May 23-25, 2012, pages 3988–3995, 2012.
[14] Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effective reparameterization of ibm model 2. In NAACL/HLT 2013, pages 644–648, 2013.
[15] Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom,
Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. cdec: A decoder, alignment, and
learning framework for finite-state and context-free translation models. In Proceedings of
the Association for Computational Linguistics (ACL), 2010.
[16] Ramy Eskander, Mohamed Al-Badrashiny, Nizar Habash, and Owen Rambow. Foreign
words and the automatic processing of arabic social media text written in roman script.
EMNLP 2014, page 1, 2014.
[17] Paolo Ferragina and Ugo Scaiella. Tagme: On-the-fly annotation of short text fragments (by
wikipedia entities). In Proceedings of the 19th ACM International Conference on Information
and Knowledge Management, CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.
[18] Spence Green, Daniel Cer, and Christopher D. Manning. Phrasal: A toolkit for new directions
in statistical machine translation. In In Proceddings of the Ninth Workshop on Statistical
Machine Translation, 2014.
[19] Nizar Habash, Owen Rambow, and Ryan Roth. Mada+ tokan: A toolkit for arabic
tokenization, diacritization, morphological disambiguation, pos tagging, stemming and
lemmatization. In Proceedings of the 2nd International Conference on Arabic Language
Resources and Tools (MEDAR), Cairo, Egypt, pages 102–109, 2009.
[20] Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran.
Evaluating entity linking with wikipedia. Artif. Intell., 194:130–150, January 2013.
BIBLIOGRAPHY
49
[21] Ondrej Hálek, Rudolf Rosa, Ales Tamchyna, and Ondrej Bojar. Named entities from
wikipedia for machine translation. In ITAT, pages 23–30. Citeseer, 2011.
[22] Ondrej Hálek, Rudolf Rosa, Ales Tamchyna, and Ondrej Bojar. Named entities from
wikipedia for machine translation. In ITAT, pages 23–30. Citeseer, 2011.
[23] Kenneth Heafield. KenLM: faster and smaller language model queries. In Proceedings of the
EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh,
Scotland, United Kingdom, July 2011.
[24] Johannes Hoffart, Dragan Milchevski, and Gerhard Weikum. Stics: Searching with strings,
things, and cats. In Proceedings of the 37th International ACM SIGIR Conference on
Research &#38; Development in Information Retrieval, SIGIR ’14, pages 1247–1248, New
York, NY, USA, 2014. ACM.
[25] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum.
KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. In Proceedings of the
21st ACM International Conference on Information and Knowledge Management, CIKM
2012, Hawaii, USA, pages 545–554, 2012.
[26] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. Yago2: A
spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194:28–61,
January 2013.
[27] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal,
Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation
of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, EMNLP ’11, pages 782–792, Stroudsburg, PA, USA, 2011. Association
for Computational Linguistics.
[28] Fei Huang, Stephan Vogel, and Alex Waibel. Improving named entity translation combining
phonetic and semantic similarities. In HLT-NAACL, volume 2004, pages 281–288, 2004.
[29] Heng Ji, HT Dang, J Nothman, and B Hachey. Overview of tac-kbp2014 entity discovery
and linking tasks. In Proc. Text Analysis Conference (TAC2014), 2014.
[30] Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. Machine transliteration survey. ACM
Comput. Surv., 43(3):17:1–17:46, April 2011.
[31] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2010.
[32] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,
Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,
Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for
statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on
Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA,
USA, 2007. Association for Computational Linguistics.
50
BIBLIOGRAPHY
[33] Young-Suk Lee. Confusion network for arabic name disambiguation and transliteration
in statistical machine translation. In COLING 2014, 25th International Conference on
Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29,
2014, Dublin, Ireland, pages 433–443, 2014.
[34] Young-Suk Lee. Confusion network for arabic name disambiguation and transliteration
in statistical machine translation. In COLING 2014, 25th International Conference on
Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29,
2014, Dublin, Ireland, pages 433–443, 2014.
[35] et al. Li, Xuansong. Gale arabic-english word alignment training part 1– newswire and web
ldc2014t05, 2014.
[36] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3: A knowledge base
from multilingual wikipedias. 2015.
[37] James Mayfield, Dawn Lawrie, Paul McNamee, and Douglas W. Oard. Building a crosslanguage entity linking collection in twenty-one languages. In Pamela Forner, Julio Gonzalo,
Jaana Keklinen, Mounia Lalmas, and Maarten de Rijke, editors, Multilingual and Multimodal
Information Access Evaluation: Second International Conference of the Cross-Language
Evaluation Forum, volume 6941 of Lecture Notes in Computer Science, pages 3–13. Springer,
2011.
[38] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W Oard, and David S Doermann.
Cross-language entity linking. In IJCNLP, pages 255–263, 2011.
[39] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W. Oard, and David S. Doermann.
Cross-language entity linking. In Fifth International Joint Conference on Natural Language
Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 255–263,
2011.
[40] Pablo N. Mendes, Max Jakob, Andres Garcia-Silva, and Christian Bizer. Dbpedia spotlight:
Shedding light on the web of documents. In Proceedings of the 7th International Conference
on Semantic Systems (I-Semantics), 2011.
[41] David Milne and Ian H. Witten. Learning to link with wikipedia. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 509–518,
New York, NY, USA, 2008. ACM.
[42] Will Monroe, Spence Green, and Christopher D. Manning. Word segmentation of informal
arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association
for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume
2: Short Papers, pages 206–211, 2014.
[43] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity Linking meets Word Sense
Disambiguation: a Unified Approach. Transactions of the Association for Computational
Linguistics (TACL), 2:231–244, 2014.
BIBLIOGRAPHY
51
[44] Preslav Nakov and Jörg Tiedemann. Combining word-level and character-level models for
machine translation between closely-related languages. In Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL
’12, pages 301–305, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
[45] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: The automatic construction,
evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell.,
193:217–250, December 2012.
[46] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: The automatic construction,
evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell.,
193:217–250, December 2012.
[47] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1):19–51, 2003.
[48] Daniel Ortiz-Martı́nez and Francisco Casacuberta. The new thot toolkit for fully automatic
and interactive statistical machine translation. In Proc. of the European Association for
Computational Linguistics (EACL): System Demonstrations, pages 45–48, Gothenburg,
Sweden, April 2014.
[49] Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander,
Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M Roth. Madamira: A fast,
comprehensive tool for morphological analysis and disambiguation of arabic. Proceedings of
the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014.
[50] Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. Local and global algorithms for
disambiguation to wikipedia. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages
1375–1384, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[51] Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie. #microposts2015 – 5th workshop on
’making sense of microposts’: Big things come in small packages. In Proceedings of the 24th
International Conference on World Wide Web Companion, WWW ’15 Companion, pages
1551–1552, Republic and Canton of Geneva, Switzerland, 2015. International World Wide
Web Conferences Steering Committee.
[52] Khaled Shaalan. A survey of arabic named entity recognition and classification. Computational Linguistics, 40(2):469–510, 2014.
[53] Nakatani Shuyo. Language detection library for java, 2010.
[54] Valentin I. Spitkovsky and Angel X. Chang. A cross-lingual dictionary for english wikipedia
concepts. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck,
Mehmet Uur Doan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and
Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language
Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language
Resources Association (ELRA).
52
BIBLIOGRAPHY
[55] Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva, and Erik van der Goot.
Jrc-names: A freely available, highly multilingual named entity resource. In Proceedings of
the International Conference Recent Advances in Natural Language Processing 2011, pages
104–110, Hissar, Bulgaria, September 2011. RANLP 2011 Organising Committee.
[56] Tao Tao, Su-Youn Yoon, Andrew Fister, Richard Sproat, and ChengXiang Zhai. Unsupervised
named entity transliteration using temporal and phonetic correlation. In Proceedings of the
2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages
250–257, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
[57] Jean Tavernier, Rosa Cowan, and Michelle Vanni. Holy moses! leveraging existing tools and
resources for entity translation. In Proceedings of the International Conference on Language
Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco, 2008.
[58] Tomasz Tylenda, Mauro Sozio, and Gerhard Weikum. Einstein: physicist or vegetarian?
summarizing semantic type graphs for knowledge discovery. In Proceedings of the 20th
International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 April 1, 2011 (Companion Volume), pages 273–276, 2011.
[59] Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Michael Röder, Daniel Gerber, Sandro Athaide
Coelho, Sören Auer, and Andreas Both. AGDISTIS - agnostic disambiguation of named
entities using linked open data. In ECAI 2014 - 21st European Conference on Artificial
Intelligence, 18-22 August 2014, Prague, Czech Republic - Including Prestigious Applications
of Intelligent Systems (PAIS 2014), pages 1113–1114, 2014.
[60] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both,
Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo
Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe
Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitclonis, and Lars Wesemann.
GERBIL - General entity annotator benchmarking framework. In WWW 2015, 24th
International World Wide Web Conference, May 18-22, 2015, Florence, Italy, Florence,
ITALY, 05 2015.
[61] Marieke van Erp, Giuseppe Rizzo, and Raphaël Troncy. Learning with the web: Spotting
named entities on the intersection of NERD and machine learning. In Proceedings of the
Concept Extraction Challenge at the Workshop on ’Making Sense of Microposts’, Rio de
Janeiro, Brazil, May 13, 2013, pages 27–30, 2013.
[62] Gerhard Weikum, Johannes Hoffart, Ndapandula Nakashole, Marc Spaniol, Fabian M
Suchanek, and Mohamed Amir Yosef. Big data methods for computational linguistics. IEEE
Data Eng. Bull., 35(3):46–64, 2012.
[63] Jonathan Wright, Kira Griffitt, Joe Ellis, Stephanie Strassel, and Brendan Callahan. Annotation trees: Ldc’s customizable, extensible, scalable, annotation infrastructure. In Proceedings
of the Eighth International Conference on Language Resources and Evaluation (LREC-2012),
Istanbul, Turkey, May 23-25, 2012, pages 479–485, 2012.
BIBLIOGRAPHY
53
[64] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart Marc Spaniol, and Gerhard Weikum.
HYENA: Hierarchical Type Classification for Entity Names. In Proc. of the 24th Intl.
Conference on Computational Linguistics (Coling 2012), December 8-15, Mumbai, India,
pages pp. 1361–1370, 2012.
[65] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart Marc Spaniol, and Gerhard Weikum.
HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text.
In Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL
2013), Sofia, Bulgaria, August 4-9, 2013, pages 133–138, 2013.
[66] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, and Gerhard Weikum.
AIDA: an online tool for accurate disambiguation of named entities in text and tables.
PVLDB, 4(12):1450–1453, 2011.
[67] Mohamed Amir Yosef, Marc Spaniol, and Gerhard Weikum. AIDArabic: A named-entity
disambiguation framework for Arabic text. In The EMNLP 2014 Workshop on Arabic
Natural Language Processing (ANLP 2014), pages 187–195, Dohar, Qatar, 2014. ACL.
Appendix A
Manual Assessment Interface
Following figures show the manual assessment web interface.
Figure A.1: Manual Assessment: Welcome page with instructions and steps video
55
56
Appendix A Manual Assessment Interface
Figure A.2: Manual Assessment: Data Evaluation Page: Each English name has at
most three translations and none and Don’t know choices