AIDArabic+ Named Entity Disambiguation for Arabic Text
Transcription
AIDArabic+ Named Entity Disambiguation for Arabic Text
Universität des Saarlandes Max-Planck-Institut für Informatik AIDArabic+ Named Entity Disambiguation for Arabic Text Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by Mohamed Gad-Elrab angefertigt unter der Leitung von / supervised by Prof. Dr. Gerhard Weikum betreut von / advised by Mohamed Amir Yosef begutachtet von / reviewers Prof. Dr. Gerhard Weikum Dr. Klaus Berberich Saarbrücken, July 2015 Eidesstattliche Erklärung Ich erkläre hiermit an Eides Statt, dass ich die vorliegende Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Statement in Lieu of an Oath I hereby confirm that I have written this thesis on my own and that I have not used any other media or materials than the ones referred to in this thesis. Einverständniserklärung Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die Bibliothek der Informatik aufgenommen und damit veröffentlicht wird. Declaration of Consent I agree to make both versions of my thesis (with a passing grade) accessible to the public by having them added to the library of the Computer Science Department. Saarbrücken, July 2015 Mohamed Gad-Elrab Abstract Named Entity Disambiguation (NED) is the problem of mapping mentions of ambiguous names in a natural language text onto canonical entities such as people or places, registered in a knowledge base. Recent advances in this field enable semantically understanding content in different types of text. While the problem had been extensively studied for the English text, the support for other languages and, in particular, Arabic is still in its infancy. In addition, Arabic web content (e.g. in the social media) has been exponentially increasing over the last few years. Therefore, we see a great potential for endeavors that support entity-level analytics of these data. AIDArabic is the first work in the direction of using evidences from both English and Arabic Wikipedia to allow disambiguation of Arabic content to an automatically generated knowledge base from Wikipedia. The contributions of this thesis are threefold: 1) We introduce EDRAK resource as an automatic augmentation for AIDArabic’s entity catalog and disambiguation data components using information beyond manually crafted data in the Arabic Wikipedia. We build EDRAK by fusing external web resources and the output of machine translation and transliteration applied on the data extracted from the English Wikipedia. 2) We incorporate an Arabic-specific input pre-processing module into the disambiguation process to handle the complex features of Arabic text. 3) We automatically build a test corpus from other parallel English-Arabic corpus to overcome the absence of standard benchmarks for Arabic NED systems. We evaluated the data resource as well as the full pipeline using a mix of manual and automatic assessment. Our enrichment approaches in EDRAK are capable of expanding the disambiguation space from 143K entities, in the original AIDArabic, to 2.4M entities. Moreover, the full disambiguation process is able to map 94.7% of the mentions to non-null entities with a precision of 73%, compared to 87.2% non-null mapping with only 69% precision in the original AIDArabic. Acknowledgements .. AJ.J £ Q g @ J» @YÔ é<Ë YÒmÌ '@ During this thesis, I have learned several essential research skills that, I believe, will shape my research career. Therefore, I would like to express my sincere gratitude to Prof.Gerhard Weikum for giving me the opportunity to work under his supervision in such a pioneering group, for facilitating the research and for his valuable advice throughout the thesis. I also would like to show my sincere gratitude and appreciation for my advisor Mohamed Amir for his continuous guidance and support on the professional and personal levels. I really appreciate his patience teaching me loads of essential research, communication and planning skills. I am extremely thankful for his generosity sharing his valuable expertise and time. Working with him was one of the richest experiences in my life. I wish to have the opportunity to work with him again in the future. I would like to thank Akram El-korashy, Mayank Goyal and Uzair Mahmoud for their useful feedback regarding the thesis writing. Also, our experiments would not be finalized without the help of the proactive volunteers who agreed to participate in our manual assessment. I would also like to thank the reviewers of this thesis for their precious time and effort. By the end of this masters program, I am grateful to the International MaxPlanck Research School for Computer Science (IMPRS-CS) family for their support throughout the master program. I believe, I was fortunate enough to be part of this big family. On the personal level, words will never be enough to express how I am thankful and indebted to my family (my parents, sister and brother) for their sincere support, encouragement and prayers throughout my life and my long education journey. I appreciate their patience towards my continuous absence. I would like to thank all my friends in Saarbrücken. I believe. I am blessed being surrounded by all those intelligent, caring and enthusiastic personalities. Finally, I would like to extend my sense of gratitude to everyone expressed his support and/or made Dua for me. Beijing, China. Mohamed Gad-Elrab July, 2015 vii Contents Abstract v Acknowledgements vii Contents ix List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Named Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Named Entity Disambiguation for Arabic . . . . . . . . . . . . . . . . . . 6 2.3 AIDArabic: Under The Hood . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 AIDArabic+ 11 3.1 AIDArabic Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 AIDArabic+ in A Nutshell 3.3 Enriching Data Components . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Language Specific Processing . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . 11 ix x CONTENTS 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 4.1 4.2 4.3 19 External Name Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Entity-Aware Resources . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Lexical Name Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 21 Named-Entities Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 General Statistical Machine Translation . . . . . . . . . . . . . . . 23 4.2.2 Named-Entities SMT . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.3 Named-Entities Light-SMT . . . . . . . . . . . . . . . . . . . . . . 25 4.2.4 Named-Entities Full-SMT . . . . . . . . . . . . . . . . . . . . . . . 28 Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 Transliteration Approaches . . . . . . . . . . . . . . . . . . . . . . 30 4.3.2 Character-Level Statistical Machine Translation . . . . . . . . . . 31 4.4 Arabic Names Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5 Edrak as A Standalone Resource . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.1 Use-cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.2 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Evaluation and Statistics 5.1 5.2 35 Evaluation EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.2 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.3 Manual Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 37 AIDArabic+ Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Arabic Corpus Creation . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 43 6 Conclusion and Outlook 45 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Bibliography 47 A Manual Assessment Interface 55 List of Figures 1.1 Internet Users Population . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Building Name Dictionary in AIDArabic . . . . . . . . . . . . . . . . . . . 8 2.2 Building Keyphrases Dictionary in AIDArabic . . . . . . . . . . . . . . . . 9 3.1 Building Name Dictionary in AIDArabic+ . . . . . . . . . . . . . . . . . . 14 3.2 Building Keyphrases Dictionary in AIDArabic+ . . . . . . . . . . . . . . . 15 4.1 General Statistical Machine Translation Pipeline . . . . . . . . . . . . . . 23 4.2 Single token translation using popularity voting . . . . . . . . . . . . . . . 27 4.3 Type-Aware Entity-Name Translation using full SMT system . . . . . . . 29 A.1 Manual Assessment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A.2 Manual Assessment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 xi List of Tables 4.1 Sample of Google-Word-to-Concept raw data . . . . . . . . . . . . . . . . 20 4.2 Sample of JRC-Names raw data . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Sample of CMUQ-Arabic-NET raw data . . . . . . . . . . . . . . . . . . . 22 4.4 Entity Names SMT Training Data Size . . . . . . . . . . . . . . . . . . . . 25 4.5 Sample of character-level training data . . . . . . . . . . . . . . . . . . . . 31 4.6 Arabic names splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.7 Main SQL Tables in EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 AIDArabic vs EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Enrichment Techniques Contribution . . . . . . . . . . . . . . . . . . . . . 36 5.3 Number of Entities per Type in AIDArabbic vs EDRAK . . . . . . . . . . 36 5.4 Conextual keyphrases dictionary AIDArabbic vs EDRAK . . . . . . . . . 37 5.5 Example from EDRAK resource . . . . . . . . . . . . . . . . . . . . . . . 38 5.6 Manual Assessment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.7 LDC2014T05 Annotated Benchmark . . . . . . . . . . . . . . . . . . . . . 42 5.8 Disambiguation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xiii Chapter 1 Introduction 1.1 Motivation Entities such as persons, organizations and locations can be referred to by many different name aliases, and similarly, the same name can be used to refer to different entities. For example, Barack Obama can be referred to as “Barack Hussien Obama”, “Obama”, or “USA president”, in different pieces of text. This type of ambiguity makes it challenging for Information Extraction (IE) and Information Retrieval (IR) to retrieve information about these entities. Named Entity Disambiguation (NED) is the process of resolving the different mentions of people, organizations and places that appear in text onto canonical entities in a knowledge base [27, 67] such as DBpedia [6] and YAGO [26]. NED is essential for several IR and Semantic Analysis tasks. It can help create accurate analytics over canonical entities instead of ambiguous mention strings [20]. Furthermore, NED can help advance applications such as Entity-based search, Text Summarization and News Analysis [24]. Arabic is one of the most widely spoken languages around the globe. As shown in Figure 1.1a, in December 2013, Arabic was estimated to have the 4th largest online users population (135M) after English, Chinese and Spanish, and followed by famous languages such as Japanese, German, French and Russian. Moreover, Arabic has the highest growing population on the Internet in the period between 2000 and 2013 with around 5000% growth. Consequently, Arabic online unstructured content such news articles, forums, blogs and social media is rapidly growing. For instance, in March 2014, Arabic speaking users contributed to twitter only with an average of 17.5M tweets/day2 . 2 http://www.arabsocialmediareport.com/ 1 2 Chapter 1 Introduction (a) Number of users (b) Users Growth Figure 1.1: Internet Users Population1 On the other hand, the amount of structured or semi-structured Arabic content is lagging behind. For example, Wikipedia is one of the main resources from which many modern Knowledge Bases (KB) are extracted. It is heavily used in the literature for IR and NLP tasks. However, the size of the Arabic Wikipedia is an order of magnitude smaller than the English one3 . Furthermore, the structured data in the Arabic Wikipedia, such as info boxes, are on average of less quality in terms of coverage and accuracy. Therefore, Arabic is still considered a resource-poor language. 1.2 Problem While NED is a well studied problem for English input, few systems have considered extending NED to other languages such as Arabic. Adapting NED to Arabic text exhibits three main challenges: • Limited structured resources: NED systems usually require dictionaries that link the candidate entities to one or more name aliases. Moreover, it requires textual representation of entities (entity description or entity context), usually, in the form of a set of keyphrases. Keyphrases are essential for estimating the context similarity between candidate entities and the retrieved mention [67]. Dictionaries built from Arabic structured resources are limited in size and quality. This restricts their ability to offer a robust NED process. • Arabic language characteristics: Arabic is a morphologically-rich language with a different character set and writing rules from Latin-alphabet languages. Normal English tokenization and normalization (i.e. lemmatization and stemming) techniques are not suitable for Arabic text. Incorrectly tokenized input heavily 3 In July 2015, Arabic Wikipedia had 374,291 articles, while the English Wikipedia has 4,910,360 Chapter 1 Introduction 3 reduce the quality of name dictionary look-up and similarity measurement [67]. For example, pronouns written connected to previous words should be separated out for better string matching. • Annotated NED corpus: There is no Arabic corpus with semantic/entity annotation. Annotated corpora are essential for tuning NED parameters as well as measuring the overall performance of the NED process in the development phase. 1.3 Proposed Solution In this thesis, we introduce AIDArabic+, an NED system geared for Arabic text input. AIDArabic+ overcomes the mentioned challenges by utilizing: • Enriched Arabic data schema (EDRAK): EDRAK is an Arabic entity-centric resource that offers comprehensive name and contextual keyphrases dictionaries for entities from both Arabic and English Wikipedia’s. Dictionaries are not limited to manually crafted data in the Arabic Wikipedia, instead, several external name dictionaries are harnessed. In addition, Type-aware Named-Entity translation and transliteration techniques are developed to automatically compile EDRAK’s dictionaries. • Arabic input pre-processing components: We integrated an Arabic morphologicalbased pre-processing component to perform deep tokenization and normalization, consequently, enhancing name matching and context similarity estimation. • Annotated NED corpus for Arabic: We automatically created an Arabic annotated corpus using manually translated and aligned parallel English-Arabic corpus. Produced corpus was used in evaluating the effect of the proposed components. 1.4 Outline The following chapter discusses Named Entity Disambiguation concepts and essential components as well as existing systems supporting Arabic. Chapter 3 describes our general approach to create AIDarabic+. Then, the creation of EDRAK, our enriched data schema, is explained in Chapter 4. We describe the statistics of the generated resource EDRAK and the manual assessment performed to verify the resource quality in Chapter 5. Finally, we discuss the creation of the annotated corpus and the effect of the proposed approaches on the quality of the full NED pipeline. Chapter 2 Background 2.1 Named Entity Disambiguation Named-Entity Disambiguation (NED) (or as in some literature Entity Linking [20]) is the problem of mapping ambiguous mentions of named entities such as places, organizations, and persons appearing in natural language input text onto canonical entities registered in a Knowledge base (KB) [27] such as DBpedia [6], YAGO [26], BabelNet [45]. NED is different from Named Entity Recognition (NER) which only concerns with extracting named entities and classifying them to coarse-grain categories LOCATION, ORGANIZATION, MISC and PERSON. NER is usually performed to recognize Named-Entities for the disambiguation processes. It is worth illustrating that, the NED tasks are different from the Word Sense disambiguation tasks (WSD). While the WSD is concerned with resolving the correct meaning for words and concepts such as “bank” or “plant” in the provided context, the NED tasks focus on mapping ambiguous names to the correct entities. For instance, the sentence “Müller plays for the German National Team” has two ambiguous names, “Müller ” and “German National Team” . “Müller ” can refer to any person with this name, and “German National Team” can be either the German football team or the German basketball team. Nevertheless, both requires rich back-end names and word context dictionaries to map the word correctly [62]. The NED problem is a well studied for English input. Several NED systems have been developed for the English language such as DBpedia Spotlight [40], Illinois Wikifier [50], Tagme2 [17], AIDA [27, 66], NERD-ML [61], AGDISTIS [59] and Babelfy [43]. Besides, several annotated corpora have been developed to evaluate the performance of these systems on English input [60] such as TAC Entity-Linking task [29], KORE-50 [25], #Microposts2015 [51] and AIDA-CoNLL [27]. 5 6 Chapter 2 Background On the other hand, only few of these systems are cable of processing input in other languages. Moreover, only few corpora exist for other language such as TAC Entity-Linking Spanish and Chinese. 2.2 Named Entity Disambiguation for Arabic Resource-poor languages such as Arabic have a limited support. Some research attempted to support Arabic disambiguation as Cross-Lingual Information Retrieval problem (CLIR). McNamee et al . (2011) [38] developed a cross-languages entity linking approach to map names in any language text to the entities in the English Wikipedia registered in the TAC-KBP [63, 39]. The input names and context are translated/transliterated to English before processing, then, the linking is performed as a monolingual English problem. In order to evaluate the performance of their approach, they developed a persons-only crosslanguage ground truth for their experiments, using parallel corpora and crowd-sourcing for creating annotation [37]. However, this approach overlooks the language and culture specific entities and names that may not exist in the English-only KB. Up to our knowledge, Babelfy and AIDArabic [67] are the only systems built to disambiguate Arabic mentions to KB containing Arabic entities together with their potential names. Babelfy1 is a multilingual system that combines both the WSD and the NED tasks. They use BabelNet [46] as their back end KB. BabelNet is a multilingual resource built using Wikipedia entities and WordNet senses. They used the sense labels, Wikipedia titles from incoming links, outgoing anchor texts, redirects and categories as sources for the disambiguation context. In addition, off-the-shelf translation service was used to translate Wikipedia concepts to other languages. Nevertheless, translation was not applied on Named-Entities [46]. Babelfy was evaluated on English, Spanish, Italian, German, and French corpora but not on an Arabic one. AIDArabic [67] is an NED system that has been built specifically for Arabic on top of YAGO3 [36] KB. While AIDArabic’s entity catalog spans over a sufficiently large number of entities from the English and Arabic Wikipedia’s, it exhibits a low entity-name and entity-description dictionaries coverage. Hence, the recall of the disambiguation was heavily harmed. 1 http://babelfy.org/ Chapter 2 Background 2.3 7 AIDArabic: Under The Hood In this section, we describing in detail the main data components of AIDArabic and how they are used in the disambiguation process. 2.3.1 Data Sources AIDArabic, similar to most of NED frameworks, has three main data components: Entity Catalog, Name-Entity Dictionary and Entity Descriptions. In addition to these three components, AIDArabic uses an Entity-Entity relatedness model as a supporting component. Entity Catalog Entity Catalog, or repository, is the source of the canonical entities known for the NED system. During the disambiguation process, all names in the text are mapped to one of the entities in the catalog. Names without proper mapping to any entity in the catalog are mapped to null. AIDArabic populate its entity catalog form YAGO3 KB [36] built from both English and Arabic Wikipedia’s. This allows capturing English prominent entities as well as culture specific Arabic entities. For the sake of the data integrity, English entities identification are used to represent entities existing in both the English and the Arabic Wikipedia’s. Entity-Name Dictionary Entity-Name Dictionary contains the possible names for each entity in the catalog. Names in the dictionary are connected to all potential entities. The dictionary is then used to extract all possible candidate for mentions appearing in the text. Entities that do not have any potential names cannot appear in the disambiguation candidate list. In AIDArabic, name dictionary is populated from the Arabic Wikipeida data only (Figure: 2.1), and names belong to one of the following four sources: • Titles of the Wikipedia pages. Titles are different from the page id appears in the URL. 8 Chapter 2 Background Entity-Name Dictionary YAGO3 EN, AR Arabic Titles Entity Arabic Anchors <Barack_Obama> Arabic Disamb. Pages <Germany> أﻟﻣﺎﻧﯾﺎ <Egypt> ﻣﺻر Arabic Redirects Name ﺑﺎراك أوﺑﺎﻣﺎ Figure 2.1: Building Name Dictionary in AIDArabic JË@ • Disambiguation Pages, in Arabic “ iJ ñ ”. ® HAj These pages contain all possible entities/meaning referred to by a specific name. The title of the of the disambiguation page is added as a potential name for all entities referenced in this page. ñm'”. Redirects are pages with no actual content but • Redirects, in Arabic “ HCK refer the reader to another page. Redirects are used when searching Wikipedia to rout the user to the most prominent entity referred to by this name. For example, Ð @” (Um Alkora) redirects to “ éºÓ ” (Mekka). searching “ øQ®Ë@ • Anchor Text of links pointing to the entity page. Anchor texts can differ from the original title, and hence, they are harvested as potential names. As shown, only manually crafted content is used in building the name-dictionary. This limits the size of the dictionary to the existing Arabic content. Technically, this information is collected from YAGO3 RDF tuples where redirects are represented with predicate <redirectedFrom> and the remaining are represented under predicate rdfs:label. In addition, separated persons names under <hasGivenName> and <hasFamilyName> are added to the dictionary. Entity Description Entity descriptions or contextual keyphrases are the set of keywords that describes an entity and are expected to appear in the text surrounding the entity mention. For example, when “Tomas Müller ”, the German footballer, appears in some text, usually Chapter 2 Background 9 Entity-Keyphrases Dictionary Arabic Anchor Text Entity Keyphrase Arabic Inlinks Titles YAGO3 EN, AR <Barack_Obama> Arabic Categories <Germany> اﻟوﻻﯾﺎت اﻟﻣﺗﺣدة أﻧﺟﯾﻼ ﻣﯾرﻛل English Inlinks English Categories <Egypt> اﻟﻘﺎھرة English-Arabic Interwiki links lookup Figure 2.2: Building Keyphrases Dictionary in AIDArabic words related to football also appear in the text such as football, match, national team, Germany, goal,etc. Contextual keyphrases are used to compute the similarity between the mention context and the candidate entity context. AIDArabic utilizes an entity-description dictionary of Arabic keyphrases. Keyphrases are further split up into keywords with a specific weight for each. Keyphrases are harvested from three sources (figure: 2.2): • Anchor Text inside the Arabic entity pages that point to other pages are assigned as keyphrases to this entity. • Inlink Titles are the titles of the pages that link to the current entity. Inlink titles of Arabic pages pointing to an entity are added directly to its keyphrases set. However, English Inlink titles are translated to Arabic via the cross-languages inter-Wikipedia links dictionary. For example, to include the inlink title of page “<Eygpt>”, the dictionary pair “<Eygpt>→ <ar/QåÓ>” is used to get the Arabic title “QåÓ”. • Categories are manual classes added to each entity. Similar to entities, YAGO3 contains a union of the English and Arabic categories. English Wikipedia ids are used to represent the categories, unless the category only exists in Arabic. Similar to the Inlink titles, Arabic categories are added directly to the keyphrases but English categories are translated via the cross-languages inter-Wikipedia links between categories. As mentioned above, only Inlink titles and Categories are translated using the manual inter-Wikipedia links. Moreover, no other external dictionaries are used, hence, the context is still limited to the Arabic Wikipedia size. 10 Chapter 2 Background Entity-Entity relatedness model It is common that a single text snippet or document contains small amount of related entities. Therefore, AIDArabic exploits the Entity-Entity relatedness model to improve the quality of the disambiguation results. The relatedness is estimated based on the overlap in the incoming links [41] fused from both the Arabic and the English Wikipedia’s. 2.3.2 Processing AIDArabic, as thethe most of NED systems, starts with retrieving the possible name mentions from the input text. Name mentions are usually recognized via a NER system. Retrieved mentions are normalized (e.g. converting text to lowercase or uppercase in English). Then, the possible candidate entities for the mentions in the text are retrieved from the Entity Catalog using the Name Dictionary. In order to resolve the mention-to-entity mapping, a weighted graph of the mentions and the candidate entities is constructed. Weights on edges between mentions and their candidates are estimated from the entity keyphrases and the mention context similarity as well as the candidate entity popularity (i.e. prior). Weights on edges between candidate entities are assigned according to the entity-entity relatedness scores. The disambiguation problem is solved by iteratively reducing the graph to dense sub-graph till each mention is connected to exactly one candidate. Chapter 3 AIDArabic+ 3.1 AIDArabic Challenges The original AIDArabic introduced NED for Arabic text. Nevertheless, it exhibited a low recall compared to the English AIDA [67]. This problem has two main roots: First, while AIDArabic utilized a comprehensive entity catalog, the generated name and contextual keyphrases dictionaries are still limited to the manually crafted information in the Arabic Wikipedia and cross-languages inter-Wikipedia links. On its turn, the Arabic Wikipedia does not have enough coverage and quality. It does not only miss a lot of entities, but also existing entities have short non-comprehensive pages. The Arabic Wikipedia, as the most used structured resource, is not capable of covering the fast growing Arabic content. Secondly, AIDArabic follows the same tokenization and normalization applied to Latin input without any Arabic-specific pre-processing. Improperly tokenized names cannot be matched against the name-dictionary using the strict matching mechanism adopted in AIDArabic, consequently, no candidate entities will be retrieved for this mention. Similarly, entity-mention similarity computation using keyphrases is negatively affected. 3.2 AIDArabic+ in A Nutshell AIDArabic+ aims at achieving a robust NED on Arabic text. Hence, we need to target the weak points in both the data schema and the processing components to enhance the overall recall an precision. 11 12 Chapter 3 AIDArabic+ In this work, we introduce EDRAK resource as an automatic augmentation for AIDArabic resource. We propose two approaches to overcome the limited data of the Arabic Wikipedia. The first approach is to collect names from other resources on the web containing possible names using semantic and syntactic equivalence. The second is to incorporate translation and transliteration techniques to automatically generate Arabic content based on evidence from the English and Arabic Wikipedia beyond the direct cross-language inter-Wikipedia mapping. In order to guarantee building an accurate data schema, different rules are enforced on the techniques according to the type of the entity and the source of the data as discussed in Section 3.3. In addition, we introduce the integration of a pre-processing component into the NED pipeline to handle Arabic-specific features. Section 3.4 illustrates the set of procedures proposed for proper Arabic normalization and tokenization to achieve better name matching and context similarity estimation. 3.3 Enriching Data Components As illustrated in Section 2.3, the main three data components necessary for our NED system are an entity catalog, name dictionary and entity description (i.e. contextual keyphrases). The forth component of AIDArabic, Entity-Entity relatedness model, depends on the topology of the KB, but not the language used. Hence, it does not require language specific enhancements. This section discusses the idea behind applying enrichment techniques on each of the main three components to improve the Arabic NED process. The design decisions and the implementation of the proposed enrichment approaches (as closed modules) are discussed in Chapter 4. Let us consider this hypothetical Arabic sentence A« l Q K Y ¯ éKñ « è Q K Am.Ì úG Q ®Ë@ à Q m' B éK. AJ» á« written in English for clarity as: Aaidh Al-Qarni might get nominated for Goethe Prize for his book La Tahzan ), a A« ®Ë@ This sentence has three named entities: a writer (Aaidh Al-Qarni / àQ èQ KAg ) and a book (La Tahzan/ à Qm ' B). . prize (Goethe Prize/ éKñ« We will illustrate, how we can adapt each data component of the NED framework to be capable of correctly disambiguating them. Chapter 3 AIDArabic+ 13 Entity Catalog By considering our example, the writer is known enough to exist in both the English and the Arabic Wikipedia’s. However, despite the fact that the book is translated, it has only an Arabic Wikipedia page. Moreover, the prize is not known enough in the Arab world to exist in the Arabic Wikipedia1 . In order to disambiguate such sentence, we need to make sure that the entity repository contains all of those entities. Therefore, we followed the same approach as in AIDArabic[67], we used YAGO3 compiled from both the English and the Arabic Wikipedia’s as our back-end KB. This allows capturing prominent English entities as well as local entities that are only known in the Arabic culture. Name Dictionary Generally, entity-name dictionary is an influential component for any NED system. Having an incomplete dictionary dramatically harms the disambiguation quality. If the dictionary misses a name-entity entry, either no candidates will be nominated for disambiguation, or even worse, a wrong entity might be picked for one or more mentions. Since modern NED systems consider coherence measures when collectively resolving all mentions, one or more wrong mappings might propagate to mislead mapping other mentions onto wrong canonical entities. We started with the same sources collected in the original AIDArabic. we harvest Wikipedia page titles, anchor texts, disambiguation pages’ titles (under predicate rdfs:label in YAGO3 [36]) and redirects. Also, we include separated Given names and Family names of persons. Nevertheless, the contribution of these sources to the Arabic dictionary is limited as shown in the statistics in Section 5.1. In order to correctly disambiguate names in our example, we need a dictionary that is aware of the Arabic names of all the three entities. Since, the writer (Aaidh ) and the book (La Tahzan/ à Qm ' B) exist in the Arabic Wikipedia, A« ®Ë@ Al-Qarni / àQ our names dictionary has at least one Arabic name for both of them (their page titles). On the other hand, Goethe Prize exists only in the English Wikipedia, without any potential Arabic name. Therefore, the correct entity will not be nominated as a candidate for its Arabic mention. In this work, we propose to go beyond Wikipedia content via automatic data generation. However, It is a challenging task to automatically build a entity-name 1 As on 01 July 2015 14 Chapter 3 AIDArabic+ Entity-Name Dictionary YAGO3 EN, AR Arabic Titles Entity Arabic Anchors <Barack_Obama> Arabic Disamb. Pages <Germany> أﻟﻣﺎﻧﯾﺎ <Egypt> ﻣﺻر Arabic Redirects Name ﺑﺎراك أوﺑﺎﻣﺎ En. Titles En. Anchors En. Redirects En. Disamb. Pages External Dictionaries External Dict. Names Translations Translated Names Transliterated Persons Names Transliteration Figure 3.1: Building Name Dictionary in AIDArabic+ dictionary that captures name variations for all entities in the entity catalog while keeping the data precision. For example, the name of “Goethe Prize” in Arabic is obtained by (1) Transliterating “Goethe” into the Arabic script, (2) Translating “Prize” into Arabic, and finally (3) Reordering the tokens to follow Arabic writing rules. Therefore, we introduce three approaches to enrich the entity-name dictionary of AIDArabic+: 1. External Name Dictionaries: We harness the existing English-Arabic name dictionaries via semantic and syntactic equivalence. For example, if two strings from one or more dictionaries are linking to the same canonical entity, we consider them potential name aliases. In Section 4.1, we discuss the harnessed resources as well as the procedure designed for the integration process. 2. Entity-Name Translation: While external dictionaries (e.g. gazetteers) and hyper links extracted from the web provided Arabic names for some English entities, many entities are still lacking potential names in the Arabic world. Arabic names should be generated instead of only extracting them. Moreover, general purpose translations exhibited problems translating/transliterating Named-Entities [4, 21, 7] even if they appear within a context. Therefore, we introduced Entity-name translation to populate our dictionary with accurate automatically generated Arabic names as discussed in Section 4.2 in-detail. 3. Persons Names Transliteration: Fair amount of entities obtained one or more Arabic names using external resources and/or translation. However, English names have different variants when being written in the Arabic script. In addition, not all Chapter 3 AIDArabic+ 15 Entity-Keyphrases Dictionary Arabic Anchor Text Entity YAGO3 EN, AR Keyphrase Arabic Inlinks Titles <Barack_Obama> Arabic Categories <Germany> اﻟوﻻﯾﺎت اﻟﻣﺗﺣدة أﻧﺟﯾﻼ ﻣﯾرﻛل English Inlinks English Categories <Egypt> اﻟﻘﺎھرة Interwiki Dictionary Translation Figure 3.2: Building Keyphrases Dictionary in AIDArabic+ name variants can be generated by translation. Therefore, we introduce incorporating a transliteration module geared for PERSON names. While transliteration is applicable on many NON-PERSON entities, applying it for such entities will create a lot of inaccurate entries that should be either fully or partially translated. Thus, we decided to exclude them from the transliteration process. Technically, We applied these approaches on the English names from all sources, namely, Given Name, Family Name, Redirects and rdfs:label (includes titles, anchor texts and disambiguation pages) as in Figure: 3.1. Entity Contextual Keyphrases Contextual keyphrases are used as a set of descriptions for the entity. Entity keyphrases are matched against the input text to compute a similarity score between the expected context of the candidate entity and the context in which the mention exists. As shown in Figure 3.2, we used the standard approaches in the original AIDArabic to extract entity contexts from Wikipedia (Wikipedia Anchor texts, Wikipedia categories, and Wikipedia inlink titles). Similar to AIDArabic, the English inlink titles and categories are looked up in the English-Arabic inter-Wikipedia links dictionary. However, English names without entries in the dictionary were neglected, consequently, many English entities were not supported by any context description. Such entities that lack adequate description cannot be promoted as a winning entity in the mapping process. 16 Chapter 3 AIDArabic+ In AIDArabic+, we overcame the low coverage and quality of the Arabic Wikipedia by applying the Entity-Name Translation and the Persons Names Transliteration techniques on Wikipedia in-link titles remaining from the inter-Wikipedia dictionary look-up. Furthermore, we trained our category translation module on a parallel corpus extracted from categories English-Arabic inter-Wikipedia links. Finally, while it seems possible to translate anchor texts, we did not perform that to avoid the inaccuracy resulting from being noisy and sometimes long. 3.4 Language Specific Processing Name matching and context similarity computation processes are necessary for successful disambiguation process. Achieving robust matching requires clean input. However, due to the specific morphological characteristics of the Arabic text, standard English text normalization and tokenization techniques (e.g. converting to lowercase) are not directly suitable. For example, Arabic text has several specific features such as: • Definitive Article (AL) “ È@” is attached to the beginning of the word (e.g. The library: éJ.JºÖÏ @ ). Nevertheless, not each “ È@” at the beginning of a word are treated is a definitive article. ¼ , H. , and È and connectors ( ¬ , ð) appear connected to the word beginning (e.g. H . at the beginning of “in France”:“ AQ®K.”). • Some propositions such as • Several pronouns are attached to the end of the word. For example, the last two characters Aë at the end of AîD®K Yg (meaning ”its park”). • Sometimes, Arabic text is written with diacritics to express vowels and facilitate pronunciation. Arabic diacritics appear as decoration above and under the normal character (e.g. H. , H. , H. , H. ,and H. ). They differ according to the meaning of the word and its position in the sentence (i.e. subject or object) In AIDArabic+, we incorporate an Arabic specific pre-processing component for input text and data schema building. There are two state-of-the-art systems that perform morphological-based analysis: MADAMIRA [49] and Stanford Arabic Word Segmenter [42]. Stanford Word Segmenter provides interpolatable handy Java API. Hence, we have used it for pre-processing. Our pre-processing component perform two main steps: Chapter 3 AIDArabic+ 17 Tokenization Besides the normal word tokenization based on punctuation and spaces, we perform word segmentation to split clitics and attached connectors. Split suffixes and prefixes are marked with special character ’+’ indicating that they are originally connected to the next or previous word. This allows reconstructing the original input text. For + H” and ” AîD®K Yg” . example, “ AQ®K.”(pronounced as “BFranca”) is segmented into ” AQ¯ is segmented as ” Aë+ I ®K Yg”. It is worth noting that Stanford Word Segmenter does not split the definitive articles. Splitting prefixes and suffixes, i.e. clitics, should increase name matching accuracy, and hence, enhancing the candidate retrieval process coverage and quality. Furthermore, it allows better keyphrases matching which is essential for achieving accurate disambiguation results. Normalization Unlike English, Arabic text normalization include several steps. Applied normalization should be customized according to application. The most common normalizations are: • Removing Diacritics: Despite the fact that removing diacritics will increase ambiguity of some words, it is important to obtain a uniform representation of the AJ K @ diacritics will be ” ÕËð @ éJK YÓ ú¯ áK ú ¯ áK AJ K @ H Q . Ë @ YË ð ” after removing YËð”. Normalizing Hamza: It includes replacing the different forms of letter Hamza ( @, c @, @, c@ , and @) to the normalized form @. This helps avoid common typing mistakes same word. For example, sentence ” ÕË ð @ • é JK YÓ H Q.Ë @ and confusing different states of the same word. • Normalizing Ya: In order to avoid a common writing mistake in the informal text, the character Ya/ ø (with dots) is replaced with ø (without dots). • Normalizing Ta-marbutah: Also, to avoid different writing forms tah-marbutah é & è (with dots) is replaced with è or é (without dots). • Removing Tatweel: Some informal text uses series of ’ ’ (U+0640) to extend the word. All Tatweel characters should be removed to get the pure word. • Normalizing Punctuation: In some cases, it is useful to replace the Arabic punctuation with equivalent ASCII symbols. 18 Chapter 3 AIDArabic+ In AIDArabic+, we apply Diacritics Removal, Hamza Normalization, Ya Normaliza- tion, Ta-marbutah Normalization and Tatweel Removal to achieve decent input quality that guarantee higher matching recall without sacrificing the precision. Recalling our example in Section 3.3, the input sentence: éKñ « à Q m' B éK. AJ» á« è Q K AmÌ . A« ú G Q ®Ë@ l Q K Y ¯ after normalization and tokenization will be l QK Y¯ ®Ë@ A« + È úGQ éKñ« èQ KAg à Qm ' B è+ H. AJ» á« . As in the example, names and contextual keyphrases in the normalized sentence become more clear for matching against AIDArabic+ dictionaries. Proposition +È Ì attached to the beginning of the word èQKAm. is detached and can be treated as a stop word. Similarly, the pronoun è+ has been detached from the end of éK. AJ». Chapter 4 EDRAK Entity-Centric Resource For Arabic Knowledge EDRAK is an entity-centric resource developed as back-end schema for AIDArabic+ NED as shown in Section 3.3. This chapter focuses on the automatic generation techniques beyond Wikipedia used in EDRAK together with the decisions taken within each technique. Section 4.1 describes the integration of external dictionaries. Then, named-entity translation methods are discussed in Section 4.2, person names transliteration in 4.3, and Arabic names splitting in 4.4. Finally, technical details about EDRAK as a general purpose standalone resource is explained in secion 4.5. 4.1 External Name Dictionaries Wikipedia, as the largest comprehensive online encyclopedia, is the most used corpus for creating knowledge bases such as YAGO [26], DBpedia [6] and Freebase [10]. Due to the limited size of the Arabic Wikipedia, building a strong semantic resources becomes a challenge. One approach to go beyond Wikipedia limits is to capture possible Arabic names mentioned in other resources such as websites and online news, then attaching them to the corresponding entities. These resources are usually harvested through automatic or semi-automatic processes. Among the generated resources, some are entity-aware, while others are purely textual names dictionaries. 19 20 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 4.1.1 Entity-Aware Resources Entity-Aware resource is the type resource that has canonical entities registered a long with their names and, in some cases, their context description. Google-Word-ToConcept (GW2C) [54] is a multilingual Entity-Aware resource that harness Wikipedia concepts, including Named Entities (NE) and their possible names from both Wikipedia and non-Wikipedia Web pages. Resource Description: Concepts’ strings (i.e. names) are harvested from: • English Wikipedia pages titles. • English anchor texts from inter-Wikipedia links into the concept. • Anchor texts from non-Wikipedia pages to Wikipedia concepts with English page. Name-to-concept mappings are stored together with a strength score that is measured through the conditional probability P (Concept|name) representing the ratio of links into the Wikipedia concept having this name. Nevertheless, names in GW2C names are stored without any kind of post-processed or cleaning. éËA®Ó j.JË@ l×. @Q.» ar: éK QºªË@ éPYÖÏ @ AJë ¡ª @ AÓðP èYëAªÓ . QÓ ø ñËñJ . AÒJ Jk 0.0013 Chuck (engineering) W08 W09 WDB Wx:1/500 1 Spyware W08 W09 WDB Wx:8/8 0.5 École Militaire W08 W09 WDB Wx:3/3 0.0005 World War I KB W08 W09 WDB Wx:4/5357 1 Treaties of Rome KB W:4/4 W08 W09 WDB Wx:3/3 1 M~ arginimea Sibiului W09 W08 WDB Wx:1/1 Table 4.1: Sample of Google-Word-to-Concept raw data GW2C contains 297M multilingual name-to-concept mapping. As shown in Table 4.1, the first and the third columns in order contain retrieved names and their Wikipedia concepts URLs. Second column contains the conditional probability computed from the witness counts presented in the flags column (fourth column). Integrating with AIDArabic+ Resource: GW2C is created automatically without any manual verification or post-processing. Therefore, it contains noise that should be filtered out. In order to include GW2C names in our names dictionary, we perform the following steps: Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 21 1. Detecting Arabic names using off-the-shelf language detection tool developed by Shuyo (2010) [53] to filter out non-Arabic records. This resulted in only 736K out of 297M as Arabic entries. 2. Filtering out ambiguous names based on the provided conditional probability scores. Excluding records with low scores filters out anchor texts such as “(Read ® ” or “(more on Wikipedia) YK QÖ Ï @ @Q¯@ ” , “(Wikipedia page) AK YJ .J ºK ñË@ éj AK YJ .J ºK ð úΫ YK QÖ Ï @”. We used 0.01 as a lower threshold on the provided scores. more) 3. Name-level post-processing to remove URLs, punctuation, common prefixes and suffixes. 4. Mapping names to AIDArabic+ Entities using Wikipedia pages URLs. 4.1.2 Lexical Name Dictionaries Lexical Name dictionaries is another type of resource that contains just name variants in different languages without any notion of canonical entities. Since these dictionaries do not consider the semantic differences, the name variants can be mapped to different entities. Therefore, we use them as look-up dictionaries to translate English entitynames to Arabic. We have utilized two dictionaries that have been exposed to a manual verification. 62 62 P P u u 62 P u 62 P u 62 P ar 62 P sl Javier+Solana Q J ¯A g AKBñ+ AKBñ+QKðA g Q J K. Ag AKBñ+ Q J ¯A gð AKBñ+ Javierjem+Solano Table 4.2: Sample of JRC-Names raw data JRC-Names [55] is a multilingual resource of organisations and persons names extracted from News Articles and Wikipedia. In the creation of JRC-Names, they used manually compiled lists of language specific rules and triggers such as persons titles, ethnic groups or modifiers to extract names of persons. In addition, a list of frequent words (e.g. club, organization, bank etc.) was used to extract organization names. The similarity between the names extracted from news and those from Wikipedia page titles was computed to recognize name variants. Names in non-Roman script were romanized. Hence, monolingual edit distance was used as a unified similarity function. Names below the specified threshold were manually matched to the corresponding name 22 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge cluster. Finally, names that either appeared in five different news clusters, manually validated or found in Wikipedia were included in the published dictionary. The dictionary has 617k multilingual name variants with only 17k Arabic name variants. As shown in Table 4.2, variants of the same name have a unique identifier. In addition, types and partial language tagging are provided with the names. National Investors Bank BALTIC COUNTRIES Yoli Adlestein Nathan Byron PAÒJB@ ½JK ú ×ñ®Ë@ . J ¢ÊJ.Ë@ ÈðX úÍñK áK AJËX@ Q K. àA KA K àð ORGANIZATION 000 0.3 0.38 0.5 ORGANIZATION 00 0.2 0.076 PERSON 11 0.75 0.875 PERSON 11 0.71 0.66 Table 4.3: Sample of CMUQ-Arabic-NET raw data CMUQ Arabic-NET is an English-Arabic name dictionary compiled from Wikipedia and parallel English-Arabic news corpora [7]. They used off-the-shelf NER system on the English side of the corpora. The NER results were projected onto the Arabic side according to the word alignment information. Additionally, they included Wikipedia cross-languages links titles in their dictionary. The dictionary was manually annotated to fit targeted use. The full dictionary has 62k English-Arabic name pairs. Table 4.3 shows a sample of the dictionary. First two columns are the English-Arabic pairs. The third column contains the type of the entity name (i.e. person or organisation). The remaining are annotations that are used for their target. Including Dictionaries in EDRAK is performed as follows: 1. Pre-processing and language detection are applied on JRC-Names. 2. English names of the entities are normalized and matched strictly against these dictionaries to get the accurate Arabic names. 3. Only new name variants are added to the resource. 4.2 Named-Entities Translation Up to this point, English entities that do nor have any possible names in the dictionary and/or context keyphrases still form a big part of our catalog. In addition, not all entity Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 23 Source language Input Statistical Models Bilingual Corpus Training Translation Model Translation Model Decoder Target Language Corpus Training Language Model Language Model SMT System Target language Output Figure 4.1: General Statistical Machine Translation Pipeline names have already appeared in Arabic text corpora, some however are prominent enough to appear in the near future. Accordingly, in this section we discuss using machine translation on English names and keyphrases. 4.2.1 General Statistical Machine Translation Statistical Machine Translation (SMT) is the process of generating possible translations for text based on statistical models trained on bilingual parallel corpora [31]. Recently, several translation systems are being developed based on SMT such as Moses [32], Cdec [15], Phrasal [18], and Thot [48]. The main advantage distinguishing SMT from other translation approaches such as rule-based translation and example-based translation is being generic enough to be used with any language pairs. Implementations of SMT mostly follow similar steps to train the models required for the decoding process (Figure: 4.1). The parallel corpora word-alignment information is extracted using one of the automatic statistical alignment tools such as GIZ++ [47] or Fast Aligner [14]. Generated word-alignment information is used to produce the translation tables/grammar. In addition, a statistical language model is generated from the target language part of the parallel corpora. In some cases, other monolingual corpora are used to generate richer models. Later, both the translation table and the language model are used in the decoding phase while translating the input text. Usually, several translations are generated for each input sentence. Resulting translations are ranked based on the accumulated probability derived from the language model and translation table. 24 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge Statistical machine translation is a viable option to translate English names into Arabic. Off-the-shelf trained SMT systems such as Google 1 or Microsoft Bing 2 translation services are trained on large parallel corpora. However, they are not geared for translating NEs. While seeking suitable translation quality on natural language input text, NE translation quality straggles as: • Most of the existing SMTs do not handle named entities explicitly, only the language model is responsible for generating the weights of name translations [22]. They do not utilize any part-of-speech tagging or NER information. • Entity names are domain specific. Hence, they can be easily missing from the parallel training corpus which cannot cover all domains. SMTs will split the name and translate each token separately resulting in wrong translations. • Entity names tend to appear less than other nouns and verbs in the parallel training corpora. Consequently, their translations have lower weights in the language models rather than normal words [33, 5, 7]. For example, “North” as geographical direction and “Green” as color have higher weight rather than their name counterpart. • SMT system does not take NE type into consideration, yet different entity types should not be translated similarly. For example, “Nolan North” (PERSON) and “North Atlantic Treaty Organization” (ORGANISATION), both Norths are proper entity names, yet the first should be transliterated while the later should be translated. 4.2.2 Named-Entities SMT Several research attempts focus on enhancing the quality of NE translation. Huang et al . (2004) [28] introduced the usage of phonetic and semantic similarity in order to improve NEs translation. Lee (2014) [34] proposed including part-of-speech tagging information in the translation process in order to enhance the translation of person names in text. Furthermore, Azab et al . (2013) [7] developed a classification technique to decide whether to translate or transliterate named-entities appearing in a full text from English to Arabic. They used combination of the token-based, semantic and contextual features including the coarse-grained type tags (PERSON, ORGANIZATION) in the classification. Nevertheless, since our problem is focused on NEs solely, we propose creating a NE customized translation module. 1 2 https://translate.google.com https://www.bing.com/translator/ Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge PER NON-PER 25 ALL Azab Wikipedia 28493 33962 34116 79699 62609 128790 Both 62455 113815 191399 Table 4.4: Entity Names SMT Training Data Size Named-Entities Training Corpus Parallel training corpora is a key player in achieving the desired translation quality. Our proposed approach is to use training data purely designed for translating NEs. Therefore, we compile our training corpus from the NEs existing in English-Arabic cross-languages inter-Wikipedia links. The intuition is if our Knowledge base knows the name of “William Hook Morley” ( úÍPñÓ ¼ñë Õæ Ëð) and “Edward Said” ( YJ ª XP@ð X@) in Arabic script (which is the case), this should be sufficient for our SMT to learn the Arabic script of “Edward Morley”. By adapting names only, we guarantee a suitable language model that provides higher weight for name translations. Type-Aware Translation Similar to the recent technique for translating named-entities within natural language text proposed by Azab et al . (2013) [7], we propose utilizing type information in names translation process. we used well structured type information provided in YAGO3 [26]. Training data have been split into two sets PERSON and NON-PERSONS. We do not split the data on more fine-grained type-basis (e.g. ORGANIZATION vs LOCATION) in order to maintain an adequate amount of training data [57] for each type. Table 4.4 shows the size of the training data for each system. We trained three SMT systems: PERSONS, NON-PERSON, ALL. The third system is used as a fallback solution in case an entity of type T could not be translated using the corresponding system. The main advantage of the type-aware architecture is allowing translating person names differently. In addition, for non-person entities like “Goethe Prize”, we are able to translate it using the fallback system which learned to translate “Goethe” from the PERSONS part of the data, and “Prize” from the NON-PERSONS part. 4.2.3 Named-Entities Light-SMT Since our target is to translate entity-names only, language models are not highly beneficial in our case. Even more, language models may reduce the quality of translation, if the name is not well represented in the target language training data. Therefore, we propose 26 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge Light-weighted SMT. Our approach intuition is that if we collect all Arabic names of entities that have the token-to-translate in one of their English names, the corresponding Arabic token should have the highest and the most distinguished occurrence count among other Arabic tokens. Therefore, applying popularity voting among Arabic tokens of the entities with the token-to-translate (in their English names) results in a proper translation. Consider a simplified example, given a set of parallel names that all its records have “Müller ” in the English side such as (“Thomas Müller ”/“QËñÓ (“Gerd Müller ”/“QËñÓ AÓñK”), XQ « ” ) etc., “Müller ” translation in Arabic “QËñÓ” is going to be the most frequent name. Light-SMT module is built as follows: 1. The English side in the parallel English-Arabic training data is tokenized using Stanford tokenizer and converted to lowercase. 2. Arabic side is normalized and tokenized as discussed in Section 3.4. Since a single English name can be written in several forms with only a difference in vowels, diacritics and/or Hamza, stemming is important for counting all such representations as a single candidate. 3. Ambiguous tokens such as “(Disambiguation)” in English and its corresponding K” are considered noise and removed. Furthermore, tokens with Arabic word “ iJ ñ no mapping in Arabic such as ”a”,”an” and ”The” as well as punctuation are eliminated. 4. In order to follow the type-aware approach, training data is split into PERSON, NONPERSON. We created three indexes, two for each type and a third for type ALL (which includes the whole parallel data combined). Each index is composed of: (a) An inverse index from English tokens to their entities. (b) An index from the entities to all tokens of their Arabic names together with their normalized version. In the decoding (i.e translation) phase, the entity-name passes through the following steps to generate all possible translations for the name: 1. Entity type is resolved via YAGO types. 2. English Name is tokenized and normalized. 3. The translation of each token is generated according to the entity type (Figure:4.2): (a) List of entities with this English token is retrieved from the inverted index. (b) Arabic tokens of all entities in the list are retrieved from the Arabic index. Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge Edward Offline Generated Translation Model Source Tokens to Entities Inverted Index edward <Edward_Morely> <Edward_Said> <Edward_England> <Edward_Barker> morely <Edward_Morely> <William_Morley> <Ebenezer_Morley> Parallel Training Data <Ebenezer_Morley> <William_Morley> Get Entities with EN Token <Edward_Morely> <Edward_Said> <Edward_England> <Edward_Barker> Get AR tokens for Entities Entities to Target Tokens Index <Edward_Said> <Edward_England> <Edward_Barker> 27 إدوارد إدورد إدوارد ﻣورﻟﻲ وﻟﯾم ﺳﻌﯾد إﻧﺟﻠﻧد ﺑﺎرﻛر إﺑﻧﯾزﯾر ﻣورﻟﻲ Vote between normalized Tokens ﺳﻌﯾد إﻧﺟﻠﻧد ﺑﺎرﻛر إدوارد إدورد إدوارد No Fail >T yes Vote between Different Representations إدوارد Top-k إدوارد إدورد إدوارد Figure 4.2: Single token translation using popularity voting (c) A popularity voting is performed on the distinct Arabic stems. The Arabic stem with the highest count is considered the proper translation for the English token. In order to achieve suitable translation accuracy, we impose two conditions: (i) the number of participating entities should be greater than five, (ii) the winning stem should achieve at least a value of 0.3 from number participating entities. (d) In order to avoid rare Arabic representations of English names written in Arabic script as well as incorrect representation, a second popularity voting is performed among the original words contributing to the winning stem. The top two representations are chosen as possible translations of the input token. (e) If no translation is found for the token using its type data, these steps are repeated with the ALL model. 4. Token translated successful are then joined together to generate all possible translations. Up till this point, Light-SMT is capable of generating acceptable translation quality for persons names as they usually follow the same token order across languages. In contrast, multi-token NONPERSON names suffer from ordering problem. For example, while translating “Max Planck Institute” to Arabic using LSMT, the result will be “ YêªÓ . »AÓ”. ½KCK Each token is correctly translated but word “ YêªÓ”, the Arabic 28 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge translation of “Institute” should appear in the beginning followed by the rest of the name . as “ ½KCK »AÓ YêªÓ”. Since NEs are short and follow common patterns, we implemented a rule-based reordering approach similar to Badr et al . (2009) [8]. This approach is efficient with Arabic NEs case as they are usually written in the format of ( < Category > < list of genitives or proper noun > ) For English NEs, the category (e.g. “University”) either appears at the beginning (e.g. “University of Saarland ”) or at the end (e.g. “Saarland University”). The first case is similar to Arabic order, and hence, no changes are required. The later however should be flipped. We learned the list of categories that requires reordering from all English nonperson names by considering the top thousand common tokens appearing at the end of the name. Before translating, the English Name is reordered such that category names are put at the beginning to follow the Arabic naming. For example, “Goethe Prize” becomes “Prize Goethe”. The main advantage offered by Light-SMT is that it allows controlling the translation quality on the token level. In addition, it leverages the correct translation by combining all similar representations. On the other hand, Light-SMT does not capture dependencies between words. Also, it assumes one-to-one word mapping which is not the case, specially with nonperson names. We tried to enhance it by applying it on n-grams and choosing then choosing the translation of the longest n-gram in the top-k as the correct one. Nevertheless, result samples do not show considerable improvement. In Section 5.2, we examined the effect of Light-SMT buit on disambiguation full pipeline. 4.2.4 Named-Entities Full-SMT Due to the limitation of the Light-SMT, decision was made to train an off-the-shelf SMT framework. Therefore, we adopted Cdec [15], a full fledged SMT framework that includes a decoder, aligner, and learning framework. We used it in combination with the type-aware paradigm. We used the same training data in addition to CMUQ Arabic-NET dictionary provided by Azab et al . (2013) [7]. We train the three system (PERSONS, NON-PERSONS and ALL) as follows: 1. Parallel data is split into development data (5%) and training data. 2. English part is tokenized normally and the Arabic side is normalized and tokenized as described in Section 3.4. Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 29 PEROSN Translation English Entity-Name Resolve Type is PEROSN OOV filtering NONPEROSN Translation Fail? Top3 FALLBACK Translation OOV filtering Arabic Entity-Names Figure 4.3: Type-Aware Entity-Name Translation using full SMT system 3. Symmetric Word-Alignment information is extracted from the English-Arabic training pairs using Fast Aligner [14]. 4. Word-Alignment information are used to generate translation grammar using Cdec Grammar Extractor. Unique grammars are kept in Synchronous context-free grammars(SCFGs) format. 5. The normalized Arabic part of the corpus is used to train the language model using KenLM Language Model Toolkit [23] provided in Cdec framework. 6. Translation parameters are tuned using MIRA tuning tool [11] adopted by Cdec. In the translation phase, the translation grammar, tuned weights, and the language model are loaded to Cdec decoder. As we follow the Type-Aware framework (Figure: 4.3), we start with resolving the type of the translated entity. Then, the English name is translated according to the entity type. We configure Cdec decoder to retriever the Top-5 translations. Translations with one or more Out-of-Vocabulary words are excluded. Finally, only top-3 translations are taken into consideration. In the case of failing to get at least one full translation, the fallback translation component is used to translate the name with the same protocol. The usage of a full fledged SMT solves the challenge of handling words dependencies and reordering. Since SMT is language independent, our enrichment approach will inherit this feature, allowing easily adopting it in other languages. On the other hand, the language model may decrease the weights of some correct translations as explained previously. Finally, it is hard to control single token translations. 4.3 Transliteration Transliteration is the process of converting a Name in language L1 script to other language L2 script while the pronunciation is preserved as much as possible. Most of SMT systems use transliteration as a fallback protocol for name translation failures. 30 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge Named-Entity Translation allowed translating huge amount of English names as well contextual keyphrases. However, there are still names that Named-Entities translation fails to generate Arabic names for them, because they are not represented in the parallel training data. Moreover, translation usually generates only the prominent Arabic representation of the name. Thus, This section discusses how transliteration is adopted to enrich our resource. 4.3.1 Transliteration Approaches English to Arabic transliteration is performed using several approaches [30]. The simplest is the Rule-based or Grapheme-based approach where each set of characters in the source language is mapped to one or more sets of characters in the target language. Another approach is Phoneme-based mapping that consider the similarity on sounds and phonetics level. Words are represented as phonetics/sounds, then transformed to the close or similar phonetics in the target language. Finally, the target language phonetics are transformed to characters [3, 56]. A widely used approach is Statistical Machine Translation on the Characters, where normal SMT systems are used but trained on parallel corpora of segmented words [1, 44, 13, 2]. There are several transliteration services from English to Arabic. On the top of these services there are: Google Transliterate 3 is a service for multilingual transliteration. However, since 2011, it has been integrated in Google translate service. The translation service does not guarantee generating transliteration rather than translation. For example, “Green”, as a name, without any context will be translated. Yamli 4 is a service5 that aims at transforming romanized Arabic text to Arabic characters (i.e. backword Transliteration). Their engine is well trained on a general romanized Arabic text as well as person names. However, it is designed to be an interactive service that suggests several transliterations and expects the user to choose the correct one (i.e. Recall oriented). 3Arrib [2] is transliteration service targeting romanized dialectal Arabic (e.g. chat in Egyptian dialectal). Usually, non-formal Arabic text uses numbers as extension for the English to cover Arabic phonetics without corresponding English ones. Thus, they are using SMS/Chat training data [9] that is not suitable for transliterating 3 https://developers.google.com/transliterate/ Yamli is the Arabic translation for “[he] dictates” 5 http://www.yamli.com/ 4 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge al-samman ||| a l b e r t S SPACE e i n s t e i n ||| john ||| 31 à @ Ð È @ H P H. È @ T SPACE à ø @ H à ø @ à ð h. Table 4.5: Sample of character-level training data formally written English names. Furthermore, it detects English names and concepts and excludes them from the transliteration process [16]. These solutions are not designed to deal with names only. Their output accuracy is not satisfactory for our target. 4.3.2 Character-Level Statistical Machine Translation In order to create a transliteration system geared towards names, we train our SMT system on the character level. For training, we consider PERSON data in Table 4.4. PERSON parallel data were used to train Cdec SMT system as follows: 1. Spaces are replaced with special symbols S SPACE and T SPACE. 2. Words-character are separated with spaces as shown in Table: 4.5. Thus, each name record is treated as a phrase and characters as its tokens. 3. We follow the same trainings steps with Cdec SMT as in Subsection 4.2.4. In the decoding phase, the same segmentation as in the training is followed to prepare the input data. Results with Out-Of-Vocabulary words (i.e. English characters) are excluded. Then, Arabic transliteration are reconstructed by reversing the tokenization steps. Finally, all names with at least one English character are excluded as failures. We used transliteration on Given names and Family names, providing that names are always transliterated. We did not apply transliteration on any other type for the sake of achieving high quality results. 4.4 Arabic Names Splitting AIDArabic uses full matching to retrieve the candidate entities for the sake of achieving accurate disambiguation process. Nevertheless, person entities are not always mentioned with the full name, only parts of their names may appear (e.g. given name and family 32 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge Type Prefixes in English script Meaning Abd Worshiper Abo Father of Umm Mather of Al Family of Gad Connectors Suffixes bin/ben Son of bent Daughter Allah, Ellah, Lellah The God Al-Dawla State Al-Deen Religion in Arabic YJ.« ñK. @ Ð@ È@ XAg. áK . , áK . @ I K . é<Ë , éËB@ , é<Ë@ éËðYË@ áK YË@ Table 4.6: Arabic names common prefixes, suffixes and name connectors with their meaning name). Therefore, splitting person names allows covering partial mentions. Unlike most of Latin names, Arabic names are not just composed of given and last names. Hence, they require different splitting rules. Person names are extracted from rdfs:label relation for entities with type PERSON in YAGO. After normalizing the names by Removing Tatweel, Normalizing Alif, and Removing Diacritics, the following splitting rules are applied: XAg. ,È @ ,Ð@ ,ñK. @ , YJ.« shown in Table:4.6 are combined with the following token as one part (e.g. Umm Kulthum/ ÐñJÊ¿ Ð @, Abd-Alkareem/ Õç' QºË@ YJ.«). K . , áK . , áK . @, which are common in names originated in Name connectors such as I 1. Arabic names prefixes 2. the Gulf countries as well as old Arabic names, are considered splitters and added to its following part. é<Ë , éËB@ , é<Ë@ , éËðYË@ , áK YË@ with the previous token YË@ PñK ). as one part (e.g. Noor Al-Deen/ áK 3. Common Arabic names suffixes 4. Full Names composed of two parts are split into <Given Name> <Last Name>. For example, Mohamed Salah/ hC given name and (Salah) YÒm× is divided into (Mohamed ) YÒm× as the hC as the last name. 5. Three or more parts names are split into <Given Name> <Middle Name> <Last È @ QK QªË@ YJ.« áK . àAÒÊ QªË@ YJ.« áK . as the is split into (Salman) àAÒÊ as the given name, (bin abdulaziz ) QK middle name, and (Al Saud ) Xñª È @ as the family name. Name>. For example, (Salman bin Abdulaziz Al Saud )/ Xñª Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 33 Finally, resulting name partitions, given name, middle name, and last name are added to enriched resource after applying the required normalisation and word segmentation. 4.5 Edrak as A Standalone Resource In order to help advancing the Arabic research, we publically released EDRAK for the research community as an Entity-centric stand alone resource. 4.5.1 Use-cases EDRAK is not only useful for as data schema for NED, but it is also a valuable asset for many Natural Language Processing NLP and Information Retrieval tasks. For example, EDRAK contains a comprehensive dictionary for different potential Arabic names for entities gathered from both the English and Arabic Wikipedia’s. EDRAK dictionary can be used for building an Arabic Dictionary-based NER [12, 52]. In addition to the name dictionary, the resource contains a large catalog of entity Arabic textual context in the form of keyphrases. They can be used to estimate EntityEntity Semantic Relatedness scores as in [25]. Entities in EDRAK are classified under the type hierarchy of YAGO [26]. Together with the keyphrases, EDRAK can be used to build an Entity Summarization system as in [58], or to build a Fine-grained Semantic Type Classifier for named entities as in [64, 65]. 4.5.2 Technical details EDRAK is available in the form of an SQL dump, and can be downloaded from the Downloads section in AIDA project page http://www.mpi-inf.mpg.de/yago-naga/ aida/. We followed the same schema used in the original AIDA framework [27] for data storage. Highlights of the SQL dump are shown in Table 4.7. EDRAK’s comprehensive entity catalog is stored in SQL table entity ids. Each entity has many potential Arabic names together stored in SQL table dictionary. In addition, each entity is assigned a set of Arabic contextual keyphrases stored in SQL table entity keyphrases. It is worth noting that sources of dictionary entries as well as entities keyphrases are kept in the schema (YAGO3 LABEL, REDIRECT, GIVEN NAME, or FAMILY NAME). Furthermore, generated data (by translation or transliteration) are differentiated from the original Arabic data extracted directly from the Arabic Wikipedia. Different generation techniques 34 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge Table Name Major Columns Description entity ids - id - entity Lists all entities together with their numerical IDs. dictionary - mention - entity - source Contains information about the candidate entities for a name. It keeps track of the source of the entry to allow application-specific filtering. entity keyphrases - Holds the characteristic description of entities in the form of keyphrases. The source of each keyphrase is kept for application-specific filtering. entity types - entity - types [] Stores YAGO semantic types to which this entity belongs. entity rank - entity - rank Ranks all entities based on the number of incoming links in both the English and Arabic Wikipedia. This can be used as a measure for entity prominence. entity keyphrase source weight Table 4.7: Main SQL Tables in EDRAK and data sources entail different data quality. Therefore, keeping data sources enables downstream applications to filter data for precision-recall trade-off. Chapter 5 Evaluation and Statistics This chapter discusses the evaluation performed for the size and quality of EDRAK as in Section 5.1 and its effect on the quality of AIDArabic+ results as in Section 5.2 . 5.1 Evaluation EDRAK It is important to evaluate the effect of the enrichment approaches on the generated resource quality and size which directly affects the overall NED system quality and performance [62]. Statistics about EDRAK resource are shown in Subsection 5.1.1. In addition, the manual assessment performed by the native Arabic speakers is discussed in Subsection 5.1.3. 5.1.1 Statistics EDRAK contains around 2.4M entities (with at least one name for each) classified under YAGO type hierarchy. By this size, EDRAK is an order of magnitude bigger than the original AIDArabic resource, that contains 143K entities, because it is constrained by the amount of Arabic names and contextual keyphrases available in the Arabic Wikipedia. Table 5.1 shows a comparison between AIDArabic and EDRAK in terms of the name and contextual keyphrases dictionaries. The name dictionary size increased form less than 0.5M entity-name pairs to 21M pairs. The number of unique names is now 20 times that in AIDArabic. In addition, the average names per entity increased from 2.45 to 7.75 name/entity. 35 36 Chapter 5 Evaluation and Statistics Unique Names Entities with Names Entity-Name Pairs Unique Keyphrases Entity-Keyphrase Pairs AIDArabic EDRAK 333,017 143,394 495,245 885,970 5,574,375 9,354,875 2,400,340 21,669,568 7,918,219 211,681,910 Table 5.1: AIDArabic vs EDRAK: Sizes of Name and Contextual Keyphrases Dictionaries Technique # New Entities # Dictionary Entries 47,406 19,706 1664 3,549,248 3,340,921 0 241,104 23,338 4148 11,222,876 9,578,658 94,782 Google W2C CMUQ-Arabic-NET JRC Translation Transliteration Name Splitting Table 5.2: Number of New Entities1 and Entity-Name pairs per Generation Technique Semantic Type PERSON EVENT LOCATION ORGANIZATION ARTIFACT AIDArabic EDRAK 47,483 11,065 34,451 10,212 15,650 1,220,032 199,846 360,108 196,305 359,071 Table 5.3: Number of Entities per Type in AIDArabbic vs EDRAK The contributions of each generation technique are summarized in Table 5.2. Numbers indicate that the automatic generation (i.e. translation and transliteration) contributes way more entries than external name dictionaries. In addition, translation delivers more entries than transliteration since it is applied on all types of entities, in contrast to only persons names for transliteration. Furthermore, GW2C did not introduce many new entities because it is not common to manually link a mention in an Arabic article to an English Wikipedia page. For CMUQ and JRC, both are collected from news-wire, hence, they added only to prominent entities. Table 5.3 lists the number of entities per high level semantic type for both AIDArabic entity catalog and EDRAK. The highest increase is observed in type PERSON as a result of applying both translation and transliteration. Similarly, the contextual keyphrases dictionary increased 42 times as shown in Table 5.1. Although, we applied the generation techniques on the categories and the Inlink titles only, the expansion in the contextual keyphrases was expected to be higher Chapter 5 Evaluation and Statistics Semantic Type citationTitle linkAnchor inlinkTitle wikipediaCategory wikipediaCategory TRANS inlinkTitle TRANS 37 AIDArabic EDRAK 67,031 2,469,923 2,734,530 302,891 67,031 2,469,923 5,216,657 4,029,483 13,842,770 186,056,046 Table 5.4: Conextual keyphrases dictionary AIDArabbic vs EDRAK than the name dictionary. New contextual keyphrases are originated to: (i) new entities that can be translated using the manual English-Arabic inter-Wikipedia links [67] or (ii) automatically generated Arabic keyphrases using translation and transliteration. This explains the expansion in the original sources, inlinkTitle and wikipediaCategory, as shown in Table 5.4. Also, it worth noting that wikipediaCategories were only translated while inlinkTitles were both translated and transliterated according to their entity type. 5.1.2 Data Example Many prominent entities do not exist in the Arabic Wikipedia, and hence do not appear in any Wikipedia-based resource. For example, Christian Schmidt, the current German Federal Minister of Food and Agriculture, and Edward W. Morley, a famous American scientist, are both missing in the Arabic Wikipedia2 . EDRAK’s data enrichment techniques managed to automatically generate reasonable potential names as well as contextual keyphrases for both. Table 5.5 lists a snippet of what EDRAK knows about those two entities. 5.1.3 Manual Assessment The target of the manual assessment is to quantify the quality of the generated names and contextual keyphrases using different methods. Setup We evaluated all aspects of data generation in EDRAK. We included entity names belonging to First Name, Last Name, Wikipedia redirects, and rdfs:label relation which 2 as of June 2015 38 Chapter 5 Evaluation and Statistics Entity Generated Arabic Names Christian Schmidt Ò HYJ ÖÞ IJ ÖÞ É® JÓ IJ ÖÞ HYJ ÖÞ É® JÓ HYJ Edward W. Morley k. àñ YJ Ò ÖÞ IJ ÖÞ HYJ Q» àAJ Q» àAJ Q» àAJ Q» àAJ Q» àAJ Q» àAJ Q» àAJ á J Q» P@ð X@ XP@ð X@ XP@ð X@ úÍPñÓ ñJ ÊK. X XP@ð X@ ú ÍPñÓ XP@ð X@ úÍPñÓ XP@ð X@ Ëð XP@ð X@ úÍPñÓ QÓAJ ú ÍPñÓ +ð XP@ð X@ úÍPñÓ +ð XP@ð X@ ÊK +ð XP@ð X@ ú ÍPñÓ QÓAJ ÊK +ð XP@ð X@ úÍPñÓ QÓAJ úÍPñÓ P@ð X@ XPðX@ úÍPñÓ +ð ø @ XP@ð X ú ÍQÓ ú ÍPñÓ úÍPñÓ ú Í Q Ó Generated Keyphrases èP@Pð Ï B@ éK XAm' B@ ¨A¯YË@ éJ KAÖ éJjÓ éJ «AÒJk @ AK PA¯AK ú¯ XAm' B@ àñJ . . AJ ú æÊ£B@ ©ÒJm.× éJ KAÖ éJ Ë@P YJ ®Ë@ Ï B@ ¨A¯YË@ èP@Pð Q» ÖÞ àAJ É® JÓ HYJ é«@P QË@ àAÖ Ï @ Z@P Pð Q¯ QJ K. Q KAë PYK éJKAÖ Ï B@ éJ Ë@P YJ ®Ë@ ¨A¯YË@ èP@Pð ½K PYK Q¯ QJ K. Q KAë é«@P P àAÖ Ï @ Z@P Pð ú æÊ£B@ é«ñÒm.× Ï @ àñJ KAÖ ÏQK. àAÖ ú æÊ£B@ é«ñÒj . ÖÏ @ éJËAJË@ éÓñºm Ì '@ àA£Qå PYK Q¯ QJ K. Q KAë ú¯ éJ jÓ éJ «AÒJk @ XAm' B@ àñJ AJ AK PA¯AK . . Q» ÖÞ àAJ É® JÓ IJ JË@ éÓñºmÌ '@ àA£Qå IËA P QË@ Z@P Pð Ï @ é«@ àAÖ Q ¯ àñJ ºK QÓ@ éJ KAK àñJ KAJ ÒJ » .J « èQ KAg . ð Q ¯ éK . Qm.' à Q +ð » éJ Öß XA¿@ Q ¯ KAK ú æ.K Qj.JË@ àñJ éJ ºK QÓB@ éJ ºÊ¯ éJ ªÔg. úÍ@P YJ ®Ë@ éJ K. QªË@ . ñm.' Qk ®jÖÏ @ éªÓAg à Q +ð » éªÓAg . A¿ ¯ úÍPñÓ éëñ QK + ¬ KAK J £ àñJ Q ¯ àñJ ú»Q ÓB@ éJ KAK ºK QÓ@ àñJ KAJ ÒJ » à Q +ð » éJ Öß XA¿B@ . Ë@ ZAJ ÒJ ºË@ t' PAK éJ KYJ éJ ºÊ¯ éJ ºK QÓB@ éJ ªÒm.Ì '@ Q» HñJ Q KA ¯ Ë@ ÐAñK. àð àñ . Ë@ ZAJ ÒJ ºË +È úæÓ QË@ ÉÊË@ éJ KYJ éJ »ñº éÊm .× XPñ® KPAë H. Q« ú ÍPñÓð àñʾJ Ó éK. Qm.' ð Q ¯ PAJ.Jk@ Table 5.5: Examples for Entities in EDRAK with their Generated Arabic Names and Keyphrases Chapter 5 Evaluation and Statistics 39 carries names extracted from Wikipedia page titles, disambiguation pages and anchor texts. The data was generated for evaluation using full SMT system trained on NamedEntities only. In order to examine the effect of considering semantic types in translation, We implemented two approaches, the first is Type-Aware SMT as described in Subsection 4.2.4, and the second uses a universal SMT for translating all names (which is referred to as Combined). For each name, the top-3 successful translation or transliteration were generated if they exist. Data assessment experiment covered all types of data against both translation approaches. Additionally, we conducted experiments to assess the quality of translating Wikipedia categories using system trained on parallel English-Arabic categories. Finally, we evaluated the performance of transliteration when applied on English person names. We randomly sampled the generated data and conducted an online experiment to manually assess the quality of the data. Task We asked a group of native Arabic speakers to manually judge the correctness of the generated data through our web-based tool (as shown in appendix A). Each participant was presented around 150 English Names together with the top-3 potential Arabic translations or transliteration proposed by cdec (or less if cdec proposed less than three translations). Participants were asked to pick all possible correct Arabic names or None, if all translations are incorrect. Participants had the option to skip the name (by choosing Don’t know option), if they needed to. The experiment was designed such that each English Name should be evaluated by three different persons. Results and Discussion In total, we had 55 participants who evaluated 1646 English surface forms, that were assigned 4463 potential Arabic translations. These English names were annotated with at least three participant to either one of the proposed translation or None. Participants were native Arabic speakers that are based in USA, Canada, Europe, KSA, and Egypt. Their homelands span Egypt, Jordan, and Palestine. Manual assessment results are shown in Table 5.6. Evaluation results are given per entity type, translation approach and name source. Since cdec did not return three potential translations for each name, we computed the total number of translations added when considering up to top one or 40 Chapter 5 Evaluation and Statistics Count@Top-K Prec@Top-K 1 2 3 1 2 3 Type-Aware First Name Last Name rdfs:label redirects 8 14 156 113 10 17 288 210 12 19 383 285 87.50 92.86 79.49 69.91 80.00 88.24 63.19 57.62 66.67 78.95 57.44 50.18 Combined First Name Last Name rdfs:label redirects 7 16 160 108 10 22 307 210 12 25 421 288 100.00 87.50 81.25 67.59 90.00 81.82 64.82 60.00 75.00 76.00 57.24 54.51 Transliteration First Name Last Name 26 94 52 188 76 279 80.77 61.54 56.58 70.21 63.83 55.91 Type-Aware rdfs:label redirects 269 191 519 370 742 526 53.16 43.16 36.66 45.55 34.86 30.99 Combined rdfs:label redirects 273 195 533 378 770 539 49.82 41.84 36.75 46.67 39.42 34.69 Categories Categories 118 234 340 67.80 52.99 46.18 Approach Persons Non-Persons Categories Source Table 5.6: Assessment Results of Applying SMT for Translating Entities and Wikipedia Categories Names two or three results. For each case, we computed the corresponding precision based on participants annotations. Data was randomly sampled from all generated data such that the size of each test set reflects the distribution of the sources included in the original data. For example, names originating from rdfs:label relation are an order of magnitude more than those coming from FirstName, and LastName relations. The quality of the generated data varies according to the entity type, name source and generation technique. The quality of translated Wikipedia redirects is consistently less than that of other sources. This is due to the nature of redirects. They are not necessarily another variation of the entity name. In addition, redirects tend to be longer strings, and hence are more error-prone than rdfs:labels. For example, “European Union common passport design” which redirects to the entity Passports of the European Union could not be correctly translated. Each token was translated correctly, but the final tokens order was wrong. Evaluators were asked to annotate such examples as wrong. However, such ordering problems are less critical for applications that incorporate partial matching techniques. Similarly, categories tend to be relatively longer than entity names, hence they exhibit the same problems as redirects. Although the size of the evaluated FirstName and LastName data points is small, the assessment results are as expected. Translating one token name is relatively an easy Chapter 5 Evaluation and Statistics 41 task. In addition, cdec returned only one or two translations for the majority of the names as shown in Table 5.6. Results also show that the type-aware translation system does not necessarily improve results, and one universal system can deliver comparable results for most of the cases. Person names transliteration unexpectedly achieved less quality than translation. This is a result of the fact that names are pronounced differently across countries. For example, a USA-based annotator is expecting “Friedrich” to be written “ ½K PYK Q¯”, while PYK Q¯”. Similarly, the person a Germany-based one is expecting it to be written as “ name “Johannes” is only known for German-based participants that it should be written as “ AëñK ” not as “ Aëñk. ”. We attempted to overcome this problem by inviting Arabic speakers located globally in different areas. Finally, inter-annotators agreement was measured using Fleiss’ kappa to be 0.484, indicating moderate agreement. 5.2 AIDArabic+ Evaluation In this section we discuss the experiment conducted to evaluate the effect of the enriched data resource (EDRAK) and the Arabic specific tokenization and normalization on AIDArabic+ results. 5.2.1 Arabic Corpus Creation The first problem we faced is the lack of annotated Arabic benchmark. Creating a well annotated corpus manually is a time consuming task. Therefore, we need to create our benchmark automatically. The main idea is to use a parallel corpus and annotate the Arabic part using automatically generated evidences from the English counter part. This approach was followed also by [7] to collect Named-Entities from parallel news and by [37] to create persons only multilingual annotated corpus. We used LDC2014T05 [35] news and web English-Arabic parallel manually translated and aligned corpus. LDC2014T05 is developed mainly for SMT development. The corpus Arabic documents are tokenized using MADA+TOKEN [19, 49] and aligned with the tokenized English translation on the word-level (many-to-many mapping). We favored using a manually word-aligned corpus over using an automatic alignment tool such as GIZA++ [47] or FAST Aligner [14] to guarantee a better projection quality. 42 Chapter 5 Evaluation and Statistics Type #Docs #Uniq. Entities #Mentions #Non-null Mentions 702 74 2009 338 18,240 2055 14,413 1385 News-wire Web Table 5.7: LDC2014T05 Annotated Benchmark Statistics We started with applying AIDA [27, 66], as a state-of-art NED system, accompanied with Named Entity Recognition on the tokenized English side. English AIDA disambiguation results were project on the tokenized Arabic side as follows: 1. All English mentions (with their entity mapping) were projected on the Arabic tokens using the word alignment information. 2. Tokens marked as GLUE at the boundaries of the Arabic mentions such as connected prepositions e.g. character “ ð” in “and Egypt”/ QåÓð and connected pronouns at the end are removed from the mention string. 3. Overlapping Arabic mentions, resulting from the translation nature, were combined. 4. Mentions were filtered such that: (a) Arabic mentions mapped to two different entities are excluded. (b) Long Arabic mentions are also excluded. The Arabic documents and the produced ground truth was exported in CoNLL dataset format. After excluding all documents with alignment problems, our Arabic corpus contains total of 776 documents with total of 15798 non-null mentions. Table 5.7 shows the details of the annotated data. 5.2.2 Experiment Setup Systems Setup For testing, we built AIDArabic+ including the new source (EDRAK) and the Arabic specific pre-processing component. We evaluated two data generation approaches: (i) using Yamli for transliteration and the type-aware Light SMT proposed in Subsection 4.2.3, (ii) using the type-aware translation and transliteration using the full SMT framework. In both, we used the external dictionaries introduced in 4.1. We tested both AIDArabic+ configurations against AIDArabic [67] and Bablfy [43] NED systems. Up to our knowledge, there is no other available systems supporting NED on Arabic input. Chapter 5 Evaluation and Statistics 43 Mention Prec. Document Prec. Mapped to Entity Dataset System LDC news AIDArabic+ (Full-SMT) AIDArabic+ (Yamli & L-SMT) AIDArabic Babelfy (Full Matching) Babelfy (Partial Matching) 73.23 70.83 69.07 30.32 25.24 71.34 68.94 67.26 31.16 25.84 94.69 92.73 87.19 39.75 39.48 AIDArabic+ (Full-SMT) AIDArabic+ (Yamli & L-SMT) AIDArabic Babelfy (Full Matching) Babelfy (Partial Matching) 68.16 66.06 62.02 22.33 20.66 60.10 56.86 52.48 21.13 19.52 93.86 92.13 85.56 38.62 35.52 LDC web Table 5.8: Disambiguation Results for AIDArabic+ vs AIDArabic vs Babelfy For all AIDA-based systems, we used YAGO3 as our back-end Knowledgebase built from the English Wikipedia dump of 12Jan2015 combined with the Arabic Wikipedia dump of 18Dec2014. The same configuration was used in the original AIDA local similarity technique [27]. For Babelfy, we used their web service3 version 1.0. It offers two modes: named entities full matching and partial matching. We ran both using a predefined set of mentions. For fair comparison, we limited their candidate space to Wikipedia. We resolved the corpus ground truth from YAGO3 to BabelNet [46] through BabelNet web service getSynsetIdsFromWikipediaTitle. Evaluation We evaluated against mentions with non-null ground truth annotations. For fair comparison, Null annotations returned by all systems were considered wrong annotations. We computed both mention precision and document average precision. Precision is computed according to the number of correct annotations in contrast to the number of all annotations returned by the system. 5.2.3 Results and Discussion Results of our experiments are shown Table 5.8. AIDArabic+ consistently delivered better results than competitors under test. Both versions of AIDArabic+ mapped above 92% of the mentions to non-null entities. AIDArabic+ built with full SMT achieved better precision over Yamli & L-SMT build due to the better quality of the generated 3 http://babelfy.org/guide 44 Chapter 5 Evaluation and Statistics dictionaries. While the enhancement in the precision of the latter compared to AIDArabic is less than 1%, the full-SMT version achieved 4% increment. Nevertheless, for the news, our comprehensive KB could not shine enough, since most of the entities are prominent enough to appear in the Arabic Wikipedia. On the other hand, since entities in the web corpus are less prominent than the news, the enriched KB showed better performance for the web documents. AIDArabic+ achieved 8% increment in the document precision and 6% in the mentions precision. Babelfy with full and partial matching achieved less than 35% for both news-wire and web corpora. Babelfy backend source does not apply entity name translation [46], that explains its poor performance. Results sampling shows that enhancements in AIDArabic+ resulted from the following: • New Entities: EDRAK covered new entities that were not covered in AIDArabic schema by introducing at least one potential name. For example, names Kñ» “ñKñ éJ ¯A ® K@” á ” @YK Pñʯ HðA J and “ ÉJ J were linked to ”Cotonou Agree- ment”, and ”Sun-Sentinel newspaper” respectively, although there exist no Arabic Wikipedia page for both. Thus, they were correctly disambiguated. • Name Variants: Some entities already existed in the Arabic Wikipedia together with their names, however, some English names have several potential forms in Arabic. Transliteration was able to produce Arabic potential name variants. For instance, the Arabic Wikipedia page of the Nobel prize winner“José Saramag” lists his Arabic name as “ ñ«AÓ@P A . ”. éK Pñk However, in our news corpus, the name “ñ«AÓ@P A éJ ñk ” was used instead. Our system could learn both forms and correctly disambiguate that mention. • New Names: Similarly, some entities lacks several prominent Arabic name aliases. Translation and external dictionaries were able to expand the name dictionary with such names. For example, entity United State Department of State may be referred to as “ éJ k. PAmÌ '@” which did not exist in the Arabic Wikipedia, but was translated from the English redirect “Department of State”. Chapter 6 Conclusion and Outlook 6.1 Conclusion In this thesis, we discussed adapting Named Entity Disambiguation effectively for Arabic text. AIDArabic was the first attempt to enable NED on Arabic. Nevertheless, it exhibited low recall due to the sparsity of structured Arabic resources and the complex Arabic specific features. We are introducing AIDArabic+ to enhance NED on the Arabic input by utilizing rich data schema and a customized Arabic pre-processing component. In order to overcome data sparsity of Arabic structured resources, we introduced EDRAK as a back-end schema for AIDArabic+. EDRAK is an entity-centric Arabic resource that contains around 2.4M entities, with their potential Arabic names, contextual keyphrases and semantic types. Data in EDRAK has been extracted from the Arabic Wikipedia and other available resources such as GW2C and name dictionaries. In addition, we enriched EDRAK with automatically generated Arabic data based on the English Wikipedia. For the sake of achieving accurate data generation, we developed the Type-aware Named-Entity translation, utilizing the fully fledged SMT framework Cdec and a parallel corpus of Entity-Names. Furthermore, we developed a persons names transliteration module to generate all possible variants of persons-names. Generated data has been manually assessed by group of Native Arabic speakers. We made EDRAK publicly available as a standalone resource to help advance the research for the Arabic language. Due to the morphological nature of Arabic, we integrated an Arabic pre-processing module into AIDArabic+ architecture to correctly tokenize and normalize Arabic input. 45 46 Chapter 6 Conclusion and Outlook Arabic customized pre-processing allowed better recall and precision for name and context matching. We used Stanford Arabic Segmenter to perform the required morphological analysis and tokenization. Finally, in order to evaluate the effect of the proposed enhancements in AIDArabic, we utilized a parallel word-aligned English-Arabic corpus (LDC2014T05 ) to create an automatically annotated Arabic corpus. AIDA, as a state-of-art NED system was used to generate annotations on the English side. Then, annotations were projected on the Arabic side. AIDArabic+ was able to resolve 94% of the mentions in the news-wire corpus to non-null entities instead of 87% in the original AIDArabic. The expansion in the coverage was achieved with 73% mention precision, which is 4% more than the precision for AIDArabic and way better than Babelfy. Also, for the web articles, non-null mapping increased from 85.6% to 93.9%, keeping a mention precision of 68% (8% increase over AIDArabic). This indicates that our approach allows capturing more information about non-prominent entities. 6.2 Outlook There is still space for enhancing Named Entity Disambiguation on Arabic. The data schema can be further enriched. Anchor texts have not been translated for the sake of accuracy of the contextual keyphrases. Developing translation module with proper training data for anchor texts can enrich the keyphrases dictionary, and hence, achieving better precision. AIDArabic+ used entities extracted from YAGO3 English and Arabic, no other language was included. Evidences from languages other than the English Wikipidia can be harnessed to enrich EDRAK entities repository and dictionaries. For example, more entities can be captured from the German Wikipedia. Also, other languages that have Arabic script such as Persian and Urdu can be further processed to provide a big number of name entries specially for entities of type PERSON. NED can be used for different applications. One of the applications built based on AIDA NED is STICS [24]. STICS offers a web interface to search and explore news article using canonical entities instead of normal text search. AIDArabic+ can be adapted as a NED engine for STICS to support Arabic articles. This will introduce many use-cases and challenges. Bibliography [1] Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-arabic cross language information retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM ’03, pages 139–146, New York, NY, USA, 2003. ACM. [2] Mohamed Al-Badrashiny, Ramy Eskander, Nizar Habash, and Owen Rambow. Automatic transliteration of romanized dialectal arabic. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014, pages 30–38, 2014. [3] Yaser Al-Onaizan and Kevin Knight. Machine transliteration of names in arabic text. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, SEMITIC ’02, pages 1–13, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. [4] Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 400–408, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. [5] Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 400–408, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. [6] Sren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In In 6th Intl Semantic Web Conference, Busan, Korea, pages 11–15. Springer, 2007. [7] Mahmoud Azab, Houda Bouamor, Behrang Mohit, and Kemal Oflazer. Dudley north visits north london: Learning when to transliterate to arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 439–444, Atlanta, Georgia, June 2013. Association for Computational Linguistics. [8] Ibrahim Badr, Rabih Zbib, and James Glass. Syntactic phrase reordering for english-to-arabic statistical machine translation. In Proceedings of the 12th Conference of the European Chapter 47 48 BIBLIOGRAPHY of the Association for Computational Linguistics, EACL ’09, pages 86–93, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. [9] Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash, Ramy Eskander, and Owen Rambow. Transliteration of arabizi into arabic orthography: Developing a parallel annotated arabizi-arabic script sms/chat corpus. ANLP 2014, page 93, 2014. [10] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1247–1250, New York, NY, USA, 2008. ACM. [11] David Chiang. Hope and fear for discriminative training of statistical translation models. J. Mach. Learn. Res., 13(1):1159–1187, April 2012. [12] Kareem Darwish. Named entity recognition using cross-lingual resources: Arabic as an example. In ACL (1), pages 1558–1567. The Association for Computer Linguistics, 2013. [13] Chris Irwin Davis. Tajik-farsi persian transliteration using statistical machine translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 23-25, 2012, pages 3988–3995, 2012. [14] Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effective reparameterization of ibm model 2. In NAACL/HLT 2013, pages 644–648, 2013. [15] Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of the Association for Computational Linguistics (ACL), 2010. [16] Ramy Eskander, Mohamed Al-Badrashiny, Nizar Habash, and Owen Rambow. Foreign words and the automatic processing of arabic social media text written in roman script. EMNLP 2014, page 1, 2014. [17] Paolo Ferragina and Ugo Scaiella. Tagme: On-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM. [18] Spence Green, Daniel Cer, and Christopher D. Manning. Phrasal: A toolkit for new directions in statistical machine translation. In In Proceddings of the Ninth Workshop on Statistical Machine Translation, 2014. [19] Nizar Habash, Owen Rambow, and Ryan Roth. Mada+ tokan: A toolkit for arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, pages 102–109, 2009. [20] Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran. Evaluating entity linking with wikipedia. Artif. Intell., 194:130–150, January 2013. BIBLIOGRAPHY 49 [21] Ondrej Hálek, Rudolf Rosa, Ales Tamchyna, and Ondrej Bojar. Named entities from wikipedia for machine translation. In ITAT, pages 23–30. Citeseer, 2011. [22] Ondrej Hálek, Rudolf Rosa, Ales Tamchyna, and Ondrej Bojar. Named entities from wikipedia for machine translation. In ITAT, pages 23–30. Citeseer, 2011. [23] Kenneth Heafield. KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, United Kingdom, July 2011. [24] Johannes Hoffart, Dragan Milchevski, and Gerhard Weikum. Stics: Searching with strings, things, and cats. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pages 1247–1248, New York, NY, USA, 2014. ACM. [25] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, Hawaii, USA, pages 545–554, 2012. [26] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194:28–61, January 2013. [27] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 782–792, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [28] Fei Huang, Stephan Vogel, and Alex Waibel. Improving named entity translation combining phonetic and semantic similarities. In HLT-NAACL, volume 2004, pages 281–288, 2004. [29] Heng Ji, HT Dang, J Nothman, and B Hachey. Overview of tac-kbp2014 entity discovery and linking tasks. In Proc. Text Analysis Conference (TAC2014), 2014. [30] Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. Machine transliteration survey. ACM Comput. Surv., 43(3):17:1–17:46, April 2011. [31] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2010. [32] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics. 50 BIBLIOGRAPHY [33] Young-Suk Lee. Confusion network for arabic name disambiguation and transliteration in statistical machine translation. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, pages 433–443, 2014. [34] Young-Suk Lee. Confusion network for arabic name disambiguation and transliteration in statistical machine translation. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, pages 433–443, 2014. [35] et al. Li, Xuansong. Gale arabic-english word alignment training part 1– newswire and web ldc2014t05, 2014. [36] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3: A knowledge base from multilingual wikipedias. 2015. [37] James Mayfield, Dawn Lawrie, Paul McNamee, and Douglas W. Oard. Building a crosslanguage entity linking collection in twenty-one languages. In Pamela Forner, Julio Gonzalo, Jaana Keklinen, Mounia Lalmas, and Maarten de Rijke, editors, Multilingual and Multimodal Information Access Evaluation: Second International Conference of the Cross-Language Evaluation Forum, volume 6941 of Lecture Notes in Computer Science, pages 3–13. Springer, 2011. [38] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W Oard, and David S Doermann. Cross-language entity linking. In IJCNLP, pages 255–263, 2011. [39] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W. Oard, and David S. Doermann. Cross-language entity linking. In Fifth International Joint Conference on Natural Language Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 255–263, 2011. [40] Pablo N. Mendes, Max Jakob, Andres Garcia-Silva, and Christian Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics), 2011. [41] David Milne and Ian H. Witten. Learning to link with wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 509–518, New York, NY, USA, 2008. ACM. [42] Will Monroe, Spence Green, and Christopher D. Manning. Word segmentation of informal arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages 206–211, 2014. [43] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics (TACL), 2:231–244, 2014. BIBLIOGRAPHY 51 [44] Preslav Nakov and Jörg Tiedemann. Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pages 301–305, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. [45] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell., 193:217–250, December 2012. [46] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell., 193:217–250, December 2012. [47] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. [48] Daniel Ortiz-Martı́nez and Francisco Casacuberta. The new thot toolkit for fully automatic and interactive statistical machine translation. In Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations, pages 45–48, Gothenburg, Sweden, April 2014. [49] Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M Roth. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014. [50] Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. Local and global algorithms for disambiguation to wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 1375–1384, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [51] Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie. #microposts2015 – 5th workshop on ’making sense of microposts’: Big things come in small packages. In Proceedings of the 24th International Conference on World Wide Web Companion, WWW ’15 Companion, pages 1551–1552, Republic and Canton of Geneva, Switzerland, 2015. International World Wide Web Conferences Steering Committee. [52] Khaled Shaalan. A survey of arabic named entity recognition and classification. Computational Linguistics, 40(2):469–510, 2014. [53] Nakatani Shuyo. Language detection library for java, 2010. [54] Valentin I. Spitkovsky and Angel X. Chang. A cross-lingual dictionary for english wikipedia concepts. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uur Doan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language Resources Association (ELRA). 52 BIBLIOGRAPHY [55] Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva, and Erik van der Goot. Jrc-names: A freely available, highly multilingual named entity resource. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pages 104–110, Hissar, Bulgaria, September 2011. RANLP 2011 Organising Committee. [56] Tao Tao, Su-Youn Yoon, Andrew Fister, Richard Sproat, and ChengXiang Zhai. Unsupervised named entity transliteration using temporal and phonetic correlation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 250–257, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. [57] Jean Tavernier, Rosa Cowan, and Michelle Vanni. Holy moses! leveraging existing tools and resources for entity translation. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco, 2008. [58] Tomasz Tylenda, Mauro Sozio, and Gerhard Weikum. Einstein: physicist or vegetarian? summarizing semantic type graphs for knowledge discovery. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 April 1, 2011 (Companion Volume), pages 273–276, 2011. [59] Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Michael Röder, Daniel Gerber, Sandro Athaide Coelho, Sören Auer, and Andreas Both. AGDISTIS - agnostic disambiguation of named entities using linked open data. In ECAI 2014 - 21st European Conference on Artificial Intelligence, 18-22 August 2014, Prague, Czech Republic - Including Prestigious Applications of Intelligent Systems (PAIS 2014), pages 1113–1114, 2014. [60] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitclonis, and Lars Wesemann. GERBIL - General entity annotator benchmarking framework. In WWW 2015, 24th International World Wide Web Conference, May 18-22, 2015, Florence, Italy, Florence, ITALY, 05 2015. [61] Marieke van Erp, Giuseppe Rizzo, and Raphaël Troncy. Learning with the web: Spotting named entities on the intersection of NERD and machine learning. In Proceedings of the Concept Extraction Challenge at the Workshop on ’Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013, pages 27–30, 2013. [62] Gerhard Weikum, Johannes Hoffart, Ndapandula Nakashole, Marc Spaniol, Fabian M Suchanek, and Mohamed Amir Yosef. Big data methods for computational linguistics. IEEE Data Eng. Bull., 35(3):46–64, 2012. [63] Jonathan Wright, Kira Griffitt, Joe Ellis, Stephanie Strassel, and Brendan Callahan. Annotation trees: Ldc’s customizable, extensible, scalable, annotation infrastructure. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 23-25, 2012, pages 479–485, 2012. BIBLIOGRAPHY 53 [64] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart Marc Spaniol, and Gerhard Weikum. HYENA: Hierarchical Type Classification for Entity Names. In Proc. of the 24th Intl. Conference on Computational Linguistics (Coling 2012), December 8-15, Mumbai, India, pages pp. 1361–1370, 2012. [65] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart Marc Spaniol, and Gerhard Weikum. HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text. In Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pages 133–138, 2013. [66] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, and Gerhard Weikum. AIDA: an online tool for accurate disambiguation of named entities in text and tables. PVLDB, 4(12):1450–1453, 2011. [67] Mohamed Amir Yosef, Marc Spaniol, and Gerhard Weikum. AIDArabic: A named-entity disambiguation framework for Arabic text. In The EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP 2014), pages 187–195, Dohar, Qatar, 2014. ACL. Appendix A Manual Assessment Interface Following figures show the manual assessment web interface. Figure A.1: Manual Assessment: Welcome page with instructions and steps video 55 56 Appendix A Manual Assessment Interface Figure A.2: Manual Assessment: Data Evaluation Page: Each English name has at most three translations and none and Don’t know choices