Indonesian-Japanese CLIR and CLQA
Transcription
Indonesian-Japanese CLIR and CLQA
Developing Cross Language Systems for Language Pair with Limited Resource -Indonesian-Japanese CLIR and CLQA- December, 2007 DOCTOR OF ENGINEERING Ayu Purwarianti Toyohashi University of Technology Developing Cross Language Systems for Language Pair with Limited Resource -Indonesian-Japanese CLIR and CLQAAbstract Researches on cross language text processing systems have become an interesting research area, including CLIR (Cross Lingual Information Retrieval) and CLQA (Cross Language Question Answering) systems. For major languages, there are various available resources such as a parallel corpus, a rich bilingual dictionary, a high performance machine translation software, etc. In order to translate an English sentence into Japanese, one can use free machine translation tools such as Babelfish, Excite, etc. But this is not always the case, especially for minor languages such as Indonesian. For Indonesian, until now, obtaining a rich resource for the translation is quite impossible. To have resources such as the major languages do, one has to spend a lot of work hours. In this thesis, we deal with the cross language systems for Indonesian, a language with limited resources. We developed some systems for Indonesian language such as Indonesian-Japanese CLIR, Indonesian monolingual QA, IndonesianEnglish CLQA and Indonesian-Japanese CLQA. The main aim of these researches is to propose methods to handle the limited resource problem. In the Indonesian-Japanese CLIR, we propose a query transitive translation system of a CLIR for a language pair with limited data resources. The method is to do the transitive translation with a minimum data resource of the source language (Indonesian) and exploit the data resource of the target language (Japanese). There are two kinds of translation, a pure transitive translation and a combination of direct and transitive translations. In the transitive translation, English is used as the pivot language. The translation consists of two main steps. The first is a keyword translation process which attempts to make translation based on available resources. The keyword translation process involves many target language resources such as the Japanese proper name dictionary and English-Japanese (pivot-target language) bilingual dictionary. The second step is a process to select some of the best available translations. The mutual information score (computed from target language corpus) is combined with the TF × IDF score in order to select the best translation. The result on NTCIR 3 (NII-NACSIS Test Collection for IR Systems) Web Retrieval Task showed that the translation method achieved a higher IR score than the machine translation (using Kataku (Indonesian-English) and Babelfish/Excite (EnglishJapanese) engines). The performance of transitive translation was about 38% of the monolingual retrieval, and the combination of direct and transitive translation achieved about 49% of the monolingual retrieval which is comparable to the English-Japanese IR task. In the monolingual Question Answering (QA) system, we have developed a QA system for a limited resource language (e.g. Indonesian) which employs a machine learning approach. The QA system consists of two key components: ”question classifier” and ”answer finder”, which are based on Support Vector Machines (SVM). We also developed some supporting tools such as an easily built POS tagger and a shallow parser for the question. These supporting tools are built with small human efforts. In the development, there are 3000 questions for 6 answer types, collected from 18 Indonesian. For the evaluation data, there are 71,109 Indonesian news articles available on Web. In the experiments, some feature combinations for the SVM were compared. All features used are extracted from the available language resources. One of the important features is a bi-gram frequency between the intended word and some defined words. This feature is introduced to cope with the resource poorness. For the question classification task, the system achieved about 96% accuracy. The answer finder achieved MRR of 0.52 on the first answer as the exact correct answer. Using this machine learning approach, we argue that this monolingual QA system can be adapted easily to other limited resource language. For the CLQA research, we adopted the approach used in the Indonesian monolingual QA into the Indonesian-English CLQA system. The IndonesianEnglish CLQA system was built from Indonesian question analyzer system, IndonesianEnglish translation using a bilingual dictionary, English passage retriever and English answer finder. Different with other Indonesian-English CLQA systems, we did a bilingual dictionary translation in order to make the keyword coverage larger. The translation module is equipped with a transformation module for Indonesian borrowed words such as “prefektur”(from “prefecture”), “Rusia”(from “Russian”), etc. The translation results are combined into a boolean query to retrieve relevant English passages. Features of translated question keywords and passages are used to define the answer in the English passages. The bi-gram frequency feature used in the Indonesian answer finder for each word in the passage is replaced by the WordNet distance feature. This replacement is done easily without adding any mapping tables. In the experiments, 2553 questions were used as the training data and 284 questions were used as the test data. These questions were collected from 18 Indonesian college students. Using this in-house data, the question answering achieved the accuracy of 25% for the first correct answer. Experiments were also conducted using the translated test questions from NTCIR 2005 CLQA data. For the NTCIR 2005 CLQA data, our IndonesianEnglish CLQA system is superior to others except for one with a rich translation resources. We also did an experiment for various size of training data which shows that the size of training data does influence the accuracy of a CLQA system. In the Indonesian-Japanese CLQA, we used a transitive approach in translation and passage retrieval phase. Similar with the Indonesian-Japanese CLIR, English is used as the pivot language in the transitive translation with bilingual dictionaries. The experiment shows that the passage retriever for transitive translation using mutual information score and TF × IDF score as the translation filtering could enhance the performance to be higher than the direct translation. Furthermore, using English passage retriever result as the input for the Japanese passage retriever gives much higher passage retrieval performance compared to the one with only query as the input. The answer finder process employs easily gained features including the POS information yielded by Chasen (an available Japanese morphological analyzer). Even though the Indonesian-Japanese question answering performance is lower than the Indonesian-English CLQA but it is higher than other research using a similar technique which employs text chunking process in an English-Japanese CLQA. Acknowledgements This research work would not have been possible without the support of many people. In this opportunity, I would like to give my appreciation to my supervisor, Prof. Seiichi Nakagawa, who has given many directions in my research and also has been supportive along these years. I also would like to thank Assistant Prof. Masatoshi Tsuchiya for all guidances and helps in this research, to Prof. Norihide Kitaoka for his support to me and my family, also to all members of Doctoral meeting for all advices for this research, especially Prof. Takehito Utsuro and Assistant Prof. Kazumasa Yamamoto for all the help and suggestions. My thanks also to Prof. Masaki Aono and Prof. Tomoyosi Akiba for all the comments and suggestions for this research. My thanks also to Prof. Atsushi Fujii for providing me the NTCIR data and Japanese IR tool, to DR. Hammam Riza for the Indonesian-English KEBI dictionary. Many thanks to all my friends in Nakagawa laboratory and also to my Indonesian friends in Toyohashi who have helped me a lot in my family live. Thanks to the Soroptimist foundation Japan chapter, Hori foundation and Japan Monbukagakusho for awarding me financial means to complete this project. And finally, great thanks to my husband, my parents, my daughter and my relatives in Indonesia who endured this long process with me, always offering support and love. v Contents Contents 1 Introduction 1.1 Background . . . . . . 1.2 Research’s Focus . . . 1.3 Thesis’s Contribution 1.4 Thesis’s Outline . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Characteristics of Indonesian Language 2.1 Development of Indonesian Language . . . . . . . . . . 2.2 Indonesian’s Grammar . . . . . . . . . . . . . . . . . . 2.2.1 Part-of-Speech (POS) in Indonesian Language 2.2.2 Affixes in Indonesian Language . . . . . . . . . 2.2.3 Sentence Structure in Indonesian Language . . 2.2.4 Influences to Indonesian Language . . . . . . . 2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Indonesian-Japanese CLIR with Transitive Translation 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Overview of Indonesian Query . . . . . . . . . . . . . . . . . . 3.4 Indonesian - Japanese Query Translation System . . . . . . . 3.4.1 Indonesian - Japanese Key Word Translation Process 3.4.2 Japanese Translation Candidate Filtering Process . . . 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Experimental Data . . . . . . . . . . . . . . . . . . . . 3.5.2 Compared Methods . . . . . . . . . . . . . . . . . . . 3.5.3 Experimental Result . . . . . . . . . . . . . . . . . . . 3.5.4 Keyword Comparison . . . . . . . . . . . . . . . . . . 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 5 . . . . . . . 7 7 8 8 10 11 14 19 . . . . . . . . . . . . 21 21 21 23 24 24 29 30 30 32 38 46 48 vii 4 Indonesian Monolingual QA using Machine Learning Approach 4.1 Introduction on Monolingual QA . . . . . . . . . . . . . . . . . . . 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Language Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Article Collection . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Building Question Collection . . . . . . . . . . . . . . . . . 4.3.3 Other Data Resources (Indonesian-English Dictionary) . . . 4.4 QA System with Machine Learning Approach . . . . . . . . . . . . 4.4.1 Supporting Tool (POS Tagger) . . . . . . . . . . . . . . . . 4.4.2 Question Analyzer . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Passage Retriever . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Answer Finder . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Question Classifier . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Passage Retriever . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Answer Finder . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Using NTCIR 2005 (QAC and CLQA Data Set) . . . . . . 4.6 Adopting QA System for Other Language . . . . . . . . . . . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 52 52 52 54 54 55 56 61 63 64 64 66 68 71 72 74 5 Indonesian-English CLQA 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Data Collection for Indonesian-English CLQA and its Problems 5.4 Indonesian-English CLQA Schema . . . . . . . . . . . . . . . . 5.5 Modules in Indonesian-English CLQA . . . . . . . . . . . . . . 5.5.1 Question Analyzer . . . . . . . . . . . . . . . . . . . . . 5.5.2 Section Translation . . . . . . . . . . . . . . . . . . . . . 5.5.3 Passage Retriever . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Answer Finder . . . . . . . . . . . . . . . . . . . . . . . 5.6 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Question Classifier . . . . . . . . . . . . . . . . . . . . . 5.6.2 Keyword Translation . . . . . . . . . . . . . . . . . . . . 5.6.3 Passage Retriever . . . . . . . . . . . . . . . . . . . . . . 5.6.4 Answer Finder . . . . . . . . . . . . . . . . . . . . . . . 5.6.5 Experiments with NTCIR 2005 CLQA Task Test Data . 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 78 79 82 82 82 85 86 87 89 89 91 92 92 94 98 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Transitive Approach in Indonesian-Japanese CLQA 101 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Indonesian-Japanese CLQA with Transitive Translation (Schema 1)104 viii CONTENTS 6.4 6.5 6.6 6.7 Indonesian-Japanese CLQA with Pivot Passage Retrieval (Schema 2)105 Modules of Indonesian-Japanese CLQA . . . . . . . . . . . . . . . 106 6.5.1 Japanese Passage Retrieval . . . . . . . . . . . . . . . . . . 107 6.5.2 Japanese Answer Finder . . . . . . . . . . . . . . . . . . . . 107 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.6.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . 108 6.6.2 Evaluation on OOV of Indonesian-Japanese Translation . . 109 6.6.3 Passage Retriever’s Experimental Results . . . . . . . . . . 110 6.6.4 Japanese Answer Finder . . . . . . . . . . . . . . . . . . . . 112 6.6.5 Experimental Result for the Transitive Passage Retriever . 115 6.6.6 Experimental Results of English-Japanese CLQA . . . . . . 116 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7 Conclusions and Future Research 119 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Bibliography 125 List of Publications 131 Chapter 1 Introduction 1.1 Background There are thousands of language used by people of the world. The information provided in internet is also available in various languages. But those information could not be used by many people because of the limitation of a person’s language ability. Researchers in natural language processing (NLP) area have tried to build technologies in order to solve the problem, to bridge the understanding among nations with different languages. There are many cross language research areas investigated such as machine translation(MT), cross lingua information retrieval(CLIR), cross language question answering(CLQA), etc. In order to build a cross language system, one needs some adequate language resources. For example, to build a machine translation system, the resources could be a good bilingual dictionary, syntactic rules, or parallel corpus. These rich language resources are usually available for major languages, such as English, Japanese, Chinese, Germany, etc, The availiability of these language resources encourage researches in the cross language area. We can see that now there are online machine translation systems available in Internet, such as BabelFish, Excite, etc. For a minor language∗ , the language resources become one big problem in developing a cross language system, especially if one wants to develop a cross language system between a minor language and other language than English. Here, in the research, this difficulty is addressed. We want to build two cross language systems (CLIR and CLQA) between a minor language (e.g. Indonesian) and a major language (e.g. Japanese). Definition of ∗ A minor language is a language with limited data resources and limited language processing tools. 1 2 CHAPTER 1. INTRODUCTION Indonesian language is a national language used in Indonesia, a country with about 250 million population. Indonesian language is also understood by people of neighbouring countries such as Malaysia and Brunei Darussalam. In Indonesia, knowledge about Japanese language is very limited even though the two countries have a good relationship for a long time. Indonesia itself is a developing country which needs many information on culture and science from other country such as Japan. Therefore, it will be a good opportunity to build a cross language system between Indonesian as the source language and Japanese as the target language in order to strengthen the understanding of Indonesian people on Japan’s culture and science. 1.2 Research’s Focus In order to achieve the final goal, the research is divided into several steps. Each step is translated into a system. Basically, there are four systems developed in this research: 1. Indonesian-Japanese CLIR A CLIR system is an information retrieval system which receives query sentences in a certain source language and retrieves documents in a target language different with the source language. In the Indonesian-Japanese CLIR system, the focus is on the query translation system of Indonesian query sentences. There are some strategies in order to improve the IR (Information Retrieval) score by using only available data resources. The proposed strategy is based on a transitive translation with English as the pivot language. The translation results are then processed by a filtering system that uses mutual information score and TF IDF score. The experiment shows that the proposed translation method achieved higher IR score than other comparison methods. 2. Indonesian QA A QA system tries to give answers for user’s natural language question sentences. In this research, the monolingual QA is built from scratch. The question-answer data were collected from Indonesian people, and then the monolingual QA system is built. The aim is to build a good monolingual QA for language with limited resource without labour work of programming in providing the language processing tools. The focus is on the application of machine learning approach for the monolingual QA. The machine learning approach is adapted in two modules of this QA system. The result shows that without employing any rich data resource, the monolingual QA system can still achieve a promising accuracy result. 1.3. THESIS’S CONTRIBUTION 3 3. Indonesian-English CLQA A CLQA system is an answering system where the answers are located in resources (documents) written in language different with the question language. In Indonesian-English CLQA, the question language is Indonesian and the documents are in English. The focus is on how to adopt the machine learning approach that has been used in the monolingual QA for the CLQA system with a good performance. The Indonesian-English CLQA system was tested on our in-house test data and test set of NTCIR 2005 CLQA task (translated into Indonesian). As for the training data, some Indonesian college students were asked to write Indonesian questions based on some English articles. The experimental result showed that the result on NTCIR 2005 CLQA task was only defeated by the top result which employed some high quality dictionaries. 4. Indonesian-Japanese CLQA The Indonesian-Japanese CLQA is the research final goal. Even though it is also a CLQA similar with the Indonesian-English CLQA, but there are some different problems between the two systems. First, the translation of Indonesian-Japanese is more difficult than the translation of IndonesianEnglish. This is because of the poor translation language resource and also the different writing style between Indonesian and Japanese. The second problem is the large size of Japanese corpus as the source for the document retrieval. Its size is about 30 times larger than the Indonesian corpus. It makes the translation and passage retrieval used in Indonesian-English CLQA not be adequate enough for the Indonesian-Japanese CLQA. Both problems are addressed in our Indonesian-Japanese CLQA system. 1.3 Thesis’s Contribution Below are the research’s contributions: 1. Indonesian-Japanese CLIR • It is the first work of an Indonesian-Japanese CLIR • This is the first work for a source language with a limited resource which employs only a bilingual dictionary of the source language into a pivot language (transitive translation using bilingual dictionaries). It should be noted that the system explores the language resources owned by the target language • In the keyword filtering process, there are two scores used as the proposed method: the mutual information score to represent the relationship between word pairs in a sequence and the TF IDF score to 4 CHAPTER 1. INTRODUCTION represent the relationship among all terms in a sequence at the same time • The experiments provide the comparison on CLIR’s result (the queries are translated from English queries in the Third NTCIR Web Retrieval Task Data) between a transitive machine translation, direct translation with half size dictionary of the transitive translation, and a transitive bilingual dictionary translation with several keyword filtering schemas 2. Indonesian monolingual QA • It is the first work of an Indonesian monolingual QA • It is the first data collection for an Indonesian monolingual QA with 3000 question-answer pairs. The passages with tagged answers in it are also provided. • The system has a question classifier module with machine learning approach for Indonesian language (limited resource language) that achieves good performance. The features are extracted from available resources without using any knowledge or ontology resource. The question classifier could be quite easily adopted to other limited resource languages. • The answer is located using a text chunking approach with features yielded by the question analyzer and the simple POS tagger on the target document. Even in the answer finder, the system does not employ any knowledge or richer quality resource or tools which are usually unavailable for language with limited data resources. • The experiments show the comparison of passage retrieval method using IDF score with an available search engine that uses TF IDF score. 3. Indonesian-English CLQA • It is the first data collection of Indonesian-English CLQA with medium size (2857 question answer pairs with the answer tagged passages). The available Indonesian-English CLQA data collection before is a translation set of English questions for the CLEF 2005-2006 CLQA task, each year with 200 pairs. • To translate the OOVs, the system employs an Indonesian-English transliteration module to cope with Indonesian word’s characteristics • Based on some certain words used in Indonesian language, other than a common stop word elimination, the system also employed other kind of stop word elimination. 1.4. THESIS’S OUTLINE 5 • In the answer finder module, the text chunking approach is used such as in the monolingual Indonesian QA but with different features. The bi-gram frequency (statistical information) features were replaced by the WordNet distance features. • Other than the experiment with the in-house data, experiments using translated questions of test questions used in NTCIR 2005 CLQA1 task are also conducted. • In the experiments, the comparison on the machine learning approach for different number of training data were made. 4. Indonesian-Japanese CLQA • It is the first work of Indonesian-Japanese CLQA • There are two transitive approaches which were compared in order to handle the limited resource problem. The first is using the similar method with Indonesian-English CLQA with transitive translation of Indonesian-Japanese. The second transivite approach is to retrieve Japanese passages using the retrieved English passages. 1.4 Thesis’s Outline Chapter 2 consists of the description on Indonesian language including the grammars and some influences of other language into Indonesian. The following chapters describes the research which is divided into four systems, IndonesianJapanese CLIR, Indonesian monolingual QA, Indonesian- English CLQA and Indonesian-Japanese CLQA. Each system is explained in each chapter. Chapter 3 describes the Indonesian-Japanese CLIR, Chapter 4 describes the Indonesian monolingual QA, Chapter 5 explains the Indonesian-English CLQA and chapter 6 is about Indonesian-Japanese CLQA. The explanation of each chapter is divided into several sections such as the introduction, related works, the methods, the experiment and conclusions. The overall conclusions of the research are written in Chapter 7. Chapter 2 Characteristics of Indonesian Language 2.1 Development of Indonesian Language Malay language is the root of Indonesian language. It was named Indonesian in 1928, October 28th because of the political reason, to have a free country with its own language. Even though it comes from Malay language, Indonesian language has changed since it was declared as a national language. As a conversation language, it has been used by many people across islands and regions in Indonesia. In Indonesia, people in a certain region usually have their own original language which is called regional language such as Javanese, Sundanese, Batak, etc. Prof. Dr. Slametmulyana[5] noted that Indonesian language and these regional languages have influenced each other. The influences are not only in the vocabulary, but also in the sentence structure. The influences on sentence structure are usually found in an informal situation. Some influences also came from foreign countries. Indonesia is located between two continents (Asia - Australia) and two oceans (Indian ocean - Pacific ocean). Since some centuries ago, Indonesia has been a transition place for people that across between these two continents or between these two oceans. There are many countries which gave influences on Indonesian language, whether because of the trade affair, religion spreading or collonialism. We can find now that there are many Indonesian words come from other country such as Sanskerta, Arab, English, Dutch, etc. 7 8 CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE 2.2 2.2.1 Indonesian’s Grammar Part-of-Speech (POS) in Indonesian Language Noun Plural noun is represented in 2 ways: • repeat singular noun, example: rumah-rumah (houses) • combined with numeral word, example: 8 rumah (8 houses), banyak rumah (a lot of houses) Verb[5] In Indonesian language, there is no special characteristic to differentiate verb from other word type in a sentence. Definition of verb in Indonesian language is given by Gorys Keraf[5]: ”verb is any kind of word that can be expanded with dengan(with) + adjective”. Verb can be word with affix (me, ber, etc) or a root word. Here are some examples: • Saya mandi (I take a bath) • Saya memandikan adik (I bathe my younger sister/brother) There is no auxiliary word in Indonesian language, but auxiliary words in English can be translated into adverbs, for example: • I have read a book (saya telah membaca sebuah buku) • You must go home (kamu harus pulang) Unlike English, passive sentence in Indonesian language is not characterized by its verb, but by its structure, even though verbs with di prefix must be a passive word. For example: • Lagu itu dinyanyikan Erni (that song is being sung by Erni) • Pintu itu kubuka (that door is opened by me) Adverb Words that can be categorized as adverb in Indonesian language: 1. To explain temporal information Such as akan/hendak (will), belum(haven’t), masih(has/have), telah/sudah(has/have). 2. To explain manner The structure is formulated by dengan + adjective. Here is an example: ”Dia bekerja dengan baik ”(He works well) 9 2.2. INDONESIAN’S GRAMMAR 3. As modality word Such as harus/mesti(must/have to), boleh/dapat(may/can), mestinya/ semestinya(should) Adjective Adjective can be preceded by adverb such as amat, paling, sangat, lebih . . . dari, kurang . . . dari, terlalu. All these words corespond to more, most, very, -est, -er, too. Functions of adjective are: 1. to explain other word’s condition. For example: manis(sweet), kuat(strong). 2. to complement number. For example: buah, kuntum, batang. Here is an example of a sentence: ”Saya membeli beberapa kuntum bunga”(I bought some flowers) 3. as comparison. For example: ”Buah mangga itu semanis gula”(That mango is as sweet as sugar). Preposition[5] Words that are categorized as preposition are di/pada(in/on/at), ke(to), dari(from), di dalam(inside), di luar (outside), di atas(above, on), di bawah(under), di depan(in front of), di belakang(behind), di samping(beside), etc. In Indonesian language, there is no postposition. Pronoun[5] There are 3 kinds of pronouns, such as shown in Table 2.1. Table 2.1: Pronouns in Indonesian Language Singular Plural Pronoun I Aku, saya, -ku (I, me, my) Pronoun II Engkau, kamu, mu (you, your) Kami, kita (we, our, us) Kamu, kalian (you, your) Pronoun III Ia, dia, nya (he/she, her/his/him) Mereka (they, them, their) Other than these pronouns, there is also possessive pronouns which can be declared by pattern kepunyaan+pronoun or using suffixes mu, ku, or nya. Here is an example: ”Apakah ini bukumu?” (Is this your book?) In conversation, there are also some indirect pronouns which are used to appreciate the person to speak to. For example ”Apakah bapak hendak menyampaikan sesuatu ?” (do you wish to say something?). Here bapak is translated into you even though its real translation should be Mr. Numeral There are some numerals used such as: 10 CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE 1. Cardinal numeral, for example: 1 (satu), 2 (dua), 10 (sepuluh), 100 (seratus) 2. Distributive numeral, for example: lusin (dozen), pasang (pair) 3. Multiplicative numeral, for example: sekali/satu kali (once), dua kali (twice) 4. Ordinal numeral, for example: pertama/kesatu (first), kedua (second) 5. Partitive numeral, for example: setengah/seperdua (half), sepertiga (third) For cardinal and distributive numeral, noun that follows numeral will not change, for example: ”Saya membeli 3 buah buku” (I bought 3 books). It should be noted that usually there is a special word (it can be classified as adjectives) to describe noun, such as buah for buku(book), kuntum/tangkai for bunga(flower), etc. Conjunction [5] There are some uses of conjunction in Indonesian language, such as: 1. parallel relationship, such as: dan (and), lalu/kemudian (and then), setelah itu/ sesudah itu (after that), bahkan/malahan (even), apalagi/tambahan pula (more over), etc. 2. adverse relationship, such as: tetapi/akan tetapi (but), sebaliknya (on the other hand), atau (or), etc. 3. causal relationship, such as: oleh sebab itu/oleh karena itu (that’s way), sebab itu/karena itu (for that) 4. collective sentence, such as: ketika/tatkala/waktu/selagi/semasa/manakala (when), sedari/ sejak/semenjak (since), sesudah/setelah (after), sebelum (before), etc. Later, in the keyword translation of the cross language system, all these conjunction words are included in the stop word list because we consider them as non-important words in order to retrieve the documents or passages. 2.2.2 Affixes in Indonesian Language Indonesian language is an agglutinative language system, which means that affix holds an important role. Indonesian language does not have conjugation or declination (verb form is independent of time, number and person). Words in Indonesian, usually verbs, can be attached by many prefixes or suffixes. The list of affixes is shown below. Those affixes are [5]: 2.2. INDONESIAN’S GRAMMAR 11 1. prefix, such as me- , di-, ber-, per-, ter-, ke-, and se-. The variations of each prefix are shown in Table 2.2. 2. infix, such as -el-, -em-, and -er-. Here are some word examples: • getar (vibrate) + -em- = gemetar (shaking) • gigi (teeth) + -er- = gerigi (jagged). 3. suffix, such as -an, -kan, -i, -nya These suffixes can be combined with prefixes above. It can also be independent from other affixes, for example: tendangan (kick), laut-an (sea), ruang-an (room), kubur-an (brave yard), mula-i (begin), aku-i (admit), buka-kan (open), etc. Beside being as a suffix, nya, together with -ku and mu are defined as pronoun such as mentioned in Table 2.1. These pronouns can be combined with other affixes that have been described before, example: ruang-an-mu (your room), me(ng)ata-kan-nya (say it), etc. It can also be combined with root word, such as buku-ku (my book), kakak-mu (your brother), etc. Words with different affixes might have a different translation but also might have the same translation. Examples of a different word translation include “membaca” and “pembaca,” which are translated “read” and “reader,” respectively. Examples of the same word translation are the word “baca”and “bacakan,” which are both translated into “read” in English. Other examples are the word “membaca” and “dibaca,” which are translated into “read”and “being read,” respectively. Using a stop word elimination, the translation result of “membaca” and “dibaca” will give the same English translation, “read.” An Indonesian dictionary usually contains words with affixes (that have different translations) and the base words. For example, the “se-nya” affix declares a “most possible” pattern, such as “sebanyak-banyaknya” (as much as possible) , “sesedikit-sedikitnya” (as little as possible), and “sehitam-sehitamnya” (as black as possible). This affix can be attached to many adjectives with the same meaning pattern. Therefore, words with the “se-nya” affix usually are not included in an Indonesian dictionary. 2.2.3 Sentence Structure in Indonesian Language Indonesian is quite a simple language. For example, it does not have declensions or conjugations. The basic word order is SVO (Subject-Verb-Object). Verbs are not inflected for person or number. There are no tenses. Tense is denoted by the time adverb or some other tense indicator. The time adverb can be placed at the front or end of the sentence. There are 4 grammatical relations in Indonesian language: 12 CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE Table 2.2: Prefix Variations in Indonesian Language Prefix Description Variation of me prefix me For words begin with phoneme l,r,w,m,n,ng,ny mem For words phoneme b,p men For words begin with phoneme d,t For words begin with phoneme g,k,h and vocal meng begin with meny For words begin with j,c,s memper combine prefix mem and per mekan mei memperkan similar with prefix me similar with prefix me similar with prefix memper memperi memberkan menter-kan menge- similar with prefix memper combine prefix mem and ber combine prefix mem and ber similar with prefix meng with one syllable root word menyesimilar with prefix meng with one syllable root word Variation of ber prefix beFor words begin with phoneme r or the first syllable ends with er bel ber For other words beside for prefix be and bel Examples me-lawat (trip), me-rawat (take care), me-wangi (smell), me-masak (cook), me-nilai (evaluate), me-nganga (gape), me-nyanyi (sing) mem-buat (make), me(m)otong (cut), memprotes (protest) men-datang (upcoming), me(n)olak (refuse) meng-gulung (roll), mengkhayal (imagine), me(ng)ait (hook), meng-hadir-i (attend), meng-ambil (take), meng-ekor (follow), mengelus (stroke), meng-ikat (tie), meng-ulur (elongate) men-jawab (answer), men-cari (find), me(ny)aring (screen) memper- kaya (make . . . rich), memper- kecil (reduce) me-naik-kan (up) mem-bau-i (smell) memper-bunga-kan (accrue), memper-dagang-kan (trade) memper-baik-i (fix) member-henti-kan (fired) menter-tawa-kan (laugh at) menge-tik (type), menge-cap (stamp) menye-tir (drive), menye-top (stop) be-rasa (taste), be-rambut (have hair), be-kerja (work), be-serta (with) only for bel-ajar (study) ber-warna (colour), ber-kata (say) 2.2. INDONESIAN’S GRAMMAR 13 Prefix Description Examples Variation of ke prefix Actually, if this prefix is used without suffix then this prefix is a nonproductive prefix, there are only 3 words: ke-tua (leader), ke-kasih (lover), ke-hendak (will). But in informal language, there are some influences from Javanese language, such as ke-tawa (laugh), ke-temu (meet), ke-pergok (caught), ke-tabrak (hit by), etc. This prefix can be combined with an suffix become kean circumfix (a productive circumfix), such as in ke-kuat-an (strength), ke-lemah-an (weakness), ke-lapar-an (starvation) Variation of pe prefix peSame pattern with prefix me pe-lawat, pe-rawat (nurse), pe-waris, pem-bina (counselor), pen-daki (climber), pen-jual (sales people), pencuri (thief), peng-ganti (stand in/subtitute), etc pemer pemer-satu (unifier), pemerhati (observer) pel It is an exception, only for pelajar (student) Variation of ter prefix teFor words begin with te-rasa (feel), terencana phoneme /r/ or first syllable (plan), te-perdaya (tricked) ends with /r/ (except that the word means superlative) telOnly for certain words as dis- telanjur (too late), telantar similation effect (abandoned) terFor other words ter-ambil (took), ter-buat (made of), ter-daftar (registered) 1. subject (S), usually a noun phrase, it can also be expanded into sub sentence with its own subject-predicate. 2. predicate (P), could be noun phrase, verb phrase, adjective phrase, preposition phrase, and numeral. 3. object (O), could be any phrase and could also be expanded into sub sentence which has its own subject-predicate. 4. adverb (A), to explain place, time, cause, manner, goal, condition. Just 14 CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE like object, it can be any phrase and can be expanded into sub sentence. Below are some grammatical patterns in Indonesian language: 1. S-P Example: Adik (S) menangis (P) (younger sister/brother is crying) 2. S-P-O Example: Saya (S) membaca (P) buku (O) (I am reading a book) This S-P-O pattern can be changed into O-P-S as in passive term. For example: ”Ibu membaca buku” is changed into ”Buku dibaca ibu” (Mother read a book) 3. S-P-A Example: Kakak (S) menyanyi (P) di kamarnya (A) (older sister/brother is singing in her/his room) 4. S-P-O-A Example: Ibuku (S) membeli (P) sayuran (O) di toko (A) (my mother bought vegetables at store) 5. A-S-P-O-A Example: Tadi pagi (A), ayahku (S) membaca (P) koran (O) di ruang keluarga (A) (In this morning, my father read newspaper in living room) 6. S-A For spoken language, S-A pattern is used frequently. Example: Ibu (S) ke pasar (A) (Mother went to market) 2.2.4 Influences to Indonesian Language As mentioned in Section 2.1, sentence structures in Indonesian language now have been influenced by some regional languages. Some examples are shown in Table 2.3. Table 2.3: Examples of Regional Language’s Influences on Indonesian Sentence[5] Influenced by Javanese language: Rumahnya ayah saya sudah dijual. (My father’s hourse has been sold) Sementara orang menganggap itu benar. (Some people think it is true) Influenced by Sundanese language: Buku-buku itu sudah saya kekantorkan. (I have delivered the books to the office) Uangmu ada di saya. (I have your money) 15 2.2. INDONESIAN’S GRAMMAR Indonesian language also absorps new words from regional languages or foreign languages. These borrowed words are divided into two types: 1. The lexical and its pronunciation are not changed, strictly used in Indonesian sentence. Such as ”Academy Awards” in ”Siapakah yang telah memenangkan Academy Awards berkali-kali?” (Who wins Academy Awards many times?) 2. The lexical and its pronunciation are changed into Indonesian. Such as aerodinamika(aerodynamics), klasifikasi(classification), etc. Some transformation rules along with theirs examples are shown in Table 2.4. Table 2.4: Transformation Rules for the Borrowed Words in Indonesian Language Rules aa(Dutch) become a ae can be changed into e but it can also still become ae ai and au is not changed c is changed into k if it is placed before a, u, o or a consonant c is changed into s if it is placed before e, i, oe, and y cc is changed into ks if it is placed before e and i cch and ch are changed into k if it is placed before a, o and a consonant ch is changed into s when it is pronounced as s or sy ch is changed into c when it is pronounced as c c(Sanskerta) is changed into s e remains e ea remains ea Examples pal(paal), bal(baal), oktaf(octaaf) aerob(aerobe), aerodinamika (aerodynamics), hemoglobin(haemoglobin), hematit(haematite) kaison(caisson), hidraulik(hydraulic) kalomel(calomel), konstruksi(construction), kubik(cubic), kristal(crystal) sentral(central), sen(cent), lasi(circulation), selom(coelom), der(cylinder) aksen(accent), vaksin(vaccine) sakarin(saccharin), karisma(charisma), era(cholera), kromosom(chromosome) sirkusilin- kol- aselon(achelon), mesin(machine) cek(check), Cina(China) sabda(cabda), sastra(castra) efek(effect), deskripsi(description), sis(synthesis), sistem(system) idealis(idealist), habeas(habeas) sinte- 16 CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE Rules ee(Dutch) is changed into e ei remains ei eo remains eo eu remains eu f remains f gh is changed into g gue is changed into ge i remains i if it is placed at the beginning of a word and before a vocal ie(Dutch) is changed into i when it is pronounced as i ie remains i when it is not pronounced as i kh(Arabic) remains kh ng remains ng oe(oi Greek) is changed into e oo(Dutch) is changed into o oo(English) is changed into u oo (double vocal) remains oo ou is changed into au when it is pronounced as au ou is changed into u when it is pronounced as u ph is changed into f ps remains ps pt remains pt q is changed into k rh is changed into r sc is changed into sk if it is placed before a, o, u and a consonant Examples stratosfer(stratosfeer), sistem(systeem) eikosan(eicosane), einsteinium(einsteinium) stereo(stereo), geometri(geometry) neutron(neutron), eugenol(eugenol) fanatic(fanatic, fanatiek), faktor(factor), fosil(fossil) sorgum(sorghum) ige(igue), gige(gigue) iambe(iamb), ion(ion) politik(politiek), rim(riem) varietas(variety), pasien(patient), efisien(efficient) khusus(khusus), akhir(akhir) kontingen(contingent), kongres(congress), linguistik(linguistics) estrogen(oestrogen), enologi(oenology), fetus(foetus) kompor(komfoor), provos(provoost) kartun(cartoon), pruf(proof), pul(pool) zoologi(zoology), koordinasi(coordination) baut(bout), kaunter(counter) gubernur(gouverneur), kupon(coupon), kontur(contour) fase(phase), fisiologi(physiology), spektograf(spectrograph) pseudo(pseudo), psikiatri(psychiatry), psikosomatik(psychosomatic) ptersosaur(ptersosaur), pteridologi(pteridology), ptialin(ptyalin) akuarium(aquarium), frekuensi(frequency), ekuator(equator) rapsodi(rhapsody), ritme(rhythm), retorik(rhetoric) skandium(scandium), skotopia(scotopia), skripsi(scriptie) 17 2.2. INDONESIAN’S GRAMMAR Rules sc is changed into s if it is placed before e, i and y sk is changed into sch if it is placed before a vocal t is changed into s when it is placed before i and pronounced as s th is changed into t u remains u ua remains ua ue remains ue ui remains ui uo remains uo uu is changed into u v remains v x remains x if it is placed at the beginning of a word x is changed into ks if it is placed not at the beginning of a word xc is changed into ks if it is placed before e and i xc is changed into ksk if it is placed before a, o, u and a consonant y remains y when it is pronounced as y y is changed into i when it is pronounced as i z remains z -aat is changed into -at -age is changed into -ase -ary, -air are changed into -er Examples senografi(scenography), sintilasi(scintillation), sifistoma(scyphistoma) skema(schema), skizofrenia(schizophrenia), skolastisisme(scholasticism) rasio(ratio), aksi(actie, action), pasien(patient) teokrasi(theocracy), ortografi(orthography), trombosis(thrombosis), metode(method, methode) unit(unit), nucleolus(nucleolus), struktur(structure, structuur) dualisme(dualism), akuarium(aquarium) sued(suede), duet(duet) ekuinoks(equinox), konduite(conduite), duit(duit) kuorum(quorum), kuota(quota) prematur(prematuur), vakum(vacuum), kultur(cultuur) vitamin(vitamin), televisi(television), kavaleri(cavalry) xantat(xanthate), xenon(xenon), xilofon(xylophone) eksekutif(executive), taksi(taxi), ekstra(extra), kompleks(complex), lateks(latex) eksepsi(exceptie), ekses(excess), tasi(excitation) ekskomunikasi(excommunication), sif(excursive), ekslusif(exclusive) eksiekskur- yangonin(yangonin), yen(yen), yukaganin(yuccaganin) dinamo(dynamo), propil(propyl), psikologi(psychologie) zenit(zenith), zodiak(zodiac), zaman(zaman) advokat(advokaat), traktat(traktaat) persentase(percentage), etalase(etalage) komplementer(complementary, complementair), primer(primary, primair), sekunder(secondary, secundair) 18 CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE Rules -ant is changed into -an -archy, -archie are changed into -arki -al, -eel, -aal are changed into -aal -ein remains -ein -or, -eur are changed into -ur -or remains -or -ive, -ief are changed into -if -ic, ics, ique, iek, ica (nominal) are changed into -ik, -ika -ile, -iel are changed into -il -ic (adjective), changed into -ik -isch are -ical, -isch are changed into -is -ism, -isme are changed into isme -ist is changed into -is -logy, -logie are changed into -logi -logue is changed into -log -loog (Dutch) is changed into -log -oid, -oide are changed into oid -oir (e) is changed into -oar -ty, -teit are changed into -tas -ure, -uur are changed into -ur Examples akuntan(accountant), informan(informant) anarki(anarchy, anarchie), oligarki(oligarchy, oligarchie) struktural(structural, structureel), formal(formal, formeel), ideal(ideal, ideaal), normal(normal, normaal) sistein(cystein), protein(protein) direktur(director, directeur), inspektur(inspector, inspekteur) diktator(dictator), korektor(corrector) deskriptif(descriptive, descriptief), demonstratif(demonstrative, demonstratief) fonetik(phonetics, phonetiek), fisika(physics, physica), logika(logic,logika), dialektika(dialectics, dialectica), teknik(technique, techniek) persentil(percentile, percentiel), mobil(mobile, mobiel) elektronik(electronic, electronisch), mekanik(mechanic, mechanisch), balistik(ballistic, balistisch) ekonomis(economical, economisch), praktis(practical, practisch), logis(logical, logisch) modernisme(modernism, modernisme), komunisme(communism, communisme), imperialisme(imperialism, imperialisme) publisis(publicist), egois(egoist), teroris(terrorist) teknologi(technology, technologie), fisiologi(physiology, physiologie), analogi(analogy, analogie) katalog(catalogue), dialog(dialogue) analog(analoog), epilog(epiloog) hominoid(hominoid, hominoide), antropoid(anthropoid, anthrophoide) trotoar(trottoir), repertoar(repertoire) universitas(university, universiteit), kualitas(quality, kwaliteit) struktur(structure, structuur), prematur(premature, prematuur) 2.3. CONCLUSIONS 2.3 19 Conclusions In this chapter, we introduced the properties of Indonesian language. Basically, there are two types of Indonesian words: native words and borrowed words. Native words are words originally came from Indonesian. Borrowed words are words came from other countries. The borrowed words could be in its original term (unmodified), for example Salsa, Manhattan, etc. The borrowed words could also be a modified form of its original term as described in Table 2.4. We make use this characteristics (modified borrowed words) by building some transformation rules to translate the OOVs. Another important characteristics is the affix. A single base term with different affixes could have different meaning and different English translation. But there are also affixes that do not change a word meaning. The morphological analyzer built in this research aims to handle the affixes that do not change a word meaning. Chapter 3 Indonesian-Japanese CLIR with Transitive Translation 3.1 Introduction Nowadays, there are many web resources on the Internet written in languages other than English. CLIR (Cross Language Information Retrieval) has served as a bridge for Internet sources and users with different languages. Indonesia, with a population of 241 million people, also has an interest in utilizing the CLIR. But, unfortunately, Indonesian is a language with minimum data resources. Here, we found that there is a need to build a CLIR for language with minimum data resources such as the Indonesian language. In order to do this kind of translation, transitive translation with bilingual dictionaries has been an alternative method. Even though there are other data resources available such as machine translation or parallel corpus, an electronic bilingual dictionary is the most widely available. We propose a query transitive translation system of a CLIR (Cross Language Information Retrieval) for a source language with a poor data resource. Our research aim is to do the transitive translation with a minimum data resource of the source language (Indonesian) and exploit the data resource of the target language (Japanese). In the transitive translation, English is used as the pivot language. 3.2 Related Works Some studies have been done in the field of transitive translation with bilingual dictionaries in the CLIR system such as [7, 15]. The literature [7] translated Spanish queries into French with English as the interlingua. Ballesteros used Collins Spanish-English and English-French dictionaries. The literature [15] translated 21 22 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION the Germany queries into English using two pivot languages (Spanish and Dutch), while Gollins used the Euro Wordnet as a data resource. To our knowledge, no CLIR is available with transitive translation for a source language with limited data resources such as Indonesian. In [42], the language with limited data resources was used as the target language in English-Hindi CLIR. English queries were translated into Hindi using two English-Hindi bilingual dictionaries, a direct translation, not a transitive translation. They provided a stop-word list, a simple normalizer, a stemming module and a transliteration module for Hindi. These tools were used to support the Hindi information retrieval. It should be noted that although the target language is a limited resource language but the other language is English, which is different with our research focus, a cross language system for a language pair where the target language is a major language other than English. Translation with a bilingual dictionary usually provides many translation alternatives, only a few of which are appropriate. A transitive translation gives more translation alternatives than a direct translation. In order to select the most appropriate translation, a monolingual corpus can be used. The literature [6] used an English corpus to select some English translation based on SpanishEnglish translation and analyzed the co-occurrence frequencies to disambiguate phrase translations. The occurrence score is called the em score. Each set is ranked by an em score, and the highest ranking set is taken as the final translation. The literature [13] used a Chinese corpus to select the best English-Chinese translation set. They modified the EMMI weighting measure to calculate the term coherence score. Chinese corpus was also used by [14] in the English-Chinese CLIR. They used three kinds of translation: COTM (co-occurrence translation model with a modified mutual information score), NPTM (NP translation model by identifying NP and do a statistical translation on the NP) and DPTM (dependency translation model). An Chinese-English CLIR [23] proposed a statistical framework called maximum coherence principle with term similarity based on Mutual Information score and another technique using graph partitioning approach for the query translation disambiguation. The literature [34] selected the best Spanish-English and Chinese-English translation using the English corpus. The coherence score calculation was based on 1) web page count; 2) retrieval score; and 3) mutual information score. The literature [2] translated Indonesian into English and used an English monolingual corpus to select the best translation, employing a term similarity score based on the Dice similarity coefficient. The literature [10] combined the N-best translation based on an HMM model of a query translation pair and relevant document probability of the input word to rank Italian documents retrieved by English query. The literature [19] used all terms to retrieve a document in order to obtain the best term combination, and chose the most frequent term in each term translation set that appears in the 3.3. OVERVIEW OF INDONESIAN QUERY 23 top-ranked document. Here, we translated Indonesian queries into a Japanese keyword list in order to retrieve Japanese documents. Because of resource limitations between Indonesian and Japanese, we conducted a transitive translation with English as the pivot language. Even though there are alternatives to use machine translation or parallel corpus in the transitive translation, we prefer to employ a bilingual dictionary as the most available resource. To filter the translation results, we combined the TF × IDF engine score and the mutual information score (taken from a monolingual target language corpus) to select the most appropriate translation. Another problem in the translation with bilingual dictionaries is out-of-vocabulary (OOV) words. This problem becomes critical if the OOV words are proper nouns which are usually important keywords in the query. Therefore, if the proper noun keywords are not translated, the IR system will return almost no relevant document. We found that some OOV words that are not available in the Indonesian dictionary are borrowed words. In this Indonesian-Japanese CLIR, the borrowed words come from the English and Japanese languages. Therefore, in order to translate these OOV, we used the English-Japanese dictionary and Japanese proper name dictionary. 3.3 Overview of Indonesian Query Indonesian is the official language of Indonesia. It is understood by people in Indonesia, Malaysia, and Brunei. The Indonesian language family is MalayoPolynesian (Austronesian), which extends across the islands of Southeast Asia and the Pacific. Indonesian is not related to either English or Japanese. Unlike other languages used in Indonesia such as Javanese, Sundanese and Balinese that use their own scripts, Indonesian uses the familiar Roman script. It uses only 26 letters as in the English alphabet. A transliteration module is not needed to translate an Indonesian sentence. Indonesian sentences usually consist of native (Indonesian) words and borrowed words. The first three query examples in Table 3.1 contain borrowed words. “Academy Awards” of the first query, “novel” of the second query and “salsa” of the third query are borrowed from the English language. “Miyabe Miyuki” in the second query is transliterated from the Japanese. Other than these exact borrowed words, there are also some borrowed words that were changed into Indonesian such as “generasi” from “generation” in the first query, “metode” from “method” in the second query and “ozon” from “ozone” in the last query. To obtain a good translation, the query translation in our system must be able to translate those words, the Indonesian (native) words and the borrowed words. CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION 24 Table 3.1: Indonesian Query Examples Query 1 Saya ingin mengetahui siapa yang telah menjadi peraih Academy Awards beberapa generasi secara berturut-turut (I want to know who have been the recipients of successive generations of Academy Awards) Query 2 Temukan buku-buku yang mengulas tentang novel yang ditulis oleh Miyabe Miyuki (Find book reviews of novels written by Miyabe Miyuki) Query 3 Saya ingin mengerahui metode untuk belajar bagaimana menari salsa (I want to know the method of studying how to dance the salsa) Query 4 Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang ozon terhadap tubuh manusia (I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) 3.4 Indonesian - Japanese Query Translation System Indonesian-Japanese query translation is a component of the Indonesian-Japanese CLIR. The query translation system aims to translate an Indonesian query sentence into a Japanese keyword list. The Japanese keyword list is then used in the Japanese IR system to retrieve the relevant document. The schema of the Indonesian-Japanese query translation system can be seen in Figure 3.1. The query translation system consists of 2 subsystems; the keyword translation and translation candidate filtering. The keyword translation system seeks to obtain Japanese translation candidates for an Indonesian query sentence. The translation candidate filtering aims to select the most appropriate among all Japanese translation alternatives. The Japanese translation resulting from the translation filtering is used as the input for the Japanese IR system. The keyword translation and translation filtering process is described in the next section. 3.4.1 Indonesian - Japanese Key Word Translation Process The keyword translation system is a process by which Indonesian keywords are translated into Japanese keywords. We chose to do a transitive translation with bilingual dictionaries to do the keyword translation. Other approaches such as direct translation or machine translation are employed for the comparison method. 3.4. INDONESIAN - JAPANESE QUERY TRANSLATION SYSTEM 25 Figure 3.1: Indonesian-Japanese Query Translation Schema The schema of the keyword transitive translation using bilingual dictionaries is shown in Figure 3.2. Even though an Indonesian-Japanese dictionary is available, we do not propose the direct translation using bilingual dictionary because the bilingual dictionaries between a certain language and English are more available than a bilingual dictionary between two certain languages other than English. By using transitive translation, this translation will be applied easier to some certain language pairs than the direct translation using bilingual dictionary. The keyword translation process consists of native (Indonesian) word translation and borrowed word translation. The native words are translated using Indonesian-English and English-Japanese dictionaries. Because the Indonesian tag parser is not available, we do the translation of a single word and consecutive pair of words that exist as a single term in the Indonesian-English dictionary. As mentioned in the previous section dealing with the affix combination in the Indonesian language, not all words with the affix combination are recorded in an Indonesian dictionary. Therefore, if a search does not reveal the exact word, it will search for other words that are the basic word of the query word or have the same basic word. For example, the Indonesian word, “munculnya” (come out), has a basic word “muncul” with the postfix “-nya”. Here, the term “munculnya” is not available in the dictionary. Therefore, the search will take “muncul” as the matching word with “munculnya” and give the English translation for “muncul” such as “come out” as its translation result. The English translation results are then translated into Japanese using an English-Japanese dictionary. The English translation results also include inflected words, not only basic words. For example, the English translation pair from Indonesian-English dictionary for “obat-obatan” is “medicines,” while the term in the English-Japanese dictionary is “medicine.” Therefore, in the English matching, it searched the same English word or the basic word of the English CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION 26 Indonesian sentence query Indonesian words Indonesian - English Bilingual Dictionary Translation English - Japanese Bilingual Dictionary Translation borrowed * English - Japanese Bilingual Dictionary Translation * Japanese Proper Name Dictionary Translation * Hiragana/Katakana Transliteration Japanese Keyword Japanese Morphological Analyzer (Chasen) Japanese Stop Word Elimination Candidate for Japanese Translation (as input for the Filtering Process) Figure 3.2: Indonesian-Japanese Keyword Translation Schema translation. In Indonesian, a noun phrase has the reverse word position of that in English. For example, “ozone hole” is translated as “lubang ozon” (ozone=ozon, hole=lubang) in Indonesian. Therefore, in English translation, besides a wordby-word translation, we also searched for the reversed English word pair as a single term in an English-Japanese dictionary. This strategy reduced the number of translation candidates. An example of a keyword translation process in the transitive translation with bilingual dictionary is shown in Table 3.2. In the query example, there are 3 word pairs treated as single terms in the English-Japanese dictionary: ozone layer (オゾン層), ozone hole (オゾンホール) and human body (人体). Other translations such as coating or stratum (the synonym for layer) are eliminated as translation candidates. Borrowed words are translated using an English-Japanese dictionary. The English-Japanese dictionary is used because most of the borrowed words in the query translation system come from English. Examples of borrowed words in the query are “Academy Awards,” “Aurora,” “Tang,” “baseball,” “Plum,” “taping,” and “Kubrick.” Even though using an English-Japanese dictionary may help with accurate translation of words, there are some proper nouns which can not be translated by this dictionary, such as “Miyabe Miyuki,” “Miyazaki Hayao,” “Honjo Manami,” etc. These proper names come from Japanese words which are romanized. 3.4. INDONESIAN - JAPANESE QUERY TRANSLATION SYSTEM 27 Table 3.2: Illustration of Native (Indonesian) Keyword Translation Process Indonesian Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang query ozon terhadap tubuh manusia (= I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) Indonesian Keywords belajar perusakan lapisan ozon pelebaran lubang ozon tubuh manusia English keywords as the Indonesian-English dictionary matching result to study, damaging coating, ozone widening, cavity, ozone body human to learn, layer, broaden- hole, being, to take stratum ing hollow, man, up perforahuman tion Japanese keywords as the English-Japanese dictionary matching result ∼を調べ 損害を与 オ ゾ ン 層 (ozone 拡大主義 オ ゾ ン ホ ー ル 人 体 (human る, 勉 強 え る, 不 layer) 者, 広 が (ozone hole) body) す る, 研 利 な, 有 り 究 す る, 害な 学ぶ,. . . In the Japanese language, these proper names might be written in one of the following scripts: kanji (Chinese character), hiragana (cursive form), katakana (squared form) and romaji (roman alphabet). One alphabetical word can be transliterated into more than one Japanese word. For hiragana and katakana script, the borrowed word is translated using a pair list between hiragana or katakana and its roman alphabet. These systems have a one-to-one correspondence for pronunciation (syllables or phonemes), something that can not be done for kanji. Therefore, in order to obtain a kanji corresponding to borrowed words, we use a Japanese proper name dictionary. Each term in the original proper name dictionary usually consists of two words, the first and last names. For a wider selection of the translation candidates, we separate each two-word term into two terms. Even though the input word can not be found in the original proper name dictionary (family name and first name), a match may still be possible with the new proper name dictionary. Each of the above translation processes also involves the stop word elimination process, which aims to delete stop words or words that do not have significant meaning in the documents retrieved. The stop word elimination is done at every language step. First, Indonesian stop word elimination is applied to an Indonesian query sentence to obtain Indonesian keywords. Second, English stop word elimination is applied before English keywords are translated into Japanese keywords. Finally, Japanese stop word elimination is done after the Japanese keywords are morphologically analyzed by Chasen [26]. CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION 28 Table 3.3: Examples of Indonesian-Japanese Keywords Translation Step Indonesian query Indonesian keyword English keyword (from IndonesianEnglish dictionary) Japanese keyword (from EnglishJapanese dictionary) Japanese keyword (after analyzed by Chasen) Result Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (= I want to know the method of studying how to dance the salsa) Native word translation Borrowed word translation metode belajar menari salsa method to learn, to study, to take up dance salsa 規 則 正 し さ, 筋 道, 秩序, 方法 ∼を調べる, 勉強 する, 研究する, 学ぶ, 勉強, 研究, 調査, 検討, 書斎, ∼を学ぶ, 知る, わかる, 暗記す る, 覚える, 確認 する, 習う, 突き とめる 調べる, 勉強, 研 究, 学ぶ, 調査, 検討, 書斎, 知る, わかる, 暗記, 覚 える, 確認, 習う, 突きとめる 舞踊, ダンス, ダ ンスパーティー, バレエ, ダンスす る, 舞う, 踊る, 踊らされる, いい ようにされる サルサ, サルサの ダンス 舞踊, ダンス, ダ ンスパーティー, バレエ, 舞う, 踊 る, 踊ら サルサ 規則正し, 筋道, 秩序, 方法 Examples of the Indonesian-Japanese keyword translation system can be seen in Table 3.3. Each word in the input query is matched with the term in the Indonesian-English bilingual dictionary and the stop word list. If the word is not included in the stop word list and exists in the Indonesian-English bilingual dictionary, then it is assumed to be a native word and translated using IndonesianEnglish and English-Japanese dictionaries. Words (in Table 3.3) included in this type group are “metode” (method), “belajar” (to learn) and “menari” (dance). If the word is not included in the stop word list and does not exist in the IndonesianEnglish bilingual dictionary, then the word is assumed to be a borrowed word and translated using an English-Japanese dictionary, Japanese proper name dictionary and/or by transliteration. For example, “salsa” is translated into サル 3.4. INDONESIAN - JAPANESE QUERY TRANSLATION SYSTEM 29 サ using the English-Japanese dictionary. The final Japanese keywords are then analyzed by Chasen and inputted into the translation candidate filtering process which is described in the following section. The keyword transitive translation is used in 2 systems: 1) transitive translation to translate all words in the query, and 2) transitive translation to translate OOV (Indonesian) words by direct translation using an Indonesian-Japanese dictionary and English-Japanese dictionary. We call the first method transitive translation using bilingual dictionaries and the second method combined translation (direct-transitive). 3.4.2 Japanese Translation Candidate Filtering Process The Japanese translation candidate filtering process aims to select the most appropriate among the Japanese translation candidates. In order to select the best Japanese translation, rather than choosing only the highest TF × IDF score or only the highest mutual information score among all keyword lists, we combine both scores to select the highest mutual information score among the top 3 TF × IDF scores. To avoid calculation of all sequences, we selected 100 term-sequences by their mutual information scores. The mutual information score is calculated per word pair. First, we select the 100 (or less) best mutual information score sequences among the translations of the first 2 Indonesian keywords. These 100 best sequences joined with the 3rd keyword translation set are recalculated to obtain the mutual information score and reselected to yield the 100 best sequences for the 3 translation sets. This step is repeated until all translation sets are covered. For a word sequence, the mutual information score is: I(t1 · · · tn ) = n−1 X n X I(ti , tj ) (3.1) i=1 j=i+1 I(t1 · · · tn ) means the mutual information for a sequence of words t1 , t2 , · · · tn . I(ti , tj ) means the mutual information between two words (ti , tj ). Here, a zero frequency word will have no impact on the mutual information score of a word sequence. For the mutual information score between two words, the standar formula [47] is used: I(ti , tj ) = P (wi , wj ) × log( where and P (wi , wj ) ) P (wi ) × P (wj ) (3.2) C(wi , wj ) P (wi , wj ) = Pw′ ,w′ i j C(w ′ , w ′ ) i j (3.3) C(w) P (w) = Pw′ C(w′ ) (3.4) 30 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Here C(wi , wj ) is the co-occurrence frequency of terms wi and wj in a predefined window and C(w) is the occurrence number of term w. This mutual information score represents the relationship between word pairs in a sequence, but not the relationship among all terms in a sequence at the same time. Therefore, in the translation candidate filtering, we also used the TF × IDF score to represent such a relationship. The next step is to select a keyword list with the highest TF × IDF score among some sequences with top mutual information scores. The TF × IDF score used here is the relevance score between the document and the query (Equation 3.5 from [12]). X { t T Ft,i N } · log DFt + T Ft,i DLi avglen (3.5) T Ft,i denotes the frequency of term t appearing in document i. DFt denotes the number of documents containing term t. N indicates the total number of documents in the collection. DLi denotes the length of document i (i.e., the number of characters contained in i), and avglen the average length of documents in the collection. Table 3.4 shows an example of the keyword selection process after completion of the keyword translation process (such as in Table 3.3, rewritten in Table 3.4, 3rd row, the Japanese keyword translation result). The translation combinations (4th row) and sequence rankings (5th row) are for all words (translation sets of “metode”, “belajar”, “dansa”, and “salsa”) in the query. Then, all resulting sequences (ranked by its mutual information score) are executed in the IR system[12] to obtain the TF × IDF score. Accepting Japanese keywords as the input, the IR system[12] ranked the documents using the formula in Equation 3.5. It resulted a list of relevant documents for each keyword list. The final query chosen is the one which has the highest TF × IDF score (6th row) for the 300 result documents. 3.5 3.5.1 Experiments Experimental Data The query translation performance was measured using the IR score achieved by a CLIR system because CLIR is a real application and includes the performance of key word expansion. For this, we did not use the word translation accuracy as for the performance of word-to-word translation, since a one-to-one translation rate is not suitable, given so many semantically equivalent words (the keyword translation evaluation is shown in Section 3.5.4). The CLIR experiments are conducted on NTCIR-3 Web Retrieval Task data (100 GB Japanese documents), in which the Japanese queries and translated English queries were prepared. The 31 3.5. EXPERIMENTS Table 3.4: Result Example of Translation Filtering Method Step Indonesian query Japanese keyword (after analyzed by Chasen) Translation Combinations Rank Sequences based on Mutual Information Score Result Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (I want to know the method of studying how to dance the salsa) 規則正し, 筋道, 調べる, 勉強, 舞踊, ダンス, サルサ 秩序, 方法 研究, 学ぶ, 調 ダ ン ス パ ー テ 査, 検討, 書斎, ィー, バ レ エ, 知る, わかる, 舞う, 踊る, 踊 暗記, 覚える, ら 確認, 習う, 突 きとめる (規則正し, 調べる, 舞踊, サルサ) (筋道, 調べる, 舞踊, サルサ) (秩序, 調べる, 舞踊, サルサ), etc 1. ( 秩序, 知る, 踊る, サルサ) 2. ( 秩序, 研究, 踊る, サルサ) 3. ( 方法, わかる, ダンス, サルサ) 4. ( 方法, 覚える, ダンス, サルサ) 5. ( 秩序, 分かる, 踊る, サルサ) Select Queries with best TF × IDF score ( 方法, わかる, ダンス, サルサ) Indonesian queries (47 queries) are manually translated from English queries. The 47 queries contain 528 Indonesian words (225 are not stop words), 35 English borrowed words, and 16 transliterated Japanese words (proper nouns). The IR system [12] is borrowed from Atsushi Fujii (Tsukuba University). Using the equation 3.1, the IR system retrieves 1000 documents with highest TF?IDF score based on a non-boolean Japanese query. The Indonesian queries are translated into Japanese and then inputted into the IR system. The Indonesian-Japanese translation resources are as follows: • Indonesian-English dictionary KEBI [17], 29,054 Indonesian words CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION 32 • English-Japanese dictionary Eijirou [27], 556,237 English words • Indonesian-Japanese dictionary [36], 14,823 Indonesian words • English stop word list, combined from [11] and [53]. Indonesian stop word list is translated from the English stop word list. Japanese stop word list consists of function words in Japanese. • English morphology rules, implement WordNet [28] description • Indonesian morphology rules, restricted only for word repetition, posfix -nya and -i • Japanese morphological analyzer Chasen [26] • Japanese proper name dictionary∗ • Mainichi Shinbun newspaper corpus [25] • Daily Yomiuri Online (in English) newspaper corpus [51] • Indonesian newspaper corpus† The Mainichi Shinbun newspaper corpus is used as the data resource in the mutual information score calculation between Japanese keywords. The Daily Yomiuri Online newspaper corpus is used as the data resource in the mutual information score calculation between English keywords. The Indonesian newspaper corpus is used to reduce the vocabulary size of the original Indonesian-Japanese dictionary. 3.5.2 Compared Methods In the experiment, we compared the proposed method with other translation methods. Table 3.5 lists all the compared methods with its corresponding query label (as is of figures in Section 3.5.3). Transitive Translation using Machine Translation The first compared method is a transitive translation using MT (machine translation). The Indonesian-Japanese transitive translation using MT has a schema similar to Indonesian-Japanese transitive translation using a bilingual dictionary. Instead of using the available Indonesian-English and English-Japanese dictionaries, the Indonesian queries are translated using online Indonesian-English MT Kataku [45] and 2 online English-Japanese MTs Babelfish [30] - Excite [9]. The example of the Indonesian-Japanese translation result using machine translation method is shown in Table 3.6. ∗ † This dictionary contains 61,629 Japanese person names. Articles downloaded from http://ilps.science.uva.nl/Resources/BI/ were used. 33 3.5. EXPERIMENTS Table 3.5: List of Compared Methods with its Corresponding Query Label Shown in Section 3.5.3 Method Description Indonesian-English-Japanese transitive translation with Machine Translation (Kataku and Excite) Indonesian-English-Japanese transitive translation with Machine Translation (Kataku and Babelfish) Indonesian-Japanese direct translation with existing Indonesian-Japanese dictionary Indonesian-Japanese direct translation with built-in Indonesian-Japanese dictionary The Japanese keyword filtering with only using the mutual information score. It is marked by the query label’s postfixes of -In and -I-n. -In means that the Japanese keyword list is the nth rank Japanese keyword list of mutual information score. -I-n means that Japanese keyword list is the conjunct of 1st rank until nth rank Japanese keyword list of mutual information score. Using the English keyword filtering by mutual information score. -En means that the English keywords is the nth rank English keyword list of mutual information score English-Japanese CLIR. The query label’s notation is ej-xxx-yyy. xxx shows the keyword translation method such as “man” (the English keywords are selected manually from the English query sentence), “mb” (using Babelfish machine translation), “mx” (using Excite machine translation). yyy shows the keyword filtering method such as In, I-n and IR-n. (the keyword list chosen is the one with best TF × IDF score of its document collection among n highest rank of mutual in formation score) Query label (Section 3.5.3) iej-mx iej-mb ij ijn The query label’s postfixes are -In and -I-n The query label’s infix is -En Figure 3.9, the query label’s notation is ej-xxxyyy. 34 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Table 3.6: Examples of Indonesian-Japanese Translation using Machine Translation Method Indonesian query Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang ozon dan pelebaran lubang ozon terhadap tubuh manusia (= I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) Translated English I wanted to study about resulting from destruction (Kataku engine) and the widening of the ozone hole of the layer of ozone of the human body Translated Japanese Sentence Excite Engine (iej-mx) 私は人体のオゾンの層のオゾンホールの破壊から生じ て、広くなるのに関して研究したかったです Babelfish Engine(iej- 私は人体のオゾンの層のオゾン穴の破壊そして広がる mb) ことに起因について調査したいと思った Japanese Keywords Excite Engine 破壊, 人体, オゾン, 層, ホール, 広く, 起因, 勉強 Babelfish Engine 人体, オゾン, 層, 穴, 破壊, 広がる, 起因, 調査, 思っう Indonesian query: Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (= I want to know the method of studying how to dance the salsa) Translated English I wanted to know the method of studying how danced (Kataku engine) salsa Translated Japanese Sentence Excite Engine (iej-mx) 私は、どのようにを研究するか方法がサルサを踊った のをしりたかったです Babelfish Engine(iej- 私はいかに踊られたサルサ調査する方法を知りたいと mb) 思った Japanese Keywords Excite Engine 勉強, 方法, サルサ, ダンス Babelfish Engine いかに, 踊ら, サルサ, 調査, 方法, 思う Direct Translation using Existing Indonesian-Japanese Dictionary The second comparison method is a direct translation with an Indonesian-Japanese dictionary. This direct translation also has a schema similar to the transitive translation using a bilingual dictionary (Figure 3.2). The difference is that 35 3.5. EXPERIMENTS in translation of an Indonesian keyword, only one dictionary is used, rather than two. In this case, we use an Indonesian-Japanese bilingual dictionary (14,823 words) with fewer words than the Indonesian-English (29,054 words) and English-Japanese (556,237 words) dictionaries. We also did some experiments in direct translation by reducing the Indonesian-Japanese dictionary into various sizes (3000, 5000 and 8857 words). Table 3.7 shows the example of the Indonesian-Japanese translation result with direct translation method using Indonesia-Japanese dictionary. Table 3.7: Examples of Indonesian-Japanese Translation with Direct Translation using Existing Indonesian-Japanese Dictionary Indonesian query Japanese Keywords 3000 dictionary size 5000 dictionary size 8857 dictionary size 14,823 dictionary size Indonesian query: Japanese Keywords 3000 dictionary size 5000 dictionary size 8857 dictionary size 14,823 dictionary size Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang ozon dan pelebaran lubang ozon terhadap tubuh manusia (= I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) 学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, pelebaran,lubang, 体, 肉体, 人間 学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, pelebaran, ホール, 穴, 体, 肉体, 人間 学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, 拡張, ホー ル, 穴, 体, 肉体, 人間 学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, 拡張, ホー ル, 穴, 体, 肉体, 人間 Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (= I want to know the method of studying how to dance the salsa) 方法, メソッド, 学ぶ, 習う, 学習, サルサ, ダンス 方法, メソッド, 学ぶ, 習う, 学習, サルサ, ダンス 方法, メソッド, 学ぶ, 習う, 学習, サルサ, ダンス 方法, メソッド, 学ぶ, 習う, 学習, サルサ, ダンス CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION 36 Direct Translation using Built-in Indonesian-Japanese Dictionary We also compared the transitive translation results with those of direct translation using our built-in Indonesian-Japanese dictionary. We also call it direct translation because although the Indonesian-Japanese dictionary was made from the Indonesian-English and Japanese-English dictionaries in advance, the query translation process uses only the dictionary which yields different Japanese translations compared with the transitive translation. When building the Indonesian-Japanese dictionary from the Indonesian-English and Japanese-English dictionaries, explosion of possible translation pairs arises. To select the correct pair, we used the “one-time inverse consultation” score‡ such as in [44]. We also used WordNet to get more English translation candidates. The complete procedure is as follows: 1. Do word matching between English translation (from Indonesian-English dictionary) and English word in the Japanese-English dictionary. If the English term is phrase and the matching word could not be found, then the English terms will be normalized by eliminating certain words (“to”, “a”, “an”, “the”, “to be”, “kind of”). For example, the word “belajar” has three English translations in the Indonesian-English KEBI dictionary: “to study”, “to learn” and “to take up”, By normalization, the English translations become “study”, “learn” and “take up”. 2. For every Japanese translation, a “one time inverse consultation” score is calculated. English translation for every Japanese candidate is matched with the English translation for Indonesian words. If the matched word is more than one then it is accepted as Indonesian-Japanese pair. But if it is not, then the English translation will be added by its synonym taken from WordNet. The “one-time inverse consultation” score of the new English words is recalculated. For example, the word “わかる” can be translated into “learn” or “take up” such as exists in the English-Japanese dictionary, then the word “わかる” is taken as the translation of the word “belajar”. The example of the Indonesian-Japanese translation result using built-in Indonesian Japanese dictionary is shown in Table 3.8. Japanese Keyword Selection using only Mutual Information Score in Transitive Translation using Bilingual Dictionary We also compared the proposed keyword selection method with the Japanese keyword selection based on mutual information score only. There are two keyword ‡ For each Indonesian word, we look up all English translations, and see how many match the English translations of the original Japanese word (translation candidate). This is called the “one time inverse consultation”. 3.5. EXPERIMENTS 37 Table 3.8: Examples of Indonesian-Japanese Translation with Direct Translation using Built-in Indonesian-Japanese Dictionary Indonesian query Japanese (ijn): keywords Indonesian query: Japanese (ijn): keywords Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang ozon dan pelebaran lubang ozon terhadap tubuh manusia (= I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) わかる, 研究, 与える, 層, オゾン, 広げる, 穴, 空洞, 団 体, 集団, 死体, 人物, 人, 人間, 人類 Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (= I want to know the method of studying how to dance the salsa) 秩序, 方法, 研究, わかる, 踊る, ダンス, 曲, パーティー, 舞う, バレエ, 舞踊, サルサ selection schemas. In the first schema, only one best keyword list among the ranked keyword lists is selected. In the second, all keywords from the first rank keyword list until the nth -rank keyword list are grouped into one keyword list. For the baseline (iej), we used the Indonesian-Japanese transitive translation using a bilingual dictionary without keyword selection. Table 3.9 shows some translation examples for the Indonesian-Japanese transitive translation (bilingual dictionary) with Japanese keyword selection using mutual information score only. Queries with postfix “In” (first schema) and postfix “I-n” (second schema) in Section 3.5.3 show the experiment results. English Keyword Selection using Mutual Information Score in Transitive Translation using Bilingual Dictionary Another comparison method is a transitive translation with English keyword selection based on mutual information taken from monolingual English corpus. The English keywords are selected based on its mutual information score. The English keywords selected are used as the input for the English-Japanese translation. Table 3.10 shows the example of Indonesian-Japanese translation with English keyword filtering using mutual information score. 38 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Table 3.9: Examples of Indonesian-Japanese Translation with Japanese Keyword Selection using Mutual Information Score only Indonesian query Japanese keywords (iejI1): Japanese keywords (iejI-3): Indonesian query: Japanese keywords (iejI1): Japanese keywords (iejI-3): Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang ozon dan pelebaran lubang ozon terhadap tubuh manusia (= I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) 確認, 与える, オゾン層, 広がり, オゾンホール, 人体 する, 確認, 覚える, 与える, オゾン層, 広がり, オゾンホー ル, 人体 Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (= I want to know the method of studying how to dance the salsa) 方法, する, ダンス, サルサ 秩序, 方法, する, 知る, 踊る, サルサ English-Japanese Translation The English-Japanese translation is done to compare the performance reduction caused by the Indonesian-English translation. Methods used in the EnglishJapanese query translation are machine translation and word-by-word translation using a bilingual English-Japanese dictionary. The schemas for using the machine translation and the dictionary are similar to those described in Section 3.5.2 and Section 3.5.2 respectively. The machine translation systems used here are Babelfish and Excite engines. Table 3.11 shows the example of the EnglishJapanese translation result in the English-Japanese CLIR. 3.5.3 Experimental Result Baseline Experiments In these experiments, we compared the IR score of each translation method. The IR scores are shown as Mean Average Precision (MAP) scores. MAP is the mean of average precision (non-interpolated) for all relevant documents. Average precision (non-interpolated) is the average of precision value obtained after each relevant document is retrieved. Each query group has 4 MAP scores: RL 3.5. EXPERIMENTS 39 Table 3.10: Examples of Indonesian-Japanese Translation (Bilingual Dictionary) with English Keyword Selection Indonesian query Japanese keywords (iejE): Indonesian query: Japanese keywords (iejE): Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang ozon dan pelebaran lubang ozon terhadap tubuh manusia (= I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body) 検討, 学ぶ, 調査, 調べる, 研究, 勉強, 与える, 有害, 不利, 損害, オゾン, 層, 拡大, 主義者, ホール, 人体 Saya ingin mengetahui metode untuk belajar bagaimana menari salsa (= I want to know the method of studying how to dance the salsa) 秩序, 方法, 知る, わかる, 覚える, 確認, 学ぶ, 習う, 踊 る, ダンス, パーティー, バレエ, 舞踊, 舞う, サルサ (highly relevant document as correct answer with hyperlink information used), RC (highly relevant document as correct answer), PL (partially relevant document as correct answer with hyperlink information used), and PC (partially relevant document as correct answer). Figure 3.3 shows the IR scores of queries translated with basic translation methods such as the bilingual dictionary or machine translation, without any enhanced process. All translation candidates are grouped together and used as the query input for the IR system. With only bilingual dictionaries (Indonesian-Japanese and English-Japanese), the proposed method (iej and ij-iej) gave an IR score lower than for the transitive translation using machine translation (iej-mx and iej-mb). The combination between direct and transitive translation achieved a higher IR result than the direct translation (ij), but the improvement was not significant. The direct translation with built-in dictionary (ijn) achieved the lowest IR score which gives conclusion that the new dictionary (Indonesian-Japanese) has lower coverage than the two source dictionaries (Indonesian-English and English-Japanese). The main baseline here is “iej”, transitive translation using bilingual dictionaries without any borrowed word translation and without the keyword selection. The transitive translation with machine translation (iej-mx and iej-mb) scored higher (IR) than other translation methods. The highest CLIR score in the baseline translation only achieved 31% (iej-mx, MAP score on RL = 0,0306) compared to the monolingual IR (jp, MAP score on RL = 0.0985). The dictionary based transitive 40 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Table 3.11: Examples of English-Japanese Translation of English-Japanese CLIR Indonesian query Japanese Keywords English-Japanese dictionary (ej): Excite engine (ej-mx): Babelfish engine (ejmb): Indonesian query: Japanese Keywords English-Japanese dictionary (ej): Excite engine (ej-mx): Babelfish engine (ejmb): I want to learn about the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body 学ぶ, 知る, わかる, 暗記, 覚える, 確認, 習う, 突きとめ る, 個人, 資産, 駆除, 絶滅, 倒壊, 破壊, 破滅, 原因, 撲 滅, 滅亡, オゾン, 層, ホール, 膨張, 発展, 拡張, 人体 オゾン, ホール, 層, 破壊, 拡大, 人体, 上, 効果, 学び オゾン, 層, 効果, 学び, 思い, 穴, 拡張, 人体 I want to know the method of studying how to dance the salsa 規則正しさ, 筋道, 秩序, 方法, 学習, 学問, 舞踊, ダンス, パーティー, バレエ, 舞う, 踊る, 踊らされる, サルサ サルサ, 踊る, 方法, 学ぶ サルサ, 踊る, 方法, 学ぶ, 為, 思う Figure 3.3: IR Score with Indonesian-Japanese Baseline Translation translation (iej, MAP score on RL = 0.0138) and the direct-transitive translation (ij-iej, MAP score on RL = 0.0218) achieved 14% and 22% compared to the monolingual IR, respectively. 3.5. EXPERIMENTS 41 Experiments with Translation of Borrowed Words In Figure 3.3, the translation is only done for original Indonesian words. This left many OOV words that are borrowed words. In order to enhance the IR score, these borrowed words are translated using the supporting resources (the EnglishJapanese dictionary, the Japanese proper name dictionary and the Japanese common noun dictionary). Figure 3.4 shows the IR score by translating the borrowed words. In Figure 3.4, by translating the borrowed words in the query, each translation method improved the IR score obtained by the baseline methods in Figure 3.3. Figure 3.4: IR Score of Indonesian-Japanese CLIR (with Borrowed Word Translation) The most significant improvement is the direct Indonesian-Japanese translation. A combined translation (ij-iej) showed a lower IR score than the direct translation (ij). We assumed that this lower IR score is because the combined translation gives too many translation results and leads to the retrieval of irrelevant documents. This reason is also applied to the transitive translation (iej) that scored lowest (IR) among all translation results. Experiments with Keyword Filtering To reduce the translation candidates yielded by the translation methods, we performed a keyword selection on the translation result (see keyword selection details in section 3.4.2). I experimented with 2 kinds of keyword selection: 1) Japanese keyword selection, and 2) English keyword selection. Figure 3.5 shows the Japanese keyword selection impact on the IR score and Figure 3.7 shows the IR score achieved by the English and/or Japanese keyword selection. Query label’s notation used in Figure 3.5 is xxx-yyy. The prefix “xxx” shows the keyword translation method such as written in Figure 3.3, for example: iej, iejmb, etc. The postfix “yyy” shows the keyword filtering method. Figure 3.5 shows that the use of keyword selection based on the combination of mutual information and TF × IDF score (iej-IR-n) yielded significant IR score improvement for the 42 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION transitive translation. The proposed transitive translation (iej-IR-10) improved the IR (RL) score of the baseline method of transitive translation (iej) from 0.0138 to 0.0371. The t-test showed that iej-IR-10 significantly increased the baseline method (iej) with a 97.5% confidence level, T(68) = 1.92, p<0.03. The t-test also showed that, compared to other baseline systems, the proposed transitive translation (iej-IR-10) can increase the IR score at 85% (T(84) = 104, p<0.15), 69% (T(86) = 0.49, p<0.31), 91% (T(83) = 1.35, p<0.09), and 93% (T(70) = 1.49, p<0.07) confidence level for iej-mb, iej-mx, ij and ij-iej, respectively. The IR score achieved by the transitive translation using bilingual dictionary was better than those of transitive machine translation (iej-mb-IR and iej-mx-IR) or direct translation (ijn-IR and ij-IR-5). Figure 3.5: IR Score of Indonesian-Japanese CLIR (with Borrowed Word Translation and Japanese Keyword Selection) for All Queries (47 Queries) Experiments with Combination of Translation Another proposed method, a combination of direct and transitive translation (ij-iej) achieved the best IR score among all the translation methods (transitive machine translation, direct translation and transitive translation using bilingual dictionaries). The proposed combination translation method (ij-iej-IR-30) improved the IR (RL) score of the baseline combination translation (ij-iej) from 0.0218 to 0.0486. The t-test showed that the proposed combination translation 3.5. EXPERIMENTS 43 significantly improved the IR score of the baseline ij-iej with a 98% confidence level, T(69) = 2.09, p<0.02. Compared to other baseline systems, the t-test showed that the proposed combination translation method (ij-iej-IR-30) significantly improved the IR score at 95% (T(83) = 1.66, p<0.05), 86% (T(85) = 1.087, p<0.14), 97% (T(82) = 1.91, p<0.03) and 99% (T(67) = 2.38, p<0.005) confidence level for iej-mb, iej-mx, ij and iej, respectively. Figure 3.6 shows the IR score of Indonesian-Japanese CLIR for queries with in-vocabulary words (42 queries). The IR score achievement pattern in Figure 3.6 equals with the achievement pattern for all queries (Figure 3.5). In Figure 3.6, ij-iej translation achieved the highest IR score of 0.0555 which is 56% of the monolingual IR. Figure 3.6: IR Score of Indonesian-Japanese CLIR (with Borrowed Word Translation and Japanese Keyword Selection) for Query with In-Vocabulary Words (42 Queries) Figure 3.7 shows the impact of English (pivot language) keyword selection on the transitive translation. The method is described in Section 3.5.2. The experimental result shows that using keyword selection on the English keywords failed to yield a significant improvement in translation. 44 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Figure 3.7: IR Score of Indonesian-Japanese CLIR with Translation useing English and/or Japanese Keyword Selection Experiments of English-Japanese CLIR Figure 3.8 shows the IR score of English-Japanese CLIR, which has 4 translation groups: ej-man (English-Japanese translation using bilingual dictionary where the English keywords are selected manually), ej-mb (English-Japanese translation using Babelfish engine), ej-mx (English-Japanese translation using Excite engine) and ej (English-Japanese translation using bilingual dictionary). Compared to the Japanese monolingual IR in Figure 3.3, the English-Japanese CLIR with bilingual-dictionary based translation (ej-IR-30, MAP score on RL = 0.0467) achieved 47% performance. The Indonesian-Japanese CLIR with transitive translation (iej-IR-30 in Figure 3.5, MAP score on RL = 0.0371) achieved 38% performance of the Japanese monolingual IR (MAP score on RL = 0.0985); and the Indonesian-Japanese CLIR with combined translation between direct and transitive translation (ij-iej-IR-30 in Figure 3.5, MAP score on RL = 0.0486) achieved 49% IR score of the Japanese monolingual IR, comparable with the English-Japanese CLIR using the bilingual dictionary. Experiments on Dictionary Size Figure 3.9 shows the highest IR score of the CLIR using bilingual-dictionary based translation (ijn, ij, iej and ij-iej) and the vocabulary size of each translation. Even though the built-in Indonesian-Japanese dictionary has a greater 3.5. EXPERIMENTS 45 Figure 3.8: IR Score of English-Japanese CLIR vocabulary than the existing Indonesian-Japanese dictionary, the IR score of the translation using the built-in Indonesian-Japanese dictionary (ijn) is lower than that of the translation using the existing dictionary (ij). We assume that the Japanese keyword selection in the dictionary building process is not able to select appropriate Japanese translations. The ij-iej translation uses a merged dictionary, resulted from the existing Indonesian-Japanese and Indonesian-English dictionaries. Although the dictionary size is larger than the existing IndonesianJapanese (ij) and Indonesian-English (iej) dictionaries, its performance is worse than that of the existing two bilingual dictionaries because there are some Indonesian OOV words of the Indonesian-Japanese dictionary which exist in the Indonesian-English dictionary and vice versa. Experiments for Number of OOV words Figure 3.10 shows the CLIR score with its OOV word number for CLIR with direct translation using the existing Indonesian-Japanese bilingual dictionary (ij). Dictionary words are reduced to 3000, 5000 and 8857, respectively. In other words, there are 4 dictionaries with differing numbers of words; 3000, 5000, 8857 and the complete Indonesian-Japanese dictionary of 14,823 words. The reduction is done by selecting the most frequent Indonesian words in an Indonesian newspaper corpus. Figure 3.10 shows that the larger the OOV word number 46 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Figure 3.9: IR Score of CLIR using Bilingual Dictionary-based Translation and its Dictionary Size one translation yielded, the lower IR score it achieved. It shows that dictionary quality plays a significant role in a CLIR. 㪹㪼㫊㫋㩷㪠㪩㩷㪪㪺㫆㫉㪼㩷㩿㪩㪣㪀 㪦㪦㪭㩷㫎㫆㫉㪻㫊 㪦㪦㪭 㪇㪅㪋㪌 㪇㪅㪋 㪏㪇 㪇㪅㪊㪌 㪎㪇 㪇㪅㪊 㪍㪇 㪇㪅㪉㪌 㪌㪇 㪇㪅㪉 㪋㪇 㪇㪅㪈㪌 㪊㪇 㪇㪅㪈 㪉㪇 㪇㪅㪇㪌 㪈㪇 㪇 㪇 㪊㪃㪇㪇㪇 㪌㪃㪇㪇㪇 㪏㪃㪏㪌㪎 㪛㫀㪺㫋㫀㫆㫅㪸㫉㫐㩷㪪㫀㫑㪼 㪈㪋㪃㪏㪉㪊 Figure 3.10: IR Score of CLIR using Indonesian-Japanese Direct Translation (ij) with its Word Number and the OOV Word Number 3.5.4 Keyword Comparison All figures in Section 3.5.3 show the IR score achieved by each translation method. By comparing the IR score, each translation result is compared with all words with the same semantic meaning (one-to-all comparison). We also did a one-toone comparison (Table 3.12) by comparing the translation result with the keyword list of the monolingual query (jp). The comparison can be seen in Table 3.12. 47 3.5. EXPERIMENTS Table 3.12: Keyword Comparison between Translation Result and the Original Japanese Keyword List Query Label Baseline method iej-mx iej-mb ij iej ij-iej Compared method iej-mx-IR iej-mb-IR ij-IR jp(monolingual) Proposed method iej-IR10 ijiej-IR30 Precision Recall IR score(RL) 22.82% 21.66% 17.49% 3.63% 10.78% 42.18% 34.6% 33.65% 40.75% 39.81% 0.0306 0.0197 0.0074 0.0138 0.0218 23.71% 23.72% 30.12% 100% 43.6% 37.44% 35.55% 100% 0.0336 0.0238 0.0366 0.0985 25.48% 37.05% 31.28% 44.08% 0.0371 0.0486 Table 3.12 lists the keyword comparison between the translation result and the original Japanese keyword, as indicated by precision and recall scores. There is obviously no direct correspondence of the precision and recall scores with the IR score. Even though the combined translation (ij-iej-IR30) with the highest IR score showed the highest recall and precision score, other results show different comparisons. For example, the iej-IR10 (transitive translation using bilingual dictionary) had lower recall and precision scores than ij-IR-5 (direct translation), yet the IR score achieved by iej-IR-10 was higher than the one achieved by ij-IR5. We assumed that this was because the keyword comparison treated the main keyword and the complement keyword equally, while the main and complement keywords had a different effect on the information retrieval score. For example, the query “Find documents describing how to make chiffon cake” has “chiffon cake” as the main keyword and “how to make” as the complement of the main keyword. If the translation system resulted in only a correct translation of the complement keywords (in the example, the “how to make” word), then the precision and recall scores of the keyword comparison would be increased, whereas the IR score would not. 48 3.6 CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE TRANSLATION Conclusions We presented a translation method that is suitable for queries with a limited data resource language such as Indonesian. Compared to other types of translation, such as transitive translation using machine translation and direct translation using bilingual dictionary (the source-target dictionary is a poor bilingual dictionary), our transitive translation and the combined translation (direct translation and transitive translation) achieved higher IR scores. In the Indonesian-Japanese CLIR, the transitive translation achieved a 38% performance of the monolingual IR and the combined translation achieved a 49% performance of the monolingual IR, which is comparable with the English-Japanese CLIR. The two important methods in our transitive translation are the borrowed word translation and the keyword selection method. The borrowed word translation method can reduce the number of OOV from 50 to 5 words using a pivottarget (English-Japanese) bilingual dictionary and target (Japanese) proper name dictionary. The keyword selection using the combination of mutual information score and TF × IDF score gives a significant improvement over the baseline transitive translation. The other important method, combining transitive and direct translation using bilingual dictionaries, also improved the CLIR performance, and the t-test showed that it significantly increased the baseline of transitive translation with a 99% confidence level. We believe that the system can be easily adapted to be able to accept input queries written in other minor languages. Some tools needed for the modification are the bilingual dictionary between the query language and English, the morphological rule for the stemming, and the stop word list (in the query language) which can be easily translated from the English stop words. Chapter 4 Indonesian Monolingual QA using Machine Learning Approach 4.1 Introduction on Monolingual QA A Question Answering (QA) system has been an interesting research field in Natural Language Processing (NLP) area. It is a system which gives an answer taken from some available sources when a question in human language is given. To answer a question, human has to posses the knowledge about the question domain. But for a computer system, collecting and building knowledge resource such as human is a very expensive work. Until now, there is no system able to gain a complete knowledge resource. On the other hand, information in a text format is widely available on the World Wide Web. This triggers researches on a question answering system area to exploit the document texts as the resource for extracting some possible answers. Researchers have developed some approaches for a QA system that uses document texts as the answer resources. Most of the QA systems are composed by three subsystems [48, 16]: question analysis, passage retrieval and answer finder. In traditional QA systems, these components depend on human-crafted rules which are expensive. In order to avoid the expensive work, there is an alternative to employ a machine learning method either in the question analysis or in the answer finder. For many languages such as English or European Language, there are available formal question-answer-documents resources provided for QA researches. U.S. government’s TREC (Text Retrieval Evaluation Conference) has initiated QA task for English since 1999 (TREC-8). Europe’s CLEF (Cross Language Evaluation Form) has provided QA task since 2003, started with Dutch, Ital49 CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 50 ian and Spanish monolingual QA task. The target languages provided in QA task CLEF 2005 has been increasing into 9 languages, added by French, Portuguese, Bulgarian, English, Finnish, and German. For Asian language, other than Japanese (NTCIR, NII-NACSIS Test Collection for IR System), there is no other formal data resource available. As for our knowledge, this research is the first QA system for a language with limited resource. “Limited resource” means that there are no question answering data and no language processing tools available (rich dictionary, parsing tool, etc). Here, we use Indonesian as the language for our QA system. Indonesian is used by 260 million population of Indonesia, a country located in South East Asia. It is understood by people in Malaysia and Brunei. Because the language is used by such many people, there is an increasing need of Indonesian natural language technology, including an IR/QA system. Most of Indonesian people usually speak two languages, one is Indonesian language, and the other one is their own regional language. Although it has some similar sentence structures with English, the subject - predicate - object position, it has many syntactical differences such as the word order in a noun phrase, etc. The Indonesian question sentence structure is also different from English. For example, in English, there is no “Country which has the biggest population?” (the correct sentence is “Which country has the biggest population?”), while in Indonesian, the structure “Negara apa (Country which)” has the same meaning with “Apa negara (Which country)”. Observing the differences between Indonesian and English, we concluded that English language processing tools couldn’t be applied directly into Indonesian. For Indonesian, there are already some computational linguistic researches been developed, such as Indonesian Information Retrieval [43], Indonesian-English CLQA [3] or Indonesian-Japanese CLIR [33]. In the Indonesian-English CLQA system [3], Indonesian is used as the question language and English as the document language. This Indonesian-English CLQA task is available at CLEF. As for a monolingual Indonesian Question Answering, we believe that this system is the first QA research where the question and document language are both in Indonesian. 4.2 Related Works In the question classification (part of question analysis) for English, the literature [52] compared various machine learning methods such as Nearest Neighbor, Naive Bayes, Decision Tree, Sparse Network of Winnows (SnoW) and Support Vector Machine (SVM). They reported that the SVM algorithm gave the highest accuracy score by using bag of word and bag of n-gram features. The experiments achieved 90% for coarse classes (6 classes) using a tree kernel, and 80.2% for fine classes (50 classes). The literature [21] used the SNoW learning architec- 4.2. RELATED WORKS 51 ture to classify English questions with features: words, POS tags, chunks, named entities, head chunk (the first noun chunk in the sentence), and semantically related words (words that often occur with a specific question class). Using the same data with [52], they reported 91% classification accuracy score for coarse classes and 84.2% for fine classes. With the same data as [52], the literature [41] utilized the SVM algorithm with features combined from subordinate word category (using WordNet), question focus (extracted from a manually listed regular expressions) and syntactic semantic structure. Using the same data with the previous related work mentioned, they achieved 85.6% classification accuracy for fine classes. Here, for our Indonesian question classification, we used SVM algorithm with features extracted from the available resources, which have some differences with features mentioned in the related works. For a full QA system, the literature [31] employed a perceptron model to answer TREC-9 questions for English. They used dependency relations (learned in the probabilistic parser for question) and WordNet information for the machine learning features. Their experimental results achieved the highest MRAR score among participants in TREC-9 evaluation, 0.58 for short answer and 0.76 for long answer. There was another QA system with a machine learning approach [35] which used maximum entropy with features such as the expected answer class (or question type, predicted by some surface text patterns they have collected), the answer frequency, the question word absent and word match. They used a pattern file supplied by NIST to tag the answer chunks in a testing phase, while for a training phase, they used the TREC 9 and TREC 10 data sets. For Japanese QA, The literature [38] tried to eliminate the restriction given by question types by joining the question features, document features and combination features for a maximum entropy algorithm. The feature sets include four part-of-speech information (obtained from a Japanese morphological analyzer Chasen) for each word in question and document, n-gram lexical terms and some matching scores (lexical and POS). The main experiment result achieved MRR of 0.36 and 0.47 for Top5. In this Indonesian monolingual QA system, we adopted a similar approach with the literature [38]. In our Indonesian QA, the answer finder employed an SVM algorithm with features obtained from an Indonesian corpus, joined with the result of a built-in POS tagger and a built-in question parser. We compared some feature combinations of the question class, question features and document features. The experiment result shows that using both question class and question features, the QA system gives better performance than using only question class or only question features in the answer finder module. CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 52 4.3 4.3.1 Language Resources Article Collection First, we checked existing Indonesian news articles available on web. We found that there are three candidates for the article collection. But in the article downloading, we were only able to retrieve one article collection, located at http:// www.tempointeraktif.com. We downloaded about 56,471 articles which are noisy with many incorrect characters and some of them are English. We cleaned the articles semi-automatically by deleting articles with certain words as sub title. We joined our downloaded articles with the available corpus ( http://ilps. science.uva.nl/Resources/BI/, 28,393 articles) and resulted 71,109 articles. In final, we selected some articles (221 articles) based on the number of possible answers they contain as the seeds for the question collection. 4.3.2 Building Question Collection To build our question collection, we asked 18 Indonesian natives to write factoid questions along with their answer and question types based on the selected articles. For this task, we made a web based question editor (the interface is shown in Figure 4.1). Figure 4.1: Interface of User Input for Indonesian QA Data Collection Task 53 4.3. LANGUAGE RESOURCES Each user was required to input about 200-250 Indonesian questions in 6 question types (person, organization, location, name, date and quantity). After eliminating exactly the same questions manually, we gathered 3000 questions: 500 questions for each question type. Question examples are shown in Table 4.1. Table 4.1: Example of Collected Indonesian Questions No 1 2 3 4 5 6 Question Mulai tanggal berapakah, PT Pertamina menurunkan harga Pertamax dari Rp 5.400 menjadi Rp 5.000? (When did PT Pertamina lower Pertamax price from Rp 5,400 into Rp 5,000?) Apa nama bandara di Yogyakarta? ( What is airport name in Yogyakarta? ) Apa nama film pertama Indonesia yang terpilih sebagai film terbaik internasional Festival Film Asia Pasifik 1970? (What is the first Indonesian movie chosen as the best international movie at Asia Pacific Film Festival 1970?) Badan apakah yang memiliki wahana antariksa Rosetta? (What council does Rosetta starship own?) Siapakah pendiri Freedom Institute? (Who is the founder of Freedom Institute?) Berapakah temperatur rata-rata permukaan Mars? (How high is the average temperature of Mars surface?) Question Type Date Location Name Organization Person Quantity Based on our observation on Indonesian question collection, we divided Indonesian factual questions into two general patterns (related with the question main word ): 1. Question with an explicitly written question focus. This is mostly for “what” and “which” questions. For example, “Tanggal berapakah, negara TNRC diproklamasikan?” (What date was TNRC proclaimed?). In this kind of questions, the question focus can be located before the interrogative word or after it. It can be also preceded by a preposition or not, such as “Tanggal berapakah” (what date) and “Pada (preposition, on) tanggal (noun, date) berapakah (interrogative, what)” (on what date). The question focus is selected as the question main word. There is a special case with stop words such as nama (name), judul (title), induk (mother), CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 54 etc. such as in question “Dengan nama apakah, Bandara Selaparang akan direlokasi ke Lombok Barat?” (With what name will Selaparang airport be relocated to West Lombok?), where the “Bandara” (airport) is selected as the question focus. 2. Question with no explicit question focus. This is usually for questions that the interrogative word is preceded by a preposition or a “who” question. For example, “Di manakah konser untuk mendiang Teguh Karya akan dilaksanakan ?” (Where will the concert for Teguh Karya be held?). “Where” is translated into “di mana” (“di” is preposition, translated into “in”, and “mana” is an interrogative, translated into “which”) which can be expanded into “di kota mana” (in which city) or “di negara mana” (in which country). Or an example of “Siapakah yang menerbitkan kartu asuransi kesehatan untuk korban diare?” (Who published health assurance card for diarrhea victims?). This kind of questions has no question focus and the question category’s clue could lay on the verb or on the nearest noun if the question has no verb. For the above examples, “dilaksanakan” (be held) and “menerbitkan” (published) are selected as the question main word. 4.3.3 Other Data Resources (Indonesian-English Dictionary) Beside our collection, we also used an Indonesian-English dictionary, located at http://nlp.aia.bppt.go.id/kebi/, made by Indonesian Agency for The Assessment and Application of Technology. There are 29,054 Indonesian words, completed with the POS information. In our observation, there are some words with incorrect POS information. Therefore, we only used the dictionary to get POS information for nouns, verbs and adjectives. While, for other POSs (conjunction, preposition, pronoun, adverb), we made our own word lists (248 words in total). The POS tagger is described in Section 4.4.1. 4.4 QA System with Machine Learning Approach My QA system structure is similar to other monolingual QA systems [48, 16] which consist of three main components (see Figure 4.2): question analyzer, passage retrieval and answer finder. The question analyzer aims to extract question focus, interrogative word (who, where, etc), question keywords and question type. The passage retrieval collects several passages containing the question keywords. The answer finder locates the answer based on the question analyzer output and the passages given by passage retrieval. The detail of each system is described in the following section. 4.4. QA SYSTEM WITH MACHINE LEARNING APPROACH 55 Indonesian Question Sentence Indonesian Question Analyzer Indonesian Keywords Indonesian Passage Retriever Indonesian Newspaper Corpus Indonesian Passage EAT, Question Main Word, Interrogative word, Phrase Information Indonesian Answer Finder Indonesian Answers Figure 4.2: Schema of Indonesian QA System 4.4.1 Supporting Tool (POS Tagger) Such as mentioned in Section 4.3.3, we used the Indonesian-English dictionary and our own defined word lists (248 words in total) for other POS (conjunction, preposition, pronoun, adverb). We did not handle the ambiguity on the POS. If there are more than one POS available for a word, we chose the first candidate written in the Indonesian-English dictionary. Using this dictionary leaves many words (in corpus or question) with unknown POS. To handle these OOV words, we used a bi-gram frequency method described in Section 4.4.2. First, we specified some words that usually appear before noun, verb and adjective. Then, for each OOV word, we calculated the co-occurrence (bi-gram) frequency of the OOV word and the defined words. And then we attributed the word POS based on these frequencies. The defined words include some prepositions (kepada/to, di/at-onin, oleh/by, etc), some adjectives (sangat/very, kurang/less, agak/a bit, etc), some auxiliary verbs (sudah/have, sedang/-ing, akan/will, etc) and some special words for special nouns such as “sekuntum” (a) for flower (“sekuntum bunga” = a flower), “seekor” for animal (“seekor binatang” = an animal), etc. CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 56 4.4.2 Question Analyzer Question Shallow Parser The question analyzer system consists of two components: question shallow parser and question classification. The question shallow parser extracts question keywords , question main word, interrogative word (who, where, etc), and its phrase label (and also a preposition if the phrase is PP) from a question. The procedure in the question shallow parser is as follows: 1. Assign the POS tag to each word in the question. 2. Select the question main word based on the following rules (the rules below are ordered based on their priority): • A noun preceding the interrogative word, • A verb occurring between the interrogative word and noun • A noun following the interrogative word, Note that words listed in a stop word list are not considered as the question main word. 3. Define a phrase to define the position of the question main word and the interrogative word such as NP/Noun Phrase (if the question main word is a noun and located after the interrogative word) or PP/Preposition Phrase (if the noun precedes the interrogative word and there is a preposition preceding the noun) or VP/Verb Phrase (if the question main word is a verb and located after the interrogative word) or NP-PREV (the question main word is a noun and located before the interrogative word) or VP-PREV (the question main word is a verb and located before the interrogative word). 4. Take all nouns and verbs as the question keywords. The example of the shallow parser result is shown in Figure 4.4. Question Classifier Such as mentioned in Section 4.1, we applied a machine learning approach for the question classification. We used an SVM as it has proven to be effective in the question classification domain. The first feature for the question classification is the output of the shallow parser. Basically, it represents important words to define a question class of a question sentence. It includes the interrogative word, the question main word (could be a question focus, a verb or a noun following the interrogative word but not a question focus), the phrase information and the preposition (if the phrase is a PP). Beside this feature, we also compared two features developed from the question focus. Those are the simple rule based class 4.4. QA SYSTEM WITH MACHINE LEARNING APPROACH 57 candidates (defined as C in the experiment result) and the bi-gram frequency (defined as P in the experiment result) between the nouns related to the main words with some defined previous words. We also tried to calculate WordNet distance (defined as W in the experiment result) as a comparison method. Simple Rule Based Class Candidates We observed that by using some simple rules on the shallow parser result, it is possible to get some most possible question classes. For example, question with “siapa” (who) as the interrogative word will be mostly categorized as person question. Or, question with “kapan” (when) as the interrogative word is always a “date” question. Therefore, as the first additional attribute, we defined some most possible categories using some specified simple rules, for example, if the interrogative word is “siapa” (who) then the candidates are person, organization, location and name. The complete rules are shown in Table 4.2. WordNet Distance The WordNet (http://wordnet.princeton.edu/) describes the semantic relation among words. Thus, there are many researches using the WordNet for the question classification system. But those researches usually employed it for English questions. Here, we tried to use the WordNet for Indonesian questions which need a translation phase. The complete procedure using the WordNet is as follows: Table 4.2: Rules for Defining Class Candidates Interrogative word kapan berapa siapa mana apa apa/mana Preposition Phrase di, ke , dari not blank blank PN NP NP Class Candidate date date, quan, name person, org, loc, name loc, org loc, org, name, date org, loc, name, person 1. Translate the selected nouns (related with the question main word) into English using the Indonesian-English KEBI dictionary. For the question with question focus (such as in the 1st pattern described in Section 4.3.2), the selected noun is the question focus. For the question without question focus (such as in the 2nd pattern described in Section 4.3.2), the system will search some possible nouns corresponding to the question main word. These possible nouns are the most frequent nouns co-occuring with the CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 58 question main word (not the question focus) in a sentence window of the Indonesian corpus. 2. Calculate WordNet depth (or distance) between the translated nouns with some specified WordNet synsets (taken from the 25 noun lexicographer files in WordNet). The specified WordNet synsets are act, animal, artifact, attribute, body, cognition, communication, event, feeling, food, group, location, motive, object, person, phenomenon, plant, possession, process, quantity, relation, shape, state, substance, and time. 3. Include all these WordNet distances as the additional attribute value. Figure 4.3: WordNet Information for Word “City” For example, in question “Di kota manakah, lokasi Bandara Supadio?” (What city is Supadio airport located at?), “kota” as the main word is translated using Indonesian-English dictionary into “city”. In WordNet, there are 3 meanings for the word “City”. Figure 4.3 shows that “city” is separated by 5 synsets with “location” for the first meaning, by 4 synsets with “location” for the second meaning and by 4 synsets with “group” for the third meaning. By using normalization, the WordNet distance score of the word “city” are 0.64 and 0.36, for “location” and “group” synsets, respectively. The 0.64 score for “location” is resulted by (1/5 + 1/4) divided by the overall distance (1/5 + 1/4 + 1/4). For Indonesian question, the strategy using the WordNet has problems with the translation ambiguation and the OOV words (not available in the IndonesianEnglish dictionary). The translation ambiguation is handled by using all distances in the additional feature, such as mentioned in step 3 above. By using all calculated distances and the shallow parser result, we assumed that these 4.4. QA SYSTEM WITH MACHINE LEARNING APPROACH 59 attributes are adequate enough to describe the question intention. For the borrowed OOV word problems, by not translating the OOV words, WordNet distance works for English borrowed words such as “distributor” in sentence “Apa nama distributor rekaman CD acara festival Raum & Schatten di Berlin, untuk Indonesia?” (What is the name of CD record distributor for Raum & Schatten festival in Berlin?). But this method doesn’t work for the common noun and Indonesian proper name. For other OOV words (Indonesian common noun and Indonesian proper name), we used a monolingual corpus to search similar words with the main word and then calculate WordNet distance for each similar word. The similar words are defined as nouns that have similar preceding words (among those listed in Appendix A) with the main word. For example, if the question focus “ibukota” (capital city) is an OOV word, a common Indonesian noun, then, to get its WordNet distance, we searched some most frequent preceding words of “ibukota” in the Indonesian corpus such as “arah” (direction), “daerah” (region), “kawasan” (region), “ke” (to), etc. After that, we searched some nouns with the similar preceding words. These nouns (such as “rumah” (house), “medan” (field), “kompleks” (complex, site), etc) are assumed as the similar words with the main word “ibukota”. Then, the WordNet distance between these nouns and the specified WordNet synsets are calculated. The calculation result is treated as the WordNet distance for the main word. Bi-gram Frequency The last additional attribute is the frequency with some defined preceding words (bigram). The method is alike with the method to tag the POS of the OOV words using Indonesian corpus mentioned in Section 4.4.1. The different part is the word list and the last process. The first step is to collect the defined preceding words by the following procedure: 1. List some words for each question category (person, name, organization, location, date, quantity). For example, words categorized in “person” category include presiden (president), guru (teacher), musisi (musician), dokter (doctor), penulis (writer), etc. 2. Search in the corpus some most frequent preceding words of those words in step 1. For example, some most frequent preceding words for “presiden” (president) is kandidat (candidate), sebagai (as), oleh (by), etc. 3. Select some words from words in step 2 that can differentiate question category. This step is done manually and resulted 6 word lists, one word list for each category. Next step is to calculate the bi-gram frequency (normalized) for each question main word with a defined word list. The word list is shown in Table 4.3. CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 60 Table 4.3: Preceding Word List in Calculating the Bi-gram Frequency Category Person Organization Location Name Date Quantity Preceding Words (Bi-gram) almarhum (deceased), arahan (guidance from someone), asosiasi (association of), atasan (higher person at work), bawahan (lower person at work), ditandatangani (signed by), era, foto (photo), kata/ungkap/ujar/ucap (say), kandidat (candidate), kediaman (home), kepergian (departure of a person), lanjut (continue speaking), mantan (retired person), mendiang (deceased), menjabat (hold position as), pembantu (like an assistant), pemberhentian (unemployment), pesan (message said by a person), pribadi (personally), profesi (profession), seorang (a for a person) anggota (member), antar (between), aturan (rule of), bawah (under), bersifat (characterized), kantor (office), kelompok (group), kinerja (achievement), keputusan (decision), koalisi (coalision), jaringan (network), manajemen (management), mekanisme (mechanism), pelatihan (training), pemimpin/pimpinan/ketua (leader), pengembangan (development), pengurus (committee), restrukturisasi (restructure), sanksi (sanction), secara (as), wadah (place to), validasi (validation) arah (direction), asal (come from), barat (west of), batas (limit of), daerah (area), dekat (near), geografis (geographic), ibukota (capital city), kawasan (region), ke (to), masuk (enter), menuju (heading to), pariwisata (tourism), pembangunan (development), perbatasan (border area), posisi (position), regional, sekitar (around), selatan (south of), seluas (as wide as), tanah (land of ), teritorial (territorial), timur (west of), utara (north of) berdasarkan (based on), hasil (result), judul (title), karangan (paper), korban (victim), kotak (box), meluncurkan (launch), memaparkan (describe), membacakan (read), membuat (make), menerbitkan (publish), menyampaikan (deliver), meraih (get), pelaksanaan (implementation), peluncuran (launch), penerbitan (publisher), penjualan (sell), sebatang (a for tree), seekor (a for animal), sehelai (a for something thin and light), sekeping (a for something thin, small and not light), sekuntum (a for flower), seluruh (whole), seputar (around), sosialisasi (socialization), terjadi (happen), usai (after) akhir (end), awal (beginning), tengah (middle of), hingga (until), ketika (when), saat (when), waktu (when), sejak (since), selama (during), setiap (each), tiap (each) beberapa (a few), puluhan, ratusan (hundred), ribuan (thousand), jutaan (million), belasan, milyaran (billion), puluh, ratus, ribu, juta, belas, milyar 4.4. QA SYSTEM WITH MACHINE LEARNING APPROACH 61 The procedure is as follows: 1. For each question, calculate the word pair frequency between the main word in the question and each listed word in Table 4.3. If the phrase of the question focus is not NP, then the system will search some possible nouns corresponding to the question focus. This strategy is the same with the first step in the WordNet distance calculation procedure. 2. Include all the frequencies as the additional attribute. We did not take one highest frequency, instead we included all the frequencies because a noun might have some most frequent preceding words spreading among the 6 word lists. For example, question “Dimanakah ibukota Republik Turki Siprus Utara?” (Where is the capital city of North Ciprus Turk Republic?) has “ibukota” as the main word. In the Indonesian corpus, among the listed words, the bi-gram words for “ibukota” are 3 words (“kawasan”, “ke” and “menuju”). The total frequency of these words (which is normalized by all words appear with “ibukota” as a bi-gram) is 0.0913. The complete bi-gram frequency score is shown in Figure 4.4. This method is able to handle OOV words such as common noun or proper name that has no translation in the Indonesian-English dictionary which was a problem in the WordNet distance calculation; for example, the common noun “bandara” in “Apakah nama bandara di Pekanbaru?” (What is the airport located in Pekanbaru?), or the proper name “Biodiesel” in “Apakah nama kimia untuk Biodiesel?” (What is the chemistry name for Biodiesel?). For both nouns (“bandara” and “Biodiesel”), the “without OOV handling” WordNet distance approach could not give additional information because these words are not listed in both the Indonesian-English dictionary and the WordNet itself. But by using the bi-gram frequency approach, we was still able to gain additional information that can distinguish the semantic information between “bandara” (as location) and “Biodiesel” (as name). Another advantage is the data resource needed for this method. For WordNet distance approach, one needs a bilingual dictionary and a thesaurus which demand an expensive effort if either is not available. But for the bi-gram frequency method, it only needs a monolingual corpus that can be collected from the WWW. 4.4.3 Passage Retriever We did the passage retriever in two steps. First, we collected the most relevant documents by executing a boolean query into a document retriever. Second, we selected some passages (paragraphs) within the 3 highest IDF scores among the retrieved documents. For the document retriever module, we conducted some methods: CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 62 Question: Dimana ibukota Republik Turki Siprus Utara? (Where is the capital city of North Ciprus Turk Rep.?) Question main word: ibukota (capital city) Corpus Searching Result: “kawasan ibukota” (location), “ke ibukota” (location), “menuju ibukota” (location) SVM Attributes: 1. Shallow Parser’s Result (SP) • interrogative word: dimana (where) • main word: ibukota (capital city) • phrase: NP • preposition: 2. WordNet Distance (WN), with OOV handling act: 0.0432, animal: 0, artifact: 0.1724, attribute: 0.0253, body: 0.0052, cognition: 0.0306, communication: 0.017, event: 0.0007, feeling: 0, food: 0.0067, group: 0.0907, location: 0.2478, motive: 0, object: 0.0435, person: 0.011, phenomenon: 0, plant: 0.0019, possession: 0.0216, process: 0.0042, quantity: 0.043, relation: 0.009, shape: 0.0063, state: 0.055, substance: 0.016, time: 0.0013 3. Bi-gram Frequency (PF) • person (number of words, frequency): 0, 0 • location (number of words, frequency): 3, 0.0913 • organization (number of words, frequency): 0, 0 • date (number of words, frequency): 0, 0 • quantity (number of words, frequency): 0, 0 • name (number of words, frequency): 0, 0 Figure 4.4: Question Example with its Question Features 1. Select documents within the 3 highest IDF scores 2. Select documents with IDF score larger than half of the highest IDF score 3. Select documents with IDF score larger than the lowest IDF score 4. Use Estraier (http://estraier.sourceforge.net/) as a document retriever for Indonesian documents. The Estraier is designed for English or Japanese. 4.4. QA SYSTEM WITH MACHINE LEARNING APPROACH 63 Here, we used it for Indonesian. We found a difficulty when some Indonesian words are considered as English stop words. We found that some queries contain words that do not exist in the correct passage. These words are usually the synonym of the word in the passage. For example, the word “PT Bursa Efek Surabaya” in a query “Siapakah direktur PT Bursa Efek Surabaya saat ini?” (Who is the director of PT Bursa Efek Surabaya now?) does not exist in the correct passage “Direktur BES Guntur Pasaribu mengatakan . . . ” (Director of BES Guntur Pasaribu said . . . ). Here “BES” is the synonym of “PT Bursa Efek Surabaya”. To handle this, we extracted some synonyms from the Indonesian corpus using “word (synonym)” pattern. For each word in the synonym word list, if the word is in the question then its synonym is added into the question with “or2” operator[32] (for example, a synonym of “BES” and “PT Bursa Efek Surabaya” will be composed into a boolean query of “BES” or2 (“PT” and “Bursa” and “Efek” and “Surabaya”)) and execute the new keyword list into the document and passage retriever. 4.4.4 Answer Finder To locate an answer, an SVM algorithm is employed to classify each word in the corpus as “I” if the word is part of the answer but not the beginning of the answer, as “B” if the word is the beginning of the answer and as “O” is the word is not part of the answer. Yamcha[20] is used as the text chunking software for the answer finder. The complete features for the SVM are as follows: 1. Question features It is almost the same with the question classification features: interrogative words, question focus, phrase of question focus, preposition, bi-gram frequency of the question focus and 4 words around interrogative words. 2. Expected Question Class The expected question class is resulted from the question classification system with features: bag of words, shallow parser result and bi-gram frequency scores. 3. Document features Each word in the document is complemented with some features (the lexical form, the POS information, orthographic information and the lexical similarity with the question keywords) of n (width of word window) preceding and following words and also the current word features itself (the lexical form, the POS information, orthographic information, the lexical similarity with the question keywords, and the bi-gram frequency). Features for the n preceding words are completed with the I/B/O information. CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 64 For the training data, the document features are automatically built from the correct passage (the answer in the passage is tagged by <A>and </A>). Each word is completed by some information such as its POS, its orthographic information, the lexical similarity score (1 if the word exists in the question, 0 otherwise), bi-gram frequency (such as in the question classification task), and the I/B/O information (every word has “O” value except for word within answer tag: “B” for word after the <A>tag and “I” for the rest answer words). For the testing data, the document features are similar with the training data except the I/B/O information. In the testing data, all words have an O value for the I/B/O information. Figure 4.5 shows an example of a question along with its question features, its expected question class and its document features. 4.5 4.5.1 Experimental Result Question Classifier In the experiment, an SVM algorithm is employed with linear kernel and the “string to word vector” function to process the string value, both are available in the WEKA software [50]. 10-Fold cross validation is used for the accuracy calculation (that is, 2700 questions for training, 300 questions for testing). The baseline is the bag of words attribute. As for the machine learning comparison, three machine learning algorithms (C4.5, K Nearest Neighbor (kNN) and SVM) were executed with the bag of words feature. The question classification result for 6 classes (date, location, name, organization, person and quantity) is shown in Table 4.4. The highest score is achieved by using the SVM algorithm. Table 4.5 shows the experiment results on the question classification without handling OOV problems. Here, we compared the highest baseline result with the proposed attributes: the shallow parser result (S), the simple rule based class candidate (C), the WordNet distance without OOV handling (W), the bi-gram frequency (P), the combination of the additional attributes (C+W, C+P, W+P and C+W+P) and their combination with the bag of words (B). From Table 4.5, we can see that using only several important words (S) gives higher score than using all words in the question (B; baseline). It improved the accuracy score about 2.45%. The t-test for all proposed (the combination between S and other additional attributes such as C, P and W) attributes (compared to the baseline/B) showed that the improvement is significant, all the p-values are lower than 0.025. Among all combinations with the additional attributes, using P gave higher accuracy score than using C or even WordNet distance (W). We assumed that using the bi-gram frequency (P) was better than the WordNet distance (W) because of the OOV words in the WordNet distance attribute (the question main word was not available in the Indonesian-English dictionary). 4.5. EXPERIMENTAL RESULT 65 Question: Dimana ibukota Republik Turki Siprus Utara? (Where is the capital city of North Ciprus Turk Rep.?) Question features (QF): Interrogative word: dimana Main word: ibukota Phrase: NP Preposition: Bi-gram frequency: see Figure 4.4. 4 surrounding words: -, -, ibukota, Republik Expected Answer Type (EAT): location Correct Passage (for training) Di tengah lembah itu pula terdapat ibukota Nicosia yang juga terbagi dua antara Turki dan Yunani. Warga Siprus Turki menyebut Nicosia dengan nama Lefkosa. . . Document features (DF) for word “ibukota” Features for 2 preceding words in passage “. . . pula terdapat ibukota . . . ” • “pula”, features: lexical: pula; POS: adverb; orthographic: alphabet; lexical similarity: 0; I/B/O information: O • “terdapat”, features: lexical: terdapat; POS: verb; orthographic: alphabet; lexical similarity: 0; I/B/O information: O Features for 2 following words in passage “. . . ibukota Nicosia yang . . . ”: • “Nicosia”, features: lexical: Nicosia; POS: noun; orthographic: capitalized alphabet; lexical similarity: 0 • “yang”, features: lexical: yang; POS: conjunction; orthographic: alphabet; lexical similarity: 0 Feature for current word “ibukota”: lexical: ibukota; POS: Noun; orthographic: alphabet; lexical similarity: 1; bi-gram frequency: see Figure 4.4. Figure 4.5: Features for SVM based Answer Finder Table 4.5 also shows the accuracy result if the proposed attributes are joined with the bag-of-words (B). Joining the proposed attributes with the bag-of-words (B) mostly gives lower score than without the B attribute except for the combination with S and P attributes such as B+S+P, B+S+C+P, etc; and also for combination of four or more attributes. It shows that the B attribute can not be used as the proposed attribute. We assume that it is because that there are many data for B attribute which could delete the correct patterns that were yielded by CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 66 Table 4.4: Accuracy Score of Several Machine Learning Algorithms with Bag of Words Attribute in the Indonesian Question Classification Method C4.5 K Nearest Neighbor SVM Accuracy Score 87.23% 69.30% 91.97% the combination without the B attribute. Table 4.6 shows the accuracy score of the question classification when the OOV problems were handled in the WordNet distance calculation. As mentioned in Section 4.4.2, the OOV problems were treated by finding words that have similar category with the OOV word. Compared to the one without OOV handling (Table 4.5), the result was improved about 0.27%. Even though the simple rule based class candidate attribute shows the lowest accuracy score (Table 4.5), but for some categories such as “name” and “quan”, it achieved the highest accuracy among all attributes. Therefore, we tried another experiment by combining the bi-gram frequency and the rule based class candidates. We combined it by checking whether the label of the bi-gram frequency was listed or not in the class candidates, if it was not listed then the frequency score was changed into 0. This feature is labeled by P’ feature. The accuracy score is shown in Table 4.7. All scores for the combination with the P’ attribute do not give lower score except for the B+S+W’+P’ (lower than the B+S+W’+P and also lower than B+S+C+W’+P), W’ represents the WordNet distance feature with OOV handling. We assume that it is caused by the B attribute which may contains data that could eliminate the correct patterns resulted from the attribute combination without the B attribute. To make the question classifier easily modify into other language, in the answer finder system, we used the question class resulted by the S (shallow parser) + P’ (bi-gram frequency limited by simple rule class candidates) features. 4.5.2 Passage Retriever Table 4.8 shows the experiment result of the passage retriever (from 71,109 documents) with methods such as described in Section 4.4.3. The evaluation score (precision, recall and F-score) was calculated by comparing the retrieval result with the correct passage which is actually not accurate because there are also passages containing correct answer but not considered as correct passages. This is because in the question data collecting process, user only wrote the question, its answer and the corresponding passage which was being read by the user. User 67 4.5. EXPERIMENTAL RESULT Table 4.5: Accuracy Score of Indonesian Question Classification using SVM Algorithm with Various Attributes Method Accuracy Score p-value of the one tail t-test with 95% confidence level (compared with the Baseline) Baseline(bag of words) S B+S B+C B+W B+P 91.97% 94.43% 92.60% 92.20% 94.17% 95.37% 7.34E-05 0.179000 0.368950 0.000397 3.12E-08 S+C S+W S+P 94.57% 94.83% 95.47% 2.91E-05 3.83E-06 1.12E-08 B+S+C B+S+W B+S+P 92.50% 94.73% 95.80% 0.220170 8.42E-06 2.75E-10 S+C+W S+C+P S+W+P 94.77% 95.37% 95.53% 6.50E-06 3.12E-08 5.54E-09 B+S+C+W B+S+C+P B+S+W+P 94.83% 96.03% 96.07% 3.83E-06 1.54E-11 9.99E-12 S+C+W+P B+S+C+W+P 95.50% 96.20% 7.90E-09 2.64E-12 did not care about the possibility of other passages in other documents that could contain the correct answer. The experiment result shows that using IDF score is more suitable for the passage retrieval task than the TF IDF score (employed by Estraier). The 6% recall reduction from 0.95 (>lowest IDF) to 0.89 (>highest IDF/2) indicates that we still have to improve the passage retrieval method to get lower recall reduction. CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 68 Table 4.6: Accuracy Score of Indonesian Question Classification using WordNet Distance Attribute (W’) with OOV Handling Method Accuracy Score B + W’ S + W’ B + S + W’ S + C + W’ S + W’ + P B + S + C + W’ B + S + W’ + P S + C + W’ + P B + S + C + W’ + P 94.53% 95.10% 94.83% 95.13% 95.53% 94.97% 96.17% 95.50% 96.20% p-value of the one tail t-test with 95% confidence level (compared with the Baseline) 3.69E-05 3.94E-07 3.83E-06 2.91E-07 5.54E-09 1.27E-06 2.64E-12 7.90E-09 1.68E-12 Table 4.7: Accuracy Score of Indonesian Question Classification using new Bigram Frequency (P’) (Limited by the Simple Rule Class Candidate features) Method Accuracy Score B + P’ S + P’ B + S + P’ S + W + P’ S + W’ + P’ B + S + W + P’ B + S + W’ + P’ 4.5.3 95.63% 95.60% 95.80% 95.50% 95.53% 96.10% 96.10% p-value of the one tail t-test with 95% confidence level (compared with the Baseline) 1.86E-09 2.69E-09 2.75E-10 7.90E-09 5.54E-09 6.45E-12 6.45E-12 Answer Finder In the answer finder, 2700 questions were used as the training data (139,851 instances) and 300 questions as the testing data (50 questions for each question type). Yamcha (Kudo and Matsumoto, 2000) was used as the SVM based text chunking software. To evaluate the answer finder system, we calculated the accuracy of exact and partial answers along with the MRR score. The scores used to evaluate the answer finder result are Top-Exact1 (the first answer found is ex- 69 4.5. EXPERIMENTAL RESULT Table 4.8: Passage Retriever Accuracy Document Retriever 3 highest IDF scores >highest IDF / 2 >lowest IDF Estraier #Correct Passage 2551 2672 2860 2409 #Retrieved Passage 13,556 29,160 466,141 65,660 Prec Recall F-measure 0.1882 0.0916 0.0061 0.0367 0.8503 0.8907 0.9533 0.8030 0.3082 0.1662 0.0122 0.0702 actly the same with the correct answer), Top-Exact-n (the correct answer exists among the top n answers found), Top-Partial1 (the first answer found contains the correct answer partially or the other way around, for example, the answer finder result is “August 2005” and the correct answer is “12 August 2005”), Top-Partial-n (one among the top n answers found is a partial answer) and MRR (Mean Reciprocal Rank, the average reciprocal rank (1/n), the rank of the correct answer in the answer finder result). Some examples of the answer finder results are shown in Table 4.9. The reason for the incorrect answers is because the correct passage is not retrieved. Even though the correct documents were retrieved, but not all correct passages were retrieved. The accuracy score for the answer finder result is shown in Table 4.10 and Table 4.11. QC means the predicted question class (question classification result), DF means the document features, QF means the question features and multi-IBO means instead of using only 3 classes (I, B, O) for each word, it uses 18 classes (I-date, B-date, O-date, I-location, B-location, O-location, etc). The 21 word window size means that there are 10 preceding words and 10 following words for each word in the passage. The 11 word window size means that there are 5 preceding words and 5 following words for each word in the passage. Table 4.10 shows the accuracy result of the question answering using QC and DF features for various passage retrieval methods and various word window sizes (for the document). All results in Table 4.10 shows that the “3 Top IDF” and the “>highest IDF/2” based document retrievals achieved higher accuracy score than the Estraier (TFxIDF). The highest accuracy score is achieved by using the 11 word window size (5 preceding words and 5 succeeding words) with the “3 Top IDF” and the “>highest IDF/2” based document retrievals. Based on this result, we used the 11 word window size, the “3 Top IDF” and “>highest IDF/2” based document retrievals for other feature combinations in the question answering module, which results are shown in Table 4.11. The best result in Table 4.11 is 0.59 for the Top-n and 0.52 for the MRR. This score is better than CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 70 Table 4.9: Examples of Answer Finder Result Correct Result 1 Question Passage 2 Answer Question Passage Answer Incorrect Result 3 Question Correct passage Retrieved passage Correct Answer Answer Results 4 Question Correct passage Retrieved passage Correct Answer Answer Results Mulai tanggal berapa, PT Pertamina menurunkan harga Pertamax dari Rp 5.400 menjadi Rp 5.000? (When did PT Pertamina lower Pertamax price from Rp 5,400 to Rp 5,000?) . . . Mulai 1 Januari 2006, PT Pertamina kembali menurunkan . . . Harga Pertamax dari Rp 5.400 turun menjadi . . . 1 Januari 2006 (2006, January 1st) Dimanakah percobaan pembunuhan Hosni Mubarak yang paling terkenal dan berbahaya? (Where was the famous and most dangerous assassination to Hosni Mubarak?) . . . yang paling terkenal dan berbahaya adalah percobaan pembunuhan di Addis Ababa, Etiopia, Juni 1995, . . . Addis Ababa Etiopia (Addis Ababa, Ethiopia) Di benua manakah, Negara Zambia berada? (In what continent does Zambia lay?) . . . Afrika kehilangan 272 . . . Zambia juga . . . . . . John Howard yang menyatakan akan menyerang negaranegara yang menjadi sarang teroris di Asia . . . Afrika (Africa) Asia Tenggara (South East Asian), DPR (Indonesian legislative council), etc Apa nama mata uang Thailand? (What is Thailand’s currency?) Di Thailand, harga minyak naik 26 persen selama 2005 ini menjadi 26,5 baht (Rp 6.500) per liter. . . . nilai tukar rupiah terhadap dollar AS . . . Baht Dollar AS (US Dollar), Euro, Malaysia, etc [38] which conducted Japanese QA with a similar answer finder module. The result in [38] was 0.47 for the Top-5 and 0.36 for the MRR, respectively. 71 4.5. EXPERIMENTAL RESULT Table 4.10: The Question Answering Accuracy Result with QC (Question Class Features) and DF (Document Features) for Various Passage Retrieval Methods and Various Word Window Sizes QC+DF Method Description 1. 2. 3. 4. 3 Top IDF >highest IDF/2 >lowest IDF Estraier 1. 2. 3. 4. 3 Top IDF >highest IDF/2 >lowest IDF Estraier 0.37 1. 2. 3. 4. 3 Top IDF >highest IDF/2 >lowest IDF Estraier 4.5.4 Top Exact1 Exact-n Partial1 Partial-n Word window size: 7 = 3+1+3 0.38 0.49 0.44 0.55 0.37 0.50 0.44 0.56 0.22 0.44 0.27 0.52 0.33 0.48 0.40 0.55 Word window size: 11 = 5+1+5 0.44 0.55 0.50 0.64 0.43 0.57 0.50 0.66 0.31 0.49 0.35 0.59 0.52 0.44 0.61 0.43 Word window size: 21 = 10+1+10 0.40 0.52 0.49 0.61 0.38 0.52 0.47 0.63 0.25 0.48 0.32 0.59 0.35 0.49 0.43 0.59 MRR Exact Partial 0.43 0.43 0.31 0.40 0.49 0.50 0.38 0.47 0.49 0.49 0.39 0.51 0.56 0.57 0.45 0.46 0.44 0.35 0.42 0.54 0.54 0.44 0.51 Using NTCIR 2005 (QAC and CLQA Data Set) To compare the QA performance with a formal data set, we selected some questions from the NTCIR 2005 (QAC and CLQA tasks) which answers are available in our Indonesian corpus. There are 12 questions can be used for this purpose. These 12 questions are translated into Indonesian and then used as the input question for this Indonesian QA. The experimental results are shown in Table 4.12 The main reason of the low result in the passage retrieval is because some question keywords are not available in the correct article. In some cases, there are some unimportant nouns or verbs in the questions, and in other cases, the question keyword is not lexically equal with the word in the correct article (the question keyword is the synonym of the word in the correct article). Both problems are not handled in our Indonesian passage retriever. In future plan, we will improve the passage retrieval module to cope with these problems. In the answer finder, the performance is half of the passage retrieval result. CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 72 Table 4.11: The Question Answering Accuracy Result for Some Features Variations (Multiple IBO Class, QF+DF, QC+QF+DF) with 2 Passage Retrieval Methods ((1) 3 Top IDF and (2) IDF score >highest IDF/2) and 11 Word Window Size Method Description Multiple IBO Class 1. 3 Top IDF 2. >maximum IDF/2 QF+DF 1. 3 Top IDF 2. >maximum IDF/2 QC+QF+DF 1. 3 Top IDF 2. >maximum IDF/2 Oracle Experiment QC+QF+DF (11 word window size) Exact1 Top Exact-n Partial1 Partial-n MRR Exact Partial 0.38 0.37 0.54 0.53 0.50 0.46 0.66 0.65 0.45 0.44 0.57 0.55 0.44 0.43 0.55 0.56 0.50 0.50 0.60 0.61 0.49 0.49 0.55 0.55 0.46 0.46 0.58 0.59 0.54 0.54 0.66 0.67 0.51 0.52 0.59 0.60 0.57 0.59 0.61 0.64 0.58 0.63 Table 4.12: The Question Answering Accuracy Result for 12 Questions taken from QAC and CLQA1 on the NTCIR 2005 Description Question Classification Passage Retrieval (¿ maximum IDF/2) Answer Finder (QC+QF+DF, 11 word window size) Accuracy 100% Recall = 67% Top1=Top5=MRR=0.33 This achievement is comparable with the results shown in Table 4.11, such as 0.57 for oracle experiment (100% recall of passage retrieval) or 0.46 for the 0.89 recall score of the passage retrieval. 4.6 Adopting QA System for Other Language There are two main things to be prepared to build a similar QA system for other resource limited languages: the language resources, and the components of QA system. The minimum required language resources include a text corpus, a 4.6. ADOPTING QA SYSTEM FOR OTHER LANGUAGE 73 question-answer set (along with the answer tagged passage) and a POS resource (or POS tagger). Text corpus can be downloaded from the internet and can be done in about less than two weeks. For a question-answer set, as has been mentioned before, we collected about 3000 question-answer pairs inputted by 18 Indonesian natives. We spent about 2 months for this process. We believe this time length can be shortened with more users or more managed process. In this experience as our first QA system, we made a mistake by not asking the users to type and tagged the answer in the relevant passages. This has consumed times in collecting (semi-automatically) the correct relevant passages for each question-answer pair. For the POS resource, we used an Indonesian-English dictionary (29,054 Indonesian words) where each line has information such as the Indonesian word, its POS information (such as Noun, Verb, Adjective, Adverb, etc) and the English translations. Some words were inputted manually for only certain POSs (conjunction, preposition, pronoun, adverb). The POS ambiguity is not handled. If there were more than one POS available for a word, the first candidate written in the Indonesian-English dictionary is chosen. Even though this POS tagger is simple, the QA gave higher accuracy result than other QA systems [38] that used a much better POS resource. Similar with other QA systems, this QA system consists of 3 components: the question analyzer (Section 4.4.2), the passage retriever (Section 4.4.3) and the answer finder (Section 4.4.4). Almost all of the components are machine learning based components except for the question shallow parser. In the question shallow parser, the question is not transformed into a sentence with a tree structure, we just extracted some information such as the question interrogative word, the keywords (noun, verb and adjective), and the question main word. The procedure which decides the question main word depends on the grammar syntax for a source language. However the rules to extract the question main word (based on the word’s POS and word’s position) are relatively simple and listed in Section 4.4.2. Even if one cannot develop a shallow parser, the QA system will work well by using the B+C features (result is shown in Table 4.5) to do the question classification. For the question classification task, there are some required features such as B (bag of words), S (shallow parser result), C (simple class candidates), W (WordNet distance), and P (bi-gram frequency). C, W and P features are explained in Section 4.4.2. We believe that the preparation for all these features do not consume time. For C feature, the class candidates are decided based on simple rules such as shown in Section 4.4.2. For W feature, the WordNet distance scores are calculated easily between the 25 nouns (lexicographer files) and the (bilingual dictionary based) translation of the nouns related with the question main word. For P feature, the scores are taken from the bi-gram frequency between the nouns CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING APPROACH 74 related the question main word and the word list for each category. The process to define the word list for each category is done semi automatically and will take about less than a week. For the proposed method (such as in final experiment), we did not use WordNet distance (W) feature in the question classification, we only used the B, S, and P’ (bi-gram frequency scores limited by the C - simple class candidates) features. It means that we do not need the WordNet and the translation of Indonesian into English to do the question classification. For the passage retriever system (Section 4.4.3), the programming and execution phase will consume about 2 weeks. For the answer finder, one needs to prepare the data required by the machine learning (e.g. Yamcha) which includes the QC (question class, result of the question classification task), QF (question feature, result of the question shallow parser) and DF (document feature). Such as mentioned in Section 4.4.4, the training data are automatically prepared from the tagged correct passage where all words have the “O” flag for the I/B/O information except for the answer (“B” for the first word of the answer, “I” for the rest of the answer). For the testing data, all words have the “O” flag. Other features (orthographic information, lexical similarity score, bi-gram frequency) can be yielded easily. 4.7 Conclusions We have built an Indonesian question answering system including the data collection (questions and documents) and a full QA system. The Indonesian QA data collection consists of 3000 factoid questions, collected from 18 Indonesian natives. In this collection, there are 6 question classes: person, organization, location, name, date and quantity. The document collection were downloaded from Indonesian newspaper website and joined with the existing document collection, yielded around 71,109 articles. The QA system consists of 3 components: question analyzer (question shallow parser and question classifier), passage retriever and answer finder. For the question classifier, this system proves that using features extracted from a monolingual corpus can improve classification result compared to a bag of word approach. As for the question answering result, we also obtained a nice MMR score considering there was no high quality language tools involved in the process. Compared to [38], our system got higher score for exact answers while we used only the bi-gram frequency feature and the other system used rich POS information resulted by a morphological analyzer. We described that this system can be easily adapted to other limited resource language. The procedure to apply the QA system to other limited resource language (with a minimum programming effort) is described in Section 4.6. Chapter 5 Indonesian-English CLQA 5.1 Introduction Recently, CLQA (Cross Language Question Answering) systems have become an area of much research interest. The CLEF (Cross Language Evaluation Forum) has conducted a CLQA task since 2003[24] using English target documents and Italian, Dutch, French, Spanish, German source questions. Indonesian-English CLQA has been one of the more recent goals of the CLEF since 2005[46]. The NTCIR (NII Test Collection for IR Systems) has also been active in its own CLQA efforts since 2005[39] by providing Japanese-English, English-Japanese, Chinese-English and English-Chinese data. In CLQA, the answer to a given question in a source language is searched for in documents of a target language. Accuracy is measured by the retrieved correct answer. The translation phase of CLQA makes answer searching more difficult than monolingual QA for the following reasons[39]: 1. Translated questions are represented with different expressions than those used in news articles in which answers appear; 2. Since keywords for retrieving documents are translated from an original question, document retrieval in CLQA becomes much more difficult than that in monolingual QA. Most of common approaches in CLQA used the four modules system: the question analyzer, translation module, passage retriever and answer finder. Using these four modules system, there are some basic schemas can be proposed for a CLQA system such as: 1. The question sentence in the source language is translated into the target language as a complete question sentence. This translated question sen75 76 CHAPTER 5. INDONESIAN-ENGLISH CLQA tence is then processed by a monolingual QA for the target language. The question analyzer, the passage retriever and the answer finder are all done in the target language area. 2. The EAT is defined first in the source question analyzer. The question sentence is then translated into the target language. The translation output could be as a list of keywords or as a complete question sentence. The translation results are then used to retrieve the passages. The answers are located by matching the passages, keywords and the EAT. 3. The documents are translated into source language documents. By this, the answer is located by using a monolingual QA for the source language. 4. The schema from the question analyzer until passage retrieval is similar with the second schema above. Next, the passage retrieval’s results (in target language) are translated into the source language. The answer finder is then conducted by the source language QA. Researchers usually select one of the above schemas based on the monolingual QA that they have. If one has a monolingual QA in the target language, then the chosen schema could be the first or the second schemas. If one has a monolingual QA in the source language, the third or fourth schema is suitable. In above schemas, the translation can be done either for questions, for passages or for documents. In CLQA, translation quality is an important factor in achieving an accurate QA system. Thus, the quality of the machine translation and dictionary used is very important. Even though, we have a monolingual Indonesian QA but due to the limitation on the support resource for the translation module, we select the second schema mentioned above. The second schema has a minimum drawback on the translation process where the translation is done on the extracted keywords, not on the full sentence as in the question translation or passage translation. As for the passage retriever and answer finder modules for the target language, by using the similar approach as the monolingual QA, these modules can be built in an easy way with a small programming effort. In the Indonesian language, there are a number of translation resources available[33], such as an online machine translation (Kataku) and an IndonesianEnglish dictionary (KEBI). Previous work in Indonesian-Japanese CLIR[33] shows that using a bilingual dictionary in a transitive approach can achieve a higher retrieval score than using a machine translation tool. Other work in the Indonesian language[4], however, showed that using an online dictionary available on the Internet gave a lower IR (Information Retrieval) performance score than the available machine translation because of the limited vocabulary size of the online dictionary. As for CLQA research, the best performance of Japanese-English 5.1. INTRODUCTION 77 CLQA in the NTCIR 2005 was obtained by a study[18] using three dictionaries (EDICT, 110,428 entries; ENAMDICT, 483,691 entries; and in-house translation dictionary, 660,778 entries) that can be qualified as high data resources. Accuracy was 31.5% on the first exact answer. This accuracy score outperformed other submitted runs, such as the second ranked run with 9% accuracy. For the Indonesian-English CLQA, there have been two trials[3, 49] conducted using the CLEF data set. Both systems used a machine translation tool to translate Indonesian questions into English. One system [3] used a commercially available Indonesian-English machine translation software (Transtool) and was able to obtain only 2 correct answers among 150 factoid questions. The other system [49] used an online Indonesian-English machine translation software (Kataku) and was able to obtain only 14 correct answers among 150 factoid questions. Different from both the Indonesian-English CLQA systems mentioned above, in this study we utilize an Indonesian-English KEBI[17] dictionary (29,047 word entries) and then combine the translation results in a boolean query to retrieve the relevant passages. Different with our attempt in the Indonesian-Japanese CLIR described in Chapter 3, we do not select the best translated keyword but we use all translated keywords in a Boolean query to retrieve the English passage. As mentioned in Chapter 4, there are several approaches to locate an answer in a retrieved passages. In one commonly used approach, documents are named entity tagged. Matching is then conducted between each named entity and the EAT. This method is also used in many CLQA systems such as in [18] that achieved the best submitted run for the Japanese-English CLQA task in NTCIR2005. The literature [18] macthed the EAT with NE resulted by a handcrafted rule-based NE recognizer for English documents (669 rules for 143 answer types). Apart from the named entity approach, there is another approach proposed by [37], in which a statistical approach is used to locate the answer. This approach is called an experimental Extended QBTE (Question-Biased Term Extraction) model, an extension of QBTE used in the Japanese monolingual QA[38]. The description of its method is explained in Section 5.2. But unfortunately, this approach only achieved 2 correct answers among 200 factoid questions in the NTCIR 2005 CLQA task. In the Indonesian-English CLQA system, we adopted the text chunking method in order to locate an answer. The English documents will only be POS-tagged by an available POS tagger[40]. The features for the text chunking method are similar with the Indonesian monolingual QA with several differences (see Section 5.5). There are some differences between our approach and the one proposed by [37]. One important difference is the use of EAT. Here we choose to use EAT as yielded by the question analyzer for several reasons described in Section 5.5.1. 78 CHAPTER 5. INDONESIAN-ENGLISH CLQA 5.2 Related Works There are three works on the Indonesian-English CLQA. The first system [3] was conducted for the CLEF 2005, the second system[49] was conducted for the CLEF 2006 and the third system[1] was conducted for the CLEF 2007. All systems adopted the second schema mentioned in Chapter 5.1. First, all systems classified the EAT from Indonesian question using several rules. The Indonesian question is then translated using an available machine translation (Transtool∗ was used in [3], Kataku† was used in [49]) and [1] into English. The translated questions are then executed in the existing search engine (Lemur‡ ). The English retrieved passages are then tagged by existing tools (MontyTagger§ was used in [3] and Gate¶ was used in [49] and [1]). The answers are identified with some rules that matched the named entity in the passage with the EAT and distance between the answer candidate and the word equal with the question keyword. The literature [1] added a score of the answer candidate by executing the query into Google and used the word frequency score as the weight for the answer candidate. In the CLEF 2005 of Indonesian-English CLQA Task, the literature [3] yielded 2 correct answers among 200 questions (150 factoid questions and 50 definition questions). The result analysis mentioned that one of the reason of the poor result was because the named entity tagger did not provide specific tag information to the passages. For example, the NN tag could correspond to the location or the organization EAT. In the CLEF 2006 of Indonesian-English CLQA Task, the literature [49] got 14 correct answers among 200 questions (150 factoid questions, 40 definition questions and 10 list questions). The literature cited that the result was better than the previous year because of the application of query expansion and different passage scoring techniques. In the CLEF 2007 of Indonesian-English CLQA task, the literature [1] got 20 correct answers among 200 questions. In analysis, they claimed that internet source could be used to help the answer identification. Another related work is the experimental Extended QBTE[37]. Basically, the method is a text chunking process where the answer is located by classifying each word in the target passages into 3 classes (B, I and O ). B class means that the word is the first word in the answer, I means that it is part of the answer and O means that the word is not part of the answer. This text chunking process was done by a maximum entropy algorithm. Using this approach means that one do not have to conduct the EAT classification and also need not to do the ∗ See See ‡ See § See ¶ See † http://www.geocities.com/cdpenerjemah/ http://www.toggletext.com/ http://www.lemurproject.org/ http://www.media.mit.edu/∼hugo/montytagger/ http://www.gate.shef.ac.uk/ 5.3. DATA COLLECTION FOR INDONESIAN-ENGLISH CLQA AND ITS PROBLEMS 79 named entity tagger. The features are taken from a rich POS resource which also includes the named entity information for both source and target languages. Although the accuracy result using the text chunking approach[37] achieved a low accuracy score in the NTCIR2005 CLQA task by only 2 correct answer for the Japanese-English CLQA task, we try to adopt the approach in the answer finder module. Different with [37], our approach employs a question analyzer module. Here, we used the EAT and other question shallow parser result as some of the features. For the source question, we did not use the rich POS information, instead we only used a common POS category and bi-gram frequency scores for the question main word (the question focus or the question clue word). In the target document, we used a WordNet distance score for each document word. 5.3 Data Collection for Indonesian-English CLQA and its Problems In order to gain an adequate number of data, we collected our own IndonesianEnglish CLQA data. We asked 18 Indonesian college students (different people with the one for the Indonesian QA, described in Section 4.3.2) to read English article from Daily Yomiuri Shimbun year 2000. Each student was asked to make about 200 Indonesian questions related with the English articles. Question examples along with the answer and its source article are shown in Table 5.2. The questions were factoid questions with certain EATs. There were 6 EATs: date, quantity, location, person, organization, and name (nouns except for the location, person and organization categories). After deleting duplicate questions and incorrectly formed questions, we obtained 2837 questions in total. The number of questions for each EAT are shown in Table 5.1. Because of development cost, the answer for each question is limited to one answer given by a respondent. The alternative answers are not included in our in-house question-answer pair data. But in the experiments against NTCIR 2005 CLQA task data set, we matched the resulted answers with the alternative answers provided in the NTCIR 2005 CLQA task data. Based on our observations of the written Indonesian questions, an Indonesian question always has an interrogative word. The interrogative word can be located in the beginning, the last or the middle of a question. In question “Dimana Giuseppe Sinopoli lahir?” (Where was Giuseppe Sinopoli born?), the interrogative word dimana(where) is located as the first word in the question. In question “Sebelum menjabat sebagai presiden Amerika , George W. Bush adalah gubernur dimana?” (Before serving as the President of the United States, where did George W. Bush serve as governour?), on the other hand, the interrogative word dimana (where) is the last word in the question. In addition to interrogative words, another important word is the question 80 CHAPTER 5. INDONESIAN-ENGLISH CLQA main word. Here, we define a question main word as a question focus or a question clue word (if the question focus does not exist in the question). Further, related to the question focus, the order of the question focus and the interrogative word in a sentence is reversible. An interrogative word can be ordered either before a question focus or after a question focus. For example, in the question “Lonceng yang ada di Kodo Hall dibuat di negara mana?” (In which country was Kodo Hall’s bell made?), the interrogative word mana (which) is located after the question focus negara (country). However, in the question “Apa nama buku yang ditulis oleh Koichi Hamazaki beberapa tahun yang lalu?” (What was the name of the book written by Koichi Hamazaki several years ago?), the interrogative word apa (what) is placed before the question focus buku (book). Even though a question focus is an important clue word to define the EAT, as mentioned before, not all question sentences have a question focus. Some examples include “Apa yang digunakan untuk menghilangkan lignin?” (What is used to dispose of lignin?) and “Siapa yang menyusun panduan untuk pengobatan dan pencegahan viral hepatitis pada tahun 2000?” (Who composed a reference book for medication and prevention of viral hepatitis in 2000?). For such questions without a question focus, we select other words such as “digunakan” (used) or “menyusun” (composed) as the question clue words. Table 5.2 shows some question examples for each EAT and their respective different patterns. The third question, for example, “Di prefektur manakah, Tokaimura terletak?” (In which prefecture is Tokaimura located?) has “prefektur” (prefecture) as the question main word. The QA system should be able to locate the answer, “prefecture”, as the correct answer for the question is “Ibaraki Prefecture”. Here, the question main word is found to also be part of the answer. The fifth question “Kapankah insiden penyerangan Alondra Rainbow di Perairan Indonesia?” (When did the Alondra Rainbow attack incident happen off the Indonesian waters?) has no question focus (the question clue word is “insiden” (incident)) but the answer can be located by searching the closest “date” word. The answer is “October”. Table 5.1: Number of Questions per EAT EAT date location name organization person quantity Number of Questions 459 476 482 475 447 498 5.3. DATA COLLECTION FOR INDONESIAN-ENGLISH CLQA AND ITS PROBLEMS 81 Table 5.2: Question Example for each EAT Question + EAT + Question Main Word + Interrogative Word Siapakah ketua Komite Penyalahgunaan Zat di Akademi Pediatri di Amerika (Who is the head of the Committee on Substance Abuse at the American Academy of Pediatrics) EAT: Person Question Main Word: ketua (head of) Interrogative Word: siapa (who) Apa nama institut penelitian yang meneliti Aqua-Explorer? (What is the name of the research institute that does the Aqua-Explorer experiment?) EAT: Organization Question Main Word: institut (institute) Interrogative Word: apa (what) Di kota manakah Universitas Mahidol berada? (In which city is Mahidol University located?) EAT: Location Question Main Word: kota (city) Interrogative Word: manakah (which) Ada berapa lonceng di Kodo Hall? (How many bells are there in Kodo Hall?) EAT: Quantity Question Main Word: lonceng (bells) Interrogative Word: berapa (how many) Kapankah insiden penyerangan Alondra Rainbow di Perairan Indonesia? (When did the Alondra Rainbow attack incident happen off the Indonesian waters?) EAT: Date Question Main Word: insiden (incident) Interrogative Word: kapankah (when) Apa nama buku yang ditulis oleh Koichi Hamazaki beberapa tahun yang lalu? (What was the name of the book written by Koichi Hamazaki several years ago?) EAT: Name Question Main Word: buku (book) Interrogative Word: apa (what) 82 CHAPTER 5. INDONESIAN-ENGLISH CLQA Even though there are some similarities between Indonesian and English sentences, such as the subject-predicate-object order, there are still some syntactical differences, such as the word order in a noun phrase. For example, in English, the sentence “Country which has the biggest population?” is a grammatically incorrect sentence. The sentence should be “Which country has the biggest population?”. In Indonesian, however, “Negara apa” (country which) has the same meaning as “Apa negara” (which country). Observing such differences between Indonesian and English sentences, we concluded that existing English sentence processing tools could not be used for Indonesian language. 5.4 Indonesian-English CLQA Schema Using a common approach, we divide the CLQA system into four components: question analyzer, keyword translator, passage retriever, and answer finder. As shown in Figure 5.1, Indonesian questions will be first analyzed by the question analyzer into keywords, question main word (question focus or question clue word) with its type information, EAT and a phrase information. Our question analyzer consists of a question shallow parser and a question classifier modules. The question classifier defines the EAT of a question using an SVM algorithm (provided by WEKA). Then, the Indonesian keywords and question main word are translated by the Indonesian-English keyword translator. The translator results will be used to retrieve the relevant passages. In the final phase, the English answers are located using a text chunking program (Yamcha) with input features explained in Section 5.5.4. Each module involved in the schema is described in the next section. 5.5 5.5.1 Modules in Indonesian-English CLQA Question Analyzer The Indonesian-English CLQA system such as shown in Chapter 5.1 was built by adopting certain modules from the monolingual Indonesian QA system. The only unmodified module is the question analyzer system because the module has the same function on both systems. It receives Indonesian natural language question and yields some information such as the shallow parser’s result and EAT. The detail approach of the Indonesian question analyzer module was described in Section 4.4.2. A question example of the Indonesian-English CLQA along with its features for the question classification is shown in Figure 5.2. We used two examples with main difference in the bi-gram frequency information. For the first question, the question main word “kota” (city) exists in the Indonesian corpus, while for the second question, the question main word “prefektur” (prefecture) does not exist 5.5. MODULES IN INDONESIAN-ENGLISH CLQA 83 Indonesian Question Indonesian Question Analyzer Indonesian Keywords Indonesian Question Focus Indonesian-English Keywords Translator EAT Interrogative Word Phrase-like Information English Keywords English Passage Retriever English Question Focus English Keywords English Passages English Answer Finder English Answers Figure 5.1: Schema of Indonesian-English CLQA in Indonesian corpus which gives zero score of the bi-gram frequency feature. The features are written in 4th -8th rows of each question. Features in 4th -6th rows are yielded by our question shallow parser. The next two rows are the additional features (bi-gram frequency score and WordNet distance). For the first question, the question main word “kota” (city) is statistically related with the “location” and “organization” entity (7th row). The highest relation is with the “location” entity with 0.86 as the first bi-gram frequency score. There are 23 preceding words for the “city” word related with the “location” class and 10 words with the “organization” class. As for the WordNet distance, the question main word “kota” (city) is a hyponym of “location” word. These perfect information in the bi-gram frequency score and the WordNet leads to correct prediction of EAT. In the second question, even though the bi-gram frequency score is zero and the WordNet distance score is not perfectly mentioned that the correct EAT is a location, the question classification module still gives good prediction. As has been proven in the Indonesian QA(Chapter 4), using a question class (EAT) in the answer finder gives higher performance than without using a question class (i.e. depending solely on the question shallow parser result and mainly on the question main word). Compared to a monolingual system, using question class in the cross language QA system has more benefits. The first benefit occurs when there is more than one translation for a question main word with different 84 CHAPTER 5. INDONESIAN-ENGLISH CLQA Question 1: Di kota manakah Universitas Mahidol berada? English: In what city is Mahidol University? Correct EAT: Location Interrogative: apa (what) Question Main Word: kota (city) Phrase: NP-POST Bi-gram: date(0,0), loc(0.86,23), name(0,0), organization(0.14,10), person(0, 0), quantity(0,0) WordNet-dist: act(0), animal(0), artifact(0), attribute(0), body(0), cognition(0), communication(0), event(0), feeling(0), food(0), group(0), location(1), motive(0), object(0), person(0), phenomenon(0), plant(0), possession(0), process(0), quantity(0), relation(0), shape(0), state(0), substance(0), time(0) Resulted EAT: Location Question 2: Di prefektur mana letak pulau Zamami? English: In which prefecture is Zamami island located? Correct EAT: Location Interrogative: mana(which) Question Main Word (question focus): prefektur(prefecture) Phrase: NP-POST Bi-gram: date(0,0), loc(0,0), name(0,0), org(0,0), person(0, 0), quantity(0,0) WordNet-dist: act(0.5), animal(0), artifact(0), attribute(0), body(0), cognition(0), communication(0), event(0), feeling(0), food(0), group(0), location(0.5), motive(0), object(0), person(0), phenomenon(0), plant(0), possession(0), process(0), quantity(0), relation(0), shape(0), state(0), substance(0), time(0) Resulted EAT: location Figure 5.2: Example of Features for Question Classifier 5.5. MODULES IN INDONESIAN-ENGLISH CLQA 85 meanings. For example, in the question “Posisi apakah yang dijabat George W. Bush sebelum menjadi presiden Amerika?” (What was George W. Bush’s position before he became the President of the United States?), the question main word is posisi (position) which can be assumed as a place (location) or an occupation (name). By classifying the question into “name”, the answer extractor will automatically avoid the “location” answer. The second benefit relates to the problem of an out-of-vocabulary question main word. By providing the question class even though the question main word can not be translated, the answer can still be predicted using the question class. Even though our question shallow parser is based on a rule-based module, it was built with simple rules(described in Section 4.4.2). Even though it is a simple rule-based question shallow parser, it could improve question classification’s accuracy as shown in Section 5.6.1. 5.5.2 Section Translation Based on our observations of the collected Indonesian questions, we concluded that there are three types of words used in the Indonesian question sentences: 1. Native Indonesian words, such as “siapakah” (who), “bandara” (airport), “bekerja” (work), etc 2. English words, such as “barrel”, “cherry”, etc. 3. Transformed English words, such as “presiden” (president), “agensi” (agency), “prefektur” (prefecture), etc. We use an Indonesian-English bilingual dictionary [17] (29,047 entries) to translate the non-stop Indonesian words into English. To handle the second type of keyword, we simply search the keyword in the English corpus. For the third type of keyword, we apply some transformation rules, such as “k” into “c”, or “si” into “cy”, etc. The complete transformation rules were shown in Table 2.4. Using this strategy, among 3706 unique keywords in our 2837 questions, we obtained only 153 OOV words (4%). In addition, we also augmented the English translations by adding the synonyms from WordNet. An example of keyword translation process is shown in Figure 5.3. Third row shows the keywords extracted from the question. Fourth-sixth rows are the keyword translation results. Two words (“letak”, “pulau”) can be translated with the Indonesian-English dictionary, “Zamami” is an English term which is not translated because it exists in the English corpus. “Prefektur” is transformed into “prefecture” using the transformation rules mentioned above. Next row shows the attempt of adding more translations from WordNet. The WordNet addition process is employed for word “letak” (location, site, position). An example of keywords addition process is as follows: The word “location” has 4 synsets in the 86 CHAPTER 5. INDONESIAN-ENGLISH CLQA WordNet. There are only 2 synsets with synonyms. The first synset has synonyms of “placement”, “locating”, “position”, “positioning” and “emplacement”. Because one of the synonyms is also included in the translation result of bilingual dictionary then all synonyms for the first synset are included as the additional keywords. 5.5.3 Passage Retriever The English translations are then combined into a boolean query. By joining all the translations into a boolean query, the keywords are not filtered into only one translation candidate, as is done in a machine translation method. The operators used in the boolean query are “or”, “or2”, and “and” operators. The “or” operator is used to join the translation sets for each Indonesian keyword. As shown in Figure 5.4, the “or” operator joins the translation sets of “prefektur”, “letak”, “pulau”, and “Zamami”. The “or2” [32] operator is used for synonyms. Figure 5.4 shows that the boolean query for the translations of the word “letak” is “location ’or2’ site ’or2’ position”. The “and” operator is used if the translation result contains more than one term. For example, the boolean query for “territorial water” (the translation of “perairan”) is “territorial ’and’ water”. The IDF score for each translation depends on the word number of the translation results. For example, if an Indonesian word is translated into only one English word, then the IDF score for this translation is equal with the IDF score of the English word (the number of documents in the corpus divided by number of documents contain the English word). If the translations are more than one English word (the “or2” operator), then the IDF score is calculated from the Question: Di prefektur mana letak pulau Zamami? English: In which prefecture is Zamami island located? Indonesian keywords: prefektur, letak, pulau, Zamami Translated by Indonesian-English dictionary: letak=location, site, position pulau=island Exists in English corpus: Zamami Transliterated: prefektur=prefecture Augmented by WordNet: letak=location, site, position, placement, locating, situation, emplacement, positioning, place Figure 5.3: Example of Keyword Translation Result 5.5. MODULES IN INDONESIAN-ENGLISH CLQA 87 documents containing at least one of the translation. For the “and” operator, the IDF score is calculated from the documents containing all translations in the “and” operator. The relevant passages are retrieved in two steps: document retrieval and passage retrieval. For the document retrieval, documents with IDF score higher than the highest IDF score divided by 2 are selected. And for the passage retrieval, passages in the retrieved documents within the three highest IDF scores are selected. One passage consists of three sentences. Some examples of such retrieved passages are shown in Figure 5.4. The three passages with the highest IDF score were retrieved. The first and third passages are the correct passages, but the second passage is not. This occurred because the distances among keywords are not considered yet in the passage retrieval. Question: Di prefektur mana letak pulau Zamami? English: In which prefecture is Zamami island located? Answer: Okinawa prefecture Keywords: prefektur(prefecture), letak(location, site, position), pulau(island), Zamami Boolean query: (prefecture) or (location or2 site or2 position) or (island) or (Zamami) IDF scores prefecture: 0.503 location,site,position: 0.705 island: 1.282 Zamami: 3.705 Passages with the highest IDF score: ... and record their cry off the coast of Zamami Island in Okinawa Prefecture. ... Kerama Island of Okinawa Prefecture ... Marilyn, on adjacent Zamami Island. ... humpback whale off Zamami Island, Okinawa Prefecture, by using a robot submarine. Figure 5.4: Example of Passage Retriever Result 5.5.4 Answer Finder As mentioned before, the named entity tagging is not used in the answer finder phase, instead the answer tagging process is employed. In common approach, researchers tend to use two processes in a CLQA system. First process is the named entity tagging which tags all named entities in the document. Second process is the answer selection which matches the named entity with question 88 CHAPTER 5. INDONESIAN-ENGLISH CLQA features. If both processes employ machine learning approaches, then there are two training data should be prepared. In this common approach, the error of the answer finder will be resulted from the error of the named entity tagger propagated by the error of answer selection process. By our approach, we simplify the process of answer finder. We directly match the document feature with the question feature. It means that we shorten the development time. It also means that for a new language which has no named entity tagger available, by using our approach, one does not have to prepare two training data, instead one has only to prepare one training data for the answer tagging process which can be built from the question-answer pair (already available as the QA data) and the answer tagged passages (can be easily prepared by automatically searching and tagging the answers in the relevant passages). In our answer tagging process, we treat the answer finder as a text chunking process. Here, each word in the corpus will be given a status, either B or I or O, based on some features of both the document words and of the question. We use a text chunking software Yamcha[20] that works using an SVM algorithm. As mentioned before, we do not do the named entity tagger in the answer finder phase, instead we treat the answer finder as a text chunking process. Each word in the corpus will be given a status as a B or I or O based on some features of the document word and also based on some features of the question. We use an available text chunking software Yamcha[20] that works using an SVM algorithm. The answer finder with text chunking process was also used in QBTE[37]. QBTE approach employed maximum entropy as the machine learning algorithm. In QBTE approach, [37] used some POS information for words. For example, the word “Tokyo” is analyzed as POS1=noun, POS2=propernoun, POS3=location, and POS4=general. The POS information of each word in the question is matched with the POS information of each word in the corpus by using a true/false score for features of the machine learning. In our Indonesian-English CLQA, we do not use such information. The POS information in our Indonesian-English CLQA is similar to the POS1 mentioned in [37]. Even though the Indonesian-English dictionary is bigger size than the Japanese-English dictionary (24,805 entries) used by Sasaki, the Indonesian-English dictionary does not posses the POS2, POS3 and POS4. Another difference with our study is that we use question class as one of question features with reasons as mentioned in Section 5.5.1. We also use the result of the question shallow parser along with the bi-gram frequency score. For the document features, each word is morphologically analyzed into its root word using TreeTagger[40]. The root word, its orthographic information and its POS (noun, verb, etc) information are used as the question features. Different with the Indonesian QA (Chapter 4), we do not calculate bi-gram frequency score for the document word, instead we calculate its WordNet distance with 25 synsets 5.6. EXPERIMENTAL RESULT 89 such as listed in the noun lexicographer files of WordNet. Each document word is also complemented by its similarity scores with the question main word and question keywords. If a question keyword consists of two successive words such as “territorial water” as the translation for “perairan”, then when a document word matches with one of the words of the question keyword, the score is divided by the number of words in that question keyword. For example, for the document word “territorial”, the similarity score against “territorial water” is 0.5. An example of the features used in the Yamcha text chunking software is shown in Figure 5.5. Question: Di prefektur mana letak pulau Zamami? English: In which prefecture is Zamami island located? Question Features(QF):see Figure 5.2 EAT: Location Retrieved passage: ... humpback whale off Zamami Island, Okinawa Prefecture, by using a robot submarine. Document Features(DF) for the word “Prefecture”: lexical: prefecture POS: NP orthographic: Upcase alphabet(2) question-main-word similarity: 1 keyword similarity: 1 bi-gram frequency: see Figure 5.2 Preceding words: “,” (classified as “O”), Okinawa (classified as “B”) Classification result for the word “Prefecture” is : “I” Figure 5.5: Example of Features (for Word “Prefecture”) in Answer Finder Module 5.6 5.6.1 Experimental Result Question Classifier In the question classification experiment, we applied an SVM algorithm from the WEKA software[50] with a linear kernel and the “string to word vector” function to process the string value. We used 10-fold cross validation for the accuracy calculation by using collected data described in Section 5.3. We tried 90 CHAPTER 5. INDONESIAN-ENGLISH CLQA some feature combinations for the question classifier. The results are shown in Table 5.3. Table 5.3: Question Classifier Accuracy Features B S B+S B+W B+P S+W S+P B+S+W B+S+P S+W+P B+W+P B+S+W+P Accuracy Score 91.93% 95.49% 93.83% 94.01% 94.18% 95.70% 95.88% 94.96% 95.07% 96.02% 94.61% 95.41% “B” designates the bag-of-words feature, “S” designates shallow parser’s result feature including the interrogative word, question main word, etc. “W” designates WordNet distance feature for the question main word. “P” designates the bi-gram frequency feature for the question main word. As Table 5.3 shows, features using the bag-of-words gave lower performance than those without it. By using the same technique as in the Indonesian QA but with different input sentences gives different results. The results on Indonesian question classification in the monolingual Indonesian QA (Section 4.5.1) show that using the bag-of-words feature improves classification accuracy. We believe this is because the keywords used in the CLQA are more various than those used in the monolingual QA. For the queries used in our Indonesian-English CLQA, users generated questions based on their translation knowledge, such as they were free to use any Indonesian terms as the translation of any of the English terms they found in the English article. In the monolingual QA system, however, users tended to use keywords such as those written in the monolingual (Indonesian) article. The highest accuracy result is 96.02%, achieved by the combination of the shallow parser’s result(S), bi-gram frequency(P) and WordNet distance(W). This is also different from the monolingual QA (Chapter 4.5.1), as the result using the above combination is lower than that obtained using only the S+P features. This is because there are many English keywords used in the Indonesian questions for the CLQA system, thereby making the WordNet distance feature a useful feature. 91 5.6. EXPERIMENTAL RESULT The detail accuracy for each EAT of the S+P+W features is shown in Table 5.4. Table 5.4: Confusion Matrix of S+P+W Features for Question Classification in / out date loc name org person quan date 459 0 0 0 0 2 loc 0 460 4 10 0 0 name 0 10 467 33 0 0 org 0 4 10 403 8 0 person 0 2 1 29 439 0 quan 0 0 0 0 0 496 The lowest performance is for the “organization” class, which is a difficult task. For example, for the question “Siapa yang mengatakan bahwa 10% warga negara Jepang telah mendaftarkan diri untuk mengikuti Pemilu pada tahun 2000?” (Who says that 10% of Japan citizen applied for the national election in year 2000?) “person” was obtained as the classification result. Even for a human, it would be quite difficult to define the question class of the above example without knowing the correct answer(“Foreign Ministry”). 5.6.2 Keyword Translation As mentioned before, we translated Indonesian queries into English using an Indonesian-English dictionary. Apart of that, we also transformed some Indonesian words into English. Using the transformation module, we were able to translate 38 OOV words with only 1 incorrect result. The examples of the transformation result are shown in Table 5.5. Table 5.5: Examples of Transformation Result Indonesian English Translation Correct Translation prefektur prefecture agensi agency jerman german, germany Incorrect Translation mula mole, mule 92 CHAPTER 5. INDONESIAN-ENGLISH CLQA 5.6.3 Passage Retriever For the passage retriever, we used two evaluation measures: precision and recall. Precision shows the average ratio of relevant documents. A relevant document is a document that contains a correct answer without considering any available supporting evidence. Recall refers to the number of questions that might have a correct answer in the retrieved passages. We conducted three schemas for the passage retriever. The results are shown in Table 5.6. In Table 5.6, “in house sw” means that the keywords are filtered using an additional stop word elimination process. “in house wn” means that We used WordNet to augment the Indonesian-English translations. For the English target corpus, we used Daily Yomiuri Shimbun from the years 2000 and 2001(17,741 articles and 98,922 words of vocabulary size without the function words). Table 5.6: Passage Retriever Accuracy Run-ID In-house In-house-sw In-house-sw-wn Precision % 12.6% 12.1% 19.7% # 1044 of 8311 1092 of 9026 845 of 4295 Recall % # 71.1% 202 73.2% 208 70.4% 200 F-Measure % 21.4% 20.8% 30.8% We found that using WordNet for the expansion gave lower recall result which means lower number of candidate documents that might contain answer. By using WordNet, some irrelevant passages got higher IDF score than the relevant one. 5.6.4 Answer Finder To locate the answer, we used an SVM based text chunking software (Yamcha) with a default value for the SVM configuration. We ranked the answers result with the following five schemas: 1. A: Using only Text Chunking score 2. B: (0.3 × Passage Retrieval score) + (0.7 × Text Chunking score) 3. C: (0.5 × Passage Retrieval score) + (0.5 × Text Chunking score) 4. D: (0.7 × Passage Retrieval score) + (0.3 × Text Chunking score) 5. E: Using only Passage Retrieval score All results are shown in Table 5.7. To measure the CLQA performance, we used the Top1, Top5 and MRR scores for the exact answers. The “sw” label 93 5.6. EXPERIMENTAL RESULT in the Run-ID column means that the keywords were filtered using two kinds of stopwords: a common stopword elimination (the words listed in the stopword list should be deleted from the keyword list) and a special stopword elimination (the words are only deleted if they met certain criterias such as “tahun” (year) in “Pada bulan apa akan diadakan pemilihan umum di Jepang pada tahun 2000?” (In what month of year 2000, the general election will be held in Japan?). The “wn” label in the Run-ID column means that the English translation keywords are added by synonyms from WordNet as described in Section 5.5.2. Table 5.7 shows that using also the passage retrieval’s score (“B”, “C” and “D”) improves the overall accuracy. Here number of retrieved passages influences the machine learning’s accuracy as the highest question answering accuracy score (Top1 and MRR) is achieved by the “in-house-sw-wn” which reached lowest recall score on the passage retriever accuracy, but with highest precision score on the passage retriever accuracy. Table 5.7: Answer Finder Accuracy Run-ID In-house A B C D E In-house-sw A B C D E In-house-sw-wn A B C D E Top1 Top5 MRR 20.1%(57) 23.9%(68) 23.2%(66) 23.2%(66) 17.3%(49) 34.2%( 34.2%( 33.8%( 33.8%( 32.8%( 97) 97) 96) 96) 93) 25.6% 28.1% 27.7% 27.6% 23.6% 20.4%(58) 24.3%(69) 23.9%(68) 23.9%(68) 19.0%(54) 35.6%(101) 36.6%(104) 36.6%(104) 36.3%(103) 34.5%( 98) 26.9% 29.2% 29.0% 28.8% 25.2% 20.4%(58) 25.0%(71) 24.7%(70) 24.7%(70) 18.0%(51) 34.9%( 99) 35.6%(101) 35.6%(101) 35.2%(100) 33.5%( 95) 26.5% 29.5% 29.2% 29.1% 24.3% Other than experiments shown in Table 5.7, we also conducted the experiments for question translations without the transformation process. The experiment results in Table 5.8 show that the transformation process gives benefit on the answer finder accuracy, similar to the improvement given by the stop word 94 CHAPTER 5. INDONESIAN-ENGLISH CLQA elimination process as shown in Table 5.7. Table 5.8: Answer Finder Accuracy for Translation without the Transformation Process Run-ID A B C D E 5.6.5 Top1 19.0%(54) 23.2%(66) 22.5%(64) 22.5%(64) 17.3%(49) Top5 33.5%( 33.8%( 33.8%( 33.5%( 31.3%( 95) 96) 96) 95) 89) MRR 25.2% 27.5% 27.1% 27.0% 23.2% Experiments with NTCIR 2005 CLQA Task Test Data We also conducted a CLQA experiment using NTCIR 2005 CLQA Task Data [39]. Because the English-Japanese and Japanese-English data set are parallel, we translated the English question sentences into Indonesian and used the translated Indonesian questions to find the answers from the English documents (Daily Yomiuri Shimbun, years 2000-2001). To prepare the Indonesian questions, we asked two Indonesian people to translate the 200 English questions into Indonesian. The translated Indonesian questions were then labeled as belonging to one of two categories: Trans1 and Trans2. The question test set was categorized by NTCIR 2005 in 9 EATs: money, numex, percent, date, time, organization, location, person, and artifact. This question categorization is quite similar to the EATs used in our system such as the categories for “date”, “organization” (except for the “country” question focus), “location”, “person”, “artifact”(equal with the “name” EAT), “money” (“quantity”), “numex” (“quantity”) and “percent” (“quantity”) EATs. Some question examples are shown in Table 5.9. The only EAT that we do not have in the training data is the “time” EAT. It is interesting to observe whether my system can handle the out-of-EAT questions. There are 14 questions of NTCIR 2005 with the “time” EAT. It should be noted, however, that in our data there is no question that has an answer that would include special terms such as “p.m” or “a.m”. Further, there is also no question similar to “What time did the inaugural flight depart from Haneda Airport’s new runway B?”. As a result of these differences, in the question classification, our system classified all “time” questions as “date” EAT. Even though there is no similar question-answer pair in our training data among these 14 questions, for the first translation set, our system was able to locate 2 correct answers. Further, we believe that if the “time” questions are added to the training data, the system will achieve better performance. 5.6. EXPERIMENTAL RESULT 95 Table 5.9: Question Examples of the NTCIR 2005 CLQA Task Japanese: 台湾新幹線に対して日本が優先交渉権を獲得したのはいつ English: When did Japan win prior negotiation rights in the Taiwan Shinkansen project? Indonesian-1: Kapan Jepang memenangi hak penawaran dalam proyek Shinkansen Taiwan? Indonesian-2: Kapankah Jepang telah memenangkan negosiasi perjanjian proyek Shinkansen Taiwan? Correct Answer: Dec. 28 NTCIR EAT: Date; In house EAT: Date Japanese: 高橋さんのトラックが北保さんのトラックに追突した時刻は English: At what hour did a truck driven by Takahashi rear-end a truck driven by Hokubo? Indonesian-1: Pada jam berapa truk yang dikemudikan Takahashi menabrak bagian belakang truk yang dikemudikan Hokubo? Indonesian-2: Pada jam berapakah truk yang dikendarai oleh Takahashi mendekati belakang truk yang dikendarai oleh Hokubo? Correct Answer: about 6 a.m. NTCIR EAT: Time; In house EAT: Date Japanese: 寄生虫病に苦しむ人々は世界中で何人くらいいますか English: How many people suffer from parasitic diseases all over the world? Indonesian-1: Berapa banyak jumlah orang yang menderita penyakit yang disebabkan oleh parasit di seluruh dunia? Indonesian-2: Berapa banyakkah orang yang menderita penyakit parasit di seluruh dunia? Correct Answer: More than 4 billion people NTCIR EAT: Numex; In house EAT: Quantity Japanese: 21日に慶応大で行われるソテールさんの講演会の司会を務めるのは誰 だ English: Who will host the lecture by Mr. Sautter at Keio University on the 21st? Indonesian-1: Siapakah yang akan menjadi tuan rumah kuliah tamu Mr. Sautter di Keio University pada tanggal 21? Indonesian-2: Siapakah yang akan memandu kuliah dari Mr. Sautter di Keio University pada tanggal 21? Correct Answer: Eisuke Sakakibara NTCIR EAT: Person; In house EAT: Person 96 CHAPTER 5. INDONESIAN-ENGLISH CLQA The experimental results are shown in the following two tables. The result of the passage retriever accuracy is shown in Table 5.10, which is similar to the result described in the previous section, where the highest recall score is achieved by the passage retrieval module with additional stop word elimination only, that is, without the WordNet expansion. This result is shown in both translation data sets. Table 5.10: Passage Retriever Accuracy for NTCIR Translated Test Data Set Run-ID Trans1 Trans1-sw Trans1-sw-wn Trans2 Trans2-sw Trans2-sw-wn Precision % 18.5% 792 of 17.3% 839 of 19.3% 753 of 10.1% 883 of 12.2% 976 of 12.4% 912 of # 4281 4842 3903 8733 8028 7346 Recall % # 76.0% 152 76.5% 153 74.5% 149 77.0% 154 78.5% 157 77.5% 155 F-Measure % 29.8% 28.2% 30.7% 17.9% 21.1% 21.4% Table 5.11 shows the question answering accuracy result. We used only the “B” ranking score calculation ((0.3 × Passage Retrieval’s Score) + (0.7 × Text Chunking’s Score)). Table 5.11: Question Answering Accuracy for NTCIR Translated Questions Run-ID Trans1 Trans1-sw Trans1-sw-wn Trans2 Trans2-sw Trans2-sw-wn Top1 22.5%(45) 22.0%(44) 22.0%(44) 21.0%(42) 22.0%(44) 22.5%(45) Top5 34.0%(68) 35.5%(71) 34.5%(69) 31.5%(63) 32.5%(65) 32.5%(65) MRR 27.2% 27.6% 27.4% 25.7% 26.7% 26.8% As a rough comparison, we also include the question answering accuracy of two best run-id in the Japanese-English task and the Chinese-English task of the NTCIR 2005 CLQA task, shown in Table 5.12[39]. In NTCIR 2005 CLQA task, the answer is labeled as R or U. R indicates that the answer is correct and extracted from relevant passages, whereas U indicates that the answer is correct but it is taken from irrelevant passages. In the experiments, we do not evaluate the relevancy of a passage, therefore the results shown in Table 5.12 is the R+U answers for Top1 and Top5 answers. 97 5.6. EXPERIMENTAL RESULT Table 5.12: Question Answering Accuracy of Run-ids in NTCIR 2005 CLQA Task[39] Run-ID Japanese-English NCQAL-J-E-01 Forst-J-E-02 Forst-J-E-03 Chinese-English UNTIR-C-E-01 NTOUA-C-E-03 NTOUA-C-E-02 Top1 Top5 MRR 31.5%(63) 9.0%(18) 8.5%(17) 58.5%(117) 19.5%( 39) 19.0%( 38) 42.0% 12.8% 12.3% 6.5%(13) 3.5%( 7) 3.0%( 6) N/A N/A N/A N/A N/A N/A The best result (NCQAL-J-E-01)[18] used some high quality data resources. To translate Japanese keywords into English, they used three dictionaries: EDICT, ENAMDICT and an in-house translation dictionary. In the question analyzer, they used the ALTJAWS morphological analyzer based on the Japanese lexicon “Nihongo Taikei”. The second and third best ranked teams in Japanese-English task (ForstJ-E)[29] used EDR Japanese-English bilingual dictionary to translate Japanese keywords into English and then retrieved the English passages. The English passages were translated into Japanese using a machine translation system. The translated Japanese passages along with the Japanese sentence were used as input for a Japanese monolingual QA system. Here, it can be noted that the number of entries of the EDR Japanese-English dictionary is higher than that of the KEBI Indonesian-English dictionary. In the Chinese-English task, the best result (UNTIR-C-E-01)[8] used currently available resources such as BabelFish machine translationk , Lemur IR toolkit∗∗ , Chinese Encoding Converter†† , LingPipe‡‡ and Minipar to annotate English documents. By using these resources, even though the result was lower than the submitted runs for Japanese-English task, this team achieved the highest accuracy with 13 correct answers as Top1. The second and third best team (NTOUA)[22] merged several English-Chinese dictionaries to get better translation coverage and used a web based searching to translate the unknown words. As mentioned before, our answer finder module is similar to the QATRO[37] k http://babelfish.altavista.com http://www.lemur.project.org †† http://www.mandarintools.com/zhcode.html ‡‡ http://www.alias-i.com/lingpipe http://www.cs.ualberta.ca/ lindek/minipar.htm ∗∗ 98 Accuracy (%),M RR CHAPTER 5. INDONESIAN-ENGLISH CLQA 40 35 30 25 20 15 10 5 0 Top1 Top5 M RR 0 500 1000 1500 2000 2500 3000 Num ber of Training Data Figure 5.6: QA Accuracy for Various Size of Training Data submitted run that used a maximum entropy machine learning algorithm. The team used only 300 questions as training data, and this is likely the main reason for its low accuracy score (2 correct answers as Top1). Here, we tried to use various size of the training data to show the effect of the size of training data on the CLQA accuracy scores. The results shown in Figure 5.6 are the accuracy scores (using the “B” ranking score calculation) for the first translation set of NTCIR 2005 CLQA task test set. Figure 5.6 indicates that the size of training data does influence the accuracy of a question answering system. Larger training data will cause higher performance on the question answering accuracy. Figure 5.6 also shows that for 250 question-answer pairs of training data (lower size than the one used in QATRO), our system was still able to achieve 12.3% accuracy for Top1 answer (24.6 correct answers in average) which was only defeated by the NCQAL[18] system. 5.7 Conclusions The experiments showed that for a poor resource language such as Indonesian, it is possible to build a cross language question answering system with a promising result. The machine learning approach and the boolean query made the system as an easily built system. One would not need to build a rich rule-based system to build a monolingual QA. Compared to other similar approach, we described that EAT information gives better benefit than in a monolingual QA. The experiments on NTCIR 2005 CLQA Task shows that our system could give correct answer for an out-of-EAT question. We believe that this system is suitable for languages with poor resources. For a source language, one need a POS resource mainly for the noun, verb and adjectives. Other modules such as the question shallow parser can be built with a 5.7. CONCLUSIONS 99 small effort of programming. For a target language, there are two main resources needed such as the POS resource (or POS tagger) and an ontology resource such as WordNet. Even though data resources similar with WordNet is not readily available for many languages, but our experiments in the Indonesian Question Answering showed that this kind of information can be replaced by statistics information yielded from a monolingual corpus. Chapter 6 Transitive Approach in Indonesian-Japanese CLQA 6.1 Introduction Building an Indonesian-Japanese CLQA is the final goal of this study. As in our knowledge, this Indonesian-Japanese CLQA is the first attempt to develop a CLQA system for a language pair with a limited resource. It is also the first attempt of building a CLQA system with Japanese language as the target documents where the source language is not English. So far, the only CLQA task with Japanese language as the target documents is provided in NTCIR 2005 for English-Japanese CLQA which usually has a rich translation resources such as NCQAL[18]. Another language pairs available in NTCIR 2005 CLQA Task[39] are the Japanese-English, Chinese-English, and English-Chinese CLQA data, The main problem arised in this Indonesian-Japanese CLQA is the limited data resources and tools for the translation module. For this kind of language pair, the common available resource is a bilingual dictionary. One would require an extra labour work to build a machine translation software or providing a parallel corpus to solve the translation module. Thus, the suitable schema is the keywords translation, not the sentence translation. Considering some resources that we have built, we concluded that there are some schemas adopted for the Indonesian-Japanese CLQA system: 1. Adopt a similar approach with the Indonesian-English CLQA (as described in Chapter 5). The Indonesian keywords are translated into Japanese (could be a direct or transitive translation). The passage retriever and answer finder are all done in Japanese terms. For this, we have to prepare the 101 102 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA training data of tagged Japanese passages for making the answer finder module. 2. Using English as the pivot language in the Indonesian-Japanese CLQA. By this, the Indonesian keywords and Japanese documents are translated into English. The passage retriever and answer finder are done in English terms. The main drawback here is the document translation. One can argue to translate only the verb or noun of the Japanese document into English to decrease the labour work of document translation. By doing that, however, the system probably could not get advantage from the sentence pattern which usually involves other words than verb or noun such as adverbs, function words, etc. 3. Similar with the second schema, but the passage retriever is done in Japanese. By this, the Japanese-English document translation is employed only for the relevant passages. 4. Similar with the second schema by using English as the pivot language, but the answer finder is done in Japanese documents. Here the first passage retrieval is done in English. The nouns from the English retrieved passages are then filtered and used as the input for the Japanese passage retrieval. The answer finder is conducted in Japanese. The drawback of this approach is that the pivot and target language should have a comparable corpora. We predict that the first and last schema will give higher accuracy score than others. This is caused by the advantage of sentence patterns gained in the first strategy. We believe that it is a valuable contribution to try the fourth schema to cope with the weak Japanese passage retrieval. By this consideration, in this Indonesian-Japanese CLQA, We compare these two schemas (the first and fourth strategy) to evaluate the advantages and drawbacks of each schemas. As mentioned in the Indonesian-Japanese CLIR system (Chapter 3), other than direct translation, there is an alternative to employ a transitive translation. It could be a machine transitive translation or a bilingual dictionary transitive translation. The experimental results in the Indonesian-Japanese CLIR system showed that the transitive translation with a bilingual dictionary could achieve comparable performance as the direct bilingual dictionary or the machine transitive translation. 6.2 Related Works Because the Indonesian-Japanese CLQA is not available until today, the related works described in this chapter include the Indonesian-English CLQA (CLEF) where Indonesian is the source language and English-Japanese CLQA (NTCIR) 103 6.2. RELATED WORKS where Japanese is the target language. Since CLEF 2005, there has been 3 works on the Indonesian-English CLQA. All of them adopted a common approach of a CLQA system where the source question is translated into target language and the rest execution is the monolingual target QA. The details of each work are described in Section 5.2. As for the English-Japanese CLQA, in NTCIR 2005 CLQA task, there were 5 groups submitted official runs of this English-Japanese CLQA task. The performance of each system is shown in Table 6.1[39]. “R” and “U” written in the table are the criteria of answers and retrieved documents. “R” (Right) means the answer is correct, and the document where it is from supports it. “U” (Unsupported) means that the answer is correct, but the document where it is from cannot support it as a correct answer. That is, there is no sufficient information in the document for users to confirm by themselves that the answer is a correct one. Table 6.1: Performance of English-Japanese CLQA Systems in NTCIR 2005 CLQA Task Run ID Forst LTI NICT QATRO TTN Accuracy (Top1, R+U) 15.5% 12.5% 12.0% 0.5% 6.5% Top5, R+U 24.0% N/A 21.0% 1.0% 11.5% The best result on English-Japanese CLQA 2005 was achieved by Forst[29] which used machine translation software and web search to translate proper nouns. The accuracy result was 15.5% accuracy (31 correct answers as Top1 among 200 questions). In the answer finder they used a matching score of each morpheme in the retrieved passages with the question keywords and EAT. The most related work of the submitted runs at NTCIR 2005 is QATRO[37]. The method is called Extended QBTE (Question-Biased Term Extraction). In common approach, an answer is identified by using the similarity between an EAT (Expected Answer Type) and named entity of the answer candidate. This approach uses a different method. It tries to eliminate the question classification process and named entity tagger processes. It extracts the answer by classifying each word in the document into one of 3 classes (B, I or O).“B” class means that the document word is the first word in the answers, “I” means that the word is part of the answer and “O” means that the word is not part of the answer. The answer classification was done using a maximum entropy algorithm with features taken from Chasen that includes four information POS. The accuracy 104 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA score obtained in the NTCIR 2005 CLQA was low, as there was only 1 correct answer among 200 test questions for the English-Japanese CLQA. We try to modify this method in our answer finder by using the question analyzer result having the EAT. 6.3 Indonesian-Japanese CLQA with Transitive Translation (Schema 1) The first schema of the Indonesian-Japanese CLQA is similar to the IndonesianEnglish CLQA by using a pivot language in the translation phase. The overall architecture is shown in Figure 6.1. Indonesian Question Sentence Indonesian Question Analyzer Indonesian Keywords, Indonesian Question Main Word English Japanese Translation Japanese Keyword Japanese Passage Retriever English Keywords, English Question Main Word Indonesian English Translation EAT, Interrogative word, Phrase information Japanese Keywords, Japanese Question Main Word Japanese Passage Japanese Answer Finder Japanese Answer Japanese Newspaper Corpus Figure 6.1: Schema of Indonesian-Japanese CLQA using Japanese Answer Finder First, the Indonesian question is processed by a question analyzer for EAT, question keywords and shallow parser’s result (interrogative word, question main 6.4. INDONESIAN-JAPANESE CLQA WITH PIVOT PASSAGE RETRIEVAL (SCHEMA 2) 105 word, and phrase information). The question analyzer used here is the same with the question analyzer employed in the Indonesian monolingual QA and Indonesian-English CLQA. The question keywords (along with the question main word) are then translated into Japanese. In the proposed method, we use a transitive translation with bilingual dictionaries. For a language pair with limited resources, the data resources for the transitive translation with bilingual dictionaries are more available than data resources for the direct translation or the transitive machine translation. In the experiments, we compare the proposed translation with the direct translation and transitive machine translation. In order to handle the OOV words, we use a Japanese proper name dictionary and a rule based transliteration module. The translated question keywords are joined into a query as input for the Japanese passage retriever module. In the document preparation phase, the Japanese sentences are processed using a morphological analyzer (Chasen[26]) into a list of base words. The morphological analyzer is also applied for the translated question keywords. Japanese non stop-words are used as the index. For the answer finder module, the method is similar to the answer finder in Indonesian monolingual QA (described in Chapter 4) and Indonesian-English CLQA (described in Chapter 5). It uses a text chunking process on the Japanese passages. The features are similar to the Indonesian-English CLQA. The WordNet distance features used in Indonesian-English CLQA are replaced with the POS information yielded by Chasen as the morphological analyzer for Japanese. An example of the POS information yielded by Chasen for the word 東京 (Tokyo) includes 名詞 (Noun), 固有名詞 (Proper Noun), 地域 (Region) and 一般 (General). 6.4 Indonesian-Japanese CLQA with Pivot Passage Retrieval (Schema 2) As mentioned in Section 6.1, the second schema of the Indonesian-Japanese CLQA is the one with using English passage retriever result as the input for the Japanese passage retriever. The architecture is shown in Figure 6.2. The main different thing with the first approach is the passage retriever. In the first approach, Japanese passage retriever is conducted, while in the second approach, we conducted both English and Japanese passage retriever. By this, the pivot language is not only used in the translation phase. The nouns (proper and common nouns) of the retrieved English passages are translated into Japanese and be used as the input for the Japanese passage retrieval. The rest of the process is equal as the one in the first approach. We argue that by using the pivot language passage retriever, our system can have better performance. This schema could be seen as a query expansion in 106 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA Indonesian Question Sentence Indonesian Question Analyzer Indonesian Keywords, Indonesian Question Main Word English Newspaper Corpus English Keywords English Passage Retriever Indonesian English Translation English Passages, English Keywords, English Quesiton Main Word English Japanese Translation EAT, Interrogative word, Phrase Information Japanese Keywords, Japanese Question Main Word Japanese Keyword Japanese Newspaper Corpus Japanese Passage Retriever Japanese Passage Japanese Answer Finder Japanese Answer Figure 6.2: Schema of Indonesian-Japanese CLQA using English Passage Retriever the Japanese passage retriever which could handle the low translation quality of Indonesian-Japanese translation. 6.5 Modules of Indonesian-Japanese CLQA Mainly there are several modules involved in the Indonesian-Japanese CLQA: 1. Indonesian Question Analyzer 2. Indonesian-Japanese Translation 3. English Passage Retriever 4. Japanese Passage Retriever 5. Japanese Answer Finder 6.5. MODULES OF INDONESIAN-JAPANESE CLQA 107 We made used some modules employed in the previous system such as the same Indonesian question analyzer employed in the Indonesian QA and IndonesianEnglish CLQA, Indonesian-Japanese translation employed in the IndonesianJapanese CLIR and English passage retriever in the Indonesian-English CLQA. The new developed modules here are the Japanese passage retriever and Japanese answer finder which will be described in the following sections. 6.5.1 Japanese Passage Retrieval We use GETA (http://geta.ex.nii.ac.jp/) as our generic information retrieval engine. It retrieves some Japanese documents from a keyword set by using IDF, TF or TF × IDF score. The Japanese translation results are joined into one query and inputted into GETA to get the relevant Japanese documents. All passages (2 sentences) that contain a similar word as the question keywords are used as the retrieved passages. Because translation results using the bilingual translation contain many Japanese translations, the translations were filtered by using mutual information and TF × IDF score. First, all combinations of the keyword translation sets are ranked based on their mutual information score. Each set of top 5 mutual information scores is used to retrieve the relevant documents. In the final phase, we select documents within 100 highest TF × IDF scores from all relevant documents resulted from the queries of top 5 mutual information scores. 6.5.2 Japanese Answer Finder Our Japanese answer finder locates the answer candidates using a text chunking approach with a machine learning algorithm. Here, the document feature is directly matched with the question feature. Each word in the retrieved passages is classified into “B” (first word of the answer candidate), “I” (part of the answer candidate) or “O” (not part of the answer candidate). The features used for the classification include the document feature, question feature, EAT information and similarity score. The document feature includes POS information (four POS information resulted by Chasen), and the lexical form. The question feature includes the question shallow parser result and the translated question main word. The similarity score is the score between the document word and the question keywords. An example of document feature for a document word “自民党” with question “Prime Minister Obuchi’s coalition government consists of LDP, Komei, and what other party?” is shown in Figure 6.3. For one document word, there are n preceding words and n succeeding words. In our experiment, we found that n=5 is the best option, such as shown in Figure 6.3. 108 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA Document word: 自民党 POS information (Chasen): 名詞, 固有名詞, 組織 Similarity score: 0 5 preceding words: に, よる, 連立 (similarity score is 1), 政権, は 5 succeeding words: 執行, 部, が, 小渕 (similarity score is 1), 前 Figure 6.3: Example of Document Feature for the Answer Finder Module 6.6 6.6.1 Experiments Experimental Data In order to gain an adequate number of training data for the question classifier and the answer finder modules, we collected our Indonesian-Japanese CLQA data. So far, we have 2,837 Indonesian questions and 1,903 answer tagged Japanese passages. About 1,200 Japanese passages (more than half of the 1,903 passages) were gained from Japanese native speakers having reading the English question, the English answer and the corresponding Japanese article (Yomiuri Shimbun years 2000-2001). The web interface of Japanese passages annotation task is in Figure 6.4. Training data examples are shown in Table 6.2. Figure 6.4: Web Interface of Japanese Passage Annotation Task 6.6. EXPERIMENTS 109 Table 6.2: Training Data Examples for Indonesian-Japanese CLQA Question + Correct Passage Question: Perusahaan apakah yang menerbitkan Weekly Gendai? (What company publishes Weekly Gendai?) Correct Passage: 読売新聞社は、 <A>講談社 </A>発行の「週刊現代」と徳間書店発行 の「週刊アサヒ芸能」の新聞広告について、「毎号の広告内容に、新聞に 載せるのにふさわしくない極めて過激な性表現が多数含まれ、改善が見ら れない」と判断、読売新聞紙上への広告掲載を当分の間見合わせることを 決め、三日までに両社に通知した。 Question: Kapan konsorsium Jepang mengalahkan konsorsium Taiwan untuk Taiwan Shinkansen project? (When did Japan consortium defeat Taiwan consortium for the shinkansen project in Taiwan?) Correct Passage: 台湾版新幹線プロジェクトをめぐっては、国際入札で日本企業の連合と独 仏企業連合が受注を競っていたが、 <A>十二月二十八日 </A>に日本側 が優先交渉権を獲得した B 日本政府は台湾側の資金調達を全力で支援す る姿勢を示すことで正式受注にこぎつけた「考えだ。 Question: Siapakah konduktor Italia yang lahir di Venice pada tahun 1946 ? (Who is Italian conductor which was born in Venice in 1946?) Correct Passage: <A>ジュゼッペ・シノーポリ氏 </A>(ドレスデン国立歌劇場管弦楽団 首席指揮者)現代フ代表的指揮者。フィルハーモニア管弦楽団音楽監督を 経て92年より現職。53歳 B イタリア・ベネチア生まれ。 The 200 questions from NTCIR CLQA task are used as the test data. The literature [39] noted that in the NTCIR 2005 CLQA task, the Japanese questions were created as the translation of English questions by referring to the corresponding Japanese articles. It made the question/answer pairs of JE and EJ subtasks be parallel. Thus, the Indonesian questions used in the Indonesian-English CLQA (Section 5.6.5) can be employed as the test questions in the IndonesianJapanese CLQA. The 200 Indonesian test questions contain 625 common nouns and 294 proper nouns. The language resources for the translation phase are equal with the one employed in the Indonesian-Japanese CLIR. These were described in Section 3.5.1. 6.6.2 Evaluation on OOV of Indonesian-Japanese Translation The OOV rates in query sentences (test data) are shown in Table 6.3. OOV words are the words that could not be translated by the translation module. Using the 110 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA translation resources as mentioned in the previous section (including the proper name dictionary, romanized corpus word and transliteration), for proper noun and common noun, the OOV rates were about 15.2%, 11.5%, and 10.4% for the direct translation (vocabulary size of 14,823 entries), transitive translation (dictionaries of Indonesian-English 29,054 entries and English-Japanese 556,237 entries) and transitive machine translation (similar with the Indonesian CLIR, using Indonesian-English kataku engine and English-Japanese excite engine), respectively. Table 6.3: OOV Rates of Proper Noun and Common Noun Translation Description Direct Translation Transitive Translation Transitive Machine Translation 6.6.3 Proper Noun 12.9% 13.9% 13.2% Common Noun 17.2% 9.4% 9.1% Passage Retriever’s Experimental Results The performance of the Japanese passage retriever is shown in Table 6.4. Table 6.4 shows two evaluation measures: precision and recall. Precision shows the average ratio of relevant passages to the retrieved passages. A relevant passage is defined as a passage that contains a correct answer without considering any available supporting evidence. Recall refers to the rate of questions that might have correct answers in the retrieved passages. “n-th MI score” means the input query is the keyword set with the n-th rank of MI score. “MI-TF × IDF” is the combination of Mutual Information score and “TF × IDF score” as explained in Section 6.5.1. In the keyword translation using a bilingual dictionary, even though the number of OOV for common nouns resulted by the direct translation is much larger than the transitive translation, but in general the direct translation has better retrieval performance than the transitive translation (higher precision score for all methods and higher recall score for almost all methods). It shows that the important keywords in the document retrieval are mostly the proper nouns (number of OOV proper nouns of direct translation is lower than the transitive translation). Table 6.4 also shows that without the combination of TF × IDF and mutual information filtering, the transitive translation result will have lower recall score than the direct translation. It indicates that the combined filtering is effective for transitive translation result because it is able to decrease the incorrect Japanese translations. For the direct translation, the combined filtering is not effective because the number of Japanese translations is much fewer than the transitive 6.6. EXPERIMENTS 111 Table 6.4: Indonesian-Japanese Passage Retriever’s Experimental Results Description Recall Precision Direct Translation with Bilingual Dictionary No filtering 35.5% 2.24% 1st MI score 37.5% 2.41% 2nd MI score 37.5% 2.42% 3rd MI score 36.0% 2.21% 4th MI score 38.0% 2.21% 5th MI score 39.0% 2.34% MI-TF × IDF 37.0% 2.47% Transitive Translation with Bilingual Dictionaries No filtering 28.5% 1.50% 1st MI score 34.5% 1.62% 2nd MI score 36.0% 1.58% 3rd MI score 34.0% 1.50% 4th MI score 35.0% 1.72% 5th MI score 34.5% 1.87% MI-TF × IDF 39.0% 1.82% Transitive Machine Translation No filtering 35.0% 2.61% 1st MI score 37.5% 3.09% 2nd MI score 35.5% 2.85% 3rd MI score 35.0% 2.73% 4th MI score 35.0% 2.75% 5th MI score 34.5% 2.73% MI-TF × IDF 36.0% 2.87% Keywords of Japanese monolingual queries No filtering 70.0% 5.46% translation. Table 6.4 also shows that our translation only achieved 50% accuracy compared to the oracle experiment (last row, the document retrieval using keywords extracted from Japanese monolingual queries). The translation using transitive machine translation achieved the worst performance. The main reason of its low performance is the number of OOV words. This result is different with the one conducted for the Indonesian-Japanese CLIR. One of the reasons is because in the QA’s passage retriever, answer as one of the important keywords is not listed in the input query, different with IR. We also group the experimental results on the queries with OOV words and 112 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA without it. The results shown in Table 6.5 point out that we still have a homework to handle the OOV words and get a better passage retriever performance. It also indicates that the in-vocabulary translation gives imprecise translations, all the recall scores are lower than the oracle experiment. Table 6.5: Indonesian-Japanese Passage Retriever’s Experimental Results (with and without OOV words) Queries without OOV words Queries with OOV words Description Recall Precision Recall Precision Direct Translation with Bilingual Dictionary No filtering 48.84% 3.70% 21.9% 0.90% MI-TF × IDF 53.49% 3.66% 28.1% 1.57% Transitive Translation with Bilingual Dictionaries No filtering 43.02% 2.86% 17.5% 0.49% MI-TF × IDF 52.33% 2.71% 28.95% 1.16% Transitive Machine Translation No filtering 42.10% 3.61% 25.58% 1.31% MI-TF × IDF 42.98% 3.68% 26.74% 1.82% In order to see the effect of dictionary quality on the passage retrieval, we conducted the passage retriever experiment (without filtering) for direct translation with 4 dictionaries. These dictionaries are actually the Indonesian-Japanese dictionary with various dictionary sizes: 3,000, 5,000, 8,857 and 14,823 (the original Indonesian-Japanese dictionary) entries. The reduction was done by choosing the most frequent Indonesian words occur in the Indonesian newspaper corpus. Figure 6.5 shows the experimental results. The conclusion is equal as the one conducted in the Indonesian-Japanese CLIR that the larger the OOV word, the lower performance it achieved. It indicates that dictionary quality plays an important role in the cross language system. It also shows that for 3000 words vocabulary size, without the filtering schema, the direct translation using bilingual dictionary has a lower recall score than the transitive translation using bilingual dictionaries. 6.6.4 Japanese Answer Finder In the answer finder module, we used Yamcha with default configuration as the SVM based text chunking software. To evaluate the performance of answer finder module, we conducted the answer finder experiments for the correct passages. The result is shown in Table 6.6. The evaluation scores are Top1 (correct rate of the top 1 answers), Top5 (rate of at least one correct answer included in the top 5 answers), TopN (rate of at least one correct answer retrieved in the 113 6.6. EXPERIMENTS 㪩㪼㪺㪸㫃㫃 㪦㪦㪭 㪋㪇 㪊㪌 㪊㪇 㪉㪌 㩼 㪉㪇 㪈㪌 㪈㪇 㪌 㪇 㪊㪃㪇㪇㪇 㪌㪃㪇㪇㪇 㪏㪃㪏㪌㪎 㪈㪋㪃㪏㪉㪊 㪛㫀㪺㫋㫀㫆㫅㪸㫉㫐㩷㪪㫀㫑㪼 Figure 6.5: Experimental Results of Indonesian-Japanese Passage Retriever with Direct Translation using Dictionaries in Various Size found answers) and MRR (Mean Reciprocal Rank, the average reciprocal rank (1/n) of the highest rank n of a correct answer for each question). “Baseline” means that we use features mentioned in Section 6.5.2. We tried to add two additional features: word distance feature and character type feature. The word distance feature shows the distance between the current word and other word in the document that equals to question keywords. The character type feature labels a document keyword with number, kanji, katakana, hiragana or alphabet type. To see the effect of the transitive translation in the answer finder feature, we conducted two kinds of experiments for the oracle correct documents. The first is the one that used transitive translation to measure the document word similarity score, shown in the first four rows of Table 6.6. The second is the one that used the correct translation, the Japanese keywords contained in the Japanese queries, shown in the last four rows of Table 6.6. This comparison shows that for the answer finder method, the translation error caused by the transitive translation decreases the answer finder result. But the influence is not as significant as the passage retrieval result where the recall score of the transitive translation is about half of the one using the correct Japanese keywords. As the final experiment, we applied the same answer finder module on the passage retriever results. We used the passage retriever with the combination of MI and TF × IDF filtering method. The question accuracy scores are shown in Table 6.7. The answers were ranked using the recall score (R) of the document 114 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA Table 6.6: Question Answering Accuracy for Correct Documents Description Top1 Top5 TopN MRR Use transitive translation to calculate word similarity baseline 24.0% 36.0% 40.0% 29.3% +word distance 20.0% 31.0% 33.0% 24.2% +character type 22.0% 37.0% 38.0% 27.6% +distance and type 21.0% 33.5% 34.0% 25.5% Use keywords of Japanese queries to calculate word similarity baseline 29.0% 44.0% 48.5% 35.3% +word distance 27.5% 40.5% 43.5% 33.2% +character type 30.5% 44.5% 46.5% 36.2% +distance and type 27.5% 41.5% 44.0% 33.6% retrieval and the text chunking (T) score resulted by Yamcha. Table 6.7: Question Answering Accuracy for Retrieved Documents Description Top1 Top5 Direct Translation Baseline 3 7.5 +word distance 3.5 8.5 +character type 3 7.5 +distance and type 3.5 8.5 Transitive Translation Baseline 2 6 +word distance 2.5 4.5 +character type 2 5.5 +distance and type 3 4 Transitive Machine Translation Baseline 2 5.5 +word distance 2 4.5 +character type 2 5 +distance and type 2 4 TopN MRR 22 16.5 14.5 16.5 5.5 5.7 5.4 5.7 19.5 16.5 19.5 16.5 3.9 4.0 4.0 4.2 20 16 18 18.5 4.2 3.4 4.0 3.4 All accuracy scores shown in Table 6.7 are quite comparable. This comparison is similar with the passage retrieval result shown in Table 6.4. The best score on the Top1 answer is achieved by the direct translation method. In Top1 answer, the transitive translation achieved an almost comparable performance than the direct 115 6.6. EXPERIMENTS translation. But in the Top5 answer, even though the passage retriever result of the direct translation (with TF × IDF filtering) is lower than the transitive translation, the QA performance of direct translation is better than the transitive translation. It shows that the ranking schema could not give a good rank position to the correct answer. 6.6.5 Experimental Result for the Transitive Passage Retriever As mentioned before, in the transitive passage retriever, there are two passage retriever: English passage retriever with keywords extracted from the original questions (translated into English), and Japanese passage retriever with keywords extracted from the retrieved English passages. We experimented several numbers of the chosen Japanese passages, from 5 highest until 100 highest TF × IDF. The experimental result of the passage retriever (PR) and question answering (QA) is shown in Figure 6.6. 㪩㪼㪺㪸㫃㫃㩷㩿㪧㪩㪀 㪫㫆㫇㪈㩷㩿㪨㪘㪀 㪫㫆㫇㪌㩷㩿㪨㪘㪀 㪤㪩㪩㩷㩿㪨㪘㪀 㪋㪇 㪊㪌 㪊㪇 㪉㪌 㪉㪇 㪈㪌 㪈㪇 㪌 㪇 㪈㪇㪇㩷㪫㪝㬍㪠㪛㪝 㪐㪇㩷㪫㪝㬍㪠㪛㪝 㪏㪇㩷㪫㪝㬍㪠㪛㪝 㪎㪇㩷㪫㪝㬍㪠㪛㪝 㪍㪇㩷㪫㪝㬍㪠㪛㪝 㪌㪇㩷㪫㪝㬍㪠㪛㪝 㪋㪇㩷㪫㪝㬍㪠㪛㪝 㪊㪇㩷㪫㪝㬍㪠㪛㪝 㪉㪇㩷㪫㪝㬍㪠㪛㪝 㪈㪇㩷㪫㪝㬍㪠㪛㪝 㪌㩷㪫㪝㬍㪠㪛㪝 Figure 6.6: Experimental Results of Indonesian-Japanese CLQA using Transitive Passage Retriever Figure 6.6 shows that for the Top1 answer, even though the passage retriever performance of the transitive passage retriever is lower than the direct one (without the transitive passage retriever), the CLQA performance using transitive passage retriever (10 correct answers) is higher than using a direct passage retriever (7 correct answers). It shows that the transitive passage retriever gives better quality passages than the direct one. The weakness of using the transitive passage retriever lays on its process which consumes twice translation (for question keyword and for English passage) and twice passage retriever, more complex than 116 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA the direct passage retriever which needs only one translation (question keyword translation) and one passage retriever. Even though this Indonesian-Japanese CLQA result is lower than the IndonesianEnglish CLQA experiments but it is higher compared to a similar research by QATRO[37] for English-Japanese CLQA with 1 correct answer for the Top1 answer. 6.6.6 Experimental Results of English-Japanese CLQA As a comparison, we also conducted an experiment on the English-Japanese CLQA. Using the same test data (CLQA task of NTCIR 2005) and Eijirou English-Japanese dictionary, we adopted the same method as the IndonesianEnglish CLQA. For this, we built our own English question shallow parser and used TreeTagger as the POS tagger software. The experimental results are shown in Table 6.8 and Table 6.10. Table 6.8 presents the passage retriever result for the English-Japanese CLQA using recall and precision score. The experimental result point out that the filtering using the mutual information and TFxIDF score is not effective for the direct translation. Using only the mutual information to filter the translation is adequate enough to enhance the passage retriever’s score. These scores are comparable with the Indonesian-Japanese passage retriever (see Table 6.4). It strengthens the conclusion that in order to enhance the quality of the translation, the most important keywords to be handled are the proper noun. Table 6.8: Performance of English-Japanese Passage Retriever Description No filtering 1st MI score 2nd MI score 3rd MI score 4th MI score 5th MI score MI-TF × IDF Recall 32.0% 40.5% 41.5% 37.5% 37.5% 40.0% 37.5% Precision 1.86% 1.91% 1.57% 1.56% 1.56% 1.41% 2.38% We also calculated the passage retriever performance for queries with OOV words and queries without it. The results are shown in Table 6.9. The number of OOV is 6.1% (14.6% of proper noun and 2.1% of common noun). The OOV number is higher than in the Indonesian-Japanese translation system because there are some English words which are OOV words in the English-Japanese dictionary, when translated into Indonesian, those words become in-vocabulary, 117 6.7. CONCLUSIONS for example “WW II” or “World War II” (translated into “Perang Dunia 2” in Indonesian). Table 6.9: Performance of English-Japanese Passage Retriever (With and Without OOV words Description No filtering 1st MI score 2nd MI score 3rd MI score 4th MI score 5th MI score MI-TF × IDF Queries without OOV words Recall Precision 35.5% 2.29% 44.0% 2.35% 43.97% 1.84% 39.0% 1.87% 40.4% 1.96% 44.7% 1.63% 41.8% 3.07% Queries with OOV words Recall Precision 23.7% 0.83% 32.2% 0.82% 35.6% 0.89% 33.9% 0.81% 30.5% 0.61% 28.8% 0.87% 27.1% 0.68% The passage retriever with filtering results are used as input for the Japanese answer finder. The performance of the answer finder is in Table 6.10. The answer finder performance is worse than the Indonesian-Japanese one. We assumed that it is because the OOV words (proper noun) which hold important information of the question. 6.7 Conclusions We have conducted an Indonesian-Japanese CLQA using easily adapted modules and transitive approach. There are two transitive approaches: the transitive translation and transitive passage retriever. By using a filtering of mutual information score and TFxIDF score, the transitive translation could have a comparable passage retriever performance with the direct translation. In the answer finder experiment, the direct translation has better performance than the transitive translation and filtering could not make it comparable. As for the transitive passage retriever, its answer finder performance achieved the best result compared to other conducted experiments. It shows that the query expansion strategy done in the transitive passage retriever could give more relevant passages than the one with only the direct passage retriever. Even though the accuracy score is lower than the Indonesian-English CLQA using the similar approach, but we believe that this result can be enhanced by improving the proper name translation. For the next research plan, we will try to improve the proper name translation method, for example by using internet 118 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA Table 6.10: English-Japanese Question Answering Accuracy Description Top1 Top5 TopN No Filtering Baseline 2 3.5 5.5 +word distance 2.5 4.5 9.5 +character type 1.5 4 11.5 +distance & type 2 4 8.5 Top 1 Mutual Information Filtering Baseline 3 6.5 22 +word distance 1 4 15.5 +character type 3 5 17.5 +distance & type 0.5 4 15.5 MI-TF × IDF Filtering Baseline 2 5 13 +word distance 0.5 3 12 +character type 2.5 4.5 13.5 +distance & type 0 3 12.5 MRR 2.7 3.5 2.7 3.0 5.2 2.9 4.6 2.6 3.7 2.2 3.7 1.8 as the translation resource. The experimental results also show that the answer ranking schema should be modified in order to eliminate incorrect answers. Chapter 7 Conclusions and Future Research 7.1 Conclusions The experiments on the Indonesian-Japanese CLIR showed that the transitive translation with bilingual dictionaries using a combined translation filtering schema could achieve comparable IR result gained by the direct translation and the machine transitive translation. This phenomenon gives hopes for the research development in the cross language system for a limited resource language such as Indonesian. By using available resources such as a bilingual dictionary (between a certain language and English as the major language) and a monolingual corpus, one can develop a cross language system with promising results. Our systems consist of some rule-based modules, statistical methods and some machine-learning based modules. The rule based modules are easily built modules. For example such as in the question answering system, we made a question shallow parser with few rules described in Section 4.4.2, or the additional feature of class candidates in the question classification (Section 4.4.2). In the machine learning modules, we used existing available machine learning softwares such as Yamcha[20], Weka[50] and so on. The features for the machine learning are also easily extracted using the simple rule-based methods or the statistical one. As for the knowledge resource such as WordNet (used in the English answer finder of Indonesian-English CLQA), we analyze that the resource could be replaced by a statistical method such as used in the Indonesian answer finder. It means that if a language has a knowledge resource such as WordNet or Chasen, those resources could be installed easily in the question answering system without making any mapping table or rules. And if the language does not have this kind of knowledge resource such as for Indonesian language, one can still use the statistical information gained from monolingual corpus. The machine learn119 120 CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH ing method itself is proved to be effective such as the one that we used for the monolingual QA and the CLQA. The effort to build such a system is much lower than building a rich rule-based system. By above reasons, we argue that these systems (CLIR, monolingual QA, CLQA) could be applied to other languages with a minimum programming effort. The detail conclusions of our study are as follows: in Chapter 3, we investigated the effectiveness of transitive translation with bilingual dictionary in a CLIR system. Transitive translation with a bilingual dictionary gives many translations which some of them are incorrect translations. Using all translation results gives lower IR score compared to other translation methods. It needs a certain translation filtering to improve the IR score. The conclusions of this chapter are: • The IR result using transitive translation with a bilingual dictionary could be comparable to the direct translation and the machine transitive translation if the keyword filtering schema is good enough to select the best translation. Here, we proposed a filtering schema using mutual information score and TFxIDF score. • The OOV words can be reduced from 50 into 5 words using borrowed word translation which employs a Japanese proper name dictionary and EnglishJapanese dictionary • The IR using combination of direct and transitive translation with bilingual dictionaries outperformed other translation methods. The monolingual QA for Indonesia is described in Chapter 4. It is a study on the machine learning approach of a monolingual QA for a limited resource language. Below are the conclusions: • Machine learning methods are suitable for the question classification and answer finder, two important modules in CLQA. • In the question classification, using features resulted by a rule-based question shallow parser and statistical features could achieve the classification accuracy of about 95%. • In the answer finder, by using a machine learning method means to eliminate a role of NE(Named Entity) tagger which is a common approach in building a QA system. Using available features without explicit semantic information in the machine learning could still give a good accuracy score. Combining the EAT(Expected Answer Type), question features and document features could improve the QA accuracy compared to the uncombined one. 7.1. CONCLUSIONS 121 Chapter 5 shows our work in Indonesian-English CLQA. The CLQA is built with similar approach as the monolingual QA, with machine learning method. The translation part of CLQA made the CLQA task is more difficult than a monolingual QA. The conclusions of our work in the CLQA system using machine learning method are as follows: • Translation using bilingual dictionary and some transformation rules could give bigger translation coverage than the machine translation method which are used by other Indonesian-English CLQA systems. • Using EAT information in the machine learning based answer finder module improve the system accuracy. It also gives benefit for the test question that does not have similar patterns with the training data. • CLQA with machine learning approach could achieve better performance than the rule-based system such as used in two other Indonesian-English CLQA systems. It should be noted that number of training data does influence the system performance. The more valid data are provided, the more accurate the system. The work of Indonesian-Japanese CLQA is described in Chapter 6. It used a transitive approach in the translation and passage retriever phase. The conclusions are as follows: • Filtering method using mutual information score and TF IDF score on the transitive translation with bilingual dictionaries is quite effective in the passage retrieval module. It could give better performance than the direct translation. • The translation results effect the passage retriever accuracy more than the answer finder. It is shown in Table 6.6. • Pivot language can also be used in the passage retriever. The experimental results show that the usage of English passage retriever (pivot) gave higher performance on the Japanese passage retriever (target) performance. To give the overall performance on the experiments, Table 7.1 shows the experimental results. Table 7.1 shows the experimental results of the passage retriever and answer finder modules for the Indonesian monolingual QA, IndonesianEnglish CLQA and Indonesian-Japanese CLQA. A good performance on the monolingual system shows that this method is efficient for the monolingual system. One does not have to provide many language processing tools to adapt this method. To have higher performance on the passage retriever, we can use other advanced information retrieval method which can handle the synonyms better. For the answer finder, we can also add another features such as the word distance feature. 122 CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH In the CLQA system, the Indonesian-English CLQA achieved a good QA score. One of the reason is the sameness between Indonesian and English language where there are some borrowed English words which needs only a lexical transformation as the translation method. The experimental results show that this method is appropriate for a cross language system even though the translation resource is only a bilingual dictionary with medium size. The IndonesianJapanese CLQA did not achieve a good performance as the Indonesian-English one. Although the English corpus used in the Indonesian-English CLQA is comparable with the Japanese corpus used in the Indonesian-Japanese CLQA, the corpus size is different: the English corpus size is 17,741 articles and the Japanese corpus size is 658,719 articles. Other than the different characteristics between the English sentence and Japanese sentence (without word segmentation), this corpus size makes the Japanese passage retrieval more difficult than the English one. The main cause of low passage retrieval score is the translation errors between Indonesian and Japanese. Using the available resources, the translation could not resolve the OOV proper noun problem. There are many proper nouns which are important question keywords could not be translated by the translation resources. Table 7.1: Comparisons on the Question Answering Performance Indonesian QA Passage Retriever Performance Recall 89% Precision 9% Corpus Size 71,109 Answer Finder Performance Top1 46% Top5 50% MRR 51% Test Data 200 (built in) 7.2 Indonesian-English CLQA Indonesian-Japanese CLQA 76% 18.5% 17,741 34% 1.2% 658,719 22.5% 34.0% 27.2% 5% 9% 6.5% 200 questions NTCIR CLQA 2005 Future Research As mentioned above, the developed systems have given quite a promising result. We believe that there are many other methods which can be combined with these existing methods in order to achieve higher performance. In the translation module, more translations can be added using a statistical based method on the 7.2. FUTURE RESEARCH 123 source and target corpus. One can also try to use the Internet to append the translations for the OOV words. As for the passage retrieval in the CLQA, the word window can be used to limit the retrieve passage numbers. This approach could improve the precision score of the passage retriever module. Another idea is to append or improve the feature quality for the machine-learning based modules (the question classifier and the answer finder), such as the similarity score between a corpus word and a question word, etc. To prove the easy adaption of these systems, it is worth to try to adapt the systems into other language and measure the development effort. One has usually better analysis if the proposed plan is conducted. By adapting it in real development, the system could also be improved with new ideas caused by handling the found problems. These text based systems can be enhanced with speech interface. The system can be supplied by spoken questions or spoken documents. Using spoken questions and answers will make the system be available for devices such as microphone, telephone, etc. Involving spoken documents mean that it can handle many recorded documents such as presentations, discussion, etc. Another idea is to involve the question answering system in a dialogue system. In a dialogue system, the answer should be extracted in real time. Some modules such as the searching module should be improved in order to achieve faster response time. A dialogue question answering system could also means an interactive question answering system where the question clues are spread among many questions in one dialogue. This pivot language approach could also be employed for other cross language text processing system. For example a cross language information extraction, a cross language text summarization, etc. Bibliography [1] Septian Adiwibowo and Mirna Adriani. Finding Answers Using Resources in the Internet. In Working Notes of CLEF 2007 Workshop, Hungary, September 2007. [cited at p. 78] [2] Mirna Adriani. Using statistical term similarity for sense disambiguation crosslanguage information retrieval. Infomation Retrieval, 2(1):71–82, 2000. [cited at p. 22] [3] Mirna Adriani and Rinawati. University of Indonesia Participation at Query Answering-CLEF 2005. In Working Notes of CLEF 2005 Workshop, Vienna, Austria, September 2005. [cited at p. 50, 77, 78] [4] Mirna Adriani and C.J. van Rijsbergen. Term Similarity Based Query Expansion for Cross Language Information Retrieval. In Proc. of Research and Advanced Technology for Digital Libraries (ECDL’99), pages 311–322, Paris, 1999. Springer Verlag. [cited at p. 76] [5] J.S. Badudu, editor. Pelik-Pelik Bahasa Indonesia. CV NawaPutra, Bandung, 10 2001. [cited at p. 7, 8, 9, 10, 14] [6] Lisa Ballesteros and W. Bruce Croft. Resolving ambiguity for cross-language retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 64–71, 1998. [cited at p. 22] [7] Lisa A. Ballesteros. Cross-language retrieval via transitive translation. In W. Bruce Croft, editor, Advances in Information Retrieval, pages 203–230. Kluwer Academic Publishers, 2000. [cited at p. 21] [8] Jiangping Chen, Rowena Li, Ping Yu, He Ge, Pok Chin, Fei Li, and Cong Xuan. Chinese QA and CLQA: NTCIR-5 QA Experiments at UNT. In Proc. of NTCIR-5 Workshop Meeting, pages 242–249, Tokyo, Japan, December 2005. [cited at p. 97] [9] Excite Japan. Excite machine translation. http://www.excite.co.jp/world/. [cited at p. 32] [10] Marcello Federico and Nicola Bertoldi. Statistical cross-language information retrieval using n-best query translations. In SIGIR ’02: Proceedings o f the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002. [cited at p. 22] 125 126 BIBLIOGRAPHY [11] Christopher Fox. A stop list for general text. SIGIR Forum, 24(4), 1990. [cited at p. 32] [12] Atsushi Fujii and Tetsuya Ishikawa. NTCIR-3 cross-language IR experiments at ULIS. In Proceedings of the Third NTCIR Workshop, 3 2003. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3CLIR-FujiiA.pdf. [cited at p. 30, 31] [13] Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changning Huang. Improving query translation for cross-language information retrieval using statistical models. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 96– 104, New York, NY, USA, 2001. ACM Press. [cited at p. 22] [14] Jianfeng Gao, Jian-Yun Nie, and Ming Zhou. Statistical Query Translation Models for Cross-Language Information Retrieval. ACM Transactions on Asian Language Information Processing, 5(4):323–359, 2006. [cited at p. 22] [15] Tim Gollins and Mark Sanderson. Improving cross language retrieval with triangulated translation. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001. [cited at p. 21] [16] Sanda M. Harabagiu, Marisu A. Pasca, and Steven Maiorano. Experiments with open-domain textual question answering. In Proc. of the 18th International Conference on Computational Linguistics (COLING 2000), pages 292–298, Saarbruken, Germany, 2000. [cited at p. 49, 54] [17] Indonesian Agency for The Assessment and Application of Technology. Kebi, kamus elektronik bahasa indonesia. http://nlp.aia.bppt.go.id/kebi/. [cited at p. 31, 77, 85] [18] Hideki Isozaki, Katsuhito Sudoh, and Hajime Tsukada. NTT’s Japanese-English Cross Language Question Answering System. In Proc. of NTCIR-5 Workshop Meeting, pages 186–193, Tokyo, Japan, December 2005. [cited at p. 77, 97, 98, 101] [19] Kazuaki Kishida and Noriko Kando. Two-stage refinement of query translation in a pivot language approach to cross-lingual information retrieval: An experiment at CLEF 2003. In Bridging Languages for Question Answering: DIOGENE at CLEF 2003, Lecture Notes in Computer Science, pages 253–262. Springer, Berlin / Heidelberg. [cited at p. 22] [20] Taku Kudoh and Yuji Matsumoto. Use of Support Vector Learning for Chunk Identification. In Proc. of the Fourth Conference on Natural Language Learning (CoNLL-2000), pages 142–144, Lisbon, Portugal, 2000. [cited at p. 63, 88, 119] [21] Xin Li and Dan Roth. Learning Question Classifiers. In Proc. of the 19th International Conference on Computational Linguistics (COLING 2002), pages 556–562, Taipei, Taiwan, 2002. [cited at p. 50] [22] Chuan-Lie Lin, Yu-Chun Tzeng, and Hsin-Hsi Chen. System Description of NTOUA Group in CLQA1. In Proc. of NTCIR-5 Workshop Meeting, pages 250–255, Tokyo, Japan, December 2005. [cited at p. 97] 127 [23] Yi Liu, Rong Jin, and Joyce Y. Chai. A Statistical Framework for Query Translation Disambiguation. ACM Transactions on Asian Language Information Processing, 5(4):360–387, 2006. [cited at p. 22] [24] Bernardo Magnini, Simone Romagnoli, Alessandro Vallin, Jesus Herrera, Anselmo Penas, Victor Peinado, Felisa Verdejo, and Maarten de Rijke. The Multiple Language Question Answering Track at CLEF 2003. In Proc. of CLEF 2003 Workshop, Norway, August 2003. [cited at p. 75] [25] Mainichi Shimbun Co. CD-ROM Data Sets 1993–1995. Nichigai Associates Co., 1994–1996. [cited at p. 32] [26] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara. Morphological Analysis System ChaSen version 2.2.1 Manual. http://chasen.aist-nara.ac.jp/chasen/doc/chasen2.2.1.pdf, 2000. [cited at p. 27, 32, 105] [27] Hideki Michibata, editor. Eijiro. ALC, 3 2002. (in Japanese). [cited at p. 32] [28] George A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11), 1995. [cited at p. 32] [29] Tatsunori Mori and Masami Kawagishi. A Method of Cross Language QuestionAnswering Based on Machine Translation and Transliteration. In Proc. of NTCIR-5 Workshop Meeting, pages 215–222, Tokyo, Japan, December 2005. [cited at p. 97, 103] [30] Overture Services, Inc. Altavista babelfish http://www.altavista.com/babelfish/. [cited at p. 32] machine translation. [31] Marisu A. Pasca and Sanda M. Harabagiu. High Performance Question Answering. In Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 366–374, New Orleans, 2001. [cited at p. 51] [32] Ari Pirkola. The Effects of Query Structure and Dictionary Setups in Dictionarybased Cross Language Information Retrieval. In Proc. of the 21st Annual International ACM SIGIR, pages 55–63, 1998. [cited at p. 63, 86] [33] Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa. Indonesian-Japanese Transitive Translation using English for CLIR. Journal of Natural Language Processing, Information Processing Society of Japan, 14(2), 2007. [cited at p. 50, 76] [34] Yan Qu, Gregory Grefenstette, and David A. Evans. Resolving translation ambiguity using monolingual corpora. In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck, editors, Advanced in Cross-Language Information Retrieval (CLEF2002), pages 223–241. Springer, Berlin / Heidelberg, 2002. [cited at p. 22] [35] Deepak Ravichandran, Eduard Hovy, and Franz Josef Och. Statistical QA - Classifier vs Re-ranker: What’s the difference? In Proc. of the ACL Workshop on Multilingual Summarization and Question Answering - Machine Learning and Beyond, pages 69– 75, Sapporo, Japan, 2003. [cited at p. 51] 128 BIBLIOGRAPHY [36] Sanggar Bahasa Indonesia Proyek. Kmsmini2000. http://m1.ryu.titech.ac.jp/ indonesia/todai/dokumen/kamusjpina.pdf, 2000. [cited at p. 32] [37] Yutaka Sasaki. Baseline Systems for NTCIR-5 CLQA1: An Experimentally Extended QBTE Approach. In Proc. of NTCIR-5 Workshop Meeting, pages 230–235, Tokyo, Japan, December 2005. [cited at p. 77, 78, 79, 88, 97, 103, 116] [38] Yutaka Sasaki. Question Answering as Question Biased Term Extraction: New Approach toward Multilingual QA. In Proc. of 43rd Annual Meeting of the ACL, pages 215–222, Ann Arbor, 2005. [cited at p. 51, 70, 73, 74, 77] [39] Yutaka Sasaki, Hsin-Hsi Chen, Kuang hua Chen, and Chuan-Jie Lin. Overview of the NTCIR-5 Cross-Lingual Question Answering Task (CLQA1). In Proc. of NTCIR-5 Workshop Meeting, pages 230–235, Tokyo, Japan, December 2005. [cited at p. 75, 94, 96, 97, 101, 103, 109] [40] Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proc. of International Conference on New Methods in Language Processing, Manchester, UK, 1994. http://www.ims.uni-stuttgart.de/ftp/pub/corpora/treetagger1.pdf. [cited at p. 77, 88] [41] Marcin Skowron and Kenji Araki. Effectiveness of Combined Features for Machine Learning Based Question Classification. Journal of Natural Language Processing, Information Processing Society of Japan, 6:63–83, 2005. [cited at p. 51] [42] Leah S.Larkey, Margeret E. Connell, and Nasreen Abduljaleel. Hindi CLIR in Thirty Days. ACM Transactions on Asian Language Information Processing, 2(2):130–142, 2003. [cited at p. 22] [43] Fadilla Z. Tala. A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia, 2003. [cited at p. 50] [44] Kumiko Tanaka and Kyoji Umemura. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th conference on Computational linguistics, volume 1, pages 297–303, 1994. [cited at p. 36] [45] Toggle Text. Kataku Automatic Translation http://www.toggletext.com/kataku-trial.php. [cited at p. 32] System. [46] Alessandro Vallin, Bernardo Magnini, Danilo Giampiccolo, Lili Aunimo, Christelle Ayache, Petya Osenova, Anselmo Peas, Maarten de Rijke, Bogdan Sacaleanu, Diana Santos, and Richard Sutcliffe. Overview of the CLEF 2005 Multilingual Question Answering Track. In Proc. of CLEF 2005 Workshop, Vienna, Austria, September 2005. [cited at p. 75] [47] C.J. van Rijsbergen. Information Retrieval 2nd edition. Butterworths, London, UK, 1979. [cited at p. 29] [48] Ellen M. Voorhees and Dawn M. Tice. Building a question answering test collection. In SIGIR 2000: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207, 2000. [cited at p. 49, 54] 129 [49] Wijono, Sri Hartati, Indra Budi, Lily Fitria, and Mirna Adriani. Finding Answers to Indonesian Questions from English Documents. In Working Notes of CLEF 2006 Workshop, Spain, September 2006. [cited at p. 77, 78] [50] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques 2nd edition. Elsevier Inc, 2005. [cited at p. 64, 89, 119] [51] Yomiuri Shimbun Co. Article Data of Daily Yomiuri 2001. Nihon Database Kaihatsu Co., 2001. http://www.ndk.co.jp/yomiuri/e-yomiuri/e-index.html. [cited at p. 32] [52] Dell Zhang and Wee Sun Lee. Question Classification using Support Vector Machine. In Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 26–32, Toronto, Canada, 2003. [cited at p. 50, 51] [53] Guowei Zu, Wataru Ohyama, Tetsushi Wakabayashi, and Fumitaka Kimura. Automatic text classification of English newswire articles based on statistical classification techniques. IEEJ Transactions Electronics, Information and Systems, 124– C(3):852–860, 3 2004. [cited at p. 32] List of Publications Journal Papers 1. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. 2007. “Indonesian-Japanese Transitive Translation using English for CLIR”. Journal of Natural Language Processing, pp. 95-123, Volume 14 No 2, April 2007. 2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. 2007. “A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System”. IEICE Transaction on Information and Systems, pp. 18411852, Volume E90-D No 11, November 2007. International Conferences 1. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Query Transitive Translation Using IR Score for Indonesian-Japanese CLIR”. Proceeding of Second Asia Information Retrieval Symposium, pp. 565-570, October 13-15, 2005. Jeju Island, Korea, Lecture Notes in Computer Sciences (LNCS) 3689 Information Retrieval Technology. 2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Indonesian-Japanese CLIR Using Only Limited Resource”. Proceeding of the Workshop on How Can Computational Linguistics Improve Information Retrieval, Workshop at COLING/ACL 2006, pp. 1-8, July 23, 2006. Sydney, Australia. 3. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “A Machine Learning Approach for Indonesian Question Answering System”. Proceeding of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2007), pp. 537-542, February 12-14, 2007. Innsbruck, Austria. Language Processing Society of Japan, reports 1. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Query Translation from Indonesian to Japanese using English as Pivot Language”. The 11th Annual 131 132 Conference, Language Processing Society of Japan, B3-9, pp. 580-583, March, 2005. Takamatsu, Japan. 2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Estimation of Question Types for Indonesian Question Sentence”. The 12th annual conference, Language Processing Society of Japan, B2-8, pp. 344-347, March, 2006. Keio University, Japan. 3. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Indonesian-English Cross Language Question Answering”. The 13th annual conference, Language Processing Society of Japan, E5-4, March, 2007. Ryokuko University, Japan. 4. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “A Transitive Translation for Indonesian-Japanese CLQA”. 第 182 回 自然言語処理研究会, IPSJ SIG Technical Report, pp. 93-100, November, 2007. Shizuoka University, Japan. Indonesian Student Association in Japan, reports 1. A. Purwarianti, and S. Nakagawa. “Query Translation in IndonesianJapanese Cross Language Information Retrieval”. Proc. of 13th Indonesian Scientific Conference (ISC) in Japan, pp. 428-437, September, 2004. Tokyo, Japan. 2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Proper Name Translation in Indonesian- Japanese CLIR”. Proc. of 14th Indonesian Scientific Conference in Japan, pp. 433-440, September, 2005. Nagoya, Japan. 3. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Research in Indonesian Question Answering Systems”. Chubu Chapter Indonesian Scientific Meeting, March, 2006. Toyohashi, Japan. 4. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “SVM based Indonesian Question Classification Using Indonesian Monolingual Corpus and WordNet”. Proc. of 15th ISC in Japan, August, 2006. Hiroshima, Japan.