Indonesian-Japanese CLIR and CLQA

Transcription

Developing Cross Language Systems
for Language Pair with Limited
Resource
-Indonesian-Japanese CLIR and CLQA-
December, 2007
DOCTOR OF ENGINEERING
Ayu Purwarianti
Toyohashi University of Technology
Developing Cross Language Systems for
Language Pair with Limited Resource
-Indonesian-Japanese CLIR and CLQAAbstract
Researches on cross language text processing systems have become an interesting research area, including CLIR (Cross Lingual Information Retrieval) and
CLQA (Cross Language Question Answering) systems. For major languages,
there are various available resources such as a parallel corpus, a rich bilingual
dictionary, a high performance machine translation software, etc. In order to
translate an English sentence into Japanese, one can use free machine translation
tools such as Babelfish, Excite, etc. But this is not always the case, especially
for minor languages such as Indonesian. For Indonesian, until now, obtaining a
rich resource for the translation is quite impossible. To have resources such as
the major languages do, one has to spend a lot of work hours.
In this thesis, we deal with the cross language systems for Indonesian, a
language with limited resources. We developed some systems for Indonesian language such as Indonesian-Japanese CLIR, Indonesian monolingual QA, IndonesianEnglish CLQA and Indonesian-Japanese CLQA. The main aim of these researches
is to propose methods to handle the limited resource problem.
In the Indonesian-Japanese CLIR, we propose a query transitive translation system of a CLIR for a language pair with limited data resources. The
method is to do the transitive translation with a minimum data resource of the
source language (Indonesian) and exploit the data resource of the target language (Japanese). There are two kinds of translation, a pure transitive translation and a combination of direct and transitive translations. In the transitive
translation, English is used as the pivot language. The translation consists of
two main steps. The first is a keyword translation process which attempts to
make translation based on available resources. The keyword translation process
involves many target language resources such as the Japanese proper name dictionary and English-Japanese (pivot-target language) bilingual dictionary. The
second step is a process to select some of the best available translations. The
mutual information score (computed from target language corpus) is combined
with the TF × IDF score in order to select the best translation. The result
on NTCIR 3 (NII-NACSIS Test Collection for IR Systems) Web Retrieval Task
showed that the translation method achieved a higher IR score than the machine
translation (using Kataku (Indonesian-English) and Babelfish/Excite (EnglishJapanese) engines). The performance of transitive translation was about 38% of
the monolingual retrieval, and the combination of direct and transitive translation achieved about 49% of the monolingual retrieval which is comparable to the
English-Japanese IR task.
In the monolingual Question Answering (QA) system, we have developed a
QA system for a limited resource language (e.g. Indonesian) which employs a
machine learning approach. The QA system consists of two key components:
”question classifier” and ”answer finder”, which are based on Support Vector
Machines (SVM). We also developed some supporting tools such as an easily
built POS tagger and a shallow parser for the question. These supporting tools
are built with small human efforts. In the development, there are 3000 questions
for 6 answer types, collected from 18 Indonesian. For the evaluation data, there
are 71,109 Indonesian news articles available on Web. In the experiments, some
feature combinations for the SVM were compared. All features used are extracted
from the available language resources. One of the important features is a bi-gram
frequency between the intended word and some defined words. This feature is
introduced to cope with the resource poorness. For the question classification
task, the system achieved about 96% accuracy. The answer finder achieved MRR
of 0.52 on the first answer as the exact correct answer. Using this machine
learning approach, we argue that this monolingual QA system can be adapted
easily to other limited resource language.
For the CLQA research, we adopted the approach used in the Indonesian
monolingual QA into the Indonesian-English CLQA system. The IndonesianEnglish CLQA system was built from Indonesian question analyzer system, IndonesianEnglish translation using a bilingual dictionary, English passage retriever and
English answer finder. Different with other Indonesian-English CLQA systems,
we did a bilingual dictionary translation in order to make the keyword coverage
larger. The translation module is equipped with a transformation module for Indonesian borrowed words such as “prefektur”(from “prefecture”), “Rusia”(from
“Russian”), etc. The translation results are combined into a boolean query to
retrieve relevant English passages. Features of translated question keywords and
passages are used to define the answer in the English passages. The bi-gram frequency feature used in the Indonesian answer finder for each word in the passage
is replaced by the WordNet distance feature. This replacement is done easily
without adding any mapping tables. In the experiments, 2553 questions were
used as the training data and 284 questions were used as the test data. These
questions were collected from 18 Indonesian college students. Using this in-house
data, the question answering achieved the accuracy of 25% for the first correct answer. Experiments were also conducted using the translated test questions from
NTCIR 2005 CLQA data. For the NTCIR 2005 CLQA data, our IndonesianEnglish CLQA system is superior to others except for one with a rich translation
resources. We also did an experiment for various size of training data which shows
that the size of training data does influence the accuracy of a CLQA system.
In the Indonesian-Japanese CLQA, we used a transitive approach in translation and passage retrieval phase. Similar with the Indonesian-Japanese CLIR,
English is used as the pivot language in the transitive translation with bilingual dictionaries. The experiment shows that the passage retriever for transitive
translation using mutual information score and TF × IDF score as the translation
filtering could enhance the performance to be higher than the direct translation.
Furthermore, using English passage retriever result as the input for the Japanese
passage retriever gives much higher passage retrieval performance compared to
the one with only query as the input. The answer finder process employs easily
gained features including the POS information yielded by Chasen (an available
Japanese morphological analyzer). Even though the Indonesian-Japanese question answering performance is lower than the Indonesian-English CLQA but it is
higher than other research using a similar technique which employs text chunking
process in an English-Japanese CLQA.
Acknowledgements
This research work would not have been possible without the support of many
people. In this opportunity, I would like to give my appreciation to my supervisor,
Prof. Seiichi Nakagawa, who has given many directions in my research and also
has been supportive along these years. I also would like to thank Assistant Prof.
Masatoshi Tsuchiya for all guidances and helps in this research, to Prof. Norihide
Kitaoka for his support to me and my family, also to all members of Doctoral
meeting for all advices for this research, especially Prof. Takehito Utsuro and
Assistant Prof. Kazumasa Yamamoto for all the help and suggestions. My thanks
also to Prof. Masaki Aono and Prof. Tomoyosi Akiba for all the comments
and suggestions for this research. My thanks also to Prof. Atsushi Fujii for
providing me the NTCIR data and Japanese IR tool, to DR. Hammam Riza
for the Indonesian-English KEBI dictionary. Many thanks to all my friends in
Nakagawa laboratory and also to my Indonesian friends in Toyohashi who have
helped me a lot in my family live. Thanks to the Soroptimist foundation Japan
chapter, Hori foundation and Japan Monbukagakusho for awarding me financial
means to complete this project. And finally, great thanks to my husband, my
parents, my daughter and my relatives in Indonesia who endured this long process
with me, always offering support and love.
v
Contents
Contents
1 Introduction
1.1 Background . . . . . .
1.2 Research’s Focus . . .
1.3 Thesis’s Contribution
1.4 Thesis’s Outline . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Characteristics of Indonesian Language
2.1 Development of Indonesian Language . . . . . . . . . .
2.2 Indonesian’s Grammar . . . . . . . . . . . . . . . . . .
2.2.1 Part-of-Speech (POS) in Indonesian Language
2.2.2 Affixes in Indonesian Language . . . . . . . . .
2.2.3 Sentence Structure in Indonesian Language . .
2.2.4 Influences to Indonesian Language . . . . . . .
2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Indonesian-Japanese CLIR with Transitive Translation
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Overview of Indonesian Query . . . . . . . . . . . . . . . . . .
3.4 Indonesian - Japanese Query Translation System . . . . . . .
3.4.1 Indonesian - Japanese Key Word Translation Process
3.4.2 Japanese Translation Candidate Filtering Process . . .
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Experimental Data . . . . . . . . . . . . . . . . . . . .
3.5.2 Compared Methods . . . . . . . . . . . . . . . . . . .
3.5.3 Experimental Result . . . . . . . . . . . . . . . . . . .
3.5.4 Keyword Comparison . . . . . . . . . . . . . . . . . .
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
5
.
.
.
.
.
.
.
7
7
8
8
10
11
14
19
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
23
24
24
29
30
30
32
38
46
48
vii
4 Indonesian Monolingual QA using Machine Learning Approach
4.1 Introduction on Monolingual QA . . . . . . . . . . . . . . . . . . .
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Language Resources . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Article Collection . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Building Question Collection . . . . . . . . . . . . . . . . .
4.3.3 Other Data Resources (Indonesian-English Dictionary) . . .
4.4 QA System with Machine Learning Approach . . . . . . . . . . . .
4.4.1 Supporting Tool (POS Tagger) . . . . . . . . . . . . . . . .
4.4.2 Question Analyzer . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Passage Retriever . . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Answer Finder . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Question Classifier . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Passage Retriever . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3 Answer Finder . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.4 Using NTCIR 2005 (QAC and CLQA Data Set) . . . . . .
4.6 Adopting QA System for Other Language . . . . . . . . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
49
50
52
52
52
54
54
55
56
61
63
64
64
66
68
71
72
74
5 Indonesian-English CLQA
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Data Collection for Indonesian-English CLQA and its Problems
5.4 Indonesian-English CLQA Schema . . . . . . . . . . . . . . . .
5.5 Modules in Indonesian-English CLQA . . . . . . . . . . . . . .
5.5.1 Question Analyzer . . . . . . . . . . . . . . . . . . . . .
5.5.2 Section Translation . . . . . . . . . . . . . . . . . . . . .
5.5.3 Passage Retriever . . . . . . . . . . . . . . . . . . . . . .
5.5.4 Answer Finder . . . . . . . . . . . . . . . . . . . . . . .
5.6 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Question Classifier . . . . . . . . . . . . . . . . . . . . .
5.6.2 Keyword Translation . . . . . . . . . . . . . . . . . . . .
5.6.3 Passage Retriever . . . . . . . . . . . . . . . . . . . . . .
5.6.4 Answer Finder . . . . . . . . . . . . . . . . . . . . . . .
5.6.5 Experiments with NTCIR 2005 CLQA Task Test Data .
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
75
78
79
82
82
82
85
86
87
89
89
91
92
92
94
98
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Transitive Approach in Indonesian-Japanese CLQA
101
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Indonesian-Japanese CLQA with Transitive Translation (Schema 1)104
viii
CONTENTS
6.4
6.5
6.6
6.7
Indonesian-Japanese CLQA with Pivot Passage Retrieval (Schema 2)105
Modules of Indonesian-Japanese CLQA . . . . . . . . . . . . . . . 106
6.5.1 Japanese Passage Retrieval . . . . . . . . . . . . . . . . . . 107
6.5.2 Japanese Answer Finder . . . . . . . . . . . . . . . . . . . . 107
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.6.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . 108
6.6.2 Evaluation on OOV of Indonesian-Japanese Translation . . 109
6.6.3 Passage Retriever’s Experimental Results . . . . . . . . . . 110
6.6.4 Japanese Answer Finder . . . . . . . . . . . . . . . . . . . . 112
6.6.5 Experimental Result for the Transitive Passage Retriever . 115
6.6.6 Experimental Results of English-Japanese CLQA . . . . . . 116
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7 Conclusions and Future Research
119
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography
125
List of Publications
131
Chapter 1
Introduction
1.1
Background
There are thousands of language used by people of the world. The information
provided in internet is also available in various languages. But those information
could not be used by many people because of the limitation of a person’s language
ability. Researchers in natural language processing (NLP) area have tried to
build technologies in order to solve the problem, to bridge the understanding
among nations with different languages. There are many cross language research
areas investigated such as machine translation(MT), cross lingua information
retrieval(CLIR), cross language question answering(CLQA), etc.
In order to build a cross language system, one needs some adequate language
resources. For example, to build a machine translation system, the resources
could be a good bilingual dictionary, syntactic rules, or parallel corpus. These
rich language resources are usually available for major languages, such as English,
Japanese, Chinese, Germany, etc, The availiability of these language resources
encourage researches in the cross language area. We can see that now there
are online machine translation systems available in Internet, such as BabelFish,
Excite, etc.
For a minor language∗ , the language resources become one big problem in
developing a cross language system, especially if one wants to develop a cross
language system between a minor language and other language than English.
Here, in the research, this difficulty is addressed. We want to build two cross
language systems (CLIR and CLQA) between a minor language (e.g. Indonesian)
and a major language (e.g. Japanese). Definition of
∗
A minor language is a language with limited data resources and limited language processing
tools.
1
2
CHAPTER 1. INTRODUCTION
Indonesian language is a national language used in Indonesia, a country with
about 250 million population. Indonesian language is also understood by people
of neighbouring countries such as Malaysia and Brunei Darussalam. In Indonesia,
knowledge about Japanese language is very limited even though the two countries
have a good relationship for a long time. Indonesia itself is a developing country
which needs many information on culture and science from other country such as
Japan. Therefore, it will be a good opportunity to build a cross language system
between Indonesian as the source language and Japanese as the target language
in order to strengthen the understanding of Indonesian people on Japan’s culture
and science.
1.2
Research’s Focus
In order to achieve the final goal, the research is divided into several steps. Each
step is translated into a system. Basically, there are four systems developed in
this research:
1. Indonesian-Japanese CLIR
A CLIR system is an information retrieval system which receives query
sentences in a certain source language and retrieves documents in a target
language different with the source language. In the Indonesian-Japanese
CLIR system, the focus is on the query translation system of Indonesian
query sentences. There are some strategies in order to improve the IR
(Information Retrieval) score by using only available data resources. The
proposed strategy is based on a transitive translation with English as the
pivot language. The translation results are then processed by a filtering
system that uses mutual information score and TF IDF score. The experiment shows that the proposed translation method achieved higher IR score
than other comparison methods.
2. Indonesian QA
A QA system tries to give answers for user’s natural language question
sentences. In this research, the monolingual QA is built from scratch. The
question-answer data were collected from Indonesian people, and then the
monolingual QA system is built. The aim is to build a good monolingual QA
for language with limited resource without labour work of programming in
providing the language processing tools. The focus is on the application of
machine learning approach for the monolingual QA. The machine learning
approach is adapted in two modules of this QA system. The result shows
that without employing any rich data resource, the monolingual QA system
can still achieve a promising accuracy result.
1.3. THESIS’S CONTRIBUTION
3
3. Indonesian-English CLQA
A CLQA system is an answering system where the answers are located in
resources (documents) written in language different with the question language. In Indonesian-English CLQA, the question language is Indonesian
and the documents are in English. The focus is on how to adopt the machine learning approach that has been used in the monolingual QA for the
CLQA system with a good performance. The Indonesian-English CLQA
system was tested on our in-house test data and test set of NTCIR 2005
CLQA task (translated into Indonesian). As for the training data, some
Indonesian college students were asked to write Indonesian questions based
on some English articles. The experimental result showed that the result
on NTCIR 2005 CLQA task was only defeated by the top result which
employed some high quality dictionaries.
4. Indonesian-Japanese CLQA
The Indonesian-Japanese CLQA is the research final goal. Even though it
is also a CLQA similar with the Indonesian-English CLQA, but there are
some different problems between the two systems. First, the translation
of Indonesian-Japanese is more difficult than the translation of IndonesianEnglish. This is because of the poor translation language resource and also
the different writing style between Indonesian and Japanese. The second
problem is the large size of Japanese corpus as the source for the document
retrieval. Its size is about 30 times larger than the Indonesian corpus.
It makes the translation and passage retrieval used in Indonesian-English
CLQA not be adequate enough for the Indonesian-Japanese CLQA. Both
problems are addressed in our Indonesian-Japanese CLQA system.
1.3
Thesis’s Contribution
Below are the research’s contributions:
1. Indonesian-Japanese CLIR
• It is the first work of an Indonesian-Japanese CLIR
• This is the first work for a source language with a limited resource
which employs only a bilingual dictionary of the source language into
a pivot language (transitive translation using bilingual dictionaries).
It should be noted that the system explores the language resources
owned by the target language
• In the keyword filtering process, there are two scores used as the proposed method: the mutual information score to represent the relationship between word pairs in a sequence and the TF IDF score to
4
CHAPTER 1. INTRODUCTION
represent the relationship among all terms in a sequence at the same
time
• The experiments provide the comparison on CLIR’s result (the queries
are translated from English queries in the Third NTCIR Web Retrieval
Task Data) between a transitive machine translation, direct translation with half size dictionary of the transitive translation, and a transitive bilingual dictionary translation with several keyword filtering
schemas
2. Indonesian monolingual QA
• It is the first work of an Indonesian monolingual QA
• It is the first data collection for an Indonesian monolingual QA with
3000 question-answer pairs. The passages with tagged answers in it
are also provided.
• The system has a question classifier module with machine learning
approach for Indonesian language (limited resource language) that
achieves good performance. The features are extracted from available resources without using any knowledge or ontology resource. The
question classifier could be quite easily adopted to other limited resource languages.
• The answer is located using a text chunking approach with features
yielded by the question analyzer and the simple POS tagger on the
target document. Even in the answer finder, the system does not
employ any knowledge or richer quality resource or tools which are
usually unavailable for language with limited data resources.
• The experiments show the comparison of passage retrieval method
using IDF score with an available search engine that uses TF IDF
score.
3. Indonesian-English CLQA
• It is the first data collection of Indonesian-English CLQA with medium
size (2857 question answer pairs with the answer tagged passages).
The available Indonesian-English CLQA data collection before is a
translation set of English questions for the CLEF 2005-2006 CLQA
task, each year with 200 pairs.
• To translate the OOVs, the system employs an Indonesian-English
transliteration module to cope with Indonesian word’s characteristics
• Based on some certain words used in Indonesian language, other than
a common stop word elimination, the system also employed other kind
of stop word elimination.
1.4. THESIS’S OUTLINE
5
• In the answer finder module, the text chunking approach is used such
as in the monolingual Indonesian QA but with different features. The
bi-gram frequency (statistical information) features were replaced by
the WordNet distance features.
• Other than the experiment with the in-house data, experiments using
translated questions of test questions used in NTCIR 2005 CLQA1
task are also conducted.
• In the experiments, the comparison on the machine learning approach
for different number of training data were made.
4. Indonesian-Japanese CLQA
• It is the first work of Indonesian-Japanese CLQA
• There are two transitive approaches which were compared in order to
handle the limited resource problem. The first is using the similar
method with Indonesian-English CLQA with transitive translation of
Indonesian-Japanese. The second transivite approach is to retrieve
Japanese passages using the retrieved English passages.
1.4
Thesis’s Outline
Chapter 2 consists of the description on Indonesian language including the grammars and some influences of other language into Indonesian. The following
chapters describes the research which is divided into four systems, IndonesianJapanese CLIR, Indonesian monolingual QA, Indonesian- English CLQA and
Indonesian-Japanese CLQA. Each system is explained in each chapter. Chapter
3 describes the Indonesian-Japanese CLIR, Chapter 4 describes the Indonesian
monolingual QA, Chapter 5 explains the Indonesian-English CLQA and chapter 6
is about Indonesian-Japanese CLQA. The explanation of each chapter is divided
into several sections such as the introduction, related works, the methods, the
experiment and conclusions. The overall conclusions of the research are written
in Chapter 7.
Chapter 2
Characteristics of Indonesian
Language
2.1
Development of Indonesian Language
Malay language is the root of Indonesian language. It was named Indonesian
in 1928, October 28th because of the political reason, to have a free country
with its own language. Even though it comes from Malay language, Indonesian
language has changed since it was declared as a national language. As a conversation language, it has been used by many people across islands and regions
in Indonesia. In Indonesia, people in a certain region usually have their own
original language which is called regional language such as Javanese, Sundanese,
Batak, etc. Prof. Dr. Slametmulyana[5] noted that Indonesian language and
these regional languages have influenced each other. The influences are not only
in the vocabulary, but also in the sentence structure. The influences on sentence
structure are usually found in an informal situation.
Some influences also came from foreign countries. Indonesia is located between two continents (Asia - Australia) and two oceans (Indian ocean - Pacific
ocean). Since some centuries ago, Indonesia has been a transition place for people
that across between these two continents or between these two oceans. There are
many countries which gave influences on Indonesian language, whether because
of the trade affair, religion spreading or collonialism. We can find now that there
are many Indonesian words come from other country such as Sanskerta, Arab,
English, Dutch, etc.
7
8
CHAPTER 2. CHARACTERISTICS OF INDONESIAN LANGUAGE
2.2
2.2.1
Indonesian’s Grammar
Part-of-Speech (POS) in Indonesian Language
Noun
Plural noun is represented in 2 ways:
• repeat singular noun, example: rumah-rumah (houses)
• combined with numeral word, example: 8 rumah (8 houses), banyak rumah
(a lot of houses)
Verb[5]
In Indonesian language, there is no special characteristic to differentiate verb
from other word type in a sentence. Definition of verb in Indonesian language is
given by Gorys Keraf[5]: ”verb is any kind of word that can be expanded with
dengan(with) + adjective”. Verb can be word with affix (me, ber, etc) or a root
word. Here are some examples:
• Saya mandi (I take a bath)
• Saya memandikan adik (I bathe my younger sister/brother)
There is no auxiliary word in Indonesian language, but auxiliary words in
English can be translated into adverbs, for example:
• I have read a book (saya telah membaca sebuah buku)
• You must go home (kamu harus pulang)
Unlike English, passive sentence in Indonesian language is not characterized
by its verb, but by its structure, even though verbs with di prefix must be a
passive word. For example:
• Lagu itu dinyanyikan Erni (that song is being sung by Erni)
• Pintu itu kubuka (that door is opened by me)
Adverb
Words that can be categorized as adverb in Indonesian language:
1. To explain temporal information
Such as akan/hendak (will), belum(haven’t), masih(has/have), telah/sudah(has/have).
2. To explain manner
The structure is formulated by dengan + adjective. Here is an example:
”Dia bekerja dengan baik ”(He works well)
9
2.2. INDONESIAN’S GRAMMAR
3. As modality word
Such as harus/mesti(must/have to), boleh/dapat(may/can), mestinya/ semestinya(should)
Adjective
Adjective can be preceded by adverb such as amat, paling, sangat, lebih . . . dari,
kurang . . . dari, terlalu. All these words corespond to more, most, very, -est, -er,
too. Functions of adjective are:
1. to explain other word’s condition. For example: manis(sweet), kuat(strong).
2. to complement number. For example: buah, kuntum, batang. Here is an
example of a sentence: ”Saya membeli beberapa kuntum bunga”(I bought
some flowers)
3. as comparison. For example: ”Buah mangga itu semanis gula”(That mango
is as sweet as sugar).
Preposition[5]
Words that are categorized as preposition are di/pada(in/on/at), ke(to), dari(from),
di dalam(inside), di luar (outside), di atas(above, on), di bawah(under), di depan(in front of), di belakang(behind), di samping(beside), etc. In Indonesian
language, there is no postposition.
Pronoun[5]
There are 3 kinds of pronouns, such as shown in Table 2.1.
Table 2.1: Pronouns in Indonesian Language
Singular
Plural
Pronoun I
Aku, saya, -ku (I,
me, my)
Pronoun II
Engkau, kamu, mu (you, your)
Kami, kita (we,
our, us)
Kamu,
kalian
(you, your)
Pronoun III
Ia,
dia,
nya
(he/she,
her/his/him)
Mereka
(they,
them, their)
Other than these pronouns, there is also possessive pronouns which can be
declared by pattern kepunyaan+pronoun or using suffixes mu, ku, or nya. Here
is an example: ”Apakah ini bukumu?” (Is this your book?)
In conversation, there are also some indirect pronouns which are used to
appreciate the person to speak to. For example ”Apakah bapak hendak menyampaikan sesuatu ?” (do you wish to say something?). Here bapak is translated into
you even though its real translation should be Mr.
Numeral
There are some numerals used such as:
10
1. Cardinal numeral, for example: 1 (satu), 2 (dua), 10 (sepuluh), 100 (seratus)
2. Distributive numeral, for example: lusin (dozen), pasang (pair)
3. Multiplicative numeral, for example: sekali/satu kali (once), dua kali (twice)
4. Ordinal numeral, for example: pertama/kesatu (first), kedua (second)
5. Partitive numeral, for example: setengah/seperdua (half), sepertiga (third)
For cardinal and distributive numeral, noun that follows numeral will not
change, for example: ”Saya membeli 3 buah buku” (I bought 3 books). It should
be noted that usually there is a special word (it can be classified as adjectives) to
describe noun, such as buah for buku(book), kuntum/tangkai for bunga(flower),
etc.
Conjunction [5]
There are some uses of conjunction in Indonesian language, such as:
1. parallel relationship, such as: dan (and), lalu/kemudian (and then), setelah
itu/ sesudah itu (after that), bahkan/malahan (even), apalagi/tambahan
pula (more over), etc.
2. adverse relationship, such as: tetapi/akan tetapi (but), sebaliknya (on the
other hand), atau (or), etc.
3. causal relationship, such as: oleh sebab itu/oleh karena itu (that’s way),
sebab itu/karena itu (for that)
4. collective sentence, such as: ketika/tatkala/waktu/selagi/semasa/manakala
(when), sedari/ sejak/semenjak (since), sesudah/setelah (after), sebelum
(before), etc.
Later, in the keyword translation of the cross language system, all these conjunction words are included in the stop word list because we consider them as
non-important words in order to retrieve the documents or passages.
2.2.2
Affixes in Indonesian Language
Indonesian language is an agglutinative language system, which means that affix holds an important role. Indonesian language does not have conjugation or
declination (verb form is independent of time, number and person). Words in
Indonesian, usually verbs, can be attached by many prefixes or suffixes. The list
of affixes is shown below.
Those affixes are [5]:
11
1. prefix, such as me- , di-, ber-, per-, ter-, ke-, and se-. The variations of each
prefix are shown in Table 2.2.
2. infix, such as -el-, -em-, and -er-. Here are some word examples:
• getar (vibrate) + -em- = gemetar (shaking)
• gigi (teeth) + -er- = gerigi (jagged).
3. suffix, such as -an, -kan, -i, -nya
These suffixes can be combined with prefixes above. It can also be independent from other affixes, for example: tendangan (kick), laut-an (sea),
ruang-an (room), kubur-an (brave yard), mula-i (begin), aku-i (admit),
buka-kan (open), etc.
Beside being as a suffix, nya, together with -ku and mu are defined as
pronoun such as mentioned in Table 2.1. These pronouns can be combined
with other affixes that have been described before, example: ruang-an-mu
(your room), me(ng)ata-kan-nya (say it), etc. It can also be combined with
root word, such as buku-ku (my book), kakak-mu (your brother), etc.
Words with different affixes might have a different translation but also might
have the same translation. Examples of a different word translation include
“membaca” and “pembaca,” which are translated “read” and “reader,” respectively. Examples of the same word translation are the word “baca”and “bacakan,”
which are both translated into “read” in English. Other examples are the word
“membaca” and “dibaca,” which are translated into “read”and “being read,” respectively. Using a stop word elimination, the translation result of “membaca”
and “dibaca” will give the same English translation, “read.”
An Indonesian dictionary usually contains words with affixes (that have different translations) and the base words. For example, the “se-nya” affix declares
a “most possible” pattern, such as “sebanyak-banyaknya” (as much as possible) ,
“sesedikit-sedikitnya” (as little as possible), and “sehitam-sehitamnya” (as black
as possible). This affix can be attached to many adjectives with the same meaning pattern. Therefore, words with the “se-nya” affix usually are not included in
an Indonesian dictionary.
2.2.3
Sentence Structure in Indonesian Language
Indonesian is quite a simple language. For example, it does not have declensions
or conjugations. The basic word order is SVO (Subject-Verb-Object). Verbs are
not inflected for person or number. There are no tenses. Tense is denoted by the
time adverb or some other tense indicator. The time adverb can be placed at the
front or end of the sentence.
There are 4 grammatical relations in Indonesian language:
12
Table 2.2: Prefix Variations in Indonesian Language
Prefix
Description
Variation of me prefix
me
For
words
begin
with
phoneme l,r,w,m,n,ng,ny
mem
For
words
phoneme b,p
men
For
words
begin
with
phoneme d,t
For
words
begin
with
phoneme g,k,h and vocal
meng
begin
with
meny
For words begin with j,c,s
memper
combine prefix mem and per
mekan
mei
memperkan
similar with prefix me
similar with prefix me
similar with prefix memper
memperi
memberkan
menter-kan
menge-
similar with prefix memper
combine prefix mem and ber
combine prefix mem and ber
similar with prefix meng with
one syllable root word
menyesimilar with prefix meng with
one syllable root word
Variation of ber prefix
beFor
words
begin
with
phoneme r or the first
syllable ends with er
bel
ber
For other words beside for
prefix be and bel
Examples
me-lawat (trip), me-rawat
(take care), me-wangi (smell),
me-masak (cook), me-nilai
(evaluate), me-nganga (gape),
me-nyanyi (sing)
mem-buat
(make),
me(m)otong (cut),
memprotes (protest)
men-datang
(upcoming),
me(n)olak (refuse)
meng-gulung (roll), mengkhayal (imagine), me(ng)ait
(hook), meng-hadir-i (attend), meng-ambil (take),
meng-ekor (follow), mengelus (stroke), meng-ikat (tie),
meng-ulur (elongate)
men-jawab (answer), men-cari
(find), me(ny)aring (screen)
memper- kaya (make . . . rich),
memper- kecil (reduce)
me-naik-kan (up)
mem-bau-i (smell)
memper-bunga-kan (accrue),
memper-dagang-kan (trade)
memper-baik-i (fix)
member-henti-kan (fired)
menter-tawa-kan (laugh at)
menge-tik (type), menge-cap
(stamp)
menye-tir (drive), menye-top
(stop)
be-rasa (taste), be-rambut
(have hair), be-kerja (work),
be-serta (with)
only for bel-ajar (study)
ber-warna (colour), ber-kata
(say)
13
Prefix
Description
Examples
Variation of ke prefix
Actually, if this prefix is used without suffix then this prefix is a nonproductive
prefix, there are only 3 words: ke-tua (leader), ke-kasih (lover), ke-hendak (will).
But in informal language, there are some influences from Javanese language, such
as ke-tawa (laugh), ke-temu (meet), ke-pergok (caught), ke-tabrak (hit by), etc.
This prefix can be combined with an suffix become kean circumfix (a productive
circumfix), such as in ke-kuat-an (strength), ke-lemah-an (weakness), ke-lapar-an
(starvation)
Variation of pe prefix
peSame pattern with prefix me
pe-lawat, pe-rawat (nurse),
pe-waris, pem-bina (counselor), pen-daki (climber),
pen-jual (sales people), pencuri (thief), peng-ganti (stand
in/subtitute), etc
pemer
pemer-satu (unifier), pemerhati (observer)
pel
It is an exception, only for pelajar (student)
Variation of ter prefix
teFor
words
begin
with te-rasa
(feel),
terencana
phoneme /r/ or first syllable (plan), te-perdaya (tricked)
ends with /r/ (except that
the word means superlative)
telOnly for certain words as dis- telanjur (too late), telantar
similation effect
(abandoned)
terFor other words
ter-ambil (took), ter-buat
(made of), ter-daftar (registered)
1. subject (S), usually a noun phrase, it can also be expanded into sub sentence
with its own subject-predicate.
2. predicate (P), could be noun phrase, verb phrase, adjective phrase, preposition phrase, and numeral.
3. object (O), could be any phrase and could also be expanded into sub sentence which has its own subject-predicate.
4. adverb (A), to explain place, time, cause, manner, goal, condition. Just
14
like object, it can be any phrase and can be expanded into sub sentence.
Below are some grammatical patterns in Indonesian language:
1. S-P
Example: Adik (S) menangis (P) (younger sister/brother is crying)
2. S-P-O
Example: Saya (S) membaca (P) buku (O) (I am reading a book) This
S-P-O pattern can be changed into O-P-S as in passive term. For example:
”Ibu membaca buku” is changed into ”Buku dibaca ibu” (Mother read a
book)
3. S-P-A
Example: Kakak (S) menyanyi (P) di kamarnya (A) (older sister/brother
is singing in her/his room)
4. S-P-O-A
Example: Ibuku (S) membeli (P) sayuran (O) di toko (A) (my mother
bought vegetables at store)
5. A-S-P-O-A
Example: Tadi pagi (A), ayahku (S) membaca (P) koran (O) di ruang
keluarga (A) (In this morning, my father read newspaper in living room)
6. S-A
For spoken language, S-A pattern is used frequently. Example: Ibu (S) ke
pasar (A) (Mother went to market)
2.2.4
Influences to Indonesian Language
As mentioned in Section 2.1, sentence structures in Indonesian language now have
been influenced by some regional languages. Some examples are shown in Table
2.3.
Table 2.3: Examples of Regional Language’s Influences on Indonesian Sentence[5]
Influenced by Javanese language:
Rumahnya ayah saya sudah dijual. (My father’s hourse has been sold)
Sementara orang menganggap itu benar. (Some people think it is true)
Influenced by Sundanese language:
Buku-buku itu sudah saya kekantorkan. (I have delivered the books to the
office)
Uangmu ada di saya. (I have your money)
15
Indonesian language also absorps new words from regional languages or foreign languages. These borrowed words are divided into two types:
1. The lexical and its pronunciation are not changed, strictly used in Indonesian sentence. Such as ”Academy Awards” in ”Siapakah yang telah memenangkan Academy Awards berkali-kali?” (Who wins Academy Awards many
times?)
2. The lexical and its pronunciation are changed into Indonesian. Such as
aerodinamika(aerodynamics), klasifikasi(classification), etc.
Some transformation rules along with theirs examples are shown in Table
2.4.
Table 2.4: Transformation Rules for the Borrowed Words in Indonesian Language
Rules
aa(Dutch) become a
ae can be changed into e but
it can also still become ae
ai and au is not changed
c is changed into k if it is
placed before a, u, o or a consonant
c is changed into s if it is
placed before e, i, oe, and y
cc is changed into ks if it is
placed before e and i
cch and ch are changed into k
if it is placed before a, o and
a consonant
ch is changed into s when it is
pronounced as s or sy
ch is changed into c when it is
pronounced as c
c(Sanskerta) is changed into s
e remains e
ea remains ea
Examples
pal(paal), bal(baal), oktaf(octaaf)
aerob(aerobe), aerodinamika (aerodynamics),
hemoglobin(haemoglobin), hematit(haematite)
kaison(caisson), hidraulik(hydraulic)
kalomel(calomel), konstruksi(construction), kubik(cubic), kristal(crystal)
sentral(central),
sen(cent),
lasi(circulation),
selom(coelom),
der(cylinder)
aksen(accent), vaksin(vaccine)
sakarin(saccharin), karisma(charisma),
era(cholera), kromosom(chromosome)
sirkusilin-
kol-
aselon(achelon), mesin(machine)
cek(check), Cina(China)
sabda(cabda), sastra(castra)
efek(effect),
deskripsi(description),
sis(synthesis), sistem(system)
idealis(idealist), habeas(habeas)
sinte-
16
Rules
ee(Dutch) is changed into e
ei remains ei
eo remains eo
eu remains eu
f remains f
gh is changed into g
gue is changed into ge
i remains i if it is placed at
the beginning of a word and
before a vocal
ie(Dutch) is changed into i
when it is pronounced as i
ie remains i when it is not pronounced as i
kh(Arabic) remains kh
ng remains ng
oe(oi Greek) is changed into e
oo(Dutch) is changed into o
oo(English) is changed into u
oo (double vocal) remains oo
ou is changed into au when it
is pronounced as au
ou is changed into u when it
is pronounced as u
ph is changed into f
ps remains ps
pt remains pt
q is changed into k
rh is changed into r
sc is changed into sk if it is
placed before a, o, u and a
consonant
Examples
stratosfer(stratosfeer), sistem(systeem)
eikosan(eicosane), einsteinium(einsteinium)
stereo(stereo), geometri(geometry)
neutron(neutron), eugenol(eugenol)
fanatic(fanatic,
fanatiek),
faktor(factor),
fosil(fossil)
sorgum(sorghum)
ige(igue), gige(gigue)
iambe(iamb), ion(ion)
politik(politiek), rim(riem)
varietas(variety),
pasien(patient),
efisien(efficient)
khusus(khusus), akhir(akhir)
kontingen(contingent), kongres(congress), linguistik(linguistics)
estrogen(oestrogen),
enologi(oenology),
fetus(foetus)
kompor(komfoor), provos(provoost)
kartun(cartoon), pruf(proof), pul(pool)
zoologi(zoology), koordinasi(coordination)
baut(bout), kaunter(counter)
gubernur(gouverneur), kupon(coupon), kontur(contour)
fase(phase),
fisiologi(physiology),
spektograf(spectrograph)
pseudo(pseudo), psikiatri(psychiatry), psikosomatik(psychosomatic)
ptersosaur(ptersosaur),
pteridologi(pteridology), ptialin(ptyalin)
akuarium(aquarium),
frekuensi(frequency),
ekuator(equator)
rapsodi(rhapsody),
ritme(rhythm),
retorik(rhetoric)
skandium(scandium),
skotopia(scotopia),
skripsi(scriptie)
17
Rules
sc is changed into s if it is
placed before e, i and y
sk is changed into sch if it is
placed before a vocal
t is changed into s when it
is placed before i and pronounced as s
th is changed into t
u remains u
ua remains ua
ue remains ue
ui remains ui
uo remains uo
uu is changed into u
v remains v
x remains x if it is placed at
the beginning of a word
x is changed into ks if it is
placed not at the beginning of
a word
xc is changed into ks if it is
placed before e and i
xc is changed into ksk if it is
placed before a, o, u and a
consonant
y remains y when it is pronounced as y
y is changed into i when it is
pronounced as i
z remains z
-aat is changed into -at
-age is changed into -ase
-ary, -air are changed into -er
Examples
senografi(scenography), sintilasi(scintillation),
sifistoma(scyphistoma)
skema(schema), skizofrenia(schizophrenia), skolastisisme(scholasticism)
rasio(ratio), aksi(actie, action), pasien(patient)
teokrasi(theocracy),
ortografi(orthography),
trombosis(thrombosis), metode(method, methode)
unit(unit),
nucleolus(nucleolus),
struktur(structure, structuur)
dualisme(dualism), akuarium(aquarium)
sued(suede), duet(duet)
ekuinoks(equinox),
konduite(conduite),
duit(duit)
kuorum(quorum), kuota(quota)
prematur(prematuur), vakum(vacuum), kultur(cultuur)
vitamin(vitamin), televisi(television), kavaleri(cavalry)
xantat(xanthate),
xenon(xenon),
xilofon(xylophone)
eksekutif(executive), taksi(taxi), ekstra(extra),
kompleks(complex), lateks(latex)
eksepsi(exceptie),
ekses(excess),
tasi(excitation)
ekskomunikasi(excommunication),
sif(excursive), ekslusif(exclusive)
eksiekskur-
yangonin(yangonin),
yen(yen),
yukaganin(yuccaganin)
dinamo(dynamo),
propil(propyl),
psikologi(psychologie)
zenit(zenith), zodiak(zodiac), zaman(zaman)
advokat(advokaat), traktat(traktaat)
persentase(percentage), etalase(etalage)
komplementer(complementary, complementair),
primer(primary, primair), sekunder(secondary,
secundair)
18
Rules
-ant is changed into -an
-archy, -archie are changed
into -arki
-al, -eel, -aal are changed into
-aal
-ein remains -ein
-or, -eur are changed into -ur
-or remains -or
-ive, -ief are changed into -if
-ic, ics, ique, iek, ica (nominal) are changed into -ik, -ika
-ile, -iel are changed into -il
-ic (adjective),
changed into -ik
-isch
are
-ical, -isch are changed into -is
-ism, -isme are changed into isme
-ist is changed into -is
-logy, -logie are changed into
-logi
-logue is changed into -log
-loog (Dutch) is changed into
-log
-oid, -oide are changed into oid
-oir (e) is changed into -oar
-ty, -teit are changed into -tas
-ure, -uur are changed into -ur
Examples
akuntan(accountant), informan(informant)
anarki(anarchy, anarchie), oligarki(oligarchy,
oligarchie)
struktural(structural,
structureel),
formal(formal, formeel), ideal(ideal, ideaal),
normal(normal, normaal)
sistein(cystein), protein(protein)
direktur(director,
directeur),
inspektur(inspector, inspekteur)
diktator(dictator), korektor(corrector)
deskriptif(descriptive, descriptief), demonstratif(demonstrative, demonstratief)
fonetik(phonetics, phonetiek), fisika(physics,
physica),
logika(logic,logika),
dialektika(dialectics, dialectica), teknik(technique,
techniek)
persentil(percentile, percentiel), mobil(mobile,
mobiel)
elektronik(electronic,
electronisch),
mekanik(mechanic,
mechanisch),
balistik(ballistic, balistisch)
ekonomis(economical,
economisch),
praktis(practical, practisch), logis(logical, logisch)
modernisme(modernism, modernisme), komunisme(communism, communisme), imperialisme(imperialism, imperialisme)
publisis(publicist),
egois(egoist),
teroris(terrorist)
teknologi(technology,
technologie),
fisiologi(physiology, physiologie), analogi(analogy,
analogie)
katalog(catalogue), dialog(dialogue)
analog(analoog), epilog(epiloog)
hominoid(hominoid,
hominoide),
antropoid(anthropoid, anthrophoide)
trotoar(trottoir), repertoar(repertoire)
universitas(university,
universiteit),
kualitas(quality, kwaliteit)
struktur(structure,
structuur),
prematur(premature, prematuur)
2.3. CONCLUSIONS
2.3
19
Conclusions
In this chapter, we introduced the properties of Indonesian language. Basically,
there are two types of Indonesian words: native words and borrowed words.
Native words are words originally came from Indonesian. Borrowed words are
words came from other countries. The borrowed words could be in its original
term (unmodified), for example Salsa, Manhattan, etc. The borrowed words
could also be a modified form of its original term as described in Table 2.4.
We make use this characteristics (modified borrowed words) by building some
transformation rules to translate the OOVs.
Another important characteristics is the affix. A single base term with different affixes could have different meaning and different English translation. But
there are also affixes that do not change a word meaning. The morphological
analyzer built in this research aims to handle the affixes that do not change a
word meaning.
Chapter 3
Indonesian-Japanese CLIR with
Transitive Translation
3.1
Introduction
Nowadays, there are many web resources on the Internet written in languages
other than English. CLIR (Cross Language Information Retrieval) has served
as a bridge for Internet sources and users with different languages. Indonesia,
with a population of 241 million people, also has an interest in utilizing the CLIR.
But, unfortunately, Indonesian is a language with minimum data resources. Here,
we found that there is a need to build a CLIR for language with minimum data
resources such as the Indonesian language. In order to do this kind of translation,
transitive translation with bilingual dictionaries has been an alternative method.
Even though there are other data resources available such as machine translation
or parallel corpus, an electronic bilingual dictionary is the most widely available.
We propose a query transitive translation system of a CLIR (Cross Language
Information Retrieval) for a source language with a poor data resource. Our
research aim is to do the transitive translation with a minimum data resource
of the source language (Indonesian) and exploit the data resource of the target
language (Japanese). In the transitive translation, English is used as the pivot
language.
3.2
Related Works
Some studies have been done in the field of transitive translation with bilingual
dictionaries in the CLIR system such as [7, 15]. The literature [7] translated Spanish queries into French with English as the interlingua. Ballesteros used Collins
Spanish-English and English-French dictionaries. The literature [15] translated
21
22
CHAPTER 3. INDONESIAN-JAPANESE CLIR WITH TRANSITIVE
TRANSLATION
the Germany queries into English using two pivot languages (Spanish and Dutch),
while Gollins used the Euro Wordnet as a data resource. To our knowledge, no
CLIR is available with transitive translation for a source language with limited
data resources such as Indonesian. In [42], the language with limited data resources was used as the target language in English-Hindi CLIR. English queries
were translated into Hindi using two English-Hindi bilingual dictionaries, a direct
translation, not a transitive translation. They provided a stop-word list, a simple
normalizer, a stemming module and a transliteration module for Hindi. These
tools were used to support the Hindi information retrieval. It should be noted
that although the target language is a limited resource language but the other
language is English, which is different with our research focus, a cross language
system for a language pair where the target language is a major language other
than English.
Translation with a bilingual dictionary usually provides many translation alternatives, only a few of which are appropriate. A transitive translation gives
more translation alternatives than a direct translation. In order to select the
most appropriate translation, a monolingual corpus can be used. The literature
[6] used an English corpus to select some English translation based on SpanishEnglish translation and analyzed the co-occurrence frequencies to disambiguate
phrase translations. The occurrence score is called the em score. Each set is
ranked by an em score, and the highest ranking set is taken as the final translation. The literature [13] used a Chinese corpus to select the best English-Chinese
translation set. They modified the EMMI weighting measure to calculate the
term coherence score. Chinese corpus was also used by [14] in the English-Chinese
CLIR. They used three kinds of translation: COTM (co-occurrence translation
model with a modified mutual information score), NPTM (NP translation model
by identifying NP and do a statistical translation on the NP) and DPTM (dependency translation model). An Chinese-English CLIR [23] proposed a statistical
framework called maximum coherence principle with term similarity based on
Mutual Information score and another technique using graph partitioning approach for the query translation disambiguation. The literature [34] selected the
best Spanish-English and Chinese-English translation using the English corpus.
The coherence score calculation was based on 1) web page count; 2) retrieval
score; and 3) mutual information score. The literature [2] translated Indonesian
into English and used an English monolingual corpus to select the best translation, employing a term similarity score based on the Dice similarity coefficient.
The literature [10] combined the N-best translation based on an HMM model of
a query translation pair and relevant document probability of the input word to
rank Italian documents retrieved by English query. The literature [19] used all
terms to retrieve a document in order to obtain the best term combination, and
chose the most frequent term in each term translation set that appears in the
3.3. OVERVIEW OF INDONESIAN QUERY
23
top-ranked document.
Here, we translated Indonesian queries into a Japanese keyword list in order to
retrieve Japanese documents. Because of resource limitations between Indonesian
and Japanese, we conducted a transitive translation with English as the pivot language. Even though there are alternatives to use machine translation or parallel
corpus in the transitive translation, we prefer to employ a bilingual dictionary as
the most available resource. To filter the translation results, we combined the TF
× IDF engine score and the mutual information score (taken from a monolingual
target language corpus) to select the most appropriate translation.
Another problem in the translation with bilingual dictionaries is out-of-vocabulary
(OOV) words. This problem becomes critical if the OOV words are proper nouns
which are usually important keywords in the query. Therefore, if the proper
noun keywords are not translated, the IR system will return almost no relevant document. We found that some OOV words that are not available in the
Indonesian dictionary are borrowed words. In this Indonesian-Japanese CLIR,
the borrowed words come from the English and Japanese languages. Therefore,
in order to translate these OOV, we used the English-Japanese dictionary and
Japanese proper name dictionary.
3.3
Overview of Indonesian Query
Indonesian is the official language of Indonesia. It is understood by people in
Indonesia, Malaysia, and Brunei. The Indonesian language family is MalayoPolynesian (Austronesian), which extends across the islands of Southeast Asia
and the Pacific. Indonesian is not related to either English or Japanese.
Unlike other languages used in Indonesia such as Javanese, Sundanese and
Balinese that use their own scripts, Indonesian uses the familiar Roman script.
It uses only 26 letters as in the English alphabet. A transliteration module is not
needed to translate an Indonesian sentence.
Indonesian sentences usually consist of native (Indonesian) words and borrowed words. The first three query examples in Table 3.1 contain borrowed words.
“Academy Awards” of the first query, “novel” of the second query and “salsa”
of the third query are borrowed from the English language. “Miyabe Miyuki”
in the second query is transliterated from the Japanese. Other than these exact
borrowed words, there are also some borrowed words that were changed into Indonesian such as “generasi” from “generation” in the first query, “metode” from
“method” in the second query and “ozon” from “ozone” in the last query. To
obtain a good translation, the query translation in our system must be able to
translate those words, the Indonesian (native) words and the borrowed words.
TRANSLATION
24
Table 3.1: Indonesian Query Examples
Query 1
Saya ingin mengetahui siapa yang telah menjadi peraih Academy Awards
beberapa generasi secara berturut-turut
(I want to know who have been the recipients of successive generations
of Academy Awards)
Query 2
Temukan buku-buku yang mengulas tentang novel yang ditulis oleh
Miyabe Miyuki
(Find book reviews of novels written by Miyabe Miyuki)
Query 3
Saya ingin mengerahui metode untuk belajar bagaimana menari salsa
(I want to know the method of studying how to dance the salsa)
Query 4
Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran
lubang ozon terhadap tubuh manusia
(I want to learn about the effects of destruction of the ozone layer and
expansion of the ozone hole have on the human body)
3.4
Indonesian - Japanese Query Translation System
Indonesian-Japanese query translation is a component of the Indonesian-Japanese
CLIR. The query translation system aims to translate an Indonesian query sentence into a Japanese keyword list. The Japanese keyword list is then used in
the Japanese IR system to retrieve the relevant document. The schema of the
Indonesian-Japanese query translation system can be seen in Figure 3.1.
The query translation system consists of 2 subsystems; the keyword translation and translation candidate filtering. The keyword translation system seeks
to obtain Japanese translation candidates for an Indonesian query sentence. The
translation candidate filtering aims to select the most appropriate among all
Japanese translation alternatives. The Japanese translation resulting from the
translation filtering is used as the input for the Japanese IR system. The keyword
translation and translation filtering process is described in the next section.
3.4.1
Indonesian - Japanese Key Word Translation Process
The keyword translation system is a process by which Indonesian keywords are
translated into Japanese keywords. We chose to do a transitive translation with
bilingual dictionaries to do the keyword translation. Other approaches such as direct translation or machine translation are employed for the comparison method.
3.4. INDONESIAN - JAPANESE QUERY TRANSLATION SYSTEM
25
Figure 3.1: Indonesian-Japanese Query Translation Schema
The schema of the keyword transitive translation using bilingual dictionaries is
shown in Figure 3.2. Even though an Indonesian-Japanese dictionary is available,
we do not propose the direct translation using bilingual dictionary because the
bilingual dictionaries between a certain language and English are more available
than a bilingual dictionary between two certain languages other than English.
By using transitive translation, this translation will be applied easier to some
certain language pairs than the direct translation using bilingual dictionary.
The keyword translation process consists of native (Indonesian) word translation and borrowed word translation. The native words are translated using
Indonesian-English and English-Japanese dictionaries. Because the Indonesian
tag parser is not available, we do the translation of a single word and consecutive
pair of words that exist as a single term in the Indonesian-English dictionary.
As mentioned in the previous section dealing with the affix combination in the
Indonesian language, not all words with the affix combination are recorded in an
Indonesian dictionary. Therefore, if a search does not reveal the exact word, it
will search for other words that are the basic word of the query word or have the
same basic word. For example, the Indonesian word, “munculnya” (come out),
has a basic word “muncul” with the postfix “-nya”. Here, the term “munculnya”
is not available in the dictionary. Therefore, the search will take “muncul” as the
matching word with “munculnya” and give the English translation for “muncul”
such as “come out” as its translation result.
The English translation results are then translated into Japanese using an
English-Japanese dictionary. The English translation results also include inflected words, not only basic words. For example, the English translation pair
from Indonesian-English dictionary for “obat-obatan” is “medicines,” while the
term in the English-Japanese dictionary is “medicine.” Therefore, in the English
matching, it searched the same English word or the basic word of the English
TRANSLATION
26
Indonesian sentence query
Indonesian words
Indonesian - English
Bilingual Dictionary
Translation
English - Japanese
Translation
borrowed
* English - Japanese
Translation
* Japanese Proper Name
Dictionary Translation
* Hiragana/Katakana
Transliteration
Japanese Keyword
Japanese Morphological Analyzer (Chasen)
Japanese Stop Word Elimination
Candidate for Japanese Translation
(as input for the Filtering Process)
Figure 3.2: Indonesian-Japanese Keyword Translation Schema
translation.
In Indonesian, a noun phrase has the reverse word position of that in English. For example, “ozone hole” is translated as “lubang ozon” (ozone=ozon,
hole=lubang) in Indonesian. Therefore, in English translation, besides a wordby-word translation, we also searched for the reversed English word pair as a
single term in an English-Japanese dictionary. This strategy reduced the number
of translation candidates. An example of a keyword translation process in the
transitive translation with bilingual dictionary is shown in Table 3.2. In the query
example, there are 3 word pairs treated as single terms in the English-Japanese
dictionary: ozone layer (オゾン層), ozone hole (オゾンホール) and human body
(人体). Other translations such as coating or stratum (the synonym for layer)
are eliminated as translation candidates.
Borrowed words are translated using an English-Japanese dictionary. The
English-Japanese dictionary is used because most of the borrowed words in the
query translation system come from English. Examples of borrowed words in the
query are “Academy Awards,” “Aurora,” “Tang,” “baseball,” “Plum,” “taping,”
and “Kubrick.”
Even though using an English-Japanese dictionary may help with accurate
translation of words, there are some proper nouns which can not be translated
by this dictionary, such as “Miyabe Miyuki,” “Miyazaki Hayao,” “Honjo Manami,” etc. These proper names come from Japanese words which are romanized.
27
Table 3.2: Illustration of Native (Indonesian) Keyword Translation Process
Indonesian Saya ingin belajar tentang akibat perusakan lapisan ozon dan pelebaran lubang
query
ozon terhadap tubuh manusia (= I want to learn about the effects of destruction
of the ozone layer and expansion of the ozone hole have on the human body)
Indonesian Keywords
belajar
perusakan lapisan
ozon
pelebaran lubang
ozon
tubuh manusia
English keywords as the Indonesian-English dictionary matching result
to study, damaging coating,
ozone widening, cavity,
ozone body human
to learn,
layer,
broaden- hole,
being,
to take
stratum
ing
hollow,
man,
up
perforahuman
tion
Japanese keywords as the English-Japanese dictionary matching result
∼を調べ損害を与オゾン層 (ozone 拡大主義オゾンホール人体
(human
る, 勉強える, 不 layer)
者, 広が (ozone hole)
body)
する, 研利な, 有
り
究する, 害な
学ぶ,. . .
In the Japanese language, these proper names might be written in one of the
following scripts: kanji (Chinese character), hiragana (cursive form), katakana
(squared form) and romaji (roman alphabet). One alphabetical word can be
transliterated into more than one Japanese word. For hiragana and katakana
script, the borrowed word is translated using a pair list between hiragana or
katakana and its roman alphabet. These systems have a one-to-one correspondence for pronunciation (syllables or phonemes), something that can not be done
for kanji. Therefore, in order to obtain a kanji corresponding to borrowed words,
we use a Japanese proper name dictionary. Each term in the original proper name
dictionary usually consists of two words, the first and last names. For a wider
selection of the translation candidates, we separate each two-word term into two
terms. Even though the input word can not be found in the original proper name
dictionary (family name and first name), a match may still be possible with the
new proper name dictionary.
Each of the above translation processes also involves the stop word elimination
process, which aims to delete stop words or words that do not have significant
meaning in the documents retrieved. The stop word elimination is done at every
language step. First, Indonesian stop word elimination is applied to an Indonesian
query sentence to obtain Indonesian keywords. Second, English stop word elimination is applied before English keywords are translated into Japanese keywords.
Finally, Japanese stop word elimination is done after the Japanese keywords are
morphologically analyzed by Chasen [26].
TRANSLATION
28
Table 3.3: Examples of Indonesian-Japanese Keywords Translation
Step
Indonesian
query
Indonesian keyword
English
keyword
(from
IndonesianEnglish dictionary)
Japanese keyword
(from
EnglishJapanese
dictionary)
Japanese keyword
(after
analyzed
by
Chasen)
Result
Saya ingin mengetahui metode untuk belajar bagaimana menari
salsa (= I want to know the method of studying how to dance the
salsa)
Native word translation
Borrowed word
translation
metode
belajar
menari
salsa
method
to learn,
to
study, to take
up
dance
salsa
規則正しさ, 筋
道, 秩序, 方法
∼を調べる, 勉強
する, 研究する,
学ぶ, 勉強, 研究,
調査, 検討, 書斎,
∼を学ぶ, 知る,
わかる, 暗記す
る, 覚える, 確認
する, 習う, 突き
とめる
調べる, 勉強, 研
究, 学ぶ, 調査,
検討, 書斎, 知る,
わかる, 暗記, 覚
える, 確認, 習う,
突きとめる
舞踊, ダンス, ダ
ンスパーティー,
バレエ, ダンスす
る, 舞う, 踊る,
踊らされる, いい
ようにされる
サルサ, サルサの
ダンス
舞踊, ダンス, ダ
ンスパーティー,
バレエ, 舞う, 踊
る, 踊ら
サルサ
規則正し, 筋道,
秩序, 方法
Examples of the Indonesian-Japanese keyword translation system can be seen
in Table 3.3. Each word in the input query is matched with the term in the
Indonesian-English bilingual dictionary and the stop word list. If the word is
not included in the stop word list and exists in the Indonesian-English bilingual
dictionary, then it is assumed to be a native word and translated using IndonesianEnglish and English-Japanese dictionaries. Words (in Table 3.3) included in this
type group are “metode” (method), “belajar” (to learn) and “menari” (dance). If
the word is not included in the stop word list and does not exist in the IndonesianEnglish bilingual dictionary, then the word is assumed to be a borrowed word
and translated using an English-Japanese dictionary, Japanese proper name dictionary and/or by transliteration. For example, “salsa” is translated into サル
29
サ using the English-Japanese dictionary. The final Japanese keywords are then
analyzed by Chasen and inputted into the translation candidate filtering process
which is described in the following section.
The keyword transitive translation is used in 2 systems: 1) transitive translation to translate all words in the query, and 2) transitive translation to translate
OOV (Indonesian) words by direct translation using an Indonesian-Japanese dictionary and English-Japanese dictionary. We call the first method transitive
translation using bilingual dictionaries and the second method combined translation (direct-transitive).
3.4.2
Japanese Translation Candidate Filtering Process
The Japanese translation candidate filtering process aims to select the most appropriate among the Japanese translation candidates. In order to select the best
Japanese translation, rather than choosing only the highest TF × IDF score or
only the highest mutual information score among all keyword lists, we combine
both scores to select the highest mutual information score among the top 3 TF ×
IDF scores. To avoid calculation of all sequences, we selected 100 term-sequences
by their mutual information scores. The mutual information score is calculated
per word pair. First, we select the 100 (or less) best mutual information score
sequences among the translations of the first 2 Indonesian keywords. These 100
best sequences joined with the 3rd keyword translation set are recalculated to obtain the mutual information score and reselected to yield the 100 best sequences
for the 3 translation sets. This step is repeated until all translation sets are
covered. For a word sequence, the mutual information score is:
I(t1 · · · tn ) =
n−1
X
n
X
I(ti , tj )
(3.1)
i=1 j=i+1
I(t1 · · · tn ) means the mutual information for a sequence of words t1 , t2 , · · · tn .
I(ti , tj ) means the mutual information between two words (ti , tj ). Here, a zero
frequency word will have no impact on the mutual information score of a word
sequence. For the mutual information score between two words, the standar
formula [47] is used:
I(ti , tj ) = P (wi , wj ) × log(
where
and
P (wi , wj )
)
P (wi ) × P (wj )
(3.2)
C(wi , wj )
P (wi , wj ) = Pw′ ,w′
i
j C(w ′ , w ′ )
i
j
(3.3)
C(w)
P (w) = Pw′
C(w′ )
(3.4)
30
TRANSLATION
Here C(wi , wj ) is the co-occurrence frequency of terms wi and wj in a predefined window and C(w) is the occurrence number of term w. This mutual
information score represents the relationship between word pairs in a sequence,
but not the relationship among all terms in a sequence at the same time. Therefore, in the translation candidate filtering, we also used the TF × IDF score to
represent such a relationship.
The next step is to select a keyword list with the highest TF × IDF score
among some sequences with top mutual information scores. The TF × IDF score
used here is the relevance score between the document and the query (Equation
3.5 from [12]).
X
{
t
T Ft,i
N
}
· log
DFt
+ T Ft,i
DLi
avglen
(3.5)
T Ft,i denotes the frequency of term t appearing in document i. DFt denotes
the number of documents containing term t. N indicates the total number of
documents in the collection. DLi denotes the length of document i (i.e., the
number of characters contained in i), and avglen the average length of documents
in the collection.
Table 3.4 shows an example of the keyword selection process after completion
of the keyword translation process (such as in Table 3.3, rewritten in Table 3.4,
3rd row, the Japanese keyword translation result). The translation combinations
(4th row) and sequence rankings (5th row) are for all words (translation sets of
“metode”, “belajar”, “dansa”, and “salsa”) in the query. Then, all resulting
sequences (ranked by its mutual information score) are executed in the IR system[12] to obtain the TF × IDF score. Accepting Japanese keywords as the
input, the IR system[12] ranked the documents using the formula in Equation
3.5. It resulted a list of relevant documents for each keyword list. The final query
chosen is the one which has the highest TF × IDF score (6th row) for the 300
result documents.
3.5
3.5.1
Experiments
Experimental Data
The query translation performance was measured using the IR score achieved by
a CLIR system because CLIR is a real application and includes the performance
of key word expansion. For this, we did not use the word translation accuracy as
for the performance of word-to-word translation, since a one-to-one translation
rate is not suitable, given so many semantically equivalent words (the keyword
translation evaluation is shown in Section 3.5.4). The CLIR experiments are
conducted on NTCIR-3 Web Retrieval Task data (100 GB Japanese documents),
in which the Japanese queries and translated English queries were prepared. The
31
3.5. EXPERIMENTS
Table 3.4: Result Example of Translation Filtering Method
Step
Indonesian
query
Japanese keyword
(after
analyzed
by
Chasen)
Translation
Combinations
Rank
Sequences based
on
Mutual
Information
Score
Result
Saya ingin mengetahui metode untuk belajar bagaimana
menari salsa (I want to know the method of studying how
to dance the salsa)
規則正し, 筋道, 調べる, 勉強, 舞踊, ダンス, サルサ
秩序, 方法
研究, 学ぶ, 調ダンスパーテ
査, 検討, 書斎, ィー, バレエ,
知る, わかる, 舞う, 踊る, 踊
暗記, 覚える, ら
確認, 習う, 突
きとめる
(規則正し, 調べる, 舞踊, サルサ) (筋道, 調べる, 舞踊, サルサ)
(秩序, 調べる, 舞踊, サルサ), etc
1. ( 秩序, 知る, 踊る, サルサ)
2. ( 秩序, 研究, 踊る, サルサ)
3. ( 方法, わかる, ダンス, サルサ)
4. ( 方法, 覚える, ダンス, サルサ)
5. ( 秩序, 分かる, 踊る, サルサ)
Select Queries
with best TF
× IDF score
( 方法, わかる, ダンス, サルサ)
Indonesian queries (47 queries) are manually translated from English queries.
The 47 queries contain 528 Indonesian words (225 are not stop words), 35 English
borrowed words, and 16 transliterated Japanese words (proper nouns). The IR
system [12] is borrowed from Atsushi Fujii (Tsukuba University). Using the
equation 3.1, the IR system retrieves 1000 documents with highest TF?IDF score
based on a non-boolean Japanese query. The Indonesian queries are translated
into Japanese and then inputted into the IR system. The Indonesian-Japanese
translation resources are as follows:
• Indonesian-English dictionary KEBI [17], 29,054 Indonesian words
TRANSLATION
32
• English-Japanese dictionary Eijirou [27], 556,237 English words
• Indonesian-Japanese dictionary [36], 14,823 Indonesian words
• English stop word list, combined from [11] and [53]. Indonesian stop word
list is translated from the English stop word list. Japanese stop word list
consists of function words in Japanese.
• English morphology rules, implement WordNet [28] description
• Indonesian morphology rules, restricted only for word repetition, posfix
-nya and -i
• Japanese morphological analyzer Chasen [26]
• Japanese proper name dictionary∗
• Mainichi Shinbun newspaper corpus [25]
• Daily Yomiuri Online (in English) newspaper corpus [51]
• Indonesian newspaper corpus†
The Mainichi Shinbun newspaper corpus is used as the data resource in the
mutual information score calculation between Japanese keywords. The Daily
Yomiuri Online newspaper corpus is used as the data resource in the mutual information score calculation between English keywords. The Indonesian newspaper
corpus is used to reduce the vocabulary size of the original Indonesian-Japanese
dictionary.
3.5.2
Compared Methods
In the experiment, we compared the proposed method with other translation
methods. Table 3.5 lists all the compared methods with its corresponding query
label (as is of figures in Section 3.5.3).
Transitive Translation using Machine Translation
The first compared method is a transitive translation using MT (machine translation). The Indonesian-Japanese transitive translation using MT has a schema
similar to Indonesian-Japanese transitive translation using a bilingual dictionary.
Instead of using the available Indonesian-English and English-Japanese dictionaries, the Indonesian queries are translated using online Indonesian-English MT
Kataku [45] and 2 online English-Japanese MTs Babelfish [30] - Excite [9]. The
example of the Indonesian-Japanese translation result using machine translation
method is shown in Table 3.6.
∗
†
This dictionary contains 61,629 Japanese person names.
Articles downloaded from http://ilps.science.uva.nl/Resources/BI/ were used.
33
3.5. EXPERIMENTS
Table 3.5: List of Compared Methods with its Corresponding Query Label Shown
in Section 3.5.3
Method Description
Indonesian-English-Japanese transitive translation
with Machine Translation (Kataku and Excite)
Indonesian-English-Japanese transitive translation
with Machine Translation (Kataku and Babelfish)
Indonesian-Japanese direct translation with existing
Indonesian-Japanese dictionary
Indonesian-Japanese direct translation with built-in
Indonesian-Japanese dictionary
The Japanese keyword filtering with only using the
mutual information score. It is marked by the query
label’s postfixes of -In and -I-n. -In means that the
Japanese keyword list is the nth rank Japanese keyword list of mutual information score. -I-n means that
Japanese keyword list is the conjunct of 1st rank until
nth rank Japanese keyword list of mutual information
score.
Using the English keyword filtering by mutual information score. -En means that the English keywords
is the nth rank English keyword list of mutual information score
English-Japanese CLIR. The query label’s notation
is ej-xxx-yyy. xxx shows the keyword translation
method such as “man” (the English keywords are
selected manually from the English query sentence),
“mb” (using Babelfish machine translation), “mx”
(using Excite machine translation). yyy shows the
keyword filtering method such as In, I-n and IR-n.
(the keyword list chosen is the one with best TF ×
IDF score of its document collection among n highest
rank of mutual in formation score)
Query label (Section 3.5.3)
iej-mx
iej-mb
ij
ijn
The query label’s postfixes are -In and -I-n
The query label’s infix
is -En
Figure 3.9, the query label’s notation is ej-xxxyyy.
34
TRANSLATION
Table 3.6: Examples of Indonesian-Japanese Translation using Machine Translation Method
Indonesian query
Saya ingin belajar tentang akibat perusakan lapisan
ozon dan pelebaran lubang ozon dan pelebaran lubang
ozon terhadap tubuh manusia (= I want to learn about
the effects of destruction of the ozone layer and expansion of the ozone hole have on the human body)
Translated
English I wanted to study about resulting from destruction
(Kataku engine)
and the widening of the ozone hole of the layer of
ozone of the human body
Translated Japanese Sentence
Excite Engine (iej-mx)
私は人体のオゾンの層のオゾンホールの破壊から生じ
て、広くなるのに関して研究したかったです
Babelfish
Engine(iej- 私は人体のオゾンの層のオゾン穴の破壊そして広がる
mb)
ことに起因について調査したいと思った
Japanese Keywords
Excite Engine
破壊, 人体, オゾン, 層, ホール, 広く, 起因, 勉強
Babelfish Engine
人体, オゾン, 層, 穴, 破壊, 広がる, 起因, 調査, 思っう
Indonesian query:
Saya ingin mengetahui metode untuk belajar
bagaimana menari salsa (= I want to know the
method of studying how to dance the salsa)
Translated
English I wanted to know the method of studying how danced
(Kataku engine)
salsa
Translated Japanese Sentence
Excite Engine (iej-mx)
私は、どのようにを研究するか方法がサルサを踊った
のをしりたかったです
Babelfish
Engine(iej- 私はいかに踊られたサルサ調査する方法を知りたいと
mb)
思った
Japanese Keywords
Excite Engine
勉強, 方法, サルサ, ダンス
Babelfish Engine
いかに, 踊ら, サルサ, 調査, 方法, 思う
Direct Translation using Existing Indonesian-Japanese Dictionary
The second comparison method is a direct translation with an Indonesian-Japanese
dictionary. This direct translation also has a schema similar to the transitive
translation using a bilingual dictionary (Figure 3.2). The difference is that
35
3.5. EXPERIMENTS
in translation of an Indonesian keyword, only one dictionary is used, rather
than two. In this case, we use an Indonesian-Japanese bilingual dictionary
(14,823 words) with fewer words than the Indonesian-English (29,054 words)
and English-Japanese (556,237 words) dictionaries. We also did some experiments in direct translation by reducing the Indonesian-Japanese dictionary into
various sizes (3000, 5000 and 8857 words). Table 3.7 shows the example of
the Indonesian-Japanese translation result with direct translation method using
Indonesia-Japanese dictionary.
Table 3.7: Examples of Indonesian-Japanese Translation with Direct Translation
using Existing Indonesian-Japanese Dictionary
Indonesian query
Japanese Keywords
3000 dictionary size
14,823 dictionary size
Indonesian query:
Japanese Keywords
14,823 dictionary size
学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, pelebaran,lubang, 体, 肉体, 人間
学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, pelebaran,
ホール, 穴, 体, 肉体, 人間
学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, 拡張, ホー
ル, 穴, 体, 肉体, 人間
学ぶ, 習う, 学習, 破壊, 行為, 層, 階層, オゾン, 拡張, ホー
ル, 穴, 体, 肉体, 人間
方法, メソッド, 学ぶ, 習う, 学習, サルサ, ダンス
TRANSLATION
36
Direct Translation using Built-in Indonesian-Japanese Dictionary
We also compared the transitive translation results with those of direct translation using our built-in Indonesian-Japanese dictionary. We also call it direct
translation because although the Indonesian-Japanese dictionary was made from
the Indonesian-English and Japanese-English dictionaries in advance, the query
translation process uses only the dictionary which yields different Japanese translations compared with the transitive translation.
When building the Indonesian-Japanese dictionary from the Indonesian-English
and Japanese-English dictionaries, explosion of possible translation pairs arises.
To select the correct pair, we used the “one-time inverse consultation” score‡ such
as in [44]. We also used WordNet to get more English translation candidates.
The complete procedure is as follows:
1. Do word matching between English translation (from Indonesian-English
dictionary) and English word in the Japanese-English dictionary. If the
English term is phrase and the matching word could not be found, then
the English terms will be normalized by eliminating certain words (“to”,
“a”, “an”, “the”, “to be”, “kind of”). For example, the word “belajar”
has three English translations in the Indonesian-English KEBI dictionary:
“to study”, “to learn” and “to take up”, By normalization, the English
translations become “study”, “learn” and “take up”.
2. For every Japanese translation, a “one time inverse consultation” score is
calculated. English translation for every Japanese candidate is matched
with the English translation for Indonesian words. If the matched word is
more than one then it is accepted as Indonesian-Japanese pair. But if it is
not, then the English translation will be added by its synonym taken from
WordNet. The “one-time inverse consultation” score of the new English
words is recalculated. For example, the word “わかる” can be translated
into “learn” or “take up” such as exists in the English-Japanese dictionary,
then the word “わかる” is taken as the translation of the word “belajar”.
The example of the Indonesian-Japanese translation result using built-in Indonesian Japanese dictionary is shown in Table 3.8.
Japanese Keyword Selection using only Mutual Information Score in Transitive
Translation using Bilingual Dictionary
We also compared the proposed keyword selection method with the Japanese
keyword selection based on mutual information score only. There are two keyword
‡
For each Indonesian word, we look up all English translations, and see how many match
the English translations of the original Japanese word (translation candidate). This is called
the “one time inverse consultation”.
3.5. EXPERIMENTS
37
Table 3.8: Examples of Indonesian-Japanese Translation with Direct Translation
using Built-in Indonesian-Japanese Dictionary
Indonesian query
Japanese
(ijn):
keywords
Indonesian query:
Japanese
(ijn):
keywords
わかる, 研究, 与える, 層, オゾン, 広げる, 穴, 空洞, 団
体, 集団, 死体, 人物, 人, 人間, 人類
秩序, 方法, 研究, わかる, 踊る, ダンス, 曲, パーティー,
舞う, バレエ, 舞踊, サルサ
selection schemas. In the first schema, only one best keyword list among the
ranked keyword lists is selected. In the second, all keywords from the first rank
keyword list until the nth -rank keyword list are grouped into one keyword list. For
the baseline (iej), we used the Indonesian-Japanese transitive translation using a
bilingual dictionary without keyword selection. Table 3.9 shows some translation
examples for the Indonesian-Japanese transitive translation (bilingual dictionary)
with Japanese keyword selection using mutual information score only. Queries
with postfix “In” (first schema) and postfix “I-n” (second schema) in Section
3.5.3 show the experiment results.
English Keyword Selection using Mutual Information Score in Transitive Translation using Bilingual Dictionary
Another comparison method is a transitive translation with English keyword
selection based on mutual information taken from monolingual English corpus.
The English keywords are selected based on its mutual information score. The
English keywords selected are used as the input for the English-Japanese translation. Table 3.10 shows the example of Indonesian-Japanese translation with
English keyword filtering using mutual information score.
38
TRANSLATION
Table 3.9: Examples of Indonesian-Japanese Translation with Japanese Keyword
Selection using Mutual Information Score only
Indonesian query
Japanese keywords (iejI1):
Japanese keywords (iejI-3):
Indonesian query:
Japanese keywords (iejI1):
Japanese keywords (iejI-3):
確認, 与える, オゾン層, 広がり, オゾンホール, 人体
する, 確認, 覚える, 与える, オゾン層, 広がり, オゾンホー
ル, 人体
方法, する, ダンス, サルサ
秩序, 方法, する, 知る, 踊る, サルサ
English-Japanese Translation
The English-Japanese translation is done to compare the performance reduction
caused by the Indonesian-English translation. Methods used in the EnglishJapanese query translation are machine translation and word-by-word translation
using a bilingual English-Japanese dictionary. The schemas for using the machine
translation and the dictionary are similar to those described in Section 3.5.2
and Section 3.5.2 respectively. The machine translation systems used here are
Babelfish and Excite engines. Table 3.11 shows the example of the EnglishJapanese translation result in the English-Japanese CLIR.
3.5.3
Experimental Result
Baseline Experiments
In these experiments, we compared the IR score of each translation method.
The IR scores are shown as Mean Average Precision (MAP) scores. MAP is the
mean of average precision (non-interpolated) for all relevant documents. Average precision (non-interpolated) is the average of precision value obtained after
each relevant document is retrieved. Each query group has 4 MAP scores: RL
3.5. EXPERIMENTS
39
Table 3.10: Examples of Indonesian-Japanese Translation (Bilingual Dictionary)
with English Keyword Selection
Indonesian query
Japanese keywords (iejE):
Indonesian query:
Japanese keywords (iejE):
検討, 学ぶ, 調査, 調べる, 研究, 勉強, 与える, 有害, 不利,
損害, オゾン, 層, 拡大, 主義者, ホール, 人体
秩序, 方法, 知る, わかる, 覚える, 確認, 学ぶ, 習う, 踊
る, ダンス, パーティー, バレエ, 舞踊, 舞う, サルサ
(highly relevant document as correct answer with hyperlink information used),
RC (highly relevant document as correct answer), PL (partially relevant document as correct answer with hyperlink information used), and PC (partially
relevant document as correct answer).
Figure 3.3 shows the IR scores of queries translated with basic translation
methods such as the bilingual dictionary or machine translation, without any
enhanced process. All translation candidates are grouped together and used as
the query input for the IR system.
With only bilingual dictionaries (Indonesian-Japanese and English-Japanese),
the proposed method (iej and ij-iej) gave an IR score lower than for the transitive
translation using machine translation (iej-mx and iej-mb). The combination between direct and transitive translation achieved a higher IR result than the direct
translation (ij), but the improvement was not significant. The direct translation
with built-in dictionary (ijn) achieved the lowest IR score which gives conclusion
that the new dictionary (Indonesian-Japanese) has lower coverage than the two
source dictionaries (Indonesian-English and English-Japanese). The main baseline here is “iej”, transitive translation using bilingual dictionaries without any
borrowed word translation and without the keyword selection. The transitive
translation with machine translation (iej-mx and iej-mb) scored higher (IR) than
other translation methods. The highest CLIR score in the baseline translation
only achieved 31% (iej-mx, MAP score on RL = 0,0306) compared to the monolingual IR (jp, MAP score on RL = 0.0985). The dictionary based transitive
40
TRANSLATION
Table 3.11: Examples of English-Japanese Translation of English-Japanese CLIR
Indonesian query
Japanese Keywords
English-Japanese dictionary (ej):
Excite engine (ej-mx):
Babelfish engine (ejmb):
Indonesian query:
Japanese Keywords
English-Japanese dictionary (ej):
Excite engine (ej-mx):
Babelfish engine (ejmb):
I want to learn about the effects of destruction of the
ozone layer and expansion of the ozone hole have on
the human body
学ぶ, 知る, わかる, 暗記, 覚える, 確認, 習う, 突きとめ
る, 個人, 資産, 駆除, 絶滅, 倒壊, 破壊, 破滅, 原因, 撲
滅, 滅亡, オゾン, 層, ホール, 膨張, 発展, 拡張, 人体
オゾン, ホール, 層, 破壊, 拡大, 人体, 上, 効果, 学び
オゾン, 層, 効果, 学び, 思い, 穴, 拡張, 人体
I want to know the method of studying how to dance
the salsa
規則正しさ, 筋道, 秩序, 方法, 学習, 学問, 舞踊, ダンス,
パーティー, バレエ, 舞う, 踊る, 踊らされる, サルサ
サルサ, 踊る, 方法, 学ぶ
サルサ, 踊る, 方法, 学ぶ, 為, 思う
Figure 3.3: IR Score with Indonesian-Japanese Baseline Translation
translation (iej, MAP score on RL = 0.0138) and the direct-transitive translation (ij-iej, MAP score on RL = 0.0218) achieved 14% and 22% compared to the
monolingual IR, respectively.
3.5. EXPERIMENTS
41
Experiments with Translation of Borrowed Words
In Figure 3.3, the translation is only done for original Indonesian words. This left
many OOV words that are borrowed words. In order to enhance the IR score,
these borrowed words are translated using the supporting resources (the EnglishJapanese dictionary, the Japanese proper name dictionary and the Japanese common noun dictionary). Figure 3.4 shows the IR score by translating the borrowed
words. In Figure 3.4, by translating the borrowed words in the query, each
translation method improved the IR score obtained by the baseline methods in
Figure 3.3.
Figure 3.4: IR Score of Indonesian-Japanese CLIR (with Borrowed Word Translation)
The most significant improvement is the direct Indonesian-Japanese translation. A combined translation (ij-iej) showed a lower IR score than the direct
translation (ij). We assumed that this lower IR score is because the combined
translation gives too many translation results and leads to the retrieval of irrelevant documents. This reason is also applied to the transitive translation (iej)
that scored lowest (IR) among all translation results.
Experiments with Keyword Filtering
To reduce the translation candidates yielded by the translation methods, we
performed a keyword selection on the translation result (see keyword selection
details in section 3.4.2). I experimented with 2 kinds of keyword selection: 1)
Japanese keyword selection, and 2) English keyword selection. Figure 3.5 shows
the Japanese keyword selection impact on the IR score and Figure 3.7 shows the
IR score achieved by the English and/or Japanese keyword selection.
Query label’s notation used in Figure 3.5 is xxx-yyy. The prefix “xxx” shows
the keyword translation method such as written in Figure 3.3, for example: iej, iejmb, etc. The postfix “yyy” shows the keyword filtering method. Figure 3.5 shows
that the use of keyword selection based on the combination of mutual information
and TF × IDF score (iej-IR-n) yielded significant IR score improvement for the
42
TRANSLATION
transitive translation. The proposed transitive translation (iej-IR-10) improved
the IR (RL) score of the baseline method of transitive translation (iej) from 0.0138
to 0.0371. The t-test showed that iej-IR-10 significantly increased the baseline
method (iej) with a 97.5% confidence level, T(68) = 1.92, p<0.03. The t-test
also showed that, compared to other baseline systems, the proposed transitive
translation (iej-IR-10) can increase the IR score at 85% (T(84) = 104, p<0.15),
69% (T(86) = 0.49, p<0.31), 91% (T(83) = 1.35, p<0.09), and 93% (T(70) =
1.49, p<0.07) confidence level for iej-mb, iej-mx, ij and ij-iej, respectively. The IR
score achieved by the transitive translation using bilingual dictionary was better
than those of transitive machine translation (iej-mb-IR and iej-mx-IR) or direct
translation (ijn-IR and ij-IR-5).
Figure 3.5: IR Score of Indonesian-Japanese CLIR (with Borrowed Word Translation and Japanese Keyword Selection) for All Queries (47 Queries)
Experiments with Combination of Translation
Another proposed method, a combination of direct and transitive translation
(ij-iej) achieved the best IR score among all the translation methods (transitive
machine translation, direct translation and transitive translation using bilingual
dictionaries). The proposed combination translation method (ij-iej-IR-30) improved the IR (RL) score of the baseline combination translation (ij-iej) from
0.0218 to 0.0486. The t-test showed that the proposed combination translation
3.5. EXPERIMENTS
43
significantly improved the IR score of the baseline ij-iej with a 98% confidence
level, T(69) = 2.09, p<0.02. Compared to other baseline systems, the t-test
showed that the proposed combination translation method (ij-iej-IR-30) significantly improved the IR score at 95% (T(83) = 1.66, p<0.05), 86% (T(85) =
1.087, p<0.14), 97% (T(82) = 1.91, p<0.03) and 99% (T(67) = 2.38, p<0.005)
confidence level for iej-mb, iej-mx, ij and iej, respectively.
Figure 3.6 shows the IR score of Indonesian-Japanese CLIR for queries with
in-vocabulary words (42 queries). The IR score achievement pattern in Figure 3.6
equals with the achievement pattern for all queries (Figure 3.5). In Figure 3.6,
ij-iej translation achieved the highest IR score of 0.0555 which is 56% of the
monolingual IR.
Figure 3.6: IR Score of Indonesian-Japanese CLIR (with Borrowed Word Translation and Japanese Keyword Selection) for Query with In-Vocabulary Words (42
Queries)
Figure 3.7 shows the impact of English (pivot language) keyword selection
on the transitive translation. The method is described in Section 3.5.2. The
experimental result shows that using keyword selection on the English keywords
failed to yield a significant improvement in translation.
44
TRANSLATION
Figure 3.7: IR Score of Indonesian-Japanese CLIR with Translation useing English and/or Japanese Keyword Selection
Experiments of English-Japanese CLIR
Figure 3.8 shows the IR score of English-Japanese CLIR, which has 4 translation
groups: ej-man (English-Japanese translation using bilingual dictionary where
the English keywords are selected manually), ej-mb (English-Japanese translation
using Babelfish engine), ej-mx (English-Japanese translation using Excite engine)
and ej (English-Japanese translation using bilingual dictionary).
Compared to the Japanese monolingual IR in Figure 3.3, the English-Japanese
CLIR with bilingual-dictionary based translation (ej-IR-30, MAP score on RL =
0.0467) achieved 47% performance. The Indonesian-Japanese CLIR with transitive translation (iej-IR-30 in Figure 3.5, MAP score on RL = 0.0371) achieved
38% performance of the Japanese monolingual IR (MAP score on RL = 0.0985);
and the Indonesian-Japanese CLIR with combined translation between direct
and transitive translation (ij-iej-IR-30 in Figure 3.5, MAP score on RL = 0.0486)
achieved 49% IR score of the Japanese monolingual IR, comparable with the
English-Japanese CLIR using the bilingual dictionary.
Experiments on Dictionary Size
Figure 3.9 shows the highest IR score of the CLIR using bilingual-dictionary
based translation (ijn, ij, iej and ij-iej) and the vocabulary size of each translation. Even though the built-in Indonesian-Japanese dictionary has a greater
3.5. EXPERIMENTS
45
Figure 3.8: IR Score of English-Japanese CLIR
vocabulary than the existing Indonesian-Japanese dictionary, the IR score of
the translation using the built-in Indonesian-Japanese dictionary (ijn) is lower
than that of the translation using the existing dictionary (ij). We assume that
the Japanese keyword selection in the dictionary building process is not able to
select appropriate Japanese translations. The ij-iej translation uses a merged dictionary, resulted from the existing Indonesian-Japanese and Indonesian-English
dictionaries. Although the dictionary size is larger than the existing IndonesianJapanese (ij) and Indonesian-English (iej) dictionaries, its performance is worse
than that of the existing two bilingual dictionaries because there are some Indonesian OOV words of the Indonesian-Japanese dictionary which exist in the
Indonesian-English dictionary and vice versa.
Experiments for Number of OOV words
Figure 3.10 shows the CLIR score with its OOV word number for CLIR with direct translation using the existing Indonesian-Japanese bilingual dictionary (ij).
Dictionary words are reduced to 3000, 5000 and 8857, respectively. In other
words, there are 4 dictionaries with differing numbers of words; 3000, 5000, 8857
and the complete Indonesian-Japanese dictionary of 14,823 words. The reduction is done by selecting the most frequent Indonesian words in an Indonesian
newspaper corpus. Figure 3.10 shows that the larger the OOV word number
46
TRANSLATION
Figure 3.9: IR Score of CLIR using Bilingual Dictionary-based Translation and
its Dictionary Size
one translation yielded, the lower IR score it achieved. It shows that dictionary
quality plays a significant role in a CLIR.
㪹㪼㫊㫋㩷㪠㪩㩷㪪㪺㫆㫉㪼㩷㩿㪩㪣㪀
㪦㪦㪭㩷㫎㫆㫉㪻㫊
㪦㪦㪭
㪇㪅㪋㪌
㪇㪅㪋
㪏㪇
㪇㪅㪊㪌
㪎㪇
㪇㪅㪊
㪍㪇
㪇㪅㪉㪌
㪌㪇
㪇㪅㪉
㪋㪇
㪇㪅㪈㪌
㪊㪇
㪇㪅㪈
㪉㪇
㪇㪅㪇㪌
㪈㪇
㪇
㪇
㪊㪃㪇㪇㪇
㪌㪃㪇㪇㪇
㪏㪃㪏㪌㪎
㪛㫀㪺㫋㫀㫆㫅㪸㫉㫐㩷㪪㫀㫑㪼
㪈㪋㪃㪏㪉㪊
Figure 3.10: IR Score of CLIR using Indonesian-Japanese Direct Translation (ij)
with its Word Number and the OOV Word Number
3.5.4
Keyword Comparison
All figures in Section 3.5.3 show the IR score achieved by each translation method.
By comparing the IR score, each translation result is compared with all words
with the same semantic meaning (one-to-all comparison). We also did a one-toone comparison (Table 3.12) by comparing the translation result with the keyword
list of the monolingual query (jp). The comparison can be seen in Table 3.12.
47
3.5. EXPERIMENTS
Table 3.12: Keyword Comparison between Translation Result and the Original
Japanese Keyword List
Query Label
Baseline method
iej-mx
iej-mb
ij
iej
ij-iej
Compared method
iej-mx-IR
iej-mb-IR
ij-IR
jp(monolingual)
Proposed method
iej-IR10
ijiej-IR30
Precision
Recall
IR score(RL)
22.82%
21.66%
17.49%
3.63%
10.78%
42.18%
34.6%
33.65%
40.75%
39.81%
0.0306
0.0197
0.0074
0.0138
0.0218
23.71%
23.72%
30.12%
100%
43.6%
37.44%
35.55%
100%
0.0336
0.0238
0.0366
0.0985
25.48%
37.05%
31.28%
44.08%
0.0371
0.0486
Table 3.12 lists the keyword comparison between the translation result and
the original Japanese keyword, as indicated by precision and recall scores. There
is obviously no direct correspondence of the precision and recall scores with the
IR score. Even though the combined translation (ij-iej-IR30) with the highest IR
score showed the highest recall and precision score, other results show different
comparisons. For example, the iej-IR10 (transitive translation using bilingual
dictionary) had lower recall and precision scores than ij-IR-5 (direct translation),
yet the IR score achieved by iej-IR-10 was higher than the one achieved by ij-IR5. We assumed that this was because the keyword comparison treated the main
keyword and the complement keyword equally, while the main and complement
keywords had a different effect on the information retrieval score. For example,
the query “Find documents describing how to make chiffon cake” has “chiffon
cake” as the main keyword and “how to make” as the complement of the main
keyword. If the translation system resulted in only a correct translation of the
complement keywords (in the example, the “how to make” word), then the precision and recall scores of the keyword comparison would be increased, whereas
the IR score would not.
48
3.6
TRANSLATION
Conclusions
We presented a translation method that is suitable for queries with a limited data
resource language such as Indonesian. Compared to other types of translation,
such as transitive translation using machine translation and direct translation
using bilingual dictionary (the source-target dictionary is a poor bilingual dictionary), our transitive translation and the combined translation (direct translation
and transitive translation) achieved higher IR scores. In the Indonesian-Japanese
CLIR, the transitive translation achieved a 38% performance of the monolingual
IR and the combined translation achieved a 49% performance of the monolingual
IR, which is comparable with the English-Japanese CLIR.
The two important methods in our transitive translation are the borrowed
word translation and the keyword selection method. The borrowed word translation method can reduce the number of OOV from 50 to 5 words using a pivottarget (English-Japanese) bilingual dictionary and target (Japanese) proper name
dictionary. The keyword selection using the combination of mutual information
score and TF × IDF score gives a significant improvement over the baseline transitive translation. The other important method, combining transitive and direct
translation using bilingual dictionaries, also improved the CLIR performance,
and the t-test showed that it significantly increased the baseline of transitive
translation with a 99% confidence level.
We believe that the system can be easily adapted to be able to accept input
queries written in other minor languages. Some tools needed for the modification
are the bilingual dictionary between the query language and English, the morphological rule for the stemming, and the stop word list (in the query language)
which can be easily translated from the English stop words.
Chapter 4
Indonesian Monolingual QA using
Machine Learning Approach
4.1
Introduction on Monolingual QA
A Question Answering (QA) system has been an interesting research field in
Natural Language Processing (NLP) area. It is a system which gives an answer
taken from some available sources when a question in human language is given.
To answer a question, human has to posses the knowledge about the question
domain. But for a computer system, collecting and building knowledge resource
such as human is a very expensive work. Until now, there is no system able to
gain a complete knowledge resource. On the other hand, information in a text
format is widely available on the World Wide Web. This triggers researches on
a question answering system area to exploit the document texts as the resource
for extracting some possible answers.
Researchers have developed some approaches for a QA system that uses document texts as the answer resources. Most of the QA systems are composed by
three subsystems [48, 16]: question analysis, passage retrieval and answer finder.
In traditional QA systems, these components depend on human-crafted rules
which are expensive. In order to avoid the expensive work, there is an alternative
to employ a machine learning method either in the question analysis or in the
answer finder.
For many languages such as English or European Language, there are available formal question-answer-documents resources provided for QA researches.
U.S. government’s TREC (Text Retrieval Evaluation Conference) has initiated
QA task for English since 1999 (TREC-8). Europe’s CLEF (Cross Language
Evaluation Form) has provided QA task since 2003, started with Dutch, Ital49
CHAPTER 4. INDONESIAN MONOLINGUAL QA USING MACHINE LEARNING
APPROACH
50
ian and Spanish monolingual QA task. The target languages provided in QA
task CLEF 2005 has been increasing into 9 languages, added by French, Portuguese, Bulgarian, English, Finnish, and German. For Asian language, other
than Japanese (NTCIR, NII-NACSIS Test Collection for IR System), there is no
other formal data resource available. As for our knowledge, this research is the
first QA system for a language with limited resource. “Limited resource” means
that there are no question answering data and no language processing tools available (rich dictionary, parsing tool, etc). Here, we use Indonesian as the language
for our QA system.
Indonesian is used by 260 million population of Indonesia, a country located in
South East Asia. It is understood by people in Malaysia and Brunei. Because the
language is used by such many people, there is an increasing need of Indonesian
natural language technology, including an IR/QA system. Most of Indonesian
people usually speak two languages, one is Indonesian language, and the other one
is their own regional language. Although it has some similar sentence structures
with English, the subject - predicate - object position, it has many syntactical
differences such as the word order in a noun phrase, etc. The Indonesian question
sentence structure is also different from English. For example, in English, there is
no “Country which has the biggest population?” (the correct sentence is “Which
country has the biggest population?”), while in Indonesian, the structure “Negara
apa (Country which)” has the same meaning with “Apa negara (Which country)”.
Observing the differences between Indonesian and English, we concluded that
English language processing tools couldn’t be applied directly into Indonesian.
For Indonesian, there are already some computational linguistic researches
been developed, such as Indonesian Information Retrieval [43], Indonesian-English
CLQA [3] or Indonesian-Japanese CLIR [33]. In the Indonesian-English CLQA
system [3], Indonesian is used as the question language and English as the document language. This Indonesian-English CLQA task is available at CLEF. As
for a monolingual Indonesian Question Answering, we believe that this system
is the first QA research where the question and document language are both in
Indonesian.
4.2
Related Works
In the question classification (part of question analysis) for English, the literature [52] compared various machine learning methods such as Nearest Neighbor,
Naive Bayes, Decision Tree, Sparse Network of Winnows (SnoW) and Support
Vector Machine (SVM). They reported that the SVM algorithm gave the highest
accuracy score by using bag of word and bag of n-gram features. The experiments achieved 90% for coarse classes (6 classes) using a tree kernel, and 80.2%
for fine classes (50 classes). The literature [21] used the SNoW learning architec-
4.2. RELATED WORKS
51
ture to classify English questions with features: words, POS tags, chunks, named
entities, head chunk (the first noun chunk in the sentence), and semantically related words (words that often occur with a specific question class). Using the
same data with [52], they reported 91% classification accuracy score for coarse
classes and 84.2% for fine classes. With the same data as [52], the literature
[41] utilized the SVM algorithm with features combined from subordinate word
category (using WordNet), question focus (extracted from a manually listed regular expressions) and syntactic semantic structure. Using the same data with
the previous related work mentioned, they achieved 85.6% classification accuracy
for fine classes. Here, for our Indonesian question classification, we used SVM
algorithm with features extracted from the available resources, which have some
differences with features mentioned in the related works.
For a full QA system, the literature [31] employed a perceptron model to
answer TREC-9 questions for English. They used dependency relations (learned
in the probabilistic parser for question) and WordNet information for the machine
learning features. Their experimental results achieved the highest MRAR score
among participants in TREC-9 evaluation, 0.58 for short answer and 0.76 for long
answer. There was another QA system with a machine learning approach [35]
which used maximum entropy with features such as the expected answer class
(or question type, predicted by some surface text patterns they have collected),
the answer frequency, the question word absent and word match. They used a
pattern file supplied by NIST to tag the answer chunks in a testing phase, while
for a training phase, they used the TREC 9 and TREC 10 data sets.
For Japanese QA, The literature [38] tried to eliminate the restriction given
by question types by joining the question features, document features and combination features for a maximum entropy algorithm. The feature sets include four
part-of-speech information (obtained from a Japanese morphological analyzer
Chasen) for each word in question and document, n-gram lexical terms and some
matching scores (lexical and POS). The main experiment result achieved MRR of
0.36 and 0.47 for Top5. In this Indonesian monolingual QA system, we adopted a
similar approach with the literature [38]. In our Indonesian QA, the answer finder
employed an SVM algorithm with features obtained from an Indonesian corpus,
joined with the result of a built-in POS tagger and a built-in question parser. We
compared some feature combinations of the question class, question features and
document features. The experiment result shows that using both question class
and question features, the QA system gives better performance than using only
question class or only question features in the answer finder module.
APPROACH
52
4.3
4.3.1
Language Resources
Article Collection
First, we checked existing Indonesian news articles available on web. We found
that there are three candidates for the article collection. But in the article downloading, we were only able to retrieve one article collection, located at http://
www.tempointeraktif.com. We downloaded about 56,471 articles which are
noisy with many incorrect characters and some of them are English. We cleaned
the articles semi-automatically by deleting articles with certain words as sub title. We joined our downloaded articles with the available corpus ( http://ilps.
science.uva.nl/Resources/BI/, 28,393 articles) and resulted 71,109 articles.
In final, we selected some articles (221 articles) based on the number of possible
answers they contain as the seeds for the question collection.
4.3.2
Building Question Collection
To build our question collection, we asked 18 Indonesian natives to write factoid
questions along with their answer and question types based on the selected articles. For this task, we made a web based question editor (the interface is shown
in Figure 4.1).
Figure 4.1: Interface of User Input for Indonesian QA Data Collection Task
53
4.3. LANGUAGE RESOURCES
Each user was required to input about 200-250 Indonesian questions in 6
question types (person, organization, location, name, date and quantity). After
eliminating exactly the same questions manually, we gathered 3000 questions:
500 questions for each question type. Question examples are shown in Table 4.1.
Table 4.1: Example of Collected Indonesian Questions
No
1
2
3
4
5
6
Question
Mulai tanggal berapakah, PT Pertamina menurunkan harga Pertamax dari Rp 5.400 menjadi
Rp 5.000? (When did PT Pertamina lower Pertamax price from Rp 5,400 into Rp 5,000?)
Apa nama bandara di Yogyakarta? ( What is
airport name in Yogyakarta? )
Apa nama film pertama Indonesia yang terpilih
sebagai film terbaik internasional Festival Film
Asia Pasifik 1970? (What is the first Indonesian
movie chosen as the best international movie at
Asia Pacific Film Festival 1970?)
Badan apakah yang memiliki wahana antariksa
Rosetta? (What council does Rosetta starship
own?)
Siapakah pendiri Freedom Institute? (Who is
the founder of Freedom Institute?)
Berapakah temperatur rata-rata permukaan
Mars? (How high is the average temperature
of Mars surface?)
Question Type
Date
Location
Name
Organization
Person
Quantity
Based on our observation on Indonesian question collection, we divided Indonesian factual questions into two general patterns (related with the question
main word ):
1. Question with an explicitly written question focus.
This is mostly for “what” and “which” questions. For example, “Tanggal
berapakah, negara TNRC diproklamasikan?” (What date was TNRC proclaimed?). In this kind of questions, the question focus can be located before
the interrogative word or after it. It can be also preceded by a preposition
or not, such as “Tanggal berapakah” (what date) and “Pada (preposition,
on) tanggal (noun, date) berapakah (interrogative, what)” (on what date).
The question focus is selected as the question main word. There is a special
case with stop words such as nama (name), judul (title), induk (mother),
APPROACH
54
etc. such as in question “Dengan nama apakah, Bandara Selaparang akan
direlokasi ke Lombok Barat?” (With what name will Selaparang airport be
relocated to West Lombok?), where the “Bandara” (airport) is selected as
the question focus.
2. Question with no explicit question focus.
This is usually for questions that the interrogative word is preceded by a
preposition or a “who” question. For example, “Di manakah konser untuk mendiang Teguh Karya akan dilaksanakan ?” (Where will the concert
for Teguh Karya be held?). “Where” is translated into “di mana” (“di” is
preposition, translated into “in”, and “mana” is an interrogative, translated
into “which”) which can be expanded into “di kota mana” (in which city)
or “di negara mana” (in which country). Or an example of “Siapakah yang
menerbitkan kartu asuransi kesehatan untuk korban diare?” (Who published health assurance card for diarrhea victims?). This kind of questions
has no question focus and the question category’s clue could lay on the verb
or on the nearest noun if the question has no verb. For the above examples,
“dilaksanakan” (be held) and “menerbitkan” (published) are selected as the
question main word.
4.3.3
Other Data Resources (Indonesian-English Dictionary)
Beside our collection, we also used an Indonesian-English dictionary, located at
http://nlp.aia.bppt.go.id/kebi/, made by Indonesian Agency for The Assessment
and Application of Technology. There are 29,054 Indonesian words, completed
with the POS information. In our observation, there are some words with incorrect POS information. Therefore, we only used the dictionary to get POS
information for nouns, verbs and adjectives. While, for other POSs (conjunction,
preposition, pronoun, adverb), we made our own word lists (248 words in total).
The POS tagger is described in Section 4.4.1.
4.4
QA System with Machine Learning Approach
My QA system structure is similar to other monolingual QA systems [48, 16]
which consist of three main components (see Figure 4.2): question analyzer, passage retrieval and answer finder. The question analyzer aims to extract question
focus, interrogative word (who, where, etc), question keywords and question type.
The passage retrieval collects several passages containing the question keywords.
The answer finder locates the answer based on the question analyzer output and
the passages given by passage retrieval. The detail of each system is described in
the following section.
4.4. QA SYSTEM WITH MACHINE LEARNING APPROACH
55
Indonesian
Question
Sentence
Indonesian
Question
Analyzer
Indonesian
Keywords
Indonesian
Passage
Retriever
Indonesian
Newspaper
Corpus
Indonesian
Passage
EAT,
Question Main
Word,
Interrogative
word,
Phrase
Information
Indonesian
Answer
Finder
Indonesian
Answers
Figure 4.2: Schema of Indonesian QA System
4.4.1
Supporting Tool (POS Tagger)
Such as mentioned in Section 4.3.3, we used the Indonesian-English dictionary
and our own defined word lists (248 words in total) for other POS (conjunction,
preposition, pronoun, adverb). We did not handle the ambiguity on the POS. If
there are more than one POS available for a word, we chose the first candidate
written in the Indonesian-English dictionary. Using this dictionary leaves many
words (in corpus or question) with unknown POS. To handle these OOV words,
we used a bi-gram frequency method described in Section 4.4.2. First, we specified
some words that usually appear before noun, verb and adjective. Then, for each
OOV word, we calculated the co-occurrence (bi-gram) frequency of the OOV word
and the defined words. And then we attributed the word POS based on these
frequencies. The defined words include some prepositions (kepada/to, di/at-onin, oleh/by, etc), some adjectives (sangat/very, kurang/less, agak/a bit, etc),
some auxiliary verbs (sudah/have, sedang/-ing, akan/will, etc) and some special
words for special nouns such as “sekuntum” (a) for flower (“sekuntum bunga” =
a flower), “seekor” for animal (“seekor binatang” = an animal), etc.
APPROACH
56
4.4.2
Question Analyzer
Question Shallow Parser
The question analyzer system consists of two components: question shallow parser
and question classification. The question shallow parser extracts question keywords , question main word, interrogative word (who, where, etc), and its phrase
label (and also a preposition if the phrase is PP) from a question. The procedure
in the question shallow parser is as follows:
1. Assign the POS tag to each word in the question.
2. Select the question main word based on the following rules (the rules below
are ordered based on their priority):
• A noun preceding the interrogative word,
• A verb occurring between the interrogative word and noun
• A noun following the interrogative word,
Note that words listed in a stop word list are not considered as the question
main word.
3. Define a phrase to define the position of the question main word and the
interrogative word such as NP/Noun Phrase (if the question main word is
a noun and located after the interrogative word) or PP/Preposition Phrase
(if the noun precedes the interrogative word and there is a preposition
preceding the noun) or VP/Verb Phrase (if the question main word is a verb
and located after the interrogative word) or NP-PREV (the question main
word is a noun and located before the interrogative word) or VP-PREV (the
question main word is a verb and located before the interrogative word).
4. Take all nouns and verbs as the question keywords.
The example of the shallow parser result is shown in Figure 4.4.
Question Classifier
Such as mentioned in Section 4.1, we applied a machine learning approach for
the question classification. We used an SVM as it has proven to be effective in
the question classification domain. The first feature for the question classification
is the output of the shallow parser. Basically, it represents important words to
define a question class of a question sentence. It includes the interrogative word,
the question main word (could be a question focus, a verb or a noun following
the interrogative word but not a question focus), the phrase information and the
preposition (if the phrase is a PP). Beside this feature, we also compared two
features developed from the question focus. Those are the simple rule based class
57
candidates (defined as C in the experiment result) and the bi-gram frequency
(defined as P in the experiment result) between the nouns related to the main
words with some defined previous words. We also tried to calculate WordNet
distance (defined as W in the experiment result) as a comparison method.
Simple Rule Based Class Candidates
We observed that by using some simple rules on the shallow parser result, it is
possible to get some most possible question classes. For example, question with
“siapa” (who) as the interrogative word will be mostly categorized as person
question. Or, question with “kapan” (when) as the interrogative word is always
a “date” question. Therefore, as the first additional attribute, we defined some
most possible categories using some specified simple rules, for example, if the
interrogative word is “siapa” (who) then the candidates are person, organization,
location and name. The complete rules are shown in Table 4.2.
WordNet Distance
The WordNet (http://wordnet.princeton.edu/) describes the semantic relation among words. Thus, there are many researches using the WordNet for the
question classification system. But those researches usually employed it for English questions. Here, we tried to use the WordNet for Indonesian questions
which need a translation phase. The complete procedure using the WordNet is
as follows:
Table 4.2: Rules for Defining Class Candidates
Interrogative word
kapan
berapa
siapa
mana
apa
apa/mana
Preposition
Phrase
di, ke , dari
not blank
blank
PN
NP
NP
Class Candidate
date
date, quan, name
person, org, loc, name
loc, org
loc, org, name, date
org, loc, name, person
1. Translate the selected nouns (related with the question main word) into
English using the Indonesian-English KEBI dictionary. For the question
with question focus (such as in the 1st pattern described in Section 4.3.2),
the selected noun is the question focus. For the question without question
focus (such as in the 2nd pattern described in Section 4.3.2), the system
will search some possible nouns corresponding to the question main word.
These possible nouns are the most frequent nouns co-occuring with the
APPROACH
58
question main word (not the question focus) in a sentence window of the
Indonesian corpus.
2. Calculate WordNet depth (or distance) between the translated nouns with
some specified WordNet synsets (taken from the 25 noun lexicographer
files in WordNet). The specified WordNet synsets are act, animal, artifact, attribute, body, cognition, communication, event, feeling, food, group,
location, motive, object, person, phenomenon, plant, possession, process,
quantity, relation, shape, state, substance, and time.
3. Include all these WordNet distances as the additional attribute value.
Figure 4.3: WordNet Information for Word “City”
For example, in question “Di kota manakah, lokasi Bandara Supadio?” (What
city is Supadio airport located at?), “kota” as the main word is translated using
Indonesian-English dictionary into “city”. In WordNet, there are 3 meanings for
the word “City”. Figure 4.3 shows that “city” is separated by 5 synsets with “location” for the first meaning, by 4 synsets with “location” for the second meaning
and by 4 synsets with “group” for the third meaning. By using normalization,
the WordNet distance score of the word “city” are 0.64 and 0.36, for “location”
and “group” synsets, respectively. The 0.64 score for “location” is resulted by
(1/5 + 1/4) divided by the overall distance (1/5 + 1/4 + 1/4).
For Indonesian question, the strategy using the WordNet has problems with
the translation ambiguation and the OOV words (not available in the IndonesianEnglish dictionary). The translation ambiguation is handled by using all distances in the additional feature, such as mentioned in step 3 above. By using
all calculated distances and the shallow parser result, we assumed that these
59
attributes are adequate enough to describe the question intention. For the borrowed OOV word problems, by not translating the OOV words, WordNet distance
works for English borrowed words such as “distributor” in sentence “Apa nama
distributor rekaman CD acara festival Raum & Schatten di Berlin, untuk Indonesia?” (What is the name of CD record distributor for Raum & Schatten festival
in Berlin?). But this method doesn’t work for the common noun and Indonesian
proper name.
For other OOV words (Indonesian common noun and Indonesian proper
name), we used a monolingual corpus to search similar words with the main
word and then calculate WordNet distance for each similar word. The similar
words are defined as nouns that have similar preceding words (among those listed
in Appendix A) with the main word. For example, if the question focus “ibukota”
(capital city) is an OOV word, a common Indonesian noun, then, to get its WordNet distance, we searched some most frequent preceding words of “ibukota” in
the Indonesian corpus such as “arah” (direction), “daerah” (region), “kawasan”
(region), “ke” (to), etc. After that, we searched some nouns with the similar
preceding words. These nouns (such as “rumah” (house), “medan” (field), “kompleks” (complex, site), etc) are assumed as the similar words with the main word
“ibukota”. Then, the WordNet distance between these nouns and the specified
WordNet synsets are calculated. The calculation result is treated as the WordNet
distance for the main word.
Bi-gram Frequency
The last additional attribute is the frequency with some defined preceding words
(bigram). The method is alike with the method to tag the POS of the OOV
words using Indonesian corpus mentioned in Section 4.4.1. The different part is
the word list and the last process. The first step is to collect the defined preceding
words by the following procedure:
1. List some words for each question category (person, name, organization,
location, date, quantity). For example, words categorized in “person” category include presiden (president), guru (teacher), musisi (musician), dokter
(doctor), penulis (writer), etc.
2. Search in the corpus some most frequent preceding words of those words
in step 1. For example, some most frequent preceding words for “presiden”
(president) is kandidat (candidate), sebagai (as), oleh (by), etc.
3. Select some words from words in step 2 that can differentiate question
category. This step is done manually and resulted 6 word lists, one word
list for each category.
Next step is to calculate the bi-gram frequency (normalized) for each question
main word with a defined word list. The word list is shown in Table 4.3.
APPROACH
60
Table 4.3: Preceding Word List in Calculating the Bi-gram Frequency
Category
Person
Organization
Location
Name
Date
Quantity
Preceding Words (Bi-gram)
almarhum (deceased), arahan (guidance from someone), asosiasi (association of), atasan (higher person at work), bawahan
(lower person at work), ditandatangani (signed by), era, foto
(photo), kata/ungkap/ujar/ucap (say), kandidat (candidate), kediaman (home), kepergian (departure of a person), lanjut (continue speaking), mantan (retired person), mendiang (deceased),
menjabat (hold position as), pembantu (like an assistant), pemberhentian (unemployment), pesan (message said by a person),
pribadi (personally), profesi (profession), seorang (a for a person)
anggota (member), antar (between), aturan (rule of), bawah
(under), bersifat (characterized), kantor (office), kelompok
(group), kinerja (achievement), keputusan (decision), koalisi (coalision), jaringan (network), manajemen (management), mekanisme
(mechanism), pelatihan (training), pemimpin/pimpinan/ketua
(leader), pengembangan (development), pengurus (committee),
restrukturisasi (restructure), sanksi (sanction), secara (as), wadah
(place to), validasi (validation)
arah (direction), asal (come from), barat (west of), batas (limit
of), daerah (area), dekat (near), geografis (geographic), ibukota
(capital city), kawasan (region), ke (to), masuk (enter), menuju
(heading to), pariwisata (tourism), pembangunan (development),
perbatasan (border area), posisi (position), regional, sekitar
(around), selatan (south of), seluas (as wide as), tanah (land of ),
teritorial (territorial), timur (west of), utara (north of)
berdasarkan (based on), hasil (result), judul (title), karangan (paper), korban (victim), kotak (box), meluncurkan (launch), memaparkan (describe), membacakan (read), membuat (make), menerbitkan (publish), menyampaikan (deliver), meraih (get), pelaksanaan (implementation), peluncuran (launch), penerbitan (publisher), penjualan (sell), sebatang (a for tree), seekor (a for animal), sehelai (a for something thin and light), sekeping (a for
something thin, small and not light), sekuntum (a for flower),
seluruh (whole), seputar (around), sosialisasi (socialization), terjadi (happen), usai (after)
akhir (end), awal (beginning), tengah (middle of), hingga (until),
ketika (when), saat (when), waktu (when), sejak (since), selama
(during), setiap (each), tiap (each)
beberapa (a few), puluhan, ratusan (hundred), ribuan (thousand),
jutaan (million), belasan, milyaran (billion), puluh, ratus, ribu,
juta, belas, milyar
61
The procedure is as follows:
1. For each question, calculate the word pair frequency between the main
word in the question and each listed word in Table 4.3. If the phrase of the
question focus is not NP, then the system will search some possible nouns
corresponding to the question focus. This strategy is the same with the
first step in the WordNet distance calculation procedure.
2. Include all the frequencies as the additional attribute. We did not take one
highest frequency, instead we included all the frequencies because a noun
might have some most frequent preceding words spreading among the 6
word lists.
For example, question “Dimanakah ibukota Republik Turki Siprus Utara?”
(Where is the capital city of North Ciprus Turk Republic?) has “ibukota” as the
main word. In the Indonesian corpus, among the listed words, the bi-gram words
for “ibukota” are 3 words (“kawasan”, “ke” and “menuju”). The total frequency
of these words (which is normalized by all words appear with “ibukota” as a
bi-gram) is 0.0913. The complete bi-gram frequency score is shown in Figure 4.4.
This method is able to handle OOV words such as common noun or proper
name that has no translation in the Indonesian-English dictionary which was a
problem in the WordNet distance calculation; for example, the common noun
“bandara” in “Apakah nama bandara di Pekanbaru?” (What is the airport located in Pekanbaru?), or the proper name “Biodiesel” in “Apakah nama kimia
untuk Biodiesel?” (What is the chemistry name for Biodiesel?). For both nouns
(“bandara” and “Biodiesel”), the “without OOV handling” WordNet distance approach could not give additional information because these words are not listed
in both the Indonesian-English dictionary and the WordNet itself. But by using
the bi-gram frequency approach, we was still able to gain additional information
that can distinguish the semantic information between “bandara” (as location)
and “Biodiesel” (as name).
Another advantage is the data resource needed for this method. For WordNet distance approach, one needs a bilingual dictionary and a thesaurus which
demand an expensive effort if either is not available. But for the bi-gram frequency method, it only needs a monolingual corpus that can be collected from
the WWW.
4.4.3
Passage Retriever
We did the passage retriever in two steps. First, we collected the most relevant
documents by executing a boolean query into a document retriever. Second, we
selected some passages (paragraphs) within the 3 highest IDF scores among the
retrieved documents. For the document retriever module, we conducted some
methods:
APPROACH
62
Question: Dimana ibukota Republik Turki Siprus Utara? (Where is the
capital city of North Ciprus Turk Rep.?)
Question main word: ibukota (capital city)
Corpus Searching Result: “kawasan ibukota” (location), “ke ibukota”
(location), “menuju ibukota” (location)
SVM Attributes:
1. Shallow Parser’s Result (SP)
• interrogative word: dimana (where)
• main word: ibukota (capital city)
• phrase: NP
• preposition:
2. WordNet Distance (WN), with OOV handling act: 0.0432, animal:
0, artifact: 0.1724, attribute: 0.0253, body: 0.0052, cognition:
0.0306, communication: 0.017, event: 0.0007, feeling: 0, food:
0.0067, group: 0.0907, location: 0.2478, motive: 0, object: 0.0435,
person: 0.011, phenomenon: 0, plant: 0.0019, possession: 0.0216,
process: 0.0042, quantity: 0.043, relation: 0.009, shape: 0.0063,
state: 0.055, substance: 0.016, time: 0.0013
3. Bi-gram Frequency (PF)
• person (number of words, frequency): 0, 0
• location (number of words, frequency): 3, 0.0913
• organization (number of words, frequency): 0, 0
• date (number of words, frequency): 0, 0
• quantity (number of words, frequency): 0, 0
• name (number of words, frequency): 0, 0
Figure 4.4: Question Example with its Question Features
1. Select documents within the 3 highest IDF scores
2. Select documents with IDF score larger than half of the highest IDF score
3. Select documents with IDF score larger than the lowest IDF score
4. Use Estraier (http://estraier.sourceforge.net/) as a document retriever for
Indonesian documents. The Estraier is designed for English or Japanese.
63
Here, we used it for Indonesian. We found a difficulty when some Indonesian
words are considered as English stop words.
We found that some queries contain words that do not exist in the correct
passage. These words are usually the synonym of the word in the passage. For
example, the word “PT Bursa Efek Surabaya” in a query “Siapakah direktur PT
Bursa Efek Surabaya saat ini?” (Who is the director of PT Bursa Efek Surabaya
now?) does not exist in the correct passage “Direktur BES Guntur Pasaribu
mengatakan . . . ” (Director of BES Guntur Pasaribu said . . . ). Here “BES” is
the synonym of “PT Bursa Efek Surabaya”. To handle this, we extracted some
synonyms from the Indonesian corpus using “word (synonym)” pattern. For each
word in the synonym word list, if the word is in the question then its synonym
is added into the question with “or2” operator[32] (for example, a synonym of
“BES” and “PT Bursa Efek Surabaya” will be composed into a boolean query of
“BES” or2 (“PT” and “Bursa” and “Efek” and “Surabaya”)) and execute the
new keyword list into the document and passage retriever.
4.4.4
Answer Finder
To locate an answer, an SVM algorithm is employed to classify each word in
the corpus as “I” if the word is part of the answer but not the beginning of the
answer, as “B” if the word is the beginning of the answer and as “O” is the word
is not part of the answer. Yamcha[20] is used as the text chunking software for
the answer finder. The complete features for the SVM are as follows:
1. Question features
It is almost the same with the question classification features: interrogative words, question focus, phrase of question focus, preposition, bi-gram
frequency of the question focus and 4 words around interrogative words.
2. Expected Question Class
The expected question class is resulted from the question classification system with features: bag of words, shallow parser result and bi-gram frequency scores.
3. Document features
Each word in the document is complemented with some features (the lexical
form, the POS information, orthographic information and the lexical similarity with the question keywords) of n (width of word window) preceding
and following words and also the current word features itself (the lexical
form, the POS information, orthographic information, the lexical similarity
with the question keywords, and the bi-gram frequency). Features for the
n preceding words are completed with the I/B/O information.
APPROACH
64
For the training data, the document features are automatically built from
the correct passage (the answer in the passage is tagged by <A>and </A>).
Each word is completed by some information such as its POS, its orthographic
information, the lexical similarity score (1 if the word exists in the question, 0
otherwise), bi-gram frequency (such as in the question classification task), and
the I/B/O information (every word has “O” value except for word within answer
tag: “B” for word after the <A>tag and “I” for the rest answer words). For the
testing data, the document features are similar with the training data except the
I/B/O information. In the testing data, all words have an O value for the I/B/O
information. Figure 4.5 shows an example of a question along with its question
features, its expected question class and its document features.
4.5
4.5.1
Experimental Result
Question Classifier
In the experiment, an SVM algorithm is employed with linear kernel and the
“string to word vector” function to process the string value, both are available
in the WEKA software [50]. 10-Fold cross validation is used for the accuracy
calculation (that is, 2700 questions for training, 300 questions for testing). The
baseline is the bag of words attribute. As for the machine learning comparison,
three machine learning algorithms (C4.5, K Nearest Neighbor (kNN) and SVM)
were executed with the bag of words feature. The question classification result
for 6 classes (date, location, name, organization, person and quantity) is shown
in Table 4.4. The highest score is achieved by using the SVM algorithm.
Table 4.5 shows the experiment results on the question classification without
handling OOV problems. Here, we compared the highest baseline result with the
proposed attributes: the shallow parser result (S), the simple rule based class
candidate (C), the WordNet distance without OOV handling (W), the bi-gram
frequency (P), the combination of the additional attributes (C+W, C+P, W+P
and C+W+P) and their combination with the bag of words (B).
From Table 4.5, we can see that using only several important words (S) gives
higher score than using all words in the question (B; baseline). It improved the
accuracy score about 2.45%. The t-test for all proposed (the combination between
S and other additional attributes such as C, P and W) attributes (compared to
the baseline/B) showed that the improvement is significant, all the p-values are
lower than 0.025. Among all combinations with the additional attributes, using
P gave higher accuracy score than using C or even WordNet distance (W). We
assumed that using the bi-gram frequency (P) was better than the WordNet
distance (W) because of the OOV words in the WordNet distance attribute (the
question main word was not available in the Indonesian-English dictionary).
4.5. EXPERIMENTAL RESULT
65
Question: Dimana ibukota Republik Turki Siprus Utara? (Where is the capital city of
North Ciprus Turk Rep.?)
Question features (QF):
Interrogative word: dimana
Main word: ibukota
Phrase: NP
Preposition: Bi-gram frequency: see Figure 4.4.
4 surrounding words: -, -, ibukota, Republik
Expected Answer Type (EAT): location
Correct Passage (for training)
Di tengah lembah itu pula terdapat ibukota Nicosia yang juga terbagi dua antara Turki
dan Yunani. Warga Siprus Turki menyebut Nicosia dengan nama Lefkosa. . .
Document features (DF) for word “ibukota”
Features for 2 preceding words in passage “. . . pula terdapat ibukota . . . ”
• “pula”, features: lexical: pula; POS: adverb; orthographic: alphabet; lexical similarity: 0; I/B/O information: O
• “terdapat”, features: lexical: terdapat; POS: verb; orthographic: alphabet; lexical
similarity: 0; I/B/O information: O
Features for 2 following words in passage “. . . ibukota Nicosia yang . . . ”:
• “Nicosia”, features: lexical: Nicosia; POS: noun; orthographic: capitalized alphabet;
lexical similarity: 0
• “yang”, features: lexical: yang; POS: conjunction; orthographic: alphabet; lexical
similarity: 0
Feature for current word “ibukota”:
lexical: ibukota; POS: Noun; orthographic: alphabet; lexical similarity: 1; bi-gram frequency: see Figure 4.4.
Figure 4.5: Features for SVM based Answer Finder
Table 4.5 also shows the accuracy result if the proposed attributes are joined
with the bag-of-words (B). Joining the proposed attributes with the bag-of-words
(B) mostly gives lower score than without the B attribute except for the combination with S and P attributes such as B+S+P, B+S+C+P, etc; and also for
combination of four or more attributes. It shows that the B attribute can not be
used as the proposed attribute. We assume that it is because that there are many
data for B attribute which could delete the correct patterns that were yielded by
APPROACH
66
Table 4.4: Accuracy Score of Several Machine Learning Algorithms with Bag of
Words Attribute in the Indonesian Question Classification
Method
C4.5
K Nearest Neighbor
SVM
Accuracy Score
87.23%
69.30%
91.97%
the combination without the B attribute.
Table 4.6 shows the accuracy score of the question classification when the
OOV problems were handled in the WordNet distance calculation. As mentioned
in Section 4.4.2, the OOV problems were treated by finding words that have similar category with the OOV word. Compared to the one without OOV handling
(Table 4.5), the result was improved about 0.27%.
Even though the simple rule based class candidate attribute shows the lowest
accuracy score (Table 4.5), but for some categories such as “name” and “quan”, it
achieved the highest accuracy among all attributes. Therefore, we tried another
experiment by combining the bi-gram frequency and the rule based class candidates. We combined it by checking whether the label of the bi-gram frequency
was listed or not in the class candidates, if it was not listed then the frequency
score was changed into 0. This feature is labeled by P’ feature. The accuracy
score is shown in Table 4.7. All scores for the combination with the P’ attribute
do not give lower score except for the B+S+W’+P’ (lower than the B+S+W’+P
and also lower than B+S+C+W’+P), W’ represents the WordNet distance feature with OOV handling. We assume that it is caused by the B attribute which
may contains data that could eliminate the correct patterns resulted from the
attribute combination without the B attribute.
To make the question classifier easily modify into other language, in the answer finder system, we used the question class resulted by the S (shallow parser)
+ P’ (bi-gram frequency limited by simple rule class candidates) features.
4.5.2
Passage Retriever
Table 4.8 shows the experiment result of the passage retriever (from 71,109 documents) with methods such as described in Section 4.4.3. The evaluation score
(precision, recall and F-score) was calculated by comparing the retrieval result
with the correct passage which is actually not accurate because there are also
passages containing correct answer but not considered as correct passages. This
is because in the question data collecting process, user only wrote the question,
its answer and the corresponding passage which was being read by the user. User
67
Table 4.5: Accuracy Score of Indonesian Question Classification using SVM Algorithm with Various Attributes
Method
Accuracy Score
p-value of the one tail t-test with
95% confidence level (compared
with the Baseline)
Baseline(bag of words)
S
B+S
B+C
B+W
B+P
91.97%
94.43%
92.60%
92.20%
94.17%
95.37%
7.34E-05
0.179000
0.368950
0.000397
3.12E-08
S+C
S+W
S+P
94.57%
94.83%
95.47%
2.91E-05
3.83E-06
1.12E-08
B+S+C
B+S+W
B+S+P
92.50%
94.73%
95.80%
0.220170
8.42E-06
2.75E-10
S+C+W
S+C+P
S+W+P
94.77%
95.37%
95.53%
6.50E-06
3.12E-08
5.54E-09
B+S+C+W
B+S+C+P
B+S+W+P
94.83%
96.03%
96.07%
3.83E-06
1.54E-11
9.99E-12
S+C+W+P
B+S+C+W+P
95.50%
96.20%
7.90E-09
2.64E-12
did not care about the possibility of other passages in other documents that could
contain the correct answer.
The experiment result shows that using IDF score is more suitable for the
passage retrieval task than the TF IDF score (employed by Estraier). The 6%
recall reduction from 0.95 (>lowest IDF) to 0.89 (>highest IDF/2) indicates that
we still have to improve the passage retrieval method to get lower recall reduction.
APPROACH
68
Table 4.6: Accuracy Score of Indonesian Question Classification using WordNet
Distance Attribute (W’) with OOV Handling
Method
Accuracy Score
B + W’
S + W’
B + S + W’
S + C + W’
S + W’ + P
B + S + C + W’
B + S + W’ + P
S + C + W’ + P
B + S + C + W’ + P
94.53%
95.10%
94.83%
95.13%
95.53%
94.97%
96.17%
95.50%
96.20%
with the Baseline)
3.69E-05
3.94E-07
3.83E-06
2.91E-07
5.54E-09
1.27E-06
2.64E-12
7.90E-09
1.68E-12
Table 4.7: Accuracy Score of Indonesian Question Classification using new Bigram Frequency (P’) (Limited by the Simple Rule Class Candidate features)
Method
Accuracy Score
B + P’
S + P’
B + S + P’
S + W + P’
S + W’ + P’
B + S + W + P’
B + S + W’ + P’
4.5.3
95.63%
95.60%
95.80%
95.50%
95.53%
96.10%
96.10%
with the Baseline)
1.86E-09
2.69E-09
2.75E-10
7.90E-09
5.54E-09
6.45E-12
6.45E-12
Answer Finder
In the answer finder, 2700 questions were used as the training data (139,851 instances) and 300 questions as the testing data (50 questions for each question
type). Yamcha (Kudo and Matsumoto, 2000) was used as the SVM based text
chunking software. To evaluate the answer finder system, we calculated the accuracy of exact and partial answers along with the MRR score. The scores used
to evaluate the answer finder result are Top-Exact1 (the first answer found is ex-
69
Table 4.8: Passage Retriever Accuracy
Document Retriever
3 highest IDF scores
>highest IDF / 2
>lowest IDF
Estraier
#Correct
Passage
2551
2672
2860
2409
#Retrieved
Passage
13,556
29,160
466,141
65,660
Prec
Recall
F-measure
0.1882
0.0916
0.0061
0.0367
0.8503
0.8907
0.9533
0.8030
0.3082
0.1662
0.0122
0.0702
actly the same with the correct answer), Top-Exact-n (the correct answer exists
among the top n answers found), Top-Partial1 (the first answer found contains
the correct answer partially or the other way around, for example, the answer
finder result is “August 2005” and the correct answer is “12 August 2005”),
Top-Partial-n (one among the top n answers found is a partial answer) and MRR
(Mean Reciprocal Rank, the average reciprocal rank (1/n), the rank of the correct
answer in the answer finder result).
Some examples of the answer finder results are shown in Table 4.9. The reason
for the incorrect answers is because the correct passage is not retrieved. Even
though the correct documents were retrieved, but not all correct passages were
retrieved.
The accuracy score for the answer finder result is shown in Table 4.10 and
Table 4.11. QC means the predicted question class (question classification result),
DF means the document features, QF means the question features and multi-IBO
means instead of using only 3 classes (I, B, O) for each word, it uses 18 classes
(I-date, B-date, O-date, I-location, B-location, O-location, etc). The 21 word
window size means that there are 10 preceding words and 10 following words
for each word in the passage. The 11 word window size means that there are 5
preceding words and 5 following words for each word in the passage.
Table 4.10 shows the accuracy result of the question answering using QC and
DF features for various passage retrieval methods and various word window sizes
(for the document). All results in Table 4.10 shows that the “3 Top IDF” and
the “>highest IDF/2” based document retrievals achieved higher accuracy score
than the Estraier (TFxIDF). The highest accuracy score is achieved by using
the 11 word window size (5 preceding words and 5 succeeding words) with the
“3 Top IDF” and the “>highest IDF/2” based document retrievals. Based on
this result, we used the 11 word window size, the “3 Top IDF” and “>highest
IDF/2” based document retrievals for other feature combinations in the question
answering module, which results are shown in Table 4.11. The best result in
Table 4.11 is 0.59 for the Top-n and 0.52 for the MRR. This score is better than
APPROACH
70
Table 4.9: Examples of Answer Finder Result
Correct Result
1
Question
Passage
2
Answer
Question
Passage
Answer
Incorrect Result
3
Question
Correct passage
Retrieved passage
Correct Answer
Answer Results
4
Question
Correct passage
Retrieved passage
Correct Answer
Answer Results
Mulai tanggal berapa, PT Pertamina menurunkan harga
Pertamax dari Rp 5.400 menjadi Rp 5.000? (When did
PT Pertamina lower Pertamax price from Rp 5,400 to Rp
5,000?)
. . . Mulai 1 Januari 2006, PT Pertamina kembali menurunkan . . . Harga Pertamax dari Rp 5.400 turun menjadi . . .
1 Januari 2006 (2006, January 1st)
Dimanakah percobaan pembunuhan Hosni Mubarak yang
paling terkenal dan berbahaya? (Where was the famous
and most dangerous assassination to Hosni Mubarak?)
. . . yang paling terkenal dan berbahaya adalah percobaan
pembunuhan di Addis Ababa, Etiopia, Juni 1995, . . .
Addis Ababa Etiopia (Addis Ababa, Ethiopia)
Di benua manakah, Negara Zambia berada? (In what continent does Zambia lay?)
. . . Afrika kehilangan 272 . . . Zambia juga . . .
. . . John Howard yang menyatakan akan menyerang negaranegara yang menjadi sarang teroris di Asia . . .
Afrika (Africa)
Asia Tenggara (South East Asian), DPR (Indonesian legislative council), etc
Apa nama mata uang Thailand? (What is Thailand’s currency?)
Di Thailand, harga minyak naik 26 persen selama 2005 ini
menjadi 26,5 baht (Rp 6.500) per liter.
. . . nilai tukar rupiah terhadap dollar AS . . .
Baht
Dollar AS (US Dollar), Euro, Malaysia, etc
[38] which conducted Japanese QA with a similar answer finder module. The
result in [38] was 0.47 for the Top-5 and 0.36 for the MRR, respectively.
71
Table 4.10: The Question Answering Accuracy Result with QC (Question Class
Features) and DF (Document Features) for Various Passage Retrieval Methods
and Various Word Window Sizes
QC+DF
Method Description
1.
2.
3.
4.
3 Top IDF
>highest IDF/2
>lowest IDF
Estraier
1.
2.
3.
4.
3 Top IDF
>highest IDF/2
>lowest IDF
Estraier 0.37
1.
2.
3.
4.
3 Top IDF
>highest IDF/2
>lowest IDF
Estraier
4.5.4
Top
Exact1 Exact-n Partial1 Partial-n
Word window size: 7 = 3+1+3
0.38
0.49
0.44
0.55
0.37
0.50
0.44
0.56
0.22
0.44
0.27
0.52
0.33
0.48
0.40
0.55
0.44
0.55
0.50
0.64
0.43
0.57
0.50
0.66
0.31
0.49
0.35
0.59
0.52
0.44
0.61
0.43
0.40
0.52
0.49
0.61
0.38
0.52
0.47
0.63
0.25
0.48
0.32
0.59
0.35
0.49
0.43
0.59
MRR
Exact Partial
0.43
0.43
0.31
0.40
0.49
0.50
0.38
0.47
0.49
0.49
0.39
0.51
0.56
0.57
0.45
0.46
0.44
0.35
0.42
0.54
0.54
0.44
0.51
Using NTCIR 2005 (QAC and CLQA Data Set)
To compare the QA performance with a formal data set, we selected some questions from the NTCIR 2005 (QAC and CLQA tasks) which answers are available
in our Indonesian corpus. There are 12 questions can be used for this purpose.
These 12 questions are translated into Indonesian and then used as the input
question for this Indonesian QA. The experimental results are shown in Table
4.12
The main reason of the low result in the passage retrieval is because some
question keywords are not available in the correct article. In some cases, there
are some unimportant nouns or verbs in the questions, and in other cases, the
question keyword is not lexically equal with the word in the correct article (the
question keyword is the synonym of the word in the correct article). Both problems are not handled in our Indonesian passage retriever. In future plan, we will
improve the passage retrieval module to cope with these problems.
In the answer finder, the performance is half of the passage retrieval result.
APPROACH
72
Table 4.11: The Question Answering Accuracy Result for Some Features Variations (Multiple IBO Class, QF+DF, QC+QF+DF) with 2 Passage Retrieval
Methods ((1) 3 Top IDF and (2) IDF score >highest IDF/2) and 11 Word Window Size
Method Description
Multiple IBO Class
1. 3 Top IDF
2. >maximum IDF/2
QF+DF
1. 3 Top IDF
2. >maximum IDF/2
QC+QF+DF
1. 3 Top IDF
2. >maximum IDF/2
Oracle Experiment
QC+QF+DF (11 word
window size)
Exact1
Top
Exact-n Partial1
Partial-n
MRR
Exact Partial
0.38
0.37
0.54
0.53
0.50
0.46
0.66
0.65
0.45
0.44
0.57
0.55
0.44
0.43
0.55
0.56
0.50
0.50
0.60
0.61
0.49
0.49
0.55
0.55
0.46
0.46
0.58
0.59
0.54
0.54
0.66
0.67
0.51
0.52
0.59
0.60
0.57
0.59
0.61
0.64
0.58
0.63
Table 4.12: The Question Answering Accuracy Result for 12 Questions taken
from QAC and CLQA1 on the NTCIR 2005
Description
Question Classification
Passage Retrieval (¿ maximum IDF/2)
Answer Finder (QC+QF+DF, 11 word
window size)
Accuracy
100%
Recall = 67%
Top1=Top5=MRR=0.33
This achievement is comparable with the results shown in Table 4.11, such as
0.57 for oracle experiment (100% recall of passage retrieval) or 0.46 for the 0.89
recall score of the passage retrieval.
4.6
Adopting QA System for Other Language
There are two main things to be prepared to build a similar QA system for
other resource limited languages: the language resources, and the components of
QA system. The minimum required language resources include a text corpus, a
4.6. ADOPTING QA SYSTEM FOR OTHER LANGUAGE
73
question-answer set (along with the answer tagged passage) and a POS resource
(or POS tagger). Text corpus can be downloaded from the internet and can
be done in about less than two weeks. For a question-answer set, as has been
mentioned before, we collected about 3000 question-answer pairs inputted by 18
Indonesian natives. We spent about 2 months for this process. We believe this
time length can be shortened with more users or more managed process. In this
experience as our first QA system, we made a mistake by not asking the users
to type and tagged the answer in the relevant passages. This has consumed
times in collecting (semi-automatically) the correct relevant passages for each
question-answer pair.
For the POS resource, we used an Indonesian-English dictionary (29,054 Indonesian words) where each line has information such as the Indonesian word, its
POS information (such as Noun, Verb, Adjective, Adverb, etc) and the English
translations. Some words were inputted manually for only certain POSs (conjunction, preposition, pronoun, adverb). The POS ambiguity is not handled. If
there were more than one POS available for a word, the first candidate written
in the Indonesian-English dictionary is chosen. Even though this POS tagger is
simple, the QA gave higher accuracy result than other QA systems [38] that used
a much better POS resource.
Similar with other QA systems, this QA system consists of 3 components: the
question analyzer (Section 4.4.2), the passage retriever (Section 4.4.3) and the
answer finder (Section 4.4.4). Almost all of the components are machine learning
based components except for the question shallow parser. In the question shallow
parser, the question is not transformed into a sentence with a tree structure,
we just extracted some information such as the question interrogative word, the
keywords (noun, verb and adjective), and the question main word. The procedure
which decides the question main word depends on the grammar syntax for a
source language. However the rules to extract the question main word (based on
the word’s POS and word’s position) are relatively simple and listed in Section
4.4.2. Even if one cannot develop a shallow parser, the QA system will work
well by using the B+C features (result is shown in Table 4.5) to do the question
classification.
For the question classification task, there are some required features such as B
(bag of words), S (shallow parser result), C (simple class candidates), W (WordNet distance), and P (bi-gram frequency). C, W and P features are explained
in Section 4.4.2. We believe that the preparation for all these features do not
consume time. For C feature, the class candidates are decided based on simple
rules such as shown in Section 4.4.2. For W feature, the WordNet distance scores
are calculated easily between the 25 nouns (lexicographer files) and the (bilingual
dictionary based) translation of the nouns related with the question main word.
For P feature, the scores are taken from the bi-gram frequency between the nouns
APPROACH
74
related the question main word and the word list for each category. The process
to define the word list for each category is done semi automatically and will take
about less than a week. For the proposed method (such as in final experiment),
we did not use WordNet distance (W) feature in the question classification, we
only used the B, S, and P’ (bi-gram frequency scores limited by the C - simple
class candidates) features. It means that we do not need the WordNet and the
translation of Indonesian into English to do the question classification.
For the passage retriever system (Section 4.4.3), the programming and execution phase will consume about 2 weeks. For the answer finder, one needs to
prepare the data required by the machine learning (e.g. Yamcha) which includes
the QC (question class, result of the question classification task), QF (question
feature, result of the question shallow parser) and DF (document feature). Such
as mentioned in Section 4.4.4, the training data are automatically prepared from
the tagged correct passage where all words have the “O” flag for the I/B/O information except for the answer (“B” for the first word of the answer, “I” for
the rest of the answer). For the testing data, all words have the “O” flag. Other
features (orthographic information, lexical similarity score, bi-gram frequency)
can be yielded easily.
4.7
Conclusions
We have built an Indonesian question answering system including the data collection (questions and documents) and a full QA system. The Indonesian QA
data collection consists of 3000 factoid questions, collected from 18 Indonesian
natives. In this collection, there are 6 question classes: person, organization, location, name, date and quantity. The document collection were downloaded from
Indonesian newspaper website and joined with the existing document collection,
yielded around 71,109 articles.
The QA system consists of 3 components: question analyzer (question shallow
parser and question classifier), passage retriever and answer finder. For the question classifier, this system proves that using features extracted from a monolingual
corpus can improve classification result compared to a bag of word approach. As
for the question answering result, we also obtained a nice MMR score considering
there was no high quality language tools involved in the process. Compared to
[38], our system got higher score for exact answers while we used only the bi-gram
frequency feature and the other system used rich POS information resulted by a
morphological analyzer.
We described that this system can be easily adapted to other limited resource
language. The procedure to apply the QA system to other limited resource
language (with a minimum programming effort) is described in Section 4.6.
Chapter 5
Indonesian-English CLQA
5.1
Introduction
Recently, CLQA (Cross Language Question Answering) systems have become an
area of much research interest. The CLEF (Cross Language Evaluation Forum)
has conducted a CLQA task since 2003[24] using English target documents and
Italian, Dutch, French, Spanish, German source questions. Indonesian-English
CLQA has been one of the more recent goals of the CLEF since 2005[46]. The
NTCIR (NII Test Collection for IR Systems) has also been active in its own
CLQA efforts since 2005[39] by providing Japanese-English, English-Japanese,
Chinese-English and English-Chinese data. In CLQA, the answer to a given
question in a source language is searched for in documents of a target language.
Accuracy is measured by the retrieved correct answer.
The translation phase of CLQA makes answer searching more difficult than
monolingual QA for the following reasons[39]:
1. Translated questions are represented with different expressions than those
used in news articles in which answers appear;
2. Since keywords for retrieving documents are translated from an original
question, document retrieval in CLQA becomes much more difficult than
that in monolingual QA.
Most of common approaches in CLQA used the four modules system: the
question analyzer, translation module, passage retriever and answer finder. Using
these four modules system, there are some basic schemas can be proposed for a
CLQA system such as:
1. The question sentence in the source language is translated into the target
language as a complete question sentence. This translated question sen75
76
CHAPTER 5. INDONESIAN-ENGLISH CLQA
tence is then processed by a monolingual QA for the target language. The
question analyzer, the passage retriever and the answer finder are all done
in the target language area.
2. The EAT is defined first in the source question analyzer. The question
sentence is then translated into the target language. The translation output
could be as a list of keywords or as a complete question sentence. The
translation results are then used to retrieve the passages. The answers are
located by matching the passages, keywords and the EAT.
3. The documents are translated into source language documents. By this,
the answer is located by using a monolingual QA for the source language.
4. The schema from the question analyzer until passage retrieval is similar with
the second schema above. Next, the passage retrieval’s results (in target
language) are translated into the source language. The answer finder is
then conducted by the source language QA.
Researchers usually select one of the above schemas based on the monolingual
QA that they have. If one has a monolingual QA in the target language, then the
chosen schema could be the first or the second schemas. If one has a monolingual
QA in the source language, the third or fourth schema is suitable.
In above schemas, the translation can be done either for questions, for passages or for documents. In CLQA, translation quality is an important factor in
achieving an accurate QA system. Thus, the quality of the machine translation
and dictionary used is very important.
Even though, we have a monolingual Indonesian QA but due to the limitation
on the support resource for the translation module, we select the second schema
mentioned above. The second schema has a minimum drawback on the translation process where the translation is done on the extracted keywords, not on
the full sentence as in the question translation or passage translation. As for the
passage retriever and answer finder modules for the target language, by using the
similar approach as the monolingual QA, these modules can be built in an easy
way with a small programming effort.
In the Indonesian language, there are a number of translation resources available[33], such as an online machine translation (Kataku) and an IndonesianEnglish dictionary (KEBI). Previous work in Indonesian-Japanese CLIR[33] shows
that using a bilingual dictionary in a transitive approach can achieve a higher
retrieval score than using a machine translation tool. Other work in the Indonesian language[4], however, showed that using an online dictionary available on
the Internet gave a lower IR (Information Retrieval) performance score than the
available machine translation because of the limited vocabulary size of the online
dictionary. As for CLQA research, the best performance of Japanese-English
5.1. INTRODUCTION
77
CLQA in the NTCIR 2005 was obtained by a study[18] using three dictionaries
(EDICT, 110,428 entries; ENAMDICT, 483,691 entries; and in-house translation
dictionary, 660,778 entries) that can be qualified as high data resources. Accuracy was 31.5% on the first exact answer. This accuracy score outperformed
other submitted runs, such as the second ranked run with 9% accuracy. For
the Indonesian-English CLQA, there have been two trials[3, 49] conducted using
the CLEF data set. Both systems used a machine translation tool to translate
Indonesian questions into English. One system [3] used a commercially available Indonesian-English machine translation software (Transtool) and was able
to obtain only 2 correct answers among 150 factoid questions. The other system
[49] used an online Indonesian-English machine translation software (Kataku)
and was able to obtain only 14 correct answers among 150 factoid questions.
Different from both the Indonesian-English CLQA systems mentioned above, in
this study we utilize an Indonesian-English KEBI[17] dictionary (29,047 word entries) and then combine the translation results in a boolean query to retrieve the
relevant passages. Different with our attempt in the Indonesian-Japanese CLIR
described in Chapter 3, we do not select the best translated keyword but we use
all translated keywords in a Boolean query to retrieve the English passage.
As mentioned in Chapter 4, there are several approaches to locate an answer in a retrieved passages. In one commonly used approach, documents are
named entity tagged. Matching is then conducted between each named entity
and the EAT. This method is also used in many CLQA systems such as in [18]
that achieved the best submitted run for the Japanese-English CLQA task in
NTCIR2005. The literature [18] macthed the EAT with NE resulted by a handcrafted rule-based NE recognizer for English documents (669 rules for 143 answer
types). Apart from the named entity approach, there is another approach proposed by [37], in which a statistical approach is used to locate the answer. This
approach is called an experimental Extended QBTE (Question-Biased Term Extraction) model, an extension of QBTE used in the Japanese monolingual QA[38].
The description of its method is explained in Section 5.2. But unfortunately, this
approach only achieved 2 correct answers among 200 factoid questions in the
NTCIR 2005 CLQA task.
In the Indonesian-English CLQA system, we adopted the text chunking method
in order to locate an answer. The English documents will only be POS-tagged
by an available POS tagger[40]. The features for the text chunking method are
similar with the Indonesian monolingual QA with several differences (see Section
5.5). There are some differences between our approach and the one proposed by
[37]. One important difference is the use of EAT. Here we choose to use EAT as
yielded by the question analyzer for several reasons described in Section 5.5.1.
78
5.2
Related Works
There are three works on the Indonesian-English CLQA. The first system [3]
was conducted for the CLEF 2005, the second system[49] was conducted for the
CLEF 2006 and the third system[1] was conducted for the CLEF 2007. All
systems adopted the second schema mentioned in Chapter 5.1. First, all systems
classified the EAT from Indonesian question using several rules. The Indonesian
question is then translated using an available machine translation (Transtool∗
was used in [3], Kataku† was used in [49]) and [1] into English.
The translated questions are then executed in the existing search engine
(Lemur‡ ). The English retrieved passages are then tagged by existing tools (MontyTagger§ was used in [3] and Gate¶ was used in [49] and [1]). The answers are
identified with some rules that matched the named entity in the passage with the
EAT and distance between the answer candidate and the word equal with the
question keyword. The literature [1] added a score of the answer candidate by
executing the query into Google and used the word frequency score as the weight
for the answer candidate.
In the CLEF 2005 of Indonesian-English CLQA Task, the literature [3] yielded
2 correct answers among 200 questions (150 factoid questions and 50 definition
questions). The result analysis mentioned that one of the reason of the poor result
was because the named entity tagger did not provide specific tag information to
the passages. For example, the NN tag could correspond to the location or the
organization EAT. In the CLEF 2006 of Indonesian-English CLQA Task, the
literature [49] got 14 correct answers among 200 questions (150 factoid questions,
40 definition questions and 10 list questions). The literature cited that the result
was better than the previous year because of the application of query expansion
and different passage scoring techniques. In the CLEF 2007 of Indonesian-English
CLQA task, the literature [1] got 20 correct answers among 200 questions. In
analysis, they claimed that internet source could be used to help the answer
identification.
Another related work is the experimental Extended QBTE[37]. Basically, the
method is a text chunking process where the answer is located by classifying each
word in the target passages into 3 classes (B, I and O ). B class means that the
word is the first word in the answer, I means that it is part of the answer and
O means that the word is not part of the answer. This text chunking process
was done by a maximum entropy algorithm. Using this approach means that
one do not have to conduct the EAT classification and also need not to do the
∗
See
See
‡
See
§
See
¶
See
†
http://www.geocities.com/cdpenerjemah/
http://www.toggletext.com/
http://www.lemurproject.org/
http://www.media.mit.edu/∼hugo/montytagger/
http://www.gate.shef.ac.uk/
5.3. DATA COLLECTION FOR INDONESIAN-ENGLISH CLQA AND ITS
PROBLEMS
79
named entity tagger. The features are taken from a rich POS resource which
also includes the named entity information for both source and target languages.
Although the accuracy result using the text chunking approach[37] achieved a
low accuracy score in the NTCIR2005 CLQA task by only 2 correct answer for
the Japanese-English CLQA task, we try to adopt the approach in the answer
finder module. Different with [37], our approach employs a question analyzer
module. Here, we used the EAT and other question shallow parser result as some
of the features. For the source question, we did not use the rich POS information,
instead we only used a common POS category and bi-gram frequency scores for
the question main word (the question focus or the question clue word). In the
target document, we used a WordNet distance score for each document word.
5.3
Data Collection for Indonesian-English CLQA and its
Problems
In order to gain an adequate number of data, we collected our own IndonesianEnglish CLQA data. We asked 18 Indonesian college students (different people
with the one for the Indonesian QA, described in Section 4.3.2) to read English
article from Daily Yomiuri Shimbun year 2000. Each student was asked to make
about 200 Indonesian questions related with the English articles. Question examples along with the answer and its source article are shown in Table 5.2. The
questions were factoid questions with certain EATs. There were 6 EATs: date,
quantity, location, person, organization, and name (nouns except for the location, person and organization categories). After deleting duplicate questions and
incorrectly formed questions, we obtained 2837 questions in total. The number
of questions for each EAT are shown in Table 5.1. Because of development cost,
the answer for each question is limited to one answer given by a respondent. The
alternative answers are not included in our in-house question-answer pair data.
But in the experiments against NTCIR 2005 CLQA task data set, we matched
the resulted answers with the alternative answers provided in the NTCIR 2005
CLQA task data.
Based on our observations of the written Indonesian questions, an Indonesian
question always has an interrogative word. The interrogative word can be located in the beginning, the last or the middle of a question. In question “Dimana
Giuseppe Sinopoli lahir?” (Where was Giuseppe Sinopoli born?), the interrogative word dimana(where) is located as the first word in the question. In question
“Sebelum menjabat sebagai presiden Amerika , George W. Bush adalah gubernur dimana?” (Before serving as the President of the United States, where did
George W. Bush serve as governour?), on the other hand, the interrogative word
dimana (where) is the last word in the question.
In addition to interrogative words, another important word is the question
80
main word. Here, we define a question main word as a question focus or a
question clue word (if the question focus does not exist in the question). Further,
related to the question focus, the order of the question focus and the interrogative
word in a sentence is reversible. An interrogative word can be ordered either
before a question focus or after a question focus. For example, in the question
“Lonceng yang ada di Kodo Hall dibuat di negara mana?” (In which country was
Kodo Hall’s bell made?), the interrogative word mana (which) is located after the
question focus negara (country). However, in the question “Apa nama buku yang
ditulis oleh Koichi Hamazaki beberapa tahun yang lalu?” (What was the name
of the book written by Koichi Hamazaki several years ago?), the interrogative
word apa (what) is placed before the question focus buku (book).
Even though a question focus is an important clue word to define the EAT, as
mentioned before, not all question sentences have a question focus. Some examples include “Apa yang digunakan untuk menghilangkan lignin?” (What is used
to dispose of lignin?) and “Siapa yang menyusun panduan untuk pengobatan dan
pencegahan viral hepatitis pada tahun 2000?” (Who composed a reference book
for medication and prevention of viral hepatitis in 2000?). For such questions
without a question focus, we select other words such as “digunakan” (used) or
“menyusun” (composed) as the question clue words.
Table 5.2 shows some question examples for each EAT and their respective
different patterns. The third question, for example, “Di prefektur manakah,
Tokaimura terletak?” (In which prefecture is Tokaimura located?) has “prefektur” (prefecture) as the question main word. The QA system should be able to
locate the answer, “prefecture”, as the correct answer for the question is “Ibaraki
Prefecture”. Here, the question main word is found to also be part of the answer.
The fifth question “Kapankah insiden penyerangan Alondra Rainbow di Perairan
Indonesia?” (When did the Alondra Rainbow attack incident happen off the Indonesian waters?) has no question focus (the question clue word is “insiden”
(incident)) but the answer can be located by searching the closest “date” word.
The answer is “October”.
Table 5.1: Number of Questions per EAT
EAT
date
location
name
organization
person
quantity
Number of Questions
459
476
482
475
447
498
5.3. DATA COLLECTION FOR INDONESIAN-ENGLISH CLQA AND ITS
PROBLEMS
81
Table 5.2: Question Example for each EAT
Question + EAT + Question Main Word + Interrogative Word
Siapakah ketua Komite Penyalahgunaan Zat di Akademi Pediatri di Amerika
(Who is the head of the Committee on Substance Abuse at the American
Academy of Pediatrics)
EAT: Person
Question Main Word: ketua (head of)
Interrogative Word: siapa (who)
Apa nama institut penelitian yang meneliti Aqua-Explorer? (What is the
name of the research institute that does the Aqua-Explorer experiment?)
EAT: Organization
Question Main Word: institut (institute)
Interrogative Word: apa (what)
Di kota manakah Universitas Mahidol berada? (In which city is Mahidol
University located?)
EAT: Location
Question Main Word: kota (city)
Interrogative Word: manakah (which)
Ada berapa lonceng di Kodo Hall? (How many bells are there in Kodo Hall?)
EAT: Quantity
Question Main Word: lonceng (bells)
Interrogative Word: berapa (how many)
Kapankah insiden penyerangan Alondra Rainbow di Perairan Indonesia?
(When did the Alondra Rainbow attack incident happen off the Indonesian
waters?)
EAT: Date
Question Main Word: insiden (incident)
Interrogative Word: kapankah (when)
Apa nama buku yang ditulis oleh Koichi Hamazaki beberapa tahun yang lalu?
(What was the name of the book written by Koichi Hamazaki several years
ago?)
EAT: Name
Question Main Word: buku (book)
Interrogative Word: apa (what)
82
Even though there are some similarities between Indonesian and English sentences, such as the subject-predicate-object order, there are still some syntactical
differences, such as the word order in a noun phrase. For example, in English,
the sentence “Country which has the biggest population?” is a grammatically
incorrect sentence. The sentence should be “Which country has the biggest population?”. In Indonesian, however, “Negara apa” (country which) has the same
meaning as “Apa negara” (which country). Observing such differences between
Indonesian and English sentences, we concluded that existing English sentence
processing tools could not be used for Indonesian language.
5.4
Indonesian-English CLQA Schema
Using a common approach, we divide the CLQA system into four components:
question analyzer, keyword translator, passage retriever, and answer finder. As
shown in Figure 5.1, Indonesian questions will be first analyzed by the question analyzer into keywords, question main word (question focus or question clue
word) with its type information, EAT and a phrase information. Our question
analyzer consists of a question shallow parser and a question classifier modules.
The question classifier defines the EAT of a question using an SVM algorithm
(provided by WEKA). Then, the Indonesian keywords and question main word
are translated by the Indonesian-English keyword translator. The translator results will be used to retrieve the relevant passages. In the final phase, the English
answers are located using a text chunking program (Yamcha) with input features
explained in Section 5.5.4. Each module involved in the schema is described in
the next section.
5.5
5.5.1
Modules in Indonesian-English CLQA
Question Analyzer
The Indonesian-English CLQA system such as shown in Chapter 5.1 was built by
adopting certain modules from the monolingual Indonesian QA system. The only
unmodified module is the question analyzer system because the module has the
same function on both systems. It receives Indonesian natural language question
and yields some information such as the shallow parser’s result and EAT. The
detail approach of the Indonesian question analyzer module was described in
Section 4.4.2.
A question example of the Indonesian-English CLQA along with its features
for the question classification is shown in Figure 5.2. We used two examples with
main difference in the bi-gram frequency information. For the first question, the
question main word “kota” (city) exists in the Indonesian corpus, while for the
second question, the question main word “prefektur” (prefecture) does not exist
5.5. MODULES IN INDONESIAN-ENGLISH CLQA
83
Indonesian Question
Indonesian Question Analyzer
Indonesian Keywords
Indonesian Question Focus
Indonesian-English Keywords Translator
EAT
Interrogative Word
Phrase-like Information
English Keywords
English Passage Retriever
English Question
Focus
English Keywords
English Passages
English Answer Finder
English Answers
Figure 5.1: Schema of Indonesian-English CLQA
in Indonesian corpus which gives zero score of the bi-gram frequency feature. The
features are written in 4th -8th rows of each question. Features in 4th -6th rows are
yielded by our question shallow parser. The next two rows are the additional
features (bi-gram frequency score and WordNet distance). For the first question,
the question main word “kota” (city) is statistically related with the “location”
and “organization” entity (7th row). The highest relation is with the “location”
entity with 0.86 as the first bi-gram frequency score. There are 23 preceding
words for the “city” word related with the “location” class and 10 words with
the “organization” class. As for the WordNet distance, the question main word
“kota” (city) is a hyponym of “location” word. These perfect information in the
bi-gram frequency score and the WordNet leads to correct prediction of EAT.
In the second question, even though the bi-gram frequency score is zero and the
WordNet distance score is not perfectly mentioned that the correct EAT is a
location, the question classification module still gives good prediction.
As has been proven in the Indonesian QA(Chapter 4), using a question class
(EAT) in the answer finder gives higher performance than without using a question class (i.e. depending solely on the question shallow parser result and mainly
on the question main word). Compared to a monolingual system, using question
class in the cross language QA system has more benefits. The first benefit occurs
when there is more than one translation for a question main word with different
84
Question 1:
Di kota manakah Universitas Mahidol berada?
English:
In what city is Mahidol University?
Correct EAT: Location
Interrogative: apa (what)
Question Main Word: kota (city)
Phrase: NP-POST
Bi-gram:
date(0,0), loc(0.86,23), name(0,0), organization(0.14,10), person(0, 0), quantity(0,0)
WordNet-dist:
act(0), animal(0), artifact(0), attribute(0), body(0), cognition(0), communication(0), event(0), feeling(0), food(0), group(0), location(1), motive(0), object(0), person(0), phenomenon(0), plant(0), possession(0), process(0), quantity(0), relation(0), shape(0), state(0), substance(0), time(0)
Resulted EAT: Location
Question 2:
Di prefektur mana letak pulau Zamami?
English:
In which prefecture is Zamami island located?
Correct EAT: Location
Interrogative: mana(which)
Question Main Word (question focus): prefektur(prefecture)
Phrase: NP-POST
Bi-gram:
date(0,0), loc(0,0), name(0,0), org(0,0), person(0, 0), quantity(0,0)
WordNet-dist:
act(0.5), animal(0), artifact(0), attribute(0), body(0), cognition(0), communication(0), event(0), feeling(0), food(0), group(0), location(0.5), motive(0), object(0), person(0), phenomenon(0), plant(0), possession(0), process(0), quantity(0), relation(0), shape(0), state(0), substance(0), time(0)
Resulted EAT: location
Figure 5.2: Example of Features for Question Classifier
85
meanings. For example, in the question “Posisi apakah yang dijabat George W.
Bush sebelum menjadi presiden Amerika?” (What was George W. Bush’s position before he became the President of the United States?), the question main
word is posisi (position) which can be assumed as a place (location) or an occupation (name). By classifying the question into “name”, the answer extractor
will automatically avoid the “location” answer. The second benefit relates to the
problem of an out-of-vocabulary question main word. By providing the question
class even though the question main word can not be translated, the answer can
still be predicted using the question class.
Even though our question shallow parser is based on a rule-based module, it
was built with simple rules(described in Section 4.4.2). Even though it is a simple rule-based question shallow parser, it could improve question classification’s
accuracy as shown in Section 5.6.1.
5.5.2
Section Translation
Based on our observations of the collected Indonesian questions, we concluded
that there are three types of words used in the Indonesian question sentences:
1. Native Indonesian words, such as “siapakah” (who), “bandara” (airport),
“bekerja” (work), etc
2. English words, such as “barrel”, “cherry”, etc.
3. Transformed English words, such as “presiden” (president), “agensi” (agency),
“prefektur” (prefecture), etc.
We use an Indonesian-English bilingual dictionary [17] (29,047 entries) to
translate the non-stop Indonesian words into English. To handle the second
type of keyword, we simply search the keyword in the English corpus. For the
third type of keyword, we apply some transformation rules, such as “k” into “c”,
or “si” into “cy”, etc. The complete transformation rules were shown in Table
2.4. Using this strategy, among 3706 unique keywords in our 2837 questions, we
obtained only 153 OOV words (4%). In addition, we also augmented the English
translations by adding the synonyms from WordNet.
An example of keyword translation process is shown in Figure 5.3. Third row
shows the keywords extracted from the question. Fourth-sixth rows are the keyword translation results. Two words (“letak”, “pulau”) can be translated with the
Indonesian-English dictionary, “Zamami” is an English term which is not translated because it exists in the English corpus. “Prefektur” is transformed into
“prefecture” using the transformation rules mentioned above. Next row shows
the attempt of adding more translations from WordNet. The WordNet addition
process is employed for word “letak” (location, site, position). An example of
keywords addition process is as follows: The word “location” has 4 synsets in the
86
WordNet. There are only 2 synsets with synonyms. The first synset has synonyms of “placement”, “locating”, “position”, “positioning” and “emplacement”.
Because one of the synonyms is also included in the translation result of bilingual
dictionary then all synonyms for the first synset are included as the additional
keywords.
5.5.3
Passage Retriever
The English translations are then combined into a boolean query. By joining all
the translations into a boolean query, the keywords are not filtered into only one
translation candidate, as is done in a machine translation method. The operators used in the boolean query are “or”, “or2”, and “and” operators. The “or”
operator is used to join the translation sets for each Indonesian keyword. As
shown in Figure 5.4, the “or” operator joins the translation sets of “prefektur”,
“letak”, “pulau”, and “Zamami”. The “or2” [32] operator is used for synonyms.
Figure 5.4 shows that the boolean query for the translations of the word “letak”
is “location ’or2’ site ’or2’ position”. The “and” operator is used if the translation result contains more than one term. For example, the boolean query for
“territorial water” (the translation of “perairan”) is “territorial ’and’ water”.
The IDF score for each translation depends on the word number of the translation results. For example, if an Indonesian word is translated into only one
English word, then the IDF score for this translation is equal with the IDF score
of the English word (the number of documents in the corpus divided by number
of documents contain the English word). If the translations are more than one
English word (the “or2” operator), then the IDF score is calculated from the
Question:
English:
Indonesian keywords: prefektur, letak, pulau, Zamami
Translated by Indonesian-English dictionary:
letak=location, site, position
pulau=island
Exists in English corpus: Zamami
Transliterated: prefektur=prefecture
Augmented by WordNet:
letak=location, site, position, placement, locating, situation, emplacement,
positioning, place
Figure 5.3: Example of Keyword Translation Result
87
documents containing at least one of the translation. For the “and” operator,
the IDF score is calculated from the documents containing all translations in the
“and” operator.
The relevant passages are retrieved in two steps: document retrieval and passage retrieval. For the document retrieval, documents with IDF score higher than
the highest IDF score divided by 2 are selected. And for the passage retrieval,
passages in the retrieved documents within the three highest IDF scores are selected. One passage consists of three sentences. Some examples of such retrieved
passages are shown in Figure 5.4. The three passages with the highest IDF score
were retrieved. The first and third passages are the correct passages, but the
second passage is not. This occurred because the distances among keywords are
not considered yet in the passage retrieval.
Question: Di prefektur mana letak pulau Zamami?
English: In which prefecture is Zamami island located?
Answer: Okinawa prefecture
Keywords: prefektur(prefecture), letak(location, site, position), pulau(island),
Zamami
Boolean query:
(prefecture) or (location or2 site or2 position) or (island) or (Zamami)
IDF scores
prefecture: 0.503
location,site,position: 0.705
island: 1.282
Zamami: 3.705
Passages with the highest IDF score:
... and record their cry off the coast of Zamami Island in Okinawa Prefecture.
... Kerama Island of Okinawa Prefecture ... Marilyn, on adjacent Zamami
Island.
... humpback whale off Zamami Island, Okinawa Prefecture, by using a robot
submarine.
Figure 5.4: Example of Passage Retriever Result
5.5.4
Answer Finder
As mentioned before, the named entity tagging is not used in the answer finder
phase, instead the answer tagging process is employed. In common approach,
researchers tend to use two processes in a CLQA system. First process is the
named entity tagging which tags all named entities in the document. Second
process is the answer selection which matches the named entity with question
88
features. If both processes employ machine learning approaches, then there are
two training data should be prepared. In this common approach, the error of
the answer finder will be resulted from the error of the named entity tagger
propagated by the error of answer selection process.
By our approach, we simplify the process of answer finder. We directly match
the document feature with the question feature. It means that we shorten the
development time. It also means that for a new language which has no named
entity tagger available, by using our approach, one does not have to prepare
two training data, instead one has only to prepare one training data for the
answer tagging process which can be built from the question-answer pair (already
available as the QA data) and the answer tagged passages (can be easily prepared
by automatically searching and tagging the answers in the relevant passages).
In our answer tagging process, we treat the answer finder as a text chunking
process. Here, each word in the corpus will be given a status, either B or I or
O, based on some features of both the document words and of the question. We
use a text chunking software Yamcha[20] that works using an SVM algorithm.
As mentioned before, we do not do the named entity tagger in the answer
finder phase, instead we treat the answer finder as a text chunking process. Each
word in the corpus will be given a status as a B or I or O based on some features
of the document word and also based on some features of the question. We use an
available text chunking software Yamcha[20] that works using an SVM algorithm.
The answer finder with text chunking process was also used in QBTE[37].
QBTE approach employed maximum entropy as the machine learning algorithm.
In QBTE approach, [37] used some POS information for words. For example, the
word “Tokyo” is analyzed as POS1=noun, POS2=propernoun, POS3=location,
and POS4=general. The POS information of each word in the question is matched
with the POS information of each word in the corpus by using a true/false score
for features of the machine learning. In our Indonesian-English CLQA, we do not
use such information. The POS information in our Indonesian-English CLQA is
similar to the POS1 mentioned in [37]. Even though the Indonesian-English
dictionary is bigger size than the Japanese-English dictionary (24,805 entries)
used by Sasaki, the Indonesian-English dictionary does not posses the POS2,
POS3 and POS4.
Another difference with our study is that we use question class as one of
question features with reasons as mentioned in Section 5.5.1. We also use the
result of the question shallow parser along with the bi-gram frequency score. For
the document features, each word is morphologically analyzed into its root word
using TreeTagger[40]. The root word, its orthographic information and its POS
(noun, verb, etc) information are used as the question features. Different with
the Indonesian QA (Chapter 4), we do not calculate bi-gram frequency score for
the document word, instead we calculate its WordNet distance with 25 synsets
89
such as listed in the noun lexicographer files of WordNet.
Each document word is also complemented by its similarity scores with the
question main word and question keywords. If a question keyword consists of
two successive words such as “territorial water” as the translation for “perairan”,
then when a document word matches with one of the words of the question
keyword, the score is divided by the number of words in that question keyword.
For example, for the document word “territorial”, the similarity score against
“territorial water” is 0.5.
An example of the features used in the Yamcha text chunking software is
shown in Figure 5.5.
Question:
English:
Question Features(QF):see Figure 5.2
EAT: Location
Retrieved passage:
... humpback whale off Zamami Island, Okinawa Prefecture, by using a robot
submarine.
Document Features(DF) for the word “Prefecture”:
lexical: prefecture
POS: NP
orthographic: Upcase alphabet(2)
question-main-word similarity: 1
keyword similarity: 1
bi-gram frequency: see Figure 5.2
Preceding words: “,” (classified as “O”), Okinawa (classified as “B”)
Classification result for the word “Prefecture” is : “I”
Figure 5.5: Example of Features (for Word “Prefecture”) in Answer Finder Module
5.6
5.6.1
Experimental Result
Question Classifier
In the question classification experiment, we applied an SVM algorithm from
the WEKA software[50] with a linear kernel and the “string to word vector”
function to process the string value. We used 10-fold cross validation for the
accuracy calculation by using collected data described in Section 5.3. We tried
90
some feature combinations for the question classifier. The results are shown in
Table 5.3.
Table 5.3: Question Classifier Accuracy
Features
B
S
B+S
B+W
B+P
S+W
S+P
B+S+W
B+S+P
S+W+P
B+W+P
B+S+W+P
Accuracy Score
91.93%
95.49%
93.83%
94.01%
94.18%
95.70%
95.88%
94.96%
95.07%
96.02%
94.61%
95.41%
“B” designates the bag-of-words feature, “S” designates shallow parser’s result feature including the interrogative word, question main word, etc. “W”
designates WordNet distance feature for the question main word. “P” designates
the bi-gram frequency feature for the question main word.
As Table 5.3 shows, features using the bag-of-words gave lower performance
than those without it. By using the same technique as in the Indonesian QA but
with different input sentences gives different results. The results on Indonesian
question classification in the monolingual Indonesian QA (Section 4.5.1) show
that using the bag-of-words feature improves classification accuracy. We believe
this is because the keywords used in the CLQA are more various than those used
in the monolingual QA. For the queries used in our Indonesian-English CLQA,
users generated questions based on their translation knowledge, such as they were
free to use any Indonesian terms as the translation of any of the English terms
they found in the English article. In the monolingual QA system, however, users
tended to use keywords such as those written in the monolingual (Indonesian)
article.
The highest accuracy result is 96.02%, achieved by the combination of the
shallow parser’s result(S), bi-gram frequency(P) and WordNet distance(W). This
is also different from the monolingual QA (Chapter 4.5.1), as the result using the
above combination is lower than that obtained using only the S+P features. This
is because there are many English keywords used in the Indonesian questions for
the CLQA system, thereby making the WordNet distance feature a useful feature.
91
The detail accuracy for each EAT of the S+P+W features is shown in Table 5.4.
Table 5.4: Confusion Matrix of S+P+W Features for Question Classification
in / out
date
loc
name
org
person
quan
date
459
0
0
0
0
2
loc
0
460
4
10
0
0
name
0
10
467
33
0
0
org
0
4
10
403
8
0
person
0
2
1
29
439
0
quan
0
0
0
0
0
496
The lowest performance is for the “organization” class, which is a difficult
task. For example, for the question “Siapa yang mengatakan bahwa 10% warga
negara Jepang telah mendaftarkan diri untuk mengikuti Pemilu pada tahun
2000?” (Who says that 10% of Japan citizen applied for the national election in
year 2000?) “person” was obtained as the classification result. Even for a human, it would be quite difficult to define the question class of the above example
without knowing the correct answer(“Foreign Ministry”).
5.6.2
Keyword Translation
As mentioned before, we translated Indonesian queries into English using an
Indonesian-English dictionary. Apart of that, we also transformed some Indonesian words into English. Using the transformation module, we were able
to translate 38 OOV words with only 1 incorrect result. The examples of the
transformation result are shown in Table 5.5.
Table 5.5: Examples of Transformation Result
Indonesian English Translation
Correct Translation
prefektur
prefecture
agensi
agency
jerman
german, germany
Incorrect Translation
mula
mole, mule
92
5.6.3
Passage Retriever
For the passage retriever, we used two evaluation measures: precision and recall.
Precision shows the average ratio of relevant documents. A relevant document
is a document that contains a correct answer without considering any available
supporting evidence. Recall refers to the number of questions that might have a
correct answer in the retrieved passages.
We conducted three schemas for the passage retriever. The results are shown
in Table 5.6. In Table 5.6, “in house sw” means that the keywords are filtered
using an additional stop word elimination process. “in house wn” means that
We used WordNet to augment the Indonesian-English translations. For the English target corpus, we used Daily Yomiuri Shimbun from the years 2000 and
2001(17,741 articles and 98,922 words of vocabulary size without the function
words).
Table 5.6: Passage Retriever Accuracy
Run-ID
In-house
In-house-sw
In-house-sw-wn
Precision
%
12.6%
12.1%
19.7%
#
1044 of 8311
1092 of 9026
845 of 4295
Recall
%
#
71.1% 202
73.2% 208
70.4% 200
F-Measure
%
21.4%
20.8%
30.8%
We found that using WordNet for the expansion gave lower recall result which
means lower number of candidate documents that might contain answer. By using
WordNet, some irrelevant passages got higher IDF score than the relevant one.
5.6.4
Answer Finder
To locate the answer, we used an SVM based text chunking software (Yamcha)
with a default value for the SVM configuration. We ranked the answers result
with the following five schemas:
1. A: Using only Text Chunking score
2. B: (0.3 × Passage Retrieval score) + (0.7 × Text Chunking score)
3. C: (0.5 × Passage Retrieval score) + (0.5 × Text Chunking score)
4. D: (0.7 × Passage Retrieval score) + (0.3 × Text Chunking score)
5. E: Using only Passage Retrieval score
All results are shown in Table 5.7. To measure the CLQA performance, we
used the Top1, Top5 and MRR scores for the exact answers. The “sw” label
93
in the Run-ID column means that the keywords were filtered using two kinds
of stopwords: a common stopword elimination (the words listed in the stopword
list should be deleted from the keyword list) and a special stopword elimination
(the words are only deleted if they met certain criterias such as “tahun” (year) in
“Pada bulan apa akan diadakan pemilihan umum di Jepang pada tahun 2000?”
(In what month of year 2000, the general election will be held in Japan?). The
“wn” label in the Run-ID column means that the English translation keywords
are added by synonyms from WordNet as described in Section 5.5.2. Table 5.7
shows that using also the passage retrieval’s score (“B”, “C” and “D”) improves
the overall accuracy. Here number of retrieved passages influences the machine
learning’s accuracy as the highest question answering accuracy score (Top1 and
MRR) is achieved by the “in-house-sw-wn” which reached lowest recall score on
the passage retriever accuracy, but with highest precision score on the passage
retriever accuracy.
Table 5.7: Answer Finder Accuracy
Run-ID
In-house
A
B
C
D
E
In-house-sw
A
B
C
D
E
In-house-sw-wn
A
B
C
D
E
Top1
Top5
MRR
20.1%(57)
23.9%(68)
23.2%(66)
23.2%(66)
17.3%(49)
34.2%(
34.2%(
33.8%(
33.8%(
32.8%(
97)
97)
96)
96)
93)
25.6%
28.1%
27.7%
27.6%
23.6%
20.4%(58)
24.3%(69)
23.9%(68)
23.9%(68)
19.0%(54)
35.6%(101)
36.6%(104)
36.6%(104)
36.3%(103)
34.5%( 98)
26.9%
29.2%
29.0%
28.8%
25.2%
20.4%(58)
25.0%(71)
24.7%(70)
24.7%(70)
18.0%(51)
34.9%( 99)
35.6%(101)
35.6%(101)
35.2%(100)
33.5%( 95)
26.5%
29.5%
29.2%
29.1%
24.3%
Other than experiments shown in Table 5.7, we also conducted the experiments for question translations without the transformation process. The experiment results in Table 5.8 show that the transformation process gives benefit on
the answer finder accuracy, similar to the improvement given by the stop word
94
elimination process as shown in Table 5.7.
Table 5.8: Answer Finder Accuracy for Translation without the Transformation
Process
Run-ID
A
B
C
D
E
5.6.5
Top1
19.0%(54)
23.2%(66)
22.5%(64)
22.5%(64)
17.3%(49)
Top5
33.5%(
33.8%(
33.8%(
33.5%(
31.3%(
95)
96)
96)
95)
89)
MRR
25.2%
27.5%
27.1%
27.0%
23.2%
Experiments with NTCIR 2005 CLQA Task Test Data
We also conducted a CLQA experiment using NTCIR 2005 CLQA Task Data
[39]. Because the English-Japanese and Japanese-English data set are parallel, we
translated the English question sentences into Indonesian and used the translated
Indonesian questions to find the answers from the English documents (Daily
Yomiuri Shimbun, years 2000-2001).
To prepare the Indonesian questions, we asked two Indonesian people to translate the 200 English questions into Indonesian. The translated Indonesian questions were then labeled as belonging to one of two categories: Trans1 and Trans2.
The question test set was categorized by NTCIR 2005 in 9 EATs: money, numex, percent, date, time, organization, location, person, and artifact. This question categorization is quite similar to the EATs used in our system such as the
categories for “date”, “organization” (except for the “country” question focus),
“location”, “person”, “artifact”(equal with the “name” EAT), “money” (“quantity”), “numex” (“quantity”) and “percent” (“quantity”) EATs. Some question
examples are shown in Table 5.9. The only EAT that we do not have in the
training data is the “time” EAT. It is interesting to observe whether my system
can handle the out-of-EAT questions. There are 14 questions of NTCIR 2005
with the “time” EAT. It should be noted, however, that in our data there is no
question that has an answer that would include special terms such as “p.m” or
“a.m”. Further, there is also no question similar to “What time did the inaugural
flight depart from Haneda Airport’s new runway B?”. As a result of these differences, in the question classification, our system classified all “time” questions as
“date” EAT. Even though there is no similar question-answer pair in our training
data among these 14 questions, for the first translation set, our system was able
to locate 2 correct answers. Further, we believe that if the “time” questions are
added to the training data, the system will achieve better performance.
95
Table 5.9: Question Examples of the NTCIR 2005 CLQA Task
Japanese: 台湾新幹線に対して日本が優先交渉権を獲得したのはいつ
English: When did Japan win prior negotiation rights in the Taiwan Shinkansen
project?
Indonesian-1: Kapan Jepang memenangi hak penawaran dalam proyek
Shinkansen Taiwan?
Indonesian-2: Kapankah Jepang telah memenangkan negosiasi perjanjian proyek
Shinkansen Taiwan?
Correct Answer: Dec. 28
NTCIR EAT: Date; In house EAT: Date
Japanese: 高橋さんのトラックが北保さんのトラックに追突した時刻は
English: At what hour did a truck driven by Takahashi rear-end a truck driven
by Hokubo?
Indonesian-1: Pada jam berapa truk yang dikemudikan Takahashi menabrak
bagian belakang truk yang dikemudikan Hokubo?
Indonesian-2: Pada jam berapakah truk yang dikendarai oleh Takahashi
mendekati belakang truk yang dikendarai oleh Hokubo?
Correct Answer: about 6 a.m.
NTCIR EAT: Time; In house EAT: Date
Japanese: 寄生虫病に苦しむ人々は世界中で何人くらいいますか
English: How many people suffer from parasitic diseases all over the world?
Indonesian-1: Berapa banyak jumlah orang yang menderita penyakit yang disebabkan oleh parasit di seluruh dunia?
Indonesian-2: Berapa banyakkah orang yang menderita penyakit parasit di seluruh dunia?
Correct Answer: More than 4 billion people
NTCIR EAT: Numex; In house EAT: Quantity
Japanese: ２１日に慶応大で行われるソテールさんの講演会の司会を務めるのは誰
だ
English: Who will host the lecture by Mr. Sautter at Keio University on the
21st?
Indonesian-1: Siapakah yang akan menjadi tuan rumah kuliah tamu Mr. Sautter
di Keio University pada tanggal 21?
Indonesian-2: Siapakah yang akan memandu kuliah dari Mr. Sautter di Keio
University pada tanggal 21?
Correct Answer: Eisuke Sakakibara
NTCIR EAT: Person; In house EAT: Person
96
The experimental results are shown in the following two tables. The result
of the passage retriever accuracy is shown in Table 5.10, which is similar to the
result described in the previous section, where the highest recall score is achieved
by the passage retrieval module with additional stop word elimination only, that
is, without the WordNet expansion. This result is shown in both translation data
sets.
Table 5.10: Passage Retriever Accuracy for NTCIR Translated Test Data Set
Run-ID
Trans1
Trans1-sw
Trans1-sw-wn
Trans2
Trans2-sw
Trans2-sw-wn
Precision
%
18.5% 792 of
17.3% 839 of
19.3% 753 of
10.1% 883 of
12.2% 976 of
12.4% 912 of
#
4281
4842
3903
8733
8028
7346
Recall
%
#
76.0% 152
76.5% 153
74.5% 149
77.0% 154
78.5% 157
77.5% 155
F-Measure
%
29.8%
28.2%
30.7%
17.9%
21.1%
21.4%
Table 5.11 shows the question answering accuracy result. We used only the
“B” ranking score calculation ((0.3 × Passage Retrieval’s Score) + (0.7 × Text
Chunking’s Score)).
Table 5.11: Question Answering Accuracy for NTCIR Translated Questions
Run-ID
Trans1
Trans1-sw
Trans1-sw-wn
Trans2
Trans2-sw
Trans2-sw-wn
Top1
22.5%(45)
22.0%(44)
22.0%(44)
21.0%(42)
22.0%(44)
22.5%(45)
Top5
34.0%(68)
35.5%(71)
34.5%(69)
31.5%(63)
32.5%(65)
32.5%(65)
MRR
27.2%
27.6%
27.4%
25.7%
26.7%
26.8%
As a rough comparison, we also include the question answering accuracy of
two best run-id in the Japanese-English task and the Chinese-English task of
the NTCIR 2005 CLQA task, shown in Table 5.12[39]. In NTCIR 2005 CLQA
task, the answer is labeled as R or U. R indicates that the answer is correct and
extracted from relevant passages, whereas U indicates that the answer is correct
but it is taken from irrelevant passages. In the experiments, we do not evaluate
the relevancy of a passage, therefore the results shown in Table 5.12 is the R+U
answers for Top1 and Top5 answers.
97
Table 5.12: Question Answering Accuracy of Run-ids in NTCIR 2005 CLQA
Task[39]
Run-ID
Japanese-English
NCQAL-J-E-01
Forst-J-E-02
Forst-J-E-03
Chinese-English
UNTIR-C-E-01
NTOUA-C-E-03
NTOUA-C-E-02
Top1
Top5
MRR
31.5%(63)
9.0%(18)
8.5%(17)
58.5%(117)
19.5%( 39)
19.0%( 38)
42.0%
12.8%
12.3%
6.5%(13)
3.5%( 7)
3.0%( 6)
N/A
N/A
N/A
N/A
N/A
N/A
The best result (NCQAL-J-E-01)[18] used some high quality data resources.
To translate Japanese keywords into English, they used three dictionaries: EDICT,
ENAMDICT and an in-house translation dictionary. In the question analyzer,
they used the ALTJAWS morphological analyzer based on the Japanese lexicon
“Nihongo Taikei”.
The second and third best ranked teams in Japanese-English task (ForstJ-E)[29] used EDR Japanese-English bilingual dictionary to translate Japanese
keywords into English and then retrieved the English passages. The English
passages were translated into Japanese using a machine translation system. The
translated Japanese passages along with the Japanese sentence were used as input
for a Japanese monolingual QA system. Here, it can be noted that the number of
entries of the EDR Japanese-English dictionary is higher than that of the KEBI
Indonesian-English dictionary.
In the Chinese-English task, the best result (UNTIR-C-E-01)[8] used currently available resources such as BabelFish machine translationk , Lemur IR
toolkit∗∗ , Chinese Encoding Converter†† , LingPipe‡‡ and Minipar to annotate
English documents. By using these resources, even though the result was lower
than the submitted runs for Japanese-English task, this team achieved the highest accuracy with 13 correct answers as Top1. The second and third best team
(NTOUA)[22] merged several English-Chinese dictionaries to get better translation coverage and used a web based searching to translate the unknown words.
As mentioned before, our answer finder module is similar to the QATRO[37]
k
http://babelfish.altavista.com
http://www.lemur.project.org
††
http://www.mandarintools.com/zhcode.html
‡‡
http://www.alias-i.com/lingpipe
http://www.cs.ualberta.ca/ lindek/minipar.htm
∗∗
98
Accuracy (%),M RR
40
35
30
25
20
15
10
5
0
Top1
Top5
M RR
0
500
1000
1500
2000
2500
3000
Num ber of Training Data
Figure 5.6: QA Accuracy for Various Size of Training Data
submitted run that used a maximum entropy machine learning algorithm. The
team used only 300 questions as training data, and this is likely the main reason
for its low accuracy score (2 correct answers as Top1). Here, we tried to use
various size of the training data to show the effect of the size of training data
on the CLQA accuracy scores. The results shown in Figure 5.6 are the accuracy
scores (using the “B” ranking score calculation) for the first translation set of
NTCIR 2005 CLQA task test set.
Figure 5.6 indicates that the size of training data does influence the accuracy
of a question answering system. Larger training data will cause higher performance on the question answering accuracy. Figure 5.6 also shows that for 250
question-answer pairs of training data (lower size than the one used in QATRO),
our system was still able to achieve 12.3% accuracy for Top1 answer (24.6 correct
answers in average) which was only defeated by the NCQAL[18] system.
5.7
Conclusions
The experiments showed that for a poor resource language such as Indonesian, it
is possible to build a cross language question answering system with a promising
result. The machine learning approach and the boolean query made the system
as an easily built system. One would not need to build a rich rule-based system
to build a monolingual QA.
Compared to other similar approach, we described that EAT information
gives better benefit than in a monolingual QA. The experiments on NTCIR 2005
CLQA Task shows that our system could give correct answer for an out-of-EAT
question.
We believe that this system is suitable for languages with poor resources.
For a source language, one need a POS resource mainly for the noun, verb and
adjectives. Other modules such as the question shallow parser can be built with a
5.7. CONCLUSIONS
99
small effort of programming. For a target language, there are two main resources
needed such as the POS resource (or POS tagger) and an ontology resource such
as WordNet. Even though data resources similar with WordNet is not readily
available for many languages, but our experiments in the Indonesian Question
Answering showed that this kind of information can be replaced by statistics
information yielded from a monolingual corpus.
Chapter 6
Transitive Approach in
Indonesian-Japanese CLQA
6.1
Introduction
Building an Indonesian-Japanese CLQA is the final goal of this study. As in
our knowledge, this Indonesian-Japanese CLQA is the first attempt to develop
a CLQA system for a language pair with a limited resource. It is also the first
attempt of building a CLQA system with Japanese language as the target documents where the source language is not English. So far, the only CLQA task
with Japanese language as the target documents is provided in NTCIR 2005 for
English-Japanese CLQA which usually has a rich translation resources such as
NCQAL[18]. Another language pairs available in NTCIR 2005 CLQA Task[39]
are the Japanese-English, Chinese-English, and English-Chinese CLQA data,
The main problem arised in this Indonesian-Japanese CLQA is the limited
data resources and tools for the translation module. For this kind of language
pair, the common available resource is a bilingual dictionary. One would require
an extra labour work to build a machine translation software or providing a
parallel corpus to solve the translation module. Thus, the suitable schema is the
keywords translation, not the sentence translation.
Considering some resources that we have built, we concluded that there are
some schemas adopted for the Indonesian-Japanese CLQA system:
1. Adopt a similar approach with the Indonesian-English CLQA (as described
in Chapter 5). The Indonesian keywords are translated into Japanese (could
be a direct or transitive translation). The passage retriever and answer
finder are all done in Japanese terms. For this, we have to prepare the
101
102 CHAPTER 6. TRANSITIVE APPROACH IN INDONESIAN-JAPANESE CLQA
training data of tagged Japanese passages for making the answer finder
module.
2. Using English as the pivot language in the Indonesian-Japanese CLQA. By
this, the Indonesian keywords and Japanese documents are translated into
English. The passage retriever and answer finder are done in English terms.
The main drawback here is the document translation. One can argue to
translate only the verb or noun of the Japanese document into English to
decrease the labour work of document translation. By doing that, however,
the system probably could not get advantage from the sentence pattern
which usually involves other words than verb or noun such as adverbs,
function words, etc.
3. Similar with the second schema, but the passage retriever is done in Japanese.
By this, the Japanese-English document translation is employed only for
the relevant passages.
4. Similar with the second schema by using English as the pivot language, but
the answer finder is done in Japanese documents. Here the first passage
retrieval is done in English. The nouns from the English retrieved passages
are then filtered and used as the input for the Japanese passage retrieval.
The answer finder is conducted in Japanese. The drawback of this approach
is that the pivot and target language should have a comparable corpora.
We predict that the first and last schema will give higher accuracy score than
others. This is caused by the advantage of sentence patterns gained in the first
strategy. We believe that it is a valuable contribution to try the fourth schema
to cope with the weak Japanese passage retrieval. By this consideration, in this
Indonesian-Japanese CLQA, We compare these two schemas (the first and fourth
strategy) to evaluate the advantages and drawbacks of each schemas.
As mentioned in the Indonesian-Japanese CLIR system (Chapter 3), other
than direct translation, there is an alternative to employ a transitive translation.
It could be a machine transitive translation or a bilingual dictionary transitive
translation. The experimental results in the Indonesian-Japanese CLIR system
showed that the transitive translation with a bilingual dictionary could achieve
comparable performance as the direct bilingual dictionary or the machine transitive translation.
6.2
Related Works
Because the Indonesian-Japanese CLQA is not available until today, the related
works described in this chapter include the Indonesian-English CLQA (CLEF)
where Indonesian is the source language and English-Japanese CLQA (NTCIR)
103
6.2. RELATED WORKS
where Japanese is the target language. Since CLEF 2005, there has been 3 works
on the Indonesian-English CLQA. All of them adopted a common approach of a
CLQA system where the source question is translated into target language and
the rest execution is the monolingual target QA. The details of each work are
described in Section 5.2.
As for the English-Japanese CLQA, in NTCIR 2005 CLQA task, there were
5 groups submitted official runs of this English-Japanese CLQA task. The performance of each system is shown in Table 6.1[39]. “R” and “U” written in the
table are the criteria of answers and retrieved documents. “R” (Right) means
the answer is correct, and the document where it is from supports it. “U” (Unsupported) means that the answer is correct, but the document where it is from
cannot support it as a correct answer. That is, there is no sufficient information
in the document for users to confirm by themselves that the answer is a correct
one.
Table 6.1: Performance of English-Japanese CLQA Systems in NTCIR 2005
CLQA Task
Run ID
Forst
LTI
NICT
QATRO
TTN
Accuracy (Top1, R+U)
15.5%
12.5%
12.0%
0.5%
6.5%
Top5, R+U
24.0%
N/A
21.0%
1.0%
11.5%
The best result on English-Japanese CLQA 2005 was achieved by Forst[29]
which used machine translation software and web search to translate proper
nouns. The accuracy result was 15.5% accuracy (31 correct answers as Top1
among 200 questions). In the answer finder they used a matching score of each
morpheme in the retrieved passages with the question keywords and EAT.
The most related work of the submitted runs at NTCIR 2005 is QATRO[37].
The method is called Extended QBTE (Question-Biased Term Extraction). In
common approach, an answer is identified by using the similarity between an
EAT (Expected Answer Type) and named entity of the answer candidate. This
approach uses a different method. It tries to eliminate the question classification
process and named entity tagger processes. It extracts the answer by classifying
each word in the document into one of 3 classes (B, I or O).“B” class means that
the document word is the first word in the answers, “I” means that the word
is part of the answer and “O” means that the word is not part of the answer.
The answer classification was done using a maximum entropy algorithm with
features taken from Chasen that includes four information POS. The accuracy
score obtained in the NTCIR 2005 CLQA was low, as there was only 1 correct
answer among 200 test questions for the English-Japanese CLQA. We try to
modify this method in our answer finder by using the question analyzer result
having the EAT.
6.3
Indonesian-Japanese CLQA with Transitive Translation (Schema 1)
The first schema of the Indonesian-Japanese CLQA is similar to the IndonesianEnglish CLQA by using a pivot language in the translation phase. The overall
architecture is shown in Figure 6.1.
Indonesian
Question
Sentence
Indonesian
Question
Analyzer
Indonesian Keywords,
Indonesian Question Main Word
English
Japanese
Translation
Japanese
Keyword
Japanese
Passage
Retriever
English Keywords,
English Question Main Word
Indonesian
English
Translation
EAT,
Interrogative word,
Phrase information
Japanese Keywords,
Japanese Question Main Word
Japanese Passage
Japanese
Answer
Finder
Japanese
Answer
Japanese
Newspaper
Corpus
Figure 6.1: Schema of Indonesian-Japanese CLQA using Japanese Answer Finder
First, the Indonesian question is processed by a question analyzer for EAT,
question keywords and shallow parser’s result (interrogative word, question main
6.4. INDONESIAN-JAPANESE CLQA WITH PIVOT PASSAGE RETRIEVAL
(SCHEMA 2)
105
word, and phrase information). The question analyzer used here is the same
with the question analyzer employed in the Indonesian monolingual QA and
Indonesian-English CLQA.
The question keywords (along with the question main word) are then translated into Japanese. In the proposed method, we use a transitive translation with
bilingual dictionaries. For a language pair with limited resources, the data resources for the transitive translation with bilingual dictionaries are more available
than data resources for the direct translation or the transitive machine translation. In the experiments, we compare the proposed translation with the direct
translation and transitive machine translation. In order to handle the OOV
words, we use a Japanese proper name dictionary and a rule based transliteration module.
The translated question keywords are joined into a query as input for the
Japanese passage retriever module. In the document preparation phase, the
Japanese sentences are processed using a morphological analyzer (Chasen[26])
into a list of base words. The morphological analyzer is also applied for the
translated question keywords. Japanese non stop-words are used as the index.
For the answer finder module, the method is similar to the answer finder
in Indonesian monolingual QA (described in Chapter 4) and Indonesian-English
CLQA (described in Chapter 5). It uses a text chunking process on the Japanese
passages. The features are similar to the Indonesian-English CLQA. The WordNet distance features used in Indonesian-English CLQA are replaced with the
POS information yielded by Chasen as the morphological analyzer for Japanese.
An example of the POS information yielded by Chasen for the word 東京 (Tokyo)
includes 名詞 (Noun), 固有名詞 (Proper Noun), 地域 (Region) and 一般 (General).
6.4
Indonesian-Japanese CLQA with Pivot Passage Retrieval (Schema 2)
As mentioned in Section 6.1, the second schema of the Indonesian-Japanese
CLQA is the one with using English passage retriever result as the input for
the Japanese passage retriever. The architecture is shown in Figure 6.2.
The main different thing with the first approach is the passage retriever. In
the first approach, Japanese passage retriever is conducted, while in the second
approach, we conducted both English and Japanese passage retriever. By this, the
pivot language is not only used in the translation phase. The nouns (proper and
common nouns) of the retrieved English passages are translated into Japanese
and be used as the input for the Japanese passage retrieval. The rest of the
process is equal as the one in the first approach.
We argue that by using the pivot language passage retriever, our system can
have better performance. This schema could be seen as a query expansion in
Indonesian
Question
Sentence
Indonesian
Question
Analyzer
Indonesian Keywords,
Indonesian Question Main Word
English
Newspaper
Corpus
English
Keywords
English
Passage
Retriever
Indonesian
English
Translation
English Passages, English Keywords,
English Quesiton Main Word
English
Japanese
Translation
EAT,
Interrogative
word,
Phrase
Information
Japanese Keywords,
Japanese Question Main Word
Japanese
Keyword
Japanese
Newspaper
Corpus
Japanese
Passage
Retriever
Japanese Passage Japanese
Answer
Finder
Japanese
Answer
Figure 6.2: Schema of Indonesian-Japanese CLQA using English Passage Retriever
the Japanese passage retriever which could handle the low translation quality of
Indonesian-Japanese translation.
6.5
Modules of Indonesian-Japanese CLQA
Mainly there are several modules involved in the Indonesian-Japanese CLQA:
1. Indonesian Question Analyzer
2. Indonesian-Japanese Translation
3. English Passage Retriever
4. Japanese Passage Retriever
5. Japanese Answer Finder
6.5. MODULES OF INDONESIAN-JAPANESE CLQA
107
We made used some modules employed in the previous system such as the
same Indonesian question analyzer employed in the Indonesian QA and IndonesianEnglish CLQA, Indonesian-Japanese translation employed in the IndonesianJapanese CLIR and English passage retriever in the Indonesian-English CLQA.
The new developed modules here are the Japanese passage retriever and Japanese
answer finder which will be described in the following sections.
6.5.1
Japanese Passage Retrieval
We use GETA (http://geta.ex.nii.ac.jp/) as our generic information retrieval engine. It retrieves some Japanese documents from a keyword set by using IDF, TF
or TF × IDF score. The Japanese translation results are joined into one query
and inputted into GETA to get the relevant Japanese documents. All passages
(2 sentences) that contain a similar word as the question keywords are used as
the retrieved passages.
Because translation results using the bilingual translation contain many Japanese
translations, the translations were filtered by using mutual information and TF
× IDF score. First, all combinations of the keyword translation sets are ranked
based on their mutual information score. Each set of top 5 mutual information
scores is used to retrieve the relevant documents. In the final phase, we select documents within 100 highest TF × IDF scores from all relevant documents resulted
from the queries of top 5 mutual information scores.
6.5.2
Japanese Answer Finder
Our Japanese answer finder locates the answer candidates using a text chunking
approach with a machine learning algorithm. Here, the document feature is
directly matched with the question feature. Each word in the retrieved passages
is classified into “B” (first word of the answer candidate), “I” (part of the answer
candidate) or “O” (not part of the answer candidate).
The features used for the classification include the document feature, question
feature, EAT information and similarity score. The document feature includes
POS information (four POS information resulted by Chasen), and the lexical
form. The question feature includes the question shallow parser result and the
translated question main word. The similarity score is the score between the
document word and the question keywords. An example of document feature
for a document word “自民党” with question “Prime Minister Obuchi’s coalition
government consists of LDP, Komei, and what other party?” is shown in Figure
6.3. For one document word, there are n preceding words and n succeeding words.
In our experiment, we found that n=5 is the best option, such as shown in Figure
6.3.
Document word: 自民党
POS information (Chasen): 名詞, 固有名詞, 組織
Similarity score: 0
5 preceding words: に, よる, 連立 (similarity score is 1), 政権, は
5 succeeding words: 執行, 部, が, 小渕 (similarity score is 1), 前
Figure 6.3: Example of Document Feature for the Answer Finder Module
6.6
6.6.1
Experiments
Experimental Data
In order to gain an adequate number of training data for the question classifier
and the answer finder modules, we collected our Indonesian-Japanese CLQA data.
So far, we have 2,837 Indonesian questions and 1,903 answer tagged Japanese
passages. About 1,200 Japanese passages (more than half of the 1,903 passages)
were gained from Japanese native speakers having reading the English question,
the English answer and the corresponding Japanese article (Yomiuri Shimbun
years 2000-2001). The web interface of Japanese passages annotation task is in
Figure 6.4. Training data examples are shown in Table 6.2.
Figure 6.4: Web Interface of Japanese Passage Annotation Task
6.6. EXPERIMENTS
109
Table 6.2: Training Data Examples for Indonesian-Japanese CLQA
Question + Correct Passage
Question: Perusahaan apakah yang menerbitkan Weekly Gendai?
(What company publishes Weekly Gendai?)
Correct Passage:
読売新聞社は、 <A>講談社 </A>発行の「週刊現代」と徳間書店発行
の「週刊アサヒ芸能」の新聞広告について、「毎号の広告内容に、新聞に
載せるのにふさわしくない極めて過激な性表現が多数含まれ、改善が見ら
れない」と判断、読売新聞紙上への広告掲載を当分の間見合わせることを
決め、三日までに両社に通知した。
Question: Kapan konsorsium Jepang mengalahkan konsorsium Taiwan
untuk Taiwan Shinkansen project? (When did Japan consortium defeat
Taiwan consortium for the shinkansen project in Taiwan?)
Correct Passage:
台湾版新幹線プロジェクトをめぐっては、国際入札で日本企業の連合と独
仏企業連合が受注を競っていたが、 <A>十二月二十八日 </A>に日本側
が優先交渉権を獲得した B 日本政府は台湾側の資金調達を全力で支援す
る姿勢を示すことで正式受注にこぎつけた「考えだ。
Question: Siapakah konduktor Italia yang lahir di Venice pada tahun
1946 ? (Who is Italian conductor which was born in Venice in 1946?)
Correct Passage:
<A>ジュゼッペ・シノーポリ氏 </A>（ドレスデン国立歌劇場管弦楽団
首席指揮者）現代フ代表的指揮者。フィルハーモニア管弦楽団音楽監督を
経て９２年より現職。５３歳 B イタリア・ベネチア生まれ。
The 200 questions from NTCIR CLQA task are used as the test data. The
literature [39] noted that in the NTCIR 2005 CLQA task, the Japanese questions
were created as the translation of English questions by referring to the corresponding Japanese articles. It made the question/answer pairs of JE and EJ subtasks be parallel. Thus, the Indonesian questions used in the Indonesian-English
CLQA (Section 5.6.5) can be employed as the test questions in the IndonesianJapanese CLQA. The 200 Indonesian test questions contain 625 common nouns
and 294 proper nouns. The language resources for the translation phase are equal
with the one employed in the Indonesian-Japanese CLIR. These were described
in Section 3.5.1.
6.6.2
Evaluation on OOV of Indonesian-Japanese Translation
The OOV rates in query sentences (test data) are shown in Table 6.3. OOV words
are the words that could not be translated by the translation module. Using the
translation resources as mentioned in the previous section (including the proper
name dictionary, romanized corpus word and transliteration), for proper noun
and common noun, the OOV rates were about 15.2%, 11.5%, and 10.4% for
the direct translation (vocabulary size of 14,823 entries), transitive translation
(dictionaries of Indonesian-English 29,054 entries and English-Japanese 556,237
entries) and transitive machine translation (similar with the Indonesian CLIR,
using Indonesian-English kataku engine and English-Japanese excite engine), respectively.
Table 6.3: OOV Rates of Proper Noun and Common Noun Translation
Description
Direct Translation
Transitive Machine Translation
6.6.3
Proper Noun
12.9%
13.9%
13.2%
Common Noun
17.2%
9.4%
9.1%
Passage Retriever’s Experimental Results
The performance of the Japanese passage retriever is shown in Table 6.4. Table
6.4 shows two evaluation measures: precision and recall. Precision shows the
average ratio of relevant passages to the retrieved passages. A relevant passage
is defined as a passage that contains a correct answer without considering any
available supporting evidence. Recall refers to the rate of questions that might
have correct answers in the retrieved passages. “n-th MI score” means the input
query is the keyword set with the n-th rank of MI score. “MI-TF × IDF” is the
combination of Mutual Information score and “TF × IDF score” as explained in
Section 6.5.1.
In the keyword translation using a bilingual dictionary, even though the number of OOV for common nouns resulted by the direct translation is much larger
than the transitive translation, but in general the direct translation has better retrieval performance than the transitive translation (higher precision score for all
methods and higher recall score for almost all methods). It shows that the important keywords in the document retrieval are mostly the proper nouns (number of
OOV proper nouns of direct translation is lower than the transitive translation).
Table 6.4 also shows that without the combination of TF × IDF and mutual
information filtering, the transitive translation result will have lower recall score
than the direct translation. It indicates that the combined filtering is effective for
transitive translation result because it is able to decrease the incorrect Japanese
translations. For the direct translation, the combined filtering is not effective
because the number of Japanese translations is much fewer than the transitive
6.6. EXPERIMENTS
111
Table 6.4: Indonesian-Japanese Passage Retriever’s Experimental Results
Description
Recall
Precision
Direct Translation with Bilingual Dictionary
No filtering
35.5%
2.24%
1st MI score
37.5%
2.41%
2nd MI score
37.5%
2.42%
3rd MI score
36.0%
2.21%
4th MI score
38.0%
2.21%
5th MI score
39.0%
2.34%
MI-TF × IDF
37.0%
2.47%
Transitive Translation with Bilingual Dictionaries
No filtering
28.5%
1.50%
1st MI score
34.5%
1.62%
2nd MI score
36.0%
1.58%
3rd MI score
34.0%
1.50%
4th MI score
35.0%
1.72%
5th MI score
34.5%
1.87%
MI-TF × IDF
39.0%
1.82%
No filtering
35.0%
2.61%
1st MI score
37.5%
3.09%
2nd MI score
35.5%
2.85%
3rd MI score
35.0%
2.73%
4th MI score
35.0%
2.75%
5th MI score
34.5%
2.73%
MI-TF × IDF
36.0%
2.87%
Keywords of Japanese monolingual queries
No filtering
70.0%
5.46%
translation. Table 6.4 also shows that our translation only achieved 50% accuracy compared to the oracle experiment (last row, the document retrieval using
keywords extracted from Japanese monolingual queries).
The translation using transitive machine translation achieved the worst performance. The main reason of its low performance is the number of OOV words.
This result is different with the one conducted for the Indonesian-Japanese CLIR.
One of the reasons is because in the QA’s passage retriever, answer as one of the
important keywords is not listed in the input query, different with IR.
We also group the experimental results on the queries with OOV words and
without it. The results shown in Table 6.5 point out that we still have a homework
to handle the OOV words and get a better passage retriever performance. It also
indicates that the in-vocabulary translation gives imprecise translations, all the
recall scores are lower than the oracle experiment.
Table 6.5: Indonesian-Japanese Passage Retriever’s Experimental Results (with
and without OOV words)
Queries without OOV words Queries with OOV words
Description
Recall
Precision
Recall
Precision
Direct Translation with Bilingual Dictionary
No filtering
48.84%
3.70%
21.9%
0.90%
MI-TF × IDF 53.49%
3.66%
28.1%
1.57%
Transitive Translation with Bilingual Dictionaries
No filtering
43.02%
2.86%
17.5%
0.49%
MI-TF × IDF 52.33%
2.71%
28.95%
1.16%
No filtering
42.10%
3.61%
25.58%
1.31%
MI-TF × IDF 42.98%
3.68%
26.74%
1.82%
In order to see the effect of dictionary quality on the passage retrieval, we
conducted the passage retriever experiment (without filtering) for direct translation with 4 dictionaries. These dictionaries are actually the Indonesian-Japanese
dictionary with various dictionary sizes: 3,000, 5,000, 8,857 and 14,823 (the original Indonesian-Japanese dictionary) entries. The reduction was done by choosing
the most frequent Indonesian words occur in the Indonesian newspaper corpus.
Figure 6.5 shows the experimental results. The conclusion is equal as the one conducted in the Indonesian-Japanese CLIR that the larger the OOV word, the lower
performance it achieved. It indicates that dictionary quality plays an important
role in the cross language system. It also shows that for 3000 words vocabulary
size, without the filtering schema, the direct translation using bilingual dictionary
has a lower recall score than the transitive translation using bilingual dictionaries.
6.6.4
Japanese Answer Finder
In the answer finder module, we used Yamcha with default configuration as the
SVM based text chunking software. To evaluate the performance of answer finder
module, we conducted the answer finder experiments for the correct passages.
The result is shown in Table 6.6. The evaluation scores are Top1 (correct rate
of the top 1 answers), Top5 (rate of at least one correct answer included in
the top 5 answers), TopN (rate of at least one correct answer retrieved in the
113
6.6. EXPERIMENTS
㪩㪼㪺㪸㫃㫃
㪦㪦㪭
㪋㪇
㪊㪌
㪊㪇
㪉㪌
㩼
㪉㪇
㪈㪌
㪈㪇
㪌
㪇
㪊㪃㪇㪇㪇
㪌㪃㪇㪇㪇
㪏㪃㪏㪌㪎
㪈㪋㪃㪏㪉㪊
㪛㫀㪺㫋㫀㫆㫅㪸㫉㫐㩷㪪㫀㫑㪼
Figure 6.5: Experimental Results of Indonesian-Japanese Passage Retriever with
Direct Translation using Dictionaries in Various Size
found answers) and MRR (Mean Reciprocal Rank, the average reciprocal rank
(1/n) of the highest rank n of a correct answer for each question). “Baseline”
means that we use features mentioned in Section 6.5.2. We tried to add two
additional features: word distance feature and character type feature. The word
distance feature shows the distance between the current word and other word
in the document that equals to question keywords. The character type feature
labels a document keyword with number, kanji, katakana, hiragana or alphabet
type.
To see the effect of the transitive translation in the answer finder feature,
we conducted two kinds of experiments for the oracle correct documents. The
first is the one that used transitive translation to measure the document word
similarity score, shown in the first four rows of Table 6.6. The second is the
one that used the correct translation, the Japanese keywords contained in the
Japanese queries, shown in the last four rows of Table 6.6. This comparison
shows that for the answer finder method, the translation error caused by the
transitive translation decreases the answer finder result. But the influence is not
as significant as the passage retrieval result where the recall score of the transitive
translation is about half of the one using the correct Japanese keywords.
As the final experiment, we applied the same answer finder module on the
passage retriever results. We used the passage retriever with the combination of
MI and TF × IDF filtering method. The question accuracy scores are shown in
Table 6.7. The answers were ranked using the recall score (R) of the document
Table 6.6: Question Answering Accuracy for Correct Documents
Description
Top1
Top5 TopN
MRR
Use transitive translation to calculate word similarity
baseline
24.0% 36.0% 40.0%
29.3%
+word distance
20.0% 31.0% 33.0%
24.2%
+character type
22.0% 37.0% 38.0%
27.6%
+distance and type 21.0% 33.5% 34.0%
25.5%
Use keywords of Japanese queries to calculate word similarity
baseline
29.0% 44.0% 48.5%
35.3%
+word distance
27.5% 40.5% 43.5%
33.2%
+character type
30.5% 44.5% 46.5%
36.2%
+distance and type 27.5% 41.5% 44.0%
33.6%
retrieval and the text chunking (T) score resulted by Yamcha.
Table 6.7: Question Answering Accuracy for Retrieved Documents
Description
Top1 Top5
Direct Translation
Baseline
3
7.5
+word distance
3.5
8.5
+character type
3
7.5
+distance and type
3.5
8.5
Baseline
2
6
+word distance
2.5
4.5
+character type
2
5.5
+distance and type
3
4
Baseline
2
5.5
+word distance
2
4.5
+character type
2
5
+distance and type
2
4
TopN
MRR
22
16.5
14.5
16.5
5.5
5.7
5.4
5.7
19.5
16.5
19.5
16.5
3.9
4.0
4.0
4.2
20
16
18
18.5
4.2
3.4
4.0
3.4
All accuracy scores shown in Table 6.7 are quite comparable. This comparison
is similar with the passage retrieval result shown in Table 6.4. The best score on
the Top1 answer is achieved by the direct translation method. In Top1 answer, the
transitive translation achieved an almost comparable performance than the direct
115
6.6. EXPERIMENTS
translation. But in the Top5 answer, even though the passage retriever result
of the direct translation (with TF × IDF filtering) is lower than the transitive
translation, the QA performance of direct translation is better than the transitive
translation. It shows that the ranking schema could not give a good rank position
to the correct answer.
6.6.5
Experimental Result for the Transitive Passage Retriever
As mentioned before, in the transitive passage retriever, there are two passage
retriever: English passage retriever with keywords extracted from the original
questions (translated into English), and Japanese passage retriever with keywords
extracted from the retrieved English passages. We experimented several numbers
of the chosen Japanese passages, from 5 highest until 100 highest TF × IDF. The
experimental result of the passage retriever (PR) and question answering (QA)
is shown in Figure 6.6.
㪩㪼㪺㪸㫃㫃㩷㩿㪧㪩㪀
㪫㫆㫇㪈㩷㩿㪨㪘㪀
㪫㫆㫇㪌㩷㩿㪨㪘㪀
㪤㪩㪩㩷㩿㪨㪘㪀
㪋㪇
㪊㪌
㪊㪇
㪉㪌
㪉㪇
㪈㪌
㪈㪇
㪌
㪇
㪈㪇㪇㩷㪫㪝㬍㪠㪛㪝
㪐㪇㩷㪫㪝㬍㪠㪛㪝
㪏㪇㩷㪫㪝㬍㪠㪛㪝
㪎㪇㩷㪫㪝㬍㪠㪛㪝
㪍㪇㩷㪫㪝㬍㪠㪛㪝
㪌㪇㩷㪫㪝㬍㪠㪛㪝
㪋㪇㩷㪫㪝㬍㪠㪛㪝
㪊㪇㩷㪫㪝㬍㪠㪛㪝
㪉㪇㩷㪫㪝㬍㪠㪛㪝
㪈㪇㩷㪫㪝㬍㪠㪛㪝
㪌㩷㪫㪝㬍㪠㪛㪝
Figure 6.6: Experimental Results of Indonesian-Japanese CLQA using Transitive
Passage Retriever
Figure 6.6 shows that for the Top1 answer, even though the passage retriever
performance of the transitive passage retriever is lower than the direct one (without the transitive passage retriever), the CLQA performance using transitive passage retriever (10 correct answers) is higher than using a direct passage retriever
(7 correct answers). It shows that the transitive passage retriever gives better
quality passages than the direct one. The weakness of using the transitive passage retriever lays on its process which consumes twice translation (for question
keyword and for English passage) and twice passage retriever, more complex than
the direct passage retriever which needs only one translation (question keyword
translation) and one passage retriever.
Even though this Indonesian-Japanese CLQA result is lower than the IndonesianEnglish CLQA experiments but it is higher compared to a similar research by
QATRO[37] for English-Japanese CLQA with 1 correct answer for the Top1 answer.
6.6.6
Experimental Results of English-Japanese CLQA
As a comparison, we also conducted an experiment on the English-Japanese
CLQA. Using the same test data (CLQA task of NTCIR 2005) and Eijirou
English-Japanese dictionary, we adopted the same method as the IndonesianEnglish CLQA. For this, we built our own English question shallow parser and
used TreeTagger as the POS tagger software.
The experimental results are shown in Table 6.8 and Table 6.10. Table 6.8
presents the passage retriever result for the English-Japanese CLQA using recall
and precision score. The experimental result point out that the filtering using the
mutual information and TFxIDF score is not effective for the direct translation.
Using only the mutual information to filter the translation is adequate enough
to enhance the passage retriever’s score. These scores are comparable with the
Indonesian-Japanese passage retriever (see Table 6.4). It strengthens the conclusion that in order to enhance the quality of the translation, the most important
keywords to be handled are the proper noun.
Table 6.8: Performance of English-Japanese Passage Retriever
Description
No filtering
1st MI score
2nd MI score
3rd MI score
4th MI score
5th MI score
MI-TF × IDF
Recall
32.0%
40.5%
41.5%
37.5%
37.5%
40.0%
37.5%
Precision
1.86%
1.91%
1.57%
1.56%
1.56%
1.41%
2.38%
We also calculated the passage retriever performance for queries with OOV
words and queries without it. The results are shown in Table 6.9. The number
of OOV is 6.1% (14.6% of proper noun and 2.1% of common noun). The OOV
number is higher than in the Indonesian-Japanese translation system because
there are some English words which are OOV words in the English-Japanese
dictionary, when translated into Indonesian, those words become in-vocabulary,
117
6.7. CONCLUSIONS
for example “WW II” or “World War II” (translated into “Perang Dunia 2” in
Indonesian).
Table 6.9: Performance of English-Japanese Passage Retriever (With and Without OOV words
Description
No filtering
1st MI score
2nd MI score
3rd MI score
4th MI score
5th MI score
MI-TF × IDF
Queries without OOV words
Recall
Precision
35.5%
2.29%
44.0%
2.35%
43.97%
1.84%
39.0%
1.87%
40.4%
1.96%
44.7%
1.63%
41.8%
3.07%
Queries with OOV words
Recall
Precision
23.7%
0.83%
32.2%
0.82%
35.6%
0.89%
33.9%
0.81%
30.5%
0.61%
28.8%
0.87%
27.1%
0.68%
The passage retriever with filtering results are used as input for the Japanese
answer finder. The performance of the answer finder is in Table 6.10. The answer
finder performance is worse than the Indonesian-Japanese one. We assumed that
it is because the OOV words (proper noun) which hold important information of
the question.
6.7
Conclusions
We have conducted an Indonesian-Japanese CLQA using easily adapted modules
and transitive approach. There are two transitive approaches: the transitive
translation and transitive passage retriever.
By using a filtering of mutual information score and TFxIDF score, the transitive translation could have a comparable passage retriever performance with the
direct translation. In the answer finder experiment, the direct translation has
better performance than the transitive translation and filtering could not make
it comparable.
As for the transitive passage retriever, its answer finder performance achieved
the best result compared to other conducted experiments. It shows that the
query expansion strategy done in the transitive passage retriever could give more
relevant passages than the one with only the direct passage retriever.
Even though the accuracy score is lower than the Indonesian-English CLQA
using the similar approach, but we believe that this result can be enhanced by
improving the proper name translation. For the next research plan, we will try
to improve the proper name translation method, for example by using internet
Table 6.10: English-Japanese Question Answering Accuracy
Description
Top1 Top5 TopN
No Filtering
Baseline
2
3.5
5.5
+word distance
2.5
4.5
9.5
+character type
1.5
4
11.5
+distance & type
2
4
8.5
Top 1 Mutual Information Filtering
Baseline
3
6.5
22
+word distance
1
4
15.5
+character type
3
5
17.5
+distance & type
0.5
4
15.5
MI-TF × IDF Filtering
Baseline
2
5
13
+word distance
0.5
3
12
+character type
2.5
4.5
13.5
+distance & type
0
3
12.5
MRR
2.7
3.5
2.7
3.0
5.2
2.9
4.6
2.6
3.7
2.2
3.7
1.8
as the translation resource. The experimental results also show that the answer
ranking schema should be modified in order to eliminate incorrect answers.
Chapter 7
Conclusions and Future Research
7.1
Conclusions
The experiments on the Indonesian-Japanese CLIR showed that the transitive
translation with bilingual dictionaries using a combined translation filtering schema
could achieve comparable IR result gained by the direct translation and the machine transitive translation.
This phenomenon gives hopes for the research development in the cross language system for a limited resource language such as Indonesian. By using available resources such as a bilingual dictionary (between a certain language and
English as the major language) and a monolingual corpus, one can develop a
cross language system with promising results.
Our systems consist of some rule-based modules, statistical methods and some
machine-learning based modules. The rule based modules are easily built modules. For example such as in the question answering system, we made a question
shallow parser with few rules described in Section 4.4.2, or the additional feature
of class candidates in the question classification (Section 4.4.2). In the machine
learning modules, we used existing available machine learning softwares such as
Yamcha[20], Weka[50] and so on. The features for the machine learning are also
easily extracted using the simple rule-based methods or the statistical one.
As for the knowledge resource such as WordNet (used in the English answer
finder of Indonesian-English CLQA), we analyze that the resource could be replaced by a statistical method such as used in the Indonesian answer finder. It
means that if a language has a knowledge resource such as WordNet or Chasen,
those resources could be installed easily in the question answering system without making any mapping table or rules. And if the language does not have this
kind of knowledge resource such as for Indonesian language, one can still use
the statistical information gained from monolingual corpus. The machine learn119
120
CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH
ing method itself is proved to be effective such as the one that we used for the
monolingual QA and the CLQA. The effort to build such a system is much lower
than building a rich rule-based system. By above reasons, we argue that these
systems (CLIR, monolingual QA, CLQA) could be applied to other languages
with a minimum programming effort.
The detail conclusions of our study are as follows: in Chapter 3, we investigated the effectiveness of transitive translation with bilingual dictionary in a
CLIR system. Transitive translation with a bilingual dictionary gives many
translations which some of them are incorrect translations. Using all translation results gives lower IR score compared to other translation methods. It needs
a certain translation filtering to improve the IR score. The conclusions of this
chapter are:
• The IR result using transitive translation with a bilingual dictionary could
be comparable to the direct translation and the machine transitive translation if the keyword filtering schema is good enough to select the best
translation. Here, we proposed a filtering schema using mutual information
score and TFxIDF score.
• The OOV words can be reduced from 50 into 5 words using borrowed word
translation which employs a Japanese proper name dictionary and EnglishJapanese dictionary
• The IR using combination of direct and transitive translation with bilingual
dictionaries outperformed other translation methods.
The monolingual QA for Indonesia is described in Chapter 4. It is a study
on the machine learning approach of a monolingual QA for a limited resource
language. Below are the conclusions:
• Machine learning methods are suitable for the question classification and
answer finder, two important modules in CLQA.
• In the question classification, using features resulted by a rule-based question shallow parser and statistical features could achieve the classification
accuracy of about 95%.
• In the answer finder, by using a machine learning method means to eliminate a role of NE(Named Entity) tagger which is a common approach in
building a QA system. Using available features without explicit semantic
information in the machine learning could still give a good accuracy score.
Combining the EAT(Expected Answer Type), question features and document features could improve the QA accuracy compared to the uncombined
one.
7.1. CONCLUSIONS
121
Chapter 5 shows our work in Indonesian-English CLQA. The CLQA is built
with similar approach as the monolingual QA, with machine learning method.
The translation part of CLQA made the CLQA task is more difficult than a
monolingual QA. The conclusions of our work in the CLQA system using machine
learning method are as follows:
• Translation using bilingual dictionary and some transformation rules could
give bigger translation coverage than the machine translation method which
are used by other Indonesian-English CLQA systems.
• Using EAT information in the machine learning based answer finder module
improve the system accuracy. It also gives benefit for the test question that
does not have similar patterns with the training data.
• CLQA with machine learning approach could achieve better performance
than the rule-based system such as used in two other Indonesian-English
CLQA systems. It should be noted that number of training data does
influence the system performance. The more valid data are provided, the
more accurate the system.
The work of Indonesian-Japanese CLQA is described in Chapter 6. It used a
transitive approach in the translation and passage retriever phase. The conclusions are as follows:
• Filtering method using mutual information score and TF IDF score on the
transitive translation with bilingual dictionaries is quite effective in the
passage retrieval module. It could give better performance than the direct
translation.
• The translation results effect the passage retriever accuracy more than the
answer finder. It is shown in Table 6.6.
• Pivot language can also be used in the passage retriever. The experimental
results show that the usage of English passage retriever (pivot) gave higher
performance on the Japanese passage retriever (target) performance.
To give the overall performance on the experiments, Table 7.1 shows the
experimental results. Table 7.1 shows the experimental results of the passage retriever and answer finder modules for the Indonesian monolingual QA, IndonesianEnglish CLQA and Indonesian-Japanese CLQA. A good performance on the
monolingual system shows that this method is efficient for the monolingual system. One does not have to provide many language processing tools to adapt this
method. To have higher performance on the passage retriever, we can use other
advanced information retrieval method which can handle the synonyms better.
For the answer finder, we can also add another features such as the word distance
feature.
122
CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH
In the CLQA system, the Indonesian-English CLQA achieved a good QA
score. One of the reason is the sameness between Indonesian and English language where there are some borrowed English words which needs only a lexical
transformation as the translation method. The experimental results show that
this method is appropriate for a cross language system even though the translation resource is only a bilingual dictionary with medium size. The IndonesianJapanese CLQA did not achieve a good performance as the Indonesian-English
one. Although the English corpus used in the Indonesian-English CLQA is comparable with the Japanese corpus used in the Indonesian-Japanese CLQA, the
corpus size is different: the English corpus size is 17,741 articles and the Japanese
corpus size is 658,719 articles. Other than the different characteristics between
the English sentence and Japanese sentence (without word segmentation), this
corpus size makes the Japanese passage retrieval more difficult than the English
one. The main cause of low passage retrieval score is the translation errors between Indonesian and Japanese. Using the available resources, the translation
could not resolve the OOV proper noun problem. There are many proper nouns
which are important question keywords could not be translated by the translation
resources.
Table 7.1: Comparisons on the Question Answering Performance
Indonesian QA
Passage Retriever Performance
Recall
89%
Precision
9%
Corpus Size
71,109
Answer Finder Performance
Top1
46%
Top5
50%
MRR
51%
Test Data
200 (built in)
7.2
Indonesian-English
CLQA
Indonesian-Japanese
CLQA
76%
18.5%
17,741
34%
1.2%
658,719
22.5%
34.0%
27.2%
5%
9%
6.5%
200 questions NTCIR CLQA 2005
Future Research
As mentioned above, the developed systems have given quite a promising result.
We believe that there are many other methods which can be combined with
these existing methods in order to achieve higher performance. In the translation
module, more translations can be added using a statistical based method on the
7.2. FUTURE RESEARCH
123
source and target corpus. One can also try to use the Internet to append the
translations for the OOV words. As for the passage retrieval in the CLQA, the
word window can be used to limit the retrieve passage numbers. This approach
could improve the precision score of the passage retriever module. Another idea is
to append or improve the feature quality for the machine-learning based modules
(the question classifier and the answer finder), such as the similarity score between
a corpus word and a question word, etc.
To prove the easy adaption of these systems, it is worth to try to adapt
the systems into other language and measure the development effort. One has
usually better analysis if the proposed plan is conducted. By adapting it in
real development, the system could also be improved with new ideas caused by
handling the found problems.
These text based systems can be enhanced with speech interface. The system can be supplied by spoken questions or spoken documents. Using spoken
questions and answers will make the system be available for devices such as microphone, telephone, etc. Involving spoken documents mean that it can handle
many recorded documents such as presentations, discussion, etc.
Another idea is to involve the question answering system in a dialogue system. In a dialogue system, the answer should be extracted in real time. Some
modules such as the searching module should be improved in order to achieve
faster response time. A dialogue question answering system could also means an
interactive question answering system where the question clues are spread among
many questions in one dialogue.
This pivot language approach could also be employed for other cross language
text processing system. For example a cross language information extraction, a
cross language text summarization, etc.
Bibliography
[1]
Septian Adiwibowo and Mirna Adriani. Finding Answers Using Resources in the
Internet. In Working Notes of CLEF 2007 Workshop, Hungary, September 2007.
[cited at p. 78]
[2]
Mirna Adriani. Using statistical term similarity for sense disambiguation crosslanguage information retrieval. Infomation Retrieval, 2(1):71–82, 2000. [cited at p. 22]
[3]
Mirna Adriani and Rinawati. University of Indonesia Participation at Query
Answering-CLEF 2005. In Working Notes of CLEF 2005 Workshop, Vienna, Austria, September 2005. [cited at p. 50, 77, 78]
[4]
Mirna Adriani and C.J. van Rijsbergen. Term Similarity Based Query Expansion for
Cross Language Information Retrieval. In Proc. of Research and Advanced Technology for Digital Libraries (ECDL’99), pages 311–322, Paris, 1999. Springer Verlag.
[cited at p. 76]
[5]
J.S. Badudu, editor. Pelik-Pelik Bahasa Indonesia. CV NawaPutra, Bandung, 10
2001. [cited at p. 7, 8, 9, 10, 14]
[6]
Lisa Ballesteros and W. Bruce Croft. Resolving ambiguity for cross-language retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR
conference on Research and development in information retrieval, pages 64–71, 1998.
[cited at p. 22]
[7]
Lisa A. Ballesteros. Cross-language retrieval via transitive translation. In W. Bruce
Croft, editor, Advances in Information Retrieval, pages 203–230. Kluwer Academic
Publishers, 2000. [cited at p. 21]
[8]
Jiangping Chen, Rowena Li, Ping Yu, He Ge, Pok Chin, Fei Li, and Cong Xuan.
Chinese QA and CLQA: NTCIR-5 QA Experiments at UNT. In Proc. of NTCIR-5
Workshop Meeting, pages 242–249, Tokyo, Japan, December 2005. [cited at p. 97]
[9]
Excite Japan.
Excite machine translation.
http://www.excite.co.jp/world/.
[cited at p. 32]
[10] Marcello Federico and Nicola Bertoldi. Statistical cross-language information retrieval using n-best query translations. In SIGIR ’02: Proceedings o f the 25th
annual international ACM SIGIR conference on Research and development in information retrieval, 2002. [cited at p. 22]
125
126
BIBLIOGRAPHY
[11] Christopher Fox.
A stop list for general text.
SIGIR Forum, 24(4), 1990.
[cited at p. 32]
[12] Atsushi Fujii and Tetsuya Ishikawa.
NTCIR-3 cross-language IR experiments at ULIS.
In Proceedings of the Third NTCIR Workshop,
3 2003. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3CLIR-FujiiA.pdf. [cited at p. 30, 31]
[13] Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changning
Huang. Improving query translation for cross-language information retrieval using
statistical models. In SIGIR ’01: Proceedings of the 24th annual international ACM
SIGIR conference on Research and development in information retrieval, pages 96–
104, New York, NY, USA, 2001. ACM Press. [cited at p. 22]
[14] Jianfeng Gao, Jian-Yun Nie, and Ming Zhou. Statistical Query Translation Models
for Cross-Language Information Retrieval. ACM Transactions on Asian Language
Information Processing, 5(4):323–359, 2006. [cited at p. 22]
[15] Tim Gollins and Mark Sanderson. Improving cross language retrieval with triangulated translation. In SIGIR ’01: Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in information retrieval, 2001.
[cited at p. 21]
[16] Sanda M. Harabagiu, Marisu A. Pasca, and Steven Maiorano. Experiments with
open-domain textual question answering. In Proc. of the 18th International Conference on Computational Linguistics (COLING 2000), pages 292–298, Saarbruken,
Germany, 2000. [cited at p. 49, 54]
[17] Indonesian Agency for The Assessment and Application of Technology. Kebi, kamus
elektronik bahasa indonesia. http://nlp.aia.bppt.go.id/kebi/. [cited at p. 31, 77, 85]
[18] Hideki Isozaki, Katsuhito Sudoh, and Hajime Tsukada. NTT’s Japanese-English
Cross Language Question Answering System. In Proc. of NTCIR-5 Workshop Meeting, pages 186–193, Tokyo, Japan, December 2005. [cited at p. 77, 97, 98, 101]
[19] Kazuaki Kishida and Noriko Kando. Two-stage refinement of query translation in
a pivot language approach to cross-lingual information retrieval: An experiment
at CLEF 2003. In Bridging Languages for Question Answering: DIOGENE at
CLEF 2003, Lecture Notes in Computer Science, pages 253–262. Springer, Berlin /
Heidelberg. [cited at p. 22]
[20] Taku Kudoh and Yuji Matsumoto. Use of Support Vector Learning for Chunk
Identification. In Proc. of the Fourth Conference on Natural Language Learning
(CoNLL-2000), pages 142–144, Lisbon, Portugal, 2000. [cited at p. 63, 88, 119]
[21] Xin Li and Dan Roth. Learning Question Classifiers. In Proc. of the 19th International Conference on Computational Linguistics (COLING 2002), pages 556–562,
Taipei, Taiwan, 2002. [cited at p. 50]
[22] Chuan-Lie Lin, Yu-Chun Tzeng, and Hsin-Hsi Chen. System Description of NTOUA
Group in CLQA1. In Proc. of NTCIR-5 Workshop Meeting, pages 250–255, Tokyo,
Japan, December 2005. [cited at p. 97]
127
[23] Yi Liu, Rong Jin, and Joyce Y. Chai. A Statistical Framework for Query Translation
Disambiguation. ACM Transactions on Asian Language Information Processing,
5(4):360–387, 2006. [cited at p. 22]
[24] Bernardo Magnini, Simone Romagnoli, Alessandro Vallin, Jesus Herrera, Anselmo
Penas, Victor Peinado, Felisa Verdejo, and Maarten de Rijke. The Multiple Language Question Answering Track at CLEF 2003. In Proc. of CLEF 2003 Workshop,
Norway, August 2003. [cited at p. 75]
[25] Mainichi Shimbun Co. CD-ROM Data Sets 1993–1995. Nichigai Associates Co.,
1994–1996. [cited at p. 32]
[26] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara. Morphological Analysis System
ChaSen version 2.2.1 Manual. http://chasen.aist-nara.ac.jp/chasen/doc/chasen2.2.1.pdf, 2000. [cited at p. 27, 32, 105]
[27] Hideki Michibata, editor. Eijiro. ALC, 3 2002. (in Japanese).
[cited at p. 32]
[28] George A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11),
1995. [cited at p. 32]
[29] Tatsunori Mori and Masami Kawagishi. A Method of Cross Language QuestionAnswering Based on Machine Translation and Transliteration. In Proc. of NTCIR-5
Workshop Meeting, pages 215–222, Tokyo, Japan, December 2005. [cited at p. 97, 103]
[30] Overture Services,
Inc.
Altavista babelfish
http://www.altavista.com/babelfish/. [cited at p. 32]
machine
translation.
[31] Marisu A. Pasca and Sanda M. Harabagiu. High Performance Question Answering. In Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 366–374, New Orleans,
2001. [cited at p. 51]
[32] Ari Pirkola. The Effects of Query Structure and Dictionary Setups in Dictionarybased Cross Language Information Retrieval. In Proc. of the 21st Annual International ACM SIGIR, pages 55–63, 1998. [cited at p. 63, 86]
[33] Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa. Indonesian-Japanese
Transitive Translation using English for CLIR. Journal of Natural Language Processing, Information Processing Society of Japan, 14(2), 2007. [cited at p. 50, 76]
[34] Yan Qu, Gregory Grefenstette, and David A. Evans. Resolving translation ambiguity using monolingual corpora. In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck, editors, Advanced in Cross-Language Information Retrieval
(CLEF2002), pages 223–241. Springer, Berlin / Heidelberg, 2002. [cited at p. 22]
[35] Deepak Ravichandran, Eduard Hovy, and Franz Josef Och. Statistical QA - Classifier
vs Re-ranker: What’s the difference? In Proc. of the ACL Workshop on Multilingual
Summarization and Question Answering - Machine Learning and Beyond, pages 69–
75, Sapporo, Japan, 2003. [cited at p. 51]
128
BIBLIOGRAPHY
[36] Sanggar Bahasa Indonesia Proyek. Kmsmini2000. http://m1.ryu.titech.ac.jp/ indonesia/todai/dokumen/kamusjpina.pdf, 2000. [cited at p. 32]
[37] Yutaka Sasaki. Baseline Systems for NTCIR-5 CLQA1: An Experimentally Extended QBTE Approach. In Proc. of NTCIR-5 Workshop Meeting, pages 230–235,
Tokyo, Japan, December 2005. [cited at p. 77, 78, 79, 88, 97, 103, 116]
[38] Yutaka Sasaki. Question Answering as Question Biased Term Extraction: New
Approach toward Multilingual QA. In Proc. of 43rd Annual Meeting of the ACL,
pages 215–222, Ann Arbor, 2005. [cited at p. 51, 70, 73, 74, 77]
[39] Yutaka Sasaki, Hsin-Hsi Chen, Kuang hua Chen, and Chuan-Jie Lin. Overview of the
NTCIR-5 Cross-Lingual Question Answering Task (CLQA1). In Proc. of NTCIR-5
Workshop Meeting, pages 230–235, Tokyo, Japan, December 2005. [cited at p. 75, 94,
96, 97, 101, 103, 109]
[40] Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees.
In Proc. of International Conference on New Methods in Language Processing,
Manchester, UK, 1994. http://www.ims.uni-stuttgart.de/ftp/pub/corpora/treetagger1.pdf. [cited at p. 77, 88]
[41] Marcin Skowron and Kenji Araki. Effectiveness of Combined Features for Machine
Learning Based Question Classification. Journal of Natural Language Processing,
Information Processing Society of Japan, 6:63–83, 2005. [cited at p. 51]
[42] Leah S.Larkey, Margeret E. Connell, and Nasreen Abduljaleel. Hindi CLIR in Thirty
Days. ACM Transactions on Asian Language Information Processing, 2(2):130–142,
2003. [cited at p. 22]
[43] Fadilla Z. Tala. A Study of Stemming Effects on Information Retrieval in Bahasa
Indonesia, 2003. [cited at p. 50]
[44] Kumiko Tanaka and Kyoji Umemura. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th conference on Computational
linguistics, volume 1, pages 297–303, 1994. [cited at p. 36]
[45] Toggle
Text.
Kataku
Automatic
Translation
http://www.toggletext.com/kataku-trial.php. [cited at p. 32]
System.
[46] Alessandro Vallin, Bernardo Magnini, Danilo Giampiccolo, Lili Aunimo, Christelle
Ayache, Petya Osenova, Anselmo Peas, Maarten de Rijke, Bogdan Sacaleanu, Diana
Santos, and Richard Sutcliffe. Overview of the CLEF 2005 Multilingual Question
Answering Track. In Proc. of CLEF 2005 Workshop, Vienna, Austria, September
2005. [cited at p. 75]
[47] C.J. van Rijsbergen. Information Retrieval 2nd edition. Butterworths, London, UK,
1979. [cited at p. 29]
[48] Ellen M. Voorhees and Dawn M. Tice. Building a question answering test collection.
In SIGIR 2000: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207, 2000.
[cited at p. 49, 54]
129
[49] Wijono, Sri Hartati, Indra Budi, Lily Fitria, and Mirna Adriani. Finding Answers
to Indonesian Questions from English Documents. In Working Notes of CLEF 2006
Workshop, Spain, September 2006. [cited at p. 77, 78]
[50] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques 2nd edition. Elsevier Inc, 2005. [cited at p. 64, 89, 119]
[51] Yomiuri Shimbun Co. Article Data of Daily Yomiuri 2001. Nihon Database Kaihatsu
Co., 2001. http://www.ndk.co.jp/yomiuri/e-yomiuri/e-index.html. [cited at p. 32]
[52] Dell Zhang and Wee Sun Lee. Question Classification using Support Vector Machine.
In Proc. of the 26th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 26–32, Toronto, Canada, 2003.
[cited at p. 50, 51]
[53] Guowei Zu, Wataru Ohyama, Tetsushi Wakabayashi, and Fumitaka Kimura. Automatic text classification of English newswire articles based on statistical classification techniques. IEEJ Transactions Electronics, Information and Systems, 124–
C(3):852–860, 3 2004. [cited at p. 32]
List of Publications
Journal Papers
1. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. 2007. “Indonesian-Japanese
Transitive Translation using English for CLIR”. Journal of Natural Language Processing, pp. 95-123, Volume 14 No 2, April 2007.
2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. 2007. “A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System”. IEICE Transaction on Information and Systems, pp. 18411852, Volume E90-D No 11, November 2007.
International Conferences
1. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Query Transitive Translation Using IR Score for Indonesian-Japanese CLIR”. Proceeding of Second
Asia Information Retrieval Symposium, pp. 565-570, October 13-15, 2005.
Jeju Island, Korea, Lecture Notes in Computer Sciences (LNCS) 3689 Information Retrieval Technology.
2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Indonesian-Japanese
CLIR Using Only Limited Resource”. Proceeding of the Workshop on How
Can Computational Linguistics Improve Information Retrieval, Workshop
at COLING/ACL 2006, pp. 1-8, July 23, 2006. Sydney, Australia.
3. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “A Machine Learning
Approach for Indonesian Question Answering System”. Proceeding of the
IASTED International Conference on Artificial Intelligence and Applications (AIA 2007), pp. 537-542, February 12-14, 2007. Innsbruck, Austria.
Language Processing Society of Japan, reports
1. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Query Translation from
Indonesian to Japanese using English as Pivot Language”. The 11th Annual
131
132
Conference, Language Processing Society of Japan, B3-9, pp. 580-583,
March, 2005. Takamatsu, Japan.
2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Estimation of Question
Types for Indonesian Question Sentence”. The 12th annual conference,
Language Processing Society of Japan, B2-8, pp. 344-347, March, 2006.
Keio University, Japan.
3. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Indonesian-English Cross
Language Question Answering”. The 13th annual conference, Language
Processing Society of Japan, E5-4, March, 2007. Ryokuko University,
Japan.
4. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “A Transitive Translation
for Indonesian-Japanese CLQA”. 第 182 回自然言語処理研究会, IPSJ SIG
Technical Report, pp. 93-100, November, 2007. Shizuoka University, Japan.
Indonesian Student Association in Japan, reports
1. A. Purwarianti, and S. Nakagawa. “Query Translation in IndonesianJapanese Cross Language Information Retrieval”. Proc. of 13th Indonesian
Scientific Conference (ISC) in Japan, pp. 428-437, September, 2004. Tokyo,
Japan.
2. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Proper Name Translation in Indonesian- Japanese CLIR”. Proc. of 14th Indonesian Scientific
Conference in Japan, pp. 433-440, September, 2005. Nagoya, Japan.
3. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “Research in Indonesian
Question Answering Systems”. Chubu Chapter Indonesian Scientific Meeting, March, 2006. Toyohashi, Japan.
4. A. Purwarianti, M. Tsuchiya, and S. Nakagawa. “SVM based Indonesian
Question Classification Using Indonesian Monolingual Corpus and WordNet”. Proc. of 15th ISC in Japan, August, 2006. Hiroshima, Japan.

Indonesian-Japanese CLIR and CLQA

Transcription

Similar documents

Call for Entries: Translation Internship Experience 2014

Converging Texts: Teaching Culture throughTranslaVon and

A bi-directional English-Portuguese corpus to

How to build a Babel fish

powerpoint

View our Legal Services Brochure - Bel Air Professional Translation

Cell free translation

Así se dice - Literacy Squared

Tjoet Nja` Dhien - Indonesian Embassy in London

Batam Happynings Vol 2 No 2