Slides(from Kiril Simov)

Transcription

Slides(from Kiril Simov)
Linguistically-Augmented
Bulgarian-to-English
Statistical Machine Translation Model
Rui Wang
DFKI GmbH
Germany
Petya Osenova and Kiril Simov
IICT-BAS
Bulgaria
EACL 2012
1
Main Idea
•  ‘Hybrid’ System
–  SMT as backbone
–  Deep linguistic features as factors
•  Bulgarian to English
–  Limited previous work
•  Linguistically-Augmented Model
–  Preprocessing
–  Deep grammar knowledge
EACL 2012
2
Hybrid Systems
•  Using an SMT to post-edit RBMT
•  Selecting the best translations from SMT
/RBMT
•  Selecting the best phrases/words from SMT
/RBMT
•  Ours: Feed SMT with ‘RBMT’ features
EACL 2012
3
Data Preparation
•  We are using SETIMES Dataset of OPUS
corpus – automatic sentence alignment gives
more than 25 % errors
•  Improving the tokenization of the Bulgarian
part
•  Correcting and removing the suspicious
sentence alignments
EACL 2012
4
Linguistic Preprocessing
•  POS Tagging – 97.98 % accuracy
(Georgiev, Zhikov, Osenova, Simov, and Nakov. Feature-Rich Part-of-speech
Tagging for Morphologically Complex Languages: Application to Bulgarian.
Poster at the main conference)
•  Lemmatization – 95.23 % accuracy
•  Dependency Parsing – 85.6 % labeled parsing
accuracy
EACL 2012
5
Factored Model
•  Koehn and Hoang, 2007
–  Easily incorporate linguistic features at the token
level
–  Similar to ‘supertags’
•  WF, Lemma, POS, Ling
•  DepRel, HLemma, HPOS
•  MRS features
EACL 2012
6
Example
Spored odita v elektricheskite kompanii politicite zloupotrebyavat s
dyrzhavnite predpriyatiya.
Electricity audits prove politicians abusing public companies.
EACL 2012
7
MRS Supertagging
•  Minimal Recursion Semantics is a
underspecified semantics designed for HPSG
•  Adapted to other grammar formalisms
“Every dog chases some white cat.”
<h0, {h1: every(x,h2,h3), h2: dog(x), h4: chase(x,
y), h5: some(y,h6,h7), h6: white(y), h6: cat(y)},
{}>
EACL 2012
8
Rules for RMRS
Two types:
– <Lemma, MSTag> -> EP-RMRS
The rules of this type produce an RMRS
including an elementary predicate
– <DRMRS, Rel, HRMRS> -> HRMRS'
The rules of this type unite the RMRS
constructed for a dependent node (DRMRS)
into the current RMRS for a head node
(HRMRS)
EACL 2012
9
Experiments
•  We run GIZA++ (Och and Ney, 2003) for bidirectional word alignment, and then obtain the
lexical translation table and phrase table
•  A tri-gram language model is estimated using the
SRILM toolkit (Stolcke, 2002)
•  Minimum error rate training (MERT) (Och, 2003) is
applied to tune the weights for the set of feature
weights that maximizes the BLEU score on the
development set
EACL 2012
10
Corpora
•  Train/Dev/Test
•  SETIMES
–  150,000/500/1,000
•  EMEA
–  700,000/500/1,000
•  JRC-Acquis
–  0/0/4,107
EACL 2012
11
Results (Non-Factored Model)
EACL 2012
12
Factored Model
EACL 2012
13
Manual Evaluation
•  Motivation
–  BLEU score in high range is not differentiable
–  Impacts from various linguistic knowledge
•  Evaluation metrics
–  Grammaticality
–  Content
EACL 2012
14
Manual Evaluation – Grammaticality
1.  The translation is not understandable.
2.  The evaluator can somehow guess the meaning, but
cannot fully understand the whole text.
3.  The translation is understandable, but with some
efforts.
4.  The translation is quite fluent with some mi- nor
mistakes or re-ordering of the words.
5.  The translation is perfectly readable and
grammatical.
EACL 2012
15
Manual Evaluation – Content
1.  The translation is totally different from the
reference.
2.  About 20% of the content is translated, missing the
major content/topic.
3.  About 50% of the content is translated, with some
missing parts.
4.  About 80% of the content is translated, missing only
minor things.
5.  All the content is translated.
EACL 2012
16
Grammaticality
EACL 2012
17
Content
EACL 2012
18
Related Work
•  Birch et al. (2007) and Hassan et al. (2007)
–  Supertags on English side
•  Singh and Bandyopadhyay (2010)
–  Manipuri-English bidirectional translation
•  Bond et al. (2005), Oepen et al. (2007), Graham
and van Genabith (2008), and Graham et al.
(2009)
–  Transfer-based MT
EACL 2012
19
Conclusion
•  We present a factored model for Bulgarian
English MT using linguistic factors
•  The manual evaluation shows improvement of
the translation using linguistic factors
EACL 2012
20
Future Work
•  The MRSes are not fully explored yet, since we have only
considered the EP and EOV features
•  We would like to add factors on the target language side
(English) as well
•  The guideline of the manual evaluation needs further
refinement (Farreu ́s et al., 2011)
•  We also need more experiments to evaluate the robustness of
our approach in terms of different datasets
EACL 2012
21
Acknowledgements
•  EuroMatrixPlus (IST-231720)
•  Tania Avgustinova for fruitful discussions and
her helpful linguistic analysis
•  Laska Laskova, Stanislava Kancheva and Ivaylo
Radev for doing the human evaluation of the
data
EACL 2012
22
Thank You!
EACL 2012
23