Slides(from Kiril Simov)
Transcription
Slides(from Kiril Simov)
Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model Rui Wang DFKI GmbH Germany Petya Osenova and Kiril Simov IICT-BAS Bulgaria EACL 2012 1 Main Idea • ‘Hybrid’ System – SMT as backbone – Deep linguistic features as factors • Bulgarian to English – Limited previous work • Linguistically-Augmented Model – Preprocessing – Deep grammar knowledge EACL 2012 2 Hybrid Systems • Using an SMT to post-edit RBMT • Selecting the best translations from SMT /RBMT • Selecting the best phrases/words from SMT /RBMT • Ours: Feed SMT with ‘RBMT’ features EACL 2012 3 Data Preparation • We are using SETIMES Dataset of OPUS corpus – automatic sentence alignment gives more than 25 % errors • Improving the tokenization of the Bulgarian part • Correcting and removing the suspicious sentence alignments EACL 2012 4 Linguistic Preprocessing • POS Tagging – 97.98 % accuracy (Georgiev, Zhikov, Osenova, Simov, and Nakov. Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian. Poster at the main conference) • Lemmatization – 95.23 % accuracy • Dependency Parsing – 85.6 % labeled parsing accuracy EACL 2012 5 Factored Model • Koehn and Hoang, 2007 – Easily incorporate linguistic features at the token level – Similar to ‘supertags’ • WF, Lemma, POS, Ling • DepRel, HLemma, HPOS • MRS features EACL 2012 6 Example Spored odita v elektricheskite kompanii politicite zloupotrebyavat s dyrzhavnite predpriyatiya. Electricity audits prove politicians abusing public companies. EACL 2012 7 MRS Supertagging • Minimal Recursion Semantics is a underspecified semantics designed for HPSG • Adapted to other grammar formalisms “Every dog chases some white cat.” <h0, {h1: every(x,h2,h3), h2: dog(x), h4: chase(x, y), h5: some(y,h6,h7), h6: white(y), h6: cat(y)}, {}> EACL 2012 8 Rules for RMRS Two types: – <Lemma, MSTag> -> EP-RMRS The rules of this type produce an RMRS including an elementary predicate – <DRMRS, Rel, HRMRS> -> HRMRS' The rules of this type unite the RMRS constructed for a dependent node (DRMRS) into the current RMRS for a head node (HRMRS) EACL 2012 9 Experiments • We run GIZA++ (Och and Ney, 2003) for bidirectional word alignment, and then obtain the lexical translation table and phrase table • A tri-gram language model is estimated using the SRILM toolkit (Stolcke, 2002) • Minimum error rate training (MERT) (Och, 2003) is applied to tune the weights for the set of feature weights that maximizes the BLEU score on the development set EACL 2012 10 Corpora • Train/Dev/Test • SETIMES – 150,000/500/1,000 • EMEA – 700,000/500/1,000 • JRC-Acquis – 0/0/4,107 EACL 2012 11 Results (Non-Factored Model) EACL 2012 12 Factored Model EACL 2012 13 Manual Evaluation • Motivation – BLEU score in high range is not differentiable – Impacts from various linguistic knowledge • Evaluation metrics – Grammaticality – Content EACL 2012 14 Manual Evaluation – Grammaticality 1. The translation is not understandable. 2. The evaluator can somehow guess the meaning, but cannot fully understand the whole text. 3. The translation is understandable, but with some efforts. 4. The translation is quite fluent with some mi- nor mistakes or re-ordering of the words. 5. The translation is perfectly readable and grammatical. EACL 2012 15 Manual Evaluation – Content 1. The translation is totally different from the reference. 2. About 20% of the content is translated, missing the major content/topic. 3. About 50% of the content is translated, with some missing parts. 4. About 80% of the content is translated, missing only minor things. 5. All the content is translated. EACL 2012 16 Grammaticality EACL 2012 17 Content EACL 2012 18 Related Work • Birch et al. (2007) and Hassan et al. (2007) – Supertags on English side • Singh and Bandyopadhyay (2010) – Manipuri-English bidirectional translation • Bond et al. (2005), Oepen et al. (2007), Graham and van Genabith (2008), and Graham et al. (2009) – Transfer-based MT EACL 2012 19 Conclusion • We present a factored model for Bulgarian English MT using linguistic factors • The manual evaluation shows improvement of the translation using linguistic factors EACL 2012 20 Future Work • The MRSes are not fully explored yet, since we have only considered the EP and EOV features • We would like to add factors on the target language side (English) as well • The guideline of the manual evaluation needs further refinement (Farreu ́s et al., 2011) • We also need more experiments to evaluate the robustness of our approach in terms of different datasets EACL 2012 21 Acknowledgements • EuroMatrixPlus (IST-231720) • Tania Avgustinova for fruitful discussions and her helpful linguistic analysis • Laska Laskova, Stanislava Kancheva and Ivaylo Radev for doing the human evaluation of the data EACL 2012 22 Thank You! EACL 2012 23