Language Generation in the DAESO project
Transcription
Language Generation in the DAESO project
Language Generation in the DAESO project Erwin Marsi & Emiel Krahmer Communicatie- en Informatiewetenschappen Overview • Part I: The DAESO project • Part II: Language generation 13 oct 2008 http://daeso.uvt.nl 2 Part I: DAESO Detecting And Exploiting Semantic Overlap Erwin Marsi and Emiel Krahmer, Annotating a parallel monolingual treebank with semantic similarity relations. In: The Sixth International Workshop on Treebanks and Linguistic Theories (TLT'07), Bergen, Norway, December 7-8, 2007. Erwin Marsi and Emiel Krahmer, Annotating a parallel monolingual treebank with semantic similarity relations. In: Proceedings of CLIN 2008, to appear. Outline • Background – Semantic overlap – Motivation/Applications – The DAESO project • Corpus – Text material – Processing – Annotation tools • Current status & future work 13 oct 2008 http://daeso.uvt.nl 4 Semantic overlap by example • Steve Irwin, the TV host known as the "Crocodile Hunter," has died after being stung by a stingray off Australia's north coast. [www.cnn.com] • Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. [www.time.com] 13 oct 2008 http://daeso.uvt.nl 5 Semantic overlap: X equals Y • Steve Irwin, the TV host known as the "Crocodile Hunter," has died after being stung by a stingray off Australia's north coast. • Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. 13 oct 2008 http://daeso.uvt.nl 6 Semantic overlap: X restates Y • Steve Irwin, the TV host known as the "Crocodile Hunter," has died after being stung by a stingray off Australia's north coast. • Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. 13 oct 2008 http://daeso.uvt.nl 7 Semantic overlap: X generalizesY (Y specifies X) • Steve Irwin, the TV host known as the "Crocodile Hunter," has died after being stung by a stingray off Australia's north coast. • Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. 13 oct 2008 http://daeso.uvt.nl 8 Semantic overlap: X intersects Y • Steve Irwin, the TV host known as the "Crocodile Hunter," has died after being stung by a stingray off Australia's north coast. • Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. 13 oct 2008 http://daeso.uvt.nl 9 Lexical and phrasal similarity Lexical similarity relations: 1. Equal 2. Synonym 3. Hyponym 4. Hyperonym 5. Overlap (cf. shared information, Lowest Common Ancestor) 13 oct 2008 Phrasal similarity relations: 1. Equals 2. Restates 3. Specifies 4. Generalizes 5. Intersects http://daeso.uvt.nl 10 Other assumptions Similarity relations are • mutually exclusive • based on common sense (cf. WordNet, RTE) • defined between strings (theory-neutral) • transferable to typed alignment relations between syntactic nodes... 13 oct 2008 http://daeso.uvt.nl 11 Semantic similarity relations over syntactic nodes • Similarity relations can be defined over the strings corresponding to syntactic nodes N1 N2 w1 N1' restates N3 N4 N5 w2 w3 N2' w1' w2' w3' STR(N3) = "w2 w3" restates "w1' w2' w3'" = STR(N2') 13 oct 2008 http://daeso.uvt.nl 12 Motivation for detecting semantic overlap • Automatically detecting semantic overlap – is semantics from a practical and theory-neutral perspective – is a basic semantic NLP task like WSD and RTE – has many potential applications like • Multidocument summarization • Combining answers in Question-Answering systems • Intelligent document merging • Detection of plagiarism • Information Extraction (IE) • Recognizing Textual Entailment (RTE) • Anaphora resolution • Ranking parse trees 13 oct 2008 http://daeso.uvt.nl 13 The DAESO project • • • • DAESO: Detecting And Exploiting Semantic Overlap Funded by the Dutch Stevin programme Duration: October 2006 until October 2009 Participants: – Emiel Krahmer, Erwin Marsi - Tilburg University – Walter Daelemans, Iris Hendrickx - Antwerp University – Maarten de Rijke, Erik TKS - University of Amsterrdam – Jakub Zavrel - Textkernel • http://daeso.uvt.nl/ 13 oct 2008 http://daeso.uvt.nl 14 Goals of the DAESO project 1. Development of a 1M words corpus of parallel/comparable Dutch text aligned at the level of sentences, phrases and words - plus tools and guidelines for manual annotation 2. Based on this, development of tools for automatic alignment, detection of semantic overlap, paraphrase extraction and sentence fusion 3. Application of these tools in a number of NLP applications, including multi-doc sum, QA and IE. 13 oct 2008 http://daeso.uvt.nl 15 Corpus: text material • Text material ranges from parallel to comparable Dutch text • The composition results from a compromise between: – covering different text styles; – availability in electronic format; – targeted applications; – copyright constraints 13 oct 2008 http://daeso.uvt.nl 16 Corpus: text material for manual annotation Material: Book translations Autocue-subtitle pairs News headlines Press releases Answers from QA Subtotal manually annotated 13 oct 2008 http://daeso.uvt.nl #words: 125k 125k 24k 225k 1k -----500k High Degree of Overlap Low 17 Corpus: automatic annotation Material: Book translations Autocue-subtitle pairs News headlines Press releases Answers from QA Subtotal manually annotated Subtotal automatically annotated: Total corpus size: 13 oct 2008 http://daeso.uvt.nl #words: 125k 125k 24k 225k 1k -----500k 500k ==== 1M 18 Corpus material: book translations • Ideally: alternative translations (in Dutch) of modern books • In practice: very hard to find (impossible?) • We used modern translations of – "Le petit prince", by Antoine de Saint-Exupéry (integral) – "Les Essais" by Michel de Montaigne (partial) – "On the origin of species" by Charles Darwin (partial) Ik denk ook dat de hypothese van Andrew Knight , dat deze verschillen gedeeltelijk te danken zijn aan een overvloed van voedsel , best waarschijnlijk is . Ook is er iets te zeggen voor het standpunt van Andrew Knight dat deze variabiliteit deels toe te schrijven is aan een overdaad aan voedsel . (Darwin, ch1) 13 oct 2008 http://daeso.uvt.nl 19 Corpus material: autocue-subtitles • Autocue-subtitle material from Dutch broadcast news • Aligned at sentence level in the ATRANOS project • Typically subtitle is compressed version of autocue Sterker nog , op de geldmarkt liet de euro zich meteen gelden . De debutant won terrein ten opzichte van de routinier , de dollar . Sterker nog : de euro liet zich meteen gelden en won terrein van de routinier , de dollar . (NOS Journaal 01-04-1999) 13 oct 2008 http://daeso.uvt.nl 20 Corpus material: news headlines • Mined from Google News Dutch RSS feed (>900k words) • News articles are clustered on their content, so headlines can be quite different • Hence, we performed manual subclustering • Skipping very long clusters Contact met manager slecht Manager weet niets van werkvloer Managers hebben geen benul van de werkvloer Chefs hebben geen benul van werkvloer Meeste werknemers balen van hun bazen Manager te weinig op werkvloer 13 oct 2008 http://daeso.uvt.nl 21 Corpus material: press releases • Press releases from news feeds of two Dutch press agencies • Automatic alignment of articles, aiming for high recall, at the expense of precision • Subsequent manual correction Slachtoffers van de legionella-uitbraak veroorzaakte door een koeltoren bij het Post CSgebouw in Amsterdam krijgen een schadevergoeding van opdrachtgever Oosterdokseiland Ontwikkeling Amsterdam ( OOACV ) . Enkele slachtoffers van de legionella-uitbraak bij het Post CS-gebouw in Amsterdam vorige zomer kunnen een schadevergoeding tegemoet zien . 13 oct 2008 http://daeso.uvt.nl 22 Corpus material: answers from QA • Target: alternative/similar candidate answers in output of Question Answering (QA) engines • Not the answers on "closed" questions – e.g. “Who invented the telephone?” • But answers on “open” questions – e.g., “What are the advantages of using open source software?” • However, hard to find this type of material for Dutch • We use answers from the QA reference corpus (IMIX project) – medical domain – manually extracted from all available text material – for 100 Q’s – many have only a single answer, or no answer at all 23 13 oct 2008 http://daeso.uvt.nl Answers from QA (cont'd) Answers on the question "Waardoor worden aften veroorzaakt?": De oorzaak is niet bekend , maar stress lijkt een rol te spelen ; zo kan iemand aften krijgen in de eindexamenperiode . De oorzaak is nog niet opgehelderd en er is ook nog geen afdoende behandeling van aften bekend . Vooral als de weerstand afneemt door ziekte , stress of vermoeidheid kunnen er aften ontstaan . Ook ontstaan er bij vrouwen meer aften rondom de menstruatie . Levensmiddelen die aften kunnen veroorzaken , zijn bijvoorbeeld grapefruit en kiwi's . 13 oct 2008 http://daeso.uvt.nl 24 Conversion • Conversion from original format (PDF, MS Word, HTML, OCR'ed text) to XML – TEI Lite for books – Custom XML formats in other cases • (Don't ask me any questions about character encodings...) 13 oct 2008 http://daeso.uvt.nl 25 Tokenization & parsing • Tokenization – includes sentence splitting – with D-COI tokenizer (Reynaert '07) – errors manually corrected for book translations • Parsing – all sentences parsed with the Alpino parser (Van Noord et al) – CGN XML format – parsing errors were not manually corrected 13 oct 2008 http://daeso.uvt.nl 26 Sentence alignment • No – – – • Alignment was required for – book translations (= parallel text) – press releases (=comparable text) • Procedure: 1. automatic alignment aiming at high recall (low precision) 2. manual correction with a special annotation tool 13 oct 2008 sentence alignment required for autocue-subtitle pairs answers from QA news headlines http://daeso.uvt.nl 27 Sentence alignment: book translation • Task is clear: – in the case of parallel text, correct alignment is undisputed • Normally one-to-one, without crossing alignments • Many algorithms and implementations [Gale & Church '93] • Automatic alignment – doing multiple passes helps – simple word overlap measures obtains high accuracy (>95%) 13 oct 2008 http://daeso.uvt.nl 28 Sentence alignment: press releases • Task is not so clear-cut: – Correct alignment not always evident for comparable text (cf. TLT and CLIN papers) • Crossing alignments are the rule; many-to-many alignments no exception • Manual alignment – Small pilot-exps on interannotator agreement shows reasonable precision but low recall – Finding all similar sentences in two texts is a difficult task for humans 13 oct 2008 http://daeso.uvt.nl 29 Sentence alignment: press releases • Automatic alignment – Algorithms for parallel text alignment do not work for comparable text [Nelken & Shieber '06] – Better algorithms for comparable text alignment must be developed [Barzilay & Elhadad '03, Nelken & Shieber '06] – So far, we only tried simple methods relying on shallow textual features – Best results with tf*idf weighted cosine similarity on normalized words: F-scores around 0.6 – However, not good enough to ease manual annotation 13 oct 2008 http://daeso.uvt.nl 30 Sentence alignment tool: Hitaext • Graphical tool for manually aligning two marked-up texts • Input: two arbitrary (well-formed) XML documents • Output: alignment between XML elements in simple format • Supports – alignment at different text levels – many-to-many alignments • Implementation – GUI in wxPython (Python + wxWidgets) – cross-platform (Mac OS X, GNU Linux MS Windows) – released as open source software 13 oct 2008 http://daeso.uvt.nl 31 Hitaext (cont'd) 13 oct 2008 http://daeso.uvt.nl 32 Tree alignment tool: Algraeph • Graphical tool for aligning nodes from a pair of graphs • Generalized reimplementation of Gadget – XML input/output – from Alpino trees to general graphs in GraphML – arbitrary sets of alignment relations – options to tweak rendering of large trees (e.g. collapse/expand parts of the tree) • Implementation – relies on Graphviz’ Dot to render graphs – GUI in wxPython (Python + wxWidgets) – cross-platform (Mac OS X, GNU Linux MS Windows) – released as open source software 13 oct 2008 http://daeso.uvt.nl 33 Algraeph (cont'd) 13 oct 2008 http://daeso.uvt.nl 34 Current state of affairs • Corpus – first 500k words in electronic format, tokenized, parsed and aligned at the sentence level, in (TEI) XML format – GUI tools for (correcting) sentence and tree alignment – tree alignment nearly finished – final evaluation of inter-annotator agreement nearly finished • Evaluation experiments show that users prefer fused answers in a QA context (ACL08) 13 oct 2008 http://daeso.uvt.nl 35 Current state of affairs (cont'd) Working on – automatic sentence alignment for comparable text • using only shallow textual features • using parse trees and lexical resources as well – automatic tree alignment [Marsi & Krahmer '05] • using Cornetto (pycornetto interface) – multi-document summarization for news texts on the basis of MEAD (Iris Hendrickx) – complementary multi-doc summarization corpus of texts, extracts and summaries (Iris Hendrickx) 13 oct 2008 http://daeso.uvt.nl 36 Future work • Development of – sentence fusion module (alignment + merging + generation) [Marsi & Krahmer '05] – Paraphrase extraction techniques • Application in – Multi-document summarization beyond extracts – QA – IE 13 oct 2008 http://daeso.uvt.nl 37 Part II: Language Generation Erwin Marsi and Emiel Krahmer, Explorations in Sentence Fusion. In Proceedings of the 10th European Workshop on Natural Language Generation, Aberdeen, GB, 8-10 August 2005. Outline • Context – Multi-doc summarization – Sentence fusion • Experiment – Text material – Generation procedure – Evaluation • Discussion 13 oct 2008 http://daeso.uvt.nl 39 Context: multi-document summarization Given a set of similar documents 1. Rank sentences within each document on importance 2. Align similar sentences in different documents 3. Create an: – Extract: keep the most important sentences from all documents, using alignments to avoid redundancy – Abstract: keep the most important sentences from all documents, using the alignments to fuse sentences into new sentences which combine their information [Barzilay & McKeown] 13 oct 2008 http://daeso.uvt.nl 40 Context: sentence fusion • Sentence fusion (Barzilay 2003): given two sentences which share a certain amount of information, generate a new sentence which conveys information from both • Other application: fusing similar answers in QA • We are interested in particular types of fusion: – generalization (intersection fusion) – specification (union fusion 13 oct 2008 http://daeso.uvt.nl 41 Intersection fusion Input sentences: 1. Steve Irwin, the TV host known as the "Crocodile Hunter", has died after being stung by a stingray off Australia's north coast. 2. Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. Intersection fusion := combine the most general information shared by both sentences (without introducing redundancy) [Marsi & Krahmer '05] Example: Steve Irwin is killed in a stingray attack off Australia's north coast. 13 oct 2008 http://daeso.uvt.nl 42 Union fusion Input sentences: 1. Steve Irwin, the TV host known as the "Crocodile Hunter", has died after being stung by a stingray off Australia's north coast. 2. Steve Irwin, the daredevil wildlife documentarian, is killed in a stingray attack while filming on the Great Barrier Reef. Union fusion := combine the most specific information from either sentence (without introducing redundancy) [Marsi & Krahmer '05] Example: Steve Irwin, the TV host known as the "Crocodile Hunter" and daredevil wildlife documentarian, has died after being stung by a stingray while filming on the Great Barrier Reef. 13 oct 2008 http://daeso.uvt.nl 43 Sentence fusion: decomposition 1st sentence 2nd sentence Analysis analyzing & aligning sentences Merge merging sentences Generation generating new sentence Fused sentence 13 oct 2008 http://daeso.uvt.nl 44 Merge & Generation • Abstract from the analysis task – Assume we have a labeled alignment between two dependency trees for similar sentences • Goal: given two aligned trees, generate new sentences • Focus on the subtasks of: 1. Merging: deciding which information from either sentence should be included 2. Generation: producing a grammatically correct surface form • We explore a simple string-based approach 13 oct 2008 http://daeso.uvt.nl 45 Text material • At the time, there was no monolingual treebank of parallel/comparable Dutch text available • We created a parallel treebank out of two Dutch translations of “Le Petit prince” (The Little Prince) # C h a p te r s # S e n t en c e s # W o r d s 13 oct 2008 A l te n a tr a n sl a ti on 5 290 3600 B e a u f or t tr a n sl a ti on 5 277 3358 http://daeso.uvt.nl 46 Preprocessing • • • • Tokenization and sentence splitting (manual) Parsing with Alpino (dependency triples) Sentence alignment Tree alignment – Consensus alignment by two annotators – High agreement on node alignment (F-score=.99) – High agreement on alignment relations (F-score=.97, Kappa=.96) 13 oct 2008 http://daeso.uvt.nl 47 Merging • Simplistic idea – Every node n in the tree corresponds to a substring str(n) of the sentence – If two nodes n and m are aligned, we can substitute str(m) as an alternative realization for str(n) – The semantic similarity relation between n and m determines if the result is a restatement, a specification or a generalization. 13 oct 2008 http://daeso.uvt.nl 48 Merge: restatement procedure for each aligned node pair n,m: if relation(n,m) = restates, then str(n) ← str(n) V str(m) 13 oct 2008 http://daeso.uvt.nl 49 Merge: specification procedure for each aligned node pair n,m: if relation(n,m) = generalizes, then str(n) ← str(n) V str(m) 13 oct 2008 http://daeso.uvt.nl 50 Merge: generalization procedure for each aligned node pair n,m: if relation(n,m) = specifies, then str(n) ← str(n) V str(m) for each node n: if dep-rel(n) ∈ {head/mod, head/predm, …}, then str(n) ← str(n) V NULL 13 oct 2008 http://daeso.uvt.nl 51 Generation Following work on statistical NLG (Langkilde & Knight), the generation process involves two steps: 1. Generate overgenerate candidates for surface realization 2. Rank rank candidates by means of a (statistical) language model 13 oct 2008 http://daeso.uvt.nl 52 Candidate generation • Traverse the dependency tree, expanding the list of alternative surface realizations for each node that has multiple realizations • Avoid: – Realizations identical to one of the input sentences – Multiple instances of the same realization 13 oct 2008 http://daeso.uvt.nl 53 Examples of "good" output Input1: Zo heb ik in de loop van mijn leven heel veel contacten gehad met heel veel serieuze personen 'Thus have I in the course of my life very many contacts had with very many serious persons' Input2: Op die manier kwam ik in het leven met massa's gewichtige mensen in aanraking 'In that way came I in the life with masses-of weighty/important people in touch' Restatement: op die manier heb ik in de loop van mijn leven heel veel contacten gehad met heel veel serieuze personen 'in that way have I in the course of my life very many contacts had with very many serious persons' Specification: op die manier kwam ik in de loop van mijn leven met massa's gewichtige mensen in aanraking 'in that way have I in the course of my life with masses-of weighty/important people in touch' Generalization: zo heb ik in het leven veel contacten gehad met veel serieuze personen thus have I in the life many contacts had with many serious persons 13 oct 2008 http://daeso.uvt.nl 54 Examples of "bad" output Input1: En zo heb ik op mijn zesde jaar een prachtige loopbaan als kunstschilder laten varen . 'And so have I at my sixth year a wonderful career as art-painter let sail' Input2: Zo kwam het ,dat ik op zesjarige leeftijd een schitterende schildersloopbaan liet varen . 'Thus came it , that I at six-year age a bright painter-career let sail' Specification: en zo heb ik op mijn zesde jaar als kunstschilder laten een schitterende schildersloopbaan varen 'and so have I at my sixth year as art-painter let a bright painter-career sail' Generalization: zo kwam het dat ik op leeftijd een prachtige loopbaan liet varen 'so came it that I at age a wonderful career let sail' 13 oct 2008 http://daeso.uvt.nl 55 Ranking with n-gram language model • Cambridge-CMU Statistical Modeling Toolkit v2 – >250M words from the Twente Newscorpus – using back-off and Good-Turing smoothing • Ranked sentences in order of increasing entropy • However: looking at the ranking, it was clear that it made no sense… • Discarded (for the time being) 13 oct 2008 http://daeso.uvt.nl 56 Evaluation • Goal: get a rough indication of the output quality • Material: all 433 variants generated for the sentences from Ch1 • Quality judged by the authors (J1, J2) • Categories: – Perfect: no problems in either semantics or syntax – Acceptable: understandable, but with some minor flaws in semantics or grammar – Nonsense: serious problems in semantics or grammar 13 oct 2008 http://daeso.uvt.nl 57 Evaluation: results Kappa = .72 13 oct 2008 http://daeso.uvt.nl 58 Discussion • Output is not too bad for restatements – Will this work for comparable text like news texts or answers in QA? • Generation method is too simplistic – Frequent violations of word order and agreement constraints – Rich linguistic info (POS, categories, dependencies) is ignored 13 oct 2008 http://daeso.uvt.nl 59 Discussion (cont'd) • Ranking with n-gram language model failed – Are the grammatical errors typically long-distance? (>3 words) – As syntax trees are available, tree-based stochastic models (Bangalore & Rambow) might work better • Alternatively – Merge at a more abstract level of semantic representation (e.g. case frames) – Surface realization from a semantic representation 13 oct 2008 http://daeso.uvt.nl 60