Language Generation in the DAESO project

Transcription

Language Generation in the DAESO project
Language Generation in the DAESO project
Erwin Marsi & Emiel Krahmer
Communicatie- en Informatiewetenschappen
Overview
• Part I:
The DAESO project
• Part II:
Language generation
13 oct 2008
http://daeso.uvt.nl
2
Part I: DAESO
Detecting And Exploiting Semantic Overlap
Erwin Marsi and Emiel Krahmer, Annotating a parallel monolingual
treebank with semantic similarity relations. In: The Sixth International
Workshop on Treebanks and Linguistic Theories (TLT'07), Bergen,
Norway, December 7-8, 2007.
Erwin Marsi and Emiel Krahmer, Annotating a parallel monolingual
treebank with semantic similarity relations. In: Proceedings of CLIN
2008, to appear.
Outline
• Background
– Semantic overlap
– Motivation/Applications
– The DAESO project
• Corpus
– Text material
– Processing
– Annotation tools
• Current status & future work
13 oct 2008
http://daeso.uvt.nl
4
Semantic overlap by example
• Steve Irwin, the TV host known as the "Crocodile Hunter,"
has died after being stung by a stingray off Australia's
north coast. [www.cnn.com]
• Steve Irwin, the daredevil wildlife documentarian, is killed
in a stingray attack while filming on the Great Barrier
Reef.
[www.time.com]
13 oct 2008
http://daeso.uvt.nl
5
Semantic overlap: X equals Y
• Steve Irwin, the TV host known as the "Crocodile
Hunter," has died after being stung by a stingray off
Australia's north coast.
• Steve Irwin, the daredevil wildlife documentarian, is
killed in a stingray attack while filming on the Great
Barrier Reef.
13 oct 2008
http://daeso.uvt.nl
6
Semantic overlap: X restates Y
• Steve Irwin, the TV host known as the "Crocodile Hunter,"
has died after being stung by a stingray off Australia's
north coast.
• Steve Irwin, the daredevil wildlife documentarian, is
killed in a stingray attack while filming on the Great
Barrier Reef.
13 oct 2008
http://daeso.uvt.nl
7
Semantic overlap: X generalizesY (Y specifies X)
• Steve Irwin, the TV host known as the "Crocodile Hunter,"
has died after being stung by a stingray off Australia's
north coast.
• Steve Irwin, the daredevil wildlife documentarian, is killed
in a stingray attack while filming on the Great Barrier
Reef.
13 oct 2008
http://daeso.uvt.nl
8
Semantic overlap: X intersects Y
• Steve Irwin, the TV host known as the "Crocodile
Hunter," has died after being stung by a stingray off
Australia's north coast.
• Steve Irwin, the daredevil wildlife documentarian, is
killed in a stingray attack while filming on the Great
Barrier Reef.
13 oct 2008
http://daeso.uvt.nl
9
Lexical and phrasal similarity
Lexical similarity relations:
1. Equal
2. Synonym
3. Hyponym
4. Hyperonym
5. Overlap (cf. shared
information, Lowest
Common Ancestor)
13 oct 2008
Phrasal similarity relations:
1. Equals
2. Restates
3. Specifies
4. Generalizes
5. Intersects
http://daeso.uvt.nl
10
Other assumptions
Similarity relations are
• mutually exclusive
• based on common sense (cf. WordNet, RTE)
• defined between strings (theory-neutral)
• transferable to typed alignment relations between
syntactic nodes...
13 oct 2008
http://daeso.uvt.nl
11
Semantic similarity relations over syntactic nodes
• Similarity relations can be defined over the strings
corresponding to syntactic nodes
N1
N2
w1
N1'
restates
N3
N4
N5
w2
w3
N2'
w1'
w2'
w3'
STR(N3) = "w2 w3" restates "w1' w2' w3'" = STR(N2')
13 oct 2008
http://daeso.uvt.nl
12
Motivation for detecting semantic overlap
• Automatically detecting semantic overlap
– is semantics from a practical and theory-neutral
perspective
– is a basic semantic NLP task like WSD and RTE
– has many potential applications like
• Multidocument summarization
• Combining answers in Question-Answering systems
• Intelligent document merging
• Detection of plagiarism
• Information Extraction (IE)
• Recognizing Textual Entailment (RTE)
• Anaphora resolution
• Ranking parse trees
13 oct 2008
http://daeso.uvt.nl
13
The DAESO project
•
•
•
•
DAESO: Detecting And Exploiting Semantic Overlap
Funded by the Dutch Stevin programme
Duration: October 2006 until October 2009
Participants:
– Emiel Krahmer, Erwin Marsi - Tilburg University
– Walter Daelemans, Iris Hendrickx - Antwerp University
– Maarten de Rijke, Erik TKS - University of Amsterrdam
– Jakub Zavrel - Textkernel
• http://daeso.uvt.nl/
13 oct 2008
http://daeso.uvt.nl
14
Goals of the DAESO project
1. Development of a 1M words corpus of
parallel/comparable Dutch text aligned at the level of
sentences, phrases and words - plus tools and
guidelines for manual annotation
2. Based on this, development of tools for automatic
alignment, detection of semantic overlap, paraphrase
extraction and sentence fusion
3. Application of these tools in a number of NLP
applications, including multi-doc sum, QA and IE.
13 oct 2008
http://daeso.uvt.nl
15
Corpus: text material
• Text material ranges from parallel to comparable Dutch
text
• The composition results from a compromise between:
– covering different text styles;
– availability in electronic format;
– targeted applications;
– copyright constraints
13 oct 2008
http://daeso.uvt.nl
16
Corpus: text material for manual annotation
Material:
Book translations
Autocue-subtitle pairs
News headlines
Press releases
Answers from QA
Subtotal manually annotated
13 oct 2008
http://daeso.uvt.nl
#words:
125k
125k
24k
225k
1k
-----500k
High
Degree
of
Overlap
Low
17
Corpus: automatic annotation
Material:
Book translations
Autocue-subtitle pairs
News headlines
Press releases
Answers from QA
Subtotal manually annotated
Subtotal automatically annotated:
Total corpus size:
13 oct 2008
http://daeso.uvt.nl
#words:
125k
125k
24k
225k
1k
-----500k
500k
====
1M
18
Corpus material: book translations
• Ideally: alternative translations (in Dutch) of modern
books
• In practice: very hard to find (impossible?)
• We used modern translations of
– "Le petit prince", by Antoine de Saint-Exupéry (integral)
– "Les Essais" by Michel de Montaigne (partial)
– "On the origin of species" by Charles Darwin (partial)
Ik denk ook dat de hypothese van Andrew Knight , dat deze verschillen gedeeltelijk te danken
zijn aan een overvloed van voedsel , best waarschijnlijk is .
Ook is er iets te zeggen voor het standpunt van Andrew Knight dat deze variabiliteit deels toe
te schrijven is aan een overdaad aan voedsel .
(Darwin, ch1)
13 oct 2008
http://daeso.uvt.nl
19
Corpus material: autocue-subtitles
• Autocue-subtitle material from Dutch broadcast news
• Aligned at sentence level in the ATRANOS project
• Typically subtitle is compressed version of autocue
Sterker nog , op de geldmarkt liet de euro zich meteen gelden .
De debutant won terrein ten opzichte van de routinier , de dollar .
Sterker nog : de euro liet zich meteen gelden en won terrein van de routinier , de dollar .
(NOS Journaal 01-04-1999)
13 oct 2008
http://daeso.uvt.nl
20
Corpus material: news headlines
• Mined from Google News Dutch RSS feed (>900k words)
• News articles are clustered on their content, so headlines
can be quite different
• Hence, we performed manual subclustering
• Skipping very long clusters
Contact met manager slecht
Manager weet niets van werkvloer
Managers hebben geen benul van de werkvloer
Chefs hebben geen benul van werkvloer
Meeste werknemers balen van hun bazen
Manager te weinig op werkvloer
13 oct 2008
http://daeso.uvt.nl
21
Corpus material: press releases
• Press releases from news feeds of two Dutch press
agencies
• Automatic alignment of articles, aiming for high recall, at
the expense of precision
• Subsequent manual correction
Slachtoffers van de legionella-uitbraak veroorzaakte door een koeltoren bij het Post CSgebouw in Amsterdam krijgen een schadevergoeding van opdrachtgever
Oosterdokseiland Ontwikkeling Amsterdam ( OOACV ) .
Enkele slachtoffers van de legionella-uitbraak bij het Post CS-gebouw in Amsterdam vorige
zomer kunnen een schadevergoeding tegemoet zien .
13 oct 2008
http://daeso.uvt.nl
22
Corpus material: answers from QA
• Target: alternative/similar candidate answers in output of
Question Answering (QA) engines
• Not the answers on "closed" questions
– e.g. “Who invented the telephone?”
• But answers on “open” questions
– e.g., “What are the advantages of using open source
software?”
• However, hard to find this type of material for Dutch
• We use answers from the QA reference corpus (IMIX
project)
– medical domain
– manually extracted from all available text material
– for 100 Q’s
– many have only a single
answer, or no answer at all 23
13 oct 2008
http://daeso.uvt.nl
Answers from QA (cont'd)
Answers on the question "Waardoor worden aften veroorzaakt?":
De oorzaak is niet bekend , maar stress lijkt een rol te spelen ; zo kan iemand aften krijgen
in de eindexamenperiode .
De oorzaak is nog niet opgehelderd en er is ook nog geen afdoende behandeling van aften
bekend .
Vooral als de weerstand afneemt door ziekte , stress of vermoeidheid kunnen er aften
ontstaan .
Ook ontstaan er bij vrouwen meer aften rondom de menstruatie .
Levensmiddelen die aften kunnen veroorzaken , zijn bijvoorbeeld grapefruit en kiwi's .
13 oct 2008
http://daeso.uvt.nl
24
Conversion
• Conversion from original format (PDF, MS Word, HTML,
OCR'ed text) to XML
– TEI Lite for books
– Custom XML formats in other cases
• (Don't ask me any questions about character encodings...)
13 oct 2008
http://daeso.uvt.nl
25
Tokenization & parsing
• Tokenization
– includes sentence splitting
– with D-COI tokenizer
(Reynaert '07)
– errors manually corrected for book translations
• Parsing
– all sentences parsed with the Alpino parser
(Van Noord et al)
– CGN XML format
– parsing errors were not manually corrected
13 oct 2008
http://daeso.uvt.nl
26
Sentence alignment
•
No
–
–
–
•
Alignment was required for
– book translations (= parallel text)
– press releases (=comparable text)
•
Procedure:
1. automatic alignment aiming at high recall (low
precision)
2. manual correction with a special annotation tool
13 oct 2008
sentence alignment required for
autocue-subtitle pairs
answers from QA
news headlines
http://daeso.uvt.nl
27
Sentence alignment: book translation
• Task is clear:
– in the case of parallel text, correct alignment is
undisputed
• Normally one-to-one, without crossing alignments
• Many algorithms and implementations [Gale & Church '93]
• Automatic alignment
– doing multiple passes helps
– simple word overlap measures obtains high accuracy
(>95%)
13 oct 2008
http://daeso.uvt.nl
28
Sentence alignment: press releases
• Task is not so clear-cut:
– Correct alignment not always evident for comparable
text (cf. TLT and CLIN papers)
• Crossing alignments are the rule; many-to-many
alignments no exception
• Manual alignment
– Small pilot-exps on interannotator agreement shows
reasonable precision but low recall
– Finding all similar sentences in two texts is a difficult
task for humans
13 oct 2008
http://daeso.uvt.nl
29
Sentence alignment: press releases
• Automatic alignment
– Algorithms for parallel text alignment do not work for
comparable text [Nelken & Shieber '06]
– Better algorithms for comparable text alignment must
be developed [Barzilay & Elhadad '03, Nelken &
Shieber '06]
– So far, we only tried simple methods relying on
shallow textual features
– Best results with tf*idf weighted cosine similarity on
normalized words: F-scores around 0.6
– However, not good enough to ease manual annotation
13 oct 2008
http://daeso.uvt.nl
30
Sentence alignment tool: Hitaext
• Graphical tool for manually aligning two marked-up texts
• Input: two arbitrary (well-formed) XML documents
• Output: alignment between XML elements in simple
format
• Supports
– alignment at different text levels
– many-to-many alignments
• Implementation
– GUI in wxPython (Python + wxWidgets)
– cross-platform (Mac OS X, GNU Linux MS Windows)
– released as open source software
13 oct 2008
http://daeso.uvt.nl
31
Hitaext (cont'd)
13 oct 2008
http://daeso.uvt.nl
32
Tree alignment tool: Algraeph
• Graphical tool for aligning nodes from a pair of graphs
• Generalized reimplementation of Gadget
– XML input/output
– from Alpino trees to general graphs in GraphML
– arbitrary sets of alignment relations
– options to tweak rendering of large trees (e.g.
collapse/expand parts of the tree)
• Implementation
– relies on Graphviz’ Dot to render graphs
– GUI in wxPython (Python + wxWidgets)
– cross-platform (Mac OS X, GNU Linux MS Windows)
– released as open source software
13 oct 2008
http://daeso.uvt.nl
33
Algraeph (cont'd)
13 oct 2008
http://daeso.uvt.nl
34
Current state of affairs
• Corpus
– first 500k words in electronic format, tokenized,
parsed and aligned at the sentence level, in (TEI) XML
format
– GUI tools for (correcting) sentence and tree alignment
– tree alignment nearly finished
– final evaluation of inter-annotator agreement nearly
finished
• Evaluation experiments show that users prefer fused
answers in a QA context (ACL08)
13 oct 2008
http://daeso.uvt.nl
35
Current state of affairs (cont'd)
Working on
– automatic sentence alignment for comparable text
• using only shallow textual features
• using parse trees and lexical resources as well
– automatic tree alignment [Marsi & Krahmer '05]
• using Cornetto (pycornetto interface)
– multi-document summarization for news texts on the
basis of MEAD (Iris Hendrickx)
– complementary multi-doc summarization corpus of
texts, extracts and summaries (Iris Hendrickx)
13 oct 2008
http://daeso.uvt.nl
36
Future work
• Development of
– sentence fusion module (alignment + merging +
generation)
[Marsi & Krahmer '05]
– Paraphrase extraction techniques
• Application in
– Multi-document summarization beyond extracts
– QA
– IE
13 oct 2008
http://daeso.uvt.nl
37
Part II: Language Generation
Erwin Marsi and Emiel Krahmer, Explorations in Sentence Fusion.
In Proceedings of the 10th European Workshop on Natural
Language Generation, Aberdeen, GB, 8-10 August 2005.
Outline
• Context
– Multi-doc summarization
– Sentence fusion
• Experiment
– Text material
– Generation procedure
– Evaluation
• Discussion
13 oct 2008
http://daeso.uvt.nl
39
Context: multi-document summarization
Given a set of similar documents
1. Rank sentences within each document on importance
2. Align similar sentences in different documents
3. Create an:
– Extract: keep the most important sentences from
all documents, using alignments to avoid
redundancy
– Abstract: keep the most important sentences
from all documents, using the alignments to fuse
sentences into new sentences which combine
their information
[Barzilay & McKeown]
13 oct 2008
http://daeso.uvt.nl
40
Context: sentence fusion
• Sentence fusion (Barzilay 2003):
given two sentences which share a certain amount of
information, generate a new sentence which conveys
information from both
• Other application: fusing similar answers in QA
• We are interested in particular types of fusion:
– generalization (intersection fusion)
– specification (union fusion
13 oct 2008
http://daeso.uvt.nl
41
Intersection fusion
Input sentences:
1. Steve Irwin, the TV host known as the "Crocodile Hunter",
has died after being stung by a stingray off Australia's
north coast.
2. Steve Irwin, the daredevil wildlife documentarian, is killed
in a stingray attack while filming on the Great Barrier Reef.
Intersection fusion := combine the most general information
shared by both sentences (without introducing redundancy)
[Marsi & Krahmer '05]
Example:
Steve Irwin is killed in a stingray attack off Australia's north
coast.
13 oct 2008
http://daeso.uvt.nl
42
Union fusion
Input sentences:
1. Steve Irwin, the TV host known as the "Crocodile Hunter",
has died after being stung by a stingray off Australia's
north coast.
2. Steve Irwin, the daredevil wildlife documentarian, is killed
in a stingray attack while filming on the Great Barrier Reef.
Union fusion := combine the most specific information from either
sentence (without introducing redundancy)
[Marsi & Krahmer '05]
Example:
Steve Irwin, the TV host known as the "Crocodile Hunter" and
daredevil wildlife documentarian, has died after being stung by
a stingray while filming on the Great Barrier Reef.
13 oct 2008
http://daeso.uvt.nl
43
Sentence fusion: decomposition
1st sentence
2nd sentence
Analysis
analyzing & aligning
sentences
Merge
merging sentences
Generation
generating new
sentence
Fused sentence
13 oct 2008
http://daeso.uvt.nl
44
Merge & Generation
• Abstract from the analysis task
– Assume we have a labeled alignment between two
dependency trees for similar sentences
• Goal: given two aligned trees, generate new sentences
• Focus on the subtasks of:
1. Merging: deciding which information from either
sentence should be included
2. Generation: producing a grammatically correct
surface form
•
We explore a simple string-based approach
13 oct 2008
http://daeso.uvt.nl
45
Text material
• At the time, there was no monolingual treebank of
parallel/comparable Dutch text available
• We created a parallel treebank out of two Dutch
translations of “Le Petit prince” (The Little Prince)
# C h a p te r s # S e n t en c e s # W o r d s
13 oct 2008
A l te n a
tr a n sl a ti on
5
290
3600
B e a u f or t
tr a n sl a ti on
5
277
3358
http://daeso.uvt.nl
46
Preprocessing
•
•
•
•
Tokenization and sentence splitting (manual)
Parsing with Alpino (dependency triples)
Sentence alignment
Tree alignment
– Consensus alignment by two annotators
– High agreement on node alignment (F-score=.99)
– High agreement on alignment relations (F-score=.97,
Kappa=.96)
13 oct 2008
http://daeso.uvt.nl
47
Merging
• Simplistic idea
– Every node n in the tree corresponds to a substring
str(n) of the sentence
– If two nodes n and m are aligned, we can substitute
str(m) as an alternative realization for str(n)
– The semantic similarity relation between n and m
determines if the result is a restatement, a
specification or a generalization.
13 oct 2008
http://daeso.uvt.nl
48
Merge: restatement procedure
for each aligned node pair n,m:
if relation(n,m) = restates,
then str(n) ← str(n) V str(m)
13 oct 2008
http://daeso.uvt.nl
49
Merge: specification procedure
for each aligned node pair n,m:
if relation(n,m) = generalizes,
then str(n) ← str(n) V str(m)
13 oct 2008
http://daeso.uvt.nl
50
Merge: generalization procedure
for each aligned node pair n,m:
if relation(n,m) = specifies,
then str(n) ← str(n) V str(m)
for each node n:
if dep-rel(n) ∈ {head/mod, head/predm, …},
then str(n) ← str(n) V NULL
13 oct 2008
http://daeso.uvt.nl
51
Generation
Following work on statistical NLG (Langkilde & Knight), the
generation process involves two steps:
1. Generate
overgenerate candidates for surface realization
2. Rank
rank candidates by means of a (statistical) language
model
13 oct 2008
http://daeso.uvt.nl
52
Candidate generation
• Traverse the dependency tree, expanding the list of
alternative surface realizations for each node that has
multiple realizations
• Avoid:
– Realizations identical to one of the input sentences
– Multiple instances of the same realization
13 oct 2008
http://daeso.uvt.nl
53
Examples of "good" output
Input1: Zo heb ik in de loop van mijn leven heel veel contacten gehad met heel veel serieuze
personen
'Thus have I in the course of my life very many contacts had with very many serious
persons'
Input2: Op die manier kwam ik in het leven met massa's gewichtige mensen in aanraking
'In that way came I in the life with masses-of weighty/important people in touch'
Restatement: op die manier heb ik in de loop van mijn leven heel veel contacten gehad met
heel veel serieuze personen
'in that way have I in the course of my life very many contacts had with very many serious
persons'
Specification: op die manier kwam ik in de loop van mijn leven met massa's gewichtige
mensen in aanraking
'in that way have I in the course of my life with masses-of weighty/important people in touch'
Generalization: zo heb ik in het leven veel contacten gehad met veel serieuze personen
thus have I in the life many contacts had with many serious persons
13 oct 2008
http://daeso.uvt.nl
54
Examples of "bad" output
Input1: En zo heb ik op mijn zesde jaar een prachtige loopbaan als kunstschilder laten varen .
'And so have I at my sixth year a wonderful career as art-painter let sail'
Input2: Zo kwam het ,dat ik op zesjarige leeftijd een schitterende schildersloopbaan liet varen .
'Thus came it , that I at six-year age a bright painter-career let sail'
Specification: en zo heb ik op mijn zesde jaar als kunstschilder laten een schitterende
schildersloopbaan varen
'and so have I at my sixth year as art-painter let a bright painter-career sail'
Generalization: zo kwam het dat ik op leeftijd een prachtige loopbaan liet varen
'so came it that I at age a wonderful career let sail'
13 oct 2008
http://daeso.uvt.nl
55
Ranking with n-gram language model
• Cambridge-CMU Statistical Modeling Toolkit v2
– >250M words from the Twente Newscorpus
– using back-off and Good-Turing smoothing
• Ranked sentences in order of increasing entropy
• However: looking at the ranking, it was clear that it made
no sense…
• Discarded (for the time being)
13 oct 2008
http://daeso.uvt.nl
56
Evaluation
• Goal: get a rough indication of the output quality
• Material: all 433 variants generated for the sentences
from Ch1
• Quality judged by the authors (J1, J2)
• Categories:
– Perfect: no problems in either semantics or syntax
– Acceptable: understandable, but with some minor
flaws in semantics or grammar
– Nonsense: serious problems in semantics or grammar
13 oct 2008
http://daeso.uvt.nl
57
Evaluation: results
Kappa = .72
13 oct 2008
http://daeso.uvt.nl
58
Discussion
• Output is not too bad for restatements
– Will this work for comparable text like news texts or
answers in QA?
• Generation method is too simplistic
– Frequent violations of word order and agreement
constraints
– Rich linguistic info (POS, categories, dependencies) is
ignored
13 oct 2008
http://daeso.uvt.nl
59
Discussion (cont'd)
• Ranking with n-gram language model failed
– Are the grammatical errors typically long-distance?
(>3 words)
– As syntax trees are available, tree-based stochastic
models (Bangalore & Rambow) might work better
• Alternatively
– Merge at a more abstract level of semantic
representation (e.g. case frames)
– Surface realization from a semantic representation
13 oct 2008
http://daeso.uvt.nl
60