Automatic Recognition of Structural Relations in Dutch Text

Transcription

Automatic Recognition of Structural Relations in Dutch Text
Faculty of Human Media Interaction
Department of EEMCS
University of Twente
P.O. Box 217, 7500 AE Enschede
The Netherlands
Automatic Recognition of Structural
Relations in Dutch Text
– A Study in the Medical Domain –
by
Sander E.J. Timmerman
[email protected]
March 1, 2007
Committee
Ir. W.E. Bosma
Dr. M. Theune
Dr. D.K.J. Heylen
Preface
Right now, you are reading the thesis about my graduation assignment. During my time at
the University of Twente, I found that the field of interaction between humans and computers
was the most interesting field in Computer Science. Therefore I was happy to be able to
perform my graduation assignment at the faculty of Human Media Interaction.
Before starting with the final assignment, I followed different courses from this faculty. My
traineeship originated in the same faculty. It implied the automatic subtitling of meetings,
with speech recognition at Noterik in Amsterdam.
In May 2006 I started with this assignment, and it lasted until March 2007. Although
working on the project, spending time with fellow graduates, and the time spent as a student
as such was very nice, I am glad to be able to finish it, and start a new life, not being a
student anymore.
I would like to thank my graduation committee, Wauter Bosma, Mariet Theune and Dirk
Heylen, for their time, work and comments. Furthermore I would like to thank Rieks op
den Akker for his contributions to the annotations and comments, and Tjerk Bijlsma for his
review of my thesis. And last but not least, I thank my parents for giving me the opportunity
to do this study.
I hope you enjoy reading this thesis,
Sander Timmerman, March 1, 2007
iii
Samenvatting
Interactie tussen mensen en computers is een hot topic binnen de huidige informatica. Er
wordt onderzoek gedaan naar user interfaces, communicatie, kunstmatige intelligentie en veel
meer. Een van de onderwerpen is natuurlijke taalverwerking door computers. Onderzoekers
werken bijvoorbeeld aan het automatisch genereren van (zinvolle) tekst, spraakherkenning
en automatisch samenvatten.
Een manier om inzicht te krijgen in teksten is het beschrijven van de structuur van een
tekst. Hiervoor kan gebruik worden gemaakt van de Rhetorical Structure Theory, een van
de manieren om structuur in tekst te beschrijven. Het herkennen van structuur is zelfs voor
mensen een moeilijk proces omdat taal erg ambigu is.
Dit verslag beschrijft een manier om automatisch structuur te herkennen in Nederlandse
teksten. Meer specifiek in het medische domein. Uit het medische domein is de Merck
Manual, een Nederlandse medische encyclopedie, geselecteerd als bron voor de teksten.
Om automatisch structuur te kunnen herkennen wordt gebruikt gemaakt van bepaalde
woordtypes die relaties tussen bepaalde segmenten uit de tekst kunnen aangeven. In dit
verslag worden zinnen als segmenten, als basis voor de structuur gebruikt. Er is een herkenner
ontwikkeld die gebruikt maakt van deze bepaalde woordtypes om automatisch de structuur
te herkennen. De herkenner gebruikt woordtypes die in andere onderzoeken gebruikt zijn
en woordtypes die specifiek geschikt zijn voor het medische domein. Deze laatste types zijn
speciaal voor deze opdracht geselecteerd.
De herkenner genereert de structuur als een Rhetorical Structure Theory boom. Om de
kwaliteit van de herkende structuur te evalueren worden de automatisch gegenereerde bomen
vergeleken met bomen die handmatig zijn gemaakt en tevens worden handmatig gemaakte
bomen met elkaar vergeleken.
v
Abstract
Interaction between humans and computers is a hot topic in the current computer science
field. Research is done for user interfaces, communication, artificial intelligence and much
more. One of the subjects is natural language processing by computers. Researchers work
for example at the automatic generation of (sensible) text, speech recognition and automatic
summarization.
A way to get insight about texts is to describe their structure. Therefore use can be
made of the Rhetorical Structure Theory, one of the ways to describe structure in texts. The
recognition of structure in texts is even for humans a difficult process.
This thesis describes a way to recognize structure in Dutch texts automatically. More
specific, texts from the medical domain. The Merck Manual, a Dutch medical encyclopedia,
is chosen as the source for the texts.
Certain word types which can signal relations between text segments are used for the
automatic recognition of structure. In this thesis, sentences are used as segments, the basis
of the structure. An automatic recognizer is developed, which uses the specific word types
to recognize the structure. It uses word types which are discussed in other researches as well
and word types which are specific for the medical domain. The latter are gathered during
this assignment.
The recognizer generates the structure of a text as a Rhetorical Structure Theory tree. To
evaluate the quality of the recognized structure, automatically generated trees are compared
with manually generated trees of the same text. Furthermore, manually generated trees are
compared with each other.
vii
Contents
1 Introduction
1.1 About IMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Structure in Text
2.1 Discourse . . . . . . . . . .
2.2 Rhetorical Structure Theory
2.2.1 Theory . . . . . . . .
2.2.2 Applications . . . . .
1
2
2
2
.
.
.
.
5
5
6
6
11
.
.
.
.
.
.
13
13
14
16
18
19
20
.
.
.
.
.
.
.
21
21
22
22
22
29
32
33
5 Medical Texts
5.1 Merck Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
39
39
40
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Annotation
3.1 Manual Annotation . . . . . . . . .
3.2 Elementary Discourse Units . . . .
3.3 Required Resources for Annotation
3.4 Discourse Markers . . . . . . . . .
3.5 Assigning Relations . . . . . . . . .
3.6 Tree Structure . . . . . . . . . . . .
4 Structure in Dutch
4.1 Dutch Text . . . . . . . . . . .
4.2 Dutch Relations . . . . . . . . .
4.3 Dutch Discourse Markers . . . .
4.3.1 Conjunctions . . . . . .
4.3.2 Other Discourse Markers
4.4 Genres . . . . . . . . . . . . . .
4.4.1 Comparison . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
5.2
5.3
5.4
Medical texts . . . . . . . .
5.2.1 Merck Texts . . . . .
5.2.2 Other Medical Texts
Relations in Medical Texts .
Medical Discourse Markers .
5.4.1 Noun Constructions .
5.4.2 Verb Constructions .
5.4.3 Signaled Relations .
5.4.4 Time Markers . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Automatic Recognition
6.1 State of the Art . . . . . . . . . . . . . . .
6.2 Automatic Annotator . . . . . . . . . . . .
6.2.1 The Segmenter . . . . . . . . . . .
6.2.2 The Recognizer . . . . . . . . . . .
6.2.3 The Tree-Builder . . . . . . . . . .
6.3 Segmentation . . . . . . . . . . . . . . . .
6.4 Defining the Relation Set . . . . . . . . . .
6.5 Used Discourse Markers . . . . . . . . . .
6.5.1 Conjunctions . . . . . . . . . . . .
6.5.2 Adverbs . . . . . . . . . . . . . . .
6.5.3 Pronouns . . . . . . . . . . . . . .
6.5.4 Domain Specific Discourse Markers
6.5.5 Relation Markers . . . . . . . . . .
6.5.6 Adjectives . . . . . . . . . . . . . .
6.5.7 Implicit Markers . . . . . . . . . .
6.6 Recognizing Relations . . . . . . . . . . .
6.7 Recognition Algorithm . . . . . . . . . . .
6.8 Scoring Hierarchy . . . . . . . . . . . . . .
6.9 Example . . . . . . . . . . . . . . . . . . .
7 Evaluation
7.1 RST-Trees . . . . . . . . . . .
7.2 Relation Based Evaluation . .
7.3 Full Tree Evaluation . . . . .
7.4 Testing Results . . . . . . . .
7.4.1 First Text . . . . . . .
7.4.2 Second Text . . . . . .
7.4.3 Third Text . . . . . .
7.4.4 Fourth Text . . . . . .
7.4.5 Fifth Text . . . . . . .
7.4.6 More Texts . . . . . .
7.4.7 Annotator Evaluation
7.5 Discussion . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
43
44
47
48
49
50
52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
56
56
57
57
57
58
59
59
60
61
62
62
63
63
65
67
69
73
.
.
.
.
.
.
.
.
.
.
.
.
77
77
78
79
80
82
85
85
87
89
90
91
91
CONTENTS
8 Conclusions and Future Work
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
93
94
Bibliography
97
A Rhetorical Relations
101
B Conjunctions
105
C Merck Manual
109
D Medical Annotations
111
E Dutch Texts
113
xi
Chapter
1
Introduction
In their daily life, people read a lot of texts. Whether they are reading a book, processing
the subtitles of a foreign movie on TV or picking a meal from the restaurant menu.
All texts are created with an intention; the goal of the author is to inform the reader
about a specific subject, ranging from for example telling a story to amuse the reader to
defending political statements. Since a reader must be able to understand the text there are
some constraints to the way texts are to be written.
First of all, the used language must be familiar to the reader. It is of no use to write a
letter to someone in French if the receiver only understands English.
Secondly the text must be syntactically correct; it must follow the grammatical rules.
A sentence like: "Any go bird fox table" has no meaning to the reader and is therefore
useless, although the sentence only consists of correct English words. In some cases, improper
use of syntax can cause the reader to interpret a piece of text in a way the author certainly
did not intend.
In the third place texts must have coherence. This means that the sentences of a text
must have a relation with the other sentences in the text, and parts of a sentence with each
other as well.
In most cases a raw collection of unrelated sentences is not only boring to read, it is
also very hard, or even impossible to understand. Therefore the written text has to have
a certain structure which enables the reader to spot this coherence and create a cognitive
representation of the text. The cognitive representation enables the reader to extract the
meaning intended by the author.
In this thesis existing methods to describe structure in text will be explained and the
possibilities to use these methods for Dutch are discussed. The focus lies on the Rhetorical
Structure Theory (RST) [MT88], which is used within IMIX1 . The main goal of this thesis
is to research whether it is possible to recognize structure in text automatically. Since IMIX
is focused on medical text, for this research medical texts in Dutch are used.
1
http://www.nwo.nl/imix
1
CHAPTER 1. INTRODUCTION
1.1
About IMIX
This work is done as a part of the IMIX project. IMIX is the abbreviation of Interactieve
Multimodale Informatie eXtractie. It is a project in the field of Speech and Language
technology for Dutch. The goal of the project is to develop knowledge and technology which
is needed to retrieve specific answers to specific questions in Dutch documents. IMIX is
funded by NWO2 , the Netherlands Organization for Scientific Research.
To be able to extract meaningful answers from a text, it is useful to have knowledge
about its structure. A properly written text is as coherent as possible, for that is the best
way for the author to avoid possible unintended interpretations of the text by a reader.
The structure of texts can be visualized by annotating the text with rhetorical relations.
These annotated texts can be queried for relations which exist within the text, rather than
analyzing the raw text itself. Since annotation is a very time-expensive process if done by
hand, an automatic annotation process is required.
1.2
Research Questions
This research concerns the automatic recognition of structural relations in text. The focus
lies on the Rhetorical Structure Theory, a popular discourse theory to describe the structure
of texts. The following questions are to be answered:
1. How can RST annotations be automatically generated?
2. Is RST suitable for Dutch and specifically for medical texts?
3. Which properties of Dutch and which relations and discourse markers are best for
automatic recognition?
4. Is it possible to use other properties of medical texts as discourse markers to find relations?
5. How can automatically generated annotations be evaluated?
1.3
Organization of the Thesis
In chapter 2 the need for structure in text and ways to generate representations of texts is
described. Discourse will be discussed and a theory to define the structure of a text, the
Rhetorical Structure Theory, is presented. Some applications of the Rhetorical Structure
Theory are described.
The next chapter elaborates on discourse and the RST by describing the manual annotation of text with RST. Knowledge about elementary discourse units and discourse
markers is presented, and the need for world knowledge is explained. The assigning of
relations is briefly introduced and the use of the annotation tool of O’Donnell is explained.
The chapter provides the information needed to create an RST tree from a text.
2
http://www.nwo.nl/
2
1.3. ORGANIZATION OF THE THESIS
The fourth chapter will focus on Dutch texts. In this chapter it is shown that RST is
indeed suited for Dutch texts. Dutch discourse markers and relations which can be
used for manual and automatic annotation are described. This chapter contains a comparison
between several Dutch genres of text. It is shown that although differences occur within the
genres, texts from each genre can be annotated with RST. In this chapter lists of possible
discourse markers are presented and research regarding the use and functionality of these
markers is described.
The application of markers will be extended in chapter 5 where the general properties of
medical texts will be described as well as functional properties which can be useful to find
relations automatically. Lists of special discourse markers are gathered, useful for the
(automatic) annotation of medical texts.
In chapter 6 the process of automatic annotation of texts with rhetorical relations is
described. The three different parts of the program, the segmenter, the recognizer and the
tree-builder are discussed. Lists with word types and relations which they might signal are
used by the recognizer to generate the RST-tree. Furthermore, the algorithm used by the
recognizer is explained.
The generated trees of example texts by the automatic recognizer are presented in chapter
7. These trees are compared to manually annotated trees and the differences are discussed.
This chapter also contains comparisons between manually created trees.
Finally, the conclusions are presented, and future work is described.
3
Chapter
2
Structure in Text
There are many different kinds of texts, different from each other for example in layout,
purpose, subject and coherence. A clear difference in layout exists between dialogues versus
prose. A dialogue usually is an alternation of segments generated by different persons, while
prose can contain practically all kinds of layout, large spans of texts, lists, and even dialogues
itself. Differences in purpose can be illustrated by fairy tales versus encyclopedias. While
a fairy tale is intended to tell the reader a story, an encyclopedia is created to inform the
reader about a certain subject in an objective way.
Coherence in a text is needed to be able to understand the meaning intended by the writer.
Furthermore, structured texts tend to be memorized more easily [Bir85]. Practically all texts
are coherent, although there are some familiar examples of text without clear coherence like
some poems. In this chapter discourse will be discussed and a way to represent structure of
discourse, the Rhetorical Structure Theory [MT88]. While there are other theories such as
the theory of Grosz and Sidner [GS86] and the theory of Hobbs [Hob85], RST is used for
this research. This is done because RST is used within IMIX. Some applications of RST are
discussed afterwards.
2.1
Discourse
A text consists of multiple sentences which are related to each other. Such a combination is
called a discourse. A discourse itself consists of multiple discourse segments, non-overlapping
spans of text which can consist of a part of a sentence, a complete sentence or even a group
of sentences. The coherence between these segments is provided by a relation. A discourse
segment can for example provide additional information about a preceding segment. It can
also contrast with it or provide a framework for new information.
An example:
1
2
Harry broke his wrist.
He fell out of a tree.
In the preceding example it is clear that the first sentence is coherent with the second.
The second sentence provides additional information about Harry. Harry fell out of the tree
5
CHAPTER 2. STRUCTURE IN TEXT
which caused him to break his wrist. There is a clear relationship between both sentences.
In the next example the coherence is not clear:
1
2
Linda drank a glass of lemonade.
The chicken lay an egg.
The fact that Linda is drinking a glass of lemonade has nothing to do with the chicken
laying an egg. These sentences do not have any relation because the actions drinking a glass
of lemonade and the laying of an egg are unrelated. But how can be known that in the first
example the sentences do relate and in the second example they do not. This is caused by the
fact that the reader of the examples has knowledge of the subject. All knowledge about the
subject which is not embedded in the text, is called world knowledge. This will be further
explained in chapter 3. For a reader unfamiliar with trees, the relation in the first example
is not clear since he does not know a tree. Therefore he is unable to know man could fall
out of one and break a wrist. Even for readers familiar with the event of falling there is no
need for a relation with breaking a wrist. This is since falling itself does not always causes
a wrist to break. Although the word "falling" might indicate that a wrist can become
broken, if it were used in the context of falling in love, it is rather unlikely this will cause
a wrist to break. Even the word falling in means of the actual gravitational caused event,
breaking is not always the result. If for example the second sentence was: "He fell on the
pillows" a relation is less obvious since it is rather strange if that would have caused his
wrist to break.
Every reader generates a virtual representation of the text he is reading. In that way
the text makes sense, i.e. the message which it contains, is revealed to the reader. There
are different ways to model the representations in texts. The one used in this research, the
Rhetorical Structure Theory of Mann and Thompson will be described.
2.2
Rhetorical Structure Theory
Rhetorical Structure Theory (RST) was developed by William Mann and Sandra Thompson
[MT88]. It was originally developed as part of studies of computer-based text generation
using monologue written texts. It describes a set of rhetorical relations between discourse
segments, which are used for the analysis of text. This section will discuss RST and some
applications.
2.2.1
Theory
At first the theory described 24 different relations, but the number has increased to 30
relations at present time. The original 24 relations are shown in table 2.1.
For an overview of the set of relations and their descriptions see appendix A. In the
appendix only 23 of the above relations are covered. This is because there is no actual
schema for the Joint relation since it is in fact no rhetorical relation. Joint is used in cases
where there is no actual relation between two parts of text but they are related somehow
since both pieces of text appear in the same discourse.
6
2.2. RHETORICAL STRUCTURE THEORY
Relation
Antithesis
Background
Circumstance
Concession
Condition
Contrast
Elaboration
Enablement
Evaluation
Evidence
Interpretation
Joint
Justify
Motivation
Otherwise
Purpose
Restatement
Sequence
Solutionhood
Summary
Volitional Cause
Non-Volitional Cause
Volitional Result
Non-Volitional Result
Type
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Paratactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Paratactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Paratactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Hypotactic
Table 2.1: The original RST relations
The coherence in texts can be graphically presented in a tree form, called RST-trees.
Practically all text can be represented as a tree although in [WG05] it is argued that trees
are not strong enough and cross dependencies are needed.
The theory provides a way of expressing the importance of the text parts a relation
connects. Therefore the terms nucleus and satellite are introduced. The most important
one of the text parts the relation connects is called the nucleus, the other part is called the
satellite.
Most relations exist between a nucleus and a satellite although there are some multinuclear relations. In a multinuclear relation, the text parts are considered to be of equal
importance. A relation between a nucleus and a satellite is called a hypotactic relation. A
paratactic relation exists between multiple nuclei. In table 2.1 it is shown that there are 20
hypotactic and 3 paratactic relations.
The difference between a nucleus and a satellite can be shown with the example:
1
2
Harry broke his wrist.
He fell out of a tree.
7
CHAPTER 2. STRUCTURE IN TEXT
It is clear that in this case the first sentence is of more importance than the second
sentence. The first statement could appear on its own while the second is dependent of the
first. The sentence "He fell out of a tree" can also appear single, but has no or less
meaning to a reader since there is no knowledge about who "he" is. In this example it is
clear that "he" refers to "Harry" and it can be concluded that Harry is the person who fell
out of the tree. The second sentence gives additional information about the first sentence.
In this case it explains the cause of the fact that Harry broke his wrist to the reader. So the
first sentence is the nucleus and the second sentence is the satellite. This example can be
represented in an RST-tree, see figure 2.1.
Non-Volitional Cause
Harry broke
his wrist.
He fell
out of
a tree.
Figure 2.1: An RST-tree example
The order of the sentences is not the only part which determines the importance. If the
two sentences of the preceding examples are switched like:
1
2
Harry fell out of a tree.
He broke his wrist.
the second sentence would be the most important and the same relation can be used to
represent the connection between both.
An RST-tree consists of the following parts, discourse segments, which are represented
as a horizontal line with the corresponding text beneath, and the relation arrows with a
label containing the name of the relation. The discourse segments in the preceding example
are "Harry broke his wrist" and "He fell out of a tree". Discourse segments are
discussed in more detail in chapter 3. The relation is called Non-Volitional Cause and is
represented by the pointed arc. The arc starts in the satellite and ends in the nucleus.
In the next example a paratactic relation is embedded:
1
2
A bird flies,
but a fish swims.
This example consists of a single sentence, split into two separate non-overlapping spans.
A nucleus or satellite can consist of a single sentence, but can as well be a (small) part of it.
These two spans are related to each other in a contrastive way. Both parts of the relation
are equal in importance. The fact that a bird flies does not have a higher importance than
the fact a fish swims. This can be represented as follows using the multi-nuclear relationship
Contrast in figure 2.2.
Two discourse segments can form a new segment together which is related to other
segments. This can be shown with an extension of the example of Harry and the tree:
8
2.2. RHETORICAL STRUCTURE THEORY
Contrast
A bird but a fish
flies,
swims
Figure 2.2: A paratactic relation
1
2
3
Harry broke his wrist.
He fell out of a tree,
which stood in the garden.
While the third sentence does not give extra information about the breaking of Harry’s
wrist it does give additional information about the tree. The combination of sentence 2 and
3 is related to the first sentence since the combination does specify the reason Harry broke
his wrist. In figure 2.3 the RST-tree is shown.
Non-Volitional Cause
Harry broke
his wrist.
Elaboration
He fell which stood in
out of the garden.
a tree,
Figure 2.3: An extended RST-tree
The relations between segments are bound to certain constraints. There are constraints
to both the nucleus and the satellite apart and the combination of them. At last the effect
of the relation is bound. This can be shown in the description of the Purpose relation in
figure 2.4. The text is taken from the original paper by Mann and Thompson. The following
abbreviations are used:
ˆ N = Nucleus
ˆ S = Satellite
ˆ R = Reader
An example of this relation can be shown with the text below:
1
2
To obtain a pack of cigarettes,
put four euros in the machine.
9
CHAPTER 2. STRUCTURE IN TEXT
=== Purpose ===
Constraints on N : Presents an activity
Constraints on S : Presents a situation that is unrealized
Constraints on the N + S Combination :
S presents a situation to be realized through the activity in N
The Effect : R recognizes that the activity in N is initiated in order to realize S
Figure 2.4: The Purpose relation as defined in the Rhetorical Structure Theory
Purpose
R
To obtain a pack put four
of cigarettes,
euros in
the machine.
Figure 2.5: An example of the Purpose relation
Its corresponding RST-tree is shown in figure 2.5.
In the example above the first sentence is the satellite. It fulfills the constraint of being
an unrealized situation. The second sentence is the nucleus, it provides the way to obtain
the pack of cigarettes. It fulfills the constraint of being an activity. The combination of both
sentences provides the activity of putting four euros in the machine to realize the situation
of obtaining a pack of cigarettes.
As shown above, the text spans can be connected in multiple ways. Mann and Thompson
define five schema types. These schema types define how text spans can be connected to
each other. The five different types are shown in figure 2.6.
The Circumstance example shows the use of a hypotactic relation. The arc in the example
is pointed from left to right. This means that the satellite is the left part and the nucleus
the right part. It is also possible that the nucleus comes first. The arc is reversed in that
case, see for example in figure 2.1. All hypotactic relations not mentioned in figure 2.6 can
be used like the Circumstance relation.
The Contrast relation is defined with a different schema. The arc between the discourse
segments has two points. The discourse segments which are connected with a Contrast
relation are equal in importance. They are both nucleus. This schema looks like the one
used for the Sequence relation, in which relation all discourse segments are considered to be
nucleus as well. The difference is the fact that the Contrast relation always connects two
segments, while the Sequence can contain more.
The Joint relation has yet another different schema. No arc is added. The number of
nuclei used within the Joint relation is 2 or more, just like the Sequence relation.
The last schema with the Motivation and the Enablement is a special case. It allows two
satellites, one which exists before the nucleus in the text and the second after the nucleus,
to be connected to the nucleus in the same nuclear span. A nuclear span is a span of two
or more connected discourse segments. The nuclear span is indicated by the vertical bar.
10
2.2. RHETORICAL STRUCTURE THEORY
Figure 2.6: The five different schemas
With the other relations such a tree is not possible. If for example a text containing three
discourse segments would be represented as a tree, where the second segment would be the
only nucleus, there are two possible trees if the relations would be different than Motivation
and Enablement. These two trees are shown in figure 2.7.
Concession
Circumstance
R
A
Circumstance
Concession
C
R
B
C
A
(a)
B
(b)
Figure 2.7: Two different trees
2.2.2
Applications
The Rhetorical Structure Theory can be used for different applications. To show the use of
RST, some of the applications will are mentioned below. These applications are based on a
survey by Taboada and Mann [TM06]. They divide them in four genres:
1. Computational Linguistics
2. Cross Linguistic Studies
11
CHAPTER 2. STRUCTURE IN TEXT
3. Dialogue and Multimedia
4. Discourse Analysis, Argumentation and Writing
The field of computational linguistics is broad, examples of applications in this field
are summarization, essay scoring, translation and natural language generation. A study
regarding essay scoring is for example [BMK03]. In his PhD Thesis [Mar97b], Daniel Marcu
describes summarization and natural language generation, with the use of RST. Authors
and automatic language generators can use RST to prevent readers to interpret the text in
a wrong way, as shown in [Hov93].
RST is used in other languages than English as well. For example in Brazilian Portuguese
[PdGVNM04], German [Ste04] and more.
12
Chapter
3
Annotation
In this chapter the annotation of structure in texts is discussed. First the process of manually
annotating texts is described and difficulties occurring during this process. The use of the
annotation tool of O’Donnell [O’D00] is discussed. After that, the segmentation of discourses
into elementary discourse units is handled. Furthermore the use of world knowledge and
discourse markers to annotate a discourse is described and the process of assigning relations
to discourse units. The texts are selected from the Merck Manual1 , a (digital) Dutch medical
encyclopedia. This medical encyclopedia is used since the research field of the IMIX project
is set to medical texts in Dutch. For this research several pieces of text from different sources
are annotated. A selection of annotations from samples of the Merck Manual is shown in
appendix D.
3.1
Manual Annotation
Before automatic annotation was possible, a selection of texts was manually annotated.
These texts are taken from the Merck manual. Each describes a medical topic, like for example tiredness, shortness of breath or the blood facilities of the heart itself.
A typical text consists of several paragraphs. Ideally, an RST-tree of the complete text of a
topic is created.
For the manual annotation the RST-tool of O’Donnell is used. With this tool the text file
is read and adapted. First of all the text has to be divided into non-overlapping spans of text
(segments) between which the relations can be defined. For these annotations the standard
list of 24 rhetorical relations as defined by Mann and Thompson is used. It is possible to
use other, more extensive lists or smaller ones. The annotator can also define new rhetorical
relations. In figure 3.1 the text segmenting part of the tool is shown. The vertical bars define
the segment boundaries. In this example, the text is segmented into sentences, but a further
segmentation is possible, as will be explained in section 3.2.
After the segmentations are made, the relations can be added in a graphical environment.
See figure 3.2. In this picture there are four sentences between which the relations are to be
added. The first two sentences are already connected with an Elaboration relation.
1
http://www.merckmanual.nl
13
CHAPTER 3. ANNOTATION
Figure 3.1: The RST-Tool of O’Donnell
3.2
Elementary Discourse Units
For the recognition of the structure in a discourse, it is necessary to split the text in units.
The first constraint is that the text is split in non-overlapping spans. These spans of texts
(segments) are called elementary discourse units (EDU). There are several ways to split a
text up in segments. The easiest way is to take a complete sentence as a unit. The fact that
sentences can be very long is a drawback of this approach. If a long sentence is a single unit,
it is common to lose information about the text. Within the sentence, more segmentation
could be possible and relations added, thus providing more information.
Take for example the following sample from the Merck Manual:
Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen, jeukt deze altijd. De jeuk leidt vaak
tot een onbedwingbare neiging tot krabben, wat een cyclus van jeuk-krabben-uitslag-jeuk tot
gevolg heeft, waardoor het probleem alleen maar erger wordt.
ENG: Although the color, seriousness, and location of the rash differ, it always itches. The itch
often leads to an uncontrollable tendency to scratch, which results in a cycle of itch-scratchingrash-itch , which results in making the problem worse.
A possible segmentation is the following. This small text is segmented at the sentence
boundaries.
[Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen, jeukt deze altijd.]2A [De jeuk leidt
vaak tot een onbedwingbare neiging tot krabben, wat een cyclus van jeuk-krabben-uitslag-jeuk
tot gevolg heeft, waardoor het probleem alleen maar erger wordt.]2B
The second sentence can be connected to the first sentence with an Elaboration a NonVolitional Result or a Background relation. But the second sentence contains more informa14
3.2. ELEMENTARY DISCOURSE UNITS
Figure 3.2: The graphical relation part
tion on its own, which one would probably like to make visible in the structure. To add this
information, the text has to be segmented into smaller pieces. Perhaps a better segmentation
is the following:
[Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen, jeukt deze altijd.]3A [De jeuk leidt
vaak tot een onbedwingbare neiging tot krabben,]3B [wat een cyclus van jeuk-krabben-uitslag-jeuk
tot gevolg heeft,]3C [waardoor het probleem alleen maar erger wordt.]3D
Right now the following relations can be added to this text resulting in the following tree
in figure 3.3.
Background
R
3A
Non-Volitional Result
3B
Elaboration
3C
3D
Figure 3.3: The RST-tree of the Merck example
15
CHAPTER 3. ANNOTATION
This segmentation provides more information. The fact that the itch-scratch-rash-itch
cycle is caused by the scratching, can be extracted from the rhetorical structure.
There is no best approach to divide a discourse in segments. In the example above it
can be argued that the second segmentation is better than the first, since more information
is added, but it can not be decided it is the best. For example it is possible to segment the
first sentence also in multiple parts, like:
[Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen,]4A [jeukt deze altijd.]4B [De jeuk leidt
vaak tot een onbedwingbare neiging tot krabben,]4C [wat een cyclus van jeuk-krabben-uitslag-jeuk
tot gevolg heeft,]4D [waardoor het probleem alleen maar erger wordt.]4E
These two parts can be connected with a Concession relation, which results in the following tree in figure 3.4.
Background
R
Concession
R
4A
Non-Volitional Result
4B
4C
Elaboration
4D
4E
Figure 3.4: The final RST-tree of the Merck example
Further segmentation of an elementary discourse unit is not necessarily the best option.
In fact, there is no optimal segmentation defined. Human annotators segment a text at
points they think it is useful. Therefore differences in segmentation could occur if a text was
annotated by different annotators.
3.3
Required Resources for Annotation
To be able to connect two spans of text to each other with a rhetorical relation, a decision has
to be made between which segments a relation is to be added, and exactly which relation
is best fit. There are multiple ways to connect the spans. These differences can occur
due to several reasons [MT88], for example ambiguity of the texts, differences between the
annotators and analytical error. This also depends on the way the text is segmented.
Humans can read texts because they have knowledge about the world, they know what
the text is about and it makes sense. People know what the meaning of the word "tree" is,
and know (a portion of) the properties of the entity tree. They also know what the effect of
16
3.3. REQUIRED RESOURCES FOR ANNOTATION
the event "fell" is. World knowledge provides the possibility to segment the example from
chapter 2 into elementary discourse units and to decide which relation fits between them.
Without knowledge about the domain of the subject, it is very hard to create a meaningful
structure representation. This can be shown with the following sample from the Merck
Manual, a paragraph about immune malfunctions:
Experimentele methoden, zoals de transplantatie van foetale thymuscellen en levercellen, zijn in
enkele gevallen effectief gebleken, vooral bij patiënten met het syndroom van DiGeorge. Bij
ernstige gecombineerde immuundeficiëntie met adenosinedeaminasedeficintie kunnen de
ontbrekende enzymen soms worden aangevuld. Gentherapie is veelbelovend bij deze en enkele
andere aangeboren immuundeficiënties waarbij de erfelijke afwijking is geïdentificeerd.
ENG: Experimental methods, like the transplantation of fetal thymus cells and liver cells, appeared
to be effective in some cases, especially with patients suffering from the DiGeorge syndrome. In
some cases of serious, combined immune deficiency with adenosineaminasis deficiency, missing
enzymes can be replenished. Gene therapy is promising with this, and some other innate immune
deficiencies, where the hereditarily abnormality is identified.
This text is harder to segment since it is about more specialized issues. One can decide
to take the sentences as units, since such a segmentation is easy to perform, resulting in the
following text:
[Experimentele methoden, zoals de transplantatie van foetale thymuscellen en levercellen, zijn in
enkele gevallen effectief gebleken, vooral bij patiënten met het syndroom van DiGeorge.]6A [Bij
ernstige gecombineerde immuundeficiëntie met adenosinedeaminasedeficiëntie kunnen de ontbrekende enzymen soms worden aangevuld.]6B [ Gentherapie is veelbelovend bij deze en enkele
andere aangeboren immuundeficiënties waarbij de erfelijke afwijking is geïdentificeerd.]6C
The selection of the relations between them is harder. The relation between 6A and
6B can be an Elaboration, but could also be judged as an Evidence relation. Probably few
people know what "adenosinedeaminasedeficiëntie" is, making it hard to decide which
relation to choose.
The RST-relations are defined with certain constraints. These constraints indicate what
(kind of) world knowledge is actually needed. For example the Evidence relation is defined
with the following constraints:
1. Contraints on N(ucleus): R(eader) might not believe N to a degree satisfactory to
W(riter).
2. Contraints on S(atellite): The reader believes S or will find it credible.
3. Contraints on N+S Combination: R’s comprehending S increases R’s belief of N.
The need for world knowledge to assign this relation is shown with the following example:
1
2
We ran out of cookies.
The red box is empty.
17
CHAPTER 3. ANNOTATION
To be able to assign the Evidence relation, it is necessary to have world knowledge about
the cookies and the box. It is needed to know that the cookies are stored in the red box.
Since that box is empty, the conclusion is thus that there are no cookies present. So the
second sentence provides the evidence for the first statement. The combination of the two
sentences gives reason to believe the statement that there are no cookies left.
3.4
Discourse Markers
In general a discourse contains discourse markers. Discourse markers are words or expressions which signal that certain discourse relations hold between certain discourse segments.
Examples of such markers are conjunctions, prepositions and adverbs, but more word types
also fit in. No exact definition of discourse markers exists and there are several alternative
names, for instance cue phrases, discourse connectives, coherence markers and more [Fra99].
For the segmentation of text discourse markers can be useful as well. If a discourse marker
is found in a text, it can be segmented at that point and the relation added (afterwards).
In the following example, the discourse marker "although" is found, which can signal a
concession relation.
The cat is hungry, although she just finished her meal.
The text can be segmented in two elementary discourse units where the second unit starts
with the discourse marker. This results in the following:
[The cat is hungry,]8A [although she just finished her meal.]8B
The Concession relation can be added in which the first part is the nucleus and the
second part the satellite. See figure 3.5.
Concession
The cat is although she just
hungry,
finished her meal
Figure 3.5: The RST tree of the hungry cat
If the example above was written in a different order like:
Although the cat just finished her meal, she is still hungry.
the Concession relation would still hold, but the segmentation differs. In this example
the segmentation could be done at the comma position but there are other possibilities, in
which the segmentation task is much harder.
Since text can be ambiguous, certain words which can act as a discourse marker do not
always signal a relation. In the case a discourse marker does signal a relation, multiple
18
3.5. ASSIGNING RELATIONS
relations can be possible. A discourse marker like "because" can signal a causal relation.
But it depends on the way the text is written which exact rhetorical relation is the most
useful.
This can be illustrated with the following example:
1
Because John acted badly, his mother sent him to bed.
And the next:
1
Snakes are very dangerous because they bite lots of people.
The first item could be marked with a Volitional Result relation while the second is better
described with a Justify or an Evidence relation.
Discourse Markers unfortunately do not signal all relations in a discourse, and can be
ambiguous. In [Tabng] is shown that approximately 60-70% of the relations are not signaled
by discourse markers. It does suggest that genre-specific factors may affect which relations
are signaled. This suggests that special types of text may have a higher percentage of signaled
relations by discourse markers.
Instances of words and phrases which can act as a discourse marker do not necessarily do
so. As mentioned above, prepositions can be discourse markers, but not all prepositions are
discourse markers. Whether or not a preposition acts as a discourse marker depends on the
sentence. Take for example the preposition "behind". In the following sentence, the word
acts as a marker, signaling an Elaboration relation:
The dog sits in his kennel, behind the brick wall.
While in the next sentence it does not:
I am always behind you.
Another difficulty with the use of discourse markers is the fact that a certain discourse
marker can signal different relations [HL93]. Especially with relations which are pretty
similar. The discourse marker "But" for example can signal an Antithesis relation but could
as well signal a Contrast relation.
3.5
Assigning Relations
The assignment of the relations between two text spans can be a hard decision. The use of
discourse markers helps making the segmentation and the assignment of the relations. In
some cases it is very hard or even impossible to assign a relation between segments if there
is no world knowledge. This can be illustrated with the following example:
Elke cel bevat mitochondriën, kleine structuren die de cel van energie voorzien.
ENG: Each cell contains mitochondrions, small structures which provide the cell with energy.
A (correct) segmentation is:
19
CHAPTER 3. ANNOTATION
[Elke cel bevat mitochondriën,]13A [ kleine structuren die de cel van energie voorzien.]13B
These parts can be connected with the Elaboration relation. Since there are no discourse
markers present which signal the Elaboration relation in this sentence the use of world
knowledge is necessary. If this sentence must be annotated automatically, a problem would
arise since a computer has no knowledge about the world, if not specifically added. This
problem will be discussed in more detail in chapter 6.
3.6
Tree Structure
If a text would be manually annotated by different annotators, differences would occur.
People do not only choose different relations between elementary discourse units, but the
actual EDUs they relate, show considerable differences as well.
Take for example the following text, segmented into three elementary discourse units,
annotated into two trees in figure 3.6.
[Harry fell out of a tree.]14A [He broke his wrist.]14B [His wrist hurt badly.]14C
Non-Volitional Result
Non-Volitional
i Result
14A
14A
14B
Non-Volitional Result
14C
(a)
14B
14C
(b)
Figure 3.6: Two annotation trees
Both elementary discourse units, 14B and 14C are connected to another EDU with a
Non-Volitional Result relation. But where in the first tree the EDUs are both connected to
14A, in the second tree the third is connected to the second.
These kind of different connections of relations between EDUs happen very often during
annotation. For large(r) texts this can cause two trees of the same text to be quite unlike
each other. For all the texts annotated during this research, only a few RST trees were
similar to each other. None of these texts consisted of more than 4 EDUs.
20
Chapter
4
Structure in Dutch
In this chapter structure in Dutch texts will be discussed. First, properties of Dutch in
general and the possibility to use Rhetorical Structure Theory to annotate texts in this
language are described. After that discourse markers for Dutch are discussed. In the last
section the differences between several genres of Dutch text are shown.
4.1
Dutch Text
Dutch is a West-European language natively spoken by over 21 million people1 . It belongs to
the family of West Germanic languages just like English and German, although differences
exist between these languages. The writing system of Dutch uses the Latin alphabet.
Although RST is developed using English texts, it can be used for Dutch as well. In
[KS98] Knott and Sanders performed a cross-linguistic study of English and Dutch regarding
the classification of coherence relations. They found similarities with the use of discourse
markers signaling coherence relations. More studies [San, MS01, ART93] ground the use of
RST in Dutch. This will be illustrated with some small examples, regarding the use of RST
in Dutch compared to English.
The composition of Dutch sentences is relatively comparable to their translations in
English. This is illustrated with this small sentence:
1
2
De man is een broodje aan het eten.
ENG: The man is eating a sandwich.
it shows that the translation is relatively similar. The elementary units are conserved.
In the translation the nouns (man, sandwich), the verbs (is, eating) are present. If the two
sentences used in an example in chapter 2 are translated, it shows that the relation is also
conserved:
1
2
Harry broke his wrist.
He fell out of a tree.
1
http://taalunieversum.org/ as of 08/2006
21
CHAPTER 4. STRUCTURE IN DUTCH
Dutch translation:
1
2
Harry heeft zijn pols gebroken.
Hij viel uit een boom.
Although these simple examples do not provide evidence, they suggest that Rhetorical
Structure Theory can be used to describe structure in Dutch. During this research several
pieces of Dutch text from different genres are successfully annotated with RST. No texts
were found which could not be annotated.
4.2
Dutch Relations
For the manual annotation of Dutch texts during this assignment, the standard set of rhetorical relations as defined by Mann and Thompson is used. This set of relations provides the
possibility of structuring Dutch texts, but it is questionable whether this set is optimal, in
terms of coverage of the text [HM97]. This problem exists for English (and other languages)
as well, despite the fact that RST originally is developed for English. Some text could perhaps be better related with an undefined relation type (for example, a kind of relation that is
used by other theories). So it is possible there is a need for additional relations. In [HM97],
Hovy discusses a set of over 350 relations from different sources. However, it is also possible
to take relations together which differ slightly (in their definition).
For the automatic recognition of text, a shorter list of possible relations is important
since it decreases the likelihood of an error. A drawback of the use of a shorter list is an
increase in information loss. In chapter 5, a classification is performed for relations in Dutch
medical texts which is used for the automatic recognition, as described in chapter 6.
4.3
Dutch Discourse Markers
In Dutch discourse markers are present, similar to other languages like English. There are
different kinds of words which can act as discourse markers. First the conjunctions will
be described and shown that they are most useful within sentences rather than between
sentences. After that, other kinds of words are discussed, like adverbs and demonstrative
pronouns which are used for the recognition of relations between sentences.
4.3.1
Conjunctions
The first word type which can be used as a discourse marker to be discussed is the conjunction. Using this word as a discourse marker, is done in several researches [PSBA03, Mar00,
CFM+ 02, Sch87] and many more. Conjunctions usually signal relations within sentences,
but a small subset can also signal relations between sentences.
Lists of conjunctions are collected from different sites2 regarding Dutch language and afterwards extended these with data from the CGN lexicon [HMSvdW01]. The CGN lexicon is
2
http://www.muiswerk.nl/WRDNBOEK/VOEGWRDN.HTM
http://nl.wikipedia.org/wiki/Voegwoord
22
4.3. DUTCH DISCOURSE MARKERS
developed for the project Corpus Gesproken Nederlands (Spoken Dutch Corpus). It consists
of about a million syntactically annotated words. The conjunctions are extracted from this
lexicon, and the conjunctions which would not be used in written text removed. This is done
because the CGN-lexicon also contains some speech bastardizations. For example the word
"enne" (and-eh) as a corruption of the word "en" (and).
The total list of conjunctions is added in appendix B, with their corresponding English
translation. The list contains conjunctions which are used today and some ’old’ words not
used often anymore. All the data used for this project, the whole Merck Manual, is checked
for the appearance of the conjunctions, and a part of these words did not appear once. The
words "schoon" and "naar" in fact did appear but not as conjunction since they also means
”clean” and ”to”, in Dutch respectively. The other conjunctions are all used in the available
data. The list of ’old’ conjunctions is shown in table 4.1.
aleer
alsook
annex
dewijl
doordien
eer
eerdat
ende
gelijk
hoezeer
ingeval
naar
naardien
nademaal
niettegenstaande
ofdat
oftewel
overmits
schoon
uitgenomen
vermits
vooraleer
wijl
Table 4.1: ’Old’ Dutch Conjunctions
In table 4.2 the remaining conjunctions of the collected list is shown.
aangezien
al
alhoewel
als
alsmede
alsof
alvorens
behalve
daardoor
daar
dan
dat
doch
doordat
dus
evenals
en
hetzij
hoewel
indien
maar
mits
na
naargelang
naarmate
nadat
noch
nu
of
ofschoon
ofwel
om
omdat
opdat
sedert
sinds
tenzij
terwijl
toen
tot
totdat
uitgezonderd
zoals
voor
voordat
voorzover
wanneer
want
zo
zodat
zodra
zolang
zover
Table 4.2: Most Common Dutch Conjunctions
An example of the use of conjunctions in Dutch:
Het paard staat in de wei, hoewel het erg koud is.
ENG: The horse stands in the meadow, although it is very cold.
http://triangulu.co.wikivx.org/nl/grammatica
http://oase.uci.kun.nl/ ans/e-ans/10/body.html
23
CHAPTER 4. STRUCTURE IN DUTCH
The sentence could be segmented at the discourse marker "hoewel" (although) resulting
in:
[Het paard staat in de wei,]16A [hoewel het erg koud is.]16B
The discourse marker "hoewel" signals here a Concession relation. The corresponding
RST-tree is shown in figure 4.1.
Concession
Het paard hoewel het
staat in de erg koud is.
wei,
Figure 4.1: The Concession relation
To get an overview of the number of possible discourse markers 10 texts from the Merck
Manual are randomly selected, ranging from about 150 till 2000 words. The occurrences of
the conjunctions as shown in appendix B in the texts are counted. The results are presented
in table 4.3.
The table shows only the number of occurrences. Not all occurrences of a certain conjunction do in fact signal a relation. Words which can act as a conjunction can also be used
for different reasons, than connecting two sentences parts together. This can be shown with
the following fragment from text2, where the word "of" (or) is not used as a conjunction.
Tevens kan deze bacterie longontsteking (pneumonie) veroorzaken, bronchitis,
middenoorontsteking (otitis media), oogontsteking (conjunctivitis), ontsteking van een of meer
neusbijholten (sinusitis) en acute infectie van het gebied net boven het strottenhoofd
(epiglottitis).
ENG: Also, this bacterium can cause pneumonia, bronchitis, inflammation of the middle ear (otitis
media), inflammation of the eye (conjunctivitis), inflammation of one or more accessory sinuses
of the nose and immediate infection of the area just above the larynx (epiglottitis).
This can be an explanation for the significant differences in numbers of occurrences.
While in text6, a percentage of 9.88 conjunctions out of the total words is counted, in text7
it is only 3.77%. Another special case about text7 is the fact it does not contain any instance
of the word "en" or "of", while all other texts from table 4.3 do.
To be able to assign relations to sentences it is necessary to know what relations are
signaled by which discourse markers. It is possible that a discourse marker signals more than
one relation. Also, relations can be signaled by multiple discourse markers [GS98]. Therefore
a classification of the relations signaled by the possible discourse markers is needed. First
of all an overview is gathered of which conjunction is most present in medical discourses. In
table 4.4 the spreading of the conjunctions in the texts is shown. The numbers represent
each occurrence of the conjunctions. The percentages show the diversity of the conjunctions.
For example, 5.71% of all the found conjunctions were "Tot". The fact that the word "Voor"
24
4.3. DUTCH DISCOURSE MARKERS
Name
Text0
Text1
Text2
Text3
Text4
Text5
Text6
Text7
Text8
Text9
Conjunctions
48
14
12
26
66
162
85
8
13
74
Total Words
548
243
153
330
964
2048
860
212
191
1060
Percentage
8.76 %
5.76 %
7.84 %
7.88 %
6.85 %
7.91 %
9.88 %
3.77 %
6.81 %
6.98 %
Table 4.3: Percentages of the number of conjunctions
occurs 38 times in the selection does not mean it also acts 38 times as a discourse marker.
To check whether a conjunction acts as a discourse marker or not, all texts are analyzed by
hand.
It is necessary to know what relations are signaled by which discourse marker and in
which context. A difference is made between the signaling of relations between sentences
and inside sentences. In [PSBA03], the conjunctions are divided into two types, the subordinating conjunctions and the coordinating conjunctions. While subordinating conjunctions
can signal relations within the sentence they appear in, the coordinating conjunctions could
also signal relations between sentences. The list of coordinating conjunctions contains only
five elements: En, Want, Maar, Dus, Of [KS04]. The rest are considered subordinating
conjunctions.
To be able to find the relations signaled by the conjunctions, the texts are manually
checked for each conjunction. If a conjunction was found in the text, it is checked whether
is signaled a relation or not. If the conjunction signaled a relation, the actual relation it
signaled is noted. Some conjunctions signaled different relations in different contexts. The
conjunctions that signaled a relation most of the time they were found in the text, are shown
in table 4.5.
In the table is shown that all conjunctions were found to signal relations within a sentence.
Two of the conjunctions did indeed signal relations between sentences. Both are considered
to be coordinating conjunction. Although the conjunction Want is added to the list, no
occurrence of it was found, signaling a relation between sentences. In her master’s thesis
[vL05], van Langen also did not discover the conjunction Want signaling a relation between
sentences. In her research the signaling of a relation between sentences by the coordinating
conjunction Dus was not found, although it was found in the data used for this assignment.
The coordinating conjunctions Of and En are missing from the table. This is because the
signaling of a relation was found to be too ambiguous. For example, the word "en" is used to
combine multiple subjects to a single subject in about 75% of the cases as will be illustrated
in section 4.4.1. Note that the word "Daardoor" can also signal a relation between sentences,
this is because it can act as a conjunctive and as a conjunctive adverb. Conjunctive adverbs
are discussed in section 4.3.2. The word "Daardoor", along with "Dus" and "Daarom" are
25
CHAPTER 4. STRUCTURE IN DUTCH
Conjunction
En
Of
Voor
Om
Tot
Als
Dan
Dat
Maar
Omdat
Wanneer
Na
Terwijl
Zoals
Totdat
Al
Doordat
Zo
Behalve
Daardoor
Zodat
Daar
Dus
Hoewel
Ofwel
Voordat
Total
Occurrences
164
79
38
32
29
26
25
23
17
17
13
8
8
5
4
3
3
3
2
2
2
1
1
1
1
1
508
Percentage
32.28 %
15.56 %
7.48 %
6.30 %
5.71 %
5.19 %
4.92 %
4.53 %
3.30 %
3.30 %
2.56 %
1.57 %
1.57 %
0.98 %
0.79 %
0.59 %
0.59 %
0.59 %
0.39 %
0.30 %
0.39 %
0.20 %
0.20 %
0.20 %
0.20 %
0.20 %
100 %
Table 4.4: Overview of Occurrences
26
4.3. DUTCH DISCOURSE MARKERS
Conjunction
Relation
Aangezien
Behalve
Daardoor
Doordat
Evidence
Concession
(Non) Volitional
(Non) Volitional
Elaboration
(Non) Volitional
Evidence
Sequence
Elaboration
Concession
(Non) Volitional
Condition
Concession
Antithesis
Contrast
Condition
Condition
Concession
Elaboration
Sequence
(Non) Volitional
(Non) Volitional
Background
Circumstance
Concession
Antithesis
Contrast
Antithesis
(Non) Volitional
Condition
(Non) Volitional
Evidence
Elaboration
(Non) Volitional
(Non) Volitional
(Non) Volitional
Condition
Condition
Dus
Evenals
Hoewel
Indien
Maar
Mits
Nadat
Ofschoon
Ofwel
Omdat
Opdat
Sinds
Tenzij
Uitgezonderd
Wanneer
Want
Zoals
Zodat
Zodra
Zolang
Cause
Cause
Cause
Cause
Cause
Cause
Cause
Cause
Cause
Result
Cause
Inside
Sentences
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Between
Sentences
No
No
Yes
No
No
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
Table 4.5: Relations signaled by conjunctions
27
CHAPTER 4. STRUCTURE IN DUTCH
considered causal connectives, where the cause precedes the result [MS01].
Some examples of the use of these conjunctions are shown below. The first example shows
the use of the conjunction "Indien" (If). The corresponding tree is presented in figure 4.2.
[Indien de histamine met het voedsel wordt ingenomen,]18A [kan het gezicht onmiddellijk rood
worden.]18B
[ENG: If the histamine is consumed together with the food,]18C [the face can become red
immediately.]18D
Condition
R
18D
18C
Figure 4.2: The Condition relation
The second example shows the use of the conjunction "Aangezien" (For), the tree is
shown in figure 4.3.
[Het is vaak niet eenvoudig om onderscheid te maken tussen hyperhydratie en te hoog
bloedvolume,]19A [aangezien hyperhydratie zowel op zichzelf als in combinatie met te hoog
bloedvolume kan voorkomen.]19B
[ENG: It if often complicated to distinguish between hyper hydration and a high blood volume,]19C
[for hyper hydration can appear solely and in combination with a high blood volume.]19D
Evidence
19C
19D
Figure 4.3: The Evidence relation
The remaining conjunctions signaled relations in a lower frequency. For some words this
is caused by the fact they can act as a different type of word instead of a conjunction. This
can be showed with the following example regarding the word "al". In the first case it acts
as a conjunction, while in the second sentence it does not.
1
2
1
2
Robert pakte een koekje, al had zijn moeder dat verboden.
ENG: Robert took a cake, although his mother had forbidden it.
Die tijd is al voorbij.
ENG: That time is already gone.
The most important conclusion resulting from table 4.5 is the fact that only a few conjunctions signal relations between sentences. For relations inside sentences these conjunctions can
be very useful. For the automatic recognition of relations between sentences, other discourse
markers or information providers are to be used as well.
28
4.3. DUTCH DISCOURSE MARKERS
Background
Elaboration
Enablement
Evaluation
Evidence
Interpretation
Justify
Motivation
Preparation
Restatement
Solutionhood
Table 4.6: Relations seldom signaled
In discussions between members of the linguistic listserver3 , relations which are seldom
signaled are discussed. These relations as mentioned in the discussion, are shown in table
4.6. If this list is compared with the list of the relations which were signaled by conjunctions
certain similarities show. Just two of the seldom signaled relations are indeed found to be
signaled by conjunctions, the Elaboration and the Evidence relation. This grounds the use
of this list, for relations which are not signaled and thus hard to find if a text was to be
annotated using conjunctions as discourse markers.
4.3.2
Other Discourse Markers
As shown in table 4.5 conjunctions usually signal relations within sentences, rather than
between sentences. Other kinds of words which can signal relations are adverbs. More
specifically, the conjunctive adverbs [PSBA03]. A list of conjunctive adverbs is collected from
different sources [vL05, Per]. The combined list is shown in table 4.7 with their corresponding
English translation.
Conjunctive adverbs are considered to be able to signal relations between sentences if
they appear as the first word of the second sentence [vL05]. Next thing needed to know is
which relation is signaled by a certain conjunctive adverb. To gather this information, for
each conjunctive adverb, a selection of extracts from the available data is taken and checked
manually for these conjunctive adverbs. Only the ones, which did signal relations most
clearly are noted. Furthermore, conjunctive adverbs which appeared in a very low amount
were omitted from the list. For example the word "hierom" occurs only one time in the
whole dataset. The results are shown in table 4.8.
Furthermore, it is useful to know whether there are other adverbs which do signal a
relation. Therefore a list of adverbs is collected. These adverbs are gathered from a website4
and extracted from the CGN-lexicon. The total number of different adverbs collected in this
way, exceeds a thousand.
The adverbs which might signal a relation are extracted from the list manually. Since
this is done by hand, (the adverbs can not be checked automatically due to the lack of large
annotated texts), only a portion of the adverbs are checked. The adverbs which occur very
frequently are discarded because analysis of these words showed they were too ambiguous to
use. The adverbs which occur in a low number are discarded too. This is done because only
a few occurrences are not enough to determine which relation they represent and whether
3
4
http://listserv.linguistlist.org
http://www.muiswerk.nl/WRDNBOEK/BIJWOORD.HTM
29
CHAPTER 4. STRUCTURE IN DUTCH
Bijgevolg
Bijvoorbeeld
Bovendien
Daardoor
Daarnaast
Daarom
Daartoe
Dan Ook
Derhalve
Dientengevolge
Dus
Echter
Evenzeer
Hierdoor
Hierom
Hiertoe
Immers
Namelijk
Ook
Tenslotte
Tevens
Toch
Verder
Vervolgens
Wel
Zo
As a consequence
For example
Besides
Therefore
Moreover
Therefore
For That
Than As well
So
Consequently
So / Therefore
However
Also
Because of this
For this reason
For this purpose
After all
Namely
As well
Finally
Also
Nevertheless
Apart from that
Next
Well
Zo
Table 4.7: Conjunctive Adverbs
the adverbs signals a relation strongly. Furthermore, an adverb which occurs only a few
times is less useful for an automatic recognizer than an adverb which occurs more often.
The adverbs are checked with extracts from the Merck Manual. For each adverb, different
extracts were gathered which embedded the adverb. These texts were analyzed, and if the
adverb was found to signal a relation it is counted. Adverbs which signaled most of the time
are kept. The others are discarded as well.
This resulted in a list of adverbs, with their corresponding relations. The overview of
these words is shown in table 4.9. Again a difference is made between the signaling of
relations inside sentences and between sentences.
The relations signaled by adverbs are mainly of the Elaboration type. In table 4.9 is shown
that 13 out of the 16 adverbs in the list can signal an Elaboration This can be explained by
the fact that two sentences, when regarding the same topic, usually provide extra information
about earlier statements. A clear signaling by adverbs occurs if they appear at the start of
the second sentence, however, they are found to signal relations when residing in the middle
of the second sentence as well. Like:
30
4.3. DUTCH DISCOURSE MARKERS
Conjunctive
Adverb
Bijvoorbeeld
Bovendien
Daardoor
Daarnaast
Daarom
Derhalve
Dus
Dientengevolge
Hierdoor
Ook
Tevens
Toch
Verder
Vervolgens
Zo
Relation
Elaboration
Elaboration
Non-Volitional Result
Elaboration
Sequence
Justify / Evidence
Background
Non-Volitional Cause
Non-Volitional Result
Non-Volitional Result
Non-Volitional Result
Non-Volitional Result
Non-Volitional Result
Elaboration
Elaboration
Elaboration
Concession
Elaboration
Elaboration
Non-Volitional Result
Elaboration
Inside
Sentences
Yes
Yes
Yes
No
No
No
No
No
No
No
Yes
No
Yes
Yes
No
No
No
No
No
No
No
Between
Sentences
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Table 4.8: Conjunctive Adverbs and the relations they might signal
[Bij volwassenen leidt een tekort aan groeihormoon meestal tot aspecifieke symptomen.]20A [Bij
kinderen leidt het daarentegen tot sterk vertraagde groei en soms tot dwerggroei.]20B
[ENG: With adults, a shortage of growth hormones usually causes anti specific symptoms.]20C
[With children, on the other hand, it causes a strong delayed growth and sometimes dwarfishness.]20D
The word "Daarentegen" (On the other hand) signals a Contrast relation between both
sentences.
This way of signaling is also found with the conjunctive adverbs. If a conjunctive adverb
would reside in the middle of a sentence, it also signals a relation. The relation it signals, is
the same as when the marker would exist at the start of the sentence. For both conjunctive
adverbs and adverbs is found that a word of these types, signal stronger when they appear
at the start of the sentence.
The third type of textual units which can signal a relation is the pronoun, more specifically
the demonstrative pronoun. This type of word can be used to link sentences together when
the writer likes to elaborate on the subject of a previous sentence. This extra information
can for example be a Non-Volitional Result or an Elaboration. Examples of the use of
demonstrative pronouns are the following:
31
CHAPTER 4. STRUCTURE IN DUTCH
Adverb
Relation
Daarentegen
Contrast
Concession
Elaboration
Non-Volitional Result
Elaboration
Elaboration
Concession
Elaboration
Elaboration
Elaboration
Elaboration
Elaboration
Elaboration
Concession
Elaboration
Elaboration
Elaboration
Elaboration
Daarna
Daarop
Door
Doorgaans
Echter
Gewoonlijk
Hierbij
Meestal
Om
Ongeveer
Soms
Tegelijkertijd
Vooral
Voorts
Zelfs
Inside
Sentences
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
Between
Sentences
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Table 4.9: Adverbs and the relations they might signal
1
2
3
4
1
2
3
4
Katten zijn vriendelijke beesten.
Ze worden door veel mensen als huisdier gehouden.
ENG: Cats are friendly creatures.
They are held as pet by many people.
Sommige auto’s raken total-loss na een ongeluk.
Deze worden dan gesloopt.
ENG: Some cars become total-loss after an accident.
These will be pulled down.
While the first example is best annotated with an Elaboration, a Non-Volitional Result fits
better with the second example, resulting in the tree in figure 4.4. The pronouns however
cannot be used to determine what relation exactly is signaled. In the previous example
about the car, the Non-Volitional Result relation is added because of the relation between
"total-loss" and "pulled down". In nearly all the cases in the analyzed data where a
(demonstrative) pronoun was present, the relation which would be added, was the Elaboration
relation. Therefore, all pronouns are considered to signal an Elaboration relation.
4.4
Genres
As in all languages, in Dutch there exist many different genres of texts. In this section the
consistencies and the differences between a selection of genres and the consequences for the
32
4.4. GENRES
Non-Volitional Result
Sommige auto’s
raken total-loss
na een ongeluk.
Deze worden
dan gesloopt.
Figure 4.4: Non-Volitional Result signaled by the Demonstrative Pronoun "Deze"
annotation of these texts are described. For the comparison use is made of medical texts,
fairy tales, recipes, newspaper articles and weather forecasts, but of course there are plenty
of other genres. A snapshot of these texts is added in appendix E.
4.4.1
Comparison
First of all, selections of newspaper articles5 , fairy tales6 , recipes7 and weather forecasts8 are
gathered. These texts are gathered from the internet, each from multiple sites, to prevent
the use of a specific style for a genre which can be different for other sources.
The differences between texts of these genres are described in several properties. The
first property to describe is the composition of the texts. While a fairy tale mostly is a text
with practically no lay-out features (apart from chapters and paragraphs), a recipe makes
extensive use of bulleted lists. A newspaper lay-out is more alike to a fairy tale while a
medical encyclopedia piece and a weather forecast contain both styles.
The use of personalities in text is the second cause of difference. In fairy tales often
dialogues are used since fairy tales are mostly about persons or animals which go through
adventures, while in medical encyclopedias and recipes this never happens. Recipes often
only consist of a list of actions to be performed by the reader. A news paper article can
again contain both plain text and dialogues. Some articles are enriched with interviews with
the person the article is about or a professional in the subject the article describes.
While fairy tales and in a sort of way, newspaper articles as well tell stories, weather
forecasts, recipes and encyclopedias only present objective information. The reader sees a
text containing actions and facts.
Also there is a difference in coherence between these genres. This is illustrated with the
examples below. The first example is a weather forecast:
[De wind is meest matig en draait van ZW naar west tot NW.]21A [De maxima liggen rond 19
graden.]21B [Vanavond wordt het geleidelijk droog en er volgt een droge nacht met opklaringen
en plaatselijk mist.]21C [Minima 10 tot 14 graden.]21D
5
http://www.volkskrant.nl , http://www.trouw.nl , http://www.nrc.nl
http://www.sprookjes.nu , http://www.sprookjes.org
7
http://recepten.net , http://smikkelenensmullen.blogspot.com
8
http://www.knmi.nl , http://www.weer.nl
6
33
CHAPTER 4. STRUCTURE IN DUTCH
[ENG: The wind is moderate, and turns from SW to West to NW.]21E [The maxima are about
19 degrees.]21F [Tonight it becomes little by little dry, followed by a dry night with bright periods
and local mist.]21G [Minima 10 to 14 degrees.]21H
This text contains four sentences which each handle a different aspect of the weather.
These aspects only relate to each other since they are about the weather. The only suitable
tree shape for this kind of text is a Joint for all sentences since each one present a valid
statement about the weather, where no statement is of more importance than an other.
The second example is the following recipe:
[Grill voorverwarmen op hoogste stand.]22A [Tomaten wassen en in partjes snijden.]22B [Dressing
door tomaten scheppen.]22C [Schnitzels plat slaan en bestrooien met zout en peper.]22D [In
koekenpan olie verhitten.]22E [Schnitzels in ca. 6 minuten bruin en rosé bakken, halverwege
keren.]22F [Schnitzels op 4 sneetjes brood leggen.]22G
[ENG: Preheat Grill at highest level.]22H [Wash tomatoes and slice in parts.]22I [Mix dressing with
tomatoes.]22J [Flatten schnitzels and sprinkle with salt and pepper.]22K [Heat the oil in a frying
pan.]22L [Bake schnitzels in ca. 6 minutes brown and rosé, turn halfway.]22M [Serve schnitzels on
4 slices of bread.]22N
In fact the whole recipe is one sequence of actions to be performed. No action is of more
importance than an other so only one tree can be built of such a text. The best approach
would be to cover all sentences in a sequence relation. It is still possible though to split
the text in smaller parts than a sentence. For example sentence 22D could be split into two
pieces: "Schnitzels plat slaan" and "en bestrooien met zout en peper".
Texts from the other genres are usually better fit for annotating with RST and building
a more obvious RST-tree. For example the following part of the fairy tale:
[Ver weg in een mooi land was eens een beeldschoon prinsesje.]23A [Ze woonde in een prachtig
kasteel aan de rand van het bos.]23B [Het prinsesje zat graag bij de vijver in de kasteeltuin.]23C
[ENG: Far away in a beautiful country, once lived a gorgeous little princess.]23D [She lived in a
magnificent castle at the edge of the forest.]23E [The little princess liked sitting at the pond in
the castle garden.]23F
This could be annotated like figure 4.5.
Background
R
Elaboration
23C
23A
23B
Figure 4.5: The Princess example
In the following tables the number of conjunctions for a selection of texts from the genres
is shown. In table 4.10 the number of conjunctions for the news paper articles is shown.
34
4.4. GENRES
These numbers indicate that the number of conjunctions is comparable to the number of
conjunctions of the medical articles in table 4.3
Name
Newspaper0
Newspaper1
Newspaper2
Newspaper3
Newspaper4
Newspaper5
Newspaper6
Newspaper7
Newspaper8
Newspaper9
Conjunctions
37
10
26
72
20
28
29
27
29
27
Total Words
486
111
284
901
236
355
354
323
398
378
Percentage
7.61 %
9.01 %
9.15 %
7.99 %
8.47 %
7.89 %
8.19 %
8.36 %
7.29 %
7.14 %
Table 4.10: Conjunction numbers of the Newspapers
If the conjunction numbers of the fairy tales, the recipes and the weather forecasts, which
are shown in table 4.11, are compared, higher percentages show than with the newspaper
texts and the medical texts of table 4.3.
Name
FairyTale0
FairyTale1
FairyTale2
FairyTale3
FairyTale4
Recipe0
Recipe1
Recipe2
Recipe3
Recipe4
Recipe5
Weather0
Weather1
Weather2
Weather3
Weather4
Weather5
Conjunctions
40
12
14
54
38
13
5
10
7
29
15
7
7
7
8
14
7
Total Words
398
201
264
471
431
148
76
94
53
253
201
83
87
79
106
136
102
Percentage
10.05 %
5.97 %
5.30 %
11.46 %
8.82 %
8.78 %
6.58 %
10.64 %
13.21 %
11.46 %
7.46 %
8.43 %
8.05 %
8.86 %
7.55 %
10.29 %
6.86 %
Table 4.11: Conjunction numbers of the Fairy Tales, Recipes and Weather Forecasts
To explain these peaks, the exact conjunctions are checked. It appears that the conjunction "en" (and) is very frequently present within these genres in contrast to the other
conjunctions which are spread more equally. The number of occurrences of the word "en"
is shown in table 4.12.
35
CHAPTER 4. STRUCTURE IN DUTCH
Text
Medical Texts
News Papers
Recipes
Fairy Tales
Weather Forecasts
Percentage
32.87 %
30.54 %
64.94 %
39.35 %
40.82 %
Table 4.12: Percentages of the conjunctions which are "en"
The high number of occurrences of the word "en" in some genres points out that in
fact some conjunctions are less useful than others. Most of the occurrences of "en" do not
signal a relation at all (about 75% is used in a construction as discussed below) while other
conjunctions, like "want" (because) always does. No occurrence of "want" is found where it
did not signal a relation. An example in which the word "en" does not signal a relation is:
1
2
Jan en Piet liepen naar school
ENG: Jan and Piet walked to school
The reason that "want" does always signal a relation is that it implies a reason. A reason
likely to be added to the text by the writer apart from a statement the reason is about. But
"en" can combine subjects which act as a single unit, like in the example above. There Jan
and Piet are two separate guys, but they walked together. The fact that they were walking
to school, is stated about the group, rather than the boys themselves.
In table 4.13 the percentages of conjunctions in the texts are shown, with the number of
occurrences of the word "en" discarded. The main observation is that the differences between
the percentages of texts from a single genre are decreased. Beside that the percentages
of texts of different genres became more equal, nearly all of the percentages are between
about 4% and 7%. The only exception is shown with the recipes, where the numbers of
conjunctions are very low, except recipe4. The difference between recept4 and the rest is
the way of writing. While the other recipes exist of small statements about cooking, recipe4
consists of larger sentences, which are connected with conjunctions.
The different genres are also compared by other discourse markers like conjunctive adverbs. However, these numbers were much lower. For example, about 1% of the words could
be counted as conjunctive adverbs.
36
4.4. GENRES
Name
Text0
Text1
Text2
Text3
Text4
FairyTale0
FairyTale1
FairyTale2
FairyTale3
FairyTale4
Recipe0
Recipe1
Recipe2
Recipe3
Recipe4
Weather0
Weather1
Weather2
Weather3
Weather4
Newspaper0
Newspaper1
Newspaper2
Newspaper3
Newspaper4
Conjunctions
28
9
7
22
43
19
8
12
33
25
1
1
2
1
18
3
5
4
6
7
26
9
18
59
15
Total Words
548
243
153
330
964
398
201
264
471
431
148
76
94
53
253
83
87
79
106
136
486
111
284
901
236
Percentage
5.11 %
3.70 %
4.56 %
6.67 %
4.46 %
4.77 %
3.98 %
4.55 %
7.01 %
5.80 %
0.68 %
1.32 %
2.13 %
1.89 %
7.11 %
3.61 %
5.75 %
5.06 %
5.66 %
5.15 %
5.35 %
8.11 %
6.34 %
6.55 %
6.36 %
Table 4.13: Conjunction numbers without the occurrences of "en"
37
Chapter
5
Medical Texts
This chapter describes medical texts, focused on the Merck Manual. This encyclopedia is
chosen because it provides the most data among the possible Dutch sources, like the Winkler
Prins medical encyclopedia [Spe03]. The Merck Manual explains the different diseases more
thoroughly. It is also preferred for this assignment above medical texts from the Wikipedia1 ,
since it is a professional source, where the Wikipedia is not. It is therefore assumed that
the texts of the Merck Manual are better structured. The Merck manual will be discussed
regarding its text and features. After that general properties of medical texts are covered.
Furthermore relations and discourse markers of medical texts will be described and a study
for special medical discourse markers is performed.
5.1
Merck Manual
For this research the Merck Manual is used. This is an online encyclopedia regarding medical
issues. It differs from standard encyclopedias in its organization. While many encyclopedias
are organized alphabetically, the Merck Manual is organized in classes. Examples of such
classes are the Heart and Lung-diseases, the Hormone-system and Cancer.
5.1.1
Composition
The Merck Manual is divided in 24 sections, each describing a (separate) class, and appendices. Each section is divided in chapters. A chapter contains the description about a
subsection of the class. The section about the eye for example contains 12 chapters such as
Disorders to the eye sockets and Disorders to the cornea.
These chapters consist of multiple subsections. Each chapter contains at least a subsection Introduction. The subsections deal with a specific part of the subject. This differs from
the causes of the disorders, the symptoms and possible cures. Sometimes a list of diseases
is provided about the subject and symptoms and cures are dealt with per specific disease
1
http://www.wikipedia.org/
39
CHAPTER 5. MEDICAL TEXTS
or even a further subsectioning is used. An example is the description of apheresis2 . This
treatment is to be found through the following path:
(Section
(Chapter
(Subject
(Part :
: Blood) : Blood Transfusion) : Special Transfusion Procedures) Apheresis)
The complete Merck Manual can be represented like a tree. The root of the tree can
be the manual itself with its children being the different sections. In figure 5.1 a graphical
representation of this tree is shown. The leaves of the tree can either be a subject or a part
of the subject. The leaves contain the actual text. In figure 5.2 a screenshot of the Merck
Manual is shown. It displays the subject Functie van de voorkwab van de hypofyse of the
chapter Aandoeningen van de hypofyse. The section to which the chapter belongs, is called
Hormonale stelsel. In the figure at the left, the other subjects of this chapter and their parts
are shown. The subject Acromegalie, consists of three parts: Symptomen, diagnose and
Behandeling. In appendix C additional background information about the Merck Manual is
added.
Manual
Section
Section
Chapter
Chapter
Subject
Text
Subject
Text
Figure 5.1: The structure of the Merck Manual
5.1.2
Features
The manual provides extra features. They are used to present the data in an easier format
to read. For example the use of lists, tables and headings. In figure 5.3 a snapshot of the
Merck Manual is shown regarding the use of a table.
2
Apheresis (Greek: ”to take away”) is a medical technology in which the blood of a donor or patient is
passed through an apparatus that separates out one particular constituent and returns the remainder to the
circulation, from: http://en.wikipedia.org/ as of 07/2006.
40
5.2. MEDICAL TEXTS
Figure 5.2: A screenshot of the Merck Manual
While this feature is hard to parse, the data is presented in consistent form. The application of lists in the text could be used as a marker for annotating the information in the
list as a sequence. However, only a small portion of the information presented in the manual
is in a specific format. The main part is plain text. Since these texts are very large, the
compositional information is not suited. Therefore, recognition based on this information is
not further used.
5.2
Medical texts
In this chapter different medical texts will be discussed. First the Merck Manual will be
decribed, followed by some other Dutch medical texts.
5.2.1
Merck Texts
Medical texts are a subset of discourses. These medical texts share common properties. First
of all, the subject of these medical texts are all subjects related to the medical field. For
example the Merck Manual gives descriptions about illnesses, but also describes the way of
treatments. A typical composition of a medical text in Merck is a selection from the following
possibilities. Usually the text starts with a short description of the subject. In many cases
symptoms are described as well as diagnosis, prognosis, treatments, causes of the disease
41
CHAPTER 5. MEDICAL TEXTS
Figure 5.3: Tables in the Merck Manual
and prevention. A subject can also contain several examples and variations about the topic
of the chapter in which each example is explained with a selection of diagnosis, treatments
etcetera. This composition of the text can add certain properties to the text which can be
used to annotate the text.
The text itself has also some special features. A medical encyclopedia is written with the
purpose of informing the reader about the medical subjects. A property of the subjects is
that one usually tries to prevent them or likes to know how to treat them. Also exceptions
and other special cases are of interest to the reader. This causes the text to contain a lot of
cause and effect constructions. It can be suggested that annotated text contains a significant
number of relations which are based on this property.
An example:
De snelheid waarmee parodontitis zich ontwikkelt, varieert enorm, zelfs bij mensen die ongeveer
evenveel tandsteen hebben. Dat komt waarschijnlijk omdat hun tandplak andere soorten en
verschillende hoeveelheden bacteriën bevat en omdat mensen verschillend reageren op de
bacteriën.
ENG: The speed in which paradontitis develops, is highly variable, even with people who have an
equal amount of tartar. This is probably caused because their plaque contains other kinds and
different amounts of bacteria and because people react in a different way.
In the above example, one can spot the following relations: Concession, Non-Volitional
42
5.2. MEDICAL TEXTS
Cause, Sequence. The second sentence gives a cause for the fact that the speed of the
development of paradontitis varies. There are more causes, so a sequence can be used.
Furthermore a concession is done with the part: "even with people who have an equal
amount of tartar". A possible annotation would thus be:
[De snelheid waarmee parodontitis zich ontwikkelt, varieert enorm,]25A [zelfs bij mensen die
ongeveer evenveel tandsteen hebben.]25B [Dat komt waarschijnlijk omdat hun tandplak andere
soorten en verschillende hoeveelheden bacteriën bevat]25C [en omdat mensen verschillend
reageren op de bacteriën.]25D
Non-Volitional Cause
Concession
25A
Sequence
25B
25C
25D
Figure 5.4: The RST-tree of the Paradontitis example
5.2.2
Other Medical Texts
This research is performed with the use of the Merck Manual. There exist more Dutch
medical sources. Below, the differences between a professional medical encyclopedia called
the Winkler Prins and a public one, the Wikipedia will be described.
The first difference is the way the texts are organized. While the Merck Manual and the
Wikipedia articles are narrative, the Winkler Prins texts consist of keywords with (short)
explanations. A typical text from the Winkler Prins encyclopedia is the following:
Gipsverband
Genor
Omhulling met een in gips gedrenkt verband, toegepast o.a. voor het onbeweeglijk maken van
gebroken ledematen, zieke gewrichten en operatief behandelde misvormingen. Daarnaast worden
gipsverbanden ook wel toegepast als rustgevend verband bij uitgebreide verwondingen (closed
plaster technique). Er wordt onderscheid gemaakt tussen gewatteerde gipsverbanden en ongewatteerde gipsverbanden. Ze kunnen worden gebruikt in de vorm van spalken die 2/3 deel van de
omtrek van het te behandelen lichaamsdeel omvatten, of in de vorm van circulair gips, waarbij
het lichaamsdeel volledig wordt omhuld.
The texts of the Wikipedia are similar to the ones of the Merck Manual. This can also
be shown with the degree of coherence, expressed with the numbers of conjunctions each
source has. In table 5.1 and 5.2, the numbers of conjunctions of the Winkler Prins- and the
Wikipedia articles are shown respectively. While the number of conjunctions in the Winkler
43
CHAPTER 5. MEDICAL TEXTS
Name
Winkler0
Winkler1
Winkler2
Winkler3
Winkler4
Conjunctions
6
4
9
5
9
Total Words
72
74
114
83
154
Percentage
8.33 %
5.41 %
7.89 %
6.02 %
5.84 %
Table 5.1: Conjunction numbers of the Winkler Prins Encyclopedia articles
Name
Wiki0
Wiki1
Wiki2
Wiki3
Wiki4
Wiki5
Wiki6
Wiki7
Wiki8
Wiki9
Conjunctions
21
8
12
11
5
18
17
34
13
14
Total Words
189
123
131
255
142
305
204
453
121
151
Percentage
11.11 %
6.50 %
9.16 %
4.31 %
3.52 %
5.90 %
8.33 %
7.51 %
10.74 %
9.27 %
Table 5.2: Conjunctions numbers of the Wikipedia articles
Prins texts are relatively low, Wikipedia articles equal the Merck articles. Furthermore
the Wikipedia articles are written less ’professional’ compared to the other encyclopedias.
While a professional encyclopedia like the Merck Manual is judged by doctors and medical
specialists, any one can add text to a Wikipedia article. As a result the texts vary in
coherence. This can also be seen in the conjunction numbers of the Wikipedia articles,
which differ from 3.52% to 11.11% while the conjunction numbers of the Merck Manual vary
from 3.77% to 9.88%. While these upper and lower value do not differ very much, most of
the values of the Merck Manual range from 6% to 9% while the percentage numbers of the
Wikipedia articles are equally spread.
5.3
Relations in Medical Texts
Since medical texts, like each text, are written with a certain purpose it is likely some
relations occur more often in medical texts than they do in other texts. Because medical
texts tell us about diseases and (therefore) about prevention and cures, the occurrence of
cause and result relations is expected, such as Non-Volitional Result and Non-Volitional
Cause while a relation like Interpretation might be expected less frequently. Secondly it is
useful to know if some relations are relatively similar. The relative similarity of relations is
useful for all text domains. If relations are relatively similar, these relations can be grouped
so that a smaller set of relations is required for the (automatic) annotation of these medical
texts.
44
5.3. RELATIONS IN MEDICAL TEXTS
To measure if grouping is indeed possible, manually annotated texts of different annotators are compared. Therefore a randomly chosen selection of 10 texts from the Merck Manual
has been gathered and annotated by four different annotators. These texts are annotated
at sentence level so each elementary discourse unit consists of a single sentence. For the
annotations the original 24 relations as defined by Mann and Thompson are used. Although
there are differences in the annotations, similarity is found in as well. Differences occur in
the structure of the resulted trees. If (a part of) the structure was similar, human annotators
still added different relations. Similarities are first of all the used relations. The annotators
used only a few of the original relations are actually used. In table 5.3, an overview of
the used relations is shown. The numbers are the summed number of occurrences for each
relation as found by the annotators.
Relation
Elaboration
Non-Volitional Result
Non-Volitional Cause
Antithesis
Background
Concession
Contrast
Sequence
Circumstance
Joint
Interpretation
Justify
Restatement
Volitional Cause
Volitional Result
Condition
Enablement
Evaluation
Motivation
Otherwise
Summary
Total
Occurrences
113
22
12
11
10
9
7
6
5
5
4
4
2
2
2
1
1
1
1
1
1
220
Percentage
51.36 %
10.00 %
5.45 %
5.00 %
4.54 %
4.09 %
3.18 %
2.73 %
2.27 %
2.27 %
1.82 %
1.82 %
0.90 %
0.90 %
0.90 %
0.45 %
0.45 %
0.45 %
0.45 %
0.45 %
0.45 %
100 %
Table 5.3: Total number of occurrences of the relations
The relations Evidence, Purpose and Solutionhood were never used. The relations Condition, Enablement, Evaluation, Motivation, Otherwise and Summary are only used once;
there is only a single annotator which used them, while the others did not agree. This suggest
an analytical error [MT88]. If these relations would be omitted from the list, there would
still be coverage of 97.27 %, while using only 15 of the original 24 relations. In table 5.4,
the numbers of occurrences for each annotator are shown. The numbers of relations differ
per annotator. Two reasons can be identified. First, in some annotated text, not all EDUs
45
CHAPTER 5. MEDICAL TEXTS
were actually connected. Secondly, the Sequence relation and the Joint relation are counted
as single relations, while they can cover more EDUs than two. If a Sequence with n nuclei
would be counted as (n-1) relations, the number of Sequence relations would increase with
7. The number of Joint relations would not change since it was not used to connect more
than two nuclei.
Relation
Elaboration
Non-Volitional Result
Non-Volitional Cause
Antithesis
Background
Concession
Contrast
Sequence
Joint
Circumstance
Interpretation
Justify
Restatement
Volitional Cause
Volitional Result
Condition
Enablement
Evaluation
Motivation
Otherwise
Summary
Total
P1
21
9
4
0
3
5
2
1
0
3
1
1
0
2
2
1
0
0
1
0
0
56
P2
35
3
1
0
2
3
5
1
4
0
1
2
1
0
0
0
0
0
0
0
1
59
P3
27
2
2
3
2
1
1
3
0
0
2
1
1
0
0
0
1
1
0
1
0
48
P4
30
8
5
8
3
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
58
Average
28.25
5.5
3
2.75
2.5
2.5
2
1.5
1.25
1
1
1
0.5
0.5
0.5
0.25
0.25
0.25
0.25
0.25
0.25
55.25
Table 5.4: Number of occurrences of the relations per annotator
It is clear that Elaboration is the most used relation. If all relations would be tagged
with the Elaboration relation, a coverage of more than half is already performed.
After annotation, the results are compared and it is noted which relations are often
confused with each other. This comparison is based on relations between the same elementary
discourse units. If two annotators added different relations between the same EDUs, the used
relations are considered to be confused. The relations which are confused multiple times are
grouped together. Table 5.5 contains an overview of the groups; columns are confused
relations.
In the table is shown that different kinds of relations can be confused. Hypotactic relations with each other, but also hypotactic relations with paratactic ones.
The + symbol after the Elaboration relation in table 5.5 means multiple instances. The
Sequence relation can thus be confused with multiple Elaboration relations. This is shown
in figure 5.5.
46
5.4. MEDICAL DISCOURSE MARKERS
Contrast
Antithesis
Concession
Non-Volitional Result
Non-Volitional Cause
Volitional Result
Volitional Cause
Background
Elaboration
Evidence
Elaboration
Sequence
Elaboration+
Table 5.5: Confused relations
Elaboration
Elaboration
A
Elaboration
Sequence
B
C
C
A
B
(a)
(b)
Figure 5.5: Sequence vs. Elaborations
Table 5.5 grounds the use of a smaller subset of relations. Since relations are confused,
they could be grouped to a covering relation. When annotating, the group name of the
relations can be used. Confused relations get the same group name. If two annotated texts
would be compared with each other, the use of the covering groups makes this task easier.
One of the reasons of confusion is the fact texts can be ambiguous. If two parts of a
sentence would be related to each other with a cause-result relation, the use of a result or
a cause relation is dependent on which part is the nucleus and which the satellite. Other
reasons are for example analytical error or differences between annotators.
5.4
Medical Discourse Markers
The content of medical texts suggests that there might be special discourse markers in these
texts that signal certain relations. If a text contains a sentence like:
1
2
De oorzaak van het tekort aan bloedplaatjes is een autoantistof tegen
bloedplaatjes.
47
CHAPTER 5. MEDICAL TEXTS
3
4
ENG: The cause of the shortfall of blood platelets, is an auto antibody
against blood platelets.
the use of an Non-Volitional Cause relation is obvious although there is no sign of a
conjunction, adverb or preposition. While the marker "De oorzaak van" might be general
there are other markers which tend to be more used in medical texts. Like for example
the use of the word "symptomen" which in the example below might signal an Elaboration
relation:
1
2
3
4
Andere symptomen zijn verwijde pupillen, kippenvel, tremoren,
spierkrampen ... [text omitted]
ENG: Other symptoms are widened pupils, Goosebumps, tremors,
muscle cramps ... [text omitted]
The use of these markers for (automatic) annotation can be hard, since they are ambiguous. In the following example it would be better to segment at the discourse marker "maar"
and assign an Concession relation instead of a Non-Volitional Cause:
1
2
3
4
De oorzaak van alcoholisme is onbekend, maar alcoholgebruik is
niet de enige factor.
ENG: The cause of alcoholism is unknown, but the use of alcohol
is not the only factor.
To test the use of these special discourse markers, a list of possible phrases is gathered.
The lists contain verb and noun-constructions. The selection is discussed below.
1
2
3
4
1
27
*
*
*
2
21
24
*
*
3
19
16
21
*
4
18
15
16
20
Table 5.6: Conjunction numbers of the Nouns
5.4.1
Noun Constructions
To gather these phrases four random selections of a hundred sentences each were extracted
from the Merck Manual. These selections of a hundred sentences have been checked manually
and the possible discourse markers are noted. The sentences contained between 20 to 27 noun
constructions each where no difference is made between the singular or plural form.
The lists with possible noun constructions of the text are conjugated with each other.
The results of the possible noun constructions are shown in table 5.6. The numbers represent
the number of similar noun constructions. For example, selection two and three contain 24
and 21 noun constructions respectively. They share 16 noun constructions. For the final list
48
5.4. MEDICAL DISCOURSE MARKERS
Noun-Constructions
Aandoening
Disorder
Behandeling
Treatment
Afwijking
Deviance
Complicatie
Complication
Diagnose
Diagnosis
Gevolg
Result
Manier
Way
Methode
Method
(Genees)Middel Cure
Onderzoek
Investigation
Oorzaak
Cause
Operatie
Operation
Probleem
Problem
Reactie
Reaction
Risico
Risk
Stadium
Stadium
Symptoom
Symptom
Tekort
Shortage
Vorm (van)
Sort of
Verandering
Change
Vergroting
increase
Verschijnsel
Phenomenon
Table 5.7: Possible Noun Constructions as discourse markers
of noun constructions, all nouns which appear in at least three out of the four selections are
taken. These results are shown in 5.7.
After selecting the list with nouns, a statistical analysis is performed to see in which
numbers these nouns actually occur in the Merck Manual. The nouns are counted in their
singular and plural form. The total number of sentences is 40k with 666k words in the entire
Merck Manual. Table 5.8 shows the number of occurrences.
To check whether these noun constructions actually signal a relation and what relation
exact, a classification is to be made. This is explained in section 5.4.3.
5.4.2
Verb Constructions
The same four selections are checked for verbs, in a similar way as for the nouns. The number
of verb constructions varied between 30 to 43. For the verb constructions all inflections of
the root of the verb are counted. The same conjunction procedure as described for the nouns
is done for the verbs as well. In table 5.9 the conjugations of the number of present possible
verb constructions are shown. For example, selection three consists of 37 verb constructions
and selection four has 30. They share 23 verb constructions.
Again all verbs which appear in at least three of the four selections are selected. These
49
CHAPTER 5. MEDICAL TEXTS
Word
Aandoening
Afwijking
Behandeling
Complicatie
Diagnose
Geneesmiddel
Gevolg
Manier
Methode
Middel
Onderzoek
Oorzaak
Operatie
Probleem
Reactie
Risico
Stadium
Symptoom
Tekort
Verandering
Vergroting
Verschijnsel
Vorm (van)
Singular
1032
147
1593
77
736
445
882
162
118
574
781
772
394
221
259
431
205
112
161
48
45
19
622
Plural
497
359
138
237
9
1120
126
66
76
502
253
207
39
391
129
36
48
1689
17
239
0
46
496
Table 5.8: Numbers of the Nouns
1
2
3
4
1
43
*
*
*
2
22
35
*
*
3
28
22
37
*
4
25
18
23
30
Table 5.9: Conjunction numbers of the Verbs
are shown in Table 5.10. The Dutch verb Voorkomen has two different meanings. It can have
the same meaning as Optreden (To Occur) but can also be translated as ”To Prevent”. The
meaning is dependent of the actual text the verb is in.
The classification of the verbs, to check which relations they signal is explained in section
5.4.3.
5.4.3
Signaled Relations
To be able to use the noun- and the verb-constructions, it must be clear which relation they
signal. Furthermore it should be known in which situations they signal or do not signal
50
5.4. MEDICAL DISCOURSE MARKERS
Verb Constructions
Aantasten
To Affect
Afnemen
To Decrease
Beginnen
To Start
Behandelen
To Treat
Dalen
To Decrease
Gebruiken
To Use
Helpen
To Help
Herstellen
To Recover
Kenmerken (door) To Characterize (by)
Krijgen
To Get
Leiden tot
To Lead to
Onderzoeken
To Investigate
Onstaan
To Origin
Optreden
To Occur
Produceren
To Produce
Toenemen
To Increase
Toepassen
To Apply
Uitvoeren
To Carry Out
Vaststellen
To Find
Verbeteren
To Improve
Vermijden
To Avoid
Verminderen
To Decrease
Veroorzaken
To Cause
Verwijderen
To Remove
Voordoen
To Display
Voorkomen
To Prevent
Voorkomen
To Occur
Table 5.10: Possible Verb Constructions as discourse markers
a relation. Therefore a classification is to be made for these markers. In table 5.11 the
relations which could be signaled by noun constructions are shown. This table is generated
by manually checking pieces of text with an occurrence of the noun construction.
Noun
Behandeling
Gevolg
Oorzaak
Stadium
Symptoom
Verschijnsel
Vorm
Relation
Elaboration
Non-Volitional Result
Non-Volitional Cause
Elaboration
Elaboration
Elaboration
Elaboration
Table 5.11: Noun Constructions and the relations they signal
51
CHAPTER 5. MEDICAL TEXTS
Verb
Afnemen
Beginnen
Kenmerken (door)
Leiden tot
Onstaan
Optreden
Toenemen
Vermijden
Verminderen
Veroorzaken
Voordoen
Relation
Elaboration
Elaboration
Elaboration
Non-Volitional Cause
Elaboration
Elaboration
Elaboration
Elaboration
Elaboration
Non-Volitional Cause
Elaboration
Table 5.12: Verb Constructions and the relations they signal
It appears that a great deal of the noun constructions which were hypothesized to signal
relations, in fact did not. Although the appearance of the word suggested the presence of
a relation (mostly an Elaboration), the actual signaling was done by a different word type.
This can be shown with the next example:
[Neurogene pijn is het gevolg van een af- wijking ergens in een zenuwbaan.]27A [Een dergelijke afwijking verstoort de zenuwprikkels, die vervolgens in de hersenen verkeerd worden geïnterpreteerd.]27B
[Neurogenic pain is the result of an abnormality somewhere in a nerve tract.]27C [Such an abnormality disturbs the nerve signals, which are misinterpreted in the brains as a consequence.]27D
Here the Elaboration relation, which indeed holds, is signaled by the word "Dergelijke",
instead of the word "Afwijking". If for example the second sentence was: Een andere
afwijking verstoort ...
the relation between the first and second sentence is better
suited with a Concession relation. The indication that a relation does hold between sentences
however still remains. This procedure applied for the nouns, has also been done for the verb
constructions. This resulted in table 5.12.
The table with verbs only contains a small part of the words originally selected. Most
of them signaling Elaboration relations. This has the same reason as with the noun constructions. The verbs give indeed hint that a certain relation is present, but if a different
word type like for example a conjunctive adverb is present, the relation is more likely to be
signaled by the conjunctive adverb.
5.4.4
Time Markers
Since the Merck Manual is a medical encyclopedia, it describes certain diseases. These
diseases are often discussed in a chronological way. For example the effects of a treatment
are described like:
[Bij bloedverlies zal de hartfrequentie toenemen.]28A [Nadat de bloeding is gestopt, wordt er
vocht vanuit de weefsels in de bloedsomloop opgenomen.]28B
52
5.4. MEDICAL DISCOURSE MARKERS
[ENG: Loss of blood will increase the heart frequency.]28C [After the blooding is stopped, fluid
from the tissues flows in the blood vessels.]28D
Relations between sentences like this can be signaled by words which indicate a time
flow. Examples are shown in table 5.13. These markers include combinations of words as
well.
Eerst
Nadat
(Ten) Eerste / Tweede
Tenslotte
Tijdens
Uiteindelijk
Verder
Voordat
First
After
First / Second
Finally
While
Finally
Further
Before
Table 5.13: Words which indicate a time flow
Time markers may indicate a relation with sentences which occurred before, although
the exact relation is not set by the marker itself. These markers of course appear in other
texts as well, but since these describing methods appear quite often in the medical texts,
they might be useful.
Some of the time markets in table 5.13 signal strongly. Others only indicate a relation
is present, and the actual relation is most likely to be signaled by a different word type, like
an adverb.
The time markers were found to signal a Non-Volitional Result or an Elaboration. In
table 5.14 the words and the relation they signal are shown.
Eerst
Ten eerste / tweede
Tijdens
Verder
Tenslotte
Uiteindelijk
Elaboration
Elaboration
Elaboration
Elaboration
Non-Volitional Result
Non-Volitional Result
Table 5.14: Time markers and their relations
53
Chapter
6
Automatic Recognition
In this chapter the automatic recognition of Dutch medical texts is discussed. First of all the
current state of automatic recognition is described followed by a discussion about the use of
Rule Based annotation or Machine Learning techniques. After that, the (separate) parts of
the automatic recognizer which is developed during this research are described. Subsequently,
the way of segmentation, the creation of the sets of relations and discourse markers used
by the recognizer, are described and motivated. Next the recognition of relations will be
described, and the assumptions and notions needed for it, are discussed. The last three
sections will describe the algorithm used for the recognition and the applied hierarchy for
scoring the recognized relations and an example of the process is worked out.
6.1
State of the Art
Currently several studies for the automatic annotation of texts with Rhetorical Structure
Theory are performed. In [Mar97a] and [Mar97b] Marcu describes a method for the automatic rhetorical parsing of natural language texts. He presents four different paradigms.
Among those four are a constraint satisfaction problem and a theorem solving problem.
These are used to produce algorithms to solve the problem of text structure derivation.
These are used to create a rhetorical parser. The recognizer produces all combinations
found, and removes the trees which are not conform the specification. Marcu takes discourse
markers as basis for the indication of the presence of relations. This research was performed
for the English language only.
In [HHS03], a chunk parser [Abn91] is used with a feature based grammar. This approach
is based on the use of discourse markers as well, and intended for English texts.
Methee Wattanamethanont et al used a Naive Bayes Classifier for the automatic recognition of rhetorical relations in Thai [WSK05]. To train the classifier they used a corpus
of 2850 Elementary Discourse Unit pairs, split in a 90%/10% training/testing ratio. The
features used by the classifier are Discourse Markers, Key phrases and word co-occurrence.
This approach is similar as is done in [ME02], where Marcu trains a Naive Bayes Classifier to
recognize four types of relations between arbitrary sentences, even if no discourse marker is
present. The classifier used by Marcu is trained with the use of two corpora, with more than
55
CHAPTER 6. AUTOMATIC RECOGNITION
40 million sentences of which different training sets, each containing millions of sentences,
were extracted.
The PhD Thesis of Corston-Oliver [CO98] describes the Rhetorical Structure Theory
Analyzer (RASTA). The program tests for conditions, like the ordering of the clauses the
relation will relate, for a certain relation, and afterwards checking if a marker is present at
all. Such conditions are similar to the conditions for the relations shown in appendix A.
Possible relations receive a heuristic score. RASTA generates no invalid trees, and thus does
not need to validate the generated trees afterwards.
The ConAno tool developed by Stede and Heintze [SH04] can be used for assistance
when manually annotating text with rhetorical relations. It outputs files in the .rs3 format
used by RSTTool [O’D00]. ConAno uses a discourse marker lexicon, with relations they can
signal. The tool does not make decisions about which relation to use, but gives hints to the
annotator. The tool was originally developed for German, but with the use of a different
lexicon it can be applied for other languages as well. The creation of ConAno was inspired
by the work performed for the creation of the Potsdam Corpus [Ste04]. This is a corpus of
German newspaper articles, annotated with multiple information types, such as rhetorical
structure, part-of-speech, and co-reference.
Different corpora of rhetorical annotated texts have been developed, such as the RST
corpus [CMO03] by Carlson et al. This corpus consists of 385 Wall Street Journal articles
of the Penn Treebank [MMS93]. A different corpus also originated from the Penn Treebank
is the Penn Discourse Treebank (PDTB) [MRAW04b]. The PDTB contains large texts in
which the discourse connectives are annotated along with their arguments. This corpus
contains no RST annotations.
6.2
Automatic Annotator
For the automatic annotation of texts, a program has been written which consists of three
main elements. These elements are the segmenter, the recognizer and the tree-builder. Each
of the elements is written in the Perl1 programming language. Together the parts build a
tree from the input text and stores this in a .rs3 file. These .rs3 files can be imported with
RSTTool and edited afterwards. The main parts will be discussed separately.
6.2.1
The Segmenter
The first element is called the segmenter. This part takes text as input and segments it
into non-overlapping spans of text, the elementary discourse units, which will be used to
recognize the relations. This is a fairly simple process, which just breaks the text at certain
points. The segmentation process is explained in section 6.3. If a text with multiple sections
is used as input, these sections are treated as if they were one.
1
http://www.perl.com
56
6.3. SEGMENTATION
6.2.2
The Recognizer
In this assignment, a combination of rule based recognition and machine learning is used to
recognize relations. Due to a lack of large quantities of annotated data, machine learning
like is done in research as presented in section 6.1 could not be applied. The recognizer is
equipped with lists of discourse markers and the relation that they can signal. For each rule
type a likelihood estimation is used. These rules are static. The recognizer cannot change
these markers nor the parameters. These values can only be changed by a user.
The recognizer takes the Elementary Discourse Units resulted by the segmenter and
produces two lists of relations between the segments. The first list is an overview of all
possible relations, the segments which the relation connects together, the type of the relation
and its likelihood. The second list is a subset of the first list and contains the nodes of the
most likely tree which is automatically recognized. The list contains the relations like this:
[3 4 N 100 elaboration]
This example shows an Elaboration relation between EDU number 3 and EDU number 4,
where the first one is the nucleus, as denoted by the letter N. The number 100 expresses the
likelihood that this relation actually holds. This will be explained in section 6.6. A different
relation is:
[2 3 S 30 concession]
The relation above expresses a concession relation between the satellite, EDU number 2 and
the nucleus, EDU number 3, labeled as a Concession relations with a likelihood of 30.
6.2.3
The Tree-Builder
The tree-builder takes the second list of the recognizer, with the final relation data. It builds
the actual tree from the data and stores it as an .rs3 file. All the data is added to the result;
if no relation is found with a certain segment, it is added without any information. This
.rs3 file can be read with RSTTool and be edited afterwards. Missing relations, or falsely
recognized relations can thus be corrected.
6.3
Segmentation
Although the actual segmentation is just a small portion of the recognition work, it is an
important part. The segmenter provides the data the recognizer works with. It creates the
building blocks of the tree which is to be produced. Therefore one must be sure what rules
to use for segmentation. There are two different segmentation approaches.
The first way is to use complete sentences as Elementary Discourse Units. For this
segmentation there is no need for knowledge about the text. It is sufficient to search for
sentence boundaries and segment at these points. Every dot is considered to be a segment
boundary. Since this segmentation is performed merely on sentence boundaries, errors might
occur if for example abbreviations with dots followed by a capital would come up, like: ”I
went to see dr. Jones”. However, these cases did not occur in the tests.
57
CHAPTER 6. AUTOMATIC RECOGNITION
The second approach is to segment within sentences as well. This is a much harder task
since it is error sensitive. Knowledge about the sentence is needed to be able to segment
correctly. In section 3.2 an example from the Merck Manual is described. In that example
the sentences were manually split at certain points. The points were chosen because at these
points that part of the sentence could be related to the previous part. Whereas it is sufficient
for segmentation between sentences to spot sentence boundaries, no such lexical information
is available for inside sentence segmentation. Although the comma in a reasonable number
of occasions indeed can be used to segment, not all the commas are suited.
The automatic recognizer uses the first approach. This is chosen because the recognition
of relations between sentences is harder than within sentences. Furthermore this prevents
differences between a manually and an automatic segmentation of a text. This ensures that
a comparison like is done in chapter 7 can be made.
6.4
Defining the Relation Set
As described in chapter 4 and 5, there is a certain subset of the original relations by Mann
& Thompson, whose members do not occur often (in the used data). Furthermore, another
(partially overlapping) subset of relations is seldom signaled. Third, certain relations are
confused with each other during annotation. These notions are a reason for suggesting only
using a (small) subset of relations in the automatic recognition process. Furthermore, a
smaller list of relations simplifies the recognition process. To gather this relation set, the
reasons given are taken into account. Some relations are hardly used; for example in the
annotated data, there are 6 relations which are only found once, and 3 relations which were
never used as is shown in table 5.3. Furthermore relations can be confused with a relation
which does occur often so the seldom used relations are omitted from the used relation set.
A drawback of this smaller list is a decrease in the coverage. The reason for confusing can for
example be the signaling of the relation by a discourse marker. This approach for defining
a set of relations is used in [Kno93]. Since relations as Concession and Antithesis can be
signaled by the same marker, these can be grouped. For example the conjunction Maar can
signal a Contrast, Antithesis or Concession.
The definitions of the relations as are shown in appendix A can be used to group some
relations. In [MAR99] it is argued that in the cases an Evidence relation was used, an
Elaboration relation would also fit since the last relation is less specific than the first. The
definitions of the three relations grouped as a Concession differ only slightly. The Elaboration
relation definition covers most other relations. This is partly because of the definition of the
Elaboration relation, and partly because human annotators tend to confuse other relations
with the Elaboration relation.
The Non-Volitional Cause and Non-Volitional Result cover their volitional counter parts.
The only difference between a Non-Volitional Cause and a Voltional Cause is the fact that
the cause must be volitional for the second relation. However, a discourse marker like
"Daardoor" does not give clue whether a cause is volitional or not. Since the (Non) Volitional
Result and the (Non) Volitional Cause differ more from each other, they are added as
different groups, although they did got confused by human annotators, as is shown in section
5.3.
58
6.5. USED DISCOURSE MARKERS
Elaboration
(Non)Volitional Cause
(Non)Volitional Result
Concession
Table 6.1: List of Relations to use
These notes result in the use of the following relations, shown in table 6.1. The Elaboration
relation covers most of the occurring relations. Together with the other three relations, the
coverage is 88.2%.
6.5
Used Discourse Markers
For the automatic recognition use is made of different kinds of discourse markers. These
include conjunctions, adverbs, pronouns, medical discourse markers and implicit markers.
For each kind of discourse marker restrictions and the way of using these discourse markers
to recognize relations is defined. The discourse markers are discussed below. In some cases
discourse markers cannot be used to recognize relations. In those cases keyword repetition
is applied with the use of Alpino [BvNM01].
6.5.1
Conjunctions
The first discourse marker type used is the conjunction. In chapter 4, a list of conjunctions
and their relations is presented. This list is used for the recognition of relations between
elementary discourse units. For the actual recognition, the assigning of the precise relation
and its likelihood, it is needed to know how and when the relation can be added.
The first note about conjunctions, is that they usually signal a relation between elementary discourse units which belong to the same sentence. For example the conjunction Zodra.
If it consists inside a sentence it is likely to signal a relation, as it does in the next example:
1
2
Ik bel je, zodra de uitslagen bekend zijn.
ENG: I call you, once the results are public.
However, it is possible that a conjunction signals a relation between sentences as well.
This is because it is possible to split sentences at a conjunction (with some rewriting), to
create two sentences which still belong together. This can be done to prevent the creation of
long sentences while writing, although the result is rather childish. Even though rewriting
is possible, it is thus not expected to be very common in an encyclopedia. For example the
small sentence:
1
2
Henk krijgt een cadeau, hoewel hij niet jarig is.
ENG: Henk receives a present, although it is not his birthday.
It can be split into:
59
CHAPTER 6. AUTOMATIC RECOGNITION
1
2
Henk krijgt een cadeau.
Hoewel hij niet jarig is.
The relation between both sentences is still the same. So, it is possible to use conjunctions for the recognition of relations between sentences as well. The constraint however is
the placing of the conjunction. The conjunction must appear as the first word in the second
sentence, because if it exists at a different spot, it is more likely to signal a relation within
the sentence rather than between sentences. In the checked data, no occurrence of an actual rewrite is found, so it is decided that the recognizer will not use conjunctions for the
recognition of relations between sentences.
Conjunctions can also appear at the beginning of the sentence signaling a relation within
the sentence. This is shown with the following example:
1
2
Zodra ik de lotto win, gaan we op vakantie.
ENG: Once I win the lottery, we will go on vacation.
It is shown that conjunctions are also ambiguous in their signaling of relations between
sentences or within. Therefore their use in the recognizer will be omitted, since it intended
to find relations between sentences, rather than within.
6.5.2
Adverbs
The second type of discourse markers are the (conjunctive) adverbs. A selection of discourse
markers of this type is also presented in chapter 4. Adverbs, like conjunctions, signal relations
inside sentences and between sentences. Unlike conjunctions however, adverbs also signal
relations between sentences if they do not appear as a rewrite result, i.e. not at the start of
a second sentence.
Take for example the word Vervolgens (Next). In both following texts, the word signals
an Elaboration:
1
2
3
4
1
2
3
4
De receptie begint om acht uur ’s avonds.
Vervolgens is er een borrel in zaal drie.
ENG: The reception starts at 8 PM.
Next there is a drink in room three.
De receptie begint om acht uur ’s avonds.
Er is vervolgens een borrel in zaal drie.
ENG: The reception starts at 8 PM.
There is a drink next in room three.
Adverbs which start a sentence are most likely to signal a relation with another sentence
in stead of signaling an inside sentence relation.
60
6.5. USED DISCOURSE MARKERS
Alle
Bepaalde
Dat
Dergelijke
Deze
Dit
Elk(e)
Sommige
Verschillende
Ze
Zulke
Table 6.2: (Demonstrative) Pronouns
6.5.3
Pronouns
Pronouns are the third type of discourse markers. A small list of used pronouns is shown in
table 6.2.
Pronouns can signal an Elaboration relation between sentences, if used at the start of the
relating sentence. They can also signal a relation if used in the middle of a sentence, but
the likelihood is less since another relation could be more suited. Such a relation is probably
signaled by a different marker in that case.
To show the difference in relation assigning due to pronouns, the next two examples are
presented:
1
2
3
4
1
2
3
4
Criminelen horen in de cel.
Het kan zijn dat dergelijke mensen er reeds zitten.
ENG: Criminals belong in jail.
It is possible, people like those are already there.
Criminelen horen in de cel.
Toch kunnen sommige criminelen beter naar een TBS-inrichting.
ENG: Criminals belong in jail.
Still, some criminals better go to a TBS-facility.
While the first example can be related with an Elaboration, the second example needs a
Concession relation to be connected properly. This indicates that the pronouns are overruled
in importance by other discourse markers. The pronouns signals the existence of a relation,
but to determine which relation is to be added, other discourse markers are necessary.
For the automatic recognition, pronouns are only used as an explicit marker for an
Elaboration relation if they are present at the start of a sentence. If they are present in the
middle of the sentence, it is at best used as an indication that a relation is present. This is
because these pronouns can also signal a relation within the sentences. This is explained with
the following extracts from the Merck Manual, where the word "Dit" signals the relations.
[Voor elke vitamine is de aanbevolen dagelijkse hoeveelheid (ADH) vastgesteld.]29A [Dit is de
hoeveelheid die een gemiddelde persoon dagelijks nodig heeft om gezond te blijven.]29B
[ENG: For every vitamin, the reference daily intake (RDI) is determined.]29C [This is the amount
an average person needs on a daily base, to keep his health.]29D
61
CHAPTER 6. AUTOMATIC RECOGNITION
[Wanneer het lichaam zowel overtollig vocht als natrium kwijtraakt of opneemt,]30A [kan dit
zowel het bloedvolume als de natriumspiegel benvloeden.]30B
[When the body loses or absorbs both superfluous fluid and natrium,]30C [this can manipulate
both the blood volume and the natrium level.]30D
6.5.4
Domain Specific Discourse Markers
The medical discourse markers are the fourth type of discourse markers used. Two types
of medical discourse markers are used: the noun constructions and the verb constructions.
As shown in chapter 5, there exist some medical words which might indicate the presence
of a relation. The actual relation in those cases is more likely to be signaled by a different
discourse marker. Like for example:
1
2
Johns body does react strange to the medicines.
These medicines increase the symptoms.
The second sentence of the example contains the words Symptoms and ”Increase”. These
are found to be able to signal an Elaboration relation. That relation can also be signaled by
the word "These", which is stronger since it is a pronoun. This is explained in section 6.8.
For a subset of the list, the relation itself is signaled more clearly. These markers are all
signaling a (Non) Volitional Cause/Result
6.5.5
Relation Markers
During this research, some words and constructions were found, which indicate a specific
relation, that did not belong to a type discussed. An example of such a word is "niet",
which can sometimes be useable to detect a Concession relation. This word negates a
sentence, signaling a different view on facts presented before. Other words clearly elaborate
on a subject, like the word "Voorbeelden". In table 6.3, a small list is presented with
relation markers and the relations they might signal.
Word
Andere
Geen
Niet
Nooit
Tegenover
Tegenstelling
Aanwijzingen
Mogelijkheden
Voorbeelden
Relation
Concession
Concession
Concession
Concession
Concession
Concession
Elaboration
Elaboration
Elaboration
Table 6.3: Relation Markers
62
6.5. USED DISCOURSE MARKERS
These relation markers should be used with care, since there are examples in which these
words did not signal a relation. If these words are found at the start of the sentence, the
signaling is found to be stronger than when the word reside within a sentence. When a
relation marker is found to be within a sentence, a relation is usually found to be signaled
by a different relation.
6.5.6
Adjectives
A different word type that might signal an elaboration about a certain topic is the adjective.
These can be used to elaborate on a specific type of the subject mentioned before. For
example the following text:
1
2
3
4
Een afwijking aan het afweersyteem is gevaarlijk.
Een grote afwijking kan dodelijk zijn.
ENG: An anomaly of the immune system is dangerous.
A large anomaly can be fatal.
There are however many adjectives, and a large part of them are not used to elaborate on
a previous statement. Furthermore, relations between these sentences can also be found with
the use of keyword repetition which will be discussed below. The recognizer will therefore
make no use of adjectives.
6.5.7
Implicit Markers
Since not all relations are signaled by discourse markers, a way of recognizing relations
without them is necessary. An attempt for annotating this sort of relations is carried out
in [MRAW04a]. Implicit markers can be identified in texts which do not embed an explicit
relation. In this assignment, all word types used by the recognizer are considered explicit
markers. An example of two sentences which embed an implicit marker is:
1
2
3
4
Een eik laat zijn bladeren vallen in de winter.
Bomen hebben in de winter geen voedsel over voor bladeren.
ENG: An oak drops its leaves in winter.
Trees cannot spare their food in winter for leaves.
This example embeds a relation between both sentences. The relation could for example
be a Non-Volitional Result or an Elaboration. The implicit marker in this example is the
word "want" (because).
While the exact recognition of the relation is very hard, the recognition of the presence
of a relation at all is interesting as well. In the preceding example the existence of a relation
between these two sentences can be found by different sources.
First of all, since an oak is an element of the set of trees, the presence of a relation is
indicated. But for an automatic recognizer this world knowledge must be made explicit.
This can be done with the help of a thesaurus. A thesaurus, sometimes referred to as an
ontology, is a set of (textual) data. A thesaurus does not define the words, but it stores
63
CHAPTER 6. AUTOMATIC RECOGNITION
relations regarding words, such as synonyms, homonyms etcetera. A well-known example of
a thesaurus is Wordnet [Mil95].
Secondly, the first sentence states something about leaves and winter. These subjects
appear in the second sentence as well. This repetition of keywords might be an indication
for the presence of a relation, while it does not give any clue about which relation connects
the sentences. To be able to use this method, the keywords must be selected. It may be
clear that words like "The" and "In" are not suited to be marked as keywords. The word
types with the highest probability are the nouns. To be able to find the nouns in a Dutch
text, Alpino [BvNM01] is used. To find the nouns a Part of Speech (POS) tagger could be
used as well.
The approach with the repetition of keywords is used for the recognition of distance
relations. Sentences which embed the same nouns are attached to each other. The added
relation is always the Elaboration relation, furthermore it is connected in a left-right ordering.
That means that the first sentence is selected to be the nucleus and the second as the satellite.
This principle is illustrated with the following extract:
[Ongeveer 15% van alle schildkliercarcinomen bestaat uit folliculair carcinoom en komt vooral
voor bij ouderen.]31A [Folliculair carcinoom komt ook meer voor bij vrouwen dan bij mannen,
maar net als bij papillair carcinoom is bij mannen de waarschijnlijkheid groter dat de knobbel
kwaadaardig is.]31B [Folliculair carcinoom is veel agressiever dan papillair carcinoom en
verspreidt zich vaak door de bloedbaan. ]31C [Hierdoor ontstaan uitzaaiingen in verschillende
delen van het lichaam. ]31D [De behandeling van folliculair carcinoom bestaat uit zo volledig
mogelijk operatieve verwijdering van de schildklier en vernietiging van eventueel overblijvend
schildklierweefsel en van de uitzaaiingen met behulp van radioactief jodium.]31E
[ENG: About 15% of all thyroid carcinomas consists of follicular carcinoma, and mainly occurs
with old people.]31F [Follicular carcinoma occurs more often with women than with men, but just
like papular carcinoma, with men the probability is higher the bump is malignant.]31G [Follicular
carcinoma is much more aggressive than papular carcinoma and spread often through the blood
stream.]31H [Therefore secondary tumors arise in different parts of the body.]31I [The treatment
of follicular carcinoma consists of the operational removal of the thyroid and the destruction of
possible remaining thyroid tissue and secondary tumors with help of radioactive iodine.]31J
This piece of text consists of five sentences. There are just a few words which might
indicate a relation. The most obvious word is "Hierdoor", in sentence four. It indicates
a relation between sentence three and four. The rest of the sentences only contain some
weaker markers, so the recognizer cannot link the rest of the sentences. However, since each
sentence is about some part of Folliculair Carcinoom (Follicular Carcinoma), this can be used
for linking the sentences. The recognizer uses Alpino to find the nouns in the sentences. After
this, the nouns are compared and the sentences which embed similar nouns are connected.
The comparison is done like this: the first sentence is compared to the second, if a match
is found, a relation is added. After this, the first sentence is compared to the third. If no
relations are found, the process starts again from the second sentence. The second sentence
is compared to the third and fourth. EDU 31C is omitted from this process since it is already
connected to EDU 31D.
64
6.6. RECOGNIZING RELATIONS
Beacuse each sentence is about the Folliculair Carcinoom, each sentence will be linked
with the first sentence. The result will thus look like figure 6.1.
Elaboration
.............Elaboration
........................Elaboration
31A
31B
Non-Volitional Cause
31E
R
31C
31D
Figure 6.1: The Folliculair Carcinoom example
A remark for this approach is the fact that Alpino makes errors, thus causing the recognizer to make mistakes. Furthermore the plural and singular forms of the same noun do not
count as a match. This problem could be solved with the use of a stemmer, like for example
the Porter stemmer [Por80]. Another remark is the fact that there are more word types than
just the nouns possible to link sentences.
6.6
Recognizing Relations
The recognition consists of the attribution of scores to relations between elementary discourse
units. Therefore some prior assumptions are made:
1. The text to annotate is actually coherent
2. The most important piece of text is usually presented first.
3. Texts are from the Dutch medical domain
The assumption that the text indeed is coherent is an important one. It is possible
to write a nonsense text which embeds discourse markers which would indicate a certain
relation if they would be used in a normal text fragment. For example:
1
2
3
4
Jake is een aparte man.
Daardoor is de kat groen.
ENG: Jake is a strange guy.
Therefore the cat is green.
In this example, the recognizer would assign a relation between these sentences, since the
word "Daardoor" (Therefore) indicates a relation.
65
CHAPTER 6. AUTOMATIC RECOGNITION
The fact that most texts indeed state the most important information first, is useful
for assigning scores to a relation and the decision which part of the relation is most likely
to be the nucleus when finding a hypotactic relation. This fact can be grounded with the
following numbers from table 6.4, extracted from the texts annotated during this assignment.
In 81.3 % of the cases the nucleus precedes the satellite. Only relations between sentences
are counted.
Ordering
N-S
S-N
Equal
Percentage
81.3 %
5.3 %
13.5 %
Table 6.4: Relation Ordering Percentages
The first approach for recognition is between full sentences. Recognition between sentences is harder than between EDUs which exist in a single sentence. This is because of the
fact that information presented in a single sentence usually relate to each other. Furthermore
it is rare that an EDU from a sentence has a relation with a part outside the sentence, unless
it is the nucleus of that sentence.
The recognition of relations between sentences is furthermore hard since it is possible
that a relation holds between sentences which have a great distance between them. The
main part however, consists of relations between adjacent sentences. Results of the analysis
of the annotated data are presented in table 6.5. The table contains the percentages of
relations between adjacent sentences for four annotators. About 60% of the relations are
defined between adjacent sentences.
Annotator
P1
P2
P3
P4
Percentage
66.67 %
61.29 %
64.71 %
55.42 %
Table 6.5: Relation numbers between adjacent sentences
The last assumption is that the texts which are to be annotated are from the medical
domain in Dutch. Although Dutch texts from other genres can be annotated as well, English
texts for example cannot. The recognizer uses word types which were specifically gathered
for the medical domain, therefore texts from other domains benefit less from these types.
To be able to find the relations in a text, the following approach is used. The segmented
text is checked for relations. A scoring is added by the recognizer, based on the discourse
markers found in the sentences, which represents the likelihood that the relation exists
between EDUs. The type of the relation is present as well. If the text would contain three
sentences a possible outcome is the following:
66
6.7. RECOGNITION ALGORITHM
[1
[1
[2
[2
[2
[1
2
2
3
3
3
3
N
N
N
N
S
N
80
60
30
90
10
40
nonvolitional-cause]
elaboration]
nonvolitional-cause]
elaboration]
concession]
elaboration]
In the preceding example there were two relations found between EDUs one and two, and
three relations between two and three and one between the first and the third. The most
probable relations are kept and the rest discarded. Since a satellite can only be added to one
nucleus, conflicts are solved by keeping the highest ranked relations. In the example above,
sentence 3 can be added to the first and to the second sentence. It is added to sentence 1
since that relation is scored 90, while the relation between 1 and 3 is scored 40. This results
in:
[1 2 N 80 nonvolitional-cause]
[2 3 N 90 elaboration]
This can be graphically represented like in figure 6.2.
Non-Volitional Cause
1
Elaboration
2
3
Figure 6.2: Tree view of the example
6.7
Recognition Algorithm
This section will describe the recognition algorithm used by the recognizer. The recognizer
works according to the assumptions, described in section 6.6. This is extended with the
requirement of a segmented input. The recognizer receives a list of EDUs, which are to be
connected.
The recognizer uses the following algorithm. The list of EDUs is checked for relations
per pair. The first pair to check is the first and the second EDU in the list. A score is added
for each relation possible between these EDUs. After that, the pair consisting of the second
and third EDU is checked for a relation. This results in a list of relations between adjacent
EDUs. The most probable relation between two EDUs is kept and all the relations between
the same EDUs with a lower probability than the kept relation are discarded. After this
step, the recognizer has to find the long-distance relations, for it cannot find them in the
original input. This is shown with figure 6.3.
67
CHAPTER 6. AUTOMATIC RECOGNITION
Background
R
C
Non-Volitional Cause
B
A
Figure 6.3: A relation between non-adjacent EDUs
The example in figure 6.3 contains three EDUs, the first cycle checks only A with B and
B with C. It does not compare A with C.
The comparison of A with C is done afterwards. To find the long-distance relations,
a new list is formed, containing the nuclei of the relations and the EDUs which are not
connected to another EDU.
This long-distance relation scoring is repeated until al relations are connected. This is
because the text is assumed to be coherent, therefore no EDUs are to be left without a
relation with an other EDU.
Some remarks with this approach are the following. Once a relation is connected as a
satellite to a nucleus, it is not possible to reconnect it to another one, for example a distance
relation. This is illustrated with the following example.
Elaboration
C
Elaboration
A
B
Figure 6.4: Two EDUs connected to the same EDU
Elaboration
A
Elaboration
B
C
Figure 6.5: EDU C is connected to A through B.
In figure 6.4, the EDUs B and C are both connected with an Elaboration relation to EDU
A. In figure 6.5 C is connected to B which is connected to A. If a relation between B and
C is made, the long distance relation which holds between A and C is never to be found,
since C is then considered a satellite. Two possible solutions for this problem exist. The
68
6.8. SCORING HIERARCHY
first solution is using a minimum likelihood necessary to confirm a relation. This prevents
the adding of relations which are not signaled strongly and could therefore be incorrect. If
a connection between B and C would not be made because of a likelihood which is below a
threshold, the recognizer will check between A and C, afterwards. The second solution is to
check all possible relation tuples. However, for large texts, the time complexity of checking
all possible relations could be a problem since for n EDUs, the number of checks is:
1 2
(n − n)
2
while with the growing distance between EDUs the likelihood of a relation decreases. Therefore the first solution is used. The scoring of relations will be explained in the next section,
while the effect is evaluated in chapter 7.
Figure 6.6 displays the used algorithm in pseudocode.
Start of code
1
2
3
4
5
6
7
8
9
10
11
12
for each tuple of adjacent EDUs (a,b)
compare (a,b)
for each EDU pair
select relation with MAX likelihood
where likelihood > threshold
create new list with nuclei and non-assigned nodes
for each tuple of list (x,y)
compare_distant (x,y)
for each EDU pair
select relation with MAX likelihood
where likelihood > threshold
joint all unconnected subtrees
End of code
Figure 6.6: Algorithm in Pseudocode
6.8
Scoring Hierarchy
The scoring of the relations is based on different aspects. A difference is made between
the importance of relations and the strength of the used discourse markers. To create the
hierarchy, use is made of annotated texts by human annotators. The EDUs which are
connected with a relation, added by a human annotator are checked for the existence of
a discourse marker. If a discourse marker was present, it is analyzed whether the marker
signaled the relation or did not signal the relation. To measure if this relation was indeed
signaled by the discourse marker an approach quite similar to the process as described in
[KS98] is used. If the discourse marker is substituted with a different discourse marker and
the relation would change, the reason the relation did exist was the discourse marker. The
analysis of the texts for the strength of discourse markers is performed for all the word types,
69
CHAPTER 6. AUTOMATIC RECOGNITION
so a hierarchy of the word types could be made, and the corresponding placement of the
discourse markers in the sentences.
The first notice in developing the hierarchy, is the fact that if a conjunctive adverb starts
a sentence, it acts strongly as a discourse marker. Consider the extract below:
[De kans op een HIV-infectie door moedermelk is relatief laag.]32A [Toch dienen HIVgeïnfecteerde moeders borstvoeding te vermijden.]32B
[ENG: The chance of getting infected by HIV through breast milk is relatively low.]32C
[Nevertheless, mothers infected with HIV ought to avoid giving breast milk.]32D
These two sentences can be related to each other with a Concession relation. It is clearly
signaled by the word "Toch" (Nevertheless).
As mentioned before, (conjunctive) adverbs can also signal relations when they do not
appear at the start of the second sentence. However, the signaling is less strong than if they
appear at the start of the second sentence.
This can be illustrated with the following example:
[Bij mensen die aan asbest blootgesteld zijn geweest, kan de diagnose asbestose soms al worden
gesteld op basis van kenmerkende afwijkingen op een thoraxfoto.]33A [De patiënt heeft meestal
ook een afwijkende longfunctie en met de stethoscoop zijn krakende geluiden (’crepitaties’) in
de longen te horen.]33B
[ENG: With people who have been exposed to asbestos, the diagnosis asbestosis can sometimes
be made bases on particular abnormalities on a thorax photo.]33C [The patient usually has a
divergent long function as well and with a stethoscope creaking sounds (’crepitations’) can be
heard in the longs.]33D
In this example the second sentence can be connected to the first sentence with an
Elaboration relation. The signaling for this relation can be two words. The first one is
"Meestal" the second is "Ook". If the both words are removed from the second sentence,
no relation can be added if the recognizer has no world knowledge. Like:
[Bij mensen die aan asbest blootgesteld zijn geweest, kan de diagnose asbestose soms al worden
gesteld op basis van kenmerkende afwijkingen op een thoraxfoto.]34A [De patiënt heeft een afwijkende longfunctie en met de stethoscoop zijn krakende geluiden (’crepitaties’) in de longen te
horen.]34B
To be able to assign the relation for this text, the recognizer must at least know that
both sentence share some subject. Therefore it must be known that a patiënt is a human,
or know what Asbestose is. Since the recognizer does not have this kind of information, a
relation cannot be added.
If only one signaling word is present, the relation could still be made. Like:
[Bij mensen die aan asbest blootgesteld zijn geweest, kan de diagnose asbestose soms al worden gesteld op basis van kenmerkende afwijkingen op een thoraxfoto.]35A [De patiënt heeft
70
6.8. SCORING HIERARCHY
(meestal/ook) een afwijkende longfunctie en met de stethoscoop zijn krakende geluiden (’crepitaties’) in de longen te horen.]35B
When the two words are present, the signaling of a relation is even stronger. Therefore
each occurrence of a (conjunctive) adverb increases the likelihood of the presence of a relation.
A combination of words which signal a different relation is possible. In such a case, all
possibilities should be considered and the likelihood added. This is shown with the following
extract:
Vocht rond de longen kan met een naald worden verwijderd en onderzocht; deze ingreep heet
thoracentese. Een thoracentese is meestal echter niet zo nauwkeurig als een biopsie.
ENG: Fluid at the longues can be removed with a needle and researched; this operation is called
thora centesis. A thora centesis however, is usually not as accurate as a biopsy.
In the second sentence, there are three words which could signal a relation, "Meestal",
"Echter" (Usually, However) and "Niet"(Not). While meestal can signal an Elaboration
relation, "Echter" and "Niet" can signal a Concession. Because there are two markers
signaling the Concession relation, it is preferred.
So, in the hierarchy, (conjunctive) adverbs which appear at the start of a second sentence
signal a relation more strongly than if the word appears at a different spot.
Pronouns are weaker than (conjunctive) adverbs, in the sense that they do signal relations, and usually Elaborations, but the actual relation can be signaled stronger by a
different marker. Consider the next examples. In the first example, the word "Deze" signals
an Elaboration while in the second it could also signal a Concession:
[De ziekte wordt waarschijnlijk veroorzaakt door letsel dat is ontstaan doordat de pees van de
knieschijf te hard trekt aan de aanhechting aan de kop van het scheenbeen (tibia).]37A [Deze
aanhechting wordt tuberculum tibiae genoemd.]37B
[ENG: The disease is probably caused by injuries, which is originated because the string of the
kneecap pulls too hard at the attachment to the head of the shinbone (tibia).]37C [This attachment
is called tuberculum tibiae.]37D
[ Als bijvoorbeeld de bloedvaten zich verwijden, waardoor de bloeddruk daalt, zenden de
sensoren onmiddellijk signalen via de hersenen naar het hart, waardoor de hartfrequentie
toeneemt en het hart dus meer bloed rondpompt.]38A [Daardoor verandert de bloeddruk
uiteindelijk niet of nauwelijks.]38B [Deze compensatiemechanismen hebben echter ook hun
beperkingen.]38C
[ENG: If for example the blood vessels widen, causing the blood pressure to drop, the sensors
immediately send signals via the brains to the heart, as a result of what the heart frequency
increases and the heart pumps more blood trough the system.]38D [Therefore the blood pressure
does not or only slightly differs in the end.]38E [These compensation mechanisms have their limits
however.]38F
The Concession relation can be signaled by "Echter" or "Beperkingen". Therefore, if
no other markers are found, the pronouns are used to connect the sentences. Pronouns can
71
CHAPTER 6. AUTOMATIC RECOGNITION
also be used for the signaling if they appear in the middle of a sentence. However, these
words signal even less strongly.
A further remark with the use of pronouns is the fact that they are used to link sentences
which are non adjacent. The text refers to a statement presented in an earlier sentence in
the text. These references can be found with the use of keyword repetition.
The fourth used type of connection marker is the medical discourse marker as explained
in chapter 5. They are to be used if no other markers are found, they are lowest in the
hierarchy.
The word types which are highest in the ranking are the types which are used when
a the marker of that type is found at the start of a sentence. There are only four types
which can be used in this way, the conjunctive adverbs, the adverbs, the pronouns and the
relation markers. The rest of the word types are used when a marker of that type resides
in the middle of a sentences. The last entry of the ranking is for the implicit markers. This
because they are checked last, so they do not conflict with other word types for the signaling
of relations.
The final ranking is shown in table 6.6. The top of the table represents the strongest
marker.
Ranking
1
2
3
4
5
6
7
8
9
10
Type
Conjunctive Adverbs
Adverbs
Pronouns
Relation Markers
Conjunctive Adverbs
Adverbs
Relation Markers
Pronouns
Medical Markers
Implicit Markers
Start of Sentence
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Table 6.6: Discourse Marker Hierarchy
To measure the actual strength of the discourse markers the data is checked for occurrences of discourse markers of each word type. For these occurrences is is checked whether
a relations is signaled by that discourse marker or if it is not signaled. The results of each
discourse marker of a certain word type are combined to a number which expresses the likelihood an instance of the word types signals a relation. Furthermore, a threshold value is
determined from the data. The threshold is based on the likelihood of the weakest signaling
word types. A single instance of such a word type is not strong enough to signal a relation. A small likelihood bonus is added to the likelihood of a relation when there are more
relations which can be possible between that relation in the same ordering. For example if
the markers signal an Non-Volitional Cause relation in a Nucleus - Satellite ordering and a
Elaboration relation in a Nucleus - Satellite ordering, the likelihood bonus is applied. This
is done because if more relations could be added in a certain ordering the assumption that
the ordering is correct is supported.
72
6.9. EXAMPLE
The likelihood values are shown in table 6.7.
Type
Conjunctive Adverbs
Adverbs
Pronouns
Relation Marker
Second Sentence Standard
Second Sentence Medical
Threshold
Likelihood Bonus
Likelihood
60
50
30
20
20
15
20
5
Table 6.7: Discourse Marker Likelihoods
The likelihood values from table 6.7 are based on their mutual proportions and the
percentages of the cases they signal a relation. It is possible to derive a percentage of
signaling per word type or even for each word, but that would require a large amount of
data which was not available.
These likelihood values are cumulatively used to calculate the likelihood. If an conjunctive
adverb at the start of a sentence and a pronoun which resides in the middle of a sentence
both signal an Elaboration relation the likelihood would be 60 + 20 = 80.
In the list the awarding of far distance relations is missing. This is because the likelihood
of these relations is dependent on the number of similar keywords. The more similar keywords
the higher the likelihood. It is expressed as a number between 0 and 100 based on the
percentage of similar keywords. Nouns are used as keywords.
6.9
Example
The automatic annotation process is illustrated by the example below. The intention of this
example is to show the used steps of the recognizer and the results. The used text for the
example is the following:
Wanneer de huid een lage temperatuur bereikt, verwijden de bloedvaten in het gebied zich bij
wijze van reactie. De huid wordt rood, voelt heet aan, jeukt en kan pijn doen. Deze effecten
treden meestal op negen tot 16 minuten nadat het ijs is aangebracht en verdwijnen ongeveer
vier tot acht minuten nadat het ijs is verwijderd. Daarom moet het ijs worden verwijderd na
tien minuten, of eerder wanneer deze effecten optreden, maar tien minuten nadat het is
verwijderd, mag het weer worden aangebracht.
ENG: When the skin reaches a low temperature, the blood vessels widen in the area as a reaction.
The skin becomes red, feels hot, itches and can hurt. These effects usually occur nine to 16
minutes after the ice is placed and disappear about four to eight minutes after the ice is removed.
Therefore the ice must be removed after ten minutes, or earlier when these effects occur, but it
can be replaced ten minutes after it is removed.
73
CHAPTER 6. AUTOMATIC RECOGNITION
The recognizer starts with the segmentation process. The text is split into elementary
discourse units. This results in the following:
[Wanneer de huid een lage temperatuur bereikt, verwijden de bloedvaten in het gebied zich bij
wijze van reactie.]40A [De huid wordt rood, voelt heet aan, jeukt en kan pijn doen.]40B [Deze
effecten treden meestal op negen tot 16 minuten nadat het ijs is aangebracht en verdwijnen
ongeveer vier tot acht minuten nadat het ijs is verwijderd. ]40C [Daarom moet het ijs worden
verwijderd na tien minuten, of eerder wanneer deze effecten optreden, maar tien minuten nadat
het is verwijderd, mag het weer worden aangebracht.]40D
After the segmentation, the actual recognition process is started. First the adjacent
relations are found. The list with found relations is cleaned. The relations with a likelihood
which is below the threshold are discarded. The relations as shown below are found.
[2
[2
[2
[3
[3
3
3
3
4
4
N
N
N
N
N
145 elaboration]
25 concession]
25 nonvolitional-cause]
45 elaboration]
65 non-volitional-result]
Between EDU number 2 and EDU number 3, three relations with a likelihood above
threshold are found. The likelihood of the Elaboration relation is calculated to be 145.
It is found by the following discourse markers:
"Deze" , "treden op" , "meestal",
"nadat" , "nadat" . Furthermore it received the 5 point bonus, because there are multiple
relations found in a Nucleus - Satellite ordering between EDU 2 and EDU 3.
The recognizer proceeds with dicarding the conflicting relations. There are three three
possible relations found between EDU number 2 and EDU number 3 and two possible relations between EDU number 3 and number 4. No relations are found between EDU 1 and
EDU 2. After cleaning the result is the following:
[2 3 N 145 elaboration]
[3 4 N 65 non-volitional-result]
Next, since there is still an EDU which is unconnected, the far distance relations are to
be found. The recognizer runs Alpino to extract the nouns for each sentence and compares
each possible combination. It finds a connection between EDU 1 and EDU 2 for the noun
"huid". This relations receives a likelihood based on the number of similar nouns. The
likelihood is calculated as the number of similar nouns is divided by the total number of
nouns in the second sentence. Since the second sentence only contains two nouns, ("huid",
"pijn") and only the noun "huid" is similar, the number 25 is incorrect. This is caused by
Alpino, which incorrectly identified words as nouns.
The found relation is added to the list which was generated before. Then possible conflicting relations are removed in a similar way as done before. The final relation list is the
following:
74
6.9. EXAMPLE
[2 3 N 145 elaboration]
[3 4 N 65 non-volitional-result]
[1 2 N 25 elaboration]
This list is parsed by the tree-builder and an rs3 file is generated. The final tree is
presented in figure 6.7.
Figure 6.7: Generated Tree
75
Chapter
7
Evaluation
In this chapter the evaluation of automatically generated discourse trees will be discussed.
This evaluation can be done in multiple ways. First of all it is explained when a tree is valid
using the standards as defined by RST. This is explained in the first section.
Furthermore it is useful to check whether the information of a valid tree is correctly
representing the information in the text. Two approaches to measure this correctness are
described. The first approach is based on manually scoring the result from the automatic
generated version, by evaluating each relation separately. The second approach is to check
whether the result is a similar tree as a whole compared to a human annotated version of
the same text. This last approach is used for this research, because it is much harder to tell
whether a relation is wrong or correct, than to check if it similar to the results of a human
annotator.
After describing the different evaluation methods, the results of the recognizer are discussed, and the increase of accuracy with the adding of information used by the program
is shown. Five example texts are presented, two of them are annotated with an increasing
use of discourse marker types. The other three are smaller texts, with an analysis of the
differences with respect to annotations performed by a human annotator. These evaluations
are followed by the results of more texts annotated by the recognizer with the use of all discourse marker types. These results are compared to evaluations between human annotators.
The last section contains a discussion about the performance of the automatic recognizer
and adjustments which could be made.
7.1
RST-Trees
First of all, it is needed to formalize the correctness of an RST-tree. Mann and Thompson
define the following four rules [MT88]:
1. Completeness: The set contains one schema application that contains a set of text
spans that constitute the entire text.
2. Connectedness: Except for the entire text as a text span, each text span in the analysis
is either a minimal unit or a constituent of another schema application of the analysis.
77
CHAPTER 7. EVALUATION
3. Uniqueness: Each schema application consists of a different set of text spans, and
within a multi-relation schema each relation applies to a different set of text spans.
4. Adjacency: The text spans of each schema application constitute one text span.
The first rule, completeness, states that the tree should cover the entire input text. This
can be accomplished by forcing the text together. If no relations are found between leaves
and or subtrees, the connection could be done with the Joint relation. This can be done,
because the texts to annotate are considered to be coherent.
The three other rules state that all elementary discourse units, should be added in the
tree as a node, and no node can belong to different schemas.
The four rules are covered by the recognition algorithm. The algorithm cannot connect
satellites, which are connected to a nucleus again. Between nuclei, new relations are possible,
but once connected, no other connection is possible afterwards. This ensures that a nucleus
cannot be connected outside a schema. If no relations are found, all subtrees fit in the Joint
schema.
The trees generated by the recognizer always conform to these rules.
7.2
Relation Based Evaluation
The relation based evaluation method is based on the judgment of an annotator. The
annotator examines the automatic generated result and judges for each relation whether it
is acceptable or not. The relations are not necessarily the best option, but are not allowed
to be wrong. For example:
[Peter won the running contest.]41A [He is very fast.]41B
In figure 7.1, this small piece of discourse is annotated in three different ways. While the
first is better than the second one, since the second sentence states a explanation about the
first, both are correct. The third annotation however clearly is incorrect.
Also the nuclearity of the EDUs between which the relation is added, must be correct.
Non-Volitional Cause
Elaboration
41A
41B
41A
41B
(b)
(a)
Figure 7.1: Three annotations
To score each relation, a scoring table must be developed.
78
Concession
41A
41B
(c)
7.3. FULL TREE EVALUATION
7.3
Full Tree Evaluation
The full tree evaluation is done by a comparison between a RST-tree generated by the
recognizer program and a manually annotated tree. This idea is also presented in Marcu’s
thesis [Mar97b]. To be able to evaluate these full discourse trees, the parts which might
differ from each other should be noted.
First of all, the segmentation can differ. If a human annotator decided to split the
discourse at other points than the recognizer did, differences occur. This can be prevented
by using only sentences boundaries for segmentation. This ensures that the human annotator
and the recognizer use the same elementary discourse units.
The second possible difference can be the structure of the tree. A set of three elementary
discourse units can be connected to each other using different schemas.
The third difference is the actual relation between elementary discourse units, and their
nuclearity aspect.
So to compare an automatically generated tree, the second and third differences are to be
checked for. The first check can be omitted for trees generated by the developed automatic
recognizer. This is because both approaches are based on the same segmentation. Since
a relation cannot be compared with another relation if the elementary discourse units are
different, this is the first point of measurement. The total numbers of relations which are
correct are counted. A correct relation is one which is equal to the relation in the manually
generated tree.
The last part consists of the counting of the correct relation names. The final measurement thus consists of two parts, a percentage of the correct relation organization, and a
percentage of the correct labels. The measurement of the percentage of the correct relation
organizations, is based on the fact that a correct tree of n nodes contains exactly n-1 relations
if sentences are used as EDUs. A Joint or Sequence which contains m nodes, counts as m-1
relations. The total correctness of the trees is measured by multiplying the two percentages.
The process of full tree evaluation is illustrated with the following extract:
[Geneesmiddelen ter bestrijding van infecties zijn onder andere gericht tegen bacteriën, virussen
en schimmels.]42A [Deze geneesmiddelen zijn zo gemaakt dat ze zo toxisch mogelijk zijn voor
het infecterende micro-organisme en zo veilig mogelijk voor menselijke cellen.]42B [Ze zijn dus zo
gemaakt dat ze selectief toxisch zijn.]42C [Productie van geneesmiddelen met selectieve toxiciteit
om bacteriën en schimmels te bestrijden, is relatief gemakkelijk, omdat bacteriële en schimmelcellen sterk van menselijke cellen verschillen.]42D [Het is echter zeer moeilijk om een geneesmiddel
te maken dat een virus bestrijdt zonder de geïnfecteerde cel en daardoor ook andere menselijke
cellen aan te tasten.]42E
This text can be annotated like the trees in figure 7.2.
These trees differ only in their relation between the first and the second sentence. The
comparison results are shown in table 7.1.
To show the difference, the next example is an annotation which differs highly. The tree
in figure 7.3 is compared to the first one in figure 7.2. The results are shown in table 7.2.
There are two relation organizations which are correct, between 42D and 42E and the arc
from the contrast schema to 42B. The tags with these connections are not always correct.
79
CHAPTER 7. EVALUATION
Elaboration
Background
R
42A
42A
Summary
Elaboration
42B
42C
Summary
Elaboration
42B
Contrast
42D
42C
42E
Contrast
42D
(a)
42E
(b)
Figure 7.2: Two Annotation Trees
Correct Organization
Correct Label
Total Correctness
Numbers
4
4
Percentage
80 %
100 %
80 %
Table 7.1: Comparison Results
Only the Contrast relation is labeled correctly. So the organization percentage precision is
40% and the label precision is 50%.
Correct Organization
Correct Label
Total Correctness
Numbers
2
1
Percentage
40 %
50 %
20 %
Table 7.2: Comparison Results
7.4
Testing Results
To measure the accuracy of the recognizer, this section discusses the results of the recognizer,
with the increase of input the recognizer uses. A number of texts is randomly extracted from
the Merck Manual. These texts are annotated by a human annotator and by the automatic
recognizer. The automatically generated trees are compared to the manually annotated
tree. It is shown that accuracy of the recognizer grows if more word types which can signal
relations are added. The used word types are shown below:
1. Conjunctive Adverbs
2. Adverbs
80
7.4. TESTING RESULTS
3. Pronouns
4. Medical Discourse Markers
5. Keyword Repetition
Background
R
42A
Justify
42B
Contrast
Non-Volitional Cause
42E
R
42C
42D
Figure 7.3: Third Tree
The recognizer has processed each text multiple times. Firstly, the recognizer uses only
conjunctive adverbs. Next, it uses conjunctive adverbs and adverbs. This process is repeated
until all the word types listed above are used.
The results are presented for two texts differing in size, which are annotated manually
by three human annotators. The results of the recognizer of these same texts are evaluated
against the human annotated trees.
The results show the separate score of the first marker type, and the total number of
recognized relations for each word type which is added. The results thus make clear which
word type increased the number of recognized relations.
After this, some of the results of the recognizer on three smaller texts are showed, compared with the results of a single human annotator. These texts are automatically recognized
with the use of all the word types listed above. The differences between the results of the
automatic recognizer and a human annotator are discussed.
These tests are extended in the next section where a list of evaluations of seven more
texts is presented. These results are compared to the evaluation results of texts which
are annotated by two annotators. The last section discusses the results of the automatic
recognizer.
81
CHAPTER 7. EVALUATION
7.4.1
First Text
This first text is manually annotated by three annotators. The text is shown below:
[Veel aandoeningen die op de huid verschijnen, blijven tot de huid beperkt.]43A [Andere aandoeningen komen naast de huid ook in de inwendige organen voor.]43B [Zo krijgen mensen met
systemische lupus erythematodes een ongewone roodachtige uitslag op de wangen, meestal nadat
ze aan zonlicht zijn blootgesteld.]43C [Artsen moeten dus rekening houden met vele mogelijke inwendige oorzaken wanneer ze huidproblemen onderzoeken.]43D [Door het gehele huidoppervlak op
bepaalde patronen van huiduitslag te onderzoeken kunnen ze een eventuele ziekte vaststellen.]43E
[Om te controleren in hoeverre het huidprobleem zich heeft verspreid, kan de arts een patint verzoeken zich helemaal uit te kleden, ook wanneer de patint slechts een afwijking op een klein deel
van de huid heeft opgemerkt. ]43F [Om bepaalde ziekten op te sporen of uit te sluiten, kunnen
artsen bovendien een bloedonderzoek of andere laboratoriumonderzoeken laten uitvoeren.]43G
Two of the annotations showed a similar tree, the third differed slightly. The added
relations however differed in a higher degree. Since the manual annotated trees differ, the
trees are combined to an average tree. The combining is done like this. Since all EDUs are
the same, the relations can be compared. First the composition is checked, the relations
which were added between the same EDUs by at least 2 out of the 3 annotators are kept.
After this, the labels are checked. The label which was added by at least 2 of the 3 annotators
is used. Three manually annotated trees are shown in figures 7.4 , 7.5 and 7.6. The combined
tree is shown in figure 7.7.
Figure 7.4: First Manually Annotated Tree
The automatically generated tree, shown in figure 7.8 is compared to the combined tree.
These results are shown in table 7.3.
Table 7.3 shows an increase of recognized relations with use of the Conjunctive Adverbs
the Adverbs and the Implicit Markers. The reason the implicit markers does increase the
number of recognized relations, is the fact these are used by the recognizer for the recognition
of far distance relations. Although the other relations like for example the Markers Non-Start
did not increase the number of recognized relations, the likelihood the previous recognized
relations are correct actually did increase.
The numbers between parentheses are the total number of added relations. In total there
are 6 connections possible, thus the percentages are based on that number.
82
7.4. TESTING RESULTS
Figure 7.5: Second Manually Annotated Tree
Figure 7.6: Third Manually Annotated Tree
Figure 7.7: Combined Manually Annotated Tree
Furthermore, the relations which were annotated identically by all annotators separately
are compared with the result of the recognizer on the same specific part. This is done for
the organization and the labeling.
There were four organizations similar with al the human annotators. Three of these
connections were found by the recognizer as well, although one of them was a Joint relation.
From these four connections, the human annotators agreed in labeling only once. The
recognizer labeled that relation identical as well.
83
CHAPTER 7. EVALUATION
Conjunctive Adverbs
Correct Organization
Correct Label
Adverbs
Correct Organization
Correct Label
Pronouns
Correct Organization
Correct Label
Markers Non-Start
Correct Organization
Correct Label
Relation Based
Correct Organization
Correct Label
Medical Markers
Correct Organization
Correct Label
Implicit Markers
Correct Organization
Correct Label
Total Correctness
Numbers
Percentage
1 (1)
1
16.7 %
100 %
3 (3)
3
50 %
100 %
3 (3)
3
50 %
100 %
3 (3)
3
50 %
100 %
3 (3)
3
50 %
100 %
3 (3)
3
50 %
100 %
4 (6)
3
66.7 %
75 %
50 %
Table 7.3: Results of First Text
84
7.4. TESTING RESULTS
Figure 7.8: Automatically Generated Tree
7.4.2
Second Text
The second text is the following:
[De productie van groeihormoon door de hypofyse kan worden vastgesteld met behulp van de
zogenaamde GHRH-argininetest of de insulinetolerantietest.]44A [Doordat het lichaam groeihormoon in de loop van het etmaal met pieken produceert (vooral ’s nachts), is de bloedspiegel op
n bepaald moment geen aanwijzing voor een al dan niet normale productie.]44B [De arts meet
daarom vaak het gehalte aan IGF-I in het bloed omdat dit gehalte doorgaans langzaam verandert
in verhouding tot de totale hoeveelheid door de hypofyse afgegeven groeihormoon.]44C [Een klein
tekort aan groeihormoon is bijzonder moeilijk vast te stellen.]44D [Bovendien zijn bij verminderde
werking van de schildklier of bijnier de groeihormoonspiegels meestal laag.]44E
The human annotated trees are combined to a single tree like is done with the trees of the
first text. The differences between the human annotations were pretty large, no connection
was added the same way by all three annotators. Therefore the combination is based on
similarity of two annotators. The combination process is done similar as is done for the first
text.
The evaluation results of the automatically generated tree against the combined tree of
the human annotators are shown in table 7.4. They are quite similar to the results of the
first text. Increases were seen with the same discourse marker types. Again the numbers
between parentheses are the total number of added relations. In total, 5 relations should be
added, on which the percentage is based.
7.4.3
Third Text
The first small text is shown below. It consists of three sentences, used as elementary discourse markers. This text is annotated by a single human annotator. For the automatically
generated tree all word types as discussed before are used.
85
CHAPTER 7. EVALUATION
Conjunctive Adverbs
Correct Organization
Correct Label
Adverbs
Correct Organization
Correct Label
Pronouns
Correct Organization
Correct Label
Markers Non-Start
Correct Organization
Correct Label
Relation Based
Correct Organization
Correct Label
Medical Markers
Correct Organization
Correct Label
Implicit Markers
Correct Organization
Correct Label
Total Correctness
Numbers
Percentage
1 (1)
1
20 %
100 %
1 (2)
1
20 %
100 %
1 (2)
1
20 %
100 %
1 (3)
1
20 %
100 %
1 (3)
1
20 %
100 %
1 (3)
1
20 %
100 %
2 (4)
2
50 %
100 %
50 %
Table 7.4: Results of Second Text
86
7.4. TESTING RESULTS
[Behalve dat de aanmaak van rode bloedcellen afneemt, wordt ook het zenuwstelsel door een
vitamine-B12-tekort aangetast.]45A [Hierdoor ontstaan tintelingen in handen en voeten, gevoelsstoornissen in benen, handen en voeten en spastische bewegingen.]45B [Andere symptomen zijn onder
meer een bepaalde vorm van kleurenblindheid, gewichtsverlies, donkere verkleuring van de huid,
verwardheid, depressie en vermindering van de verstandelijke vermogens.]45C
In figure 7.9, an RST tree created by a human annotator is shown. In figure 7.10 the tree
generated by the recognizer is shown. The relation between the second and third EDU is the
same. The relation between the first and the second EDU however, differ. But in fact, the
difference is minor. Both annotations show a cause-result relation between the EDUs, and
both annotations show the same cause and result, the first EDU is the cause, the second the
result. The thing which differs is the importance of the EDUs. While the human annotator
decided the first EDU as the nucleus, the recognizer picked the second EDU. Analysis of the
decision tree showed that the recognizer had a slight preference of adding the Non-Volitional
Cause over the Non-Volitional Result.
Figure 7.9: Manually Annotated Tree
The result of the evaluation is presented in table 7.5.
Correct Organization
Correct Label
Total Correctness
1 (2)
1
50 %
100 %
50 %
Table 7.5: Results of the Third Text
7.4.4
Fourth Text
A different small text is the following:
87
CHAPTER 7. EVALUATION
Figure 7.10: Automatically Annotated Tree
[Bij macroglobulinemie ontstaat ook vaak cryoglobulinemie, een aandoening die wordt gekenmerkt
door cryoglobulinen.]46A [Dit zijn abnormale antistoffen die in het bloed precipiteren (neerslaan)
wanneer de temperatuur ervan tot onder de lichaamstemperatuur daalt en die weer oplossen als
de temperatuur stijgt.]46B [Patiënten met cryoglobulinemie kunnen zeer gevoelig worden voor kou
of het Raynaud-fenomeen ontwikkelen.]46C [Hierbij worden de handen en voeten bij blootstelling
aan kou zeer pijnlijk en wit.]46D
Again, the text is annotated by a single human annotator and the automatic recognizer.
The annotations created by the human annotator and the recognizer were similar. The
evaluation thus results a total correctness of 100%.The annotation tree is shown in figure
7.11.
Analysis of the decision tree shows that the relation between the first and the second
EDU, is clearly signaled by the word "Dit". The relation between the third and the fourth
EDU is signaled by the word Hierbij. The remaining relation is added since both sentences
speak about "Cryoglobulinemie".
Figure 7.11: Manually/Automatically Annotated Tree
88
7.4. TESTING RESULTS
7.4.5
Fifth Text
The last small text is the following:
[Voorlichting over elektriciteit en voorzichtigheid met betrekking tot het omgaan met elektriciteit
zijn van groot belang.]47A [Ongevallen met elektriciteit thuis en op het werk kunnen worden
voorkomen door ervoor te zorgen dat ontwerp, installatie en onderhoud van alle elektrische apparaten in orde zijn.]47B [Elk elektrisch apparaat waarmee het lichaam in aanraking kan komen, dient
goed geaard te zijn en te zijn aangesloten op een stroomkring met een stroomonderbreker.]47C
[Dergelijke veiligheidsmechanismen die de stroomkring onderbreken bij een lekstroom van 5 mA,
vormen een uitstekende bescherming en zijn overal verkrijgbaar.]47D
It is annotated by a single human annotator and the automatic recognizer, and the
generated trees are compared.
Figure 7.12 shows the manually annotated tree. It consists of two Elaborations and
a Background relation. Since the recognizer cannot recognize a Background relation, the
automatically generated tree will differ. This tree is shown in figure 7.13. Indeed the first
relation is not present. Instead of the Background relation from the first EDU to the second,
another Elaboration relation is added from the second towards the first EDU. Although the
relations are not similar, both representations indicate that the second sentence is providing
extra information about the topic discussed by the first sentence.
Figure 7.12: Manually Annotated Tree
The result of the evaluation is presented in table 7.6.
89
CHAPTER 7. EVALUATION
Figure 7.13: Automatically Annotated Tree
Correct Organization
Correct Label
Total Correctness
2 (3)
2
66.7 %
100 %
66.7 %
Table 7.6: Results of the Fifth Text
7.4.6
More Texts
In this section some other results of the recognizer are presented. More texts are annotated
and the total correctness of each annotation is calculated. This is done by comparing the
automatically generated trees to a manually annotated version of the same text. The results
are shown in table 7.7. They are sorted at the amount of total relations the texts embedded.
The results of the texts from this table, and the results presented before, show that the
recognition process is very difficult. Results vary from very bad (14.3%) to excellent (100%).
Furthermore, the recognition of the organization is easier than the labeling. The organization
scores 50% or above in all cases.
Relations
3
4
4
4
6
6
7
Organization
3 (100 %)
4 (100 %)
3 (75 %)
2 (50 %)
3 (50 %)
3 (50 %)
4 (57.1 %)
Label
1 (33.3 %)
3 (75 %)
3 (100 %)
1 (50 %)
1 (33.3 %)
2 (66.7 %)
1 (25 %)
Table 7.7: More Text Results
90
Total
33.3 %
75 %
75 %
25 %
16.7 %
33.3 %
14.3 %
7.5. DISCUSSION
Relations
4
4
4
6
6
7
Organization
3 (75 %)
4 (100 %)
4 (100 %)
4 (67 %)
5 (83.3 %)
4 (57 %)
Label
1 (33 %)
4 (100 %)
3 (75 %)
3 (75 %)
2 (40 %)
1 (25 %)
Total
25 %
100 %
75 %
50 %
33.3 %
14.3 %
Table 7.8: More Text Results
The labeling scores 25% to 100%. The reasons the organization differs, is partly because
nucleus and satellite are swapped. Even if the relation label would be correct, this is counted
as a mistake.
To evaluate the recognizer more precisely, texts should be used in which multiple human
annotators agreed fully.
7.4.7
Annotator Evaluation
To be able to conclude whether these results are fine, comparisons between the results of
human annotators are done. These calculations are based on other texts than which are
used for the evaluation of the recognizer. The results are shown in table 7.8.
The table shows the numbers for 7 texts, which vary in length between 4 and 7 EDUs.
For each text, two manually annotated trees are compared in a similar way as is done for
the comparison between a manually generated tree and a automatically generated tree.
The organization agreement is 57% or higher, and the label agreement is between 25% and
100% as well. The degree of difference in annotations generated by two human annotators
is similar to the difference between annotations of a human annotator and the recognizer.
7.5
Discussion
In this last section the results of the recognizer are discussed. Some adjustments are presented
and discussed whether they would improve the result of the automatic recognizer.
As seen in the previous sections, the trees generated by the automatic recognizer differs
from a manually generated one, as much as two manually generated trees differ from each
other. The first remark hereby is the fact that the recognizer is more limited in its relations it
can use. Furthermore, human annotators use different sources than just discourse markers.
The fact that two manually annotated versions can differ highly from each other is an
important one. The automatic recognizer is built based on results of human annotators.
Therefore its results are not expected to be better.
To check if it is possible to optimize the results of the recognizer, different amounts of
likeliness are tried. It turned out however that if the order in which the discourse marker are
present in the hierarchy does not change, the resulted trees do not differ, or just slightly. The
main differences would occur between two EDUs related for example with a Non-Volitional
Cause. With a mixed ranking, they could be related inversed, that is the nucleus and the
91
CHAPTER 7. EVALUATION
satellite could be swapped and a Non-Volitional Result added. If the ranking would be mixed
up, the quality of the trees would decrease.
Another possibility is the adjustment of the threshold. If for example the threshold
would be increased, the number of recognized relations would be less, however the relations
which would be correct were already correct in the first place. If the threshold would be too
low, there would be recognized too many relations which actually are not worth recognizing.
Therefore a large difference in the threshold is not necessary.
The next adjustment which could be done is the extending of the shorter list of relations
which can be recognized. The complete list as defined by RST can be used. Although the
evaluation of the comparison between manually generated trees and automatically generated
trees could be better, the relations which would be added are harder to find. Different
discourse markers, or different ways of recognition are then necessary.
For this assignment lists with special discourse markers are gathered. Although these
markers do not often signal a relation, they increase the likelihood of found relations. It can
be useful to search for other domain specific discourse markers and features, although they
were not found during this assignment.
Relation specific markers are used as well. They can signal relations, and increase the likelihood a relation is correct. For the other relations, similar list of words could be developed.
For some genres, compositional information can be used for the recognition of structure.
Although some compositional information could be useful in this domain, the amount of
composition data was too low to implement these features. Furthermore, the larger the
texts become, the harder the structure is to found, and the compositional information which
was available specified larger amounts of text.
It is possible to change the algorithm of the recognizer. For example all possible EDUs
could be taken as a pair and checked whether they are related somehow. The drawback of
this process is the use of discourse markers for the recognition of relations between sentences.
While sentences do not necessarily are connected to each other, the parts within a sentence
usually does. If a discourse marker is said to signal a relation it does in every case it is
taken into consideration. The sentence which embeds the discourse marker would therefore
be connected to many other EDUs. The majority of the relations would than be incorrect. Therefore, discourse markers are not optimal for the recognition of relations between
sentences.
This drawback is partially covered by the recognition of implicit markers with the use
of keyword repetition. However, there are plenty of cases in which human annotators added
a far distant relation, while the recognizer was not able to find it. This suggests the use
of more linguistic tools, like for example Wordnet. This can help find connections between
sentences, and than the relations could be checked for discourse marker. This would signal
the relation more strongly, while the discourse marker might indicate which label should be
added.
The last note with this automatic recognition process is the fact that it could be better,
if there were more annotated data available. In that case, the used numbers and discourse
markers could be more specific. Furthermore, it would be possible to use machine learning
to check what performance is possible if that would be applied.
92
Chapter
8
Conclusions and Future Work
First of all, the original research questions are stated once more, and the conclusions are
added per subquestion. After that, some general conclusions about this research are added.
Finally some recommendations for future work are presented.
8.1
Conclusions
1. How can RST annotations be automatically generated?
RST annotations can be automatically generated. There are generators based on machine learning, and programs based on static rules. For this assignment an automatic
recognizer program is written which is based on static rules. The recognizer consists
of three parts, a segmenter, the actual recognition program and a tree-builder. The
recognizer uses lists of discourse markers from several word types. Strength indication
variables based on learning are assigned to each word type. These are used to determine which relation is best fit between two elementary discourse units. The developed
automatic recognition program generates Rhetorical Structure Theory trees which differ from manually annotated trees in a similar degree as two manually generated trees
differ from each other.
2. Is RST suitable for Dutch and specifically for medical texts?
Yes, RST is suitable for Dutch. It is used in multiple other researches with various
topics. During this assignment several Dutch texts from the Merck Manual have been
successfully annotated with RST. Furthermore, texts from other Dutch genres are
annotated as well. The recognizer is tested on medical texts, but it produces similar
results for texts from other Dutch genres. Some genres however are not suited to be
annotated with RST.
3. Which properties of Dutch and which relations and discourse markers are best for
automatic recognition?
In this assignment different types of discourse markers are discussed. Markers which
are described in other researches are used and lists of domain specific markers are
gathered. These discourse marker types are fit into a hierarchy. The created hierarchy
93
CHAPTER 8. CONCLUSIONS AND FUTURE WORK
provides an overview of the importance and quality of each of the discourse marker
types. The clearest markers appear at the start of a sentence, but all of the discussed
markers can be used when they reside within a sentence as well. During this research
relations from the original list with relations as defined by RST are combined. This
resulted in a small set with a coverage of over 88%. These relations are most used, and
are suitable for automatic recognition. This list can be extended with relations from
the original list and new relations can be added.
4. Is it possible to use other properties of medical texts as discourse markers to find relations?
While the Merck Manual provides features like lists and tables, this can only be used
in a few cases. The main part of the information is plain text and therefore no compositional information is used except for the composition into paragraphs. The texts
which are automatically annotated are considered to be coherent. Furthermore, the
use of lists and tables is not a typical case of medical texts. Other texts like for example
recipes use them as well. Since the latter are generally much smaller, this compositional information is more useful in those texts. Other properties of medical texts are
used by the automatic recognizer. Lists of noun constructions and verb constructions
with the relations they signal are gathered and discussed. These markers are specific
for the medical domain. Although these markers are not strong, with respect to the
general discourse markers for Dutch, they increase the likelihood of found relations. In
cases where no other discourse markers are present they are used to signal relations on
their own.
5. How can automatically generated annotations be evaluated?
Two ways of evaluation are discussed, the first is based on relation evaluation, the
second is based on the evaluation of full trees. The full tree evaluation method is used
for the evaluation of the recognizer. Furthermore, this method is used to compare
manually annotated trees with each other. The full tree evaluation method shows a
score for the presence of a relation, and a separate score for the labeling of the relation.
The evaluation is thus useful in two ways.
One of the remarks on this research is the fact that annotations differ highly, even between
human annotators. In one way this is because it is hard to create a good tree from a text,
which is also affected by the fact that the people annotated these texts are no professional
linguists, which might decrease the performance. Also, this is an aspect of language itself.
For some texts there may be an optimal tree, for some (larger) texts this may be impossible.
This conclusion agrees with the results of Mann and Thompson, who state in their work
that most texts have multiple possible structures.
8.2
Future Work
This research can be extended in several ways. Some possible extensions are presented and
discussed separately.
94
8.2. FUTURE WORK
1. More Discourse Marker Types
Since the performance of the recognizer increased by adding more discourse marker
types, it may be worth, to search for other types or constructions which can be helpful. Some types which were found during analysis but were not implemented, are the
signals for the recognition of lists/sequences, and words which signaled Elaborations
although the latter is partly used. Also it may be worth to check for larger structures,
although they were not found in the used data.
2. Extend the Relation List
The recognizer is able to recognize only a small portion of the relations described
by RST. It is useful to research the possibilities of extending this list with the remaining relations. Also, it is possible to merge the list with relations defined by other
theories, to create a new list of relations.
3. Use Domain Specific Features
While the Merck Manual is presented in a clear form, this knowledge however was
hard to use for recognition. This would also only be interesting for the automatic
annotation of very large texts. For smaller texts, possible features like bulleted lists,
headings etcetera could be useful. For some texts, like for example the recipes, the fact
that they usually follow a certain writing style, first the preparation of the ingredients,
than the cooking, and finally the serving, can be used as constraints to the recognizer.
4. Use a Thesaurus
The use of a thesaurus enables man to compare hyponyms, homonyms etcetera. The
use of such a thesaurus will probably increase the performance of a recognizer.
5. Create a New Tool
The RST tool of O’Donnel has many drawbacks; it is hard to work with. Therefore it would be useful, to create a new program or fix some bugs and add features to
this one. Furthermore, an automatic recognition program could be integrated in the
new tool.
6. Create a Dutch Corpus
Creating a Dutch corpus of texts, annotated with RST, offers a great source of resource material. Add to that the fact that the creation of such a corpus gives a better
understanding of the structure of texts. This corpus could be used with a machine
learning algorithm.
95
Bibliography
[Abn91]
Steven Abney. Parsing by chunks. In Berwick, Abney, and Tenny, editors,
Principle-Based Parsing. Kluwer Academic Publishers, 1991.
[ART93]
E. Abelen, G. Redeker, and S.A. Thompson. The rhetorical structure of
us-american and dutch fund-raising letters. Text, 13(3):323–350, 1993.
[Bir85]
D.P. Birkmire. Text processing: The influence of text structure, background
knowledge and purpose. Reading Research Quarterly, 20(3):314–326, 1985.
[BMK03]
Jill Burstein, Daniel Marcu, and Kevin Knight. Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent
Systems, 18(1):32–39, 3 2003.
[BvNM01]
G. Bouma, G. van Noord, and R. Malouf. Alpino: Wide-coverage computational analysis of dutch. Computational Linguistics in The Netherlands,
2001.
[CFM+ 02]
Cassandre Creswell, Kate Forbes, Eleni Miltsakaki, Rashmi Prasad, Aravind
Joshi, and Bonnie Webber. The discourse anaphoric properties of connectives. In Proceedings of the Discourse Anaphora and Anaphor Resolution
Colloquium, Lisbon, Portugal, 2002.
[CMO03]
Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Current directions
in discourse and dialogue, chapter building a discourse-tagged corpus in the
framework of rhetorical structure theory. IEEE Intelligent Systems, 2003.
[CO98]
Simon Corston-Oliver. Computing of Representations of the Structure of
Written Discourse. PhD thesis, University of California, Santa Barbara, 1998.
[Fra99]
Bruce Fraser. What are discourse markers? Journal of Pragmatics, pages
931–952, 1999.
[GS86]
Barbara J. Grosz and Candace L. Sidner. Attention, intentions, and the
structure of discourse. Computational Linguistics, 12(3):175–203, 1986.
97
BIBLIOGRAPHY
[GS98]
Brigitte Grote and Manfred Stede. Discourse marker choice in sentence planning. In Proceedings of the Ninth International Workshop on Natural Language Generation, pages 128137. Association for Computational Linguistics,
New Brunswick, New Jersey, 1998.
[HHS03]
T. Hanneforth, S. Heintze, and M. Stede. Rhetorical parsing with underspecification and forests. In Proceedings of the 2003 Human Language Technology
Conference of the North American Chapter of the Association for Computational Linguistics, 2003.
[HL93]
J. Hirschberg and D. Litman. Empirical studies on the disambiguation of cue
phrases. Computational Linguistics, 19(3):501–530, 1993.
[HM97]
E. Hovy and E. Maier. Parsimonious or profligate : how many and which
discourse structure relations? Discourse Processes, 1997.
[HMSvdW01] H. Hoekstra, M. Moortgat, I. Schuurman, and T. van der Wouden. Syntactic annotation for the spoken dutch corpus project (cgn). In W. Daelemans, K. Simaan, J. Veenstra, and J. Zavrel, editors, Proceedings of the 3rd
ESCA/COCOSDA workshop on Speech Synthesis, pages 73–87, Amsterdam :
Rodopi, 2001.
[Hob85]
J.R. Hobbs. On the coherence and structure of discourse. Technical Report
CSLI-85-37, Stanford University, 1985.
[Hov93]
E. Hovy. Automated discourse generation using discourse structure relations.
Artificial Intelligence, 63(Special issue on NLP), 1993.
[Kno93]
A Knott. Using cue phrases to determine a set of rhetorical relations. In O
Rambow (ed) Intentionality and Structure in Discourse Relations: Proceedings
of the ACL SIGGEN Workshop, 1993.
[KS98]
A. Knott and T. Sanders. The classification of coherence relations and their
linguistic markers: An exploration of two languages. Journal of Pragmatics,
30:135–175, 1998.
[KS04]
L. Koenen and R. Smits. Handboek Nederlands. Utrecht: Bijleveld, 2004.
[Mar97a]
Daniel Marcu. The rhetorical parsing of natural language texts. In Proceedings
of ACL/EACL97, pages 96–103, 1997.
[Mar97b]
Daniel Marcu. The rhetorical parsing, summarization and generation of natural language texts. Ph.D. Thesis. Department of Computer Science, University of Toronto, 1997.
[MAR99]
Daniel Marcu, Estibaliz Amorrortu, and Magdelena Romera. Experiments in
constructing a corpus of discourse trees. In Proceedings of the ACL Workshop
on Standards and Tools for Discourse Tagging, College Park, MD, pages 48–
57, 1999.
98
BIBLIOGRAPHY
[Mar00]
Daniel Marcu. The rhetorical parsing of unrestricted texts: A surface-based
approach. Computational Linguistics, 26:395–448, 2000.
[ME02]
Daniel Marcu and A. Echihabi. An unsupervised approach to recognizing discourse relations. In Proceedings of the 40th Annual Meeting of the Association
for Computational Linguistics (ACL-2002), 2002.
[Mil95]
G.A. Miller. Wordnet a lexical database for english. Comm. ACM, 12(11):39–
41, 1995.
[MMS93]
M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: the penn treebank. Computational Linguistics,
19(2):313–330, 1993.
[MRAW04a]
Eleni Miltsakaki, Prasad Rashmi, Joshi Aravind, and Bonnie Webber. Annotating discourse connectives and their arguments. In Proceedings of the
Workshop on Frontiers in Corpus Annotation, pages 48–57, 2004.
[MRAW04b]
Eleni Miltsakaki, Prasad Rashmi, Joshi Aravind, and Bonnie Webber. The
penn discourse treebank. In Proceedings of the Language and Resources and
Evaluation Conference, 2004.
[MS01]
Henk Pander Maat and Ted Sanders. Subjectivity in causal connectives: An
empirical study of language in use. Cognitive Linguistics, 12(3):247–273, 2001.
[MT88]
William C. Mann and Sandra A. Thompson. Rhetorical structure theory:
Toward a functional theory of text organization. Text, 8(3):243–281, 1988.
[O’D00]
Mike O’Donnell. Rsttool 2.4 a markup tool for rhetorical structure theory. In
Proceedings of the 1st International Natural Language Generation Conference,
2000.
[PdGVNM04] Thiago Alexandre Salgueiro Pardo, Maria das Gracas Volpe Nunes, and Lucia Helena MachadoRino. Dizer: An automatic discourse analyzer for brazilian portuguese. In Proceedings of First International Workshop on Natural
Language Understanding and Cognitive Science (NLUCS 2004), 2004.
[Per]
Perrez. Connectieven, tekstbegrip en vreemdetaalverwerking. Een studie
van de impact van causale en contrastieve connectieven op het begrijpen van
teksten in het Nederlands als vreemde taal.
[Por80]
M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[PSBA03]
R. Power, D. Scott, and N. Bouayad-Agha. Document structure. Computational Linguistics, 29(3):211–260, 2003.
[San]
T.J.M. Sanders. Coherence, causality and cognitive complexity in discourse.
[Sch87]
Deborah Schriffrin. Discourse markers. Cambridge University Press, 1987.
99
BIBLIOGRAPHY
[SH04]
Manfred Stede and S. Heintze. Machine-assisted rhetorical structure annotation. In Proceedings of the Int’l Conference on Computational Linguistics,
COLING-2004, 2004.
[Spe03]
Spectrum. Winkler Prins Medische Encyclopedie. Spectrum, 2003.
[Ste04]
Manfred Stede. The potsdam commentary corpus. In Proceedings of the Workshop on Discourse Annotation, 42nd Meeting of the Association for Computational Linguistics, 2004.
[Tabng]
Maite Taboada. Discourse markers as signals (or not) of rhetorical relations.
Journal of Pragmatics, forthcoming.
[TM06]
Maite Taboada and William C. Mann. Applications of rhetorical structure
theory. Discourse Studies, 8(4), 2006.
[vL05]
M. van Langen. Question answering for general practitioners. Master’s thesis,
University of Twente , Enschede, 2005.
[WG05]
F. Wolf and E. Gibson. Representing discourse coherence: A corpus-based
analysis. Computational Linguistics, 31(2):249–287, 2005.
[WSK05]
Methee Wattanamethanont, Thana Sukvaree, and Asanee Kawtrakul. Thai
discourse relations recognition by using naive bayes classifier. The Sixth Symposium on Natural Language Processing 2005 (SNLP 2005), 2005.
100
Appendix
A
Rhetorical Relations
Below man can find the set of Rhetorical Relations as defined in the original published paper
of Mann and Thompson [MT88] alphabetically sorted. The following abbreviations are used:
1. N = Nucleus
2. S = Satellite
3. W = Writer
4. R = Reader
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
=== Antithesis ===
W has positive regard for the situation presented in N
None
The situations presented in N and S are in contrast (cf. CONTRAST, i.e. are
a) comprehended as the same in many respects
b) comprehended as differing in a few respects and
c) are compared with respect to one or more of these differences);
because of an incompatibility that arises from the contrast, one cannot have positive regard
for both the situations presented in N and S;
comprehending S and the incompatibility between the situations presented in N and S
increases R’s positive regard for the situation presented in N
R’s positive regard for N is increased
N
=== Background ===
R won’t comprehend N sufficiently before reading text of S
None
S increases the ability of R to comprehend an element in N
R’s ability to comprehend N increases
N
=== Circumstance ===
None
Presents a situation (not unrealized)
S sets a framework in the subject matter within R is intended to interpret the situation
presented in N
R recognizes that the situation presented in S provides the framework for interpreting N
N and S
101
APPENDIX A. RHETORICAL RELATIONS
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on the combination of nuclei :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
=== Concession ===
W has positive regard for the situation presented in N;
W is not claiming that the situation presented in S does not hold
W acknowledges a potential or apparent incompatibility between the situations presented
in N and S;
W regards the situations presented in N and S as compatible;
recognizing the compatibility between the situations presented in N and S increases R’s
positive regard for the situation presented in N
R’s positive regard for the situation presented in N is increased
N and S
=== Condition ===
None
S presents a hypothetical, future, or otherwise unrealized situation (relative to the situational context of S)
Realization of the situation presented in N depends on realization of that presented in S
R recognizes how the realization of the situation is presented in N depends on the realization
of the situation presented in S
N and S
=== Contrast ===
Multi-nuclear
no more than two nuclei;
The situations presented in these two nuclei are:
a) comprehended as the same in many respects
b) comprehended as different in a few respects and
c) compared with respect to one or more of these differences
R recognizes the comparability and the difference(s) yielded by the comparison is being
made
Multiple nuclei
=== Elaboration ===
None
None
S presents additional detail about the situation or some element of subject matter which
is presented in N or inferentially accessible in N in one or more of the ways listed below.
In the list if N presents the first member of any pair, then S includes the second:
1) set : member
2) abstract : instance
3) whole : part
4) process : step
5) object : attribute
6) generalization : specific
R recognizes the situation presented in S as providing additional detail for N.
R identifies the element of subject matter for which detail is provided
N and S
=== Enablement ===
Presents R action (including accepting an offer), unrealized with respect to the context of
N
None
R comprehending S increases R’s potential ability to perform the action presented in N
R’s potential ability to perform the action is presented in N increases
N
=== Evaluation ===
None
None
S relates the situation in N to degree of W’s positive regard toward the situation presented
in N
R recognizes that the situation presented in S assesses the situation presented in N and
recognizes the value it assigns
N and S
102
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on the combination of nuclei :
The Effect :
Locus of the Effect :
=== Evidence ===
R might not believe N to a degree satisfactory to W
The reader believes S or will find it credible
R’s comprehending S increases R’s belief of N
R’s belief of N is increased
N
=== Interpretation ===
None
None
S relates the situation presented in N to a framework of ideas not involved in N itself and
not concerned with W’s positive regard
R recognizes that S relates the situation presented in N to a framework of ideas not involved
in the knowledge presented in a N itself
N and S
=== Justify ===
None
None
R’s comprehending S increases R’s readiness to accept W’s right to present N
R’s readiness to accept W’s right to present N is increased.
N
=== Motivation ===
Presents an action in which R is the actor (including accepting an offer) unrealized with
respect to the context of N
None
Comprehending S increases R’s desire to perform action presented in N
R’s desire to perform action presented in N is increased
N
=== Otherwise ===
Presents an unrealized situation
Presents an unrealized situation
Realization of the situation presented in N prevents realization of the situation presented
in S
R recognizes the dependency relation of prevention between the realization of the situation
presented in N and the realization of the situation presented in S
N and S
=== Purpose ===
Presents an activity
Presents a situation that is unrealized
S presents a situation to be realized through the activity in N
R recognizes that the activity in N is initiated in order to realize S
N and S
=== Restatement ===
None
None
S restates N, where S and N are of comparable bulk
R recognizes S as a restatement of N
N and S
=== Sequence ===
multi-nuclear
A succession relationship between the situations is presented in the nuclei
R recognizes the succession relationships among the nuclei
Multiple nuclei
103
APPENDIX A. RHETORICAL RELATIONS
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The Effect :
Locus of the Effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination: :
The effect :
Locus of the effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The effect :
Locus of the effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The effect :
Locus of the effect :
Constraints on N :
Constraints on S :
Constraints on the N + S
Combination :
The effect :
Locus of the effect :
=== Solutionhood ===
None
Presents a problem
The situation presented in N is a solution to the problem stated in S
R recognizes the situation presented in N as a solution to the problem presented in S
N and S
=== Summary ===
N must be more than one unit
None
S presents a restatement of the content of N, that is shorter in bulk
R recognizes S as a shorter restatement of N
N and S
=== Volitional Cause ===
Presents a volitional action or else a situation that could have arisen from a volitional
action
None
S presents a situation that could have caused the agent of the volitional action in N to
perform that action;
without the presentation of S, R might not regard the action as motivated or know the
particular motivation;
N is more central to W’s purposes in putting forth the N-S combination than S is.
R recognizes the situation presented in S as a cause for the volitional action presented in
N
N and S
=== Non-Volitional Cause ===
Presents a situation that is not a volitional action
None
S presents a situation that, by means other than motivating a volitional action caused the
situation presented in N;
without the presentation of S, R might not know the particular cause of the situation;
a presentation of N is more central than S to W’s purposes in putting forth the N-S
combination.
R recognizes the situation presented in S as a cause of the situation presented in N
N and S
=== Volitional Result ===
None
Presents a volitional action or a situation that could have arisen from a volitional action
N presents a situation that could have caused the situation presented in S;
The situation presented in N is more central to W’s purposes that is that presented in S;
R recognizes the situation presented in N could be a cause for the action or situation
presented in S
N and S
=== Non-Volitional Result ===
None
Presents a situation that is not a volitional action
N presents a situation that caused the situation presented in S;
Presentation of N is more central to W’s purposes in putting forth the N-S combination
than is the presentation of S.
R recognizes the situation presented in N could have caused the situation presented in S
N and S
104
Appendix
B
Conjunctions
Dutch
aangezien
al
aleer
alhoewel
als
alsmede
alsof
alsook
alvorens
annex
behalve
daar
daardoor
dan
dat
dewijl
doch
doordat
doordien
dus
eer
eerdat
en
ende
evenals
gelijk
hetzij
hoewel
English
since
although
before
although
if
just as
as if
just as
before
added to that
except
because
therefore
than
that
because
but
because
because
so
before
before
and
and
just as
as
either
although
Continued on next page
105
APPENDIX B. CONJUNCTIONS
Dutch
English
hoezeer
how much
indien
if
ingeval
in case of
maar
but
mits
if
na
after
naar
in accordance with
naargelang
as
naardien
if
naarmate
as
nadat
after
nademaal
now
niettegenstaande although
noch
neither
nu
now
of
if
ofdat
whether
ofschoon
although
oftewel
like
ofwel
either
om
for
omdat
because
opdat
so
overmits
for
schoon
although
sedert
since
sinds
since
tenzij
unless
terwijl
while / although
toen
then
tot
until
totdat
until
uitgenomen
except
uitgezonderd
except
vermits
as
voor
before
vooraleer
before
voordat
before
voorzover
as far as
wanneer
when
want
for
wijl
while
Continued on next page
106
Dutch
zo
zoals
zodat
zodra
zolang
zover
English
as
as
so
as soon as
as long as
as far as
107
Appendix
C
Merck Manual
The text below is the original text about the Merck Manual taken from their website. English
translation below.
Over Merck
De inhoud van deze website is volledig gebaseerd op het in 2000 verschenen Merck Manual
Medisch handboek. In deze online versie van het handboek kunt u gemakkelijk zoeken op
onderwerp en kunt u teksten en eventuele plaatjes en tabellen uitprinten. De nummering van
de secties, hoofdstukken en onderwerpen komen overeen met het boek, zodat u via internet
gemakkelijk kunt zoeken en eventueel het boek kunt raadplegen om de onderwerpen eens
rustig door te lezen.
The Merck Manual is een omvangrijk naslagwerk voor artsen dat in 1899 voor het eerst
uitkwam en inmiddels in een 17e editie is verschenen. The Merck Manual geniet al ruim een
eeuw groot gezag in de gezondheidszorg vanwege zijn betrouwbaarheid en volledigheid.
Dit naslagwerk is voor een breed publiek herschreven en is uitgebracht als The Merck
Manual Home Edition. Van deze publiekseditie zijn in de Verenigde Staten in de eerste
drie jaar meer dan 1,5 miljoen exemplaren verkocht. The Merck Manual Home Edition is
inmiddels al in 13 talen uitgebracht en is in Nederland bekend als het Merck Manual Medisch
handboek.
De Nederlandse editie is zorgvuldig beoordeeld, geactualiseerd en waar nodig aangepast
aan de Nederlandse gezondheidszorg door 34 artsen en medisch deskundigen verbonden aan
academische ziekenhuizen en overige gezondheidsinstellingen in Nederland.
Sinds december 2002 is het Merck Manual Medisch handboek ook online beschikbaar.
Het Merck Manual Medisch handboek is een naslagwerk dat niet te eenvoudig is voor
de arts en niet te moeilijk voor de patint. Het Merck Manual Medisch handboek biedt
inzicht in de oorzaken en behandeling van meer dan 3000 aandoeningen, in prettig leesbaar,
begrijpelijk Nederlands. Geen enkel ander medisch naslagwerk behandelt een dergelijke grote
verscheidenheid aan ziekten. De site maakt een gedegen voorbereiding van een bezoek aan de
arts mogelijk en laat patinten en zorgverleners op niveau met elkaar van gedachten wisselen
over de vele onderzoeks- en behandelmogelijkheden in de moderne geneeskunde.
109
APPENDIX C. MERCK MANUAL
About Merck
The contents of this website is fully based on the Merck Manual Medical Handbook which
appeared in 2000. This online version of the handbook provides easy searching for subjects
en enables you to print text and pictures. De numbers of the secions, chapters and subjects
match with the the book, so the internet version provides easy searching while you can use
the book to reread the subjects in peace.
The Merck Manual is a sizeable reference book for doctors first appeared in 1899 and
meanwhile reached the 17th edition. The Merck Manual has great authority in Health Care
due to its reliability and robustness.
This reference book is rewritten for the public, and released as the Merck Manual Home
Edition. This edition is sold more than 1.5 million times in the United States. The Merck
Manual Home Edition is translated into 13 languages and is known as the Merck Manual
Medisch Handboek in The Netherlands.
The Dutch edition is carefully judged, actualized and adapted to the Dutch Health Care
where necessary by 34 doctors and medical specialists, connected to academical hospitals
and other health institutes in The Netherlands.
Since december 2000, the Merck Manual Medisch Handboek is available online as well.
The Merck Manual Medisch Handboek is a reference book which is not too easy for a
doctor and not too difficult for a patient at the same time. The Merck Manual Medisch
Handboek provides insight in causes and treatments for more than 3000 disorders, in comfortable, understandable Dutch. No other medical reference book treats a similar number of
illnesses. The site enables one to prepare himself for a visite to a doctor and enables patients
and medical care takers talk about the many research and treatment possibilities in modern
medicine.
110
Appendix
D
Medical Annotations
Two annotated examples of texts from the Merck Manual.
First Example:
[Geneesmiddelen ter bestrijding van infecties zijn onder andere gericht tegen bacteriën, virussen
en schimmels.]48A [Deze geneesmiddelen zijn zo gemaakt dat ze zo toxisch mogelijk zijn voor
het infecterende micro-organisme en zo veilig mogelijk voor menselijke cellen.]48B [Ze zijn dus zo
gemaakt dat ze selectief toxisch zijn.]48C [Productie van geneesmiddelen met selectieve toxiciteit
om bacteriën en schimmels te bestrijden, is relatief gemakkelijk, omdat bacteriële en schimmelcellen sterk van menselijke cellen verschillen.]48D [Het is echter zeer moeilijk om een geneesmiddel
te maken dat een virus bestrijdt zonder de geïnfecteerde cel en daardoor ook andere menselijke
cellen aan te tasten.]48E
111
APPENDIX D. MEDICAL ANNOTATIONS
Second Example:
[De meeste therapieën voor overgewicht zijn gericht op wijziging van het eetgedrag.]49A [De
nadruk ligt gewoonlijk meer op permanente veranderingen met betrekking tot eetgewoonten
en op meer lichaamsbeweging dan op een dieet.]49B [Er wordt de mensen aangeleerd hoe ze
zich geleidelijk betere eetgewoonten kunnen aanwennen door meer complexe koolhydraten (fruit,
groenten, brood en pasta) te eten en het gebruik van vet te verminderen.]49C [Voor mensen met
een licht overgewicht wordt slechts een kleine beperking van de hoeveelheid calorieën en vetten
aanbevolen.]49D
112
Appendix
E
Dutch Texts
A selection of Dutch texts is added per genre. For each of the genres, two examples are
given. Also two examples of each type of other medical texts are added.
Fairy Tales
First example:
Ver weg, in India, in een oerwoud dat nimmer door mensenvoeten is betreden, ligt een klein
stil meer, als geslepen uit blauw kristal. Het weelderige lommer langs de kanten wordt
rimpelloos weerspiegeld, omdat zelfs de rusteloze wind zijn adem inhoudt bij het zien van
zoveel schoonheid. Op het meer drijven zeven waterlelies. En die waterlelies zijn het, waaraan
dit sprookje zijn bestaan dankt. Maar laat ik bij het begin beginnen. Je moet weten dat in
dat oerwoud een heks woonde die z lelijk was, dat ze alleen ’s nachts uit haar schuilplaats
tevoorschijn durfde te komen. Haar neus was zo groot en krom, dat hij haar puntige kin bijna
raakte en haar haren waren bleek en grauw als een bundeltje uitgedroogd gras. Als de maan
hoog aan de hemel stond en de krekels hun avondconcert hadden beëindigd, beklom de heks
een kale rots in de nabijheid van het meer. Daarop keerde zij haar schrikwekkende gezicht
naar de hemel, blikte een tijdlang star omhoog, verschool zich dan in een holle boomstam
en begon te zingen. In tegenstelling tot haar afschuwelijke verschijning klonk haar gezang zo
mooi, dat geen dichter het ooit zou kunnen beschrijven. Het was of alle nachtegalen van het
oerwoud zich verenigd hadden in die ene wonderlijke stem: een stem die de macht had om
ieder die het maar hoorde, te betoveren. Dat wist de oude heks. Ze wist ook, dat bij volle
maan de Maanfee met haar gevolg van sterren op aarde neerdaalde. Dan dansten zij op de
waterspiegel van dat oerwoudmeer waarvan de heks slechts een paar passen was verwijderd.
Er viel dan een heldere manestraal op het meer, omgeven door talloze lichtende sterren
die, zodra zij het water raakten, de vorm aannamen van kleine feeën. Ze droegen zilveren
jurkjes die oogverblindend schitterden en ze dansten sierlijk en speels op de tonen van de
heksenzang. In hun midden wervelde een ranke gestalte, gekleed in zilveren sluiers, met op
het hoofd een stralende kroon waarop een sikkeltje glansde. Dat was de Maanfee zelf. Hee;
de lange nacht dansten de Maanfee en haar sterrenkinderen en war hun voetjes het water
taakten ontstonden zilveren kringetjes, die zich vermenigvuldigden, wijder en wijder werden.
113
APPENDIX E. DUTCH TEXTS
Om ten slotte tussen het riet te verdwijnen. Zo dansten zij, tot de morgenstond de maan
deed verbleken. Hoe meer de nacht vorderde, hoe luider en bezwerender de heks zong. Het
was haar echter nooit gelukt de Maanfee en haar sterrenkinderen in haar ban te krijgen en
ze wist maar al te goed dat de Maanfee, wanneer de dag in aantocht was, met haar gevolg
naar de hemel terugkeerde.
Second example:
Er was eens een heel lief en mooi meisje die na het overlijden van haar vader bij haar gemene
stiefmoeder en stiefzusters woonde. De naam van het meisje was Assepoester. Ze woonden
in een groot huis in een klein dorpje bij het paleis. Assepoester moest in haar eentje het
hele huis altijd schoonhouden. Voor haar lelijke en luie stiefmoeder en stiefzusters moest zij
alle vieze en vermoeiende karweitjes opknappen. Op een dag verschenen er overal in de stad
aanplakbiljetten waarin iedereen werd uitgenodigd om op het grote feest te komen die in het
paleis werd gegeven. Op dit feest zou de prins op zoek gaan naar een geschikte kandidaat
om te trouwen. Het hele huis was in rep en roer.
De gemene stiefmoeder dacht dat n van haar dochters wel geschikt zou zijn voor de prins.
”Jij mag niet mee Assepoester, zo een lelijk wicht als jij maakt toch geen kans” zei de gemene
stiefmoeder ”Ga jij maar snel de jurken van mij en je stiefzusters opknappen en wassen”.
Assepoester was de hele week in de weer om de jurken te verstellen en mooi te maken. De
gemene stiefzusters pestte haar de hele dag dat zij wel naar het bal gingen en Assepoester
niet. Op de dag van het bal was Assepoester heel verdrietig. De gemene stiefzusters hadden
dit door en lachte haar hard uit. Verdrietig keek Assepoester hoe de gemene stiefmoeder
en stiefzusters lachend in de koets stapte en vertrokken naar het paleis. ”Huil niet meisje”
hoorde Assepoester ineens achter haar. Ze draaide zich om en daar stond een vriendelijk fee
met een toverstokje ”Ik ben je petemoe en ik ga ervoor zorgen dat jij naar het feest kan”.
Assepoester kreeg een dikke glimlach op haar gezicht. Ze kon het eigenlijk niet geloven. Met
een tikje van haar toverstaf, toverde de vriendelijke fee Assepoester in een schitterende jurk.
Het haar van Assepoester was ook meteen mooi gekamd en gekapt. Aan haar voetjes had
zij ineens hele mooie glazen muiltjes. Assepoester danste in het rond van geluk. ”Ooohhh
petemoe wat ben ik mooi gekleed zo” ”Nu nog een mooie koets met koetsier en paarden”zei
de vriendelijk fee en met een zwaai toverde zij een pompoen en 5 muizen in een mooie koets
met koetsier en 4 paarden. ”Ga nu snel mijn kind maar vergeet niet dat de betovering maar
tot twaalf uur middernacht duurt en geen seconde later”
Weather Forecasts
First example:
Onder invloed van een hogedrukwig krijgen we woensdag rustig en vrij mooi weer al kan
de ochtendmist en de lage bewolking nog een groot deel van de voormiddag blijven hangen,
vooral dan in het oosten van het land. Donderdag begint nog mooi, opnieuw met plaatselijk
ochtendmist, maar in de loop van de dag is er kans op wat lichte regen. Vrijdag is het eerst
nog vrij mooi maar later op de dag neemt de kans toe op regen of buien. Ook zaterdag vallen
er een aantal buien. Tijdens heel de periode blijft het vrij zacht voor de tijd van het jaar
met maxima rond 20 graden.
114
Second example:
De maand is voorlopig 4 graden warmer dan het langjarig gemiddelde. De hoogste temperatuur (30,2 graden) werd op 12 september gemeten in Ell (L). De temperaturen van de 21
(Arcen 27,5) en 22 september (Twente 28,9, De Bilt 26,3 graden) horen bij de hoogste voor
die tijd. De hoogste temperatuur ooit in De Bilt tussen 21-30 september is 27,3 graden op
24 september 1949. Landelijk is voor eind september 30,4 graden het hoogtepunt voor op 26
september 1967 in Buchten. Ook in 2003 beleefde ons land tussen 17 en 22 september een
uitzonderlijk warme oudewijvenzomer met temperaturen tussen 25 en 31 graden.
Recipes
First example:
Smelt de boter in een braadpan en bak de uien al omscheppend lichtbruin. Leg de braadworsten ertussen en bak ze rondom bruin. Voeg de tijm, de bouillon en peper en zout naar
smaak toe en laat de uien en de worst in 20 min. gaar worden. Garneer met de peterselie.
Lekker met aardappelpuree.
Second example:
Ingredieënten: 500 gram bruine bonen, 3 liter water, 15 gram zout, 2 laurierbladeren, 2 kruidnagels, 1 Spaanse paper, 250 gram aardappelen, 1 grote ui, 40 gram boter, 1 selderijknol,
peper, zout, peper, Worcesterschire saus.
Bereidingswijze: Was de bonen onder koud water en laat ze 24 uur weken. Breng de
bonen in het weekwater aan de kook met het zout en de kruiden, (tijdens de bewerking
moeten de kruiden verwijderd worden, doe ze daarom in een thee-ei of een linnen zakje).
Laat de bonen niet te gaar worden. Schil de aardappelen, snijd ze in blokjes, doe ze bij
de bonen en laat het geheel nog een half uurtje doorkoken. Snij de ui en fruit deze in de
boter. Schil de selderijknol en snijd hem in stukjes. Haal de kruiden uit de soep en pureer in
het water de bonen en de aardappelen met een pureestamper. Voeg de stukken knolselderij
toe en ook de gefruite ui. Laat de massa nog een kwartiertje doorkoken totdat het goed
gebonden is. Breng de soep op smak met zout, peper en als u daar va houdt een scheutje
Worcesterschire saus. Strooi vlak voor het serveren de fijngehakte peterselie erover. Geef er
geroosterd wittebrood of vers stokbrood bij.
Newspaper Articles
First example:
AMSTERDAM - Nog nooit iemand zo verweesd naar een bos bloemen zien kijken als Jaap
Stam zaterdag na afloop van de ontluisterende nederlaag tegen Ierland. De international keek
eens op naar het meisje dat hem de ruikers overhandigde en moffelde het bosje vervolgens
vakkundig weg onder de stoel waarin hij even daarvoor met een diepe zucht was neergezegen.
115
APPENDIX E. DUTCH TEXTS
Arme Stam. Het ergste moest nog komen. Staand in de middencirkel werden de voetballers van Oranje uitgezwaaid richting Portugal, maar een hels fluitconcert daalde neer op
de ploeg die zojuist door een veredeld jeugdteam met 0-1 was verslagen.
Ruud van Nistelrooij nam de strafexpeditie nuchter op: Het is nooit leuk om uitgefloten
te worden door je eigen publiek, maar dat hebben we zelf in de hand. Als we de sterren van
de hemel spelen, hebben we er geen last van.
Tot overmaat van ramp betrad ook zanger Frans Bauer nog het veld om zijn hits ’Een
onsje geluk’ en ’Heb je even voor mij?’ ten gehore te brengen. De internationals, nog steeds
verzameld op het midden van het veld, keken beteuterd toe hoe het merendeel van de 48.000
toeschouwers zich wl vermaakte met de volkszanger. Stam kon het niet meer aanzien en
wandelde voor een slok water naar de zijlijn. Van mij hoeft die poespas niet. Ook niet als
we met 3-0 gewonnen hadden. Maar ik beslis die dingen niet. Dat uitfluiten is niet leuk,
maar wel begrijpelijk. De laatste oefenwedstrijd voor het EK ’88 ging ook verloren. Toen
werd de selectie eveneens uitgefloten. Er is nog hoop.
Van Nistelrooij zei niet te hebben overwogen om de misplaatste toegift te ontlopen. Als
al die mensen blijven zitten om ons uit te zwaaien, ga ik niet naar binnen.
Het laatste woord was tenslotte aan aanvoerder Phillip Cocu, die de fans bedankte voor
hun steun: Hoe moeilijk dat ook was. Ik had daar liever gestaan na een 3-0 overwinning.
Maar wat ik zei, was wel gemeend. Het doet me nog steeds wat als het stadion helemaal vol
zit en oranje kleurt. We hebben de hulp van het legioen straks hard nodig in Portugal. Dat
was het understatement van de dag.
Second example:
Prijzenslag
Nieuwe ronde in prijzenslag (AD, 18-9). Volgens AH-topman Dick Boer waarderen de
klanten van Albert Heijn de aanpak om de prijzen te verlagen. Ik kan daar als klant voor
een deel in meegaan. Maar gezonde concurrentie kan alleen bestaan als alle bij het proces
betrokkenen er beter van worden. En niet alleen de consument. Stel dat naast Albert Heijn
er nog slechts wat kleine buurtwinkels in enclaves zouden overblijven. Moeten we daar dan
gelukkig mee zijn? Als delen van de keten moeten afhaken omdat er een een eind is gekomen
aan hun inventiviteit en inleveringsvermogen. Oftewel: weg concurrentie! Weg bekende
merken? En, misschien ook weg de noodzaak van lage prijzen?
Winkler Prins
First example:
geslachtsdrift genps of libido, mate waarin iemand behoefte heeft aan seksueel contact (zie
*seksualiteit). De seksuele activiteit wordt in overwegende mate bepaald door in de loop van
het leven opgedane ervaringen, door de houding van anderen en hun normen, alsmede door
de hormonale activiteit van het hersenaanhangsel en de geslachtsklieren. De geslachtsdrift
kan afnemen onder invloed van emoties (angst, neerslachtigheid). Ook onder invloed van de
menstruele cyclus verandert de geslachtsdrift. Zie ook *frigiditeit; *impotentie.
116
Second example:
HARTMASSAGE: Uitwendige hartmassage wordt verricht door middel van een ritmisch
drukken (met twee handen) op het onderste gedeelte van het borstbeen. Men drukt hierbij
80 tot 100 maal per minuut (rechtsboven), waardoor de borstkas in hetzelfde tempo wordt
samengedrukt. Let op: niet leunen! Bij het loslaten (linksonder) veert de borstkas weer
omhoog, waardoor het hart meer ruimte in kan nemen en bloed aanzuigt. Bij de volgende
drukverhoging wordt het bloed de aorta en longslagader ingeperst. bij hartstilstand of lage
hartfrequentie (minder dan 20 slagen per minuut) het ritmisch ca.80-100 maal per minuut
samendrukken van de borstkas, waardoor het *hart leeggedrukt wordt en de pompfunctie gedeeltelijk blijft bestaan. Hierbij wordt het onderste gedeelte van het borstbeen met
een snelle stevige stoot van beide handpalmen ongeveer 5 cm ingedrukt. Gelijktijdig wordt
beademing toegepast, hetzij in de vorm van mond-op-mond-beademing (zie *beademing), ca.
20 maal per minuut, hetzij door toediening van zuurstof door een anesthesist na intubatie.
Wikipedia
First example:
Hersenen van gewervelden ontvangen signalen van de ’sensors’ (receptoren) van het organisme
via de zenuwen. Deze signalen worden door het centrale zenuwstelsel genterpreteerd waarna
reacties worden geformuleerd, gebaseerd op reflexen en aangeleerde kennis. Eenzelfde soort
systeem bezorgt aansturende boodschappen vanuit de hersenen bij de spieren in het hele
lichaam.
Sensorische input wordt verwerkt door de hersenen voor de herkenning van gevaar, het
vinden van voedsel, het identificeren van mogelijke partners en voor meer verfijnde functies.
Gezichts- gevoels- en gehoorsinformatie gaat eerst naar specifieke kernen van de thalamus en
daarna naar gebieden van de cerebrale cortex die bij dat specifieke sensorische systeem horen.
Reukinformatie (fylogenetisch het oudste systeem) gaat eerst naar de bulbus olfactorius,
daarna naar andere delen van het olfactorisch systeem. Smaak wordt via de hersenstam
geleid naar andere delen van het betreffende systeem.
Om bewegingen te cordineren hebben de hersenen een aantal parallelle systemen die
spieren besturen. Het motorisch systeem bestuurt de willekeurige bewegingen van spieren,
geholpen door de motorische schors, de kleine hersenen (het cerebellum) en de basale ganglia.
Uiteindelijk projecteert het systeem via het ruggenmerg naar de zogenaamde spiereffectors.
Kernen in de hersenstam besturen veel onwillekeurige spierfuncties zoals de ademhaling.
Daarnaast kunnen veel automatische handelingen zoals reflexen gestuurd worden door het
ruggenmerg.
De hersenen produceren ook een deel van de hormonen die organen en klieren benvloeden. Aan de andere kant reageren de hersenen ook op hormonen die elders in het lichaam
geproduceerd zijn. Bij zoogdieren worden de meeste hormonen uitgescheiden in de bloedsomloop; de besturing van veel hormonen verloopt via de hypofyse.
117
APPENDIX E. DUTCH TEXTS
Second example:
Een slagader of arterie is een bloedvat dat zorgt voor het transport van bloed van het hart
naar de rest van het lichaam. Het arteriestelsel voert bloed vanuit het lichaam naar de
gebruikers, nl. de organen en weefsels.
De naam ’slagader’ verwijst naar het feit dat men aan een arterie het hart kan voelen kloppen, omdat de daarmee gepaard gaande drukwisselingen zich in de arterieën voortplanten.
De diameter van een slagader is ongeveer vier millimeter. Hieromheen zijn wanden gelegen, welke ongeveer n millimeter dik zijn. Om de ruimte waar het bloed stroomt, zit eerst
een enkele laag endotheelcellen. Hieromheen is een laag glad spierweefsel. Deze laag is
dikker dan de gladde spiercellaag van de aders. Hieromheen, als buitenste laag, zit een laag
bindweefselcellen.
118