Automatic Recognition of Structural Relations in Dutch Text
Transcription
Automatic Recognition of Structural Relations in Dutch Text
Faculty of Human Media Interaction Department of EEMCS University of Twente P.O. Box 217, 7500 AE Enschede The Netherlands Automatic Recognition of Structural Relations in Dutch Text – A Study in the Medical Domain – by Sander E.J. Timmerman [email protected] March 1, 2007 Committee Ir. W.E. Bosma Dr. M. Theune Dr. D.K.J. Heylen Preface Right now, you are reading the thesis about my graduation assignment. During my time at the University of Twente, I found that the field of interaction between humans and computers was the most interesting field in Computer Science. Therefore I was happy to be able to perform my graduation assignment at the faculty of Human Media Interaction. Before starting with the final assignment, I followed different courses from this faculty. My traineeship originated in the same faculty. It implied the automatic subtitling of meetings, with speech recognition at Noterik in Amsterdam. In May 2006 I started with this assignment, and it lasted until March 2007. Although working on the project, spending time with fellow graduates, and the time spent as a student as such was very nice, I am glad to be able to finish it, and start a new life, not being a student anymore. I would like to thank my graduation committee, Wauter Bosma, Mariet Theune and Dirk Heylen, for their time, work and comments. Furthermore I would like to thank Rieks op den Akker for his contributions to the annotations and comments, and Tjerk Bijlsma for his review of my thesis. And last but not least, I thank my parents for giving me the opportunity to do this study. I hope you enjoy reading this thesis, Sander Timmerman, March 1, 2007 iii Samenvatting Interactie tussen mensen en computers is een hot topic binnen de huidige informatica. Er wordt onderzoek gedaan naar user interfaces, communicatie, kunstmatige intelligentie en veel meer. Een van de onderwerpen is natuurlijke taalverwerking door computers. Onderzoekers werken bijvoorbeeld aan het automatisch genereren van (zinvolle) tekst, spraakherkenning en automatisch samenvatten. Een manier om inzicht te krijgen in teksten is het beschrijven van de structuur van een tekst. Hiervoor kan gebruik worden gemaakt van de Rhetorical Structure Theory, een van de manieren om structuur in tekst te beschrijven. Het herkennen van structuur is zelfs voor mensen een moeilijk proces omdat taal erg ambigu is. Dit verslag beschrijft een manier om automatisch structuur te herkennen in Nederlandse teksten. Meer specifiek in het medische domein. Uit het medische domein is de Merck Manual, een Nederlandse medische encyclopedie, geselecteerd als bron voor de teksten. Om automatisch structuur te kunnen herkennen wordt gebruikt gemaakt van bepaalde woordtypes die relaties tussen bepaalde segmenten uit de tekst kunnen aangeven. In dit verslag worden zinnen als segmenten, als basis voor de structuur gebruikt. Er is een herkenner ontwikkeld die gebruikt maakt van deze bepaalde woordtypes om automatisch de structuur te herkennen. De herkenner gebruikt woordtypes die in andere onderzoeken gebruikt zijn en woordtypes die specifiek geschikt zijn voor het medische domein. Deze laatste types zijn speciaal voor deze opdracht geselecteerd. De herkenner genereert de structuur als een Rhetorical Structure Theory boom. Om de kwaliteit van de herkende structuur te evalueren worden de automatisch gegenereerde bomen vergeleken met bomen die handmatig zijn gemaakt en tevens worden handmatig gemaakte bomen met elkaar vergeleken. v Abstract Interaction between humans and computers is a hot topic in the current computer science field. Research is done for user interfaces, communication, artificial intelligence and much more. One of the subjects is natural language processing by computers. Researchers work for example at the automatic generation of (sensible) text, speech recognition and automatic summarization. A way to get insight about texts is to describe their structure. Therefore use can be made of the Rhetorical Structure Theory, one of the ways to describe structure in texts. The recognition of structure in texts is even for humans a difficult process. This thesis describes a way to recognize structure in Dutch texts automatically. More specific, texts from the medical domain. The Merck Manual, a Dutch medical encyclopedia, is chosen as the source for the texts. Certain word types which can signal relations between text segments are used for the automatic recognition of structure. In this thesis, sentences are used as segments, the basis of the structure. An automatic recognizer is developed, which uses the specific word types to recognize the structure. It uses word types which are discussed in other researches as well and word types which are specific for the medical domain. The latter are gathered during this assignment. The recognizer generates the structure of a text as a Rhetorical Structure Theory tree. To evaluate the quality of the recognized structure, automatically generated trees are compared with manually generated trees of the same text. Furthermore, manually generated trees are compared with each other. vii Contents 1 Introduction 1.1 About IMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Structure in Text 2.1 Discourse . . . . . . . . . . 2.2 Rhetorical Structure Theory 2.2.1 Theory . . . . . . . . 2.2.2 Applications . . . . . 1 2 2 2 . . . . 5 5 6 6 11 . . . . . . 13 13 14 16 18 19 20 . . . . . . . 21 21 22 22 22 29 32 33 5 Medical Texts 5.1 Merck Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 39 40 . . . . . . . . . . . . . . . . 3 Annotation 3.1 Manual Annotation . . . . . . . . . 3.2 Elementary Discourse Units . . . . 3.3 Required Resources for Annotation 3.4 Discourse Markers . . . . . . . . . 3.5 Assigning Relations . . . . . . . . . 3.6 Tree Structure . . . . . . . . . . . . 4 Structure in Dutch 4.1 Dutch Text . . . . . . . . . . . 4.2 Dutch Relations . . . . . . . . . 4.3 Dutch Discourse Markers . . . . 4.3.1 Conjunctions . . . . . . 4.3.2 Other Discourse Markers 4.4 Genres . . . . . . . . . . . . . . 4.4.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 5.2 5.3 5.4 Medical texts . . . . . . . . 5.2.1 Merck Texts . . . . . 5.2.2 Other Medical Texts Relations in Medical Texts . Medical Discourse Markers . 5.4.1 Noun Constructions . 5.4.2 Verb Constructions . 5.4.3 Signaled Relations . 5.4.4 Time Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Automatic Recognition 6.1 State of the Art . . . . . . . . . . . . . . . 6.2 Automatic Annotator . . . . . . . . . . . . 6.2.1 The Segmenter . . . . . . . . . . . 6.2.2 The Recognizer . . . . . . . . . . . 6.2.3 The Tree-Builder . . . . . . . . . . 6.3 Segmentation . . . . . . . . . . . . . . . . 6.4 Defining the Relation Set . . . . . . . . . . 6.5 Used Discourse Markers . . . . . . . . . . 6.5.1 Conjunctions . . . . . . . . . . . . 6.5.2 Adverbs . . . . . . . . . . . . . . . 6.5.3 Pronouns . . . . . . . . . . . . . . 6.5.4 Domain Specific Discourse Markers 6.5.5 Relation Markers . . . . . . . . . . 6.5.6 Adjectives . . . . . . . . . . . . . . 6.5.7 Implicit Markers . . . . . . . . . . 6.6 Recognizing Relations . . . . . . . . . . . 6.7 Recognition Algorithm . . . . . . . . . . . 6.8 Scoring Hierarchy . . . . . . . . . . . . . . 6.9 Example . . . . . . . . . . . . . . . . . . . 7 Evaluation 7.1 RST-Trees . . . . . . . . . . . 7.2 Relation Based Evaluation . . 7.3 Full Tree Evaluation . . . . . 7.4 Testing Results . . . . . . . . 7.4.1 First Text . . . . . . . 7.4.2 Second Text . . . . . . 7.4.3 Third Text . . . . . . 7.4.4 Fourth Text . . . . . . 7.4.5 Fifth Text . . . . . . . 7.4.6 More Texts . . . . . . 7.4.7 Annotator Evaluation 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 43 44 47 48 49 50 52 . . . . . . . . . . . . . . . . . . . 55 55 56 56 57 57 57 58 59 59 60 61 62 62 63 63 65 67 69 73 . . . . . . . . . . . . 77 77 78 79 80 82 85 85 87 89 90 91 91 CONTENTS 8 Conclusions and Future Work 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 93 94 Bibliography 97 A Rhetorical Relations 101 B Conjunctions 105 C Merck Manual 109 D Medical Annotations 111 E Dutch Texts 113 xi Chapter 1 Introduction In their daily life, people read a lot of texts. Whether they are reading a book, processing the subtitles of a foreign movie on TV or picking a meal from the restaurant menu. All texts are created with an intention; the goal of the author is to inform the reader about a specific subject, ranging from for example telling a story to amuse the reader to defending political statements. Since a reader must be able to understand the text there are some constraints to the way texts are to be written. First of all, the used language must be familiar to the reader. It is of no use to write a letter to someone in French if the receiver only understands English. Secondly the text must be syntactically correct; it must follow the grammatical rules. A sentence like: "Any go bird fox table" has no meaning to the reader and is therefore useless, although the sentence only consists of correct English words. In some cases, improper use of syntax can cause the reader to interpret a piece of text in a way the author certainly did not intend. In the third place texts must have coherence. This means that the sentences of a text must have a relation with the other sentences in the text, and parts of a sentence with each other as well. In most cases a raw collection of unrelated sentences is not only boring to read, it is also very hard, or even impossible to understand. Therefore the written text has to have a certain structure which enables the reader to spot this coherence and create a cognitive representation of the text. The cognitive representation enables the reader to extract the meaning intended by the author. In this thesis existing methods to describe structure in text will be explained and the possibilities to use these methods for Dutch are discussed. The focus lies on the Rhetorical Structure Theory (RST) [MT88], which is used within IMIX1 . The main goal of this thesis is to research whether it is possible to recognize structure in text automatically. Since IMIX is focused on medical text, for this research medical texts in Dutch are used. 1 http://www.nwo.nl/imix 1 CHAPTER 1. INTRODUCTION 1.1 About IMIX This work is done as a part of the IMIX project. IMIX is the abbreviation of Interactieve Multimodale Informatie eXtractie. It is a project in the field of Speech and Language technology for Dutch. The goal of the project is to develop knowledge and technology which is needed to retrieve specific answers to specific questions in Dutch documents. IMIX is funded by NWO2 , the Netherlands Organization for Scientific Research. To be able to extract meaningful answers from a text, it is useful to have knowledge about its structure. A properly written text is as coherent as possible, for that is the best way for the author to avoid possible unintended interpretations of the text by a reader. The structure of texts can be visualized by annotating the text with rhetorical relations. These annotated texts can be queried for relations which exist within the text, rather than analyzing the raw text itself. Since annotation is a very time-expensive process if done by hand, an automatic annotation process is required. 1.2 Research Questions This research concerns the automatic recognition of structural relations in text. The focus lies on the Rhetorical Structure Theory, a popular discourse theory to describe the structure of texts. The following questions are to be answered: 1. How can RST annotations be automatically generated? 2. Is RST suitable for Dutch and specifically for medical texts? 3. Which properties of Dutch and which relations and discourse markers are best for automatic recognition? 4. Is it possible to use other properties of medical texts as discourse markers to find relations? 5. How can automatically generated annotations be evaluated? 1.3 Organization of the Thesis In chapter 2 the need for structure in text and ways to generate representations of texts is described. Discourse will be discussed and a theory to define the structure of a text, the Rhetorical Structure Theory, is presented. Some applications of the Rhetorical Structure Theory are described. The next chapter elaborates on discourse and the RST by describing the manual annotation of text with RST. Knowledge about elementary discourse units and discourse markers is presented, and the need for world knowledge is explained. The assigning of relations is briefly introduced and the use of the annotation tool of O’Donnell is explained. The chapter provides the information needed to create an RST tree from a text. 2 http://www.nwo.nl/ 2 1.3. ORGANIZATION OF THE THESIS The fourth chapter will focus on Dutch texts. In this chapter it is shown that RST is indeed suited for Dutch texts. Dutch discourse markers and relations which can be used for manual and automatic annotation are described. This chapter contains a comparison between several Dutch genres of text. It is shown that although differences occur within the genres, texts from each genre can be annotated with RST. In this chapter lists of possible discourse markers are presented and research regarding the use and functionality of these markers is described. The application of markers will be extended in chapter 5 where the general properties of medical texts will be described as well as functional properties which can be useful to find relations automatically. Lists of special discourse markers are gathered, useful for the (automatic) annotation of medical texts. In chapter 6 the process of automatic annotation of texts with rhetorical relations is described. The three different parts of the program, the segmenter, the recognizer and the tree-builder are discussed. Lists with word types and relations which they might signal are used by the recognizer to generate the RST-tree. Furthermore, the algorithm used by the recognizer is explained. The generated trees of example texts by the automatic recognizer are presented in chapter 7. These trees are compared to manually annotated trees and the differences are discussed. This chapter also contains comparisons between manually created trees. Finally, the conclusions are presented, and future work is described. 3 Chapter 2 Structure in Text There are many different kinds of texts, different from each other for example in layout, purpose, subject and coherence. A clear difference in layout exists between dialogues versus prose. A dialogue usually is an alternation of segments generated by different persons, while prose can contain practically all kinds of layout, large spans of texts, lists, and even dialogues itself. Differences in purpose can be illustrated by fairy tales versus encyclopedias. While a fairy tale is intended to tell the reader a story, an encyclopedia is created to inform the reader about a certain subject in an objective way. Coherence in a text is needed to be able to understand the meaning intended by the writer. Furthermore, structured texts tend to be memorized more easily [Bir85]. Practically all texts are coherent, although there are some familiar examples of text without clear coherence like some poems. In this chapter discourse will be discussed and a way to represent structure of discourse, the Rhetorical Structure Theory [MT88]. While there are other theories such as the theory of Grosz and Sidner [GS86] and the theory of Hobbs [Hob85], RST is used for this research. This is done because RST is used within IMIX. Some applications of RST are discussed afterwards. 2.1 Discourse A text consists of multiple sentences which are related to each other. Such a combination is called a discourse. A discourse itself consists of multiple discourse segments, non-overlapping spans of text which can consist of a part of a sentence, a complete sentence or even a group of sentences. The coherence between these segments is provided by a relation. A discourse segment can for example provide additional information about a preceding segment. It can also contrast with it or provide a framework for new information. An example: 1 2 Harry broke his wrist. He fell out of a tree. In the preceding example it is clear that the first sentence is coherent with the second. The second sentence provides additional information about Harry. Harry fell out of the tree 5 CHAPTER 2. STRUCTURE IN TEXT which caused him to break his wrist. There is a clear relationship between both sentences. In the next example the coherence is not clear: 1 2 Linda drank a glass of lemonade. The chicken lay an egg. The fact that Linda is drinking a glass of lemonade has nothing to do with the chicken laying an egg. These sentences do not have any relation because the actions drinking a glass of lemonade and the laying of an egg are unrelated. But how can be known that in the first example the sentences do relate and in the second example they do not. This is caused by the fact that the reader of the examples has knowledge of the subject. All knowledge about the subject which is not embedded in the text, is called world knowledge. This will be further explained in chapter 3. For a reader unfamiliar with trees, the relation in the first example is not clear since he does not know a tree. Therefore he is unable to know man could fall out of one and break a wrist. Even for readers familiar with the event of falling there is no need for a relation with breaking a wrist. This is since falling itself does not always causes a wrist to break. Although the word "falling" might indicate that a wrist can become broken, if it were used in the context of falling in love, it is rather unlikely this will cause a wrist to break. Even the word falling in means of the actual gravitational caused event, breaking is not always the result. If for example the second sentence was: "He fell on the pillows" a relation is less obvious since it is rather strange if that would have caused his wrist to break. Every reader generates a virtual representation of the text he is reading. In that way the text makes sense, i.e. the message which it contains, is revealed to the reader. There are different ways to model the representations in texts. The one used in this research, the Rhetorical Structure Theory of Mann and Thompson will be described. 2.2 Rhetorical Structure Theory Rhetorical Structure Theory (RST) was developed by William Mann and Sandra Thompson [MT88]. It was originally developed as part of studies of computer-based text generation using monologue written texts. It describes a set of rhetorical relations between discourse segments, which are used for the analysis of text. This section will discuss RST and some applications. 2.2.1 Theory At first the theory described 24 different relations, but the number has increased to 30 relations at present time. The original 24 relations are shown in table 2.1. For an overview of the set of relations and their descriptions see appendix A. In the appendix only 23 of the above relations are covered. This is because there is no actual schema for the Joint relation since it is in fact no rhetorical relation. Joint is used in cases where there is no actual relation between two parts of text but they are related somehow since both pieces of text appear in the same discourse. 6 2.2. RHETORICAL STRUCTURE THEORY Relation Antithesis Background Circumstance Concession Condition Contrast Elaboration Enablement Evaluation Evidence Interpretation Joint Justify Motivation Otherwise Purpose Restatement Sequence Solutionhood Summary Volitional Cause Non-Volitional Cause Volitional Result Non-Volitional Result Type Hypotactic Hypotactic Hypotactic Hypotactic Hypotactic Paratactic Hypotactic Hypotactic Hypotactic Hypotactic Hypotactic Paratactic Hypotactic Hypotactic Hypotactic Hypotactic Hypotactic Paratactic Hypotactic Hypotactic Hypotactic Hypotactic Hypotactic Hypotactic Table 2.1: The original RST relations The coherence in texts can be graphically presented in a tree form, called RST-trees. Practically all text can be represented as a tree although in [WG05] it is argued that trees are not strong enough and cross dependencies are needed. The theory provides a way of expressing the importance of the text parts a relation connects. Therefore the terms nucleus and satellite are introduced. The most important one of the text parts the relation connects is called the nucleus, the other part is called the satellite. Most relations exist between a nucleus and a satellite although there are some multinuclear relations. In a multinuclear relation, the text parts are considered to be of equal importance. A relation between a nucleus and a satellite is called a hypotactic relation. A paratactic relation exists between multiple nuclei. In table 2.1 it is shown that there are 20 hypotactic and 3 paratactic relations. The difference between a nucleus and a satellite can be shown with the example: 1 2 Harry broke his wrist. He fell out of a tree. 7 CHAPTER 2. STRUCTURE IN TEXT It is clear that in this case the first sentence is of more importance than the second sentence. The first statement could appear on its own while the second is dependent of the first. The sentence "He fell out of a tree" can also appear single, but has no or less meaning to a reader since there is no knowledge about who "he" is. In this example it is clear that "he" refers to "Harry" and it can be concluded that Harry is the person who fell out of the tree. The second sentence gives additional information about the first sentence. In this case it explains the cause of the fact that Harry broke his wrist to the reader. So the first sentence is the nucleus and the second sentence is the satellite. This example can be represented in an RST-tree, see figure 2.1. Non-Volitional Cause Harry broke his wrist. He fell out of a tree. Figure 2.1: An RST-tree example The order of the sentences is not the only part which determines the importance. If the two sentences of the preceding examples are switched like: 1 2 Harry fell out of a tree. He broke his wrist. the second sentence would be the most important and the same relation can be used to represent the connection between both. An RST-tree consists of the following parts, discourse segments, which are represented as a horizontal line with the corresponding text beneath, and the relation arrows with a label containing the name of the relation. The discourse segments in the preceding example are "Harry broke his wrist" and "He fell out of a tree". Discourse segments are discussed in more detail in chapter 3. The relation is called Non-Volitional Cause and is represented by the pointed arc. The arc starts in the satellite and ends in the nucleus. In the next example a paratactic relation is embedded: 1 2 A bird flies, but a fish swims. This example consists of a single sentence, split into two separate non-overlapping spans. A nucleus or satellite can consist of a single sentence, but can as well be a (small) part of it. These two spans are related to each other in a contrastive way. Both parts of the relation are equal in importance. The fact that a bird flies does not have a higher importance than the fact a fish swims. This can be represented as follows using the multi-nuclear relationship Contrast in figure 2.2. Two discourse segments can form a new segment together which is related to other segments. This can be shown with an extension of the example of Harry and the tree: 8 2.2. RHETORICAL STRUCTURE THEORY Contrast A bird but a fish flies, swims Figure 2.2: A paratactic relation 1 2 3 Harry broke his wrist. He fell out of a tree, which stood in the garden. While the third sentence does not give extra information about the breaking of Harry’s wrist it does give additional information about the tree. The combination of sentence 2 and 3 is related to the first sentence since the combination does specify the reason Harry broke his wrist. In figure 2.3 the RST-tree is shown. Non-Volitional Cause Harry broke his wrist. Elaboration He fell which stood in out of the garden. a tree, Figure 2.3: An extended RST-tree The relations between segments are bound to certain constraints. There are constraints to both the nucleus and the satellite apart and the combination of them. At last the effect of the relation is bound. This can be shown in the description of the Purpose relation in figure 2.4. The text is taken from the original paper by Mann and Thompson. The following abbreviations are used: N = Nucleus S = Satellite R = Reader An example of this relation can be shown with the text below: 1 2 To obtain a pack of cigarettes, put four euros in the machine. 9 CHAPTER 2. STRUCTURE IN TEXT === Purpose === Constraints on N : Presents an activity Constraints on S : Presents a situation that is unrealized Constraints on the N + S Combination : S presents a situation to be realized through the activity in N The Effect : R recognizes that the activity in N is initiated in order to realize S Figure 2.4: The Purpose relation as defined in the Rhetorical Structure Theory Purpose R To obtain a pack put four of cigarettes, euros in the machine. Figure 2.5: An example of the Purpose relation Its corresponding RST-tree is shown in figure 2.5. In the example above the first sentence is the satellite. It fulfills the constraint of being an unrealized situation. The second sentence is the nucleus, it provides the way to obtain the pack of cigarettes. It fulfills the constraint of being an activity. The combination of both sentences provides the activity of putting four euros in the machine to realize the situation of obtaining a pack of cigarettes. As shown above, the text spans can be connected in multiple ways. Mann and Thompson define five schema types. These schema types define how text spans can be connected to each other. The five different types are shown in figure 2.6. The Circumstance example shows the use of a hypotactic relation. The arc in the example is pointed from left to right. This means that the satellite is the left part and the nucleus the right part. It is also possible that the nucleus comes first. The arc is reversed in that case, see for example in figure 2.1. All hypotactic relations not mentioned in figure 2.6 can be used like the Circumstance relation. The Contrast relation is defined with a different schema. The arc between the discourse segments has two points. The discourse segments which are connected with a Contrast relation are equal in importance. They are both nucleus. This schema looks like the one used for the Sequence relation, in which relation all discourse segments are considered to be nucleus as well. The difference is the fact that the Contrast relation always connects two segments, while the Sequence can contain more. The Joint relation has yet another different schema. No arc is added. The number of nuclei used within the Joint relation is 2 or more, just like the Sequence relation. The last schema with the Motivation and the Enablement is a special case. It allows two satellites, one which exists before the nucleus in the text and the second after the nucleus, to be connected to the nucleus in the same nuclear span. A nuclear span is a span of two or more connected discourse segments. The nuclear span is indicated by the vertical bar. 10 2.2. RHETORICAL STRUCTURE THEORY Figure 2.6: The five different schemas With the other relations such a tree is not possible. If for example a text containing three discourse segments would be represented as a tree, where the second segment would be the only nucleus, there are two possible trees if the relations would be different than Motivation and Enablement. These two trees are shown in figure 2.7. Concession Circumstance R A Circumstance Concession C R B C A (a) B (b) Figure 2.7: Two different trees 2.2.2 Applications The Rhetorical Structure Theory can be used for different applications. To show the use of RST, some of the applications will are mentioned below. These applications are based on a survey by Taboada and Mann [TM06]. They divide them in four genres: 1. Computational Linguistics 2. Cross Linguistic Studies 11 CHAPTER 2. STRUCTURE IN TEXT 3. Dialogue and Multimedia 4. Discourse Analysis, Argumentation and Writing The field of computational linguistics is broad, examples of applications in this field are summarization, essay scoring, translation and natural language generation. A study regarding essay scoring is for example [BMK03]. In his PhD Thesis [Mar97b], Daniel Marcu describes summarization and natural language generation, with the use of RST. Authors and automatic language generators can use RST to prevent readers to interpret the text in a wrong way, as shown in [Hov93]. RST is used in other languages than English as well. For example in Brazilian Portuguese [PdGVNM04], German [Ste04] and more. 12 Chapter 3 Annotation In this chapter the annotation of structure in texts is discussed. First the process of manually annotating texts is described and difficulties occurring during this process. The use of the annotation tool of O’Donnell [O’D00] is discussed. After that, the segmentation of discourses into elementary discourse units is handled. Furthermore the use of world knowledge and discourse markers to annotate a discourse is described and the process of assigning relations to discourse units. The texts are selected from the Merck Manual1 , a (digital) Dutch medical encyclopedia. This medical encyclopedia is used since the research field of the IMIX project is set to medical texts in Dutch. For this research several pieces of text from different sources are annotated. A selection of annotations from samples of the Merck Manual is shown in appendix D. 3.1 Manual Annotation Before automatic annotation was possible, a selection of texts was manually annotated. These texts are taken from the Merck manual. Each describes a medical topic, like for example tiredness, shortness of breath or the blood facilities of the heart itself. A typical text consists of several paragraphs. Ideally, an RST-tree of the complete text of a topic is created. For the manual annotation the RST-tool of O’Donnell is used. With this tool the text file is read and adapted. First of all the text has to be divided into non-overlapping spans of text (segments) between which the relations can be defined. For these annotations the standard list of 24 rhetorical relations as defined by Mann and Thompson is used. It is possible to use other, more extensive lists or smaller ones. The annotator can also define new rhetorical relations. In figure 3.1 the text segmenting part of the tool is shown. The vertical bars define the segment boundaries. In this example, the text is segmented into sentences, but a further segmentation is possible, as will be explained in section 3.2. After the segmentations are made, the relations can be added in a graphical environment. See figure 3.2. In this picture there are four sentences between which the relations are to be added. The first two sentences are already connected with an Elaboration relation. 1 http://www.merckmanual.nl 13 CHAPTER 3. ANNOTATION Figure 3.1: The RST-Tool of O’Donnell 3.2 Elementary Discourse Units For the recognition of the structure in a discourse, it is necessary to split the text in units. The first constraint is that the text is split in non-overlapping spans. These spans of texts (segments) are called elementary discourse units (EDU). There are several ways to split a text up in segments. The easiest way is to take a complete sentence as a unit. The fact that sentences can be very long is a drawback of this approach. If a long sentence is a single unit, it is common to lose information about the text. Within the sentence, more segmentation could be possible and relations added, thus providing more information. Take for example the following sample from the Merck Manual: Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen, jeukt deze altijd. De jeuk leidt vaak tot een onbedwingbare neiging tot krabben, wat een cyclus van jeuk-krabben-uitslag-jeuk tot gevolg heeft, waardoor het probleem alleen maar erger wordt. ENG: Although the color, seriousness, and location of the rash differ, it always itches. The itch often leads to an uncontrollable tendency to scratch, which results in a cycle of itch-scratchingrash-itch , which results in making the problem worse. A possible segmentation is the following. This small text is segmented at the sentence boundaries. [Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen, jeukt deze altijd.]2A [De jeuk leidt vaak tot een onbedwingbare neiging tot krabben, wat een cyclus van jeuk-krabben-uitslag-jeuk tot gevolg heeft, waardoor het probleem alleen maar erger wordt.]2B The second sentence can be connected to the first sentence with an Elaboration a NonVolitional Result or a Background relation. But the second sentence contains more informa14 3.2. ELEMENTARY DISCOURSE UNITS Figure 3.2: The graphical relation part tion on its own, which one would probably like to make visible in the structure. To add this information, the text has to be segmented into smaller pieces. Perhaps a better segmentation is the following: [Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen, jeukt deze altijd.]3A [De jeuk leidt vaak tot een onbedwingbare neiging tot krabben,]3B [wat een cyclus van jeuk-krabben-uitslag-jeuk tot gevolg heeft,]3C [waardoor het probleem alleen maar erger wordt.]3D Right now the following relations can be added to this text resulting in the following tree in figure 3.3. Background R 3A Non-Volitional Result 3B Elaboration 3C 3D Figure 3.3: The RST-tree of the Merck example 15 CHAPTER 3. ANNOTATION This segmentation provides more information. The fact that the itch-scratch-rash-itch cycle is caused by the scratching, can be extracted from the rhetorical structure. There is no best approach to divide a discourse in segments. In the example above it can be argued that the second segmentation is better than the first, since more information is added, but it can not be decided it is the best. For example it is possible to segment the first sentence also in multiple parts, like: [Hoewel de kleur, ernst en plaats van de uitslag uiteenlopen,]4A [jeukt deze altijd.]4B [De jeuk leidt vaak tot een onbedwingbare neiging tot krabben,]4C [wat een cyclus van jeuk-krabben-uitslag-jeuk tot gevolg heeft,]4D [waardoor het probleem alleen maar erger wordt.]4E These two parts can be connected with a Concession relation, which results in the following tree in figure 3.4. Background R Concession R 4A Non-Volitional Result 4B 4C Elaboration 4D 4E Figure 3.4: The final RST-tree of the Merck example Further segmentation of an elementary discourse unit is not necessarily the best option. In fact, there is no optimal segmentation defined. Human annotators segment a text at points they think it is useful. Therefore differences in segmentation could occur if a text was annotated by different annotators. 3.3 Required Resources for Annotation To be able to connect two spans of text to each other with a rhetorical relation, a decision has to be made between which segments a relation is to be added, and exactly which relation is best fit. There are multiple ways to connect the spans. These differences can occur due to several reasons [MT88], for example ambiguity of the texts, differences between the annotators and analytical error. This also depends on the way the text is segmented. Humans can read texts because they have knowledge about the world, they know what the text is about and it makes sense. People know what the meaning of the word "tree" is, and know (a portion of) the properties of the entity tree. They also know what the effect of 16 3.3. REQUIRED RESOURCES FOR ANNOTATION the event "fell" is. World knowledge provides the possibility to segment the example from chapter 2 into elementary discourse units and to decide which relation fits between them. Without knowledge about the domain of the subject, it is very hard to create a meaningful structure representation. This can be shown with the following sample from the Merck Manual, a paragraph about immune malfunctions: Experimentele methoden, zoals de transplantatie van foetale thymuscellen en levercellen, zijn in enkele gevallen effectief gebleken, vooral bij patiënten met het syndroom van DiGeorge. Bij ernstige gecombineerde immuundeficiëntie met adenosinedeaminasedeficintie kunnen de ontbrekende enzymen soms worden aangevuld. Gentherapie is veelbelovend bij deze en enkele andere aangeboren immuundeficiënties waarbij de erfelijke afwijking is geïdentificeerd. ENG: Experimental methods, like the transplantation of fetal thymus cells and liver cells, appeared to be effective in some cases, especially with patients suffering from the DiGeorge syndrome. In some cases of serious, combined immune deficiency with adenosineaminasis deficiency, missing enzymes can be replenished. Gene therapy is promising with this, and some other innate immune deficiencies, where the hereditarily abnormality is identified. This text is harder to segment since it is about more specialized issues. One can decide to take the sentences as units, since such a segmentation is easy to perform, resulting in the following text: [Experimentele methoden, zoals de transplantatie van foetale thymuscellen en levercellen, zijn in enkele gevallen effectief gebleken, vooral bij patiënten met het syndroom van DiGeorge.]6A [Bij ernstige gecombineerde immuundeficiëntie met adenosinedeaminasedeficiëntie kunnen de ontbrekende enzymen soms worden aangevuld.]6B [ Gentherapie is veelbelovend bij deze en enkele andere aangeboren immuundeficiënties waarbij de erfelijke afwijking is geïdentificeerd.]6C The selection of the relations between them is harder. The relation between 6A and 6B can be an Elaboration, but could also be judged as an Evidence relation. Probably few people know what "adenosinedeaminasedeficiëntie" is, making it hard to decide which relation to choose. The RST-relations are defined with certain constraints. These constraints indicate what (kind of) world knowledge is actually needed. For example the Evidence relation is defined with the following constraints: 1. Contraints on N(ucleus): R(eader) might not believe N to a degree satisfactory to W(riter). 2. Contraints on S(atellite): The reader believes S or will find it credible. 3. Contraints on N+S Combination: R’s comprehending S increases R’s belief of N. The need for world knowledge to assign this relation is shown with the following example: 1 2 We ran out of cookies. The red box is empty. 17 CHAPTER 3. ANNOTATION To be able to assign the Evidence relation, it is necessary to have world knowledge about the cookies and the box. It is needed to know that the cookies are stored in the red box. Since that box is empty, the conclusion is thus that there are no cookies present. So the second sentence provides the evidence for the first statement. The combination of the two sentences gives reason to believe the statement that there are no cookies left. 3.4 Discourse Markers In general a discourse contains discourse markers. Discourse markers are words or expressions which signal that certain discourse relations hold between certain discourse segments. Examples of such markers are conjunctions, prepositions and adverbs, but more word types also fit in. No exact definition of discourse markers exists and there are several alternative names, for instance cue phrases, discourse connectives, coherence markers and more [Fra99]. For the segmentation of text discourse markers can be useful as well. If a discourse marker is found in a text, it can be segmented at that point and the relation added (afterwards). In the following example, the discourse marker "although" is found, which can signal a concession relation. The cat is hungry, although she just finished her meal. The text can be segmented in two elementary discourse units where the second unit starts with the discourse marker. This results in the following: [The cat is hungry,]8A [although she just finished her meal.]8B The Concession relation can be added in which the first part is the nucleus and the second part the satellite. See figure 3.5. Concession The cat is although she just hungry, finished her meal Figure 3.5: The RST tree of the hungry cat If the example above was written in a different order like: Although the cat just finished her meal, she is still hungry. the Concession relation would still hold, but the segmentation differs. In this example the segmentation could be done at the comma position but there are other possibilities, in which the segmentation task is much harder. Since text can be ambiguous, certain words which can act as a discourse marker do not always signal a relation. In the case a discourse marker does signal a relation, multiple 18 3.5. ASSIGNING RELATIONS relations can be possible. A discourse marker like "because" can signal a causal relation. But it depends on the way the text is written which exact rhetorical relation is the most useful. This can be illustrated with the following example: 1 Because John acted badly, his mother sent him to bed. And the next: 1 Snakes are very dangerous because they bite lots of people. The first item could be marked with a Volitional Result relation while the second is better described with a Justify or an Evidence relation. Discourse Markers unfortunately do not signal all relations in a discourse, and can be ambiguous. In [Tabng] is shown that approximately 60-70% of the relations are not signaled by discourse markers. It does suggest that genre-specific factors may affect which relations are signaled. This suggests that special types of text may have a higher percentage of signaled relations by discourse markers. Instances of words and phrases which can act as a discourse marker do not necessarily do so. As mentioned above, prepositions can be discourse markers, but not all prepositions are discourse markers. Whether or not a preposition acts as a discourse marker depends on the sentence. Take for example the preposition "behind". In the following sentence, the word acts as a marker, signaling an Elaboration relation: The dog sits in his kennel, behind the brick wall. While in the next sentence it does not: I am always behind you. Another difficulty with the use of discourse markers is the fact that a certain discourse marker can signal different relations [HL93]. Especially with relations which are pretty similar. The discourse marker "But" for example can signal an Antithesis relation but could as well signal a Contrast relation. 3.5 Assigning Relations The assignment of the relations between two text spans can be a hard decision. The use of discourse markers helps making the segmentation and the assignment of the relations. In some cases it is very hard or even impossible to assign a relation between segments if there is no world knowledge. This can be illustrated with the following example: Elke cel bevat mitochondriën, kleine structuren die de cel van energie voorzien. ENG: Each cell contains mitochondrions, small structures which provide the cell with energy. A (correct) segmentation is: 19 CHAPTER 3. ANNOTATION [Elke cel bevat mitochondriën,]13A [ kleine structuren die de cel van energie voorzien.]13B These parts can be connected with the Elaboration relation. Since there are no discourse markers present which signal the Elaboration relation in this sentence the use of world knowledge is necessary. If this sentence must be annotated automatically, a problem would arise since a computer has no knowledge about the world, if not specifically added. This problem will be discussed in more detail in chapter 6. 3.6 Tree Structure If a text would be manually annotated by different annotators, differences would occur. People do not only choose different relations between elementary discourse units, but the actual EDUs they relate, show considerable differences as well. Take for example the following text, segmented into three elementary discourse units, annotated into two trees in figure 3.6. [Harry fell out of a tree.]14A [He broke his wrist.]14B [His wrist hurt badly.]14C Non-Volitional Result Non-Volitional i Result 14A 14A 14B Non-Volitional Result 14C (a) 14B 14C (b) Figure 3.6: Two annotation trees Both elementary discourse units, 14B and 14C are connected to another EDU with a Non-Volitional Result relation. But where in the first tree the EDUs are both connected to 14A, in the second tree the third is connected to the second. These kind of different connections of relations between EDUs happen very often during annotation. For large(r) texts this can cause two trees of the same text to be quite unlike each other. For all the texts annotated during this research, only a few RST trees were similar to each other. None of these texts consisted of more than 4 EDUs. 20 Chapter 4 Structure in Dutch In this chapter structure in Dutch texts will be discussed. First, properties of Dutch in general and the possibility to use Rhetorical Structure Theory to annotate texts in this language are described. After that discourse markers for Dutch are discussed. In the last section the differences between several genres of Dutch text are shown. 4.1 Dutch Text Dutch is a West-European language natively spoken by over 21 million people1 . It belongs to the family of West Germanic languages just like English and German, although differences exist between these languages. The writing system of Dutch uses the Latin alphabet. Although RST is developed using English texts, it can be used for Dutch as well. In [KS98] Knott and Sanders performed a cross-linguistic study of English and Dutch regarding the classification of coherence relations. They found similarities with the use of discourse markers signaling coherence relations. More studies [San, MS01, ART93] ground the use of RST in Dutch. This will be illustrated with some small examples, regarding the use of RST in Dutch compared to English. The composition of Dutch sentences is relatively comparable to their translations in English. This is illustrated with this small sentence: 1 2 De man is een broodje aan het eten. ENG: The man is eating a sandwich. it shows that the translation is relatively similar. The elementary units are conserved. In the translation the nouns (man, sandwich), the verbs (is, eating) are present. If the two sentences used in an example in chapter 2 are translated, it shows that the relation is also conserved: 1 2 Harry broke his wrist. He fell out of a tree. 1 http://taalunieversum.org/ as of 08/2006 21 CHAPTER 4. STRUCTURE IN DUTCH Dutch translation: 1 2 Harry heeft zijn pols gebroken. Hij viel uit een boom. Although these simple examples do not provide evidence, they suggest that Rhetorical Structure Theory can be used to describe structure in Dutch. During this research several pieces of Dutch text from different genres are successfully annotated with RST. No texts were found which could not be annotated. 4.2 Dutch Relations For the manual annotation of Dutch texts during this assignment, the standard set of rhetorical relations as defined by Mann and Thompson is used. This set of relations provides the possibility of structuring Dutch texts, but it is questionable whether this set is optimal, in terms of coverage of the text [HM97]. This problem exists for English (and other languages) as well, despite the fact that RST originally is developed for English. Some text could perhaps be better related with an undefined relation type (for example, a kind of relation that is used by other theories). So it is possible there is a need for additional relations. In [HM97], Hovy discusses a set of over 350 relations from different sources. However, it is also possible to take relations together which differ slightly (in their definition). For the automatic recognition of text, a shorter list of possible relations is important since it decreases the likelihood of an error. A drawback of the use of a shorter list is an increase in information loss. In chapter 5, a classification is performed for relations in Dutch medical texts which is used for the automatic recognition, as described in chapter 6. 4.3 Dutch Discourse Markers In Dutch discourse markers are present, similar to other languages like English. There are different kinds of words which can act as discourse markers. First the conjunctions will be described and shown that they are most useful within sentences rather than between sentences. After that, other kinds of words are discussed, like adverbs and demonstrative pronouns which are used for the recognition of relations between sentences. 4.3.1 Conjunctions The first word type which can be used as a discourse marker to be discussed is the conjunction. Using this word as a discourse marker, is done in several researches [PSBA03, Mar00, CFM+ 02, Sch87] and many more. Conjunctions usually signal relations within sentences, but a small subset can also signal relations between sentences. Lists of conjunctions are collected from different sites2 regarding Dutch language and afterwards extended these with data from the CGN lexicon [HMSvdW01]. The CGN lexicon is 2 http://www.muiswerk.nl/WRDNBOEK/VOEGWRDN.HTM http://nl.wikipedia.org/wiki/Voegwoord 22 4.3. DUTCH DISCOURSE MARKERS developed for the project Corpus Gesproken Nederlands (Spoken Dutch Corpus). It consists of about a million syntactically annotated words. The conjunctions are extracted from this lexicon, and the conjunctions which would not be used in written text removed. This is done because the CGN-lexicon also contains some speech bastardizations. For example the word "enne" (and-eh) as a corruption of the word "en" (and). The total list of conjunctions is added in appendix B, with their corresponding English translation. The list contains conjunctions which are used today and some ’old’ words not used often anymore. All the data used for this project, the whole Merck Manual, is checked for the appearance of the conjunctions, and a part of these words did not appear once. The words "schoon" and "naar" in fact did appear but not as conjunction since they also means ”clean” and ”to”, in Dutch respectively. The other conjunctions are all used in the available data. The list of ’old’ conjunctions is shown in table 4.1. aleer alsook annex dewijl doordien eer eerdat ende gelijk hoezeer ingeval naar naardien nademaal niettegenstaande ofdat oftewel overmits schoon uitgenomen vermits vooraleer wijl Table 4.1: ’Old’ Dutch Conjunctions In table 4.2 the remaining conjunctions of the collected list is shown. aangezien al alhoewel als alsmede alsof alvorens behalve daardoor daar dan dat doch doordat dus evenals en hetzij hoewel indien maar mits na naargelang naarmate nadat noch nu of ofschoon ofwel om omdat opdat sedert sinds tenzij terwijl toen tot totdat uitgezonderd zoals voor voordat voorzover wanneer want zo zodat zodra zolang zover Table 4.2: Most Common Dutch Conjunctions An example of the use of conjunctions in Dutch: Het paard staat in de wei, hoewel het erg koud is. ENG: The horse stands in the meadow, although it is very cold. http://triangulu.co.wikivx.org/nl/grammatica http://oase.uci.kun.nl/ ans/e-ans/10/body.html 23 CHAPTER 4. STRUCTURE IN DUTCH The sentence could be segmented at the discourse marker "hoewel" (although) resulting in: [Het paard staat in de wei,]16A [hoewel het erg koud is.]16B The discourse marker "hoewel" signals here a Concession relation. The corresponding RST-tree is shown in figure 4.1. Concession Het paard hoewel het staat in de erg koud is. wei, Figure 4.1: The Concession relation To get an overview of the number of possible discourse markers 10 texts from the Merck Manual are randomly selected, ranging from about 150 till 2000 words. The occurrences of the conjunctions as shown in appendix B in the texts are counted. The results are presented in table 4.3. The table shows only the number of occurrences. Not all occurrences of a certain conjunction do in fact signal a relation. Words which can act as a conjunction can also be used for different reasons, than connecting two sentences parts together. This can be shown with the following fragment from text2, where the word "of" (or) is not used as a conjunction. Tevens kan deze bacterie longontsteking (pneumonie) veroorzaken, bronchitis, middenoorontsteking (otitis media), oogontsteking (conjunctivitis), ontsteking van een of meer neusbijholten (sinusitis) en acute infectie van het gebied net boven het strottenhoofd (epiglottitis). ENG: Also, this bacterium can cause pneumonia, bronchitis, inflammation of the middle ear (otitis media), inflammation of the eye (conjunctivitis), inflammation of one or more accessory sinuses of the nose and immediate infection of the area just above the larynx (epiglottitis). This can be an explanation for the significant differences in numbers of occurrences. While in text6, a percentage of 9.88 conjunctions out of the total words is counted, in text7 it is only 3.77%. Another special case about text7 is the fact it does not contain any instance of the word "en" or "of", while all other texts from table 4.3 do. To be able to assign relations to sentences it is necessary to know what relations are signaled by which discourse markers. It is possible that a discourse marker signals more than one relation. Also, relations can be signaled by multiple discourse markers [GS98]. Therefore a classification of the relations signaled by the possible discourse markers is needed. First of all an overview is gathered of which conjunction is most present in medical discourses. In table 4.4 the spreading of the conjunctions in the texts is shown. The numbers represent each occurrence of the conjunctions. The percentages show the diversity of the conjunctions. For example, 5.71% of all the found conjunctions were "Tot". The fact that the word "Voor" 24 4.3. DUTCH DISCOURSE MARKERS Name Text0 Text1 Text2 Text3 Text4 Text5 Text6 Text7 Text8 Text9 Conjunctions 48 14 12 26 66 162 85 8 13 74 Total Words 548 243 153 330 964 2048 860 212 191 1060 Percentage 8.76 % 5.76 % 7.84 % 7.88 % 6.85 % 7.91 % 9.88 % 3.77 % 6.81 % 6.98 % Table 4.3: Percentages of the number of conjunctions occurs 38 times in the selection does not mean it also acts 38 times as a discourse marker. To check whether a conjunction acts as a discourse marker or not, all texts are analyzed by hand. It is necessary to know what relations are signaled by which discourse marker and in which context. A difference is made between the signaling of relations between sentences and inside sentences. In [PSBA03], the conjunctions are divided into two types, the subordinating conjunctions and the coordinating conjunctions. While subordinating conjunctions can signal relations within the sentence they appear in, the coordinating conjunctions could also signal relations between sentences. The list of coordinating conjunctions contains only five elements: En, Want, Maar, Dus, Of [KS04]. The rest are considered subordinating conjunctions. To be able to find the relations signaled by the conjunctions, the texts are manually checked for each conjunction. If a conjunction was found in the text, it is checked whether is signaled a relation or not. If the conjunction signaled a relation, the actual relation it signaled is noted. Some conjunctions signaled different relations in different contexts. The conjunctions that signaled a relation most of the time they were found in the text, are shown in table 4.5. In the table is shown that all conjunctions were found to signal relations within a sentence. Two of the conjunctions did indeed signal relations between sentences. Both are considered to be coordinating conjunction. Although the conjunction Want is added to the list, no occurrence of it was found, signaling a relation between sentences. In her master’s thesis [vL05], van Langen also did not discover the conjunction Want signaling a relation between sentences. In her research the signaling of a relation between sentences by the coordinating conjunction Dus was not found, although it was found in the data used for this assignment. The coordinating conjunctions Of and En are missing from the table. This is because the signaling of a relation was found to be too ambiguous. For example, the word "en" is used to combine multiple subjects to a single subject in about 75% of the cases as will be illustrated in section 4.4.1. Note that the word "Daardoor" can also signal a relation between sentences, this is because it can act as a conjunctive and as a conjunctive adverb. Conjunctive adverbs are discussed in section 4.3.2. The word "Daardoor", along with "Dus" and "Daarom" are 25 CHAPTER 4. STRUCTURE IN DUTCH Conjunction En Of Voor Om Tot Als Dan Dat Maar Omdat Wanneer Na Terwijl Zoals Totdat Al Doordat Zo Behalve Daardoor Zodat Daar Dus Hoewel Ofwel Voordat Total Occurrences 164 79 38 32 29 26 25 23 17 17 13 8 8 5 4 3 3 3 2 2 2 1 1 1 1 1 508 Percentage 32.28 % 15.56 % 7.48 % 6.30 % 5.71 % 5.19 % 4.92 % 4.53 % 3.30 % 3.30 % 2.56 % 1.57 % 1.57 % 0.98 % 0.79 % 0.59 % 0.59 % 0.59 % 0.39 % 0.30 % 0.39 % 0.20 % 0.20 % 0.20 % 0.20 % 0.20 % 100 % Table 4.4: Overview of Occurrences 26 4.3. DUTCH DISCOURSE MARKERS Conjunction Relation Aangezien Behalve Daardoor Doordat Evidence Concession (Non) Volitional (Non) Volitional Elaboration (Non) Volitional Evidence Sequence Elaboration Concession (Non) Volitional Condition Concession Antithesis Contrast Condition Condition Concession Elaboration Sequence (Non) Volitional (Non) Volitional Background Circumstance Concession Antithesis Contrast Antithesis (Non) Volitional Condition (Non) Volitional Evidence Elaboration (Non) Volitional (Non) Volitional (Non) Volitional Condition Condition Dus Evenals Hoewel Indien Maar Mits Nadat Ofschoon Ofwel Omdat Opdat Sinds Tenzij Uitgezonderd Wanneer Want Zoals Zodat Zodra Zolang Cause Cause Cause Cause Cause Cause Cause Cause Cause Result Cause Inside Sentences Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Between Sentences No No Yes No No Yes Yes No No No No No Yes Yes Yes No No No No No No No No No No No No No No No No No No No No No No No Table 4.5: Relations signaled by conjunctions 27 CHAPTER 4. STRUCTURE IN DUTCH considered causal connectives, where the cause precedes the result [MS01]. Some examples of the use of these conjunctions are shown below. The first example shows the use of the conjunction "Indien" (If). The corresponding tree is presented in figure 4.2. [Indien de histamine met het voedsel wordt ingenomen,]18A [kan het gezicht onmiddellijk rood worden.]18B [ENG: If the histamine is consumed together with the food,]18C [the face can become red immediately.]18D Condition R 18D 18C Figure 4.2: The Condition relation The second example shows the use of the conjunction "Aangezien" (For), the tree is shown in figure 4.3. [Het is vaak niet eenvoudig om onderscheid te maken tussen hyperhydratie en te hoog bloedvolume,]19A [aangezien hyperhydratie zowel op zichzelf als in combinatie met te hoog bloedvolume kan voorkomen.]19B [ENG: It if often complicated to distinguish between hyper hydration and a high blood volume,]19C [for hyper hydration can appear solely and in combination with a high blood volume.]19D Evidence 19C 19D Figure 4.3: The Evidence relation The remaining conjunctions signaled relations in a lower frequency. For some words this is caused by the fact they can act as a different type of word instead of a conjunction. This can be showed with the following example regarding the word "al". In the first case it acts as a conjunction, while in the second sentence it does not. 1 2 1 2 Robert pakte een koekje, al had zijn moeder dat verboden. ENG: Robert took a cake, although his mother had forbidden it. Die tijd is al voorbij. ENG: That time is already gone. The most important conclusion resulting from table 4.5 is the fact that only a few conjunctions signal relations between sentences. For relations inside sentences these conjunctions can be very useful. For the automatic recognition of relations between sentences, other discourse markers or information providers are to be used as well. 28 4.3. DUTCH DISCOURSE MARKERS Background Elaboration Enablement Evaluation Evidence Interpretation Justify Motivation Preparation Restatement Solutionhood Table 4.6: Relations seldom signaled In discussions between members of the linguistic listserver3 , relations which are seldom signaled are discussed. These relations as mentioned in the discussion, are shown in table 4.6. If this list is compared with the list of the relations which were signaled by conjunctions certain similarities show. Just two of the seldom signaled relations are indeed found to be signaled by conjunctions, the Elaboration and the Evidence relation. This grounds the use of this list, for relations which are not signaled and thus hard to find if a text was to be annotated using conjunctions as discourse markers. 4.3.2 Other Discourse Markers As shown in table 4.5 conjunctions usually signal relations within sentences, rather than between sentences. Other kinds of words which can signal relations are adverbs. More specifically, the conjunctive adverbs [PSBA03]. A list of conjunctive adverbs is collected from different sources [vL05, Per]. The combined list is shown in table 4.7 with their corresponding English translation. Conjunctive adverbs are considered to be able to signal relations between sentences if they appear as the first word of the second sentence [vL05]. Next thing needed to know is which relation is signaled by a certain conjunctive adverb. To gather this information, for each conjunctive adverb, a selection of extracts from the available data is taken and checked manually for these conjunctive adverbs. Only the ones, which did signal relations most clearly are noted. Furthermore, conjunctive adverbs which appeared in a very low amount were omitted from the list. For example the word "hierom" occurs only one time in the whole dataset. The results are shown in table 4.8. Furthermore, it is useful to know whether there are other adverbs which do signal a relation. Therefore a list of adverbs is collected. These adverbs are gathered from a website4 and extracted from the CGN-lexicon. The total number of different adverbs collected in this way, exceeds a thousand. The adverbs which might signal a relation are extracted from the list manually. Since this is done by hand, (the adverbs can not be checked automatically due to the lack of large annotated texts), only a portion of the adverbs are checked. The adverbs which occur very frequently are discarded because analysis of these words showed they were too ambiguous to use. The adverbs which occur in a low number are discarded too. This is done because only a few occurrences are not enough to determine which relation they represent and whether 3 4 http://listserv.linguistlist.org http://www.muiswerk.nl/WRDNBOEK/BIJWOORD.HTM 29 CHAPTER 4. STRUCTURE IN DUTCH Bijgevolg Bijvoorbeeld Bovendien Daardoor Daarnaast Daarom Daartoe Dan Ook Derhalve Dientengevolge Dus Echter Evenzeer Hierdoor Hierom Hiertoe Immers Namelijk Ook Tenslotte Tevens Toch Verder Vervolgens Wel Zo As a consequence For example Besides Therefore Moreover Therefore For That Than As well So Consequently So / Therefore However Also Because of this For this reason For this purpose After all Namely As well Finally Also Nevertheless Apart from that Next Well Zo Table 4.7: Conjunctive Adverbs the adverbs signals a relation strongly. Furthermore, an adverb which occurs only a few times is less useful for an automatic recognizer than an adverb which occurs more often. The adverbs are checked with extracts from the Merck Manual. For each adverb, different extracts were gathered which embedded the adverb. These texts were analyzed, and if the adverb was found to signal a relation it is counted. Adverbs which signaled most of the time are kept. The others are discarded as well. This resulted in a list of adverbs, with their corresponding relations. The overview of these words is shown in table 4.9. Again a difference is made between the signaling of relations inside sentences and between sentences. The relations signaled by adverbs are mainly of the Elaboration type. In table 4.9 is shown that 13 out of the 16 adverbs in the list can signal an Elaboration This can be explained by the fact that two sentences, when regarding the same topic, usually provide extra information about earlier statements. A clear signaling by adverbs occurs if they appear at the start of the second sentence, however, they are found to signal relations when residing in the middle of the second sentence as well. Like: 30 4.3. DUTCH DISCOURSE MARKERS Conjunctive Adverb Bijvoorbeeld Bovendien Daardoor Daarnaast Daarom Derhalve Dus Dientengevolge Hierdoor Ook Tevens Toch Verder Vervolgens Zo Relation Elaboration Elaboration Non-Volitional Result Elaboration Sequence Justify / Evidence Background Non-Volitional Cause Non-Volitional Result Non-Volitional Result Non-Volitional Result Non-Volitional Result Non-Volitional Result Elaboration Elaboration Elaboration Concession Elaboration Elaboration Non-Volitional Result Elaboration Inside Sentences Yes Yes Yes No No No No No No No Yes No Yes Yes No No No No No No No Between Sentences Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Table 4.8: Conjunctive Adverbs and the relations they might signal [Bij volwassenen leidt een tekort aan groeihormoon meestal tot aspecifieke symptomen.]20A [Bij kinderen leidt het daarentegen tot sterk vertraagde groei en soms tot dwerggroei.]20B [ENG: With adults, a shortage of growth hormones usually causes anti specific symptoms.]20C [With children, on the other hand, it causes a strong delayed growth and sometimes dwarfishness.]20D The word "Daarentegen" (On the other hand) signals a Contrast relation between both sentences. This way of signaling is also found with the conjunctive adverbs. If a conjunctive adverb would reside in the middle of a sentence, it also signals a relation. The relation it signals, is the same as when the marker would exist at the start of the sentence. For both conjunctive adverbs and adverbs is found that a word of these types, signal stronger when they appear at the start of the sentence. The third type of textual units which can signal a relation is the pronoun, more specifically the demonstrative pronoun. This type of word can be used to link sentences together when the writer likes to elaborate on the subject of a previous sentence. This extra information can for example be a Non-Volitional Result or an Elaboration. Examples of the use of demonstrative pronouns are the following: 31 CHAPTER 4. STRUCTURE IN DUTCH Adverb Relation Daarentegen Contrast Concession Elaboration Non-Volitional Result Elaboration Elaboration Concession Elaboration Elaboration Elaboration Elaboration Elaboration Elaboration Concession Elaboration Elaboration Elaboration Elaboration Daarna Daarop Door Doorgaans Echter Gewoonlijk Hierbij Meestal Om Ongeveer Soms Tegelijkertijd Vooral Voorts Zelfs Inside Sentences No No No No No No No No No No No No No No No No No No Between Sentences Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Table 4.9: Adverbs and the relations they might signal 1 2 3 4 1 2 3 4 Katten zijn vriendelijke beesten. Ze worden door veel mensen als huisdier gehouden. ENG: Cats are friendly creatures. They are held as pet by many people. Sommige auto’s raken total-loss na een ongeluk. Deze worden dan gesloopt. ENG: Some cars become total-loss after an accident. These will be pulled down. While the first example is best annotated with an Elaboration, a Non-Volitional Result fits better with the second example, resulting in the tree in figure 4.4. The pronouns however cannot be used to determine what relation exactly is signaled. In the previous example about the car, the Non-Volitional Result relation is added because of the relation between "total-loss" and "pulled down". In nearly all the cases in the analyzed data where a (demonstrative) pronoun was present, the relation which would be added, was the Elaboration relation. Therefore, all pronouns are considered to signal an Elaboration relation. 4.4 Genres As in all languages, in Dutch there exist many different genres of texts. In this section the consistencies and the differences between a selection of genres and the consequences for the 32 4.4. GENRES Non-Volitional Result Sommige auto’s raken total-loss na een ongeluk. Deze worden dan gesloopt. Figure 4.4: Non-Volitional Result signaled by the Demonstrative Pronoun "Deze" annotation of these texts are described. For the comparison use is made of medical texts, fairy tales, recipes, newspaper articles and weather forecasts, but of course there are plenty of other genres. A snapshot of these texts is added in appendix E. 4.4.1 Comparison First of all, selections of newspaper articles5 , fairy tales6 , recipes7 and weather forecasts8 are gathered. These texts are gathered from the internet, each from multiple sites, to prevent the use of a specific style for a genre which can be different for other sources. The differences between texts of these genres are described in several properties. The first property to describe is the composition of the texts. While a fairy tale mostly is a text with practically no lay-out features (apart from chapters and paragraphs), a recipe makes extensive use of bulleted lists. A newspaper lay-out is more alike to a fairy tale while a medical encyclopedia piece and a weather forecast contain both styles. The use of personalities in text is the second cause of difference. In fairy tales often dialogues are used since fairy tales are mostly about persons or animals which go through adventures, while in medical encyclopedias and recipes this never happens. Recipes often only consist of a list of actions to be performed by the reader. A news paper article can again contain both plain text and dialogues. Some articles are enriched with interviews with the person the article is about or a professional in the subject the article describes. While fairy tales and in a sort of way, newspaper articles as well tell stories, weather forecasts, recipes and encyclopedias only present objective information. The reader sees a text containing actions and facts. Also there is a difference in coherence between these genres. This is illustrated with the examples below. The first example is a weather forecast: [De wind is meest matig en draait van ZW naar west tot NW.]21A [De maxima liggen rond 19 graden.]21B [Vanavond wordt het geleidelijk droog en er volgt een droge nacht met opklaringen en plaatselijk mist.]21C [Minima 10 tot 14 graden.]21D 5 http://www.volkskrant.nl , http://www.trouw.nl , http://www.nrc.nl http://www.sprookjes.nu , http://www.sprookjes.org 7 http://recepten.net , http://smikkelenensmullen.blogspot.com 8 http://www.knmi.nl , http://www.weer.nl 6 33 CHAPTER 4. STRUCTURE IN DUTCH [ENG: The wind is moderate, and turns from SW to West to NW.]21E [The maxima are about 19 degrees.]21F [Tonight it becomes little by little dry, followed by a dry night with bright periods and local mist.]21G [Minima 10 to 14 degrees.]21H This text contains four sentences which each handle a different aspect of the weather. These aspects only relate to each other since they are about the weather. The only suitable tree shape for this kind of text is a Joint for all sentences since each one present a valid statement about the weather, where no statement is of more importance than an other. The second example is the following recipe: [Grill voorverwarmen op hoogste stand.]22A [Tomaten wassen en in partjes snijden.]22B [Dressing door tomaten scheppen.]22C [Schnitzels plat slaan en bestrooien met zout en peper.]22D [In koekenpan olie verhitten.]22E [Schnitzels in ca. 6 minuten bruin en rosé bakken, halverwege keren.]22F [Schnitzels op 4 sneetjes brood leggen.]22G [ENG: Preheat Grill at highest level.]22H [Wash tomatoes and slice in parts.]22I [Mix dressing with tomatoes.]22J [Flatten schnitzels and sprinkle with salt and pepper.]22K [Heat the oil in a frying pan.]22L [Bake schnitzels in ca. 6 minutes brown and rosé, turn halfway.]22M [Serve schnitzels on 4 slices of bread.]22N In fact the whole recipe is one sequence of actions to be performed. No action is of more importance than an other so only one tree can be built of such a text. The best approach would be to cover all sentences in a sequence relation. It is still possible though to split the text in smaller parts than a sentence. For example sentence 22D could be split into two pieces: "Schnitzels plat slaan" and "en bestrooien met zout en peper". Texts from the other genres are usually better fit for annotating with RST and building a more obvious RST-tree. For example the following part of the fairy tale: [Ver weg in een mooi land was eens een beeldschoon prinsesje.]23A [Ze woonde in een prachtig kasteel aan de rand van het bos.]23B [Het prinsesje zat graag bij de vijver in de kasteeltuin.]23C [ENG: Far away in a beautiful country, once lived a gorgeous little princess.]23D [She lived in a magnificent castle at the edge of the forest.]23E [The little princess liked sitting at the pond in the castle garden.]23F This could be annotated like figure 4.5. Background R Elaboration 23C 23A 23B Figure 4.5: The Princess example In the following tables the number of conjunctions for a selection of texts from the genres is shown. In table 4.10 the number of conjunctions for the news paper articles is shown. 34 4.4. GENRES These numbers indicate that the number of conjunctions is comparable to the number of conjunctions of the medical articles in table 4.3 Name Newspaper0 Newspaper1 Newspaper2 Newspaper3 Newspaper4 Newspaper5 Newspaper6 Newspaper7 Newspaper8 Newspaper9 Conjunctions 37 10 26 72 20 28 29 27 29 27 Total Words 486 111 284 901 236 355 354 323 398 378 Percentage 7.61 % 9.01 % 9.15 % 7.99 % 8.47 % 7.89 % 8.19 % 8.36 % 7.29 % 7.14 % Table 4.10: Conjunction numbers of the Newspapers If the conjunction numbers of the fairy tales, the recipes and the weather forecasts, which are shown in table 4.11, are compared, higher percentages show than with the newspaper texts and the medical texts of table 4.3. Name FairyTale0 FairyTale1 FairyTale2 FairyTale3 FairyTale4 Recipe0 Recipe1 Recipe2 Recipe3 Recipe4 Recipe5 Weather0 Weather1 Weather2 Weather3 Weather4 Weather5 Conjunctions 40 12 14 54 38 13 5 10 7 29 15 7 7 7 8 14 7 Total Words 398 201 264 471 431 148 76 94 53 253 201 83 87 79 106 136 102 Percentage 10.05 % 5.97 % 5.30 % 11.46 % 8.82 % 8.78 % 6.58 % 10.64 % 13.21 % 11.46 % 7.46 % 8.43 % 8.05 % 8.86 % 7.55 % 10.29 % 6.86 % Table 4.11: Conjunction numbers of the Fairy Tales, Recipes and Weather Forecasts To explain these peaks, the exact conjunctions are checked. It appears that the conjunction "en" (and) is very frequently present within these genres in contrast to the other conjunctions which are spread more equally. The number of occurrences of the word "en" is shown in table 4.12. 35 CHAPTER 4. STRUCTURE IN DUTCH Text Medical Texts News Papers Recipes Fairy Tales Weather Forecasts Percentage 32.87 % 30.54 % 64.94 % 39.35 % 40.82 % Table 4.12: Percentages of the conjunctions which are "en" The high number of occurrences of the word "en" in some genres points out that in fact some conjunctions are less useful than others. Most of the occurrences of "en" do not signal a relation at all (about 75% is used in a construction as discussed below) while other conjunctions, like "want" (because) always does. No occurrence of "want" is found where it did not signal a relation. An example in which the word "en" does not signal a relation is: 1 2 Jan en Piet liepen naar school ENG: Jan and Piet walked to school The reason that "want" does always signal a relation is that it implies a reason. A reason likely to be added to the text by the writer apart from a statement the reason is about. But "en" can combine subjects which act as a single unit, like in the example above. There Jan and Piet are two separate guys, but they walked together. The fact that they were walking to school, is stated about the group, rather than the boys themselves. In table 4.13 the percentages of conjunctions in the texts are shown, with the number of occurrences of the word "en" discarded. The main observation is that the differences between the percentages of texts from a single genre are decreased. Beside that the percentages of texts of different genres became more equal, nearly all of the percentages are between about 4% and 7%. The only exception is shown with the recipes, where the numbers of conjunctions are very low, except recipe4. The difference between recept4 and the rest is the way of writing. While the other recipes exist of small statements about cooking, recipe4 consists of larger sentences, which are connected with conjunctions. The different genres are also compared by other discourse markers like conjunctive adverbs. However, these numbers were much lower. For example, about 1% of the words could be counted as conjunctive adverbs. 36 4.4. GENRES Name Text0 Text1 Text2 Text3 Text4 FairyTale0 FairyTale1 FairyTale2 FairyTale3 FairyTale4 Recipe0 Recipe1 Recipe2 Recipe3 Recipe4 Weather0 Weather1 Weather2 Weather3 Weather4 Newspaper0 Newspaper1 Newspaper2 Newspaper3 Newspaper4 Conjunctions 28 9 7 22 43 19 8 12 33 25 1 1 2 1 18 3 5 4 6 7 26 9 18 59 15 Total Words 548 243 153 330 964 398 201 264 471 431 148 76 94 53 253 83 87 79 106 136 486 111 284 901 236 Percentage 5.11 % 3.70 % 4.56 % 6.67 % 4.46 % 4.77 % 3.98 % 4.55 % 7.01 % 5.80 % 0.68 % 1.32 % 2.13 % 1.89 % 7.11 % 3.61 % 5.75 % 5.06 % 5.66 % 5.15 % 5.35 % 8.11 % 6.34 % 6.55 % 6.36 % Table 4.13: Conjunction numbers without the occurrences of "en" 37 Chapter 5 Medical Texts This chapter describes medical texts, focused on the Merck Manual. This encyclopedia is chosen because it provides the most data among the possible Dutch sources, like the Winkler Prins medical encyclopedia [Spe03]. The Merck Manual explains the different diseases more thoroughly. It is also preferred for this assignment above medical texts from the Wikipedia1 , since it is a professional source, where the Wikipedia is not. It is therefore assumed that the texts of the Merck Manual are better structured. The Merck manual will be discussed regarding its text and features. After that general properties of medical texts are covered. Furthermore relations and discourse markers of medical texts will be described and a study for special medical discourse markers is performed. 5.1 Merck Manual For this research the Merck Manual is used. This is an online encyclopedia regarding medical issues. It differs from standard encyclopedias in its organization. While many encyclopedias are organized alphabetically, the Merck Manual is organized in classes. Examples of such classes are the Heart and Lung-diseases, the Hormone-system and Cancer. 5.1.1 Composition The Merck Manual is divided in 24 sections, each describing a (separate) class, and appendices. Each section is divided in chapters. A chapter contains the description about a subsection of the class. The section about the eye for example contains 12 chapters such as Disorders to the eye sockets and Disorders to the cornea. These chapters consist of multiple subsections. Each chapter contains at least a subsection Introduction. The subsections deal with a specific part of the subject. This differs from the causes of the disorders, the symptoms and possible cures. Sometimes a list of diseases is provided about the subject and symptoms and cures are dealt with per specific disease 1 http://www.wikipedia.org/ 39 CHAPTER 5. MEDICAL TEXTS or even a further subsectioning is used. An example is the description of apheresis2 . This treatment is to be found through the following path: (Section (Chapter (Subject (Part : : Blood) : Blood Transfusion) : Special Transfusion Procedures) Apheresis) The complete Merck Manual can be represented like a tree. The root of the tree can be the manual itself with its children being the different sections. In figure 5.1 a graphical representation of this tree is shown. The leaves of the tree can either be a subject or a part of the subject. The leaves contain the actual text. In figure 5.2 a screenshot of the Merck Manual is shown. It displays the subject Functie van de voorkwab van de hypofyse of the chapter Aandoeningen van de hypofyse. The section to which the chapter belongs, is called Hormonale stelsel. In the figure at the left, the other subjects of this chapter and their parts are shown. The subject Acromegalie, consists of three parts: Symptomen, diagnose and Behandeling. In appendix C additional background information about the Merck Manual is added. Manual Section Section Chapter Chapter Subject Text Subject Text Figure 5.1: The structure of the Merck Manual 5.1.2 Features The manual provides extra features. They are used to present the data in an easier format to read. For example the use of lists, tables and headings. In figure 5.3 a snapshot of the Merck Manual is shown regarding the use of a table. 2 Apheresis (Greek: ”to take away”) is a medical technology in which the blood of a donor or patient is passed through an apparatus that separates out one particular constituent and returns the remainder to the circulation, from: http://en.wikipedia.org/ as of 07/2006. 40 5.2. MEDICAL TEXTS Figure 5.2: A screenshot of the Merck Manual While this feature is hard to parse, the data is presented in consistent form. The application of lists in the text could be used as a marker for annotating the information in the list as a sequence. However, only a small portion of the information presented in the manual is in a specific format. The main part is plain text. Since these texts are very large, the compositional information is not suited. Therefore, recognition based on this information is not further used. 5.2 Medical texts In this chapter different medical texts will be discussed. First the Merck Manual will be decribed, followed by some other Dutch medical texts. 5.2.1 Merck Texts Medical texts are a subset of discourses. These medical texts share common properties. First of all, the subject of these medical texts are all subjects related to the medical field. For example the Merck Manual gives descriptions about illnesses, but also describes the way of treatments. A typical composition of a medical text in Merck is a selection from the following possibilities. Usually the text starts with a short description of the subject. In many cases symptoms are described as well as diagnosis, prognosis, treatments, causes of the disease 41 CHAPTER 5. MEDICAL TEXTS Figure 5.3: Tables in the Merck Manual and prevention. A subject can also contain several examples and variations about the topic of the chapter in which each example is explained with a selection of diagnosis, treatments etcetera. This composition of the text can add certain properties to the text which can be used to annotate the text. The text itself has also some special features. A medical encyclopedia is written with the purpose of informing the reader about the medical subjects. A property of the subjects is that one usually tries to prevent them or likes to know how to treat them. Also exceptions and other special cases are of interest to the reader. This causes the text to contain a lot of cause and effect constructions. It can be suggested that annotated text contains a significant number of relations which are based on this property. An example: De snelheid waarmee parodontitis zich ontwikkelt, varieert enorm, zelfs bij mensen die ongeveer evenveel tandsteen hebben. Dat komt waarschijnlijk omdat hun tandplak andere soorten en verschillende hoeveelheden bacteriën bevat en omdat mensen verschillend reageren op de bacteriën. ENG: The speed in which paradontitis develops, is highly variable, even with people who have an equal amount of tartar. This is probably caused because their plaque contains other kinds and different amounts of bacteria and because people react in a different way. In the above example, one can spot the following relations: Concession, Non-Volitional 42 5.2. MEDICAL TEXTS Cause, Sequence. The second sentence gives a cause for the fact that the speed of the development of paradontitis varies. There are more causes, so a sequence can be used. Furthermore a concession is done with the part: "even with people who have an equal amount of tartar". A possible annotation would thus be: [De snelheid waarmee parodontitis zich ontwikkelt, varieert enorm,]25A [zelfs bij mensen die ongeveer evenveel tandsteen hebben.]25B [Dat komt waarschijnlijk omdat hun tandplak andere soorten en verschillende hoeveelheden bacteriën bevat]25C [en omdat mensen verschillend reageren op de bacteriën.]25D Non-Volitional Cause Concession 25A Sequence 25B 25C 25D Figure 5.4: The RST-tree of the Paradontitis example 5.2.2 Other Medical Texts This research is performed with the use of the Merck Manual. There exist more Dutch medical sources. Below, the differences between a professional medical encyclopedia called the Winkler Prins and a public one, the Wikipedia will be described. The first difference is the way the texts are organized. While the Merck Manual and the Wikipedia articles are narrative, the Winkler Prins texts consist of keywords with (short) explanations. A typical text from the Winkler Prins encyclopedia is the following: Gipsverband Genor Omhulling met een in gips gedrenkt verband, toegepast o.a. voor het onbeweeglijk maken van gebroken ledematen, zieke gewrichten en operatief behandelde misvormingen. Daarnaast worden gipsverbanden ook wel toegepast als rustgevend verband bij uitgebreide verwondingen (closed plaster technique). Er wordt onderscheid gemaakt tussen gewatteerde gipsverbanden en ongewatteerde gipsverbanden. Ze kunnen worden gebruikt in de vorm van spalken die 2/3 deel van de omtrek van het te behandelen lichaamsdeel omvatten, of in de vorm van circulair gips, waarbij het lichaamsdeel volledig wordt omhuld. The texts of the Wikipedia are similar to the ones of the Merck Manual. This can also be shown with the degree of coherence, expressed with the numbers of conjunctions each source has. In table 5.1 and 5.2, the numbers of conjunctions of the Winkler Prins- and the Wikipedia articles are shown respectively. While the number of conjunctions in the Winkler 43 CHAPTER 5. MEDICAL TEXTS Name Winkler0 Winkler1 Winkler2 Winkler3 Winkler4 Conjunctions 6 4 9 5 9 Total Words 72 74 114 83 154 Percentage 8.33 % 5.41 % 7.89 % 6.02 % 5.84 % Table 5.1: Conjunction numbers of the Winkler Prins Encyclopedia articles Name Wiki0 Wiki1 Wiki2 Wiki3 Wiki4 Wiki5 Wiki6 Wiki7 Wiki8 Wiki9 Conjunctions 21 8 12 11 5 18 17 34 13 14 Total Words 189 123 131 255 142 305 204 453 121 151 Percentage 11.11 % 6.50 % 9.16 % 4.31 % 3.52 % 5.90 % 8.33 % 7.51 % 10.74 % 9.27 % Table 5.2: Conjunctions numbers of the Wikipedia articles Prins texts are relatively low, Wikipedia articles equal the Merck articles. Furthermore the Wikipedia articles are written less ’professional’ compared to the other encyclopedias. While a professional encyclopedia like the Merck Manual is judged by doctors and medical specialists, any one can add text to a Wikipedia article. As a result the texts vary in coherence. This can also be seen in the conjunction numbers of the Wikipedia articles, which differ from 3.52% to 11.11% while the conjunction numbers of the Merck Manual vary from 3.77% to 9.88%. While these upper and lower value do not differ very much, most of the values of the Merck Manual range from 6% to 9% while the percentage numbers of the Wikipedia articles are equally spread. 5.3 Relations in Medical Texts Since medical texts, like each text, are written with a certain purpose it is likely some relations occur more often in medical texts than they do in other texts. Because medical texts tell us about diseases and (therefore) about prevention and cures, the occurrence of cause and result relations is expected, such as Non-Volitional Result and Non-Volitional Cause while a relation like Interpretation might be expected less frequently. Secondly it is useful to know if some relations are relatively similar. The relative similarity of relations is useful for all text domains. If relations are relatively similar, these relations can be grouped so that a smaller set of relations is required for the (automatic) annotation of these medical texts. 44 5.3. RELATIONS IN MEDICAL TEXTS To measure if grouping is indeed possible, manually annotated texts of different annotators are compared. Therefore a randomly chosen selection of 10 texts from the Merck Manual has been gathered and annotated by four different annotators. These texts are annotated at sentence level so each elementary discourse unit consists of a single sentence. For the annotations the original 24 relations as defined by Mann and Thompson are used. Although there are differences in the annotations, similarity is found in as well. Differences occur in the structure of the resulted trees. If (a part of) the structure was similar, human annotators still added different relations. Similarities are first of all the used relations. The annotators used only a few of the original relations are actually used. In table 5.3, an overview of the used relations is shown. The numbers are the summed number of occurrences for each relation as found by the annotators. Relation Elaboration Non-Volitional Result Non-Volitional Cause Antithesis Background Concession Contrast Sequence Circumstance Joint Interpretation Justify Restatement Volitional Cause Volitional Result Condition Enablement Evaluation Motivation Otherwise Summary Total Occurrences 113 22 12 11 10 9 7 6 5 5 4 4 2 2 2 1 1 1 1 1 1 220 Percentage 51.36 % 10.00 % 5.45 % 5.00 % 4.54 % 4.09 % 3.18 % 2.73 % 2.27 % 2.27 % 1.82 % 1.82 % 0.90 % 0.90 % 0.90 % 0.45 % 0.45 % 0.45 % 0.45 % 0.45 % 0.45 % 100 % Table 5.3: Total number of occurrences of the relations The relations Evidence, Purpose and Solutionhood were never used. The relations Condition, Enablement, Evaluation, Motivation, Otherwise and Summary are only used once; there is only a single annotator which used them, while the others did not agree. This suggest an analytical error [MT88]. If these relations would be omitted from the list, there would still be coverage of 97.27 %, while using only 15 of the original 24 relations. In table 5.4, the numbers of occurrences for each annotator are shown. The numbers of relations differ per annotator. Two reasons can be identified. First, in some annotated text, not all EDUs 45 CHAPTER 5. MEDICAL TEXTS were actually connected. Secondly, the Sequence relation and the Joint relation are counted as single relations, while they can cover more EDUs than two. If a Sequence with n nuclei would be counted as (n-1) relations, the number of Sequence relations would increase with 7. The number of Joint relations would not change since it was not used to connect more than two nuclei. Relation Elaboration Non-Volitional Result Non-Volitional Cause Antithesis Background Concession Contrast Sequence Joint Circumstance Interpretation Justify Restatement Volitional Cause Volitional Result Condition Enablement Evaluation Motivation Otherwise Summary Total P1 21 9 4 0 3 5 2 1 0 3 1 1 0 2 2 1 0 0 1 0 0 56 P2 35 3 1 0 2 3 5 1 4 0 1 2 1 0 0 0 0 0 0 0 1 59 P3 27 2 2 3 2 1 1 3 0 0 2 1 1 0 0 0 1 1 0 1 0 48 P4 30 8 5 8 3 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 58 Average 28.25 5.5 3 2.75 2.5 2.5 2 1.5 1.25 1 1 1 0.5 0.5 0.5 0.25 0.25 0.25 0.25 0.25 0.25 55.25 Table 5.4: Number of occurrences of the relations per annotator It is clear that Elaboration is the most used relation. If all relations would be tagged with the Elaboration relation, a coverage of more than half is already performed. After annotation, the results are compared and it is noted which relations are often confused with each other. This comparison is based on relations between the same elementary discourse units. If two annotators added different relations between the same EDUs, the used relations are considered to be confused. The relations which are confused multiple times are grouped together. Table 5.5 contains an overview of the groups; columns are confused relations. In the table is shown that different kinds of relations can be confused. Hypotactic relations with each other, but also hypotactic relations with paratactic ones. The + symbol after the Elaboration relation in table 5.5 means multiple instances. The Sequence relation can thus be confused with multiple Elaboration relations. This is shown in figure 5.5. 46 5.4. MEDICAL DISCOURSE MARKERS Contrast Antithesis Concession Non-Volitional Result Non-Volitional Cause Volitional Result Volitional Cause Background Elaboration Evidence Elaboration Sequence Elaboration+ Table 5.5: Confused relations Elaboration Elaboration A Elaboration Sequence B C C A B (a) (b) Figure 5.5: Sequence vs. Elaborations Table 5.5 grounds the use of a smaller subset of relations. Since relations are confused, they could be grouped to a covering relation. When annotating, the group name of the relations can be used. Confused relations get the same group name. If two annotated texts would be compared with each other, the use of the covering groups makes this task easier. One of the reasons of confusion is the fact texts can be ambiguous. If two parts of a sentence would be related to each other with a cause-result relation, the use of a result or a cause relation is dependent on which part is the nucleus and which the satellite. Other reasons are for example analytical error or differences between annotators. 5.4 Medical Discourse Markers The content of medical texts suggests that there might be special discourse markers in these texts that signal certain relations. If a text contains a sentence like: 1 2 De oorzaak van het tekort aan bloedplaatjes is een autoantistof tegen bloedplaatjes. 47 CHAPTER 5. MEDICAL TEXTS 3 4 ENG: The cause of the shortfall of blood platelets, is an auto antibody against blood platelets. the use of an Non-Volitional Cause relation is obvious although there is no sign of a conjunction, adverb or preposition. While the marker "De oorzaak van" might be general there are other markers which tend to be more used in medical texts. Like for example the use of the word "symptomen" which in the example below might signal an Elaboration relation: 1 2 3 4 Andere symptomen zijn verwijde pupillen, kippenvel, tremoren, spierkrampen ... [text omitted] ENG: Other symptoms are widened pupils, Goosebumps, tremors, muscle cramps ... [text omitted] The use of these markers for (automatic) annotation can be hard, since they are ambiguous. In the following example it would be better to segment at the discourse marker "maar" and assign an Concession relation instead of a Non-Volitional Cause: 1 2 3 4 De oorzaak van alcoholisme is onbekend, maar alcoholgebruik is niet de enige factor. ENG: The cause of alcoholism is unknown, but the use of alcohol is not the only factor. To test the use of these special discourse markers, a list of possible phrases is gathered. The lists contain verb and noun-constructions. The selection is discussed below. 1 2 3 4 1 27 * * * 2 21 24 * * 3 19 16 21 * 4 18 15 16 20 Table 5.6: Conjunction numbers of the Nouns 5.4.1 Noun Constructions To gather these phrases four random selections of a hundred sentences each were extracted from the Merck Manual. These selections of a hundred sentences have been checked manually and the possible discourse markers are noted. The sentences contained between 20 to 27 noun constructions each where no difference is made between the singular or plural form. The lists with possible noun constructions of the text are conjugated with each other. The results of the possible noun constructions are shown in table 5.6. The numbers represent the number of similar noun constructions. For example, selection two and three contain 24 and 21 noun constructions respectively. They share 16 noun constructions. For the final list 48 5.4. MEDICAL DISCOURSE MARKERS Noun-Constructions Aandoening Disorder Behandeling Treatment Afwijking Deviance Complicatie Complication Diagnose Diagnosis Gevolg Result Manier Way Methode Method (Genees)Middel Cure Onderzoek Investigation Oorzaak Cause Operatie Operation Probleem Problem Reactie Reaction Risico Risk Stadium Stadium Symptoom Symptom Tekort Shortage Vorm (van) Sort of Verandering Change Vergroting increase Verschijnsel Phenomenon Table 5.7: Possible Noun Constructions as discourse markers of noun constructions, all nouns which appear in at least three out of the four selections are taken. These results are shown in 5.7. After selecting the list with nouns, a statistical analysis is performed to see in which numbers these nouns actually occur in the Merck Manual. The nouns are counted in their singular and plural form. The total number of sentences is 40k with 666k words in the entire Merck Manual. Table 5.8 shows the number of occurrences. To check whether these noun constructions actually signal a relation and what relation exact, a classification is to be made. This is explained in section 5.4.3. 5.4.2 Verb Constructions The same four selections are checked for verbs, in a similar way as for the nouns. The number of verb constructions varied between 30 to 43. For the verb constructions all inflections of the root of the verb are counted. The same conjunction procedure as described for the nouns is done for the verbs as well. In table 5.9 the conjugations of the number of present possible verb constructions are shown. For example, selection three consists of 37 verb constructions and selection four has 30. They share 23 verb constructions. Again all verbs which appear in at least three of the four selections are selected. These 49 CHAPTER 5. MEDICAL TEXTS Word Aandoening Afwijking Behandeling Complicatie Diagnose Geneesmiddel Gevolg Manier Methode Middel Onderzoek Oorzaak Operatie Probleem Reactie Risico Stadium Symptoom Tekort Verandering Vergroting Verschijnsel Vorm (van) Singular 1032 147 1593 77 736 445 882 162 118 574 781 772 394 221 259 431 205 112 161 48 45 19 622 Plural 497 359 138 237 9 1120 126 66 76 502 253 207 39 391 129 36 48 1689 17 239 0 46 496 Table 5.8: Numbers of the Nouns 1 2 3 4 1 43 * * * 2 22 35 * * 3 28 22 37 * 4 25 18 23 30 Table 5.9: Conjunction numbers of the Verbs are shown in Table 5.10. The Dutch verb Voorkomen has two different meanings. It can have the same meaning as Optreden (To Occur) but can also be translated as ”To Prevent”. The meaning is dependent of the actual text the verb is in. The classification of the verbs, to check which relations they signal is explained in section 5.4.3. 5.4.3 Signaled Relations To be able to use the noun- and the verb-constructions, it must be clear which relation they signal. Furthermore it should be known in which situations they signal or do not signal 50 5.4. MEDICAL DISCOURSE MARKERS Verb Constructions Aantasten To Affect Afnemen To Decrease Beginnen To Start Behandelen To Treat Dalen To Decrease Gebruiken To Use Helpen To Help Herstellen To Recover Kenmerken (door) To Characterize (by) Krijgen To Get Leiden tot To Lead to Onderzoeken To Investigate Onstaan To Origin Optreden To Occur Produceren To Produce Toenemen To Increase Toepassen To Apply Uitvoeren To Carry Out Vaststellen To Find Verbeteren To Improve Vermijden To Avoid Verminderen To Decrease Veroorzaken To Cause Verwijderen To Remove Voordoen To Display Voorkomen To Prevent Voorkomen To Occur Table 5.10: Possible Verb Constructions as discourse markers a relation. Therefore a classification is to be made for these markers. In table 5.11 the relations which could be signaled by noun constructions are shown. This table is generated by manually checking pieces of text with an occurrence of the noun construction. Noun Behandeling Gevolg Oorzaak Stadium Symptoom Verschijnsel Vorm Relation Elaboration Non-Volitional Result Non-Volitional Cause Elaboration Elaboration Elaboration Elaboration Table 5.11: Noun Constructions and the relations they signal 51 CHAPTER 5. MEDICAL TEXTS Verb Afnemen Beginnen Kenmerken (door) Leiden tot Onstaan Optreden Toenemen Vermijden Verminderen Veroorzaken Voordoen Relation Elaboration Elaboration Elaboration Non-Volitional Cause Elaboration Elaboration Elaboration Elaboration Elaboration Non-Volitional Cause Elaboration Table 5.12: Verb Constructions and the relations they signal It appears that a great deal of the noun constructions which were hypothesized to signal relations, in fact did not. Although the appearance of the word suggested the presence of a relation (mostly an Elaboration), the actual signaling was done by a different word type. This can be shown with the next example: [Neurogene pijn is het gevolg van een af- wijking ergens in een zenuwbaan.]27A [Een dergelijke afwijking verstoort de zenuwprikkels, die vervolgens in de hersenen verkeerd worden geïnterpreteerd.]27B [Neurogenic pain is the result of an abnormality somewhere in a nerve tract.]27C [Such an abnormality disturbs the nerve signals, which are misinterpreted in the brains as a consequence.]27D Here the Elaboration relation, which indeed holds, is signaled by the word "Dergelijke", instead of the word "Afwijking". If for example the second sentence was: Een andere afwijking verstoort ... the relation between the first and second sentence is better suited with a Concession relation. The indication that a relation does hold between sentences however still remains. This procedure applied for the nouns, has also been done for the verb constructions. This resulted in table 5.12. The table with verbs only contains a small part of the words originally selected. Most of them signaling Elaboration relations. This has the same reason as with the noun constructions. The verbs give indeed hint that a certain relation is present, but if a different word type like for example a conjunctive adverb is present, the relation is more likely to be signaled by the conjunctive adverb. 5.4.4 Time Markers Since the Merck Manual is a medical encyclopedia, it describes certain diseases. These diseases are often discussed in a chronological way. For example the effects of a treatment are described like: [Bij bloedverlies zal de hartfrequentie toenemen.]28A [Nadat de bloeding is gestopt, wordt er vocht vanuit de weefsels in de bloedsomloop opgenomen.]28B 52 5.4. MEDICAL DISCOURSE MARKERS [ENG: Loss of blood will increase the heart frequency.]28C [After the blooding is stopped, fluid from the tissues flows in the blood vessels.]28D Relations between sentences like this can be signaled by words which indicate a time flow. Examples are shown in table 5.13. These markers include combinations of words as well. Eerst Nadat (Ten) Eerste / Tweede Tenslotte Tijdens Uiteindelijk Verder Voordat First After First / Second Finally While Finally Further Before Table 5.13: Words which indicate a time flow Time markers may indicate a relation with sentences which occurred before, although the exact relation is not set by the marker itself. These markers of course appear in other texts as well, but since these describing methods appear quite often in the medical texts, they might be useful. Some of the time markets in table 5.13 signal strongly. Others only indicate a relation is present, and the actual relation is most likely to be signaled by a different word type, like an adverb. The time markers were found to signal a Non-Volitional Result or an Elaboration. In table 5.14 the words and the relation they signal are shown. Eerst Ten eerste / tweede Tijdens Verder Tenslotte Uiteindelijk Elaboration Elaboration Elaboration Elaboration Non-Volitional Result Non-Volitional Result Table 5.14: Time markers and their relations 53 Chapter 6 Automatic Recognition In this chapter the automatic recognition of Dutch medical texts is discussed. First of all the current state of automatic recognition is described followed by a discussion about the use of Rule Based annotation or Machine Learning techniques. After that, the (separate) parts of the automatic recognizer which is developed during this research are described. Subsequently, the way of segmentation, the creation of the sets of relations and discourse markers used by the recognizer, are described and motivated. Next the recognition of relations will be described, and the assumptions and notions needed for it, are discussed. The last three sections will describe the algorithm used for the recognition and the applied hierarchy for scoring the recognized relations and an example of the process is worked out. 6.1 State of the Art Currently several studies for the automatic annotation of texts with Rhetorical Structure Theory are performed. In [Mar97a] and [Mar97b] Marcu describes a method for the automatic rhetorical parsing of natural language texts. He presents four different paradigms. Among those four are a constraint satisfaction problem and a theorem solving problem. These are used to produce algorithms to solve the problem of text structure derivation. These are used to create a rhetorical parser. The recognizer produces all combinations found, and removes the trees which are not conform the specification. Marcu takes discourse markers as basis for the indication of the presence of relations. This research was performed for the English language only. In [HHS03], a chunk parser [Abn91] is used with a feature based grammar. This approach is based on the use of discourse markers as well, and intended for English texts. Methee Wattanamethanont et al used a Naive Bayes Classifier for the automatic recognition of rhetorical relations in Thai [WSK05]. To train the classifier they used a corpus of 2850 Elementary Discourse Unit pairs, split in a 90%/10% training/testing ratio. The features used by the classifier are Discourse Markers, Key phrases and word co-occurrence. This approach is similar as is done in [ME02], where Marcu trains a Naive Bayes Classifier to recognize four types of relations between arbitrary sentences, even if no discourse marker is present. The classifier used by Marcu is trained with the use of two corpora, with more than 55 CHAPTER 6. AUTOMATIC RECOGNITION 40 million sentences of which different training sets, each containing millions of sentences, were extracted. The PhD Thesis of Corston-Oliver [CO98] describes the Rhetorical Structure Theory Analyzer (RASTA). The program tests for conditions, like the ordering of the clauses the relation will relate, for a certain relation, and afterwards checking if a marker is present at all. Such conditions are similar to the conditions for the relations shown in appendix A. Possible relations receive a heuristic score. RASTA generates no invalid trees, and thus does not need to validate the generated trees afterwards. The ConAno tool developed by Stede and Heintze [SH04] can be used for assistance when manually annotating text with rhetorical relations. It outputs files in the .rs3 format used by RSTTool [O’D00]. ConAno uses a discourse marker lexicon, with relations they can signal. The tool does not make decisions about which relation to use, but gives hints to the annotator. The tool was originally developed for German, but with the use of a different lexicon it can be applied for other languages as well. The creation of ConAno was inspired by the work performed for the creation of the Potsdam Corpus [Ste04]. This is a corpus of German newspaper articles, annotated with multiple information types, such as rhetorical structure, part-of-speech, and co-reference. Different corpora of rhetorical annotated texts have been developed, such as the RST corpus [CMO03] by Carlson et al. This corpus consists of 385 Wall Street Journal articles of the Penn Treebank [MMS93]. A different corpus also originated from the Penn Treebank is the Penn Discourse Treebank (PDTB) [MRAW04b]. The PDTB contains large texts in which the discourse connectives are annotated along with their arguments. This corpus contains no RST annotations. 6.2 Automatic Annotator For the automatic annotation of texts, a program has been written which consists of three main elements. These elements are the segmenter, the recognizer and the tree-builder. Each of the elements is written in the Perl1 programming language. Together the parts build a tree from the input text and stores this in a .rs3 file. These .rs3 files can be imported with RSTTool and edited afterwards. The main parts will be discussed separately. 6.2.1 The Segmenter The first element is called the segmenter. This part takes text as input and segments it into non-overlapping spans of text, the elementary discourse units, which will be used to recognize the relations. This is a fairly simple process, which just breaks the text at certain points. The segmentation process is explained in section 6.3. If a text with multiple sections is used as input, these sections are treated as if they were one. 1 http://www.perl.com 56 6.3. SEGMENTATION 6.2.2 The Recognizer In this assignment, a combination of rule based recognition and machine learning is used to recognize relations. Due to a lack of large quantities of annotated data, machine learning like is done in research as presented in section 6.1 could not be applied. The recognizer is equipped with lists of discourse markers and the relation that they can signal. For each rule type a likelihood estimation is used. These rules are static. The recognizer cannot change these markers nor the parameters. These values can only be changed by a user. The recognizer takes the Elementary Discourse Units resulted by the segmenter and produces two lists of relations between the segments. The first list is an overview of all possible relations, the segments which the relation connects together, the type of the relation and its likelihood. The second list is a subset of the first list and contains the nodes of the most likely tree which is automatically recognized. The list contains the relations like this: [3 4 N 100 elaboration] This example shows an Elaboration relation between EDU number 3 and EDU number 4, where the first one is the nucleus, as denoted by the letter N. The number 100 expresses the likelihood that this relation actually holds. This will be explained in section 6.6. A different relation is: [2 3 S 30 concession] The relation above expresses a concession relation between the satellite, EDU number 2 and the nucleus, EDU number 3, labeled as a Concession relations with a likelihood of 30. 6.2.3 The Tree-Builder The tree-builder takes the second list of the recognizer, with the final relation data. It builds the actual tree from the data and stores it as an .rs3 file. All the data is added to the result; if no relation is found with a certain segment, it is added without any information. This .rs3 file can be read with RSTTool and be edited afterwards. Missing relations, or falsely recognized relations can thus be corrected. 6.3 Segmentation Although the actual segmentation is just a small portion of the recognition work, it is an important part. The segmenter provides the data the recognizer works with. It creates the building blocks of the tree which is to be produced. Therefore one must be sure what rules to use for segmentation. There are two different segmentation approaches. The first way is to use complete sentences as Elementary Discourse Units. For this segmentation there is no need for knowledge about the text. It is sufficient to search for sentence boundaries and segment at these points. Every dot is considered to be a segment boundary. Since this segmentation is performed merely on sentence boundaries, errors might occur if for example abbreviations with dots followed by a capital would come up, like: ”I went to see dr. Jones”. However, these cases did not occur in the tests. 57 CHAPTER 6. AUTOMATIC RECOGNITION The second approach is to segment within sentences as well. This is a much harder task since it is error sensitive. Knowledge about the sentence is needed to be able to segment correctly. In section 3.2 an example from the Merck Manual is described. In that example the sentences were manually split at certain points. The points were chosen because at these points that part of the sentence could be related to the previous part. Whereas it is sufficient for segmentation between sentences to spot sentence boundaries, no such lexical information is available for inside sentence segmentation. Although the comma in a reasonable number of occasions indeed can be used to segment, not all the commas are suited. The automatic recognizer uses the first approach. This is chosen because the recognition of relations between sentences is harder than within sentences. Furthermore this prevents differences between a manually and an automatic segmentation of a text. This ensures that a comparison like is done in chapter 7 can be made. 6.4 Defining the Relation Set As described in chapter 4 and 5, there is a certain subset of the original relations by Mann & Thompson, whose members do not occur often (in the used data). Furthermore, another (partially overlapping) subset of relations is seldom signaled. Third, certain relations are confused with each other during annotation. These notions are a reason for suggesting only using a (small) subset of relations in the automatic recognition process. Furthermore, a smaller list of relations simplifies the recognition process. To gather this relation set, the reasons given are taken into account. Some relations are hardly used; for example in the annotated data, there are 6 relations which are only found once, and 3 relations which were never used as is shown in table 5.3. Furthermore relations can be confused with a relation which does occur often so the seldom used relations are omitted from the used relation set. A drawback of this smaller list is a decrease in the coverage. The reason for confusing can for example be the signaling of the relation by a discourse marker. This approach for defining a set of relations is used in [Kno93]. Since relations as Concession and Antithesis can be signaled by the same marker, these can be grouped. For example the conjunction Maar can signal a Contrast, Antithesis or Concession. The definitions of the relations as are shown in appendix A can be used to group some relations. In [MAR99] it is argued that in the cases an Evidence relation was used, an Elaboration relation would also fit since the last relation is less specific than the first. The definitions of the three relations grouped as a Concession differ only slightly. The Elaboration relation definition covers most other relations. This is partly because of the definition of the Elaboration relation, and partly because human annotators tend to confuse other relations with the Elaboration relation. The Non-Volitional Cause and Non-Volitional Result cover their volitional counter parts. The only difference between a Non-Volitional Cause and a Voltional Cause is the fact that the cause must be volitional for the second relation. However, a discourse marker like "Daardoor" does not give clue whether a cause is volitional or not. Since the (Non) Volitional Result and the (Non) Volitional Cause differ more from each other, they are added as different groups, although they did got confused by human annotators, as is shown in section 5.3. 58 6.5. USED DISCOURSE MARKERS Elaboration (Non)Volitional Cause (Non)Volitional Result Concession Table 6.1: List of Relations to use These notes result in the use of the following relations, shown in table 6.1. The Elaboration relation covers most of the occurring relations. Together with the other three relations, the coverage is 88.2%. 6.5 Used Discourse Markers For the automatic recognition use is made of different kinds of discourse markers. These include conjunctions, adverbs, pronouns, medical discourse markers and implicit markers. For each kind of discourse marker restrictions and the way of using these discourse markers to recognize relations is defined. The discourse markers are discussed below. In some cases discourse markers cannot be used to recognize relations. In those cases keyword repetition is applied with the use of Alpino [BvNM01]. 6.5.1 Conjunctions The first discourse marker type used is the conjunction. In chapter 4, a list of conjunctions and their relations is presented. This list is used for the recognition of relations between elementary discourse units. For the actual recognition, the assigning of the precise relation and its likelihood, it is needed to know how and when the relation can be added. The first note about conjunctions, is that they usually signal a relation between elementary discourse units which belong to the same sentence. For example the conjunction Zodra. If it consists inside a sentence it is likely to signal a relation, as it does in the next example: 1 2 Ik bel je, zodra de uitslagen bekend zijn. ENG: I call you, once the results are public. However, it is possible that a conjunction signals a relation between sentences as well. This is because it is possible to split sentences at a conjunction (with some rewriting), to create two sentences which still belong together. This can be done to prevent the creation of long sentences while writing, although the result is rather childish. Even though rewriting is possible, it is thus not expected to be very common in an encyclopedia. For example the small sentence: 1 2 Henk krijgt een cadeau, hoewel hij niet jarig is. ENG: Henk receives a present, although it is not his birthday. It can be split into: 59 CHAPTER 6. AUTOMATIC RECOGNITION 1 2 Henk krijgt een cadeau. Hoewel hij niet jarig is. The relation between both sentences is still the same. So, it is possible to use conjunctions for the recognition of relations between sentences as well. The constraint however is the placing of the conjunction. The conjunction must appear as the first word in the second sentence, because if it exists at a different spot, it is more likely to signal a relation within the sentence rather than between sentences. In the checked data, no occurrence of an actual rewrite is found, so it is decided that the recognizer will not use conjunctions for the recognition of relations between sentences. Conjunctions can also appear at the beginning of the sentence signaling a relation within the sentence. This is shown with the following example: 1 2 Zodra ik de lotto win, gaan we op vakantie. ENG: Once I win the lottery, we will go on vacation. It is shown that conjunctions are also ambiguous in their signaling of relations between sentences or within. Therefore their use in the recognizer will be omitted, since it intended to find relations between sentences, rather than within. 6.5.2 Adverbs The second type of discourse markers are the (conjunctive) adverbs. A selection of discourse markers of this type is also presented in chapter 4. Adverbs, like conjunctions, signal relations inside sentences and between sentences. Unlike conjunctions however, adverbs also signal relations between sentences if they do not appear as a rewrite result, i.e. not at the start of a second sentence. Take for example the word Vervolgens (Next). In both following texts, the word signals an Elaboration: 1 2 3 4 1 2 3 4 De receptie begint om acht uur ’s avonds. Vervolgens is er een borrel in zaal drie. ENG: The reception starts at 8 PM. Next there is a drink in room three. De receptie begint om acht uur ’s avonds. Er is vervolgens een borrel in zaal drie. ENG: The reception starts at 8 PM. There is a drink next in room three. Adverbs which start a sentence are most likely to signal a relation with another sentence in stead of signaling an inside sentence relation. 60 6.5. USED DISCOURSE MARKERS Alle Bepaalde Dat Dergelijke Deze Dit Elk(e) Sommige Verschillende Ze Zulke Table 6.2: (Demonstrative) Pronouns 6.5.3 Pronouns Pronouns are the third type of discourse markers. A small list of used pronouns is shown in table 6.2. Pronouns can signal an Elaboration relation between sentences, if used at the start of the relating sentence. They can also signal a relation if used in the middle of a sentence, but the likelihood is less since another relation could be more suited. Such a relation is probably signaled by a different marker in that case. To show the difference in relation assigning due to pronouns, the next two examples are presented: 1 2 3 4 1 2 3 4 Criminelen horen in de cel. Het kan zijn dat dergelijke mensen er reeds zitten. ENG: Criminals belong in jail. It is possible, people like those are already there. Criminelen horen in de cel. Toch kunnen sommige criminelen beter naar een TBS-inrichting. ENG: Criminals belong in jail. Still, some criminals better go to a TBS-facility. While the first example can be related with an Elaboration, the second example needs a Concession relation to be connected properly. This indicates that the pronouns are overruled in importance by other discourse markers. The pronouns signals the existence of a relation, but to determine which relation is to be added, other discourse markers are necessary. For the automatic recognition, pronouns are only used as an explicit marker for an Elaboration relation if they are present at the start of a sentence. If they are present in the middle of the sentence, it is at best used as an indication that a relation is present. This is because these pronouns can also signal a relation within the sentences. This is explained with the following extracts from the Merck Manual, where the word "Dit" signals the relations. [Voor elke vitamine is de aanbevolen dagelijkse hoeveelheid (ADH) vastgesteld.]29A [Dit is de hoeveelheid die een gemiddelde persoon dagelijks nodig heeft om gezond te blijven.]29B [ENG: For every vitamin, the reference daily intake (RDI) is determined.]29C [This is the amount an average person needs on a daily base, to keep his health.]29D 61 CHAPTER 6. AUTOMATIC RECOGNITION [Wanneer het lichaam zowel overtollig vocht als natrium kwijtraakt of opneemt,]30A [kan dit zowel het bloedvolume als de natriumspiegel benvloeden.]30B [When the body loses or absorbs both superfluous fluid and natrium,]30C [this can manipulate both the blood volume and the natrium level.]30D 6.5.4 Domain Specific Discourse Markers The medical discourse markers are the fourth type of discourse markers used. Two types of medical discourse markers are used: the noun constructions and the verb constructions. As shown in chapter 5, there exist some medical words which might indicate the presence of a relation. The actual relation in those cases is more likely to be signaled by a different discourse marker. Like for example: 1 2 Johns body does react strange to the medicines. These medicines increase the symptoms. The second sentence of the example contains the words Symptoms and ”Increase”. These are found to be able to signal an Elaboration relation. That relation can also be signaled by the word "These", which is stronger since it is a pronoun. This is explained in section 6.8. For a subset of the list, the relation itself is signaled more clearly. These markers are all signaling a (Non) Volitional Cause/Result 6.5.5 Relation Markers During this research, some words and constructions were found, which indicate a specific relation, that did not belong to a type discussed. An example of such a word is "niet", which can sometimes be useable to detect a Concession relation. This word negates a sentence, signaling a different view on facts presented before. Other words clearly elaborate on a subject, like the word "Voorbeelden". In table 6.3, a small list is presented with relation markers and the relations they might signal. Word Andere Geen Niet Nooit Tegenover Tegenstelling Aanwijzingen Mogelijkheden Voorbeelden Relation Concession Concession Concession Concession Concession Concession Elaboration Elaboration Elaboration Table 6.3: Relation Markers 62 6.5. USED DISCOURSE MARKERS These relation markers should be used with care, since there are examples in which these words did not signal a relation. If these words are found at the start of the sentence, the signaling is found to be stronger than when the word reside within a sentence. When a relation marker is found to be within a sentence, a relation is usually found to be signaled by a different relation. 6.5.6 Adjectives A different word type that might signal an elaboration about a certain topic is the adjective. These can be used to elaborate on a specific type of the subject mentioned before. For example the following text: 1 2 3 4 Een afwijking aan het afweersyteem is gevaarlijk. Een grote afwijking kan dodelijk zijn. ENG: An anomaly of the immune system is dangerous. A large anomaly can be fatal. There are however many adjectives, and a large part of them are not used to elaborate on a previous statement. Furthermore, relations between these sentences can also be found with the use of keyword repetition which will be discussed below. The recognizer will therefore make no use of adjectives. 6.5.7 Implicit Markers Since not all relations are signaled by discourse markers, a way of recognizing relations without them is necessary. An attempt for annotating this sort of relations is carried out in [MRAW04a]. Implicit markers can be identified in texts which do not embed an explicit relation. In this assignment, all word types used by the recognizer are considered explicit markers. An example of two sentences which embed an implicit marker is: 1 2 3 4 Een eik laat zijn bladeren vallen in de winter. Bomen hebben in de winter geen voedsel over voor bladeren. ENG: An oak drops its leaves in winter. Trees cannot spare their food in winter for leaves. This example embeds a relation between both sentences. The relation could for example be a Non-Volitional Result or an Elaboration. The implicit marker in this example is the word "want" (because). While the exact recognition of the relation is very hard, the recognition of the presence of a relation at all is interesting as well. In the preceding example the existence of a relation between these two sentences can be found by different sources. First of all, since an oak is an element of the set of trees, the presence of a relation is indicated. But for an automatic recognizer this world knowledge must be made explicit. This can be done with the help of a thesaurus. A thesaurus, sometimes referred to as an ontology, is a set of (textual) data. A thesaurus does not define the words, but it stores 63 CHAPTER 6. AUTOMATIC RECOGNITION relations regarding words, such as synonyms, homonyms etcetera. A well-known example of a thesaurus is Wordnet [Mil95]. Secondly, the first sentence states something about leaves and winter. These subjects appear in the second sentence as well. This repetition of keywords might be an indication for the presence of a relation, while it does not give any clue about which relation connects the sentences. To be able to use this method, the keywords must be selected. It may be clear that words like "The" and "In" are not suited to be marked as keywords. The word types with the highest probability are the nouns. To be able to find the nouns in a Dutch text, Alpino [BvNM01] is used. To find the nouns a Part of Speech (POS) tagger could be used as well. The approach with the repetition of keywords is used for the recognition of distance relations. Sentences which embed the same nouns are attached to each other. The added relation is always the Elaboration relation, furthermore it is connected in a left-right ordering. That means that the first sentence is selected to be the nucleus and the second as the satellite. This principle is illustrated with the following extract: [Ongeveer 15% van alle schildkliercarcinomen bestaat uit folliculair carcinoom en komt vooral voor bij ouderen.]31A [Folliculair carcinoom komt ook meer voor bij vrouwen dan bij mannen, maar net als bij papillair carcinoom is bij mannen de waarschijnlijkheid groter dat de knobbel kwaadaardig is.]31B [Folliculair carcinoom is veel agressiever dan papillair carcinoom en verspreidt zich vaak door de bloedbaan. ]31C [Hierdoor ontstaan uitzaaiingen in verschillende delen van het lichaam. ]31D [De behandeling van folliculair carcinoom bestaat uit zo volledig mogelijk operatieve verwijdering van de schildklier en vernietiging van eventueel overblijvend schildklierweefsel en van de uitzaaiingen met behulp van radioactief jodium.]31E [ENG: About 15% of all thyroid carcinomas consists of follicular carcinoma, and mainly occurs with old people.]31F [Follicular carcinoma occurs more often with women than with men, but just like papular carcinoma, with men the probability is higher the bump is malignant.]31G [Follicular carcinoma is much more aggressive than papular carcinoma and spread often through the blood stream.]31H [Therefore secondary tumors arise in different parts of the body.]31I [The treatment of follicular carcinoma consists of the operational removal of the thyroid and the destruction of possible remaining thyroid tissue and secondary tumors with help of radioactive iodine.]31J This piece of text consists of five sentences. There are just a few words which might indicate a relation. The most obvious word is "Hierdoor", in sentence four. It indicates a relation between sentence three and four. The rest of the sentences only contain some weaker markers, so the recognizer cannot link the rest of the sentences. However, since each sentence is about some part of Folliculair Carcinoom (Follicular Carcinoma), this can be used for linking the sentences. The recognizer uses Alpino to find the nouns in the sentences. After this, the nouns are compared and the sentences which embed similar nouns are connected. The comparison is done like this: the first sentence is compared to the second, if a match is found, a relation is added. After this, the first sentence is compared to the third. If no relations are found, the process starts again from the second sentence. The second sentence is compared to the third and fourth. EDU 31C is omitted from this process since it is already connected to EDU 31D. 64 6.6. RECOGNIZING RELATIONS Beacuse each sentence is about the Folliculair Carcinoom, each sentence will be linked with the first sentence. The result will thus look like figure 6.1. Elaboration .............Elaboration ........................Elaboration 31A 31B Non-Volitional Cause 31E R 31C 31D Figure 6.1: The Folliculair Carcinoom example A remark for this approach is the fact that Alpino makes errors, thus causing the recognizer to make mistakes. Furthermore the plural and singular forms of the same noun do not count as a match. This problem could be solved with the use of a stemmer, like for example the Porter stemmer [Por80]. Another remark is the fact that there are more word types than just the nouns possible to link sentences. 6.6 Recognizing Relations The recognition consists of the attribution of scores to relations between elementary discourse units. Therefore some prior assumptions are made: 1. The text to annotate is actually coherent 2. The most important piece of text is usually presented first. 3. Texts are from the Dutch medical domain The assumption that the text indeed is coherent is an important one. It is possible to write a nonsense text which embeds discourse markers which would indicate a certain relation if they would be used in a normal text fragment. For example: 1 2 3 4 Jake is een aparte man. Daardoor is de kat groen. ENG: Jake is a strange guy. Therefore the cat is green. In this example, the recognizer would assign a relation between these sentences, since the word "Daardoor" (Therefore) indicates a relation. 65 CHAPTER 6. AUTOMATIC RECOGNITION The fact that most texts indeed state the most important information first, is useful for assigning scores to a relation and the decision which part of the relation is most likely to be the nucleus when finding a hypotactic relation. This fact can be grounded with the following numbers from table 6.4, extracted from the texts annotated during this assignment. In 81.3 % of the cases the nucleus precedes the satellite. Only relations between sentences are counted. Ordering N-S S-N Equal Percentage 81.3 % 5.3 % 13.5 % Table 6.4: Relation Ordering Percentages The first approach for recognition is between full sentences. Recognition between sentences is harder than between EDUs which exist in a single sentence. This is because of the fact that information presented in a single sentence usually relate to each other. Furthermore it is rare that an EDU from a sentence has a relation with a part outside the sentence, unless it is the nucleus of that sentence. The recognition of relations between sentences is furthermore hard since it is possible that a relation holds between sentences which have a great distance between them. The main part however, consists of relations between adjacent sentences. Results of the analysis of the annotated data are presented in table 6.5. The table contains the percentages of relations between adjacent sentences for four annotators. About 60% of the relations are defined between adjacent sentences. Annotator P1 P2 P3 P4 Percentage 66.67 % 61.29 % 64.71 % 55.42 % Table 6.5: Relation numbers between adjacent sentences The last assumption is that the texts which are to be annotated are from the medical domain in Dutch. Although Dutch texts from other genres can be annotated as well, English texts for example cannot. The recognizer uses word types which were specifically gathered for the medical domain, therefore texts from other domains benefit less from these types. To be able to find the relations in a text, the following approach is used. The segmented text is checked for relations. A scoring is added by the recognizer, based on the discourse markers found in the sentences, which represents the likelihood that the relation exists between EDUs. The type of the relation is present as well. If the text would contain three sentences a possible outcome is the following: 66 6.7. RECOGNITION ALGORITHM [1 [1 [2 [2 [2 [1 2 2 3 3 3 3 N N N N S N 80 60 30 90 10 40 nonvolitional-cause] elaboration] nonvolitional-cause] elaboration] concession] elaboration] In the preceding example there were two relations found between EDUs one and two, and three relations between two and three and one between the first and the third. The most probable relations are kept and the rest discarded. Since a satellite can only be added to one nucleus, conflicts are solved by keeping the highest ranked relations. In the example above, sentence 3 can be added to the first and to the second sentence. It is added to sentence 1 since that relation is scored 90, while the relation between 1 and 3 is scored 40. This results in: [1 2 N 80 nonvolitional-cause] [2 3 N 90 elaboration] This can be graphically represented like in figure 6.2. Non-Volitional Cause 1 Elaboration 2 3 Figure 6.2: Tree view of the example 6.7 Recognition Algorithm This section will describe the recognition algorithm used by the recognizer. The recognizer works according to the assumptions, described in section 6.6. This is extended with the requirement of a segmented input. The recognizer receives a list of EDUs, which are to be connected. The recognizer uses the following algorithm. The list of EDUs is checked for relations per pair. The first pair to check is the first and the second EDU in the list. A score is added for each relation possible between these EDUs. After that, the pair consisting of the second and third EDU is checked for a relation. This results in a list of relations between adjacent EDUs. The most probable relation between two EDUs is kept and all the relations between the same EDUs with a lower probability than the kept relation are discarded. After this step, the recognizer has to find the long-distance relations, for it cannot find them in the original input. This is shown with figure 6.3. 67 CHAPTER 6. AUTOMATIC RECOGNITION Background R C Non-Volitional Cause B A Figure 6.3: A relation between non-adjacent EDUs The example in figure 6.3 contains three EDUs, the first cycle checks only A with B and B with C. It does not compare A with C. The comparison of A with C is done afterwards. To find the long-distance relations, a new list is formed, containing the nuclei of the relations and the EDUs which are not connected to another EDU. This long-distance relation scoring is repeated until al relations are connected. This is because the text is assumed to be coherent, therefore no EDUs are to be left without a relation with an other EDU. Some remarks with this approach are the following. Once a relation is connected as a satellite to a nucleus, it is not possible to reconnect it to another one, for example a distance relation. This is illustrated with the following example. Elaboration C Elaboration A B Figure 6.4: Two EDUs connected to the same EDU Elaboration A Elaboration B C Figure 6.5: EDU C is connected to A through B. In figure 6.4, the EDUs B and C are both connected with an Elaboration relation to EDU A. In figure 6.5 C is connected to B which is connected to A. If a relation between B and C is made, the long distance relation which holds between A and C is never to be found, since C is then considered a satellite. Two possible solutions for this problem exist. The 68 6.8. SCORING HIERARCHY first solution is using a minimum likelihood necessary to confirm a relation. This prevents the adding of relations which are not signaled strongly and could therefore be incorrect. If a connection between B and C would not be made because of a likelihood which is below a threshold, the recognizer will check between A and C, afterwards. The second solution is to check all possible relation tuples. However, for large texts, the time complexity of checking all possible relations could be a problem since for n EDUs, the number of checks is: 1 2 (n − n) 2 while with the growing distance between EDUs the likelihood of a relation decreases. Therefore the first solution is used. The scoring of relations will be explained in the next section, while the effect is evaluated in chapter 7. Figure 6.6 displays the used algorithm in pseudocode. Start of code 1 2 3 4 5 6 7 8 9 10 11 12 for each tuple of adjacent EDUs (a,b) compare (a,b) for each EDU pair select relation with MAX likelihood where likelihood > threshold create new list with nuclei and non-assigned nodes for each tuple of list (x,y) compare_distant (x,y) for each EDU pair select relation with MAX likelihood where likelihood > threshold joint all unconnected subtrees End of code Figure 6.6: Algorithm in Pseudocode 6.8 Scoring Hierarchy The scoring of the relations is based on different aspects. A difference is made between the importance of relations and the strength of the used discourse markers. To create the hierarchy, use is made of annotated texts by human annotators. The EDUs which are connected with a relation, added by a human annotator are checked for the existence of a discourse marker. If a discourse marker was present, it is analyzed whether the marker signaled the relation or did not signal the relation. To measure if this relation was indeed signaled by the discourse marker an approach quite similar to the process as described in [KS98] is used. If the discourse marker is substituted with a different discourse marker and the relation would change, the reason the relation did exist was the discourse marker. The analysis of the texts for the strength of discourse markers is performed for all the word types, 69 CHAPTER 6. AUTOMATIC RECOGNITION so a hierarchy of the word types could be made, and the corresponding placement of the discourse markers in the sentences. The first notice in developing the hierarchy, is the fact that if a conjunctive adverb starts a sentence, it acts strongly as a discourse marker. Consider the extract below: [De kans op een HIV-infectie door moedermelk is relatief laag.]32A [Toch dienen HIVgeïnfecteerde moeders borstvoeding te vermijden.]32B [ENG: The chance of getting infected by HIV through breast milk is relatively low.]32C [Nevertheless, mothers infected with HIV ought to avoid giving breast milk.]32D These two sentences can be related to each other with a Concession relation. It is clearly signaled by the word "Toch" (Nevertheless). As mentioned before, (conjunctive) adverbs can also signal relations when they do not appear at the start of the second sentence. However, the signaling is less strong than if they appear at the start of the second sentence. This can be illustrated with the following example: [Bij mensen die aan asbest blootgesteld zijn geweest, kan de diagnose asbestose soms al worden gesteld op basis van kenmerkende afwijkingen op een thoraxfoto.]33A [De patiënt heeft meestal ook een afwijkende longfunctie en met de stethoscoop zijn krakende geluiden (’crepitaties’) in de longen te horen.]33B [ENG: With people who have been exposed to asbestos, the diagnosis asbestosis can sometimes be made bases on particular abnormalities on a thorax photo.]33C [The patient usually has a divergent long function as well and with a stethoscope creaking sounds (’crepitations’) can be heard in the longs.]33D In this example the second sentence can be connected to the first sentence with an Elaboration relation. The signaling for this relation can be two words. The first one is "Meestal" the second is "Ook". If the both words are removed from the second sentence, no relation can be added if the recognizer has no world knowledge. Like: [Bij mensen die aan asbest blootgesteld zijn geweest, kan de diagnose asbestose soms al worden gesteld op basis van kenmerkende afwijkingen op een thoraxfoto.]34A [De patiënt heeft een afwijkende longfunctie en met de stethoscoop zijn krakende geluiden (’crepitaties’) in de longen te horen.]34B To be able to assign the relation for this text, the recognizer must at least know that both sentence share some subject. Therefore it must be known that a patiënt is a human, or know what Asbestose is. Since the recognizer does not have this kind of information, a relation cannot be added. If only one signaling word is present, the relation could still be made. Like: [Bij mensen die aan asbest blootgesteld zijn geweest, kan de diagnose asbestose soms al worden gesteld op basis van kenmerkende afwijkingen op een thoraxfoto.]35A [De patiënt heeft 70 6.8. SCORING HIERARCHY (meestal/ook) een afwijkende longfunctie en met de stethoscoop zijn krakende geluiden (’crepitaties’) in de longen te horen.]35B When the two words are present, the signaling of a relation is even stronger. Therefore each occurrence of a (conjunctive) adverb increases the likelihood of the presence of a relation. A combination of words which signal a different relation is possible. In such a case, all possibilities should be considered and the likelihood added. This is shown with the following extract: Vocht rond de longen kan met een naald worden verwijderd en onderzocht; deze ingreep heet thoracentese. Een thoracentese is meestal echter niet zo nauwkeurig als een biopsie. ENG: Fluid at the longues can be removed with a needle and researched; this operation is called thora centesis. A thora centesis however, is usually not as accurate as a biopsy. In the second sentence, there are three words which could signal a relation, "Meestal", "Echter" (Usually, However) and "Niet"(Not). While meestal can signal an Elaboration relation, "Echter" and "Niet" can signal a Concession. Because there are two markers signaling the Concession relation, it is preferred. So, in the hierarchy, (conjunctive) adverbs which appear at the start of a second sentence signal a relation more strongly than if the word appears at a different spot. Pronouns are weaker than (conjunctive) adverbs, in the sense that they do signal relations, and usually Elaborations, but the actual relation can be signaled stronger by a different marker. Consider the next examples. In the first example, the word "Deze" signals an Elaboration while in the second it could also signal a Concession: [De ziekte wordt waarschijnlijk veroorzaakt door letsel dat is ontstaan doordat de pees van de knieschijf te hard trekt aan de aanhechting aan de kop van het scheenbeen (tibia).]37A [Deze aanhechting wordt tuberculum tibiae genoemd.]37B [ENG: The disease is probably caused by injuries, which is originated because the string of the kneecap pulls too hard at the attachment to the head of the shinbone (tibia).]37C [This attachment is called tuberculum tibiae.]37D [ Als bijvoorbeeld de bloedvaten zich verwijden, waardoor de bloeddruk daalt, zenden de sensoren onmiddellijk signalen via de hersenen naar het hart, waardoor de hartfrequentie toeneemt en het hart dus meer bloed rondpompt.]38A [Daardoor verandert de bloeddruk uiteindelijk niet of nauwelijks.]38B [Deze compensatiemechanismen hebben echter ook hun beperkingen.]38C [ENG: If for example the blood vessels widen, causing the blood pressure to drop, the sensors immediately send signals via the brains to the heart, as a result of what the heart frequency increases and the heart pumps more blood trough the system.]38D [Therefore the blood pressure does not or only slightly differs in the end.]38E [These compensation mechanisms have their limits however.]38F The Concession relation can be signaled by "Echter" or "Beperkingen". Therefore, if no other markers are found, the pronouns are used to connect the sentences. Pronouns can 71 CHAPTER 6. AUTOMATIC RECOGNITION also be used for the signaling if they appear in the middle of a sentence. However, these words signal even less strongly. A further remark with the use of pronouns is the fact that they are used to link sentences which are non adjacent. The text refers to a statement presented in an earlier sentence in the text. These references can be found with the use of keyword repetition. The fourth used type of connection marker is the medical discourse marker as explained in chapter 5. They are to be used if no other markers are found, they are lowest in the hierarchy. The word types which are highest in the ranking are the types which are used when a the marker of that type is found at the start of a sentence. There are only four types which can be used in this way, the conjunctive adverbs, the adverbs, the pronouns and the relation markers. The rest of the word types are used when a marker of that type resides in the middle of a sentences. The last entry of the ranking is for the implicit markers. This because they are checked last, so they do not conflict with other word types for the signaling of relations. The final ranking is shown in table 6.6. The top of the table represents the strongest marker. Ranking 1 2 3 4 5 6 7 8 9 10 Type Conjunctive Adverbs Adverbs Pronouns Relation Markers Conjunctive Adverbs Adverbs Relation Markers Pronouns Medical Markers Implicit Markers Start of Sentence Yes Yes Yes Yes No No No No No No Table 6.6: Discourse Marker Hierarchy To measure the actual strength of the discourse markers the data is checked for occurrences of discourse markers of each word type. For these occurrences is is checked whether a relations is signaled by that discourse marker or if it is not signaled. The results of each discourse marker of a certain word type are combined to a number which expresses the likelihood an instance of the word types signals a relation. Furthermore, a threshold value is determined from the data. The threshold is based on the likelihood of the weakest signaling word types. A single instance of such a word type is not strong enough to signal a relation. A small likelihood bonus is added to the likelihood of a relation when there are more relations which can be possible between that relation in the same ordering. For example if the markers signal an Non-Volitional Cause relation in a Nucleus - Satellite ordering and a Elaboration relation in a Nucleus - Satellite ordering, the likelihood bonus is applied. This is done because if more relations could be added in a certain ordering the assumption that the ordering is correct is supported. 72 6.9. EXAMPLE The likelihood values are shown in table 6.7. Type Conjunctive Adverbs Adverbs Pronouns Relation Marker Second Sentence Standard Second Sentence Medical Threshold Likelihood Bonus Likelihood 60 50 30 20 20 15 20 5 Table 6.7: Discourse Marker Likelihoods The likelihood values from table 6.7 are based on their mutual proportions and the percentages of the cases they signal a relation. It is possible to derive a percentage of signaling per word type or even for each word, but that would require a large amount of data which was not available. These likelihood values are cumulatively used to calculate the likelihood. If an conjunctive adverb at the start of a sentence and a pronoun which resides in the middle of a sentence both signal an Elaboration relation the likelihood would be 60 + 20 = 80. In the list the awarding of far distance relations is missing. This is because the likelihood of these relations is dependent on the number of similar keywords. The more similar keywords the higher the likelihood. It is expressed as a number between 0 and 100 based on the percentage of similar keywords. Nouns are used as keywords. 6.9 Example The automatic annotation process is illustrated by the example below. The intention of this example is to show the used steps of the recognizer and the results. The used text for the example is the following: Wanneer de huid een lage temperatuur bereikt, verwijden de bloedvaten in het gebied zich bij wijze van reactie. De huid wordt rood, voelt heet aan, jeukt en kan pijn doen. Deze effecten treden meestal op negen tot 16 minuten nadat het ijs is aangebracht en verdwijnen ongeveer vier tot acht minuten nadat het ijs is verwijderd. Daarom moet het ijs worden verwijderd na tien minuten, of eerder wanneer deze effecten optreden, maar tien minuten nadat het is verwijderd, mag het weer worden aangebracht. ENG: When the skin reaches a low temperature, the blood vessels widen in the area as a reaction. The skin becomes red, feels hot, itches and can hurt. These effects usually occur nine to 16 minutes after the ice is placed and disappear about four to eight minutes after the ice is removed. Therefore the ice must be removed after ten minutes, or earlier when these effects occur, but it can be replaced ten minutes after it is removed. 73 CHAPTER 6. AUTOMATIC RECOGNITION The recognizer starts with the segmentation process. The text is split into elementary discourse units. This results in the following: [Wanneer de huid een lage temperatuur bereikt, verwijden de bloedvaten in het gebied zich bij wijze van reactie.]40A [De huid wordt rood, voelt heet aan, jeukt en kan pijn doen.]40B [Deze effecten treden meestal op negen tot 16 minuten nadat het ijs is aangebracht en verdwijnen ongeveer vier tot acht minuten nadat het ijs is verwijderd. ]40C [Daarom moet het ijs worden verwijderd na tien minuten, of eerder wanneer deze effecten optreden, maar tien minuten nadat het is verwijderd, mag het weer worden aangebracht.]40D After the segmentation, the actual recognition process is started. First the adjacent relations are found. The list with found relations is cleaned. The relations with a likelihood which is below the threshold are discarded. The relations as shown below are found. [2 [2 [2 [3 [3 3 3 3 4 4 N N N N N 145 elaboration] 25 concession] 25 nonvolitional-cause] 45 elaboration] 65 non-volitional-result] Between EDU number 2 and EDU number 3, three relations with a likelihood above threshold are found. The likelihood of the Elaboration relation is calculated to be 145. It is found by the following discourse markers: "Deze" , "treden op" , "meestal", "nadat" , "nadat" . Furthermore it received the 5 point bonus, because there are multiple relations found in a Nucleus - Satellite ordering between EDU 2 and EDU 3. The recognizer proceeds with dicarding the conflicting relations. There are three three possible relations found between EDU number 2 and EDU number 3 and two possible relations between EDU number 3 and number 4. No relations are found between EDU 1 and EDU 2. After cleaning the result is the following: [2 3 N 145 elaboration] [3 4 N 65 non-volitional-result] Next, since there is still an EDU which is unconnected, the far distance relations are to be found. The recognizer runs Alpino to extract the nouns for each sentence and compares each possible combination. It finds a connection between EDU 1 and EDU 2 for the noun "huid". This relations receives a likelihood based on the number of similar nouns. The likelihood is calculated as the number of similar nouns is divided by the total number of nouns in the second sentence. Since the second sentence only contains two nouns, ("huid", "pijn") and only the noun "huid" is similar, the number 25 is incorrect. This is caused by Alpino, which incorrectly identified words as nouns. The found relation is added to the list which was generated before. Then possible conflicting relations are removed in a similar way as done before. The final relation list is the following: 74 6.9. EXAMPLE [2 3 N 145 elaboration] [3 4 N 65 non-volitional-result] [1 2 N 25 elaboration] This list is parsed by the tree-builder and an rs3 file is generated. The final tree is presented in figure 6.7. Figure 6.7: Generated Tree 75 Chapter 7 Evaluation In this chapter the evaluation of automatically generated discourse trees will be discussed. This evaluation can be done in multiple ways. First of all it is explained when a tree is valid using the standards as defined by RST. This is explained in the first section. Furthermore it is useful to check whether the information of a valid tree is correctly representing the information in the text. Two approaches to measure this correctness are described. The first approach is based on manually scoring the result from the automatic generated version, by evaluating each relation separately. The second approach is to check whether the result is a similar tree as a whole compared to a human annotated version of the same text. This last approach is used for this research, because it is much harder to tell whether a relation is wrong or correct, than to check if it similar to the results of a human annotator. After describing the different evaluation methods, the results of the recognizer are discussed, and the increase of accuracy with the adding of information used by the program is shown. Five example texts are presented, two of them are annotated with an increasing use of discourse marker types. The other three are smaller texts, with an analysis of the differences with respect to annotations performed by a human annotator. These evaluations are followed by the results of more texts annotated by the recognizer with the use of all discourse marker types. These results are compared to evaluations between human annotators. The last section contains a discussion about the performance of the automatic recognizer and adjustments which could be made. 7.1 RST-Trees First of all, it is needed to formalize the correctness of an RST-tree. Mann and Thompson define the following four rules [MT88]: 1. Completeness: The set contains one schema application that contains a set of text spans that constitute the entire text. 2. Connectedness: Except for the entire text as a text span, each text span in the analysis is either a minimal unit or a constituent of another schema application of the analysis. 77 CHAPTER 7. EVALUATION 3. Uniqueness: Each schema application consists of a different set of text spans, and within a multi-relation schema each relation applies to a different set of text spans. 4. Adjacency: The text spans of each schema application constitute one text span. The first rule, completeness, states that the tree should cover the entire input text. This can be accomplished by forcing the text together. If no relations are found between leaves and or subtrees, the connection could be done with the Joint relation. This can be done, because the texts to annotate are considered to be coherent. The three other rules state that all elementary discourse units, should be added in the tree as a node, and no node can belong to different schemas. The four rules are covered by the recognition algorithm. The algorithm cannot connect satellites, which are connected to a nucleus again. Between nuclei, new relations are possible, but once connected, no other connection is possible afterwards. This ensures that a nucleus cannot be connected outside a schema. If no relations are found, all subtrees fit in the Joint schema. The trees generated by the recognizer always conform to these rules. 7.2 Relation Based Evaluation The relation based evaluation method is based on the judgment of an annotator. The annotator examines the automatic generated result and judges for each relation whether it is acceptable or not. The relations are not necessarily the best option, but are not allowed to be wrong. For example: [Peter won the running contest.]41A [He is very fast.]41B In figure 7.1, this small piece of discourse is annotated in three different ways. While the first is better than the second one, since the second sentence states a explanation about the first, both are correct. The third annotation however clearly is incorrect. Also the nuclearity of the EDUs between which the relation is added, must be correct. Non-Volitional Cause Elaboration 41A 41B 41A 41B (b) (a) Figure 7.1: Three annotations To score each relation, a scoring table must be developed. 78 Concession 41A 41B (c) 7.3. FULL TREE EVALUATION 7.3 Full Tree Evaluation The full tree evaluation is done by a comparison between a RST-tree generated by the recognizer program and a manually annotated tree. This idea is also presented in Marcu’s thesis [Mar97b]. To be able to evaluate these full discourse trees, the parts which might differ from each other should be noted. First of all, the segmentation can differ. If a human annotator decided to split the discourse at other points than the recognizer did, differences occur. This can be prevented by using only sentences boundaries for segmentation. This ensures that the human annotator and the recognizer use the same elementary discourse units. The second possible difference can be the structure of the tree. A set of three elementary discourse units can be connected to each other using different schemas. The third difference is the actual relation between elementary discourse units, and their nuclearity aspect. So to compare an automatically generated tree, the second and third differences are to be checked for. The first check can be omitted for trees generated by the developed automatic recognizer. This is because both approaches are based on the same segmentation. Since a relation cannot be compared with another relation if the elementary discourse units are different, this is the first point of measurement. The total numbers of relations which are correct are counted. A correct relation is one which is equal to the relation in the manually generated tree. The last part consists of the counting of the correct relation names. The final measurement thus consists of two parts, a percentage of the correct relation organization, and a percentage of the correct labels. The measurement of the percentage of the correct relation organizations, is based on the fact that a correct tree of n nodes contains exactly n-1 relations if sentences are used as EDUs. A Joint or Sequence which contains m nodes, counts as m-1 relations. The total correctness of the trees is measured by multiplying the two percentages. The process of full tree evaluation is illustrated with the following extract: [Geneesmiddelen ter bestrijding van infecties zijn onder andere gericht tegen bacteriën, virussen en schimmels.]42A [Deze geneesmiddelen zijn zo gemaakt dat ze zo toxisch mogelijk zijn voor het infecterende micro-organisme en zo veilig mogelijk voor menselijke cellen.]42B [Ze zijn dus zo gemaakt dat ze selectief toxisch zijn.]42C [Productie van geneesmiddelen met selectieve toxiciteit om bacteriën en schimmels te bestrijden, is relatief gemakkelijk, omdat bacteriële en schimmelcellen sterk van menselijke cellen verschillen.]42D [Het is echter zeer moeilijk om een geneesmiddel te maken dat een virus bestrijdt zonder de geïnfecteerde cel en daardoor ook andere menselijke cellen aan te tasten.]42E This text can be annotated like the trees in figure 7.2. These trees differ only in their relation between the first and the second sentence. The comparison results are shown in table 7.1. To show the difference, the next example is an annotation which differs highly. The tree in figure 7.3 is compared to the first one in figure 7.2. The results are shown in table 7.2. There are two relation organizations which are correct, between 42D and 42E and the arc from the contrast schema to 42B. The tags with these connections are not always correct. 79 CHAPTER 7. EVALUATION Elaboration Background R 42A 42A Summary Elaboration 42B 42C Summary Elaboration 42B Contrast 42D 42C 42E Contrast 42D (a) 42E (b) Figure 7.2: Two Annotation Trees Correct Organization Correct Label Total Correctness Numbers 4 4 Percentage 80 % 100 % 80 % Table 7.1: Comparison Results Only the Contrast relation is labeled correctly. So the organization percentage precision is 40% and the label precision is 50%. Correct Organization Correct Label Total Correctness Numbers 2 1 Percentage 40 % 50 % 20 % Table 7.2: Comparison Results 7.4 Testing Results To measure the accuracy of the recognizer, this section discusses the results of the recognizer, with the increase of input the recognizer uses. A number of texts is randomly extracted from the Merck Manual. These texts are annotated by a human annotator and by the automatic recognizer. The automatically generated trees are compared to the manually annotated tree. It is shown that accuracy of the recognizer grows if more word types which can signal relations are added. The used word types are shown below: 1. Conjunctive Adverbs 2. Adverbs 80 7.4. TESTING RESULTS 3. Pronouns 4. Medical Discourse Markers 5. Keyword Repetition Background R 42A Justify 42B Contrast Non-Volitional Cause 42E R 42C 42D Figure 7.3: Third Tree The recognizer has processed each text multiple times. Firstly, the recognizer uses only conjunctive adverbs. Next, it uses conjunctive adverbs and adverbs. This process is repeated until all the word types listed above are used. The results are presented for two texts differing in size, which are annotated manually by three human annotators. The results of the recognizer of these same texts are evaluated against the human annotated trees. The results show the separate score of the first marker type, and the total number of recognized relations for each word type which is added. The results thus make clear which word type increased the number of recognized relations. After this, some of the results of the recognizer on three smaller texts are showed, compared with the results of a single human annotator. These texts are automatically recognized with the use of all the word types listed above. The differences between the results of the automatic recognizer and a human annotator are discussed. These tests are extended in the next section where a list of evaluations of seven more texts is presented. These results are compared to the evaluation results of texts which are annotated by two annotators. The last section discusses the results of the automatic recognizer. 81 CHAPTER 7. EVALUATION 7.4.1 First Text This first text is manually annotated by three annotators. The text is shown below: [Veel aandoeningen die op de huid verschijnen, blijven tot de huid beperkt.]43A [Andere aandoeningen komen naast de huid ook in de inwendige organen voor.]43B [Zo krijgen mensen met systemische lupus erythematodes een ongewone roodachtige uitslag op de wangen, meestal nadat ze aan zonlicht zijn blootgesteld.]43C [Artsen moeten dus rekening houden met vele mogelijke inwendige oorzaken wanneer ze huidproblemen onderzoeken.]43D [Door het gehele huidoppervlak op bepaalde patronen van huiduitslag te onderzoeken kunnen ze een eventuele ziekte vaststellen.]43E [Om te controleren in hoeverre het huidprobleem zich heeft verspreid, kan de arts een patint verzoeken zich helemaal uit te kleden, ook wanneer de patint slechts een afwijking op een klein deel van de huid heeft opgemerkt. ]43F [Om bepaalde ziekten op te sporen of uit te sluiten, kunnen artsen bovendien een bloedonderzoek of andere laboratoriumonderzoeken laten uitvoeren.]43G Two of the annotations showed a similar tree, the third differed slightly. The added relations however differed in a higher degree. Since the manual annotated trees differ, the trees are combined to an average tree. The combining is done like this. Since all EDUs are the same, the relations can be compared. First the composition is checked, the relations which were added between the same EDUs by at least 2 out of the 3 annotators are kept. After this, the labels are checked. The label which was added by at least 2 of the 3 annotators is used. Three manually annotated trees are shown in figures 7.4 , 7.5 and 7.6. The combined tree is shown in figure 7.7. Figure 7.4: First Manually Annotated Tree The automatically generated tree, shown in figure 7.8 is compared to the combined tree. These results are shown in table 7.3. Table 7.3 shows an increase of recognized relations with use of the Conjunctive Adverbs the Adverbs and the Implicit Markers. The reason the implicit markers does increase the number of recognized relations, is the fact these are used by the recognizer for the recognition of far distance relations. Although the other relations like for example the Markers Non-Start did not increase the number of recognized relations, the likelihood the previous recognized relations are correct actually did increase. The numbers between parentheses are the total number of added relations. In total there are 6 connections possible, thus the percentages are based on that number. 82 7.4. TESTING RESULTS Figure 7.5: Second Manually Annotated Tree Figure 7.6: Third Manually Annotated Tree Figure 7.7: Combined Manually Annotated Tree Furthermore, the relations which were annotated identically by all annotators separately are compared with the result of the recognizer on the same specific part. This is done for the organization and the labeling. There were four organizations similar with al the human annotators. Three of these connections were found by the recognizer as well, although one of them was a Joint relation. From these four connections, the human annotators agreed in labeling only once. The recognizer labeled that relation identical as well. 83 CHAPTER 7. EVALUATION Conjunctive Adverbs Correct Organization Correct Label Adverbs Correct Organization Correct Label Pronouns Correct Organization Correct Label Markers Non-Start Correct Organization Correct Label Relation Based Correct Organization Correct Label Medical Markers Correct Organization Correct Label Implicit Markers Correct Organization Correct Label Total Correctness Numbers Percentage 1 (1) 1 16.7 % 100 % 3 (3) 3 50 % 100 % 3 (3) 3 50 % 100 % 3 (3) 3 50 % 100 % 3 (3) 3 50 % 100 % 3 (3) 3 50 % 100 % 4 (6) 3 66.7 % 75 % 50 % Table 7.3: Results of First Text 84 7.4. TESTING RESULTS Figure 7.8: Automatically Generated Tree 7.4.2 Second Text The second text is the following: [De productie van groeihormoon door de hypofyse kan worden vastgesteld met behulp van de zogenaamde GHRH-argininetest of de insulinetolerantietest.]44A [Doordat het lichaam groeihormoon in de loop van het etmaal met pieken produceert (vooral ’s nachts), is de bloedspiegel op n bepaald moment geen aanwijzing voor een al dan niet normale productie.]44B [De arts meet daarom vaak het gehalte aan IGF-I in het bloed omdat dit gehalte doorgaans langzaam verandert in verhouding tot de totale hoeveelheid door de hypofyse afgegeven groeihormoon.]44C [Een klein tekort aan groeihormoon is bijzonder moeilijk vast te stellen.]44D [Bovendien zijn bij verminderde werking van de schildklier of bijnier de groeihormoonspiegels meestal laag.]44E The human annotated trees are combined to a single tree like is done with the trees of the first text. The differences between the human annotations were pretty large, no connection was added the same way by all three annotators. Therefore the combination is based on similarity of two annotators. The combination process is done similar as is done for the first text. The evaluation results of the automatically generated tree against the combined tree of the human annotators are shown in table 7.4. They are quite similar to the results of the first text. Increases were seen with the same discourse marker types. Again the numbers between parentheses are the total number of added relations. In total, 5 relations should be added, on which the percentage is based. 7.4.3 Third Text The first small text is shown below. It consists of three sentences, used as elementary discourse markers. This text is annotated by a single human annotator. For the automatically generated tree all word types as discussed before are used. 85 CHAPTER 7. EVALUATION Conjunctive Adverbs Correct Organization Correct Label Adverbs Correct Organization Correct Label Pronouns Correct Organization Correct Label Markers Non-Start Correct Organization Correct Label Relation Based Correct Organization Correct Label Medical Markers Correct Organization Correct Label Implicit Markers Correct Organization Correct Label Total Correctness Numbers Percentage 1 (1) 1 20 % 100 % 1 (2) 1 20 % 100 % 1 (2) 1 20 % 100 % 1 (3) 1 20 % 100 % 1 (3) 1 20 % 100 % 1 (3) 1 20 % 100 % 2 (4) 2 50 % 100 % 50 % Table 7.4: Results of Second Text 86 7.4. TESTING RESULTS [Behalve dat de aanmaak van rode bloedcellen afneemt, wordt ook het zenuwstelsel door een vitamine-B12-tekort aangetast.]45A [Hierdoor ontstaan tintelingen in handen en voeten, gevoelsstoornissen in benen, handen en voeten en spastische bewegingen.]45B [Andere symptomen zijn onder meer een bepaalde vorm van kleurenblindheid, gewichtsverlies, donkere verkleuring van de huid, verwardheid, depressie en vermindering van de verstandelijke vermogens.]45C In figure 7.9, an RST tree created by a human annotator is shown. In figure 7.10 the tree generated by the recognizer is shown. The relation between the second and third EDU is the same. The relation between the first and the second EDU however, differ. But in fact, the difference is minor. Both annotations show a cause-result relation between the EDUs, and both annotations show the same cause and result, the first EDU is the cause, the second the result. The thing which differs is the importance of the EDUs. While the human annotator decided the first EDU as the nucleus, the recognizer picked the second EDU. Analysis of the decision tree showed that the recognizer had a slight preference of adding the Non-Volitional Cause over the Non-Volitional Result. Figure 7.9: Manually Annotated Tree The result of the evaluation is presented in table 7.5. Correct Organization Correct Label Total Correctness 1 (2) 1 50 % 100 % 50 % Table 7.5: Results of the Third Text 7.4.4 Fourth Text A different small text is the following: 87 CHAPTER 7. EVALUATION Figure 7.10: Automatically Annotated Tree [Bij macroglobulinemie ontstaat ook vaak cryoglobulinemie, een aandoening die wordt gekenmerkt door cryoglobulinen.]46A [Dit zijn abnormale antistoffen die in het bloed precipiteren (neerslaan) wanneer de temperatuur ervan tot onder de lichaamstemperatuur daalt en die weer oplossen als de temperatuur stijgt.]46B [Patiënten met cryoglobulinemie kunnen zeer gevoelig worden voor kou of het Raynaud-fenomeen ontwikkelen.]46C [Hierbij worden de handen en voeten bij blootstelling aan kou zeer pijnlijk en wit.]46D Again, the text is annotated by a single human annotator and the automatic recognizer. The annotations created by the human annotator and the recognizer were similar. The evaluation thus results a total correctness of 100%.The annotation tree is shown in figure 7.11. Analysis of the decision tree shows that the relation between the first and the second EDU, is clearly signaled by the word "Dit". The relation between the third and the fourth EDU is signaled by the word Hierbij. The remaining relation is added since both sentences speak about "Cryoglobulinemie". Figure 7.11: Manually/Automatically Annotated Tree 88 7.4. TESTING RESULTS 7.4.5 Fifth Text The last small text is the following: [Voorlichting over elektriciteit en voorzichtigheid met betrekking tot het omgaan met elektriciteit zijn van groot belang.]47A [Ongevallen met elektriciteit thuis en op het werk kunnen worden voorkomen door ervoor te zorgen dat ontwerp, installatie en onderhoud van alle elektrische apparaten in orde zijn.]47B [Elk elektrisch apparaat waarmee het lichaam in aanraking kan komen, dient goed geaard te zijn en te zijn aangesloten op een stroomkring met een stroomonderbreker.]47C [Dergelijke veiligheidsmechanismen die de stroomkring onderbreken bij een lekstroom van 5 mA, vormen een uitstekende bescherming en zijn overal verkrijgbaar.]47D It is annotated by a single human annotator and the automatic recognizer, and the generated trees are compared. Figure 7.12 shows the manually annotated tree. It consists of two Elaborations and a Background relation. Since the recognizer cannot recognize a Background relation, the automatically generated tree will differ. This tree is shown in figure 7.13. Indeed the first relation is not present. Instead of the Background relation from the first EDU to the second, another Elaboration relation is added from the second towards the first EDU. Although the relations are not similar, both representations indicate that the second sentence is providing extra information about the topic discussed by the first sentence. Figure 7.12: Manually Annotated Tree The result of the evaluation is presented in table 7.6. 89 CHAPTER 7. EVALUATION Figure 7.13: Automatically Annotated Tree Correct Organization Correct Label Total Correctness 2 (3) 2 66.7 % 100 % 66.7 % Table 7.6: Results of the Fifth Text 7.4.6 More Texts In this section some other results of the recognizer are presented. More texts are annotated and the total correctness of each annotation is calculated. This is done by comparing the automatically generated trees to a manually annotated version of the same text. The results are shown in table 7.7. They are sorted at the amount of total relations the texts embedded. The results of the texts from this table, and the results presented before, show that the recognition process is very difficult. Results vary from very bad (14.3%) to excellent (100%). Furthermore, the recognition of the organization is easier than the labeling. The organization scores 50% or above in all cases. Relations 3 4 4 4 6 6 7 Organization 3 (100 %) 4 (100 %) 3 (75 %) 2 (50 %) 3 (50 %) 3 (50 %) 4 (57.1 %) Label 1 (33.3 %) 3 (75 %) 3 (100 %) 1 (50 %) 1 (33.3 %) 2 (66.7 %) 1 (25 %) Table 7.7: More Text Results 90 Total 33.3 % 75 % 75 % 25 % 16.7 % 33.3 % 14.3 % 7.5. DISCUSSION Relations 4 4 4 6 6 7 Organization 3 (75 %) 4 (100 %) 4 (100 %) 4 (67 %) 5 (83.3 %) 4 (57 %) Label 1 (33 %) 4 (100 %) 3 (75 %) 3 (75 %) 2 (40 %) 1 (25 %) Total 25 % 100 % 75 % 50 % 33.3 % 14.3 % Table 7.8: More Text Results The labeling scores 25% to 100%. The reasons the organization differs, is partly because nucleus and satellite are swapped. Even if the relation label would be correct, this is counted as a mistake. To evaluate the recognizer more precisely, texts should be used in which multiple human annotators agreed fully. 7.4.7 Annotator Evaluation To be able to conclude whether these results are fine, comparisons between the results of human annotators are done. These calculations are based on other texts than which are used for the evaluation of the recognizer. The results are shown in table 7.8. The table shows the numbers for 7 texts, which vary in length between 4 and 7 EDUs. For each text, two manually annotated trees are compared in a similar way as is done for the comparison between a manually generated tree and a automatically generated tree. The organization agreement is 57% or higher, and the label agreement is between 25% and 100% as well. The degree of difference in annotations generated by two human annotators is similar to the difference between annotations of a human annotator and the recognizer. 7.5 Discussion In this last section the results of the recognizer are discussed. Some adjustments are presented and discussed whether they would improve the result of the automatic recognizer. As seen in the previous sections, the trees generated by the automatic recognizer differs from a manually generated one, as much as two manually generated trees differ from each other. The first remark hereby is the fact that the recognizer is more limited in its relations it can use. Furthermore, human annotators use different sources than just discourse markers. The fact that two manually annotated versions can differ highly from each other is an important one. The automatic recognizer is built based on results of human annotators. Therefore its results are not expected to be better. To check if it is possible to optimize the results of the recognizer, different amounts of likeliness are tried. It turned out however that if the order in which the discourse marker are present in the hierarchy does not change, the resulted trees do not differ, or just slightly. The main differences would occur between two EDUs related for example with a Non-Volitional Cause. With a mixed ranking, they could be related inversed, that is the nucleus and the 91 CHAPTER 7. EVALUATION satellite could be swapped and a Non-Volitional Result added. If the ranking would be mixed up, the quality of the trees would decrease. Another possibility is the adjustment of the threshold. If for example the threshold would be increased, the number of recognized relations would be less, however the relations which would be correct were already correct in the first place. If the threshold would be too low, there would be recognized too many relations which actually are not worth recognizing. Therefore a large difference in the threshold is not necessary. The next adjustment which could be done is the extending of the shorter list of relations which can be recognized. The complete list as defined by RST can be used. Although the evaluation of the comparison between manually generated trees and automatically generated trees could be better, the relations which would be added are harder to find. Different discourse markers, or different ways of recognition are then necessary. For this assignment lists with special discourse markers are gathered. Although these markers do not often signal a relation, they increase the likelihood of found relations. It can be useful to search for other domain specific discourse markers and features, although they were not found during this assignment. Relation specific markers are used as well. They can signal relations, and increase the likelihood a relation is correct. For the other relations, similar list of words could be developed. For some genres, compositional information can be used for the recognition of structure. Although some compositional information could be useful in this domain, the amount of composition data was too low to implement these features. Furthermore, the larger the texts become, the harder the structure is to found, and the compositional information which was available specified larger amounts of text. It is possible to change the algorithm of the recognizer. For example all possible EDUs could be taken as a pair and checked whether they are related somehow. The drawback of this process is the use of discourse markers for the recognition of relations between sentences. While sentences do not necessarily are connected to each other, the parts within a sentence usually does. If a discourse marker is said to signal a relation it does in every case it is taken into consideration. The sentence which embeds the discourse marker would therefore be connected to many other EDUs. The majority of the relations would than be incorrect. Therefore, discourse markers are not optimal for the recognition of relations between sentences. This drawback is partially covered by the recognition of implicit markers with the use of keyword repetition. However, there are plenty of cases in which human annotators added a far distant relation, while the recognizer was not able to find it. This suggests the use of more linguistic tools, like for example Wordnet. This can help find connections between sentences, and than the relations could be checked for discourse marker. This would signal the relation more strongly, while the discourse marker might indicate which label should be added. The last note with this automatic recognition process is the fact that it could be better, if there were more annotated data available. In that case, the used numbers and discourse markers could be more specific. Furthermore, it would be possible to use machine learning to check what performance is possible if that would be applied. 92 Chapter 8 Conclusions and Future Work First of all, the original research questions are stated once more, and the conclusions are added per subquestion. After that, some general conclusions about this research are added. Finally some recommendations for future work are presented. 8.1 Conclusions 1. How can RST annotations be automatically generated? RST annotations can be automatically generated. There are generators based on machine learning, and programs based on static rules. For this assignment an automatic recognizer program is written which is based on static rules. The recognizer consists of three parts, a segmenter, the actual recognition program and a tree-builder. The recognizer uses lists of discourse markers from several word types. Strength indication variables based on learning are assigned to each word type. These are used to determine which relation is best fit between two elementary discourse units. The developed automatic recognition program generates Rhetorical Structure Theory trees which differ from manually annotated trees in a similar degree as two manually generated trees differ from each other. 2. Is RST suitable for Dutch and specifically for medical texts? Yes, RST is suitable for Dutch. It is used in multiple other researches with various topics. During this assignment several Dutch texts from the Merck Manual have been successfully annotated with RST. Furthermore, texts from other Dutch genres are annotated as well. The recognizer is tested on medical texts, but it produces similar results for texts from other Dutch genres. Some genres however are not suited to be annotated with RST. 3. Which properties of Dutch and which relations and discourse markers are best for automatic recognition? In this assignment different types of discourse markers are discussed. Markers which are described in other researches are used and lists of domain specific markers are gathered. These discourse marker types are fit into a hierarchy. The created hierarchy 93 CHAPTER 8. CONCLUSIONS AND FUTURE WORK provides an overview of the importance and quality of each of the discourse marker types. The clearest markers appear at the start of a sentence, but all of the discussed markers can be used when they reside within a sentence as well. During this research relations from the original list with relations as defined by RST are combined. This resulted in a small set with a coverage of over 88%. These relations are most used, and are suitable for automatic recognition. This list can be extended with relations from the original list and new relations can be added. 4. Is it possible to use other properties of medical texts as discourse markers to find relations? While the Merck Manual provides features like lists and tables, this can only be used in a few cases. The main part of the information is plain text and therefore no compositional information is used except for the composition into paragraphs. The texts which are automatically annotated are considered to be coherent. Furthermore, the use of lists and tables is not a typical case of medical texts. Other texts like for example recipes use them as well. Since the latter are generally much smaller, this compositional information is more useful in those texts. Other properties of medical texts are used by the automatic recognizer. Lists of noun constructions and verb constructions with the relations they signal are gathered and discussed. These markers are specific for the medical domain. Although these markers are not strong, with respect to the general discourse markers for Dutch, they increase the likelihood of found relations. In cases where no other discourse markers are present they are used to signal relations on their own. 5. How can automatically generated annotations be evaluated? Two ways of evaluation are discussed, the first is based on relation evaluation, the second is based on the evaluation of full trees. The full tree evaluation method is used for the evaluation of the recognizer. Furthermore, this method is used to compare manually annotated trees with each other. The full tree evaluation method shows a score for the presence of a relation, and a separate score for the labeling of the relation. The evaluation is thus useful in two ways. One of the remarks on this research is the fact that annotations differ highly, even between human annotators. In one way this is because it is hard to create a good tree from a text, which is also affected by the fact that the people annotated these texts are no professional linguists, which might decrease the performance. Also, this is an aspect of language itself. For some texts there may be an optimal tree, for some (larger) texts this may be impossible. This conclusion agrees with the results of Mann and Thompson, who state in their work that most texts have multiple possible structures. 8.2 Future Work This research can be extended in several ways. Some possible extensions are presented and discussed separately. 94 8.2. FUTURE WORK 1. More Discourse Marker Types Since the performance of the recognizer increased by adding more discourse marker types, it may be worth, to search for other types or constructions which can be helpful. Some types which were found during analysis but were not implemented, are the signals for the recognition of lists/sequences, and words which signaled Elaborations although the latter is partly used. Also it may be worth to check for larger structures, although they were not found in the used data. 2. Extend the Relation List The recognizer is able to recognize only a small portion of the relations described by RST. It is useful to research the possibilities of extending this list with the remaining relations. Also, it is possible to merge the list with relations defined by other theories, to create a new list of relations. 3. Use Domain Specific Features While the Merck Manual is presented in a clear form, this knowledge however was hard to use for recognition. This would also only be interesting for the automatic annotation of very large texts. For smaller texts, possible features like bulleted lists, headings etcetera could be useful. For some texts, like for example the recipes, the fact that they usually follow a certain writing style, first the preparation of the ingredients, than the cooking, and finally the serving, can be used as constraints to the recognizer. 4. Use a Thesaurus The use of a thesaurus enables man to compare hyponyms, homonyms etcetera. The use of such a thesaurus will probably increase the performance of a recognizer. 5. Create a New Tool The RST tool of O’Donnel has many drawbacks; it is hard to work with. Therefore it would be useful, to create a new program or fix some bugs and add features to this one. Furthermore, an automatic recognition program could be integrated in the new tool. 6. Create a Dutch Corpus Creating a Dutch corpus of texts, annotated with RST, offers a great source of resource material. Add to that the fact that the creation of such a corpus gives a better understanding of the structure of texts. This corpus could be used with a machine learning algorithm. 95 Bibliography [Abn91] Steven Abney. Parsing by chunks. In Berwick, Abney, and Tenny, editors, Principle-Based Parsing. Kluwer Academic Publishers, 1991. [ART93] E. Abelen, G. Redeker, and S.A. Thompson. The rhetorical structure of us-american and dutch fund-raising letters. Text, 13(3):323–350, 1993. [Bir85] D.P. Birkmire. Text processing: The influence of text structure, background knowledge and purpose. Reading Research Quarterly, 20(3):314–326, 1985. [BMK03] Jill Burstein, Daniel Marcu, and Kevin Knight. Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems, 18(1):32–39, 3 2003. [BvNM01] G. Bouma, G. van Noord, and R. Malouf. Alpino: Wide-coverage computational analysis of dutch. Computational Linguistics in The Netherlands, 2001. [CFM+ 02] Cassandre Creswell, Kate Forbes, Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. The discourse anaphoric properties of connectives. In Proceedings of the Discourse Anaphora and Anaphor Resolution Colloquium, Lisbon, Portugal, 2002. [CMO03] Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Current directions in discourse and dialogue, chapter building a discourse-tagged corpus in the framework of rhetorical structure theory. IEEE Intelligent Systems, 2003. [CO98] Simon Corston-Oliver. Computing of Representations of the Structure of Written Discourse. PhD thesis, University of California, Santa Barbara, 1998. [Fra99] Bruce Fraser. What are discourse markers? Journal of Pragmatics, pages 931–952, 1999. [GS86] Barbara J. Grosz and Candace L. Sidner. Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3):175–203, 1986. 97 BIBLIOGRAPHY [GS98] Brigitte Grote and Manfred Stede. Discourse marker choice in sentence planning. In Proceedings of the Ninth International Workshop on Natural Language Generation, pages 128137. Association for Computational Linguistics, New Brunswick, New Jersey, 1998. [HHS03] T. Hanneforth, S. Heintze, and M. Stede. Rhetorical parsing with underspecification and forests. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003. [HL93] J. Hirschberg and D. Litman. Empirical studies on the disambiguation of cue phrases. Computational Linguistics, 19(3):501–530, 1993. [HM97] E. Hovy and E. Maier. Parsimonious or profligate : how many and which discourse structure relations? Discourse Processes, 1997. [HMSvdW01] H. Hoekstra, M. Moortgat, I. Schuurman, and T. van der Wouden. Syntactic annotation for the spoken dutch corpus project (cgn). In W. Daelemans, K. Simaan, J. Veenstra, and J. Zavrel, editors, Proceedings of the 3rd ESCA/COCOSDA workshop on Speech Synthesis, pages 73–87, Amsterdam : Rodopi, 2001. [Hob85] J.R. Hobbs. On the coherence and structure of discourse. Technical Report CSLI-85-37, Stanford University, 1985. [Hov93] E. Hovy. Automated discourse generation using discourse structure relations. Artificial Intelligence, 63(Special issue on NLP), 1993. [Kno93] A Knott. Using cue phrases to determine a set of rhetorical relations. In O Rambow (ed) Intentionality and Structure in Discourse Relations: Proceedings of the ACL SIGGEN Workshop, 1993. [KS98] A. Knott and T. Sanders. The classification of coherence relations and their linguistic markers: An exploration of two languages. Journal of Pragmatics, 30:135–175, 1998. [KS04] L. Koenen and R. Smits. Handboek Nederlands. Utrecht: Bijleveld, 2004. [Mar97a] Daniel Marcu. The rhetorical parsing of natural language texts. In Proceedings of ACL/EACL97, pages 96–103, 1997. [Mar97b] Daniel Marcu. The rhetorical parsing, summarization and generation of natural language texts. Ph.D. Thesis. Department of Computer Science, University of Toronto, 1997. [MAR99] Daniel Marcu, Estibaliz Amorrortu, and Magdelena Romera. Experiments in constructing a corpus of discourse trees. In Proceedings of the ACL Workshop on Standards and Tools for Discourse Tagging, College Park, MD, pages 48– 57, 1999. 98 BIBLIOGRAPHY [Mar00] Daniel Marcu. The rhetorical parsing of unrestricted texts: A surface-based approach. Computational Linguistics, 26:395–448, 2000. [ME02] Daniel Marcu and A. Echihabi. An unsupervised approach to recognizing discourse relations. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 2002. [Mil95] G.A. Miller. Wordnet a lexical database for english. Comm. ACM, 12(11):39– 41, 1995. [MMS93] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330, 1993. [MRAW04a] Eleni Miltsakaki, Prasad Rashmi, Joshi Aravind, and Bonnie Webber. Annotating discourse connectives and their arguments. In Proceedings of the Workshop on Frontiers in Corpus Annotation, pages 48–57, 2004. [MRAW04b] Eleni Miltsakaki, Prasad Rashmi, Joshi Aravind, and Bonnie Webber. The penn discourse treebank. In Proceedings of the Language and Resources and Evaluation Conference, 2004. [MS01] Henk Pander Maat and Ted Sanders. Subjectivity in causal connectives: An empirical study of language in use. Cognitive Linguistics, 12(3):247–273, 2001. [MT88] William C. Mann and Sandra A. Thompson. Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3):243–281, 1988. [O’D00] Mike O’Donnell. Rsttool 2.4 a markup tool for rhetorical structure theory. In Proceedings of the 1st International Natural Language Generation Conference, 2000. [PdGVNM04] Thiago Alexandre Salgueiro Pardo, Maria das Gracas Volpe Nunes, and Lucia Helena MachadoRino. Dizer: An automatic discourse analyzer for brazilian portuguese. In Proceedings of First International Workshop on Natural Language Understanding and Cognitive Science (NLUCS 2004), 2004. [Per] Perrez. Connectieven, tekstbegrip en vreemdetaalverwerking. Een studie van de impact van causale en contrastieve connectieven op het begrijpen van teksten in het Nederlands als vreemde taal. [Por80] M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. [PSBA03] R. Power, D. Scott, and N. Bouayad-Agha. Document structure. Computational Linguistics, 29(3):211–260, 2003. [San] T.J.M. Sanders. Coherence, causality and cognitive complexity in discourse. [Sch87] Deborah Schriffrin. Discourse markers. Cambridge University Press, 1987. 99 BIBLIOGRAPHY [SH04] Manfred Stede and S. Heintze. Machine-assisted rhetorical structure annotation. In Proceedings of the Int’l Conference on Computational Linguistics, COLING-2004, 2004. [Spe03] Spectrum. Winkler Prins Medische Encyclopedie. Spectrum, 2003. [Ste04] Manfred Stede. The potsdam commentary corpus. In Proceedings of the Workshop on Discourse Annotation, 42nd Meeting of the Association for Computational Linguistics, 2004. [Tabng] Maite Taboada. Discourse markers as signals (or not) of rhetorical relations. Journal of Pragmatics, forthcoming. [TM06] Maite Taboada and William C. Mann. Applications of rhetorical structure theory. Discourse Studies, 8(4), 2006. [vL05] M. van Langen. Question answering for general practitioners. Master’s thesis, University of Twente , Enschede, 2005. [WG05] F. Wolf and E. Gibson. Representing discourse coherence: A corpus-based analysis. Computational Linguistics, 31(2):249–287, 2005. [WSK05] Methee Wattanamethanont, Thana Sukvaree, and Asanee Kawtrakul. Thai discourse relations recognition by using naive bayes classifier. The Sixth Symposium on Natural Language Processing 2005 (SNLP 2005), 2005. 100 Appendix A Rhetorical Relations Below man can find the set of Rhetorical Relations as defined in the original published paper of Mann and Thompson [MT88] alphabetically sorted. The following abbreviations are used: 1. N = Nucleus 2. S = Satellite 3. W = Writer 4. R = Reader Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : === Antithesis === W has positive regard for the situation presented in N None The situations presented in N and S are in contrast (cf. CONTRAST, i.e. are a) comprehended as the same in many respects b) comprehended as differing in a few respects and c) are compared with respect to one or more of these differences); because of an incompatibility that arises from the contrast, one cannot have positive regard for both the situations presented in N and S; comprehending S and the incompatibility between the situations presented in N and S increases R’s positive regard for the situation presented in N R’s positive regard for N is increased N === Background === R won’t comprehend N sufficiently before reading text of S None S increases the ability of R to comprehend an element in N R’s ability to comprehend N increases N === Circumstance === None Presents a situation (not unrealized) S sets a framework in the subject matter within R is intended to interpret the situation presented in N R recognizes that the situation presented in S provides the framework for interpreting N N and S 101 APPENDIX A. RHETORICAL RELATIONS Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on the combination of nuclei : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : === Concession === W has positive regard for the situation presented in N; W is not claiming that the situation presented in S does not hold W acknowledges a potential or apparent incompatibility between the situations presented in N and S; W regards the situations presented in N and S as compatible; recognizing the compatibility between the situations presented in N and S increases R’s positive regard for the situation presented in N R’s positive regard for the situation presented in N is increased N and S === Condition === None S presents a hypothetical, future, or otherwise unrealized situation (relative to the situational context of S) Realization of the situation presented in N depends on realization of that presented in S R recognizes how the realization of the situation is presented in N depends on the realization of the situation presented in S N and S === Contrast === Multi-nuclear no more than two nuclei; The situations presented in these two nuclei are: a) comprehended as the same in many respects b) comprehended as different in a few respects and c) compared with respect to one or more of these differences R recognizes the comparability and the difference(s) yielded by the comparison is being made Multiple nuclei === Elaboration === None None S presents additional detail about the situation or some element of subject matter which is presented in N or inferentially accessible in N in one or more of the ways listed below. In the list if N presents the first member of any pair, then S includes the second: 1) set : member 2) abstract : instance 3) whole : part 4) process : step 5) object : attribute 6) generalization : specific R recognizes the situation presented in S as providing additional detail for N. R identifies the element of subject matter for which detail is provided N and S === Enablement === Presents R action (including accepting an offer), unrealized with respect to the context of N None R comprehending S increases R’s potential ability to perform the action presented in N R’s potential ability to perform the action is presented in N increases N === Evaluation === None None S relates the situation in N to degree of W’s positive regard toward the situation presented in N R recognizes that the situation presented in S assesses the situation presented in N and recognizes the value it assigns N and S 102 Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on the combination of nuclei : The Effect : Locus of the Effect : === Evidence === R might not believe N to a degree satisfactory to W The reader believes S or will find it credible R’s comprehending S increases R’s belief of N R’s belief of N is increased N === Interpretation === None None S relates the situation presented in N to a framework of ideas not involved in N itself and not concerned with W’s positive regard R recognizes that S relates the situation presented in N to a framework of ideas not involved in the knowledge presented in a N itself N and S === Justify === None None R’s comprehending S increases R’s readiness to accept W’s right to present N R’s readiness to accept W’s right to present N is increased. N === Motivation === Presents an action in which R is the actor (including accepting an offer) unrealized with respect to the context of N None Comprehending S increases R’s desire to perform action presented in N R’s desire to perform action presented in N is increased N === Otherwise === Presents an unrealized situation Presents an unrealized situation Realization of the situation presented in N prevents realization of the situation presented in S R recognizes the dependency relation of prevention between the realization of the situation presented in N and the realization of the situation presented in S N and S === Purpose === Presents an activity Presents a situation that is unrealized S presents a situation to be realized through the activity in N R recognizes that the activity in N is initiated in order to realize S N and S === Restatement === None None S restates N, where S and N are of comparable bulk R recognizes S as a restatement of N N and S === Sequence === multi-nuclear A succession relationship between the situations is presented in the nuclei R recognizes the succession relationships among the nuclei Multiple nuclei 103 APPENDIX A. RHETORICAL RELATIONS Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The Effect : Locus of the Effect : Constraints on N : Constraints on S : Constraints on the N + S Combination: : The effect : Locus of the effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The effect : Locus of the effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The effect : Locus of the effect : Constraints on N : Constraints on S : Constraints on the N + S Combination : The effect : Locus of the effect : === Solutionhood === None Presents a problem The situation presented in N is a solution to the problem stated in S R recognizes the situation presented in N as a solution to the problem presented in S N and S === Summary === N must be more than one unit None S presents a restatement of the content of N, that is shorter in bulk R recognizes S as a shorter restatement of N N and S === Volitional Cause === Presents a volitional action or else a situation that could have arisen from a volitional action None S presents a situation that could have caused the agent of the volitional action in N to perform that action; without the presentation of S, R might not regard the action as motivated or know the particular motivation; N is more central to W’s purposes in putting forth the N-S combination than S is. R recognizes the situation presented in S as a cause for the volitional action presented in N N and S === Non-Volitional Cause === Presents a situation that is not a volitional action None S presents a situation that, by means other than motivating a volitional action caused the situation presented in N; without the presentation of S, R might not know the particular cause of the situation; a presentation of N is more central than S to W’s purposes in putting forth the N-S combination. R recognizes the situation presented in S as a cause of the situation presented in N N and S === Volitional Result === None Presents a volitional action or a situation that could have arisen from a volitional action N presents a situation that could have caused the situation presented in S; The situation presented in N is more central to W’s purposes that is that presented in S; R recognizes the situation presented in N could be a cause for the action or situation presented in S N and S === Non-Volitional Result === None Presents a situation that is not a volitional action N presents a situation that caused the situation presented in S; Presentation of N is more central to W’s purposes in putting forth the N-S combination than is the presentation of S. R recognizes the situation presented in N could have caused the situation presented in S N and S 104 Appendix B Conjunctions Dutch aangezien al aleer alhoewel als alsmede alsof alsook alvorens annex behalve daar daardoor dan dat dewijl doch doordat doordien dus eer eerdat en ende evenals gelijk hetzij hoewel English since although before although if just as as if just as before added to that except because therefore than that because but because because so before before and and just as as either although Continued on next page 105 APPENDIX B. CONJUNCTIONS Dutch English hoezeer how much indien if ingeval in case of maar but mits if na after naar in accordance with naargelang as naardien if naarmate as nadat after nademaal now niettegenstaande although noch neither nu now of if ofdat whether ofschoon although oftewel like ofwel either om for omdat because opdat so overmits for schoon although sedert since sinds since tenzij unless terwijl while / although toen then tot until totdat until uitgenomen except uitgezonderd except vermits as voor before vooraleer before voordat before voorzover as far as wanneer when want for wijl while Continued on next page 106 Dutch zo zoals zodat zodra zolang zover English as as so as soon as as long as as far as 107 Appendix C Merck Manual The text below is the original text about the Merck Manual taken from their website. English translation below. Over Merck De inhoud van deze website is volledig gebaseerd op het in 2000 verschenen Merck Manual Medisch handboek. In deze online versie van het handboek kunt u gemakkelijk zoeken op onderwerp en kunt u teksten en eventuele plaatjes en tabellen uitprinten. De nummering van de secties, hoofdstukken en onderwerpen komen overeen met het boek, zodat u via internet gemakkelijk kunt zoeken en eventueel het boek kunt raadplegen om de onderwerpen eens rustig door te lezen. The Merck Manual is een omvangrijk naslagwerk voor artsen dat in 1899 voor het eerst uitkwam en inmiddels in een 17e editie is verschenen. The Merck Manual geniet al ruim een eeuw groot gezag in de gezondheidszorg vanwege zijn betrouwbaarheid en volledigheid. Dit naslagwerk is voor een breed publiek herschreven en is uitgebracht als The Merck Manual Home Edition. Van deze publiekseditie zijn in de Verenigde Staten in de eerste drie jaar meer dan 1,5 miljoen exemplaren verkocht. The Merck Manual Home Edition is inmiddels al in 13 talen uitgebracht en is in Nederland bekend als het Merck Manual Medisch handboek. De Nederlandse editie is zorgvuldig beoordeeld, geactualiseerd en waar nodig aangepast aan de Nederlandse gezondheidszorg door 34 artsen en medisch deskundigen verbonden aan academische ziekenhuizen en overige gezondheidsinstellingen in Nederland. Sinds december 2002 is het Merck Manual Medisch handboek ook online beschikbaar. Het Merck Manual Medisch handboek is een naslagwerk dat niet te eenvoudig is voor de arts en niet te moeilijk voor de patint. Het Merck Manual Medisch handboek biedt inzicht in de oorzaken en behandeling van meer dan 3000 aandoeningen, in prettig leesbaar, begrijpelijk Nederlands. Geen enkel ander medisch naslagwerk behandelt een dergelijke grote verscheidenheid aan ziekten. De site maakt een gedegen voorbereiding van een bezoek aan de arts mogelijk en laat patinten en zorgverleners op niveau met elkaar van gedachten wisselen over de vele onderzoeks- en behandelmogelijkheden in de moderne geneeskunde. 109 APPENDIX C. MERCK MANUAL About Merck The contents of this website is fully based on the Merck Manual Medical Handbook which appeared in 2000. This online version of the handbook provides easy searching for subjects en enables you to print text and pictures. De numbers of the secions, chapters and subjects match with the the book, so the internet version provides easy searching while you can use the book to reread the subjects in peace. The Merck Manual is a sizeable reference book for doctors first appeared in 1899 and meanwhile reached the 17th edition. The Merck Manual has great authority in Health Care due to its reliability and robustness. This reference book is rewritten for the public, and released as the Merck Manual Home Edition. This edition is sold more than 1.5 million times in the United States. The Merck Manual Home Edition is translated into 13 languages and is known as the Merck Manual Medisch Handboek in The Netherlands. The Dutch edition is carefully judged, actualized and adapted to the Dutch Health Care where necessary by 34 doctors and medical specialists, connected to academical hospitals and other health institutes in The Netherlands. Since december 2000, the Merck Manual Medisch Handboek is available online as well. The Merck Manual Medisch Handboek is a reference book which is not too easy for a doctor and not too difficult for a patient at the same time. The Merck Manual Medisch Handboek provides insight in causes and treatments for more than 3000 disorders, in comfortable, understandable Dutch. No other medical reference book treats a similar number of illnesses. The site enables one to prepare himself for a visite to a doctor and enables patients and medical care takers talk about the many research and treatment possibilities in modern medicine. 110 Appendix D Medical Annotations Two annotated examples of texts from the Merck Manual. First Example: [Geneesmiddelen ter bestrijding van infecties zijn onder andere gericht tegen bacteriën, virussen en schimmels.]48A [Deze geneesmiddelen zijn zo gemaakt dat ze zo toxisch mogelijk zijn voor het infecterende micro-organisme en zo veilig mogelijk voor menselijke cellen.]48B [Ze zijn dus zo gemaakt dat ze selectief toxisch zijn.]48C [Productie van geneesmiddelen met selectieve toxiciteit om bacteriën en schimmels te bestrijden, is relatief gemakkelijk, omdat bacteriële en schimmelcellen sterk van menselijke cellen verschillen.]48D [Het is echter zeer moeilijk om een geneesmiddel te maken dat een virus bestrijdt zonder de geïnfecteerde cel en daardoor ook andere menselijke cellen aan te tasten.]48E 111 APPENDIX D. MEDICAL ANNOTATIONS Second Example: [De meeste therapieën voor overgewicht zijn gericht op wijziging van het eetgedrag.]49A [De nadruk ligt gewoonlijk meer op permanente veranderingen met betrekking tot eetgewoonten en op meer lichaamsbeweging dan op een dieet.]49B [Er wordt de mensen aangeleerd hoe ze zich geleidelijk betere eetgewoonten kunnen aanwennen door meer complexe koolhydraten (fruit, groenten, brood en pasta) te eten en het gebruik van vet te verminderen.]49C [Voor mensen met een licht overgewicht wordt slechts een kleine beperking van de hoeveelheid calorieën en vetten aanbevolen.]49D 112 Appendix E Dutch Texts A selection of Dutch texts is added per genre. For each of the genres, two examples are given. Also two examples of each type of other medical texts are added. Fairy Tales First example: Ver weg, in India, in een oerwoud dat nimmer door mensenvoeten is betreden, ligt een klein stil meer, als geslepen uit blauw kristal. Het weelderige lommer langs de kanten wordt rimpelloos weerspiegeld, omdat zelfs de rusteloze wind zijn adem inhoudt bij het zien van zoveel schoonheid. Op het meer drijven zeven waterlelies. En die waterlelies zijn het, waaraan dit sprookje zijn bestaan dankt. Maar laat ik bij het begin beginnen. Je moet weten dat in dat oerwoud een heks woonde die z lelijk was, dat ze alleen ’s nachts uit haar schuilplaats tevoorschijn durfde te komen. Haar neus was zo groot en krom, dat hij haar puntige kin bijna raakte en haar haren waren bleek en grauw als een bundeltje uitgedroogd gras. Als de maan hoog aan de hemel stond en de krekels hun avondconcert hadden beëindigd, beklom de heks een kale rots in de nabijheid van het meer. Daarop keerde zij haar schrikwekkende gezicht naar de hemel, blikte een tijdlang star omhoog, verschool zich dan in een holle boomstam en begon te zingen. In tegenstelling tot haar afschuwelijke verschijning klonk haar gezang zo mooi, dat geen dichter het ooit zou kunnen beschrijven. Het was of alle nachtegalen van het oerwoud zich verenigd hadden in die ene wonderlijke stem: een stem die de macht had om ieder die het maar hoorde, te betoveren. Dat wist de oude heks. Ze wist ook, dat bij volle maan de Maanfee met haar gevolg van sterren op aarde neerdaalde. Dan dansten zij op de waterspiegel van dat oerwoudmeer waarvan de heks slechts een paar passen was verwijderd. Er viel dan een heldere manestraal op het meer, omgeven door talloze lichtende sterren die, zodra zij het water raakten, de vorm aannamen van kleine feeën. Ze droegen zilveren jurkjes die oogverblindend schitterden en ze dansten sierlijk en speels op de tonen van de heksenzang. In hun midden wervelde een ranke gestalte, gekleed in zilveren sluiers, met op het hoofd een stralende kroon waarop een sikkeltje glansde. Dat was de Maanfee zelf. Hee; de lange nacht dansten de Maanfee en haar sterrenkinderen en war hun voetjes het water taakten ontstonden zilveren kringetjes, die zich vermenigvuldigden, wijder en wijder werden. 113 APPENDIX E. DUTCH TEXTS Om ten slotte tussen het riet te verdwijnen. Zo dansten zij, tot de morgenstond de maan deed verbleken. Hoe meer de nacht vorderde, hoe luider en bezwerender de heks zong. Het was haar echter nooit gelukt de Maanfee en haar sterrenkinderen in haar ban te krijgen en ze wist maar al te goed dat de Maanfee, wanneer de dag in aantocht was, met haar gevolg naar de hemel terugkeerde. Second example: Er was eens een heel lief en mooi meisje die na het overlijden van haar vader bij haar gemene stiefmoeder en stiefzusters woonde. De naam van het meisje was Assepoester. Ze woonden in een groot huis in een klein dorpje bij het paleis. Assepoester moest in haar eentje het hele huis altijd schoonhouden. Voor haar lelijke en luie stiefmoeder en stiefzusters moest zij alle vieze en vermoeiende karweitjes opknappen. Op een dag verschenen er overal in de stad aanplakbiljetten waarin iedereen werd uitgenodigd om op het grote feest te komen die in het paleis werd gegeven. Op dit feest zou de prins op zoek gaan naar een geschikte kandidaat om te trouwen. Het hele huis was in rep en roer. De gemene stiefmoeder dacht dat n van haar dochters wel geschikt zou zijn voor de prins. ”Jij mag niet mee Assepoester, zo een lelijk wicht als jij maakt toch geen kans” zei de gemene stiefmoeder ”Ga jij maar snel de jurken van mij en je stiefzusters opknappen en wassen”. Assepoester was de hele week in de weer om de jurken te verstellen en mooi te maken. De gemene stiefzusters pestte haar de hele dag dat zij wel naar het bal gingen en Assepoester niet. Op de dag van het bal was Assepoester heel verdrietig. De gemene stiefzusters hadden dit door en lachte haar hard uit. Verdrietig keek Assepoester hoe de gemene stiefmoeder en stiefzusters lachend in de koets stapte en vertrokken naar het paleis. ”Huil niet meisje” hoorde Assepoester ineens achter haar. Ze draaide zich om en daar stond een vriendelijk fee met een toverstokje ”Ik ben je petemoe en ik ga ervoor zorgen dat jij naar het feest kan”. Assepoester kreeg een dikke glimlach op haar gezicht. Ze kon het eigenlijk niet geloven. Met een tikje van haar toverstaf, toverde de vriendelijke fee Assepoester in een schitterende jurk. Het haar van Assepoester was ook meteen mooi gekamd en gekapt. Aan haar voetjes had zij ineens hele mooie glazen muiltjes. Assepoester danste in het rond van geluk. ”Ooohhh petemoe wat ben ik mooi gekleed zo” ”Nu nog een mooie koets met koetsier en paarden”zei de vriendelijk fee en met een zwaai toverde zij een pompoen en 5 muizen in een mooie koets met koetsier en 4 paarden. ”Ga nu snel mijn kind maar vergeet niet dat de betovering maar tot twaalf uur middernacht duurt en geen seconde later” Weather Forecasts First example: Onder invloed van een hogedrukwig krijgen we woensdag rustig en vrij mooi weer al kan de ochtendmist en de lage bewolking nog een groot deel van de voormiddag blijven hangen, vooral dan in het oosten van het land. Donderdag begint nog mooi, opnieuw met plaatselijk ochtendmist, maar in de loop van de dag is er kans op wat lichte regen. Vrijdag is het eerst nog vrij mooi maar later op de dag neemt de kans toe op regen of buien. Ook zaterdag vallen er een aantal buien. Tijdens heel de periode blijft het vrij zacht voor de tijd van het jaar met maxima rond 20 graden. 114 Second example: De maand is voorlopig 4 graden warmer dan het langjarig gemiddelde. De hoogste temperatuur (30,2 graden) werd op 12 september gemeten in Ell (L). De temperaturen van de 21 (Arcen 27,5) en 22 september (Twente 28,9, De Bilt 26,3 graden) horen bij de hoogste voor die tijd. De hoogste temperatuur ooit in De Bilt tussen 21-30 september is 27,3 graden op 24 september 1949. Landelijk is voor eind september 30,4 graden het hoogtepunt voor op 26 september 1967 in Buchten. Ook in 2003 beleefde ons land tussen 17 en 22 september een uitzonderlijk warme oudewijvenzomer met temperaturen tussen 25 en 31 graden. Recipes First example: Smelt de boter in een braadpan en bak de uien al omscheppend lichtbruin. Leg de braadworsten ertussen en bak ze rondom bruin. Voeg de tijm, de bouillon en peper en zout naar smaak toe en laat de uien en de worst in 20 min. gaar worden. Garneer met de peterselie. Lekker met aardappelpuree. Second example: Ingredieënten: 500 gram bruine bonen, 3 liter water, 15 gram zout, 2 laurierbladeren, 2 kruidnagels, 1 Spaanse paper, 250 gram aardappelen, 1 grote ui, 40 gram boter, 1 selderijknol, peper, zout, peper, Worcesterschire saus. Bereidingswijze: Was de bonen onder koud water en laat ze 24 uur weken. Breng de bonen in het weekwater aan de kook met het zout en de kruiden, (tijdens de bewerking moeten de kruiden verwijderd worden, doe ze daarom in een thee-ei of een linnen zakje). Laat de bonen niet te gaar worden. Schil de aardappelen, snijd ze in blokjes, doe ze bij de bonen en laat het geheel nog een half uurtje doorkoken. Snij de ui en fruit deze in de boter. Schil de selderijknol en snijd hem in stukjes. Haal de kruiden uit de soep en pureer in het water de bonen en de aardappelen met een pureestamper. Voeg de stukken knolselderij toe en ook de gefruite ui. Laat de massa nog een kwartiertje doorkoken totdat het goed gebonden is. Breng de soep op smak met zout, peper en als u daar va houdt een scheutje Worcesterschire saus. Strooi vlak voor het serveren de fijngehakte peterselie erover. Geef er geroosterd wittebrood of vers stokbrood bij. Newspaper Articles First example: AMSTERDAM - Nog nooit iemand zo verweesd naar een bos bloemen zien kijken als Jaap Stam zaterdag na afloop van de ontluisterende nederlaag tegen Ierland. De international keek eens op naar het meisje dat hem de ruikers overhandigde en moffelde het bosje vervolgens vakkundig weg onder de stoel waarin hij even daarvoor met een diepe zucht was neergezegen. 115 APPENDIX E. DUTCH TEXTS Arme Stam. Het ergste moest nog komen. Staand in de middencirkel werden de voetballers van Oranje uitgezwaaid richting Portugal, maar een hels fluitconcert daalde neer op de ploeg die zojuist door een veredeld jeugdteam met 0-1 was verslagen. Ruud van Nistelrooij nam de strafexpeditie nuchter op: Het is nooit leuk om uitgefloten te worden door je eigen publiek, maar dat hebben we zelf in de hand. Als we de sterren van de hemel spelen, hebben we er geen last van. Tot overmaat van ramp betrad ook zanger Frans Bauer nog het veld om zijn hits ’Een onsje geluk’ en ’Heb je even voor mij?’ ten gehore te brengen. De internationals, nog steeds verzameld op het midden van het veld, keken beteuterd toe hoe het merendeel van de 48.000 toeschouwers zich wl vermaakte met de volkszanger. Stam kon het niet meer aanzien en wandelde voor een slok water naar de zijlijn. Van mij hoeft die poespas niet. Ook niet als we met 3-0 gewonnen hadden. Maar ik beslis die dingen niet. Dat uitfluiten is niet leuk, maar wel begrijpelijk. De laatste oefenwedstrijd voor het EK ’88 ging ook verloren. Toen werd de selectie eveneens uitgefloten. Er is nog hoop. Van Nistelrooij zei niet te hebben overwogen om de misplaatste toegift te ontlopen. Als al die mensen blijven zitten om ons uit te zwaaien, ga ik niet naar binnen. Het laatste woord was tenslotte aan aanvoerder Phillip Cocu, die de fans bedankte voor hun steun: Hoe moeilijk dat ook was. Ik had daar liever gestaan na een 3-0 overwinning. Maar wat ik zei, was wel gemeend. Het doet me nog steeds wat als het stadion helemaal vol zit en oranje kleurt. We hebben de hulp van het legioen straks hard nodig in Portugal. Dat was het understatement van de dag. Second example: Prijzenslag Nieuwe ronde in prijzenslag (AD, 18-9). Volgens AH-topman Dick Boer waarderen de klanten van Albert Heijn de aanpak om de prijzen te verlagen. Ik kan daar als klant voor een deel in meegaan. Maar gezonde concurrentie kan alleen bestaan als alle bij het proces betrokkenen er beter van worden. En niet alleen de consument. Stel dat naast Albert Heijn er nog slechts wat kleine buurtwinkels in enclaves zouden overblijven. Moeten we daar dan gelukkig mee zijn? Als delen van de keten moeten afhaken omdat er een een eind is gekomen aan hun inventiviteit en inleveringsvermogen. Oftewel: weg concurrentie! Weg bekende merken? En, misschien ook weg de noodzaak van lage prijzen? Winkler Prins First example: geslachtsdrift genps of libido, mate waarin iemand behoefte heeft aan seksueel contact (zie *seksualiteit). De seksuele activiteit wordt in overwegende mate bepaald door in de loop van het leven opgedane ervaringen, door de houding van anderen en hun normen, alsmede door de hormonale activiteit van het hersenaanhangsel en de geslachtsklieren. De geslachtsdrift kan afnemen onder invloed van emoties (angst, neerslachtigheid). Ook onder invloed van de menstruele cyclus verandert de geslachtsdrift. Zie ook *frigiditeit; *impotentie. 116 Second example: HARTMASSAGE: Uitwendige hartmassage wordt verricht door middel van een ritmisch drukken (met twee handen) op het onderste gedeelte van het borstbeen. Men drukt hierbij 80 tot 100 maal per minuut (rechtsboven), waardoor de borstkas in hetzelfde tempo wordt samengedrukt. Let op: niet leunen! Bij het loslaten (linksonder) veert de borstkas weer omhoog, waardoor het hart meer ruimte in kan nemen en bloed aanzuigt. Bij de volgende drukverhoging wordt het bloed de aorta en longslagader ingeperst. bij hartstilstand of lage hartfrequentie (minder dan 20 slagen per minuut) het ritmisch ca.80-100 maal per minuut samendrukken van de borstkas, waardoor het *hart leeggedrukt wordt en de pompfunctie gedeeltelijk blijft bestaan. Hierbij wordt het onderste gedeelte van het borstbeen met een snelle stevige stoot van beide handpalmen ongeveer 5 cm ingedrukt. Gelijktijdig wordt beademing toegepast, hetzij in de vorm van mond-op-mond-beademing (zie *beademing), ca. 20 maal per minuut, hetzij door toediening van zuurstof door een anesthesist na intubatie. Wikipedia First example: Hersenen van gewervelden ontvangen signalen van de ’sensors’ (receptoren) van het organisme via de zenuwen. Deze signalen worden door het centrale zenuwstelsel genterpreteerd waarna reacties worden geformuleerd, gebaseerd op reflexen en aangeleerde kennis. Eenzelfde soort systeem bezorgt aansturende boodschappen vanuit de hersenen bij de spieren in het hele lichaam. Sensorische input wordt verwerkt door de hersenen voor de herkenning van gevaar, het vinden van voedsel, het identificeren van mogelijke partners en voor meer verfijnde functies. Gezichts- gevoels- en gehoorsinformatie gaat eerst naar specifieke kernen van de thalamus en daarna naar gebieden van de cerebrale cortex die bij dat specifieke sensorische systeem horen. Reukinformatie (fylogenetisch het oudste systeem) gaat eerst naar de bulbus olfactorius, daarna naar andere delen van het olfactorisch systeem. Smaak wordt via de hersenstam geleid naar andere delen van het betreffende systeem. Om bewegingen te cordineren hebben de hersenen een aantal parallelle systemen die spieren besturen. Het motorisch systeem bestuurt de willekeurige bewegingen van spieren, geholpen door de motorische schors, de kleine hersenen (het cerebellum) en de basale ganglia. Uiteindelijk projecteert het systeem via het ruggenmerg naar de zogenaamde spiereffectors. Kernen in de hersenstam besturen veel onwillekeurige spierfuncties zoals de ademhaling. Daarnaast kunnen veel automatische handelingen zoals reflexen gestuurd worden door het ruggenmerg. De hersenen produceren ook een deel van de hormonen die organen en klieren benvloeden. Aan de andere kant reageren de hersenen ook op hormonen die elders in het lichaam geproduceerd zijn. Bij zoogdieren worden de meeste hormonen uitgescheiden in de bloedsomloop; de besturing van veel hormonen verloopt via de hypofyse. 117 APPENDIX E. DUTCH TEXTS Second example: Een slagader of arterie is een bloedvat dat zorgt voor het transport van bloed van het hart naar de rest van het lichaam. Het arteriestelsel voert bloed vanuit het lichaam naar de gebruikers, nl. de organen en weefsels. De naam ’slagader’ verwijst naar het feit dat men aan een arterie het hart kan voelen kloppen, omdat de daarmee gepaard gaande drukwisselingen zich in de arterieën voortplanten. De diameter van een slagader is ongeveer vier millimeter. Hieromheen zijn wanden gelegen, welke ongeveer n millimeter dik zijn. Om de ruimte waar het bloed stroomt, zit eerst een enkele laag endotheelcellen. Hieromheen is een laag glad spierweefsel. Deze laag is dikker dan de gladde spiercellaag van de aders. Hieromheen, als buitenste laag, zit een laag bindweefselcellen. 118