320 pages - Institutionen för filosofi, lingvistik och vetenskapsteori
Transcription
320 pages - Institutionen för filosofi, lingvistik och vetenskapsteori
GOTHENBURG MONOGRAPHS IN LINGUISTICS 24 Automatic Detection of Grammar Errors in Primary School Children’s Texts A Finite State Approach Sylvana Sofkova Hashemi Doctoral Dissertation Publicly defended in Lilla Hörsalen, Humanisten, Göteborg University, on June 7, 2003, at 10.15 for the degree of Doctor of Philosophy Department of Linguistics, Göteborg University, Sweden ISBN 91-973895-5-2 c 2003 Sylvana Sofkova Hashemi Typeset by the author using LATEX Printed by Intellecta Docusys, Göteborg, Sweden, 2003 i Abstract This thesis concerns the analysis of grammar errors in Swedish texts written by primary school children and the development of a finite state system for finding such errors. Grammar errors are more frequent for this group of writers than for adults and the distribution of the error types is different in children’s texts. In addition, other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words. The method used in the implemented tool FiniteCheck involves subtraction of finite state automata that represent grammars with varying degrees of detail, creating a machine that classifies phrases in a text containing certain kinds of errors. The current version of the system handles errors concerning agreement in noun phrases, and verb selection of finite and non-finite forms. At the lexical level, we attach all lexical tags to words and do not use a tagger which could eliminate information in incorrect text that might be needed later to find the error. At higher levels, structural ambiguity is treated by parsing order, grammar extension and some other heuristics. The simple finite state technique of subtraction has the advantage that the grammars one needs to write to find errors are always positive, describing the valid rules of Swedish rather than grammars describing the structure of errors. The rule sets remain quite small and practically no prediction of errors is necessary. The linguistic performance of the system is promising and shows comparable results for the error types implemented to other Swedish grammar checking tools, when tested on a small adult text not previously analyzed by the system. The performance of the other Swedish tools was also tested on the children’s data collected for this study, revealing quite low recall rates. This fact motivates the need for adaptation of grammar checking techniques to children, whose errors are different from those found in adult writers and pose more challenge to current grammar checkers, that are oriented towards texts written by adult writers. The robustness and modularity of FiniteCheck makes it possible to perform both error detection and diagnostics. Moreover, the grammars can in principle be reused for other applications that do not necessarily have anything to do with error detection, such as extracting information in a given text or even parsing. K EY W ORDS : grammar errors, spelling errors, punctuation, children’s writing, Swedish, language checking, light parsing, finite state technology ii iii Acknowledgements Work on this thesis would not have been possible without contributions, support and encouragement from many people. The idea of developing a writing tool for supporting children in their text production and grammar emerged from a study on how primary school children write by hand in comparison to when they use a computer. Special thanks to my colleague Torbjörn Lager, who inspired me to do this study and whose children attended the school where I gathered my data. My main supervisor Robin Cooper awakened the idea of using finite state methods for grammar checking and launched the collaboration with the Xerox research group. I want to express my greatest gratitude to him for inspiring discussions during project meetings and supervision sessions, and his patience with my writing, struggling to understand every bit of it, always raising questions and always full of new exciting ideas. I really enjoyed our discussions and look forward to more. I would also like to thank my assistant supervisor Elisabet Engdahl who carefully read my writing and made sure that I expressed myself more clearly. Many thanks to all my colleagues at the Department of Linguistics for creating an inspiring research environment with interesting projects, seminars and conferences. I especially want to mention Leif Grönqvist for being the helping hand next door whenever, Robert Andersson for being my project colleague, Stina Ericsson for loan of LATEX-manual and for always being helpful, Ulla Veres for help with recruitment of new victims for writing experiments, Jens Allwood and Elisabeth Ahlsén for introducing me to the world of transcription and coding, Sally Boyd, Nataliya Berbyuk, Ulrika Ferm for support and encouragement, Shirley Nicholson for always available with books and also milk for coffee, Pia Cromberger always ready for a chat. A special thanks to Ylva Hård af Segerstad for fruitful discussions leading to future collaboration that I am looking forward to, and for being a friend. I also want to thank the children in my study and their teachers for providing me with their text creations, and Sven Strömqvist and Victoria Johansson for sharing their data collection. A special thanks to Genie Perdin who carefully proofread this thesis and gave me some encouraging last minute ‘kicks’. I also want to thank all my friends, who reminded me now and then about life outside the university. My deepest gratitude to my family for being there for me and for always believing in me. My husband Ali - I know the way was long and there were times I could be distant, but I am back. My daughter Sarah for being the sunshine of my life, my inspiration, my everything. My mother, father, sister and my big little brother ... Sylvana Sofkova Hashemi Göteborg, May 2003 iv v Table of Contents 1 Introduction 1.1 Written Language in a Computer Literate Society . . . . . . . . . 1.2 Aim and Scope of the Study . . . . . . . . . . . . . . . . . . . . 1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 5 I Writing 7 2 Writing and Grammar 2.1 Introduction . . . . . . . . . . . . . . . . . . . 2.2 Research on Writing in General . . . . . . . . 2.3 Written Language and Computers . . . . . . . 2.3.1 Learning to Write . . . . . . . . . . . . 2.3.2 The Influence of Computers on Writing 2.4 Studies of Grammar Errors . . . . . . . . . . . 2.4.1 Introduction . . . . . . . . . . . . . . . 2.4.2 Primary and Secondary Level Writers . 2.4.3 Adult Writers . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 11 11 12 14 14 14 15 18 3 Data Collection and Analysis 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Data Collection . . . . . . . . . . . . . . . 3.2.1 Introduction . . . . . . . . . . . . . 3.2.2 The Sub-Corpora . . . . . . . . . . 3.3 Error Categories . . . . . . . . . . . . . . . 3.3.1 Introduction . . . . . . . . . . . . . 3.3.2 Spelling Errors . . . . . . . . . . . 3.3.3 Grammar Errors . . . . . . . . . . 3.3.4 Spelling or Grammar Error? . . . . 3.3.5 Punctuation . . . . . . . . . . . . . 3.4 Types of Analysis . . . . . . . . . . . . . . 3.5 Error Coding and Tools . . . . . . . . . . . 3.5.1 Corpus Formats . . . . . . . . . . . 3.5.2 CHAT-format and CLAN-software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 21 23 25 25 26 27 28 31 32 34 34 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 4 Error Profile of the Data 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . 4.2 General Overview . . . . . . . . . . . . . . . . . . 4.3 Grammar Errors . . . . . . . . . . . . . . . . . . . 4.3.1 Agreement in Noun Phrases . . . . . . . . 4.3.2 Agreement in Predicative Complement . . 4.3.3 Definiteness in Single Nouns . . . . . . . . 4.3.4 Pronoun Case . . . . . . . . . . . . . . . . 4.3.5 Verb Form . . . . . . . . . . . . . . . . . 4.3.6 Sentence Structure . . . . . . . . . . . . . 4.3.7 Word Choice . . . . . . . . . . . . . . . . 4.3.8 Reference . . . . . . . . . . . . . . . . . . 4.3.9 Other Grammar Errors . . . . . . . . . . . 4.3.10 Distribution of Grammar Errors . . . . . . 4.3.11 Summary . . . . . . . . . . . . . . . . . . 4.4 Child Data vs. Other Data . . . . . . . . . . . . . 4.4.1 Primary and Secondary Level Writers . . . 4.4.2 Evaluation Texts of Proof Reading Tools . 4.4.3 Scarrie’s Error Database . . . . . . . . . . 4.4.4 Summary . . . . . . . . . . . . . . . . . . 4.5 Real Word Spelling Errors . . . . . . . . . . . . . 4.5.1 Introduction . . . . . . . . . . . . . . . . . 4.5.2 Spelling in Swedish . . . . . . . . . . . . 4.5.3 Segmentation Errors . . . . . . . . . . . . 4.5.4 Misspelled Words . . . . . . . . . . . . . . 4.5.5 Distribution of Real Word Spelling Errors . 4.5.6 Summary . . . . . . . . . . . . . . . . . . 4.6 Punctuation . . . . . . . . . . . . . . . . . . . . . 4.6.1 Introduction . . . . . . . . . . . . . . . . . 4.6.2 General Overview of Sentence Delimitation 4.6.3 The Orthographic Sentence . . . . . . . . . 4.6.4 Punctuation Errors . . . . . . . . . . . . . 4.6.5 Summary . . . . . . . . . . . . . . . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 37 41 41 50 52 53 55 62 67 69 71 72 77 77 77 80 85 88 89 89 89 91 94 98 100 100 100 101 103 105 107 107 vii II Grammar Checking 111 5 Error Detection and Previous Systems 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 What Is a Grammar Checker? . . . . . . . . . . . . . . . . . 5.2.1 Spelling vs. Grammar Checking . . . . . . . . . . . 5.2.2 Functionality . . . . . . . . . . . . . . . . . . . . . 5.2.3 Performance Measures and Their Interpretation . . . 5.3 Possibilities for Error Detection . . . . . . . . . . . . . . . 5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.3.2 The Means for Detection . . . . . . . . . . . . . . . 5.3.3 Summary and Conclusion . . . . . . . . . . . . . . 5.4 Grammar Checking Systems . . . . . . . . . . . . . . . . . 5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Methods and Techniques in Some Previous Systems 5.4.3 Current Swedish Systems . . . . . . . . . . . . . . 5.4.4 Overview of The Swedish Systems . . . . . . . . . 5.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . 5.5 Performance on Child Data . . . . . . . . . . . . . . . . . . 5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Evaluation Procedure . . . . . . . . . . . . . . . . . 5.5.3 The Systems’ Detection Procedures . . . . . . . . . 5.5.4 The Systems’ Detection Results . . . . . . . . . . . 5.5.5 Overall Detection Results . . . . . . . . . . . . . . 5.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 113 114 114 114 115 117 117 117 125 128 128 128 130 134 142 143 143 143 145 146 168 172 6 FiniteCheck: A Grammar Error Detector 6.1 Introduction . . . . . . . . . . . . . . . . 6.2 Finite State Methods and Tools . . . . . . 6.2.1 Finite State Methods in NLP . . . 6.2.2 Regular Grammars and Automata 6.2.3 Xerox Finite State Tool . . . . . . 6.2.4 Finite State Parsing . . . . . . . . 6.3 System Architecture . . . . . . . . . . . . 6.3.1 Introduction . . . . . . . . . . . . 6.3.2 The System Flow . . . . . . . . . 6.3.3 Types of Automata . . . . . . . . 6.4 The Lexicon . . . . . . . . . . . . . . . . 6.4.1 Composition of The Lexicon . . . 6.4.2 The Tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 173 175 175 176 177 180 184 184 186 189 191 191 193 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 195 196 196 198 201 203 205 205 210 214 214 215 216 216 7 Performance Results 7.1 Introduction . . . . . . . . . . . . . . . . . . . 7.2 Initial Performance on Child Data . . . . . . . 7.2.1 Performance Results: Phase I . . . . . 7.2.2 Grammatical Coverage . . . . . . . . . 7.2.3 Flagging Accuracy . . . . . . . . . . . 7.3 Current Performance on Child Data . . . . . . 7.3.1 Introduction . . . . . . . . . . . . . . . 7.3.2 Improving Flagging Accuracy . . . . . 7.3.3 Performance Results: Phase II . . . . . 7.4 Overview of Performance on Child Data . . . . 7.5 Performance on Other Text . . . . . . . . . . . 7.5.1 Performance Results of FiniteCheck . . 7.5.2 Performance Results of Other Tools . . 7.5.3 Overview of Performance on Other Text 7.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 219 219 219 220 223 228 228 229 232 233 237 237 240 243 246 8 Summary and Conclusion 8.1 Introduction . . . . . . . . . . . . . . . . . . . . 8.2 Summary . . . . . . . . . . . . . . . . . . . . . 8.2.1 Introduction . . . . . . . . . . . . . . . . 8.2.2 Children’s Writing Errors . . . . . . . . 8.2.3 Diagnosis and Possibilities for Detection 8.2.4 Detection of Grammar Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 249 249 249 250 251 253 6.5 6.6 6.7 6.8 6.9 6.4.3 Categories and Features . . . . . . . Broad Grammar . . . . . . . . . . . . . . . . Parsing . . . . . . . . . . . . . . . . . . . . 6.6.1 Parsing Procedure . . . . . . . . . . 6.6.2 The Heuristics of Parsing Order . . . 6.6.3 Further Ambiguity Resolution . . . . 6.6.4 Parsing Expansion and Adjustment . Narrow Grammar . . . . . . . . . . . . . . . 6.7.1 Noun Phrase Grammar . . . . . . . . 6.7.2 Verb Grammar . . . . . . . . . . . . Error Detection and Diagnosis . . . . . . . . 6.8.1 Introduction . . . . . . . . . . . . . . 6.8.2 Detection of Errors in Noun Phrases . 6.8.3 Detection of Errors in the Verbal Head Summary . . . . . . . . . . . . . . . . . . . ix 8.3 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . Future Plans . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Introduction . . . . . . . . . . . . . . . . . . 8.4.2 Improving the System . . . . . . . . . . . . 8.4.3 Expanding Detection . . . . . . . . . . . . . 8.4.4 Generic Tool? . . . . . . . . . . . . . . . . . 8.4.5 Learning to Write in the Information Society . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 256 256 256 257 258 258 Bibliography 260 Appendices 276 A Grammatical Feature Categories 279 B Error Corpora B.1 Grammar Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Misspelled Words . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Segmentation Errors . . . . . . . . . . . . . . . . . . . . . . . . 281 282 293 306 C SUC Tagset 313 D Implementation D.1 Broad Grammar . D.2 Narrow Grammar: D.3 Narrow Grammar: D.4 Parser . . . . . . D.5 Filtering . . . . . D.6 Error Finder . . . 315 315 315 318 319 319 320 . . . . . . . . Noun Phrases Verb Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF TABLES xi List of Tables 3.1 Child Data Overview . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36 General Overview of Sub-Corpora . . . . . . . . . . . . . . General Overview by Age . . . . . . . . . . . . . . . . . . General Overview of Spelling Errors in Sub-Corpora . . . . General Overview of Spelling Errors by Age . . . . . . . . . Number Agreement in Swedish . . . . . . . . . . . . . . . . Gender Agreement in Swedish . . . . . . . . . . . . . . . . Definiteness Agreement in Swedish . . . . . . . . . . . . . Noun Phrases with Proper Nouns as Head . . . . . . . . . . Noun Phrases with Pronouns as Head . . . . . . . . . . . . Noun Phrases without (Nominal) Head . . . . . . . . . . . . Agreement in Partitive Noun Phrase in Swedish . . . . . . . Gender and Number Agreement in Predicative Complement Personal Pronouns in Swedish . . . . . . . . . . . . . . . . Finite and Non-finite Verb Forms . . . . . . . . . . . . . . . Tense Structure . . . . . . . . . . . . . . . . . . . . . . . . Fa-sentence Word Order . . . . . . . . . . . . . . . . . . . Af-sentence Word Order . . . . . . . . . . . . . . . . . . . Distribution of Grammar Errors in Sub-Corpora . . . . . . . Distribution of Grammar Errors by Age . . . . . . . . . . . Examples of Grammar Errors in Teleman’s Study . . . . . . Examples of Grammar Errors from the Skrivsyntax Project . Grammar Errors in the Evaluation Texts of Grammatifix . . . Grammar Errors in Granska’s Evaluation Corpus . . . . . . General Error Ratio in Grammatifix, Granska and Child Data Three Error Types in Grammatifix, Granska and Child Data Grammar Errors in Scarrie’s ECD and Child Data . . . . . . Examples of Spelling Error Categories . . . . . . . . . . . . Spelling Variants . . . . . . . . . . . . . . . . . . . . . . . Distribution of Real Word Segmentation Errors . . . . . . . Distribution of Real Word Spelling Errors in Sub-Corpora . Distribution of Real Word Spelling Errors by Age . . . . . . Sentence Delimitation in the Sub-Corpora . . . . . . . . . . Sentence Delimitation by Age . . . . . . . . . . . . . . . . Major Delimiter Errors in Sub-Corpora . . . . . . . . . . . Major Delimiter Errors by Age . . . . . . . . . . . . . . . . Comma Errors in Sub-Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 38 39 40 40 42 42 42 44 44 45 45 50 54 55 56 63 63 74 74 78 79 81 82 83 83 86 90 91 91 99 99 103 103 105 105 106 LIST OF TABLES xii 4.37 Comma Errors by Age . . . . . . . . . . . . . . . . . . . . . . . 107 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Summary of Detection Possibilities in Child Data . . . . . . . . . Overview of the Grammar Error Types in Grammatifix (GF), Granska (GR) and Scarrie (SC) . . . . . . . . . . . . . . . . . . Overview of the Performance of Grammatifix, Granska and Scarrie Performance Results of Grammatifix on Child Data . . . . . . . . Performance Results of Granska on Child Data . . . . . . . . . . Performance Results of Scarrie on Child Data . . . . . . . . . . . Performance Results of Targeted Errors . . . . . . . . . . . . . . 6.1 6.2 6.3 Some Expressions and Operators in XFST . . . . . . . . . . . . . 178 Types of Directed Replacement . . . . . . . . . . . . . . . . . . . 179 Noun Phrase Types . . . . . . . . . . . . . . . . . . . . . . . . . 206 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 Performance Results on Child Data: Phase I . . . . False Alarms in Noun Phrases: Phase I . . . . . . . False Alarms in Finite Verbs: Phase I . . . . . . . . False Alarms in Verb Clusters: Phase I . . . . . . . False Alarms in Noun Phrases: Phase II . . . . . . False Alarms in Finite Verbs: Phase II . . . . . . . False Alarms in Verb Clusters: Phase II . . . . . . Performance Results on Child Data: Phase II . . . Performance Results of FiniteCheck on Other Text Performance Results of Grammatifix on Other Text Performance Results of Granska on Other Text . . Performance Results of Scarrie on Other Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 137 141 169 169 170 171 220 224 226 227 229 231 231 232 237 240 241 242 LIST OF FIGURES xiii List of Figures 3.1 Principles for Error Categorization . . . . . . . . . . . . . . . . . 31 4.1 4.2 4.3 4.4 73 76 76 4.5 4.6 Grammar Error Distribution . . . . . . . . . . . . . . . . . . . . Error Density in Sub-Corpora . . . . . . . . . . . . . . . . . . . . Error Density in Age Groups . . . . . . . . . . . . . . . . . . . . Three Error Types in Grammatifix (black line), Granska (gray line) and Child Data (white line) . . . . . . . . . . . . . . . . . . . . Error Distribution of Selected Error Types in Scarrie . . . . . . . Error Distribution of Selected Error Types in Child Data . . . . . 6.1 The System Architecture of FiniteCheck . . . . . . . . . . . . . . 185 7.1 7.2 7.3 7.4 7.5 7.6 7.7 False Alarms: Phase I vs. Phase II . . . . . . . Overview of Recall in Child Data . . . . . . . Overview of Precision in Child Data . . . . . . Overview of Overall Performance in Child Data Overview of Recall in Other Text . . . . . . . . Overview of Precision in Other Text . . . . . . Overview of Overall Performance in Other Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 87 87 233 234 235 236 244 244 245 xiv Chapter 1 Introduction 1.1 Written Language in a Computer Literate Society Written language plays an important role in our society. A great deal of our communication occurs by means of writing, which besides the traditional paper and pen, is facilitated by the computer, the Internet and other applications such as for instance the mobile phone. Word processing and sending messages via email are among the most usual activities on computers. Other communicated media that enable written communication are also becoming popular such as webchat or instant messaging on the Internet or text messaging (Short-Message-Service, SMS) via the mobile phone.1 The present doctoral dissertation concerns word processing on computers, in particular the linguistic tools integrated in such authoring aids. The use of word processors for writing both in educational and professional settings modifies the process, practice and acquisition of writing. With a word processor, it is not only easy to produce a text with a neat layout, but it supports the writer throughout the whole writing process. Text may be restructured and revised at any time during text production without leaving any trace of the changes that have been made. Text may be reused and a new text composed by cutting and pasting passages. Iconic material such as pictures2 (or even sounds) can be inserted, linguistic aids can be used for proofreading a text. Writing acquisition can be enhanced by use of a word processor. For instance, focus on somewhat more technical aspects such as physically shaping letters with a pen shifts toward the more cognitive processes of text 1 Studies of computer-mediated communication are provided by e.g. Severinson Eklundh (1994); Crystal (2001); Herring (2001). A recent dissertation by Hård af Segerstad (2002) explores especially how written Swedish is used in email, webchat and SMS. 2 Smileys or emoticons (e.g. :-) “happy face”) are more used in computer-mediated communication. 2 Chapter 1. production enabling the writer to apply the whole language register. Writing on a computer enhances in general both the motivation to write, revise or completely change a text (cf. Wresch, 1984; Daiute, 1985; Severinson Eklundh, 1993; Pontecorvo, 1997). The status of written language in our modern information society has developed. In contrast to ancient times, writing is no longer reserved for just a small minority of professional groups (e.g. priests and monks, bankers, important merchants). In particular, the emergence of computers in writing has led to the involvement of new user groups besides today’s writing professionals like journalists, novelists and scientists. We write more nowadays in general, and the freedom of and control over one’s own writing has increased. Texts are produced rapidly and are more seldom proofread by a careful secretary with knowledge of language. This is sometimes reflected in the quality and correctness of the resulting text (cf. Severinson Eklundh, 1995). Linguistic tools that check mechanics, grammar and style have taken over the secretarial function to some degree and are usually integrated in word processing software. Spelling checkers and hyphenators that check writing mechanics and identify violations on individual words have existed for some time now. Grammar checkers that recognize syntactic errors and often also violations of punctuation, word capitalization conventions, number and date formatting and other style-related issues, thus working above the word level, are a rather new technology, especially for such minor small languages like Swedish. Grammar checking tools for languages such as English, French, Dutch, Spanish, and Greek were being developed in the 1980’s, whereas research on Swedish writing aids aimed at grammatical deviance started quite recently. In addition to the present work, there are three research groups working in this area. The Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH), with a long tradition of research in writing and authoring aids, is responsible for Granska. Development of this tool has occurred over a series of projects starting in 1994 (Domeij et al., 1996, 1998; Carlberger et al., 2002). The Department of Linguistics, Uppsala University was involved in an EU-sponsored project, Scarrie, between 1996 and 1999. The goal of this project was development of language tools for Danish, Norwegian and Swedish (Sågvall Hein, 1998a; Sågvall Hein et al., 1999). Finally, a Finnish language engineering company Lingsoft Inc. developed Grammatifix. Initiated in 1997, and completed in 1999, this tool was released on the market in November 1998, and has been part of the Swedish Microsoft Office Package since 2000 (Arppe, 2000; Birn, 2000). The three Swedish systems mainly use parsing techniques with some degree of feature relaxation and/or explicit error rules for detection of errors. Grammatifix and Granska are developed as generic tools and are tested on adult (mostly pro- Introduction 3 fessional) texts. Scarrie’s end-users are professional writers from newspapers and publishing firms. 1.2 Aim and Scope of the Study The primary purpose of the present work is to detect grammar errors by means of linguistic descriptions of correct language use rather than describing the structure of errors. The ideal would be to develop a generic method for detection of grammar errors in unrestricted text that could be applied to different writing populations displaying different error types without the need for rewriting the grammars of the system. That is, instead of describing the errors made by different groups of writers resulting in distinct sets of error rules, use the same grammar set for detection. This approach of identifying errors in text without explicit description of them contrasts with the other three Swedish grammar checkers. Using this method, we will hopefully cover many different cases of errors and minimize the possibility of overlooking some errors. We chose primary school children as the targeted population as a new group of users not covered by the previous Swedish projects. Children as beginning writers, are in the process of acquiring written language, unlike adult writers, and will probably produce relatively more errors and errors of a different kind than adult writers. Their writing errors have probably more to do with competence than performance. Grammar checkers for this group have to have different coverage and concentrate on different kinds of errors. Further, the positive impact of computers on children’s writing opens new opportunities for the application of language technology. The role of proofreading tools for educational purposes is a rather new application area and this work can be considered a first step in that direction. Against this background, the main goal of the present thesis is handling children’s errors and experimenting with positive grammatical descriptions using finite state techniques. The work is divided into three subtasks, including first, an overall error analysis of the collected children’s texts, then exploring the nature and possibilities for detection of errors and finally, implementation of detection of (some) grammatical error types. Here is a brief characterization of these three tasks: I. Investigation of children’s writing errors: The targeted data for a grammar checker can be selected either by intuitions about errors that will probably occur, or by directly looking at errors that actually occur. In the present work, the second approach of empirical analysis will be applied. Texts from pupils at three primary schools were collected and analyzed for errors, focusing on errors above word-level including grammar errors, spelling errors resulting in existent words, and punctuation. The main focus lies on grammar errors Chapter 1. 4 as the basis for implementation. The questions that arise are: What grammar errors occur? How should the errors be categorized? What spelling errors result in lexicalized strings and are not captured by a spelling checker? What is the nature of these? How is punctuation used and what errors occur? II. Investigation of the possibilities for detection of these writing errors: The nature of errors will be explored along with available technology that can be applied in order to detect them. An interesting point is how the errors that are found are handled by the current systems. The questions that arise are: What is the nature of the error? What is the diagnosis of the error? What is needed to be able to detect the error? How are the grammar errors handled by the current Swedish grammar checkers, Grammatifix, Granska and Scarrie? III. Implementation of the detection of (some) grammar errors: A subset of errors will be chosen for implementation and will concern grammar checking to the level of detecting errors. Errors will obtain a description of the type of error detected. Implementation will not include any additional diagnosis or any suggestion of how to correct the error. The analysis will be shallow, using finite state techniques. The grammars will describe real syntactic relations rather than the structure of erroneous patterns. The difference between grammars of distinct accuracy will reveal the errors, that as finite state automata can be subtracted from each other. Karttunen et al. (1997a) use this technique to find instances of invalid dates and this is an attempt to apply their approach to a larger language domain. The work on this grammar error detector started at the Department of Linguistics at Göteborg University in 1998, in the project Finite State Grammar for Finding Grammatical Errors in Swedish Text and was a collaboration with the NADA group at KTH in the project Integrated Language Tools for Writing and Document Handling.3 The present thesis describes both the initial development within this project and the continuation of it. The main contributions of this thesis concern understanding of incorrect language use in primary school children’s writing and computational analysis of such incorrect text by means of correct language use, in particular: • Collection of texts written by primary school children, written both by hand and on a computer. 3 This project was sponsored by HSFR/NUTEK Language Technology Programme and has its site at: http://www.nada.kth.se/iplab/langtools/ Introduction 5 • Analysis of grammar errors, spelling errors and punctuation in the texts of primary school writers. • Comparison of errors found in the present data with errors found in other studies on grammar errors. • Comparison of error types covered by the three Swedish grammar checkers. • Performance analysis of the three Swedish grammar checkers on the present data. • Implementation of a grammar error detector that derives/compiles error patterns rather than writing the error grammar by hand. • Performance analysis of the detector on the collected data and some portion of other data. 1.3 Outline of the Thesis The remaining chapters of the thesis fall into two parts. Part I: The first part is devoted to a discussion of writing and an analysis of the collected data and consists of three chapters. Chapter 2 provides a brief introduction to research on writing in general, writing acquisition, how computers influence writing and descriptions of previous findings on grammar errors, concluding with what grammar errors are to be expected in written Swedish. Chapter 3 gives an overview of the data collected and a discussion of error classification. Chapter 4 presents the error profile of the data. The chapter concludes with discussion of the requirements for a grammar error detector for the particular subjects of this study. Part II: The second part of the thesis concerns grammar checking and includes three chapters. Chapter 5 starts with a general overview of the requirements and functionalities of a grammar checker and what is required for the errors in the present data. Swedish grammar checkers are described and their performance is checked on the present data. Chapter 6 presents the implementation of a grammar error detector that handles these errors, including description of finite state formalism. The techniques of finite state parsing are explained. Chapter 7 presents the performance of this tool. The thesis ends with a concluding summary (Chapter 8). In addition, the thesis contains four appendices. Appendix A presents the grammatical feature categories 6 Chapter 1. used in the examples of errors or when explaining the grammar of Swedish. Appendix B presents the error corpora consisting of the grammar errors found in the present study (Appendix B.1), misspelled words (Appendix B.2) and segmentation errors (Appendix B.3). The tagset used is presented in Appendix C and some listings from the implementation are listed in Appendix D. Part I Writing 8 Chapter 2 Writing and Grammar 2.1 Introduction Learning to write does not imply acquiring a completely new language (new grammar), since often at this stage (i.e. beginning school) a child already knows the majority of the (general) grammar rules. Rather, learning to write is a process of learning the difference between written language and the already acquired spoken language. Consequently, errors that one will find in the writing of primary school children often are due to their lack of knowledge of written language and consist of attempts to reproduce spoken language norms as an alternative to the standard written norm or to errors due to the as yet not acquired part of written language. Further, even when the writer knows the standard norm, errors can occur either as the result of disturbances such as tiredness, stress, etc. or because the writer cannot manage to keep together complex content and meaning constructions (cf. Teleman, 1991a). Another source of errors is the aids we use for writing, computers, which also impact on our writing and may give rise to errors. The main purpose of the present chapter is to see if previous studies on writing can give some hint on what grammar errors are to be expected in the writing by Swedish children. It provides a survey of previous studies of grammar errors, as well as some background research on writing in general and some insights into what it means to learn to write and how computers influence our writing. First, a short review of research on writing is presented (Section 2.2), followed by a short explanation of what acquisition of written language involves and how computers influence the way we write (Section 2.3). Previous findings on grammar errors in Swedish can be found in the following section, including studies of writing of children and adolescents, adults and the disabled (Section 2.4). 10 Chapter 2. 2.2 Research on Writing in General For a long period of time many considered written language (beginning with e.g. de Saussure, 1922; Bloomfield, 1933) to be a transcription of spoken (oral) language and not that important as, or even inferior to, spoken language. A similar view is also reflected in the research on literacy, where studies on writing were very few in comparison to research on reading. A turning point at the end of 1970s, is described by many as “the writing crisis” (Scardamalia and Bereiter, 1986), when an expansion in research occurs in teaching native language writing. During this period, more naturalistic methods for writing are propagated, i.e. “learning to write by writing” (Moffett, 1968), examination of the writing situation in English schools (e.g. Britton, 1982; Emig, 1982) and changing the focus of study from judgments of products and more text-oriented research to the strategies involved in the process of writing (see Flower and Hayes, 1981). In Sweden, writing skills were studied by focusing on the written product, often related to the social background of the child. Research has been devoted to spelling (e.g. Haage, 1954; Wallin, 1962, 1967; Dahlquist and Henrysson, 1963; Ahlström, 1964, 1966; Lindell, 1964) and writing of composition in connection to standardized tests (e.g. Björnsson, 1957, 1977; Ljung, 1959). There are also studies concerning writing development in primary and upper secondary schools (e.g. Grundin, 1975; Björnsson, 1977; Hultman and Westman, 1977; Lindell et al., 1978; Larsson, 1984). During the later half of the 1980s, research in Sweden took a new direction towards studies of writing strategies concerning writing as a process (e.g. Bj örk and Björk, 1983; Strömquist, 1987, 1989) and development of writing abilities focusing on writing activities between children and parents (e.g. Liberg, 1990) and text analysis (e.g. Garme, 1988; Wikborg and Björk, 1989; Josephson et al., 1990). This turning point was reflected in education by the introduction of process-oriented writing, as well. Some research concerned writing as a cognitive text-creating process using video-recordings of persons engaged in writing (e.g. Matsuhasi, 1982), or clinical experiments (e.g. Bereiter and Scardamalia, 1985). The use of computers in writing prompted studies on the influence of computers in writing (e.g. Severinson Eklundh and Sjöholm, 1989; Severinson Eklundh, 1993; Wikborg, 1990), resulting in the development of computer programs that register and record writing activities (e.g. Flower and Hayes, 1981; Severinson Eklundh, 1990; Kollberg, 1996; Strömqvist, 1996). Writing and Grammar 11 2.3 Written Language and Computers 2.3.1 Learning to Write Writing, like speaking, is primarily aimed at expressing meaning. The most evident difference between written and spoken language lies in the physical channel. Written language is a single-channelled monologue, using only the visual channel (eye) with the addressee not present at the same time. It is a more relaxed, rather slow process affording longer time for consideration and the possibility to edit/correct the end product. Speech as a dialogue is simultaneous and involves participants present at the same time, where all the senses can be used to receive information. It is a fast process with little time for consideration and difficulty in correcting the end product. The rules and conventions of written language are more restrictive than the rules of spoken language in the sense that there are constructions in spoken language regarded as “incorrect” in written language. Writing is, in general, standardized with less (dialectal) variation in contrast to spoken language, which is dialectal and varied. Further, acquisition of written and spoken language occurs under different conditions and in different ways. Writing is taught in school by teachers with certain training, whereas speaking is learned privately (in a family, from peers, etc.), without any planning of the process. When learning to speak, we learn the language. When learning to write we already know the language (in the spoken form) (cf. Linell, 1982; Teleman, 1991b; Liberg, 1990). 1 Learning a written language means not only acquiring its more or less explicit norms and rules, but also learning to handle the overall writing system, including the more technical aspects, such as how to shape the letters, the boundaries between words, how a sentence is formed, as well as acquiring the grammatical, discursive, and strategic competence to convey a thought or message to the reader. In other words, writing entails being able to handle the means of writing, i.e. letters and grammar rules, and arranging them to form words and sentences and being able to use them in a variety of different contexts and for different purposes. During this development, children may compose text of different genre, but not necessarily apply the conventions of the writing system correctly. Children are quite creative and they often use conventions in their own ways, for instance using periods between words to separate them instead of blank spaces (cf. Mattingly, 1972; Chall, 1979; Lundberg, 1989; Liberg, 1990; Pontecorvo, 1997; Håkansson, 1998). 1 For further, more extensive definitions of differences between written and spoken language see e.g. Chafe (1985); Halliday (1985); Biber (1988). Chapter 2. 12 The above discussion leads to a view of learning to write as being the acquisition of a complex system of communication with several components. Following Hultman (1989, p.73), we can identify three aspects of writing: 1. the motor aspect: the movement of the hand when forming the letters or typing on the keyboard 2. the grammar aspect: the rules for spelling and punctuation, morphology and syntax on clause, sentence and text level 3. the pragmatic aspect: use of writing for a purpose, to argue, tell, describe, discuss, inform, refer, etc. The text has to be readable, reflecting the meaning of words and the effect they have. This thesis focuses on the grammar aspect, in particular on the syntactic relationships between words. Also some aspects of spelling and punctuation are covered. The text level is not analyzed here. 2.3.2 The Influence of Computers on Writing The view on writing has changed, it is no longer interpreted as a linear activity consisting of independent and temporally sequenced phases, but rather considered to be a dynamic, problem solving activity. According to Hayes and Flower (1980), as a cognitive process, writing is influenced by the task environment (the external social conditions) and the writer’s long term memory, including cognitive processes of planning (generating and organizing ideas, setting goals, and decision-making of what to include, what to concentrate on), translation (the actual production) and revision (evaluation of what has been written, proof-reading, writing out and publishing). This process-based approach with the phases also referred to as prewriting, writing and rewriting has been adopted in writing instruction in school (e.g. Graves, 1983; Calkins, 1986; Strömquist, 1993) and is also considered to be well-suited to computer-assisted composition (Wresch, 1984; Montague, 1990). Writing on a computer makes text easy to structure, rearrange and rewrite. Many studies report writers’ decreased resistance to writing. They experience that it is easier to start to write and there is a possibility to revise under the whole process of writing, leave the text and then come back to it again and update and reuse old texts (e.g. Wresch, 1984; Severinson Eklundh, 1993). Also, studies of children’s use of computers show that children who use a word-processor in school enjoy writing and editing activities more, considering writing on a computer to be much easier and more fun. They are more willing to revise and even completely Writing and Grammar 13 change their texts and they write more in general (e.g. Daiute, 1985; Pontecorvo, 1997). The word processor affects the way we write in general. We usually plan less in the beginning when writing on a computer and revise more during writing. Thus, editing occurs during the whole process of writing and is not left solely to the final phase. In an investigation by Severinson Eklundh (1995) of twenty adult writers with academic backgrounds more than 80% of all editing was performed during writing and not after. The main disadvantage reported is that it is hard to get an overall perspective of a text on the screen, which then makes planning and revision more difficult and can in turn lead to texts being of worse quality (e.g. Hansen and Haas, 1988; Severinson Eklundh, 1993). Rewriting and rearranging of a text is easy to do on a word processor, for instance by copy and paste utilities that may easily give rise to errors that are hard to discover afterwards, especially in a brief perusal. Words and phrases can be repeated, omitted, transposed. Sentences can be too long (Pontecorvo, 1997) and errors that normally are not found in native speakers’ writing occur. The common claim is that writing in one’s mother tongue normally results in the types of errors that are different from the public language norm, since most of the mother tongue’s grammar is present before we begin school (Teleman, 1979). There are studies that clearly show that the use of word processors leads to completely new error types including some errors that were considered characteristic for second language writers. For instance, morpho-syntactic (agreement) errors have been found to be quite usual among native speakers in the studies of Bustamente and León (1996) and Domeij et al. (1996). The errors are connected to how we use the functions in a word processor and that revision is more local due to limitations in view on the screen (cf. Domeij et al., 1996; Domeij, 2003). Concerning text quality, there are studies that point out that the use of a word processor results in longer texts, both among children and adults. Some researchers claim that the quality of compositions improved when word processors were used (see e.g. Haas, 1989; Sofkova Hashemi, 1998). However, no reliable quality enhancement besides the length of a text is evident in any study. The effects of using a computer for revision are regarded by some as being positive both on the mechanics and the content of writing while others feel it promotes only surface level revision, not enhancing content or meaning (see the surveys in Hawisher, 1986; Pontecorvo, 1997; Domeij, 2003). Chapter 2. 14 2.4 Studies of Grammar Errors 2.4.1 Introduction There are not many studies of grammar errors in written Swedish. Studies of adult writing are few, while research on children’s writing development mostly concerns the early age of three to six years and development of spelling and use of the period and/or other punctuation marks and conventions (e.g. Allard and Sundblad, 1991). Recent expansion of development of grammar checking tools contributes to this field, however. Below, studies are presented of grammar errors found in the writing of primary and upper secondary school children, adults, error types covered by current proof reading tools and analysis of grammar errors in texts of adult writers used for evaluation of these tools. Some of these studies are described further in detail and are compared to the analysis of the children’s texts gathered for the present thesis in Chapter 4 (Section 4.4). 2.4.2 Primary and Secondary Level Writers During the 1980s, several projects investigated the language of Swedish school children as a contribution to discussion of language development and language instruction (see e.g. the surveys in Östlund-Stjärnegårdh, 2002; Nyström, 2000). The writing of children in primary and upper secondary school was analyzed mostly with focus on lexical measures of productivity and language use, in terms of analysis of vocabulary, parts-of-speech distribution, length of words, word variation and also content, relation to gender, social background and the grades assigned to the texts (e.g. Hersvall et al., 1974; Hultman and Westman, 1977; Lindell et al., 1978; Pettersson, 1980; Larsson, 1984). Then, when the traditional productoriented view on writing switched to the new process-oriented paradigm, studies on writing concerned the text as a whole and as a communicative act (e.g. Chrystal and Ekvall, 1996, 1999; Liberg, 1999) and became more devoted to analysis of genre and referential issues (e.g. Öberg, 1997; Nyström, 2000) and relation to the grades assigned (e.g. Östlund-Stjärnegårdh, 2002) and modality (speech or writing) (e.g. Strömqvist et al., 2002). Quantitative analysis in this field still concerns lexical measures of variation, length, coherence, word order and sentence structure; very few studies note errors other than spelling or punctuation (e.g. Olevard, 1997; Hallencreutz, 2002). A study by Teleman (1979) shows examples (no quantitative measures) of both lexical and syntactic errors observed in the writing of children from the seventh year of primary school (among others). He reports on errors in function words, Writing and Grammar 15 inflection with dialectal endings in nouns, dropping of the tense-endings on verbs and on use of nominative form of pronouns in place of accusative forms as is often the case in spoken Swedish. Also, errors in definiteness agreement, missing constituents, reference problems, word order and tense shift are exemplified as well as observation of erroneous use of or missing prepositions in idiomatic expressions. Another study of Hultman and Westman (1977), concerns analysis of national tests from third year students from upper secondary school. The aim of the project Skrivsyntax “Written Syntax” was to study writing practice in school from a linguistic point of view. The material included 151 compositions (88 757 words in total) with the subject Familjen och äktenskapet än en gång ‘Family and marriage once more’. Vocabulary, distribution of word categories, syntax and spelling were studied and compared to adult texts, between the marks assigned to the texts and between boys and girls. The study also included error analysis of punctuation, orthography, grammar, lexicon, semantics, stylistics and functionality of the text. Among grammar errors, gender agreement errors were reported being usual, and relatively many errors in pronoun case after preposition occurred. Errors in agreement between subject and predicative complement are also reported as rather frequent. Word order errors are also reported, mostly in the placement of adverbials. Other examples include verb form errors, subject related errors, reference, preposition use in idiomatic expressions and clauses with odd structure. 2.4.3 Adult Writers There are few studies of adult writing in Swedish. Those that exist are mostly devoted to the writing process as a whole or to social aspects of it with very little attention being paid to the mechanics of writing. However, the recent expansion in the development of Swedish computational grammar checking tools that require understanding of what error types should be treated by such tools, has made contributions to this field. The realization of what types of errors occur and should thus be included in such an authoring aid may be based on intuitive presuppositions of what rules could be violated, in addition to empirical analysis of text. More empirical evidence of grammar violations also comes from the evaluation of such tools, where the system is tested against a text corpus with hand-coded analysis of errors. There are three available grammar checkers for Swedish: Granska (Knutsson, 2001), Grammatifix (Birn, 2000) and Scarrie (Sågvall Hein et al., 1999).2 Scarrie is explicitly devoted to professional writers of newspaper articles. The other two systems are not explicitly aimed at any special user groups, although their performance tests were provided mainly on newspaper texts. 2 These tools are described in detail in Chapter 5. 16 Chapter 2. Below, a survey of studies is presented of professional and non-professional writers, adult disabled writers, the grammar errors that are covered by the three Swedish grammar checkers, and grammar errors that occurred in the evaluation texts the performance of these systems was tested upon. Professional and Non-professional Writers Studies focusing on adult non-professional writing concern analysis of crime reports (Leijonhielm, 1989), post-school writing development (Hammarbäck, 1989), a socio-linguistic study concerning writing attitudes, i.e. what is written and who writes what at a local government office regardless of writing conventions (Gunnarsson, 1992) and some “typical features in non-proof-read adult prose” at a government authority are reported in Göransson (1998), the only investigation that addresses (to some extent) grammatical structure. Göransson (1998) describes her immediate impression when proof-reading texts written by her colleagues at a government authority, showing some typical features in this unedited adult prose. She examined reports, instructional texts, newspaper articles, formal letters, etc. The analysis distinguishes between high and low level errors. High level includes comprehensibility of the text, coherence and style, relevance for the context, ability to see one’s own text with the eyes of others, choice of words, etc. Low level errors cover grammar and spelling errors. Among the grammar errors she only reports reference problems, choice of preposition and agreement errors. Among studies of professional writers, the language consultant Gabriella Sandström (1996) analyzed editing at the Swedish newspaper Svenska Dagbladet that included 29 articles written by 15 reporters. The original script, the edited version and the final version of the articles were analyzed. The analysis involved spelling, errors at lexical and syntactic level, formation errors, punctuation and graphical errors. The result showed that the journalists made most errors in punctuation, graphical errors and lexical errors and most of them disappeared during the editing process. Among the lexical errors, Sandström mentions errors in idiomatic expressions and in the choice of prepositions. Syntax errors also seem to be quite common, but the article does not give an analysis of the different kinds of syntax errors. Writing and Grammar 17 Adults with Writing Disabilities Studies on writing strategies of disabled groups were conducted within the project Reading and Writing strategies of Disabled Groups,3 including analysis of grammar for the dyslexic and deaf (Wengelin, 2002). The analysis of the writing of deaf adults included no frequency data and is not that important for the present study since it tends to reflect more strategies found in second language acquisition. Adult dyslexics mostly show problems with formation of sentences and frequent omission of constituents. Especially frequent were missing or erroneous conjunctions. Other errors concern agreement in noun phrase or the form of noun phrases, verb form, tense shift within sentences and incorrect choice of prepositions. Marking of sentence boundaries and punctuation is the main problem of these writers. Error Types in Proof Reading Tools The error types covered by a grammar checker should, in general, include the central constructions of the language and, in particular, those which give rise to errors. These constructions should allow precise descriptions so that false alarms can be avoided. The selection of what error types to include is then also dependent on the available technology and the possibility of detecting and correcting the error types (cf. Arppe, 2000; Birn, 2000). In the development of Grammatifix, the pre-analysis of existing error types in Swedish was based on linguistic intuition, personal observation and reference literature of Swedish grammar and writing conventions (Arppe, 2000). In the case of Granska, the pre-analysis involved analysis of empirical data such as newspaper texts and student compositions (Domeij et al., 1996; Domeij, 2003). In the Scarrie project, where journalists are the end-users, the stage of pre-analysis consisted of gathering corrections made by professional proof-readers at the newspapers involved. These corrections were stored in a database (The Swedish Error Corpora Database, ECD), that contains nearly 9,000 error entries, including spelling, grammar, punctuation, graphic and style, meaning and reference errors. Arppe (2000) provides an overview of the types of errors covered by the Swedish tools and reports, in short, that all the tools treat errors in noun phrase agreement and verb forms in verb chains. Scarrie and Granska also treat errors in compounds, whereas Grammatifix has the widest coverage in punctuation and number formatting errors. He points out that the error classification in these tools is similar, but not exactly the same. The depth and breadth of included error categor3 More information about this project may be found at: http://www.ling.gu.se/ ˜wengelin/projects/r&r. 18 Chapter 2. ies differs in the subsets of phrases, level of syntactic complexity or in the position of detection in the sentence. They may, for instance, detect errors in syntactically simple fragments, but fail with syntactically more complex structures. These factors are further explained and exemplified in Chapter 5, where I also compare the error types covered by the individual tools. Among the grammar errors presented in Scarrie’s ECD, errors in noun phrase agreement, predicative complement agreement, definiteness in single nouns, verb subcategorization and choice of preposition are the most frequent error types. Evaluation Texts of Proof Reading Tools Other empirical evidence of grammar errors can be observed in the evaluation of the three grammar checkers (Birn, 2000; Knutsson, 2001; Sågvall Hein et al., 1999). The performance of all the tools was tested on newspaper text, written by professional writers. Only the evaluation corpus of Granska included texts written by non-professionals as well, represented by student compositions. In general, the corpora analyzed are dominated by errors in verb form, agreement in noun phrases, prepositions and missing constituents. 2.5 Conclusion The main purpose of the present chapter was to investigate if previous research reveals which grammar errors to expect in the writing of primary school children. Apparently, grammar in general has a very low priority in the research on writing in Swedish. Grammar errors in children’s writing have been analyzed at the upper level in primary school and in the upper secondary school and exist only as reports with some examples, without any particular reference to frequency. Some analyses have been performed on the writing of professional adult writers and in the research on the writing of adult dyslexic and deaf adults, with quantitative data for the dyslexic group. The only area that directly approaches grammar errors concerns the development of proofreading tools aiming particularly at grammar. These studies report on grammar errors in the writing of adults. Previous research presents no general characterization of grammar errors in children’s writing. There are, however, few indications that children as beginning writers make errors different from adult writers. Teleman’s observations indicate use of spoken forms that were not reported in the other studies. Some examples of errors in the Skrivsyntax project are evidently more related to the fact that the children have not yet mastered writing conventions (e.g. errors in the accusative Writing and Grammar 19 case of plural pronouns) rather than making errors related to “slip of the pen” (e.g. due to lack of attention). In general, all the studies report errors in agreement (both in non phrase and predicative complement), verb form and the choice of prepositions in idiomatic expressions. Are these the central constructions in Swedish that give rise to grammar errors? It may be true for adult writers, but it is unclear regarding beginning writers. Analysis of grammar errors in the children data collected for the present study is presented in Chapter 4, together with a comparison of the findings of the previous studies of grammar errors presented above. 20 Chapter 3 Data Collection and Analysis 3.1 Introduction In this chapter we report on data that has been gathered for this study and the types of analysis provided on them. First, the data collection is presented and the different sub-corpora are described (Section 3.2). Then, a discussion follows of the kinds of errors analyzed and how they are classified (Section 3.3). The types of analyses in the present study are provided in the subsequent section (Section 3.4) and a description of error coding and tools that were used for that purpose end this chapter (Section 3.5). 3.2 Data Collection 3.2.1 Introduction The main goal of this thesis is to detect automatically grammar errors in texts written by children. In order to explore what errors actually occur, texts with different topics written by different subjects were collected to build an underlying corpus for analysis, hereafter referred to as the Child Data corpus. The material was collected on three separate occasions and has served as basis for other (previous) studies. The first collection of the data consists of both hand written and computer written compositions on set topics by 18 children between 9 and 11 years old - The Hand versus Computer Collection. The second collection involves the same subjects, this time, the children participate in an experiment and tell a story about a series of pictures, both orally and in writing on a computer The Frog Story Collection. The third collection was presented from a project on Chapter 3. 22 development of literacy and includes eighty computer written compositions of 10 and 13 year old children on set topics in two genres - The Spencer Collection. 1 Table 3.1 gives an overview of the whole Child Data corpus, including the three collections mentioned above, divided into five sub-corpora by the writing topics the subjects were given: Deserted Village, Climbing Fireman, Frog Story, Spencer Narrative, Spencer Expository. Further information concerns the age of the subjects involved, the number of compositions, number of words, if the children wrote by hand or on computer and what writing aid was then used. Table 3.1: Child Data Overview AGE C OMP W ORDS T OPIC W RITING AID H AND VS . C OMPUTER C OLLECTION : Deserted Village 9-11 18 7 586 ”They arrived in a paper and pen deserted village” Climbing Fireman 9-11 18 4 505 Shown: a picture of a Claris Works 3.0 fireman climbing on a ladder F ROG S TORY C OLLECTION : Frog Story 9-11 18 4 907 Story-retelling: ”Frog ScriptLog where are you?” S PENCER C OLLECTION : Spencer Narrative 10 & 13 40 5 487 Narrative: Tell about ScriptLog a predicament you had rescued somebody from, or you had been rescued from Spencer Expository 10 & 13 40 7 327 Expository: Discuss the ScriptLog problems seen in the video 29 812 T OTAL 134 Altogether 58 children between 9 and 13 years old wrote 134 papers, comprising a corpus of 29,812 words. Most of the papers are written on the computer. Only the first sub-corpus (Deserted Village) consists of 18 hand written compositions. The editor Claris Works 3.0 was used for 18 computer written texts. ScriptLog, a tool for experimental research on the on-line process of writing, was used for the remaining (98) computer written compositions. ScriptLog looks just like an ordin1 Many thanks to Victoria Johansson and Sven Strömqvist, Department of Linguistics, Lund University for sharing this collection of data. Data Collection and Analysis 23 ary word processor to the user, but in addition to producing the written text, it also logs information of all events on the keyboard, the screen position of these events and their temporal distribution.2 This section proceeds with detailed descriptions of the three collections that form the corpus, with information about when and for what purpose the material was collected, the subjects involved, the tasks they were given and the experiments they took part in. 3.2.2 The Sub-Corpora The Hand vs. Computer Collection The first collection originates from a study on the computer’s influence on children’s writing, gathered in autumn, 1996. The writing performance in hand written and computer written compositions on the same subjects was compared (see Sofkova, 1997). Results from this study showed both great individual variation among the subjects and similarities between the two modes, e.g. the distribution of spelling and segmentation errors, as well as improved performance in the essays written on the computer especially in the use of punctuation, capitals and the number of spelling errors. The subjects included a group of eighteen children, twelve girls and six boys, between the age of 9 and 11, all pupils at the intermediate level at a primary school. This school was picked because the children had some experience with writing on computers. Computers had already been introduced in their instruction and pupils were free to choose to write on a computer or by hand. If they chose to write on a computer, they wrote directly on the computer, using the Claris Works 3.0 wordprocessor. Other requirements were that the subjects should be monolingual and not have any reading or writing disabilities. The children wrote two essays - one by hand and one on the computer. At the beginning of this study, the children were already busy writing a composition, which now is part of the hand written material. They were given a heading for the hand written task: De kom till en övergiven by ‘They arrived in a deserted village’. For the computer written task, pupils were shown a picture of a fireman climbing on a ladder. They were also told not to use the spelling checker when writing in order to make the two tasks as comparable as possible. 2 A first prototype was developed in the project Reading and writing in a Linguistic and a didactic perspective (Strömqvist and Hellstrand, 1994). An early version of ScriptLog developed for Macintosh computers was used for collecting the data in this thesis (Strömqvist and Malmsten, 1998). There is now also a Windows version (Strömqvist and Karlsson, 2002). Chapter 3. 24 The Frog Story Collection The second collection is a story-telling experiment and involves the same subjects as in the Hand vs. Computer Collection. In April 1997, we invited the children to the Department of Linguistics at Göteborg University to take part in the experiment. They played a role as control group in the research project Reading and Writing Strategies of Disabled Groups, that aims at developing a unified research environment for contrastive studies of reading and writing processes in language users with different types of functional disabilities.3 The experiment included a production task and the data were elicited both in written and spoken form (video-taped). A wordless picture story booklet Frog, where are you? by Mercer Mayer (1969) was used, a cartoon like series of 24 pictures about a boy, his dog and a frog that disappears. Each subject was asked to tell the story, picture by picture. At the beginning of the experiment the children were invited to look through the book to get an idea of the content. Then, the instruction was literally Kan du berätta vad som händer på de här bilderna? ‘Can you tell what is happening in these pictures?’ Half of the children started with writing and then telling the story and half of them did the opposite. For the written task, the on-line process editor ScriptLog was used, storing all the writing activities. The SPENCER Collection The Spencer Project on Developing Literacy across Genres, Modalities and Languages4 lasted between July 1997, and June 2000. The aim was to investigate the development of literacy in both speech and writing. Four age groups (grade school students, junior high school students, high school students and university students), and seven languages (Dutch, English, French, Hebrew, Icelandic, Spanish and Swedish) were studied. Schools were picked from areas where one could expect few immigrants in the classes, and also where the children had some experience with computers. The subjects came from middle class, monolingual families and they had no reading or writing disabilities. Another criterion was that at least one of the subject’s parents had education beyond high school. 3 The project’s directors are Sven Strömqvist and Elisabeth Ahlsén from the Department of Linguistics, Göteborg University. More information about this project may be found at: http: //www.ling.gu.se/˜wengelin/projects/r&r. 4 The project was funded by the Spencer Foundation Major Grant for the Study of Developing Literacy to Ruth Berman, Tel Aviv University, who was the coordinator of this project. Each language/country involved has had its own contact person, for Swedish it was Sven Str ömqvist from the Department of Linguistics at Lund University. Data Collection and Analysis 25 All subjects had to create two spoken and two written texts, in two genres, expository and narrative. Each subject saw a short video (3 minutes long), containing scenes from a school day. After the video, the procedure varied depending on the order of genre and modality.5 The topic for the narratives was to tell about an event when the subject had rescued somebody, or had been rescued by somebody from a predicament. They were asked to tell how it started, how it went on and how it ended. The topic for the expository text was to discuss the problems they had seen in the video, and possibly give some solutions. They were explicitly asked not to describe the video. Written material for two age groups from the Swedish part of the study is included in the present Child Data: the grade school students (10 year olds) and junior high school students (13 year olds). In total, 20 subjects from each age group were recruited. The texts the subjects wrote were logged in the on-line process editor ScriptLog. 3.3 Error Categories 3.3.1 Introduction The texts under analysis contain a wide variety of violations against written language norms, on all levels: lexical, syntactic, semantic and discourse. The main focus of this thesis is to analyze and detect grammar errors, but first we need to establish what a grammar error is and what distinguishes a grammar error from, for instance, a spelling error. Punctuation is another category of interest, important for deciding how to syntactically handle a text by a grammar error detector. The following section discusses categorization of the errors found in the data and explains what errors are classified as spelling errors as well as where the boundary lies between spelling and grammar errors. The error examples provided are glossed literally and translated into English. Grammatical features are placed within brackets following the word in the English gloss (e.g. klockan ‘watch [def]’) (the different feature categories are listed in Appendix A). Occurrences of spelling violations are followed by the correct form within parentheses and preceded by ‘⇒’, both in the Swedish example and the English gloss (e.g. var (⇒ vad) ‘was (⇒ what)’). 5 There were four different orders in the experiment: Order A: Narrative spoken, Narrative written, Expository spoken, Expository written. Order B: Narrative written, Narrative spoken, Expository written, Expository spoken. Order C: Expository spoken, Expository written, Narrative spoken, Narrative written. Order D: Expository written, Expository spoken, Narrative written, Narrative spoken. Chapter 3. 26 3.3.2 Spelling Errors Spelling errors are violations of the orthographic norms of a language, such as insertion (e.g. errour instead of error), omission (e.g. eror), substitution (e.g. errer) or transposition (e.g. erorr) of one or more letters within the boundaries of a word or omission of space between words (i.e. when words are written together) or insertion of space within a word (i.e. splitting a word into parts). Spelling errors may occur due to the subject’s lack of linguistic knowledge of a particular rule (competence errors) or as a typographical mistake, when the subject knows the spelling, but makes a motor coordination slip (performance errors). The difference between a competence and a performance error is not always so easy to see in a given text. For example, the (nonsense) string gube deviates from the intended correct word gubbe ‘old man’ by missing doubling of ‘b’ and violates thus the consonant gemination rule for this particular word. The text where the error comes from, shows that this subject is (to some degree) familiar with this rule applying consonant gemination on other words, indicating that the error is likely to be a typo (i.e. a performance error) and that it occurred by mistake. On the other hand, the subject may not be aware that this rule applies to this particular word. 6 It is then more a question of insufficient knowledge and thus, a competence error. Spelling errors often give rise to non-existent words (a non-word error) as in the example above, but they can also lead to an already lexicalized string (a real word error).7 For example, in the sentence in (3.1), the string damen also violates the consonant doubling rule and deviates from the intended correct word dammen ‘dam [def]’ by omission of ‘m’. However, in this case the resultant string coincides with an existent word damen ‘lady [def]’.8 The error still concerns the single word, but differs from non-word errors in that the realization now influences not only the erroneously spelled string but also the surrounding context. The newly-formed word completely changes the meaning of the sentence and gives rise to a sentence with a very peculiar meaning, where a particular lady is not deep. (3.1) Men ∗ damen (⇒ dammen) är inte så djup. but lady [def] (⇒ dam [def]) is not that deep – But the dam is not so deep. Homophones, words that sound alike but are spelled differently, are another example of a spelling error realized as a real word. The classical examples are the 6 The word gubbe ‘old man’ was used only once in the text. Usually around 40% of all misspellings result in lexicalized strings (e.g. Kukich, 1992). The notion of non-word vs. real word spelling errors is a terminology used in research on spelling (cf. Kukich, 1992; Ingels, 1996). 8 Consonant doubling is used for distinguishing short and long vowels in Swedish. 7 Data Collection and Analysis 27 words hjärna ‘brain’ and gärna ‘with pleasure’ that are often substituted in written production and as carriers of different meanings completely change the semantics of the whole sentence. Another category of words that may result in non-words or real words in writing are the alternative morphological forms in different dialects. For instance, a spoken dialectal variation of the standard final plural suffix -or on nouns as in flicker ‘girls’ (standard form is flick-or ‘girls’) is normally not accepted in written form and thus realizes as a non-word in the written language. Other spoken forms, such as jag ‘I’ normally reduced to ja in speech, coincide with other existent words and form real word errors in writing. In this case ja is homonymous with the interjection (or affirmative) ja ‘yes’. In neither case is it clear if the spoken form is used intentionally as some kind of stylistical marker or spelled in this way due to competence or performance insufficiency, meaning that the subject either had not acquired the written norm or that a typographical error occurred. Spelling errors are then violations of characters (or spaces) in single isolated words, that form (mostly) non-words or real words, the latter causing ungrammaticalities in text. 3.3.3 Grammar Errors Grammar errors violate (mostly) the syntactic rules of a language, such as feature agreement, order or choice of constituents in a phrase or sentence, thus concerning a wider context than a single word.9 Like spelling errors, a grammar error may occur due to the subject’s insufficient knowledge of such language rules. However, the difference is that when learning to write as a native speaker (as the subjects in this study), only the written language norms that deviate from the already acquired (spoken) grammatical knowledge have to be learned. As mentioned earlier, research reveals that native speakers make not only errors reflecting the norms of the group one belongs to as one might expect, but also other grammar errors that have been ascribed to the influence of computers on writing. That is, even a native speaker can make grammar errors when writing on a computer due to rewriting or rearranging text. Again, the real cause of an error is not always clear from the text. For instance, in the noun phrase denna uppsatsen ‘this [def] essay [def]’ a violation of definiteness agreement occurs, since the demonstrative determiner denna ‘this’ normally requires the following noun to be in the indefinite form. In this case, the form denna uppsats ‘this [def] essay [indef]’ is the correct one (see Section 4.3.1). However, in certain regions of Sweden this construction is grammatical in speech. This 9 Choice of words may also lead to semantic or pragmatic anomaly. Chapter 3. 28 means that this error appears as a competence error since the subject is not familiar with the written norm and applies the acquired spoken norm. On the other hand, it could also be a typographical mistake, as would be the case if the subject first used a determiner like den ‘the/that [def]’ that requires the following noun to be in definite form and then changed the determiner to the demonstrative one but forgot to change the definite form in the subsequent noun to indefinite. In earlier research grammar errors have been divided along two lines. Some researchers characterize the errors by application of the same operations as for orthographic rules also at this level, with omissions, insertions, substitutions and transpositions of words. Feature mismatch is then treated as a special case of substitution (e.g. Vosse, 1994; Ingels, 1996). For instance, in the incorrect noun phrase denna uppsatsen ‘this [def] essay [def]’ the required indefinite noun is substituted by a definite noun. Word choice errors, such as incorrect verb particles or prepositions, are other examples of grammatical substitution. Word order errors occur as transpositions of words, i.e. all the correct words are present but their order is incorrect. Missing constituents in sentences concern omission of words, whereas redundant words concern insertion. Others separate feature mismatch from other error types and distinguish between structural errors, that include violations of the syntactic structure of a clause, and non-structural errors, that concern feature mismatch (e.g. Bustamente and León, 1996; Sågvall Hein, 1998a). 3.3.4 Spelling or Grammar Error? As mentioned in the beginning of this section, writing errors occur at all levels, including lexicon, syntax, semantics and discourse. The nature of an error is sometimes obvious, but in many cases it is unclear how to classify errors. The final versions of the text give very little hint about what was going on in the writer’s mind at the time of text production.10 Some kind of classification of writing errors is necessary, however, for detection and diagnosis of them. Consider for instance the sentence in (3.2), where a (non-finite) supine verb form försökt ‘tried [sup]’ is used as the main verb of the second sentence. The word in isolation is an existent word in Swedish, but syntactically a verb in supine form is ungrammatical as the predicate of a main sentence (see Section 4.3.5). This non-finite verb form has to be preceded by a (finite) temporal auxiliary verb (har försökt ‘have [pres] tried [sup]’ or hade försökt ‘had [pret] tried [sup]’) or the form has to be exchanged for a finite verb form, such as present (f örsöker ‘try [pres]’) 10 Probably some information can be gained from the log-files in the ScriptLog versions, but since not all data in the corpus are stored in that format, such an analysis has not been included in this thesis. Data Collection and Analysis 29 or preterite (försökte ‘tried [pret]’). In regard to the tense used in the preceding context, the last alternative of preterite form would be the best choice. att klättra ner. (3.2) Han tittade på hunden. Hunden ∗ försökt he looked [pret] at the-dog the-dog tried [sup] to climb down – He looked at the dog. The dog tried to climb down. The problem of classification lies in that although one single letter distinguishes the word from the intended preterite form and could then be considered as an orthographical violation, the error is realized not as a new word, but rather another form of the intended word is formed. This error could occur as a result of editing if the writer first used a past perfect tense (hade försökt ‘had tried’) and later changed the tense form to preterite (försökte ‘tried’) by removing the temporal auxiliary verb, but forgot also to change the supine form (försökt ‘tried [sup]’) to the correct preterite form. On the other hand, the correct preterite tense could be used by the subject already from the start. Then it is rather a question of a (real word) spelling error. The subject intended already from the beginning to write a preterite form, but intentionally or unintentionally omitted the final vowel -e, that happens to be a distinctive suffix for this verb. In the next example (3.3), a gender agreement error occurs between the neuter determiner det ‘the’ and the common gender noun ända ‘end’, as a result of replacing enda ‘only’ with ända ‘end’. The erroneous word is an existent word and differs from the intended word only in the single letter at the beginning (an orthographic violation). This is clearly a question of a spelling error, since the word does not form any other form of the intended word and it is realized as a completely new word with distinct meaning. ∗ (3.3) Det ända (⇒ enda) jag vet om the [neu] end [com] (⇒ only) I know about – The only thing I know about ... In the grammar checking literature, the categorization of writing errors is primarily divided into word-level errors and in errors requiring context larger than a word (cf. Sågvall Hein, 1998a; Arppe, 2000). Real word spelling errors were treated in Scarrie’s Error Corpora Database as errors requiring wider context for recognition and were categorized in accordance with the means used for their detection. In other words, errors either belong to the category of grammar errors when violating syntactic rules, or are otherwise categorized as style, meaning and reference category (Wedbjer Rambell, 1999a, p.5). In this thesis, where grammar errors (syntactic violations) are the main foci, real word spelling errors will be classified as a separate category. This distinction is important for examination of the Chapter 3. 30 real nature of such errors, especially when presenting a diagnosis to the user. Such considerations are especially important when the user is a beginning writer. Obvious cases of spelling errors such as the one in (3.3) are treated as such, whereas the treatment of errors lying on the borderline between a spelling and a grammar error as in (3.2) depends on: • what type of new formation occurred (other form of the same lemma or new lemma) • what type of violation occurred (change in letter, morpheme or word) • what level is influenced (lexical, syntactic or semantic) These principles are primarily aimed at the unclear cases, but seem to be applicable to other real word violations as well. The fact is that a majority of real word spelling errors form new words and violate semantics rather than syntax and just a few of them “accidentally” cause syntactic errors (see further in Section 5.3.2). It is the ones that form other forms of the same lemma that are tricky. They are treated here as grammar errors, but for diagnosis it is important to bear in mind that they also could be spelling errors. Figure 3.1 shows a scheme for error categorization. All violations of the written norm will be categorized starting with whether the error is realized as a non-word or a real word. Non-words are always classified as spelling errors. Real word errors are then further considered with regard to whether they form other forms of the same lemma or if new lemmas are created. In the case of same lemma (as in (3.2)), errors are classified as grammar errors. When new lemmas are formed, syntactic or semantic errors occur. Here a distinction is made between whether just a single letter is influenced, categorizing the error as a spelling error, or a whole word was substituted, categorizing it as a grammar error. For the errors realized as real words the following principles for error categorization then apply:11 (3.4) (i). All real word errors, that violate a syntactic rule and result in other forms of the same lemma are classified as grammar errors. (ii). All real word errors resulting in new lemmas by a change of the whole word are classified as grammar errors. (iii). All real word errors resulting in new lemmas by a change in (one or more) letter(s) are classified as spelling errors. 11 Homophones are excepted from the principle (ii). They certainly form a new lemma by a change of the whole word, but are related to how the word is pronounced and thus are considered as spelling errors. Data Collection and Analysis 31 Figure 3.1: Principles for Error Categorization For the above example (3.2), this means that considering the word in isolation, försökt ‘tried [sup]’ is an existent word in Swedish. Considering the difference in deviation of the intended preterite form, no new lemma is created, rather another form of the same lemma that happens to lack the final suffix realized as a single vowel. Considering the context it appears in, a syntactic violation occurs, since the sentence has no finite verb. So, according to principle (i) for error categorization in (3.4), this error is classified as a grammar error, since no new lemma was created, the required preterite form simply was replaced by a supine form of the same verb. In the case of (3.3), this error also involves a real word, but here, a new lemma was created by substitution of a letter. The error is then, according to principle (iii) in (3.4), considered to be a spelling error, since no other form of the same lemma or substitution of the whole word occurred. 3.3.5 Punctuation Research on sentence development and the use of punctuation reveals that children mark out entities that are content rather than syntactically driven (e.g. Kress, 1994; Ledin, 1998). They form larger textual units, for instance, by joining together sentences that are “topically closely connected”, according to Kress (1994). In speech, such sequences would be joined by intonation due to topic. An example Chapter 3. 32 of such adjoined clauses is “The boy I am writing about is called Sam he lived in the fields of Biggs Flat.”(Kress, 1994, p.84). Others use a strategy of linking together sentences with connectives like ‘and’, ‘then’, ‘so’ instead of punctuation marks, which can result in sentences of great length, here called long sentences (see Section 4.6 for examples). As we will see later on in Chapter 5, the Swedish grammar checking systems are based on texts written by adults and are able to rely on punctuation conventions for marking syntactic sentences in their detection rules or for scanning a text sentence by sentence. In accordance with the above discussion, this is not possible with the present data that consists of texts written by children. Occurrences of adjoined and long sentences are quite probable. In other words, analysis of the use of punctuation is important to confirm that also the subjects of the present study mark larger units. Thus, omissions of sentence boundaries are expected and have to be taken into consideration. 3.4 Types of Analysis The analysis of the Child Data starts with a general overview of the corpus, including frequency counts on words, word types, and all spelling errors. The main focus is on a descriptive error-oriented study of all errors above the lexical level, i.e. all that influence context. Only spelling errors resulting in non-words are not part of this analysis. The error types included are: 1. Real word spelling errors - misspelled words and segmentation errors resulting in existent words. 2. Grammar errors - syntactic and semantic violations in phrases and sentences. 3. Punctuation - sentence delimitation and the use of major delimiters and commas. The main focus here lies in the second group of grammar errors. Real word spelling errors and grammar errors are listed as separate error corpora - see Appendix B.1 for grammar errors, Appendix B.2 for misspelled words and Appendix B.3 for segmentation errors. Here all errors are represented with the surrounding context of the clause they appear in (in some cases greater parts are included e.g. in the case of referential errors). Errors are indexed and categorized by the type of error and annotated with information about possible correction (intended word) and the error’s origin in the core data. Data Collection and Analysis 33 The analysis also includes descriptions of the overall distribution of error types and error density. Comparison is made between errors found in the different subcorpora and by age. Here it is important to bear in mind that the texts were gathered under different circumstances and that not all subjects attended in all the experiments (see Section 3.2). Error frequencies are related differently depending on the error type. Spelling errors that concern isolated words, are related to the total number of words. In the case of grammar errors, the best strategy would be to relate some error types to phrases, some to clauses or sentences and some to even bigger entities in order to get an appropriate comparison measure. However, counting such entities is problematic, especially in texts that contain lots of structural errors. The best solution is to compare frequencies of the attested error types that will reflect the error profile of the texts. The main focus in the analysis of the use of punctuation in this thesis is not the syntactic complexity of sentences, but rather if the children mark larger units than syntactic sentences and if they use sentence markers in wrong ways. The most intuitive procedure would be to compare the orthographic sentences, i.e. the real markings done by the writers, with the (“correct”) syntactic sentences. The main problem with such an analysis is that in the case of long sentences, often it will be hard to decide where to draw the line, since they are for the most part syntactically correct. Several solutions for delimitation in syntactic sentences may be available.12 The subjects’ own orthographic sentences will be analyzed at that point by length in terms of the number of words and by the occurrence of adjoined clauses. Further, erroneous use of punctuation marks will be provided for. Analysis of the use of connectives as sentence delimiters would certainly be appropriate here, but we live this for future research. All error examples represent the errors found in the Child Data corpus. The example format includes the error index in the corresponding error corpora (G for grammar errors (Appendix B.1), M for misspelled words (Appendix B.2), and S for segmentation errors (Appendix B.3)) and as already mentioned, the text is glossed and translated into English with grammatical features (see Appendix A) attached to words and spelling violations followed by the correct form within parentheses preceded by a double right-arrow ‘⇒’. 12 The macro-syntagm (Loman and Jörgensen, 1971; Hultman and Westman, 1977) and the T-unit (Hunt, 1970) are other units of measure more related to investigation of sentence development and grammatical complexity in education-oriented research in Sweden and America, respectively. Chapter 3. 34 3.5 Error Coding and Tools 3.5.1 Corpus Formats In order to be able to carry out automatic analyses on the collected material, the hand written texts were converted to a machine-readable format and compiled with the computer written texts to form one corpus. All the texts were transcribed in accordance with the CHAT-format (see (3.5) below) and coded for spelling, segmentation and punctuation errors and some grammar errors. Other grammar errors were identified and extracted either manually or by scripts specially written for the purpose. Non-word spelling errors were corrected in the original texts in order to be able to test the text in the developing error detector that includes no spelling checker. The spelling checker in Word 2001 was used for this purpose. The original Child Data corpus now exists in three versions: the original texts in machine-readable format, a coded version in CHAT-format and a spell-checked version. This version free from non-words was used as the basis for the manual grammar error analysis and as input to the error detector in progress and other grammar checking tools that were tested (see Chapter 5). 3.5.2 CHAT-format and CLAN-software The CHAT (Codes for the Human Analysis of Transcripts) transcription and coding format and the CLAN (Computerized Language Analysis) program are tools developed within the CHILDES (Child Language Data Exchange System) project (first conceived in 1981), a computerized exchange system for language data (MacWhinney, 2000). This software is designed primarily for transcription and analysis of spoken data. It is, however, practical to apply this format to written material in order to take advantage of the quantitative analysis that this tool provides. For instance, the current material includes a lot of spelling errors that can be easily coded and a corresponding correct word may be added following the transcription format. This means that not only the number of words, but also the correct number of word types may be included in the analysis. Also analysis concerning for instance the spelling of words may be easily extracted. In practice, conversion of a written text to CHAT-format involves addition of an information field and division of the text into units corresponding to “speaker’s lines”, since the transcript format is adjusted to spoken material. The information field at the beginning of a transcript usually includes information on the subject(s) involved, the time and location for the experiment, the type of material coded, the type of analysis done, the name of the transcriber, etc. Speaker’s lines in spoken Data Collection and Analysis 35 material correspond naturally to utterances. For the written material, we chose to use a finite clause as a corresponding unit, which means that every line must include a finite verb, except for imperatives and titles, that form their own “speaker’s lines”. The whole transcript includes just one participant, as it is a monologue. The information field in the transcribed text example in (3.5) below taken from the corpus, in accordance with the CHAT-format, includes all the lines at the beginning of this text starting with the @-sign. Lines starting with *SBJ: correspond to the separate clauses in the text. Comments can be inserted in brackets in speaker’s lines, e.g. [+ tit] indicating that this line corresponds to the title of the text. The intended word is given in brackets following a colon, e.g. & [: och] ‘and’. Relations to more than one word are indicated by the ‘<’ and ‘>’ signs, where the whole segment is included, e.g. <över jivna> [: övergivna] ‘abandoned’. Other signs and codes can be inserted in the transcription.13 (3.5) @Begin @Participants: SBJ Subject @Filename: caan09mHW.cha @Age of SBJ: 9 @Birth of SBJ: 1987 @Sex of SBJ: Male @Language: Swedish @Text Type: Hand written @Date: 10-NOV-1996 @Location: Gbg @Version: spelling, punctuation, grammar @Transcriber: Sylvana Sofkova Hashemi *SBJ: de kom till en överjiven [: övergiven] by [+ tit] *SBJ: vi kom över molnen jag & [: och] per på en flygande gris *SBJ: som hete [: hette] urban . *SBJ: då såg jag nåt [: något] *SBJ: som jag aldrig har set [: sett] . *SBJ: en ö som var helt <i jen täkt> [: igentäckt] av palmer *SBJ: & [: och] i miten [: mitten] var en by av äkta guld . *SBJ: när vi kom ner . *SBJ: så gick vi & [: och] titade [: tittade] . *SBJ: vi såg ormar spindlar krokodiler ödler [: ödlor] & [: och] anat [: annat] . *SBJ: när vi hade gåt [: gått] en lång bit så sa [: sade] per . *SBJ: vi <vi lar> [: vilar] oss . *SBJ: per luta [: lutade] sig mot en . *SBJ: palmen vek sig *SBJ: & [: och] så åkte vi ner i ett hål . *SBJ: sen [: sedan] svimag [: svimmade jag] . *SBJ: när jag vakna [: vaknade] . *SBJ: satt jag per & [: och] urban mit [: mitt] i byn . *SBJ: vi gick runt & [: och] titade [: tittade] . *SBJ: alla hus var <över jivna> [: övergivna] . 13 Further information about this transcription format and coding, including manuals for download, may be found at: http://childes.psy.cmu.edu/. Chapter 3. 36 *SBJ: *SBJ: *SBJ: *SBJ: *SBJ: @End då sa [: sade] per . vi har hitat den <över jivna> [: övergivna] byn . & [: och] när vi kom hem så vakna [: vaknade] jag & [: och] alt [: allt] var en dröm . slut Chapter 4 Error Profile of the Data 4.1 Introduction This chapter describes the empirical analysis of the collected data, starting with a general overview (Section 4.2) followed by sections describing the actual error analysis and distribution of errors in the data. Error analysis starts with descriptions of grammar errors (Sections 4.3), the main foci, and continues with analysis of real word spelling errors (Section 4.5) and punctuation (Section 4.6). The section on grammar errors concludes with a comparison of error distribution in the analyzed data with grammar errors found in other data already discussed in Chapter 2 (Section 4.4). 4.2 General Overview The Child Data of total 29,812 words consists of 134 compositions written by 58 children.1 Further information is provided here on the corpus along with a discussion of the size of sub-corpora, the average length of individual texts and word variation. Also described here is the overall impression of the texts in terms of writing errors and also the nature of spelling errors (both non-words and real words). Text Size and Word Variation The different sub-corpora are divided by topic of the written tasks (see Table 4.1). The first three were written by 18 subjects. The last two, belonging to the Spencerproject, involved 40 children each. In terms of the total number of words, Deserted 1 The composition of Child Data is described in Chapter 3 (see Section 3.2). Chapter 4. 38 Village and the Spencer Expository texts are the largest sub-corpora (in bold face) and the Climbing Fireman corpus is the smallest one. In total, the average text size is 222.5 words. This corresponds to a rather short text, approximately to 20 lines of a typed text or nearly half a page. Only the texts of Deserted Village (in bold face) are on the average twice as long as the other texts. The Spencer-project texts are the shortest ones. Table 4.1: General Overview of Sub-Corpora C ORPUS Deserted Village Climbing Fireman Frog Story Spencer Narrative Spencer Expository T OTAL T EXTS 18 18 18 40 40 134 W ORDS 7 586 4 505 4 907 5 487 7 327 29 812 W ORDS /T EXT 421.4 250.3 272.6 137.2 183.2 222.5 W ORD T YPES 1 610 1 040 763 1 085 1 021 3 373 The reason for this difference in text length probably lies in the degree of freewriting and in the use of and familiarity with the writing aid. The texts of the Deserted Village corpus were produced in the subjects’ own everyday environment, in the classroom, time was not limited, and they wrote by hand. The texts of Climbing Fireman are also written in a familiar environment with relatively unrestricted time demands, but these were written on a computer. Although computers had been introduced and used previously by the subjects, they may still have felt unfamiliar with its use. The Frog Story texts are slightly longer than the Climbing Fireman texts, but the higher number of words was probably elicited by the experiment in which the subjects were required to write text for 24 pictures. The Spencer-project texts are also of a more experimental nature, produced in an environment not familiar to the subjects, with more restrictions on time and written by means of a previously unknown text editor (ScriptLog). Next, let us consider word variation. 3,373 word types were found in the whole corpus. The Frog Story texts have the smallest number of word types, not surprisingly since the scope of word variation is more determined by the pictures of the story the children were supposed to tell. In the other sub-corpora, the Deserted Village corpus has the highest word variation, whereas the other three each contain around 1,000 word types. Table 4.2 shows the texts grouped by age. We find that the sub-corpus of 9 year olds is almost the same size as all the texts written by 10 year olds, although the sub-corpus consists of less than half as many compositions. The 9 year old children produced on average texts which are three times longer (854 words per text) than Error Profile of the Data 39 Table 4.2: General Overview by Age AGE 9-years 10-years 11-years 13-years T OTAL S UBJECTS 8 24 6 20 58 T EXTS 24 52 18 40 134 W ORDS 6 832 6 837 8 012 8 131 29 812 W ORDS /T EXT 854.0 284.9 1 335.3 406.6 222.5 W ORD T YPES 1 270 1 356 1 629 1 279 3 373 the 10 year olds, who wrote the shortest texts in the whole corpus. The sub-corpora of 11 and 13 year olds display similar sizes and are more than a thousand words larger than the texts of the younger children. The 11 year olds wrote the longest texts in the whole corpus (1,335.3 words per text), which is almost five times more than for the shortest texts of the 10 year olds. There is, in other words, much variation in the average length of text and especially the 11 year olds distinguish themselves by their much longer texts.2 Word variation measured in the number of word types seems to be slightly higher for the 11 year olds. The other age groups each contain around 1,300 word types. Overall Impression and Spelling Errors The first thing to observe when reading the texts by the children involved in this study, is the high number of spelling errors and split compounds, the rare use of capitals at the beginning of sentences and the unconventional use of punctuation delimiters to mark sentence boundaries. The children literally write as they speak. They use a great deal of direct speech and many spoken word forms. The different writing errors above lexical level are presented and discussed in the subsequent sections. In this section the sub-corpora and age groups are discussed and compared with respect to the total number of spelling errors (both non-words and real words). Most of the errors concern misspelled words, i.e. words with one or more spelling errors, represented by 2,422 (8.1%) words in total (see the last two columns in Table 4.3 below). Segmentation errors are four times less frequent, with 377 (1.3%) words written apart (splits) and 240 (0.8%) words written together (run-ons). Among the different sub-corpora (Table 4.3), the most misspelled words, splits and run-ons are found in the hand-written texts of the Deserted Village corpus. 2 For the time being no standard deviation was counted. Chapter 4. 40 The Deserted Village corpus and the Frog Story corpus have the highest numbers of spelling errors, 15.6% and 14.3% respectively, of the total number of words in different sub-corpora (last row in the table). The texts of the Spencer-project, that were much shorter in size, include around 5% spelling errors, which is two or three times lower than in the other three sub-corpora. Considering the age differences (Table 4.4), as expected most of the errors occurred in the texts of the youngest 9 year olds with 1,475 (21.6%) errors in total. Only the number of splits is higher in the texts of 11 year olds. The oldest 13 year olds made five times fewer errors. The group of 11 year olds has a very high number of spelling errors with 813 (10.1%) errors, in comparison to the texts by 10 year olds that include 459 (6.7%) spelling errors. Table 4.3: General Overview of Spelling Errors in Sub-Corpora E RROR T YPE Misspelled Words Splits Run-ons T OTAL % Deserted Village 924 146 113 1 183 15.6 S UB -C ORPORA Climbing Frog Spencer Fireman Story Narrative 422 568 209 69 93 37 26 39 32 517 700 278 11.5 14.3 5.1 Spencer Expository 299 32 30 361 4.9 T OTAL 2 422 377 240 3 039 10.2 % 8.1 1.3 0.8 Table 4.4: General Overview of Spelling Errors by Age E RROR T YPE Misspelled Words Splits Run-ons T OTAL % 9-years 1 242 129 104 1 475 21.6 AGE 10-years 11-years 356 602 69 148 34 63 459 813 6.7 10.1 13-years 222 31 39 292 3.6 T OTAL 2 422 377 240 3 039 10.2 % 8.1 1.3 0.8 According to Pettersson (1989, p.164), children in the second year at primary school (9 years old) make on average 13 spelling errors in 100 words which is much less than our 9 year olds who have almost 22 errors. By the eighth year (14 years old), the number decreases to four errors, which seems to hold true for our 13 year olds. Last year students at upper secondary school make in average 1 spelling error in 100 words. Error Profile of the Data 41 Summary The texts in Child Data are on average not longer than half a page with the exception of the hand written Deserted Village texts, that are an average of double that size. The length difference varies more by age. The 10 year olds wrote the shortest texts on average, whereas the texts written by 11 year olds are almost five times longer. Word variation is much lower in the Frog Story corpus than the other corpora. In the whole corpus, 10% of all words are misspelled or wrongly segmented and the highest concentrations of those errors are in the texts of Deserted Village, Frog Story and in the 9 year olds. Splits are also quite common in the 11 year olds’ texts. 4.3 Grammar Errors Previous research and analyses of grammar (reported in Section 2.4), suggest that Swedish writers in general make errors in agreement (both in noun phrase and predicative complement), verb form, and in the choice of prepositions in idiomatic expressions. The writing of children at primary school also includes dialectal inflections on words, dropped endings and substitution of nominative for accusative case in pronouns. This section presents the types of grammar errors in the present corpus of primary school writers and investigates whether the same types of errors occur and if or how much spoken language plays a role in their writing. Each error type is discussed and exemplified, introduced by a description of the structure of the relevant phrase types in Swedish, so that a reader who does not know Swedish will be able to understand why something is classified as an error. The number of errors is summarized in Section 4.3.10 along with a discussion of the relative frequency of the different error types in total and in comparison with sub-corpora and age. All the errors are listed in Appendix B.1. The grammar error types of this analysis are compared further to the errors found in some of the previous studies of grammar errors, presented in the subsequent section (Section 4.4). 4.3.1 Agreement in Noun Phrases Noun Phrase Structure and Agreement in Swedish A noun phrase in Swedish consists of a head, normally a noun, a proper noun or a (nominal) pronoun. In addition prenominal and/or postnominal determiners and modifiers may occur. The attributes come in a certain order and must agree with the head in number, gender, definiteness and case. Chapter 4. 42 Swedish distinguishes between singular (unmarked) and plural (normally a suffix) in the number system and number agreement is governed by the noun’s grammatical number: Table 4.5: Number Agreement in Swedish S INGULAR min bok ingen byxa my book no trousers P LURAL mina böcker inga byxor my [pl] book [pl] no [pl] trousers [pl] Gender is represented by two categories, common and neuter. Many animate nouns are further categorized according to the sex, masculine or feminine (unmarked). Gender agreement is only found in singular and is not visible in plural. Table 4.6: Gender Agreement in Swedish C OMMON N EUTER S INGULAR en gammal bil ett gammal-t hus an old car an old house P LURAL några gamla bilar några gamla hus some old cars some old houses Definiteness marking is quite complicated and is one of the factors in Swedish grammar that causes problems. The indefinite form is unmarked, whereas the definite form is (mostly) double marked, both by prenominal attributes and with a noun suffix. For adjectives (and participles) there are two different forms, normally called strong and weak forms. The strong form is used in indefinite noun phrases and in predicative use. The weak form of adjectives is used in definite noun phrases. The weak form is the same in all genders and numbers, except optionally when the noun denotes a male person.3 The plural of the strong and weak forms coincide. Table 4.7: Definiteness Agreement in Swedish S INGULAR COMMON NEUTER P LURAL I NDEFINITE en bok a book en gammal bok an old book en gammal man an old man ett gammalt hus an old house gaml-a hus old [wk] houses D EFINITE bok-en book [def] den gaml-a bok-en the old [wk] book [def] den gaml-e mann-en the old [masc] man [def] det gaml-a hus-et the old [wk] house [def] de gaml-a hus-en the old [wk] houses [def] 3 Notice that the masculine gender is only optional, which means that a noun phrase of the form den gaml-a mannen ‘the old [wk] man [def]’ is correct as well. Error Profile of the Data 43 Finally, case in the nominal system is represented by (unmarked) nominative and genitive which uses the suffix -s (personal pronouns are also declined by accusative case, see further under pronouns, Section 4.3.4). The basic constituent order in a noun phrase is determiner-adjective-noun, e.g. ett stort hus ‘a big house’, det stora huset ‘the big house’. The co-occurrence patterns of definiteness marking can be divided into three different types (Cooper, 1986, p.34):4 1. Definite noun phrase, which reflects the double definiteness marking and requires definite prenominal attributes and definite noun: D ET [+ DEF ] den de den här de här A DJ [+ DEF ] röd-a röd-a röd-a röd-a N[+ DEF ] bil-en bilar-na bil-en bilar-na this/the red car this/the two red cars this red car these red cars 2. Indefinite noun phrase, which requires indefinite prenominal attributes and indefinite noun: D ET [– DEF ] en någon inga A DJ [– DEF ] röd röd röda N[– DEF ] bil bil bilar a red car some red car no red cars 3. Mixed noun phrase, which requires definite prenominal attributes and indefinite noun. This type applies to demonstrative pronouns, possessive attributes and some relative clauses. D ET [+ DEF ] Demonstr. pronouns: denna dessa Possessive attributes: firmans deras Relative clause: den 4 A DJ [+ DEF ] röd-a röd-a röd-a röd-a röd-a N[– DEF ] bil bilar bil bil bil (som) han köpte igår this red car these red cars the firm’s red car their red car the red car that he bought yesterday Cooper defines these types in terms of existent determiner types that require either definite or indefinite adjectives and nouns. Chapter 4. 44 The optional prenominal attributive adjectives can be recursively stacked, as in (4.1a). Numerals as quantifying attributes occur in both definite (4.1b) and indefinite noun phrases (4.1c). (4.1) a. en ny röd bil a new red car b. de två röda bilarna the two red cars c. två röda bilar two red cars A proper noun as the main word of a noun phrase behaves (almost) like a noun in definite noun phrase. Proper nouns are inherently definite and uncountable. The most common form is when the proper noun occurs on its own, without any modifiers, as in the first example in Table 4.8, but prenominal attributes may occur as shown in the other examples (Teleman et al., 1999, Part3:56): Table 4.8: Noun Phrases with Proper Nouns as Head D ET A DJ den den där min en lilla snälla tråkiga söta ångerfull N Peter Karin Anna Karl Maria Karl-Erik Peter little Karin the good/kind Anna that boring Karl my sweet Maria a regretful Karl-Erik Pronouns as head of a noun phrase occur normally without modifiers, although pronouns with relative clauses are quite common (see further in Teleman et al., 1999): Table 4.9: Noun Phrases with Pronouns as Head A DJ hela båda hela själva P RO jag jag ni den hon I all of me both of you all of it she herself Error Profile of the Data 45 A noun phrase need not have a noun (or pronoun) as head. In this case, an adjective occurs normally in that position. Noun phrases consisting of only a determiner also exist. The structure of the (in)definite noun phrase is the same as in a noun phrase with a noun as head. Table 4.10 gives an overview of noun phrases without (nominal) heads. Table 4.10: Noun Phrases without (Nominal) Head D EFINITE N OUN P HRASE D ET A DJ denne den andra den där nye väntande många andra det bästa I NDEFINITE N OUN P HRASE D ET A DJ någon en annan allt roligt this one the other one the one new waiting many other the best someone an another all fun One further type of noun phrase will be relevant in this thesis, namely the partitive phrase which consists of a quantifier, the preposition av ‘of’ and a definite noun phrase. The quantifier agrees in gender with the noun phrase (Teleman et al., 1999, Part3:69): Table 4.11: Agreement in Partitive Noun Phrase in Swedish C OMMON N EUTER en av cyklarna ingen av filmerna ett av träden inget av äpplena one [com] of bicycles [com] none [com] of movies [com] one [neu] of trees [neu] none [neu] of apples [neu] Agreement Errors in Definiteness Definiteness agreement was violated in eight noun phrases and occurred in all three noun phrase types. Errors in definite noun phrases included three errors, all located in the head. In all instances the head noun is in the indefinite form, lacking the definite suffix as in (4.2a). In (4.2b) we see the correct form of the definite noun phrase with both the definite determiner/article and the definite suffix on the noun. Chapter 4. 46 (4.2) (G1.1.2) ∗ a. En gång blev den hemska pyroman utkastad one time was the [def] awful [wk] pyromaniac [indef]) thrown-out ur stan. from the-city – Once the awful pyromaniac was thrown out of the city. b. den hemska pyroman-en the [def] awful [wk] pyromaniac [def] One of these three erroneous noun phrases is ambiguous in the context (see (4.3)), providing yet another correction possibility. The intended noun phrase could be definite as in (4.3b) or also indefinite as in (4.3c). (4.3) (G1.1.3) a. Jag såg på ett TV program där en metod mot mobbing var att I saw on a TV program where a method against bullying was to ∗ sätta mobbarn på den stol och andra människor runt put the-bullyier on the [def] chair [indef] and other people around den personen och då fråga varför. the person and then ask why – I saw on a TV program where a method against bullying was to put the bullyier on the chair and other people around the person and then ask why. b. den stolen the [def] chair [def] c. en stol a [indef] chair [indef] There were three errors in definite noun phrases with indefinite head (type 3) which involved possessive and demonstrative attributes. In all cases, the head noun is in the definite form with a (superfluous) definite suffix as in (4.4a). The most obvious correction is to change the form in the noun to indefinite as in (4.4b), but it could also be that the possessive determiner is superfluous making the single definite noun as in (4.4c) more correct. Error Profile of the Data 47 (4.4) (G1.1.4) ∗ a. Pär tittar på sin klockan och det var tid för familjen att Pär looks at his [gen] watch [def] and it was time for the-family to gå hem. go home – Pär looks at his watch. It was time for the family to go home. b. sin klocka his [gen] watch [indef] c. klockan watch [def] A violation involving a demonstrative pronoun presented in (4.5) occurred probably due to the subject’s regional origin. Nouns modified by denna ‘this’ occur in definite form in some regional dialects. (4.5) (G1.1.6) ∗ a. Nu när jag kommer att skriva denna uppsats-en så kommer jag now when I will to write this [def] essay [def] so will I ha en rubrik om några problem och ... have a title about some problems and – Now when I write this essay, I will have a heading about some problems and... b. denna uppsats this [def] essay [indef] Two errors occurred in indefinite noun phrases and once more concerned the head noun being in definite form as in (4.6). Two corrections are possible here as well, one changing the form in the head noun as in (4.6b) or removing the determiner as in (4.6c). (4.6) (G1.1.7) ∗ för det var en a. Men senare ångrade dom sig, räkningen på but later regretted they selves for it was a [indef] bill [def] on deras lägenhet. their apartment – But later they regretted it, because it was a bill for their apartment. b. en räkning a [indef] bill [indef] c. räkningen bill [def] Chapter 4. 48 Gender Agreement Errors Agreement errors in gender occurred in both definite and indefinite noun phrases and in partitive noun phrases and show up as a mismatch between the gender of the article and the rest of the phrase or as violations of the semantic gender of the adjective. One disagreement in article occurred in an indefinite noun phrase shown in (4.7a) and one in a partitive noun phrase (G1.2.2). (4.7) (G1.2.1) grodbarn. a. Pojken fick ∗ en the-boy got a [com] frog-child [neu] – The boy got a frog baby. b. ett grodbarn a [neu] frog-child [neu] Two errors were related to the semantic gender, where masculine gender was wrongly used in the adjectival attributes of definite noun phrases. In one case, the masculine gender is used together with a plural noun (see (4.8a)). (4.8) (G1.2.4) a. nasse Nasse blev became arg angry han gick he went och and la lay sig himself med with dom the [pl] ∗ andre syskonen. other [masc] siblings [pl] – Nasse got angry. He lay down with his brothers and sisters. b. dom andra syskonen the [pl] other [pl] siblings [pl] The second instance of semantic gender mismatch is more a question of asymmetry between the adjectives involved (see (4.9a)). The first adjective in the noun phrase is declined for masculine gender (hemsk-e ‘awful [masc]’) and the second uses the unmarked form (ful-a ‘ugly [def]’). Either both should be in the masculine form (as in (4.8b)) or both should have the unmarked form (as in (4.8c)). Error Profile of the Data (4.9) (G1.2.3) a. det va it was 49 den the [def] ∗ hemske awful [wk,masc] ∗ fula ugly [wk] troll troll karlen (⇒ trollkarlen) tokig som ... man [def] (⇒ magician [def]) Tokig that – It was the awful ugly magician Tokig that ... b. den hemske fule trollkarlen the [def] awful [wk,masc] ugly [wk,masc] magician [def] c. den hemska fula trollkarlen the [def] awful [wk] ugly [wk] magician [def] Number Agreement Errors Three noun phrases violated number agreement. One concerned a definite attribute in a definite noun phrase (see (4.10a)). It seems like the required plural determiner de ‘the [pl]’ is replaced by the singular definite determiner det ‘the [sg]’. It could also be a question of an (un)intentional addition of the character -t subsequently making it a spelling error rather than a grammar error. But since a syntactic violation occurred with no new lemma formed, the error is classified as a grammar error and not as a real word spelling error. (4.10) (G1.3.1) a. Den där scenen med ∗ det tre tjejerna tyckte jag att de var the there scene with the [sg] three girls [pl] thought I that they were taskiga som går ifrån den tredje tjejen. mean that go from the third girl – I thought that in the scene with the three girls they were mean to leave the third girl. b. de tre tjejerna the [pl] three girls [pl] The other two errors concern the head noun of a partitive attribute as shown in (4.11a). In both instances, the noun is in the singular definite form instead of the required plural definite form. Both errors were made by the same subject. This realization points more clearly to a typographical error. The determiner and the partitive preposition were probably inserted afterwards into the text, since the singular definite form that this error brings about is not at all part of the correct non-elliptic noun phrase (see (4.10b)), but may function perfectly well as a noun phrase on its own. Chapter 4. 50 (4.11) (G1.3.2) a. Alla männen och pappa gick in i ett av ∗ huset. all the-men and daddy went into in one of house [sg, def] b. –All the men and daddy went into one of the houses. ett (hus) av husen one (house [indef]) of houses [pl, def] 4.3.2 Agreement in Predicative Complement Introduction A predicative complement is part of a verb phrase and specifies features about the subject or the object. An adjective phrase, participle or a noun phrase are the typical representatives. The predicative complement differs from other parts of a verb phrase in that the predicative agrees in gender and number (in the case of a noun phrase only in number) with the corresponding subject or object it refers to, as shown in Table 4.12. Table 4.12: Gender and Number Agreement in Predicative Complement S INGULAR C OMMON N EUTER P LURAL boken är gammal huset är gammal-t husen är gaml-a the-book [com] is old [com] the-house [neu] is old [neu] the-houses [pl] are old [pl] The predicative normally combines with copula verbs (vara ‘be’, bli ‘be/become’, förbli ‘remain’), naming verbs (e.g. heta ‘be called’, kallas ‘be called’), raising verbs (e.g. verka ‘seem’, förefalla ‘seem’, tyckas ‘seem’), and other similar verb categories (Teleman et al., 1999, Part3:340). Gender Agreement Errors Violations of gender agreement were rare. Altogether two errors of this type occurred. One concerned an adjective in the complement position and the other a past participle form. The adjective error occurred with neuter gender as shown below in (4.12a). Error Profile of the Data 51 (4.12) (G2.1.1) är ∗ blöt. a. Då börja Urban lipa och sa: Mitt hus then start Urban blubber and said my [neu] house [neu] is wet [com] – Then Urban started to blubber and said: My house is wet. är blött. b. Mitt hus my [neu] house [neu] is wet [neu] Here the neuter gender subject is connected to the adjective bl öt ‘wet [com]’ in common gender. The error could also be classified as a spelling error with omission of the final double consonant, but since it is also another form of the same adjective and a syntactic violation occurs, the error is classified as a grammar error. Number Agreement Errors In the case of number agreement, there was one error involving singular number and two errors involving plural number. As in (4.13a), the sentence structures that include number violations in the predicative complement are in general rather complex and the distance between the head and the modifier is not restricted to a single verb. In this case, it seems to be a question of a lack of linguistic competence since all three adjectives lack the plural ending. (4.13) (G2.2.3) och är mer ∗ öppen a. Själv tycker jag att killarnas metoder self think I that the-boys’ methods [pl] are more open [sg] and ∗ ärlig men också mer ∗ elak än var (⇒ vad) tjejernas honest [sg] but also more mean [sg] than was (⇒ what) the-girls’s metoder är. methods are – I think myself that the boys’ methods are more open and honest but also more mean than the girls’ methods are. men också och ärliga är mer öppna b. killarnas metoder the-boys’ methods [pl] are more open [pl] and honest [pl] but also mer elaka more mean [pl] Chapter 4. 52 4.3.3 Definiteness in Single Nouns Introduction The grammatical violations in this section concern single nouns as the only constituents of a noun phrase. Bare singular nouns are (normally) ungrammatical without an article. The noun must be in definite form or preceded by an article as in (4.14b) or (4.14d). Example sentences in (4.14a) and (4.14c) are (normally) ungrammatical in Swedish, although they may occur as newspaper head lines for instance. (4.14) a. ∗ Polis arresterade studenten. Policeman arrested the-student. b. Polisen/En polis arresterade studenten. The policeman/A policeman arrested the-student. c. ∗ Polisen arresterade student. The policeman arrested student. d. Polisen arresterade studenten/en student. The policeman arrested the-student/a student. There are, however, grammatical sentences which include bare singular nouns. The acceptability of such sentences depends, according to Cooper (1984), on the lexical choice. Thus, changing the noun or the verb may influence the grammaticality of a sentence: (4.15) a. ∗ Det är jobbigt att inte se bil. is hard to not see car [indef]. It b. Det är jobbigt att inte ha bil. It is hard to not have car [indef]. Bare definite nouns are often used as anaphoric device, referring to an entity that has already been introduced or is well known in the speech situation. The noun is then in definite form as in (4.16) below. (4.16) a. Ta (den) nya bilen. Take (the) new car [def]. b. (den) gamle kungen (the) old king [def] c. (den) tredje gången (the) third time [def] Error Profile of the Data 53 Errors in Definiteness in Single Nouns There were six cases of definiteness errors in single nouns. They all were realized as indefinite nouns. One instance from the corpus is shown in (4.17). Here the topic is introduced by an indefinite noun phrase (en ö ‘an [indef] island [indef]’) in the first sentence, but then in the following sentence instead of the expected definite noun that would indicate a continuation of discussion of this topic, we find a single indefinite noun (ö ‘island [indef]’). This noun lacks the definite suffix. (4.17) (G3.1.3) Vi gick till ∗ ö. a. Jag såg en ö. I saw an island we went to island [indef] – I saw an island. We went to island. b. ön island [def] 4.3.4 Pronoun Case Features of Pronouns Personal pronouns in Swedish are declined in nominative, genitive and accusative case (see Table 4.13 below). Third person singular inanimate pronouns have the same form in both subject and object positions. For plural, the nominativeaccusative distinction de-dem is only used in writing. It is not used in speech, where both forms are pronounced as dom in the standard language. This spoken form is used (increasingly) in some types of informal writing. 5 Errors in Pronoun Case All five errors in pronoun case concern nominative case being used in the object position. Two cases involved errors in the accusative case of the pronoun han ‘he’, probably due to regional influence,6 e.g.: (4.18) (G4.1.5) a. bara för man inte vill vara med ∗ han just for one not want be with he [nom] – just because one doesn’t want to be with him. b. honom him [acc] 5 Purists recommend however keeping the distinction de-dem and that dom should be used only for rendering spoken language (Teleman et al., 1999, Part2:270). 6 In certain dialects han ‘he’ is also the object form. Chapter 4. 54 Table 4.13: Personal Pronouns in Swedish N OMINATIVE ACCUSATIVE G ENITIVE S INGULAR 1 ST PERSON jag I du you han he hon she den it det it 2 ND PERSON 3 RD PERSON A NIMATE M ALE F EMALE I NANIMATE C OMMON N EUTER mig me dig you honom him henne her den it det it min my din yours hans his hennes hers dess, dens its dess its P LURAL 1 ST PERSON 2 ND PERSON 3 RD PERSON W RITTEN S POKEN vi we ni you de they dom they er vår, vårt us ours [com], [neu] er er, ert you yours [com], [neu] dem deras them theirs dom deras them theirs The rest concerned plural pronouns, as the one in (4.19). As mentioned above, the distinction between the nominative form de ‘they’ and the accusative form dem ‘them’ occurs only in writing. In speech dom is used in both cases. A scan of the writing profiles of all subjects showed that most of the subjects use only the spoken form. For that reason, the errors were included only if the subject used an incorrect written form and not just the spoken form. (4.19) (G4.1.1) a. bilarna bromsade så att det blev svarta streck efter ∗ de. the-cars braked so that it became black lines after they [nom] – The cars braked so there were black lines after them. b. dem them [acc] Error Profile of the Data 4.3.5 55 Verb Form Verb Core Structure A verb phrase consists of a verbal head that can form a verb phrase on its own or be combined with modifiers and appropriate complements. In this description no attention is drawn to the complements, just the actual core of the verb phrase. First, the types of verbs (finite and non-finite) are described, followed by presentation of the simple vs. compound tense structures and finally the infinitive phrase is described. Verbs are divided into finite and non-finite. A sentence must contain at least one verb in finite form to be considered grammatically correct. In Swedish, there are three forms of finite verbs (present, preterite and imperative) and four forms of non-finite verbs (infinitive, supine, present participle and past participle). Table 4.14: Finite and Non-finite Verb Forms T ENSE Infinitive: Imperative: Future: Present: Preterite: Perfect: Present participle: Past participle: F INITE att ska jagar jagade har den är N ON - FINITE jaga jaga jaga jagat [sup] jagande jagad to hunt hunt will hunt hunt/hunts hunted have hunted [sup] the hunting is hunted Among the non-finite verbs, infinitive and supine occur as the main verb in combination with a modifying (finite) auxiliary verb (see Future and Perfect respectively in Table 4.14 above). The infinitive form also occurs in infinitive phrases preceded by the infinitive marker att ‘to’. Present and past participle forms have more adjectival characteristics and function as attributes in a noun phrase or in predicative position after a copula verb. A core verb phrase may consist of one single finite verb and form a simple tense construction, or of a sequence of two or more verbs, composed of one finite verb plus a number of non-finite verbs to form a kind of compound tense (see Table 4.15 below). Compound tense structures like sequences of two or more verbs are usually referred to as verb chains or verb clusters and generally include some kind of auxiliary verb followed by the main (non-finite) verb. In Swedish we find the temporal and modal auxiliary verbs in verb cluster constructions. Chapter 4. 56 Table 4.15: Tense Structure S IMPLE S TRUCTURE : Present: Preterite: Katten The cat jagar chases möss. mice. Katten The cat jagade chased möss. mice. ska will [pres] jaga chase [inf] möss. mice. C OMPOUND S TRUCTURE : Future: Katten The cat Perfect: Katten The cat har has [pres] jagat chased [sup] möss. mice. Past perfect: Katten The cat hade had [pret] jagat chased [sup] möss. mice. Future perfect: Katten The cat ska shall [pres] ha have [inf] jagat chased [sup] möss. mice. Secondary future perfect: Katten The cat skulle would [pret] ha have [inf] jagat chased [sup] möss. mice. Verb clusters with temporal auxiliary verbs in general follow two patterns, one expressing the past tense with the main verb in the supine (here only the verb ha ‘have’ is used), and one for future tense with the main verb in the infinitive. In subordinate clauses, the temporal finite forms har ‘has/have [pres]’ or hade ‘had [pret]’ are often omitted in perfect and past perfect7 and the verb core consists then only of the supine verb form (examples from Ljung and Ohlander, 1993, p.99): (4.20) a. Han säger att han redan (har) gjort det. he says that he already (has) done that – He says that he has done that already. b. Han sade att han ofta (hade) sett dem. he said that he often (had) seen them – He said that he had often seen them. Also the temporal infinitive ha ‘have’ in the secondary future perfect can be omitted irrespective of sentence type. In these cases, a past tense modal auxiliary is followed directly by a supine form (Teleman et al., 1999, Part3:272): 7 The omission is most common in writing, up to 80% (Teleman et al., 1999, Part4:12), but occurs more and more in speech as well (Teleman et al., 1999, Part3:272). Error Profile of the Data 57 (4.21) a. Nu blev det inte så illa som det kunde (ha) blivit. now became that not so bad as it could (have) become [sup] – Now it got not so bad as it could have been. b. ... fastän det borde (ha) skett för länge sedan. although it should (have) happened for long ago – ... although it should have happened a long time ago. A verb in the infinitive form is treated as part of an infinitive phrase preceded by an infinitive marker att ‘to’, which is necessary in certain contexts and optional in others. Auxiliary verbs are combined with bare infinitives (as shown and discussed above) thus lacking the infinitive marker as in (4.22a). An exception is the temporal komma ‘will’ that requires the infinitive marker as in (4.22b) (Teleman et al., 1999, Part3:572): (4.22) a. Hon kan spela schack. she can play chess – She can play chess. b. Hon kommer att spela schack. she will to play chess – She will play chess. The bare infinitive is also used in nexus constructions as in (Teleman et al., 1999, Part3:597): (4.23) Han ansåg tiden vara mogen. he considered the-time be ripe – He found the time to be ripe. Many main verbs take either a noun phrase or an infinitive phrase as complement (Teleman et al., 1999, Part3:570,596). With some main verbs, the infinitive marker is optional (Teleman et al., 1999, Part3:597). The tendency to omit the infinitive marker is higher if the infinitive phrase directly follows the verb (Teleman et al., 1999, Part3:598): (4.24) a. Vi slutade spela. we stopped play – We stopped playing. b. Vi slutade avsiktligt att spela. we stopped deliberately to play – We deliberately stopped playing. Chapter 4. 58 Infinitive phrases are found in subject position as well (4.25): (4.25) Att få segla jorden runt hade alltid lockat honom. to get sail earth around had always tempted him – He had always wanted to get to sail around the world. Finite Main Verb Errors The use of non-finite verb forms as finite verbs, forming sentences that lack a finite main verb is the most common error type in Child Data. Errors of this kind concern both present and past tense. Most of them (87) occurred in the past tense as in (4.26a) and concern regular weak verbs ending in -a in the basic form that lacks the appropriate past tense ending. Nine errors occurred in present tense as in (4.27a) and primarily concern regular weak verbs ending in -a, also in addition to some strong verbs. (4.26) (G5.2.45) ∗ vakna jag av att brandlarmet tjöt. a. På natten in the-night wake [untensed] I from that fire-alarm howled – In the night I woke up from the fire-alarm going off. b. vaknade woke [pret] (4.27) (G5.1.2) a. När hon kommer ner undrar hon varför det ∗ lukta så when she comes down wonders she why it smell [untensed] so bränt och varför det låg en handduk över spisen. over the-stove burnt and why it lay a towel – When she comes down, she wonders why it smells so burnt and why a towel was lying over the stove. b. luktar smells [pres] The most probable cause for this recurrent error is the fact that in spoken Swedish regular weak verbs ending in -a may lack the past tense suffix and sometimes also lack the present tense suffix. For example, the past form of the verb vaknade ‘woke [pret]’ is pronounced either as [va:knade] or reduced to [va:kna], which then coincides with the infinitive and imperative forms vakna ‘to wake’ as in the erroneous sentence (4.26a) above. Error Profile of the Data 59 In addition to the above errors in the form of the finite main verb, two instances involved strong verbs, both realized as the (non-finite) infinitive form. One error occurred in the present tense, and one (exemplified in (4.28)) in the past. (4.28) (G5.2.100) a. Nästa dag så var en ryggsäck borta och mera grejer ∗ försvinna next day so was a rucksack gone and more things disappear [inf] – The next day a rucksack had gone and more things disappeared. b. försvann disappeared [pret] Then, there were two occurrences of errors using a supine verb form as predicate of a main sentence. Recall that the supine may occur on its own as predicate in subordinate clauses (see above). These errors occurred in main clauses, both with the same lemma and were committed by the same subject. One of these error instances has already been discussed in Section 3.3 (example (3.2) on p.29). The other is exemplified and discussed below: (4.29) (G5.2.88) a. det låg it lay [pret] massor av lots of saker things runtomkring jag ∗ försökt tried [sup] around I att to kom (⇒ komma) till fören came (⇒ come) to the-prow – There were a lot of things lying around. I tried to go to the prow. b. försökte tried [pret] The sentence jag försökt att kom till fören ‘I tried [sup] to go to the prow’ in isolation suggests that just an auxiliary verb is missing in front of the supine form, i.e. hade försökt ‘had tried’. However, the past form predicate of the preceding sentence suggests that in order to be consistent the predicate of the subsequent sentence should also be in past form. It could be that the subject believes that this word is spelled without the final vowel -e. The reason why this case is considered a grammar error is that it forms another form of the intended lemma. Thus, according to principle (i) in (3.4) it is a grammar error (see Section 3.3). Finally, ten error instances concerned past participle forms in the finite verb position, as in (4.30), all lacking the final -e in the preterite suffix. Chapter 4. 60 (4.30) (G5.2.92) a. dom ∗ letad överallt they search [past part] everywhere – They searched everywhere. b. letade searched [pret] These past participle forms could occur due to the final letter’s alphabetical pronunciation (letter ‘d’ is pronounced [de] in Swedish). Following the classification principles in (3.4), these errors are considered grammar errors since an other form is used rather than the intended one is formed.8 Verb Cluster Errors Grammar errors in verb clusters affect the form of the (non-finite) main verb and omission of auxiliary verbs. Main verb errors may involve a sequence of finite verbs and thus violate the rule of one finite verb in a clause. One error instance included secondary future perfect requiring a supine form as in (4.31a), where the main verb is realized as a past tense form of the intended verb. The cause of the error is not possible to determine, but an interesting observation is that the erroneous verb form is followed by a preposition beginning in the vowel ‘i’ that is part of the omitted supine ending thus indicating a possible assimilation of these sounds. (4.31) (G6.1.7) a. Jag skrattade och undrade hur tromben skulle ha I laughed and wondered how the-tornado would [pret] have [inf] ∗ kom igenom det lilla hålet. came [pret] through the small hole – I laughed and wondered how the tornado would have come through the small hole. b. skulle ha kommit would [pret] have [inf] come [sup] Other errors in the main verb of a verb cluster concerned structures requiring an infinitive verb form as in (4.32a), where the modal auxiliary verb ska ‘will’ is followed by a verb in present tense, blir ‘becomes’. 8 Some of the participle forms like pratad ‘told [past part]’ are not lexicalized in Swedish, but are quite possible to form in accordance with grammar rules of Swedish. They are included in the present analysis since they were not detected as non-words by the spelling checker in Word. Error Profile of the Data 61 (4.32) (G6.1.1) ∗ a. Men kom ihåg att det inte ska blir någon riktig brand but remember that it not will [pres] becomes [pres] some real fire – But remember that there will not be a real fire. b. ska bli will [pres] become [inf] There were two cases with omitted auxiliary verb. Both concerned the temporal verb ha ‘to have’ and the predicate of the main sentences consisted then of only a supine verb form: (4.33) (G6.2.2) a. men pappa — frågat mig om jag ville följa med. but daddy — asked [sup] me if I wanted follow with – but daddy has asked me if I wanted to come along. b. hade frågat had [pret] asked [sup] OR frågade asked [pret] Infinitive Phrase Errors In this category, we find errors in the verb form following the infinitive marker and in the omission of the infinitive marker after the auxiliary verb komma ‘will’. Constructions with main verbs that combine with an infinitive phrase as complement have not been included. As we will see later on (Section 5.5), there are constructions where there is uncertainty in the language as to whether the infinitive marker should be used or not. In general, the infinitive marker is tending to disappear more and more. For this reason it is not quite clear which of these cases should be classified as an error. Four verb form errors occurred where, instead of the (non-finite) infinitive verb that is required, we find the (finite) imperative as in (4.34) or present form as in (4.35) after an infinitive marker. Chapter 4. 62 (4.34) (G7.1.2) a. glöm inte att ∗ stäng dörren forget not to close [imp] the-door – don’t forget to close the door b. att stänga to close [inf] (4.35) (G7.1.1) sig a. Men hunden klarar att inte ∗ slår but the-dog manages to not hits [pres] himself – But the dog manages not to hit himself. b. att inte slå to not hit [inf] Three cases concerned an omitted infinitive marker in the context of the temporal auxiliary verb komma ‘will’ that (as explained above) is different from the other auxiliary verbs and requires the infinitive marker: (4.36) (G7.2.3) a. Nu när jag kommer att skriva denna uppsatsen så kommer jag — ha now when I will to write this essay so will I — have en rubrik om några problem och vad man kan göra för att förbättra a title about some problems and what one can do to improve dom. them – Now when I write this essay, I will have a heading about some problems and what one can do to improve them. b. kommer jag att ha will I to have The error example (4.36) is even more interesting in that att ‘to’ is used in the first construction with the verb kommer att skriva ‘will write’ whereas it is omitted in the subsequent. 4.3.6 Sentence Structure Introduction The errors in this category concern word order, phrases or clauses lacking obligatory constituents, reduplications of the same word and constructions with redundant constituents. Error Profile of the Data 63 The finite verb is normally considered as the core of a sentence and is surrounded by its complements (e.g. subject, direct and indirect object, adverbials). The distribution of such complements is defined both syntactically (i.e. defines the verb’s construction scheme) and semantically (i.e. defines what role the different actants play in a sentence). Thus the verb governs the structure of the whole sentence in what constituents are to be included and in what place and what role they will play. In addition, the position of sentence adverbials plays an important role. Sentences in Swedish display two types of word order. Main clause order is characterized by the finite verb before the adverbial (dubbed fa-sentence in Teleman et al. (1999, Part4:7)), presented in Table 4.16. 9 Subordinate clause word order is characterized by adverbial before the finite verb (dubbed af-sentence in Teleman et al. (1999, Part4:7)) presented in Table 4.17. In addition to recognizing the distinct word orders in main and subordinate clauses, traditional grammar also makes a distinction between basic word order where the subject precedes the predicate (example sentence 2 in Table 4.16 and both sentences in Table 4.17) and inverted word order where the subject follows the predicate (example sentence 1 and 3 in Table 4.16). Table 4.16: Fa-sentence Word Order I NITIAL F IELD Initiation Nu now M IDDLE F IELD Finite Verb Subject skulle Per would Per Adverbial* nog inte probably not F INAL F IELD Rest of VP vilja träffa någon. like to meet someone 2. Per Per skulle would – – nog inte probably not vilja träffa någon nu. like to meet someone now 3. Vem who skulle would Per Per nog inte probably not vilja träffa nu? like to meet now? 1. Table 4.17: Af-sentence Word Order 1. 2. I NITIAL F IELD Initiation eftersom because M IDDLE F IELD Subject Adverbial* Per nog inte Per probably not Finite Verb skulle would F INAL F IELD Rest of Verb Phrase vilja träffa någon nu like to meet someone now vem who Per Per skulle would vilja träffa nu like to meet now nog inte probably not 9 Conjunctions that coordinate main or subordinate clauses are not included in the scheme. The asterix in the tables indicates that more constituents of this kind are possible. Chapter 4. 64 Word Order Errors Word order errors concern transposition of sentence constituents thus violating the fa-sentence or af-sentence word order constraints. Only five sentences with incorrect word order were found. The following error example (4.37a) violates the fa-sentence word order, since there are two constituents before the finite verb, a subject and a time adverbial. The finite verb is expected in the second position in the sentence. The correct form of the sentence can be formed in two ways: either introduced by the subject and placing the time adverbial last, as in (4.37b), or starting with the time adverbial, placing the subject directly after the finite verb, as in (4.37c). (4.37) (G8.1.3) a. ∗ Jag den dan gjorde inget bättre. I the day did nothing better – I didn’t do anything better that day. b. Jag gjorde inget bättre den dan. I did nothing better the day c. Den dan gjorde jag inget bättre. the day did I nothing better Redundancy Errors As mentioned above, the type and the number of constituents in a sentence is governed by the main verb. Any addition of other constituents influences the whole complement distribution, both syntactically and semantically. Words were duplicated directly (five occurrences) as in (4.38a) below with the reduplicated word in the same position as the intended one: (4.38) (G9.1.3) a. många som mobbar har ∗ har det oftast dåligt hemma many that bully have have it most-often bad at-home – Many that bully have have it most often bad at home. b. många som mobbar har det oftast dåligt hemma many that bully have it most-often bad at-home Four occurrences included duplication with words between, i.e. when the same word occurs somewhere else in the sentence. In the example (4.39a) the subject jag ‘I’ is repeated after the verb as if indicating inverted word order: Error Profile of the Data 65 (4.39) (G9.1.7) a. jag fick ∗ jag hjälp med det. I got I help with it – I got I help with it. b. jag fick hjälp med det. I got help with it – I got help with it. The example in (4.40a) involves a case where the writer has fronted not only the object det ‘that’ but also the verb particle åt ‘for’ which also occurs in its normal position after the verb. Either the fronted verb particle can be removed as in (4.40b) or the one following the verb as in (4.40c). (4.40) (G9.1.8) a. Åt det går det nog inte att gör (⇒ göra) så mycket about that goes it probably not to do [pres] (⇒ do [inf]) that much ∗ åt. about – About this not so much can probably be done about. b. Det går det nog inte att gör så mycket åt. that goes it probably not to do that much about c. Åt det går det nog inte att gör så mycket. about that goes it probably not to do that much In four cases, new words disturbed the sentence structure by their redundancy in the complement structure. In the following example, the pronoun det ‘it’ is redundant and plays no role in the sentence:10 10 There is also an error in word order between the constituents bara kan ‘just can’ that should be switched, see G8.1.2 in Appendix B.1. Chapter 4. 66 (4.41) (G9.2.2) ∗ a. för då kan man inte något ting bara kan gå på stan det då cause then can one not some thing just can go to the-city it then fattar hjärna ingenting understand brain nothing – because then one cannot anything just can go to the city it then the brain doesn’t understand anything. b. för då kan man inte något ting, bara gå på stan. Då fattar for then can one not some thing just go to the-city then understand hjärna ingenting. brain nothing – because then one cannot anything, just go to the city. Then the brain doesn’t understand anything. Missing Constituents Altogether 44 sentences were incomplete in the sense that one (or more) obligatory constituent(s) were missing in the sentence. Omission of the noun in the subject position is the most frequent type of error in this category (10 occurrences), e.g.: (4.42) (G10.1.8) a. När man tror att man har kompisar blir — ledsen när man when one thinks that one has friends becomes — sad when one bara går där ifrån just goes there from – When someone thinks that he has friends, he is sad when people just leave from there. b. blir man ledsen becomes one sad Missing prepositions are quite common (11 occurrences): (4.43) (G10.6.4) a. Hunden hoppade ner — ett getingbo. the-dog jumped down — a wasp-nest – The dog jumped into a wasp’s nest. b. i into Error Profile of the Data 67 Some occurrences of missing verbs were also found: (4.44) (G10.4.3) a. Jag tycker att det har med uppfostran — om man nu ger eller inte I believe that it has with upbringing — if one now gives or not ger hon/han den saken som man tappade gives she/he the thing that one lost – I believe that it is has to do with your upbringing if you give the thing he/she lost back or not. b. att göra to do Here is an example of a missing subjunction: (4.45) (G10.7.4) a. till exempel — den här killen gör så igen så ... for instance — the here the-boy does so again so – for instance if this boy does so again, then ... b. om if Other omissions involve pronouns, infinitive markers, adverbs and some fixed expressions, as in: (4.46) (G10.8.4) a. sen levde vi lyckliga — våra dagar then lived we happy — our days – Then we lived happily ever after. b. i alla våra dagar in all our days 4.3.7 Word Choice This error category concerns words being replaced by other words that semantically violate the sentence structure. They concern mostly replacements within the same word category, but changes of category also occur. Most of these substitutions involve prepositions and particles, but we also find some adverbs, infinitive markers, pronouns and other classes. In (4.47a) we see an example of an erroneous verb particle. Here the verb att vara lika ‘to be alike’ requires the particle till ‘to’ in combination with the noun phrase sättet ‘the-manner’ and not på ‘on’ as the writer uses. Chapter 4. 68 (4.47) (G11.1.7) a. vi var väldigt lika ∗ på sättet alltså vi tyckte om samma we were very like on the-manner in-other-words we fond of same saker things – We were very alike in the our manner. In other words we were fond of the same things. b. lika till sättet like to the-manner Also the choice of prepositions is problematic. In (4.48a) the preposition ur ‘from’, which describes a completely different action than the required av ‘off’, was used. (4.48) (G11.1.2) a. Vi sprang allt vad vi orkade ner till sjön och slängde ∗ ur oss we run all what we could down to the-lake and threw out-of us kläderna. clothes – We run as fast as we could down to the lake and threw off our clothes. b. slängde av oss threw off us Five errors concerned the conjunction och ‘and’ used in the position of an infinitive marker. This error is speech related. In Swedish the pronunciation of the infinitive marker att [at] ‘to’ is often reduced to [å], which is also the case for the conjunction och [ock] ‘and’, i.e. both att ‘to’ and och ‘and’ are often reduced and pronounced as [å]. As a consequence, these two forms and their syntactic roles can be mixed up in writing, as in the next example (4.49a). (4.49) (G11.3.1) ∗ a. det var onödigt och skrika pappa it was unnecessary and scream daddy – It wasn’t necessary to scream, daddy. b. att to The choice between the adverb vart ‘whither’ and var ‘where’ caused trouble for two subjects in three occurrences, an example is given in (4.50a). This may also be a dialectal matter, since in certain regions this form has the same distribution as var ‘where’. Error Profile of the Data 69 (4.50) (G11.2.2) a. Men ∗ vart ska jag bo? but whither will I live – But whither will I live? b. var where Also blends of fixed expressions occurred. In the following example, the writer mixes up the expressions så mycket jag kunde ‘as much as I could’ and allt vad jag var värd ‘for all I was worth’: (4.51) (G11.5.3) a. jag sprang så fort ∗ så mycket jag var värd I whither so fast so much I was worth – I run so fast so much I was worth. b. allt vad jag var värd all what I was worth Other word choice errors concerned pronouns, adjectives and nouns. 4.3.8 Reference Reference in Swedish Pronouns are used to refer to something already mentioned in the text (anaphoric reference) or something present in the utterance situation (deictic reference). The pronoun correlates then with the referring noun and has to agree in number and gender with it. Reference Errors Referential violations concern only anaphoric reference, referring to the previous text, both within the same clause and in a larger context. The errors were of two types, cases where the pronoun did not agree (six occurrences) and cases where the referent changed (two occurrences). In the case of agreement, four errors concerned wrong number as in (4.52a) and two cases were related to gender as in (4.53a). Chapter 4. 70 (4.52) (G12.1.1) a. Nästa dag gick dem upp till en grotta där fick dem var sin next day went they up to a cave there got they each his/her korg med saker i. Lena fick en kattunge för manen hade många basket with things in Lena got a kitten because the-man had many sej iväg när djur. Och Alexander fick ett spjut. sen gav ∗ den animals and Alexander got a spear then went it [sing] self away when de gått och gått så hände något ... they went and went so happened something – The next day they went to visit a cave. There they each got a basket with things in it. Lena got a kitten, because the man had many animals. And Alexander got a spear. Then it went away. When they went and went, something happened ... b. de they (4.53) (G12.1.5) a. Vad heter din mamma? Det stod helt still i huvudet what is-called your mother [fem] It stood completely still in the-head vad var det ∗ han hette nu igen? what was it he [masc] was-called now again – What is your mother’s name? It was completely still in my head. What was he called now again? b. hon she In two cases, shift between direct quotes and narratives occurred. In one such error in (4.54a) the writer is first involved in the situation, referred to as vi ‘we’ and then suddenly in the subsequent sentence the pronoun is changed to ni ‘you [pl]’ switching the focus from the writer as part of a group to other people. (4.54) (G12.2.1) kom ut ... a. spring ut nu vi har besökare när ∗ ni run out now we have visitors when you [pl] came out – Run out, we have visitors! When we came out... b. vi we Error Profile of the Data 4.3.9 71 Other Grammar Errors One error instance includes an adverb used as an adjective: (4.55) (G13.1.2) mindre a. När jag var ∗ liten when I was small [adj] smaller – When I was a little smaller... b. lite mindre a little [adv] smaller Finally, three cases could not be classified at all. The sentences had very strange structure, either single words were incomprehensible or the whole sentence did not make any sense. In some cases this could be a question of several sentences being put together, in which case, the sentences are incomplete and/or lack any marking of sentence boundaries. During the analysis, some errors involving sequence of tense were discovered. These are not targeted in the present analysis and will be left for future analysis. Chapter 4. 72 4.3.10 Distribution of Grammar Errors As discussed in the presentation of error types (Section 3.4), the units by which frequency of grammar errors could be estimated are different from type to type and are also difficult to count in text containing errors. For that reason, error frequency between error types will be compared and total numbers of errors will be related to the total number of words. Overall Error Distribution In the whole corpus of 29,812 words 262 instances of grammar errors were found. That corresponds to 8.8 errors per 1,000 words. The different errors are summarized in Table 4.18, grouped by sub-corpora and, in Table 4.19, by age. The total error distribution is also illustrated in Figure 4.1 below. The most recurrent grammar problem concerns the form of the finite main verb lacking tense ending on the main verb (42%). This problem seems to be characteristic of this particular age group, whose writing is close to spoken language. Most of these errors are found in the Deserted Village corpus (44) and among the 9 year olds (72). Frog Story texts also contain quite a high number of such errors. The rest of the corpora include around 10 such errors per corpus. Missing constituents is the second largest error category (16.8%). These errors tend to appear mostly among the older children, maybe because their text structure is more developed and complex than that of the younger children. Among the different sub-corpora, the Spencer Expository texts include most of these errors (20). Erroneous choice of words, mostly dominated by errors in the choice of preposition and verb particles, is the third most frequent category, representing 10.7% (28) of all grammar errors, and seems to be spread evenly both among sub-corpora and age groups. Agreement errors in noun phrase and extra words being inserted into sentences are also quite frequent (5.7% and 5.0% respectively). Agreement errors are quite equally spread in the corpora and occur most among the 9 year olds and 11 year olds. Redundancy errors display a similar distribution to that of the missing constituents, more errors were found among the older children and the Spencer Expository texts contain most errors of this kind. Other grammar error categories represent less than 4% each of all the grammar errors. Eight agreement errors in predicative complement occurred, mostly among the 13 year old subjects and in the Spencer Expository texts. The six definiteness errors were made only by 9 year olds and 11 year olds. Pronoun case errors occurred five times, found only in the texts of 10 year olds and 13 year olds, probably Error Profile of the Data 73 because they were the only ones that made the written distinction between nominative and accusative in plural pronouns (de-dem ‘they-them’). Seven cases of erroneous verb form after auxiliary verb occurred, mostly in the writing of 11 year olds and in the Deserted Village corpus. All errors but one in the verb form in infinitive phrase category were made by 11 year olds. Omission of the infinitive marker after the auxiliary verb komma ‘will’ was rare, only three cases occurred among the 13 year olds in Spencer Expository texts. Eight referential errors occurred, mostly in the Deserted Village corpus and in the texts by 9 year olds. Five word order errors were found and they were equally distributed among sub-corpora and ages. Figure 4.1: Grammar Error Distribution Chapter 4. 74 Table 4.18: Distribution of Grammar Errors in Sub-Corpora S UB -C ORPORA Deserted Climbing Frog Spencer Spencer E RROR T YPE Village Fireman Story Narrative Expository T OTAL % Agreement in NP 5 4 2 4 15 5.7 Agreement in PRED 2 6 8 3.1 Definiteness in single nouns 3 1 2 6 2.3 Pronoun Case 1 1 3 5 1.9 Finite Verb 44 13 34 10 9 110 42.0 Verb form after Vaux 3 1 1 2 7 2.7 Vaux Missing 2 2 0.8 Verb form after inf. marker 2 1 1 4 1.5 Inf. marker Missing 3 3 1.1 Word Order 1 1 3 5 1.9 Redundancy 1 2 1 3 6 13 5.0 Missing Constituents 7 2 8 7 20 44 16.8 Word Choice 9 5 2 3 9 28 10.7 Reference 3 1 2 2 8 3.1 Other 3 1 4 1.5 T OTAL 82 32 54 26 69 262 100 Errors/1,000 Words 10.8 7.1 11.0 4.7 9.3 8.8 Table 4.19: Distribution of Grammar Errors by Age E RROR T YPE Agreement in NP Agreement in PRED Definiteness in single nouns Pronoun Case Finite Verb Verb form after Vaux Vaux Missing Verb form after inf. marker Inf. marker Missing Word Order Redundancy Missing Constituents Word Choice Reference Other T OTAL Errors/1,000 Words 9-year 5 1 3 72 1 1 1 1 3 10 3 1 102 14.9 AGE 11-year 6 1 3 3 11 14 1 3 1 3 10-year 2 1 3 5 13 4 43 6.3 3 13 6 3 2 58 7.2 13-year 2 5 2 13 2 1 3 1 4 15 8 2 1 59 7.3 T OTAL 15 8 6 5 110 7 2 4 3 5 13 44 28 8 4 262 8.8 % 5.7 3.1 2.3 1.9 42.0 2.7 0.8 1.5 1.2 1.9 5.0 16.8 10.7 3.1 1.5 100 Error Profile of the Data 75 Distribution Among Sub-Corpora In Table 4.18 we summarize the grammar errors found in the separate sub-corpora. Most of the grammar errors occurred in the Deserted Village corpus (82), followed by the texts from the Spencer Expository (68). However, if we consider the number of errors in comparison to the size of the sub-corpora and how often they occur per 1,000 words, the Frog Story corpus and the Deserted Village corpus have the highest numbers with 11 and 10.8 errors, respectively. The Spencer Narrative texts included only 26 grammar errors in total, that corresponds to only 4.7 errors per 1,000 words. As regards frequency of the various error types (see Figure 4.2), Frog Story and Deserted Village are distinguished from the other sub-corpora in that they have a much higher frequency of finite verb errors, with seven and six such errors per 1,000 words, respectively. They are half that number or less in the other sub-corpora. Other error types occur at most 1.6 times per 1,000 words. All the sub-corpora are dominated by errors in the finite verb, except for the Spencer Expository texts, where missing constituents are the most frequent error type. Errors in finite verbs are, however, the second most frequent category in these texts. Agreement errors in predicative complement are only found in the Climbing Fireman texts and in the Spencer Expository corpus. Further, errors in the texts of Spencer Narrative are spread over a much smaller number of different error types. Distribution Among Ages Looking at grammar errors by age (Table 4.19), we find that most of the grammar errors are found in the youngest 9 year olds (102) and less in the texts of 10 year olds (43). Error density varies from 14.9 errors per 1,000 words in the texts of 9 year olds to 6.3 errors for the 10 year olds. The 11 year olds and 13 year olds have a very similar distribution of 7.2 and 7.3 errors, respectively. The separate error types and their density are presented in Figure 4.3. Finite verb form errors are most characteristic for the 9 year olds, represented by five times more errors than in the other age-groups. In the other age-groups, finite verb errors and missing constituents are together the most frequent errors. Word choice errors are also highly ranked in all age-groups. Errors in agreement with predicative complement are concentrated in the texts of 13 year olds. Besides the finite verb form errors in 9 year olds, errors occur not more than two times per 1,000 words in all ages. Chapter 4. 76 Figure 4.2: Error Density in Sub-Corpora Figure 4.3: Error Density in Age Groups Error Profile of the Data 4.3.11 77 Summary In total, 262 grammar errors were found in Child Data corresponding to an average of 8.8 errors per 1,000 words. The most common errors concern the form of finite verb, missing obligatory constituents, choice of words and agreement errors in noun phrases. Most frequent are errors found in the Frog Story and the Deserted Village corpora and among the 9 year olds. 4.4 Child Data vs. Other Data In this section, the grammar errors found in Child Data will be compared with studies of grammar errors discussed in Chapter 2 (Section 2.4). Only a comparison with the analyses of children’s writing at school and the studies on adult writing from the grammar checking projects are included. It turned out that it was very difficult to compare the error types in the other studies, since they either did not report much data or they classified errors differently, without giving enough information on exactly which errors were included. The object of this part of analysis is to investigate the similarities and/or differences between the error types found in children and other writers in order to see which grammar errors to concentrate on in the development of a grammar checker aimed at children. 4.4.1 Primary and Secondary Level Writers Teleman’s study and the analysis from the Skrivsyntax project are the two analyses of children’s writing which report on grammar errors at the syntactic level. The reports do not provide any quantitative analyses concerning the frequency of error types. Instead the types of errors are reported and, in some cases, exemplified. Teleman’s Examples Teleman’s study (Teleman, 1979) includes examples of writing errors in texts by children from the seventh year of primary school (14 years old). The examples are mostly listed as fragments taken out of context, though some are presented with the surrounding context. Many of the examples concern word choice or are of content-related nature. Among grammar errors (Table 4.20), 11 Teleman (1979) lists examples of errors in the pronoun case, verb form, definiteness agreement, missing constituents (mostly the subject is missing), reference errors, word order 11 The column representing the correct forms of the exemplified errors are my own suggestions. Teleman (1979)’s examples are just listed without any suggestions of possible correction. Chapter 4. 78 and tense shift. Other errors concerned incorrect use of idiomatic expressions, missing prepositions or the use of the conjunction och ‘and’ instead of the infinitive marker att ‘to’. The influence of spoken language is evident in many of the examples. Tenseendings on verbs are dropped, accusative forms of pronouns are not used, in particular the pronunciation-like form dom (‘they’ or ‘them’) is used instead of the nominative (de) and accusative (dem) forms, which as mentioned earlier, are only distinguished in writing. Also the use of the conjunction och ‘and’ instead of the infinitive marker att ‘to’ indicates influence of the spoken language. Dialect influence occurs in the example of definiteness agreement with the determiner denna ‘this’ followed by a definite noun. All the error types that Teleman found (except for one) occurred in our Child Data corpus as well. Only the case when two supine verbs follow each other was not found in the present Child Data corpus. However, there were additional types of errors in Child Data, such as other verb form errors than dropped tense-endings on finite verbs, or other occurrences of erroneous word choice than using prepositions or conjunctions in the place of the infinitive marker, or occurrences of superfluous constituents. Table 4.20: Examples of Grammar Errors in Teleman’s Study E RROR TYPE Pronoun form Verb form Double supine Agreement in NP Agreement in PRED Missing constituents Reference Word order Tense shift Choice of or missing prepositions ‘och’ instead of ‘att’ E XAMPLE E RROR dom ‘they [spoken form]’ han, hon ‘he, she’ fråga ‘ask [inf]’ fått sålt ‘got [sup] sold [sup]’ denna bilen ‘this car [def]’ hennes förslag ... förefaller mig ∗ orealistisk ‘her suggestion [neu] ... appears to me unrealistic [com]’ Tog med honom till polisen. ‘took along him to the-police’ polisen ... ∗ de ‘policeman ... they’ ett till fall ‘a more case’ Då förstod Majsan varför han ∗ har varit rädd. ‘then understood [pret] Majsan why he has been [perf] afraid’ bet ∗ på repet ‘bit on the-rope’ fråga vissa saker ‘ask some things’ få lov ∗ och göra något ‘get permission and do something’ C ORRECT F ORM de, dem ‘they [nom], they [acc]’ honom, henne ‘him [acc], her [acc]’ frågade ‘asked [pret]’ fått sälja ‘got [sup] sell [inf]’ denna bil ‘this car [indef]’ orealistiskt ‘unrealistic [neu]’ subject missing han ‘he’ ett fall till ‘a case more’ hade varit ‘had been [past perf]’ bet i repet ‘bit in the-rope’ fråga om ‘ask about’ att göra ‘to do’ Error Profile of the Data 79 Skrivsyntax Among the seven error types distinguished in the error analysis of the Skrivsyntax project on writing of third year students of upper secondary school (Hultman and Westman, 1977, p.230), grammar errors were the most frequent. From the whole corpus of 88,757 words, 1,157 were classified as grammar errors. According to Hultman and Westman (1977), gender agreement errors were usual and relatively many examples of errors in pronoun case after preposition occurred in these texts. Errors in agreement between subject and predicative complement occurred quite frequently. Word order errors were also reported, mostly in the placement of adverbials. Other examples include verb form errors, errors in idiomatic phrases (the majority concern prepositions), subject related errors, and clauses with odd structure. Some examples of these grammar errors are displayed in Table 4.21. Table 4.21: Examples of Grammar Errors from the Skrivsyntax Project E RROR TYPE Gender agreement Agreement in PRED Pronoun case Verb form Word order Idiomatic expressions E XAMPLE bland ∗ det mest intolerabla och kortsynta formen på samlevnad ‘among the [neu] most intolerant and short-sighted form [com,def] of married life’ barnet är ∗ van ‘child [neu,def] is used-to [com]’ för alla ∗ de som ‘for all they [nom] that’ hjälpa ∗ de som ‘help they [nom] that’ Naturligtvis måste båda typerna av äktenskap ∗ finns ‘of course must both types of marriage exists [pres]’ Hon har inte ∗ kunna frigöra sig ‘she has not be-able [inf] free herself’ Ett äktenskap kräver att två personer bara skall älska varandra hela livet ut ‘a marriage demands that two people only shall love each-other whole thelife out’ löftet ∗ till trohet ‘promise to fidelity’ grundtanken ∗ till äktenskapet ‘the-fundamental-idea to marriage’ C ORRECT F ORM den ... formen ‘the [com] ... form [com,def]’ vant ‘used-to [neu]’ dem ‘them [acc]’ dem ‘them [acc]’ måste ... finnas ‘must ... exist [inf]’ har ... kunnat ‘has ... being-able [sup]’ skall älska bara varandra ‘shall love only each-other’ om ‘about’ i ‘in’ Chapter 4. 80 Other errors mentioned concern the structure of sentences and include, for instance, the omission of the infinitive marker att ‘to’, main clause word order in subordinate sentences, and sub-categorization of verbs. Also, reference errors are observed and are considered to be quite usual in the material. Some tense problems occurred. The error types encountered in Skrivsyntax show a general indication of the decreasing influence of spoken language on writing compared to earlier ages. The only examples of errors that may contradict this statement are errors in the use of the subjective form of the pronoun de ‘they’ in object-position or in certain expressions after prepositions (should be dem ‘them’). Verb form errors, on the other hand, include only erroneous use of existing written forms with no dropped tense-endings being reported. These errors, and errors in the choice of preposition, gender agreement, verbs and word order were also found in Child Data. Omission of the infinitive marker with certain verbs was only analyzed in the context of the verb komma ‘will’ in the present study. Further, constituent structure seems to be more complex than in texts from Child Data, resulting in errors where the agreeing elements are separated by more words, thus being harder to discover for the writer, e.g. the gender agreement error in bland det mest intolerabla och kortsynta formen på samlevnad. Conclusion Although the Teleman and Skrivsyntax studies cannot be considered to be completely representative for the two age groups and despite a time span of more than twenty years between the studies and the present study, the error types that occur in children’s writing are persistent. The writing of primary school children shows similarities to Child Data mostly in the use of spoken forms. Those types of errors seem to be (almost) not-existent in secondary level writers. Since no numbers or other indications of error frequency than by words are given, the relative frequency or distribution of errors is unclear. 4.4.2 Evaluation Texts of Proof Reading Tools As already mentioned, the evaluation studies that have been carried out as part of the development of the three Swedish grammar checking tools report on grammar errors found primarily in the writing of professional adult writers. Here, we look at the errors reported in two such studies and compare them to the grammar errors found in Child Data. Error Profile of the Data 81 Error Profiles of the Evaluation Texts The performance test of Grammatifix reporting the ratio of detected errors (recall) was based on a newspaper corpus of 87,713 words (Birn, 2000). 12 The material included in total 127 grammar errors summarized in Table 4.22 below. 13 Among the error types, Other agreement errors contained complements, postmodifiers and anaphoric pronouns (i.e. reference errors) and the category Missing or superfluous endings consisted of e.g. genitive, passive or adverb endings. Verb form errors included mostly errors in verb clusters. It is not clear which types of errors belong to the category of Sentence structure errors, or what is included under the Other category (see further in Birn, 2000, p.39). Table 4.22: Grammar Errors in the Evaluation Texts of Grammatifix E RROR T YPE Agreement in noun phrase Other agreement errors Verb form Choice of preposition Missing or superfluous endings Sentence structure Word order Other T OTAL NO. 22 9 28 26 21 8 3 10 127 % 17.3% 7.1% 22.0% 20.5% 16.5% 6.3% 2.4% 7.8% 100% Four error types clearly dominate, including errors in verb form, choice of prepositions, agreement in noun phrase and missing or superfluous endings. Other types occurred, at most, ten times. In Knutsson (2001), an evaluation of Granska’s proof-reading tool is reported based on a text corpus of 201,019 words. The collection included texts of different genres, mostly news articles of different kinds, some official texts, popular science articles and student papers. The analysis concerned grammar, punctuation and some spelling errors. Table 4.23 below is summary of the grammar errors (see further in Knutsson, 2001, p.143). The relative frequency of error types was recounted. 12 Precision of the system, i.e. how good the system is at avoiding false alarms, was tested on a corpus of 1,000,504 words. It is not clear if the corpora includes different newspaper texts or if there was an overlap with the texts tested for recall of the system. According to the author, only the recall-corpus was pre-analyzed manually for grammar errors (see further in Birn, 2000). 13 Birn (2000) reports also 8 instances of splits. They are not included here, since the type belongs to the spelling error category. Chapter 4. 82 The error classification in Granska’s corpus is more similar to the classification adopted in the present thesis. The category of Verb form errors, however, does not specify the different sub-categories. Altogether 272 grammar errors occurred in this evaluation corpus. Both Granska’s corpus, which is more than double the size and the evaluation texts of Grammatifix display (almost) the same error rate with 1.35 errors per 1,000 words. Most errors were erroneous verb forms, followed in frequency by agreement errors in noun phrases and missing constituents. Some errors occurred in predicative complement agreement and pronoun form. The rest of the errors occurred less than ten times. Table 4.23: Grammar Errors in Granska’s Evaluation Corpus E RROR T YPE Definiteness in single nouns Agreement in noun phrase Agreement in pred. compl. Verb form Pronoun form Reference Choice of preposition Word order Missing word Redundant word T OTAL NO. 4 69 16 89 14 1 11 8 56 4 272 % 1.5% 25.4% 5.9% 32.7% 5.1% 0.4% 4% 2.9% 20.6% 1.5% 100% Comparison with Child Data The most obvious difference between the grammar errors from the evaluation texts and Child Data is the error rate in comparison to the size of the corpora. Although the Child Data corpus is the smallest, the total number of errors is almost the same as that in Granska’s evaluation texts. Errors in Child Data are six times more frequent, with almost 9 errors per 1,000 words, than in the evaluation texts with an error density of less than 1.5 errors per 1,000 words - see Table 4.24. Error Profile of the Data 83 Table 4.24: General Error Ratio in Grammatifix, Granska and Child Data Number of words Number of errors Number of errors/1 000 words G RAMMATIFIX 87 713 127 1.45 G RANSKA 201 019 272 1.35 C HILD DATA 29 812 262 8.8 As we have seen, error classification varies between the projects, making a comparison of all error types impossible. Verb form errors, noun phrase agreement, missing constituents (in Granska) and erroneous choice of prepositions (in Grammatifix) are the four most common error types, with frequencies in the range of 20% to 30% each. Recall that errors in Child Data are less evenly spread among the various types of errors. They are clearly dominated by errors in (finite) verb forms (42%), followed by missing constituents at half that frequency (16.8%). Erroneous choice of words is the third most common grammar error (10.7%). Agreement errors in noun phrase occurred in 15 cases (5.7%). Relating the errors of noun phrase agreement, verb form and choice of preposition reported by all groups to the size of the corpora presented in Table 4.25 below, we get a rough picture of error frequency for these three error types in comparison to Child Data. The corresponding error types that were selected in the Child Data corpus, include exactly all the errors in agreement in noun phrases and only the preposition related errors in the word choice category. Three error categories were selected as representative for verb form errors: finite main verb, verb form after auxiliary verb and verb form after infinitive marker. Table 4.25: Three Error Types in Grammatifix, Granska and Child Data G RAMMATIFIX Errors/ E RROR T YPE No. 1,000 words Agreement in noun phrase 22 0.25 Verb form 28 0.32 Choice of preposition 26 0.30 No. 69 89 11 G RANSKA Errors/ 1,000 words 0.34 0.44 0.05 C HILD DATA Errors/ No. 1,000 words 15 0.50 112 3.76 10 0.34 Table 4.25 is also rendered as a graph in Figure 4.4 below. These figures show that children made more errors than the adult writers in all three error types. The difference is marginal for errors in noun phrase agreement and choice of preposition. For verb form errors, the difference is eightfold. Children made almost four such errors in 1,000 words, compared to the adults’ less than 0.5. 84 Chapter 4. The distribution of errors over the three error categories is the same for Child Data and Granska, with fewest errors in choice of preposition and most in verb form. In the Grammatifix corpus, erroneous choice of preposition is quite frequent, with almost the same rate as in Child Data. Here, errors in noun phrase agreement are few. Figure 4.4: Three Error Types in Grammatifix (black line), Granska (gray line) and Child Data (white line) Conclusion The error classifications in the projects differ making comparison on a more detailed level impossible. The overall error rate reveals similar values for the adult corpora, whereas errors are considerably more frequent in Child Data. A comparison of the three most common error types in the adult corpora with the same types in Child Data displays a considerable difference in the frequency of verb form errors, whereas the difference is not as substantial for the other two types. Although not all error types could be compared, this observation indicates that there is a difference not only in the overall error rate, but also in the types of errors. Error Profile of the Data 4.4.3 85 Scarrie’s Error Database As mentioned in Section 2.4, corrections of professional proof-readers at two Swedish newspapers were gathered into a Swedish Error Corpora Database (ECD) in the Scarrie project. This database now contains nearly 9,000 error entries. In total, 1,374 of these errors were classified as grammar errors, corresponding to approximately 16% of all errors (Wedbjer Rambell et al., 1999). Error Profile of the Error Database The error classification in ECD is very refined, the division of error types is, initially, based on the type of phrase involved, rather than the violation type. As Wedbjer Rambell et al. (1999) state, noun phrase errors are the most frequent, followed by verb sub-categorization problems, errors in prepositional phrases and problems within verb clusters. Within the noun phrase category, agreement errors are the most common error type (27.8%), followed by definiteness in single nouns (22.3%) and case errors (14.2%). Verb valence, the second largest grammar problem category, includes problems with the infinitive phrase as the most frequent (24.7%), moreover, over 90% of all verb valence errors concern the infinitive marker att ‘to’ (one third occur after the verb komma ’will’). The choice of preposition and missing preposition errors are the top list error subtypes in the prepositional phrase category (36% and 26.6% respectively). Finally, in verb clusters, errors involving the auxiliary verb being followed by infinitive (33.3%), main verbs in the finite form (30.6%) and temporal auxiliary verb followed by supine (18.0%) are the most common errors. Comparison to Child Data The advantage of the fine division of error types and on-line availability of Scarrie’s ECD, makes more extensive and precise comparison of the studies possible. In total, eleven error types are compared with the errors in Child Data, presented in Table 4.26. The errors, missing auxiliary verb and infinitive marker, which were quite few, are not included, nor are all the word choice or Other category errors. The large size of the newspaper corpus (approximately 70,000,000 words) in Scarrie results in a ratio of 0.009 of errors per 1,000 words. In the Child Data corpus, the ratio is 8 errors per 1,000 words for the listed error types. The big gap in error density is obvious and further analysis will concern comparison of how frequent the errors are over these selected categories and, luckily, show what types of errors characterize the corpora. Chapter 4. 86 Table 4.26: Grammar Errors in Scarrie’s ECD and Child Data E RROR T YPE Agreement in noun phrase Agreement in pred. compl. Definiteness in single nouns Pronoun form Finite verb form Verb form after Vaux Verb form after inf. marker Word order Missing or redundant word Choice of preposition Reference T OTAL Errors/1,000 Words S CARRIE NO. % 176 25.7% 48 7.0% 68 9.9% 21 3.1% 34 5.0% 57 8.3% 4 0.6% 57 8.3% 132 19.2% 76 11.1% 13 1.9% 686 100% 0.009 C HILD DATA NO. % 15 6.4% 8 3.4% 6 2.6% 5 2.1% 110 46.8% 7 3.0% 4 1.7% 5 2.1% 57 24.3% 10 4.3% 8 3.4% 235 100% 7.8 Figure 4.5 shows the relative error frequency of the selected error types in Scarrie’s corpus and Figure 4.6 shows the Child Data corpus. The main difference is that the top error type for Child Data, represented by errors in the finite verb form, is not a very common error in Scarrie’s corpus. The other three top error types in Child Data and the three top error types in Scarrie are represented by the same categories, but in a slightly different order. In Scarrie’s corpus, noun phrase agreement errors are the most frequent, followed by missing and redundant constituents and then the choice of preposition. In Child Data, agreement errors in noun phrase are much less frequent than omission or addition of words in sentences, but erroneous choice of preposition is also the least frequent of these three categories. Errors in verb forms overall have much lower frequency in Scarrie’s corpus. Errors in verb form after an auxiliary verb is the fifth most common error type in Scarrie’s corpus and the most frequent among the verb errors. Errors in finite verb form are even less frequent and errors in verb form after an infinitive marker are quite rare. In Child Data, errors in verb form after an auxiliary verb are much less frequent than in the finite verb, the most common error. Errors in verb form after an infinitive marker are also rare. As already mentioned, agreement errors in noun phrases have a higher frequency distribution in Scarrie’s ECD than in Child Data. Agreement errors in predicative complement positions seem to be slightly more common in Scarrie’s texts, likewise for definiteness errors in bare nouns. Error Profile of the Data Figure 4.5: Error Distribution of Selected Error Types in Scarrie Figure 4.6: Error Distribution of Selected Error Types in Child Data 87 Chapter 4. 88 There were few word order errors in Child Data. These seem more common in Scarrie’s ECD, being as common as errors in verb form after auxiliary verb. The opposite holds for reference errors, which were quite rare in Scarrie’s texts and more common in Child Data. Pronoun form errors display a similar distribution in both corpora. Conclusion Comparison of error frequency over the selected error types in the two corpora shows both differences and similarities. The largest difference is in the verb form errors. In Scarrie’s texts, verbs following an auxiliary verb are the main problem, whereas in Child Data it is the finite verb form, the most common error in the whole corpus. Other differences concern word order and definiteness in bare nouns, more common in Scarrie’s corpus, and reference errors, more common in Child Data. Agreement errors in predicative complements seem to be slightly more common in Scarrie’s corpus. Some of the differences could certainly be circumstantial, due to the difference in the size of the corpora, but certainly not in the most common error types. Child Data’s profile is characterized by errors in finite verb form and omissions or additions of words. Scarrie’s texts are dominated by errors in noun phrase agreement and omission or addition of words. Agreement errors in noun phrases are the third most common error type in Child Data. Errors in choice of preposition and pronoun form obtained similar frequency distributions in the corpora. 4.4.4 Summary The nature of grammar errors in Child Data is more similar to the errors found in Teleman’s primary school children than the secondary level writers of the Skrivsyntax project. The different error classification in the grammar checking projects made deeper analysis difficult. Errors are, in general, more frequent in Child Data, but a closer look at three error types indicates that for some error types the difference is marginal whereas for others, children make many more errors. A finegrained comparison with some selected error types from Scarrie’s ECD confirms this difference with different error frequency distribution in certain error sub-types. On the other hand, the most common error types in Scarrie’s corpus are, other than finite verb form errors, also the most frequent in Child Data. Error Profile of the Data 89 4.5 Real Word Spelling Errors 4.5.1 Introduction This section is devoted to spelling errors which form existing words. These errors are particularly interesting from the computational point of view, because they normally require analysis of context larger than a word and are most often not discovered by a traditional spelling checker developed for the detection of isolated words. Since this error category is not the main focus of the present study, the analysis aims more at providing an overall impression of what errors occur and what grammatical consequences the new word formations create rather than analysis of the spelling error types. First, the spelling violation types that are typical in Swedish are presented in (Section 4.5.2), followed by an analysis of segmentation errors (Section 4.5.3) and misspelled words (Section 4.5.4). The total number of errors and their distribution is discussed at the end of this section (Section 4.5.5). 4.5.2 Spelling in Swedish As mentioned already in the classification of error categories in Chapter 3 (Section 3.4), spelling errors occur as violations of the orthographic norms of a language. In Swedish, these errors concern operations on letters and segmentation of words. Compounds in Swedish are always written as one word. Since this is such a productive category, compounds are often a source of erroneous segmentation. They are most often spelled apart forming more than one word, but the opposite occurs as well when words are written together as if they were a compound. Other spelling violations occur when letters in words are missing, are replaced by other letters, moved to other positions of the word, or when extra letters appear. Apart from these basic operations, Swedish has consonant gemination, often the cause of spelling errors (cf. Nauclér, 1980). Words can differ simply in single or double consonants and have completely separate meanings, as in glas ‘glass’ and glass ‘ice-cream’. The spelling errors in this study are divided first into segmentation errors and misspellings. Segmentation errors are then further divided into writing words apart as erroneous separation of compound elements (splits) and writing words together as erroneous combination of words into compounds (run-ons). The error taxonomy of misspellings is based on the four basic error types of omission, insertion, substitution and transposition, usually applied in research on spelling (e.g. Kukich, 1992; Vosse, 1994), and extended with two additional error categories related to consonant doubling as separate categories. The spelling taxonomy consists then of two Chapter 4. 90 categories with segmentation errors divided in two sub-categories and misspellings in six sub-categories: 1. Segmentation Errors: (a) splits - a word written apart, with a space in between (b) run-ons - words written together as one 2. Misspellings: (a) omission - a letter is missing (b) double consonant omission - single consonant instead of double consonant (c) insertion - an extra letter is added (d) double consonant insertion - double consonant instead of single consonant (e) substitution - a letter is replaced by another letter (f) transposition - two or more letters have changed positions A word can be in violation in just one such spelling operation on letters or spaces, or several spelling violations may occur. The categories are exemplified in the Table 4.27 below. All the errors in the table are represented by real word spelling errors, found in the current corpus and some also with multiple violations. First the error category is presented, followed by an example of it and its correct form. The last column in the table includes the error index in the corresponding Appendix where the error instance(s) may be found (misspelled words from Appendix B.2 with the index starting in M and segmentation errors from Appendix B.3 with the index starting in S). Table 4.27: Examples of Spelling Error Categories E RROR T YPE E RROR C ORRECT W ORD S INGLE E RRORS : Split djur affär ‘animal store’ djuraffär ‘animal-store’ Run-on tillslut ‘close’ till slut ‘eventually’ Omission bror ‘brother’ beror ‘depends’ Double omission koma ‘coma’ komma ‘to come’ örn ‘eagle’ ön ‘the island’ Insertion Double insertion matt ‘faint’ mat ‘food’ Substitution bi ‘bee’ by ‘village’ Transposition förts ‘been taken’ först ‘first’ M ULTIPLE E RRORS : Split and Double brand manen ‘fire mane’ brandmannen ‘fire-man’ omission Substitution and kran kvistar ‘tap twigs’ grankvistar ‘fir-twigs’ Split Double omission fören ‘the stem’ förrän ‘until’ and Substitution Omission and tupp ‘rooster’ stup ‘precipice’ Double insertion I NDEX S1.1.28 S8.1.3 - 12 M4.2.1 M4.2.33 - 36 M1.1.51 M1.2.3 M1.1.9 - 11 M6.4.1 - 2 S1.1.21 - 22 S1.1.59 M8.1.1 - 4 M1.1.46 Error Profile of the Data 91 Some spoken forms in Swedish are accepted as spelling variants and will not be included as errors in this analysis. These are listed in Table 4.28 below. Table 4.28: Spelling Variants S POKEN F ORM dom sen sa la nån nåt nåra nånstans sån sånt såna våran vårat mej dej sej stan dan 4.5.3 W RITTEN E QUIVALENCE de ‘they’ sedan ‘then’ sade ‘said’ lade ‘laid’ någon ‘someone’ något ‘somewhat’ några ‘some [pl]’ någonstans ‘somewhere’ sådan ‘such [com]’ sådant ‘such [neu]’ sådana ‘such [pl]’ vår ‘ours [com]’ vårt ‘ours [neu]’ mig ‘me [acc]’ dig ‘you [acc]’ sig ‘him/her/itself [acc]’ staden ‘city [def]’ dagen ‘day [def]’ Segmentation Errors The different types of segmentation errors are listed in Table 4.29 together with the number of different word types and how many were misspelled. Splits are further divided in accordance with what part-of-speech they concern. Distribution in sub-corpora and among participant ages for segmentation errors is discussed in Section 4.5.5. Table 4.29: Distribution of Real Word Segmentation Errors C ATEGORY RUN - ONS : S PLITS : Nouns Adjectives Pronouns Verbs Adverbs Prepositions Conjunctions T OTAL S PLITS N UMBER 13 126 49 5 8 53 2 3 246 W ORD T YPES 4 90 37 2 8 21 2 1 160 M ISSPELLED 0 6 0 1 0 5 0 0 12 Chapter 4. 92 Very few real word spelling errors occurred as words written together (runons), since these most often result in non-words. Cases that formed an existing word included just four word types. The most recurrent real word run-on was the preposition phrase till slut ‘eventually’ that, when written together, forms the verb tillslut ‘close’, see (4.56): (4.56) (S8.1.12) a. Vi åkte ∗ tillslut på bio. we went close to cinema – We went eventually to the cinema. b. till slut eventually Splits, on the other hand, are usually realized as real words since they are compounded of two (or more) lemmas. As seen in Table 4.29, most of the splits concern noun compounds. In six cases, these were misspelled, resulting in real words as in (4.57). Thus, the compound brandmännen ‘firemen’ is split, while a vowel substitution occurs in the second part of the compound. Both parts are finally realized as lexicalized strings which then slip through a spellchecker unnoticed: (4.57) (S1.1.23) a. ∗ brand menen ryckte ut och släckte elden. fire the-harms turned out and put-out the-fire – The firemen turned out and put out the fire. b. brandmännen fire-men Two instances among the noun splits were not compounds, as for instance in (4.58) below. Here, the definite suffix is separated from the noun stem: (4.58) (S1.1.118) a. ni får gärna bo hos oss under ∗ tid en ni inte you [pl] may gladly live at us during time [definite suffix] you [pl] not har nåt att bo i. have something to live into – You are welcome to live at our place during the time you don’t have anywhere to live. b. tiden the-time Also, adjectives are quite often split with the parts realized as existent words. A recurrent error (27 occurrences) is the segmentation of the modifying intensifier Error Profile of the Data 93 jätte ‘giant’ as in (4.59). This is supposed to be written together (see Teleman et al., 1999, Part2:185-188). (4.59) (S2.1.18) a. då blev jag ∗ jätte glad then become I giant happy – Then I was extremely happy. b. jätteglad extremely-happy Splits in adverbs are recurrent as well, often concerning certain words, as seen in the number of word types. Some of them were also misspelled, as for instance in (4.60), where ändå ‘anyway’ is split and the first part includes vowel substitution and realizes as the indefinite determiner en ‘a’: (4.60) (S5.1.46) a. men olof var glad ∗ en då but Olof was happy a then – But Olof was happy anyway. b. ändå anyway Eight cases concerned split verbs. One of these included a morphological split, where the past tense suffix was separated from the verb stem: (4.61) (S4.1.7) a. Han ∗ ring de till mig sen och sa samma sak. he call [pret] to me afterwards and said same thing – He called me afterwards and said the same thing. b. ringde called Also, some splits in pronouns, prepositions and conjunctions occurred. Among the conjunctions, three cases of the conjunction eftersom ‘because’ were segmented: (4.62) (7.1.1) a. ∗ Efter som han frös och ... after that he was-cold and – Because he was cold and ... b. eftersom because Chapter 4. 94 All these segmentation errors resulting in real words are presented in Appendix B.3. They are classified first by the type of violation that occurred and then by partof-speech. 4.5.4 Misspelled Words In general, multiple misspellings occurred in just a few cases, most of the words involved single violations. Substitution and double consonant omission are the most frequent spelling violations. Nouns, pronouns and verbs are the most frequent categories for violations. Certain types of words seem to be more problematic than others regarding spelling. For instance, there is real confusion concerning the spelling of the pronoun de ‘they’. Recall that this pronoun is pronounced as [dom], as is the accusative form dem ‘them’. Both forms can be spelled as dom, an accepted spelling variant, as well. In sixteen cases, four subjects used the accusative form dem ‘them’ as in (4.63a): (4.63) (M3.1.49) 16 occurrences, 4 subjects a. ∗ Dem hade ett privatplan them had a private-plane – They had a private-plane. b. De They Two children substituted the vowel in the pronoun, as a consequence, it was realized as the noun dam ‘lady’, as in (4.64a): (4.64) (M3.2.13) 14 occurrences, 2 subjects a. ∗ dam bodde i en by lady lived in a village – They lived in a village. b. dom/de they Another confusion exists between the pronouns det ‘it’ and de ‘they’. In speech, det is usually reduced to [de], thus coinciding with the plural pronoun de ‘they’ in writing. In 33 cases, 15 subjects used de instead of det ‘it’: Error Profile of the Data 95 (4.65) (M3.1.20) 33 occurrences, 15 subjects a. ja men nu är ∗ de läggdags sa mormor yes but now is they bed-time said grandmother – Yes, but now it is time to go to bed, grandmother said. b. det it The opposite occurred in nine cases, where six subjects wrote the singular det ‘it’ instead of the plural pronoun de ‘they’: (4.66) (M3.1.4) 9 occurrences, 6 subjects a. ∗ Det kom till en övergiven by it came to a abandoned village – They came to an abandoned village b. De They Other rather recurrent spelling errors concern the pronoun vad ‘what’, the adverb var ‘where’, the infinitive verb form vara ‘to be’ and the past form of the same verb var ‘was/were’, all of which can be pronounced [va]. First, the forms are often erroneously substituted for one another. In six cases, the form var is used instead of the correct pronoun vad ‘what’ as in: (4.67) (M3.6.22) 6 occurrences, 4 subjects a. Men ∗ var är det för ljud? but where is it for sound – But what is it for sound? b. vad what Then in eight cases the form vad is used instead of the past verb form var ‘was/where’: (4.68) (M4.6.8) 8 occurrences, 3 subjects ∗ vad grön. a. Hans älsklingsfärg his favourite-colour what green – His favourite-colour was green. b. var was Chapter 4. 96 Two children also used vad for the adverb form var ‘where’ in three cases: (4.69) (M3.6.25) 3 occurrences, 2 subjects a. Hjälp det brinner ∗ vad nånstans. help it burns what somewhere – Help! Fire! Where abouts? b. var where Further, these words are also realized as the corresponding (reduced) pronunciation form va, that in turn coincides with the interjection va ‘what’ in writing. Most of these cases concerned the past verb form var ‘was/where’ as in: (4.70) (M4.5.4) 33 occurrences, 8 subjects a. Klockan ∗ va ungefär 12 när jag vaknade the-watch what approximately 12 when I woke – The time was about 12 when I woke up b. var was Some cases included the infinitive verb form vara ‘to be’: (4.71) (M4.5.39) 8 occurrences, 5 subjects a. dom vill inte ∗ va kompis med han/hon. they want [pres] not what friend with he/she – They don’t want to be friends with him/her. b. vill inte vara want [pres] not be [inf] Here is an example of the use of the adverb var ‘where’ reduced as va: (4.72) (M6.5.3) 3 occurrences, 1 subject a. sen undra han ∗ va dom bodde then wonder he what they lived – Then he wondered where they live. b. var where Error Profile of the Data 97 Two instances of va corresponded to the pronoun vad ‘what’ as in: (4.73) (M3.5.4) 2 occurrences, 1 subject a. Madde vaknade av mitt skrik, hon fråga ∗ va det var för nåt. Madde woke from my shout, she ask what it was for something – Madde woke up from my shout. She asked what was wrong. b. vad what Other spelling that also was related to spoken reduction concerned the pronoun jag ‘I’, normally pronounced [ja], which, when written as pronounced, corresponds to ja ‘yes’. Three instances of the use of jag as ja occurred: (4.74) (M3.5.3) 3 occurrences, 2 subjects a. Vilken fin klänning ∗ ja har what pretty dress yes have – What a pretty dress I have. b. jag I Also, five instances concern the conjunction och ‘and’, usually pronounced as [å], which in writing coincides with the noun å ‘river’. (4.75) (M8.1.11) a. Vi bor i samma hus jag och Kamilla ∗ å hennes hund. we live in same house I and Kamilla river her dog – We live in the same house me and Kamilla and her dog. b. och and All these misspelled words resulting in real words are listed in Appendix B.2. They are classified first by the part-of-speech of the intended word and then by the part-of-speech of the realized word. The type of spelling violations that occur are notified in the margin. Chapter 4. 98 4.5.5 Distribution of Real Word Spelling Errors From the examples above, it is clear that the children’s spelling is quite unstable. In general there is a high degree of confusion as to which form to write in which context and many spoken forms are used. The totals of misspelled words, splits and run-ons are summarized in Table 4.30 below, where the texts are divided into sub-corpora, and in Table 4.31, where the texts are grouped by age. The errors are divided further into non-words and real words and the relative frequency of errors compared to the total number of words is presented. As already discussed in the general overview in Section 4.2, all spelling errors (i.e. both non-word and real word) amount to 10.2% of all words. Most common are misspelled words, followed by splits, which are more recurrent than run-ons. The same distribution applies for real word spelling errors. In total, (the last column in the last row in the tables) these amount to 2.3% of all words, which is three times less than non-word spelling errors (7.9%). Put in other words, real word spelling errors amount to 29% of all spelling errors.14 Real word spelling errors are also dominated by misspelled words (1.5%). Splits are more usual as real words (0.8% in comparison to non-word splits 0.4%), whereas run-ons are almost not-existent as real words (0.04%). Most of the misspelled words as real words occur in the Deserted Village corpus and among the 9-year olds. Real word splits are also most frequent in the Deserted Village corpus, closely followed by the Frog Story corpus. In the case of age, the texts of 11-year olds contained most of the erroneous splits (non-word splits are most common among 9-year olds). Real word run-ons are very rare so not much can be said about their distribution in sub-corpora or by age group. 14 Recall that the corresponding rate Kukich (1992) refers to is: 40% of all misspellings result in lexicalized strings. Error Profile of the Data 99 Table 4.30: Distribution of Real Word Spelling Errors in Sub-Corpora E RROR T YPE M ISSPELLED W ORDS : non-word % real word % S PLITS : non-word % real word % RUN - ONS : non-word % real word % T OTAL : non-word % real word % S UB -C ORPORA Deserted Climbing Frog Spencer Spencer Village Fireman Story Narrative Expository T OTAL 743 9.8 181 2.4 48 0.6 98 1.3 108 1.4 5 0.07 899 11.9 284 3.7 351 7.8 71 1.6 28 0.6 41 0.9 25 0.6 1 0.02 404 9.0 113 2.5 484 9.9 84 1.7 32 0.7 61 1.2 37 0.8 2 0.04 553 11.3 147 3.0 173 3.2 36 0.7 14 0.3 23 0.4 28 0.5 4 0.07 215 3.9 63 1.1 239 3.3 60 0.8 9 0.1 23 0.3 29 0.4 1 0.01 277 3.8 84 1.1 1 990 6.7 432 1.4 131 0.4 246 0.8 227 0.8 13 0.04 2 348 7.9 691 2.3 Table 4.31: Distribution of Real Word Spelling Errors by Age E RROR T YPE M ISSPELLED W ORDS : non-word % real word % S PLITS : non-word % real word % RUN - ONS : non-word % real word % T OTAL : non-word % real word % AGE 11-year 9-year 10-year 994 14.5 248 3.6 71 1.0 58 0.8 102 1.5 2 0.03 1 167 17.1 308 4.5 292 4.3 64 0.9 18 0.3 51 0.7 32 0.5 2 0.03 342 5.0 117 1.7 524 6.5 78 1.0 35 0.4 113 1.4 58 0.7 5 0.06 617 7.7 196 2.4 13-year T OTAL 180 2.2 42 0.5 7 0.1 24 0.3 35 0.4 4 0.05 222 2.7 70 0.9 1 990 6.7 432 1.4 131 0.4 246 0.8 227 0.8 13 0.04 2 348 7.9 691 2.3 Chapter 4. 100 4.5.6 Summary Real word spelling errors are three times less frequent than non-word spelling errors in the Child Data corpus. Misspelled words are the most common type of error, reflecting a clear spelling confusion for some word types. Splits are, in general, more common as real word errors, the opposite being the case for run-ons. Most errors occurred in general in the Deserted Village corpus and among the 9-year olds, but 11-year olds made most of the erroneous segmentation errors (splits). 4.6 Punctuation 4.6.1 Introduction Beginning writers, as mentioned in Chapter 3 (Section 3.4), usually use punctuation marks to delimit larger textual units than syntactic sentences, joining for instance (main) clauses together without any conjunctions. The main purpose of the present analysis of punctuation is to investigate the erroneous use of punctuation both manifested as omissions, thus giving rise to joined sentences, and as substitutions and insertions. The length of the orthographic sentences marked by the subjects and especially the number of (main) clauses without conjunctions joined in them (adjoined clauses) will give us a picture of how often sentence boundaries are omitted and to what degree sentences correspond to syntactic sentences. Analysis of erroneous use of end-of-sentence punctuation and commas will reveal in what other places one might expect them. As orthographic sentences are considered sequences of words that start with a capital letter and end in a major delimiter (cf. Teleman, 1974). Also included in that category are, sequences that do not completely follow the writing conventions of a capital letter at the beginning and a major delimiter at the end, but indicate the writer’s intention of such marking. These include sentences ending in a major delimiter followed by a small letter, or the opposite when the major delimiter is missing but the beginning of the next sentence is indicated by a capital. Within the orthographic sentence, occurrences of main sentences attached to a main clause without conjunction are counted as adjoined clauses (cf. Näslund, 1981; Ledin, 1998). These reveal whether or not the writer joins syntactic sentences to larger units, or in other words omits sentence boundaries. The analysis of punctuation is important for decisions on how to handle texts written by children computationally. Do they delimit their text in syntactic sentences? Are there any other units they delimit instead? What is then the nature of such delimitation? How frequently are sentences joined together and sentence boundaries omitted? Error Profile of the Data 4.6.2 101 General Overview of Sentence Delimitation The content related marking of text, rather than syntactic, is also evident in the texts in this study. In the following example (4.76) written by a nine year old, most of the sentence boundaries correspond to syntactic units and are delimited in accordance with the writing conventions using capital letters at the beginning and major delimiters at the end. Two adjoined clauses can be observed in the third and the fifth sentences, joining main sentences together without conjunctions. Two vertical bars indicate where one would expect a major delimiter between the adjoined clauses (spelling or other errors are ignored in the English version). 15 (4.76) Den brinnande makan Det var en gång en pojke som hette Urban. En dag tänkte Urban göra varma makor . Då hände en grej som inte får hända || huset brann upp för att makan hade tat eld . Då kom Urban ut med brinnande kalsingar och sa: Det brinner!!!!!!!!!!!!!!!!!!!!!! Brandkåren kom och spola ner huset || då börja Urban lipa och sa : Mitt hus är blöt . – The burning sandwich There was once a boy who was called Urban. One day Urban planned to make hot sandwiches. Then a thing happened that should not happen. The house burnt down because the sandwich started to burn. Then Urban came out with burning underwear and said: Fire! The fire-brigade came and hosed down the house. Then Urban started to blubber and said: My house is wet. In other texts, punctuation marks are used to delimit larger units as in the following text (4.77), written by a ten year old: (4.77) Den där scenen med dammen som tappade sedlarna tycker jag att den där flickan måste vara fattig så att hon tar sedlarna . Den där scenen med det tre tjejerna tyckte jag att de var taskiga som går ifrån den tredje tjejen || det tycker jag att tjejen tar upp det på mötet med fröken och sedan tar fröken upp det på de andra tjejernas möte med fröken || det kan hjälpa ibland . – That scene with the lady that lost the money, I think that that girl must be poor so it is her who takes the money. That scene with the three girls, I thought that they were mean when they left the third girl. I think that the girl will take that up at the meeting with the teacher and then the teacher will take it up at the other girls’ meeting with the teacher. That can help sometimes. In this text, only two full stops occur. The first delimitation concerns a single sentence, correctly initiated by a capital letter and terminated by a full stop. The 15 The exemplified text represents the spell-checked versions, where the non-word misspellings have been corrected (see further in Section 3.5). 102 Chapter 4. sentence is quite long, however, and commas could facilitate reading. The second full stop terminates a whole paragraph that consists of at least three sentences. Some texts did not include any delimiters or other indicators of sentence boundaries at all, as in (4.78) also written by a ten year old. Again, vertical bars indicate the missing punctuation marks. (4.78) så här börja det || jag var på mitt land och bada || då var jag liten || plötsligt kom en snok || i för sig så hugger inte snokar i vatten men jag blev alla fall jätte rädd för jag kunde inte simma då och snoken jagade mig längre och längre ut || då ko min bror med en gummi båt och tog upp mig || då blev jag jätte glad – It started like this. I was in the country and went for a swim. I was little then. Suddenly a grass snake came. Actually grass snakes do not bite in the water, but I was very scared, because I could not swim then and the grass snake chased me further and further out. Then my brother came with a rubber-boat and lifted me up. Then I was very happy. In the following text (4.79) written by an eleven year old we see examples of long sentences, where several clauses are put together either by inserting conjunctions or as adjoined clauses. Especially the first orthographic sentence is quite long, consisting of first three sentences joined by the conjunction och ‘and’ followed by three adjoined clauses. Conjunctions are marked in boldface and omitted sentence boundaries are indicated by two vertical bars: (4.79) Ljus Det var en gång en pojke som hette Karl och gillade att leka med elden och en dag började det brinna i en hö-skulle ute på landet och den stackars pojken var bakom elden som hade sträckt ut sig tio meter bakom hö-skullen || då kom det ett åskmoln och blixten slog ner i ladugården som tog eld || kale som blev jätte rädd och sprang till närmaste hus som låg 9 kilometer bort || det tog en timme att koma ditt och då ringde han fel numer av bara farten. När han kom fram skrek han i örat på brand männen att det brann på Macintosh vägen 738c och brand menen rykte ut och släkte elden. SLUT – Light There was once a boy who was called Karl and liked to play with fire and one day a fire started in a hayloft out in the country and the poor boy was behind the fire that had spread ten meters behind the hayloft. Then came a thundercloud and the lighting struck in the cowshed that caught fire. Kalle who became very scared and ran to the nearest house that was 9 kilometers away. It took an hour to get there and then he called the wrong number because he was in such a rush. When he got through he yelled in the ear to the fire-men that there was fire at Macintosh Road 738c and the fire-men turned out and put out the fire. END It is a typical pattern in the whole Child Data corpus, that sentences are put together to build larger units, either as adjoined clauses where sentences follow each other without any conjunctions or long sentences are built with conjunctions Error Profile of the Data 103 as in the above text (4.79) or in the example below (4.80), written by a nine year old: (4.80) på morgonen när vi vakna och jag skulle gå ut att hämta cyklarna märkte jag att vi inte va på toppen av berget utan i en by || jag väckte pappa och skrek att han Va för tung och att vi åkt ner från berget och åkt så långt att vi inte visste va vi va. – In the morning when we woke up and I was about to go out to get the bicycles, I noticed that we were not on the top of the mountain but in a village. I woke Daddy up and yelled that he was too heavy and that we had fallen down from the mountain and fallen so far that we didn’t know where we were. 4.6.3 The Orthographic Sentence In order to investigate more closely how sentence delimitation is used and to what extent it corresponds to syntactic sentences, we analyze the length of orthographic sentences and the number of adjoined clauses. In Tables 4.32 and 4.33 we present the number of orthographic sentences and their length in number of words, along with the number of adjoined clauses and their frequency per 1,000 words. Table 4.32: Sentence Delimitation in the Sub-Corpora C ORPUS Deserted Village Climbing Fireman Frog Story Spencer Narrative Spencer Expository T OTAL O RTHOGRAPHIC O RTHOGRAPHIC A DJOINED A DJOINED C LAUSES / S ENTENCE S ENTENCE L ENGTH C LAUSE 1,000 WORDS 422 18.0 298 39.3 408 11.0 75 16.6 536 9.2 70 14.3 313 17.5 98 17.9 392 18.7 73 10.0 2 071 14.4 614 20.6 Table 4.33: Sentence Delimitation by Age C ORPUS 9-years 10-years 11-years 13-years T OTAL O RTHOGRAPHIC O RTHOGRAPHIC A DJOINED A DJOINED C LAUSES / S ENTENCE S ENTENCE L ENGTH C LAUSE 1,000 WORDS 476 14.4 216 31.6 487 14.0 122 17.8 651 12.3 210 26.2 457 17.8 66 8.1 2 071 14.4 614 20.6 104 Chapter 4. The average length of an orthographic sentence was 14.4 words. The shortest sentences are found in the Frog Story and Climbing Fireman corpora. Among age groups, orthographic sentence length is very similar; only the 13-years old have a greater average length of orthographic sentences. Although this measure does not reveal anything about what units are actually delimited, there seems to be a tendency for mean sentence length to increase with age. Additional analysis is needed in order to reveal if the increase in length of orthographic sentences with age is because children become worse at delimiting sentences or because their sentences have more complex structure (presumably the latter). In comparison, the primary school children in the study by Ledin (1998, p.21) obtained similar length of orthographic sentences for the lower age children, 12.9 words. Although older children had on average 10.0 words, which then contradicts the hypotheses. Also, the orthographic sentence length for adults in Hultman and Westman (1977) study averaged 14.7 words,16 whereas secondary level students had longer sentences with average of 16.8 words. The frequency of adjoined clauses reflects how often (main) sentences are joined and brings more light to the nature of text delimitation. A common hypothesis is that adjoined clauses are less frequent by age, often considered to be a phenomenon related to primary school writers (Ledin, 1998). This seems to hold for our data too. The 13-year olds in the present study had four times fewer adjoined clauses per 1,000 words than the 9-year olds. The other two age groups also put quite a large number of clauses together without conjunctions. In the subcorpora, adjoined clauses are four times more frequent in the hand-written texts of Deserted Village than in the Spencer Expository corpus. The average value is 20 adjoined clauses per 1,000 words in the whole corpus. In comparison to the Ledin (1998, p.25) study, lower age primary school children had 10.2 adjoined sentences per 1,000 words overall, but 28.9 in narrative writing. The older children had on average 8.2 adjoined sentences per 1,000 words. In a study by Näslund (1981) (reported in Ledin (1998)), final year primary school children had on average 9.0 adjoined sentences and upper secondary students 5.1 per 1,000 words. Not surprisingly, analysis showed that sentence length increases by age, whereas the number of adjoined clauses decreases with age. Although the analysis did not identify what other units are marked, it indicates clearly that the younger children more often join sentences together into larger units. 16 The average value is based on orthographic sentence length of adult texts in five genres, (see Hultman and Westman, 1977, p.223) Error Profile of the Data 4.6.4 105 Punctuation Errors Errors related to the use of major delimiters, summarized in Tables 4.34 and 4.35, concern omission of sentence boundaries (Omission), extra delimiters (Insertion) in front of a subordinate clause or a conjunction and periods placed in lists and adjective phrases, or put at other syntactically incorrect places in a sentence. Table 4.34: Major Delimiter Errors in Sub-Corpora S UB -C ORPORA Deserted Climbing Frog Spencer Spencer E RROR T YPE Village Fireman Story Narrative Expository T OTAL % Omission 310 75 116 109 82 692 92.6 Insertion in front a subclause 9 9 12 1 16 47 6.3 Insertion other 4 2 1 1 8 1.1 T OTAL 323 86 129 110 99 747 Table 4.35: Major Delimiter Errors by Age E RROR T YPE Omission Insertion in front a subclause Insertion other T OTAL 9-years 264 16 2 282 AGE 10-years 11-years 134 220 6 12 1 4 141 236 13-years 74 15 1 90 T OTAL 692 49 8 749 % 92.4 6.5 1.1 The most common error is the omission of the sentence end-markers, often in the case of adjoined clauses. In (4.81) we see an example of a period inserted between a subordinate clause and its main clause: (4.81) Medan Oliver sprang ∗ . Hade Erik vekt en uggla som nu jagade honom. while Oliver ran had Erik woken a owl that now chased him – While Oliver ran, Erik had woken up an owl that now chased him. Some cases of a period being placed in enumerations occurred as in (4.82): (4.82) Där nere i det höga gräset låg Dalmatinen Tess ∗ . Grisen kalle knorr there down in the high grass lay the-dalmatian Tess the-pig Kalle Knorr Hammstern Hilde ∗ . ödlan Graffitti katten fillipa och ... the-hamster Hilde the-lizard Graffitti the-cat Fillipa and – Down there in the high grass lay the Dalmatian Tess, the pig Kalle Knorr, the hamster Hilde, the lizard Graffitti, the cat Fillipa and ... Chapter 4. 106 Further, the erroneous use of comma was analyzed, but only when syntactic violations occurred or when omitted in enumerations. Commas were, in general, very rare and when used were often misplaced. Commas occurred in front of a conjunction in an enumeration in (4.83): (4.83) De hade med sig: ett spritkök, ett tält ∗ , och massa mat, några they had with themselves a spirit-stove a tent and a lot of food some kulgevär ∗ , och ammunition m.m and ammunition etc rifles – They had with them a spirit-stove, a tent and lots of food, some rifles and ammunition, etc. In some instances a comma was placed in front of a finite verb: (4.84) Linda ∗ , brukade ofta vara i stallet. Linda used-to often be in the-stable – Linda often used to be in the stable. Often comma was used where one would expect a full stop: (4.85) Nasse kunde inte sova Nasse could not sleep ∗ , plötsligt hörde Nasse nån som öppnade suddenly heard Nasse someone that opened dörren. the door – Nasse could not sleep. Suddenly Nasse heard someone open the door. Error frequencies are summarized in Tables 4.36 and 4.37 below. Error types include missing comma in enumerations or adjective phrases (Omission), an extra comma in front of a conjunction, enumeration or in other cases (Insertion), and commas being used instead of a major delimiter to mark a sentence boundary (Substitution). Table 4.36: Comma Errors in Sub-Corpora E RROR T YPE Omission Insertion Substitution T OTAL S UB -C ORPORA Deserted Climbing Frog Spencer Spencer Village Fireman Story Narrative Expository T OTAL % 41 2 10 3 3 59 33.5 5 13 1 4 7 30 17.0 5 22 2 30 28 87 49.4 51 37 13 37 38 176 Error Profile of the Data 107 Table 4.37: Comma Errors by Age E RROR T YPE Omission Insertion Substitution T OTAL 9-years 22 12 16 50 AGE 10-years 11-years 5 28 8 5 15 12 28 45 13-years 4 5 44 53 T OTAL 59 30 87 176 % 33.5 17.0 49.4 Overall, commas were mostly placed in sentence boundaries or were omitted. In the Deserted Village corpus commas were mostly omitted, whereas in the other texts, they were often used to mark a sentence boundary. 9-year olds and 11year olds tend to omit commas, whereas 13-year olds use commas mostly to mark sentence boundaries. 4.6.5 Summary The delimitation of text varies both by age and corpora and indicates clearly that, especially younger children, often join clauses into larger units. Orthographically, the 13-year olds form the longest units with the smallest number of adjoined clauses. Most adjoined clauses occur among the youngest group, 9-year olds, and in the hand-written corpus of Deserted Village. The erroneous use of major delimiters is mostly represented by omission or insertion in front of subordinate clauses, lists, etc. Commas are mostly missing or used to mark sentence boundaries. 4.7 Conclusions All the grammar errors that were expected as “typical” for Swedish writers, including noun phrase agreement, predicative complement agreement, verb form and the choice of prepositions in idiomatic expressions, are represented in Child Data, but not all are very frequent. Especially frequent are errors in verb form, mostly in the finite main verb (other verb form errors were much less frequent). Errors in predicative complement agreement are not very common, whereas noun phrase agreement errors are more frequent. Erroneous choice of preposition is included in the category of word choice errors, represented by ten occurrences. More characteristic for this population are, besides the omission of tense-endings on finite verbs, errors in omission of obligatory constituents in sentences and word choice errors. Some impact of spoken language on writing is reflected (again) in finite verb forms, pronoun forms and also some cases of dialect forms within noun phrase. 108 Chapter 4. Comparison with grammar errors in other studies shows, not surprisingly, most similarities with the writing of primary school children. In comparison to adult writers, there are differences both in how frequent errors are and in error distribution. Grammar errors in Child Data are much more frequent than among adult writers, with approximately 5 to 8 errors for children and 1 error for adults per 1,000 words. Errors in verb form, noun phrase agreement, missing or redundant words and choice of preposition are the most common error types for all populations, including the Child Data population. The difference lies in the error frequency distribution. A closer look at the different sub-types of the verb form category shows that the discrepancy is due to the frequent dropping of tense-endings on finite verbs in the Child Data. Such errors are not very common in the newspaper articles of the Scarrie corpus, where errors in verbs after an auxiliary verb are the most common verb error. The grammar error profile of Child Data and its comparison with adult writers suggests then not only inclusion of the four central grammar error types in a grammar checker for primary school writers, but the treatment of errors in finite verb form in particular. Another observation more related to error correction is that in many cases more than one solution is possible, a fact exemplified in the analysis. Also, at the lexical level spoken forms are common. The spelling of many word forms indicate confusion as to what form should be used in which context. Among real words, misspelled words were most common, followed by splits that were more common in general as real words. Run-ons as real words were very rare. The overall spelling error frequency seems to be representative for the age group. Errors in punctuation are mostly represented by omission, there are cases where marking is put at syntactically incorrect places. There was quite a high frequency of adjoined clauses, especially among the younger children, indicating that subjects join syntactic units to larger units and do not delimit text in (only) syntactic sentences. The analysis does not reveal what other larger units are selected instead, if any. On the other hand, this observation clearly indicates that a grammar checker cannot rely on sentence marking conventions and consider capitals or sentence delimiters as real markings of the beginning or an end of a syntactic sentence. We should be aware that marking of sentence boundaries might be omitted in texts written by children, or even misplaced. The following conclusion can then be drawn from the analysis of Child Data for further work on development of a grammar error detector for primary school children: Error Profile of the Data 109 • include at least detection of errors in verb form (especially finite verb), agreement in noun phrase, redundancy and missing constituents, and some word choice errors (such as use of prepositions), • be aware that there may be more than one solution for correcting an error, • do not rely on the use of capitals or sentence delimiters as indicators of syntactic sentence boundaries, rather be aware that sentence marking can be missing or misplaced and several (main) clauses can be joined together. 110 Part II Grammar Checking 112 Chapter 5 Error Detection and Previous Systems 5.1 Introduction Constructing a system that will provide the user with grammar checking requires not only analysis of what error types are to be expected, but also an understanding of what possibilities there are to detect and correct an error. In the previous chapter, an analysis was presented of the grammar errors found in texts written by children and the central errors for this group of users were identified. The purpose of this chapter is to explore the second requirement and analyze the errors in terms of how they can be detected. The questions that arise are: What errors can be detected by means of syntactic analysis and what do require other levels of analysis? How much of the text needs to be examined in order to find a given error? Can it be traced within a sequence of two or three words, a clause, a sentence or a wider context? I will also investigate available technologies and establish: What grammar errors are covered by the current Swedish grammar checkers? Where do they succeed and where do they fail on Child Data? The chapter starts with a description of the requirements and functionalities of a grammar checker and the performance it has to achieve (Section 5.2), followed by the analysis of possibilities for detecting the errors in Child Data (Section 5.3). Then some grammar checking systems are described, paying special attention to Swedish tools (Section 5.4), followed by a performance test of the Swedish systems on Child Data (Section 5.5). Conclusions are presented in the last section (Section 5.6). Chapter 5. 114 5.2 What Is a Grammar Checker? 5.2.1 Spelling vs. Grammar Checking Writing aids for spelling, hyphenation, or grammar and style are part of today’s authoring software. Spelling and hyphenation modules were the first proofing tools developed. They are traditionally built to handle errors in single isolated words. Grammar checkers are a fairly new technology, not only aiming at syntactic correction as one would expect from their name, but often also including correction of graphical conventions and style, such as punctuation, word capitalization, number and date formatting, word choice and idiomatic expressions. Thus, whereas a spelling checker detects and handles errors at word-level, all detection of errors that is dependent on the surrounding context has been moved up to the level of grammar checking (cf. Arppe, 2000; Sågvall Hein, 1998a).1 The various proofing tools exist both as separate modules developed by different companies that can be attached to an editor (e.g. Microsoft proofing tools are delivered by different suppliers) or the spelling and grammar checkers are integrated into a single system (see further in Section 5.4). 5.2.2 Functionality Proofing tools, in general, give those involved in the process of writing support in the rather tedious, time-consuming stage of revision (or rewriting), 2 and are helpful in finding the types of errors humans easily overlook (cf. Vosse, 1994). Their functionality can be defined in terms of detection, diagnosis and correction (or suggestion for correction) of errors. Identifying incorrect words and phrases is the most obvious task of a grammar checker. The position of an error in the text can be located either by marking exactly the area where the error is, or by marking the error with surrounding context (e.g. marking only the erroneous noun vs. marking the whole noun phrase). Detection of an error can be enough feedback to the user, if the user understands what went wrong. Diagnosis of the error is important when the user needs an explanation, especially if the tool handles several related error types. In the long run, diagnosis is of real use to every user in order to promote understanding of the error marked (see Domeij, 1996; Knutsson, 2001). Finally, presenting one (or more) suggestions for revision of the error can enhance a user’s understanding of the problem in addition to providing an easy way to correct the error. 1 Proofing tools without syntactic correction, correcting style and graphical convention also exist (cf. Domeij, 2003, p.14). 2 Recall that editing activities on a computer occur usually during the whole process of writing and not only at the end. The writer may switch several times between writing phases, see Section 2.3.2. Error Detection and Previous Systems 115 The functionalities of such a system must be achieved with high precision. Systems should not mark correct strings as incorrect. A system that provides detection of many errors, but also marks a large amount of correct text as erroneous can be regarded more negatively by a user than a system that detects fewer errors but makes fewer false predictions (cf. Birn, 2000). 5.2.3 Performance Measures and Their Interpretation Performance Measures Within the field of information extraction and information retrieval, measures of recall, precision and F-value have been developed for measuring the effectiveness of algorithms (van Rijsbergen, 1979). Recall measures the proportion of targeted items that are actually extracted from a system, also referred to as coverage. Precision measures the proportion of correctly extracted information, also referred to as accuracy. Overall performance of a system can be measured by the F-value, which is a measure of balance between recall and precision. When for instance the recall and precision have approximately the same value, the F-value is the same as the mean score of recall and precision. Also the main attributes by which the performance of a grammar checker is evaluated are related to its effectiveness and functionality. The attributes of evaluation of writing tools have been discussed and developed within the frames of the TEMAA (A Testbed Study of Evaluation Methodologies: Authoring Aids) (Manzi et al., 1996) and EAGLES projects (Expert Advisory Group on Language Engineering Standards) (EAGLES, 1996) with respect to a product’s design specifications and user requirements. They consist of recall that, in this case, estimates how many of the targeted errors are actually detected by the system (i.e. grammatical coverage) and precision that measures the proportion of real errors detected and reveals how good a system is at avoiding false alarms (i.e. flagging accuracy). The higher the coverage and accuracy of the system are, the better. A third attribute of proofing tools concerns suggestion adequacy, which is related to the system’s suggestions for correction. These validation parameters usually vary depending on the system’s own strategies (Paggio and Underwood, 1998; Paggio and Music, 1998). The exact definitions for the evaluation measures used in the present study are presented in Section 5.5. 116 Chapter 5. Methods and Interpretation of Evaluation Besides the above mentioned measures, the whole method of evaluation and interpretation of results is important. A system’s performance can be evaluated against an error corpus consisting of a collection of (sentence) samples with the errors targeted by the system (e.g. Domeij and Knutsson, 1999; Paggio and Music, 1998). Or more recently tests with text corpora were also made (e.g. Birn, 2000; Knutsson, 2001; Sågvall Hein et al., 1999) that contain both the erroneous (ungrammatical) and correct (grammatical) word sequences. The capability of a system to handle correct text is better tested with the last method, where the proportion of grammaticality is higher. At least three factors may influence the outcome of an evaluation of a system’s performance: the kind of syntactic constructions present in the evaluation sample, the number of errors in them and who was the writer (beginner, student, professional, second language learner, etc.). This means different text genre and degree of the writer’s own writing skills in expressing himself may display different syntactic coverage that also influence the possibility of occurrence of an error type. The size of the corpus needed for evaluation can be dependent on the error frequency in a writing population or the type of error evaluated. As discussed in Section 4.4, adults in the analyzed corpora made in average one grammatical error per 1,000 words. In order to cover a satisfactory quantity of syntactic constructions and errors in them, the evaluation corpus must be quite large. Grammar errors in the children’s corpus are on average eight times more common than for adults, which could mean that for evaluation for this population a smaller corpus will probably be sufficient since grammar errors are more frequent. Thus, different populations of writers can have different requirements on what is needed for evaluation. Similarly, the frequency of different error types varies in general in that some error types are more common than others. For instance, a larger corpus is probably needed to cover errors in word order than errors in noun phrase agreement that are in general more recurrent. The method used and the factors that may influence the outcome of an evaluation of a system have to be taken into consideration when interpreting results, especially in a comparison between systems. The evaluated text genre, size of the corpus, error type and the nature of the writer should be related. Error Detection and Previous Systems 117 5.3 Possibilities for Error Detection 5.3.1 Introduction Current grammar checking systems are restricted to a small set of all possible writing errors. The fact is that not all possible syntactic structures are covered and many errors above the single word level cannot be found without semantic or even discourse interpretation (cf. Arppe, 2000). In this section I discuss which errors in Child Data can be found by means of syntactic analysis and which require higher levels of analysis, such as semantics or discourse analysis. If syntactic analysis is sufficient, then an examination will follow of how much context is required for detection and, further, if the error can be identified locally by selection restricted to word sequences (i.e. partial parsing) or if analysis of complete clauses and/or sentences is necessary (i.e. full parsing). The different error types will be divided in accordance with both previous methods of classification (see Section 3.3.3) and the error taxonomy that was used to distinguish real word spelling errors from grammar errors in (see Section 3.3.4). That is, errors will be divided into whether they form structural errors and violate the syntactic structure of a clause or non-structural errors concerning feature mismatch, also whether new lemmas or other forms of same lemma are formed and finally, if words are omitted, inserted, substituted or transposed. Further, the violation types of errors will be considered in relation to which means must be used for detection of them. Previous analysis of this kind was provided within the Scarrie project with the assumption that, in general, partial parsing can be used to handle non-structural errors whereas other methods should be applied for structural errors (in the Scarrie project local error rules were used). They also identified error types that could not be handled by either of those two methods. The study reports further on the problem with this division since many errors could be handled by both methods (see Wedbjer Rambell, 1999c). The discussion in the analysis is brief, referring to previously discussed examples in the analysis of errors in Chapter 4 or directly to the index number in the error corpora presented in Appendix B. The section concludes with a summary of detection possibilities for errors in Child Data. The summary will serve as a specification for the final part of implementation described in Chapter 6. 5.3.2 The Means for Detection Agreement in Noun Phrases Detection of agreement errors in noun phrases requires a context of precisely the noun phrase and the errors can thus in general be detected by noun phrase parsing. 118 Chapter 5. All noun phrase errors are non-structural and in Child Data they are concentrated into one constituent realized as other form of the intended word lemma. Syntactically, most of the noun phrases follow one of the three noun phrase types (see Section 4.3.1) and three cases are in the partitive form. The feature sets have to include, besides definiteness, number and grammatical gender, also definitions of the semantic masculine gender in the adjectives. In this case, not only agreement with the noun has to be fulfilled, but also requirements on consistent use. That is, in one case (G1.2.3; see (4.9) on p.49) a (masculine) noun is modified by two adjectives where one of them has the masculine weak form and the other the common gender weak form. Both adjectives should follow one of the patterns, i.e. semantic or grammatical gender. Further, the feature mismatch in partitive noun phrases concerns not only the agreement between the quantifier and the noun, but also the number of the head noun (e.g. G1.3.2; see (4.11) on p.50). Another important thing to bear in mind is correct interpretation of the spelling variants. For instance the errors in G1.2.2 and G1.2.4 (see (4.8) on p.48) include the determiner de ‘the [pl]’ spelled as the allowed variant dom, which in turn is homonymous with the noun dom ‘judgment/verdict’. It is important that the lexicon of the system has this information. Agreement in Predicative Complement In order to detect errors in agreement between the subject or object of a sentence with its complement, a context larger than a noun phrase is required. The errors are non-structural, realized as other forms of the same lemma and can still be handled by partial parsing identifying the parts that have to agree, i.e. the noun phrase, the verb types used in such constructions and the modifying adjective phrase. In Child Data, these errors concern agreement mismatch between the subject and an adjective or participle as the predicative complement. Syntactically, many of the subject noun phrases include embedded clauses (often with other predicates) that increase the complexity and the distance between the subject and the predicative complement, and probably require more elaborate analysis. Further, in G2.2.3 (see (4.13) on p.51) several predicative complements are coordinated, detection of all of them requires analysis of coordination. Finally, we have the case of G2.2.6 (see (5.1) below) where the head noun syskon is ambiguous between the singular reading ‘sibling [sg]’ and the plural form ‘siblings [pl]’, which complicates analysis. Error Detection and Previous Systems 119 (5.1) (G2.2.6) nasse är skär. Men nasses nasse är en gris som har massor av syskon. of siblings [pl] Nasse is pink but Nasse’s Nasse is a pig that has lots ∗ smutsig är syskon sibling [neu,sg/pl] is/are dirty [com,sg] – Nasse is a pig that has a lot of brothers and sisters. Nasse is pink. But Nasse’s brothers and sisters are dirty. Identifying the subject, the copula verb and the adjective syskon är smutsig is enough to signal that an error in predicative complement agreement occurred. However, the diagnosis can fail if the noun is only interpreted as singular. The tool would then signal that a mismatch in gender occurred, suggesting a form change in the adjective to smutsigt ‘dirty [neu,sg]’. But if the author refers to massor av syskon ‘lots of siblings’, then the noun should be interpreted as plural and the checker should then indicate a number mismatch and suggest the plural form smutsiga ‘dirty [pl]’ instead. In any case, the most sound solution is to suggest two corrections due to the ambiguous nature of the nouns and let the user decide. Definiteness in Single Nouns Definiteness errors in single nouns in Child Data are represented by bare singular nouns (e.g. ö ‘island’ in (5.2a)) that lack the definite suffix (i.e. ön ‘island [def]’) and form other form of the intended lemma (see also (4.17) on p.53). Considered then as non-structural errors they could be detected by means of partial parsing. Marking bare singular nouns as ungrammatical can also be helpful for finding instances where, instead of a missing suffix, the indefinite article is missing. That is, if the noun phrase in the first sentence in (5.2a) was represented only as in (5.2b) (such errors were not found in Child Data). However, cases where bare singular nouns are grammatical exist.3 (5.2) (G3.1.3) Vi gick till ∗ ö. a. Jag såg en ö. I saw an island we went to island [indef] – I saw an island. We went to island. b. Jag såg ∗ ö. I saw island [indef] – I saw island. In order to decide whether a bare singular noun is ungrammatical due to omission of article or noun suffix or if it is grammatical, a context wider than a sentence 3 Bare singular nouns can be grammatical in one context (e.g. ha bil ‘have car’) and ungrammatical in another (e.g. se ∗ bil ‘see car’), see further in Section 4.3.3. 120 Chapter 5. is needed in addition to some kind of lexical or semantic analysis in order to see if the noun was or was not introduced/specified earlier or the construction is grammatical (i.e. lexicalized). Pronoun Case Pronoun case errors in Child Data concerned the accusative case of pronouns and are realized as other forms of the same lemma, that is, using the nominative case form instead of accusative. These errors concern feature mismatch and are classified as non-structural errors. However, exactly as in the case of agreement errors in predicative complement, a more complex syntactic analysis is required to identify the requirements on certain positions in a clause. Some hint on the way to identify these could be the preposition preceding the pronoun, which would then require only partial parsing. Three such errors in Child Data consist of a nominative pronoun preceded by a preposition (e.g. G4.1.5; see (4.18) on p.53). Verb Errors Errors in verb form can be located directly at the verbal core, consisting then of one single finite verb, or a sequence of two or more verbs, or a verb preceded by an infinitive marker. They can be both structural (an auxiliary verb is missing) and non-structural (another form of the verb was used). All verb errors should be detectable by means of partial parsing. Optional constituents such as adverbs, noun phrases, and coordination of verbs should be taken into consideration. The errors in finite verb form found in Child Data in many cases coincide with the imperative form of these verbs (see e.g. G5.2.45 in (4.26) on p.58). Imperative as a finite verb form should be distinguished from the infinitive verb form in order to be able to detect such errors in finite verbs. Errors in verbal chains are represented in Child Data by two finite verbs in a row (e.g. ska blir ‘will [pres] become [pres]’; (4.32) on p.61), in one case with the embedded infinitive as secondary future perfect (i.e. skulle ha kom ‘would [pret] have [inf] came [pret]’; (4.31) on p.60). They also occur as bare supine in main clause, lacking the auxiliary verb (e.g. G6.2.2; see (4.33) on p.61). All such errors can be detected by parsing just the verbal cluster. In the case of missing auxiliary verbs, the crucial point is to be sure that the omission occurs in a main clause, which then requires identification of the type of clause. Errors in infinitive phrases concern infinitive markers followed by a verb in finite form (e.g. att stäng ‘to close [imp]’; (4.34) on p.62), or missing infinitive marker with the auxiliary verb komma ‘will’ (e.g. G7.2.3; see (4.36) on p.62). Both these error types can be located by partial parsing. In the case of an omitted Error Detection and Previous Systems 121 infinitive marker in the context of the auxiliary verb komma ‘will’, it is important not to confuse it with the main verb komma ‘come’. Word order All word order errors are structural errors, involving transposition of sentence constituents. In general, detection of word order errors requires identification of the main verb and analysis either of the preceding or following constituents, which in turn requires identification of the beginning and ending of a sentence. In theory, some errors in the placement of adverbials can be traced by partial parsing, for instance in certain subordinate clauses. In Child Data, punctuation and capitalization conventions are often not followed and sentences may be joined together (see Section 4.6). This means that word order analysis cannot completely rely on such conventions until we find some way to locate sentence boundaries. In addition, the word order errors found in Child Data are rather complex, involving for instance more than one initial constituent before the finite verb in a main clause (see Section 4.3.6). This means that the possibility of success in locating word order errors in Child Data by such simple techniques as partial parsing is minimal. Redundancy Redundancy errors also represent structural errors, manifested as insertions of superfluous constituents into sentences. Immediate repetition of words (e.g. G9.1.3; see (4.38) on p.64) should be possible to detect by means of partial parsing. Occurrences of repeated constituents at different places in a given sentence (e.g. G9.1.7; see (4.39) on p.65) would require analysis of the complement structure, often of the whole sentence. The same applies for new constituents being inserted (e.g. G9.2.2; see (4.41) on p.66). Missing Constituents Sentences lacking a constituent also represent structural errors. Some of them may be detected by partial parsing, but most require more complex analysis. Among the errors in Child Data, discovering a missing subject or object would require analysis of the complement structure of the main verb, which means that such information must be stored somewhere (e.g. in the lexicon of the system). Finding an omission of a finite verb requires a search for a finite verb in a sentence, assuming that it is not an exclamation, a title, or other construction without finite verbs. Finding omissions of particles or prepositions requires knowledge of the verbs’ Chapter 5. 122 sub-categorization frame, or the structure of fixed expressions. Other types require not only syntactic analysis but also semantics and/or world knowledge as in (5.3), where negation on the main verb is missing. (5.3) (G10.5.1) a. tuni hade jätte ont i knät men hon ville — sluta för det. Tuni had great pain in knee but she wanted — stop for that – Tuni had much pain in her knee, but she did not want to stop because of that. b. men hon ville inte sluta för det. but she wanted not stop for that Word Choice Word choice errors as substitutions of constituents also represent structural errors. These errors are realized as completely new words with distinct meaning from the intended one, new lemmas. Some of them can probably be solved by storing for instance information on the use of particles and prepositions with certain verbs (e.g. G11.1.2; see (4.48) on p.68), or word usage in fixed expressions (e.g. G11.1.7; see (4.47) on p.68), in the dictionary. Others will probably require analysis of semantics or even world knowledge before they can be detected, as the one in (5.4). (5.4) (G11.6.3) a. Jag tittade på Virginia som torkade av sin näsa som var blodig I looked at Virginia that wiped off her nose that was bloody på tröjarmen. on jumper-arm – I looked at Virginia who wiped her bloody nose on the sleeve of her jumper. b. tröjärmen jumper-sleeve Reference Referential issues concern structural violations as substitutions of constituents, realized as new lemmas. All the errors in Child Data concerned anaphoric reference. Reference errors in general are discourse oriented. Anaphoric reference requires identification of the antecedent that agrees with the subsequent pronoun. The distance of the antecedent may be a preceding sentence, but it could also be farther away. Partial parsing techniques probably can be used for identifying antecedents. The crucial problem is how far in the discourse to search for antecedents. Error Detection and Previous Systems 123 Real Word Spelling Errors Spelling errors resulting in existent words always form new lemmas, that means that they are realized as completely new words. They mostly violate the structural requirements as substitutions of constituents, but can also accidentally cause nonstructural violations, for instance agreement errors in noun phrases. The majority of misspellings slip through any syntactic analysis, resulting in syntactically correct strings. For instance, an error resulting in a word which is the same part-of-speech as the intended word, as in (5.5a), will be very hard to track down without any semantic information. In this example, the word as written coincides not only with the part-of-speech of the intended word but also the intended inflection. The intended word is presented in (5.5b): (5.5) (M1.1.33) har tagit hand om oss. a. den här gamla ∗ manen the [def] here old mane [def] has taken hand about us – This old man took care of us. b. mannen man [def] Moreover, words resulting in other parts of speech are hard to trace syntactically. In (5.6a) a pronoun becoming a verb in supine form will not be detected without an additional level of analysis because the form of the verb that follows the preceding auxiliary verbs is syntactically correct: (5.6) (M3.3.10) problem a. den killen eller tjejen måste ha ∗ nått the boy [def] or girl [def] must have reached [sup] problem – the boy or girl must have some problem b. nåt some Only a few real word spelling errors in Child Data cause syntactic violations and can to some extent be detected by means of syntactic analysis. Here is an example of a pronoun realized as a noun subsequently forming a noun phrase with an agreement error in gender and definiteness: Chapter 5. 124 (5.7) (M2.2.3) ∗ a. det här brevet är det ända jag kan ge dig idag. the here letter is the [neu,def] end [com,indef] I can give you today – This letter is the only one I can give you today. b. det enda the only Here is an example of a pronoun becoming a verb, where as a consequence three verbs in a row were found in a sentence, first the two correctly spelled verbs form a grammatical verb cluster and then the misspelled pronoun forming a passive past verb form (5.8a). In this case, the feature structure of a verb cluster is violated and the error can be detected by partial parsing. (5.8) (M3.3.8) ∗ hanns mobiltelefon. a. jag fick låna I could borrow was-managed cell-phone – I could borrow his cell-phone. b. hans his In (5.9a), the predicate of the sentence forms a noun and could be detected as a sentence lacking a finite verb: (5.9) (M4.2.32) a. då ∗ ko min bror then cow my brother – then came my brother b. kom came Splits mostly violate complement conditions. For instance in (5.10a) the split will be analyzed as two successive noun phrases: (5.10) (S1.1.16) a. En ∗ brand man klättrade upp till oss. man climbed up to us a fire – A fire-man climbed up to us. b. brandman fire-man Splits can also violate agreement, such as in (5.11a), the first part of the split has gender different from the second part, which results in the article (en ‘a [com]’) Error Detection and Previous Systems 125 and the first part of the split (djur ‘animal [neu]’) not agreeing. The correct form is shown in (5.11b): (5.11) (S1.1.28) ∗ a. Desere jobbade i en djur affär Desere worked in a [com] animal [neu] store [com] – Desere worked in an animal-store. b. en djuraffär a [com] petshop [com] Punctuation at Sentence Boundaries Erroneous use of punctuation to mark sentence boundaries, probably requires full parsing or at least analysis of complement structure following the main verb. For instance, in order to detect the missing boundary in (5.12a) (indicated by dash), the system has to know that the verb gilla ‘like’ is transitive and thus combines with only one object and cannot also take the pronoun dom ‘they’ as a complement. That is, just locating the arguments following the verb, marked in boldface in the example, with the diagnosis of too many complements signals that something is wrong with the sentence. The correct form is presented in (5.12b). (5.12) a. Vissa i filmen gillade inte varann — dom bråkade och some in the-movie liked not each-other — they quarrelled and lämnade vissa utanför. left some outside – Some (people) in the movie did not like each other. They quarrelled and left some (people) out. b. Vissa i filmen gillade inte varann. Dom bråkade och lämnade some in the-movie liked not each-other they quarrelled and left vissa utanför. some outside 5.3.3 Summary and Conclusion In accordance with the above discussion, it is clear that only some errors in Child Data can be detected by partial syntactic analysis alone, most of the errors require a higher level of analysis, a full parsing or even discourse analysis. The error types, their classification in accordance with what violations they cause, and comments on the possibility of detection are summarized in Table 5.1 below. Chapter 5. 126 Errors requiring only partial parsing for detection (in bold face in the table) concern (mostly) non-structural errors, including noun phrase agreement, verb form errors and some structural errors such as omissions within a verb core. Further, some pronoun case errors, constrained for instance by preceding constituents (e.g. a preposition), could be traced by partial parsing. In addition, some word order errors would in general be possible to detect by means of partial parsing, but since those found in Child Data display rather high complexity, detection possibility is minimal without more elaborated analysis. Finally, repeated words could be detected by partial parsing (i.e. among the redundancy errors). Table 5.1: Summary of Detection Possibilities in Child Data E RROR T YPE G RAMMAR E RRORS : Agreement in NP Agreement in PRED Definiteness in single nouns E RROR C LASS V IOLATION non-structural non-structural non-structural structural Pronoun case non-structural Finite Verb Form Verb Form after Vaux Vaux Missing Verb Form after inf. marker Inf. marker Missing Word order Redundancy non-structural non-structural structural non-structural structural structural structural Missing Constituents structural Word Choice structural Reference OTHER : Real Word Spelling Errors structural structural non-structural Missing Sentence Boundary structural C OMMENT substitution: other form substitution: other form substitution: other form partial parsing complex partial parsing partial parsing and discourse omission partial parsing and discourse substitution: other form some by partial parsing OR complex partial parsing substitution: other form partial parsing substitution: other form partial parsing omission partial parsing substitution: other form partial parsing omission partial parsing transposition some by partial parsing insertion some by partial parsing OR full parsing omission at least complement structure substitution: new lemma full parsing + semantics and world knowledge substitution: new lemma discourse analysis substitution: new lemma full parsing + semantics and world knowledge substitution: new lemma partial parsing omission at least complement structure Error Detection and Previous Systems 127 Two of the non-structural error types require a more complex partial parsing (in italics) and specification of a larger context in order to be able to detect them. These include agreement errors in predicative complement and pronoun case errors. Definiteness errors in single nouns could also in general be detected by partial parsing, but (probably) require discourse analysis in order to diagnose them correctly. The rest of the grammar errors are all structural and require at least analysis of complement structure or full parsing of sentences or even discourse analysis. In many cases also semantic and/or world knowledge interpretation is required. Among the real word spelling errors very few can be traced only by means of syntactic analysis. Most of them need semantics or even understanding of world knowledge in order to be identified. Missing sentence boundaries often cause syntactic violations in verb subcategorization. In conclusion, this summary suggests that not only non-structural errors can be detected by means of partial parsing, but also some structural violations. This division is certainly more dependent on whether the error is located in a certain portion of delimited text or not. For instance, some of the omission violations located in certain types of phrases can be detected by means of partial parsing (e.g. missing auxiliary verb). The most clear choice of which error types are certain to be detected by means of partial parsing are the agreement errors in noun phrases and errors located in verbs (i.e. concerning verb form and omission of verb or infinitive marker). These are also among the most frequent (central) errors in Child Data and invite implementation as I will show in Chapter 6. Among the other most frequent error types in Child Data, redundant constituents in clauses can probably be detected only when words are repeated directly. Other types of extra inserted constituents into clauses, omissions of words or word choice errors are structural errors that require more complex analysis and cannot be detected by just partial parsing. Chapter 5. 128 5.4 Grammar Checking Systems 5.4.1 Introduction After the analysis of what possibilities there are for detecting the errors in Child Data presented in the previous section, the question arises as to what error types are already covered by current technologies and with what success. As pointed out in Section 5.2, research and development of grammar checking techniques is rather recent and started around the 1980’s with products mainly for English4 but also for other languages, e.g. French (Chanod, 1993), 5 Dutch (Vosse, 1994), Czech (Kirschner, 1994), Spanish and Greek (Bustamente and Le ón, 1996). In the case of Swedish, the development of grammar checkers did not start until the latter half of the 1990’s with several independent projects. Grammatifix developed by the Finnish company Lingsoft AB was introduced on the Swedish market in November 1998, and since 2000 it has been part of the Swedish Microsoft Office Package (Arppe, 2000; Birn, 2000). Granska is a grammar checking prototype, being developed by the research group of the Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH) in Stockholm (Carlberger and Kann, 1999). Another Swedish prototype was developed at the Department of Linguistics at Uppsala University, between 1996 and 1999, within the EU-project Scarrie (Sågvall Hein, 1998a; Sågvall Hein et al., 1999).6 This section continues with a short review of methods and techniques used in some non-Swedish systems (Section 5.4.2). Then, follows an overview of the Swedish approaches to grammar checking (Section 5.4.3) and a discussion of the techniques used in these systems, error types covered and their reported performance (Section 5.4.4). 5.4.2 Methods and Techniques in Some Previous Systems Many of the grammar checking systems on the market are commercial products. Technical documentation is often minimal or even absent. One exception is the grammar checking system Critique (known until 1984 as Epistle) (Ravin, 1993; 4 For instance Perfect Grammar was integrated in Word for Windows 2.0 in late 1991 and Grammatik 5 part of WordPerfect for Windows 5.2 and Word for Mac 5.0 in 1992 were among the first on the market (see further in Vernon, 2000). 5 Vanneste (1994) compared the utilities of other French products: Grammatik (French), Hugo Plus and GramR. 6 Skribent (http://www.skribent.info/) and Plita (Domeij, 1996, 2003) are other proofing tools on the Swedish market that include detection of violations against the graphical conventions and style, but not any syntactic error detection. Error Detection and Previous Systems 129 Richardson, 1993) developed within the Programming Language for Natural Language Processing (PLNLP) project (Jensen et al., 1993b). 7 This project aimed at the development of a large-scale natural language processing system covering not only syntax, but also the various levels of semantics, discourse and pragmatics.8 During the project the PLNLP-formalism was used in several domains of natural language applications. Besides the “text-critiquing” system, devices targeting for instance machine translation, sense disambiguation via on-line dictionaries, analysis of conceptual structure in paragraphs as a “unit of thought”, etc. were developed. English was the main language, but also languages such as Japanese, French, German, Italian, Portuguese, etc. were involved (Jensen et al., 1993b). Critique is based on the English parser (PEG) of this system (Jensen, 1993), utilizing the PLNLP-formalism of Augmented Phrase Structure Grammar (ACFG)9 (implemented in Lisp), and producing a complete analysis for all sentences (even ungrammatical) on the basis of the most likely parse (Heidorn, 1993). Thus, in order to be able to detect errors, the syntactic analysis in PEG was developed to analyze not only grammatical sentences, but all sentences obtained an analysis. This was achieved by applying relaxation to rules when parsing failed on the first try, or a parse fitting procedure identifying the head and its constituents (e.g. in fragments) (see further in Jensen, 1993; Jensen et al., 1993a; Ravin, 1993). The system targets about 25 grammar error types and 85 stylistic “weaknesses”. The grammar errors are divided into five error categories: number agreement, pronoun case, verb form, punctuation and confusion/contamination of expressions (Ravin, 1993, pp. 68-70). Critique was planned to be developed for other languages besides English and now also a French version exists (Chanod, 1993). The insight gained by the PLNLP-project by providing analysis of all sentences seemed to have influenced other grammar formalisms such as Constraint Grammar (Karlsson et al., 1995) or Functional Dependency Grammar (Järvinen and Tapanainen, 1998). The methods of rule relaxation and parse fitting had an impact on the development of other (Swedish) grammar checking systems. Another quite well documented and frequently cited project is the Dutch system CORRie (Vosse, 1994). It applies the same idea of analyzing ill-formed sentences as well as well-formed ones and using augmented context-free grammar for 7 The development of Critique was done in collaboration with IBM and was later taken over by Microsoft. The tool is now used as a module for English grammar checking in Microsoft Word (cf. Jensen et al., 1993b; Domeij, 2003). 8 Mostly syntax and semantics are covered by the system, but also approaches involving analysis of discourse and pragmatics have been targeted. 9 The ACFG is considered to be more effective in contrast to CFG, since features and restrictions on them can be associated directly to corresponding categories/symbols resulting in a considerably decreased number of rules. Chapter 5. 130 that purpose. The system aimed primarily at correcting spelling errors resulting in other existing words but included analysis of misspellings, compounds, spelling of idiomatic expressions, hyphenation. CORRie’s parser and its formalism inspired the development of the proofing tools developed in the Scarrie project (see below). 5.4.3 Current Swedish Systems There are at present three known proofing tools for Swedish aimed at syntactic error detection including Grammatifix, the grammar and style module part of Swedish Microsoft Word since 2000, the Granska prototype under development at NADA, KTH and the ScarCheck prototype developed at the Department of Linguistics at Uppsala University in the Scarrie project. For each system I describe below the architecture, the different error types covered, the technique used for grammar checking (to the extent available information exists) and the system’s reported performance. Grammatifix Lingsoft’s10 commercial product Grammatifix, was introduced on the Swedish market in November 1998, and has been since 2000 part of Microsoft Word since 2000. Parts of this proof-reading tool are based on research and technology from the 1980:s, when work on a morphological surface-parser had started. The work on error detection rules began in 1997 (Arppe, 2000). The lexical analysis in Grammatifix is based on the morphological analyzer SWETWOL, designed according to the principles of two-level morphology (Karlsson, 1992) and utilizing a lexicon of about 75,000 word types. At this nondisambiguated lexical-lookup stage, each word may obtain more than one reading. The part-of-speech assignment is to a large extent disambiguated in the next level of analysis, by application of the Swedish Constraint Grammar (SWECG) (Birn, 1998),11 a surface-syntactic parser applying context-sensitive disambiguation rules (Arppe et al., 1998). As Birn (2000) points out, full disambiguation is not a goal since the targeted text contains grammar errors. Errors are detected by partial parsing after assigning the tags @ERR and @OK to all strings and then applying error detection rules defined in the same manner as the constraint grammar rules used for syntactic disambiguation, with negative conditions often related to just portions of a sentence. These error rules select the tag @ERR when an error occurs. The error 10 Lingsoft’s homepage is http://www.lingsoft.fi/ Birn (1998) gives a short presentation of the formalism. The CG-formalism was originally developed by Karlsson (1990). Karlsson et al. (1995) give a description of the basic principles and the CG-formalism. 11 Error Detection and Previous Systems 131 detection component consists of 659 error rules and a final rule that applies the tag @OK to the remaining words (Birn, 2000). Relaxation on rules is included in the error detection rules and not in the phrase construction rules, so it regards certain word sequences as phrases despite grammar errors in them (Arppe et al., 1998). Grammatical errors are viewed by this system as “violations of formal constraints between morphosyntactic categories” (Arppe et al., 1998). Two types of constraints are distinguished: intra-phrasal, e.g. phrase-internal agreement, and inter-phrasal, e.g. constituent order in a clause. Grammatifix not only detects errors, but also provides a diagnosis with explanation of the error and a suggestion for correction when possible. The tool addresses 43 error types, where 26 concern grammar, 14 punctuation and formatting, and 3 stylistic issues. The grammar error types include agreement errors in noun phrases and subject complements, errors in pronoun form after preposition, errors in verbs, in word order and others (Arppe et al., 1998; Arppe, 2000). The grammar error types are listed and compared to the types in the other Swedish systems in Section 5.4.4. The linguistic performance of the system was tested separately for precision and recall based on corpora of different size from the newspaper G öteborgs Posten (Birn, 2000, pp.37-39). For precision, the newspaper corpus consisted of 1,000,504 words and resulted in a rate of 70% precision (374 correct alarms and 160 false alarms). The analysis of recall was based on a text extract of 87,713 words and resulted in an overall recall rate of 35% including also error types not covered by the tool (135 errors in the text and 47 errors detected). Counting only the error types targeted by Grammatifix, the recall is 85% (55 errors in the text and 47 errors detected).12 The Granska Project The proof-reading tool Granska is being developed at the Department of Numerical Analysis and Computer Science, KTH (the Royal Institute of Technology) in Stockholm. The first prototype was developed in 1995, running under Unix. Then followed a more elaborate version with graphical interface in the Windows operating system. This version included detection of agreement errors in noun phrases. The current version of Granska is a completely new program written from scratch, starting in 1998, in the project Integrated language tools for writing and document handling.13 Granska is an integrated system that provides spelling and grammar 12 The error profile of the corpus used for analysis of Grammatifix’s grammatical coverage (recall) is reported in Chapter 4, Section 4.4. 13 See more about the project at: http://www.nada.kth.se/iplab/langtools/ 132 Chapter 5. checking that run at the same time and can be tested in a simple web-interface. 14 The system recognizes and diagnoses errors and suggests correction when possible. Granska combines probabilistic and rule-based methods, where specific error rules and local applied rules detect ungrammaticalities in free text. The underlying lexicon includes 160,000 word forms, generated from the tagged Stockholm-Ume å Corpus (SUC) (Ejerhed et al., 1992) of 1 million words, and completed with word forms from SAOL (Svenska Akademiens Ordlista, 1986). The lexical analyzer applies Hidden Markov Models based on the statistics of word and tag occurrences in SUC. Each word obtains one tag with part-of-speech and feature information. Unknown words are analyzed with probabilistic wordending analysis (Carlberger and Kann, 1999). A rule matching system analyses the tagged text searching for grammatical violations defined in the detection rules and produces error description and a correction suggestion for the error. When needed, additional help rules are applied more locally, used as context conditions in the error rules. Other accepting rules handle correct grammatical constructions in order to avoid application by error rules, i.e. avoiding false alarms (Knutsson, 2001). Granska’s rule language is partly object-oriented with a syntax resembling C++ or Java and is meant to be applied not only for grammar checking, but also partial parses as identification of phrase and sentence boundaries. Further, with Granska it is possible to search and directly edit in text, e.g. changing tense for verbs, moving constituents within a sentence. Also the tagging result may be improved, when the “guess” is wrong, so new tagging of a certain text area may be applied (see further in Knutsson, 2001). The rule collection of the system consists of approximately 600 rules (Domeij et al., 1998) divided into three main categories: orthographic, stylistic and grammatical rules. Half of the rules detect grammar errors including noun phrase and complement agreement, errors in pronoun form after preposition, errors in verbs, errors in preposition in fixed expressions, word order and other errors (Domeij and Knutsson, 1999; Knutsson, 2001). The grammar error types are listed and compared to the types covered by the other Swedish systems in Section 5.4.4. A validation test of Granska is reported in Knutsson (2001, pp.141-150), based on a corpus of 201,019 words and shows an overall performance of 52% in recall and 53% in precision (418 errors in the texts, 216 correct alarms and 197 false alarms). In this text sample, including both published texts written by professional writers and student papers,15 Granska is best at detecting errors in verb form with 14 Granska’s Internet demonstrator is located at: http://www.nada.kth.se/theory/ projects/granska/demo.html 15 The error profile of the validated corpus of Granska was already reported in Chapter 4, Section 4.4. Error Detection and Previous Systems 133 a recall of 97% and precision of 83%, and agreement errors in noun phrase with a recall of 83% and precision of 44%. The Scarrie Project Within the framework of the EU-sponsored project Scarrie, 16 prototypes of proof reading tools for the Scandinavian languages Danish, Norwegian and Swedish were developed. The project ran during the period December 1996, to February 1999. WordFinder Software AB17 was the coordinator of the project and Department of Linguistics at Uppsala University and the newspaper Svenska Dagbladet were the other Swedish partners. Interface and packaging were outside the project and planned to be taken care of by WordFinder after the project’s completion. Professional writers at work in particular newspaper and publishing firms were the intended users. The Swedish version of the prototype provides both spelling and grammar checking run at the same time, searching through the text sentence by sentence. The system recognizes and diagnoses errors, giving information about error type and error span. No suggestions for correction are given. 18 The system lexicon is based on a corpus of 220,000 newspaper articles published in 1995, and 1996 from the Swedish newspapers Svenska Dagbladet (SvD) and Uppsala Nya Tidning (UNT). The SvD/UNT corpus consists of more than 70 million tokens and 1.5 million word types. The resulting lexical database, ScarrieLex, consists of a one-word lexicon of 257,136 single word forms and a multi-word lexicon of 4,899 phrases (Povlsen et al., 1999). The spelling module is based on the Dutch software CORRie (Vosse, 1994)(see Section 5.4.2), whereas the grammar checking module ScarCheck was developed as new software (Sågvall Hein, 1998b; Starbäck, 1999).19 The grammar checker is based on a previously developed parser, the Uppsala Chart Parser (UCP), a procedural, bottom-up parser, applying a longest path strategy (Sågvall Hein, 1981, 1983).20 16 The Scarrie project homepage: http://fasting.hf.uib.no/˜desmedt/scarrie/ The homepage of Wordfinder Software AB is http://www.wordfinder.com 18 A demonstrator of the Scarrie’s prototype is located at: http://stp.ling.uu.se/˜ljo/ scarrie-pub/scarrie.html 19 The spelling and grammar checking in the Danish and Norwegian prototypes is solely based on the Dutch software CORRie (Vosse, 1994). 20 The original version of the chart-parser was first implemented in Common Lisp (see Carlsson, 1981) and then converted to C. The resulting Uppsala Chart Parser Light (UCP Light) (see Weijnitz, 1999) is a smaller and faster version at the cost of less functionality, starting at syntax level and requiring a morphologically analyzed input. UCP Light is used in the web-demonstrator (Starbäck, 17 Chapter 5. 134 The parsing strategy of erroneous input is based on constraint relaxation in the context-free phrase structure rules and application of local error rules (Wedbjer Rambell, 1999b). The grammar is in other words underspecified to a certain level, allowing feature violations and parsing of ungrammatical word sequences. The local error rules are part of the same grammar and are applied to the result of the partial parse. Alternative parses are weighted yielding the best parse. A chart-scanner collects and reports on errors (Sågvall Hein, 1999). ScarCheck targets more than thirty error types concerning grammar, including agreement errors in noun phrase and complement, errors in verb phrase and verb valence errors, errors in conjunctions, pronoun case, word order and others (Sågvall Hein et al., 1999). Again, the different grammar error types are listed and compared to the errors of the other two Swedish systems in Section 5.4.4. The performance evaluation of the grammar checking system was based on a newspaper corpus of 14,810 words with an overall recall of 83.3% and precision of 76.9% (first run). Six grammar errors occurred in the corpus represented by errors in nominal phrase, verb phrase and word order (Sågvall Hein et al., 1999).21 5.4.4 Overview of The Swedish Systems Detection Approaches The approaches for detection of errors in unrestricted text differ in the Swedish systems, not only in the technology used, that varies from chart-based methods in Scarrie, application of constraint grammars in Grammatifix, to probabilistic and rule-based methods in Granska, but also in the way that strategies are applied. Grammatifix and Granska identify erroneous patterns by partial analysis, whereas Scarrie produces full analysis for both grammatical and ungrammatical sentences. Grammatifix leaves ambiguity resolution to the syntactic level and applies relaxation on error rules in order to be able to parse erroneous phrases. Granska disambiguates starting at the lexical level assigning only one morphosyntactic tag to each word and then applying explicit error rules in the search for errors, including locally applied rules and rules to avoid marking of grammatically correct word sequences as ungrammatical. Scarrie parses ungrammatical input implicitly by relaxation of the parsing rules (not in error rules as Grammatifix does) and explicitly by additional error rules applied locally on the parsing result. The thing common to all the tools is that they define (wholly or to some extent) explicit error rules describing the nature of the error they search for. Furthermore, 1999). (Email correspondence with Leif-Jöran Olsson, Department of Linguistics, Uppsala University - 21/11/01) 21 Also two errors in splits are reported. Error Detection and Previous Systems 135 the tools either proceed with error detection sentence by sentence, requiring recognition of sentence boundaries, or they can rely in their rules on for instance capitalization conventions, and search for words beginning in capital letters (cf. Birn, 2000). The Coverage of Error Types In this section I present the different grammar error types covered by Grammatifix, Granska and Scarrie and the similarities and/or differences between the systems’ selection of error types. Table 5.2 (p.137) shows the results of this analysis, based on the available error specifications of the different projects 22 and completed with personal observations from tests run with these tools. For every listed error type an example sentence from the projects’ error specifications (if present) was chosen to exemplify the targeted error. The source of this example is listed in the last column of the table. A similar analysis is discussed in Arppe (2000),23 where he concludes that the selection of error types targeted by the Swedish grammar checking tools is quite similar in many aspects. Differences occur in the subsets of errors or some specializations. The analysis in the present thesis shows that all the tools check for errors in noun phrase agreement concerning definiteness, number and gender in both the form of the noun and the adjective. They also detect errors in the agreement between the quantifier/pronoun and noun in partitive noun phrases and in the masculine form of the adjective. Violations in number and gender agreement with predicative complement are also included in all three tools and so is pronoun case, which all tools check in the context after certain prepositions. Also, the same kinds of word order errors are covered by all the tools, except that Scarrie also checks for inversion in the main clause. Errors in verbs was the group that was most difficult to compare, because detection approaches differ in some aspects. The tools all check for occurrences of finite verbs (too many, missing or no predicate at all) and the form of non-finite verbs (after auxiliary verb or infinitive marker). Only Grammatifix does not check for finite verbs after an infinitive marker. They check further for missing or extra inserted infinitive marker in the context of main verbs. They also look for more 22 Grammatifix: Arppe et al. (1998); Arppe (2000) and the specification in Word 2001; Granska: Domeij and Knutsson (1998, 1999) and the Internet demo: http://www.nada.kth.se/ theory/projects/granska/demo.html, Scarrie: Sågvall Hein et al. (1999) and examples listed in the Internet demo: http://stp.ling.uu.se/˜ljo/scarrie-pub/scarrie. html. 23 The present comparison is independent of the analysis reported in Arppe (2000). He also compared the punctuation and stylistical error types. 136 Chapter 5. style-oriented errors in the use of passive verbs (double or after certain verbs) and supine form (double or without ha ‘have’). Scarrie also checks if a supine form is used in the place of an imperative. All the tools check for the use of the superlative form möjligast ‘most-possible’ in combination with an adjective. Some other differences concern errors in the use of prepositions, where Grammatifix and Granska detect errors in the harmony of prepositions in certain context, only Granska checks for preposition use in idiomatic expressions. Further, Granska checks tense within a sentence. Double negation is not targeted by Scarrie. Granska and Scarrie also detect missing subject errors. Granska also checks more stylistical issues such as contamination of expressions and tautology, which are not included in the table. 24 24 Splits and run-ons are also targeted by some of these tools, but since these are not syntactic errors they were not included in this comparison. Error Detection and Previous Systems 137 Table 5.2: Overview of the Grammar Error Types in Grammatifix (GF), Granska (GR) and Scarrie (SC) The comparison was done on 08/10/01 and revisited on 30/10/02. ‘X’ indicates observations from error specificatios, ‘(x)’ indicates my own observations. E RROR T YPE GF GR SC E XAMPLE S OURCE N OUN P HRASE : Definiteness agree- X X X Det är i samhällets ∗ utvecklingen bort från detta som ArbetsGF ment domstolen inte hängt med It is in the society’s [poss] development [def] away from this that the Labour court has not kept up Number agreement X X X Natten bär ∗ sin skuggor. SC The night carries its [sg] shadows [pl] Gender agreement X X X En ∗ eventuellt segerfest får vänta. SC A [com] possible [neu] victory-party [com] has to wait. Gender agreement: X X (x) ∗ Ett av de gula blommorna hade slagit ut. GF quant. and noun One [neu] of the yellow flowers [com] had come out. Gender agreement: X (x) (x) Då frestade han ditt kött och sände dig den ∗ rödhårige kvinGF masculine form of nan. adjective Then he tempted your flesh and sent you the red-haired [masc] woman. P REDICATIVE C OMPLEMENT: Number agreement X X X Tävlingen blev väldigt ∗ besvärliga. SC The competition [sg] became very difficult [pl] Gender agreement X X X Då hade läget i byn redan blivit ∗ outhärdlig för gruppen. GF At that point the situation [neu] in the village had already become unbearable [com] for the group. P RONOUN : Case after preposition X X X Vi sjöng för ∗ de. GF We sang for they [nom]. V ERBS : Verb form after auxil- X (x) X Hur trygghet inte längre kan ∗ var statisk utan ligga i SC iary verb förnyelsen, utvecklingen och förändringen. How safety cannot any longer be [pres] static but lie in renewal, development and change. Verb form after inf. – (x) X Han har lovat att i alla fall ∗ skall slå Turkiet. SC marker He has promised that in any case will [pres] beat Turkey. Number of finite verbs X (x) X I Ryssland ∗ är betalar nästan ingen någon skatt. GF In Russia almost noone is [pres] pays [pres] any tax. Missing finite verb X X X Det ∗ bli viktigt. GF That will-be [inf] important. Missing verb X X X Ingen koll. GR No control. Missing inf. marker X X X Vi kommer – spela en låt av Ebba Grön. GR We will – play a song by Ebba Grön. Extra inf. marker X (x) X Sverige började ∗ att klassa kärnkraftsincidenter enligt den inSC ternationella standarden. Sweden started to classify nuclear incidents in accordance with the international standard. C ONTINUED ON N EXT PAGE Chapter 5. 138 E RROR T YPE GF GR SC Supine instead of im- – – X perative E XAMPLE också de anläggningskostnader som tillkommer. ∗ Betänkt Consider [sup] also the construction-costs that will be added. De kunde – fått bilderna på begravningsgästerna från danska polisen. They could – get pictures of the funeral-gests from the Danish police. Vi hade velat ∗ sett en större anslutningstakt, säger Dennis. We had wanted [sup] seen [sup] a greater rate of joining, says Dennis. Saken har försökts att ∗ tystas ner. The thing has been tried [pass] to be quietened [pass] down. Huset ämnar byggas S OURCE SC Supine without “ha” X X (x) Double supine X X X Double passive X X X S-passive after certain verbs X (x) X Tense harmony – X – The house intends to be built [pass]. Jag höll mig inne tills stormen ∗ har bedarrat. I kept [pret] myself inside until the storm has abated [perf]. GR P REPOSITIONS : Wrong preposition in fixed expressions – X – med utgångspunkt ∗ från GR X (x) – with starting-point from Det är utbildning som idag inte erbjuds vare sig i Lund eller – Malmö. GF Preposition harmony with two-part conjunctions GF GF GF SC It is education that today is not offered either in Lund or Malmö. W ORD O RDER : Placement of verb/negation ad- X X X Man kan tro inte sina öron. SC Word order in subordinate interrogative clause X X X One can believe not one’s ears. Jag undrar vad gör de unga männen i Finland. GF Word order in main clause with inversion – – X OTHER : Missing subject I wonder what do the young men in Finland do. Nu man kan testa de kommande versionerna av programvaran. Now one can try the future versions of the program. – (x) (x) Missing inf. marker with preposition X X X Jag klarar av – gå. Repeated words X – – I can manage – walk. (No example given in the specification.) Double negation X (x) – Construction “möjligast” + adjective X (x) (x) SC SC Det kan bli svårt att få jobb om man inte har varken pengar eller familj att stöda en. It can be hard to get work if one does not have neither money or family to support one. Hon körde med möjligast stora snabbhet. She drove with the most possible great speed. GF GF GF Error Detection and Previous Systems 139 So far, comparison has concerned the different types of errors covered, but the truth is that the detection of errors also depends on the syntactic complexity defined in the separate error types. For instance, detection for errors in the verb form after an infinitive marker can differ depending on whether other (optional) constituents are inserted between the infinitive marker and the verb. In (5.13), all the sentences violate the rule of a required infinitive verb form after an infinitive marker. In (5.13a) and (5.13b) the targeted verb is preceded by an adverbial realized as a prepositional phrase, which disturbed both Granska and Scarrie in the detection of this error.25 (5.13) a. A LARM Han har he have lovat promised att to i alla fall in any case G RANSKA S CARRIE No No ∗ skall slå Turkiet. will [pres] beat [inf] Turkey – He has promised to will beat Turkey in any case. b. Han he har have lovat promised att to i alla fall in any case No No No Yes Yes Yes Yes Yes ∗ vill slå Turkiet. want [pres] beat [inf] Turkey – He has promised to wants beat Turkey in any case. c. slå Han har lovat att ∗ skall he have promised to will [pres] beat [inf] Turkiet. Turkey – He has promised to will beat Turkey. d. Han he har have lovat promised att to ∗ vill want [pres] slå Turkiet. beat [inf] Turkey – He has promised to wants beat Turkey. e. Han har lovat att ∗ slår Turkiet. he have promised to beats [pres] Turkey – He has promised to beat Turkey. The error is detected only when the verb follows directly after the infinitive marker in the cases (5.13d) and (5.13e). In the sentence in (5.13c) the verb also 25 The errors are not detected even if simple adverbials such as inte ‘not’, aldrig ‘never’ or sen ‘later’ are inserted. Chapter 5. 140 follows directly after the infinitive marker, but Granska does not detect it as an error although the verb is tagged as a verb in present tense form. Another example of how important syntactic coverage is for error detection is shown in (5.14), where Scarrie had problems detecting the agreement error between the subject and the adjective form in the predicative complement due to a possessive modifier in the head noun in the subject in (5.14b). Granska detects both errors but Grammatifix does not react at all. (5.14) a. A LARM är ∗ vacker. Hus house [pl, neu] is beautiful [sg, com] S CARRIE ’ S D IAGNOSIS wrong number in the adjective in predicative complement – House is beautiful. b. Mitt my [sg, neu] hus house [sg, neu] är is no reaction ∗ vacker. beautiful [sg,com] – My house is beautiful. In conclusion, the three Swedish systems cover both grammatical and more style-oriented errors. The coverage is similar in many aspects. In relation to the most common errors in Child Data, they all cover the non-structural errors that are, as discussed in the previous section, reserved to certain delimited text patterns. The structural errors that require more complex analysis are included only to a small extent. They all detect the same errors in noun phrase agreement and most of the errors in verb form. Exceptions are the verb form errors after an infinitive marker that are not included in Grammatifix, errors concerning the use of supine verb form instead of the infinitive are only included in Scarrie while tense harmony is only checked by Granska. Errors in finite verb form, that were the most frequent error type in Child Data, are (probably) covered by the ‘Missing finite verb’ category that all the tools cover. Among the errors of redundant or missing constituents in clauses, only Grammatifix checks for repeated words. All the tools check for missing infinitive marker in the context of a preceding preposition. Granska and Scarrie also detect missing subject. Other categories of redundant or missing constituents in clauses are not covered. Word choice errors are only covered by Granska to the extent of prepositions in fixed expressions. Other types are not included. As discussed in the previous section, structural errors of this kind require in general more complex analysis in order to be identified, except when they are limited to certain parts that can be delimited clearly (e.g. in a verb cluster). Error Detection and Previous Systems 141 The present overview of error types covered by these tools does not reveal the actual grammatical coverage and precision of detection. As shown above, there is a question of the extent of error coverage, since for instance insertion of some optional constituents or presence/absence of certain constituents has influenced whether or not an error was identified. I provided a test of these tools’ performance directly on Child Data, which is reported in the subsequent Section 5.5. Performance All the systems were validated for the linguistic functionalities they provide for as reported above in the descriptions of the separate projects (Section 5.4.3), summarized in Table 5.3 below. The validation tests carried out by the developers are based on corpora of different size and composition, and different sets of errors were found. As discussed in Section 5.2.3, the size and genre of the evaluated texts and the writer’s experience may influence the outcome of such analysis and the results should be interpreted carefully. The size and composition of the tested texts influence the occurrence of syntactic constructions giving rise to errors and should also be related to how frequent errors in the tested population are. Table 5.3: Overview of the Performance of Grammatifix, Granska and Scarrie T OOL Grammatifix Granska Scarrie C ORPUS newspaper articles newspaper articles newspaper articles, official texts, student papers newspaper articles S IZE 87 713 1 000 504 201 019 R ECALL 35% P RECISION 52% 70% 53% 14 810 83% 77% Grammatifix and Scarrie were tested solely on newspaper texts written by professional writers, which is probably enough in the case of Scarrie since it was developed for professional writers. On the other hand, Grammatifix as a module in a word processor not aimed at any special groups, should be tested on texts of different genre written by different populations. Granska’s evaluation was tested upon texts of different genre consisting of published newspaper and popular science articles, official texts and student compositions. This corpus is more balanced and perhaps reflects the real performance of the system. In addition, certain types of errors that dominate in the corpus depending on the genre are reported (Knutsson, 2001). Chapter 5. 142 Further, a fairly large amount of data is needed in order to be able to test a reasonable number of errors. The validation corpus used for Scarrie was small in this aspect, including only six of the defined errors and yielding quite high rates in both recall and precision. In the case of Granska, the size of the corpus is much bigger and, as discussed, better balanced. Grammatifix included the largest corpora for the test of precision and a smaller corpus for the test of recall and obtained the lowest recall. As a commercial product with high expectations on precision, the error coverage of the system was probably cut down. This means that the system probably is able to detect more errors and receive a better recall rate than the current 35%, but if the result is lower precision due to the number of false flaggings increasing, the detection of those “unsafe” errors is not included and remains undetected. The recall rates in the systems vary from 35% to 83% and the precision rates are between 53% and 77%. Evaluation of individual error types is only reported for Granska, with the best results for verb form errors and agreement errors in noun phrase. 5.4.5 Summary The Swedish approaches to grammar checking apply techniques for searching (more or less) explicitly for ungrammaticalities in text. Errors are found either by looking for specific patterns in certain contexts in the text that match the defined error rules, or by using selections in a “relaxed” parse by a chart-scanner. The approaches seem to be dependent on how fine or broad a specific error type is defined, so that the same error is not overlooked in other contexts. The choice of what types of errors are detected is based on a more or less ambitious analysis of errors in writing, often for certain group of writers (e.g. professional writers, writers at work). However, the risk is still there that some other type of error in the same pattern may be overlooked. The coverage of error types is very similar between the systems. Performance was evaluated separately on different text data so the results are hard to compare. Error Detection and Previous Systems 143 5.5 Performance on Child Data 5.5.1 Introduction Having examined what error types are covered by the current Swedish systems Grammatifix, Granska and Scarrie, their performance will be tested on the Child Data corpus. Recall that the error frequency is different in texts written by children than adult writers targeted by the Swedish grammar checkers and also that the error distribution is (slightly) different in Child Data. The purpose of testing the tools’ performance on Child Data is crucial in the view of handling text with higher error density and of (slightly) different kind than they were designed for. Discussion in the previous section on the error types covered by these systems points out that many of the errors in Child Data are targeted. Among the most common error types in Child Data, all (or most) of the error types related to verb form and agreement in noun phrase are targeted by the tools and some (quite few) of the errors concerning redundant or missing constituents in clauses and word choice errors, a group of errors that needs more elaborated and complex analysis for detection (see discussion in Section 5.3). The tools are not, however, designed in the first place to detect errors in children’s texts and will most probably perform worse on these texts. The question is how low will the performance be, where exactly will they fail, and what consequences do the results have for Child Data. This section continues with a description of the evaluation procedure (Section 5.5.2) and the individual systems’ detection procedures (Section 5.5.3). Then the detection results on Child Data are presented type by type (Section 5.5.4). Finally, a summary of the results and discussion on overall performance is presented (Section 5.5.5). 5.5.2 Evaluation Procedure As discussed in Section 5.2.3, evaluation of authoring tools normally concerns detection, diagnosis and correction functionalities, either on single sentences or on whole text samples. For the case of investigating how good a system is at detecting targeted errors, sentence samples usually will do, but a corpus is better for measuring how good a system is overall. In my analysis the whole Child Data corpus in the spell-checked version was used as input, free from the non-word spelling errors,26 since the main purpose of the evaluation analysis is to see the checker’s performance in detection of grammar errors. The Child Data corpus 26 See Section 3.3 for discussion on how this was achieved. 144 Chapter 5. represents texts that are new to all three systems and a writing population which is not explicitly covered by any of them. Since not all the systems give suggestions for correction, the present performance test will only analyze detection and diagnosis performance. Detection performance is investigated in terms of the number of correct and false alarms. Correct alarms include all detected errors, divided further into whether a correct or an incorrect diagnosis was made. False alarms are divided further into detection of correct word sequences diagnosed as errors, and detections that happen to include other error categories than grammar errors, e.g. a spelling error, a split, or a sentence boundary. To exemplify, the agreement mismatch between the common gender determiner en ‘a [com]’ and the neuter-gender compound noun stenhus ‘stone-house [neu]’ in (5.15a) concerns the gender form of the determiner, which is a correct alarm with a correct diagnosis. Now, identifying this noun phrase segment and classifying this as an error in number agreement as in (5.15b) would then be considered as a correct alarm with an incorrect diagnosis. That is, the erroneous segment is correctly detected, but the analysis of what type of error it concerns is wrong. The example in (5.15c) represents a false alarm, where the correct (grammatical) form of the noun phrase was detected and diagnosed as an error in gender agreement. Finally, in (5.15d) we see an example of a false alarm that includes a segmentation error (not a grammar error). The noun in the noun phrase is split and the determiner and the first part of the split noun are identified as a grammar error with an agreement violation in gender. These instances of grammatically correct text selected as ungrammatical due to a split, spelling error, etc. are classified as false alarms with other error. I’ve chosen to separate these detections from the “real” false alarms, since they represent text fragments not entirely free from errors, although the errors are of a different nature than grammar/syntactic ones. These findings can be interesting, since as Knutsson (2001) points out, such an alarm could be a hint to some writers that can see that the actual error lies in the split noun. It could however also give rise to a new error if the user chooses to change the gender of the determiner and writes: en sten hus ‘a [com] stone [com] hus [neu]’. Error Detection and Previous Systems 145 (5.15) a. A LARM ∗ en stenhus a [com] stone-house [neu] D IAGNOSIS gender agreement error C LASS OF A LARM correct alarm with correct diagnosis b. ∗ stenhus en a [com] stone-house [neu] number agreement error correct alarm with incorrect diagnosis c. ett stenhus a [neu] stone-house [neu] gender error agreement false alarm d. ett sten hus a [neu] stone [com] house [neu] gender error agreement false alarm with other error The set of all detected errors is represented then by all correct alarms with correct or incorrect diagnosis and the set of false alarms consists of false flaggings without any error and false flaggings containing other errors than grammatical ones. The systems’ grammatical coverage (recall) and flagging accuracy (precision) has been calculated in accordance with the following definitions: (5.16) a. recall = correct alarms all errors b. precision = * 100 correct alarms correct alarms + f alse alarms * 100 I will also consider the overall performance of the systems expressed in Fvalue, a combined measure of recall and precision. F-value is calculated as presented in (5.17), where the β parameter has the value 1, since both recall and precision are equally important in the analysis.27 (5.17) 5.5.3 F-value = (β 2 + 1) ∗ recall ∗ precision β 2 ∗ (recall + precision) The Systems’ Detection Procedures Grammatifix Grammatifix is included as a module in Microsoft Word, working along with a spell checking module. The user may choose to disregard grammar checking and just check the text for spelling or include both checkers. The tool then checks the text sentence by sentence first for spelling and then grammar. Further adjustments of grammar checking are possible, where the user may choose among the different 27 The parameter β obtains different values dependent on whether precision is more important (β > 1) or whether recall is of greater value (β < 1). When both recall and precision are equally important the value of β is 1 (β = 1). Chapter 5. 146 error types defined in Grammatifix (including style, punctuation and formatting and grammar errors) and also set the maximum length of a sentence in number of words. The tool also provides a report on the text’s readability, including the sum of tokens, words, sentences and paragraphs The mean score of these is counted providing an index of readability. One diagnosis of the error is always given, and usually a suggestion for correction. Granska The web-based demonstrator of Granska includes no interactive mode, and spelling and grammar are corrected independently, based on the tagging information. The user may choose a presentation format of the result that includes all sentences with comments on spelling and grammar or only the erroneous sentences. Further adjustments include the choice to display error correction, the result of tagging and if newline is interpreted as end of sentence or not. The last attribute is quite important for children’s writing, where punctuation is often absent or not used properly and the use of new line is also arbitrary, i.e. occurrence of new line in the middle of a sentence is not unusual. In some cases, Granska yields also more than one suggestion for error correction and there is a possibility of constructing new detection rules. Long parts in a text without any punctuation or new line (usual in children’s writing) are probably hard to handle by the tool, which just rejects the text without any output result. Scarrie Also the web-demonstrator of Scarrie does not include any interactive mode. Individual sentences (or a longer text) can be entered, with requirements on end-ofsentence punctuation. Both spelling and grammar are corrected and the result of detection is displayed at the same time. Errors are highlighted and a diagnosis is displayed in the status bar. The system gives no suggestion for correction. 5.5.4 The Systems’ Detection Results In this section I present the result of the systems’ performance on Child Data. For every error type I first present to what extent the errors are explicitly covered according to the systems’ specifications and then I proceed system by system and present the detection result for the particular error type and discuss which errors are actually detected and which were incorrectly diagnosed, characteristics of errors that were not found, and false alarms. A short conclusion ends every error Error Detection and Previous Systems 147 type presentation. Exemplified errors from Child Data refer either to previously discussed samples or directly to the index numbers of the error corpus presented in Appendix B.1. A system’s diagnosis is presented exactly as given by the particular system. All detection results are summarized and the overall performance is presented in Section 5.5.5. Agreement in Noun Phrases Most of the errors in Child Data concern definiteness in the noun and gender or number in determiner in the noun phrases, errors that, according to the error specifications, are explicitly covered by all three tools. They all also check for errors in masculine gender of adjective and agreement between the quantifier and the noun in partitive constructions. The latter type found in Child Data concerns the form in the noun rather than the form of the quantifier (see (4.11) on p.50). Grammatifix detected seven errors in definiteness and gender agreement. One of the errors in the masculine form of adjective was only detected in part and was given a wrong diagnosis. The error concerns inconsistency in the use of adjectives (previously discussed in (4.9) on p.49), either both adjectives should carry the masculine gender form or both should have the unmarked form. The error detection by Grammatifix is exemplified in (5.18), where we see that due to the split noun, the error was diagnosed as a gender agreement error between the common-gender determiner den ‘the [com]’ and the first part of the split troll ‘troll [neu]’ that is neuter. An interesting observation is that when the split noun is corrected and forms the correct word trollkarlen ‘magician [com,def]’ Grammatifix does not react and the error in the adjectives is not discovered. Grammatifix checks only when the masculine form of an adjective occurs together with a non-masculine noun, but not consistency of use as is the case in this error sample. (5.18) A LARM det va it was ∗ den hemske the [com,def] awful [masc,wk] ∗ fula troll karlen ugly [wk] troll [neu,indef] man [com,def] (⇒ trollkarlen) tokig som ... (⇒ magician [com,def]) Tokig that G RAMMATIFIX ’ S D IAGNOSIS Check the word form den ‘the [com,def]’. If a determiner modifies a noun with neuter gender, e.g. troll ‘troll’ the determiner should also have neuter gender ⇒ det ‘the [neu,def]’ – It was the awful ugly magician Tokig that ... In general, simple constructions with determiner and a noun are detected, whereas more complex noun phrases were missed. Three errors in definiteness form of the noun were overlooked (G1.1.1, G1.1.2 - see (4.2) p.46, G1.1.3 - see Chapter 5. 148 (4.3) p.46). Concerning gender agreement, one error involving the masculine form of an adjective was missed (G1.2.4 - see (4.8) p.48). None of the errors in number agreement were detected, one with a determiner error (G1.3.1 - see (4.10) p.49) and two with partitive constructions (G1.3.2 - see (4.11) p.50, G1.3.3). Grammatifix made altogether 20 false assumptions, where 16 of them involved other error categories, mostly splits (12 false alarms), such as the one in (5.19): (5.19) A LARM det var it was ett a [neu] stort big [neu] hus house [neu] sten stone [com] G RAMMATIFIX ’ S D IAGNOSIS Check the word form ett ‘a [neu]’. If a determiner modifies a noun with commongender, e.g. sten ‘stone [com]’ should also the determiner have common-gender ⇒ en ‘a [com]’ – It was a big stone-house. The overall performance for Grammatifix’s detection of errors in noun phrase agreement amounts then to 53% for recall and 29% for precision. Granska detected six errors in definiteness and two in gender agreement, one in a partitive noun phrase (G1.2.2). In three cases, where the error concerned the definiteness form in the noun, Granska suggested instead to change the determiner (and adjective), correcting G1.1.7 as den räkningen ‘the [def] bill [def]’ instead of en räkning ‘a [indef] bill [indef]’ (see (4.6) p.47). The same happened for error G1.1.8 where en kompisen ‘a [indef] friend [def]’ is corrected as den kompisen ‘the [def] friend [def]’ and the opposite for G1.1.2 where the definite determiner and adjective in den hemska pyroman ‘the [def] awful pyromaniac [indef]’ are changed to indefinite forms instead of changing the form in the noun to definite (see (4.2) p.46). Two errors in definiteness agreement (G1.1.1, G1.1.3 - see (4.3) p.46), none of the errors in masculine form of adjective (G1.2.3 - see (4.9) p.49, G1.2.4 - see (4.8) p.48) and all errors in number agreement were left undiscovered by Granska. Grammatical coverage for this error type results then in 53% recall. Some false alarms occurred (25), where 17 included other error categories, with splits as the most represented (9 false alarms), resulting in a slightly lower precision rate of 24% in comparison to Grammatifix. Scarrie detected six errors in definiteness agreement, one in gender agreement in a partitive noun phrase, two in the masculine form of adjective and one in number agreement. In the case of number agreement, the error in det tre tjejerna ‘the [sg] three girls [pl]’ (G1.3.1 - see (4.10) p.49) is incorrectly diagnosed as an error in the noun instead of in the determiner. Error Detection and Previous Systems 149 Exactly as in Grammatifix, Scarrie detected the error in G1.2.3 due to the split noun and gave the same incorrect diagnosis (see (5.18) above). The missed errors include two errors in definiteness in the noun, one with a possessive determiner (G1.1.4 - see (4.4) p.47) and one with an indefinite determiner (G1.1.7 - see (4.6) p.47). One error concerned gender agreement with an incorrect determiner with a compound noun (G1.2.1 - see (4.7) p.48). Finally, two errors in number of the noun in partitive constructions were not detected (G1.3.2 - see (4.11) p.50, G1.3.3). Many false alarms occurred (133) and 50 of them concerned other error categories, mostly splits (33 false alarms) as in (5.20): (5.20) A LARM han tittade i he looked into ett jord a [neu] ground [com] S CARRIE ’ S D IAGNOSIS wrong gender hål hole [neu] – He looked into a hole in the ground. Others involved spelling errors (10 false alarms) as in (5.21), where the pronoun vad ‘what’ is written as var and interpreted as the pronoun ‘each’ that does not agree in number with the following noun. (5.21) A LARM Själv tycker jag att killarnas metoder self think I that the-boys’ methods [pl] S CARRIE ’ S D IAGNOSIS wrong number men också är mer öppen och ärlig are more open and honest but also mer more elak mean än than var (⇒ vad) each [sg] (⇒ what) tjejernas metoder är. the-girls’ [pl] methods are – I think myself that the boys’ methods are more open and honest but also more mean than the girls’ methods are. Some false flaggings also concerned sentence boundaries (7 false alarms) as in (5.22): Chapter 5. 150 (5.22) A LARM pojken gick till fönstret och ropade the-boy went to the-window and shouted på at grodan the-frog hunden the-dog [com] men but vad what har had fastnat stuck S CARRIE ’ S D IAGNOSIS wrong form in adjective dumt silly [neu] i in burken the-pot där grodan var. there the-frog was – The boy went to the window and shouted at the frog, but how silly, the dog got stuck in the pot where the frog was. But mostly, ambiguity problems occurred (83 false alarms) as in (5.23a) and as in (5.23b): (5.23) a. A LARM dessutom besides luktade smelled det it/the [neu] S CARRIE ’ S D IAGNOSIS wrong gender saltgurka. pickle-gherkin [com] – Besides it smelled like pickled gherkin. b. Jag trampade rakt på den och skar upp I walked right on it and cut up wrong number hela min vänstra fot. whole my left [pl,def] foot [sg,indef] The coverage for this error type in Scarrie is 67%, but the high number of false alarms results in a very low precision value of only 7%. In conclusion, only Scarrie detected more than half of the errors in noun phrase agreement, but at the cost of many false alarms. Grammatifix and Granska displayed similarities in detection of this error type, detecting almost the same errors and also their false alarms are not that many. Scarrie’s coverage is different from the other tools and the high number of false alarms considerably decreased the precision score for detection of this error type. All tools failed to find the erroneous forms in the head nouns of the partitive noun phrases (G1.3.2 - see (4.11) p.50, G1.3.3), that are most likely not defined in the grammars of these systems. Error Detection and Previous Systems 151 Agreement in Predicative Complement All the tools cover errors in both number and gender agreement with predicative complement. These types of errors in Child Data are however represented in most cases by rather complex phrase structures and will then at most result in three detections. Grammatifix detected only one instance of all the agreement errors in predicative complement (G2.2.6) and yielded an incomplete analysis of this particular error. It failed in that only the context of a sentence is taken into consideration. Due to ambiguity in the noun between a singular and plural form, Grammatifix detected this error as gender agreement, but should suggest plural form instead, which is clear from the preceding context (see (5.1) and the discussion on detection possibilities in Section 5.3, p.119). Grammatifix obtained very low recall (13%) for this error type. Three false alarms (one with a split), results in a precision value of 25%. The three simple construction of agreement errors in the predicative complement were all detected by Granska (G2.1.1 - see (4.12) p.51, G2.2.3 - see (4.13) p.51, G2.2.6 - see (5.1) p.119). In the case of G2.2.6 discussed above, the plural alternative is suggested. In error G2.2.3, the predicative complement includes a coordinated adjective phrase with errors in all three adjectives. Granska detected the first part: (5.24) A LARM Själv tycker self think metoder methods [pl] jag I är are att that mer more killarnas the-boys’ [pl] ∗ öppen open [sg] och and ∗ ärlig men också mer ∗ elak än honest [sg] but also more mean [sg] than G RANSKA’ S D IAGNOSIS If öppen ‘open [sg]’ refers to metoder ‘methods [pl]’ that is an agreement error ⇒ killarnas metoder är mer öppna ‘the boys’ [pl] methods [pl] are more open [pl]’ var (⇒vad) tjejernas metoder är. was (⇒what) the-girls’s methods are – I think myself that the boys’ methods are more open and honest but also more mean than the girls’ methods. Granska obtained then a coverage value of 38% for this error type with 5 false alarms (including one in split and one with a spelling error) the precision rate is also 38%. In the case of Scarrie, no errors in predicative complement agreement were detected, only 13 false flaggings occurred, which leaves this category with no results Chapter 5. 152 for recall or precision. The false alarms occurred due to incorrectly chosen segments as in the following examples. In (5.25a) we have a compound noun phrase, where only the second part is considered and interpreted as a singular noun that does not agree with the plural adjective phrase as its predicative complement. In (5.25b) the verb is pratade ‘spoke [pret]’ interpreted as a plural past participle form and is considered as not agreeing with the preceding singular noun hon ‘she [sg]’. (5.25) a. A LARM Han och hans hund var mycket he and his dog [sg] were/was very S CARRIE ’ S D IAGNOSIS wrong number in adjective in predicative complement över den. stolta proud [pl] over it – He and his dog were very proud over it. b. då sa jag till dom och våran lärare then said I to them and our teacher wrong number in adjective in predicative complement att hon blev mobbad och efter det that she [sg] was harassed and after that så pratade läraren med dom som so spoke [pl] the-teacher with them that mobbade henne och då slutade dom med harassed her and then stopped they with det. that – Then I said to them and our teacher that she was harassed and after that the teacher spoke to them that harassed her and then they stopped with that. In conclusion, only Granska detected at least the simplest forms of agreement errors in predicative complement. The other tools had problems with selecting correct segments, especially Scarrie with its high number of false alarms. Pronoun Form Errors All three tools check explicitly for pronoun case errors after certain prepositions. Three of the four error instances in Child Data are preceded by a preposition. Grammatifix found two errors in the form of pronoun in the context of different prepositions (G4.1.1 - see (4.19) p.54, G4.1.3). No false flagging occurred. Granska found three errors in the context of the prepositions efter ‘after’ and med ‘with’ (G4.1.1 - see (4.19) p.54, G4.1.4, G4.1.5 - see (4.18) p.53), that gives a Error Detection and Previous Systems 153 recall rate of 60%. However, many false alarms (24) occurred involving conjunctions being interpreted as prepositions (17 flaggings) or prepositions in a sentence boundary where punctuation is missing (5 flaggings), resulting in a very low precision value of 11%. In (5.26a) we see an example of a false alarm with the conjunction för ‘because’ and in (5.26b) with a preposition ending a sentence followed by a personal pronoun as the subject of the next sentence: (5.26) a. A LARM Vi skulle we would hon she [nom] åka in go in skulle would i into hamnen the-port för for berätta något tell something för for G RANSKA’ S D IAGNOSIS Erroneous pronoun form, use object-form ⇒ för henne ‘for her [acc]’ sin mamma. her mother – We would go into the port because she should tell something to her mother. b. ... och jag kom då tänka på den and I came then think at the byn vi va (⇒ var) i jag the-village we what (⇒ were) in I [nom] Erroneous pronoun form, use object-form ⇒ i mig ‘in me [acc]’ berätta (⇒ berättade) om byn och tell (⇒ told) about the-village and dom they sa said att that det va (⇒ var) it what (⇒ was) deras their by. village – and I came to think at the village we were in. I told about the village and they said that it was their village. Scarrie also found three error instances (G4.1.1 - see (4.19) p.54, G4.1.3, G4.1.4), all with different prepositions. False flaggings occurred also due to ambiguity problems, as for example in (5.27) and (5.28). (5.27) A LARM Jag gick och gick tills jag hörde I walked and walked until I heard Pappa skrika kom kom daddy scream come come – I walked and walked until I heard daddy scream: Come! Come! S CARRIE ’ S SUGGESTION wrong form of pronoun Chapter 5. 154 (5.28) a. A LARM Erik frågade om han kunde få ett Erik asked if OR about he could get a S CARRIE ’ S SUGGESTION wrong form of pronoun barn. child – Erik asked if he could get a child. b. Tänk om jag bott hos pappa. think if OR about I lived with daddy wrong form of pronoun – Think if I lived with daddy. Scarrie obtains a recall of 60% but with 17 false alarms, attains a precision rate of only 15% for errors in pronoun case. In conclusion, as seen in the above examples, the tools search for errors in the pronoun form after certain types of prepositions, but due to ambiguity in them they fail more often than they succeed in detection of these errors. Finite Verb Form Errors Errors in finite verbs concern non-inflected verb forms, which is also the most common error found in Child Data. All of the tools search for missing finite verbs in sentences and, judging from the examples in the error specifications, it seems that they detect exactly this type of error. Grammatifix detected very few instances of sentences that lack a finite verb. Altogether four such errors are recognized and in one of them Grammatifix suggested correcting another verb. In total, seven false alarms occurred, detecting verbs after an infinitive marker as in (5.29) or after an auxiliary verb as in (5.30). (5.29) A LARM dom la sig ner för att ta they lay themselves down for to take [inf] skydd under natten shelter during the-night – They lay down to take shelter during the night. G RAMMATIFIX ’ S D IAGNOSIS The sentence seems to lack a tense-inflected verb form. If such a construction is necessary can you try to change ta ‘take’. Error Detection and Previous Systems (5.30) A LARM det kan ju bero på att föräldrarna it can of-course depend on that the-parents inte bryr sig dom kanske inte ens not care themselves they maybe not even 155 G RAMMATIFIX ’ S D IAGNOSIS The sentence seems to lack a tense-inflected verb form. If such a construction is necessary can you try to change behöva ‘need’. vet att man har prov för dom lyssnar inte know that one has test for they listen not på sitt barn för en del kan ju to their children for a bit can of-course behöva hjälp av sina föräldrar need help from their parents – It can depend on that the parents do not care. They probably do not even know that you have a test, because they do not listen to their child, because some can need help from their parents. It seems that Grammatifix cannot cope with longer sentences. For instance, breaking down the last example in (5.30b) from det kan ju bero p å... the error marking is not highlighted anymore. Since many errors with non-finite verbs as the predicates of sentences occurred in Child Data, Grammatifix obtains a low recall value of 4%. False alarms were relatively few, which gives a precision rate of 36%. Granska also checks for errors in clauses where a finite verb form is missing. It detected nine errors in verbs lacking tense-endings altogether, resulting in a recall of just 8%. Nine false flaggings occurred, mostly with imperatives, which gives it a precision score of 44%. Some other alarms concerned exclamations such as Grodan! ‘Frog!’ or Tyst! ‘Silence!’, or fragment clauses, where no verb was used (29 alarms). These are excluded from the present analysis. Scarrie explicitly checks verb forms in the predicate of a sentence and detected 17 errors in Child Data with two diagnoses - ‘wrong verb form in the predicate’ or ‘no inflected predicative verb’. Altogether, 13 false flaggings occurred due to marking correct finite verbs. One false alarm included a split as shown below in (5.31). Scarrie has the best result of the three systems for this error type with 15% in recall and 57% in precision. (5.31) A LARM Han ring de till mig sen och sa samma he call [pret] to me later and said same sak. thing – He phoned me later and said the same thing. S CARRIE ’ S D IAGNOSIS wrong verb form in predicate Chapter 5. 156 In conclusion, the tools succeeded in detecting at most 17 cases of errors in finite verb form. The tools have a very low coverage rate for this frequent error type. The worst detection rate is for Grammatifix the best for Scarrie. Verb Form after Auxiliary Verb All the tools include detection of errors in the verb form after auxiliary verbs. In Child Data, only one of these erroneous verb clusters included an inserted adverb and one occurred in a coordinated verb. Grammatifix does not find any of these errors. Four instances of erroneous verb form after auxiliary verbs were detected by Granska. The remaining three which were not detected are presented in (5.32) and concern G6.1.2, a coordinated verb in (5.32a), G6.1.5, a verb with preceding adverb (5.32b) and G6.1.6, an auxiliary verb followed by a verb in imperative form (5.32c). (5.32) a. Ibland sometimes får can [pres] man bjuda one offer [inf] på on sig själv och oneself and ∗ låter let [pres] henne/honom vara med! her/him be with – Sometimes can one make a sacrifice and let him/her take part. b. han råkade bara ∗ kom emot getingboet he happened [pret] just came [pret] against the wasp-nest – He just happened to come across the wasp’s nest. ∗ som vi alla nog skulle c. Det är något gör om vi inte it is something that we all probably would [pret] do [imp] if we not hade läst på ett prov. had read to a test – This is something that we all probably would do if we had not been studying for a test. Five false alarms occurred in sentence boundary. In (5.33a) we see an example where the end of a preceding direct-speech clause is not marked and the final verb is selected with the main verb of the subsequent clause. Similarly, in (5.33b) the verb cluster ending a clause where the boundary is not marked is selected together with the (adverb) and the initial main verb of the subsequent clause. Error Detection and Previous Systems (5.33) a. A LARM Jo, det kanske han kan sa no that maybe he can [pres] said [pret] pappa. Daddy 157 G RANSKA’ S D IAGNOSIS unusual with verb form sa ‘said [pret]’ after modal verb kan ‘can [pres]’. ⇒ kan säga ‘can [pres] say [inf]’ – No, maybe he can, said Daddy. b. precis när dom skulle börja så just when they would [pret] start [inf] so hörde dom en röst heard [pret] they a voice unusual with verb form hörde ‘heard [pret]’ after modal verb skulle ‘would [pret]’. ⇒ skulle börja så ha hört ‘would [pret] start [inf] so have [inf] heard [sup]’ or skulle börja så höra ‘would [pret] start [inf] so hear [inf]’ – Just when they were about to begin, they heard a voice. Granska’s performance rates are 57% in recall and 44% in precision. Scarrie detected only one error in verb form after an auxiliary verb in Child Data (G6.1.6 - see (5.32c) above) and made altogether nine false flaggings. Two false alarms occurred at sentence boundaries, one of them in the same instance as in Granska, see (5.33a) above. Scarrie ends up with a performance result of 14% in recall and 10% in precision. In conclusion, Granska detects more than half of the verb errors after the auxiliary, but the performance of the other tools is very low, detecting either none or just one such error. Missing Auxiliary Verb All the tools check explicitly for supine verb forms without the auxiliary infinitive form ha ‘have’. It is not clear if they also check for omission of the finite forms of the auxiliary verb in front of a bare supine. In Swedish, the supine is only used in subordinate clauses (see Section 4.3.5). Two errors with bare supine form in main clauses were found in Child Data. Grammatifix did not find the two errors in Child Data. Instead, Grammatifix suggested insertion of the auxiliary verb ha ‘have’ in constructions between an auxiliary verb and a supine verb form. This is rather a stylistic correction and is not part of the present analysis. Altogether, nine such suggestions were made of the kind given below: Chapter 5. 158 (5.34) A LARM ätit för en kvart jag skulle I should [pret] eaten [sup] for a quarter sen later G RAMMATIFIX ’ S D IAGNOSIS Consider the word ätit ‘eaten [sup]’. A verb such as skulle ‘should [pret]’ combines in polished style with ha ‘have [inf]’ + supine rather than only a supine. ⇒ skulle ha ätit ‘should [pret] have [inf] eaten [sup]’ – I should have eaten a quarter of an hour ago. The same happened in Granska, no errors were detected and suggestions made were for insertion of auxiliary ha ‘have’ in front of supine forms preceded by auxiliary verbs. Seven such flaggings occurred as in (5.35) and two flaggings were false and occurred at sentence boundaries. (5.35) A LARM Jag måste svimmat . I must [pret] fainted [sup] G RANSKA’ S D IAGNOSIS unusual with verb form svimmat ‘fainted [sup]’ after the modal verb måste ‘must [pret]’. ⇒ måste ha svimmat ‘must [pret] have [inf] fainted [sup]’ – I must have fainted. Scarrie did find one of the error instances in Child Data with a missing auxiliary verb (G6.2.1). Eight other detections included the same stylistic issues as for the other tools, suggesting insertion of ha ‘have’ between an auxiliary verb and a supine verb form, as in: (5.36) A LARM de kunde berott på att dom it could [pret] depend [sup] on that they S CARRIE ’ S D IAGNOSIS wrong verb form after modal verb gillade samma tjej liked same girl – It could have been because they liked the same girl. In conclusion, just one of the two missing auxiliary verb errors in Child Data was found by Scarrie. The systems bring more attention to the stylistic issue of omitted ha ‘have’ with supine forms, pointing out that the supine verb form should not stand alone in formal prose. Error Detection and Previous Systems 159 Verb Form in Infinitive Phrase Granska and Scarrie search for erroneous verb forms following an infinitive marker and should not have problems with finding these errors in Child Data, where only one instance included an adverb splitting the infinitive. Granska identified three errors in verb form after an infinitive marker, missing only the one with an adverb between the parts of the infinitive (G7.1.1 - see (4.35) p.62). This problem of syntactic coverage was already discussed in Section 5.4.4 in the examples in (5.13), where it also showed that Granska does not take adverbs into consideration. Altogether six false alarms occurred. Granska’s overall performance rates are 75% in recall and 33% in precision. Scarrie detected one of the errors in Child Data, where the infinitive marker is followed by a verb in imperative form instead of infinitive: att g ör ‘to do [imp]’ (G7.1.4). Also, one false flagging occurred, shown in (5.37), where it seems that the system misinterpreted the conjunction för att ‘because’ as the infinitive marker att ‘to’: (5.37) A LARM så jag sa att hon skulle ta det lite so I said that she should take it little S CARRIE ’ S D IAGNOSIS inflected verb form after att ‘to’ lugnt för att annars så kan hon easy for that otherwise so can[pres] she inte så skada sig och det är ju hurt[inf] herself and it is of-course not so bra. good – So I said that she should take it easy a little because otherwise she might hurt herself and that is of course not so good. In conclusion, Granska finds all but one of the errors, due to insufficient syntactic coverage and makes also quite many false flaggings. Scarrie has difficulties with this error type and Grammatifix does not target it at all. Missing Infinitive Marker with Verbs All the tools check explicitly for both missing and extra inserted infinitive marker. Three errors in missing infinitive marker with verbs occurred in Child Data in the context of the auxiliary verb komma ‘will’. As presented in Section 4.3.5, certain main verbs take also an infinitive phrase as complement and some lack the infinitive marker and start to behave as auxiliary verbs, that normally do not combine with an Chapter 5. 160 infinitive marker and only take bare infinitives as complement. This development is now in progress in Swedish, which indicates then rather to treat these constructions as stylistic issues. Grammatifix did not find the three errors in Child Data with omitted infinitive markers with the auxiliary verb komma ‘will’ (see example (4.36) p.62). In seven cases, the tool rather suggested removing the infinitive marker with the verbs b örja ‘begin’ and tänka ‘think’, e.g.: (5.38) a. A LARM Jag och Virginia började att berätta I and Virginia started [pret] to tell [inf] om about tromben the-tornado och and den the övergivna abandoned byn the-village G RAMMATIFIX ’ S D IAGNOSIS Check the words att ‘to’ and berätta ‘tell [inf]’. If an infinitive is governed by the verb började ‘started [pret]’, the infinitive should not be preceded by att ‘to’ ⇒ började berätta ‘started [pret] tell [inf]’ – Virginia and I started to tell about the tornado and the abandoned village. b. 4 hus och 5 affärer var ordning gjorda 4 houses and 5 shops were order done av gumman som hade tänkt by old-lady who had [pret] thought [sup] att göra museum av den gamla staden to make [inf] museum of the old the-city Check the words att ‘to’ and göra ‘make [inf]’. If an infinitive is governed by the verb tänkt ‘thought [sup]’, the infinitive should not be preceded by att ‘to’ ⇒ tänkt göra ‘thought [sup] make [inf]’ – 4 houses and 5 shops were tidied up by the old lady who had planned to make a museum of the old city. Granska detected all the three omitted infinitive markers in the context of the auxiliary verb komma ‘will’. In this case also six false flaggings occurred, concerning the same verb used as a main verb, e.g.: Error Detection and Previous Systems (5.39) A LARM han kommer he comes [pres] alla all på on handen the-hand och and utan except undra (⇒ undrar) wonder [inf] (⇒ wonder [pres]) 161 klappar pats en a kille boy hur how han he G RANSKA’ S D IAGNOSIS kommer ‘will’ without att ‘to’ before verb in infinitive känner sig då? feels himself then – He comes and pats everybody’s hand except one boy. (I) wonder how he feels then? In two cases, Granska also suggested insertion of the the infinitive marker with the verbs fortsätta ‘continue’ and prova ‘try’. In nine cases, it wanted to remove the infinitive marker with the verbs börja ‘begin’, försöka ‘try’, sluta ‘stop’ and tänka ‘think’. Scarrie detected two of the three missing infinitive marker errors with the verb komma ‘will’ found in Child Data. Quite a large number of false alarms (13) with the verb used as main verb occurred as in (5.40), where s å is ambiguous between the conjunction ‘so’ or ‘and’ and a verb reading ‘sow’. The precision rate is then only 13%. (5.40) A LARM men kom nu så går vi hem but come now so OR sow go we home S CARRIE ’ S D IAGNOSIS att ‘to’ missing – But come now and we’ll go home. In five cases, Scarrie suggested removal of the infinitive marker in the context of the verbs börja ‘begin’, fortsätta ‘continue’ and sluta ‘stop’. In conclusion, whereas both Granska and Scarrie performed well, Grammatifix did not succeed in tracing any of the errors with omitted infinitive markers with the auxiliary verb komma ‘will’. Overall, all the tools suggested both omission and insertion of infinitive markers with certain main verbs. In some cases they agree, but there are also cases where one system suggests removal of the infinitive marker and an another suggests insertion. A clear indication of confusion in the use or omission of the infinitive marker showed up when Granska suggested to insert the infinitive marker in the verb sequence fortsätta leva ‘continue live’ as shown in (5.41a), whereas in (5.41b) Scarrie suggested to remove it in the same verb sequence. This fact indicates clearly that this issue should be classified as a matter of style and not as a pure grammar error. Chapter 5. 162 (5.41) a. A LARM när jag dog 1978 i cancer återvände jag when I died 1978 of cancer returned I D IAGNOSIS Granska: ⇒ fortsätta att leva ‘continue to live’ hit för att fortsätta leva mitt here for that continue [inf] live [inf] my liv här life here – When I died in 1978 of cancer, I returned here to continue live my life here. b. Vi fortsatte att leva [inf] som en we continued [pret] to live as a Scarrie: ⇒ fortsatte leva ‘continued live’ hel familj i vårt nya hus här i whole family in our new house here in Göteborg. Göteborg – We continued to live as a whole family in our new house here in Göteborg. Word Order Errors All three tools check for the position of adverbs (or negation) in subordinate clauses and constituent order in interrogative subordinate clauses. Scarrie also checks for word order in main clauses with inversion. Among the word order errors found in Child Data, all the errors are quite complex and also none of the tools succeeded in detection of this type of error. However, false flaggings of correct sentences occurred. Grammatifix made 15 false alarms when checking word order, one included a split and three occurred in clause boundary. A false flagging involving clause boundary is presented in (5.42a), where Grammatifix concerned the adverb hem ‘home’ as being placed wrongly between verbs. This problem is not only complicated by the second verb initiating a subsequent clause, but also in that not all adverbs can precede verbs. Another false flagging is presented in (5.42b), where Grammatifix checked for adverbs placed after the main verb in the expected subordinate clause, but here, main word order is found in the indirect speech construction.28 28 Main clause word order occurs when the clause expresses the speaker’s or the subject’s opinion or beliefs. Error Detection and Previous Systems (5.42) a. A LARM När when vi we kom came undra (⇒ undrar) wonder [inf] (⇒ wonder [pres]) 163 hem home självklart of-course G RAMMATIFIX ’ S D IAGNOSIS Check the placement of hem ‘home’. In a subclause adverb is not usually placed between the verbs. Placement before the finite verb is often suitable. mamma vart vi varit... mother where we been When we came home, mother wondered of course where we had been. b. killen i luren sa att han kommer the-guy in the-receiver said that he comes genast immediately Check the placement of genast ‘immediately’. In a subclause sentential adverb is placed by rule before the finite verb. ⇒ genast kommer ‘immediately comes’ – The guy in the receiver said that he would come immediately In (5.43) the sentence is erroneously marked as a word order error in the placement of negation. The problem however concerns the choice of the (explanative) conjunction för att ‘since/due to’ that combines with main clause and is more typical of spoken Swedish (Teleman et al., 1999, Part2:730). This conjunction corresponds to för ‘due to/in order to’ in writing and coordinates then only main clauses. It is often confused with the causal subjunction för att ‘because/with the intention of’ that is used only with subordinate clauses and requires then adverbs to be placed before the main verb (Teleman et al., 1999, Part2:736). (5.43) A LARM ...då sa han ja för att han ville inte then said he yes for that he wanted not berätta för fröken att han var ensam tell to the-teacher that he was alone G RAMMATIFIX ’ S D IAGNOSIS Check the placement of inte ‘not’. In a subclause sentential adverb is placed by rule before the finite verb. ⇒ inte ville ‘not wanted’ – ... then he said yes, because he did not want to tell the teacher that he was alone. All of the 15 flaggings by Granska were false, interpreting conjunctions as subjunctions as in (5.44a) or not taking indirect speech into consideration as in (5.44b), where the subject’s opinion is expressed by main clause word order and not subordinate clause word order as interpreted by the tool. Chapter 5. 164 (5.44) a. A LARM ... men den gick av så jag hade bara lite but it went off so I had just little gips kvar. plaster left G RANSKA’ S D IAGNOSIS Word order error, erroneous placement of adverb in subordinate clause. ⇒ bara hade ‘just had’ – ... but it broke off so I only had a little plaster left. då tycker jag att det var inte hans fel then believe I that it was not his fault b. utan deras. but theirs Word order error, erroneous placement of adverb in subordinate clause. ⇒ inte var ‘not was’ Then I think that it was not his fault but theirs. Scarrie’s 11 diagnoses were also false, mostly of the type “subject taking the position of the verb” as in (5.45a) and also cases of interpreting conjunctions as subjunctions as in (5.45b): (5.45) a. A LARM Då vi kom till min by. Trillade jag when we came to my village fell I S CARRIE ’ S D IAGNOSIS the subject in the verb position av brand bilen för det var en guppig väg. off fire the-car for it was a bumpy road – When we arrived in my village, I fell off the fire-engine because the road was bumpy. b. dom kanske inte ens vet att man har they maybe not even know that one has prov för dom lyssnar inte på sitt barn ... test for they listen not at their child the inflected verb before sentence adverbial in subordinate clause – They probably do not even know that you have a test, because they do not listen to their child ... In conclusion, word order errors were hard to find due to their inner complexity. The tools seem to apply rather straight-forward approaches that resulted in many false flaggings. Redundancy According to the error specifications, only Grammatifix searches for repeated words and should then be able to at least detect errors with doubled words. Error Detection and Previous Systems 165 Grammatifix identified the five errors with duplicated words immediately following each other. The number of false alarms is quite high (18 occurrences). One example is given below: (5.46) A LARM Var var den där överraskningen. where was the there surprise G RAMMATIFIX ’ S D IAGNOSIS doubled word – Where was that surprise? No other superfluous elements were detected so the system ends up with a performance rate of 38% in recall, and 23% in precision. Missing Constituents All three tools search for sentences with omitted verbs or infinitive markers, also in the context of a preceding preposition. Grammatifix did not find any missing verbs, but detected the only error with a missing infinitive marker in front of an infinitive verb after certain prepositions (G10.3.1) shown in (5.47). (5.47) a. Efter — ha sprungit igenom häckarna två gånger så vilade after — have [inf] run [sup] through the-hurdles two times then rest vi lite ... we little – After twice running through the hurdles, we rested a little. b. Efter att ha sprungit after to have [inf] run [sup] Six false alarms occurred for this error type, mostly when the adverb tillbaka ‘back’ was split as shown in (5.48). The problem lies in that the split word results in a preposition till ‘to’ and the verb baka ‘bake’. (5.48) A LARM inget kvack kom till — baka no quack came to — bake G RAMMATIFIX ’ S D IAGNOSIS Check the word baka. If an infinitive is governed by a preposition it should be preceded by att ‘to’ – no quack came back. Granska checked in the case of omitted verbs only for occurrences of single words such as Slut. ‘End.’ or sentence fragments, such as Tom gr å och tyst. ‘Empty Chapter 5. 166 grey and silent.’ or Inte ens pappa. ‘Not even daddy.’. The program further suggested that the error might be a title “Verb seems to be missing in the sentence. If this is a title it should not be ended with a period.” Altogether, 25 sentences were judged to be missing a verb and 12 false alarms occurred. None of the errors listed in Child Data were detected by Granska. This particular error type is not included in the present performance analysis. Granska also checks for missing subjects. Two cases concerned short sentence fragments and two were false flaggings as the one in (5.49) below. (5.49) A LARM Hade alla 7 vandrat förgäves? had all 7 walked in vain G RANSKA’ S D IAGNOSIS a subject seems to be missing in the sentence – Had all seven walked in vain? Scarrie also checks for missing subjects and successfully detected the error G10.1.5, shown in (5.50). The other three flaggings were false. In the case of a missing infinitive marker in constructions where a preposition precedes an infinitive phrase, six false flaggings occurred. Like Grammatifix, Scarrie marks erroneous splits homonymous with prepositions (see (5.48) above). (5.50) a. man försöker att lära barnen att om — fuskar med t ex ett prov då ... one tries to teach the-children to if — cheat with e.g. a test then – One tries to teach children that if they cheat on e.g. a test then ... b. om de fuskar med if they cheat with In conclusion, many of the omitted constituents are not covered by these tools and result mostly in false flaggings. Grammatifix successfully detected a missing infinitive marker preceded by a preposition and Scarrie detected a missing subject. Other Errors Among other error types, all the tools also check if a sentence has too many finite verbs. Grammatifix succeeded in finding three instances of unmarked sentence boundaries. In three cases, false flaggings occurred, listed in (5.51). Two such flaggings concerned ambiguity between a verb and a pronoun and the one in (5.51c) involved a spelling error that resulted in a verb. These alarms are not part of the system’s performance test, since such errors were not the target of this analysis. Error Detection and Previous Systems (5.51) a. 167 A LARM Han undrade var de var någonstans he wondered where they were somewhere G RAMMATIFIX ’ S D IAGNOSIS Check the word forms undrade ‘wondered’ and var ‘where/was’. It seems as if the sentence would have too many finite verbs. – He wondered where they were? b. Var var den där överraskningen. where was the there surprise Check the word forms var ‘where/was’ and var ‘where/was’. It seems as if the sentence might have too many finite verbs. – Where was that surprise? c. Pojken blev red (⇒ rädd) the-boy became rode (⇒ afraid) Check the word forms blev ‘became’ and red ‘rode’. It seems as if the sentence might have too many finite verbs. – The boy became afraid. Granska checks for occurrences of other finite verbs after the copula verb vara ‘be’. In Child Data, however, the only detections were false flaggings (8 occurrences), mostly due to homonymy between the verb and the adverb var ‘where’ as in (5.52a) (5 occurrences). Three false alarms occurred because of spelling errors as in (5.52a) or at sentence boundaries, as in (5.52b): (5.52) a. A LARM Pojken the-boy blev became [pret] G RANSKA’ S D IAGNOSIS it is unusual to have a verb after the verb blev ‘became [pret]’ som tur var landade jag på as luck was [pret] landed [pret] I on it is unusual to have a verb after the verb var ‘was [pret]’ red (⇒ rädd) rode [pret] (⇒ afraid) – The boy became afraid. b. skyddsnätet på brandbilen the-safety-net on the-fire-engine – luckily I landed on the safety-net on the fireengine. Scarrie also checks for occurrences of two finite verbs in a row, but provides a diagnosis of a possible sentence boundary as well. Eight sentence boundaries were found and eight false markings occurred, often due lexical ambiguity as in (5.53). Also, in Scarrie’s case, these alarms are not included in the analysis. Chapter 5. 168 (5.53) A LARM Men sen kom en tjej som visste vem jag but then came a girl that knew who I S CARRIE ’ S D IAGNOSIS two inflected verbs in predicate position or a sentence boundary var för hon ... was [pret] for OR lead [imp] she – But then came a girl that knew who I was, because she ... Finally, Scarrie checks the noun case, where the genitive form of proper nouns is suggested in constructions of a proper noun followed by a noun. All result in false flaggings, due to part-of-speech ambiguity, e.g.: (5.54) A LARM Men på morgonen när Erik såg but in the-morning when Erik [nom] saw S CARRIE ’ S D IAGNOSIS basic form instead of genitive att hans groda var försvunnen. that his frog was disappeared – But in the morning when Erik saw that his frog had disappeared. 5.5.5 Overall Detection Results In accordance with the error specifications of the systems, none of the Swedish tools detects errors in definiteness in single nouns or reference and only Grammatifix checks for repeated words among redundancy errors. Missing constituents are checked only when a verb, subject or infinitive marker is missing. Word choice errors represented as prepositions in idiomatic expressions are checked by Granska. The detection results on Child Data, discussed in the previous section, are summarized in Tables 5.4, 5.5 and 5.6 below. Among the most frequent error types in Child Data, represented by errors in finite verbs, missing constituents, word choice errors, agreement in noun phrase and redundant words, Grammatifix succeeded in finding errors in four of these types, Scarrie in three of them and Granska in two categories. All the tools were best at finding errors in noun phrase agreement, with a recall rate between 53% and 67% and precision between 7% and 37%. For the most common error, finite verb form, all obtained very low coverage, with recall between 4% and 15% and precision between 36% and 57%. Grammatifix succeeded in finding all the repeated words among redundancy errors and one occurrence of missing constituent. Also Scarrie found one missing constituent. No word choice errors were found by Granska. Other error types in Child Data occurred less than ten times and no general assumptions can be made on how the tools performed on those. Error Detection and Previous Systems 169 Table 5.4: Performance Results of Grammatifix on Child Data GRAMMATIFIX C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other E RROR T YPE E RRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 7 1 4 16 53% 29% 37% Agreement in PRED 8 1 2 1 13% 25% 17% Definiteness in single nouns 6 0% – – Pronoun case 5 2 40% 100% 57% Finite Verb Form 110 3 1 5 2 4% 36% 7% Verb Form after Vaux 7 0% – – Vaux Missing 2 0% – – Verb Form after inf. marker 4 0% – – Inf. marker Missing 3 0% – – Word order 5 11 4 0% 0% – Redundancy 13 5 16 1 38% 23% 29% Missing Constituents 44 1 1 6 5% 25% 8% Word Choice 28 0% – – Reference 8 0% – – Other 4 0% – – T OTAL 262 18 4 38 30 8% 24% 12% Table 5.5: Performance Results of Granska on Child Data GRANSKA C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other E RROR T YPE E RRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 5 3 8 17 53% 24% 33% Agreement in PRED 8 3 3 2 38% 38% 38% Definiteness in single nouns 6 0% – – Pronoun case 5 3 24 60% 11% 19% Finite Verb Form 110 8 1 8 1 8% 50% 14% Verb Form after Vaux 7 4 5 57% 44% 50% Vaux Missing 2 2 0% 0% Verb Form after inf. marker 4 3 6 75% 33% 46% Inf. marker Missing 3 3 6 100% 33% 50% Word order 5 15 0% 0% – Redundancy 13 0% – – Missing Constituents 44 2 0% 0% – Word Choice 28 0% – – Reference 8 0% – – Other 4 0% – – T OTAL 262 29 4 79 20 13% 25% 17% Chapter 5. 170 Table 5.6: Performance Results of Scarrie on Child Data SCARRIE C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other E RROR T YPE E RRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 8 2 83 50 67% 7% 13% Agreement in PRED 8 12 1 0% 0% – Definiteness in single nouns 6 0% – – Pronoun case 5 3 17 60% 15% 24% Finite Verb Form 110 16 1 13 15% 57% 24% Verb Form after Vaux 7 1 7 2 14% 10% 12% Vaux Missing 2 1 50% 100% 67% Verb Form after inf. marker 4 1 1 25% 50% 33% Inf. marker Missing 3 2 13 67% 13% 22% Word order 5 11 0% 0% – Redundancy 13 0% – – Missing Constituents 44 1 4 5 2% 10% 3% Word Choice 28 0% – – Reference 8 0% – – Other 4 0% – – T OTAL 262 33 3 161 58 14% 14% 14% Overall performance figures in detecting the errors in Child Data show that Grammatifix did not detect many of the verb errors at all and has the lowest recall. Scarrie on the other hand detects most errors of them all, but has a high number of false flaggings. Errors in agreement with predicative complement were hard to find in general, even in cases where the subject and the predicate were adjacent, more complex structures would obviously pose more of a problem for the tools. Even when errors were found in these constructions, the tools often gave an incorrect diagnosis. Among the false flaggings, quite many included errors other than grammatical ones. The overall performance of the tools including all error types when applied to Child Data ends up at a recall rate of 14% at most, and a precision rate between 14% and 25%. Grammatifix detected the least number of errors and had the least number of false alarms, but the quite low recall rate leads to the lowest F-value of 12%. Granska found slightly more errors and had more false flaggings, obtaining the best F-value of 17%. Scarrie performed best of the tools in grammatical coverage, but at the cost of lots of false alarms, giving an F-value of 14%. Error Detection and Previous Systems 171 In Table 5.7 the overall performance of the systems is presented for the errors they target specifically, excluding the zero-results. Observe that the F-values are slightly higher due to increased recalls. Precision rates remain the same. 29 Table 5.7: Performance Results of Targeted Errors T OOL Grammatifix Granska Scarrie E RRORS 166 174 170 C ORRECT A LARM 22 33 36 FALSE A LARM 68 97 214 R ECALL 13% 19% 21% P RECISION 24% 25% 14% F- VALUE 17% 22% 17% The performance tests on published adult texts and some student papers provided by the developers of these tools (see Table 5.3 on p.141), show on average much higher validation rates for these texts, with an overall coverage between 35% and 85% and precision between 53% and 77%. Granska shows to be best at detecting errors in verb form in the adult text data evaluated by the developers with a recall rate of 97%. Verb form errors are mostly represented by errors in finite verb form in Child Data, where Granska obtained a recall of 8%. Other types of verb errors occurred less than ten times which makes the performance result uncertain. For agreement errors in noun phrase, which is the second best category of Granska when tested on adult texts, Granska obtained much better results and detected at least half of the errors with a recall of 53%. Since the error frequency is much higher in texts written by children, the size of the Child Data corpus can be considered to be satisfactory and safe for evaluation, at least for the most frequent error types. This performance test shows that the three Swedish tools, designed for adult writers in the first place, have in general difficulty in detecting errors in such texts as Child Data. As indicated in some examples, this is not only due to insufficient error coverage of the defined error types in the systems. The structure of the texts may also be a cause for certain errors not being detected or being erroneously marked as errors. Different results were obtained sometimes when sentences were split into smaller units. 29 Grammatifix: redundancy includes 5 errors in doubled word, missing constituents are counted as infinitive marker (1) and verb (5). Granska: missing verb (5), choice of preposition (10). Scarrie: missing subject (10), missing infinitive marker (1). 172 Chapter 5. 5.6 Summary and Conclusion From the above analyses it is clear that among the grammar errors found in Child Data, all non-structural errors and some types of structural errors should be possible to detect by syntactic analysis and partial parsing, whereas other errors require more complex analysis or wider context. Among the central error types in Child Data, errors in finite verb form and agreement errors in noun phrases could be handled by partial parsing, which I will show in Chapter 6. The other more frequent errors, such as missing constituents, word choice and redundant words; forming new lemmas require deeper analysis. Furthermore, some real word spelling errors might be detected if they violate syntax. Missing punctuation in sentence boundaries requires analysis of at least the predicate’s complement structure. All the errors in Child Data except definiteness in single nouns and reference seem to be more or less covered by the Swedish tools considering the error specifications. The performance results show that agreement errors in noun phrases are the error type best covered, whereas errors in finite verb forms in relation to their frequency obtained a very low recall in all three systems. Grammatifix had in general difficulty detecting any errors concerning verbs. Granska performed best in this case. Overall, all the tools detect few errors in Child Data and the precision rate is quite low. It is not clear how many of the missed errors depended on insufficient syntactic coverage and how many depended on the complexity of the sentences in Child Data. That is, all three tools rely on sentences to be the unit of analysis, but “sentences” in Child Data do not always correspond to syntactic sentences. They often include adjoined clauses or quite long sentences (see Section 4.6). These tools are not designed to handle such complex structures. In conclusion, many errors that can be handled by partial parsing in Child Data are detected at a rate of less of not more than 60% by the Swedish grammar checkers. Errors in finite verb form obtained quite low results and are the type of error that needs the most improvement, especially since they are the most common error in Child Data. Chapter 6 FiniteCheck: A Grammar Error Detector 6.1 Introduction This chapter reports on automatic detection of some of the grammar errors discussed in Chapter 4. The challenge of this part of the work is to exploit correct descriptions of language, instead of describing the structure of errors, and apply finite state techniques to the whole process of error detection. The implemented grammar error detector FiniteCheck identifies grammar errors using partial finite state methods, identifying syntactic patterns through a set of regular grammar rules (see Section 6.2.4). Constraints are used to reduce alternative parses or adjust the parsing result. There are no explicit error rules in the grammars of the system, in the sense that no grammar rules state the syntax of erroneous (ungrammatical) patterns. The rules of the grammar are always positive and define the grammatical structure of Swedish. The only constraints related to errors is the context of the error type. The present grammar is highly corpusoriented, based on the lexical and syntactic circumstances displayed in the Child Data corpus. Ungrammatical patterns are detected adopting the same method that Karttunen et al. (1997a) use for extraction of invalid date expressions, presented in Section 6.2.4. In short, potential candidates of grammatical violations are identified through a broad grammar that overgenerates and accepts also invalid (ungrammatical) constructions. Valid (grammatical) patterns are defined in an another narrow grammar and the ungrammaticalities among the selected candidates are identified as the difference between these two grammars. In other words, the strings selected 174 Chapter 6. by the rules of the broad grammar that are not accepted by the narrow grammar are the remaining ungrammatical violations. The current system looks for errors in noun phrase agreement and verb form, such as selection of finite and non-finite verb forms in main and subordinate clauses and infinitival complements. Errors in the finite verb form in the main verb were the most natural choice for implementation since these are the most frequent error type in the Child Data corpus, represented by 110 error instances (see Figure 4.1 on p.73). Moreover, verb form errors are possible to detect using partial parsing techniques (see Section 5.3.3). Inclusion of errors in the finite main verb motivated expansion of this category to include other errors related to verbs, with addition of other types of finite verb errors and errors in non-finite verb forms. Errors in noun phrase agreement were among the five most frequent error types. In comparison to other writing populations this type of error might be considered as one of the central error types in Swedish (see Section 4.7). Furthermore, noun phrase errors are limited within the noun phrase and can most likely be detected by partial parsing (see Section 5.3). The other errors among the five most common error types in Child Data, including word choice errors and errors with extra or missing constituents, are not locally restricted in this way and will certainly require a more complex analysis. The development of the grammar error detector started with the project Finite State Grammar for Finding Grammatical Errors in Swedish Text (1998 - 1999). It was part of a larger project Integrated Language Tools for Writing and Document Handling in collaboration with the Numerical Analysis and Computer Science Department (NADA) at the Royal Institute of Technology (KTH) in Stockholm. 1 The project group in Göteborg consisted of Robin Cooper, Robert Andersson and myself. In the description of the system I will include the whole system and its functionalities, in particular my own contributions concerning mainly a first version of the lexicon, expansion of grammar and adjustment to the present corpus data of children’s texts, disambiguation and other adjustments to parsing results, as well as evaluation and improvements made on the system’s flagging accuracy. The work of the other two members concerns primarily the final version of the lexicon, optimization of the tagset, the basic grammar and the system interface. I will not discuss their contributions in detail but will refer to the project reports when relevant. The chapter proceeds with a short introduction to finite state techniques and parsing (Section 6.2). The description of FiniteCheck starts with an overview of the system’s architecture including short presentations of the different modules (Sec1 The project was sponsored by HSFR/NUTEK Language Technology Programme. See http: //www.ling.gu.se/˜sylvana/FSG/ for methods and goals of our part of the project. FiniteCheck: A Grammar Error Detector 175 tion 6.3). Then follows a section on the composition of the lexicon with a description of the tagset, and identification of grammatical categories and features (Section 6.4). Next, the overgenerating broad grammar set is presented (Section 6.5), followed by a section on parsing (Section 6.6). The chapter then proceeds with a presentation of the narrow grammar of noun phrases and the verbal core (Section 6.7) and the actual error detection (Section 6.8). The chapter concludes with a summary (Section 6.9). Performance results of FiniteCheck are presented in Chapter 7. 6.2 Finite State Methods and Tools 6.2.1 Finite State Methods in NLP Finite state technology as such has been used since the emergence of computer science, for instance for program compilation, hardware modeling or database management (Roche, 1997). Finite state calculus is considered in general to be powerful and well-designed, providing flexible, space and time effective engineering applications. However, in the domain of Natural Language Processing (NLP) finite state models were considered to be efficient but somewhat inaccurate, often resulting in applications of limited size. Other formalisms such as context-free grammars were preferred and considered to be more accurate than finite state methods, despite difficulties reaching reasonable efficiency. Thus, grammars approximated by finite state models were considered more efficient and simpler, but at the cost of a loss of accuracy. Improvement of the mathematical properties of finite state methods and reexamination of the descriptive possibilities made it possible for the emergence of applications for a variety of NLP tasks, such as morphological analysis (e.g. Karttunen et al., 1992; Clemenceau and Roche, 1993; Beesley and Karttunen, 2003), phonetic and speech processing (e.g. Pereira and Riley, 1997; Laporte, 1997), parsing (e.g. Koskenniemi et al., 1992; Appelt et al., 1993; Abney, 1996; Grefenstette, 1996; Roche, 1997; Schiller, 1996). In this section the finite state formalism is described along with possibilities for compilation of such devices (Section 6.2.2). Next, the Xerox compiler used in the present implementation is presented (Section 6.2.3). The techniques of finite state parsing are explained along with description of a method for extracting invalid input from unrestricted text that plays an important role for the present implementation (Section 6.2.4). Chapter 6. 176 6.2.2 Regular Grammars and Automata Adopting finite state techniques in parsing means modeling the syntactic relations between words using regular grammars2 and applying finite state automata to recognize (or generate) corresponding patterns defined by such grammar. A finite state automaton is a computer model representing the regular expressions defined in a regular grammar that takes a string of symbols as input, executes some operations in a finite number of steps and halts with information interpreted depending on the grammar as either that the machine accepted or rejected the input. It is defined formally as a tuple consisting of a finite set of symbols (the alphabet), a finite set of states with an unique initial state, a number of intermediate states and final states, and finally a transition relation defining how to proceed between the different states.3 Regular expressions represent sets of simple strings (a language) or sets of pairs of strings (a relation) mapping between two regular languages, upper and lower. Regular languages are represented by simple automata and regular relations by transducers. Transducers are bi-directional finite state automata, which means for example that the same automaton can be used for both analysis and generation. Several tools for the compilation of regular expressions exist. AT&T’s FSM Library4 is a toolbox designed for building speech recognition systems and supports development of phonetic, lexical and language-modeling components. The compiler runs under UNIX and includes about 30 commands to construct weighted finite-state machines (Mohri and Sproat, 1996; Pereira and Riley, 1997; Mohri et al., 1998). FSA Utilities5 is an another compiler developed in the first place for experimental purposes applying finite-state techniques in NLP. The tool is implemented in SICStus Prolog and provides possibilities to compile new regular expressions from the basic operations, thus extending the set of regular expressions handled by the system (van Noord and Gerdemann, 1999). The compiler used in the present implementation is the Xerox Finite-State Tool, one of Xerox software tools for computing with finite state networks, described further in the subsequent section. 2 Regular grammars are also called type-3 in the classification introduced by Noam Chomsky (Chomsky, 1956, 1959). 3 See e.g. Hopcroft and Ullman (1979); Boman and Karlgren (1996) for exact formal definitions of finite state automata. A ‘gentle’ introduction is presented in Beesley and Karttunen (2003). 4 The homepage of AT&T’s FSM Library: http://www.research.att.com/sw/ tools/fsm/ 5 The homepage of FSA Utilities’: http://www.let.rug.nl/˜vannoord/Fsa/ FiniteCheck: A Grammar Error Detector 6.2.3 177 Xerox Finite State Tool Introduction Xerox research developed a system for computing and compilation of finite-state networks, the Xerox Finite State Tool (XFST).6 The tool is a successor to two earlier interfaces: IFSM created at PARC by Lauri Karttunen and Todd Yampol 1990-92, and FSC developed at RXRC by Pasi Tapanainen in 1994-95 (Karttunen et al., 1997b) The system runs under UNIX and is supplemented with an interactive interface and a compiler. Finite state networks of simple automata or transducers are compiled from regular expressions and can be saved into a binary file. The networks can also be converted to Prolog-format. The Regular Expression Formalism The metalanguage of regular expressions in XFST includes a set of basic operators such as union (or), concatenation, optionality, ignoring, iteration, complement (negation), intersection (and), subtraction (minus), crossproduct and composition, and an extended set of operators such as containment, restriction and replacement. The notational conventions of some part of the regular expression formalism in XFST, including the operators and atomic expressions that are used in the present implementation, are presented in Table 6.1 (cf. Karttunen et al., 1997b; Beesley and Karttunen, 2003). Uppercase letters such as A denote here regular expressions. For a description of the syntax and semantics of these operators see Karttunen et al. (1997a). The replacement operators play an important role in the present implementation and are further explained below. 6 Technical documentation and demonstration of the XFST can be found at: http://www. rxrc.xerox.com/research/mltt/fst/ Chapter 6. 178 Table 6.1: Some Expressions and Operators in XFST ATOMIC E XPRESSIONS epsilon symbol (the empty-string) any (unknown) symbol, universal language U NARY O PERATIONS iteration: zero or more (Kleene star) iteration: one or more (Kleene plus) optionality containment complement (not) B INARY O PERATIONS concatenation union (or) intersection (and) ignoring composition subtraction (minus) replacement (simple) 0 ?, ?* A* A+ (A) $A ∼A AB A|B A&B A/B A .o. B A-B A→B Replacement Operators The original version of the replacement operator was developed by Ronald M. Kaplan and Martin Kay in the early 1980s, and was applied as phonological rewrite rules by finite state transducers. Replacement rules can be applied in an unconditional version or constrained by context or direction (Karttunen, 1995, 1996). Simple (unconditional) replacement has the format UPPER → LOWER denoting a regular relation (Karttunen, 1995):7 (RE6.1) [ NO_UPPER [UPPER .x. LOWER] ] * NO_UPPER; For example the relation [a b c → d e]8 maps the string abcde to dede. Replacement may start at any point and include alternative replacements, making these transducers non-deterministic, and yield multiple results. For example, a transducer represented by the regular expression in (RE6.2) produces four different results (axa, ax, xa, x) to the input string aba as shown in (6.1) (Karttunen, 1996). 7 NO UPPER corresponds to ∼$[UPPER - []]. Lower-case letters, such as a, represent symbols. Symbols can be unary (e.g. a, b, c ) or symbol pairs (e.g. a:x, b:0) denoting relations (i.e. transducers). Identity relation where a symbol maps to the same symbol as in a:a is ignored and written thus as a. 8 FiniteCheck: A Grammar Error Detector (RE6.2) 179 ab|b|ba|aba→x (6.1) a b a a x a a b a --a x a b a --x a a b a ----x Directionality and the length of replacement can be constrained by the directed replacement operators. The replacement can start from the left or from the right, choosing the longest or the shortest replacement. Four types of directed replacement are defined (Karttunen, 1996): Table 6.2: Types of Directed Replacement left-to-right right-to-left longest match @→ →@ shortest match @> >@ Now, applying the same regular expression as above to the left-to-right longestmatch replacement as in the regular expression in (RE6.3), yields just one solution to the string aba as shown in (6.2). (RE6.3) a b | b | b a | a b a @→ x (6.2) a b a ----x Directed replacement is defined as a composition of four relations that are composed in advance by the XFST-compiler. The advantage is that the replacement takes place in one step without any additional levels or symbols. For instance, the left-to-right longest-match replacement UPPER @→ LOWER is composed of the following relations (Karttunen, 1996): (6.3) Input string .o. Initial match .o. Left-to-right constraint .o. Longest-match constraint .o. Replacement With these operators, transducers that mark (or filter) patterns in text can be constructed easily. For instance, strings can be inserted before and after a string Chapter 6. 180 that matches a defined regular expression. For this purpose a special insertion symbol “...” is used on the right-hand side to represent the string that is found matching the left-hand side: UPPER @→ PREFIX ... SUFFIX. Following an example from Karttunen (1996), a noun phrase that consists of an optional determiner (d), any number of adjectives a* and one or more nouns n+, can be marked using the regular expression in (RE6.4), mapping dannvaa into [dann]v[aan] as shown in (6.4). Thus, the expression compiles to a transducer that inserts brackets around maximal instances of the noun phrase pattern. (RE6.4) (6.4) (d) a* n+ @→ %[ ... ]% dann v aan -----[dann] v [aan] The replacement can be constrained further by a specific context, both on the left and the right of a particular pattern: UPPER @→ LOWER || LEFT RIGHT (see Karttunen, 1995, for further variations). Furthermore, the replacement can be parallel, meaning that multiple replacements are performed at the same time (see Kempe and Karttunen, 1996). For instance, the regular expression in (RE6.5) denotes a constrained parallel replacement, where the symbol a is replaced by symbol b and at the same time symbol b is replaced by c. Both replacements occur at the same time and only if the symbols are preceded by symbol x and followed by symbol y. Applying this automaton to the string xaxayby yields then the string xaxbyby and to the string xbybyxa yields xcybyxa as presented in (6.5). (RE6.5) a → b , b → c || x y (6.5) xaxayby --xaxbyby 6.2.4 xbybyxa --xcybyxa Finite State Parsing Introduction New approaches to parsing with the finite state formalism show that the calculus can be used to represent complex linguistic phenomena accurately and large scale lexical grammars can be represented in a compact way (Roche, 1997). There are various techniques for creating careful representations at increasing efficiency. For FiniteCheck: A Grammar Error Detector 181 instance, parts of rules that are similar are represented only once, reducing the whole set of rules. For each state only one unique outgoing transition is possible (determinization), an automaton can be reduced to a minimal number of states (minimization). Moreover, one can create bi-directional machines, where the same automaton can be used for both parsing and generation. Applications of finite state parsing are used mostly in the fields of terminology extraction, lexicography and information retrieval for large scale text. The methods are more “partial” in the sense that the goal is not production of complete syntactic descriptions of sentences, but rather recognition of various syntactic patterns in a text (e.g. noun phrases, verbal groups). Parsing Methods Many finite-state parsers adopt the chunking techniques of Abney (1991) and collect sets of pattern rules into ordered sequences of levels of finite number, so called cascades, where the result of one level is the input to the next level (e.g. Appelt et al., 1993; Abney, 1996; Chanod and Tapanainen, 1996; Grefenstette, 1996; Roche, 1997). The parsing procedure over a text tagged for parts-of-speech usually proceeds by marking boundaries of adjacent patterns, such as noun or verbal groups, then the nominal and verbal heads within these groups are identified. Finally, patterns between non-adjacent heads are extracted identifying syntactic relations between words, within and across group boundaries. For this purpose, finite state transducers are used. The automata are applied both as finite state markers, that introduce extra symbols such as surrounding brackets to the input (as exemplified in the previous section), and as finite state filters that extract and label patterns. Usually, a combination of non-finite state methods and finite state procedures is applied, but the whole parser can be built as a finite state system (see further Karttunen et al., 1997a). The first application of finite state transducers to parsing was a parser developed at the University of Pennsylvania between 1958 and 1959 (Joshi and Hopely, 1996).9 The parser is essentially a cascade of finite state transducers and the parsing style resembles Abney’s “chunking” parser (Abney, 1991). Syntactic patterns using subcategorization frames and local grammars were constructed and recognize simple NPs, PPs, AdvPs, simple verb clusters and clauses. All of the modules of the parser, including dictionary look-up and part-of-speech disambiguation are finite state computations, except for the module for recognition of clauses. 9 The original version of the parser is presented in Joshi (1961) Up-to-date information about the reconstructed version of this parser - Uniparse - can be accessed from: http://www.cis. upenn.edu/˜phopely/tdap-fe-post.html. 182 Chapter 6. Besides Abney’s chunking approach (Abney, 1991, 1996), constructive finite state parsing of collections of syntactic patterns and local grammars, others use this technique to locate noun phrases (or other basic phrases) from unrestricted text (e.g. Appelt et al., 1993; Schiller, 1996; Senellart, 1998). Further, Grefenstette (1996) uses this technique to mark syntactic functions such as subject and object. Other approaches to finite-state parsing start from a large number of alternative analyses and, through application of constraints in the form of elimination or restriction rules, they reduce the alternative parses (e.g. Voutilainen and Tapanainen, 1993; Koskenniemi et al., 1992). These techniques were also used for extraction of noun phrases or other basic phrases (e.g. Voutilainen, 1995; Chanod and Tapanainen, 1996; Voutilainen and Padró, 1997). Salah Ait-Mokhtar and Jean-Pierre Chanod constructed a parser that combines the constructive and reductionist approaches. The system defines segments by constraints rather than patterns. They mark potential beginnings and ends of phrases and use replacement transducers to insert phrase boundaries. Incremental decisions are made throughout the whole parsing process, but at each step linguistic constraints may eliminate or correct some of the previously added information (AitMohtar and Chanod, 1997). In the case of Swedish, finite state methods have been applied on a small scale to lexicography and information extraction. A Swedish regular expression grammar was implemented early at Umeå University, parsing a limited set of sentences (Ejerhed and Church, 1983; Ejerhed, 1985). Recently, a cascaded finite state parser Cass-Swe was developed for the syntactic analysis of Swedish (Kokkinakis and Johansson Kokkinakis, 1999), based on Abney’s parser. Here the regular expression patterns are applied in cascades ordered by complexity and length to recognize phrases. The output of one level in a sequence is used as input in the subsequent level, starting from tagging and syntactic labeling proceeding to recognition of grammatical functions. The grammar of Cass-Swe has been semi-automatically extracted from written text by the application of probabilistic methods, such as the mutual information statistics which allows the exclusion of incorrect part-of-speech n-grams (Magerman and Marcus, 1990), and by looking at which function words signal boundaries between phrases and clauses. Discrimination of Input One parsing application using finite state methods presented by Karttunen et al. (1997a) aims at extraction of not only valid expressions, but also invalid patterns occurring in free text due to errors and misprints. The method is applied to date expressions and the idea is simply to define two language sets - one that overgen- FiniteCheck: A Grammar Error Detector 183 erates and accepts all date expressions, including dates that do not exist, and one that defines only correct date expressions. The language of invalid dates is then obtained by subtracting the more specific language from the more general one. Thus, by distinguishing the valid date expressions from the language of all date expressions we obtain the set of expressions corresponding to invalid dates, i.e. those dates not accepted by the language set of valid expressions. To illustrate, the definitions in Karttunen et al. (1997a) express date expressions from January 1, 1 to December 31, 9999 and are represented by a small finite state automaton (13 states, 96 arcs), that accepts date expressions consisting of a day of the week, a month and a date with or without a year, or a combination of the two as defined in (RE6.6a) (SP is a separator consisting of a comma and a space, i.e. ‘, ’). The parser for that language presented in (RE6.6b) is constraint by the left to right, longest match replacement operator which means that only the maximal instances of such expressions are accepted. However, this automaton also accepts dates that do not exist, such as “April 31”, which exceeds the maximum number of days for the month. Other problems concern leap days and the relationship between the day of the week and the date. A new language is defined by intersecting constraints of invalid types of dates with the language of date expressions as presented in (RE6.6c).10 This much larger automaton (1346 states, 21006 arcs) accepts only valid date expressions and again a transducer marks the maximal instances of such dates, see (RE6.6d). (RE6.6) a. DateExpression = Day | (Day SP) Month ‘‘ ’’ Date (SP Year) b. DateExpression @→ %[ ... %] c. ValidDate = DateExpression & MaxDaysInMonth & LeapDays & WeekDayDates d. ValidDate @→ %[ ... %] As the authors point out, it may be of use to distinguish valid dates from invalid ones, but in practice we also need to recognize the invalid dates due to errors and misprints in real text corpora. For this purpose we do not need to define a new language that reveals the structure of invalid dates. Instead, we make use of the already defined languages of all date expressions DateExpression and valid dates ValidDate and obtain the language of invalid dates by subtracting these language sets from each other [DateExpression - ValidDate]. 10 For more detail on the separate definitions of constraints see Karttunen et al. (1997a). Chapter 6. 184 A parser that identifies maximal instances of date expressions is presented in (RE6.7), that tags both the valid (VD) and invalid (ID) dates. (RE6.7) [ [DateExpression - ValidDate] @→ ‘‘[ID ’’ ... ValidDate @→ ‘‘[VD’’ ... %] ] %] , In the example in (6.6) below given by the authors, the parser identified two date expressions. First a valid one (VD) and then an invalid one (ID) differing only in the weekday from the valid one. Notice that the effect of the application of the longest match is reflected when for instance the invalid date Tuesday, September 16, 1996 is selected over Tuesday, September 16, 19, which is a valid date. 11 (6.6) The correct date for today is [VD Monday, September 16, 1996]. There is an error in the program. Today is not [ID Tuesday, September 16, 1996]. 6.3 System Architecture 6.3.1 Introduction After this short introduction to finite state automata, parsing methods with finite state techniques and a description of the XFST-compiler, I will now proceed with a description of the implemented grammar error detector FiniteCheck. In this section an overview is given of the system’s architecture and how the system proceeds in the individual modules identifying errors in text. The types of automata used in the implementation are also described. The implementation methods and detailed descriptions of the individual modules are discussed in subsequent sections. The framework of FiniteCheck is built as a cascade of finite state transducers compiled from regular expressions including operators defined in the Xerox FiniteState Tool (XFST; see Section 6.2.3). Each automaton in the network composes with the result of the previous application. The implemented tool applies a strategy of simple dictionary lookup, incremental partial parsing with minimal disambiguation by parsing order and filtering, and error detection using subtraction of ‘positive’ grammars that differ in their level of detail. Accordingly, the current system of sequenced finite state transducers is divided into four main modules: the dictionary lookup, the grammar, the parser and the error finder, see Figure 6.1 below. The system runs under UNIX in a simple emacs environment implemented by Robert Andersson with an XFST-mode that allows for menus to be used to 11 This date is however only valid in theory since the Gregorian calendar was not yet in use in the year 19 AD. The Gregorian calendar that replaced the Julian calendar was introduced in Catholic countries by the pope Gregory XIII on Friday, October 15, 1582 (in Sweden 1753). FiniteCheck: A Grammar Error Detector 185 recompile files in the system. The modules are further described in the following subsection on the flow of data in the error detector. The form of the types of automata are discussed at the end of this section. Figure 6.1: The System Architecture of FiniteCheck Chapter 6. 186 6.3.2 The System Flow The Dictionary Lookup The input text into FiniteCheck is first manually tokenized so that spaces occur between all strings and tokens, including punctuation. This formatted text is then tagged with part-of-speech and feature annotations by the lookup module that assigns all lexical tags stored in the lexicon of the system to a string in the text. No disambiguation is involved, only a simple lookup. The underlying lexicon of around 160,000 word forms is built as a finite state transducer. The tagset is based on the tagformat defined in the Stockholm Umeå Corpus (Ejerhed et al., 1992) combining part-of-speech information with feature information (see Section 6.4 and Appendix C). As an example, the sentence (6.7a) is ungrammatical, containing a (finite) auxiliary verb followed by yet another finite verb (see (4.32) on p.61). It will be annotated by the dictionary lookup as shown in (6.7b): ∗ (6.7) a. Men kom ihåg att det inte ska blir någon riktig brand But remember that it not will [pres] becomes [pres] some real fire – But remember that there will not be a real fire. b. Men[kn] kom[vb prt akt] ihåg[ab][pl] att[sn][ie] det[pn neu sin def sub/obj][dt neu sin def] inte[ab] ska[vb prs akt] blir[vb prs akt] någon[dt utr sin ind][pn utr sin ind sub/obj] riktig[jj pos utr sin ind nom] brand[nn utr sin ind nom] The Grammar The grammar module includes two grammars with (positive) rules reflecting the grammatical structure of Swedish, differing in their level of detail. The broad grammar (Section 6.5) is especially designed to handle text with ungrammaticalities and the linguistic descriptions are less accurate, accepting both valid and invalid patterns. The narrow grammar (Section 6.7) is more refined and accepts only grammatical segments. For example, the regular expression in (RE6.8) belongs to the broad grammar and recognizes potential verb clusters (VC) (both grammatical and ungrammatical) as a pattern consisting of a sequence of two or three verbs in combination with (zero or more) adverbs (Adv∗). (RE6.8) define VC [Verb Adv* Verb (Verb)]; This automaton accepts all the verb cluster examples in (6.8), including the ungrammatical instance (6.8c) extracted from the text in (6.7), where a finite verb FiniteCheck: A Grammar Error Detector 187 in present tense follows a (finite) auxiliary verb, instead of a verb in infinitive form (i.e. bli ‘be [inf]’). (6.8) a. kan inte springa can not run [inf] b. skulle ha sprungit would have run [sup] c. ska ∗ blir will be [pres] Corresponding rules in the narrow grammar, represented by the regular expressions in (RE6.9), take into account the internal structure of a verb cluster and define the grammar of modal auxiliary verbs (Mod) followed by (zero or more) adverb(s) (Adv∗), and either a verb in infinitive form (VerbInf) as in (RE6.9a), or a temporal verb in infinitive (PerfInf) and a verb in supine form (VerbSup), as in (RE6.9b). These rules thus accept only the grammatical segments in (6.8) and will not include example (6.8c). The actual grammar of grammatical verb clusters is a little bit more complex (see Section 6.7). (RE6.9) a. define VC1 b. define VC2 [Mod Adv* VerbInf]; [Mod Adv* PerfInf VerbSup]; The Parser The system proceeds and the tagged text in (6.7b) is now the input to the next phase, where various kinds of constituents are selected applying a lexical-prefixfirst strategy, i.e. parsing first from the left margin of a phrase to the head and then extending the phrase by adding on complements. The phrase rules are ordered in levels. The system proceeds in three steps by first recognizing the head phrases in a certain order (verbal head vpHead, prepositional head ppHead, adjective phrase ap) and then selecting and extending the phrases with complements in a certain order (noun phrase np, prepositional phrase pp, verb phrase vp). The heuristics of parsing order gives better flexibility to the system in that (some) false parses can be blocked. This approach is further explained in the section on parsing (Section 6.6). The system then yields the output in (6.9). 12 Simple ‘<’ and ‘>’ around a phrase-tag denote the beginning of a phrase and the same signs together with a slash ‘/’ indicate the end. 12 For better readability, the lexical tags are kept only in the erroneous segment and removed manually in the rest of the exemplified sentence. Chapter 6. 188 (6.9) Men <vp> <vpHead> kom ihåg </vpHead> </vp> att <np> det </np> <vp> <vpHead> inte <vc> ska[vb prs akt] blir[vb prs akt] </vc> </vpHead> <np> någon <ap> riktig </ap> brand </np> </vp> We apply the rules defined in the broad grammar set for this parsing purpose, like the one in (RE6.8) that identified the verb cluster in boldface in (6.9) above as a sequence of two verbs. The parsing output may be refined and/or revised by application of filtering transducers. Earlier parsing decisions depending on lexical ambiguity are resolved, and phrases are extended, e.g. with postnominal modifiers (see further in Section 6.6). Other structural ambiguities, such as verb coordinations or clausal modifiers on nouns, are also taken care of (see Section 6.7) The Error Finder Finally the error finder module is used to discriminate the grammatical patterns from the ungrammatical ones, by subtracting the narrow grammar from the broad grammar. These new transducers are used to mark the ungrammatical segments in a text. For example, the regular expression in (RE6.10a) identifies verb clusters that violate the narrow grammar of modal verb clusters (VC1 or VC2 in (RE6.9)) by subtracting (‘-’) these rules from the more general (overgenerating) rule in the broad grammar (VC in ((RE6.8) within the boundaries of a verb cluster (<vc> , </vc> ), previously marked in the parsing stage in (6.9). That is, the output of the parsing stage in (6.9) is the input to this level. By application of the marking transducer in (RE6.10b), the erroneous verb cluster consisting of two verbs in present tense in a row is annotated directly in the text as shown in (6.10). (RE6.10) a. define VCerror [ "<vc>" [VC - [VC1 | VC2]] "</vc>" ]; b. define markVCerror [ VCerror -> "<Error Verb after Vaux>" ... "</Error>"]; (6.10) Men <vp> <vpHead> kom ihåg </vpHead> </vp> att <np> det </np> <vp> <vpHead> inte <Error Verb after Vaux> <vc> ska[vb prs akt] blir[vb prs akt] </vc> </Error> </vpHead> <np> någon <ap> riktig </ap> brand </np> </vp> FiniteCheck: A Grammar Error Detector 6.3.3 189 Types of Automata In accordance with the techniques of finite-state parsing (see Section 6.2.4), there are in general two types of transducers in use: one that annotates text in order to select certain segments and one that redefines or refines earlier decisions. Annotations are handled by transducers called finite state markers that add reserved symbols into the text and mark out syntactic constituents, grammar errors, or other relevant patterns. For instance, the regular expression in (RE6.11) inserts noun phrase tags in text by application of the left-to-right-longest-match replacement operator (‘@→’) (see Section 6.2.3). (RE6.11) define markNP [NP @-> "<np>" ... "</np>"]; The automaton finds the pattern that matches the maximal instance of a noun phrase (NP) and replaces it with a beginning marker (<np> ), copies the whole pattern by application of the insertion operator (‘...’) and then assigns the endmarker (</np> ). Three (maximal) instances of noun phrase segments are recognized in the example sentence (6.11a), discussed earlier in Chapter 4 (see (4.2) on p.46) as shown in (6.11b), where one violates definiteness agreement (in boldface).13 ∗ (6.11) a. En gång blev den hemska pyroman utkastad one time was the [def] awful [def] pyromaniac [indef]) thrown-out ur stan. from the-city – Once the awful pyromaniac was thrown out of the city. b. <np> En gång </np> blev <np> den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] </np> utkastad ur <np> stan </np> . The regular expression in (RE6.12) represents an another example of an annotating automaton. (RE6.12) define markNPDefError [ npDefError -> "<Error definiteness>" ... "</Error>"]; This finite state transducer marks out agreement violations of definiteness in noun phrases (npDefError; see Section 6.8). It detects for instance the erroneous noun phrase den hemska pyroman in the example sentence, where the determiner den ‘the’ is in definite form and the noun pyroman ‘pyromaniac’ is in indefinite form (6.12). By application of the left-to-right replacement operator 13 Only the erroneous segment is marked by lexical tags. Chapter 6. 190 (→) the identified segment is replaced by first inserting an error-diagnosis-marker (<Error definiteness> ) as the beginning of the identified pattern, then the pattern is copied and the error-end-marker (</Error> ) is added. (6.12) <np> En gång </np> blev <Error definiteness> <np> den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] </np> </Error> utkastad ur <np> stan </np> . The marking transducers of the system have the form A @→ S ... E, when marking the maximal instances of A from left to right by application of the left-toright-longest-match replacement operator (‘@→’) and inserting a start-symbol S (e.g. <np> ) and an end-symbol E (e.g. </np> ). In cases where the maximal instances are already recognized and only the operation of replacement is necessary, the transducers use the form A → S ... E, applying only the left-to-right replacement operator (‘→’). The other types of transducers are used for refinement and/or revision of earlier decisions. These finite state filters can for instance be used to remove the noun phrase tags from the example sentence, leaving just the error marking. The regular expression in (RE6.13) replaces all occurrences of noun phrase tags with an empty string (‘0’) by application of the left-to-right replacement operator (‘→’). The result is shown in (6.13). (RE6.13) define removeNP ["<np>" -> 0, "</np>" -> 0]; (6.13) En gång blev <Error definiteness> den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] </Error> utkastad ur stan. These filtering transducers have the form A → B and are used for simple replacement of instances of A by B by application of the left-to-right replacement operator (‘→’). In cases where the context plays a crucial role, the automata are extended by requirements on the left and/or the right context and have the form A → B || L R. Here, the patterns in A are replaced by B only if A is preceded by the left context L and followed by the right context R. In some cases only the left context is constrained, in others only the right, and in some cases both are needed. FiniteCheck: A Grammar Error Detector 191 6.4 The Lexicon 6.4.1 Composition of The Lexicon The lexicon of the system is a full form lexicon based on two resources, the Lexin (Skolverket, 1992) developed at the Department of Swedish Language, Section of Lexicology, Göteborg University and a corpus-based lexicon from the SveLex project under the direction of Daniel Ridings, LexiLogik AB. At the initial stage of lexicon composition, only the Lexin dictionary of 58 326 word forms was available to us and we chose it especially for the lexical information stored in it, namely that the lexicon also included information on valence. I converted the Lexin text records to one single regular expression by a two-step process using the programming language gawk (Robbins, 1996). From the Lexin records (exemplified in (6.14a) and (6.14b)) a new file was created with lemmas separated by rows as in (6.14c). The first line here represents the Lexin-entry for the noun bil ‘car’ in (6.14a) and the second for the verb bilar ‘travels by car [pres]’ in (6.14b). Only a word’s part-of-speech (entry #02), lemma (entry #01), and declined forms (entry #12) are listed in the current implementation. 14 The number and type of forms vary according to the part-of-speech, and sometimes even within a part-of-speech. (6.14) a. #01 #02 #04 #07 #09 #11 #11 #11 #11 #11 #11 #11 #11 #12 #14 b. #01 #02 #04 #10 #12 #14 bil subst ett slags motordrivet fordon åka bil bild 17:34, 18:36-37 bil˜trafik -en personbil bil˜buren bil˜fri bil˜sjuk bil˜sjuka bil˜telefon lastbil bilen bilar bi:l bilar verb åka bil A & (+ RIKTNING) bilade bilat bila(!) 2bI:lar c. subst verb bil bilar bilen bilar bilade bilat bila 14 Future work will further extend the other kinds of information stored in the lexicon, such as valence and compounding. Chapter 6. 192 In the next step I converted the data in (6.14c) directly to a single regular expression as shown in (RE6.14). Each word entry in the lexicon was represented as a single finite state transducer with the string in the LOWER side and the category and feature in the UPPER side, allowing both analysis and generation. The whole dictionary is formed as the union of these automata. At this stage I used only simple tagsets that were later converted to the SUC-format (see below). Using this automatic generation of lexical entries to a regular expression, alternative versions of the lexicon are easy to create with for example different tagsets or including other information from Lexin (e.g. valence, compounds). (RE6.14) [ | | . . . | | | . . . | | | | . . . | | | A % - i n k o m s t 0: %[%+NSI%] A % - k a s s a 0: %[%+NSI%] A % - s k a t t 0: %[%+NSI%] b i l 0: %[%+NSI%] b i l e n 0: %[%+NSD%] b i l a r 0:%[%+NPI%] b b b b i i i i l l l l a a a a 0:%[%+VImp%] r 0:%[%+VPres%] d e 0:%[%+VPret%] t 0:%[%+VSup%] ö v ä r l d 0: %[%+NSI%] ö v ä r l d a r 0: %[%+NPI%] ö v ä r l d e n 0:%[%+NSD%] ]; The Lexin dictionary was later extended with 100,000 most frequent word forms selected from the corpus-based SveLex. At this stage the format of the lexicon was revised. The new lexicon of 158,326 word forms was compiled to a new transducer using instead the Xerox tool Finite-State Lexicon Compiler (LEXC) (Karttunen, 1993), that made the lexicon more compact and effective. This software facilitates in particular the development of natural-language lexicons. Instead of regular expression declarations, a high-level declarative language is used to specify the morphotactics of a language. I was not part of the composition of the new version of the lexicon. The procedures and achievements of this work are described further in Andersson et al. (1998, 1999). FiniteCheck: A Grammar Error Detector 6.4.2 193 The Tagset In the present version of the lexicon, the set of tags follows the Stockholm Ume å Corpus project conventions (Ejerhed et al., 1992), including 23 category classes and 29 feature classes (see Appendix C). Four additional categories were added to this set for recognition of copula verbs (cop), modal verbs (mvb), verbs with infinitival complement (qmvb) and unknown words, that obtain the tag [nil]. This morphosyntactic information is used for identification of strings by both their category and/or feature(s). For reasons of efficiency, the whole tag with category and feature definitions is read by the system as a single symbol and not as a separate list of atoms. An experiment conducted by Robert Andersson showed that the size of an automaton recognizing a grammatical noun phrase decreased with 90% less states and 60% less transitions in comparison to declaring a tag as consisting of a category and a set of features (see further in Andersson et al., 1999). As a consequence of this choice, the automata representing the tagset are divided both in accordance with the category they state and the features, always rendering the whole tag. The automata are constructed as an union of all the tags of the same category or feature. In practice this means that the same tag occurs in different tag-definitions as many times as the number of defined characteristics. For instance, the tag defining an active verb in present tense [vb prs akt] occurs in three definitions, first as an union of all tags defining the verb category (TagVB in (RE6.15)), then among all tags for present tense (TagPRS in (RE6.16)) and then among all tags for active voice (TagAKT in (RE6.17)). (RE6.15) define TagVB | | | | | | | | | | | | | | ]; [ "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb an]" sms]" prt akt]" prt sfo]" prs akt]" prs sfo]" sup akt]" sup sfo]" imp akt]" imp sfo]" inf akt]" inf sfo]" kon prt akt]" kon prt sfo]" kon prs akt]" Chapter 6. 194 (RE6.16) define TagPRS | | | | ]; (RE6.17) define TagAKT | | | | | | ]; [ "[pc "[pc "[vb "[vb "[vb prs prs prs prs kon utr/neu sin/plu ind/def gen]" utr/neu sin/plu ind/def nom]" akt]" sfo]" prs akt]" [ "[vb "[vb "[vb "[vb "[vb "[vb "[vb prt prs sup imp inf kon kon akt]" akt]" akt]" akt]" akt]" prt akt]" prs akt]" On the other hand, the tag for an interjection ([in]) that consists only of the category, occurs just once in the definitions of tags: (RE6.18) define TagIN [ "[in]" ]; There are in total 55 different lexical-tag definitions of categories and features. One single automaton (Tag) represents all the different categories and features, that is composed as the union of these 55 lexical tags. The largest category of singular-feature (TagSIN) includes 80 different tags. 6.4.3 Categories and Features In the parsing and error detection processes, strings need to be recognized by their category and/or feature inclusion. The morphosyntactic information in the tags is used for this purpose and automata identifying different categories and feature sets are defined. For instance, the regular expression in (RE6.19a) recognizes the tagged string kan[vb prs akt] ‘can’ as a verb, i.e. a sequence of one or more (the iteration sign ‘+’) letters followed by a sequence of tags one of which is a tag containing ‘vb’ (TagVB). Features are defined in the same manner. The same string can be recognized as a carrier of the feature of present tense. The regular expression in (RE6.19b) defines the automaton for present tense as a sequence of (one or more) letters followed by a sequence of tags, where one of them fulfills the feature of present tense ‘prs’ (TagPRS). (RE6.19) a. define Verb b. define Prs Letter+ Tag* TagVB Tag*; Letter+ Tag* TagPRS Tag*; FiniteCheck: A Grammar Error Detector 195 By using intersection (‘&’) of category and feature sets, there is also the possibility recognizing category-feature combinations. The same string can then be recognized directly as a verb in present tense by the regular expression VerbPrs given in (RE6.20), that presents all the verb tense features. (RE6.20) define define define define define VerbImp VerbPrs VerbPrt VerbSup VerbInf [Verb [Verb [Verb [Verb [Verb & & & & & Imp]; Prs]; Prt]; Sup]; Inf]; Even higher level sets can be built. For instance, a category of tensed (finite) and untensed (non-finite) verbs may be defined as in (RE6.21), including the union of appropriate verb form definitions from the verb tense feature set in (RE6.20) above. Our string example falls then as a verb in present tense form among the finite verb forms (VerbTensed). (RE6.21) define VerbTensed define VerbUntensed [VerbPrs | VerbPrt]; [VerbSup | VerbInf]; 6.5 Broad Grammar The rules of the broad grammar are used to mark potential phrases in a text, both grammatical and ungrammatical. The grammar consists of valid (grammatical) rules that define the syntactic relations of constituents mostly in terms of categories and list the order of them. There are no other constraints on the selections than the types of part-of-speech that combine with each other to form phrases. The grammar is in other words underspecified and does not distinguish between grammatical and ungrammatical patterns. The parsing is incremental, i.e. identifying first heads and then complements. This is also reflected in the broad grammar listed in (RE6.22), that includes rules divided in heads and complements. The whole broad grammar consists of six rules, including the head rules of adjective phrase (AP), verbal head (VPHead) and prepositional head (PPHead) and then rules for noun phrase (NP), prepositional phrase (PP) and verb phrase (VP). Chapter 6. 196 (RE6.22) # Head define define define rules AP [(Adv) Adj+]; PPhead [Prep]; VPhead [[[Adv* Verb] | [Verb Adv*]] Verb* (PNDef & PNNeu)]; # Complement rules define NP [[[(Det | Det2 | NGen) (Num) (APPhr) (Noun)] & ?+] | Pron]; define PP [PPheadPhr NPPhr]; define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*]; An adjective phrase (AP) consists of an (optional) adverb and a sequence of (one or more) adjectives. This means that an adjective phrase consists of at least one adjective. The head of a prepositional phrase (PPHead) is a preposition. A verbal head (VPHead) includes a verb preceded by a (zero or more) adverb(s) or followed by a (zero or more) adverb(s), and possibly followed by a (zero or more) verb(s) and an optional pronoun. This means that a verbal head consists at least of a single verb, that in turn may be preceded or followed by adverb(s) and followed by verb(s). In order to prevent pronouns being analyzed as determiners in noun phrases, e.g. jag anser det bra ‘I think it is good’, single neuter definite pronouns are included in the verbal head. The regular expression describing a noun phrase (NP) consists of two parts. The first states that a noun phrase includes a determiner (Det) or a determiner with adverbial här ‘here’ or där ‘there’ (Det2) or a possessive noun (NGen), followed by a numeral (Num), an adjective phrase (APPhr) and a (proper) noun (Noun). Not only the noun can form the head of the noun phrase which is why all the constituents are optional. The intersection with ‘any-symbol’ (‘?’) followed by the iteration sign (‘+’) is needed to state that at least one of the listed constituents has to occur. The second part of the noun phrase rule, states that a noun phrase may consist of a single pronoun (Pron). A prepositional phrase (PP) is recognized as a sequence of prepositional head (PPheadPhr) followed by a noun phrase (NPPhr). A verb phrase consists of a verbal head (VPheadPhr) followed by at most three (optional) noun phrases and (zero or more) prepositional phrases. 6.6 Parsing 6.6.1 Parsing Procedure The rules of the (underspecified) broad grammar are used to mark syntactic patterns in a text. A partial, lexical-prefix-first, longest-match, incremental strategy is FiniteCheck: A Grammar Error Detector 197 used for parsing. The parsing procedure is partial in the sense that only portions of text are recognized and no full parse is provided for. Patterns not recognized by the rules of the (broad) grammar remain unchanged. The maximal instances of a particular phrase are selected by application of the left-to-right-longest-match replacement operator (‘@→’) (see Section 6.2.3). In (RE6.23) we see all the marking transducers recognizing the syntactic patterns defined in the broad grammar. The automata replace the corresponding phrase (e.g. noun phrase, NP) with a label indicating the beginning of such pattern (<np> ), the phrase itself and a label that marks the end of that pattern (</np> ). (RE6.23) define define define define define define markPPhead markVPhead markAP markNP markPP markVP [PPhead [VPhead [AP @-> [NP @-> [PP @-> [VP @-> @-> "<ppHead>" ... @-> "<vpHead>" ... "<ap>" ... "</ap>" "<np>" ... "</np>" "<pp>" ... "</pp>" "<vp>" ... "</vp>" "</ppHead>"]; "</vpHead>"]; ]; ]; ]; ]; The segments are built on in cascades in the sense that first the heads are recognized, starting from the left-most edge to the head (so called lexical-prefix) and then the segments are expanded in the next level by addition of complement constituents. The regular expressions in (RE6.24) compose the marking transducers of separate segments into a three step process. (RE6.24) define parse1 [markVPhead .o. markPPhead .o. AP]; define parse2 [markNP]; define parse3 [markPP .o. markVP]; First the verbal heads, prepositional heads and adjective phrases are recognized by composition in that order (parse1). The corresponding marking transducers presented in (RE6.23) insert syntactic tags around the found phrases as in (6.15a).15 This output serves then as input to the next level, where the adjective phrases are extended and noun phrases are recognized (parse2) and marked as exemplified in (6.15b). This output in turn serves as input to the last level, where the whole prepositional phrases and verb phrases are recognized in that order (parse3) and marked as in (6.15c). 15 The original sentence example is presented in (6.11) on p.189. Chapter 6. 198 (6.15) a. PARSE 1: VPHead .o. PPHead .o. AP En gång <vpHead> blev </vpHead> den <ap> hemska </ap> pyroman <ap> utkastad </ap> <ppHead> ur </ppHead> stan . b. PARSE 2: NP <np> En gång </np> <vpHead> blev </vpHead> <np> den <ap> hemska </ap> pyroman </np> <np> <ap> utkastad </ap> </np> <ppHead> ur </ppHead> <np> stan </np> . c. PARSE 3: PP .o. VP <np> En gång </np> <vp> <vpHead> blev </vpHead> <np> den <ap> hemska </ap> pyroman </np> <np> <ap> utkastad </ap> </np> <pp> <ppHead> ur </ppHead> <np> stan </np> </pp> </vp> . During and after this parsing annotation, some phrase types are further expanded with post-modifiers, split segments are joined and empty results are removed (see Section 6.6.4). The ‘broadness’ of the grammar and the lexical ambiguity in words, necessary for parsing text containing errors, also yields ambiguous and/or alternative phrase annotations. We block some of the (erroneous) alternative parses by the order in which phrase segments are selected, which causes bleeding of some rules and more ‘correct’ parsing results are achieved. The order in which the labels are inserted into the string influences the segmentation of patterns into phrases (see Section 6.6.2). Further ambiguity resolution is provided for by filtering automata (see Section 6.6.3). 6.6.2 The Heuristics of Parsing Order The order in which phrases are labeled supports ambiguity resolution in the parse to some degree. The choice of marking verbal heads before noun phrases prevents merging constituents of verbal heads into noun phrases which would yield noun phrases with too wide a range. For instance, marking first the sentence in (6.16a) for noun phrases ((6.16b) ∗ NP:)16 would interpret the pronoun De ‘they’ as a determiner and the verb såg ‘saw’, that is exactly as in English homonymous with the noun ‘saw’, as a noun and merges these two constituents to a noun phrase. The output would then be composed with the selection of the verbal head ((6.16b) ∗ NP .o. VPHead) that ends up within the boundaries of the noun phrase. Composing the marking transducers in the opposite order instead yields the more correct parse in (6.16c). Although the alternative of the verb being parsed as verbal head or a noun remains (<vpHead> <np> såg </np> </vpHead> ), the pronoun is now marked correctly as a separate noun phrase and not merged together with the main verb into a noun phrase. 16 Asterix ‘*’ indicates erroneous parse. FiniteCheck: A Grammar Error Detector 199 (6.16) a. De såg ledsna ut They looked sad out – They looked sad. b. ∗ NP: <np> De såg </np> <np> ledsna </np> ut . ∗ NP .o. VPHead: <np> De <vpHead> såg </vpHead> </np> <np> ledsna </np> ut . c. VPHead: De <vpHead> såg </vpHead> ledsna ut . VPHead .o. NP: <np> De </np> <vpHead> <np> såg </np> </vpHead> <np> ledsna </np> ut . This ordering strategy is not absolute however, since the opposite scenario is possible where parsing noun phrases before verbal heads is more suitable. Consider for instance example (6.17a) below, where the string öppna ‘open’ in the noun phrase det öppna fönstret ‘the open window’ will be split in three separate noun phrase segments when applying the order of parsing verbal heads before noun phrases (6.17c), due the homonymity between an adjective and an infinitive or imperative verb form. The opposite scenario of parsing noun phrases before verbal heads yields a more correct parse (6.17b), where the whole noun phrase is recognized as one segment. (6.17) a. han tittade genom det öppna fönstret he looked through the open window – he looked through the open window b. NP: <np> han </np> tittade genom <np> det öppna fönstret </np> NP .o. VPHead: <np> han </np> <vpHead> tittade </vpHead> genom <vpHead> öppna </vpHead> fönstret </np> c. <np> det ∗ VPHead: han <vpHead> tittade </vpHead> genom det <vpHead> öppna </vpHead> fönstret ∗ VPHead .o. NP: <np> han </np> <vpHead> tittade </vpHead> genom <np> det </np> <vpHead> <np> öppna </np> </vpHead> <np> fönstret </np> We analyzed the ambiguity frequency in the Child Data corpus and found that occurrences of nouns recognized as verbs are more recurrent than the opposite. On Chapter 6. 200 this ground, we chose the strategy of marking verbal heads before marking noun phrases. In the case of the opposite scenario, the false parsing can be revised and corrected by an additional filter (see Section 6.6.3). A similar problem occurs with homonymous prepositions and nouns. For instance, the string vid is ambiguous between an adjective (‘wide’) and a preposition (‘by’) and influences the order of marking prepositional heads and noun phrases. Parsing prepositional heads before noun phrases is more suitable for preposition occurrences as shown in (6.18c) in order to prevent the preposition from being merged as part of a noun phrase, as in (6.18b). (6.18) a. Jag satte mig vid bordet I sat me by the-table – I sat down at the table. b. ∗ NP: <np> Jag </np> satte <np> mig </np> <np> vid bordet </np> . ∗ NP .o. PP: <np> Jag </np> satte <np> mig </np> <np> </ppHead> bordet </np> . <ppHead> vid c. PP: Jag satte mig <ppHead> vid </ppHead> bordet . PP .o. NP: <np> Jag </np> satte <np> mig </np> <ppHead> <np> vid </np> </ppHead> <np> bordet </np> . The opposite order is more suitable for adjective occurrences, as in (6.19), where the adjective is joined together with the head noun when selecting noun phrases first as in (6.19b). But when recognizing the adjective as prepositional head, that noun phrase is split into two noun phrases as in (6.19c). Again, the choice of marking prepositional heads before noun phrases was based on the result of frequency analysis in the corpus, i.e. the string vid occurred more often as a preposition than an adjective. FiniteCheck: A Grammar Error Detector 201 (6.19) a. Hon hade vid kjol på sig. She had wide skirt on herself. – She was wearing a wide skirt. b. NP: <np> Hon </np> hade <np> vid kjol </np> på <np> sig </np> . NP .o. PP: <np> Hon </np> hade <np> <ppHead> vid </ppHead> kjol </np> på <np> sig </np> . c. ∗ PP: Hon hade <ppHead> vid </ppHead> kjol på sig . ∗ PP .o. NP: <np> Hon </np> hade <ppHead> <np> vid </np> </ppHead> <np> kjol </np> på <np> sig </np> . 6.6.3 Further Ambiguity Resolution As discussed above, the parsing order does not give the correct result in every context. Nouns, adjectives and pronouns are homonymous with verbs and might then be interpreted by the parser as verbal heads, or adjectives homonymous with prepositions can be analyzed as prepositional heads. These parsing decisions can be redefined at a later stage by application of filtering transducers (see Section 6.3.3). As exemplified in (6.17) above, the consequence of parsing verbal heads before noun phrases may yield noun phrases that are split into parts, due to the fact that adjectives are interpreted as verbs. The filtering transducer in (RE6.25) adjusts such segments and removes the erroneous (inner) syntactic tags (i.e. replaces them with the empty string ‘0’) so that only the outer noun phrase markers remain and converts the split phrase in (6.20a) to one noun phrase yielding (6.20b). The regular expression consists of two replacement rules that apply in parallel. They are constrained by the surrounding context of a preceding determiner (Det) and a subsequent adjective phrase (APPhr) and a noun phrase (NPPhr) in the first rule, and a preceding determiner and an adjective phrase in the second rule. (6.20) a. <np> han </np> <vpHead> tittade </vpHead> genom <np> det </np> <vpHead> <np> öppna </np> </vpHead> <np> fönstret </np> b. <np> han </np> <vpHead> tittade </vpHead> genom <np> det öppna fönstret </np> (RE6.25) define adjustNPAdj [ "</np><vpHead><np>" -> 0 || Det _ APPhr "</np></vpHead>" NPPhr,, "</np></vpHead><np>" -> 0 || Det "</np><vpHead><np>" APPhr _ ]; Chapter 6. 202 Noun phrases with a possessive noun as the modifier are split when the head noun is homonymous with a verb as in (6.21).17 The parse is then adjusted by a filter that simply extracts the noun from the verbal head and moves the borders of the noun phrase yielding (6.21c). (6.21) a. barnens far hade dött children’s father had died – the father of the children had died b. <np> barnens </np> <vpHead> <np> far </np> </vpHead> hade dött c. <np> barnens far </np> <vpHead> hade dött </vpHead> The filtering automaton in (RE6.26) inserts a start-marker for verbal head (i.e. replaces the empty string ‘0’ with the syntactic tag vpHead) right after the end of the actual noun phrase and removes the redundant syntactic tags in the second replacement rule. The replacement procedure is (again) simultaneous, by application of parallel replacement. (RE6.26) define adjustNPGen [ 0 -> "<vpHead>" || NGen "</np><vpHead>" NPPhr _,, "</np><vpHead><np>" -> 0 || NGen _ ˜$"<np>" </np>"]; Another ambiguity problem occurs with the interrogative pronoun var ‘where’ that in Swedish is ambiguous with the copula verb var ‘were’ or ‘was’. Since verbal heads are annotated first in the system identifying segments of maximal length, the homonymous pronoun is recognized as a verb and combined with the subsequent verb as in (6.22) and (6.23). (6.22) a. Var var den där överraskningen. where was the there surprise – Where was that surprise? b. <vp> <vpHead> <vc> <np> Var var </np> </vc> </vpHead> <np> den där överraskningen </np> </vp> ? (6.23) a. Var såg du hästen Madde frågar jag. where saw you the-horse Madde ask I – Where did you see the horse, Madde? I asked. b. <vp> <vpHead> <vc> <np> Var såg </np> </vc> </vpHead> <np> du </np> <np> hästen </np> </vp> Madde<vp> <vpHead> frågar <np> jag </np> </vpHead> </vp> . 17 Here the string far is ambiguous between the noun reading ‘father’ and the present tense verb form ‘goes’. FiniteCheck: A Grammar Error Detector 203 A similar problem occurs with adjectives or participles homonymous with verbs as in (6.24), where the adjective rädda ‘scared [pl]’ is identical to the infinitive or imperative form of the verb ‘rescue’ and is joined with the preceding copula verb to form a verb cluster. (6.24) a. Alla blev rädda ... all became afraid – All became afraid ... b. <np> Alla </np> <vp> <vpHead> <vc> blev<np> <ap> r ädda </ap> </np> </vc> </vpHead> </vp> ... All verbal heads recognized as sequences of verbs with a copula verb in the beginning are selected by the replacement transducer in (RE6.27) that changes the verb cluster label (<vc> ) to a new marking (<vcCopula> ). This selection provides no changes in the parsing result in that no markings are (re)moved. Its purpose rather is to prevent false error detection and mark such verb clusters as being different. For instance, applying this transducer on the example in (6.22,) will yield the output presented in (6.25). (RE6.27) define SelectVCCopula [ "<vc>" -> "<vcCopula>" || _ [CopVerb / NPTags] ˜$"<vc>" "</vc>"]; (6.25) <vp> <vpHead> <vcCopula> <np> Var var </np> </vc> </vpHead> <np> den där överraskningen </np> </vp> ? 6.6.4 Parsing Expansion and Adjustment The text is now annotated with syntactic tags and some of the segments have to be further expanded with postnominal attributes and coordinations. In the current system, partitive prepositional phrases are the only postnominal attributes taken care of. The reason is that grammatical errors were found in these constructions. By application of the filtering transducer in (RE6.28) the example text in (6.26a) with the partitive noun phrase split into a noun phrase followed by a prepositional head that includes the partitive preposition av ‘of’ and yet another noun phrase from the parsing stage in (6.26b) is merged to form a single noun phrase as in (6.26c). This automaton removes the redundant inner syntactic markers by application of two replacement rules, constrained by the right or left context. The replacement occurs simultaneously by application of parallel replacement. (RE6.28) define adjustNPPart [ "</np><ppHead>" -> 0 || _ PPart "</ppHead><np>",, "</ppHead><np>" -> 0 || "</np><ppHead>" PPart _ ]; Chapter 6. 204 i en av (6.26) a. Mamma och Virginias mamma hade öppnat en tygaffär mum and Virginia’s mum had opened a fabric-store in one of Dom gamla husen. the old the-houses – Mum and Virginia’s mum had opened a fabric-store in one of the old houses. b. <np> Mamma </np> och <np> Virginias mamma </np> <vp> <vpHead> <vc> hade öppnat </vc> </vpHead> <np> en tygaffär </np> i <np> en </np> <ppHead> av </ppHead> <np> Dom <ap> gamla </ap> husen </np> . c. <np> Mamma </np> och <np> Virginias mamma </np> <vp> <vpHead> <vc> hade öppnat </vc> </vpHead> <np> en tygaffär </np> i <NPPart> en av Dom <ap> gamla </ap> husen </np> . Another type of phrase that needs to be expanded are verbal groups with a noun phrase in the middle, normally occurring when a sentence is initiated by other constituents than a subject (i.e. with inverted word order; see Section 4.3.6), as in (6.27a). In the parsing phase the verbal group is split into two verbal heads, as in (6.27b) that should be joined in one as in (6.27c). (6.27) a. En dag tänkte Urban göra varma mackor One day thought Urban do hot sandwiches – One day Urban thought of making hot sandwiches. b. <np> En dag </np> <vpHead> tänkte </vpHead> <np> Urban </np> <vpHead> göra </vpHead> <np> varma mackor </np> . c. <np> En dag </np> <vpHead> tänkte <np> Urban </np> göra </vpHead> <np> varma mackor </np> . The filtering automaton merging the parts of a verb cluster to a single segment is constrained so that two verbal heads are joined together only if there is a noun phrase in-between them and the preceding verbal head includes an auxiliary verb or a verb that combines with an infinitive verb form (VBAux). The corresponding regular expression (RE6.29) removes the redundant verbal head markers in this constrained context. The replacement works in parallel, here removing both the redundant start-marker (<vpHead> ) and the end-marker (</vpHead> ) at the same time. There are two (alternative) replacement rules for every tag, since the noun phrase can either occur directly after the first verbal head as in our example (6.27) above, or as a pronoun be part of the first verbal head. Tags not relevant for this replacement (VCTags) are ignored (/). FiniteCheck: A Grammar Error Detector (RE6.29) 205 define adjustVC [ "</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,, "</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,, "<vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags]] "</vpHead>" NPPhr _ ˜$"<vpHead>" "</vpHead>",, "<vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags]] NPPhr "</vpHead>" _ ˜$"<vpHead>" "</vpHead>" ]; Other filtering transducers are used for refining the parsing result. Incomplete parsing decisions are eliminated at the end of parsing. For instance, incomplete prepositional phrases, i.e. a prepositional head without a following noun phrase, defined in the regular expression (RE6.30a) are removed. Also removed are empty verbal heads as in (RE6.30b) and other misplaced tags. (RE6.30) a. define errorPPhead [ "<ppHead>" -> 0 || \["<pp>"] _ ,, "</ppHead>" -> 0 || _ \["<np>"]]; b. define errorVPHead [ "<vp><vpHead></vpHead></vp>" -> 0]; 6.7 Narrow Grammar The narrow grammar is the grammar proper, whose purpose is to distinguish the grammatical segments from the ungrammatical ones. The automata of this grammar express the valid (grammatical) rules of Swedish, and constrain both the order of constituents and feature requirements. The current grammar is based on the Child Data corpus and includes rules for noun phrases and the verbal core. 6.7.1 Noun Phrase Grammar Noun Phrases The rules in the noun phrase grammar are divided, following Cooper’s approach (Cooper, 1984, 1986), according to what types of constituent they consist of and what feature conditions they have to fulfill (see Section 4.3.1). There are altogether ten noun phrase types implemented, listed in Table 6.3, including noun phrases with the (proper) noun as the head, pronoun or determiner, adjective, numeral and partitive attribute, reflecting the profile of the Child Data corpus. Chapter 6. 206 Table 6.3: Noun Phrase Types RULE S ET NP1 N OUN P HRASE T YPE single noun (Num) N PNoun NP2 NP3 determiner and noun Det (DetAdv) (Num) N poss. noun and noun NGen (Num) N determiner, adj. and noun Det AP N poss. noun, adj. and noun NGen AP N NP4 adjective and noun (Num) AP N NP5 single pronoun PN NP6 single determiner Det NP7 adjective NP8 determiner and adjective NP9 numeral (Det) Num NPPart partitive Num PPart NP partitive Det PPart NP Adj+ Det Adj+ E XAMPLE (två) grodor (two) frogs Kalle Kalle de (här) (två) grodorna the/these (two) frogs flickans (två) grodor girl’s (two) frogs den lilla grodan the little frog flickans lilla groda girl’s little frog (två) små grodor (two) little frogs han he den that obehörig unauthorized de gamla the old den tredje, 8 the third, 8 två av husen two of houses ett av de gamla husen one of the old houses Every noun phrase type is divided into six subrules, expressing the different types of errors, two for definiteness (NPDef, NPInd), two for number (NPSg, NPPl) and two for gender agreement (NPUtr, NPNeu). 18 For instance, in (RE6.31) we have the set of rules representing noun phrases consisting of a single pronoun, that present the feature requirements on the pronoun as the only constituent, i.e. that a definite form of the pronoun is required (PNDef) in order to be considered as a definite noun phrase (NPDef). 18 Utr denotes the common gender called utrum in Swedish. FiniteCheck: A Grammar Error Detector (RE6.31) define define define define define define NPDef5 NPInd5 NPSg5 NPPl5 NPNeu5 NPUtr5 207 [PNDef]; [PNInd]; [PNSg]; [PNPl]; [PNNeu]; [PNUtr]; The rule set NP2 presented in (RE6.32) is more complex and defines the grammar for both definite, indefinite and mixed noun phrases (see Section 4.3.1) with a determiner (or a possessive noun) and a noun. For instance, the definite form of this noun phrase type (NPDef2) is defined as a sequence of a definite determiner (DetDef), an optional adverbial (DetAdv; e.g. här ‘here’), an optional numeral (Num), and a definite noun, or as a sequence of mixed determiner (DetMixed i.e. those that take an indefinite noun as complement; e.g. denna ‘this’) or a possessive noun (NGen), followed by an optional numeral and an indefinite noun. (RE6.32) define NPDef2 define define define define define NPInd2 NPSg2 NPPl2 NPNeu2 NPUtr2 [DetDef (DetAdv) (Num) NDef] | [[DetMixed | NGen] (Num) NInd]; [DetInd (Num) NInd]; [[DetSg (DetAdv) | NGen] (NumO) NSg]; [[DetPl (DetAdv) | NGen] (Num) NPl]; [[DetNeu (DetAdv) | NGen] (Num) NNeu]; [[DetUtr (DetAdv) | NGen] (Num) NUtr]; This particular automaton (NPDef2) accepts all the noun phrases in (6.28), except for the first one that forms an indefinite noun phrase and will be handled by the subsequent automaton of indefinite noun phrases of this kind (NPInd2). It also accepts the ungrammatical noun phrase in (6.28c), since it only constrains the definiteness features. This erroneous noun phrase is then handled by the automaton representing singular noun phrases of this type (NPSg2) that states that only ordinal numbers (NumO) can be combined with singular determiners and nouns. (6.28) a. en (första) blomma a [indef] (first) flower [indef] b. den (här) (första) blomman this [def] (here) (first) flower [def] c. ∗ den (här) (två) blomman this [def] (here) (two) flower [def] d. denna (första) blomma this [def] (first) flower [indef] e. flickans (första) blomma the [def] (first) flower [indef] Chapter 6. 208 The different noun phrase rules can be joined by union into larger sets and divided in accordance with what different feature conditions they meet. For instance, the set of all definite noun phrases is defined as in (RE6.33a) and indefinite noun phrases as in (RE6.33b). All noun phrases that meet definiteness agreement are then represented by the regular expression in (RE6.33c), that is an automaton formed as an union of all definite and all indefinite noun phrase automata. (RE6.33) a. ### Definite NPs define NPDef [NPDef1 | NPDef2 | NPDef3 | NPDef4 | NPDef5 | NPDef6 | NPDef7 | NPDef8 | NPDef9]; b. ### Indefinite NPs define NPInd [NPInd1 | NPInd2 | NPInd3 | NPInd4 | NPInd5 | NPInd6 | NNPInd7 | PInd8 | NPInd9]; c. ###### NPs that meet definiteness agreement define NPDefs [NPDef | NPInd]; Noun phrases with partitive attributes have a noun phrase as the head and are treated separately in the grammar. Although the agreement occurs only between the quantifier and the noun phrase in gender, the rules of definiteness and number state that the noun phrase has to be definite and plural, see (RE6.34). (RE6.34) define define define define define define NPPartDef NPPartInd NPPartSg NPPartPl NPPartNeu NPPartUtr [[Det | Num] PPart NPDef]; [[Det | Num] PPart NPDef]; [[DetSg | Num] PPart NPPl]; [[DetPl | Num] PPart NPPl]; [[DetNeu | Num] PPart NPNeu]; [[DetUtr | Num] PPart NPUtr]; Adjective Phrases Adjective phrases occur as modifiers in two types of the defined noun phrases (NP3 and NP4) and form a head of its own in two others, (NP7 and NP8). It consists in the present implementation of an optional adverb and a sequence of one or more adjectives and is also defined in accordance with the feature conditions that have to be fulfilled for definiteness, number and gender as shown in (RE6.35). The gender feature set includes also an additional definition for masculine gender. (RE6.35) define define define define define define define APDef APInd APSg APPl APNeu APUtr APMas ["<ap>" ["<ap>" ["<ap>" ["<ap>" ["<ap>" ["<ap>" ["<ap>" (Adv) (Adv) (Adv) (Adv) (Adv) (Adv) (Adv) AdjDef+ AdjInd+ AdjSg+ AdjPl+ AdjNeu+ AdjUtr+ AdjMas+ "</ap>"]; "</ap>"]; "</ap>"]; "</ap>"]; "</ap>"]; "</ap>"]; "</ap>"]; FiniteCheck: A Grammar Error Detector 209 One problem related to error detection concerns the ambiguity in weak and strong forms of adjectives that coincide in the plural, but in the singular the weak form of adjectives is used only in definite singular noun phrases (see Section 4.3.1). Consequently, such adjectives obtain both singular and plural tags and errors such as the one in (6.29) will be overlooked by the system. As we see in (6.29a), the adjective trasiga ‘broken’ is both singular (and definite) and indefinite (and plural) as the surrounding determiner and head noun and the check for number and definiteness will succeed. Since the whole noun phrase is singular, the plural tag highlighted in bold face in (6.29b) is irrelevant and can be removed by the automaton defined in (RE6.36) allowing a definiteness error to be reported. (6.29) a. ∗ en trasiga speldosa a [sg,indef] broken [sg,wk] or [pl] musical box [sg,indef] b. <np> en[dt utr sin ind][pn utr sin ind sub/obj] <ap> trasiga[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] </ap> speldosa[nn utr sin ind nom] </np> (RE6.36) define removePluralTagsNPSg [ TagPLU -> 0 || DetSg "<ap>" Adj _ ˜$"</np>" "</np>"]; Other Selections In addition to these noun phrase rules, noun phrases with a determiner and a noun as the head that are followed by a relative subordinate clause are treated separately, for the reason that definiteness conditions are different in this context (see Section 4.3.1). As in (6.30) the head noun that is normally in definite form after a definite article lacks the suffixed article and stands instead in indefinite form. In the current system, these segments are selected as separate from other noun phrases by application of the filtering transducer in (RE6.37), that simply changes the beginning noun phrase label (<np> ) to the label <NPRel> in the context of a definite determiner with other constituents and the complementizer som ‘that’. The grammar is then prepared for expansion of detection of these error types as well. Chapter 6. 210 (6.30) a. Jag tycker att det borde finnas en hjälpgrupp för I think that it should exist a help-group for de elever som har lite sociala problem. the [pl,def] pupils [pl, indef] that have some social problems – I think that there should be a help-group for the pupils that have som social problems. b. <np> Jag </np> <vp> <vpHead> tycker </vpHead> </vp> att <np> det </np> <vp> <vpHead> <vc> borde finnas </vc> </vpHead> <np> en </np> </vp> hjälpgrupp <np> för </np> <NPRel> de elever </np> som <vp> <vpHead> har </vpHead> <np> <ap> sociala </ap> problem </np> </vp> . (RE6.37) 6.7.2 define SelectNPRel ["<np>" -> "<NPRel>" || _ DetDef ˜$"<np>" "</np>" (" ") {som} Tag*]; Verb Grammar The narrow grammar of verbs specifies the valid rules of finite and non-finite verbs (see Section 4.3.5). The rules consider the form of the main finite verb, verb clusters and verbs in infinitive phrases. Finite Verb Forms The finite verb form occurs in verbal heads either as a single main verb or as an auxiliary verb in a verb cluster. The grammar rule in (RE6.38) states that the first verb in the verbal head (possibly preceded by adverb(s)) has to be tensed. Any following verbs (or other constituents) in the verbal head are then ignored (the any-symbol ‘?∗ ’ indicates that). (RE6.38) define VPFinite [Adv* VerbTensed ?*]; Infinitive Verb Phrases The rule defining the verb form in infinitive phrases concerns verbal heads preceded by an infinitive marker. The marking transducer in (RE6.39a) selects these verbal heads and changes the label to infinitival verbal head (<vpHeadInf> ). The grammar rule of the infinitive verbal core is defined in (RE6.39b), including just one verb in infinitive form (<VerbInf> ), possibly preceded by (zero or more) adverbs and/or a modal verb also in infinitive form (<ModInf> ). FiniteCheck: A Grammar Error Detector 211 (RE6.39) a. define SelectInfVP [ "<vpHead>" -> "<vpHeadInf>" || InfMark "<vp>" _ ]; b. define VPInf [Adv* (ModInf) VerbInf Adv* ?*]; Verb Clusters The narrow grammar of verb clusters is more complex, including rules for both modal (Mod) and temporal auxiliary verbs (Perf) and verbs combining with infinitive verbs (INFVerb), i.e. infinitive phrases without infinitive marker (see Section 4.3.5). The grammar rules state the order of constituents and the form of the verbs following the auxiliary verb. The form of the auxiliary verb is defined in the VPFinite rule above (see (RE6.38), i.e. the verb has to have finite form. The marking automaton (RE6.40b) selects all verbal heads that include more than one verb as verb clusters by the VC-rule in (RE6.40a). The potential verb clusters have the form of a verb followed by (zero or more) adverbs, an (optional) noun phrase, (zero or more) adverbs and subsequently one or two verbs. Other syntactic tags (NPTags) are ignored (‘/’ is the ignore-operator). (RE6.40) a. define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ]; b. define SelectVC [VC @-> "<vc>" ... "</vc>" ]; Five different rules describe the grammar of verb clusters. Three rules concern the modal verbs (VC1, VC2, VC3 presented in (RE6.41)) and two rules deal with temporal auxiliary verbs (VC4, VC5 presented in (RE6.42)). Verbs that take infinitival phrases (without the infinitival marker) (INFVerb) share two rules with the modal verbs (VC1, VC2). All the verb cluster rules have the form VBaux (NP) Adv* Verb (Verb), i.e. an auxiliary verb followed by an optional noun phrase, (zero or more) adverb(s), a verb and an optional verb. By including the optional noun phrase, the grammar also handles inverted sentences. Again, irrelevant tags (NPTags) are ignored. (RE6.41) a. define VC1 [ [[Mod | INFVerb] / NPTags ] (NPPhr) [[Adv* VerbInf] / NPTags] ]; b. define VC2 [ [Mod / NPTags] (NPPhr) [[Adv* ModInf VerbInf] / NPTags] ]; c. define VC3 [ [Mod / NPTags] (NPPhr) [[Adv* PerfInf VerbSup] / NPTags] ]; Chapter 6. 212 (RE6.42) a. define VC4 [ [Perf / NPTags] (NPPhr) [[Adv* VerbSup] / NPTags] ]; b. define VC5 [ [Perf / NPTags] (NPPhr)[[Adv* ModSup VerbInf] / NPTags] ]; All the five rules can be combined by union in one automaton that represents the grammar of all verb clusters presented in (RE6.43). (RE6.43) define VCgram [VC1 | VC2 | VC3 | VC4 | VC5]; Other Selections Coordinations of verbal heads in verb clusters or as infinitive verb phrases are selected as separate segments by the marking transducer in (RE6.44). The automaton replaces the verbal head marking with a new label that indicates coordination of verbs (<vpHeadCoord>) as exemplified in (6.31) and (6.32.) (RE6.44) define SelectVPCoord ["<vpHead>" -> "<vpHeadCoord>" || ["<vpHeadInf>" | "</vc>"] ˜$"<vpHead>" ˜$"<vp>" [{eller} | {och}] Tag* (" ") "<vp>" _ ]; (6.31) a. hon skulle springa ner och larma she would run down and alarm – she was about to run down and give the alarm. b. <np> hon </np> <vp> <vpHead> <vc> skulle <np> springa </np> </vc> </vpHead> </vp> ner och <vp> <vpHeadCoord> larma </vpHead> </vp> (6.32) a. det är dags att gå och lägga sig. it is time to go and lay oneself – It is time to go to bed. b. <np> det </np><vp><vpHead> är </vpHead><np> dags </np> </vp> att <vp> <vpHead> gå </vpHead> </vp> och <vp> <vpHeadCoord> lägga <np> sig </np> </vpHead> </vp> . The infinitive marker att is in Swedish homonymous with the complementizer att ‘that’ or part of för att ‘because’ and thus not necessarily followed by an infinitive, as in (6.33), (6.34) and (6.35). Such ambiguous constructions are selected as separate segments by the regular expression in (RE6.45), that changes the verbal head label to <vpHeadATTFinite>. FiniteCheck: A Grammar Error Detector 213 (6.33) a. Tuni ringde mig sen och sa att allt hade gått Tuni called me later and said that everything had [pret] gone [sup] bara bra. just good – Tuni called me later and said that everything had gone just fine. b. Tuni <vp> <vpHead> ringde </vpHead> <np> mig </np> </vp> sen och <vp> <vpHead> sa </vpHead> </vp> att <vp> <vpHeadATTFinite> <np> allt </np> <vc> hade gått </vc> </vpHead> <np> <ap> bara bra </ap> </np> </vp> . (6.34) a. Men det skulle han aldrig ha gjort för att då börjar but it should he never have done because then starts [pres] grenen att röra på sig ... the-branch to move on itself – But he should never have done that because then the branch starts to move. b. Men <np> det </np> <vp> <vpHead> <vc> skulle <np> han </np> aldrig ha <np> <ap> gjort </ap> </np> </vc> </vpHead> </vp> <np> för </np> att <vp> <vpHeadATTFinite> då börjar </vpHead> <np> grenen </np> </vp> att <vp> <vpHeadInf> <np> r öra </np> <pp> <ppHead> på </ppHead> <np> sig </np> </pp> </vpHead> </vp> ... (6.35) a. så tänkte jag att nu hade jag chansen. so thought I that now had [pret] I the-chance – so I thought that now I had the chance. b. <vp> <vpHead> så tänkte <np> jag </np> </vpHead> </vp> att <vp> <vpHeadATTFinite> nu hade <np> jag </np> </vpHead> <np> chansen </np> </vp> . (RE6.45) define SelectATTFinite [ "<vpHead>" -> "<vpHeadATTFinite>" || [ [ [[{sa} Tag+] | [[{för} Tag+] / NPTags]] ("</vpHead></vp>")] | [ [{tänkte} Tag+] [[NPPhr "</vpHead></vp>" ] | ["</vpHead>" NPPhr "</vp>"]]]] InfMark "<vp>"_ ]; Verbal heads with only a supine verb as in (6.36) and (6.37) are also selected separately. They are considered grammatical in subordinate clauses, whereas main clauses with supine verbs without preceding auxiliary verbs are invalid in Swedish (see Section 4.3.5). The transducer created by the regular expression in (RE6.46) replaces a verbal head marking with <vpHeadSup>. Chapter 6. 214 (6.36) a. Tänk om jag bott hos pappa. think if I lived [sup] with daddy – Think if I had lived at Daddy’s. b. Tänk <pp> <ppHead> om </ppHead> <np> jag </np> </pp> <vp> <vpHeadSup> bott </vpHead> <pp> <ppHead> <np> hos </np> </ppHead> <np> pappa </np> </pp> </vp> . (6.37) a. det var en gång en pojke som fångat en groda. it was a time a boy that caught [sup] a frog – There was once a boy that had caught a frog. b. <np> det </np> <vp> <vpHead> <np> var </np> </vpHead> <np> en gång </np> <np> en pojke </np> </vp> som <vp> <vpHeadSup> fångat </vpHead> <np> en groda </np> </vp> . (RE6.46) define SelectSupVP [ "<vpHead>" -> "<vpHeadSup>" || _ VerbSup "</vpHead>"]; 6.8 Error Detection and Diagnosis 6.8.1 Introduction The broad grammar is applied for marking both the grammatical and ungrammatical phrases in a text. The narrow grammar expresses the nature of grammatical phrases in Swedish and is then used to distinguish the true grammatical patterns from the ungrammatical ones. The automata created in the stage of error detection correspond to the patterns that do not meet the constraints of the narrow grammar and thus compile into a grammar of errors. This is achieved by subtraction of the narrow grammar from the broad grammar. The potential phrase segments recognized by the broad grammar are checked against the rules in the narrow grammar and by looking at the difference. The constructions violating these rules are identified. The detection process is also partial in the sense that errors are located in an appropriately delimited context, i.e. a noun phrase when looking for agreement errors in noun phrases, a verbal head when looking for violations of finite verbs, etc. The replacement operator is used for selection of errors in text. FiniteCheck: A Grammar Error Detector 6.8.2 215 Detection of Errors in Noun Phrases In the current narrow grammar, there are three rules for agreement errors in noun phrases without postnominal attributes and three for partitive constructions, all reflecting the features of definiteness, number and gender and differing only in the context they are detected in. We present the detection rules for noun phrases without postnominal attributes in (RE6.47) and for partitive noun phrases in (RE6.48). These automata represent the result of subtracting the narrow grammar of e.g. all noun phrases that meet the definiteness conditions (NPDefs) ((RE6.33) on p.208), from the overgenerating grammar of all noun phrases (NP) ((RE6.22) on p.196). By application of a marking transducer, the ungrammatical segments are selected and annotated with appropriate diagnosis-markers related to the types of rules that are violated, presented in (RE6.47) and (RE6.48). (RE6.47) a. define npDefError ["<np>" [NP - NPDefs] "</np>"]; define npNumError ["<np>" [NP - NPNum] "</np>"]; define npGenError ["<np>" [NP - NPGen] "</np>"]; b. define markNPDefError [ npDefError -> "<Error definiteness>" ... "</Error>"]; define markNPNumError [ npNumError -> "<Error number>" ... "</Error>"]; define markNPGenError [ npGenError -> "<Error gender>" ... "</Error>"]; (RE6.48) a. define NPPartDefError [ "<NPPart>" [NPPart - NPPartDefs "</np>"]; define NPPartNumError [ "<NPPart>" [NPPart - NPPartNum] "</np>"]; define NPPartGenError [ "<NPPart>" [NPPart - NPPartGen] "</np>"]; b. define markNPPartDefError [ NPPartDefError -> "<Error definiteness NPPart>" ... "</Error>"]; define markNPPartNumError [ NPPartNumError -> "<Error number NPPart>" ... "</Error>"]; define markNPPartGenError [ NPPartGenError -> "<Error gender NPPart>" ... "</Error>"]; The narrow grammar of noun phrases is prepared for further extension of noun phrases modified by relative clauses that in the current version of the system, are just selected as distinct from the other noun phrase types. Chapter 6. 216 6.8.3 Detection of Errors in the Verbal Head Three detection rules are defined for verb errors, identifying the three types of context they can appear in. Errors in finite verb form are checked directly in the verbal head (vpHead). Errors in infinitive phrases are detected in the context of a verbal head preceded by an infinitive marker (vpHeadInf). Errors in verb form following an auxiliary verb are detected in the context of previously selected (potential) verb clusters (vc). The nets of these detecting regular expressions presented in (RE6.49a) correspond (as for noun phrases) to the difference between the grammatical rules (e.g. VPInf in (RE6.39) on p. 211) and the more general rules (e.g. VPHead in (RE6.22) on p. 196), yielding the ungrammatical verbal head patterns. The annotating automata in (RE6.49b) are used for error diagnosis. (RE6.49) a. define vpFiniteError [ "<vpHead>" [VPhead - VPFinite] "</vpHead>"]; define vpInfError [ "<vpHeadInf>" [VPhead - VPInf] "</vpHead>"]; define VCerror [ "<vc>" [VC - VCgram] "</vc>"]; b. define markFiniteError [ vpFiniteError -> "<Error finite verb>" ... "</Error>"]; define markInfError [ vpInfError -> "<Error infinitive verb>" ... "</Error>"]; define markVCerror [ VCerror -> "<Error verb after Vaux>" ... "</Error>"]; Also, the narrow grammar of verbs can be extended with the grammar of coordinated verbs, use of finite verb forms after att ‘that’ and bare supine verb form as the predicate, all selected as separate patterns. 6.9 Summary This chapter presented the final step of this thesis, to implement detection of some of the grammar errors found in the Child Data corpus. The whole system is implemented as a network of finite state transducers, disambiguation is minimal, achieved essentially by parsing order and filtering techniques, and the grammars of the system are always positive. The system detects errors in noun phrase agreement and errors in the finite and non-finite verb forms. The strength of the implemented system lies in the definition of grammars as positive rule sets, covering the valid rules of the language. The rule sets remain FiniteCheck: A Grammar Error Detector 217 quite small and practically no description of errors by hand is necessary. There are altogether six rules defining the broad grammar set and the narrow grammar set is also quite small. Other automata are used for selection and filtering. We do not have to elaborate on what errors may occur, only in what context, and certainly not spend time on stipulating the structure of them. The approach aimed further at minimal information loss in order to be able to handle texts containing errors. The degree of ambiguity is maximal at the lexical level, where we choose to attach all lexical tags to strings. At a higher level, structural ambiguity is treated by parsing order, grammar extension and filtering techniques. The parsing order resolves some structural ambiguities and is complemented by grammar extensions as an application of filtering transducers that refine and/or redefine the parsing decisions. Other disambiguation heuristics are applied for instance to noun phrases, where pronouns that follow a verbal head are attached directly to the verbal head in order to prevent them from attachment to a subsequent noun. 218 Chapter 7 Performance Results 7.1 Introduction The implementation of the grammar error detector is to a large extent based on the lexical and syntactic circumstances displayed in the Child Data corpus. The actual implementation proceeded in two steps. In the first phase we developed the grammar so that the system could run on sentences containing errors and correctly identify the errors. When the system was then run on complete texts, including correct material, the false alarms allowed by the system were revealed. The second phase involved adjustment of the grammar to improve the flagging accuracy of the system. FiniteCheck was tested for grammatical coverage (recall) and flagging accuracy (precision) on Child Data and on an arbitrary text not known to the system in accordance with the performance test on the other three grammar checkers (see Section 5.2.3). In this chapter I present results from both the initial phase in the development of the system (Section 7.2) and the improved current version (Section 7.3). The results are further compared to the performance of the other three Swedish checkers on both Child Data (Section 7.4) and the unseen adult text (Section 7.5). The chapter ends with a short summary and conclusions (Section 7.6). 7.2 Initial Performance on Child Data 7.2.1 Performance Results: Phase I The results of the implemented detection of errors in noun phrase agreement, verb form in finite verbs, after auxiliary verb and after infinitive markers in Child Chapter 7. 220 Data from the initial Phase I in the development of FiniteCheck are presented in Table 7.1. Table 7.1: Performance Results on Child Data: Phase I E RROR T YPE E RRORS Agreement in NP 15 Finite Verb Form 110 Verb Form after Vaux 7 Verb Form after inf. m. 4 T OTAL 136 F INITE C HECK : PHASE I C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other Diagnosis Diagnosis Error Error Recall Precision F-value 14 1 76 64 100% 10% 18% 98 0 237 19 89% 28% 42% 6 0 61 10 86% 8% 15% 4 0 5 0 100% 44% 62% 122 1 379 93 90% 21% 34% The grammatical coverage (recall) in this training corpus was maximal, except in one erroneous verb form after auxiliary verb and a few instances of errors in finite verb form. The overall recall rate for these four error types was 90%. When tested on the whole Child Data corpus, many segments were wrongly marked as errors and the precision rate was quite low, only 21% total, resulting in the overall F-value of 34%. Most of the false alarms occurred in errors in finite verb form, followed by errors in noun phrase agreement. Related to the error frequency of the individual error types, errors in verb form after an auxiliary verb had the lowest precision (8%), closely followed by errors in noun phrase agreement (10%). The grammar of the system was at this initial stage based essentially on the syntactic constructions displayed in the erroneous patterns that we wanted to capture. Many of the false alarms were due to missing grammar rules when tested on the whole text corpus. Other false markings of correct text occurred due to ambiguity, incorrect segmentation of the text in parsing stage, or occurrences of other error categories than grammatical ones. Below I discuss in more detail the grammatical coverage and flagging accuracy in this initial phase. 7.2.2 Grammatical Coverage Errors in Noun Phrase Agreement All errors in noun phrase agreement were detected and one with incorrect diagnosis, due to a split in the head-noun. FiniteCheck is not prepared to handle segmentation errors and exactly as the other Swedish grammar checkers the noun phrase with inconsistent use of adjectives (G1.2.3; see (4.9) on p.49) is only detected in part. The detector yields both the correct diagnosis of gender mismatch and Performance Results 221 an incorrect diagnosis of a definiteness mismatch, since the first part troll ‘troll’ of the head-noun is indefinite and neuter and does not agree with the definite, common gender determiner den ‘the’ as seen in (7.1a). When the head-noun has the correct form and is no longer split into two parts, the whole noun phrase is selected and a gender mismatch is reported, as seen in (7.1b). (7.1) (G1.2.3) a. det va it was <Error ∗ ∗ hemske awful [masc] definiteness><Error fula ugly [def] gender> den the [com,def] troll </Error> troll [neu,indef] karlen man [com,def] tokig som ... Tokig that – It was the awful ugly magician Tokig that ... b. det va it was <Error gender> den the [com,def] ∗ hemske awful [masc] ∗ fula ugly [def] trollkarlen </Error> tokig som ... magician [com,def] Tokig that – It was the awful ugly magician Tokig that ... Errors in Finite Verb Form Among the errors in finite verb form, none of the errors concerning main verbs realized as participles were detected (G5.2.90 - G5.2.99; see (4.30) p.60). They require other methods for detection, since as seen in (7.2) they are interpreted as adjective phrases. (7.2) a. älgen sprang med olof till ett stup och ∗ kastad ner olof the-moose ran with Olof to a cliff and threw [past part] down Olof och hans hund and his dog – The moose ran with Olof to a cliff and threw Olof and his dog over it. b. <np> älgen </np> <vp> <vpHead> sprang med</vpHead> <np> olof </np> <pp> <ppHead> till </ppHead> <np> ett stup </np> </pp> </vp> och <np> <ap> kastad </ap> </np> ner <np> olof</np> och <np> hans hund</np> Two errors were missed due to preceding verbs joined into the same segment and were then treated as verb clusters, as shown in (7.3) and (7.4). Chapter 7. 222 (7.3) (G5.1.1) a. Madde och jag bestämde oss för att sova i kojan och se om vi Madde and I decided usselfs for to sleep in the-hut and see if we ∗ få se vind. can [untensed] see Vind – Madde and I decided to sleep in the hut and see if we will see Vind. b. <np> Madde </np> och <np> jag </np> <vp> <vpHead> bestämde </vpHead> <np> oss </np> </vp> <np> för </np> att <vp> <vpHeadInf> sova i </vpHead> <np> kojan </np> </vp> och <vp> <vpHead> se om <np> vi </np> <np> få </np> se </vpHead> <np> vind </np> </vp> . (7.4) (G5.2.40) ∗ a. När vi kom fram börja vi packa upp våra grejer och when we came forward start [untensed] we pack up our stuff and rulla upp sovsäcken. role up the-sleepingbag – When we arrived, we started to unpack our things and role out the sleepingbag. b. När <np> vi </np> <vp> <vpHead> <vc> kom fram börja </vc> <np> vi </np> packa upp </vpHead> <np> våra grejer </np> </vp> och <vp> <vpHeadCoord> rulla upp </vpHead> <np> sovsäcken </np> </vp> . One of the errors in finite verb was wrongly selected as seen in (7.5b). Here, the noun bo ‘nest’ is homonymous with the verb bo ‘live’ and joined together with the main verb to a verb cluster the detector selects the verb cluster 1 and diagnoses it as an error in finite verb, which is actually true but only for the main verb, the second constituent of this segment. 1 The noun phrase tags surrounding bo are ignored in the selection as verb cluster, see (RE6.40) on p.211. Performance Results (7.5) (G5.2.70) a. Då gick then went 223 pojken the-boy vidare further och and såg saw inte not att that binas bees’s bo nest ∗ trilla ner. tumble [untensed] down – Then the boy went further on and did not see that the nest of the bees tumbled down. b. <vp> <vpHead> Då gick </vpHead> <np> pojken </np> <np> vidare </np> </vp> och <vp> <vpHead> <np> såg </np> inte </vpHead> </vp> att <np> binas </np> <vp> <vpHead> <Error finite verb> <vc> <np> bo </np> trilla </vc> </Error> </vpHead> </vp> ner. Rest of the Verb Form Errors One error in verb form after auxiliary verb was not detected (see (7.6)), that, involved coordination of a verb cluster and yet another verb, that should follow the same pattern and thus be in infinitive form (i.e. låta ‘let [inf]’). The system does not take coordination of verbs into consideration and the coordinated verb is identified as a separate verbal head with a finite verb, which then is a valid form in accordance with the grammar rules of the system, and the error is overlooked. (7.6) (G6.1.2) a. Ibland får man bjuda på sig själv och ∗ låter sometimes must [pres] one offer [inf] on oneself and let [pres] henne/honom vara med! her/him be with – Sometimes one has to make a sacrifice and let him/her come along. b. <vp> <vpHead> Ibland <np> får </np><np> man </np> bjuda <pp> <ppHead> på </ppHead><np> sig </np> </pp></vpHead><np> själv </np></vp> och <vp> <vpHead> låter </vpHead> </vp> henne/honom <vp> <vpHead><np> vara </np> med </vpHead> </vp> ! Finally, all errors in verb form after infinitive marker were detected. 7.2.3 Flagging Accuracy In this subsection follows a presentation of the kinds of false flaggings that occurred in this first test of the system. The description proceeds error type by error type, with specifications on whether the appearance was due to missing grammar rules, erroneous segmentation of text at parsing stage or ambiguity. Furthermore, the false alarms with other error categories are specified. Chapter 7. 224 False Alarms in Noun Phrase Agreement The kinds and the number of false alarms occurring in noun phrases are presented in Table 7.2. Table 7.2: False Alarms in Noun Phrases: Phase I FALSE ALARM TYPE Not in Grammar: Segmentation: Ambiguity: Other Error: NPInd+som Adv in NP other too long parse PP V misspelling split sentence boundary NO. 5 28 8 26 7 2 12 48 4 Most of these false alarms were due to the fact that they were not included in the grammar of the system. For instance, adverbs in noun phrases as in (7.7a) were not covered, causing alarms in gender agreement since often in Swedish a neuter form adjective coincides with the adverb of the same lemma. Further, noun phrases with a subsequent relative clause such as (7.7b) were selected as errors in definiteness, although they are correct since the form of the head noun is indefinite when followed by such clauses (see Section 4.3.1). (7.7) a. Det var i skolan och jag kom lite för sent till en lektion med it was in school and I came little too late to a class with <Error gender> väldigt sträng lärare. </Error> very hard/strict teacher – It was in school and I came little late to a class with very strict teacher. b. Jag tycker att det borde finnas en hjälpgrupp för I think that it should exist a help-group for <Error definiteness> de elever </Error> som har the [pl,def] pupils [pl, indef] that have lite sociala problem. some social problems – I think that there should be a help-group for the pupils that have some social problems. Other false flaggings depended on the application of longest-match resulting in noun phrases with too wide range as in (7.8a), where the modifying predicative Performance Results 225 complement and the subject are merged to one noun phrase since the inverted word order forced the verb to be placed at the end of sentence instead of the usual place in-between, i.e. skolan ‘school’ should form a noun phrase on its own. (7.8) dom tänker inte hur they think not how <Error definiteness> viktig skolan important [str] school [def] </Error> är is – They do not think how important school is. Furthermore, due to lexical ambiguity some prepositional phrases such as in (7.9) and verbs were parsed as noun phrases and later marked as errors. (7.9) Det är en ganska stor väg ungefär it is a rather big road somewhere <Error definiteness> vid hamnen </Error> wide [indef]/at harbor [def] – It is a rather big road somewhere at the harbor. Also false flaggings with other error categories than grammar errors were quite common. Mostly splits as in (7.10a) were flagged. Here, the noun ögonblick ‘moment’ is split and the first part of it ögon ‘eyes’ does not agree in number with the preceding singular determiner and adjective. Also, flaggings involving misspellings occurred as in (7.10b), where the newly formed word results in a noun of different gender and definiteness than the preceding determiner and causes agreement errors. Some cases of missing sentence boundary were flagged as errors in noun phrase agreement. (7.10) a. För <Error number> ett kort ögon </Error> blick trodde For a [sg] short eye [pl] blinking thought jag ... I ... – For a short moment I thought ... b. <Error definiteness> <Error gender> Det ända the [neu,def] end [com,indef] </Error> </Error> jag vet I know – The only thing I know ... Furthermore, erroneous tags assigned in the lexical lookup caused trouble when for instance many words were erroneously selected as proper names. Chapter 7. 226 False Alarms in Finite Verb Form The types and number of false alarms in finite verbs are summarized in Table 7.3. These occurred mostly because of the small size of the grammar, but also due to ambiguity problems. Table 7.3: False Alarms in Finite Verbs: Phase I FALSE ALARM TYPE Not in Grammar: Ambiguity: Other Error: imperative coordinated infinitive discontinuous verb cluster noun pronoun preposition/adjective misspelling split NO. 56 74 43 36 8 20 9 10 Imperative verb forms, that in the first phase were not part of the grammar, caused false alarms not only in verbs as in (7.11a), but also in strings homonymous with such forms as in (7.11b). Here the noun sätt is ambiguous between the noun reading ‘way’ and the imperative verb form ‘set’. (7.11) a. Men <Error finite verb> titta </Error> en stock. but look [imp] a log – But look, a log. b. Dom samlade in pengar <Error finite verb> på olika sätt they collected in money in different ways/set [imp] </Error> – They collected money in different ways. Further, coordinated infinitives as in (7.12) were diagnosed as errors in the finite verb form, since due to the partial parsing strategy, they were selected as separate verbal heads (see (6.31) and (6.32) on p.212). (7.12) a. hon skulle springa ner och <Error finite verb> larma </Error> she would run [inf] down and alarm – she would run down and alarm. </Error> b. det är dags att gå och <Error finite verb> lägga sig. lay [inf] oneself it is time to go and – It is time to go and lay down. Performance Results 227 Similar problems occurred with discontinuous verb clusters when a noun followed the auxiliary verb and the subsequent verb forms are treated as separate verbal heads (see (6.27) on p.204). Further, primarily nouns, but also pronouns, adjectives and prepositions were recognized also as verbal heads causing false diagnosis as errors. Other error categories selected as errors in finite verb form concerned both splits and misspellings, but were considerably fewer in comparison to similar false alarms in noun phrase agreement. False Alarms in Verb Forms after an Auxiliary Verb False alarms in verb forms after an auxiliary verb occurred either due to ambiguity in nouns, pronouns, adjectives and prepositions interpreted as verbs or due to occurrences of other error categories (Table 7.4). In the case of pronouns, they were interpreted as verbs (mostly) in front of a copula verb and merged together to a verbal cluster segment. Similar problems occurred with adjectives and participles (see (6.22)-(6.24) starting on p.202). Table 7.4: False Alarms in Verb Clusters: Phase I FALSE ALARM TYPE Ambiguity: Other error category: noun pronoun preposition/adjective misspelling split NO. 26 18 17 3 7 Among false flaggings concerning other error categories, both spelling errors and splits were flagged. In (7.13) we see an example of a misspelling where the adjective rädd ‘afraid’ is written as red coinciding with the verb ‘rode’ being marked as an error in verb form after auxiliary verb.2 (7.13) pojken <Error verb after Vaux> blev red </Error> the-boy became rode – The boy became afraid. Furthermore, many instances of missing punctuation at a sentence boundary were flagged as errors in verb clusters, as the ones in (7.14). 3 Similarly to the 2 The broad grammar rule for verb clusters joins any types of verbs which is why the copula verb blev ‘became’ is included. 3 Two vertical bars indicate the missing clause or sentence boundary. Chapter 7. 228 performance test of the other grammar checkers, these flaggings are not included in the test. They represent correct flaggings, although the diagnosis is not correct. (7.14) a. Jag <Error verb after Vaux> fortsatte vägen fram || då såg I continued the-road forward then saw </Error> jag en brandbil || jag visste vad det var. I a fire-car I knew what it was – I continued forward on the road, then I saw a firetruck. I knew what it was. b. I hålet pojken <Error verb after Vaux> hittat || fanns </Error> in the-hole the-boy found was en mullvad. a mole – In the hole the boy found a mole. False Alarms in Verb Forms in Infinitive Phrase Finally, five false alarms in infinitival verbal heads occurred in constructions that do not require an infinitive verb form after att, which is both an infinitive marker ‘to’ and a subjunction ‘that’ (see (6.33)-(6.35) starting on p.213). 7.3 Current Performance on Child Data 7.3.1 Introduction As shown above, almost all the errors in Child Data were detected by FiniteCheck. The erroneously selected segments classified as errors by the implemented detector were mostly due to the small number of grammatical structures covered by the grammar, tagging problems and the high degree of ambiguity in the system. Many alarms included also other error categories, such as misspellings, splits and omitted punctuation. In accordance with these observations, also the detection performance of the system was improved in these three ways in order to avoid false alarms: • extend and correct the lexicon • extend the grammar • improve parsing The full form lexicon of the system is rather small (around 160,000 words) and not without errors. So, the first and rather easy step was to correct erroneous Performance Results 229 tagging and add new words to the lexicon. The grammar rules were extended and filtering transducers were used to block false parsing. Below follows a description of the grammar extension and other improvements in the system to avoid false alarms in the individual error types. Then the current performance of the system is presented (Section 7.3.3). 7.3.2 Improving Flagging Accuracy Improving Flagging Accuracy in Noun Phrase Agreement The grammar of adjective phrases was expanded with missing adverbs. Noun phrases followed by relative clauses. These display distinct agreement constraints and were selected separately by the already discussed regular expression (RE6.37) (see p.210). This does not mean that the grammar is extended for such noun phrases, but false alarms in these constructions may be avoided. The false alarms in noun phrases caused by limitations in the grammar set were all avoided. This grammar update further improved parsing in the system and decreased the number of wide parses giving rise to false alarms. The types and number of false alarms that remain are presented in Table 7.5. Table 7.5: False Alarms in Noun Phrases: Phase II FALSE ALARM TYPE Segmentation: Ambiguity: Other Error: too long parse PP misspelling split sentence boundary NO. 5 10 10 35 2 Among these are (relative) clauses without complementizers as in (7.15). (7.15) a. det var den godaste frukost jag någonsin ätit ... eaten it was the best breakfast I ever – It was the best breakfast I ever have eaten ... b. <np> det </np> <vp> <vpHead> <np> var </np> </vpHead> <Error definiteness> <np> den <ap> godaste </ap> frukost </np> </Error> <np> jag </np> </vp> <vp> <vpHead> någonsin ätit </vpHead> </vp> . 230 Chapter 7. Improving Flagging Accuracy in Finite Verbs In the case of finite verbs, the problem with imperative verbs is solved to the extent that forms that do not coincide with other verb forms are accepted as finite verb forms, e.g. tänk ‘think’. The imperative forms that coincide with infinitives (e.g. titta ‘look’) remain. The problem lies mostly in that errors in verbs realized as lack of the tense endings, often coincide with the imperative (and infinitive) form of the verb. Allowing all imperative verb forms as grammatical finite verb forms would then mean that such errors would not be detected by the system. Normally other hints, such as for example checking for end-of-sentence marking or a noun phrase before the predicate, are used to identify imperative forms of verbs. These methods are however not suitable for the texts written by children since these texts often lack end-of-sentence punctuation or capitals indicating the beginning of a sentence. This could then mean that a noun phrase preceding the predicate could be an end to a previous sentence. However, just to define imperative verb forms not coinciding with other verb forms as grammatical finite verb forms reduced the number of false alarms in imperatives decrease to half as shown in Table 7.6 below. Finite verb false alarms in coordinations with infinitive verbs decreased to just nine alarms and were blocked by selection of infinitive verbs preceded by a verbal group or infinitive phrase as a separate pattern category by the transducer in (RE6.44) (see p.212). Discontinuous verbal groups with a noun phrase following the auxiliary verb were joined together by the automaton (RE6.29) (see p.205) and the narrow grammar of verb clusters was expanded to include (optional) noun phrases. Almost half of such false alarms were avoided. False alarms in finite verbs occurring because of ambiguous interpretation also decreased. Some of those were avoided by the grammar update that also improved parsing. Further adjustments included nouns being interpreted as verbs in possessive noun phrases and adjectives in noun phrases being interpreted as verbal heads that were filtered applying the automata (RE6.26) and (RE6.25) (see p.201). Furthermore, verbal heads with a single supine verb form were distinguished since they are grammatical in subordinate clauses (see (RE6.46) on p.214). Performance Results 231 The remaining false alarms are summarized in Table 7.6. Table 7.6: False Alarms in Finite Verbs: Phase II FALSE ALARM TYPE Not in Grammar: Ambiguity: Other: Other Error: imperative coordinated infinitive discontinuous verb clusters noun pronoun preposition/adjective misspelling split NO. 27 9 28 9 1 14 6 18 14 Improving Flagging Accuracy in Verb Form after Auxiliary Verb The ambiguity resolutions defined for finite verbs blocked not only the false alarms in finite verbs, but also in verb clusters. Furthermore, an annotation filter (RE6.27) (see p.203) was defined for copula verbs to block false markings of copula verbs combined with other constituents such such as pronouns, adjectives, and participles as a sequence of verbs. The types and number of false alarms that remain are presented in Table 7.7. Table 7.7: False Alarms in Verb Clusters: Phase II FALSE ALARM TYPE Ambiguity: Other error category: noun pronoun preposition/adjective misspelling split NO. 4 4 24 6 9 Improving Flagging Accuracy in Verb Form in Infinitive Phrases The false alarms in infinitive verb phrases occurred due to constructions that do not require an infinitive verb form after an infinitive marker. These were selected as separate patterns by the automaton (RE6.45) (see p.213) and false markings of this type were blocked. Chapter 7. 232 7.3.3 Performance Results: Phase II The performance of the system in the new improved version (Phase II) is presented in Table 7.8. The grammatical coverage is the same for all error types, except for finite verbs, where the recall rate (slightly) decreased from 89% to 87%. Table 7.8: Performance Results on Child Data: Phase II E RROR T YPE E RRORS Agreement in NP 15 Finite Verb Form 110 Verb Form after Vaux 7 Verb Form after inf. m. 4 T OTAL 136 F INITE C HECK : PHASE II C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other Diagnosis Diagnosis Error Error Recall Precision F-value 14 1 15 47 100% 19% 33% 96 0 94 32 87% 43% 58% 6 0 32 15 86% 11% 20% 4 0 0 0 100% 100% 100% 120 1 145 94 89% 34% 49% This is due to the improvement in flagging accuracy. That is, in addition to the errors not detected by the system from the initial stage (see Section 7.2), two additional errors in finite verb form realized as bare supine were not detected (G5.2.88, G5.2.89) as a consequence of selecting all bare supine forms as separate segments, as shown in (7.16). This selection was necessary in order to avoid marking correct use of bare supine forms as erroneous. When the grammar for the bare supine verb form is covered, these errors can be detected as well. (7.16) (G5.2.89) a. Han tittade på hunden. Hunden ∗ försökt att klättra ner. he looked [pret] at the-dog the-dog tried [sup] to climb down – He looked at the dog. The dog tried to climb down. b. <np> Han </np> <vp> <vpHead> tittade på </vpHead> <np> hunden </np> </vp> , <np> hunden </np> <vp> <vpHeadSup> f örsökt </vpHead> </vp> att <vp> <vpHeadInf> klättra ner </vpHead> </vp> We were able to avoid many of the false flaggings by improvement of the lexical assignment of tags and expansion of grammar. The parsing results of the system also improved as was the case for false flaggings. The total precision rate improved from 21% to 34%. The remaining false alarms have most often to do with ambiguity. Only in the case of verb clusters is further expansion of grammar needed. Figure 7.1 shows the number of false markings of correct text as erroneous in comparison between the initial Phase I and the current Phase II. Performance Results 233 Figure 7.1: False Alarms: Phase I vs. Phase II The types and number of alarms revealing other error categories are more or less constant and can be considered a side-effect of such a system. Methods for recognizing these error types are of interest. In the case of splits and misspellings, most of them were discovered due to agreement problems. Omission of sentence boundaries is in many cases covered by verb cluster analysis. The overall performance of the system in detecting the four error types defined, increased in F-value from 34% in the initial phase to 49% in the current improved version. 7.4 Overview of Performance on Child Data I presented earlier in Section 5.5 the linguistic performance on the Child Data corpus of the other three Swedish tools: Grammatifix, Granska and Scarrie. Here I discuss the results of these tools for the four error types targeted by FiniteCheck and explore the similarities and differences in performance between our system and the other tools. The purpose is not to claim that FiniteCheck is in general superior to the other tools. FiniteCheck was developed on the Child Data corpus, whereas the other tools were not. However, it is important to show that FiniteCheck represents some improvement over systems that were not even designed to cover this particular data. Chapter 7. 234 The grammatical coverage of these three tools and our detector for the four error types are presented in Figure 7.2.4 The three other tools are designed to detect errors in adult texts and not surprisingly the detection rates are low. Among these four error types, agreement errors in noun phrases is the error type best covered by these tools, whereas errors in verb form obtained in general much lower results. All three systems managed to detect at least half of the errors in noun phrase agreement. Errors in the finite verb form obtained the worst results. In the case of Grammatifix, errors in verbs obtained no or very few results. Granska targeted all four error types and detected more than half of the errors in three of the types and only 4% of the errors in finite verb form. Scarrie also had problems in detecting errors in verbs, although it performed best on finite verbs in comparison to the other tools, detecting 15% of them. Figure 7.2: Overview of Recall in Child Data FiniteCheck, which was trained on this data, obtained maximal recall rates for errors in noun phrase agreement and verb form after infinitive markers. Errors in other types of verb form obtained a somewhat lower recall (around 86%). Although this is a good result, we should keep in mind that FiniteCheck is here tested on the data that was used for development. That is, it is not clear if the system would 4 The number of errors per error type is presented within parentheses next to the error type name. Performance Results 235 receive such high recall rates for all four error types even for unseen child texts. 5 However, the high performance in detecting errors especially in the frequent finite verb form error type is an obvious difference in comparison to the low performance of the other tools, which at least seems to motivate the tailoring of grammar checkers to children’s texts. Precision rates are presented in Figure 7.3. They are in most cases below 50% for all systems. The result is however relative to the number of errors. Best valued are probably errors in finite verb form as a quite frequent error type. The errors in verb form after infinitive marker are too few to draw any concrete conclusions about the outcome. Figure 7.3: Overview of Precision in Child Data Evaluating the overall performance of the systems in detection of these four error types presented in Figure 7.4 below, the three other systems obtained a recall of 16% on average. The recall rate of FiniteCheck is considerably higher, which can mean that the tool is good at finding erroneous patterns in texts written by children, but that remains to be seen when tests on unseen texts are performed. Flagging accuracy is slightly above 30% for Grammatifix, Granska and FiniteCheck. Scar5 We have not been able to test the system on new child data. Texts written by children are hard to get and require a lot of preprocessing. Chapter 7. 236 rie obtained slightly lower precision rates. In combining these rates and measuring the overall system performance in F-value, Grammatifix obtained the lowest rate, probably due to the low recall, closely followed by Scarrie. Granska had slightly higher results of 23%. Our system obtained twice the value of Granska. Figure 7.4: Overview of Overall Performance in Child Data In conclusion, among these four error types the three other grammar checkers had difficulties in detecting the verb form errors in Child Data and only detected around half of the errors in noun phrase agreement. FiniteCheck had high recall rates for all four error types and a precision on the same level as the other tools. It is unclear how much the outcome is influenced by the fact that the system is based on exactly this data, but FiniteCheck seems not to have difficulties in finding errors in verb form (especially in finite verbs) that the other tools clearly signal. Further evaluation of FiniteCheck on a small text not known to the system is reported in the following section. Performance Results 237 7.5 Performance on Other Text In order to see how FiniteCheck would perform on unseen text, of the kind used to test the other Swedish grammar checkers, a small literary text of 1,070 words describing a trip was evaluated. This text is used as a demonstration text by Granska. 6 The text includes 17 errors in noun phrase agreement, five errors in finite verb form and one error in verb form after an auxiliary verb. The purpose of this test is to see if the results are comparable to the other Swedish tools. Note that, the aim is not to compare the performance between all the checkers, which would be unfair since the text is a demonstration text of Granska, but rather to see how our detector performs in just the error types it targets in comparison to tools designed for this kind of text. Below, I first present and discuss the results of FiniteCheck, then the performance of the three other checkers is presented, followed by a comparative discussion. 7.5.1 Performance Results of FiniteCheck Introduction The text was first manually prepared and spaces were inserted when needed between all strings, including punctuation. Further, the lexicon had to be updated, since the text used a particular jargon.7 The detection results of FiniteCheck are presented in Table 7.9. Table 7.9: Performance Results of FiniteCheck on Other Text E RROR T YPE E RRORS Agreement in NP 17 Finite verb form 5 Verb form after Vaux 1 T OTAL 23 F INITE C HECK : OTHER T EXT C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other Diagnosis Diagnosis Error Error Recall Precision F-value 13 1 2 4 82% 70% 76% 5 0 1 0 100% 83% 91% 1 0 1 0 100% 50% 67% 19 1 4 4 87% 71% 78% FiniteCheck missed three errors in noun phrase agreement, which leaves it with a total recall of 87%. False alarms occurred in all three error types, mostly in noun 6 Demonstration page of Granska: http://www.nada.kth.se/theory/projects/ granska/. 7 FiniteCheck’s lexicon would need to be extended anyway to make a general grammar checking application. Chapter 7. 238 phrase agreement, and results in a total precision of 71%. Below I discuss the performance results in more detail. Errors in Noun Phrase Agreement Among the noun phrase agreement errors, three errors were not detected and one was incorrectly diagnosed. The latter concerned a proper noun preceded by an indefinite neuter determiner. The noun phrase was selected and marked for all three types of agreement errors, as shown in (7.17). The reason for this selection is that the noun phrase was recognized by the broad grammar as a noun phrase, but rejected by the narrow grammar as ungrammatical. In this case it is true, since the proper noun should stand alone or be preceded by a neuter gender determiner, but the system should signal only an error in gender agreement. That is, the noun phrase was as a whole rejected by the system, since there are no rules for noun phrases with a determiner and a proper noun. (7.17) a. Detta är sannerligen ∗ en Mekka för fjällälskaren ... this is certainly a [com,indef] Mekka for the mountain-lover – This is certainly a Mekka for the mountain-lover ... b. <np> Detta </np> <vp> <vpHead> är sannerligen </vpHead> <Error definiteness> <Error number> <Error gender> <np> en Mekka </np> </Error> </Error> </Error> </vp> <np> för </np> fjällälskaren ... The undetected errors all concerned constructions not covered by our grammar. The first one in (7.18a)8 involves a possessive noun phrase modifying another noun. FiniteCheck covers noun phrases with single possessive nouns as modifiers. The other two concern numerals with nouns in definite form. Our current grammar does not explore much about numerals and definiteness. den stora ∗ forsen brus the big stream [nom] roar [nom] ⇒ den stora forsens brus the big stream [gen] roar [nom] b. två ∗ nackdelarna two disadvantages [def] ⇒ två nackdelar two disadvantages [indef] c. två ∗ kåsorna kaffe two scoops [def] coffee ⇒ två kåsor kaffe two scoops [indef] coffee (7.18) a. Altogether six false flaggings occurred in noun phrase agreement, four of them due to a split, thus involving an another error category. Two were due to ambiguity in the parsing. Both types are exemplified in the sentence in (7.19), where in 8 Correct forms are presented to the right after the arrow in the examples. Performance Results 239 the first case the noun fjällutrustningen ‘mountain-equipment [sg,com,def]’ is split and the first part does not agree with the preceding modifiers. The second case involves the complex preposition framför allt ‘above all’ where allt is joined with the following noun to build a noun phrase and a gender mismatch occurs. (7.19) a. ...i in tältet och tent and den the [sg,com,def] övriga fjäll rest mountain [sg/pl,neu,indef] utrustningen vilar tryggheten och framför allt equipment [sg,com,def] rests the-safety and above all [neu, indef] friheten. freedom [com,def] – ... in the tent and the other mountain equipment lies the safety and above all freedom. b. <pp> <ppHead> i </ppHead> <np> tältet </np> </pp> och <Error definiteness> <Error number> <Error gender> <np> den övriga fjäll </np> </Error> </Error> </Error> <np> utrustningen </np> <vp> <vpHead> vilar </vpHead> <np> tryggheten </np> </vp> och fram <pp> <ppHead> <np> för </np> </ppHead> <Error gender> <np> allt friheten </np> </Error> </pp> . Errors in Verb Form All the errors in verb form have been detected. One false alarm occurred in each error type. In the case of finite verbs, the alarm was caused due to homonymity in the noun styrka ‘force’ interpreted as the verb ‘prove’, as seen in (7.20). (7.20) a. Vinden mojnar inte under natten utan fortsätter med oför minskad the-wind subside not during the-night but continues with undiminished styrka. force – The wind does not subside during the night, but continues with undiminished force. b. <np> Vinden </np> mojnar inte <pp> <ppHead> <np> under </np> </ppHead> <np> natten </np> </pp> <vp> <vpHead> <np> utan </np> fortsätter </vpHead> </vp> med oför <np> minskad </np> <vp> <Error finite verb> <vpHead> <np> styrka </np> </vpHead> </Error> </vp> . The false alarm in verb form after auxiliary verb concerned the split noun sovsäcken ‘sleeping-bag’, where the first part sov is homonymous with the verb ‘sleep’ and was joined with the preceding verb to form a verb cluster, as shown in (7.21). Chapter 7. 240 (7.21) a. Det finns dock två nackdelarna med tältning, pjäxorna There exist however two disadvantages with camping the skiing-boots måste i sov säcken för att inte krympa ihop av kylan ... must into sleeping bag because not shrink together from the-cold – There are two disadvantages with camping, the skiing-boots must be inside the sleeping-bag in order to not shrink from the cold ... b. <np> Det </np> <vp> <vpHead> finns dock </vpHead> <np> två nackdelarna </np> </vp> med tältning, pjäxorna <vp> <vpHead> <Error verb after Vaux> <vc> måste i sov </vc> </Error> </vpHead> <np> säcken </np> </vp> <np> för </np> att <vp> <vpHeadATTFinite> inte krympa </vpHead> </vp> ihop <pp> <ppHead> av </ppHead> <np> kylan </np> </pp> 7.5.2 Performance Results of Other Tools Grammatifix The results for Grammatifix are presented in Table 7.10 below, with 12 detected errors in noun phrase agreement, one error in finite verb form and one false alarm in verb form error after an auxiliary verb. This leaves the system a total recall of 57% and a precision of 93% for these three error types. Table 7.10: Performance Results of Grammatifix on Other Text GRAMMATIFIX: OTHER T EXT FALSE A LARM P ERFORMANCE C ORRECT No Other E RROR T YPE E RRORS A LARM Error Error Recall Precision F-value Agreement in NP 17 12 0 0 71% 100% 83% Finite Verb Form 5 1 0 0 20% 100% 33% Verb Form after Vaux 1 0 0 1 0% 0% – T OTAL 23 13 0 1 57% 93% 70% The five errors in noun phrase agreement that were missed concerned the same segment with a possessive noun modifying another noun (see (7.18a)) and the one with a numeral and a noun in the definite form (see (7.18b)). Other cases concerned a possessive proper noun with erroneous definite noun (see (7.22a)), another definiteness error in noun (see (7.22b)) and a strong form of adjective used in definite noun phrase (see (7.22c)). Correct forms are presented to the right, next to the erroneous phrases. Performance Results 241 Lapplands ∗ drottningen Lappland’s queen [def] ⇒ Lapplands drottning Lappland’s queen [indef] b. ∗ en ny dagen a [indef] new [indef] day [def] ⇒ en ny dag a [indef] new [indef] day [indef] c. ∗ djup snön den the [def] deep [str] snow [def] ⇒ den djupa snön the [def] deep [wk] snow [def] (7.22) a. No false alarms occurred other than one with a verb form after an auxiliary verb, concerning exactly the same segment and error suggestions as our detector as exemplified in (7.21) above. Granska The result for Granska is presented in Table 7.11. This system detected 11 agreement errors in noun phrase, the one error in verb form after auxiliary verb and one false alarm occurred in noun phrase agreement. No errors in finite verb form were identified. The total recall is 52% and precision 92% for these three error types. Table 7.11: Performance Results of Granska on Other Text GRANSKA: OTHER T EXT FALSE A LARM P ERFORMANCE C ORRECT No Other E RROR T YPE E RRORS A LARM Error Error Recall Precision F-value Agreement in NP 17 11 0 1 65% 92% 76% Finite Verb Form 5 0 0 0 0% – – Verb Form after Vaux 1 1 0 0 100% 100% 100% T OTAL 23 12 0 1 52% 92% 67% The six errors in noun phrase agreement that were missed concerned the same segment with a possessive noun modifying another noun (see (7.18a)) and both cases with the numeral and a noun in definite form (see (7.18b-c)). Further errors concerned a possessive noun with an erroneous definite noun (see (7.23a)), a neuter gender possessive pronoun with a common gender noun (see (7.23b)) and an indefinite determiner with a definite noun (see (7.23c)). Chapter 7. 242 ripornas ∗ kurren grouse’s hoot [def] ⇒ ripornas kurr grouse’s hoot [indef] b. ∗ mitt huva my [neu] hood [com] ⇒ min huva my [com] hood [com] c. ∗ smulan en a [indef] bit [def] ⇒ en smula a [indef] bit [indef] (7.23) a. One false alarm occurred in a noun phrase with a split adjective and a missing noun, as shown in (7.24). Here the adjective vinteröppna ‘winter-open’ (i.e. open for winter) is split and the first part causes an agreement error in definiteness. öppna — husera en arg gubbe ... (7.24) ... i den andra vinter in the [def] other winter [indef] open — haunt [inf] an angry old man –... the other cottage open for the winter was haunted by an angry old man ... Scarrie The results for Scarrie are presented in Table 7.12. This system detected 10 agreement errors in noun phrase and one error in finite verb form. It had six false markings concerning noun phrase agreement. The total recall is 48% and precision 65%. Table 7.12: Performance Results of Scarrie on Other Text SCARRIE: OTHER T EXT FALSE A LARM P ERFORMANCE C ORRECT No Other E RROR T YPE E RRORS A LARM Error Error Recall Precision F-value Agreement in NP 17 10 2 4 59% 63% 61% Finite Verb Form 5 1 0 0 20% 100% 33% Verb Form after Vaux 1 0 0 0 0% – – T OTAL 23 11 2 4 48% 65% 55% The seven errors in noun phrase agreement that were missed concerned the three our system did not find (see (7.18)) and two that Granska did not find (see (7.23a) and (7.23c)). The others are presented below, where two concerned gender agreement with determiner and a (proper) noun (see (7.25a)) and ((7.25b)), and one definiteness agreement with a weak form adjective together with an indefinite noun (see (7.25c)). Performance Results 243 (7.25) a. ∗ en Mekka a [com] Mekka ⇒ ett Mekka a [neu] Mekka b. ∗ en mantra a [com] mantra [neu] ⇒ ett mantra a [neu] mantra [neu] c. ∗ ⇒ orörd fjällnatur untouched [str] mountain-nature [indef] orörda untouched [wk] fjällnatur mountain-nature [indef] All false alarms concerned noun phrase agreement, where four of them concerned other error categories, as for instance in the ones presented in (7.19) or in (7.24). 7.5.3 Overview of Performance on Other Text In Figure 7.5 I present the recall values for all three of the grammar checkers and our FiniteCheck for the three evaluated error types. All the tools detected 60% or more errors in noun phrase agreement, whereas verb form errors obtained different results. The other tools detected at most one verb form error in total of either the finite verb kind or after an auxiliary verb. FiniteCheck identified all six of the verb form errors. The errors in verb form are in fact quite few (six instances in total), but even for such a small amount there are indications that the other tools have problems identifying errors in verb form. Flagging accuracy for these error types is presented in Figure 7.6. Concerning errors in noun phrase agreement, Grammatifix had no false flaggings and obtains a precision of 100%. Granska’s precision rate is also quite high with only one false alarm. Scarrie and FiniteCheck obtained a lower precision around 70% due to six false alarms by each tool. Concerning verb errors, the three systems obtained full rates without any false flaggings when detection occurred. FiniteCheck had one false alarm in each error type, thus obtaining lower precision rates. The flagging accuracy of FiniteCheck in this text is a bit lower in comparison to Grammatifix and Granska, but comparable to the results of Scarrie. Chapter 7. 244 Figure 7.5: Overview of Recall in Other Text Figure 7.6: Overview of Precision in Other Text Performance Results 245 The overall performance on the evaluated text presented in Figure 7.7 with 23 grammar errors, the three grammar checkers obtained on average 52% in recall, FiniteCheck scored 87%. The opposite scenario applies for precision, where FiniteCheck had slightly worse rate (71%) than Grammatifix and Granska which had a precision above 90%. Scarrie’s precision rate was 65%. In the combined measure of recall and precision (F-value) our system obtained a rate of 78%, which is slightly better in comparison to the other tools that had 70% or less in F-value. Figure 7.7: Overview of Overall Performance in Other Text In conclusion, this test only compared a few of the constructions covered by the other systems, represented by the error types targeted by FiniteCheck. The result is promising for our detector that obtained comparable or better performance rates for coverage in this text. Flagging accuracy was slightly worse, especially in comparison to Grammatifix and Granska. Moreover the text was small with few errors and future tests on larger unseen text are of interest for better understanding of the system’s performance. 246 Chapter 7. 7.6 Summary and Conclusion The performance of FiniteCheck was tested during the developmental stage and on the current version. The system is in general good at finding errors and the flagging accuracy of the system can be improved by relatively simple means. The initial performance was improved solely by extension of the grammar and some ambiguity resolution. The broad grammar was extended by filtering transducers that extended head phrases with complements and merged split constituents or somehow adjusted the parsing output as a disambiguation step. The narrow grammar was improved by either extension of existing grammar rules or additional selections of segments. These new selections provide a basis for definitions of new grammars, thus the possibility of extending the detection to other types of errors. In the current version, noun phrases followed by relative clauses, coordinated infinitives and verbs in supine form were selected as separate segments and can be further extended with corresponding grammar rules. Detection of the four implemented error types in FiniteCheck was tested on both Child Data and a short adult text not only for our detector but also for the other three Swedish grammar checkers.9 In the case of Child Data, FiniteCheck achieved maximal or high grammatical coverage, being based on this corpus, and a total precision of around 30%. The other tools detected in general few errors in Child Data in the included error types with an average recall of 16%. Flagging accuracy is also around 30% for two of these tools and is lower for one of them. The outcome of FiniteCheck is hard to compare to the performance of the other tools, since our system is based on the Child Data corpus which was also used for evaluation, but there are indications of differences in the detection of errors in verb form at least, especially in finite verbs, where the other tools obtained quite low recall on average 9%. A similar effect occurs when the tools were tested on the adult text, where also here the other tools had difficulties to detect errors in verb form (although they were few), whereas FiniteCheck identified all of them. Otherwise, FiniteCheck obtained comparable (or even better) recall for the adult text with the three tools and a slightly lower accuracy in comparison to two of the tools. The performance rates of all the tools are in general higher on this adult text in comparison to Child Data, with a recall around 50% and a precision around 80%. Corresponding rates for Child Data are around 16% in recall 10 and 30% in precision. The validation tests on Child Data and the adult text indicate clearly that the children’s texts and the errors in them really are different from the adult texts and 9 Recall that these tools target many more error types. Evaluation of these grammar checkers on all errors found in Child Data is presented in Chapter 5 (Section 5.5). 10 Here, the recall rates of FiniteCheck were not included, since it is developed on this data. Performance Results 247 errors, and that they are more challenging for current grammar checkers that have been developed for texts and errors written by adult writers. The low performance of the Swedish tools on Child Data clearly demonstrates the need for adaptation of grammar checking techniques to other users, such as children. The performance of FiniteCheck is promising but at this point only preliminary. More tests are needed in order to see the real performance of this tool, both on other unseen children texts and texts written by other users, such as adult writers or even second language learners. 248 Chapter 8 Summary and Conclusion 8.1 Introduction This concluding chapter begins with a short summary of the thesis (Section 8.2), followed by a section on conclusions (Section 8.3), finally, some future plans are discussed (Section 8.4). 8.2 Summary 8.2.1 Introduction This thesis concerns the analysis of grammar errors in Swedish texts written by primary school children and the development of a finite state system for finding such errors. Grammar errors are more frequent for this group of writers and the distribution of the error types is different from texts written by adults. Also other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words. The method used in the implemented tool FiniteCheck involves subtraction of finite state automata that represent two ‘positive’ grammars with varying degree of detail. The difference between the automata corresponds to the search for writing problems that violate the grammars. The technique shows promising results on the implemented agreement phenomena and verb selection phenomena. The work is divided into three subtasks, analysis of errors in the gathered data, investigation of what possibilities there are for detecting these errors automatically and finally, implementation of detection of some errors. The summary of the thesis presented below follows these three subtasks. Chapter 8. 250 8.2.2 Children’s Writing Errors Data, Error Categories and Error Classification The analysis of children’s writing errors is based on empirical data of total 29,812 words gathered in a Child Data corpus consisting of three separate collections of hand written and computer written compositions written by primary school children between 9 and 13 years of age (see Section 3.2). The analysis concentrates on grammar, primarily. Other categories under investigation concern spelling errors which give rise to real word strings and punctuation. Error classification of the involved error categories is discussed in Chapter 3 (Section 3.3), where I present a taxonomy (Figure 3.1, p.31) and principles for classifying writing errors. Although this taxonomy was designed particularly for errors in the borderline between spelling and grammar error, it can be used for classification of both spelling and grammar errors. It takes into consideration the kind of new formation involved (new lemma or other forms of the same lemma), the type of violation (change in letter, morpheme or word) and what level was influenced (lexical, syntactic or semantic). What Grammar Errors Occur? In the survey of the considerably few existing studies on grammar errors in Chapter 2 (Section 2.4) I show that the most typical grammar errors in these studies are errors in noun phrase and predicative complement agreement, verb form and choice of prepositions in idiomatic expressions. Furthermore, some indications of errors influenced by spoken language are also evident in children’s writing. However, grammar has in general low priority in research on writing in Swedish. In particular, there are no recent studies concerning grammar errors by children and certainly no studies whatever for the youngest primary school children (see Section 2.3). In the present analysis in Child Data in Chapter 4 (Section 4.3), a total of 262 grammar errors occur. They are spread over more than ten error types. The expected “typical” errors occur, but they are not all particularly frequent. The most common errors occur in finite verb form, omission of obligatory constituents in sentences, choice of words, agreement in noun phrases and extra added constituents in sentences. In comparison to adult writers (Section 4.4), there are clear differences in error frequency and the distribution of error types. Grammar errors occur on average as much as 9 times in a child text of 1,000 words, which is considerably more frequent compared to adult writers who make an average one grammar error per 1,000 words. For some error types (e.g. noun phrase agreement) frequency differs marginally, whereas more significant differences arise, for instance for errors in Summary and Conclusion 251 verb form, that are on average eight times more common in Child Data. Frequency distribution across all error types is also different, although the representation of the most common error types is similar except for finite verb form errors. The most common error type for the adults in the studies presented were missing or redundant constituents in sentences, agreement in noun phrase and word choice errors. In contrast, the most common verb error among adult writers is in the verb form after auxiliary verb and not in the finite verb form, as is the case for children. What Real Word Spelling Errors Occur? Spelling errors resulting in existing words are usually not captured by a spelling checker. For that reason they have been included in the present analysis, since they often require analysis of context larger than a word in order to be detected. The ones found in the Child Data corpus (presented and discussed in Section 4.5) are three times less frequent than the non-word spelling errors, where misspelled words are the most common error type. These errors indicate a clear confusion as to what form to use in which context as well as the influence of spoken language. Splits were in general more common as real word errors. How Is Punctuation Used? The main purpose of the analysis of punctuation (Section 4.6) was to investigate how children delimit text and use major delimiters and commas to signal clauses and sentences. The analysis of Child Data reveals that mostly the younger children join sentences into larger units without using any major delimiters to signal sentence boundaries. The oldest children formed the longest units with the least adjoined clauses. Erroneous use of punctuation is mostly represented by omission of delimiters, but also markings occurring at syntactically incorrect places. Punctuation analysis concludes at this point with recommendation not to rely on sentence marking conventions in children’s texts when describing grammar and rules of a system aiming at analyzing such texts. 8.2.3 Diagnosis and Possibilities for Detection Possibilities and Means for Detection The errors found in Child Data were analyzed according to what means and how much context is needed for detection of them. Most of the non-structural errors (i.e. substitutions of words, concerning feature mismatch) and some structural errors (i.e. omission, insertion and transposition of words) can be detected successfully by means of partial parsing. These errors concern agreement in noun phrases, verb 252 Chapter 8. form or missing constituents in verb clusters, some pronoun case errors, repeated words that cause redundant constituents, some word order errors and to some extent agreement errors in predicative complements. Furthermore, real word spelling errors giving rise to syntactic violations can also be traced by partial parsing. Other error types require more elaborate analysis in the form of parsing larger portions of a clause or even full sentence parsing (e.g. missing or extra inserted constituents), analysis above sentence-level requiring analysis of a preceding discourse (e.g. definiteness in single nouns, reference), or even semantics and world-knowledge (e.g. word choice errors). Among the most common errors in the Child Data corpus, errors in verb form and noun phrase agreement can be detected by partial parsing, whereas errors in the structure of sentences as insertions or omissions of constituents and word choice errors require more elaborate analysis. Coverage and Performance of Swedish Tools The three existing Swedish grammar checkers Grammatifix, Granska and Scarrie are designed for and primarily tested on texts written by (mostly professional) adult writers. According to their error specifications, they cover many of the error types found in Child Data. The errors that none of these tools targets include definiteness errors in single nouns and reference errors. The tools were tested on Child Data in order to gauge their real performance. The result of this test indicates low coverage overall and in particular for the most common error types. The systems are best at identifying errors in noun phrase agreement and obtain an average recall rate of 58%. However, the most common error in children’s writing, finite verb forms, is on average covered only to 9% (see Tables 5.4, 5.5 and 5.6 starting on p.169 or Figure 7.2 on p.234). The overall grammatical coverage (recall) by the adult grammar checkers across all errors in Child Data averages around 12%. A figure which is almost five times lower than in the tests on adult texts provided by the developers of these tools where the average recall rate is 57% (see Table 5.3 on p.141). This test showed that although these three proofing tools target the grammar errors occurring in Child Data, they have problems in detecting them. The reasons for this effect could in some cases be ascribed to the complexity of the error (e.g. insertion of optional constituents). However, the low performance has more often to do with the high error frequency in some error types (e.g. errors in finite verb form are much less frequent in adult texts; see Figure 4.5 on p.87) and the complexity in the sentence and discourse structure of the texts used in this study (e.g. violations of punctuation and capitalization conventions resulting in adjoined clauses). Summary and Conclusion 8.2.4 253 Detection of Grammar Errors Targeted Errors Among the errors found in Child Data, errors in noun phrase agreement and in the verb form in finite and non-finite verbs were chosen for implementation. There were two reasons for concentrating on these error types. First, they (almost all) occur among the five most common error types. Second, these error types are all limited to certain portions of text and can then be detected by means of partial parsing. In the current implementation, agreement errors in noun phrases with a noun, adjective, pronoun or numeral as the head are detected, as well as in noun phrases with partitive attributes. The noun phrase rules are defined in accordance with what feature requirements they have to fulfill (i.e. definiteness, number and gender). The noun phrase grammar is prepared for further detection of errors in noun phrases with a relative subordinate clause as complement, that display different agreement conditions. In the present implementation these are selected as separate segments from the other noun phrases. The main purpose of this selection was to avoid marking of correct noun phrase segments of this type as erroneous. The verb grammar detects errors in finite form, both in bare main verbs and in auxiliary verbs in a verb cluster, as well as non-finite forms in a verb cluster and in infinitive phrases following an infinitive marker. The grammar is designed to take into consideration insertion of optional constituents such as adverbs or noun phrases and also handles inverted word order. Also the verb grammar is prepared for expansion to cover detection of other errors in verbs. Coordinated verbs preceded by a verb cluster or infinitive phrase are selected as individual segments and invite further expansion of the system’s grammar to detection of errors manifested as finite verbs instead of the expected non-finite verb form. Similarly, verbal heads with bare supine form separate segments and lay a basis for the detection of omitted temporal auxiliary verbs in main clauses. Detection Approach The implemented grammar error detector FiniteCheck is built as a cascade of finite state transducers compiled from regular grammars using the expressions and operators defined in the Xerox Finite-State Tool. The detection of errors in a given text is based on the difference between two positive grammars differing in degree of accuracy. This is the same method that Karttunen et al. (1997a) use for distinguishing valid and invalid date expressions. The two grammars always describe valid rules of Swedish. The first more relaxed (underspecified) grammar is needed in a text containing errors to identify all segments that could contain errors and marks both 254 Chapter 8. the grammatical and ungrammatical segments. The second grammar is a precise grammar of valid rules in Swedish and is used to distinguish the ungrammatical segments from the grammatical ones. The parsing strategy of FiniteCheck is partial rather than full, annotating portions of text with syntactic tags. The procedure is incremental recognizing first the heads (lexical prefix) and then expanding them with complements, always selecting maximal instances of segments. In order to prevent overlooking errors, the ambiguity in the system is maximal at the lexical level, assigning all the lexical tags presented in the lexicon. Structural ambiguity at a higher level is treated partially by parsing order and partially by filtering techniques, blocking or rearranging insertion of syntactic tags. Performance Results FiniteCheck was tested both on the (training) Child Data written by children and an adult text not known to the system. In the case of Child Data, the system showed high coverage (recall) in the initial phase of development, whereas many correct segments were selected as erroneous. Many of these false alarms were avoided by extending the grammar of the system, blocking an average of half of all the false markings. Remaining false alarms are more related to the ambiguity in parsing or selection of other error categories (i.e. misspelled words, splits and missing sentence boundaries). Only in the case of verb clusters did the system mark constructions not yet covered by the grammar of the system. Being based on this corpus, maximal or high grammatical coverage occurs with a total recall rate of 89% for the four implemented error types. Precision is 34%. The other three Swedish tools had on average lower results in recall with a total ratio of 16% on Child Data for the four error types targeted by FiniteCheck. The corresponding total precision value is on average 27%. Further, the performance of FiniteCheck on a text not known to the system shows that the system is good at finding errors, whereas the precision is lower. The three undetected errors in noun phrase agreement occurred due to the small size of the grammar. False flaggings involved both ambiguity problems and selections due to occurrence of other error categories. The total grammatical coverage (recall) of FiniteCheck on this text was 87% and precision was 71%. The other three Swedish tools are (again) good at finding errors in noun phrase agreement, whereas the verb errors obtain quite low results. The average total recall rate is 52% and precision is 83% for the three evaluated error types. The validation tests show that the performance of FiniteCheck on the four implemented error types is promising and comparable to current Swedish checkers. The low performance results of the Swedish systems on children’s texts indicates Summary and Conclusion 255 that the nature of the errors found in texts written by primary school writers are different from adult texts and are more challenging for current systems that are oriented towards texts written by adult writers. 8.3 Conclusion The present work contributes to research on children’s writing by revealing the nature of grammar errors in their texts and fills a gap in this research field, since not many studies are devoted to grammar in writing. It shows further that it is important to develop aids for children since there are differences in both frequency and error types in comparison to adult writers. Current tools have difficulties coping with such texts. The findings here also show that it is plausible and promising to use positive rules for error detection. The advantage of applying positive grammars in detection of errors is first, that only the valid grammar has to be described and I do not have to speculate on what errors may occur. The prediction of errors is limited exactly to the portions of text that can be delimited. For example, errors in number in noun phrases with a partitive complement were not identified by any of the three Swedish checkers, since adults probably do not make these types of errors. The grammar of FiniteCheck describes the overall structure of such phrases in Swedish, including agreement between the quantifying numeral or determiner and the modifying noun phrase. It also states that the noun phrase has to have plural number in order to be considered correct. The Swedish tools take into consideration only the agreement between the constituents and not the whole structure of the phrase. Secondly, the rule sets remain quite small. Thirdly the grammars can be used for other purposes. That is, since the grammars of the system describe the real grammar of Swedish, they can also be used for detection of valid noun phrases and verbs and be applied for instance to extracting information in text or even parsing. The performance of FiniteCheck indicates promising results by the fact that not only good results were obtained on the ‘training’ Child Data, but also running FiniteCheck on adult texts yielded good results comparable to the other current tools. This result perhaps also indicates that the approach could be used as a generic method for detection of errors. The ambiguity in the system is not fully resolved, but his does not disturb the error detection. However, false parses are hard to predict and they may give rise to errors not being detected or occurrence of false alarms. Chapter 8. 256 8.4 Future Plans 8.4.1 Introduction The current version of the implemented grammar error detector is not intended to be considered a full-fledged grammar checker or a generic tool for detection of errors in any text written by any writer. The present version of FiniteCheck is based on a lexicon of limited size, ambiguity in the system is not fully resolved and it detects a limited set of grammar errors yielding simple diagnoses. The next challenge will be to expand the lexicon, experiment with disambiguation contra error detection, expand the coverage of the system to other error types, explore the diagnosis stage and test to detect errors in new texts written by different users. Furthermore, application of the grammars of the system for other purposes is also interesting to explore. 8.4.2 Improving the System The lexicon of the system has to be expanded with missing forms and new lemmas and other valuable information, such as valence or compound information. The latter has practically been accomplished, being stored in the original text-version of part of the lexicon. There is a high level of ambiguity in the system, especially at the lexical level since we do not use a tagger which might eliminate information in incorrect text that is later needed to find the error. The fact is that unresolved ambiguity can sometimes lead to false parsing, which in turn could mean false alarms. The degree of lexical ambiguity and the impact on parsing and by extension detection of errors can be studied by experiments with weighted lexical annotation for instance, i.e. lexical tags ordered by probability measures (e.g. weighted automata). Such taggers are however often based on texts written by adults and could give rise to deceptive results. Also, disambiguation is not fully resolved at the structural level, blocking some insertions by parsing order and further adjusting the output by filtering automata. Extension of grammars in the system have shown positive impact on parsing and further evaluation is needed in order to decide the degree of ambiguity and prospects for prediction of false parsing, both having an influence on error detection. Another possibility is to explore the use of alternative parses, implemented for instance as charts. The rules of the broad grammar overgenerate to a great extent. One thing to experiment with is the degree of broadness in order to see how it influences the detection process. Will the parsing of text be better at the cost of worse error Summary and Conclusion 257 detection? How much could the grammar set be extended to improve the parsing without influencing the error detection? Since the grammars of the system are positive, experiments of using them for other purposes are in place. For instance, the more accurate narrow grammar could be applied for information extraction or even parsing. 8.4.3 Expanding Detection The first step in expansion of detection of FiniteCheck would naturally involve the types that are already selected for such expansion, i.e. noun phrases with relative clauses, coordinated infinitives and bare supine verbs. Furthermore, the verb grammar can be expanded with other constructions such as the auxiliary verb komma ‘will’ that requires an infinitive marker preceding the main verb, or main verbs that combine with infinitive phrases (see Section 4.3.5). Further expansion naturally would concern errors that require least analysis. After noun phrase and verb form errors, only some constructions can be detected by simple partial parsing, but more complex analysis is required. The system can be further expanded to include detection of errors in predicative agreement, some pronoun case errors, some word order errors and probably some definiteness errors in single nouns. With regard to children, most crucial would be coverage of errors with missing or redundant constituents in clauses or word choice errors, which represent two of the more frequent error types. These errors will, as my analysis reveals, most probably require quite complex investigation with descriptions of complement structure. It would be plausible to do more analysis of children’s writing in order to investigate if some such errors are for instance limited to certain portions of text and could then be detected by means of partial parsing. In consideration of children as the users of a grammar checker for educational purposes, the most important development will concern the error diagnosis and error messages to the user. A tool that supports beginning writers in their acquisition has to place high demands on the diagnosis of and information on errors in order to be useful. The message to the user has to be clear and adjusted to the skills of the child. A child not familiar with a given writing error or the grammatical terminology associated with it will certainly not profit from detecting such an error or from information containing grammatical terms. Studies of children’s interaction with authoring aids are in place in order to explore how alternatives of detection, diagnosis and error messages could best profit this user group. For instance, such a tool could be used for training grammar allowing customizing and options for what error types to detect or train on. There could also be different levels of diagnosis and error messages depending on the individual child’s level of writing acquisi- Chapter 8. 258 tion. Also other users could find such a tool interesting, for instance in language acquisition as a second language learner. The diagnosis stage could be adjusted by analysis of on-going processes in writing of children, that could be a step toward revealing the cause of an error. By logging all activities during the writing process on a screen, for instance all revisions could be stored and then analyzed if they indicate any repeated patterns and certainly if there is a difference between making a spelling error or making a grammar error. Could a grammar checker gain from such on-line information? This analysis would further be of interest for the errors in the borderline between grammar and spelling error and could aid detection of other categories of errors incorrectly detected as grammar errors. 8.4.4 Generic Tool? The detection and overall performance of the system has so far been tested on the ‘training’ Child Data corpus and a small adult text not known to the system. The results for the four implemented error types are promising on both texts representing two different writing populations. This fact could also imply that this method could be used as generic. FiniteCheck obtained comparable performance to other Swedish grammar checkers for the adult text and on Child Data. Although FiniteCheck was based on these texts, considerable difference in coverage occurred for some error types that the other tools had difficulty finding. The system needs to be tested further on other children’s texts not known to the system and also texts from other writers, primarily, texts of different genres written by adults. Furthermore, it would be interesting to test FiniteCheck on texts written by second language learners, dyslectics or even the hearing impaired, in order to explore how generic this tool is. 8.4.5 Learning to Write in the Information Society Some of the future work discussed above is already initiated within the framework of a three year project Learning to Write in the Information Society, initiated in 2003 and sponsored by Vetenskapsrådet. The project group consisting of Robin Cooper, Ylva Hård af Segerstad and me aims to investigate written language by school children in different modalities and the effects of the use of computers and other communicated media such as webchat and text messaging over mobile phone. The main aims are to see how writing is used today and how information technology can better be used for support. Texts written by primary school children will be gathered, both in hand written and computer written form. The study will also involve writing experiments with email, SMS (Short-Message-Service) Summary and Conclusion 259 and webchat. Further studies dealing with interaction with different writing aids are included. The results of this study should reveal how writing aids influence the writing of children, what the needs and requirements are on such tools by this writing population and how writing aids can be improved to enhance writing development and instruction in school. 260 BIBLIOGRAPHY 261 Bibliography Abney, S. (1991). Parsing by chunks. In Berwick, R. C., Abney, S., and Tenny, C., editors, Principle-Based Parsing: Computation and Psycholinguistics, pages 257–278. Kluwer Academic Publishers, Dordrecht. Abney, S. (1996). Partial parsing via finite-state cascades. In Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI’96, Prague, Czech Republic. Ahlström, K.-G. (1964). Studies in spelling I. Uppsala University, The Institute of Education. Report 20. Ahlström, K.-G. (1966). Studies in spelling II. Uppsala University, The Institute of Education. Report 27. Ait-Mohtar, S. and Chanod, J.-P. (1997). Incremental finite-state parsing. In ANLP’97, pages 72–79, Washington. Allard, B. and Sundblad, B. (1991). Skrivandets genes under skoltiden med fokus på stavning och övriga konventioner. Doktorsavhandling, Stockholms Universitet, Pedagogiska Institutionen. Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1998). Finite state grammar for finding grammatical errors in Swedish text: a finite-state word analyser. Technical report, Göteborg University, Department of Linguistics. [http: //www.ling.gu.se/˜sylvana/FSG/Report-9808.ps]. Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1999). Finite state grammar for finding grammatical errors in Swedish text: a system for finding ungrammatical noun phrases in Swedish text. Technical report, Göteborg University, Department of Linguistics. [http://www.ling.gu.se/˜sylvana/FSG/ Report-9903.ps]. 262 BIBLIOGRAPHY Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., and Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-word text. In The Proceedings of IJCAI’93, Chambery, France. Arppe, A. (2000). Developing a grammar checker for Swedish. In Nordgård, T., editor, The 12th Nordic Conference in Computational Linguistics, NODALIDA’99, pages 13–27. Department of Linguistics, Norwegian University of Science and Technology, Trondheim. Arppe, A., Birn, J., and Westerlund, F. (1998). Lingsoft’s Swedish Grammar Checker. [http:www.lingsoft.fi/doc.swegc]. Beesley, K. R. and Karttunen, L. (2003). Publications. Finite-State Morphology. CSLI- Bereiter, C. and Scardamalia, M. (1985). Cognitive coping strategies and the problem of inert knowledge. In Chipman, S. F., Segal, J. W., and Glaser, R., editors, Thinking and learning skills. Vol. 2, Research and open questions. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Biber, D. (1988). Variation across speech and writing. Cambridge University Press, Cambridge. Birn, J. (1998). Swedish Constraint Grammar: A Short Presentation. [http: //www.lingsoft.fi/doc/swecg/]. Birn, J. (2000). Detecting grammar errors with Lingsoft’s Swedish grammar checker. In Nordgård, T., editor, The 12thNordic Conference in Computational Linguistics, NODALIDA’99, pages 28–40. Department of Linguistics, Norwegian University of Science and Technology, Trondheim. Björk, L. and Björk, M. (1983). Amerikansk projekt för bättre skrivundervisning. det viktiga är själva skrivprocessen - inte resultatet. Lärartidningen 1983:28, pages 30–33. Björnsson, C.-H. (1957). Uppsatsbedömning och mätning av skrivförmåga. Licentiatavhandling, Stockholm. Björnsson, C.-H. (1977). Skrivförmågan förr och nu. Pedagogiskt centrum, Stockholm. Bloomfield, L. (1933). Language. Henry Holt & CO, New York. Boman, M. and Karlgren, J. (1996). Abstrakta maskiner och formella spr åk. Studentlitteratur, Lund. BIBLIOGRAPHY 263 Britton, J. (1982). Spectator role and the beginnings of writing. In Nystrand, M., editor, The Structure of Written Communication. Studies in Reciprocity Between Writers and Readers. Academic Press, New York. Bustamente, F. R. and León, F. S. (1996). GramCheck: A grammar and style checker. In The 16th International Conference on Computational Linguistics, Copenhagen, pages 175–181. Calkins, L. M. (1986). The Art of Teaching Writing. Heinemann, Portsmouth. Carlberger, J., Domeij, R., Kann, V., and Ola, K. (2002). A Swedish Grammar Checker. Submitted to Association for Computational Linguistics, October 2002. Carlberger, J. and Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software - Practice and Experience, 29(9):815–832. Carlsson, M. (1981). Uppsala Chart Parser 2: System documentation (UCDL-R81-1). Technical report, Uppsala University: Center for Computational Linguistics. Chafe, W. L. (1985). Linguistic differences produced by differences between speaking and writing. In Olson, D. R., Torrance, N., and Hildyard, A., editors, Literacy, language, and learning: The nature and consequences of reading and writing. Cambridge University Press, Cambridge. Chall, J. (1979). The great debate: ten years later, with a modest proposal for reading stages. In Resnick, L. B. and Weaver, P. A., editors, Theory and practice of early reading, Vol.2. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Chanod, J.-P. (1993). A broad-coverage French grammar checker: Some underlying principles. In the Sixth International Conference on Symbolic and Lo gical Computing, Dakota State University Madison, South Dakota. Chanod, J.-P. and Tapanainen, P. (1996). A robust finite-state parser for french. In Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI’96, Prague, Czech Republic. Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory, 2, pages 113–124. Chomsky, N. (1959). On certain formal properties of grammars. Information and Control, 1, pages 91–112. 264 BIBLIOGRAPHY Chrystal, J.-A. and Ekvall, U. (1996). Skribenter in spe. elevers skrivf örmåga och skriftspråkliga kompetens. ASLA-information 22:3, pages 28–35. Chrystal, J.-A. and Ekvall, U. (1999). Planering och revidering i skolskrivande. In Andersson, L.-G. e. a., editor, Svenskans beskrivning 23, pages 57–76. Lund. Clemenceau, D. and Roche, E. (1993). Enhancing a large scale dictionary with a two-level system. In EACL-93. Cooper, R. (1984). Svenska nominalfraser och kontext-fri grammatik. Nordic Journal of Linguistics, 7(2):115–144. Cooper, R. (1986). Swedish and the head-feature convention. In Hellan, L. and Koch Christensen, K., editors, Topics in Scandinavian Syntax. Crystal, D. (2001). Language and the Internet. Cambridge University Press, Cambridge. Dahlquist, A. and Henrysson, H. (1963). Om rättskrivning. Klassificering av fel i diagnostiska skrivprov. Folkskolan 3. Daiute, C. (1985). Writing and computers. Addison-Wesley, New York. Domeij, R. (1996). Detecting and presenting errors for Swedish writers at work. IPLab 108, TRITA-NA-P9629, KTH, Department of Numerical Analysis and Computing Science, Stockholm. Domeij, R. (2003). Datorstödd språkgranskning under skrivprocessen. Svensk språkkontroll ur användarperspektiv. Doktorsavhandling, Stockholms Universitet, Institutionen för lingvistik. Domeij, R. and Knutsson, O. (1998). Granskaprojektet: Rapport fr ån arbetet med granskningsregler och kommentarer. KTH, Institutionen f ör numerisk analys och datalogi, Stockholm. Domeij, R. and Knutsson, O. (1999). Specifikation av grammatiska feltyper i Granska. Internal working paper. KTH, Institutionen f ör numerisk analys och datalogi, Stockholm. Domeij, R., Knutsson, O., Larsson, S., Eklundh, K., and Rex, Ȧ. (1998). Granskaprojektet 1996-1997. IPLab-146, KTH, Institutionen f ör numerisk analys och datalogi, Stockholm. BIBLIOGRAPHY 265 Domeij, R., Ola, K., and Stefan, L. (1996). Datorstöd för språklig granskning under skrivprocessen: en lägesrapport. IPLab 109, TRITA-NA-P9630, KTH, Institutionen för numerisk analys och datalogi, Stockholm. EAGLES (1996). EAGLES Evaluation of Natural Langauge Processing Systems. Final Report. EAGLES Document EAG-EWG-PR.2. [http://www.issco. unige.ch/projects/ewg96/ewg96.html]. Ejerhed, E. (1985). En ytstruktur grammatik för svenska. In Allén, S., Andersson, L.-G., Löfström, J., Nordenstam, K., and Ralph, B., editors, Svenskans beskrivning 15. Göteborg. Ejerhed, E. and Church, K. (1983). Finite state parsing. In Karlsson, F., editor, Papers from the 7th Scandinavian Conference of Linguistics. University of Helsinki. No. 10(2):410-431. Ejerhed, E., Källgren, G., Wennstedt, O., and Åström, M. (1992). The Linguistic Annotation System of the Stockholm-Umeå Corpus Project. Report 33. University of Umeå, Department of Linguistics. Emig, J. (1982). Writing, composition, and rhetoric. In Mitzel, H. E., editor, Encyclopedia of Educational Research. The Free Press, New York. Flower, L. and Hayes, J. R. (1981). A cognitive process theory of writing. College Composition and Communication, 32:365–387. Garme, B. (1988). Text och tanke. Liber, Malmö. Graves, D. H. (1983). Writing: Teachers and Children at Work. Heinemann, Portsmouth. Grefenstette, G. (1996). Light parsing as finite-state filtering. In Kornai, A., editor, ECAI’96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Grundin, H. (1975). Läs och skrivförmågans utveckling genom skolåren. Utbildningsforskning 20. Liber, Stockholm. Gunnarsson, B.-L. (1992). Skrivande i yrkeslivet: en sociolingvistisk studie. Studentlitteratur, Lund. Göransson, A.-L. (1998). Hur skriver vuxna? Språkvård 3. Haage, H. (1954). Rättskrivningens psykologiska och pedagogiska problem. Folkskolans metodik. 266 BIBLIOGRAPHY Haas, C. (1989). Does the medium make a difference? Two studies of writing with pen and paper and with computers. Human-Computer Interaction, 4:149–169. Hallencreutz, K. (2002). Särskrivningar och andra skrivningar - nu och då. Språkvårdssamfundets skrifter 33. Halliday, M. A. K. (1985). Spoken and Written Language. Oxford University Press, Oxford. Hammarbäck, S. (1989). Skriver, det gör jag aldrig. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 103–113. Svenska f öreningen för tillämpad språkvetenskap, Uppsala. Hansen, W. J. and Haas, C. (1988). Reading and writing with computers: a framework for explaining differences in perfomance. Communications of the ACM, 31, Sept, pages 1080–1089. Hawisher, G. E. (1986). Studies in word processing. Computers and Composition, 4:7–31. Hayes, J. R. and Flower, L. (1980). Identifying the organisation of the writing process. In Gregg, L. W. and Steinberg, E. R., editors, Cognitive processes in writing. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Heidorn, G. (1993). Experience with an easily computed metric for ranking alternative parses. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Herring, S. C. (2001). Computer-mediated discourse. In Tannen, D., Schiffrin, D., and Hamilton, H., editors, Handbook of Discourse Analysis. Oxford, Blackwell. Hersvall, M., Lindell, E., and Petterson, I.-L. (1974). Om kvalitet i gymnasisters skriftspråk. Pedagogisk-psykologiska problem 253. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, New York. Hultman, T. G. (1989). Skrivande i skolan: sett i ett utvecklingsperspektiv. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 69– 89. Svenska föreningen för tillämpad språkvetenskap, Uppsala. BIBLIOGRAPHY 267 Hultman, T. G. and Westman, M. (1977). Gymnasistsvenska. Liber, Lund. Hunt, K. W. (1970). Recent measures in syntactic development. In Lester, M., editor, Readings in Applied Transformational Grammar. New York. Håkansson, G. (1998). Språkinlärning hos barn. Studentlitteratur, Lund. Hård af Segerstad, Y. (2002). Use and Adaptation of Written Language to the Conditions of Computer-Mediated Communication. PhD thesis, G öteborg University, Department of Linguistics. Ingels, P. (1996). A Robust Text Processing Technique Applied to Lexical Error Recovery. Licentiate Thesis. Linköping University, Sweden. Jensen, K. (1993). PEG: The PLNLP English Grammar. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Jensen, K., Heidorn, G., Miller, L., and Ravin, Y. (1993a). Parse fitting and prose fixing. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Jensen, K., Heidorn, G., and Richardson, S. D., editors (1993b). Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Josephson, O., Melin, L., and Oliv, T. (1990). Elevtext. Analyser av skoluppsatser från åk 1 till åk 9. Studentlitteratur, Lund. Joshi, A. K. (1961). Computation of syntactic structure. Advances in Documentation and Library Science, Vol. III, Part 2. Joshi, A. K. and Hopely, P. (1996). A parser from antiquity: An early application of finite state transducers to natural language parsing. In Kornai, A., editor, ECAI’96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Järvinen, T. and Tapanainen, P. (1998). Towards an implementable dependency grammar. In Kahane, S. and Polguere, A., editors, The Proceedings of COLINGACL’98, Workshop on ‘Processing of Dependency-Based Grammars’, pages 1– 10. Universite de Montreal, Canada. Karlsson, F. (1990). Constraint grammar as a system for parsing running text. In The Proceedings of the International Conference on Computational Linguistics, COLING’90, pages 168–173, Helsinki. 268 BIBLIOGRAPHY Karlsson, F. (1992). SWETWOL: Comprehensive morphological analyzer for Swedish. Nordic Journal of Linguistics, 15:1–45. Karlsson, F., Voutilainen, A., Heikkilä, J., and Anttila, A. (1995). Constraint Grammar: a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin. Karttunen, L. (1993). Finite-State Lexicon Compiler. Technical Report ISTLNLTT-1993-04-02, Xerox PARC. April 1993. Palo Alto, California. Karttunen, L. (1995). The replace operator. In The Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. ACL’95, pages 16– 23, Boston, Massachusetts. Karttunen, L. (1996). Directed replacement. In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL’96, Santa Cruz, California. Karttunen, L., Chanod, J.-P., Grefenstette, G., and Schiller, A. (1997a). Regular expressions for language engineering. Natural Language Engineering 2(4), pages 305–328. Cambrigde University Press. Karttunen, L., Gaál, T., and Kempe, A. (1997b). Xerox Finite-State Tool. Technical report, Xerox Research Centre Europe, Grenoble. June 1997. Maylan, France. Karttunen, L., Kaplan, R. M., and Zaenen, A. (1992). Two-level morphology with composition. In The Proceedings of the International Conference on Computational Linguistics, COLING’92. Vol. I, pages 141–148, July 25-28, Nantes France. Kempe, A. and Karttunen, L. (1996). Parallel replacement in the finite-state calculus. In The Proceedings of the Sixteenth International Conference on Computational Linguistics, COLING’96, Copenhagen, Denmark. Kirschner, Z. (1994). CZECKER - a maquette grammar-checker for Czech. The Prague Bulletin of Mathematical Linguistics 62. Universita Karlova, Praha. Knutsson, O. (2001). Automatisk språkgranskning av svensk text. Licentiatavhandling, KTH, Institutionen för numerisk analys och datalogi, Stockholm. Kokkinakis, D. and Johansson Kokkinakis, S. (1999). A cascaded finite-state parser for syntactic analysis of Swedish. In EACL’99, pages 245–248. BIBLIOGRAPHY 269 Kollberg, P. (1996). S-notation as a tool for analysing the episodic structure of revisions. In European writing conferences, Barcelona, October 1996. Koskenniemi, K., Tapanainen, P., and Voutilainen, A. (1992). Compiling and using finite-state syntactic rules. In The Proceedings of the International Conference on Computational Linguistics, COLING’92. Vol. I, pages 156–162, Nantes, France. Kress, G. (1994). Learning to write. Routledge, London. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439. Laporte, E. (1997). Rational transductions for phonetic conversion and phonology. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Larsson, K. (1984). Skrivförmåga: studier i svenskt elevspråk. Liber, Malmö. Ledin, P. (1998). Att sätta punkt. hur elever på låg- och mellanstadiet använder meningen i sina uppsatser. Språk och stil, 8:5–47. Leijonhielm, B. (1989). Beskrivning av språket i brottsanmälningar. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988. Svenska f öreningen för tillämpad språkvetenskap, Uppsala. Liberg, C. (1990). Learning to Read and Write. RUUL 20. Reports from Uppsala University, Department of Linguistics. Liberg, C. (1999). Elevers möte med skolans textvärldar. ASLA-information 25:2, pages 40–44. Lindell, E. (1964). Den svenska rättskrivningsmetodiken: bidrag till dess pedagogisk-psykologiska grundval. Studia psychologica et paedagogica 12. Lindell, E., Lundquist, B., Martinsson, A., Nordlund, A., and Petterson, I.-L. (1978). Om fri skrivning i skolan. Utbildningsforskning 32. Liber, Stockholm. Linell, P. (1982). The written language bias in linguistics. Department of Communication Studies, Univerity of Linköping. Ljung, B.-O. (1959). En metod för standardisering av uppsatsbedömning. Pedagogisk forskning 1. Universitetsforlaget, Oslo. 270 BIBLIOGRAPHY Ljung, M. and Ohlander, S. (1993). Allmän Grammatik. Gleerups Förlag, Surte. Loman, B. and Jörgensen, N. (1971). Manual för analys och beskrivning av makrosyntagmer. Studentlitteratur, Lund. Lundberg, I. (1989). Språkutveckling och läsinlärning. In Sandqvist, C. and Teleman, U., editors, Språkutveckling under skoltiden. Studentlitteratur, Lund. MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk, Vol. 1: Transcription Format and Programs. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Magerman, D. M. and Marcus, M. P. (1990). Parsing a natural language using mutual information statistics. In AAAI’90, Boston, Ma. Manzi, S., King, M., and Douglas, S. (1996). Working towards user-orineted evaluation. In The Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLP+IA 96), pages 155–160. Moncton, New-Brunswick, Canada. Matsuhasi, A. (1982). Explorations in the realtime production of written discourse. In Nystrand, M., editor, What writers know: the language, process, and structure of written discourse. Academic Press, New York. Mattingly, I. G. (1972). Reading, the linguistic process and linguistic awareness. In Kavanagh, J. F. and Mattingly, I. G., editors, Language by Ear and by Eye, pages 133–147. MIT Press, Cambridge. Moffett, J. (1968). Teaching the Universe of Discourse. Houghton Mifflin Company, New York. Mohri, M., Pereira, F. C. N., and Riley, M. (1998). A rational design for a weighted finite-state transducer library. Lecture Notes in Computer Science 1436. Mohri, M. and Sproat, R. (1996). An efficient compiler for weighted rewrite rules. In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL’96, Santa Cruz, California. Montague, M. (1990). Computers and writing process instruction. Computers in the Schools 7(3). Nauclér, K. (1980). Perspectives on misspellings. A phonetic, phonological and psycholinguistic study. Liber Läromedel, Lund. BIBLIOGRAPHY 271 van Noord, G. and Gerdemann, D. (1999). An extendible regular expression compiler for finite-state approaches in natural language processing. In Workshop on Implementing Automata’99, Postdam, Germany. Nyström, C. (2000). Gymnasisters skrivande. En studie av genre, textstruktur och sammnahang. Doktorsavhandling, Institutionen för Nordiska språk, Uppsala Universitet. Näslund, H. (1981). Satsradningar i svenskt elevspråk. FUMS 95: Forskningskommittén i Uppsala för modern svenska. Institutionen för nordiska språk, Uppsala universitet. Olevard, H. (1997). Tonårsliv. en pilotstudie av 60 elevtexter från standardproven för skolår 9 åren 1987 och 1996. Svenska i utveckling nr 11. FUMS Rapport nr 194. Paggio, P. and Music, B. (1998). Evaluation in the SCARRIE project. In First International Conference on Language Resources and Evaluation, Granada, Spain, pages 277–281. Paggio, P. and Underwood, N. L. (1998). Validating the TEMAA LE evaluation methodology: a case study on danish spelling checkers. Natural Language Engineering, 4(3):211–228. Cambridge University Press. Pereira, F. C. N. and Riley, M. D. (1997). Speech recognition by composition of weighted finite automata. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Pettersson, A. (1980). Hur gymnasister skriver. Svenskl ärarserien 184. Pettersson, A. (1989). Utvecklingslinjer och utvecklingskrafter i elevernas uppsatser. In Sandqvist, C. and Teleman, U., editors, Språkutveckling under skoltiden. Studentlitteratur, Lund. Pontecorvo, C. (1997). Studying writing and writing acquisition today: A multidisciplinary view. In Pontecorvo, C., editor, Writing development: An interdisciplinary view. John Benjamins Publishing Company. Povlsen, C., Sågvall Hein, A., and de Smedt, K. (1999). Final Project Report. Reports from the SCARRIE project, Deliverable 0.4. [http://fasting. hf.uib.no/˜desmedt/scarrie/final-report.html]. Ravin, Y. (1993). Grammar errors and style weakness in a text-critiquing system. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. 272 BIBLIOGRAPHY Richardson, S. D. (1993). The experience of developing a large-scale natural language processing system: Critique. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. van Rijsbergen, C. J. (1979). Information Retrieval. London. Robbins, A. D. (1996). AWK Language Programming. A User’s Guide for GNU AWK. Free Software Foundation, Boston. Roche, E. (1997). Parsing with finite-state transducers. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Sandström, G. (1996). Språklig redigering på en dagstidning. Språkvård 1. de Saussure, F. (1922). Course in General Linguistics. Translation by Roy Harris. Duckworth, London. Scardamalia, M. and Bereiter, C. (1986). Research on written composition. In Wittrock, M. C., editor, Handbook of Research of Teaching. Third edition. A project of the american Educational Research Association. Macmillan Publishing Company, New York. Schiller, A. (1996). Multilingual finite-state noun phrase extraction. In ECAI’96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Senellart, J. (1998). Locating noun phrases with finite state transducers. In The Proceedings of COLING-ACL’98, pages 1212–1219. Severinson Eklundh, K. (1990). Global strategies in computer-based writing: the use of logging data. In 2nd Nordic Conference on Text Comprehension in Man and Machine, Täby. Severinson Eklundh, K. (1993). Skrivprocessen och datorn. IPLab 61, KTH, Institutionen för numerisk analys och datalogi, Stockholm. Severinson Eklundh, K. (1994). Electronic mail as a medium for dialogue. In van Waes, L., Woudstra, E., and van den Hoven, P., editors, Functional Communication Quality. Rodopi Publishers, Amsterdam/Atlanta. Severinson Eklundh, K. (1995). Skrivmönster med ordbehandlare. Språkvård 4. BIBLIOGRAPHY 273 Severinson Eklundh, K. and Sjöholm, K. (1989). Writing with a computer. A longitudinal survey of writers of technical documents. IPLab 19, KTH, Department of Numerical Analysis and Computing Science, Stockholm. Skolverket (1992). LEXIN: språklexikon för invandrare. Nordsteds Förlag. Sofkova Hashemi, S. (1998). Writing on a computer and writing with a pencil and paper. In Strömqvist, S. and Ahlsén, E., editors, The Process of Writing - a progress report, pages 195–208. Göteborg University, Department of Linguistics. Starbäck, P. (1999). ScarCheck - a software for word and grammar checking. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Strömquist, S. (1987). Styckevis och helt. Liber, Malmö. Strömquist, S. (1989). Skrivboken. Liber, Malmö. Strömquist, S. (1993). Skrivprocessen. Teori och tillämpning. Studentlitteratur, Lund. Strömqvist, S. (1996). Discourse flow and linguistic information structuring: Explorations in speech and writing. Gothenburg Papers in Theoretical Linguistics 78. Göteborg University, Department of Linguistics. Strömqvist, S. and Hellstrand, Ȧ. (1994). Tala och skriva i lingvistiskt och didaktiskt perspektiv - en projektbeskrivning. Didaktisk Tidskrift, Nr 1-2. Strömqvist, S., Johansson, V., Kriz, S., Ragnarsdottir, H., Aisenman, R., and Ravid, D. (2002). Towards a crosslinguistic comparisson of lexical quanta in speech and writing. Written language and literacy Vol 5, N:o 1, pages 45–68. Strömqvist, S. and Karlsson, H. (2002). ScriptLog for Windows - User’s Manual. Department of Linguistics and University College of Stavanger: Centre for Reading Research. Strömqvist, S. and Malmsten, L. (1998). ScriptLog Pro 1.04 - User’s Manual. Göteborg University, Department of Linguistics. Svenska Akademiens Ordlista (1986). 11 uppl. Norstedts f örlag, Stockholm. Sågvall Hein, A. (1981). An Overview of the Uppsala Chart Parser Version I (UCP-1). Uppsala University, Department of Linguistics. 274 BIBLIOGRAPHY Sågvall Hein, A. (1983). A Parser for Swedish. Status Report for SveUcp. (UCDLR-83-2). Uppsala University, Department of Linguistics. February 1983. Sågvall Hein, A. (1998a). A chart-based framework for grammar checking: Initial studies. In The 11th Nordic Conference in Computational Linguistics, NODALIDA’98. Sågvall Hein, A. (1998b). A specification of the required grammar checking machinery. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.5.2, June 1998. Uppsala University, Department of Linguistics. Sågvall Hein, A. (1999). A grammar checking module for Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.6.3, June 1999. Uppsala University, Department of Linguistics. Sågvall Hein, A., Olsson, L.-G., Dahlqvist, B., and Mats, E. (1999). Evaluation report for the Swedish prototype. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 8.1.3, June 1999. Uppsala University, Department of Linguistics. Teleman, U. (1974). Manual för beskrivning av talad och skriven svenska. Lund. Teleman, U. (1979). Språkrätt. Gleerups, Malmö. Teleman, U. (1991a). Lära svenska: Om språkbruk och modersmålsundervisning. Skrifter utgivna av Svenska Språknämnden, Almqvist and Wiksell, Solna. Teleman, U. (1991b). Vad kan man när man kan skriva? In Malmgren and Sandqvist, editors, Skrivpedagogik. Teleman, U., Hellberg, S., and Andersson, E. (1999). Svenska Akademiens grammatik. Svenska Akademien. Vanneste, A. (1994). Checking grammar checkers. Utrecht Studies and Communication, 4. Vernon, A. (2000). Computerized grammar checkers 2000: Capabilities, limitations, and pedagogical possibilities. Computers and Composition 17, pages 329–349. Vosse, T. G. (1994). The Word Connection. Grammar-based Spelling Error Correction in Dutch. PhD thesis, Neslia Paniculata, Enschede. Voutilainen, A. (1995). NPtool, a detector of English noun phrases. In the Proceedings of Workshop on Very Large Corpora, Ohio State University. BIBLIOGRAPHY 275 Voutilainen, A. and Padró, L. (1997). Developing a hybrid NP parser. In ANLP’97, Washington. Voutilainen, A. and Tapanainen, P. (1993). Ambiguity resolution in a reductionistic parser. In EACL-93, pages 394–403, Utrecht. Wallin, E. (1962). Bidrag till rättstavningsförmågans psykologi och pedagogik. Göteborgs Universitet, Pedagogiska Institutionen. Wallin, E. (1967). Spelling. Factorial and experimental studies. Almqvist and Wiksell, Stockholm. Wedbjer Rambell, O. (1999a). Error typology for automatic proof-reading purposes. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O. (1999b). Swedish phrase constituent rules. A formalism for the expression of local error rules for Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O. (1999c). Three types of grammatical errors in Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O., Dahlqvist, B., Tjong Kim Sang, E., and Hein, N. (1999). An error database of Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Weijnitz, P. (1999). Uppsala Chart Parser Light: system documentation. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wengelin, Ȧ. (2002). Text Production in Adults with Reading and Writing Difficulties. PhD thesis, Göteborg University, Department of Linguistics. Wikborg, E. (1990). Composing on the computer: a study of writing habits on the job. In Nordtext Symposium, Text structuring - reception and production strategies, Hanasaari, Helsinki. Wikborg, E. and Björk, L. (1989). Sammanhang i text. Hallgren and Fallgren Studieförlag AB, Uppsala. Wresch, W. (1984). The computer in composition instruction. National Council of Teachers of English. 276 BIBLIOGRAPHY Öberg, H. S. (1997). Referensbindning i elevuppsatser. en preliminär modell och en analys i två delar. Svenska i utveckling nr 7. FUMS Rapport nr 187. Östlund-Stjärnegårdh, E. (2002). Godkänd i svenska? Bedömning och analys av gymnasieelevers texter. Doktorsavhandling, Institutionen f ör Nordiska språk, Uppsala Universitet. Appendices 278 Appendix A Grammatical Feature Categories GENDER: com neu masc fem common gender neuter gender masculine gender feminine gender DEFINITENESS: def indef wk str definite form indefinite form weak form of adjective strong form of adjective CASE: nom acc gen nominative case accusative case genitive case NUMBER: sg pl singular plural TENSE: imp inf pres pret perf past perf sup past part untensed imperative infinitive present preterite perfect past perfect supine past participle non-finite verb VOICE: pass passive OTHER: adj adv adjective adverb 280 Appendix B Error Corpora This Appendix presents the errors found in Child Data and consists of three corpora: B.1 Grammar Errors B.2 Misspelled Words B.3 Segmentation Errors Every listed instance of an error (E RROR) is indexed and followed by a suggestion for possible correction (C ORRECTION) and information about which sub-corpora (C ORP) it originates from, who the writer was (S UBJ), the writer’s age (AGE) and sex (S EX; m for male and f for female). The different sub-corpora are abbreviated as DV Deserted Village, CF Climbing Fireman, FS Frog Story, SN Spencer Narrative, SE Spencer Expository. Appendix B. 282 B.1 Grammar Errors Grammar errors are categorized by the type of error that occurred. E RROR 1 1.1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 1.1.7 1.1.8 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 AGREEMENT IN NOUN PHRASE Definiteness agreement Indefinite head with definite modifier Jag tar den närmsta handduk och slänger den i vasken och blöter den, En gång blev den hemska pyroman utkastad ur stan. Jag såg på ett TV program där en metod mot mobbing var att satta mobbarn på den stol och andra människor runt den personen och då fråga varför. Definite head with possessive modifier Pär tittar på sin klockan och det var tid för familjen att gå hem. hunden sa på pojkens huvet. Definite head with modifier ‘denna’ Nu när jag kommer att skriva denna uppsatsen så kommer jag ha en rubrik om några problem och ... Definite head with indefinite modifier Men senare ångrade dom sig för det var en räkningen på deras lägenhet. Man ska inte fråga en kompisen om något arbete, man ska fråga läraren. Gender agreement Wrong article pojken fick en grodbarn Wrong article in partitive Virginias mamma hade öppnat en tyg affär i en av Dom gamla husen. Masculine form of adjective sen berätta den minsta att det va den hemske fula troll karlen tokig som ville göra mos av dom för han skulle bo i deras by. nasse blev arg han gick och la sig med dom andre syskonen. Number agreement Singular modifier with plural head Den dära scenen med det tre tjejerna tyckte jag att de var taskiga som går ifrån den tredje tjejen C ORRECTION C ORP S UBJ AGE S EX handuken CF alhe 9 f pyromanen CF frma 9 m en stol/den stolen SE wj16 13 f klocka DV frma 9 m huve/ huvud FS haic 11 f uppsats SE wj03 13 f räkning DV jowe 9 f kompis SE wg05 10 m ett FS haic 11 f ett DV idja 11 f fule DV alhe 9 f andra CF haic 11 f de SE wg09 10 m Error Corpora 1.3.2 1.3.3 283 E RROR C ORRECTION Singular noun in partitive attribute Alla männen och pappa gick in i ett av huset. en av boven tog bensinen och gick bakåt. C ORP S UBJ AGE S EX husen bovarna DV CF haic haic 11 11 f f CF SE caan wg11 9 10 m f meningslös SE wj05 13 m mobbade SE wj05 13 m öppna, ärliga, elaka SE wj13 13 m utsatta SE wj19 13 m själva smutsiga SE CF wj20 haic 13 11 m f byn skeppet DV DV haic haic 11 11 f f ön borgmästaren grenen pojken DV CF FS FS haic frma frma frma 11 9 9 9 f m m m dem SN wg10 10 m dem CF klma 10 f dem honom SE SE wg16 wj14 10 13 f m honom SE wj14 13 m får CF alhe 9 f 2 2.1 2.1.1 2.1.2 AGREEMENT IN PREDICATIVE COMPLEMENT Gender agreement då börja Urban lipa och sa: Mitt hus är blöt. blött den som hörde de där stygga orden vågade utskrattad kanske inte spela på en konsert för att vara rädd att bli utskrattat av avundsjuka personer. 2.2 Number agreement Singular En som är mobbad gråter säkert varje dag känner sig menigslösa. 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 3 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 4 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 5 5.1 5.1.1 Plural Om dom som mobbar någon gång blir mobbad själv skulle han ändras helt och hållet. Sjävl tycker jag att killarnas metoder är mer öppen och ärlig men också mer elak än var tjejernas metoder är. jag tror att dom som är s har själva varit ut satt någon gång och nu vill dom hämnas och... ... för folk tänker mest på sig själv. nasse är en gris som har massor av syskon. nasse är skär. Men nasses syskon är smutsig. DEFINITENESS IN SINGLE NOUNS dom gick till by dom som bodde på ön kanske försökte komma på skepp Jag såg en ö vi gick till ö dom sa till borgmästare vad ska vi göra! män han hade skrikit så börjar gren röra på sig pojke hoppade ner till hunden PRONOUN CASE Case - Objective form bilarna bromsade så att det blev svarta streck efter de. Två av brandmännen sprang in i huset för att rädda de jag tycker synd om de då kan ju den eleven som blir utsatt gå fram och prata med han bara för man inte vill vara med han FINITE MAIN VERB Present tense Regular verbs Madde och jag bestämde oss för att sova i kojan och se om vi få se vind. Appendix B. 284 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 E RROR C ORRECTION C ORP S UBJ AGE S EX När hon kommer ner undrar hon varför det lukta så bränt och varför det låg en handduk över spisen. undra vad det brann nånstans jag måste i alla fall larma Få se nu vilken väg är det, den här. han kommer och klappar alla på handen utan en kille undra hur han känner sig då? ... det kan även vara att nån kan sparka eller att man få vara enstöring... ... där några tjejer/killar sitter och prata. men det kanske bero på att det var en mindre skola ... och inte bry sig om han man inte få vara med, luktar CF alhe 9 f undrar CF erja 9 m Får undrar FS SE idja wj03 11 13 f f får SE wj08 13 f pratar beror SE SE wj08 wj13 13 13 f m får SE wj14 13 m säger SE wj03 13 f berättade berättade berättade berättade började DV DV DV DV DV alhe alhe alhe alhe alhe 9 9 9 9 9 f f f f f cyklade hämtade hämtade DV DV DV alhe alhe alhe 9 9 9 f f f knackade DV alhe 9 f knackade DV alhe 9 f knackade lugnade pekade ramlade stannade undrade DV DV DV DV DV DV alhe alhe alhe alhe alhe alhe 9 9 9 9 9 9 f f f f f f undrade undrade vaknade öppnade öppnade öppnade ropade vaknade lutade DV DV DV DV DV DV DV DV DV alhe alhe alhe alhe alhe alhe angu angu anhe 9 9 9 9 9 9 9 9 11 f f f f f f f f m Strong verbs 5.1.10 Att stjäla är inte bra speciellt inte om man tar en sak av en person som gick för en i ett led och inte säga till att man hittade den utan att man behåller den. 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 5.2.11 5.2.12 5.2.13 5.2.14 5.2.15 5.2.16 5.2.17 5.2.18 5.2.19 5.2.20 5.2.21 5.2.22 5.2.23 5.2.24 5.2.25 Preterite Regular verbs vi berätta och ... den äldsta som va 80 år berätta jag berätta om byn sen berätta den minsta då börja alla i hela tunneln förutom pappa och ja gråta sen cykla vi dit igen. ...gick ner och hämta min och pappas cyklar ... Pappa gick och Knacka på en dörr till medan jag hämta cyklarna Pappa gick och knacka på en dörr för att vi var väldigt hungriga Pappa gick och Knacka på en dörr till medan jag hämta cyklarna jag knacka på dörren men jag lugna mig och kände på marken dom peka på väggen av tunneln jag ramla i en rutschbana långt åkte ja tills jag stanna vid en port ... när vi kom hem undra självklart mamma vart vi varit pappa och jag undra va nycklarna va sen undra han va dom bodde på morgonen när vi vakna... men ingen öppna någon eller något öppna dörren vi till och med öppna pensionathem Lena Ropa mamma Lena vakna Plötsligt vakna Hon av att någon sa Lena Lena. Per luta sig mot en Error Corpora E RROR 5.2.26 Sen Svimma jag 5.2.27 när jag vakna satt Jag Per och Urban mitt i byn. 5.2.28 och när vi kom hem så Vakna jag och allt var en dröm. 5.2.29 Plötsligt börja en lavin 5.2.30 när Gunnar öppna dörren till det stora huset rasa det ihop 5.2.31 och snart rasa hela byn ihop. 5.2.32 när Gunnar öppna dörren till det stora huset rasa det ihop 5.2.33 Niklas och Benny hoppa av kamelerna 5.2.34 och snabbt hoppa dom på kamelerna 5.2.35 och rusa iväg och red bort 5.2.36 snabbt samla han ihop alla sina jägare 5.2.37 men undra varför den är övergiven. 5.2.38 Ida gick och tänkte på vad dom skulle göra hon snubbla på nåt 5.2.39 Jag tog min väska och Madde tog sin, och vi börja gå mot vår koja, där vi skulle sova. 5.2.40 När vi kom fram börja vi packa upp våra grejer och rulla upp sovsäcken. 5.2.41 Madde vaknade av mitt skrik, hon fråga va det var för nåt. 5.2.42 På morgonen vaknade vi och klädde på oss sen packa vi ner våra grejer. 5.2.43 jag sa att det inte va nåt så somna vi om. 5.2.44 För ett ögon blick trodde jag att den hästen vakta våran koja. 5.2.45 på natten vakna jag av att brandlarmet tjöt 5.2.46 då börja Urban lipa och sa: Mitt hus är blöt. 5.2.47 Brandkåren kom och spola ner huset 5.2.48 Cristoffer stod och titta på ugglan i trädet 5.2.49 Erik gick till skogen och ropa allt han kunde. 5.2.50 Rådjuret sprang iväg med honom. Och kasta av pojken vid ett berg. 5.2.51 De klättra över en stock. 5.2.52 Pojken ropa groda groda var är du 5.2.53 De gick ut och ropa men de fick inget svar. 5.2.54 Ruff råka trilla ut ur fönstret. 5.2.55 Pojken satt varje kväll och titta på grodan 5.2.56 När pojken vakna nästa morgon och fann att grodan var försvunnen blev han orolig 5.2.57 Och utan att pojken visste om det hoppa grodan ur burken när han låg. 5.2.58 Nästa dag vakna pojken och såg att grodan hade rymt 5.2.59 hunden halka efter. 5.2.60 När han landa så svepte massa bin över honom. 5.2.61 Pojken leta och leta i sitt rum. 5.2.62 Pojken leta och leta i sitt rum. 5.2.63 Hunden leta också 5.2.64 Pojken gick då ut och leta efter grodan 5.2.65 Pojken leta i ett träd 285 C ORP S UBJ AGE S EX svimmade vaknade vaknade C ORRECTION DV DV DV anhe anhe anhe 11 11 11 m m m började rasade DV DV erha erha 10 10 m m rasade öppnade DV DV erha erha 10 10 m m hoppade hoppade rusade samlade undrade snubblade DV DV DV DV DV DV erja erja erja erja idja jowe 9 9 9 9 11 9 m m m m f f började CF alhe 9 f började CF alhe 9 f frågade CF alhe 9 f packade CF alhe 9 f somnade vaktade CF CF alhe alhe 9 9 f f vaknade började spolade tittade ropade kastade CF CF CF FS FS FS angu caan caan alca alhe angu 9 9 9 11 9 9 f m m f f f klättrade ropade ropade råkade tittade vaknade FS FS FS FS FS FS angu angu angu angu angu angu 9 9 9 9 9 9 f f f f f f hoppade FS caan 9 m vaknade FS caan 9 m halkade landade letade letade letade letade letade FS FS FS FS FS FS FS erge erge erge erge erge erge erge 9 9 9 9 9 9 9 f f f f f f f Appendix B. 286 E RROR 5.2.66 Då helt plötsligt ramla hunden ner från fönstret 5.2.67 där bodde bara en uggla som skrämde honom så han ramla ner på marken. 5.2.68 Där ställde pojken sig och ropa efter grodan 5.2.69 Hej då ropa han hej då. 5.2.70 Då gick pojken vidare och såg inte att binas bo trilla ner. 5.2.71 när dom båda trilla i. 5.2.72 Han ropa hallå var är du 5.2.73 han gick upp på stora stenen ropa hallå hallå 5.2.74 Då öppnade han fönstret & ropa på grodan. 5.2.75 I min förra skola hade man nåt som man kallade för kamratstödjare, Det funka väl ganska bra men... 5.2.76 man visade ingen hänsyn eller att man inte heja eller bara bråka 5.2.77 man visade ingen hänsyn eller att man inte heja eller bara bråka 5.2.78 Var var den där överraskningen. Ni svara jag men båda tittade på varandra ... 5.2.79 Ni svara jag 5.2.80 det gick inte så hon klättrade upp bredvid mig och medan jag för sökte lyfta upp mig skälv medan hon putta bort jackan från pelare. 5.2.81 medan hon putta jackan från pelaren 5.2.82 jag var på mitt land och bada 5.2.83 så här börja det 5.2.84 där sövde dom mig och gipsa handen. 5.2.85 Hon hade bara kladdskrivit den uppsats jag lämna in ... 5.2.86 ...så jag ångra verkligen att jag tog hennes uppsats... 5.2.87 När jag gick förbi den djupa avdelningen så kom en annan kille och putta i mig Supine 5.2.88 det låg massor av saker runtomkring jag försökt att kom till fören 5.2.89 Han tittade på hunden, hunden försökt att klättra ner Participle 5.2.90 Fönstrena ser lite blankare ut där uppe sa Virginia och börjad klättra upp för den ruttna stegen. 5.2.91 älgen sprang med olof till ett stup och kastad ner olof och hans hund 5.2.92 dom letad överallt 5.2.93 när han letad kollade en sork upp 5.2.94 han letad bakom stocken 5.2.95 alla pratad om borgmästaren 5.2.96 hunden råkade skakad ner ett getingbo 5.2.97 det var en liten pojke som satt och snyftad 5.2.98 svarad han C ORP S UBJ AGE S EX ramlade ramlade C ORRECTION FS FS erge erge 9 9 f f ropade ropade trillade FS FS FS erge erge erge 9 9 9 f f f trillade ropade ropade ropade funkade FS FS FS FS SE erge haic haic jobe wj13 9 11 11 10 13 f f f m m bråkade SE wj18 13 m hejade SE wj18 13 m svarade SN wg07 10 f svarade puttade SN SN wg07 wg16 10 10 f f puttade badade började gipsade lämnade SN SN SN SN SN wg16 wg18 wg18 wj05 wj16 10 10 10 13 13 f m m m f ångrade SN wj16 13 f puttade SN wj20 13 m försökte DV haic 11 f försökte FS haic 11 f började DV idja 11 f kastade FS frma 9 m letade letade letade pratade skaka snyftade svarade FS FS FS CF FS DV DV frma frma frma frma frma haic alco 9 9 9 9 9 11 9 m m m m m f f Error Corpora E RROR 287 S UBJ AGE S EX torkade DV idja 11 f försvann DV erge 9 f VERB CLUSTER Verb form after auxiliary verb Present Och i morgon är det brandövning men kom ihåg att det inte ska blir någon riktig brand. Ibland får man bjuda på sig själv och låter henne/honom vara med ! bli CF klma 10 f låta SE wj17 13 f Preterite hon ville inte att jag skulle följde med men med lite tjat fick jag. följa DV alhe 9 f rida DV idja 11 f komma göra FS SE haic wj20 11 13 f m kommit DV idja 11 f har/hade DV erge 9 f har/hade DV haic 11 f 7.1.1 INFINITIVE PHRASE Verbform after infinitive marker Present Men hunden klarar att inte slår sig. slå FS haic 11 f 7.1.2 7.1.3 7.1.4 Imperative glöm inte att stäng dörren jag försökt att kom till fören Åt det går det nog inte att gör så mycket åt. stänga komma göra DV DV SE hais haic wj20 11 11 13 f f m att E13 wj01 13 f kommer det inte att gå E13 wj06 13 f 5.2.99 Jag tittade på Virginia som torkad av sin näsa som var blodig på tröjarmen. Strong verbs 5.2.100 Nästa dag så var en ryggsäck borta och mera grejer försvinna 6 6.1 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6 6.1.7 6.2 6.2.1 6.2.2 7 7.1 7.2 7.2.1 7.2.2 Imperative Men de var fult med buskar utan för som vi fick rid igenom. han råkade bara kom i mot getingboet. Det är något som vi alla nog skulle gör om vi inte hade läst på ett prov. Jag skrattade och undrade hur Tromben skulle ha kom igenom det lilla hålet. Missing auxiliary verb Temporal ‘ha’ ni måste hjälpa mig om ni ska få henne. och dom — lovat att bygga upp staden och de blev hotell Men pappa — frågat mig om jag ville följa med Missing infinitive marker Men det vågar man kanske inte i första taget för då kan man ju bli rädd att man kommer — få ett kännetecken som skolans skvallerbytta eller något sånt! ... tänkte jag att om man ska hålla på så kommer det — inte gå bra i skolan. C ORRECTION kommer få C ORP Appendix B. 288 E RROR C ORRECTION C ORP S UBJ AGE S EX 7.2.3 Nu när jag kommer att skriva denna uppsatsen så kommer jag — ha en rubrik om några problem och vad man kan göra för att förbättra dom. kommer jag att ha E13 wj03 13 f 8 8.1.1 WORD ORDER När han kom hem så åt han middag gick och borstade tänderna och gick och sedan lade sig. för då kan man inte något ting bara kan gå på stan det då fattar hjärnan ingenting Jag den dan gjorde inget bättre. sedan och FS jowe 9 f kan bara SE wg03 10 f Jag gjorde inget bättre den dan. på matten med SN wg07 10 f SE wg10 10 m dom tvingar SE wj12 13 f hund , men FS SE alhe wj17 9 13 f f har SE wj19 13 m också SN wg04 10 m klarade SN wg10 10 m jag tycker att alla måste få vara med jag fick hjälp med det Åt det går det nog inte att gör så mycket. Nasse sprang som en liten fnutknapp efter bovarna. SE wg18 10 m SN wj11 13 f SE wj20 13 m CF haic 11 f CF anhe 11 m SE wg03 10 f SE wg12 10 f 8.1.2 8.1.3 8.1.4 8.1.5 9 9.1 att jag har ett problem att jag måste hela tiden fuska på proven annars med på matten nog alla lektioner måste jag fuska och alltid bråka för att få uppmärksamhet. kompisarna gör det inte men om tvingar dom inte dig till att göra det 9.1.5 REDUNDANCY Doubled word Following directly Han tittade på sin hund hund oliver Kompisen ska få titta på en ibland också men, men det får inte bli regelbundet för då... många som mobbar har har det oftast dåligt hemma vi skall i alla fall träffas idag 20 mars 1999 måndagen kanske imorgon också också Jag hade tur jag klarade klarade mig 9.1.6 Word between jag tycker jag att alla måste få vara med 9.1.7 jag fick jag hjälp med det. 9.1.8 Åt det går det nog inte att gör så mycket åt. 9.1.9 Nasse sprang efter som en liten fnutknapp efter Bovarna. 9.2 9.2.1 Redundant word Kalle som blev jätte rädd och sprang till närmaste hus som låg 9, kilometer bort för då kan man inte något ting bara kan gå på stan det då fattar hjärnan ingenting Hon och han borde pratat med en vuxen person (läraren). Eller pratat med föräldrarna. 9.1.1 9.1.2 9.1.3 9.1.4 9.2.2 9.2.3 inte Error Corpora E RROR 9.2.4 289 C ORRECTION C ORP S UBJ AGE S EX DV haic 11 f jag CF erja 9 m jag SN wg04 10 m något/det folk som SE SE wg08 wg14 10 10 f m de SE wg19 10 m jag SE wj03 13 f jag/vi SN wj09 13 m man SE wj19 13 m man SE wj19 13 m han FS mawe 11 f de? SE wg03 10 f det SN wg06 10 f varandra SE wg18 10 m att SN wj03 13 f hade DV alhe 9 f var FS hais 11 f att göra SE wj07 13 f fick SN wj13 13 m , blev (?) DV hais 11 f när De kom till en övergiven by va Tor och jag var rädda 10 Missing Constituents 10.1 Subject 10.1.1 — undra vad det brann nånstans jag måste i alla fall larma 10.1.2 vidare hoppas — att vi kommer att vara kompisar rätt länge 10.1.3 Jag tror — skulle hjälpa dem är att ... 10.1.4 I början på filmen var det massa — kollade på den andras papper på uppgiften 10.1.5 man försöker att lära barnen att om — fuskar med t ex ett prov då... 10.1.6 han kommer och klappar alla på handen utan en kille — undra hur han känner sig då? 10.1.7 När jag var ungefär 5 år och gick på dagis så skulle — åka på ett barnkalas hos en tjej med dagiset. 10.1.8 När man tror att man har kompisar blir — ledsen när man bara går där ifrån om just kom dit 10.1.9 När man tror att man har kompisar blir ledsen när man bara går där ifrån om — just kom dit 10.1.10 Dom satte av efter Billy och Åke som suttit i ett träd men blivit nerputtad av en uggla — blev nästan nertrampad. 10.2 Object or other NPs 10.2.1 Om dom bråkar som — är det inte så mycket man kan göra åt saken 10.2.2 jag viste att han skulle bli lite ledsen då efter som vi hade bestämt —. 10.2.3 Om man sätter barn som är lika bra som — på samma ställe blir det bättre för... 10.3 Infinitive marker 10.3.1 Efter — ha sprungit igenom häckarna två gånger så vilade vi lite... 10.4 (att) Verb 10.4.1 en port som va helt glittrig och — 2 guldögon och silver mun. 10.4.2 sedan skuttade han fram vidare till den öppna burken där grodan — han. Nosade förundrat på grodan 10.4.3 Jag tycker att det har med ens uppfostran — om man nu ger eller inte ger hon/han den saken som man tappade. 10.4.4 ... så kom det några utlänningar och tog bollen och vi — inte tillbaka den. 10.4.5 då bar det av i 14 dagar och 14 äventyrsfyllda nätter jagade av älg — kompis med huggorm trampat på igelkott mycket hände verkligen. Appendix B. 290 E RROR 10.5 Adverb 10.5.1 tuni hade jätte ont i knät men hon ville — sluta för det. 10.6 Preposition 10.6.1 Gunnar var på semester — Norge och åkte skidor. 10.6.2 dom bär massor av sken smycken massor — saker 10.6.3 det ena huset efter det andra gjordes — ordning 10.6.4 Hunden hoppade ner — ett getingbo. 10.6.5 Nej det var inte grodan som bodde — hålet. 10.6.6 Pojken som var på väg upp — ett träd fick slänga sig på marken... 10.6.7 att de som kollade på den andras papper skall träna mer — sin uppgift 10.6.8 ... så tänkte jag att det är — verklighet sånt händer 10.6.9 Mobbning handlar nog mycket — att man inte förstår olika människor. 10.6.10 men jag blev — alla fall jätte rädd för... 10.6.11 mobbing är det värsta som finns och — dom som gör det saknas det säkert någonting i huvudet. 10.7 Conjunction and subjunction 10.7.1 han gick upp på stora stenen — ropa hallå! hallå! 10.7.2 Simon klädde på sig — åt frukost. 10.7.3 Det som flickan gjorde när det var en vuxen — svarade i sin mobiltelefon som tappade en 100 lapp. 10.7.4 ...till exempel — den här killen gör så igen så... 10.7.5 om det är en tjej man inte alls är bra kompis med — kommer och sätter sig på bänken 10.8 Other 10.8.1 Alla blev rädda för hans skrik hans hämnd kunde vara — som helst ... 10.8.2 dom gick ut på kullek och letade. — och på marken och i luften. 10.8.3 De körde långt bort och till slut kom de fram till en gärdsgård och det var massor av hus —. 10.8.4 sen levde vi lyckliga — våra dagar 10.8.5 att jag har ett problem att jag måste hela tiden fuska på proven annars — med på matten nog alla lektioner måste jag fuska och alltid bråka för att få uppmärksamhet. 10.8.6 den som hörde de där stygga orden vågade kanske inte spela på en konsert för att — vara rädd att bli utskrattat av avundsjuka personer. C ORRECTION C ORP S UBJ AGE S EX inte SN wj03 13 f i DV erha 10 m av DV haic 11 f i DV hais 11 f i i i FS FS FS anhe haic idja 11 11 11 m f f på SE wg14 10 m i verkligheten om SE wj06 13 f SE wj20 13 m i hos SN SE wg18 wj05 10 13 m m och FS haic 11 f och som FS SE hais wg14 11 10 f m om som SE SE wj03 wj17 13 13 f f hur hemsk/vad de letade? CF frma 9 m FS hais 11 f där DV alca 11 f i alla (?) DV SE hais wg10 11 10 f m han/hon var SE wg11 10 f Error Corpora E RROR 10.8.7 Om man inte kan det man ska göra och tittar på någon annan visar — någon annans resultat sen. 10.8.8 För att förbättra det är nog — att man ska prata med en lärare eller förälder så... 11 11.1 11.1.1 11.1.2 WORD CHOICE Prepositions and particles dom peka på väggen av tunneln Vi sprang allt vad vi orkade ner till sjön och slängde ur oss kläderna. 11.1.3 Jag kom ihåg allt som hänt innan jag trillat ifrån grenen. 11.1.4 Han ropade ut igenom fönstret men inget kvack kom tillbaka. 11.1.5 sen var det problem på klass fotot 11.1.6 Jag tycker att om man har svårigheter för att skriva eller nåt annat skall man visa det... 11.1.7 vi var väldigt lika på sättet alltså vi tyckte om samma saker 11.1.8 Jag blev glad på Malin att hon hjälpte mig att säga det till honom för... 11.1.9 han kommer och klappar alla på handen utan en kille 11.1.10 När vi skulle gå av satt jag och dagdrömde och så gick alla av utan jag. 11.2 Adverb 11.2.1 Jag undrar ibland vart mamma är men det är ingen som vet. 11.2.2 Men vart ska jag bo? 11.2.3 Men vart dom en letade hittade dom ingen groda. 11.3 11.3.1 11.3.2 11.3.3 Infinitive marker det var onödigt och skrika pappa sen gick jag in och la mig och sova men jag vet inte hur man ska få dom och göra det. 11.3.4 ... men om man vill försöka bli kompis med några tjejer/killar och kanske försöker och gå fram ... 11.3.5 ... det fick en och tänka till hur man kan hjälpa såna som är utsatta. 11.4 Pronoun 11.4.1 vad skulle dom göra dess pengar tog nästan slut 11.4.2 Det är vanligt att om man har problem hemma att man lätt blir arg och det går då ut över sina kompisar. 291 C ORP S UBJ AGE S EX (?) C ORRECTION SE wj05 13 m det bästa? SE wj07 13 f i av DV DV alhe idja 9 11 f f från CF jowe 9 f genom FS caan 9 m med med SE SE wg18 wj11 10 13 m f till SN wg04 10 m (?) SN wg06 10 f utom SE wj03 13 f utom SN wj09 13 m var CF erge 9 f var var CF FS erge anhe 9 11 f m att att att DV DV SE alhe alhe wg18 9 9 10 f f m att SE wj08 13 f att SE wj16 13 f deras ens DV SE jowe wj12 9 13 f f Appendix B. 292 E RROR 11.5 Blend 11.5.1 när dom kommer hem så märker inte föräldrarna något även fast att man luktar rök och sprit 11.5.2 Det är nog inte ett ända barn som inte har något problem även fast att man inte har så stora 11.5.3 jag sprang så fort så mycket jag var värd 11.6 Other 11.6.1 Hon satte sig på det guldigaste och mjukaste gräset i hela världen. 11.6.2 men se där är ni ju det lilla följet bestående av snutna djur från djuraffären. 11.6.3 Jag tittade på Virginia som torkad av sin näsa som var blodig på tröjarmen. 11.6.4 jag förstår inte vad fröken menar med grammatik näringsväv och allt de andra. 11.6.5 Nasse sprang efter som en liten fnutknapp efter Bovarna. 12 12.1 12.1.1 12.1.2 12.1.3 12.1.4 REFERENCE Erroneous referent Number Lena fick en kattunge...Och Alexander fick ett spjut. sen gav den sej iväg när de gått och gått så hände något långt bort skymtade ett gult hus. vi närmade oss de sakta Att Urban hade en fru. och en massa ungar hade det. Oliver försökte få av sig burken så aggressivt så han ramlade över kanten. Erik tittade efter honom med en frågande min När Oliver hade dom i baken så hopade Erik ner. Gender 12.1.5 ...vad heter din mamma? Det stod bara helt still i huvudet vad var det han hette nu igen? 12.1.6 Om nu någon tappar någon som pengar... 12.2 Change of referent 12.2.1 spring ut nu vi har besökare när ni kom ut ... 12.2.2 Om dom som mobbar någon gång blir mobbad själv skulle han ändras helt och hållet. 13 OTHER 13.1 Adverb 13.1.1 När jag var liten mindre ... 13.2 13.2.1 13.2.2 13.2.3 Strange construction så Pär var läggdags god natt på er Ses i morgon i går god natt när vi rast skulle stänga affären så gömde jag mig. C ORRECTION C ORP S UBJ AGE S EX även om/fastän SE wj12 13 f även om/fastän allt vad SE wj12 13 f DV haic 11 f mest gulda DV angu 9 f stulna DV hais 11 f ärmen DV idja 11 f näringslära? CF angu 9 f ? CF haic 11 f de DV angu 9 f det DV hais 11 f de FS alhe 9 f den FS alhe 9 f hon CF hais 11 f något SE wj07 13 f vi dom/han (?) DV SE hais wj05 11 13 f m lite SN wj11 13 f DV DV DV frma hais hais 9 11 11 m f f Error Corpora 293 B.2 Misspelled Words Errors are categorized by part-of-speech and then by the part-of-speech they are realized in, indicated by an arrow (e.g. ‘Noun → Noun’ a noun becoming another noun). E RROR 1 1.1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 1.1.7 1.1.8 1.1.9 1.1.10 1.1.11 1.1.12 1.1.13 1.1.14 1.1.15 1.1.16 1.1.17 1.1.18 1.1.19 1.1.20 1.1.21 1.1.22 1.1.23 1.1.24 1.1.25 1.1.26 1.1.27 1.1.28 1.1.29 1.1.30 1.1.31 1.1.32 1.1.33 1.1.34 1.1.35 NOUN Noun → Noun Medan Oliver hoppade efter bot. Grävde sig Erik längre ner i bot men upp ur bot kom ett djur upp. Erik sprang i väg medan Oliver välte ner det surande bot. Bina som bodde i bot rusade i mot Oliver men hunden hade fastnat i buken att dom bot i en jätte fin dy det va deras dy. Det KaM Till EN övergiven Bi dam bodde i en bi pappa i har hittat än övergiven bi de var en by en öde dy. både pappa och jag kom då att tänka på den dyn vi va i på vägen hem undrade pär hur dyn hade kommit till. jag sprang till boten sen vaknade vi i botten Den där scenen med dammen som tappade sedlarna Renen sprang tills dom kom till en dam kastad ner olof och hans hund i en dam En dag när han var vid damen drog han med håven i vattnet och fick upp en groda. Men damen är inte så djup. Vi kom Över Molnen Jag och Per på en flygande fris som hette Urban. pojken och huden kom i vattnet. de lät precis som Fjory hennes hast August rosen gren har lämnat hjorden... därför skulle dom andra i klasen visa hur duktiga dom var. Den brinnande makan huset brann upp för att makan hade tagit eld. En dag tänkte Urban göra varma makor. Manen var tjock och rökte cigarr. Ni har en son som ringt efter oss sa manen. Den gamla manen Berättade om en by han Bot i för länge sedan den här gamla manen har tagit hand om oss. manen kom ut med tre skålar härlig soppa. men så en dag kom en man som hette svarta manen C ORRECTION C ORP S UBJ AGE S EX boet boet boet boet FS FS FS FS alhe alhe alhe alhe 9 9 9 9 f f f f boet burken by by by by by by byn FS FS DV DV DV DV DV DV DV alhe frma alhe alhe erja erja erja frma alhe 9 9 9 9 9 9 9 9 9 f m f f m m m m f byn DV frma 9 m båten båten damen DV DV SE haic haic wg09 11 11 10 f f m damm damm dammen FS FS FS alhe frma alhe 9 9 9 f m f dammen gris??? FS DV jobe caan 10 9 m m hunden häst jorden klassen FS DV DV SE haic alco hais wg02 11 9 11 10 f f f f mackan mackan mackor mannen mannen mannen CF CF CF CF CF DV caan caan caan alco idja angu 9 9 9 9 11 9 m m m f f f mannen mannen mannen DV DV DV angu angu angu 9 9 9 f f f Appendix B. 294 E RROR 1.1.36 1.1.37 1.1.38 1.1.39 1.1.40 1.1.41 1.1.42 1.1.43 1.1.44 1.1.45 1.1.46 1.1.47 1.1.48 1.1.49 1.1.50 för manen hade många djur Det var nog den här byn manen talade om det var svarta manen. Lena gick fram till svarta manen svarta manen blev rädd svarta manen sprang sin väg det log maser av saker runtomkring dom bär maser av sken smycken ... men plötsligt tog matten slut. alla menen och Pappa gick in i ett av huset pojken skrek ett tupp! ja tak just då ringde telefånen och pappa svarade: Fram ur vasen kom det något Sen gick jag ut, och fram för mig stod värdens finaste häst. 1.1.51 dom som borde på örn kanske försökte koma på skepp 1.2 1.2.1 1.2.2 1.2.3 Noun → Adjective man kunde rida fyra i bred kale som blev jätte rädd... ... och där fans ett tempel fult med matt. 1.3 1.3.1 Noun → Pronoun Men det han höll i var ett par hon som i sin tur satt fast i en hjort. 1.4 1.4.1 Noun→ Numeral olof som klättrade i ett tre 1.5 1.5.1 1.5.2 1.5.3 1.5.4 Noun → Verb pappa gick och knacka på en dör till och knacka på en dör Lena var en flika som var 8 år. Han letade i ett hål medans hunden skällde på masa bin. När han landa så svepte masa bin över honom. hunden hade hittat masa getingar där va en masa människor Jag tycker att om man inte gillar en viss person ska man inte visa det på ett så taskigt sett. 1.5.5 1.5.6 1.5.7 1.5.8 1.6 1.6.1 1.6.2 1.7 1.7.1 1.7.2 Noun → Preposition Då fick muffins syn på en massa in och började jaga dom. dam flyttade naturligtvis till den övergivna b in Noun → More than one category Jag hade en jacka på mig som det var ett litet håll i... Hur ska men kunna göra för att förbättra dessa problem? C ORP S UBJ AGE S EX mannen mannen mannen mannen mannen mannen massor massor maten männen stup tack telefonen vassen världens C ORRECTION DV DV DV DV DV DV DV DV DV DV FS DV CF FS CF angu angu angu angu angu angu haic haic erge haic haic haic erge idja alhe 9 9 9 9 9 9 11 11 9 11 11 11 9 11 9 f f f f f f f f f f f f f f f ön DV haic 11 f bredd Kalle mat DV CF DV idja anhe erge 11 11 9 f m f horn FS anhe 11 m träd FS frma 9 m dörr dörr flicka massa DV DV DV FS alhe alhe angu erge 9 9 9 9 f f f f massa massa massa sätt FS FS DV SE erge haic alhe wg17 9 11 9 10 f f f f bin FS jowe 9 f byn DV erja 9 m hål SN wg16 10 f man SE wj03 13 f Error Corpora 1.7.3 1.7.4 1.7.5 2 2.1 2.1.1 2.1.2 E RROR C ORRECTION C ORP S UBJ AGE S EX ...och vad men kan göra för att förbättra dom. Att utfrysa en kompis eller någon annan kan vara det värsta men någonsin kan göra tycker jag. Precis då kom pappa och hans men. man man SE SE wj03 wj03 13 13 f f män DV haic 11 f kallt CF erge 9 f trygga CF frma 9 m bäst enda enda enda CF CF DV SE hais idja jowe wj12 11 11 9 13 f f f f enda SE wj13 13 m enda SN wg19 10 m rädd rädd rädd rädd rädda tyken CF FS FS SN CF SE anhe frma frma wg18 frma wj14 11 9 9 10 9 13 m m m m m m förra kända lätt rädd FS DV SE FS idja erge wg03 erja 11 9 10 9 f f f m alla de de DV FS FS idja alhe caan 11 9 9 f f m de de de de de de DV DV DV DV DV DV alco erja erja erja erja frma 9 9 9 9 9 9 f m m m m m de det DV CF jobe angu 10 9 m f det CF angu 9 f ADJECTIVE Adjective → Adjective Pappa du har glömt att tända brasan och det är kalt. det är den plikt att få ås att bli dryga Adjective → Noun när hon var som best ... men inte en ända människa syntes till. det här brevet är det ända jag kan ge dig idag Det är nog inte ett ända barn som inte har något problem även fast att man inte har så stora 2.2.5 Det ända jag vet om grov mobbing är det jag har sett på tv! 2.2.6 ... för det var det ända sättet att komma upp till en koja 2.2.7 kalle som blev jätte räd 2.2.8 han blev så räd 2.2.9 han var lite räd för kråkan 2.2.10 jag blev alla fall jätte räd 2.2.11 alla var reda 2.2.12 man behöver inte vara tycken bara för man inte vill vara med han. 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.3 2.3.1 2.3.2 2.3.3 2.3.4 3 3.1 3.1.1 3.1.2 3.1.3 295 Adjective → Verb Och kanske var det ett barn till hans föra groda. ... och spökena blev skända... jag tror man ska ta ett lett prov först men... pojken blev red PRONOUN Pronoun → Pronoun fortsatte det att ringa i alle fall och en massa ungar hade det. Han sa till hunden att vara tyst för att det skull titta efter. 3.1.4 Det kom till en övergiven by 3.1.5 Det KaM Till EN övergiven Bi 3.1.6 när det kam hem sade pappa... 3.1.7 när det hade kommit en liten bit sa pappa... 3.1.8 då hörde det att det bubblade... 3.1.9 Det kom till en, plats som de aldrig hade varit, på. 3.1.10 Det kom till en övergiven by 3.1.11 jag förstår inte vad fröken menar med grammatik näringsväv och allt de andra. 3.1.12 Och sen den dagen de brann i Kamillas lägenhet leker vi alltid brandmän. Appendix B. 296 E RROR 3.1.13 de börjar att skymma 3.1.14 De var han och han hade hittat en partner. 3.1.15 ... men de kom ingen groda den här gången heller 3.1.16 De va en pojke som hette olof 3.1.17 de va en älg 3.1.18 mormor berättade att de fanns en by bortom solens rike 3.1.19 där de fanns små röda hus med vita knutar 3.1.20 ja men nu är de läggdags sa mormor. 3.1.21 Anna funderade halva natten över de där med morfar 3.1.22 de lät precis som Fjory hennes häst 3.1.23 de såg faktiskt ut som en övergiven by 3.1.24 de var bara ett fönster som lyste 3.1.25 De var en kväll som Lisa jag alltså ville höra en saga... 3.1.26 och dom lovat att bygga upp staden och de blev hotell 3.1.27 de var en by en öde by. 3.1.28 de var tid för familjen att gå hem. 3.1.29 Det var dåligt väder de blåste och regnade. 3.1.30 de blåste mer och mer 3.1.31 Men de var fullt med buskar utanför 3.1.32 Dom gick in genom dörren och blev förvånade av de dom såg. 3.1.33 de kunde berott på att dom gillade samma tjej. 3.1.34 När jag får se en son här film tänker jag på att de nog är så i de flesta skolorna 3.1.35 ... för de är nog något typiskt med de 3.1.36 ... för de är nog något typiskt med de 3.1.37 de får man nog för man får så mycket att göra när man blir större 3.1.38 Den är ju inte heller säkert att den kompisen man kollar på har rätt 3.1.39 De var bara ungdomar inga vuxna. 3.1.40 De hela började med att jag och min morfar skulle cykla ner till sjön för... 3.1.41 de verkade lugnt. 3.1.42 de va en vanlig måndag 3.1.43 ... efter som de fanns en hel del snälla kompisar i min klass så hjälpte dom mig... 3.1.44 När jag kom på fötter igen så hade de kommit cirka tolv stycken i min klass och hjälpte mig 3.1.45 det är den plikt att få ås att bli dryga 3.1.46 Dem kom med en stegbil och hämtade oss. 3.1.47 Nästa dag gick dem upp till en grotta 3.1.48 där fick dem var sin korg med saker i 3.1.49 Dem hade ett privatplan 3.1.50 nu slår dem upp tältet för att vila... 3.1.51 nästa morgon går dem långt långt 3.1.52 men till slut kom dem till en övergiven by. 3.1.53 där stannade dem och bodde där resten av livet C ORP S UBJ AGE S EX det det det C ORRECTION CF FS FS frma caan frma 9 9 9 m m m det det det FS FS DV frma frma alco 9 9 9 m m f det det det DV DV DV alco alco alco 9 9 9 f f f det det det det DV DV DV DV alco alco alco erge 9 9 9 9 f f f f det DV erge 9 f det det det det det det DV DV DV DV DV DV frma frma hais idja idja mawe 9 9 11 11 11 11 m m f f f f det det SE SE wg07 wg20 10 10 f m det det det SE SE SE wg20 wg20 wg20 10 10 10 m m m det SE wj17 13 f det det SE SN wj18 wg10 13 10 m m det det det SN SN SN wg11 wg20 wg20 10 10 10 f m m det SN wj10 13 m din dom dom dom dom dom dom dom dom CF CF DV DV DV DV DV DV DV frma jobe angu angu jobe jobe jobe jobe jobe 9 10 9 9 10 10 10 10 10 m m f f m m m m m Error Corpora E RROR 3.1.54 dem kanske bodde i ett hus som dem fick hyra 3.1.55 dem kanske bodde i ett hus som dem fick hyra 3.1.56 ... dem måste få höga betyg annars får de skäll av sina föräldrar. 3.1.57 Dem andra människorna som kollade på sina kompisars provpapper, 3.1.58 ... när dem började bråka, 3.1.59 dem kunde väl hjälpa varandra. 3.1.60 Men dem fortsatte. 3.1.61 Men jag fortsatte kämpa för dem två skulle kunna se på varan utan att vända bort huvudet, 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.2.8 3.2.9 3.2.10 3.2.11 3.2.12 3.2.13 3.2.14 3.2.15 3.2.16 3.2.17 3.2.18 3.2.19 3.2.20 3.2.21 3.2.22 3.2.23 3.2.24 3.2.25 3.2.26 3.2.27 3.2.28 3.2.29 3.2.30 3.2.31 3.2.32 3.2.33 3.2.34 3.2.35 Pronoun → Noun ... för att du är ju alt jag har. ... och alt var en dröm. Någon anan la mig på en bår... och gick till en anan tunnel Det finns nog en anan väg... så jag fik åka med en anan som skulle också hänga med var är set... var är set här snabbt springer dam ut ur brand bilarna snabbt tar dam fram stegen dam ramlar rakt ner i en damm då är dam ännu närmare ljudet dam bodde i en by dam tåg och så med sig sina två tigrar när dam hade kommit än bit in i skogen å dam två tigrarna följde också med dam red bod när dam kam hem dam flyttade naturligtvis till den övergivna in där levde dam lyckliga tillslut blev dam två kamelerna så trötta... när dam kam hem var kl. 12 hon fråga va det var för not och efter som det inte fans not lock på burken han har fot syn på not om det skulle hända not om man såg en älg eller räv och not anat stort djur en poäng alltid not ni får gärna bo hos oss under tid en ni inte har not att bo i. det är den plikt att få ås att bli dryga och la os på varsin sida av den spikiga toppen och utrusta os sa Desere med en son skarp röst hon alltid använde. gick vi upp till utgången av tältet men upptäckte varan och vi blev så rädda Visa i filmen gillade inte varan 297 C ORP S UBJ AGE S EX dom dom dom C ORRECTION SE SE SE wg01 wg01 wg01 10 10 10 f f f dom SE wg01 10 f dom dom dom dom SN SN SN SN wg01 wg01 wg01 wg07 10 10 10 10 f f f f allt allt annan annan annan annan CF DV CF DV DV SN erge caan erge alhe idja wg20 9 9 9 9 11 10 f m f f f m det det dom dom dom dom dom dom dom dom dom dom dom dom dom dom nåt nåt nåt nåt nåt DV DV CF CF FS FS DV DV DV DV DV DV DV DV DV DV CF FS FS DV DV hais hais erja erja erja erja erja erja erja erja erja erja erja erja erja frma alhe alhe frma alhe alhe 11 11 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 f f m m m m m m m m m m m m m m f f m f f nåt nåt DV DV alhe idja 9 11 f f oss oss oss sån CF DV DV DV frma alhe alhe hais 9 9 9 11 m f f f varann DV alhe 9 f varann SE wg06 10 f Appendix B. 298 E RROR 3.2.36 det första problemet är att dom kollar på varan 3.2.37 för då tittar man inte på varan. 3.2.38 Men jag fortsatte kämpa för dem två skulle kunna se på varan utan att vända bort huvudet, 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 3.3.7 3.3.8 3.3.9 3.3.10 3.3.11 3.3.12 3.3.13 3.3.14 3.3.15 Pronoun → Verb om man såg en älg eller räv och not anat stort djur Vi såg ormar spindlar krokodiler ödlor och anat. hanns groda var försvunnen. hanns mamma hade slängt ut den. som nu satt på hanns huvud. för att hanns kruka hade gått sönder kastad ner olof och hanns hund i en dam jag fick låna hanns mobiltelefon. han frågade honom nått ... den killen eller tjejen måste ha nått problem eller... om det kommer nån ny till klassen eller nått ...så hon hamnade inne i skogen på nått konstigt sätt... När det var två flickor som satt på en bänk så kom det en annan flicka som satte säg bredvid Det var också väldigt roligt för att man kände säg inte ensam om det. man får nog mer sona problem när man kommer högre upp i skolan 3.4 3.4.1 3.4.2 Pronoun → Preposition vi bar allt till mamma hos sa... sen när in kompis skulle hoppa så... 3.5 3.5.1 3.5.2 Pronoun → Interjection va fiffigt tänkte ja då börja alla i hela tunneln förutom pappa och ja gråta vilken fin klänning ja har Madde vaknade av mitt skrik, hon fråga va det var för nåt. 3.5.3 3.5.4 3.6 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 3.6.6 3.6.7 3.6.8 3.6.9 3.6.10 3.6.11 Pronoun → More than one category Det var än gång än man som hette Gustav Det var än gång än man som hette Gustav än dag när Gustav var på jobbet ringde det han trycker på än knapp Gustav sitter i än av brand bilarna där e än där uppe på än balkong står det ett barn han hade än groda män än natt klev grodan upp ur glas burken det var än gång två pojkar dam bodde i än bi. C ORP S UBJ AGE S EX varann varann varann C ORRECTION SE SE SN wg18 wg18 wg07 10 10 10 m m f annat DV alhe 9 f annat DV caan 9 m hans hans hans hans hans hans nåt nåt FS FS FS FS FS SN DV SE alhe alhe alhe alhe frma wg14 haic wj08 9 9 9 9 9 10 11 13 f f f f m m f f nåt nåt SE SN wj08 wj08 13 13 f f sig SE wg14 10 m sig SN wj11 13 f såna SE wg20 10 m hon min DV SN haic wj08 11 13 f f jag jag DV DV alhe alhe 9 9 f f jag vad DV CF angu alhe 9 9 f f en en en en en en en en en en en CF CF CF CF CF CF CF FS FS DV DV erja erja erja erja erja erja erja erja erja erja erja 9 9 9 9 9 9 9 9 9 9 9 m m m m m m m m m m m Error Corpora 3.6.12 3.6.13 3.6.14 3.6.15 3.6.16 3.6.17 3.6.18 3.6.19 3.6.20 3.6.21 3.6.22 3.6.23 3.6.24 3.6.25 3.6.26 3.6.27 3.6.28 3.6.29 3.6.30 4 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.1.6 4.1.7 4.1.8 4.1.9 4.1.10 4.1.11 4.1.12 4.1.13 4.2 4.2.1 4.2.2 299 E RROR C ORRECTION S UBJ AGE S EX pappa vi har hittat än övergiven bi. än dag sa Niklas ska vi rida ut när dam hade kommit än bit in i skogen än liten bit in i skogen såg dom än övergiven by än liten bit in i skogen såg dom än övergiven by Man ska vara en bra kompis, när någon vill vara än själv. jag satt ner men packning Men var nu då? dörren går inte upp. När simon kom ut och såg var som hade hänt... Hans hund Taxi var nyfiken på var det var för något i burken. Men var är det för ljud? var fan gör du Sjävl tycker jag att killarnas metoder är mer öppen och ärlig men också mer elak än var tjejernas metoder är. Hjälp det brinner vad nånstans undra vad det brann nånstans jag måste i alla fall larma Jag visste inte att brandbilen vad på väg förbi min egen by. Lena sa vad är vi hon såg sig omkring Visa i filmen gillade inte varan dom bråkade och lämnade visa utanför. en en en en DV DV DV DV erja erja erja erja 9 9 9 9 m m m m en DV erja 9 m en SE wg05 10 m min vad vad vad DV CF FS FS haic idja hais idja 11 11 11 11 f f f f vad vad vad FS SE SE idja wg07 wj13 11 10 13 f f m var var CF CF erja erja 9 9 m m var CF jowe 9 f var Vissa vissa DV SE SE angu wg06 wg06 9 10 10 f f f bet FS angu 9 f bodde DV haic 11 f hoppade FS alhe 9 f hoppade hoppade hålla lyfta låg låtsas FS DV SN SN DV SE anhe idja wj12 wg16 haic wj14 11 11 13 10 11 13 m f f f f m ryckte satt surrade sätt CF FS FS DV anhe haic erja alco 11 11 9 9 m f m f beror SE wg12 10 f bott DV angu 9 f VERB Verb → Verb Upp ur hålet kom en grävling och bett pojken i näsan dom som borde på örn kanske försökte koma på skepp När Oliver hade dom i baken så hopade Erik ner. Och pojken hopade efter hunden. Vi hopade upp på hästarna... ...för att hälla henne sällskap. först försökte hon att lufta mig... det log maser av saker runtomkring han behöver inte lossas om som ingenting har hänt, brand männen rykte ut och släkte elden hunden sa på pojkens huvet. då surade bina rakt över pojken sett dig hon gjorde som mannen sa Verb → Noun Och problemet kanske bror på att kompisarna inte tyckte om den personen Den gamla manen Berättade om en by han Bot i för länge sedan C ORP Appendix B. 300 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8 4.2.9 4.2.10 4.2.11 4.2.12 4.2.13 4.2.14 4.2.15 4.2.16 4.2.17 4.2.18 4.2.19 4.2.20 4.2.21 4.2.22 4.2.23 4.2.24 4.2.25 4.2.26 4.2.27 4.2.28 4.2.29 4.2.30 4.2.31 4.2.32 4.2.33 4.2.34 4.2.35 4.2.36 4.2.37 4.2.38 4.2.39 4.2.40 4.2.41 E RROR C ORRECTION C ORP S UBJ AGE S EX Men konstigt nog ville jag se den hästen fastän den inte fans. Det fans en doktor som pratade vänligt med mig, och efter som det inte fans not lock på burken Men i hålet fans bara... mormor berättade att de fans en by bortom solens rike därde fans små röda hus med vita knutar där Annas morfar hade bott ... och där fans ett tempel fult med matt. men efter som de fans en hel del snälla kompisar i min klass när jag kom ut ur huset sa Kamilla att jag fik hunden... Så fik pojken ett grodbarn Och vad fik dom se? men med lite tjat fik jag och för varje djur fik man 1 eller 3 poäng fik man tio poäng först fik jag panik hon hoppade till när hon fik syn på oss Men de var fult med buskar utan för som vi fik rid igenom. så jag fik åka med en anan som skulle också hänga med han har fot syn på not ... som dom hade fot tillsammans. På morgonen vaknade vi och kläde på oss Madde sprang upp till sitt rum och kläde på sig Han kläde på sig Det Kam Till EN övergiven Bi när det kam hem sade pappa... när Niklas och Bennys halva kam fram till en damm upp ur dammen kam två krokodiler när dam kam hem när dam kam hem var kl. 12 då ko min bror När jag kom ut såg jag en liten eld låga koma ut genom fönstret, det tog en timme att koma ditt Pojken som var på väg upp ett träd fick slänga sig på marken för att inte koma i vägen för bin. dom som borde på örn kanske försökte koma på skepp hans hämnd kund vara som helst på vägen till pappa möte jag en katt Jag gick in och sate mig vid bordet och åt. Han sate sig upp och lyssnade Hon sate sej på det guldigaste och mjukaste gräset i hela världen. fanns CF alhe 9 f fanns CF erge 9 f fanns fanns fanns FS FS DV alhe erge alco 9 9 9 f f f fanns DV alco 9 f fanns fanns DV SN erge wg20 9 10 f m fick CF angu 9 f fick fick fick fick fick fick fick fick FS FS DV DV DV DV DV DV caan erge alhe alhe alhe alhe hais idja 9 9 9 9 9 9 11 11 m f f f f f f f fick SN wg20 10 m fått fått klädde klädde klädde kom kom kom FS FS CF CF FS DV DV DV frma haic alhe alhe haic erja erja erja 9 11 9 9 11 9 9 9 m f f f f m m m kom kom kom kom komma DV DV DV SN CF erja erja frma wg18 alhe 9 9 9 10 9 m m m m f komma komma CF FS anhe idja 11 11 m f komma DV haic 11 f kunde mötte satte satte satte CF DV CF FS DV frma alhe alhe alhe angu 9 9 9 9 9 m f f f f Error Corpora E RROR 4.2.42 Redan nästa dag sate vi igång med reparationen av byn. 4.2.43 Då såg jag nåt som jag aldrig har set 4.2.44 Jag tycker att hon skal prata med dom. 4.2.45 brandmännen släkte elden 4.2.46 där nere i det höga gräset låg dalmatinen tess, grisen kalle-knorr... och sav 4.2.47 Ring till Börje sej att vi låst oss ute. 4.2.48 dam tåg och så med sig sina två tigrar 4.2.49 ... att vi åkt ner från berget och åkt så långt att vi inte viste va vi va. 4.2.50 typ när man pratar om grejer som inte man villa att alla ska höra! 4.2.51 ... att Mia inte viste om att mamma var en strandskata. 4.2.52 Och utan att pojken viste om det hoppa grodan ur burken när han låg. 4.2.53 jag viste att han skulle bli lite ledsen då efter som vi hade bestämt. 4.2.54 då viste jag inte vad jag skulle göra 4.2.55 hon kan ju inte skylla på att hon inte märker nåt för det ärr alltid tydligt. 4.3 4.3.1 Verb → Pronoun mer han jag inte tänka... 4.4 4.4.1 4.4.2 4.4.3 4.4.4 Verb → Adjective å älgen bara gode Niklas och Benny kunde inte hala emot han höll sig i och road Jag såg på ett TV program där en metod mot mobbing var att satta mobbarn på den stol och andra människor runt den personen och då fråga varför. Hade Erik vekt en uggla 4.4.5 Verb → Interjection jag blev jätte besviken för jag trodde att klockan va sådär 7. 4.5.2 men jag va visst jätte ledsen så jag gick ut. 4.5.3 Vi kom tillbaks vid 6 tiden, och då va vi jätte trötta och hungriga. 4.5.4 Klockan va ungefär 12 när jag vaknade, och va får jag se om inte hästen. 4.5.5 Klockan va ungefär 12 när jag vaknade, och va får jag se om inte hästen. 4.5.6 jag sa att det inte va nåt så somna vi om. 4.5.7 alla va överens 4.5.8 De va en pojke som hette olof 4.5.9 de va en älg 4.5.10 Nu va det bara att hoppa ut från fönstret. 4.5.11 ... att vi åkt ner från berget och åkt så långt att vi inte viste va vi va. 4.5.12 pappa och jag undra va nycklarna va 4.5 4.5.1 301 S UBJ AGE S EX satte C ORRECTION C ORP DV idja 11 f sett skall släckte sov DV SE CF DV caan wg02 frma hais 9 10 9 11 m f m f säg tog var CF DV DV idja erja alhe 11 9 9 f m f vill SE wj17 13 f visste CF hais 11 f visste FS caan 9 m visste SN wg06 10 f visste är SN SE wg20 wj13 10 13 m m hann DV idja 11 f glodde? hålla ropade? sätta FS DV FS SE frma erja frma wj16 9 9 9 13 m m m f väckt FS alhe 9 f var CF alhe 9 f var var CF CF alhe alhe 9 9 f f var CF alhe 9 f var CF alhe 9 f var var var var var var CF CF FS FS FS DV alhe frma frma frma haic alhe 9 9 9 9 11 9 f m m m f f var DV alhe 9 f Appendix B. 302 E RROR 4.5.13 Det börjar med att pappa och jag va ute och cyklade på landet... 4.5.14 ... att vi inte va på toppen av berget utan i en by 4.5.15 han va för tung 4.5.16 vi va i en jätte liten och fin by 4.5.17 nej det va en blåmes 4.5.18 Sen sa pappa att vi va tvungna att leta. 4.5.19 om dom va öppna 4.5.20 När jag kom dit va redan pappa där 4.5.21 en port som va helt glittrig 4.5.22 en katt som va svart och len 4.5.23 en platta som nästan va omringad av lava 4.5.24 där va en massa människor som va fastkedjade med tjocka kedjor 4.5.25 där va en massa människor som va fastkedjade med tjocka kedjor 4.5.26 den äldsta som va 80 år berätta att... 4.5.27 den byn vi va i 4.5.28 det va deras by 4.5.29 det va den hemske fula trollkarlen tokig 4.5.30 som tur va gick hästarna i hagen. 4.5.31 ... då vill ju han vara med den kompisen som han va med innan. 4.5.32 ... men eftersom det inte va så mycket mobbing så... 4.5.33 Det var i somras när jag, min syster och två andra kompisar va på vårat vanliga ställe... 4.5.34 Vi va kanske inte så bra på det utan vi ramlade ganska ofta. 4.5.35 det kunde ju va att en sjusovare bor där inne 4.5.36 ... utan det kan även vara att nån kan sparka eller att man få vara enstöring och sitta själv hela tiden eller kanske spotta eller bara kanske va taskiga mot den personen 4.5.37 ... att försöka va tuff hela tiden (eller?) 4.5.38 det kan ju va att den som blir mobbad inte uppför sig på rätt sätt, 4.5.39 dom vill inte va kompis med hon/han. 4.5.40 Då måste man fråga dom som inte vill va kompis med en vad man gör får fel... 4.5.41 Och om kompisarna tycker att man är ful och inte vill va med en som är ful så... 4.5.42 Marianne sa fort farande hur jag kunde va med henne 4.6 4.6.1 4.6.2 4.6.3 4.6.4 4.6.5 Verb → More than one category så kommer det att vara svårare att skaffa jobb om dom inte har gott i skolan han fick hetta Hubert. Men pojken är inte så glad för nu måste han hetta en ny glasburk. Men sen så dom att det var små grodor. ...vi hade precis gått förbi skolan när vi så ett gäng på ca tio personer komma emot oss. C ORP S UBJ AGE S EX var C ORRECTION DV alhe 9 f var var var var var var var var var var var DV DV DV DV DV DV DV DV DV DV DV alhe alhe alhe alhe alhe alhe alhe alhe alhe alhe alhe 9 9 9 9 9 9 9 9 9 9 9 f f f f f f f f f f f var DV alhe 9 f var var var var var var DV DV DV DV DV SE alhe alhe alhe alhe idja wg12 9 9 9 9 11 10 f f f f f f var SE wj13 13 m var SN wj06 13 f var SN wj07 13 f vara vara DV SE alhe wj08 9 13 f f vara vara SE SE wj08 wj13 13 13 f m vara vara SE SE wj19 wj19 13 13 m m vara SE wj19 13 m vara SN wg07 10 f gått SE wg03 10 f heta hitta FS FS haic haic 11 11 f f såg såg FS SN idja wj15 11 13 f m Error Corpora 4.6.6 4.6.7 4.6.8 4.6.9 4.6.10 4.6.11 4.6.12 4.6.13 5 5.1 5.1.1 E RROR C ORRECTION C ORP S UBJ AGE S EX Hela majs fältet vad svart Oliver bodde i en liten stuga en liten bit i från skogen och vad väldigt intresserad av djur. Hans älsklings färg vad grön För han vad mycket trött. till slut vad han uppe på stocken med stort besvär. när jag senare vad klar kom grannen och skrek... För att komma till Strömstad vad de tvungna att åka från Göteborg... och sedan Strömstad. Det var en ganska dålig lärare som inte märkte hans fusklapp han hade i pennfacket eller vad det vad. var var CF FS jowe jowe 9 9 f f var var var FS FS FS jowe jowe jowe 9 9 9 f f f var DV jowe 9 f var DV klma 10 f var SE wj07 13 f surrande FS alhe 9 f bort DV erja 9 m bort gott gott hur DV CF FS SE erja frma alhe wj13 9 9 9 13 m m f m väl SE wg08 10 f väl väl SE SE wj04 wj07 13 13 m f väl väl SN SN wj08 wj08 13 13 f f fullt fullt DV DV erge idja 9 11 f f inte SE wj12 13 f nu nu rätt visst visst visst visst visst DV DV CF CF CF FS DV DV hais hais idja alhe jobe erge haic idja 11 11 11 9 10 9 11 11 f f f f m f f f PARTICIPLE Participle → Participle Erik sprang i väg medan Oliver välte ner det surande bot. 6 6.1 6.1.1 ADVERB Adverb → Noun snabbt hoppa dom på kamelerna och rusa iväg och red bod till pappa 6.1.2 dam red bod 6.1.3 ingen sov got den natten 6.1.4 Oliver hjälpte till så got han kunde. 6.1.5 att säga ifrån och förklara ur den utsatta skall uppföra sig. 6.1.6 När de gick ifrån tjejen som kom så var det väll för att hon inte hjälpte dem med provet 6.1.7 ...men sen måste dom väll få skuld känslor. 6.1.8 så kan man väll fortfarande vara kompis med han hon. 6.1.9 det gick väll ganska bra. 6.1.10 jag får väll ta av min snowboard. Adverb → Adjective ... och där fans ett tempel fult med matt. Men de var fult med buskar utan för som vi fik rid igenom. 6.2.3 men det är ju mycket coolare att säga nej tack jag röker inte en att säga ja jag är väl inre feg. 6.2.4 ny vänta nu kommer hon 6.2.5 ny öppna inte garderoben 6.2.6 Det var rät blåsigt. 6.2.7 ... men jag va vist jätte ledsen såjag gick ut. 6.2.8 det började vist brinna 6.2.9 dom hade vist ungar och där var hans groda. 6.2.10 då får vi Natta över i byn vist. 6.2.11 Och så landade du vist i en möglig ko skit också. 6.2 6.2.1 6.2.2 303 Appendix B. 304 E RROR 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6 Adverb → Pronoun det tog en timme att koma ditt Men vart dom en letade hittade dom ingen groda. men hur han en lockade så kom den inte. Det beror på att den andra har jobbat bättre en den andra den som kollade på honom. men det kan ju vara andra saker en bara skolan? men det är ju mycket coolare att säga nej tack jag röker inte en att säga ja jag är väl inre feg. 6.4 6.4.1 6.4.2 6.4.3 6.4.4 Adverb → Verb förts att vi inte sögs med tromben som jag förts trodde så har gick det till: är ett sånt problem uppstår försöker man klart hjälpa till. 6.5 6.5.1 Adverb → Interjection ... att vi åkt ner från berget och åkt så långt att vi inte viste va vi va. pappa och jag undra va nycklarna va sen undra han va dom bodde 6.5.2 6.5.3 6.6 6.6.1 Adverb → More than one category Hunden hade skäll t så mycket att geting boet hade ramlat när. 7 7.1 7.1.1 PREPOSITION Preposition → Verb Min kompis tänkte hämta hjälp så han hängde sig i viadukten och hoppa ber sprang till närmaste huset och sa att det var en som hade trillat ner och att han skulle ringa ambulansen. 7.2 7.2.1 7.2.2 Preposition → More than one category kan vi inte gå nu sa Filippa men darrig röst Man beslöt att börja men marknaderna igen. 8 8.1 8.1.1 CONJUNCTION Conjunction → Noun pojken fick nästan inte resa på sig fören en uggla kom. Pojken hinner knappt resa sig upp fören en uggla kommer flygande mot honom. fören pappa kom in rusande i mitt rum. inte fören när jag skulle gå ner märkte jag att jag hade fastnat, män än natt klev grodan upp ur glas burken män plötsligt hoppade hunden ut ur fönstret män då hoppade pojken efter gick vi upp till utgången av tältet mer upptäckte varan och vi blev så rädda män han hade skrikit så... 8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.1.7 8.1.8 8.1.9 C ORRECTION C ORP S UBJ AGE S EX dit än CF FS anhe anhe 11 11 m m än än FS SE erge wg03 9 10 f f än än SE SE wg03 wj12 10 13 f f först först här När DV SN DV SE idja wj16 hais wg07 11 13 11 10 f f f f var DV alhe 9 f var var DV DV alhe alhe 9 9 f f ner FS caan 9 m ner SN wj05 13 m med med DV DV hais mawe 11 11 f f förrän FS haic 11 f förrän FS idja 11 f förrän förrän DV SN idja wg16 11 10 f f men men men men FS FS FS DV erja erja erja alhe 9 9 9 9 m m m f men/medan FS frma 9 m Error Corpora E RROR 8.1.10 ... å ställde cyklarna på den utskurna plattan. 8.1.11 Vi bor i samma hus jag och Kamilla å hennes hund. 8.1.12 Så vi fick vänta tills pappa kom hem å då skulle jag visa pappa mamma 8.1.13 å älgen bara gode 8.1.14 å dam två tigrarna följde också med 8.2 8.2.1 8.2.2 8.2.3 Conjunction → More than one category Då måste man fråga dom som inte vill va kompis med en vad man gör får fel... då skulle vi samlas 11.30 får bussen gick lite senare vi har så mycket saker så vi kan ha i byn 9 9.1 9.1.1 INTERJECTION Interjection → Adjective när vi kom in till mig så stod mamma och pappa i dörren och sa gratis till mig när jag kom. 10 10.1.1 10.1.2 10.1.3 10.1.4 10.1.5 10.1.6 OTHER där e huset som brinner nu e nog alla människor ute där e än då e dam ännu närmare ljudet Att bli mobbad e nog det värsta som finns, Han slog då till mig över kinden så att jag fick ett R. 305 C ORP S UBJ AGE S EX och och C ORRECTION CF CF alhe angu 9 9 f f och CF hais 11 f och och FS DV frma erja 9 9 m m för SE wj19 13 m för SN wg20 10 m som??? DV haic 11 f grattis CF alhe 9 f är är är är är ärr CF CF CF CF SE SN erja erja erja erja wj08 wg15 9 9 9 9 13 10 m m m m f m Appendix B. 306 B.3 Segmentation Errors Errors are categorized by part-of-speech. E RROR 1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 1.1.7 1.1.8 1.1.9 1.1.10 1.1.11 1.1.12 1.1.13 1.1.14 1.1.15 1.1.16 1.1.17 1.1.18 1.1.19 1.1.20 1.1.21 1.1.22 1.1.23 1.1.24 1.1.25 1.1.26 NOUN VI VAR PÅ BORÅS BAD HUS ... har hunden fått syn på en bi kupa. Han hoppar upp på bi kupan ... så att bi kupan börjar att skaka bi kupan ramlar ner till marken! då kom det en bi svärm surrande förbi tillslut välte han ner hela kupan och en hel bi svärm surrade ut. Efter 5 minuter körde en brand bil in på gården. Då vi kom till min by. Trillade jag av brand bilen Men grannen intill ringde brand kåren. när brand kåren kom hade hela vår ranch brunnit ner till grunden. brand larmet går Just när han hörde smällen gick brand larmet på riktigt! Han rusade ut till brandmännen som inte hade hört smällen och brand larmet. Han jobbade som brand man En brand man klättrade upp till oss. om det fanns någon ledig brand man jag håller på och utbildar mig till brand man Petter sa att han tänkte bli Brand man när han blir stor. En brand man berättade att... BRAND MANEN det här var en bra träning för mig sa brand manen brand menen ryckte ut och släckte elden. jag ringde till brand stationen Och i morgon är det brand övning där brand övningen skulle hålla till. 1.1.27 vi skulle börja göra i ordning den lilla byn som bestod av 8 hus 6 affärer och ett by hus 1.1.28 Desere jobbade i en djur affär 1.1.29 men se där är ni ju det lilla följet bestående av snutna djur från djur affären. 1.1.30 när det lilla djur följet gått i fyra timmar 1.1.31 Efter några sekunder stod såfus med tungan halvvägs hängande ut i mun i dörr öppningen. 1.1.32 hon lurade i min pojkvän massa elak heter om Linnea. 1.1.33 han hade ett 4 mannatält I sin fik kniv. 1.1.34 Då sprang dom fort till tunneln och fort till skidbacken och Fort till flyg platsen C ORRECTION C ORP S UBJ AGE S EX badhus bikupa bikupan bikupan bikupan bisvärm bisvärm SN FS FS FS FS FS FS wg13 klma klma klma klma alca hais 10 10 10 10 10 11 11 m f f f f f f brandbil CF idja 11 f brandbilen CF jowe 9 f brandkåren brandkåren CF DV jobe idja 10 11 m f brandlarmet brandlarmet CF CF erja klma 9 10 m f brandlarmet CF klma 10 f brandman brandman brandman brandman brandman CF CF CF CF CF erja idja idja idja idja 9 11 11 11 11 m f f f f brandman brandmannen brandmannen CF CF CF jowe erja idja 9 9 11 f m f brandmännen brandstationen brandövning brandövningen byhus CF CF CF CF anhe idja klma klma 11 11 10 10 m f f f DV hais 11 f djuraffär djuraffären DV DV hais hais 11 11 f f djurföljet dörröppningen elakheter DV FS hais hais 11 11 f f SN wg07 10 f fickkniv flygplatsen DV DV alhe erha 9 10 f m Error Corpora E RROR 1.1.35 1.1.36 1.1.37 1.1.38 Jag hör fot steg från trappan frukost klockan ringde jag går ner och ringer i frukost klockan genom att han tappat en jord fläck på fönster karmen. 1.1.39 Ronja hittade en förbands låda 1.1.40 Men lars fick försäkrings pengarna 1.1.41 1.1.42 1.1.43 1.1.44 1.1.45 1.1.46 1.1.47 1.1.48 1.1.49 1.1.50 1.1.51 1.1.52 1.1.53 1.1.54 1.1.55 1.1.56 1.1.57 1.1.58 1.1.59 1.1.60 1.1.61 1.1.62 1.1.63 1.1.64 1.1.65 1.1.66 1.1.67 1.1.68 1.1.69 1.1.70 1.1.71 1.1.72 1.1.73 1.1.74 Hunden hoppar vid ett geting bo. Geting boet trillar ner på marken. Geting boet går sönder. det var en gips skena som... Nu hade han den i en ganska stor glas burk, på sitt rum. så han tog med sig grodan hem i en glas burk. grodan klev upp ur glas burken. hunden stack in huvudet i glas burken Glas burken som hunden hade på huvudet gick i tusen bitar Oliver innerligt försökte få av sig den glas burken som... Hunden hade fastnat i glas burken och ramlade ner. Pojken och hunden sitter och kollar på grodan i glas burken. När pojken och hunden har somnat kryper grodan ut ur glas burken. Glas burken går sönder. såfus hade letat i glas burken han fick ha på sig glas burken över huvudet. såfus landade med huvudet före och hela glas burken sprack. ... så gick glas burken sönder. dom plockade många kran kvistar och la som täcke här är också en grav sten från 1989. jag satte upp grav stenar efter dom dan efter grävde vi upp deras grav stenar hit ut går det ju bara en grus väg Hästarna saktade av när dom kom ut på en grus väg. vi fortsatte på den lilla grus vägen. grus vägen ledde fram till en övergiven by. Vi följde grus vägen Vi red i genom det stora hålet och kom in på grus vägen vart tionde år måste han ha 5 guld klimpar en hund på 14 hund år trampat på igel kott En dag hade vi en informations dag om mobbing Då kom det upp en jord ekorre han tittade i ett jord hål. 307 C ORRECTION C ORP S UBJ AGE S EX fotsteg frukostklockan frukostklockan fönsterkarmen förbandslåda försäkringspengarna getingbo getingboet getingboet gipsskena glasburk CF DV DV FS alhe hais hais hais 9 11 11 11 f f f f DV CF mawe erha 11 10 f m FS FS FS SN FS erha erha erha wj05 alca 10 10 10 13 11 m m m m f glasburk glasburken glasburken glasburken FS FS FS FS alhe alca alca alca 9 11 11 11 f f f f glasburken FS alhe 9 f glasburken FS caan 9 m glasburken FS erha 10 m glasburken FS erha 10 m glasburken glasburken glasburken glasburken FS FS FS FS erha hais hais hais 10 11 11 11 m f f f glasburken grankvistar FS DV klma hais 10 11 f f gravsten gravstenar gravstenar grusväg grusväg DV DV DV DV DV hais hais hais idja idja 11 11 11 11 11 f f f f f grusvägen grusvägen grusvägen grusvägen DV DV DV DV idja idja idja idja 11 11 11 11 f f f f guldklimpar hundår igelkott informationsdag DV DV DV SE angu hais hais wj16 9 11 11 13 f f f f jordekorre jordhål FS FS alca alhe 11 9 f f Appendix B. 308 E RROR 1.1.75 1.1.76 1.1.77 1.1.78 1.1.79 1.1.80 det är ju jul afton om 3 dagar Innan jul skulle våran klass ha jul fest. sen var det problem på klass fotot man vill ju vara fin på klass fotot På t ex klass fotot MIN KLASS KAMRAT VILLE INTE HOPPA FRÅN HOPPTORNET 1.1.81 snabbt tog han på sig klä där 1.1.82 Och så landade du visst i en möglig ko skit också 1.1.83 men det finns i alla fall ingen tur med en möglig ko skit. 1.1.84 De hade med sig : ett spritkök, ett tält, och Massa Mat, några kul gevär, och ammunition M.M. 1.1.85 När kvälls daggen kom var vi helt klara 1.1.86 Kvälls daggen hade fallit 1.1.87 det brann på Macintosh vägen 738c 1.1.88 Att få status är kanske det maffia ledarna håller på med. 1.1.89 Hela majs fältet var svart 1.1.90 Vid mat bordet var det en livlig stämma 1.1.91 dom kom in till oss med 2 stora mat kassar. 1.1.92 det var när jag gick i mellan stadiet 1.1.93 Jag satt vid middags bordet tillsammans med mamma och min lillebror Simon. 1.1.94 där stannade dem och bodde där resten av livet för mobil telefonen räckte inte enda hem. 1.1.95 alla djur rusade ut ur affären upp på mölndals vägen 1.1.96 Han hade fångat en groda när han var i parken vid den stora näckros dammen. 1.1.97 skuggorna föll förundrat på det vita parkett golvet. 1.1.98 En vecka senare så var det en polis patrull som letade efter skol klassen 1.1.99 och precis när en av dem skulle slå till mig så hörde jag polis sirener 1.1.100 Man hämtar då en rast vakt. 1.1.101 följer du med på en rid tur 1.1.102 här står det August rosen gren har lämnat jorden 1.1.103 jag hade fått en sjuk dom 1.1.104 helt plötsligt var jag på sjuk huset. 1.1.105 ... förrän jag vaknade i en sjukhus säng. 1.1.106 jag tog mina saker ner i en sken påse 1.1.107 dom bär massor av sken smycken 1.1.108 Pappa det var du som la den i skrivbords lådan 1.1.109 ...men sen måste dom väl få skuld känslor. 1.1.110 därför är lärarens skyldig het att se till att eleven får hjälp. 1.1.111 Sedan var det ett sov rum med 4 bäddar. 1.1.112 Dem kom med en steg bil och hämtade oss. C ORP S UBJ AGE S EX julafton julfest klassfotot klassfotot klassfotot klasskamrat C ORRECTION CF SN SE SE SE SN erge wg02 wg18 wg18 wg19 wg13 9 10 10 10 10 10 f f m m m m kläder koskit FS DV erja idja 9 11 m f koskit DV idja 11 f kulgevär DV jobe 10 m kvällsdaggen kvällsdaggen Macintoshvägen maffialedarna DV DV CF SE hais mawe anhe wj20 11 11 11 13 f f m m majsfältet matbordet matkassar mellanstadiet middagsbordet CF DV CF SN CF jowe idja alhe wj14 mawe 9 11 9 13 11 f f f m f mobiltelefonen DV jobe 10 m Mölndalsvägen DV hais 11 f näckrosdammen parkettgolvet FS alca 11 f FS hais 11 f polispatrull DV alca 11 f polissirener SN wj15 13 m rastvakt ridtur Rosengren SE DV DV wg07 idja hais 10 11 11 f f f sjukdom CF sjukhuset CF sjukhussäng CF skenpåse DV skensmycken DV skrivbordslådan CF skuldkänslor SE skyldighet SE erge erge mawe haic haic erge wj04 wj19 9 9 11 11 11 9 13 13 f f f f f f m m sovrum stegbil mawe jobe 11 10 f m DV CF Error Corpora E RROR 1.1.113 det var ett stort sten hus 1.1.114 Kalle-knorr hade hittat ett stort sten kors 1.1.115 där står ett gult hus med stock rosor slingrande efter väggarna 1.1.116 allt från att förstå en telefon apparat till att förstå en människa. 1.1.117 när de var hemma så tittade de i telefon katalogen 1.1.118 ni får gärna bo hos oss under tid en ni inte har nåt att bo i. 1.1.119 så kom brandbilen och räddade mamma ut genom toalett fönstret. 1.1.120 där bakom några grenar låg någonting ett trä hus 1.1.121 Ett vardags rum med 2 soffor 1 bord och en stor öppenspis 1.1.122 Johan gick in i vardags rummet och satte upp elementet. 1.1.123 hela vardags rummet stod i brand 1.1.124 hans älsklings djur var groda. 1.1.125 Hans älsklings färg vad grön 1.1.126 Och det är nog en överlevnads instinkt. 2 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.1.6 2.1.7 2.1.8 2.1.9 2.1.10 2.1.11 2.1.12 2.1.13 2.1.14 2.1.15 2.1.16 2.1.17 2.1.18 2.1.19 2.1.20 ADJECTIVE/PARTICIPLE Fast pappa hade utrustat alla hus brand säkra. där va massa människor som va fast kedjade med tjocka kedjor Människorna hade haft färg glada dräkter på sig Tanja sydde glatt färgade kläder åt allihop Fönstret stod halv öppet där han låg hjälp lös på marken. Cristoffer hoppade ner och var jätte arg för att burken gick sönder. Cristoffer lyfte upp hunden och var fortfarande jätte arg men ... Ett par horn på en hjort som blev jätte arg. Bina som var inne i boet blev jätte arga och surrade upp ur boet. så kanske de blir jätte bra kompisar. och tänk om den som man skrev av hade skrivit en jätte bra dikt Det var inte så jätte djupt på den delen av floden som Cristoffer och hunden föll i på. dom bott i en jätte fin by Sen hjälpte vi dom att göra om byn till en jätte fin by Mamma och pappa tyckte det var en jätte fin by Jag hade ett jätte fint rum. då blev jag jätte glad Då blev dom jätte glada. där man kan äta jätte god picknick 309 C ORP S UBJ AGE S EX stenhus stenkors stockrosor C ORRECTION DV DV DV erha hais hais 10 11 11 m f f telefonapparat SE wj20 13 m telefonkatalogen CF alca 11 f tiden DV idja 11 f toalettfönstret CF hais 11 f trähus DV hais 11 f vardagsrum DV mawe 11 f vardagsrummet CF alca 11 f vardagsrummet älsklingsdjur älsklingsfärg överlevnadsinstinkt CF FS FS SE alca jowe jowe wj20 11 9 9 13 f f f m brandsäkra fastkedjade DV DV idja alhe 11 9 f f färgglada DV mawe 11 f glattfärgade halvöppet hjälplös jättearg DV FS FS FS mawe hais hais alca 11 11 11 11 f f f f jättearg FS alca 11 f jättearg jättearga FS FS erge alca 9 11 f f jättebra jättebra SE SE wg16 wg17 10 10 f f jättedjupt FS alca 11 f jättefin jättefin DV DV alhe alhe 9 9 f f jättefin DV idja 11 f jättefint jätteglad jätteglada jättegod DV SN DV DV idja wg18 alhe alhe 11 10 9 9 f f f f Appendix B. 310 E RROR 2.1.21 det var helt lila och såg jätte hemskt ut, 2.1.22 pappa och jag tänkte att vi skulle cykla upp på det jätte höga berget för att titta på ut sikten. 2.1.23 pappa gick ut och såg att vi va I en jätte liten och fin by, 2.1.24 Den andra frågan är jätte lätt 2.1.25 vi mulade och kastade jätte många snöbollar på dom 2.1.26 tuni hade jätte ont i knät 2.1.27 Nästa dag när Oliver vaknade blev han jätte rädd för han såg inte grodan i glasburken. 2.1.28 Då blev Oliver jätte rädd. 2.1.29 jag blev jätte rädd 2.1.30 både muffins och Oliver blev jätte rädda. 2.1.31 Det blev jätte struligt med allt möjligt inblandat. 2.1.32 han sade till muffins att vara jätte tyst. 2.1.33 man ser att det är nåt jätte viktigt hon ville berätta. 2.1.34 Med en gång blev jag klar vaken 2.1.35 en platta som nästan va om ringad av lava. 2.1.36 vi slog upp tältet på den spik spetsiga toppen 2.1.37 det var en varm och stjärn klar natt. 2.1.38 En gång blev den hemska pyroman ut kastad ur stan. 2.1.39 Om man blir ut satt för något ... 2.1.40 i vart enda hus var alla saker kvar från 1600 talet 2.1.41 då bar det av i 14 dagar och 14 äventyrs fyllda nätter 2.1.42 då kom dom till en över given by 2.1.43 de kom till en över given by 2.1.44 de kom till en över given by 2.1.45 Det var en över given by. 2.1.46 då för stod vi att det var en över given by 2.1.47 till slut kom dem till en över given By. 2.1.48 vi passerade många över vuxna hus 2.1.49 Oliver fick se ett geting bo och blev hel galen. 3 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 4 4.1.1 4.1.2 4.1.3 4.1.4 PRONOUN hon hade bara drömt allt ihop. simon låg på sin kudde och hade inte märkt någon ting. Nu ska jag visa er någon ting Dom flesta var duktiga på någon ting för då kan man inte något ting VERB när jag dog 1978 i cancer återvände jag hit för att fort sätta mitt liv här Jag tror att killen inte kan för bättra sig själv... då för stod vi att det var en över given by medan jag för sökte lyfta upp mig skälv C ORP S UBJ AGE S EX jättehemskt jättehöga C ORRECTION SN DV wj03 alhe 13 9 f f jätteliten DV alhe 9 f jättelätt jättemånga SE SN wj03 wj10 13 13 f m jätteont jätterädd SN FS wj03 jowe 13 9 f f jätterädd jätterädd jätterädda jättestruligt FS SN FS SN jowe wj03 jowe wg11 9 13 9 10 f f f f jättetyst jätteviktigt FS CF jowe alhe 9 9 f f klarvaken omringad spikspetsiga stjärnklar utkastad DV DV DV DV CF idja alhe alhe hais frma 11 9 9 11 9 f f f f m utsatt vartenda SE DV wj19 hais 13 11 m f äventyrsfyllda DV hais 11 f övergiven övergiven övergiven övergiven övergiven övergiven övervuxna helgalen DV DV DV DV DV DV DV FS erge erha hais hais hais jobe hais alhe 9 10 11 11 11 10 11 9 f m f f f m f f alltihop någonting DV FS angu hais 9 11 f f någonting någonting någonting DV DV SE hais mawe wg03 11 11 10 f f f fortsätta DV alco 9 f förbättra förstod försökte SE DV SN wj03 hais wg16 13 11 10 f f f Error Corpora 4.1.5 4.1.6 4.1.7 4.1.8 5 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.1.12 5.1.13 5.1.14 5.1.15 5.1.16 5.1.17 5.1.18 5.1.19 5.1.20 5.1.21 5.1.22 5.1.23 5.1.24 5.1.25 5.1.26 5.1.27 5.1.28 5.1.29 5.1.30 5.1.31 5.1.32 5.1.33 5.1.34 5.1.35 311 E RROR C ORRECTION C ORP S UBJ AGE S EX ni för tjänar verkligen mina hem kokta kladdkakor a Tess min fina gamla hund du på minner mig om någon jag har träffat förut Han ring de till mig sen och sa samma sak. Hon under sökte noga hans fot. förtjänar DV hais 11 f påminner DV hais 11 f ringde undersökte SN DV wg07 mawe 10 11 f f därefter därifrån därifrån därifrån därifrån CF FS SE SN SN hais hais wj19 wg13 wj01 11 11 13 10 13 f f m m f därifrån emot emot fortfarande SN FS FS SN wj10 alhe haic wg07 13 9 11 10 m f f f framemot förbi förbi förbi förut SN FS DV SE DV wj09 caan hais wg07 hais 13 9 11 10 11 m m f f f förut DV idja 11 f härifrån härifrån ibland ibland CF DV SE SE idja angu wj02 wj09 11 9 13 13 f f f m igen CF hais 11 f igen igen igen igenom igenom CF SN SN FS DV hais wg03 wg03 erha erge 11 10 10 10 9 f f f m f igenom igenom ihop DV DV DV idja idja erha 11 11 10 f f m ihop ihop iväg iväg DV DV FS FS erha erja angu09 anhe 10 9 9 11 m m f m också DV angu 9 f också omkring DV DV erja hais 9 11 m f ADVERB Där efter dog mamma på sjukhuset. men han tog sig snabbt där i från. när man bara går där ifrån SEN GICK VI DÄR IFRÅN Jag ställde mig på en sten och efter ett tag så ville jag gå där ifrån, så till slut så sprang dom där ifrån Bina som bodde i bot rusade i mot Oliver han råkade bara kom i mot getingboet. Marianne sa fort farande hur jag kunde va med henne Alla såg fram emot att åka Då kom hunden för bi med getingar människor som går för bi kan höra oss. Eller när man går för bi varandra vi hade aldrig fått smaka plättar sylt och kola för ut Inte konstigt att vi inte har upptäckt den här ingången för ut jag som alltid tyckt det var så högt här i från. stick här i från annars är du dödens I bland kan allt vara jobbigt och hemskt Men i bland kan det vara så att dom tror att dom är coola jag var tvungen att berätta hela historien om i gen. vad var det han hete nu i gen? jag vill bli kompis med henne i gen och så ville Johanna bli kompis i gen. Pojken och hunden söker i genom rummet. morfar och dom andra letar och letar i genom staden Vi red i genom det stora hålet Vi red i genom byn när Gunnar öppna dörren till det stora huset rasa det i hop snart rasa hela byn i hop snabbt samla han i hop alla sina jägare Rådjuret sprang i väg med honom. Han sprang i vägg och klättrade upp på en kulle. Lena såg en gammal man sitta i ett tält av guld intill sov säckarna som och så var av guld. dam tåg och så med sig sina två tigrar undulater flög om kring Appendix B. 312 E RROR 5.1.36 när de såg sig om kring 5.1.37 han trillar om kull. 5.1.38 Han ropade igenom fönstret men inget kvack kom till baka. 5.1.39 vi gick till baka igen 5.1.40 svarta manen sprang sin väg och kom aldrig mer till baka. 5.1.41 Efter det gick vi till baka 5.1.42 ... ska man lämna till baka den. 5.1.43 Sedan slumrade såfus, grodan och simon djupt till sammans. 5.1.44 Men de var fult med buskar utan för som vi fick rid igenom. 5.1.45 en kille blev utan för, 5.1.46 men olof var glad en då 5.1.47 men om man inte får vara med än då 5.1.48 Erik letade över allt 5.1.49 Han letade över allt i sitt rum 5.1.50 Han letade under sängen under pallen i tofflorna bland kläderna ja över allt 5.1.51 Han letade över allt 5.1.52 Desere letade över allt 5.1.53 jag har letat över allt 6 6.1.1 6.1.2 PREPOSITION fram för mig stod världens finaste häst. Vi gick längs vägen tills vi såg ett stort hus som låg en bit utan för själva stan 7 7.1.1 CONJUNCTION Efter som han frös och inte såg sig för snubblade han på en sten. ... och efter som det inte fanns nåt lock på burken... men jag kunde inte säga det till honom för att jag visste att han skulle bli lite ledsen då efter som vi hade bestämt. 7.1.2 7.1.3 8 8.1.1 8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.1.7 8.1.8 8.1.9 8.1.10 8.1.11 8.1.12 8.1.13 RUN-ONS Nathalie berättade alltför mig därbakom fanns 2 grodor. och tillslut stod vi alla på marken tillslut välte han ner hela kupan tillslut kom de fram till en gärdsgård men tillslut tyckte de också att ... tillslut blev dam två kamelerna så trötta... tillslut kom de fram till en vacker plats tillslut sa pappa Tillslut kom dom upp mot sidan av oss och sa, Tillslut kom det en massa vuxna som... Vi åkte tillslut på bio. mobbing råkar väldigt många utför. C ORP S UBJ AGE S EX omkring omkull tillbaka C ORRECTION DV FS FS jowe klma caan 9 10 9 f f m tillbaka tillbaka DV DV alhe angu 9 9 f f tillbaka tillbaka tillsammans DV SE FS idja wg17 hais 11 10 11 f f f utanför DV idja 11 f utanför ändå ändå överallt överallt överallt SE FS SE FS FS FS wj11 frma wj14 alhe jobe jowe 13 9 13 9 10 9 f m m f m f överallt överallt överallt FS DV DV mawe hais hais 11 11 11 f f f framför utanför CF DV alhe idja 9 11 f f eftersom DV mawe 11 f eftersom FS alhe 9 f eftersom SN wg06 10 f allt för där bakom till slut till slut till slut till slut till slut till slut till slut till slut till slut till slut ut för SN FS CF FS DV DV DV DV DV SN SN SN SE wg11 jowe idja hais alca alca erja hila idja wj04 wj04 wj04 wj05 10 9 11 11 11 11 9 10 11 13 13 13 13 f f f f f f m f f m m m m Appendix C SUC Tagset The set of tags used was taken from the Stockholm Umeå Corpus (SUC): Code AB DL DT HA HD HP HS IE IN JJ KN NN PC PL PM PN PP PS RG RO SN O VB Category Adverb Delimiter (Punctuation) Determiner Interrogative/Relative Adverb Interrogative/Relative Determiner Interrogative/Relative Pronoun Interrogative/Relative Possessive Infinitive Marker Interjection Adjective Conjunction Noun Participle Particle Proper Noun Pronoun Preposition Possessive Cardinal Number Ordinal Number Subjunction Foreign Word Verb Code Feature UTR NEU MAS UTR/NEU - Common (Utrum) Neutre Masculine Underspecified Unspecified Gender Gender Gender Gender Gender SIN PLU Singular Plural Number Number Appendix C. 314 SIN/PLU - Underspecified Unspecified Number Number IND DEF IND/DEF - Indefinite Definite Underspecified Unspecified Definiteness Definiteness Definiteness Definiteness NOM GEN SMS - Nominative Genitive Compound Unspecified Case Case Case Case POS KOM SUV Positive Comparative Superlative Degree Degree Degree SUB OBJ SUB/OBJ Subject Object Underspecified Pronoun Form Pronoun Form Pronoun Form PRS PRT INF SUP IMP Present Preterite Infinitive Supinum Imperative Verb Verb Verb Verb Verb AKT SFO Active S form Voice Voice KON PRF Subjunctive Perfect Mood Perfect AN Abbreviation Form Form Form Form Form Form Appendix D Implementation D.1 Broad Grammar #### Declare categories define PPheadPhr ["<ppHead>" ˜$"<ppHead>" "</ppHead>"]; define VPheadPhr ["<vpHead>" ˜$"<vpHead>" "</vpHead>"]; define define define define APPhr NPPhr PPPhr VPPhr ["<ap>" ["<np>" ["<pp>" ["<vp>" ˜$"<ap>" ˜$"<np>" ˜$"<pp>" ˜$"<vp>" "</ap>"]; "</np>"]; "</pp>"]; "</vp>"]; #### Head rules define AP [(Adv) Adj+]; define PPhead [Prep]; define VPhead [[[Adv* Verb] | [Verb Adv*]] Verb* (PNDef & PNNeu)]; #### Complement rules define NP [[[(Det | Det2 | NGen) (Num) (APPhr) (Noun) ] & ?+] | Pron]; define PP [PPheadPhr NPPhr]; define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*]; #### Verb clusters define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ]; D.2 Narrow Grammar: Noun Phrases ############### define APDef define APInd define APSg define APPl define APNeu define APUtr define APMas Narrow grammar for APs: ["<ap>" (Adv) AdjDef+ "</ap>"]; ["<ap>" (Adv) AdjInd+ "</ap>"]; ["<ap>" (Adv) AdjSg+ "</ap>"]; ["<ap>" (Adv) AdjPl+ "</ap>"]; ["<ap>" (Adv) AdjNeu+ "</ap>"]; ["<ap>" (Adv) AdjUtr+ "</ap>"]; ["<ap>" (Adv) AdjMas+ "</ap>"]; Appendix D. 316 ############### Narrow grammar for NPs: ###### NPs consisting of a single noun define NPDef1 [(Num) [NDef | PNoun]]; define NPInd1 [(Num) NInd]; define NPSg1 [(NumO) NSg | [NPl & NInd] | PNoun]; define NPPl1 [(NumC) [NPl | PNoun]]; define NPNeu1 [(Num) [NNeu | [NUtr & NInd] | PNoun]]; define NPUtr1 [(Num) [[NUtr & NPl] | [NUtr & NDef] | PNoun]]; ###### NPs consisting of a determiner (or a noun in genitive) and a noun define NPDef2 [DetDef (DetAdv) (Num) NDef] | [[DetMixed | NGen] (Num) NInd]; define NPInd2 [DetInd (Num) NInd]; define NPSg2 [[DetSg (DetAdv) | NGen] (NumO) NSg]; define NPPl2 [[DetPl (DetAdv) | NGen] (NumC) NPl]; define NPNeu2 [[DetNeu (DetAdv) | NGen] (Num) NNeu]; define NPUtr2 [[DetUtr (DetAdv) | NGen] (Num) NUtr]; ###### NPs consisting of [Det (AP) N] define NPDef3 [DetDef (DetAdv) (Num) (APDef) NDef] | [[DetMixed | NGen] (Num) (APDef) NInd]; define NPInd3 [DetInd (NumO) (APInd) NInd]; define NPSg3 [[DetSg (DetAdv) | NGen] (NumO) (APSg) NSg]; define NPPl3 [[DetPl (DetAdv) | NGen] (NumC) (APPl) NPl]; #define NPNeu3 [[DetNeu (DetAdv) | NGen] (Num) (APNeu) NNeu]; define NPNeu3 [[DetNeu (DetAdv) | NGen] (Num) [[(APNeu) NNeu] | [(APMas) NMas]]]; define NPUtr3 [[DetUtr (DetAdv) | NGen] (Num) (APUtr) NUtr]; ###### NPs consisting of [Adj+ N] # optional numbers only in NPINd and NPPl define NPDef4 [APDef NDef]; define NPInd4 [(Num) APInd NInd]; define NPSg4 [APSg NSg]; define NPPl4 [(Num) APPl NPl]; define NPNeu4 [APNeu NNeu]; define NPUtr4 [APUtr NUtr]; ###### define define define define define define NPs consisting of a single pronoun NPDef5 [PNDef]; NPInd5 [PNInd]; NPSg5 [PNSg]; NPPl5 [PNPl]; NPNeu5 [PNNeu]; NPUtr5 [PNUtr]; ###### define define define define define define NPs consisting of a single determiner NPDef6 [DetDef (DetAdv)]; NPInd6 [DetInd]; NPSg6 [DetSg (DetAdv)]; NPPl6 [DetPl (DetAdv)]; NPNeu6 [DetNeu (DetAdv)]; NPUtr6 [DetUtr (DetAdv)]; Implementation 317 ###### define define define define define define NPs consisting of adjectives NPDef7 [APDef+]; NPInd7 [APInd+]; NPSg7 [APSg+]; NPPl7 [APPl+]; NPNeu7 [APNeu+]; NPUtr7 [APUtr+]; ###### define define define define define define NPs consisting of a single determiner and adjectives NPDef8 [DetDef APDef]; NPInd8 [DetInd APInd]; NPSg8 [DetSg APSg]; NPPl8 [DetPl APPl]; NPNeu8 [DetNeu APNeu]; NPUtr8 [DetUtr APUtr]; ###### define define define define define define NPs consisting of number as the main word NPDef9 [(DetDef) NumO]; NPInd9 [Num]; NPSg9 [Num]; NPPl9 [Num]; NPNeu9 [Num]; NPUtr9 [Num]; ###### NPs that meet definiteness agreement ### Definite NPs define NPDef [NPDef1 | NPDef2 | NPDef3 | NPDef4 | NPDef5 | NPDef6 | NPDef7 | NPDef8 | NPDef9 ]; ### Indefinite NPs define NPInd [NPInd1 | NPInd2 | NPInd3 | NPInd4 | NPInd5 | NPInd6 | NPInd7 | NPInd8 | NPInd9 ]; define NPDefs [NPDef | NPInd]; ###### NPs that meet number agreement ### Singular NPs define NPSg [NPSg1 | NPSg2 | NPSg3 | NPSg4 | NPSg5 | NPSg6 | NPSg7 | NPSg8 | NPSg9 ]; ### Plural NPs define NPPl [NPPl1 | NPPl2 | NPPl3 | NPPl4 | NPPl5 | NPPl6 | NPPl7 | NPPl8 | NPPl9 ]; define NPNum [NPSg | NPPl]; ###### NPs that meet gender agreement ### Utrum NPs define NPUtr [NPUtr1 | NPUtr2 | NPUtr3 | NPUtr4 | NPUtr5 | NPUtr6 | NPUtr7 | NPUtr8 | NPUtr9 ]; ### Neutrum NPs define NPNeu [NPNeu1 | NPNeu2 | NPNeu3 | NPNeu4 | NPNeu5 | NPNeu6 | NPNeu7 | NPNeu8 | NPNeu9 ]; define NPGen [NPNeu | NPUtr]; Appendix D. 318 ########## Partitive NPs define NPPart [[Det | Num] PPart NP]; define define define define define define NPPartDef NPPartInd NPPartSg NPPartPl NPPartNeu NPPartUtr [[Det | Num] PPart NPDef]; [[Det | Num] PPart NPDef]; [[DetSg | Num] PPart NPPl]; [[DetPl | Num] PPart NPPl]; [[DetNeu | Num] PPart NPNeu]; [[DetUtr | Num] PPart NPUtr]; define NPPartDefs [NPPartDef | NPPartInd]; define NPPartNum [NPPartSg | NPPartPl]; define NPPartGen [NPPartNeu | NPPartUtr]; ########## NPs followed by relative subclause define SelectNPRel [ "<np>" -> "<NPRel>" || _ DetDef ˜$"<np>" "</np>" (" ") {som} Tag*]; D.3 Narrow Grammar: Verb Phrases #### Infinitive VPs # select Infinitive VPs define SelectInfVP ["<vpHead>" -> "<vpHeadInf>" || InfMark "<vp>" _ ]; # Infinitive VP define VPInf [Adv* (ModInf) VerbInf Adv* (NPPhr)]; #### Tensed verb first define VPFinite [ Adv* VerbTensed ?* ]; #### Verb Clusters: # select VCs define SelectVC [VC @-> "<vc>" ... "</vc>" ]; define VC1 [ [[Mod | INFVerb] / NPTags ] (NPPhr) [[Adv* VerbInf] / NPTags ]]; define VC2 [ [Mod / NPTags] (NPPhr) [[Adv* ModInf VerbInf] / NPTags ]]; define VC3 [ [Mod / NPTags] (NPPhr) [[Adv* PerfInf VerbSup] / NPTags ]]; define VC4 [ [Perf / NPTags] (NPPhr) [[Adv* VerbSup] / NPTags ]]; define VC5 [ [Perf / NPTags] (NPPhr)[[Adv* ModSup VerbInf] / NPTags ]]; define VCgram [VC1 | VC2 | VC3 | VC4 | VC5]; Implementation 319 ### Coordinated VPs: define SelectVPCoord ["<vpHead>" -> "<vpHeadCoord>" || ["<vpHeadInf>" | "</vc>"] ˜$"<vpHead>" ˜$"<vp>" [{eller} | {och}] Tag* (" ") "<vp>" _ ]; #** ATT-VPs that do not require infinitive define SelectATTFinite [ "<vpHead>" -> "<vpHeadATTFinite>" || [ [ [[{sa} Tag+] | [[{för} Tag+] / NPTags]] ("</vpHead></vp>")] | [ [{tänkte} Tag+] [[NPPhr "</vpHead></vp>" ] | ["</vpHead>" NPPhr "</vp>"]]]] InfMark "<vp>"_ ]; ### Supine VPs define SelectSupVP [ "<vpHead>" -> "<vpHeadSup>" || _ VerbSup "</vpHead>"]; D.4 Parser ###### define define define Mark head phrases (lexical prefix) markPPhead [PPhead @-> "<ppHead>" ... "</ppHead>"]; markVPhead [VPhead @-> "<vpHead>" ... "</vpHead>"]; markAP [AP @-> "<ap>" ... "</ap>" ]; ###### define define define Mark phrases with complements markNP [NP @-> "<np>" ... "</np>" ]; markPP [PP @-> "<pp>" ... "</pp>" ]; markVP [VP @-> "<vp>" ... "</vp>" ]; ###### define define define Composing parsers parse1 [markVPhead .o. markPPhead .o. markAP]; parse2 [markNP]; parse3 [markPP .o. markVP]; D.5 Filtering ################# Filtering Parsing Results ### Possessive NPs define adjustNPGen [ 0 -> "<vpHead>" || NGen "</np><vpHead>" NPPhr _,, "</np><vpHead><np>" -> 0 || NGen _ ˜$"<np>" </np>"]; ### Adjectives define adjustNPAdj [ "</np><vpHead><np>" -> 0 || Det _ APPhr "</np></vpHead>" NPPhr ,, "</np></vpHead><np>" -> 0 || Det "</np><vpHead><np>" APPhr _]; ### Adjective form, i.e. remove plural tags if singular NP define removePluralTagsNPSg [ TagPLU -> 0 || DetSg "<ap>" Adj _ ˜$"</np>" "</np>"]; ### Partitive NPs define adjustNPPart [ Appendix D. 320 "</np><ppHead>" -> 0 || _ PPart "</ppHead><np>",, "</ppHead><np>" -> 0 || "</np><ppHead>" PPart _]; ### Complex VCs stretched over two vpHeads: define adjustVC [ "</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,, "</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,, "<vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags]] "</vpHead>" NPPhr _ ˜$"<vpHead>" "</vpHead>",, "<vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags]] NPPhr "</vpHead>" _ ˜$"<vpHead>" "</vpHead>" ]; ### VCs with two copula or copula and an adjective: define SelectVCCopula [ "<vc>" -> "<vcCopula>" || _ [CopVerb / NPTags] ˜$"<vc>" "</vc>"]; ################# Removing Parsing Errors ### not complete PPs, i.e. ppHeads without following NP define errorPPhead [ "<ppHead>" -> 0 || \["<pp>"] _ ,, "</ppHead>" -> 0 || _ \["<np>"]]; ### empty VPHead define errorVPHead [ "<vp><vpHead></vpHead></vp>" -> 0]; D.6 Error Finder ######### Finding grammatical ###### NPs # Define NP-errors define npDefError ["<np>" [NP define npNumError ["<np>" [NP define npGenError ["<np>" [NP errors (Error marking) - NPDefs] "</np>"]; - NPNum] "</np>"]; - NPGen] "</np>"]; # Mark NP-errors define markNPDefError [ npDefError -> "<Error definiteness>" ... "</Error>"]; define markNPNumError [ npNumError -> "<Error number>" ... "</Error>"]; define markNPGenError [ npGenError -> "<Error gender>" ... "</Error>"]; # Define NPPart-errors define NPPartDefError ["<NPPart>" [NPPart - NPPartDefs] "</np>"]; define NPPartNumError ["<NPPart>" [NPPart - NPPartNum] "</np>"]; define NPPartGenError ["<NPPart>" [NPPart - NPPartGen] "</np>"]; # Mark NPPart-errors define markNPPartDefError [ NPPartDefError -> "<Error definiteness NPPart>" ... "</Error>"]; define markNPPartNumError [ NPPartNumError -> "<Error number NPPart>" ... "</Error>"]; define markNPPartGenError [ NPPartGenError -> "<Error gender NPPart>" ... "</Error>"]; Implementation ###### VPs # Define errors in VPs define vpFiniteError ["<vpHead>" [VPhead - VPFinite] "</vpHead>"]; define vpInfError ["<vpHeadInf>" [VPhead - VPInf] "</vpHead>"]; define VCerror ["<vc>" [VC - VCgram] "</vc>"]; # Mark VP-errors define markFiniteError [ vpFiniteError -> "<Error finite verb>" ... "</Error>"]; define markInfError [ vpInfError -> "<Error infinitive verb>" ... "</Error>"]; define markVCerror [ VCerror -> "<Error verb after Vaux>" ... "</Error>"]; 321