320 pages - Institutionen för filosofi, lingvistik och vetenskapsteori

Transcription

GOTHENBURG MONOGRAPHS IN LINGUISTICS 24
Automatic Detection
of Grammar Errors
in Primary School Children’s Texts
A Finite State Approach
Sylvana Sofkova Hashemi
Doctoral Dissertation
Publicly defended in Lilla Hörsalen,
Humanisten, Göteborg University,
on June 7, 2003, at 10.15
for the degree of Doctor of Philosophy
Department of Linguistics, Göteborg University, Sweden
ISBN 91-973895-5-2
c
2003
Typeset by the author using LATEX
Printed by Intellecta Docusys, Göteborg, Sweden, 2003
i
Abstract
This thesis concerns the analysis of grammar errors in Swedish texts written by
primary school children and the development of a finite state system for finding
such errors. Grammar errors are more frequent for this group of writers than for
adults and the distribution of the error types is different in children’s texts. In
addition, other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words.
The method used in the implemented tool FiniteCheck involves subtraction
of finite state automata that represent grammars with varying degrees of detail,
creating a machine that classifies phrases in a text containing certain kinds of errors.
The current version of the system handles errors concerning agreement in noun
phrases, and verb selection of finite and non-finite forms. At the lexical level,
we attach all lexical tags to words and do not use a tagger which could eliminate
information in incorrect text that might be needed later to find the error. At higher
levels, structural ambiguity is treated by parsing order, grammar extension and
some other heuristics.
The simple finite state technique of subtraction has the advantage that the grammars one needs to write to find errors are always positive, describing the valid rules
of Swedish rather than grammars describing the structure of errors. The rule sets
remain quite small and practically no prediction of errors is necessary.
The linguistic performance of the system is promising and shows comparable
results for the error types implemented to other Swedish grammar checking tools,
when tested on a small adult text not previously analyzed by the system. The performance of the other Swedish tools was also tested on the children’s data collected
for this study, revealing quite low recall rates. This fact motivates the need for adaptation of grammar checking techniques to children, whose errors are different
from those found in adult writers and pose more challenge to current grammar
checkers, that are oriented towards texts written by adult writers.
The robustness and modularity of FiniteCheck makes it possible to perform
both error detection and diagnostics. Moreover, the grammars can in principle be
reused for other applications that do not necessarily have anything to do with error
detection, such as extracting information in a given text or even parsing.
K EY W ORDS : grammar errors, spelling errors, punctuation, children’s writing,
Swedish, language checking, light parsing, finite state technology
ii
iii
Acknowledgements
Work on this thesis would not have been possible without contributions, support
and encouragement from many people. The idea of developing a writing tool for
supporting children in their text production and grammar emerged from a study
on how primary school children write by hand in comparison to when they use a
computer. Special thanks to my colleague Torbjörn Lager, who inspired me to do
this study and whose children attended the school where I gathered my data.
My main supervisor Robin Cooper awakened the idea of using finite state methods for grammar checking and launched the collaboration with the Xerox research
group. I want to express my greatest gratitude to him for inspiring discussions during project meetings and supervision sessions, and his patience with my writing,
struggling to understand every bit of it, always raising questions and always full
of new exciting ideas. I really enjoyed our discussions and look forward to more.
I would also like to thank my assistant supervisor Elisabet Engdahl who carefully
read my writing and made sure that I expressed myself more clearly.
Many thanks to all my colleagues at the Department of Linguistics for creating
an inspiring research environment with interesting projects, seminars and conferences. I especially want to mention Leif Grönqvist for being the helping hand next
door whenever, Robert Andersson for being my project colleague, Stina Ericsson
for loan of LATEX-manual and for always being helpful, Ulla Veres for help with
recruitment of new victims for writing experiments, Jens Allwood and Elisabeth
Ahlsén for introducing me to the world of transcription and coding, Sally Boyd,
Nataliya Berbyuk, Ulrika Ferm for support and encouragement, Shirley Nicholson
for always available with books and also milk for coffee, Pia Cromberger always
ready for a chat. A special thanks to Ylva Hård af Segerstad for fruitful discussions
leading to future collaboration that I am looking forward to, and for being a friend.
I also want to thank the children in my study and their teachers for providing me
with their text creations, and Sven Strömqvist and Victoria Johansson for sharing
their data collection. A special thanks to Genie Perdin who carefully proofread this
thesis and gave me some encouraging last minute ‘kicks’. I also want to thank all
my friends, who reminded me now and then about life outside the university.
My deepest gratitude to my family for being there for me and for always believing in me. My husband Ali - I know the way was long and there were times I could
be distant, but I am back. My daughter Sarah for being the sunshine of my life,
my inspiration, my everything. My mother, father, sister and my big little brother ...
Göteborg, May 2003
iv
v
Table of Contents
1 Introduction
1.1 Written Language in a Computer Literate Society . . . . . . . . .
1.2 Aim and Scope of the Study . . . . . . . . . . . . . . . . . . . .
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
5
I Writing
7
2 Writing and Grammar
2.1 Introduction . . . . . . . . . . . . . . . . . . .
2.2 Research on Writing in General . . . . . . . .
2.3 Written Language and Computers . . . . . . .
2.3.1 Learning to Write . . . . . . . . . . . .
2.3.2 The Influence of Computers on Writing
2.4 Studies of Grammar Errors . . . . . . . . . . .
2.4.1 Introduction . . . . . . . . . . . . . . .
2.4.2 Primary and Secondary Level Writers .
2.4.3 Adult Writers . . . . . . . . . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
10
11
11
12
14
14
14
15
18
3 Data Collection and Analysis
3.1 Introduction . . . . . . . . . . . . . . . . .
3.2 Data Collection . . . . . . . . . . . . . . .
3.2.1 Introduction . . . . . . . . . . . . .
3.2.2 The Sub-Corpora . . . . . . . . . .
3.3 Error Categories . . . . . . . . . . . . . . .
3.3.1 Introduction . . . . . . . . . . . . .
3.3.2 Spelling Errors . . . . . . . . . . .
3.3.3 Grammar Errors . . . . . . . . . .
3.3.4 Spelling or Grammar Error? . . . .
3.3.5 Punctuation . . . . . . . . . . . . .
3.4 Types of Analysis . . . . . . . . . . . . . .
3.5 Error Coding and Tools . . . . . . . . . . .
3.5.1 Corpus Formats . . . . . . . . . . .
3.5.2 CHAT-format and CLAN-software .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
21
23
25
25
26
27
28
31
32
34
34
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
4 Error Profile of the Data
4.1 Introduction . . . . . . . . . . . . . . . . . . . . .
4.2 General Overview . . . . . . . . . . . . . . . . . .
4.3 Grammar Errors . . . . . . . . . . . . . . . . . . .
4.3.1 Agreement in Noun Phrases . . . . . . . .
4.3.2 Agreement in Predicative Complement . .
4.3.3 Definiteness in Single Nouns . . . . . . . .
4.3.4 Pronoun Case . . . . . . . . . . . . . . . .
4.3.5 Verb Form . . . . . . . . . . . . . . . . .
4.3.6 Sentence Structure . . . . . . . . . . . . .
4.3.7 Word Choice . . . . . . . . . . . . . . . .
4.3.8 Reference . . . . . . . . . . . . . . . . . .
4.3.9 Other Grammar Errors . . . . . . . . . . .
4.3.10 Distribution of Grammar Errors . . . . . .
4.3.11 Summary . . . . . . . . . . . . . . . . . .
4.4 Child Data vs. Other Data . . . . . . . . . . . . .
4.4.1 Primary and Secondary Level Writers . . .
4.4.2 Evaluation Texts of Proof Reading Tools .
4.4.3 Scarrie’s Error Database . . . . . . . . . .
4.4.4 Summary . . . . . . . . . . . . . . . . . .
4.5 Real Word Spelling Errors . . . . . . . . . . . . .
4.5.1 Introduction . . . . . . . . . . . . . . . . .
4.5.2 Spelling in Swedish . . . . . . . . . . . .
4.5.3 Segmentation Errors . . . . . . . . . . . .
4.5.4 Misspelled Words . . . . . . . . . . . . . .
4.5.5 Distribution of Real Word Spelling Errors .
4.5.6 Summary . . . . . . . . . . . . . . . . . .
4.6 Punctuation . . . . . . . . . . . . . . . . . . . . .
4.6.1 Introduction . . . . . . . . . . . . . . . . .
4.6.2 General Overview of Sentence Delimitation
4.6.3 The Orthographic Sentence . . . . . . . . .
4.6.4 Punctuation Errors . . . . . . . . . . . . .
4.6.5 Summary . . . . . . . . . . . . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
37
41
41
50
52
53
55
62
67
69
71
72
77
77
77
80
85
88
89
89
89
91
94
98
100
100
100
101
103
105
107
107
vii
II Grammar Checking
111
5 Error Detection and Previous Systems
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 What Is a Grammar Checker? . . . . . . . . . . . . . . . . .
5.2.1 Spelling vs. Grammar Checking . . . . . . . . . . .
5.2.2 Functionality . . . . . . . . . . . . . . . . . . . . .
5.2.3 Performance Measures and Their Interpretation . . .
5.3 Possibilities for Error Detection . . . . . . . . . . . . . . .
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
5.3.2 The Means for Detection . . . . . . . . . . . . . . .
5.3.3 Summary and Conclusion . . . . . . . . . . . . . .
5.4 Grammar Checking Systems . . . . . . . . . . . . . . . . .
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Methods and Techniques in Some Previous Systems
5.4.3 Current Swedish Systems . . . . . . . . . . . . . .
5.4.4 Overview of The Swedish Systems . . . . . . . . .
5.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . .
5.5 Performance on Child Data . . . . . . . . . . . . . . . . . .
5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Evaluation Procedure . . . . . . . . . . . . . . . . .
5.5.3 The Systems’ Detection Procedures . . . . . . . . .
5.5.4 The Systems’ Detection Results . . . . . . . . . . .
5.5.5 Overall Detection Results . . . . . . . . . . . . . .
5.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
113
113
114
114
114
115
117
117
117
125
128
128
128
130
134
142
143
143
143
145
146
168
172
6 FiniteCheck: A Grammar Error Detector
6.1 Introduction . . . . . . . . . . . . . . . .
6.2 Finite State Methods and Tools . . . . . .
6.2.1 Finite State Methods in NLP . . .
6.2.2 Regular Grammars and Automata
6.2.3 Xerox Finite State Tool . . . . . .
6.2.4 Finite State Parsing . . . . . . . .
6.3 System Architecture . . . . . . . . . . . .
6.3.1 Introduction . . . . . . . . . . . .
6.3.2 The System Flow . . . . . . . . .
6.3.3 Types of Automata . . . . . . . .
6.4 The Lexicon . . . . . . . . . . . . . . . .
6.4.1 Composition of The Lexicon . . .
6.4.2 The Tagset . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
173
173
175
175
176
177
180
184
184
186
189
191
191
193
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
194
195
196
196
198
201
203
205
205
210
214
214
215
216
216
7 Performance Results
7.1 Introduction . . . . . . . . . . . . . . . . . . .
7.2 Initial Performance on Child Data . . . . . . .
7.2.1 Performance Results: Phase I . . . . .
7.2.2 Grammatical Coverage . . . . . . . . .
7.2.3 Flagging Accuracy . . . . . . . . . . .
7.3 Current Performance on Child Data . . . . . .
7.3.1 Introduction . . . . . . . . . . . . . . .
7.3.2 Improving Flagging Accuracy . . . . .
7.3.3 Performance Results: Phase II . . . . .
7.4 Overview of Performance on Child Data . . . .
7.5 Performance on Other Text . . . . . . . . . . .
7.5.1 Performance Results of FiniteCheck . .
7.5.2 Performance Results of Other Tools . .
7.5.3 Overview of Performance on Other Text
7.6 Summary and Conclusion . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
219
219
219
219
220
223
228
228
229
232
233
237
237
240
243
246
8 Summary and Conclusion
8.1 Introduction . . . . . . . . . . . . . . . . . . . .
8.2 Summary . . . . . . . . . . . . . . . . . . . . .
8.2.1 Introduction . . . . . . . . . . . . . . . .
8.2.2 Children’s Writing Errors . . . . . . . .
8.2.3 Diagnosis and Possibilities for Detection
8.2.4 Detection of Grammar Errors . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
249
249
249
249
250
251
253
6.5
6.6
6.7
6.8
6.9
6.4.3 Categories and Features . . . . . . .
Broad Grammar . . . . . . . . . . . . . . . .
Parsing . . . . . . . . . . . . . . . . . . . .
6.6.1 Parsing Procedure . . . . . . . . . .
6.6.2 The Heuristics of Parsing Order . . .
6.6.3 Further Ambiguity Resolution . . . .
6.6.4 Parsing Expansion and Adjustment .
Narrow Grammar . . . . . . . . . . . . . . .
6.7.1 Noun Phrase Grammar . . . . . . . .
6.7.2 Verb Grammar . . . . . . . . . . . .
Error Detection and Diagnosis . . . . . . . .
6.8.1 Introduction . . . . . . . . . . . . . .
6.8.2 Detection of Errors in Noun Phrases .
6.8.3 Detection of Errors in the Verbal Head
Summary . . . . . . . . . . . . . . . . . . .
ix
8.3
8.4
Conclusion . . . . . . . . . . . . . . . . . . . . . .
Future Plans . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Introduction . . . . . . . . . . . . . . . . . .
8.4.2 Improving the System . . . . . . . . . . . .
8.4.3 Expanding Detection . . . . . . . . . . . . .
8.4.4 Generic Tool? . . . . . . . . . . . . . . . . .
8.4.5 Learning to Write in the Information Society
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
255
256
256
256
257
258
258
Bibliography
260
Appendices
276
A Grammatical Feature Categories
279
B Error Corpora
B.1 Grammar Errors . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Misspelled Words . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 Segmentation Errors . . . . . . . . . . . . . . . . . . . . . . . .
281
282
293
306
C SUC Tagset
313
D Implementation
D.1 Broad Grammar .
D.2 Narrow Grammar:
D.3 Narrow Grammar:
D.4 Parser . . . . . .
D.5 Filtering . . . . .
D.6 Error Finder . . .
315
315
315
318
319
319
320
. . . . . . . .
Noun Phrases
Verb Phrases
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
LIST OF TABLES
xi
List of Tables
3.1
Child Data Overview . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
4.32
4.33
4.34
4.35
4.36
General Overview of Sub-Corpora . . . . . . . . . . . . . .
General Overview by Age . . . . . . . . . . . . . . . . . .
General Overview of Spelling Errors in Sub-Corpora . . . .
General Overview of Spelling Errors by Age . . . . . . . . .
Number Agreement in Swedish . . . . . . . . . . . . . . . .
Gender Agreement in Swedish . . . . . . . . . . . . . . . .
Definiteness Agreement in Swedish . . . . . . . . . . . . .
Noun Phrases with Proper Nouns as Head . . . . . . . . . .
Noun Phrases with Pronouns as Head . . . . . . . . . . . .
Noun Phrases without (Nominal) Head . . . . . . . . . . . .
Agreement in Partitive Noun Phrase in Swedish . . . . . . .
Gender and Number Agreement in Predicative Complement
Personal Pronouns in Swedish . . . . . . . . . . . . . . . .
Finite and Non-finite Verb Forms . . . . . . . . . . . . . . .
Tense Structure . . . . . . . . . . . . . . . . . . . . . . . .
Fa-sentence Word Order . . . . . . . . . . . . . . . . . . .
Af-sentence Word Order . . . . . . . . . . . . . . . . . . .
Distribution of Grammar Errors in Sub-Corpora . . . . . . .
Distribution of Grammar Errors by Age . . . . . . . . . . .
Examples of Grammar Errors in Teleman’s Study . . . . . .
Examples of Grammar Errors from the Skrivsyntax Project .
Grammar Errors in the Evaluation Texts of Grammatifix . . .
Grammar Errors in Granska’s Evaluation Corpus . . . . . .
General Error Ratio in Grammatifix, Granska and Child Data
Three Error Types in Grammatifix, Granska and Child Data
Grammar Errors in Scarrie’s ECD and Child Data . . . . . .
Examples of Spelling Error Categories . . . . . . . . . . . .
Spelling Variants . . . . . . . . . . . . . . . . . . . . . . .
Distribution of Real Word Segmentation Errors . . . . . . .
Distribution of Real Word Spelling Errors in Sub-Corpora .
Distribution of Real Word Spelling Errors by Age . . . . . .
Sentence Delimitation in the Sub-Corpora . . . . . . . . . .
Sentence Delimitation by Age . . . . . . . . . . . . . . . .
Major Delimiter Errors in Sub-Corpora . . . . . . . . . . .
Major Delimiter Errors by Age . . . . . . . . . . . . . . . .
Comma Errors in Sub-Corpora . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
38
39
40
40
42
42
42
44
44
45
45
50
54
55
56
63
63
74
74
78
79
81
82
83
83
86
90
91
91
99
99
103
103
105
105
106
LIST OF TABLES
xii
4.37 Comma Errors by Age . . . . . . . . . . . . . . . . . . . . . . . 107
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Summary of Detection Possibilities in Child Data . . . . . . . . .
Overview of the Grammar Error Types in Grammatifix (GF),
Granska (GR) and Scarrie (SC) . . . . . . . . . . . . . . . . . .
Overview of the Performance of Grammatifix, Granska and Scarrie
Performance Results of Grammatifix on Child Data . . . . . . . .
Performance Results of Granska on Child Data . . . . . . . . . .
Performance Results of Scarrie on Child Data . . . . . . . . . . .
Performance Results of Targeted Errors . . . . . . . . . . . . . .
6.1
6.2
6.3
Some Expressions and Operators in XFST . . . . . . . . . . . . . 178
Types of Directed Replacement . . . . . . . . . . . . . . . . . . . 179
Noun Phrase Types . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
7.12
Performance Results on Child Data: Phase I . . . .
False Alarms in Noun Phrases: Phase I . . . . . . .
False Alarms in Finite Verbs: Phase I . . . . . . . .
False Alarms in Verb Clusters: Phase I . . . . . . .
False Alarms in Noun Phrases: Phase II . . . . . .
False Alarms in Finite Verbs: Phase II . . . . . . .
False Alarms in Verb Clusters: Phase II . . . . . .
Performance Results on Child Data: Phase II . . .
Performance Results of FiniteCheck on Other Text
Performance Results of Grammatifix on Other Text
Performance Results of Granska on Other Text . .
Performance Results of Scarrie on Other Text . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
126
137
141
169
169
170
171
220
224
226
227
229
231
231
232
237
240
241
242
LIST OF FIGURES
xiii
List of Figures
3.1
Principles for Error Categorization . . . . . . . . . . . . . . . . .
31
4.1
4.2
4.3
4.4
73
76
76
4.5
4.6
Grammar Error Distribution . . . . . . . . . . . . . . . . . . . .
Error Density in Sub-Corpora . . . . . . . . . . . . . . . . . . . .
Error Density in Age Groups . . . . . . . . . . . . . . . . . . . .
Three Error Types in Grammatifix (black line), Granska (gray line)
and Child Data (white line) . . . . . . . . . . . . . . . . . . . .
Error Distribution of Selected Error Types in Scarrie . . . . . . .
Error Distribution of Selected Error Types in Child Data . . . . .
6.1
The System Architecture of FiniteCheck . . . . . . . . . . . . . . 185
7.1
7.2
7.3
7.4
7.5
7.6
7.7
False Alarms: Phase I vs. Phase II . . . . . . .
Overview of Recall in Child Data . . . . . . .
Overview of Precision in Child Data . . . . . .
Overview of Overall Performance in Child Data
Overview of Recall in Other Text . . . . . . . .
Overview of Precision in Other Text . . . . . .
Overview of Overall Performance in Other Text
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
87
87
233
234
235
236
244
244
245
xiv
Chapter 1
Introduction
1.1 Written Language in a Computer Literate Society
Written language plays an important role in our society. A great deal of our communication occurs by means of writing, which besides the traditional paper and
pen, is facilitated by the computer, the Internet and other applications such as for
instance the mobile phone. Word processing and sending messages via email are
among the most usual activities on computers. Other communicated media that
enable written communication are also becoming popular such as webchat or instant messaging on the Internet or text messaging (Short-Message-Service, SMS)
via the mobile phone.1
The present doctoral dissertation concerns word processing on computers, in
particular the linguistic tools integrated in such authoring aids. The use of word
processors for writing both in educational and professional settings modifies the
process, practice and acquisition of writing. With a word processor, it is not only
easy to produce a text with a neat layout, but it supports the writer throughout the
whole writing process. Text may be restructured and revised at any time during
text production without leaving any trace of the changes that have been made. Text
may be reused and a new text composed by cutting and pasting passages. Iconic
material such as pictures2 (or even sounds) can be inserted, linguistic aids can be
used for proofreading a text. Writing acquisition can be enhanced by use of a word
processor. For instance, focus on somewhat more technical aspects such as physically shaping letters with a pen shifts toward the more cognitive processes of text
1
Studies of computer-mediated communication are provided by e.g. Severinson Eklundh (1994);
Crystal (2001); Herring (2001). A recent dissertation by Hård af Segerstad (2002) explores especially
how written Swedish is used in email, webchat and SMS.
2
Smileys or emoticons (e.g. :-) “happy face”) are more used in computer-mediated communication.
2
Chapter 1.
production enabling the writer to apply the whole language register. Writing on
a computer enhances in general both the motivation to write, revise or completely
change a text (cf. Wresch, 1984; Daiute, 1985; Severinson Eklundh, 1993; Pontecorvo, 1997).
The status of written language in our modern information society has developed. In contrast to ancient times, writing is no longer reserved for just a
small minority of professional groups (e.g. priests and monks, bankers, important merchants). In particular, the emergence of computers in writing has led to the
involvement of new user groups besides today’s writing professionals like journalists, novelists and scientists. We write more nowadays in general, and the freedom
of and control over one’s own writing has increased. Texts are produced rapidly
and are more seldom proofread by a careful secretary with knowledge of language.
This is sometimes reflected in the quality and correctness of the resulting text (cf.
Severinson Eklundh, 1995).
Linguistic tools that check mechanics, grammar and style have taken over the
secretarial function to some degree and are usually integrated in word processing
software. Spelling checkers and hyphenators that check writing mechanics and
identify violations on individual words have existed for some time now. Grammar checkers that recognize syntactic errors and often also violations of punctuation, word capitalization conventions, number and date formatting and other
style-related issues, thus working above the word level, are a rather new technology, especially for such minor small languages like Swedish. Grammar checking
tools for languages such as English, French, Dutch, Spanish, and Greek were being developed in the 1980’s, whereas research on Swedish writing aids aimed at
grammatical deviance started quite recently. In addition to the present work, there
are three research groups working in this area. The Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH),
with a long tradition of research in writing and authoring aids, is responsible for
Granska. Development of this tool has occurred over a series of projects starting
in 1994 (Domeij et al., 1996, 1998; Carlberger et al., 2002). The Department of
Linguistics, Uppsala University was involved in an EU-sponsored project, Scarrie, between 1996 and 1999. The goal of this project was development of language
tools for Danish, Norwegian and Swedish (Sågvall Hein, 1998a; Sågvall Hein et al.,
1999). Finally, a Finnish language engineering company Lingsoft Inc. developed
Grammatifix. Initiated in 1997, and completed in 1999, this tool was released on
the market in November 1998, and has been part of the Swedish Microsoft Office
Package since 2000 (Arppe, 2000; Birn, 2000).
The three Swedish systems mainly use parsing techniques with some degree of
feature relaxation and/or explicit error rules for detection of errors. Grammatifix
and Granska are developed as generic tools and are tested on adult (mostly pro-
Introduction
3
fessional) texts. Scarrie’s end-users are professional writers from newspapers and
publishing firms.
1.2 Aim and Scope of the Study
The primary purpose of the present work is to detect grammar errors by means of
linguistic descriptions of correct language use rather than describing the structure
of errors. The ideal would be to develop a generic method for detection of grammar
errors in unrestricted text that could be applied to different writing populations
displaying different error types without the need for rewriting the grammars of the
system. That is, instead of describing the errors made by different groups of writers
resulting in distinct sets of error rules, use the same grammar set for detection.
This approach of identifying errors in text without explicit description of them
contrasts with the other three Swedish grammar checkers. Using this method, we
will hopefully cover many different cases of errors and minimize the possibility of
overlooking some errors.
We chose primary school children as the targeted population as a new group of
users not covered by the previous Swedish projects. Children as beginning writers,
are in the process of acquiring written language, unlike adult writers, and will probably produce relatively more errors and errors of a different kind than adult writers.
Their writing errors have probably more to do with competence than performance.
Grammar checkers for this group have to have different coverage and concentrate
on different kinds of errors. Further, the positive impact of computers on children’s
writing opens new opportunities for the application of language technology. The
role of proofreading tools for educational purposes is a rather new application area
and this work can be considered a first step in that direction.
Against this background, the main goal of the present thesis is handling children’s errors and experimenting with positive grammatical descriptions using finite
state techniques. The work is divided into three subtasks, including first, an overall
error analysis of the collected children’s texts, then exploring the nature and possibilities for detection of errors and finally, implementation of detection of (some)
grammatical error types. Here is a brief characterization of these three tasks:
I. Investigation of children’s writing errors: The targeted data for a grammar
checker can be selected either by intuitions about errors that will probably
occur, or by directly looking at errors that actually occur. In the present work,
the second approach of empirical analysis will be applied. Texts from pupils
at three primary schools were collected and analyzed for errors, focusing on
errors above word-level including grammar errors, spelling errors resulting
in existent words, and punctuation. The main focus lies on grammar errors
Chapter 1.
4
as the basis for implementation. The questions that arise are: What grammar
errors occur? How should the errors be categorized? What spelling errors
result in lexicalized strings and are not captured by a spelling checker? What
is the nature of these? How is punctuation used and what errors occur?
II. Investigation of the possibilities for detection of these writing errors: The
nature of errors will be explored along with available technology that can
be applied in order to detect them. An interesting point is how the errors
that are found are handled by the current systems. The questions that arise
are: What is the nature of the error? What is the diagnosis of the error?
What is needed to be able to detect the error? How are the grammar errors
handled by the current Swedish grammar checkers, Grammatifix, Granska
and Scarrie?
III. Implementation of the detection of (some) grammar errors: A subset of
errors will be chosen for implementation and will concern grammar checking to the level of detecting errors. Errors will obtain a description of the
type of error detected. Implementation will not include any additional diagnosis or any suggestion of how to correct the error. The analysis will be
shallow, using finite state techniques. The grammars will describe real syntactic relations rather than the structure of erroneous patterns. The difference
between grammars of distinct accuracy will reveal the errors, that as finite
state automata can be subtracted from each other. Karttunen et al. (1997a)
use this technique to find instances of invalid dates and this is an attempt to
apply their approach to a larger language domain. The work on this grammar
error detector started at the Department of Linguistics at Göteborg University in 1998, in the project Finite State Grammar for Finding Grammatical
Errors in Swedish Text and was a collaboration with the NADA group at
KTH in the project Integrated Language Tools for Writing and Document
Handling.3 The present thesis describes both the initial development within
this project and the continuation of it.
The main contributions of this thesis concern understanding of incorrect language use in primary school children’s writing and computational analysis of such
incorrect text by means of correct language use, in particular:
• Collection of texts written by primary school children, written both by hand
and on a computer.
3
This project was sponsored by HSFR/NUTEK Language Technology Programme and has its
site at: http://www.nada.kth.se/iplab/langtools/
Introduction
5
• Analysis of grammar errors, spelling errors and punctuation in the texts of
primary school writers.
• Comparison of errors found in the present data with errors found in other
studies on grammar errors.
• Comparison of error types covered by the three Swedish grammar checkers.
• Performance analysis of the three Swedish grammar checkers on the present
data.
• Implementation of a grammar error detector that derives/compiles error patterns rather than writing the error grammar by hand.
• Performance analysis of the detector on the collected data and some portion
of other data.
1.3 Outline of the Thesis
The remaining chapters of the thesis fall into two parts.
Part I: The first part is devoted to a discussion of writing and an analysis of the
collected data and consists of three chapters. Chapter 2 provides a brief
introduction to research on writing in general, writing acquisition, how computers influence writing and descriptions of previous findings on grammar
errors, concluding with what grammar errors are to be expected in written
Swedish. Chapter 3 gives an overview of the data collected and a discussion
of error classification. Chapter 4 presents the error profile of the data. The
chapter concludes with discussion of the requirements for a grammar error
detector for the particular subjects of this study.
Part II: The second part of the thesis concerns grammar checking and includes
three chapters. Chapter 5 starts with a general overview of the requirements
and functionalities of a grammar checker and what is required for the errors in the present data. Swedish grammar checkers are described and their
performance is checked on the present data. Chapter 6 presents the implementation of a grammar error detector that handles these errors, including
description of finite state formalism. The techniques of finite state parsing
are explained. Chapter 7 presents the performance of this tool.
The thesis ends with a concluding summary (Chapter 8). In addition, the thesis
contains four appendices. Appendix A presents the grammatical feature categories
6
Chapter 1.
used in the examples of errors or when explaining the grammar of Swedish. Appendix B presents the error corpora consisting of the grammar errors found in the
present study (Appendix B.1), misspelled words (Appendix B.2) and segmentation errors (Appendix B.3). The tagset used is presented in Appendix C and some
listings from the implementation are listed in Appendix D.
Part I
Writing
8
Chapter 2
Writing and Grammar
2.1 Introduction
Learning to write does not imply acquiring a completely new language (new grammar), since often at this stage (i.e. beginning school) a child already knows the
majority of the (general) grammar rules. Rather, learning to write is a process of
learning the difference between written language and the already acquired spoken
language. Consequently, errors that one will find in the writing of primary school
children often are due to their lack of knowledge of written language and consist
of attempts to reproduce spoken language norms as an alternative to the standard
written norm or to errors due to the as yet not acquired part of written language.
Further, even when the writer knows the standard norm, errors can occur either as
the result of disturbances such as tiredness, stress, etc. or because the writer cannot
manage to keep together complex content and meaning constructions (cf. Teleman, 1991a). Another source of errors is the aids we use for writing, computers,
which also impact on our writing and may give rise to errors.
The main purpose of the present chapter is to see if previous studies on writing
can give some hint on what grammar errors are to be expected in the writing by
Swedish children. It provides a survey of previous studies of grammar errors, as
well as some background research on writing in general and some insights into
what it means to learn to write and how computers influence our writing.
First, a short review of research on writing is presented (Section 2.2), followed
by a short explanation of what acquisition of written language involves and how
computers influence the way we write (Section 2.3). Previous findings on grammar
errors in Swedish can be found in the following section, including studies of writing
of children and adolescents, adults and the disabled (Section 2.4).
10
Chapter 2.
2.2 Research on Writing in General
For a long period of time many considered written language (beginning with e.g.
de Saussure, 1922; Bloomfield, 1933) to be a transcription of spoken (oral) language and not that important as, or even inferior to, spoken language. A similar
view is also reflected in the research on literacy, where studies on writing were very
few in comparison to research on reading. A turning point at the end of 1970s, is
described by many as “the writing crisis” (Scardamalia and Bereiter, 1986), when
an expansion in research occurs in teaching native language writing. During this
period, more naturalistic methods for writing are propagated, i.e. “learning to write
by writing” (Moffett, 1968), examination of the writing situation in English schools
(e.g. Britton, 1982; Emig, 1982) and changing the focus of study from judgments
of products and more text-oriented research to the strategies involved in the process
of writing (see Flower and Hayes, 1981).
In Sweden, writing skills were studied by focusing on the written product,
often related to the social background of the child. Research has been devoted
to spelling (e.g. Haage, 1954; Wallin, 1962, 1967; Dahlquist and Henrysson, 1963;
Ahlström, 1964, 1966; Lindell, 1964) and writing of composition in connection
to standardized tests (e.g. Björnsson, 1957, 1977; Ljung, 1959). There are also
studies concerning writing development in primary and upper secondary schools
(e.g. Grundin, 1975; Björnsson, 1977; Hultman and Westman, 1977; Lindell et al.,
1978; Larsson, 1984).
During the later half of the 1980s, research in Sweden took a new direction towards studies of writing strategies concerning writing as a process (e.g. Bj örk and
Björk, 1983; Strömquist, 1987, 1989) and development of writing abilities focusing
on writing activities between children and parents (e.g. Liberg, 1990) and text analysis (e.g. Garme, 1988; Wikborg and Björk, 1989; Josephson et al., 1990). This
turning point was reflected in education by the introduction of process-oriented
writing, as well.
Some research concerned writing as a cognitive text-creating process using
video-recordings of persons engaged in writing (e.g. Matsuhasi, 1982), or clinical
experiments (e.g. Bereiter and Scardamalia, 1985).
The use of computers in writing prompted studies on the influence of
computers in writing (e.g. Severinson Eklundh and Sjöholm, 1989; Severinson Eklundh, 1993; Wikborg, 1990), resulting in the development of computer
programs that register and record writing activities (e.g. Flower and Hayes, 1981;
Severinson Eklundh, 1990; Kollberg, 1996; Strömqvist, 1996).
Writing and Grammar
11
2.3 Written Language and Computers
2.3.1
Learning to Write
Writing, like speaking, is primarily aimed at expressing meaning. The most evident difference between written and spoken language lies in the physical channel.
Written language is a single-channelled monologue, using only the visual channel (eye) with the addressee not present at the same time. It is a more relaxed,
rather slow process affording longer time for consideration and the possibility to
edit/correct the end product. Speech as a dialogue is simultaneous and involves
participants present at the same time, where all the senses can be used to receive
information. It is a fast process with little time for consideration and difficulty in
correcting the end product. The rules and conventions of written language are more
restrictive than the rules of spoken language in the sense that there are constructions in spoken language regarded as “incorrect” in written language. Writing is, in
general, standardized with less (dialectal) variation in contrast to spoken language,
which is dialectal and varied. Further, acquisition of written and spoken language
occurs under different conditions and in different ways. Writing is taught in school
by teachers with certain training, whereas speaking is learned privately (in a family, from peers, etc.), without any planning of the process. When learning to speak,
we learn the language. When learning to write we already know the language (in
the spoken form) (cf. Linell, 1982; Teleman, 1991b; Liberg, 1990). 1
Learning a written language means not only acquiring its more or less explicit
norms and rules, but also learning to handle the overall writing system, including
the more technical aspects, such as how to shape the letters, the boundaries between
words, how a sentence is formed, as well as acquiring the grammatical, discursive,
and strategic competence to convey a thought or message to the reader. In other
words, writing entails being able to handle the means of writing, i.e. letters and
grammar rules, and arranging them to form words and sentences and being able to
use them in a variety of different contexts and for different purposes. During this
development, children may compose text of different genre, but not necessarily apply the conventions of the writing system correctly. Children are quite creative and
they often use conventions in their own ways, for instance using periods between
words to separate them instead of blank spaces (cf. Mattingly, 1972; Chall, 1979;
Lundberg, 1989; Liberg, 1990; Pontecorvo, 1997; Håkansson, 1998).
1
For further, more extensive definitions of differences between written and spoken language see
e.g. Chafe (1985); Halliday (1985); Biber (1988).
Chapter 2.
12
The above discussion leads to a view of learning to write as being the acquisition of a complex system of communication with several components. Following
Hultman (1989, p.73), we can identify three aspects of writing:
1. the motor aspect: the movement of the hand when forming the letters or
typing on the keyboard
2. the grammar aspect: the rules for spelling and punctuation, morphology and
syntax on clause, sentence and text level
3. the pragmatic aspect: use of writing for a purpose, to argue, tell, describe,
discuss, inform, refer, etc. The text has to be readable, reflecting the meaning
of words and the effect they have.
This thesis focuses on the grammar aspect, in particular on the syntactic relationships between words. Also some aspects of spelling and punctuation are
covered. The text level is not analyzed here.
2.3.2
The Influence of Computers on Writing
The view on writing has changed, it is no longer interpreted as a linear activity
consisting of independent and temporally sequenced phases, but rather considered
to be a dynamic, problem solving activity. According to Hayes and Flower (1980),
as a cognitive process, writing is influenced by the task environment (the external
social conditions) and the writer’s long term memory, including cognitive processes
of planning (generating and organizing ideas, setting goals, and decision-making
of what to include, what to concentrate on), translation (the actual production)
and revision (evaluation of what has been written, proof-reading, writing out and
publishing).
This process-based approach with the phases also referred to as prewriting,
writing and rewriting has been adopted in writing instruction in school (e.g. Graves,
1983; Calkins, 1986; Strömquist, 1993) and is also considered to be well-suited to
computer-assisted composition (Wresch, 1984; Montague, 1990).
Writing on a computer makes text easy to structure, rearrange and rewrite.
Many studies report writers’ decreased resistance to writing. They experience that
it is easier to start to write and there is a possibility to revise under the whole
process of writing, leave the text and then come back to it again and update and
reuse old texts (e.g. Wresch, 1984; Severinson Eklundh, 1993). Also, studies of
children’s use of computers show that children who use a word-processor in school
enjoy writing and editing activities more, considering writing on a computer to be
much easier and more fun. They are more willing to revise and even completely
Writing and Grammar
13
change their texts and they write more in general (e.g. Daiute, 1985; Pontecorvo,
1997).
The word processor affects the way we write in general. We usually plan less in
the beginning when writing on a computer and revise more during writing. Thus,
editing occurs during the whole process of writing and is not left solely to the final
phase. In an investigation by Severinson Eklundh (1995) of twenty adult writers
with academic backgrounds more than 80% of all editing was performed during
writing and not after. The main disadvantage reported is that it is hard to get an
overall perspective of a text on the screen, which then makes planning and revision
more difficult and can in turn lead to texts being of worse quality (e.g. Hansen and
Haas, 1988; Severinson Eklundh, 1993). Rewriting and rearranging of a text is easy
to do on a word processor, for instance by copy and paste utilities that may easily
give rise to errors that are hard to discover afterwards, especially in a brief perusal.
Words and phrases can be repeated, omitted, transposed. Sentences can be too
long (Pontecorvo, 1997) and errors that normally are not found in native speakers’
writing occur. The common claim is that writing in one’s mother tongue normally
results in the types of errors that are different from the public language norm, since
most of the mother tongue’s grammar is present before we begin school (Teleman,
1979). There are studies that clearly show that the use of word processors leads
to completely new error types including some errors that were considered characteristic for second language writers. For instance, morpho-syntactic (agreement)
errors have been found to be quite usual among native speakers in the studies of
Bustamente and León (1996) and Domeij et al. (1996). The errors are connected to
how we use the functions in a word processor and that revision is more local due
to limitations in view on the screen (cf. Domeij et al., 1996; Domeij, 2003).
Concerning text quality, there are studies that point out that the use of a word
processor results in longer texts, both among children and adults. Some researchers
claim that the quality of compositions improved when word processors were used
(see e.g. Haas, 1989; Sofkova Hashemi, 1998). However, no reliable quality enhancement besides the length of a text is evident in any study. The effects of using
a computer for revision are regarded by some as being positive both on the mechanics and the content of writing while others feel it promotes only surface level
revision, not enhancing content or meaning (see the surveys in Hawisher, 1986;
Pontecorvo, 1997; Domeij, 2003).
Chapter 2.
14
2.4 Studies of Grammar Errors
2.4.1
Introduction
There are not many studies of grammar errors in written Swedish. Studies of adult
writing are few, while research on children’s writing development mostly concerns
the early age of three to six years and development of spelling and use of the period
and/or other punctuation marks and conventions (e.g. Allard and Sundblad, 1991).
Recent expansion of development of grammar checking tools contributes to this
field, however.
Below, studies are presented of grammar errors found in the writing of primary
and upper secondary school children, adults, error types covered by current proof
reading tools and analysis of grammar errors in texts of adult writers used for evaluation of these tools. Some of these studies are described further in detail and are
compared to the analysis of the children’s texts gathered for the present thesis in
Chapter 4 (Section 4.4).
2.4.2
Primary and Secondary Level Writers
During the 1980s, several projects investigated the language of Swedish school
children as a contribution to discussion of language development and language instruction (see e.g. the surveys in Östlund-Stjärnegårdh, 2002; Nyström, 2000). The
writing of children in primary and upper secondary school was analyzed mostly
with focus on lexical measures of productivity and language use, in terms of
analysis of vocabulary, parts-of-speech distribution, length of words, word variation and also content, relation to gender, social background and the grades assigned to the texts (e.g. Hersvall et al., 1974; Hultman and Westman, 1977; Lindell
et al., 1978; Pettersson, 1980; Larsson, 1984). Then, when the traditional productoriented view on writing switched to the new process-oriented paradigm, studies
on writing concerned the text as a whole and as a communicative act (e.g. Chrystal
and Ekvall, 1996, 1999; Liberg, 1999) and became more devoted to analysis of
genre and referential issues (e.g. Öberg, 1997; Nyström, 2000) and relation to the
grades assigned (e.g. Östlund-Stjärnegårdh, 2002) and modality (speech or writing) (e.g. Strömqvist et al., 2002). Quantitative analysis in this field still concerns
lexical measures of variation, length, coherence, word order and sentence structure; very few studies note errors other than spelling or punctuation (e.g. Olevard,
1997; Hallencreutz, 2002).
A study by Teleman (1979) shows examples (no quantitative measures) of both
lexical and syntactic errors observed in the writing of children from the seventh
year of primary school (among others). He reports on errors in function words,
Writing and Grammar
15
inflection with dialectal endings in nouns, dropping of the tense-endings on verbs
and on use of nominative form of pronouns in place of accusative forms as is often
the case in spoken Swedish. Also, errors in definiteness agreement, missing constituents, reference problems, word order and tense shift are exemplified as well as
observation of erroneous use of or missing prepositions in idiomatic expressions.
Another study of Hultman and Westman (1977), concerns analysis of national
tests from third year students from upper secondary school. The aim of the project Skrivsyntax “Written Syntax” was to study writing practice in school from a
linguistic point of view. The material included 151 compositions (88 757 words
in total) with the subject Familjen och äktenskapet än en gång ‘Family and marriage once more’. Vocabulary, distribution of word categories, syntax and spelling
were studied and compared to adult texts, between the marks assigned to the texts
and between boys and girls. The study also included error analysis of punctuation, orthography, grammar, lexicon, semantics, stylistics and functionality of the
text. Among grammar errors, gender agreement errors were reported being usual,
and relatively many errors in pronoun case after preposition occurred. Errors in
agreement between subject and predicative complement are also reported as rather
frequent. Word order errors are also reported, mostly in the placement of adverbials. Other examples include verb form errors, subject related errors, reference,
preposition use in idiomatic expressions and clauses with odd structure.
2.4.3
Adult Writers
There are few studies of adult writing in Swedish. Those that exist are mostly devoted to the writing process as a whole or to social aspects of it with very little
attention being paid to the mechanics of writing. However, the recent expansion
in the development of Swedish computational grammar checking tools that require
understanding of what error types should be treated by such tools, has made contributions to this field. The realization of what types of errors occur and should thus
be included in such an authoring aid may be based on intuitive presuppositions of
what rules could be violated, in addition to empirical analysis of text. More empirical evidence of grammar violations also comes from the evaluation of such tools,
where the system is tested against a text corpus with hand-coded analysis of errors.
There are three available grammar checkers for Swedish: Granska (Knutsson,
2001), Grammatifix (Birn, 2000) and Scarrie (Sågvall Hein et al., 1999).2 Scarrie
is explicitly devoted to professional writers of newspaper articles. The other two
systems are not explicitly aimed at any special user groups, although their performance tests were provided mainly on newspaper texts.
2
These tools are described in detail in Chapter 5.
16
Chapter 2.
Below, a survey of studies is presented of professional and non-professional
writers, adult disabled writers, the grammar errors that are covered by the three
Swedish grammar checkers, and grammar errors that occurred in the evaluation
texts the performance of these systems was tested upon.
Professional and Non-professional Writers
Studies focusing on adult non-professional writing concern analysis of crime reports (Leijonhielm, 1989), post-school writing development (Hammarbäck, 1989),
a socio-linguistic study concerning writing attitudes, i.e. what is written and who
writes what at a local government office regardless of writing conventions (Gunnarsson, 1992) and some “typical features in non-proof-read adult prose” at a government authority are reported in Göransson (1998), the only investigation that
addresses (to some extent) grammatical structure.
Göransson (1998) describes her immediate impression when proof-reading
texts written by her colleagues at a government authority, showing some typical
features in this unedited adult prose. She examined reports, instructional texts,
newspaper articles, formal letters, etc. The analysis distinguishes between high
and low level errors. High level includes comprehensibility of the text, coherence
and style, relevance for the context, ability to see one’s own text with the eyes of
others, choice of words, etc. Low level errors cover grammar and spelling errors.
Among the grammar errors she only reports reference problems, choice of preposition and agreement errors.
Among studies of professional writers, the language consultant Gabriella Sandström (1996) analyzed editing at the Swedish newspaper Svenska Dagbladet that
included 29 articles written by 15 reporters. The original script, the edited version
and the final version of the articles were analyzed. The analysis involved spelling,
errors at lexical and syntactic level, formation errors, punctuation and graphical
errors. The result showed that the journalists made most errors in punctuation,
graphical errors and lexical errors and most of them disappeared during the editing process. Among the lexical errors, Sandström mentions errors in idiomatic
expressions and in the choice of prepositions. Syntax errors also seem to be quite
common, but the article does not give an analysis of the different kinds of syntax
errors.
Writing and Grammar
17
Adults with Writing Disabilities
Studies on writing strategies of disabled groups were conducted within the project
Reading and Writing strategies of Disabled Groups,3 including analysis of grammar for the dyslexic and deaf (Wengelin, 2002). The analysis of the writing of deaf
adults included no frequency data and is not that important for the present study
since it tends to reflect more strategies found in second language acquisition.
Adult dyslexics mostly show problems with formation of sentences and frequent omission of constituents. Especially frequent were missing or erroneous
conjunctions. Other errors concern agreement in noun phrase or the form of noun
phrases, verb form, tense shift within sentences and incorrect choice of prepositions. Marking of sentence boundaries and punctuation is the main problem of
these writers.
Error Types in Proof Reading Tools
The error types covered by a grammar checker should, in general, include the central constructions of the language and, in particular, those which give rise to errors.
These constructions should allow precise descriptions so that false alarms can be
avoided. The selection of what error types to include is then also dependent on the
available technology and the possibility of detecting and correcting the error types
(cf. Arppe, 2000; Birn, 2000).
In the development of Grammatifix, the pre-analysis of existing error types in
Swedish was based on linguistic intuition, personal observation and reference literature of Swedish grammar and writing conventions (Arppe, 2000). In the case of
Granska, the pre-analysis involved analysis of empirical data such as newspaper
texts and student compositions (Domeij et al., 1996; Domeij, 2003). In the Scarrie project, where journalists are the end-users, the stage of pre-analysis consisted
of gathering corrections made by professional proof-readers at the newspapers involved. These corrections were stored in a database (The Swedish Error Corpora
Database, ECD), that contains nearly 9,000 error entries, including spelling, grammar, punctuation, graphic and style, meaning and reference errors.
Arppe (2000) provides an overview of the types of errors covered by the
Swedish tools and reports, in short, that all the tools treat errors in noun phrase
agreement and verb forms in verb chains. Scarrie and Granska also treat errors
in compounds, whereas Grammatifix has the widest coverage in punctuation and
number formatting errors. He points out that the error classification in these tools is
similar, but not exactly the same. The depth and breadth of included error categor3
More information about this project may be found at: http://www.ling.gu.se/
˜wengelin/projects/r&r.
18
Chapter 2.
ies differs in the subsets of phrases, level of syntactic complexity or in the position
of detection in the sentence. They may, for instance, detect errors in syntactically simple fragments, but fail with syntactically more complex structures. These
factors are further explained and exemplified in Chapter 5, where I also compare
the error types covered by the individual tools.
Among the grammar errors presented in Scarrie’s ECD, errors in noun phrase
agreement, predicative complement agreement, definiteness in single nouns, verb
subcategorization and choice of preposition are the most frequent error types.
Evaluation Texts of Proof Reading Tools
Other empirical evidence of grammar errors can be observed in the evaluation
of the three grammar checkers (Birn, 2000; Knutsson, 2001; Sågvall Hein et al.,
1999). The performance of all the tools was tested on newspaper text, written by
professional writers. Only the evaluation corpus of Granska included texts written
by non-professionals as well, represented by student compositions. In general, the
corpora analyzed are dominated by errors in verb form, agreement in noun phrases,
prepositions and missing constituents.
2.5 Conclusion
The main purpose of the present chapter was to investigate if previous research
reveals which grammar errors to expect in the writing of primary school children.
Apparently, grammar in general has a very low priority in the research on writing
in Swedish. Grammar errors in children’s writing have been analyzed at the upper
level in primary school and in the upper secondary school and exist only as reports
with some examples, without any particular reference to frequency. Some analyses
have been performed on the writing of professional adult writers and in the research
on the writing of adult dyslexic and deaf adults, with quantitative data for the dyslexic group. The only area that directly approaches grammar errors concerns the
development of proofreading tools aiming particularly at grammar. These studies
report on grammar errors in the writing of adults.
Previous research presents no general characterization of grammar errors in
children’s writing. There are, however, few indications that children as beginning
writers make errors different from adult writers. Teleman’s observations indicate
use of spoken forms that were not reported in the other studies. Some examples
of errors in the Skrivsyntax project are evidently more related to the fact that the
children have not yet mastered writing conventions (e.g. errors in the accusative
Writing and Grammar
19
case of plural pronouns) rather than making errors related to “slip of the pen” (e.g.
due to lack of attention).
In general, all the studies report errors in agreement (both in non phrase and
predicative complement), verb form and the choice of prepositions in idiomatic
expressions. Are these the central constructions in Swedish that give rise to grammar errors? It may be true for adult writers, but it is unclear regarding beginning
writers. Analysis of grammar errors in the children data collected for the present
study is presented in Chapter 4, together with a comparison of the findings of the
previous studies of grammar errors presented above.
20
Chapter 3
Data Collection and Analysis
3.1 Introduction
In this chapter we report on data that has been gathered for this study and the
types of analysis provided on them. First, the data collection is presented and the
different sub-corpora are described (Section 3.2). Then, a discussion follows of the
kinds of errors analyzed and how they are classified (Section 3.3). The types of
analyses in the present study are provided in the subsequent section (Section 3.4)
and a description of error coding and tools that were used for that purpose end this
chapter (Section 3.5).
3.2 Data Collection
3.2.1
Introduction
The main goal of this thesis is to detect automatically grammar errors in texts written by children. In order to explore what errors actually occur, texts with different
topics written by different subjects were collected to build an underlying corpus
for analysis, hereafter referred to as the Child Data corpus.
The material was collected on three separate occasions and has served as basis
for other (previous) studies. The first collection of the data consists of both hand
written and computer written compositions on set topics by 18 children between 9
and 11 years old - The Hand versus Computer Collection. The second collection
involves the same subjects, this time, the children participate in an experiment and
tell a story about a series of pictures, both orally and in writing on a computer The Frog Story Collection. The third collection was presented from a project on
Chapter 3.
22
development of literacy and includes eighty computer written compositions of 10
and 13 year old children on set topics in two genres - The Spencer Collection. 1
Table 3.1 gives an overview of the whole Child Data corpus, including the
three collections mentioned above, divided into five sub-corpora by the writing
topics the subjects were given: Deserted Village, Climbing Fireman, Frog Story,
Spencer Narrative, Spencer Expository. Further information concerns the age of
the subjects involved, the number of compositions, number of words, if the children
wrote by hand or on computer and what writing aid was then used.
Table 3.1: Child Data Overview
AGE C OMP W ORDS T OPIC
W RITING AID
H AND VS . C OMPUTER C OLLECTION :
Deserted Village
9-11
18
7 586 ”They arrived in a paper and pen
deserted village”
Climbing Fireman
9-11
18
4 505 Shown: a picture of a Claris Works 3.0
fireman climbing on a
ladder
F ROG S TORY C OLLECTION :
Frog Story
9-11
18
4 907 Story-retelling: ”Frog ScriptLog
where are you?”
S PENCER C OLLECTION :
Spencer Narrative 10 & 13
40
5 487 Narrative: Tell about ScriptLog
a predicament you
had rescued somebody
from, or you had been
rescued from
Spencer Expository 10 & 13
40
7 327 Expository: Discuss the ScriptLog
problems seen in the
video
29 812
T OTAL
134
Altogether 58 children between 9 and 13 years old wrote 134 papers, comprising a corpus of 29,812 words. Most of the papers are written on the computer. Only
the first sub-corpus (Deserted Village) consists of 18 hand written compositions.
The editor Claris Works 3.0 was used for 18 computer written texts. ScriptLog, a
tool for experimental research on the on-line process of writing, was used for the
remaining (98) computer written compositions. ScriptLog looks just like an ordin1
Many thanks to Victoria Johansson and Sven Strömqvist, Department of Linguistics, Lund University for sharing this collection of data.
23
ary word processor to the user, but in addition to producing the written text, it also
logs information of all events on the keyboard, the screen position of these events
and their temporal distribution.2
This section proceeds with detailed descriptions of the three collections that
form the corpus, with information about when and for what purpose the material
was collected, the subjects involved, the tasks they were given and the experiments
they took part in.
3.2.2
The Sub-Corpora
The Hand vs. Computer Collection
The first collection originates from a study on the computer’s influence on children’s writing, gathered in autumn, 1996. The writing performance in hand written and computer written compositions on the same subjects was compared (see
Sofkova, 1997). Results from this study showed both great individual variation
among the subjects and similarities between the two modes, e.g. the distribution
of spelling and segmentation errors, as well as improved performance in the essays written on the computer especially in the use of punctuation, capitals and the
number of spelling errors.
The subjects included a group of eighteen children, twelve girls and six boys,
between the age of 9 and 11, all pupils at the intermediate level at a primary school.
This school was picked because the children had some experience with writing on
computers. Computers had already been introduced in their instruction and pupils
were free to choose to write on a computer or by hand. If they chose to write on a
computer, they wrote directly on the computer, using the Claris Works 3.0 wordprocessor. Other requirements were that the subjects should be monolingual and
not have any reading or writing disabilities.
The children wrote two essays - one by hand and one on the computer. At
the beginning of this study, the children were already busy writing a composition,
which now is part of the hand written material. They were given a heading for the
hand written task: De kom till en övergiven by ‘They arrived in a deserted village’.
For the computer written task, pupils were shown a picture of a fireman climbing
on a ladder. They were also told not to use the spelling checker when writing in
order to make the two tasks as comparable as possible.
2
A first prototype was developed in the project Reading and writing in a Linguistic and a didactic
perspective (Strömqvist and Hellstrand, 1994). An early version of ScriptLog developed for Macintosh computers was used for collecting the data in this thesis (Strömqvist and Malmsten, 1998).
There is now also a Windows version (Strömqvist and Karlsson, 2002).
Chapter 3.
24
The Frog Story Collection
The second collection is a story-telling experiment and involves the same subjects
as in the Hand vs. Computer Collection. In April 1997, we invited the children
to the Department of Linguistics at Göteborg University to take part in the experiment. They played a role as control group in the research project Reading and
Writing Strategies of Disabled Groups, that aims at developing a unified research
environment for contrastive studies of reading and writing processes in language
users with different types of functional disabilities.3
The experiment included a production task and the data were elicited both in
written and spoken form (video-taped). A wordless picture story booklet Frog,
where are you? by Mercer Mayer (1969) was used, a cartoon like series of 24
pictures about a boy, his dog and a frog that disappears. Each subject was asked to
tell the story, picture by picture.
At the beginning of the experiment the children were invited to look through
the book to get an idea of the content. Then, the instruction was literally Kan du
berätta vad som händer på de här bilderna? ‘Can you tell what is happening in
these pictures?’ Half of the children started with writing and then telling the story
and half of them did the opposite. For the written task, the on-line process editor
ScriptLog was used, storing all the writing activities.
The SPENCER Collection
The Spencer Project on Developing Literacy across Genres, Modalities and Languages4 lasted between July 1997, and June 2000. The aim was to investigate
the development of literacy in both speech and writing. Four age groups (grade
school students, junior high school students, high school students and university
students), and seven languages (Dutch, English, French, Hebrew, Icelandic, Spanish and Swedish) were studied.
Schools were picked from areas where one could expect few immigrants in the
classes, and also where the children had some experience with computers. The
subjects came from middle class, monolingual families and they had no reading or
writing disabilities. Another criterion was that at least one of the subject’s parents
had education beyond high school.
3
The project’s directors are Sven Strömqvist and Elisabeth Ahlsén from the Department of
Linguistics, Göteborg University. More information about this project may be found at: http:
//www.ling.gu.se/˜wengelin/projects/r&r.
4
The project was funded by the Spencer Foundation Major Grant for the Study of Developing
Literacy to Ruth Berman, Tel Aviv University, who was the coordinator of this project. Each language/country involved has had its own contact person, for Swedish it was Sven Str ömqvist from the
Department of Linguistics at Lund University.
25
All subjects had to create two spoken and two written texts, in two genres, expository and narrative. Each subject saw a short video (3 minutes long), containing
scenes from a school day. After the video, the procedure varied depending on the
order of genre and modality.5
The topic for the narratives was to tell about an event when the subject had
rescued somebody, or had been rescued by somebody from a predicament. They
were asked to tell how it started, how it went on and how it ended. The topic for
the expository text was to discuss the problems they had seen in the video, and
possibly give some solutions. They were explicitly asked not to describe the video.
Written material for two age groups from the Swedish part of the study is included in the present Child Data: the grade school students (10 year olds) and junior high school students (13 year olds). In total, 20 subjects from each age group
were recruited. The texts the subjects wrote were logged in the on-line process
editor ScriptLog.
3.3 Error Categories
3.3.1
Introduction
The texts under analysis contain a wide variety of violations against written language norms, on all levels: lexical, syntactic, semantic and discourse. The main
focus of this thesis is to analyze and detect grammar errors, but first we need to
establish what a grammar error is and what distinguishes a grammar error from,
for instance, a spelling error. Punctuation is another category of interest, important
for deciding how to syntactically handle a text by a grammar error detector.
The following section discusses categorization of the errors found in the data
and explains what errors are classified as spelling errors as well as where the
boundary lies between spelling and grammar errors. The error examples provided
are glossed literally and translated into English. Grammatical features are placed
within brackets following the word in the English gloss (e.g. klockan ‘watch [def]’)
(the different feature categories are listed in Appendix A). Occurrences of spelling
violations are followed by the correct form within parentheses and preceded by
‘⇒’, both in the Swedish example and the English gloss (e.g. var (⇒ vad) ‘was
(⇒ what)’).
5
There were four different orders in the experiment:
Order A: Narrative spoken, Narrative written, Expository spoken, Expository written.
Order B: Narrative written, Narrative spoken, Expository written, Expository spoken.
Order C: Expository spoken, Expository written, Narrative spoken, Narrative written.
Order D: Expository written, Expository spoken, Narrative written, Narrative spoken.
Chapter 3.
26
3.3.2
Spelling Errors
Spelling errors are violations of the orthographic norms of a language, such as
insertion (e.g. errour instead of error), omission (e.g. eror), substitution (e.g.
errer) or transposition (e.g. erorr) of one or more letters within the boundaries of
a word or omission of space between words (i.e. when words are written together)
or insertion of space within a word (i.e. splitting a word into parts).
Spelling errors may occur due to the subject’s lack of linguistic knowledge of a
particular rule (competence errors) or as a typographical mistake, when the subject
knows the spelling, but makes a motor coordination slip (performance errors).
The difference between a competence and a performance error is not always so
easy to see in a given text. For example, the (nonsense) string gube deviates from
the intended correct word gubbe ‘old man’ by missing doubling of ‘b’ and violates
thus the consonant gemination rule for this particular word. The text where the
error comes from, shows that this subject is (to some degree) familiar with this rule
applying consonant gemination on other words, indicating that the error is likely to
be a typo (i.e. a performance error) and that it occurred by mistake. On the other
hand, the subject may not be aware that this rule applies to this particular word. 6 It
is then more a question of insufficient knowledge and thus, a competence error.
Spelling errors often give rise to non-existent words (a non-word error) as in
the example above, but they can also lead to an already lexicalized string (a real
word error).7 For example, in the sentence in (3.1), the string damen also violates
the consonant doubling rule and deviates from the intended correct word dammen
‘dam [def]’ by omission of ‘m’. However, in this case the resultant string coincides
with an existent word damen ‘lady [def]’.8 The error still concerns the single word,
but differs from non-word errors in that the realization now influences not only the
erroneously spelled string but also the surrounding context. The newly-formed
word completely changes the meaning of the sentence and gives rise to a sentence
with a very peculiar meaning, where a particular lady is not deep.
(3.1) Men ∗ damen (⇒ dammen)
är inte så djup.
but lady [def] (⇒ dam [def]) is not that deep
– But the dam is not so deep.
Homophones, words that sound alike but are spelled differently, are another
example of a spelling error realized as a real word. The classical examples are the
6
The word gubbe ‘old man’ was used only once in the text.
Usually around 40% of all misspellings result in lexicalized strings (e.g. Kukich, 1992). The
notion of non-word vs. real word spelling errors is a terminology used in research on spelling (cf.
Kukich, 1992; Ingels, 1996).
8
Consonant doubling is used for distinguishing short and long vowels in Swedish.
7
27
words hjärna ‘brain’ and gärna ‘with pleasure’ that are often substituted in written
production and as carriers of different meanings completely change the semantics
of the whole sentence.
Another category of words that may result in non-words or real words in writing are the alternative morphological forms in different dialects. For instance, a
spoken dialectal variation of the standard final plural suffix -or on nouns as in flicker ‘girls’ (standard form is flick-or ‘girls’) is normally not accepted in written form
and thus realizes as a non-word in the written language. Other spoken forms, such
as jag ‘I’ normally reduced to ja in speech, coincide with other existent words and
form real word errors in writing. In this case ja is homonymous with the interjection (or affirmative) ja ‘yes’. In neither case is it clear if the spoken form is
used intentionally as some kind of stylistical marker or spelled in this way due to
competence or performance insufficiency, meaning that the subject either had not
acquired the written norm or that a typographical error occurred.
Spelling errors are then violations of characters (or spaces) in single isolated
words, that form (mostly) non-words or real words, the latter causing ungrammaticalities in text.
3.3.3
Grammar Errors
Grammar errors violate (mostly) the syntactic rules of a language, such as feature
agreement, order or choice of constituents in a phrase or sentence, thus concerning
a wider context than a single word.9 Like spelling errors, a grammar error may
occur due to the subject’s insufficient knowledge of such language rules. However,
the difference is that when learning to write as a native speaker (as the subjects
in this study), only the written language norms that deviate from the already acquired (spoken) grammatical knowledge have to be learned. As mentioned earlier,
research reveals that native speakers make not only errors reflecting the norms of
the group one belongs to as one might expect, but also other grammar errors that
have been ascribed to the influence of computers on writing. That is, even a native
speaker can make grammar errors when writing on a computer due to rewriting or
rearranging text.
Again, the real cause of an error is not always clear from the text. For instance,
in the noun phrase denna uppsatsen ‘this [def] essay [def]’ a violation of definiteness agreement occurs, since the demonstrative determiner denna ‘this’ normally
requires the following noun to be in the indefinite form. In this case, the form
denna uppsats ‘this [def] essay [indef]’ is the correct one (see Section 4.3.1). However, in certain regions of Sweden this construction is grammatical in speech. This
9
Choice of words may also lead to semantic or pragmatic anomaly.
Chapter 3.
28
means that this error appears as a competence error since the subject is not familiar
with the written norm and applies the acquired spoken norm. On the other hand,
it could also be a typographical mistake, as would be the case if the subject first
used a determiner like den ‘the/that [def]’ that requires the following noun to be in
definite form and then changed the determiner to the demonstrative one but forgot
to change the definite form in the subsequent noun to indefinite.
In earlier research grammar errors have been divided along two lines. Some
researchers characterize the errors by application of the same operations as for
orthographic rules also at this level, with omissions, insertions, substitutions and
transpositions of words. Feature mismatch is then treated as a special case of substitution (e.g. Vosse, 1994; Ingels, 1996). For instance, in the incorrect noun phrase
denna uppsatsen ‘this [def] essay [def]’ the required indefinite noun is substituted
by a definite noun. Word choice errors, such as incorrect verb particles or prepositions, are other examples of grammatical substitution. Word order errors occur
as transpositions of words, i.e. all the correct words are present but their order is
incorrect. Missing constituents in sentences concern omission of words, whereas
redundant words concern insertion.
Others separate feature mismatch from other error types and distinguish
between structural errors, that include violations of the syntactic structure of a
clause, and non-structural errors, that concern feature mismatch (e.g. Bustamente
and León, 1996; Sågvall Hein, 1998a).
3.3.4
Spelling or Grammar Error?
As mentioned in the beginning of this section, writing errors occur at all levels,
including lexicon, syntax, semantics and discourse. The nature of an error is sometimes obvious, but in many cases it is unclear how to classify errors. The final
versions of the text give very little hint about what was going on in the writer’s
mind at the time of text production.10 Some kind of classification of writing errors
is necessary, however, for detection and diagnosis of them.
Consider for instance the sentence in (3.2), where a (non-finite) supine verb
form försökt ‘tried [sup]’ is used as the main verb of the second sentence. The
word in isolation is an existent word in Swedish, but syntactically a verb in supine
form is ungrammatical as the predicate of a main sentence (see Section 4.3.5). This
non-finite verb form has to be preceded by a (finite) temporal auxiliary verb (har
försökt ‘have [pres] tried [sup]’ or hade försökt ‘had [pret] tried [sup]’) or the form
has to be exchanged for a finite verb form, such as present (f örsöker ‘try [pres]’)
10
Probably some information can be gained from the log-files in the ScriptLog versions, but since
not all data in the corpus are stored in that format, such an analysis has not been included in this
thesis.
29
or preterite (försökte ‘tried [pret]’). In regard to the tense used in the preceding
context, the last alternative of preterite form would be the best choice.
att klättra ner.
(3.2) Han tittade
på hunden. Hunden ∗ försökt
he looked [pret] at the-dog the-dog tried [sup] to climb down
– He looked at the dog. The dog tried to climb down.
The problem of classification lies in that although one single letter distinguishes
the word from the intended preterite form and could then be considered as an orthographical violation, the error is realized not as a new word, but rather another
form of the intended word is formed. This error could occur as a result of editing if
the writer first used a past perfect tense (hade försökt ‘had tried’) and later changed
the tense form to preterite (försökte ‘tried’) by removing the temporal auxiliary
verb, but forgot also to change the supine form (försökt ‘tried [sup]’) to the correct
preterite form. On the other hand, the correct preterite tense could be used by the
subject already from the start. Then it is rather a question of a (real word) spelling
error. The subject intended already from the beginning to write a preterite form,
but intentionally or unintentionally omitted the final vowel -e, that happens to be a
distinctive suffix for this verb.
In the next example (3.3), a gender agreement error occurs between the neuter
determiner det ‘the’ and the common gender noun ända ‘end’, as a result of replacing enda ‘only’ with ända ‘end’. The erroneous word is an existent word and
differs from the intended word only in the single letter at the beginning (an orthographic violation). This is clearly a question of a spelling error, since the word does
not form any other form of the intended word and it is realized as a completely new
word with distinct meaning.
∗
(3.3) Det
ända (⇒ enda)
jag vet
om
the [neu] end [com] (⇒ only) I know about
– The only thing I know about ...
In the grammar checking literature, the categorization of writing errors is
primarily divided into word-level errors and in errors requiring context larger than
a word (cf. Sågvall Hein, 1998a; Arppe, 2000). Real word spelling errors were
treated in Scarrie’s Error Corpora Database as errors requiring wider context for
recognition and were categorized in accordance with the means used for their detection. In other words, errors either belong to the category of grammar errors
when violating syntactic rules, or are otherwise categorized as style, meaning and
reference category (Wedbjer Rambell, 1999a, p.5). In this thesis, where grammar
errors (syntactic violations) are the main foci, real word spelling errors will be classified as a separate category. This distinction is important for examination of the
Chapter 3.
30
real nature of such errors, especially when presenting a diagnosis to the user. Such
considerations are especially important when the user is a beginning writer. Obvious cases of spelling errors such as the one in (3.3) are treated as such, whereas the
treatment of errors lying on the borderline between a spelling and a grammar error
as in (3.2) depends on:
• what type of new formation occurred (other form of the same lemma or new
lemma)
• what type of violation occurred (change in letter, morpheme or word)
• what level is influenced (lexical, syntactic or semantic)
These principles are primarily aimed at the unclear cases, but seem to be applicable to other real word violations as well. The fact is that a majority of real word
spelling errors form new words and violate semantics rather than syntax and just
a few of them “accidentally” cause syntactic errors (see further in Section 5.3.2).
It is the ones that form other forms of the same lemma that are tricky. They are
treated here as grammar errors, but for diagnosis it is important to bear in mind that
they also could be spelling errors.
Figure 3.1 shows a scheme for error categorization. All violations of the written
norm will be categorized starting with whether the error is realized as a non-word
or a real word. Non-words are always classified as spelling errors. Real word
errors are then further considered with regard to whether they form other forms
of the same lemma or if new lemmas are created. In the case of same lemma (as
in (3.2)), errors are classified as grammar errors. When new lemmas are formed,
syntactic or semantic errors occur. Here a distinction is made between whether just
a single letter is influenced, categorizing the error as a spelling error, or a whole
word was substituted, categorizing it as a grammar error.
For the errors realized as real words the following principles for error categorization then apply:11
(3.4)
(i). All real word errors, that violate a syntactic rule and result in other
forms of the same lemma are classified as grammar errors.
(ii). All real word errors resulting in new lemmas by a change of the
whole word are classified as grammar errors.
(iii). All real word errors resulting in new lemmas by a change in (one
or more) letter(s) are classified as spelling errors.
11
Homophones are excepted from the principle (ii). They certainly form a new lemma by a change
of the whole word, but are related to how the word is pronounced and thus are considered as spelling
errors.
31
Figure 3.1: Principles for Error Categorization
For the above example (3.2), this means that considering the word in isolation,
försökt ‘tried [sup]’ is an existent word in Swedish. Considering the difference in
deviation of the intended preterite form, no new lemma is created, rather another
form of the same lemma that happens to lack the final suffix realized as a single
vowel. Considering the context it appears in, a syntactic violation occurs, since the
sentence has no finite verb. So, according to principle (i) for error categorization in
(3.4), this error is classified as a grammar error, since no new lemma was created,
the required preterite form simply was replaced by a supine form of the same verb.
In the case of (3.3), this error also involves a real word, but here, a new lemma
was created by substitution of a letter. The error is then, according to principle (iii)
in (3.4), considered to be a spelling error, since no other form of the same lemma
or substitution of the whole word occurred.
3.3.5
Punctuation
Research on sentence development and the use of punctuation reveals that children
mark out entities that are content rather than syntactically driven (e.g. Kress, 1994;
Ledin, 1998). They form larger textual units, for instance, by joining together
sentences that are “topically closely connected”, according to Kress (1994). In
speech, such sequences would be joined by intonation due to topic. An example
Chapter 3.
32
of such adjoined clauses is “The boy I am writing about is called Sam he lived
in the fields of Biggs Flat.”(Kress, 1994, p.84). Others use a strategy of linking
together sentences with connectives like ‘and’, ‘then’, ‘so’ instead of punctuation
marks, which can result in sentences of great length, here called long sentences
(see Section 4.6 for examples).
As we will see later on in Chapter 5, the Swedish grammar checking systems
are based on texts written by adults and are able to rely on punctuation conventions for marking syntactic sentences in their detection rules or for scanning a text
sentence by sentence. In accordance with the above discussion, this is not possible
with the present data that consists of texts written by children. Occurrences of adjoined and long sentences are quite probable. In other words, analysis of the use of
punctuation is important to confirm that also the subjects of the present study mark
larger units. Thus, omissions of sentence boundaries are expected and have to be
taken into consideration.
3.4 Types of Analysis
The analysis of the Child Data starts with a general overview of the corpus, including frequency counts on words, word types, and all spelling errors. The main focus
is on a descriptive error-oriented study of all errors above the lexical level, i.e. all
that influence context. Only spelling errors resulting in non-words are not part of
this analysis. The error types included are:
1. Real word spelling errors - misspelled words and segmentation errors
resulting in existent words.
2. Grammar errors - syntactic and semantic violations in phrases and
sentences.
3. Punctuation - sentence delimitation and the use of major delimiters and
commas.
The main focus here lies in the second group of grammar errors. Real word
spelling errors and grammar errors are listed as separate error corpora - see Appendix B.1 for grammar errors, Appendix B.2 for misspelled words and Appendix
B.3 for segmentation errors. Here all errors are represented with the surrounding
context of the clause they appear in (in some cases greater parts are included e.g.
in the case of referential errors). Errors are indexed and categorized by the type
of error and annotated with information about possible correction (intended word)
and the error’s origin in the core data.
33
The analysis also includes descriptions of the overall distribution of error types
and error density. Comparison is made between errors found in the different subcorpora and by age. Here it is important to bear in mind that the texts were gathered
under different circumstances and that not all subjects attended in all the experiments (see Section 3.2).
Error frequencies are related differently depending on the error type. Spelling
errors that concern isolated words, are related to the total number of words. In the
case of grammar errors, the best strategy would be to relate some error types to
phrases, some to clauses or sentences and some to even bigger entities in order to
get an appropriate comparison measure. However, counting such entities is problematic, especially in texts that contain lots of structural errors. The best solution is
to compare frequencies of the attested error types that will reflect the error profile
of the texts.
The main focus in the analysis of the use of punctuation in this thesis is not
the syntactic complexity of sentences, but rather if the children mark larger units
than syntactic sentences and if they use sentence markers in wrong ways. The
most intuitive procedure would be to compare the orthographic sentences, i.e. the
real markings done by the writers, with the (“correct”) syntactic sentences. The
main problem with such an analysis is that in the case of long sentences, often
it will be hard to decide where to draw the line, since they are for the most part
syntactically correct. Several solutions for delimitation in syntactic sentences may
be available.12 The subjects’ own orthographic sentences will be analyzed at that
point by length in terms of the number of words and by the occurrence of adjoined
clauses. Further, erroneous use of punctuation marks will be provided for. Analysis
of the use of connectives as sentence delimiters would certainly be appropriate
here, but we live this for future research.
All error examples represent the errors found in the Child Data corpus. The
example format includes the error index in the corresponding error corpora (G for
grammar errors (Appendix B.1), M for misspelled words (Appendix B.2), and S for
segmentation errors (Appendix B.3)) and as already mentioned, the text is glossed
and translated into English with grammatical features (see Appendix A) attached
to words and spelling violations followed by the correct form within parentheses
preceded by a double right-arrow ‘⇒’.
12
The macro-syntagm (Loman and Jörgensen, 1971; Hultman and Westman, 1977) and the T-unit
(Hunt, 1970) are other units of measure more related to investigation of sentence development and
grammatical complexity in education-oriented research in Sweden and America, respectively.
Chapter 3.
34
3.5 Error Coding and Tools
3.5.1
Corpus Formats
In order to be able to carry out automatic analyses on the collected material, the
hand written texts were converted to a machine-readable format and compiled with
the computer written texts to form one corpus. All the texts were transcribed in
accordance with the CHAT-format (see (3.5) below) and coded for spelling, segmentation and punctuation errors and some grammar errors. Other grammar errors
were identified and extracted either manually or by scripts specially written for the
purpose. Non-word spelling errors were corrected in the original texts in order to
be able to test the text in the developing error detector that includes no spelling
checker. The spelling checker in Word 2001 was used for this purpose.
The original Child Data corpus now exists in three versions: the original texts
in machine-readable format, a coded version in CHAT-format and a spell-checked
version. This version free from non-words was used as the basis for the manual
grammar error analysis and as input to the error detector in progress and other
grammar checking tools that were tested (see Chapter 5).
3.5.2
CHAT-format and CLAN-software
The CHAT (Codes for the Human Analysis of Transcripts) transcription and coding format and the CLAN (Computerized Language Analysis) program are tools
developed within the CHILDES (Child Language Data Exchange System) project (first conceived in 1981), a computerized exchange system for language data
(MacWhinney, 2000).
This software is designed primarily for transcription and analysis of spoken
data. It is, however, practical to apply this format to written material in order to
take advantage of the quantitative analysis that this tool provides. For instance,
the current material includes a lot of spelling errors that can be easily coded and a
corresponding correct word may be added following the transcription format. This
means that not only the number of words, but also the correct number of word types
may be included in the analysis. Also analysis concerning for instance the spelling
of words may be easily extracted.
In practice, conversion of a written text to CHAT-format involves addition of
an information field and division of the text into units corresponding to “speaker’s
lines”, since the transcript format is adjusted to spoken material. The information
field at the beginning of a transcript usually includes information on the subject(s)
involved, the time and location for the experiment, the type of material coded, the
type of analysis done, the name of the transcriber, etc. Speaker’s lines in spoken
35
material correspond naturally to utterances. For the written material, we chose to
use a finite clause as a corresponding unit, which means that every line must include
a finite verb, except for imperatives and titles, that form their own “speaker’s lines”.
The whole transcript includes just one participant, as it is a monologue.
The information field in the transcribed text example in (3.5) below taken from
the corpus, in accordance with the CHAT-format, includes all the lines at the beginning of this text starting with the @-sign. Lines starting with *SBJ: correspond to
the separate clauses in the text. Comments can be inserted in brackets in speaker’s
lines, e.g. [+ tit] indicating that this line corresponds to the title of the text. The
intended word is given in brackets following a colon, e.g. & [: och] ‘and’. Relations to more than one word are indicated by the ‘<’ and ‘>’ signs, where the
whole segment is included, e.g. <över jivna> [: övergivna] ‘abandoned’. Other
signs and codes can be inserted in the transcription.13
(3.5) @Begin
@Participants: SBJ Subject
@Filename:
caan09mHW.cha
@Age of SBJ:
9
@Birth of SBJ: 1987
@Sex of SBJ:
Male
@Language:
Swedish
@Text Type:
Hand written
@Date:
10-NOV-1996
@Location:
Gbg
@Version:
spelling, punctuation, grammar
@Transcriber:
*SBJ: de kom till en överjiven [: övergiven] by [+ tit]
*SBJ: vi kom över molnen jag & [: och] per på en flygande gris
*SBJ: som hete [: hette] urban .
*SBJ: då såg jag nåt [: något]
*SBJ: som jag aldrig har set [: sett] .
*SBJ: en ö som var helt <i jen täkt> [: igentäckt] av palmer
*SBJ: & [: och] i miten [: mitten] var en by av äkta guld .
*SBJ: när vi kom ner .
*SBJ: så gick vi & [: och] titade [: tittade] .
*SBJ: vi såg ormar spindlar krokodiler ödler [: ödlor] & [: och] anat
[: annat] .
*SBJ: när vi hade gåt [: gått] en lång bit så sa [: sade] per .
*SBJ: vi <vi lar> [: vilar] oss .
*SBJ: per luta [: lutade] sig mot en .
*SBJ: palmen vek sig
*SBJ: & [: och] så åkte vi ner i ett hål .
*SBJ: sen [: sedan] svimag [: svimmade jag] .
*SBJ: när jag vakna [: vaknade] .
*SBJ: satt jag per & [: och] urban mit [: mitt] i byn .
*SBJ: vi gick runt & [: och] titade [: tittade] .
*SBJ: alla hus var <över jivna> [: övergivna] .
13
Further information about this transcription format and coding, including manuals for download,
may be found at: http://childes.psy.cmu.edu/.
Chapter 3.
36
*SBJ:
*SBJ:
*SBJ:
*SBJ:
*SBJ:
@End
då sa [: sade] per .
vi har hitat den <över jivna> [: övergivna] byn .
& [: och] när vi kom hem så vakna [: vaknade] jag
& [: och] alt [: allt] var en dröm .
slut
Chapter 4
Error Profile of the Data
4.1 Introduction
This chapter describes the empirical analysis of the collected data, starting with
a general overview (Section 4.2) followed by sections describing the actual error
analysis and distribution of errors in the data. Error analysis starts with descriptions
of grammar errors (Sections 4.3), the main foci, and continues with analysis of real
word spelling errors (Section 4.5) and punctuation (Section 4.6). The section on
grammar errors concludes with a comparison of error distribution in the analyzed
data with grammar errors found in other data already discussed in Chapter 2 (Section 4.4).
4.2 General Overview
The Child Data of total 29,812 words consists of 134 compositions written by
58 children.1 Further information is provided here on the corpus along with a
discussion of the size of sub-corpora, the average length of individual texts and
word variation. Also described here is the overall impression of the texts in terms
of writing errors and also the nature of spelling errors (both non-words and real
words).
Text Size and Word Variation
The different sub-corpora are divided by topic of the written tasks (see Table 4.1).
The first three were written by 18 subjects. The last two, belonging to the Spencerproject, involved 40 children each. In terms of the total number of words, Deserted
1
The composition of Child Data is described in Chapter 3 (see Section 3.2).
Chapter 4.
38
Village and the Spencer Expository texts are the largest sub-corpora (in bold face)
and the Climbing Fireman corpus is the smallest one. In total, the average text size
is 222.5 words. This corresponds to a rather short text, approximately to 20 lines
of a typed text or nearly half a page. Only the texts of Deserted Village (in bold
face) are on the average twice as long as the other texts. The Spencer-project texts
are the shortest ones.
Table 4.1: General Overview of Sub-Corpora
C ORPUS
Deserted Village
Climbing Fireman
Frog Story
Spencer Narrative
Spencer Expository
T OTAL
T EXTS
18
18
18
40
40
134
W ORDS
7 586
4 505
4 907
5 487
7 327
29 812
W ORDS /T EXT
421.4
250.3
272.6
137.2
183.2
222.5
W ORD T YPES
1 610
1 040
763
1 085
1 021
3 373
The reason for this difference in text length probably lies in the degree of freewriting and in the use of and familiarity with the writing aid. The texts of the
Deserted Village corpus were produced in the subjects’ own everyday environment, in the classroom, time was not limited, and they wrote by hand. The texts of
Climbing Fireman are also written in a familiar environment with relatively unrestricted time demands, but these were written on a computer. Although computers
had been introduced and used previously by the subjects, they may still have felt
unfamiliar with its use. The Frog Story texts are slightly longer than the Climbing Fireman texts, but the higher number of words was probably elicited by the
experiment in which the subjects were required to write text for 24 pictures. The
Spencer-project texts are also of a more experimental nature, produced in an environment not familiar to the subjects, with more restrictions on time and written by
means of a previously unknown text editor (ScriptLog).
Next, let us consider word variation. 3,373 word types were found in the whole
corpus. The Frog Story texts have the smallest number of word types, not surprisingly since the scope of word variation is more determined by the pictures of the
story the children were supposed to tell. In the other sub-corpora, the Deserted
Village corpus has the highest word variation, whereas the other three each contain
around 1,000 word types.
Table 4.2 shows the texts grouped by age. We find that the sub-corpus of 9 year
olds is almost the same size as all the texts written by 10 year olds, although the
sub-corpus consists of less than half as many compositions. The 9 year old children
produced on average texts which are three times longer (854 words per text) than
39
Table 4.2: General Overview by Age
AGE
9-years
10-years
11-years
13-years
T OTAL
S UBJECTS
8
24
6
20
58
T EXTS
24
52
18
40
134
W ORDS
6 832
6 837
8 012
8 131
29 812
W ORDS /T EXT
854.0
284.9
1 335.3
406.6
222.5
W ORD T YPES
1 270
1 356
1 629
1 279
3 373
the 10 year olds, who wrote the shortest texts in the whole corpus. The sub-corpora
of 11 and 13 year olds display similar sizes and are more than a thousand words
larger than the texts of the younger children. The 11 year olds wrote the longest
texts in the whole corpus (1,335.3 words per text), which is almost five times more
than for the shortest texts of the 10 year olds. There is, in other words, much
variation in the average length of text and especially the 11 year olds distinguish
themselves by their much longer texts.2
Word variation measured in the number of word types seems to be slightly
higher for the 11 year olds. The other age groups each contain around 1,300 word
types.
Overall Impression and Spelling Errors
The first thing to observe when reading the texts by the children involved in this
study, is the high number of spelling errors and split compounds, the rare use of
capitals at the beginning of sentences and the unconventional use of punctuation
delimiters to mark sentence boundaries. The children literally write as they speak.
They use a great deal of direct speech and many spoken word forms. The different
writing errors above lexical level are presented and discussed in the subsequent
sections. In this section the sub-corpora and age groups are discussed and compared with respect to the total number of spelling errors (both non-words and real
words).
Most of the errors concern misspelled words, i.e. words with one or more
spelling errors, represented by 2,422 (8.1%) words in total (see the last two
columns in Table 4.3 below). Segmentation errors are four times less frequent,
with 377 (1.3%) words written apart (splits) and 240 (0.8%) words written together
(run-ons).
Among the different sub-corpora (Table 4.3), the most misspelled words, splits
and run-ons are found in the hand-written texts of the Deserted Village corpus.
2
For the time being no standard deviation was counted.
Chapter 4.
40
The Deserted Village corpus and the Frog Story corpus have the highest numbers
of spelling errors, 15.6% and 14.3% respectively, of the total number of words in
different sub-corpora (last row in the table). The texts of the Spencer-project, that
were much shorter in size, include around 5% spelling errors, which is two or three
times lower than in the other three sub-corpora.
Considering the age differences (Table 4.4), as expected most of the errors
occurred in the texts of the youngest 9 year olds with 1,475 (21.6%) errors in total.
Only the number of splits is higher in the texts of 11 year olds. The oldest 13
year olds made five times fewer errors. The group of 11 year olds has a very high
number of spelling errors with 813 (10.1%) errors, in comparison to the texts by
10 year olds that include 459 (6.7%) spelling errors.
Table 4.3: General Overview of Spelling Errors in Sub-Corpora
E RROR T YPE
Misspelled Words
Splits
Run-ons
T OTAL
%
Deserted
Village
924
146
113
1 183
15.6
S UB -C ORPORA
Climbing
Frog
Spencer
Fireman Story Narrative
422
568
209
69
93
37
26
39
32
517
700
278
11.5
14.3
5.1
Spencer
Expository
299
32
30
361
4.9
T OTAL
2 422
377
240
3 039
10.2
%
8.1
1.3
0.8
Table 4.4: General Overview of Spelling Errors by Age
E RROR T YPE
Misspelled Words
Splits
Run-ons
T OTAL
%
9-years
1 242
129
104
1 475
21.6
AGE
10-years 11-years
356
602
69
148
34
63
459
813
6.7
10.1
13-years
222
31
39
292
3.6
T OTAL
2 422
377
240
3 039
10.2
%
8.1
1.3
0.8
According to Pettersson (1989, p.164), children in the second year at primary
school (9 years old) make on average 13 spelling errors in 100 words which is
much less than our 9 year olds who have almost 22 errors. By the eighth year (14
years old), the number decreases to four errors, which seems to hold true for our 13
year olds. Last year students at upper secondary school make in average 1 spelling
error in 100 words.
41
Summary
The texts in Child Data are on average not longer than half a page with the exception of the hand written Deserted Village texts, that are an average of double that
size. The length difference varies more by age. The 10 year olds wrote the shortest
texts on average, whereas the texts written by 11 year olds are almost five times
longer. Word variation is much lower in the Frog Story corpus than the other corpora. In the whole corpus, 10% of all words are misspelled or wrongly segmented
and the highest concentrations of those errors are in the texts of Deserted Village,
Frog Story and in the 9 year olds. Splits are also quite common in the 11 year olds’
texts.
4.3 Grammar Errors
Previous research and analyses of grammar (reported in Section 2.4), suggest that
Swedish writers in general make errors in agreement (both in noun phrase and predicative complement), verb form, and in the choice of prepositions in idiomatic
expressions. The writing of children at primary school also includes dialectal inflections on words, dropped endings and substitution of nominative for accusative
case in pronouns. This section presents the types of grammar errors in the present
corpus of primary school writers and investigates whether the same types of errors
occur and if or how much spoken language plays a role in their writing.
Each error type is discussed and exemplified, introduced by a description of
the structure of the relevant phrase types in Swedish, so that a reader who does not
know Swedish will be able to understand why something is classified as an error.
The number of errors is summarized in Section 4.3.10 along with a discussion of
the relative frequency of the different error types in total and in comparison with
sub-corpora and age. All the errors are listed in Appendix B.1. The grammar error
types of this analysis are compared further to the errors found in some of the previous studies of grammar errors, presented in the subsequent section (Section 4.4).
4.3.1
Agreement in Noun Phrases
Noun Phrase Structure and Agreement in Swedish
A noun phrase in Swedish consists of a head, normally a noun, a proper noun or
a (nominal) pronoun. In addition prenominal and/or postnominal determiners and
modifiers may occur. The attributes come in a certain order and must agree with
the head in number, gender, definiteness and case.
Chapter 4.
42
Swedish distinguishes between singular (unmarked) and plural (normally a
suffix) in the number system and number agreement is governed by the noun’s
grammatical number:
Table 4.5: Number Agreement in Swedish
S INGULAR
min bok
ingen byxa
my book
no trousers
P LURAL
mina böcker
inga byxor
my [pl] book [pl]
no [pl] trousers [pl]
Gender is represented by two categories, common and neuter. Many animate nouns are further categorized according to the sex, masculine or feminine (unmarked). Gender agreement is only found in singular and is not visible in plural.
Table 4.6: Gender Agreement in Swedish
C OMMON
N EUTER
S INGULAR
en gammal bil
ett gammal-t hus
an old car
an old house
P LURAL
några gamla bilar
några gamla hus
some old cars
some old houses
Definiteness marking is quite complicated and is one of the factors in Swedish
grammar that causes problems. The indefinite form is unmarked, whereas the definite form is (mostly) double marked, both by prenominal attributes and with a
noun suffix. For adjectives (and participles) there are two different forms, normally called strong and weak forms. The strong form is used in indefinite noun
phrases and in predicative use. The weak form of adjectives is used in definite
noun phrases. The weak form is the same in all genders and numbers, except optionally when the noun denotes a male person.3 The plural of the strong and weak
forms coincide.
Table 4.7: Definiteness Agreement in Swedish
S INGULAR
COMMON
NEUTER
P LURAL
I NDEFINITE
en bok
a book
en gammal bok
an old book
en gammal man
an old man
ett gammalt hus
an old house
gaml-a hus
old [wk] houses
D EFINITE
bok-en
book [def]
den gaml-a bok-en
the old [wk] book [def]
den gaml-e mann-en
the old [masc] man [def]
det gaml-a hus-et
the old [wk] house [def]
de gaml-a hus-en
the old [wk] houses [def]
3
Notice that the masculine gender is only optional, which means that a noun phrase of the form
den gaml-a mannen ‘the old [wk] man [def]’ is correct as well.
43
Finally, case in the nominal system is represented by (unmarked) nominative
and genitive which uses the suffix -s (personal pronouns are also declined by accusative case, see further under pronouns, Section 4.3.4).
The basic constituent order in a noun phrase is determiner-adjective-noun,
e.g. ett stort hus ‘a big house’, det stora huset ‘the big house’. The co-occurrence
patterns of definiteness marking can be divided into three different types (Cooper,
1986, p.34):4
1. Definite noun phrase, which reflects the double definiteness marking and
requires definite prenominal attributes and definite noun:
D ET [+ DEF ]
den
de
den här
de här
A DJ [+ DEF ]
röd-a
röd-a
röd-a
röd-a
N[+ DEF ]
bil-en
bilar-na
bil-en
bilar-na
this/the red car
this/the two red cars
this red car
these red cars
2. Indefinite noun phrase, which requires indefinite prenominal attributes and
indefinite noun:
D ET [– DEF ]
en
någon
inga
A DJ [– DEF ]
röd
röd
röda
N[– DEF ]
bil
bil
bilar
a red car
some red car
no red cars
3. Mixed noun phrase, which requires definite prenominal attributes and indefinite noun. This type applies to demonstrative pronouns, possessive attributes and some relative clauses.
D ET [+ DEF ]
Demonstr. pronouns: denna
dessa
Possessive attributes: firmans
deras
Relative clause:
den
4
A DJ [+ DEF ]
röd-a
röd-a
röd-a
röd-a
röd-a
N[– DEF ]
bil
bilar
bil
bil
bil
(som) han
köpte igår
this red car
these red cars
the firm’s red car
their red car
the red car that he
bought yesterday
Cooper defines these types in terms of existent determiner types that require either definite or
indefinite adjectives and nouns.
Chapter 4.
44
The optional prenominal attributive adjectives can be recursively stacked, as in
(4.1a). Numerals as quantifying attributes occur in both definite (4.1b) and indefinite noun phrases (4.1c).
(4.1) a. en ny röd bil
a new red car
b. de två röda bilarna
the two red cars
c. två röda bilar
two red cars
A proper noun as the main word of a noun phrase behaves (almost) like a noun
in definite noun phrase. Proper nouns are inherently definite and uncountable.
The most common form is when the proper noun occurs on its own, without any
modifiers, as in the first example in Table 4.8, but prenominal attributes may occur
as shown in the other examples (Teleman et al., 1999, Part3:56):
Table 4.8: Noun Phrases with Proper Nouns as Head
D ET
A DJ
den
den där
min
en
lilla
snälla
tråkiga
söta
ångerfull
N
Peter
Karin
Anna
Karl
Maria
Karl-Erik
Peter
little Karin
the good/kind Anna
that boring Karl
my sweet Maria
a regretful Karl-Erik
Pronouns as head of a noun phrase occur normally without modifiers, although
pronouns with relative clauses are quite common (see further in Teleman et al.,
1999):
Table 4.9: Noun Phrases with Pronouns as Head
A DJ
hela
båda
hela
själva
P RO
jag
jag
ni
den
hon
I
all of me
both of you
all of it
she herself
45
A noun phrase need not have a noun (or pronoun) as head. In this case, an
adjective occurs normally in that position. Noun phrases consisting of only a determiner also exist. The structure of the (in)definite noun phrase is the same as in
a noun phrase with a noun as head. Table 4.10 gives an overview of noun phrases
without (nominal) heads.
Table 4.10: Noun Phrases without (Nominal) Head
D EFINITE N OUN P HRASE
D ET
A DJ
denne
den
andra
den där nye väntande
många
andra
det
bästa
I NDEFINITE N OUN P HRASE
D ET
A DJ
någon
en
annan
allt
roligt
this one
the other one
the one new waiting
many other
the best
someone
an another
all fun
One further type of noun phrase will be relevant in this thesis, namely the
partitive phrase which consists of a quantifier, the preposition av ‘of’ and a definite
noun phrase. The quantifier agrees in gender with the noun phrase (Teleman et al.,
1999, Part3:69):
Table 4.11: Agreement in Partitive Noun Phrase in Swedish
C OMMON
N EUTER
en av cyklarna
ingen av filmerna
ett av träden
inget av äpplena
one [com] of bicycles [com]
none [com] of movies [com]
one [neu] of trees [neu]
none [neu] of apples [neu]
Agreement Errors in Definiteness
Definiteness agreement was violated in eight noun phrases and occurred in all three
noun phrase types. Errors in definite noun phrases included three errors, all located
in the head. In all instances the head noun is in the indefinite form, lacking the
definite suffix as in (4.2a). In (4.2b) we see the correct form of the definite noun
phrase with both the definite determiner/article and the definite suffix on the noun.
Chapter 4.
46
(4.2) (G1.1.2)
∗
a. En gång blev den
hemska
pyroman
utkastad
one time was the [def] awful [wk] pyromaniac [indef]) thrown-out
ur
stan.
from the-city
– Once the awful pyromaniac was thrown out of the city.
b. den
hemska
pyroman-en
the [def] awful [wk] pyromaniac [def]
One of these three erroneous noun phrases is ambiguous in the context (see
(4.3)), providing yet another correction possibility. The intended noun phrase could
be definite as in (4.3b) or also indefinite as in (4.3c).
(4.3)
(G1.1.3)
a. Jag såg på ett TV program där
en metod mot
mobbing var att
I saw on a TV program where a method against bullying was to
∗
sätta mobbarn
på den
stol
och andra människor runt
put the-bullyier on the [def] chair [indef] and other people
around
den personen och då fråga varför.
the person and then ask why
– I saw on a TV program where a method against bullying was to put the
bullyier on the chair and other people around the person and then ask why.
b. den
stolen
the [def] chair [def]
c. en
stol
a [indef] chair [indef]
There were three errors in definite noun phrases with indefinite head (type 3)
which involved possessive and demonstrative attributes. In all cases, the head noun
is in the definite form with a (superfluous) definite suffix as in (4.4a). The most
obvious correction is to change the form in the noun to indefinite as in (4.4b), but
it could also be that the possessive determiner is superfluous making the single
definite noun as in (4.4c) more correct.
47
(4.4) (G1.1.4)
∗
a. Pär tittar på sin
klockan
och det var tid för familjen att
Pär looks at his [gen] watch [def] and it was time for the-family to
gå hem.
go home
– Pär looks at his watch. It was time for the family to go home.
b. sin
klocka
his [gen] watch [indef]
c. klockan
watch [def]
A violation involving a demonstrative pronoun presented in (4.5) occurred
probably due to the subject’s regional origin. Nouns modified by denna ‘this’
occur in definite form in some regional dialects.
(4.5) (G1.1.6)
∗
a. Nu när jag kommer att skriva denna
uppsats-en så kommer jag
now when I will
to write this [def] essay [def] so will
I
ha en rubrik om
några problem och ...
have a title about some problems and
– Now when I write this essay, I will have a heading about some problems
and...
b. denna
uppsats
this [def] essay [indef]
Two errors occurred in indefinite noun phrases and once more concerned the
head noun being in definite form as in (4.6). Two corrections are possible here
as well, one changing the form in the head noun as in (4.6b) or removing the
determiner as in (4.6c).
(4.6) (G1.1.7)
∗
för det var en
a. Men senare ångrade dom sig,
räkningen på
but later regretted they selves for it was a [indef] bill [def]
on
deras lägenhet.
their apartment
– But later they regretted it, because it was a bill for their apartment.
b. en
räkning
a [indef] bill [indef]
c. räkningen
bill [def]
Chapter 4.
48
Gender Agreement Errors
Agreement errors in gender occurred in both definite and indefinite noun phrases
and in partitive noun phrases and show up as a mismatch between the gender of
the article and the rest of the phrase or as violations of the semantic gender of the
adjective. One disagreement in article occurred in an indefinite noun phrase shown
in (4.7a) and one in a partitive noun phrase (G1.2.2).
(4.7) (G1.2.1)
grodbarn.
a. Pojken fick ∗ en
the-boy got a [com] frog-child [neu]
– The boy got a frog baby.
b. ett
grodbarn
a [neu] frog-child [neu]
Two errors were related to the semantic gender, where masculine gender was
wrongly used in the adjectival attributes of definite noun phrases. In one case, the
masculine gender is used together with a plural noun (see (4.8a)).
(4.8) (G1.2.4)
a. nasse
Nasse
blev
became
arg
angry
han gick
he went
och
and
la
lay
sig
himself
med
with
dom
the [pl]
∗
andre
syskonen.
other [masc] siblings [pl]
– Nasse got angry. He lay down with his brothers and sisters.
b. dom
andra
syskonen
the [pl] other [pl] siblings [pl]
The second instance of semantic gender mismatch is more a question of asymmetry between the adjectives involved (see (4.9a)). The first adjective in the noun
phrase is declined for masculine gender (hemsk-e ‘awful [masc]’) and the second
uses the unmarked form (ful-a ‘ugly [def]’). Either both should be in the masculine
form (as in (4.8b)) or both should have the unmarked form (as in (4.8c)).
(4.9) (G1.2.3)
a. det
va
it
was
49
den
the [def]
∗
hemske
awful [wk,masc]
∗
fula
ugly [wk]
troll
troll
karlen (⇒ trollkarlen)
tokig som ...
man [def] (⇒ magician [def]) Tokig that
– It was the awful ugly magician Tokig that ...
b. den
hemske
fule
trollkarlen
the [def] awful [wk,masc] ugly [wk,masc] magician [def]
c. den
hemska
fula
trollkarlen
the [def] awful [wk] ugly [wk] magician [def]
Number Agreement Errors
Three noun phrases violated number agreement. One concerned a definite attribute
in a definite noun phrase (see (4.10a)). It seems like the required plural determiner
de ‘the [pl]’ is replaced by the singular definite determiner det ‘the [sg]’. It could
also be a question of an (un)intentional addition of the character -t subsequently
making it a spelling error rather than a grammar error. But since a syntactic violation occurred with no new lemma formed, the error is classified as a grammar error
and not as a real word spelling error.
(4.10) (G1.3.1)
a. Den där scenen med ∗ det
tre
tjejerna tyckte jag att de var
the there scene with the [sg] three girls [pl] thought I that they were
taskiga som går ifrån den tredje tjejen.
mean that go from the third girl
– I thought that in the scene with the three girls they were mean to leave the
third girl.
b. de
tre tjejerna
the [pl] three girls [pl]
The other two errors concern the head noun of a partitive attribute as shown
in (4.11a). In both instances, the noun is in the singular definite form instead of
the required plural definite form. Both errors were made by the same subject.
This realization points more clearly to a typographical error. The determiner and
the partitive preposition were probably inserted afterwards into the text, since the
singular definite form that this error brings about is not at all part of the correct
non-elliptic noun phrase (see (4.10b)), but may function perfectly well as a noun
phrase on its own.
Chapter 4.
50
(4.11) (G1.3.2)
a. Alla männen och pappa gick in i ett av ∗ huset.
all the-men and daddy went into in one of house [sg, def]
b. –All the men and daddy went into one of the houses.
ett (hus)
av husen
one (house [indef]) of houses [pl, def]
4.3.2
Agreement in Predicative Complement
Introduction
A predicative complement is part of a verb phrase and specifies features about
the subject or the object. An adjective phrase, participle or a noun phrase are the
typical representatives. The predicative complement differs from other parts of a
verb phrase in that the predicative agrees in gender and number (in the case of a
noun phrase only in number) with the corresponding subject or object it refers to,
as shown in Table 4.12.
Table 4.12: Gender and Number Agreement in Predicative Complement
S INGULAR
C OMMON
N EUTER
P LURAL
boken är gammal
huset är gammal-t
husen är gaml-a
the-book [com] is old [com]
the-house [neu] is old [neu]
the-houses [pl] are old [pl]
The predicative normally combines with copula verbs (vara ‘be’, bli
‘be/become’, förbli ‘remain’), naming verbs (e.g. heta ‘be called’, kallas ‘be
called’), raising verbs (e.g. verka ‘seem’, förefalla ‘seem’, tyckas ‘seem’), and
other similar verb categories (Teleman et al., 1999, Part3:340).
Gender Agreement Errors
Violations of gender agreement were rare. Altogether two errors of this type occurred. One concerned an adjective in the complement position and the other a past
participle form. The adjective error occurred with neuter gender as shown below
in (4.12a).
51
(4.12) (G2.1.1)
är ∗ blöt.
a. Då börja Urban lipa
och sa: Mitt
hus
then start Urban blubber and said my [neu] house [neu] is wet [com]
– Then Urban started to blubber and said: My house is wet.
är blött.
b. Mitt
hus
my [neu] house [neu] is wet [neu]
Here the neuter gender subject is connected to the adjective bl öt ‘wet [com]’ in
common gender. The error could also be classified as a spelling error with omission
of the final double consonant, but since it is also another form of the same adjective
and a syntactic violation occurs, the error is classified as a grammar error.
Number Agreement Errors
In the case of number agreement, there was one error involving singular number
and two errors involving plural number. As in (4.13a), the sentence structures
that include number violations in the predicative complement are in general rather
complex and the distance between the head and the modifier is not restricted to a
single verb. In this case, it seems to be a question of a lack of linguistic competence
since all three adjectives lack the plural ending.
(4.13) (G2.2.3)
och
är mer ∗ öppen
a. Själv tycker jag att killarnas metoder
self think I that the-boys’ methods [pl] are more open [sg] and
∗
ärlig
men också mer ∗ elak
än var (⇒ vad) tjejernas
honest [sg] but also more mean [sg] than was (⇒ what) the-girls’s
metoder är.
methods are
– I think myself that the boys’ methods are more open and honest but also
more mean than the girls’ methods are.
men också
och ärliga
är mer öppna
b. killarnas metoder
the-boys’ methods [pl] are more open [pl] and honest [pl] but also
mer elaka
more mean [pl]
Chapter 4.
52
4.3.3
Definiteness in Single Nouns
Introduction
The grammatical violations in this section concern single nouns as the only constituents of a noun phrase. Bare singular nouns are (normally) ungrammatical
without an article. The noun must be in definite form or preceded by an article
as in (4.14b) or (4.14d). Example sentences in (4.14a) and (4.14c) are (normally)
ungrammatical in Swedish, although they may occur as newspaper head lines for
instance.
(4.14) a.
∗
Polis
arresterade studenten.
Policeman arrested
the-student.
b. Polisen/En polis
arresterade studenten.
The policeman/A policeman arrested
the-student.
c.
∗
Polisen
arresterade student.
The policeman arrested
student.
d. Polisen
arresterade studenten/en student.
The policeman arrested
the-student/a student.
There are, however, grammatical sentences which include bare singular nouns.
The acceptability of such sentences depends, according to Cooper (1984), on the
lexical choice. Thus, changing the noun or the verb may influence the grammaticality of a sentence:
(4.15) a.
∗
Det är jobbigt att inte se bil.
is hard to not see car [indef].
It
b. Det är jobbigt att inte ha bil.
It is hard to not have car [indef].
Bare definite nouns are often used as anaphoric device, referring to an entity
that has already been introduced or is well known in the speech situation. The noun
is then in definite form as in (4.16) below.
(4.16) a. Ta (den) nya bilen.
Take (the) new car [def].
b. (den) gamle kungen
(the) old king [def]
c. (den) tredje gången
(the) third time [def]
53
Errors in Definiteness in Single Nouns
There were six cases of definiteness errors in single nouns. They all were realized
as indefinite nouns. One instance from the corpus is shown in (4.17). Here the
topic is introduced by an indefinite noun phrase (en ö ‘an [indef] island [indef]’) in
the first sentence, but then in the following sentence instead of the expected definite
noun that would indicate a continuation of discussion of this topic, we find a single
indefinite noun (ö ‘island [indef]’). This noun lacks the definite suffix.
(4.17) (G3.1.3)
Vi gick till ∗ ö.
a. Jag såg en ö.
I saw an island we went to island [indef]
– I saw an island. We went to island.
b. ön
island [def]
4.3.4
Pronoun Case
Features of Pronouns
Personal pronouns in Swedish are declined in nominative, genitive and accusative case (see Table 4.13 below). Third person singular inanimate pronouns have
the same form in both subject and object positions. For plural, the nominativeaccusative distinction de-dem is only used in writing. It is not used in speech,
where both forms are pronounced as dom in the standard language. This spoken
form is used (increasingly) in some types of informal writing. 5
Errors in Pronoun Case
All five errors in pronoun case concern nominative case being used in the object
position. Two cases involved errors in the accusative case of the pronoun han ‘he’,
probably due to regional influence,6 e.g.:
(4.18) (G4.1.5)
a. bara för man inte vill vara med ∗ han
just for one not want be with he [nom]
– just because one doesn’t want to be with him.
b. honom
him [acc]
5
Purists recommend however keeping the distinction de-dem and that dom should be used only
for rendering spoken language (Teleman et al., 1999, Part2:270).
6
In certain dialects han ‘he’ is also the object form.
Chapter 4.
54
Table 4.13: Personal Pronouns in Swedish
N OMINATIVE ACCUSATIVE
G ENITIVE
S INGULAR
1 ST PERSON
jag
I
du
you
han
he
hon
she
den
it
det
it
2 ND PERSON
3 RD PERSON
A NIMATE
M ALE
F EMALE
I NANIMATE C OMMON
N EUTER
mig
me
dig
you
honom
him
henne
her
den
it
det
it
min
my
din
yours
hans
his
hennes
hers
dess, dens
its
dess
its
P LURAL
1 ST PERSON
2 ND PERSON
3 RD PERSON
W RITTEN
S POKEN
vi
we
ni
you
de
they
dom
they
er
vår, vårt
us ours [com], [neu]
er
er, ert
you yours [com], [neu]
dem
deras
them
theirs
dom
deras
them
theirs
The rest concerned plural pronouns, as the one in (4.19). As mentioned above,
the distinction between the nominative form de ‘they’ and the accusative form dem
‘them’ occurs only in writing. In speech dom is used in both cases. A scan of the
writing profiles of all subjects showed that most of the subjects use only the spoken
form. For that reason, the errors were included only if the subject used an incorrect
written form and not just the spoken form.
(4.19) (G4.1.1)
a. bilarna bromsade så att det blev
svarta streck efter ∗ de.
the-cars braked so that it became black lines after they [nom]
– The cars braked so there were black lines after them.
b. dem
them [acc]
4.3.5
55
Verb Form
Verb Core Structure
A verb phrase consists of a verbal head that can form a verb phrase on its own or
be combined with modifiers and appropriate complements. In this description no
attention is drawn to the complements, just the actual core of the verb phrase. First,
the types of verbs (finite and non-finite) are described, followed by presentation
of the simple vs. compound tense structures and finally the infinitive phrase is
described.
Verbs are divided into finite and non-finite. A sentence must contain at least
one verb in finite form to be considered grammatically correct. In Swedish, there
are three forms of finite verbs (present, preterite and imperative) and four forms of
non-finite verbs (infinitive, supine, present participle and past participle).
Table 4.14: Finite and Non-finite Verb Forms
T ENSE
Infinitive:
Imperative:
Future:
Present:
Preterite:
Perfect:
Present participle:
Past participle:
F INITE
att
ska
jagar
jagade
har
den
är
N ON - FINITE
jaga
jaga
jaga
jagat [sup]
jagande
jagad
to hunt
hunt
will hunt
hunt/hunts
hunted
have hunted [sup]
the hunting
is hunted
Among the non-finite verbs, infinitive and supine occur as the main verb in
combination with a modifying (finite) auxiliary verb (see Future and Perfect respectively in Table 4.14 above). The infinitive form also occurs in infinitive phrases
preceded by the infinitive marker att ‘to’. Present and past participle forms have
more adjectival characteristics and function as attributes in a noun phrase or in
predicative position after a copula verb.
A core verb phrase may consist of one single finite verb and form a simple
tense construction, or of a sequence of two or more verbs, composed of one finite
verb plus a number of non-finite verbs to form a kind of compound tense (see Table
4.15 below).
Compound tense structures like sequences of two or more verbs are usually
referred to as verb chains or verb clusters and generally include some kind of
auxiliary verb followed by the main (non-finite) verb. In Swedish we find the
temporal and modal auxiliary verbs in verb cluster constructions.
Chapter 4.
56
Table 4.15: Tense Structure
S IMPLE S TRUCTURE :
Present:
Preterite:
Katten
The cat
jagar
chases
möss.
mice.
Katten
The cat
jagade
chased
möss.
mice.
ska
will [pres]
jaga
chase [inf]
möss.
mice.
C OMPOUND S TRUCTURE :
Future:
Katten
The cat
Perfect:
Katten
The cat
har
has [pres]
jagat
chased [sup]
möss.
mice.
Past perfect:
Katten
The cat
hade
had [pret]
jagat
chased [sup]
möss.
mice.
Future perfect:
Katten
The cat
ska
shall [pres]
ha
have [inf]
jagat
chased [sup]
möss.
mice.
Secondary future perfect:
Katten
The cat
skulle
would [pret]
ha
have [inf]
jagat
chased [sup]
möss.
mice.
Verb clusters with temporal auxiliary verbs in general follow two patterns, one
expressing the past tense with the main verb in the supine (here only the verb ha
‘have’ is used), and one for future tense with the main verb in the infinitive. In
subordinate clauses, the temporal finite forms har ‘has/have [pres]’ or hade ‘had
[pret]’ are often omitted in perfect and past perfect7 and the verb core consists then
only of the supine verb form (examples from Ljung and Ohlander, 1993, p.99):
(4.20) a. Han säger att han redan (har) gjort det.
he says that he already (has) done that
– He says that he has done that already.
b. Han sade att han ofta (hade) sett dem.
he said that he often (had) seen them
– He said that he had often seen them.
Also the temporal infinitive ha ‘have’ in the secondary future perfect can be
omitted irrespective of sentence type. In these cases, a past tense modal auxiliary
is followed directly by a supine form (Teleman et al., 1999, Part3:272):
7
The omission is most common in writing, up to 80% (Teleman et al., 1999, Part4:12), but occurs
more and more in speech as well (Teleman et al., 1999, Part3:272).
57
(4.21) a. Nu blev
det inte så illa som det kunde (ha) blivit.
now became that not so bad as it could (have) become [sup]
– Now it got not so bad as it could have been.
b. ... fastän det borde (ha) skett
för länge sedan.
although it should (have) happened for long ago
– ... although it should have happened a long time ago.
A verb in the infinitive form is treated as part of an infinitive phrase preceded by
an infinitive marker att ‘to’, which is necessary in certain contexts and optional in
others. Auxiliary verbs are combined with bare infinitives (as shown and discussed
above) thus lacking the infinitive marker as in (4.22a). An exception is the temporal
komma ‘will’ that requires the infinitive marker as in (4.22b) (Teleman et al., 1999,
Part3:572):
(4.22) a. Hon kan spela schack.
she can play chess
– She can play chess.
b. Hon kommer att spela schack.
she will
to play chess
– She will play chess.
The bare infinitive is also used in nexus constructions as in (Teleman et al.,
1999, Part3:597):
(4.23) Han ansåg
tiden
vara mogen.
he considered the-time be ripe
– He found the time to be ripe.
Many main verbs take either a noun phrase or an infinitive phrase as complement (Teleman et al., 1999, Part3:570,596). With some main verbs, the infinitive
marker is optional (Teleman et al., 1999, Part3:597). The tendency to omit the infinitive marker is higher if the infinitive phrase directly follows the verb (Teleman
et al., 1999, Part3:598):
(4.24) a. Vi slutade spela.
we stopped play
– We stopped playing.
b. Vi slutade avsiktligt
att spela.
we stopped deliberately to play
– We deliberately stopped playing.
Chapter 4.
58
Infinitive phrases are found in subject position as well (4.25):
(4.25) Att få segla jorden runt
hade alltid lockat honom.
to get sail earth around had always tempted him
– He had always wanted to get to sail around the world.
Finite Main Verb Errors
The use of non-finite verb forms as finite verbs, forming sentences that lack a
finite main verb is the most common error type in Child Data. Errors of this kind
concern both present and past tense. Most of them (87) occurred in the past tense
as in (4.26a) and concern regular weak verbs ending in -a in the basic form that
lacks the appropriate past tense ending. Nine errors occurred in present tense as in
(4.27a) and primarily concern regular weak verbs ending in -a, also in addition to
some strong verbs.
(4.26) (G5.2.45)
∗
vakna
jag av
att brandlarmet tjöt.
a. På natten
in the-night wake [untensed] I from that fire-alarm howled
– In the night I woke up from the fire-alarm going off.
b. vaknade
woke [pret]
(4.27) (G5.1.2)
a. När hon kommer ner
undrar hon varför det ∗ lukta
så
when she comes down wonders she why it smell [untensed] so
bränt och varför det låg en handduk över spisen.
over the-stove
burnt and why it lay a towel
– When she comes down, she wonders why it smells so burnt and why a towel
was lying over the stove.
b. luktar
smells [pres]
The most probable cause for this recurrent error is the fact that in spoken
Swedish regular weak verbs ending in -a may lack the past tense suffix and sometimes also lack the present tense suffix. For example, the past form of the verb
vaknade ‘woke [pret]’ is pronounced either as [va:knade] or reduced to [va:kna],
which then coincides with the infinitive and imperative forms vakna ‘to wake’ as
in the erroneous sentence (4.26a) above.
59
In addition to the above errors in the form of the finite main verb, two instances
involved strong verbs, both realized as the (non-finite) infinitive form. One error
occurred in the present tense, and one (exemplified in (4.28)) in the past.
(4.28) (G5.2.100)
a. Nästa dag så var en ryggsäck borta och mera grejer ∗ försvinna
next day so was a rucksack gone and more things disappear [inf]
– The next day a rucksack had gone and more things disappeared.
b. försvann
disappeared [pret]
Then, there were two occurrences of errors using a supine verb form as predicate of a main sentence. Recall that the supine may occur on its own as predicate
in subordinate clauses (see above). These errors occurred in main clauses, both
with the same lemma and were committed by the same subject. One of these error
instances has already been discussed in Section 3.3 (example (3.2) on p.29). The
other is exemplified and discussed below:
(4.29) (G5.2.88)
a. det låg
it
lay [pret]
massor av
lots
of
saker
things
runtomkring jag ∗ försökt
tried [sup]
around
I
att
to
kom (⇒ komma) till fören
came (⇒ come) to the-prow
– There were a lot of things lying around. I tried to go to the prow.
b. försökte
tried [pret]
The sentence jag försökt att kom till fören ‘I tried [sup] to go to the prow’ in
isolation suggests that just an auxiliary verb is missing in front of the supine form,
i.e. hade försökt ‘had tried’. However, the past form predicate of the preceding
sentence suggests that in order to be consistent the predicate of the subsequent
sentence should also be in past form. It could be that the subject believes that this
word is spelled without the final vowel -e. The reason why this case is considered a
grammar error is that it forms another form of the intended lemma. Thus, according
to principle (i) in (3.4) it is a grammar error (see Section 3.3).
Finally, ten error instances concerned past participle forms in the finite verb
position, as in (4.30), all lacking the final -e in the preterite suffix.
Chapter 4.
60
(4.30) (G5.2.92)
a. dom ∗ letad
överallt
they search [past part] everywhere
– They searched everywhere.
b. letade
searched [pret]
These past participle forms could occur due to the final letter’s alphabetical
pronunciation (letter ‘d’ is pronounced [de] in Swedish). Following the classification principles in (3.4), these errors are considered grammar errors since an other
form is used rather than the intended one is formed.8
Verb Cluster Errors
Grammar errors in verb clusters affect the form of the (non-finite) main verb and
omission of auxiliary verbs. Main verb errors may involve a sequence of finite
verbs and thus violate the rule of one finite verb in a clause. One error instance
included secondary future perfect requiring a supine form as in (4.31a), where
the main verb is realized as a past tense form of the intended verb. The cause
of the error is not possible to determine, but an interesting observation is that the
erroneous verb form is followed by a preposition beginning in the vowel ‘i’ that is
part of the omitted supine ending thus indicating a possible assimilation of these
sounds.
(4.31) (G6.1.7)
a. Jag skrattade och undrade hur tromben
skulle
ha
I
laughed and wondered how the-tornado would [pret] have [inf]
∗
kom
igenom det lilla hålet.
came [pret] through the small hole
– I laughed and wondered how the tornado would have come through the small
hole.
b. skulle
ha
kommit
would [pret] have [inf] come [sup]
Other errors in the main verb of a verb cluster concerned structures requiring
an infinitive verb form as in (4.32a), where the modal auxiliary verb ska ‘will’ is
followed by a verb in present tense, blir ‘becomes’.
8
Some of the participle forms like pratad ‘told [past part]’ are not lexicalized in Swedish, but
are quite possible to form in accordance with grammar rules of Swedish. They are included in the
present analysis since they were not detected as non-words by the spelling checker in Word.
61
(4.32) (G6.1.1)
∗
a. Men kom ihåg att det inte ska
blir
någon riktig brand
but remember that it not will [pres] becomes [pres] some real fire
– But remember that there will not be a real fire.
b. ska
bli
will [pres] become [inf]
There were two cases with omitted auxiliary verb. Both concerned the temporal
verb ha ‘to have’ and the predicate of the main sentences consisted then of only a
supine verb form:
(4.33) (G6.2.2)
a. men pappa — frågat
mig om jag ville
följa med.
but daddy — asked [sup] me if I wanted follow with
– but daddy has asked me if I wanted to come along.
b. hade
frågat
had [pret] asked [sup]
OR
frågade
asked [pret]
Infinitive Phrase Errors
In this category, we find errors in the verb form following the infinitive marker and
in the omission of the infinitive marker after the auxiliary verb komma ‘will’. Constructions with main verbs that combine with an infinitive phrase as complement
have not been included. As we will see later on (Section 5.5), there are constructions where there is uncertainty in the language as to whether the infinitive marker
should be used or not. In general, the infinitive marker is tending to disappear
more and more. For this reason it is not quite clear which of these cases should be
classified as an error.
Four verb form errors occurred where, instead of the (non-finite) infinitive verb
that is required, we find the (finite) imperative as in (4.34) or present form as in
(4.35) after an infinitive marker.
Chapter 4.
62
(4.34) (G7.1.2)
a. glöm inte att ∗ stäng
dörren
forget not to close [imp] the-door
– don’t forget to close the door
b. att stänga
to close [inf]
(4.35) (G7.1.1)
sig
a. Men hunden klarar
att inte ∗ slår
but the-dog manages to not hits [pres] himself
– But the dog manages not to hit himself.
b. att inte slå
to not hit [inf]
Three cases concerned an omitted infinitive marker in the context of the temporal auxiliary verb komma ‘will’ that (as explained above) is different from the
other auxiliary verbs and requires the infinitive marker:
(4.36) (G7.2.3)
a. Nu när jag kommer att skriva denna uppsatsen så kommer jag — ha
now when I will
to write this essay
so will
I — have
en rubrik om
några problem och vad man kan göra för att förbättra
a title about some problems and what one can do to
improve
dom.
them
– Now when I write this essay, I will have a heading about some problems and
what one can do to improve them.
b. kommer jag att ha
will
I to have
The error example (4.36) is even more interesting in that att ‘to’ is used in the
first construction with the verb kommer att skriva ‘will write’ whereas it is omitted
in the subsequent.
4.3.6
Sentence Structure
Introduction
The errors in this category concern word order, phrases or clauses lacking obligatory constituents, reduplications of the same word and constructions with redundant
constituents.
63
The finite verb is normally considered as the core of a sentence and is surrounded by its complements (e.g. subject, direct and indirect object, adverbials).
The distribution of such complements is defined both syntactically (i.e. defines the
verb’s construction scheme) and semantically (i.e. defines what role the different
actants play in a sentence). Thus the verb governs the structure of the whole sentence in what constituents are to be included and in what place and what role they
will play. In addition, the position of sentence adverbials plays an important role.
Sentences in Swedish display two types of word order. Main clause order is
characterized by the finite verb before the adverbial (dubbed fa-sentence in Teleman et al. (1999, Part4:7)), presented in Table 4.16. 9 Subordinate clause word
order is characterized by adverbial before the finite verb (dubbed af-sentence in
Teleman et al. (1999, Part4:7)) presented in Table 4.17.
In addition to recognizing the distinct word orders in main and subordinate
clauses, traditional grammar also makes a distinction between basic word order
where the subject precedes the predicate (example sentence 2 in Table 4.16 and
both sentences in Table 4.17) and inverted word order where the subject follows
the predicate (example sentence 1 and 3 in Table 4.16).
Table 4.16: Fa-sentence Word Order
I NITIAL F IELD
Initiation
Nu
now
M IDDLE F IELD
Finite Verb Subject
skulle
Per
would
Per
Adverbial*
nog inte
probably not
F INAL F IELD
Rest of VP
vilja träffa någon.
like to meet someone
2.
Per
Per
skulle
would
–
–
nog inte
probably not
vilja träffa någon nu.
like to meet someone now
3.
Vem
who
skulle
would
Per
Per
nog inte
probably not
vilja träffa nu?
like to meet now?
1.
Table 4.17: Af-sentence Word Order
1.
2.
I NITIAL F IELD
Initiation
eftersom
because
M IDDLE F IELD
Subject Adverbial*
Per
nog inte
Per
probably not
Finite Verb
skulle
would
F INAL F IELD
Rest of Verb Phrase
vilja träffa någon nu
like to meet someone now
vem
who
Per
Per
skulle
would
vilja träffa nu
like to meet now
nog inte
probably not
9
Conjunctions that coordinate main or subordinate clauses are not included in the scheme. The
asterix in the tables indicates that more constituents of this kind are possible.
Chapter 4.
64
Word Order Errors
Word order errors concern transposition of sentence constituents thus violating the
fa-sentence or af-sentence word order constraints. Only five sentences with incorrect word order were found.
The following error example (4.37a) violates the fa-sentence word order, since
there are two constituents before the finite verb, a subject and a time adverbial.
The finite verb is expected in the second position in the sentence. The correct form
of the sentence can be formed in two ways: either introduced by the subject and
placing the time adverbial last, as in (4.37b), or starting with the time adverbial,
placing the subject directly after the finite verb, as in (4.37c).
(4.37) (G8.1.3)
a. ∗ Jag den dan gjorde inget
bättre.
I
the day did
nothing better
– I didn’t do anything better that day.
b. Jag gjorde inget
bättre den dan.
I did
nothing better the day
c. Den dan gjorde jag inget
bättre.
the day did
I nothing better
Redundancy Errors
As mentioned above, the type and the number of constituents in a sentence is governed by the main verb. Any addition of other constituents influences the whole
complement distribution, both syntactically and semantically.
Words were duplicated directly (five occurrences) as in (4.38a) below with the
reduplicated word in the same position as the intended one:
(4.38) (G9.1.3)
a. många som mobbar har ∗ har det oftast
dåligt hemma
many that bully have have it most-often bad at-home
– Many that bully have have it most often bad at home.
b. många som mobbar har det oftast
dåligt hemma
many that bully have it most-often bad at-home
Four occurrences included duplication with words between, i.e. when the same
word occurs somewhere else in the sentence. In the example (4.39a) the subject
jag ‘I’ is repeated after the verb as if indicating inverted word order:
65
(4.39) (G9.1.7)
a. jag fick ∗ jag hjälp med det.
I got I
help with it
– I got I help with it.
b. jag fick hjälp med det.
I got help with it
– I got help with it.
The example in (4.40a) involves a case where the writer has fronted not only the
object det ‘that’ but also the verb particle åt ‘for’ which also occurs in its normal
position after the verb. Either the fronted verb particle can be removed as in (4.40b)
or the one following the verb as in (4.40c).
(4.40) (G9.1.8)
a. Åt
det går det nog
inte att gör (⇒ göra)
så mycket
about that goes it probably not to do [pres] (⇒ do [inf]) that much
∗
åt.
about
– About this not so much can probably be done about.
b. Det går det nog
inte att gör så mycket åt.
that goes it probably not to do that much about
c. Åt
det går det nog
inte att gör så mycket.
about that goes it probably not to do that much
In four cases, new words disturbed the sentence structure by their redundancy
in the complement structure. In the following example, the pronoun det ‘it’ is
redundant and plays no role in the sentence:10
10
There is also an error in word order between the constituents bara kan ‘just can’ that should be
switched, see G8.1.2 in Appendix B.1.
Chapter 4.
66
(4.41) (G9.2.2)
∗
a. för
då kan man inte något ting bara kan gå på stan
det då
cause then can one not some thing just can go to the-city it then
fattar
hjärna ingenting
understand brain nothing
– because then one cannot anything just can go to the city it then the brain
doesn’t understand anything.
b. för då kan man inte något ting, bara gå på stan.
Då fattar
for then can one not some thing just go to the-city then understand
hjärna ingenting.
brain nothing
– because then one cannot anything, just go to the city. Then the brain doesn’t
understand anything.
Missing Constituents
Altogether 44 sentences were incomplete in the sense that one (or more) obligatory
constituent(s) were missing in the sentence. Omission of the noun in the subject
position is the most frequent type of error in this category (10 occurrences), e.g.:
(4.42) (G10.1.8)
a. När man tror
att man har kompisar blir
— ledsen när man
when one thinks that one has friends becomes — sad
when one
bara går där ifrån
just goes there from
– When someone thinks that he has friends, he is sad when people just leave
from there.
b. blir
man ledsen
becomes one sad
Missing prepositions are quite common (11 occurrences):
(4.43) (G10.6.4)
a. Hunden hoppade ner — ett getingbo.
the-dog jumped down — a wasp-nest
– The dog jumped into a wasp’s nest.
b. i
into
67
Some occurrences of missing verbs were also found:
(4.44) (G10.4.3)
a. Jag tycker att det har med uppfostran — om man nu ger eller inte
I believe that it has with upbringing — if one now gives or not
ger hon/han den saken som man tappade
gives she/he the thing that one lost
– I believe that it is has to do with your upbringing if you give the thing he/she
lost back or not.
b. att göra
to do
Here is an example of a missing subjunction:
(4.45) (G10.7.4)
a. till exempel — den här killen gör så igen så ...
for instance — the here the-boy does so again so
– for instance if this boy does so again, then ...
b. om
if
Other omissions involve pronouns, infinitive markers, adverbs and some fixed
expressions, as in:
(4.46) (G10.8.4)
a. sen levde vi lyckliga — våra dagar
then lived we happy — our days
– Then we lived happily ever after.
b. i alla våra dagar
in all our days
4.3.7
Word Choice
This error category concerns words being replaced by other words that semantically violate the sentence structure. They concern mostly replacements within the
same word category, but changes of category also occur. Most of these substitutions involve prepositions and particles, but we also find some adverbs, infinitive
markers, pronouns and other classes.
In (4.47a) we see an example of an erroneous verb particle. Here the verb att
vara lika ‘to be alike’ requires the particle till ‘to’ in combination with the noun
phrase sättet ‘the-manner’ and not på ‘on’ as the writer uses.
Chapter 4.
68
(4.47) (G11.1.7)
a. vi var väldigt lika ∗ på sättet
alltså
vi tyckte om samma
we were very like on the-manner in-other-words we fond of same
saker
things
– We were very alike in the our manner. In other words we were fond of the
same things.
b. lika till sättet
like to the-manner
Also the choice of prepositions is problematic. In (4.48a) the preposition ur
‘from’, which describes a completely different action than the required av ‘off’,
was used.
(4.48) (G11.1.2)
a. Vi sprang allt vad vi orkade ner till sjön
och slängde ∗ ur
oss
we run
all what we could down to the-lake and threw out-of us
kläderna.
clothes
– We run as fast as we could down to the lake and threw off our clothes.
b. slängde av oss
threw off us
Five errors concerned the conjunction och ‘and’ used in the position of an infinitive marker. This error is speech related. In Swedish the pronunciation of the
infinitive marker att [at] ‘to’ is often reduced to [å], which is also the case for the
conjunction och [ock] ‘and’, i.e. both att ‘to’ and och ‘and’ are often reduced and
pronounced as [å]. As a consequence, these two forms and their syntactic roles can
be mixed up in writing, as in the next example (4.49a).
(4.49) (G11.3.1)
∗
a. det var onödigt
och skrika pappa
it was unnecessary and scream daddy
– It wasn’t necessary to scream, daddy.
b. att
to
The choice between the adverb vart ‘whither’ and var ‘where’ caused trouble
for two subjects in three occurrences, an example is given in (4.50a). This may also
be a dialectal matter, since in certain regions this form has the same distribution as
var ‘where’.
69
(4.50) (G11.2.2)
a. Men ∗ vart
ska jag bo?
but whither will I live
– But whither will I live?
b. var
where
Also blends of fixed expressions occurred. In the following example, the writer
mixes up the expressions så mycket jag kunde ‘as much as I could’ and allt vad jag
var värd ‘for all I was worth’:
(4.51) (G11.5.3)
a. jag sprang så fort ∗ så mycket jag var värd
I whither so fast so much I was worth
– I run so fast so much I was worth.
b. allt vad jag var värd
all what I was worth
Other word choice errors concerned pronouns, adjectives and nouns.
4.3.8
Reference
Reference in Swedish
Pronouns are used to refer to something already mentioned in the text (anaphoric
reference) or something present in the utterance situation (deictic reference). The
pronoun correlates then with the referring noun and has to agree in number and
gender with it.
Reference Errors
Referential violations concern only anaphoric reference, referring to the previous
text, both within the same clause and in a larger context.
The errors were of two types, cases where the pronoun did not agree (six occurrences) and cases where the referent changed (two occurrences). In the case of
agreement, four errors concerned wrong number as in (4.52a) and two cases were
related to gender as in (4.53a).
Chapter 4.
70
(4.52) (G12.1.1)
a. Nästa dag gick dem upp till en grotta där fick dem var sin
next day went they up to a cave there got they each his/her
korg med saker i. Lena fick en kattunge för
manen hade många
basket with things in Lena got a kitten because the-man had many
sej iväg när
djur.
Och Alexander fick ett spjut. sen gav ∗ den
animals and Alexander got a spear then went it [sing] self away when
de gått och gått så hände
något ...
they went and went so happened something
– The next day they went to visit a cave. There they each got a basket with
things in it. Lena got a kitten, because the man had many animals. And Alexander got a spear. Then it went away. When they went and went, something
happened ...
b. de
they
(4.53) (G12.1.5)
a. Vad heter
din mamma?
Det stod helt
still i huvudet
what is-called your mother [fem] It stood completely still in the-head
vad var det ∗ han
hette
nu igen?
what was it he [masc] was-called now again
– What is your mother’s name? It was completely still in my head. What was
he called now again?
b. hon
she
In two cases, shift between direct quotes and narratives occurred. In one such
error in (4.54a) the writer is first involved in the situation, referred to as vi ‘we’ and
then suddenly in the subsequent sentence the pronoun is changed to ni ‘you [pl]’
switching the focus from the writer as part of a group to other people.
(4.54) (G12.2.1)
kom ut ...
a. spring ut nu vi har besökare när ∗ ni
run out now we have visitors when you [pl] came out
– Run out, we have visitors! When we came out...
b. vi
we
4.3.9
71
Other Grammar Errors
One error instance includes an adverb used as an adjective:
(4.55) (G13.1.2)
mindre
a. När jag var ∗ liten
when I was small [adj] smaller
– When I was a little smaller...
b. lite
mindre
a little [adv] smaller
Finally, three cases could not be classified at all. The sentences had very
strange structure, either single words were incomprehensible or the whole sentence did not make any sense. In some cases this could be a question of several
sentences being put together, in which case, the sentences are incomplete and/or
lack any marking of sentence boundaries.
During the analysis, some errors involving sequence of tense were discovered.
These are not targeted in the present analysis and will be left for future analysis.
Chapter 4.
72
4.3.10
Distribution of Grammar Errors
As discussed in the presentation of error types (Section 3.4), the units by which
frequency of grammar errors could be estimated are different from type to type and
are also difficult to count in text containing errors. For that reason, error frequency
between error types will be compared and total numbers of errors will be related to
the total number of words.
Overall Error Distribution
In the whole corpus of 29,812 words 262 instances of grammar errors were found.
That corresponds to 8.8 errors per 1,000 words. The different errors are summarized in Table 4.18, grouped by sub-corpora and, in Table 4.19, by age. The total
error distribution is also illustrated in Figure 4.1 below.
The most recurrent grammar problem concerns the form of the finite main verb
lacking tense ending on the main verb (42%). This problem seems to be characteristic of this particular age group, whose writing is close to spoken language. Most
of these errors are found in the Deserted Village corpus (44) and among the 9 year
olds (72). Frog Story texts also contain quite a high number of such errors. The
rest of the corpora include around 10 such errors per corpus.
Missing constituents is the second largest error category (16.8%). These errors
tend to appear mostly among the older children, maybe because their text structure
is more developed and complex than that of the younger children. Among the
different sub-corpora, the Spencer Expository texts include most of these errors
(20).
Erroneous choice of words, mostly dominated by errors in the choice of preposition and verb particles, is the third most frequent category, representing 10.7%
(28) of all grammar errors, and seems to be spread evenly both among sub-corpora
and age groups.
Agreement errors in noun phrase and extra words being inserted into sentences
are also quite frequent (5.7% and 5.0% respectively). Agreement errors are quite
equally spread in the corpora and occur most among the 9 year olds and 11 year
olds. Redundancy errors display a similar distribution to that of the missing constituents, more errors were found among the older children and the Spencer Expository texts contain most errors of this kind.
Other grammar error categories represent less than 4% each of all the grammar
errors. Eight agreement errors in predicative complement occurred, mostly among
the 13 year old subjects and in the Spencer Expository texts. The six definiteness
errors were made only by 9 year olds and 11 year olds. Pronoun case errors occurred five times, found only in the texts of 10 year olds and 13 year olds, probably
73
because they were the only ones that made the written distinction between nominative and accusative in plural pronouns (de-dem ‘they-them’).
Seven cases of erroneous verb form after auxiliary verb occurred, mostly in the
writing of 11 year olds and in the Deserted Village corpus. All errors but one in
the verb form in infinitive phrase category were made by 11 year olds. Omission
of the infinitive marker after the auxiliary verb komma ‘will’ was rare, only three
cases occurred among the 13 year olds in Spencer Expository texts.
Eight referential errors occurred, mostly in the Deserted Village corpus and in
the texts by 9 year olds. Five word order errors were found and they were equally
distributed among sub-corpora and ages.
Figure 4.1: Grammar Error Distribution
Chapter 4.
74
Table 4.18: Distribution of Grammar Errors in Sub-Corpora
S UB -C ORPORA
Deserted Climbing Frog Spencer
Spencer
E RROR T YPE
Village Fireman Story Narrative Expository T OTAL
%
Agreement in NP
5
4
2
4
15 5.7
Agreement in PRED
2
6
8 3.1
Definiteness in single nouns
3
1
2
6 2.3
Pronoun Case
1
1
3
5 1.9
Finite Verb
44
13
34
10
9
110 42.0
Verb form after Vaux
3
1
1
2
7 2.7
Vaux Missing
2
2 0.8
Verb form after inf. marker
2
1
1
4 1.5
Inf. marker Missing
3
3 1.1
Word Order
1
1
3
5 1.9
Redundancy
1
2
1
3
6
13 5.0
7
2
8
7
20
44 16.8
Word Choice
9
5
2
3
9
28 10.7
Reference
3
1
2
2
8 3.1
Other
3
1
4 1.5
T OTAL
82
32
54
26
69
262 100
Errors/1,000 Words
10.8
7.1 11.0
4.7
9.3
8.8
Table 4.19: Distribution of Grammar Errors by Age
E RROR T YPE
Agreement in NP
Agreement in PRED
Pronoun Case
Finite Verb
Vaux Missing
Inf. marker Missing
Word Order
Redundancy
Word Choice
Reference
Other
T OTAL
Errors/1,000 Words
9-year
5
1
3
72
1
1
1
1
3
10
3
1
102
14.9
AGE
11-year
6
1
3
3
11
14
1
3
1
3
10-year
2
1
3
5
13
4
43
6.3
3
13
6
3
2
58
7.2
13-year
2
5
2
13
2
1
3
1
4
15
8
2
1
59
7.3
T OTAL
15
8
6
5
110
7
2
4
3
5
13
44
28
8
4
262
8.8
%
5.7
3.1
2.3
1.9
42.0
2.7
0.8
1.5
1.2
1.9
5.0
16.8
10.7
3.1
1.5
100
75
Distribution Among Sub-Corpora
In Table 4.18 we summarize the grammar errors found in the separate sub-corpora.
Most of the grammar errors occurred in the Deserted Village corpus (82), followed
by the texts from the Spencer Expository (68). However, if we consider the number
of errors in comparison to the size of the sub-corpora and how often they occur
per 1,000 words, the Frog Story corpus and the Deserted Village corpus have the
highest numbers with 11 and 10.8 errors, respectively. The Spencer Narrative texts
included only 26 grammar errors in total, that corresponds to only 4.7 errors per
1,000 words.
As regards frequency of the various error types (see Figure 4.2), Frog Story
and Deserted Village are distinguished from the other sub-corpora in that they
have a much higher frequency of finite verb errors, with seven and six such errors per 1,000 words, respectively. They are half that number or less in the other
sub-corpora. Other error types occur at most 1.6 times per 1,000 words. All the
sub-corpora are dominated by errors in the finite verb, except for the Spencer Expository texts, where missing constituents are the most frequent error type. Errors in finite verbs are, however, the second most frequent category in these texts.
Agreement errors in predicative complement are only found in the Climbing Fireman texts and in the Spencer Expository corpus. Further, errors in the texts of
Spencer Narrative are spread over a much smaller number of different error types.
Distribution Among Ages
Looking at grammar errors by age (Table 4.19), we find that most of the grammar
errors are found in the youngest 9 year olds (102) and less in the texts of 10 year
olds (43). Error density varies from 14.9 errors per 1,000 words in the texts of 9
year olds to 6.3 errors for the 10 year olds. The 11 year olds and 13 year olds have
a very similar distribution of 7.2 and 7.3 errors, respectively.
The separate error types and their density are presented in Figure 4.3. Finite
verb form errors are most characteristic for the 9 year olds, represented by five
times more errors than in the other age-groups. In the other age-groups, finite
verb errors and missing constituents are together the most frequent errors. Word
choice errors are also highly ranked in all age-groups. Errors in agreement with
predicative complement are concentrated in the texts of 13 year olds. Besides the
finite verb form errors in 9 year olds, errors occur not more than two times per
1,000 words in all ages.
Chapter 4.
76
Figure 4.2: Error Density in Sub-Corpora
Figure 4.3: Error Density in Age Groups
4.3.11
77
Summary
In total, 262 grammar errors were found in Child Data corresponding to an average
of 8.8 errors per 1,000 words. The most common errors concern the form of finite
verb, missing obligatory constituents, choice of words and agreement errors in
noun phrases. Most frequent are errors found in the Frog Story and the Deserted
Village corpora and among the 9 year olds.
4.4 Child Data vs. Other Data
In this section, the grammar errors found in Child Data will be compared with
studies of grammar errors discussed in Chapter 2 (Section 2.4). Only a comparison
with the analyses of children’s writing at school and the studies on adult writing
from the grammar checking projects are included. It turned out that it was very difficult to compare the error types in the other studies, since they either did not report
much data or they classified errors differently, without giving enough information
on exactly which errors were included.
The object of this part of analysis is to investigate the similarities and/or differences between the error types found in children and other writers in order to see
which grammar errors to concentrate on in the development of a grammar checker
aimed at children.
4.4.1
Primary and Secondary Level Writers
Teleman’s study and the analysis from the Skrivsyntax project are the two analyses
of children’s writing which report on grammar errors at the syntactic level. The
reports do not provide any quantitative analyses concerning the frequency of error
types. Instead the types of errors are reported and, in some cases, exemplified.
Teleman’s Examples
Teleman’s study (Teleman, 1979) includes examples of writing errors in texts by
children from the seventh year of primary school (14 years old). The examples
are mostly listed as fragments taken out of context, though some are presented
with the surrounding context. Many of the examples concern word choice or are
of content-related nature. Among grammar errors (Table 4.20), 11 Teleman (1979)
lists examples of errors in the pronoun case, verb form, definiteness agreement,
missing constituents (mostly the subject is missing), reference errors, word order
11
The column representing the correct forms of the exemplified errors are my own suggestions.
Teleman (1979)’s examples are just listed without any suggestions of possible correction.
Chapter 4.
78
and tense shift. Other errors concerned incorrect use of idiomatic expressions,
missing prepositions or the use of the conjunction och ‘and’ instead of the infinitive
marker att ‘to’.
The influence of spoken language is evident in many of the examples. Tenseendings on verbs are dropped, accusative forms of pronouns are not used, in particular the pronunciation-like form dom (‘they’ or ‘them’) is used instead of the
nominative (de) and accusative (dem) forms, which as mentioned earlier, are only
distinguished in writing. Also the use of the conjunction och ‘and’ instead of the
infinitive marker att ‘to’ indicates influence of the spoken language. Dialect influence occurs in the example of definiteness agreement with the determiner denna
‘this’ followed by a definite noun. All the error types that Teleman found (except
for one) occurred in our Child Data corpus as well. Only the case when two supine
verbs follow each other was not found in the present Child Data corpus.
However, there were additional types of errors in Child Data, such as other
verb form errors than dropped tense-endings on finite verbs, or other occurrences
of erroneous word choice than using prepositions or conjunctions in the place of
the infinitive marker, or occurrences of superfluous constituents.
Table 4.20: Examples of Grammar Errors in Teleman’s Study
E RROR TYPE
Pronoun form
Verb form
Double supine
Agreement in NP
Agreement in PRED
Missing constituents
Reference
Word order
Tense shift
Choice of or missing
prepositions
‘och’ instead of ‘att’
E XAMPLE E RROR
dom ‘they [spoken form]’
han, hon ‘he, she’
fråga ‘ask [inf]’
fått sålt
‘got [sup] sold [sup]’
denna bilen ‘this car [def]’
hennes förslag ... förefaller mig
∗
orealistisk
‘her suggestion [neu] ... appears to
me unrealistic [com]’
Tog med honom till polisen.
‘took along him to the-police’
polisen ... ∗ de ‘policeman ... they’
ett till fall ‘a more case’
Då förstod Majsan varför han ∗ har
varit rädd.
‘then understood [pret] Majsan why
he has been [perf] afraid’
bet ∗ på repet ‘bit on the-rope’
fråga vissa saker ‘ask some things’
få lov ∗ och göra något
‘get permission and do something’
C ORRECT F ORM
de, dem ‘they [nom], they [acc]’
honom, henne ‘him [acc], her [acc]’
frågade ‘asked [pret]’
fått sälja
‘got [sup] sell [inf]’
denna bil ‘this car [indef]’
orealistiskt
‘unrealistic [neu]’
subject missing
han ‘he’
ett fall till ‘a case more’
hade varit
‘had been [past perf]’
bet i repet ‘bit in the-rope’
fråga om ‘ask about’
att göra
‘to do’
79
Skrivsyntax
Among the seven error types distinguished in the error analysis of the Skrivsyntax
project on writing of third year students of upper secondary school (Hultman and
Westman, 1977, p.230), grammar errors were the most frequent. From the whole
corpus of 88,757 words, 1,157 were classified as grammar errors. According to
Hultman and Westman (1977), gender agreement errors were usual and relatively
many examples of errors in pronoun case after preposition occurred in these texts.
Errors in agreement between subject and predicative complement occurred quite
frequently. Word order errors were also reported, mostly in the placement of adverbials. Other examples include verb form errors, errors in idiomatic phrases (the
majority concern prepositions), subject related errors, and clauses with odd structure. Some examples of these grammar errors are displayed in Table 4.21.
Table 4.21: Examples of Grammar Errors from the Skrivsyntax Project
E RROR TYPE
Gender agreement
Agreement in PRED
Pronoun case
Verb form
Word order
Idiomatic expressions
E XAMPLE
bland ∗ det mest intolerabla och kortsynta formen på samlevnad
‘among the [neu] most intolerant and
short-sighted form [com,def] of married
life’
barnet är ∗ van
‘child [neu,def] is used-to [com]’
för alla ∗ de som
‘for all they [nom] that’
hjälpa ∗ de som
‘help they [nom] that’
Naturligtvis måste båda typerna av
äktenskap ∗ finns
‘of course must both types of marriage
exists [pres]’
Hon har inte ∗ kunna frigöra sig
‘she has not be-able [inf] free herself’
Ett äktenskap kräver att två personer
bara skall älska varandra hela livet ut
‘a marriage demands that two people
only shall love each-other whole thelife out’
löftet ∗ till trohet
‘promise to fidelity’
grundtanken ∗ till äktenskapet
‘the-fundamental-idea to marriage’
C ORRECT F ORM
den ... formen
‘the [com] ... form [com,def]’
vant
‘used-to [neu]’
dem
‘them [acc]’
dem
‘them [acc]’
måste ... finnas
‘must ... exist [inf]’
har ... kunnat
‘has ... being-able [sup]’
skall älska bara varandra
‘shall love only each-other’
om
‘about’
i
‘in’
Chapter 4.
80
Other errors mentioned concern the structure of sentences and include, for instance, the omission of the infinitive marker att ‘to’, main clause word order in
subordinate sentences, and sub-categorization of verbs. Also, reference errors are
observed and are considered to be quite usual in the material. Some tense problems
occurred.
The error types encountered in Skrivsyntax show a general indication of the
decreasing influence of spoken language on writing compared to earlier ages. The
only examples of errors that may contradict this statement are errors in the use
of the subjective form of the pronoun de ‘they’ in object-position or in certain
expressions after prepositions (should be dem ‘them’). Verb form errors, on the
other hand, include only erroneous use of existing written forms with no dropped
tense-endings being reported. These errors, and errors in the choice of preposition,
gender agreement, verbs and word order were also found in Child Data. Omission
of the infinitive marker with certain verbs was only analyzed in the context of the
verb komma ‘will’ in the present study. Further, constituent structure seems to be
more complex than in texts from Child Data, resulting in errors where the agreeing
elements are separated by more words, thus being harder to discover for the writer,
e.g. the gender agreement error in bland det mest intolerabla och kortsynta formen
på samlevnad.
Conclusion
Although the Teleman and Skrivsyntax studies cannot be considered to be completely representative for the two age groups and despite a time span of more than
twenty years between the studies and the present study, the error types that occur
in children’s writing are persistent. The writing of primary school children shows
similarities to Child Data mostly in the use of spoken forms. Those types of errors
seem to be (almost) not-existent in secondary level writers. Since no numbers or
other indications of error frequency than by words are given, the relative frequency
or distribution of errors is unclear.
4.4.2
Evaluation Texts of Proof Reading Tools
As already mentioned, the evaluation studies that have been carried out as part of
the development of the three Swedish grammar checking tools report on grammar
errors found primarily in the writing of professional adult writers. Here, we look
at the errors reported in two such studies and compare them to the grammar errors
found in Child Data.
81
Error Profiles of the Evaluation Texts
The performance test of Grammatifix reporting the ratio of detected errors (recall)
was based on a newspaper corpus of 87,713 words (Birn, 2000). 12 The material
included in total 127 grammar errors summarized in Table 4.22 below. 13 Among
the error types, Other agreement errors contained complements, postmodifiers and
anaphoric pronouns (i.e. reference errors) and the category Missing or superfluous
endings consisted of e.g. genitive, passive or adverb endings. Verb form errors
included mostly errors in verb clusters. It is not clear which types of errors belong
to the category of Sentence structure errors, or what is included under the Other
category (see further in Birn, 2000, p.39).
Table 4.22: Grammar Errors in the Evaluation Texts of Grammatifix
E RROR T YPE
Agreement in noun phrase
Other agreement errors
Verb form
Choice of preposition
Missing or superfluous endings
Sentence structure
Word order
Other
T OTAL
NO.
22
9
28
26
21
8
3
10
127
%
17.3%
7.1%
22.0%
20.5%
16.5%
6.3%
2.4%
7.8%
100%
Four error types clearly dominate, including errors in verb form, choice of
prepositions, agreement in noun phrase and missing or superfluous endings. Other
types occurred, at most, ten times.
In Knutsson (2001), an evaluation of Granska’s proof-reading tool is reported
based on a text corpus of 201,019 words. The collection included texts of different genres, mostly news articles of different kinds, some official texts, popular
science articles and student papers. The analysis concerned grammar, punctuation
and some spelling errors. Table 4.23 below is summary of the grammar errors
(see further in Knutsson, 2001, p.143). The relative frequency of error types was
recounted.
12
Precision of the system, i.e. how good the system is at avoiding false alarms, was tested on
a corpus of 1,000,504 words. It is not clear if the corpora includes different newspaper texts or if
there was an overlap with the texts tested for recall of the system. According to the author, only the
recall-corpus was pre-analyzed manually for grammar errors (see further in Birn, 2000).
13
Birn (2000) reports also 8 instances of splits. They are not included here, since the type belongs
to the spelling error category.
Chapter 4.
82
The error classification in Granska’s corpus is more similar to the classification
adopted in the present thesis. The category of Verb form errors, however, does not
specify the different sub-categories. Altogether 272 grammar errors occurred in
this evaluation corpus. Both Granska’s corpus, which is more than double the size
and the evaluation texts of Grammatifix display (almost) the same error rate with
1.35 errors per 1,000 words. Most errors were erroneous verb forms, followed in
frequency by agreement errors in noun phrases and missing constituents. Some
errors occurred in predicative complement agreement and pronoun form. The rest
of the errors occurred less than ten times.
Table 4.23: Grammar Errors in Granska’s Evaluation Corpus
E RROR T YPE
Agreement in pred. compl.
Verb form
Pronoun form
Reference
Word order
Missing word
Redundant word
T OTAL
NO.
4
69
16
89
14
1
11
8
56
4
272
%
1.5%
25.4%
5.9%
32.7%
5.1%
0.4%
4%
2.9%
20.6%
1.5%
100%
Comparison with Child Data
The most obvious difference between the grammar errors from the evaluation texts
and Child Data is the error rate in comparison to the size of the corpora. Although
the Child Data corpus is the smallest, the total number of errors is almost the same
as that in Granska’s evaluation texts. Errors in Child Data are six times more
frequent, with almost 9 errors per 1,000 words, than in the evaluation texts with an
error density of less than 1.5 errors per 1,000 words - see Table 4.24.
83
Table 4.24: General Error Ratio in Grammatifix, Granska and Child Data
Number of words
Number of errors
Number of errors/1 000 words
G RAMMATIFIX
87 713
127
1.45
G RANSKA
201 019
272
1.35
C HILD DATA
29 812
262
8.8
As we have seen, error classification varies between the projects, making a
comparison of all error types impossible. Verb form errors, noun phrase agreement, missing constituents (in Granska) and erroneous choice of prepositions (in
Grammatifix) are the four most common error types, with frequencies in the range
of 20% to 30% each. Recall that errors in Child Data are less evenly spread among
the various types of errors. They are clearly dominated by errors in (finite) verb
forms (42%), followed by missing constituents at half that frequency (16.8%). Erroneous choice of words is the third most common grammar error (10.7%). Agreement errors in noun phrase occurred in 15 cases (5.7%).
Relating the errors of noun phrase agreement, verb form and choice of preposition reported by all groups to the size of the corpora presented in Table 4.25 below,
we get a rough picture of error frequency for these three error types in comparison
to Child Data. The corresponding error types that were selected in the Child Data
corpus, include exactly all the errors in agreement in noun phrases and only the
preposition related errors in the word choice category. Three error categories were
selected as representative for verb form errors: finite main verb, verb form after
auxiliary verb and verb form after infinitive marker.
Table 4.25: Three Error Types in Grammatifix, Granska and Child Data
G RAMMATIFIX
Errors/
E RROR T YPE
No. 1,000 words
Agreement in noun phrase 22 0.25
Verb form
28 0.32
26 0.30
No.
69
89
11
G RANSKA
Errors/
1,000 words
0.34
0.44
0.05
C HILD DATA
Errors/
No. 1,000 words
15 0.50
112 3.76
10 0.34
Table 4.25 is also rendered as a graph in Figure 4.4 below. These figures show
that children made more errors than the adult writers in all three error types. The
difference is marginal for errors in noun phrase agreement and choice of preposition. For verb form errors, the difference is eightfold. Children made almost four
such errors in 1,000 words, compared to the adults’ less than 0.5.
84
Chapter 4.
The distribution of errors over the three error categories is the same for Child
Data and Granska, with fewest errors in choice of preposition and most in verb
form. In the Grammatifix corpus, erroneous choice of preposition is quite frequent,
with almost the same rate as in Child Data. Here, errors in noun phrase agreement
are few.
Figure 4.4: Three Error Types in Grammatifix (black line), Granska (gray line) and
Child Data (white line)
Conclusion
The error classifications in the projects differ making comparison on a more detailed level impossible. The overall error rate reveals similar values for the adult
corpora, whereas errors are considerably more frequent in Child Data. A comparison of the three most common error types in the adult corpora with the same types
in Child Data displays a considerable difference in the frequency of verb form errors, whereas the difference is not as substantial for the other two types. Although
not all error types could be compared, this observation indicates that there is a
difference not only in the overall error rate, but also in the types of errors.
4.4.3
85
Scarrie’s Error Database
As mentioned in Section 2.4, corrections of professional proof-readers at two
Swedish newspapers were gathered into a Swedish Error Corpora Database (ECD)
in the Scarrie project. This database now contains nearly 9,000 error entries. In
total, 1,374 of these errors were classified as grammar errors, corresponding to
approximately 16% of all errors (Wedbjer Rambell et al., 1999).
Error Profile of the Error Database
The error classification in ECD is very refined, the division of error types is, initially, based on the type of phrase involved, rather than the violation type. As
Wedbjer Rambell et al. (1999) state, noun phrase errors are the most frequent,
followed by verb sub-categorization problems, errors in prepositional phrases and
problems within verb clusters. Within the noun phrase category, agreement errors are the most common error type (27.8%), followed by definiteness in single
nouns (22.3%) and case errors (14.2%). Verb valence, the second largest grammar problem category, includes problems with the infinitive phrase as the most
frequent (24.7%), moreover, over 90% of all verb valence errors concern the infinitive marker att ‘to’ (one third occur after the verb komma ’will’). The choice of
preposition and missing preposition errors are the top list error subtypes in the prepositional phrase category (36% and 26.6% respectively). Finally, in verb clusters,
errors involving the auxiliary verb being followed by infinitive (33.3%), main verbs
in the finite form (30.6%) and temporal auxiliary verb followed by supine (18.0%)
are the most common errors.
Comparison to Child Data
The advantage of the fine division of error types and on-line availability of Scarrie’s ECD, makes more extensive and precise comparison of the studies possible.
In total, eleven error types are compared with the errors in Child Data, presented
in Table 4.26. The errors, missing auxiliary verb and infinitive marker, which were
quite few, are not included, nor are all the word choice or Other category errors.
The large size of the newspaper corpus (approximately 70,000,000 words) in
Scarrie results in a ratio of 0.009 of errors per 1,000 words. In the Child Data
corpus, the ratio is 8 errors per 1,000 words for the listed error types. The big gap
in error density is obvious and further analysis will concern comparison of how
frequent the errors are over these selected categories and, luckily, show what types
of errors characterize the corpora.
Chapter 4.
86
Table 4.26: Grammar Errors in Scarrie’s ECD and Child Data
E RROR T YPE
Agreement in pred. compl.
Pronoun form
Finite verb form
Word order
Missing or redundant word
Reference
T OTAL
Errors/1,000 Words
S CARRIE
NO.
%
176 25.7%
48
7.0%
68
9.9%
21
3.1%
34
5.0%
57
8.3%
4
0.6%
57
8.3%
132 19.2%
76 11.1%
13
1.9%
686
100%
0.009
C HILD DATA
NO.
%
15
6.4%
8
3.4%
6
2.6%
5
2.1%
110 46.8%
7
3.0%
4
1.7%
5
2.1%
57 24.3%
10
4.3%
8
3.4%
235
100%
7.8
Figure 4.5 shows the relative error frequency of the selected error types in
Scarrie’s corpus and Figure 4.6 shows the Child Data corpus. The main difference
is that the top error type for Child Data, represented by errors in the finite verb
form, is not a very common error in Scarrie’s corpus. The other three top error
types in Child Data and the three top error types in Scarrie are represented by the
same categories, but in a slightly different order.
In Scarrie’s corpus, noun phrase agreement errors are the most frequent, followed by missing and redundant constituents and then the choice of preposition. In
Child Data, agreement errors in noun phrase are much less frequent than omission
or addition of words in sentences, but erroneous choice of preposition is also the
least frequent of these three categories.
Errors in verb forms overall have much lower frequency in Scarrie’s corpus.
Errors in verb form after an auxiliary verb is the fifth most common error type in
Scarrie’s corpus and the most frequent among the verb errors. Errors in finite verb
form are even less frequent and errors in verb form after an infinitive marker are
quite rare. In Child Data, errors in verb form after an auxiliary verb are much less
frequent than in the finite verb, the most common error. Errors in verb form after
an infinitive marker are also rare.
As already mentioned, agreement errors in noun phrases have a higher frequency distribution in Scarrie’s ECD than in Child Data. Agreement errors in
predicative complement positions seem to be slightly more common in Scarrie’s
texts, likewise for definiteness errors in bare nouns.
Figure 4.5: Error Distribution of Selected Error Types in Scarrie
Figure 4.6: Error Distribution of Selected Error Types in Child Data
87
Chapter 4.
88
There were few word order errors in Child Data. These seem more common in
Scarrie’s ECD, being as common as errors in verb form after auxiliary verb. The
opposite holds for reference errors, which were quite rare in Scarrie’s texts and
more common in Child Data. Pronoun form errors display a similar distribution in
both corpora.
Conclusion
Comparison of error frequency over the selected error types in the two corpora
shows both differences and similarities. The largest difference is in the verb form
errors. In Scarrie’s texts, verbs following an auxiliary verb are the main problem,
whereas in Child Data it is the finite verb form, the most common error in the whole
corpus. Other differences concern word order and definiteness in bare nouns, more
common in Scarrie’s corpus, and reference errors, more common in Child Data.
Agreement errors in predicative complements seem to be slightly more common in
Scarrie’s corpus.
Some of the differences could certainly be circumstantial, due to the difference in the size of the corpora, but certainly not in the most common error types.
Child Data’s profile is characterized by errors in finite verb form and omissions or
additions of words. Scarrie’s texts are dominated by errors in noun phrase agreement and omission or addition of words. Agreement errors in noun phrases are the
third most common error type in Child Data. Errors in choice of preposition and
pronoun form obtained similar frequency distributions in the corpora.
4.4.4
Summary
The nature of grammar errors in Child Data is more similar to the errors found in
Teleman’s primary school children than the secondary level writers of the Skrivsyntax project. The different error classification in the grammar checking projects
made deeper analysis difficult. Errors are, in general, more frequent in Child Data,
but a closer look at three error types indicates that for some error types the difference is marginal whereas for others, children make many more errors. A finegrained comparison with some selected error types from Scarrie’s ECD confirms
this difference with different error frequency distribution in certain error sub-types.
On the other hand, the most common error types in Scarrie’s corpus are, other than
finite verb form errors, also the most frequent in Child Data.
89
4.5 Real Word Spelling Errors
4.5.1
Introduction
This section is devoted to spelling errors which form existing words. These errors are particularly interesting from the computational point of view, because they
normally require analysis of context larger than a word and are most often not discovered by a traditional spelling checker developed for the detection of isolated
words. Since this error category is not the main focus of the present study, the analysis aims more at providing an overall impression of what errors occur and what
grammatical consequences the new word formations create rather than analysis of
the spelling error types.
First, the spelling violation types that are typical in Swedish are presented in
(Section 4.5.2), followed by an analysis of segmentation errors (Section 4.5.3) and
misspelled words (Section 4.5.4). The total number of errors and their distribution
is discussed at the end of this section (Section 4.5.5).
4.5.2
Spelling in Swedish
As mentioned already in the classification of error categories in Chapter 3 (Section 3.4), spelling errors occur as violations of the orthographic norms of a language. In Swedish, these errors concern operations on letters and segmentation of
words. Compounds in Swedish are always written as one word. Since this is such
a productive category, compounds are often a source of erroneous segmentation.
They are most often spelled apart forming more than one word, but the opposite
occurs as well when words are written together as if they were a compound. Other
spelling violations occur when letters in words are missing, are replaced by other
letters, moved to other positions of the word, or when extra letters appear. Apart
from these basic operations, Swedish has consonant gemination, often the cause
of spelling errors (cf. Nauclér, 1980). Words can differ simply in single or double
consonants and have completely separate meanings, as in glas ‘glass’ and glass
‘ice-cream’.
The spelling errors in this study are divided first into segmentation errors and
misspellings. Segmentation errors are then further divided into writing words apart
as erroneous separation of compound elements (splits) and writing words together
as erroneous combination of words into compounds (run-ons). The error taxonomy
of misspellings is based on the four basic error types of omission, insertion, substitution and transposition, usually applied in research on spelling (e.g. Kukich, 1992;
Vosse, 1994), and extended with two additional error categories related to consonant doubling as separate categories. The spelling taxonomy consists then of two
Chapter 4.
90
categories with segmentation errors divided in two sub-categories and misspellings
in six sub-categories:
1. Segmentation Errors:
(a) splits - a word written apart, with a space in between
(b) run-ons - words written together as one
2. Misspellings:
(a) omission - a letter is missing
(b) double consonant omission - single consonant instead of double consonant
(c) insertion - an extra letter is added
(d) double consonant insertion - double consonant instead of single consonant
(e) substitution - a letter is replaced by another letter
(f) transposition - two or more letters have changed positions
A word can be in violation in just one such spelling operation on letters or
spaces, or several spelling violations may occur. The categories are exemplified
in the Table 4.27 below. All the errors in the table are represented by real word
spelling errors, found in the current corpus and some also with multiple violations.
First the error category is presented, followed by an example of it and its correct
form. The last column in the table includes the error index in the corresponding
Appendix where the error instance(s) may be found (misspelled words from Appendix B.2 with the index starting in M and segmentation errors from Appendix
B.3 with the index starting in S).
Table 4.27: Examples of Spelling Error Categories
E RROR T YPE
E RROR
C ORRECT W ORD
S INGLE E RRORS :
Split
djur affär ‘animal store’
djuraffär ‘animal-store’
Run-on
tillslut ‘close’
till slut ‘eventually’
Omission
bror ‘brother’
beror ‘depends’
Double omission
koma ‘coma’
komma ‘to come’
örn ‘eagle’
ön ‘the island’
Insertion
Double insertion
matt ‘faint’
mat ‘food’
Substitution
bi ‘bee’
by ‘village’
Transposition
förts ‘been taken’
först ‘first’
M ULTIPLE E RRORS :
Split and Double brand manen ‘fire mane’
brandmannen ‘fire-man’
omission
Substitution and kran kvistar ‘tap twigs’
grankvistar ‘fir-twigs’
Split
Double omission
fören ‘the stem’
förrän ‘until’
and Substitution
Omission
and
tupp ‘rooster’
stup ‘precipice’
Double insertion
I NDEX
S1.1.28
S8.1.3 - 12
M4.2.1
M4.2.33 - 36
M1.1.51
M1.2.3
M1.1.9 - 11
M6.4.1 - 2
S1.1.21 - 22
S1.1.59
M8.1.1 - 4
M1.1.46
91
Some spoken forms in Swedish are accepted as spelling variants and will not
be included as errors in this analysis. These are listed in Table 4.28 below.
Table 4.28: Spelling Variants
S POKEN F ORM
dom
sen
sa
la
nån
nåt
nåra
nånstans
sån
sånt
såna
våran
vårat
mej
dej
sej
stan
dan
4.5.3
W RITTEN E QUIVALENCE
de
‘they’
sedan
‘then’
sade
‘said’
lade
‘laid’
någon
‘someone’
något
‘somewhat’
några
‘some [pl]’
någonstans ‘somewhere’
sådan
‘such [com]’
sådant
‘such [neu]’
sådana
‘such [pl]’
vår
‘ours [com]’
vårt
‘ours [neu]’
mig
‘me [acc]’
dig
‘you [acc]’
sig
‘him/her/itself [acc]’
staden
‘city [def]’
dagen
‘day [def]’
Segmentation Errors
The different types of segmentation errors are listed in Table 4.29 together with
the number of different word types and how many were misspelled. Splits are
further divided in accordance with what part-of-speech they concern. Distribution
in sub-corpora and among participant ages for segmentation errors is discussed in
Section 4.5.5.
Table 4.29: Distribution of Real Word Segmentation Errors
C ATEGORY
RUN - ONS :
S PLITS :
Nouns
Adjectives
Pronouns
Verbs
Adverbs
Prepositions
Conjunctions
T OTAL S PLITS
N UMBER
13
126
49
5
8
53
2
3
246
W ORD T YPES
4
90
37
2
8
21
2
1
160
M ISSPELLED
0
6
0
1
0
5
0
0
12
Chapter 4.
92
Very few real word spelling errors occurred as words written together (runons), since these most often result in non-words. Cases that formed an existing
word included just four word types. The most recurrent real word run-on was the
preposition phrase till slut ‘eventually’ that, when written together, forms the verb
tillslut ‘close’, see (4.56):
(4.56) (S8.1.12)
a. Vi åkte ∗ tillslut på bio.
we went close to cinema
– We went eventually to the cinema.
b. till slut
eventually
Splits, on the other hand, are usually realized as real words since they are compounded of two (or more) lemmas. As seen in Table 4.29, most of the splits concern
noun compounds. In six cases, these were misspelled, resulting in real words as in
(4.57). Thus, the compound brandmännen ‘firemen’ is split, while a vowel substitution occurs in the second part of the compound. Both parts are finally realized as
lexicalized strings which then slip through a spellchecker unnoticed:
(4.57) (S1.1.23)
a. ∗ brand menen
ryckte ut och släckte elden.
fire
the-harms turned out and put-out the-fire
– The firemen turned out and put out the fire.
b. brandmännen
fire-men
Two instances among the noun splits were not compounds, as for instance in
(4.58) below. Here, the definite suffix is separated from the noun stem:
(4.58) (S1.1.118)
a. ni
får gärna bo hos oss under ∗ tid en
ni
inte
you [pl] may gladly live at us during time [definite suffix] you [pl] not
har nåt
att bo i.
have something to live into
– You are welcome to live at our place during the time you don’t have anywhere to live.
b. tiden
the-time
Also, adjectives are quite often split with the parts realized as existent words.
A recurrent error (27 occurrences) is the segmentation of the modifying intensifier
93
jätte ‘giant’ as in (4.59). This is supposed to be written together (see Teleman et al.,
1999, Part2:185-188).
(4.59) (S2.1.18)
a. då blev
jag ∗ jätte glad
then become I giant happy
– Then I was extremely happy.
b. jätteglad
extremely-happy
Splits in adverbs are recurrent as well, often concerning certain words, as seen
in the number of word types. Some of them were also misspelled, as for instance in
(4.60), where ändå ‘anyway’ is split and the first part includes vowel substitution
and realizes as the indefinite determiner en ‘a’:
(4.60) (S5.1.46)
a. men olof var glad ∗ en då
but Olof was happy a then
– But Olof was happy anyway.
b. ändå
anyway
Eight cases concerned split verbs. One of these included a morphological split,
where the past tense suffix was separated from the verb stem:
(4.61) (S4.1.7)
a. Han ∗ ring de
till mig sen
och sa samma sak.
he call [pret] to me afterwards and said same thing
– He called me afterwards and said the same thing.
b. ringde
called
Also, some splits in pronouns, prepositions and conjunctions occurred. Among
the conjunctions, three cases of the conjunction eftersom ‘because’ were segmented:
(4.62) (7.1.1)
a. ∗ Efter som han frös
och ...
after that he was-cold and
– Because he was cold and ...
b. eftersom
because
Chapter 4.
94
All these segmentation errors resulting in real words are presented in Appendix
B.3. They are classified first by the type of violation that occurred and then by partof-speech.
4.5.4
Misspelled Words
In general, multiple misspellings occurred in just a few cases, most of the words
involved single violations. Substitution and double consonant omission are the
most frequent spelling violations. Nouns, pronouns and verbs are the most frequent
categories for violations.
Certain types of words seem to be more problematic than others regarding
spelling. For instance, there is real confusion concerning the spelling of the pronoun de ‘they’. Recall that this pronoun is pronounced as [dom], as is the accusative form dem ‘them’. Both forms can be spelled as dom, an accepted spelling
variant, as well. In sixteen cases, four subjects used the accusative form dem ‘them’
as in (4.63a):
(4.63) (M3.1.49) 16 occurrences, 4 subjects
a. ∗ Dem hade ett privatplan
them had a private-plane
– They had a private-plane.
b. De
They
Two children substituted the vowel in the pronoun, as a consequence, it was
realized as the noun dam ‘lady’, as in (4.64a):
a. ∗ dam bodde i en by
lady lived in a village
– They lived in a village.
b. dom/de
they
Another confusion exists between the pronouns det ‘it’ and de ‘they’. In
speech, det is usually reduced to [de], thus coinciding with the plural pronoun
de ‘they’ in writing. In 33 cases, 15 subjects used de instead of det ‘it’:
95
a. ja men nu är ∗ de läggdags sa mormor
yes but now is they bed-time said grandmother
– Yes, but now it is time to go to bed, grandmother said.
b. det
it
The opposite occurred in nine cases, where six subjects wrote the singular det
‘it’ instead of the plural pronoun de ‘they’:
a. ∗ Det kom till en övergiven by
it
came to a abandoned village
– They came to an abandoned village
b. De
They
Other rather recurrent spelling errors concern the pronoun vad ‘what’, the adverb var ‘where’, the infinitive verb form vara ‘to be’ and the past form of the
same verb var ‘was/were’, all of which can be pronounced [va]. First, the forms
are often erroneously substituted for one another. In six cases, the form var is used
instead of the correct pronoun vad ‘what’ as in:
a. Men ∗ var är det för ljud?
but where is it for sound
– But what is it for sound?
b. vad
what
Then in eight cases the form vad is used instead of the past verb form var
‘was/where’:
∗
vad grön.
a. Hans älsklingsfärg
his favourite-colour what green
– His favourite-colour was green.
b. var
was
Chapter 4.
96
Two children also used vad for the adverb form var ‘where’ in three cases:
a. Hjälp det brinner ∗ vad nånstans.
help it burns what somewhere
– Help! Fire! Where abouts?
b. var
where
Further, these words are also realized as the corresponding (reduced) pronunciation form va, that in turn coincides with the interjection va ‘what’ in writing.
Most of these cases concerned the past verb form var ‘was/where’ as in:
a. Klockan ∗ va ungefär
12 när jag vaknade
the-watch what approximately 12 when I woke
– The time was about 12 when I woke up
b. var
was
Some cases included the infinitive verb form vara ‘to be’:
a. dom vill
inte ∗ va kompis med han/hon.
they want [pres] not what friend with he/she
– They don’t want to be friends with him/her.
b. vill
inte vara
want [pres] not be [inf]
Here is an example of the use of the adverb var ‘where’ reduced as va:
(4.72) (M6.5.3) 3 occurrences, 1 subject
a. sen undra han ∗ va dom bodde
then wonder he what they lived
– Then he wondered where they live.
b. var
where
97
Two instances of va corresponded to the pronoun vad ‘what’ as in:
(4.73) (M3.5.4) 2 occurrences, 1 subject
a. Madde vaknade av
mitt skrik, hon fråga ∗ va det var för nåt.
Madde woke from my shout, she ask what it was for something
– Madde woke up from my shout. She asked what was wrong.
b. vad
what
Other spelling that also was related to spoken reduction concerned the pronoun
jag ‘I’, normally pronounced [ja], which, when written as pronounced, corresponds
to ja ‘yes’. Three instances of the use of jag as ja occurred:
a. Vilken fin
klänning ∗ ja har
what pretty dress
yes have
– What a pretty dress I have.
b. jag
I
Also, five instances concern the conjunction och ‘and’, usually pronounced as
[å], which in writing coincides with the noun å ‘river’.
(4.75) (M8.1.11)
a. Vi bor i samma hus jag och Kamilla ∗ å
hennes hund.
we live in same house I and Kamilla river her
dog
– We live in the same house me and Kamilla and her dog.
b. och
and
All these misspelled words resulting in real words are listed in Appendix B.2.
They are classified first by the part-of-speech of the intended word and then by the
part-of-speech of the realized word. The type of spelling violations that occur are
notified in the margin.
Chapter 4.
98
4.5.5
Distribution of Real Word Spelling Errors
From the examples above, it is clear that the children’s spelling is quite unstable.
In general there is a high degree of confusion as to which form to write in which
context and many spoken forms are used.
The totals of misspelled words, splits and run-ons are summarized in Table
4.30 below, where the texts are divided into sub-corpora, and in Table 4.31, where
the texts are grouped by age. The errors are divided further into non-words and real
words and the relative frequency of errors compared to the total number of words
is presented.
As already discussed in the general overview in Section 4.2, all spelling errors (i.e. both non-word and real word) amount to 10.2% of all words. Most
common are misspelled words, followed by splits, which are more recurrent than
run-ons. The same distribution applies for real word spelling errors. In total, (the
last column in the last row in the tables) these amount to 2.3% of all words, which
is three times less than non-word spelling errors (7.9%). Put in other words, real
word spelling errors amount to 29% of all spelling errors.14 Real word spelling errors are also dominated by misspelled words (1.5%). Splits are more usual as real
words (0.8% in comparison to non-word splits 0.4%), whereas run-ons are almost
not-existent as real words (0.04%).
Most of the misspelled words as real words occur in the Deserted Village corpus and among the 9-year olds. Real word splits are also most frequent in the
Deserted Village corpus, closely followed by the Frog Story corpus. In the case
of age, the texts of 11-year olds contained most of the erroneous splits (non-word
splits are most common among 9-year olds). Real word run-ons are very rare so
not much can be said about their distribution in sub-corpora or by age group.
14
Recall that the corresponding rate Kukich (1992) refers to is: 40% of all misspellings result in
lexicalized strings.
99
Table 4.30: Distribution of Real Word Spelling Errors in Sub-Corpora
E RROR T YPE
M ISSPELLED W ORDS :
non-word
%
real word
%
S PLITS :
non-word
%
real word
%
RUN - ONS : non-word
%
real word
%
T OTAL :
non-word
%
real word
%
S UB -C ORPORA
Spencer
743
9.8
181
2.4
48
0.6
98
1.3
108
1.4
5
0.07
899
11.9
284
3.7
351
7.8
71
1.6
28
0.6
41
0.9
25
0.6
1
0.02
404
9.0
113
2.5
484
9.9
84
1.7
32
0.7
61
1.2
37
0.8
2
0.04
553
11.3
147
3.0
173
3.2
36
0.7
14
0.3
23
0.4
28
0.5
4
0.07
215
3.9
63
1.1
239
3.3
60
0.8
9
0.1
23
0.3
29
0.4
1
0.01
277
3.8
84
1.1
1 990
6.7
432
1.4
131
0.4
246
0.8
227
0.8
13
0.04
2 348
7.9
691
2.3
Table 4.31: Distribution of Real Word Spelling Errors by Age
E RROR T YPE
M ISSPELLED W ORDS :
non-word
%
real word
%
S PLITS :
non-word
%
real word
%
RUN - ONS : non-word
%
real word
%
T OTAL :
non-word
%
real word
%
AGE
11-year
9-year
10-year
994
14.5
248
3.6
71
1.0
58
0.8
102
1.5
2
0.03
1 167
17.1
308
4.5
292
4.3
64
0.9
18
0.3
51
0.7
32
0.5
2
0.03
342
5.0
117
1.7
524
6.5
78
1.0
35
0.4
113
1.4
58
0.7
5
0.06
617
7.7
196
2.4
13-year
T OTAL
180
2.2
42
0.5
7
0.1
24
0.3
35
0.4
4
0.05
222
2.7
70
0.9
1 990
6.7
432
1.4
131
0.4
246
0.8
227
0.8
13
0.04
2 348
7.9
691
2.3
Chapter 4.
100
4.5.6
Summary
Real word spelling errors are three times less frequent than non-word spelling errors in the Child Data corpus. Misspelled words are the most common type of error, reflecting a clear spelling confusion for some word types. Splits are, in general,
more common as real word errors, the opposite being the case for run-ons. Most
errors occurred in general in the Deserted Village corpus and among the 9-year
olds, but 11-year olds made most of the erroneous segmentation errors (splits).
4.6 Punctuation
4.6.1
Introduction
Beginning writers, as mentioned in Chapter 3 (Section 3.4), usually use punctuation marks to delimit larger textual units than syntactic sentences, joining for instance (main) clauses together without any conjunctions. The main purpose of the
present analysis of punctuation is to investigate the erroneous use of punctuation
both manifested as omissions, thus giving rise to joined sentences, and as substitutions and insertions. The length of the orthographic sentences marked by the
subjects and especially the number of (main) clauses without conjunctions joined
in them (adjoined clauses) will give us a picture of how often sentence boundaries
are omitted and to what degree sentences correspond to syntactic sentences. Analysis of erroneous use of end-of-sentence punctuation and commas will reveal in
what other places one might expect them.
As orthographic sentences are considered sequences of words that start with a
capital letter and end in a major delimiter (cf. Teleman, 1974). Also included in
that category are, sequences that do not completely follow the writing conventions
of a capital letter at the beginning and a major delimiter at the end, but indicate
the writer’s intention of such marking. These include sentences ending in a major
delimiter followed by a small letter, or the opposite when the major delimiter is
missing but the beginning of the next sentence is indicated by a capital.
Within the orthographic sentence, occurrences of main sentences attached to
a main clause without conjunction are counted as adjoined clauses (cf. Näslund,
1981; Ledin, 1998). These reveal whether or not the writer joins syntactic sentences to larger units, or in other words omits sentence boundaries.
The analysis of punctuation is important for decisions on how to handle texts
written by children computationally. Do they delimit their text in syntactic sentences? Are there any other units they delimit instead? What is then the nature
of such delimitation? How frequently are sentences joined together and sentence
boundaries omitted?
4.6.2
101
General Overview of Sentence Delimitation
The content related marking of text, rather than syntactic, is also evident in the
texts in this study. In the following example (4.76) written by a nine year old,
most of the sentence boundaries correspond to syntactic units and are delimited
in accordance with the writing conventions using capital letters at the beginning
and major delimiters at the end. Two adjoined clauses can be observed in the
third and the fifth sentences, joining main sentences together without conjunctions.
Two vertical bars indicate where one would expect a major delimiter between the
adjoined clauses (spelling or other errors are ignored in the English version). 15
(4.76) Den brinnande makan
Det var en gång en pojke som hette Urban. En dag tänkte Urban göra varma
makor . Då hände en grej som inte får hända || huset brann upp för att
makan hade tat eld . Då kom Urban ut med brinnande kalsingar och sa: Det
brinner!!!!!!!!!!!!!!!!!!!!!! Brandkåren kom och spola ner huset || då börja Urban
lipa och sa : Mitt hus är blöt .
– The burning sandwich
There was once a boy who was called Urban. One day Urban planned to make
hot sandwiches. Then a thing happened that should not happen. The house burnt
down because the sandwich started to burn. Then Urban came out with burning
underwear and said: Fire! The fire-brigade came and hosed down the house.
Then Urban started to blubber and said: My house is wet.
In other texts, punctuation marks are used to delimit larger units as in the following text (4.77), written by a ten year old:
(4.77) Den där scenen med dammen som tappade sedlarna tycker jag att den där flickan
måste vara fattig så att hon tar sedlarna .
Den där scenen med det tre tjejerna tyckte jag att de var taskiga som går ifrån den
tredje tjejen || det tycker jag att tjejen tar upp det på mötet med fröken och sedan
tar fröken upp det på de andra tjejernas möte med fröken || det kan hjälpa ibland .
– That scene with the lady that lost the money, I think that that girl must
be poor so it is her who takes the money.
That scene with the three girls, I thought that they were mean when they left the
third girl. I think that the girl will take that up at the meeting with the teacher
and then the teacher will take it up at the other girls’ meeting with the teacher.
That can help sometimes.
In this text, only two full stops occur. The first delimitation concerns a single
sentence, correctly initiated by a capital letter and terminated by a full stop. The
15
The exemplified text represents the spell-checked versions, where the non-word misspellings
have been corrected (see further in Section 3.5).
102
Chapter 4.
sentence is quite long, however, and commas could facilitate reading. The second
full stop terminates a whole paragraph that consists of at least three sentences.
Some texts did not include any delimiters or other indicators of sentence boundaries at all, as in (4.78) also written by a ten year old. Again, vertical bars indicate
the missing punctuation marks.
(4.78) så här börja det || jag var på mitt land och bada || då var jag liten || plötsligt kom
en snok || i för sig så hugger inte snokar i vatten men jag blev alla fall jätte rädd
för jag kunde inte simma då och snoken jagade mig längre och längre ut || då ko
min bror med en gummi båt och tog upp mig || då blev jag jätte glad
– It started like this. I was in the country and went for a swim. I was little then.
Suddenly a grass snake came. Actually grass snakes do not bite in the water, but
I was very scared, because I could not swim then and the grass snake chased me
further and further out. Then my brother came with a rubber-boat and lifted me
up. Then I was very happy.
In the following text (4.79) written by an eleven year old we see examples of
long sentences, where several clauses are put together either by inserting conjunctions or as adjoined clauses. Especially the first orthographic sentence is quite
long, consisting of first three sentences joined by the conjunction och ‘and’ followed by three adjoined clauses. Conjunctions are marked in boldface and omitted
sentence boundaries are indicated by two vertical bars:
(4.79) Ljus
Det var en gång en pojke som hette Karl och gillade att leka med elden och en
dag började det brinna i en hö-skulle ute på landet och den stackars pojken var
bakom elden som hade sträckt ut sig tio meter bakom hö-skullen || då kom det
ett åskmoln och blixten slog ner i ladugården som tog eld || kale som blev jätte
rädd och sprang till närmaste hus som låg 9 kilometer bort || det tog en timme att
koma ditt och då ringde han fel numer av bara farten. När han kom fram skrek
han i örat på brand männen att det brann på Macintosh vägen 738c och brand
menen rykte ut och släkte elden.
SLUT
– Light
There was once a boy who was called Karl and liked to play with fire and one
day a fire started in a hayloft out in the country and the poor boy was behind the
fire that had spread ten meters behind the hayloft. Then came a thundercloud and
the lighting struck in the cowshed that caught fire. Kalle who became very scared
and ran to the nearest house that was 9 kilometers away. It took an hour to get
there and then he called the wrong number because he was in such a rush. When
he got through he yelled in the ear to the fire-men that there was fire at Macintosh
Road 738c and the fire-men turned out and put out the fire.
END
It is a typical pattern in the whole Child Data corpus, that sentences are put
together to build larger units, either as adjoined clauses where sentences follow
each other without any conjunctions or long sentences are built with conjunctions
103
as in the above text (4.79) or in the example below (4.80), written by a nine year
old:
(4.80) på morgonen när vi vakna och jag skulle gå ut att hämta cyklarna märkte jag att
vi inte va på toppen av berget utan i en by || jag väckte pappa och skrek att han
Va för tung och att vi åkt ner från berget och åkt så långt att vi inte visste va vi va.
– In the morning when we woke up and I was about to go out to get the bicycles,
I noticed that we were not on the top of the mountain but in a village. I woke
Daddy up and yelled that he was too heavy and that we had fallen down from the
mountain and fallen so far that we didn’t know where we were.
4.6.3
The Orthographic Sentence
In order to investigate more closely how sentence delimitation is used and to what
extent it corresponds to syntactic sentences, we analyze the length of orthographic
sentences and the number of adjoined clauses.
In Tables 4.32 and 4.33 we present the number of orthographic sentences and
their length in number of words, along with the number of adjoined clauses and
their frequency per 1,000 words.
Table 4.32: Sentence Delimitation in the Sub-Corpora
C ORPUS
Deserted Village
Climbing Fireman
Frog Story
Spencer Narrative
Spencer Expository
T OTAL
O RTHOGRAPHIC
O RTHOGRAPHIC A DJOINED A DJOINED C LAUSES /
S ENTENCE S ENTENCE L ENGTH
C LAUSE
1,000 WORDS
422
18.0
298
39.3
408
11.0
75
16.6
536
9.2
70
14.3
313
17.5
98
17.9
392
18.7
73
10.0
2 071
14.4
614
20.6
Table 4.33: Sentence Delimitation by Age
C ORPUS
9-years
10-years
11-years
13-years
T OTAL
O RTHOGRAPHIC
O RTHOGRAPHIC A DJOINED A DJOINED C LAUSES /
S ENTENCE S ENTENCE L ENGTH
C LAUSE
1,000 WORDS
476
14.4
216
31.6
487
14.0
122
17.8
651
12.3
210
26.2
457
17.8
66
8.1
2 071
14.4
614
20.6
104
Chapter 4.
The average length of an orthographic sentence was 14.4 words. The shortest
sentences are found in the Frog Story and Climbing Fireman corpora. Among age
groups, orthographic sentence length is very similar; only the 13-years old have
a greater average length of orthographic sentences. Although this measure does
not reveal anything about what units are actually delimited, there seems to be a
tendency for mean sentence length to increase with age. Additional analysis is
needed in order to reveal if the increase in length of orthographic sentences with
age is because children become worse at delimiting sentences or because their
sentences have more complex structure (presumably the latter).
In comparison, the primary school children in the study by Ledin (1998, p.21)
obtained similar length of orthographic sentences for the lower age children, 12.9
words. Although older children had on average 10.0 words, which then contradicts
the hypotheses. Also, the orthographic sentence length for adults in Hultman and
Westman (1977) study averaged 14.7 words,16 whereas secondary level students
had longer sentences with average of 16.8 words.
The frequency of adjoined clauses reflects how often (main) sentences are
joined and brings more light to the nature of text delimitation. A common hypothesis is that adjoined clauses are less frequent by age, often considered to be a
phenomenon related to primary school writers (Ledin, 1998). This seems to hold
for our data too. The 13-year olds in the present study had four times fewer adjoined clauses per 1,000 words than the 9-year olds. The other two age groups
also put quite a large number of clauses together without conjunctions. In the subcorpora, adjoined clauses are four times more frequent in the hand-written texts of
Deserted Village than in the Spencer Expository corpus. The average value is 20
adjoined clauses per 1,000 words in the whole corpus.
In comparison to the Ledin (1998, p.25) study, lower age primary school children had 10.2 adjoined sentences per 1,000 words overall, but 28.9 in narrative
writing. The older children had on average 8.2 adjoined sentences per 1,000 words.
In a study by Näslund (1981) (reported in Ledin (1998)), final year primary school
children had on average 9.0 adjoined sentences and upper secondary students 5.1
per 1,000 words.
Not surprisingly, analysis showed that sentence length increases by age,
whereas the number of adjoined clauses decreases with age. Although the analysis
did not identify what other units are marked, it indicates clearly that the younger
children more often join sentences together into larger units.
16
The average value is based on orthographic sentence length of adult texts in five genres, (see
Hultman and Westman, 1977, p.223)
4.6.4
105
Punctuation Errors
Errors related to the use of major delimiters, summarized in Tables 4.34 and 4.35,
concern omission of sentence boundaries (Omission), extra delimiters (Insertion)
in front of a subordinate clause or a conjunction and periods placed in lists and
adjective phrases, or put at other syntactically incorrect places in a sentence.
Table 4.34: Major Delimiter Errors in Sub-Corpora
S UB -C ORPORA
Spencer
E RROR T YPE
%
Omission
310
75 116
109
82
692 92.6
Insertion in front a subclause
9
9
12
1
16
47 6.3
Insertion other
4
2
1
1
8 1.1
T OTAL
323
86 129
110
99
747
Table 4.35: Major Delimiter Errors by Age
E RROR T YPE
Omission
Insertion in front a subclause
Insertion other
T OTAL
9-years
264
16
2
282
AGE
10-years 11-years
134
220
6
12
1
4
141
236
13-years
74
15
1
90
T OTAL
692
49
8
749
%
92.4
6.5
1.1
The most common error is the omission of the sentence end-markers, often in
the case of adjoined clauses. In (4.81) we see an example of a period inserted
between a subordinate clause and its main clause:
(4.81) Medan Oliver sprang ∗ . Hade Erik vekt
en uggla som nu jagade honom.
while Oliver ran
had Erik woken a owl that now chased him
– While Oliver ran, Erik had woken up an owl that now chased him.
Some cases of a period being placed in enumerations occurred as in (4.82):
(4.82) Där nere i det höga gräset låg Dalmatinen Tess ∗ . Grisen kalle knorr
there down in the high grass lay the-dalmatian Tess the-pig Kalle Knorr
Hammstern Hilde ∗ . ödlan
Graffitti katten fillipa och ...
the-hamster Hilde the-lizard Graffitti the-cat Fillipa and
– Down there in the high grass lay the Dalmatian Tess, the pig Kalle Knorr, the
hamster Hilde, the lizard Graffitti, the cat Fillipa and ...
Chapter 4.
106
Further, the erroneous use of comma was analyzed, but only when syntactic
violations occurred or when omitted in enumerations. Commas were, in general,
very rare and when used were often misplaced. Commas occurred in front of a
conjunction in an enumeration in (4.83):
(4.83) De hade med sig:
ett spritkök,
ett tält ∗ , och massa mat, några
they had with themselves a spirit-stove a tent and a lot of food some
kulgevär ∗ , och ammunition m.m
and ammunition etc
rifles
– They had with them a spirit-stove, a tent and lots of food, some rifles and
ammunition, etc.
In some instances a comma was placed in front of a finite verb:
(4.84) Linda ∗ , brukade ofta vara i stallet.
Linda used-to often be in the-stable
– Linda often used to be in the stable.
Often comma was used where one would expect a full stop:
(4.85) Nasse kunde inte sova
Nasse could not sleep
∗
, plötsligt hörde Nasse nån
som öppnade
suddenly heard Nasse someone that opened
dörren.
the door
– Nasse could not sleep. Suddenly Nasse heard someone open the door.
Error frequencies are summarized in Tables 4.36 and 4.37 below. Error types
include missing comma in enumerations or adjective phrases (Omission), an extra comma in front of a conjunction, enumeration or in other cases (Insertion),
and commas being used instead of a major delimiter to mark a sentence boundary
(Substitution).
Table 4.36: Comma Errors in Sub-Corpora
E RROR T YPE
Omission
Insertion
Substitution
T OTAL
S UB -C ORPORA
Spencer
%
41
2
10
3
3
59 33.5
5
13
1
4
7
30 17.0
5
22
2
30
28
87 49.4
51
37
13
37
38
176
107
Table 4.37: Comma Errors by Age
E RROR T YPE
Omission
Insertion
Substitution
T OTAL
9-years
22
12
16
50
AGE
10-years 11-years
5
28
8
5
15
12
28
45
13-years
4
5
44
53
T OTAL
59
30
87
176
%
33.5
17.0
49.4
Overall, commas were mostly placed in sentence boundaries or were omitted.
In the Deserted Village corpus commas were mostly omitted, whereas in the other
texts, they were often used to mark a sentence boundary. 9-year olds and 11year olds tend to omit commas, whereas 13-year olds use commas mostly to mark
sentence boundaries.
4.6.5
Summary
The delimitation of text varies both by age and corpora and indicates clearly that,
especially younger children, often join clauses into larger units. Orthographically, the 13-year olds form the longest units with the smallest number of adjoined
clauses. Most adjoined clauses occur among the youngest group, 9-year olds, and
in the hand-written corpus of Deserted Village. The erroneous use of major delimiters is mostly represented by omission or insertion in front of subordinate clauses,
lists, etc. Commas are mostly missing or used to mark sentence boundaries.
4.7 Conclusions
All the grammar errors that were expected as “typical” for Swedish writers, including noun phrase agreement, predicative complement agreement, verb form and the
choice of prepositions in idiomatic expressions, are represented in Child Data, but
not all are very frequent. Especially frequent are errors in verb form, mostly in
the finite main verb (other verb form errors were much less frequent). Errors in
predicative complement agreement are not very common, whereas noun phrase
agreement errors are more frequent. Erroneous choice of preposition is included
in the category of word choice errors, represented by ten occurrences. More characteristic for this population are, besides the omission of tense-endings on finite
verbs, errors in omission of obligatory constituents in sentences and word choice
errors. Some impact of spoken language on writing is reflected (again) in finite verb
forms, pronoun forms and also some cases of dialect forms within noun phrase.
108
Chapter 4.
Comparison with grammar errors in other studies shows, not surprisingly, most
similarities with the writing of primary school children. In comparison to adult
writers, there are differences both in how frequent errors are and in error distribution. Grammar errors in Child Data are much more frequent than among adult
writers, with approximately 5 to 8 errors for children and 1 error for adults per
1,000 words. Errors in verb form, noun phrase agreement, missing or redundant
words and choice of preposition are the most common error types for all populations, including the Child Data population. The difference lies in the error frequency distribution.
A closer look at the different sub-types of the verb form category shows that
the discrepancy is due to the frequent dropping of tense-endings on finite verbs in
the Child Data. Such errors are not very common in the newspaper articles of the
Scarrie corpus, where errors in verbs after an auxiliary verb are the most common
verb error.
The grammar error profile of Child Data and its comparison with adult writers
suggests then not only inclusion of the four central grammar error types in a grammar checker for primary school writers, but the treatment of errors in finite verb
form in particular. Another observation more related to error correction is that in
many cases more than one solution is possible, a fact exemplified in the analysis.
Also, at the lexical level spoken forms are common. The spelling of many
word forms indicate confusion as to what form should be used in which context.
Among real words, misspelled words were most common, followed by splits that
were more common in general as real words. Run-ons as real words were very
rare. The overall spelling error frequency seems to be representative for the age
group.
Errors in punctuation are mostly represented by omission, there are cases where
marking is put at syntactically incorrect places. There was quite a high frequency
of adjoined clauses, especially among the younger children, indicating that subjects join syntactic units to larger units and do not delimit text in (only) syntactic
sentences. The analysis does not reveal what other larger units are selected instead,
if any. On the other hand, this observation clearly indicates that a grammar checker
cannot rely on sentence marking conventions and consider capitals or sentence delimiters as real markings of the beginning or an end of a syntactic sentence. We
should be aware that marking of sentence boundaries might be omitted in texts
written by children, or even misplaced.
The following conclusion can then be drawn from the analysis of Child Data
for further work on development of a grammar error detector for primary school
children:
109
• include at least detection of errors in verb form (especially finite verb), agreement in noun phrase, redundancy and missing constituents, and some word
choice errors (such as use of prepositions),
• be aware that there may be more than one solution for correcting an error,
• do not rely on the use of capitals or sentence delimiters as indicators of syntactic sentence boundaries, rather be aware that sentence marking can be
missing or misplaced and several (main) clauses can be joined together.
110
Part II
Grammar Checking
112
Chapter 5
Error Detection and Previous
Systems
5.1 Introduction
Constructing a system that will provide the user with grammar checking requires
not only analysis of what error types are to be expected, but also an understanding
of what possibilities there are to detect and correct an error. In the previous chapter,
an analysis was presented of the grammar errors found in texts written by children
and the central errors for this group of users were identified. The purpose of this
chapter is to explore the second requirement and analyze the errors in terms of how
they can be detected. The questions that arise are: What errors can be detected
by means of syntactic analysis and what do require other levels of analysis? How
much of the text needs to be examined in order to find a given error? Can it be
traced within a sequence of two or three words, a clause, a sentence or a wider
context? I will also investigate available technologies and establish: What grammar errors are covered by the current Swedish grammar checkers? Where do they
succeed and where do they fail on Child Data?
The chapter starts with a description of the requirements and functionalities of
a grammar checker and the performance it has to achieve (Section 5.2), followed
by the analysis of possibilities for detecting the errors in Child Data (Section 5.3).
Then some grammar checking systems are described, paying special attention to
Swedish tools (Section 5.4), followed by a performance test of the Swedish systems on Child Data (Section 5.5). Conclusions are presented in the last section
(Section 5.6).
Chapter 5.
114
5.2 What Is a Grammar Checker?
5.2.1
Spelling vs. Grammar Checking
Writing aids for spelling, hyphenation, or grammar and style are part of today’s authoring software. Spelling and hyphenation modules were the first proofing tools
developed. They are traditionally built to handle errors in single isolated words.
Grammar checkers are a fairly new technology, not only aiming at syntactic correction as one would expect from their name, but often also including correction
of graphical conventions and style, such as punctuation, word capitalization, number and date formatting, word choice and idiomatic expressions. Thus, whereas
a spelling checker detects and handles errors at word-level, all detection of errors
that is dependent on the surrounding context has been moved up to the level of
grammar checking (cf. Arppe, 2000; Sågvall Hein, 1998a).1
The various proofing tools exist both as separate modules developed by different companies that can be attached to an editor (e.g. Microsoft proofing tools are
delivered by different suppliers) or the spelling and grammar checkers are integrated into a single system (see further in Section 5.4).
5.2.2
Functionality
Proofing tools, in general, give those involved in the process of writing support
in the rather tedious, time-consuming stage of revision (or rewriting), 2 and are
helpful in finding the types of errors humans easily overlook (cf. Vosse, 1994).
Their functionality can be defined in terms of detection, diagnosis and correction
(or suggestion for correction) of errors. Identifying incorrect words and phrases is
the most obvious task of a grammar checker. The position of an error in the text can
be located either by marking exactly the area where the error is, or by marking the
error with surrounding context (e.g. marking only the erroneous noun vs. marking
the whole noun phrase). Detection of an error can be enough feedback to the user,
if the user understands what went wrong. Diagnosis of the error is important when
the user needs an explanation, especially if the tool handles several related error
types. In the long run, diagnosis is of real use to every user in order to promote
understanding of the error marked (see Domeij, 1996; Knutsson, 2001). Finally,
presenting one (or more) suggestions for revision of the error can enhance a user’s
understanding of the problem in addition to providing an easy way to correct the
error.
1
Proofing tools without syntactic correction, correcting style and graphical convention also exist
(cf. Domeij, 2003, p.14).
2
Recall that editing activities on a computer occur usually during the whole process of writing and
not only at the end. The writer may switch several times between writing phases, see Section 2.3.2.
Error Detection and Previous Systems
115
The functionalities of such a system must be achieved with high precision.
Systems should not mark correct strings as incorrect. A system that provides detection of many errors, but also marks a large amount of correct text as erroneous
can be regarded more negatively by a user than a system that detects fewer errors
but makes fewer false predictions (cf. Birn, 2000).
5.2.3
Performance Measures and Their Interpretation
Performance Measures
Within the field of information extraction and information retrieval, measures of
recall, precision and F-value have been developed for measuring the effectiveness
of algorithms (van Rijsbergen, 1979). Recall measures the proportion of targeted
items that are actually extracted from a system, also referred to as coverage. Precision measures the proportion of correctly extracted information, also referred to as
accuracy. Overall performance of a system can be measured by the F-value, which
is a measure of balance between recall and precision. When for instance the recall
and precision have approximately the same value, the F-value is the same as the
mean score of recall and precision.
Also the main attributes by which the performance of a grammar checker is
evaluated are related to its effectiveness and functionality. The attributes of evaluation of writing tools have been discussed and developed within the frames of the
TEMAA (A Testbed Study of Evaluation Methodologies: Authoring Aids) (Manzi
et al., 1996) and EAGLES projects (Expert Advisory Group on Language Engineering Standards) (EAGLES, 1996) with respect to a product’s design specifications and user requirements. They consist of recall that, in this case, estimates how
many of the targeted errors are actually detected by the system (i.e. grammatical
coverage) and precision that measures the proportion of real errors detected and
reveals how good a system is at avoiding false alarms (i.e. flagging accuracy). The
higher the coverage and accuracy of the system are, the better. A third attribute
of proofing tools concerns suggestion adequacy, which is related to the system’s
suggestions for correction. These validation parameters usually vary depending
on the system’s own strategies (Paggio and Underwood, 1998; Paggio and Music,
1998). The exact definitions for the evaluation measures used in the present study
are presented in Section 5.5.
116
Chapter 5.
Methods and Interpretation of Evaluation
Besides the above mentioned measures, the whole method of evaluation and interpretation of results is important. A system’s performance can be evaluated against
an error corpus consisting of a collection of (sentence) samples with the errors targeted by the system (e.g. Domeij and Knutsson, 1999; Paggio and Music, 1998).
Or more recently tests with text corpora were also made (e.g. Birn, 2000; Knutsson,
2001; Sågvall Hein et al., 1999) that contain both the erroneous (ungrammatical)
and correct (grammatical) word sequences. The capability of a system to handle
correct text is better tested with the last method, where the proportion of grammaticality is higher.
At least three factors may influence the outcome of an evaluation of a system’s
performance: the kind of syntactic constructions present in the evaluation sample,
the number of errors in them and who was the writer (beginner, student, professional, second language learner, etc.). This means different text genre and degree of
the writer’s own writing skills in expressing himself may display different syntactic
coverage that also influence the possibility of occurrence of an error type. The size
of the corpus needed for evaluation can be dependent on the error frequency in
a writing population or the type of error evaluated. As discussed in Section 4.4,
adults in the analyzed corpora made in average one grammatical error per 1,000
words. In order to cover a satisfactory quantity of syntactic constructions and errors in them, the evaluation corpus must be quite large. Grammar errors in the
children’s corpus are on average eight times more common than for adults, which
could mean that for evaluation for this population a smaller corpus will probably
be sufficient since grammar errors are more frequent. Thus, different populations
of writers can have different requirements on what is needed for evaluation. Similarly, the frequency of different error types varies in general in that some error types
are more common than others. For instance, a larger corpus is probably needed to
cover errors in word order than errors in noun phrase agreement that are in general
more recurrent.
The method used and the factors that may influence the outcome of an evaluation of a system have to be taken into consideration when interpreting results,
especially in a comparison between systems. The evaluated text genre, size of the
corpus, error type and the nature of the writer should be related.
117
5.3 Possibilities for Error Detection
5.3.1
Introduction
Current grammar checking systems are restricted to a small set of all possible writing errors. The fact is that not all possible syntactic structures are covered and
many errors above the single word level cannot be found without semantic or even
discourse interpretation (cf. Arppe, 2000).
In this section I discuss which errors in Child Data can be found by means of
syntactic analysis and which require higher levels of analysis, such as semantics
or discourse analysis. If syntactic analysis is sufficient, then an examination will
follow of how much context is required for detection and, further, if the error can
be identified locally by selection restricted to word sequences (i.e. partial parsing)
or if analysis of complete clauses and/or sentences is necessary (i.e. full parsing).
The different error types will be divided in accordance with both previous
methods of classification (see Section 3.3.3) and the error taxonomy that was used
to distinguish real word spelling errors from grammar errors in (see Section 3.3.4).
That is, errors will be divided into whether they form structural errors and violate
the syntactic structure of a clause or non-structural errors concerning feature mismatch, also whether new lemmas or other forms of same lemma are formed and
finally, if words are omitted, inserted, substituted or transposed.
Further, the violation types of errors will be considered in relation to which
means must be used for detection of them. Previous analysis of this kind was
provided within the Scarrie project with the assumption that, in general, partial
parsing can be used to handle non-structural errors whereas other methods should
be applied for structural errors (in the Scarrie project local error rules were used).
They also identified error types that could not be handled by either of those two
methods. The study reports further on the problem with this division since many
errors could be handled by both methods (see Wedbjer Rambell, 1999c).
The discussion in the analysis is brief, referring to previously discussed examples in the analysis of errors in Chapter 4 or directly to the index number in the
error corpora presented in Appendix B. The section concludes with a summary
of detection possibilities for errors in Child Data. The summary will serve as a
specification for the final part of implementation described in Chapter 6.
5.3.2
The Means for Detection
Detection of agreement errors in noun phrases requires a context of precisely the
noun phrase and the errors can thus in general be detected by noun phrase parsing.
118
Chapter 5.
All noun phrase errors are non-structural and in Child Data they are concentrated
into one constituent realized as other form of the intended word lemma.
Syntactically, most of the noun phrases follow one of the three noun phrase
types (see Section 4.3.1) and three cases are in the partitive form. The feature
sets have to include, besides definiteness, number and grammatical gender, also
definitions of the semantic masculine gender in the adjectives. In this case, not only
agreement with the noun has to be fulfilled, but also requirements on consistent use.
That is, in one case (G1.2.3; see (4.9) on p.49) a (masculine) noun is modified by
two adjectives where one of them has the masculine weak form and the other the
common gender weak form. Both adjectives should follow one of the patterns, i.e.
semantic or grammatical gender. Further, the feature mismatch in partitive noun
phrases concerns not only the agreement between the quantifier and the noun, but
also the number of the head noun (e.g. G1.3.2; see (4.11) on p.50).
Another important thing to bear in mind is correct interpretation of the spelling
variants. For instance the errors in G1.2.2 and G1.2.4 (see (4.8) on p.48) include
the determiner de ‘the [pl]’ spelled as the allowed variant dom, which in turn is
homonymous with the noun dom ‘judgment/verdict’. It is important that the lexicon of the system has this information.
In order to detect errors in agreement between the subject or object of a sentence
with its complement, a context larger than a noun phrase is required. The errors are
non-structural, realized as other forms of the same lemma and can still be handled
by partial parsing identifying the parts that have to agree, i.e. the noun phrase, the
verb types used in such constructions and the modifying adjective phrase.
In Child Data, these errors concern agreement mismatch between the subject
and an adjective or participle as the predicative complement. Syntactically, many
of the subject noun phrases include embedded clauses (often with other predicates)
that increase the complexity and the distance between the subject and the predicative complement, and probably require more elaborate analysis. Further, in G2.2.3
(see (4.13) on p.51) several predicative complements are coordinated, detection of
all of them requires analysis of coordination. Finally, we have the case of G2.2.6
(see (5.1) below) where the head noun syskon is ambiguous between the singular
reading ‘sibling [sg]’ and the plural form ‘siblings [pl]’, which complicates analysis.
119
(5.1) (G2.2.6)
nasse är skär. Men nasses
nasse är en gris som har massor av syskon.
of siblings [pl] Nasse is pink but Nasse’s
Nasse is a pig that has lots
∗
smutsig
är
syskon
sibling [neu,sg/pl] is/are dirty [com,sg]
– Nasse is a pig that has a lot of brothers and sisters. Nasse is
pink. But Nasse’s brothers and sisters are dirty.
Identifying the subject, the copula verb and the adjective syskon är smutsig
is enough to signal that an error in predicative complement agreement occurred.
However, the diagnosis can fail if the noun is only interpreted as singular. The tool
would then signal that a mismatch in gender occurred, suggesting a form change
in the adjective to smutsigt ‘dirty [neu,sg]’. But if the author refers to massor
av syskon ‘lots of siblings’, then the noun should be interpreted as plural and the
checker should then indicate a number mismatch and suggest the plural form smutsiga ‘dirty [pl]’ instead. In any case, the most sound solution is to suggest two
corrections due to the ambiguous nature of the nouns and let the user decide.
Definiteness in Single Nouns
Definiteness errors in single nouns in Child Data are represented by bare singular
nouns (e.g. ö ‘island’ in (5.2a)) that lack the definite suffix (i.e. ön ‘island [def]’)
and form other form of the intended lemma (see also (4.17) on p.53). Considered
then as non-structural errors they could be detected by means of partial parsing.
Marking bare singular nouns as ungrammatical can also be helpful for finding instances where, instead of a missing suffix, the indefinite article is missing. That is,
if the noun phrase in the first sentence in (5.2a) was represented only as in (5.2b)
(such errors were not found in Child Data). However, cases where bare singular
nouns are grammatical exist.3
(5.2) (G3.1.3)
Vi gick till ∗ ö.
a. Jag såg en ö.
I saw an island we went to island [indef]
– I saw an island. We went to island.
b. Jag såg ∗ ö.
I saw island [indef]
– I saw island.
In order to decide whether a bare singular noun is ungrammatical due to omission of article or noun suffix or if it is grammatical, a context wider than a sentence
3
Bare singular nouns can be grammatical in one context (e.g. ha bil ‘have car’) and ungrammatical in another (e.g. se ∗ bil ‘see car’), see further in Section 4.3.3.
120
Chapter 5.
is needed in addition to some kind of lexical or semantic analysis in order to see
if the noun was or was not introduced/specified earlier or the construction is grammatical (i.e. lexicalized).
Pronoun Case
Pronoun case errors in Child Data concerned the accusative case of pronouns and
are realized as other forms of the same lemma, that is, using the nominative case
form instead of accusative. These errors concern feature mismatch and are classified as non-structural errors. However, exactly as in the case of agreement errors in
predicative complement, a more complex syntactic analysis is required to identify
the requirements on certain positions in a clause. Some hint on the way to identify
these could be the preposition preceding the pronoun, which would then require
only partial parsing. Three such errors in Child Data consist of a nominative pronoun preceded by a preposition (e.g. G4.1.5; see (4.18) on p.53).
Verb Errors
Errors in verb form can be located directly at the verbal core, consisting then of
one single finite verb, or a sequence of two or more verbs, or a verb preceded by
an infinitive marker. They can be both structural (an auxiliary verb is missing)
and non-structural (another form of the verb was used). All verb errors should be
detectable by means of partial parsing. Optional constituents such as adverbs, noun
phrases, and coordination of verbs should be taken into consideration.
The errors in finite verb form found in Child Data in many cases coincide with
the imperative form of these verbs (see e.g. G5.2.45 in (4.26) on p.58). Imperative
as a finite verb form should be distinguished from the infinitive verb form in order
to be able to detect such errors in finite verbs.
Errors in verbal chains are represented in Child Data by two finite verbs in a
row (e.g. ska blir ‘will [pres] become [pres]’; (4.32) on p.61), in one case with the
embedded infinitive as secondary future perfect (i.e. skulle ha kom ‘would [pret]
have [inf] came [pret]’; (4.31) on p.60). They also occur as bare supine in main
clause, lacking the auxiliary verb (e.g. G6.2.2; see (4.33) on p.61). All such errors
can be detected by parsing just the verbal cluster. In the case of missing auxiliary
verbs, the crucial point is to be sure that the omission occurs in a main clause,
which then requires identification of the type of clause.
Errors in infinitive phrases concern infinitive markers followed by a verb in
finite form (e.g. att stäng ‘to close [imp]’; (4.34) on p.62), or missing infinitive
marker with the auxiliary verb komma ‘will’ (e.g. G7.2.3; see (4.36) on p.62).
Both these error types can be located by partial parsing. In the case of an omitted
121
infinitive marker in the context of the auxiliary verb komma ‘will’, it is important
not to confuse it with the main verb komma ‘come’.
Word order
All word order errors are structural errors, involving transposition of sentence constituents. In general, detection of word order errors requires identification of the
main verb and analysis either of the preceding or following constituents, which in
turn requires identification of the beginning and ending of a sentence. In theory,
some errors in the placement of adverbials can be traced by partial parsing, for
instance in certain subordinate clauses.
In Child Data, punctuation and capitalization conventions are often not followed and sentences may be joined together (see Section 4.6). This means that
word order analysis cannot completely rely on such conventions until we find some
way to locate sentence boundaries. In addition, the word order errors found in
Child Data are rather complex, involving for instance more than one initial constituent before the finite verb in a main clause (see Section 4.3.6). This means
that the possibility of success in locating word order errors in Child Data by such
simple techniques as partial parsing is minimal.
Redundancy
Redundancy errors also represent structural errors, manifested as insertions of superfluous constituents into sentences. Immediate repetition of words (e.g. G9.1.3;
see (4.38) on p.64) should be possible to detect by means of partial parsing. Occurrences of repeated constituents at different places in a given sentence (e.g. G9.1.7;
see (4.39) on p.65) would require analysis of the complement structure, often of the
whole sentence. The same applies for new constituents being inserted (e.g. G9.2.2;
see (4.41) on p.66).
Sentences lacking a constituent also represent structural errors. Some of them may
be detected by partial parsing, but most require more complex analysis. Among
the errors in Child Data, discovering a missing subject or object would require
analysis of the complement structure of the main verb, which means that such information must be stored somewhere (e.g. in the lexicon of the system). Finding
an omission of a finite verb requires a search for a finite verb in a sentence, assuming that it is not an exclamation, a title, or other construction without finite verbs.
Finding omissions of particles or prepositions requires knowledge of the verbs’
Chapter 5.
122
sub-categorization frame, or the structure of fixed expressions. Other types require
not only syntactic analysis but also semantics and/or world knowledge as in (5.3),
where negation on the main verb is missing.
(5.3) (G10.5.1)
a. tuni hade jätte ont i knät men hon ville
— sluta för det.
Tuni had great pain in knee but she wanted — stop for that
– Tuni had much pain in her knee, but she did not want to stop because of that.
b. men hon ville
inte sluta för det.
but she wanted not stop for that
Word Choice
Word choice errors as substitutions of constituents also represent structural errors.
These errors are realized as completely new words with distinct meaning from the
intended one, new lemmas. Some of them can probably be solved by storing for
instance information on the use of particles and prepositions with certain verbs (e.g.
G11.1.2; see (4.48) on p.68), or word usage in fixed expressions (e.g. G11.1.7;
see (4.47) on p.68), in the dictionary. Others will probably require analysis of
semantics or even world knowledge before they can be detected, as the one in
(5.4).
(5.4) (G11.6.3)
a. Jag tittade på Virginia som torkade av sin näsa som var blodig
I
looked at Virginia that wiped off her nose that was bloody
på tröjarmen.
on jumper-arm
– I looked at Virginia who wiped her bloody nose on the sleeve of her jumper.
b. tröjärmen
jumper-sleeve
Reference
Referential issues concern structural violations as substitutions of constituents,
realized as new lemmas. All the errors in Child Data concerned anaphoric reference. Reference errors in general are discourse oriented. Anaphoric reference
requires identification of the antecedent that agrees with the subsequent pronoun.
The distance of the antecedent may be a preceding sentence, but it could also be
farther away. Partial parsing techniques probably can be used for identifying antecedents. The crucial problem is how far in the discourse to search for antecedents.
123
Real Word Spelling Errors
Spelling errors resulting in existent words always form new lemmas, that means
that they are realized as completely new words. They mostly violate the structural
requirements as substitutions of constituents, but can also accidentally cause nonstructural violations, for instance agreement errors in noun phrases. The majority
of misspellings slip through any syntactic analysis, resulting in syntactically correct
strings. For instance, an error resulting in a word which is the same part-of-speech
as the intended word, as in (5.5a), will be very hard to track down without any
semantic information. In this example, the word as written coincides not only
with the part-of-speech of the intended word but also the intended inflection. The
intended word is presented in (5.5b):
(5.5) (M1.1.33)
har tagit hand om
oss.
a. den
här gamla ∗ manen
the [def] here old mane [def] has taken hand about us
– This old man took care of us.
b. mannen
man [def]
Moreover, words resulting in other parts of speech are hard to trace syntactically. In (5.6a) a pronoun becoming a verb in supine form will not be detected
without an additional level of analysis because the form of the verb that follows
the preceding auxiliary verbs is syntactically correct:
(5.6) (M3.3.10)
problem
a. den killen
eller tjejen
måste ha ∗ nått
the boy [def] or girl [def] must have reached [sup] problem
– the boy or girl must have some problem
b. nåt
some
Only a few real word spelling errors in Child Data cause syntactic violations
and can to some extent be detected by means of syntactic analysis. Here is an
example of a pronoun realized as a noun subsequently forming a noun phrase with
an agreement error in gender and definiteness:
Chapter 5.
124
(5.7) (M2.2.3)
∗
a. det här brevet är det
ända
jag kan ge dig idag.
the here letter is the [neu,def] end [com,indef] I can give you today
– This letter is the only one I can give you today.
b. det enda
the only
Here is an example of a pronoun becoming a verb, where as a consequence
three verbs in a row were found in a sentence, first the two correctly spelled verbs
form a grammatical verb cluster and then the misspelled pronoun forming a passive
past verb form (5.8a). In this case, the feature structure of a verb cluster is violated
and the error can be detected by partial parsing.
(5.8) (M3.3.8)
∗
hanns
mobiltelefon.
a. jag fick låna
I could borrow was-managed cell-phone
– I could borrow his cell-phone.
b. hans
his
In (5.9a), the predicate of the sentence forms a noun and could be detected as
a sentence lacking a finite verb:
(5.9)
(M4.2.32)
a. då ∗ ko min bror
then cow my brother
– then came my brother
b. kom
came
Splits mostly violate complement conditions. For instance in (5.10a) the split
will be analyzed as two successive noun phrases:
(5.10) (S1.1.16)
a. En ∗ brand man klättrade upp till oss.
man climbed up to us
a fire
– A fire-man climbed up to us.
b. brandman
fire-man
Splits can also violate agreement, such as in (5.11a), the first part of the split
has gender different from the second part, which results in the article (en ‘a [com]’)
125
and the first part of the split (djur ‘animal [neu]’) not agreeing. The correct form
is shown in (5.11b):
(5.11) (S1.1.28)
∗
a. Desere jobbade i en
djur
affär
Desere worked in a [com] animal [neu] store [com]
– Desere worked in an animal-store.
b. en
djuraffär
a [com] petshop [com]
Punctuation at Sentence Boundaries
Erroneous use of punctuation to mark sentence boundaries, probably requires full
parsing or at least analysis of complement structure following the main verb. For
instance, in order to detect the missing boundary in (5.12a) (indicated by dash), the
system has to know that the verb gilla ‘like’ is transitive and thus combines with
only one object and cannot also take the pronoun dom ‘they’ as a complement.
That is, just locating the arguments following the verb, marked in boldface in the
example, with the diagnosis of too many complements signals that something is
wrong with the sentence. The correct form is presented in (5.12b).
(5.12) a. Vissa i filmen
gillade inte varann
— dom bråkade
och
some in the-movie liked not each-other — they quarrelled and
lämnade vissa utanför.
left
some outside
– Some (people) in the movie did not like each other. They quarrelled and left
some (people) out.
b. Vissa i filmen
gillade inte varann.
Dom bråkade och lämnade
some in the-movie liked not each-other they quarrelled and left
vissa utanför.
some outside
5.3.3
Summary and Conclusion
In accordance with the above discussion, it is clear that only some errors in Child
Data can be detected by partial syntactic analysis alone, most of the errors require
a higher level of analysis, a full parsing or even discourse analysis. The error types,
their classification in accordance with what violations they cause, and comments
on the possibility of detection are summarized in Table 5.1 below.
Chapter 5.
126
Errors requiring only partial parsing for detection (in bold face in the table)
concern (mostly) non-structural errors, including noun phrase agreement, verb
form errors and some structural errors such as omissions within a verb core. Further, some pronoun case errors, constrained for instance by preceding constituents
(e.g. a preposition), could be traced by partial parsing. In addition, some word
order errors would in general be possible to detect by means of partial parsing, but
since those found in Child Data display rather high complexity, detection possibility is minimal without more elaborated analysis. Finally, repeated words could be
detected by partial parsing (i.e. among the redundancy errors).
Table 5.1: Summary of Detection Possibilities in Child Data
E RROR T YPE
G RAMMAR E RRORS :
Agreement in NP
Agreement in PRED
E RROR C LASS V IOLATION
non-structural
non-structural
non-structural
structural
Pronoun case
non-structural
Finite Verb Form
Verb Form after Vaux
Vaux Missing
Verb Form after inf. marker
Inf. marker Missing
Word order
Redundancy
non-structural
non-structural
structural
non-structural
structural
structural
structural
structural
Word Choice
structural
Reference
OTHER :
Real Word Spelling Errors
structural
structural
non-structural
Missing Sentence Boundary structural
C OMMENT
substitution: other form
partial parsing
complex partial parsing
partial parsing and discourse
omission
partial parsing and discourse
substitution: other form some by partial parsing
OR complex partial parsing
substitution: other form partial parsing
omission
partial parsing
omission
partial parsing
transposition
some by partial parsing
insertion
some by partial parsing
OR full parsing
omission
at least complement structure
substitution: new lemma full parsing + semantics
and world knowledge
substitution: new lemma discourse analysis
substitution: new lemma full parsing + semantics
and world knowledge
substitution: new lemma partial parsing
omission
at least complement structure
127
Two of the non-structural error types require a more complex partial parsing (in
italics) and specification of a larger context in order to be able to detect them. These
include agreement errors in predicative complement and pronoun case errors. Definiteness errors in single nouns could also in general be detected by partial parsing,
but (probably) require discourse analysis in order to diagnose them correctly.
The rest of the grammar errors are all structural and require at least analysis
of complement structure or full parsing of sentences or even discourse analysis. In
many cases also semantic and/or world knowledge interpretation is required.
Among the real word spelling errors very few can be traced only by means of
syntactic analysis. Most of them need semantics or even understanding of world
knowledge in order to be identified. Missing sentence boundaries often cause syntactic violations in verb subcategorization.
In conclusion, this summary suggests that not only non-structural errors can
be detected by means of partial parsing, but also some structural violations. This
division is certainly more dependent on whether the error is located in a certain
portion of delimited text or not. For instance, some of the omission violations
located in certain types of phrases can be detected by means of partial parsing (e.g.
missing auxiliary verb).
The most clear choice of which error types are certain to be detected by means
of partial parsing are the agreement errors in noun phrases and errors located in
verbs (i.e. concerning verb form and omission of verb or infinitive marker). These
are also among the most frequent (central) errors in Child Data and invite implementation as I will show in Chapter 6. Among the other most frequent error types
in Child Data, redundant constituents in clauses can probably be detected only
when words are repeated directly. Other types of extra inserted constituents into
clauses, omissions of words or word choice errors are structural errors that require
more complex analysis and cannot be detected by just partial parsing.
Chapter 5.
128
5.4 Grammar Checking Systems
5.4.1
Introduction
After the analysis of what possibilities there are for detecting the errors in Child
Data presented in the previous section, the question arises as to what error types
are already covered by current technologies and with what success.
As pointed out in Section 5.2, research and development of grammar checking
techniques is rather recent and started around the 1980’s with products mainly for
English4 but also for other languages, e.g. French (Chanod, 1993), 5 Dutch (Vosse,
1994), Czech (Kirschner, 1994), Spanish and Greek (Bustamente and Le ón, 1996).
In the case of Swedish, the development of grammar checkers did not start
until the latter half of the 1990’s with several independent projects. Grammatifix
developed by the Finnish company Lingsoft AB was introduced on the Swedish
market in November 1998, and since 2000 it has been part of the Swedish Microsoft Office Package (Arppe, 2000; Birn, 2000). Granska is a grammar checking
prototype, being developed by the research group of the Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology
(KTH) in Stockholm (Carlberger and Kann, 1999). Another Swedish prototype
was developed at the Department of Linguistics at Uppsala University, between
1996 and 1999, within the EU-project Scarrie (Sågvall Hein, 1998a; Sågvall Hein
et al., 1999).6
This section continues with a short review of methods and techniques used
in some non-Swedish systems (Section 5.4.2). Then, follows an overview of the
Swedish approaches to grammar checking (Section 5.4.3) and a discussion of the
techniques used in these systems, error types covered and their reported performance (Section 5.4.4).
5.4.2
Methods and Techniques in Some Previous Systems
Many of the grammar checking systems on the market are commercial products.
Technical documentation is often minimal or even absent. One exception is the
grammar checking system Critique (known until 1984 as Epistle) (Ravin, 1993;
4
For instance Perfect Grammar was integrated in Word for Windows 2.0 in late 1991 and Grammatik 5 part of WordPerfect for Windows 5.2 and Word for Mac 5.0 in 1992 were among the first on
the market (see further in Vernon, 2000).
5
Vanneste (1994) compared the utilities of other French products: Grammatik (French), Hugo
Plus and GramR.
6
Skribent (http://www.skribent.info/) and Plita (Domeij, 1996, 2003) are other proofing tools on the Swedish market that include detection of violations against the graphical conventions
and style, but not any syntactic error detection.
129
Richardson, 1993) developed within the Programming Language for Natural Language Processing (PLNLP) project (Jensen et al., 1993b). 7 This project aimed
at the development of a large-scale natural language processing system covering
not only syntax, but also the various levels of semantics, discourse and pragmatics.8 During the project the PLNLP-formalism was used in several domains of
natural language applications. Besides the “text-critiquing” system, devices targeting for instance machine translation, sense disambiguation via on-line dictionaries,
analysis of conceptual structure in paragraphs as a “unit of thought”, etc. were
developed. English was the main language, but also languages such as Japanese,
French, German, Italian, Portuguese, etc. were involved (Jensen et al., 1993b).
Critique is based on the English parser (PEG) of this system (Jensen,
1993), utilizing the PLNLP-formalism of Augmented Phrase Structure Grammar
(ACFG)9 (implemented in Lisp), and producing a complete analysis for all sentences (even ungrammatical) on the basis of the most likely parse (Heidorn, 1993).
Thus, in order to be able to detect errors, the syntactic analysis in PEG was developed to analyze not only grammatical sentences, but all sentences obtained an
analysis. This was achieved by applying relaxation to rules when parsing failed on
the first try, or a parse fitting procedure identifying the head and its constituents
(e.g. in fragments) (see further in Jensen, 1993; Jensen et al., 1993a; Ravin, 1993).
The system targets about 25 grammar error types and 85 stylistic “weaknesses”.
The grammar errors are divided into five error categories: number agreement, pronoun case, verb form, punctuation and confusion/contamination of expressions
(Ravin, 1993, pp. 68-70). Critique was planned to be developed for other languages besides English and now also a French version exists (Chanod, 1993).
The insight gained by the PLNLP-project by providing analysis of all sentences
seemed to have influenced other grammar formalisms such as Constraint Grammar (Karlsson et al., 1995) or Functional Dependency Grammar (Järvinen and
Tapanainen, 1998). The methods of rule relaxation and parse fitting had an impact
on the development of other (Swedish) grammar checking systems.
Another quite well documented and frequently cited project is the Dutch system CORRie (Vosse, 1994). It applies the same idea of analyzing ill-formed sentences as well as well-formed ones and using augmented context-free grammar for
7
The development of Critique was done in collaboration with IBM and was later taken over by
Microsoft. The tool is now used as a module for English grammar checking in Microsoft Word (cf.
Jensen et al., 1993b; Domeij, 2003).
8
Mostly syntax and semantics are covered by the system, but also approaches involving analysis
of discourse and pragmatics have been targeted.
9
The ACFG is considered to be more effective in contrast to CFG, since features and restrictions
on them can be associated directly to corresponding categories/symbols resulting in a considerably
decreased number of rules.
Chapter 5.
130
that purpose. The system aimed primarily at correcting spelling errors resulting in
other existing words but included analysis of misspellings, compounds, spelling of
idiomatic expressions, hyphenation. CORRie’s parser and its formalism inspired
the development of the proofing tools developed in the Scarrie project (see below).
5.4.3
Current Swedish Systems
There are at present three known proofing tools for Swedish aimed at syntactic
error detection including Grammatifix, the grammar and style module part of
Swedish Microsoft Word since 2000, the Granska prototype under development
at NADA, KTH and the ScarCheck prototype developed at the Department of Linguistics at Uppsala University in the Scarrie project. For each system I describe
below the architecture, the different error types covered, the technique used for
grammar checking (to the extent available information exists) and the system’s reported performance.
Grammatifix
Lingsoft’s10 commercial product Grammatifix, was introduced on the Swedish
market in November 1998, and has been since 2000 part of Microsoft Word since
2000. Parts of this proof-reading tool are based on research and technology from
the 1980:s, when work on a morphological surface-parser had started. The work
on error detection rules began in 1997 (Arppe, 2000).
The lexical analysis in Grammatifix is based on the morphological analyzer
SWETWOL, designed according to the principles of two-level morphology (Karlsson, 1992) and utilizing a lexicon of about 75,000 word types. At this nondisambiguated lexical-lookup stage, each word may obtain more than one reading.
The part-of-speech assignment is to a large extent disambiguated in the next level
of analysis, by application of the Swedish Constraint Grammar (SWECG) (Birn,
1998),11 a surface-syntactic parser applying context-sensitive disambiguation rules
(Arppe et al., 1998). As Birn (2000) points out, full disambiguation is not a goal
since the targeted text contains grammar errors. Errors are detected by partial parsing after assigning the tags @ERR and @OK to all strings and then applying error
detection rules defined in the same manner as the constraint grammar rules used for
syntactic disambiguation, with negative conditions often related to just portions of
a sentence. These error rules select the tag @ERR when an error occurs. The error
10
Lingsoft’s homepage is http://www.lingsoft.fi/
Birn (1998) gives a short presentation of the formalism. The CG-formalism was originally
developed by Karlsson (1990). Karlsson et al. (1995) give a description of the basic principles and
the CG-formalism.
11
131
detection component consists of 659 error rules and a final rule that applies the tag
@OK to the remaining words (Birn, 2000). Relaxation on rules is included in the
error detection rules and not in the phrase construction rules, so it regards certain
word sequences as phrases despite grammar errors in them (Arppe et al., 1998).
Grammatical errors are viewed by this system as “violations of formal constraints between morphosyntactic categories” (Arppe et al., 1998). Two types of
constraints are distinguished: intra-phrasal, e.g. phrase-internal agreement, and
inter-phrasal, e.g. constituent order in a clause. Grammatifix not only detects errors, but also provides a diagnosis with explanation of the error and a suggestion
for correction when possible.
The tool addresses 43 error types, where 26 concern grammar, 14 punctuation
and formatting, and 3 stylistic issues. The grammar error types include agreement errors in noun phrases and subject complements, errors in pronoun form after
preposition, errors in verbs, in word order and others (Arppe et al., 1998; Arppe,
2000). The grammar error types are listed and compared to the types in the other
Swedish systems in Section 5.4.4.
The linguistic performance of the system was tested separately for precision
and recall based on corpora of different size from the newspaper G öteborgs Posten
(Birn, 2000, pp.37-39). For precision, the newspaper corpus consisted of 1,000,504
words and resulted in a rate of 70% precision (374 correct alarms and 160 false
alarms). The analysis of recall was based on a text extract of 87,713 words and
resulted in an overall recall rate of 35% including also error types not covered by
the tool (135 errors in the text and 47 errors detected). Counting only the error
types targeted by Grammatifix, the recall is 85% (55 errors in the text and 47 errors
detected).12
The Granska Project
The proof-reading tool Granska is being developed at the Department of Numerical Analysis and Computer Science, KTH (the Royal Institute of Technology) in
Stockholm. The first prototype was developed in 1995, running under Unix. Then
followed a more elaborate version with graphical interface in the Windows operating system. This version included detection of agreement errors in noun phrases.
The current version of Granska is a completely new program written from scratch,
starting in 1998, in the project Integrated language tools for writing and document
handling.13 Granska is an integrated system that provides spelling and grammar
12
The error profile of the corpus used for analysis of Grammatifix’s grammatical coverage (recall)
is reported in Chapter 4, Section 4.4.
13
See more about the project at: http://www.nada.kth.se/iplab/langtools/
132
Chapter 5.
checking that run at the same time and can be tested in a simple web-interface. 14
The system recognizes and diagnoses errors and suggests correction when possible.
Granska combines probabilistic and rule-based methods, where specific error
rules and local applied rules detect ungrammaticalities in free text. The underlying
lexicon includes 160,000 word forms, generated from the tagged Stockholm-Ume å
Corpus (SUC) (Ejerhed et al., 1992) of 1 million words, and completed with word
forms from SAOL (Svenska Akademiens Ordlista, 1986).
The lexical analyzer applies Hidden Markov Models based on the statistics of
word and tag occurrences in SUC. Each word obtains one tag with part-of-speech
and feature information. Unknown words are analyzed with probabilistic wordending analysis (Carlberger and Kann, 1999). A rule matching system analyses
the tagged text searching for grammatical violations defined in the detection rules
and produces error description and a correction suggestion for the error. When
needed, additional help rules are applied more locally, used as context conditions
in the error rules. Other accepting rules handle correct grammatical constructions
in order to avoid application by error rules, i.e. avoiding false alarms (Knutsson,
2001).
Granska’s rule language is partly object-oriented with a syntax resembling C++
or Java and is meant to be applied not only for grammar checking, but also partial
parses as identification of phrase and sentence boundaries. Further, with Granska it
is possible to search and directly edit in text, e.g. changing tense for verbs, moving
constituents within a sentence. Also the tagging result may be improved, when the
“guess” is wrong, so new tagging of a certain text area may be applied (see further
in Knutsson, 2001).
The rule collection of the system consists of approximately 600 rules (Domeij
et al., 1998) divided into three main categories: orthographic, stylistic and grammatical rules. Half of the rules detect grammar errors including noun phrase and
complement agreement, errors in pronoun form after preposition, errors in verbs,
errors in preposition in fixed expressions, word order and other errors (Domeij and
Knutsson, 1999; Knutsson, 2001). The grammar error types are listed and compared to the types covered by the other Swedish systems in Section 5.4.4.
A validation test of Granska is reported in Knutsson (2001, pp.141-150), based
on a corpus of 201,019 words and shows an overall performance of 52% in recall
and 53% in precision (418 errors in the texts, 216 correct alarms and 197 false
alarms). In this text sample, including both published texts written by professional
writers and student papers,15 Granska is best at detecting errors in verb form with
14
Granska’s Internet demonstrator is located at: http://www.nada.kth.se/theory/
projects/granska/demo.html
15
The error profile of the validated corpus of Granska was already reported in Chapter 4, Section 4.4.
133
a recall of 97% and precision of 83%, and agreement errors in noun phrase with a
recall of 83% and precision of 44%.
The Scarrie Project
Within the framework of the EU-sponsored project Scarrie, 16 prototypes of proof
reading tools for the Scandinavian languages Danish, Norwegian and Swedish
were developed. The project ran during the period December 1996, to February
1999. WordFinder Software AB17 was the coordinator of the project and Department of Linguistics at Uppsala University and the newspaper Svenska Dagbladet
were the other Swedish partners. Interface and packaging were outside the project and planned to be taken care of by WordFinder after the project’s completion.
Professional writers at work in particular newspaper and publishing firms were the
intended users.
The Swedish version of the prototype provides both spelling and grammar
checking run at the same time, searching through the text sentence by sentence.
The system recognizes and diagnoses errors, giving information about error type
and error span. No suggestions for correction are given. 18 The system lexicon
is based on a corpus of 220,000 newspaper articles published in 1995, and 1996
from the Swedish newspapers Svenska Dagbladet (SvD) and Uppsala Nya Tidning
(UNT). The SvD/UNT corpus consists of more than 70 million tokens and 1.5 million word types. The resulting lexical database, ScarrieLex, consists of a one-word
lexicon of 257,136 single word forms and a multi-word lexicon of 4,899 phrases
(Povlsen et al., 1999).
The spelling module is based on the Dutch software CORRie (Vosse, 1994)(see
Section 5.4.2), whereas the grammar checking module ScarCheck was developed
as new software (Sågvall Hein, 1998b; Starbäck, 1999).19 The grammar checker
is based on a previously developed parser, the Uppsala Chart Parser (UCP), a
procedural, bottom-up parser, applying a longest path strategy (Sågvall Hein, 1981,
1983).20
16
The Scarrie project homepage: http://fasting.hf.uib.no/˜desmedt/scarrie/
The homepage of Wordfinder Software AB is http://www.wordfinder.com
18
A demonstrator of the Scarrie’s prototype is located at: http://stp.ling.uu.se/˜ljo/
scarrie-pub/scarrie.html
19
The spelling and grammar checking in the Danish and Norwegian prototypes is solely based on
the Dutch software CORRie (Vosse, 1994).
20
The original version of the chart-parser was first implemented in Common Lisp (see Carlsson,
1981) and then converted to C. The resulting Uppsala Chart Parser Light (UCP Light) (see Weijnitz, 1999) is a smaller and faster version at the cost of less functionality, starting at syntax level and
requiring a morphologically analyzed input. UCP Light is used in the web-demonstrator (Starbäck,
17
Chapter 5.
134
The parsing strategy of erroneous input is based on constraint relaxation in
the context-free phrase structure rules and application of local error rules (Wedbjer Rambell, 1999b). The grammar is in other words underspecified to a certain
level, allowing feature violations and parsing of ungrammatical word sequences.
The local error rules are part of the same grammar and are applied to the result
of the partial parse. Alternative parses are weighted yielding the best parse. A
chart-scanner collects and reports on errors (Sågvall Hein, 1999).
ScarCheck targets more than thirty error types concerning grammar, including agreement errors in noun phrase and complement, errors in verb phrase and
verb valence errors, errors in conjunctions, pronoun case, word order and others
(Sågvall Hein et al., 1999). Again, the different grammar error types are listed and
compared to the errors of the other two Swedish systems in Section 5.4.4.
The performance evaluation of the grammar checking system was based on a
newspaper corpus of 14,810 words with an overall recall of 83.3% and precision of
76.9% (first run). Six grammar errors occurred in the corpus represented by errors
in nominal phrase, verb phrase and word order (Sågvall Hein et al., 1999).21
5.4.4
Overview of The Swedish Systems
Detection Approaches
The approaches for detection of errors in unrestricted text differ in the Swedish
systems, not only in the technology used, that varies from chart-based methods in
Scarrie, application of constraint grammars in Grammatifix, to probabilistic and
rule-based methods in Granska, but also in the way that strategies are applied.
Grammatifix and Granska identify erroneous patterns by partial analysis, whereas
Scarrie produces full analysis for both grammatical and ungrammatical sentences.
Grammatifix leaves ambiguity resolution to the syntactic level and applies relaxation on error rules in order to be able to parse erroneous phrases. Granska disambiguates starting at the lexical level assigning only one morphosyntactic tag to
each word and then applying explicit error rules in the search for errors, including locally applied rules and rules to avoid marking of grammatically correct word
sequences as ungrammatical. Scarrie parses ungrammatical input implicitly by relaxation of the parsing rules (not in error rules as Grammatifix does) and explicitly
by additional error rules applied locally on the parsing result.
The thing common to all the tools is that they define (wholly or to some extent)
explicit error rules describing the nature of the error they search for. Furthermore,
1999). (Email correspondence with Leif-Jöran Olsson, Department of Linguistics, Uppsala University - 21/11/01)
21
Also two errors in splits are reported.
135
the tools either proceed with error detection sentence by sentence, requiring recognition of sentence boundaries, or they can rely in their rules on for instance
capitalization conventions, and search for words beginning in capital letters (cf.
Birn, 2000).
The Coverage of Error Types
In this section I present the different grammar error types covered by Grammatifix,
Granska and Scarrie and the similarities and/or differences between the systems’
selection of error types. Table 5.2 (p.137) shows the results of this analysis, based
on the available error specifications of the different projects 22 and completed with
personal observations from tests run with these tools. For every listed error type an
example sentence from the projects’ error specifications (if present) was chosen to
exemplify the targeted error. The source of this example is listed in the last column
of the table.
A similar analysis is discussed in Arppe (2000),23 where he concludes that
the selection of error types targeted by the Swedish grammar checking tools is
quite similar in many aspects. Differences occur in the subsets of errors or some
specializations.
The analysis in the present thesis shows that all the tools check for errors in
noun phrase agreement concerning definiteness, number and gender in both the
form of the noun and the adjective. They also detect errors in the agreement
between the quantifier/pronoun and noun in partitive noun phrases and in the masculine form of the adjective. Violations in number and gender agreement with
predicative complement are also included in all three tools and so is pronoun case,
which all tools check in the context after certain prepositions. Also, the same kinds
of word order errors are covered by all the tools, except that Scarrie also checks
for inversion in the main clause.
Errors in verbs was the group that was most difficult to compare, because detection approaches differ in some aspects. The tools all check for occurrences of
finite verbs (too many, missing or no predicate at all) and the form of non-finite
verbs (after auxiliary verb or infinitive marker). Only Grammatifix does not check
for finite verbs after an infinitive marker. They check further for missing or extra
inserted infinitive marker in the context of main verbs. They also look for more
22
Grammatifix: Arppe et al. (1998); Arppe (2000) and the specification in Word 2001; Granska:
Domeij and Knutsson (1998, 1999) and the Internet demo: http://www.nada.kth.se/
theory/projects/granska/demo.html, Scarrie: Sågvall Hein et al. (1999) and examples
listed in the Internet demo: http://stp.ling.uu.se/˜ljo/scarrie-pub/scarrie.
html.
23
The present comparison is independent of the analysis reported in Arppe (2000). He also compared the punctuation and stylistical error types.
136
Chapter 5.
style-oriented errors in the use of passive verbs (double or after certain verbs) and
supine form (double or without ha ‘have’). Scarrie also checks if a supine form is
used in the place of an imperative.
All the tools check for the use of the superlative form möjligast ‘most-possible’
in combination with an adjective. Some other differences concern errors in the use
of prepositions, where Grammatifix and Granska detect errors in the harmony of
prepositions in certain context, only Granska checks for preposition use in idiomatic expressions. Further, Granska checks tense within a sentence. Double negation is not targeted by Scarrie. Granska and Scarrie also detect missing subject
errors.
Granska also checks more stylistical issues such as contamination of expressions and tautology, which are not included in the table. 24
24
Splits and run-ons are also targeted by some of these tools, but since these are not syntactic
errors they were not included in this comparison.
137
Table 5.2: Overview of the Grammar Error Types in Grammatifix (GF), Granska
(GR) and Scarrie (SC)
The comparison was done on 08/10/01 and revisited on 30/10/02. ‘X’ indicates observations from error specificatios, ‘(x)’ indicates my own observations.
E RROR T YPE
GF GR SC E XAMPLE
S OURCE
N OUN P HRASE :
Definiteness agree- X X X Det är i samhällets ∗ utvecklingen bort från detta som ArbetsGF
ment
domstolen inte hängt med
It is in the society’s [poss] development [def] away from this
that the Labour court has not kept up
Number agreement
X X X Natten bär ∗ sin skuggor.
SC
The night carries its [sg] shadows [pl]
Gender agreement
X X X En ∗ eventuellt segerfest får vänta.
SC
A [com] possible [neu] victory-party [com] has to wait.
Gender
agreement: X X (x) ∗ Ett av de gula blommorna hade slagit ut.
GF
quant. and noun
One [neu] of the yellow flowers [com] had come out.
Gender
agreement: X (x) (x) Då frestade han ditt kött och sände dig den ∗ rödhårige kvinGF
masculine form of
nan.
adjective
Then he tempted your flesh and sent you the red-haired [masc]
woman.
P REDICATIVE C OMPLEMENT:
Number agreement
X X X Tävlingen blev väldigt ∗ besvärliga.
SC
The competition [sg] became very difficult [pl]
Gender agreement
X X X Då hade läget i byn redan blivit ∗ outhärdlig för gruppen.
GF
At that point the situation [neu] in the village had already become unbearable [com] for the group.
P RONOUN :
Case after preposition X X X Vi sjöng för ∗ de.
GF
We sang for they [nom].
V ERBS :
Verb form after auxil- X (x) X Hur trygghet inte längre kan ∗ var statisk utan ligga i
SC
iary verb
förnyelsen, utvecklingen och förändringen.
How safety cannot any longer be [pres] static but lie in renewal,
development and change.
Verb form after inf. – (x) X Han har lovat att i alla fall ∗ skall slå Turkiet.
SC
marker
He has promised that in any case will [pres] beat Turkey.
Number of finite verbs X (x) X I Ryssland ∗ är betalar nästan ingen någon skatt.
GF
In Russia almost noone is [pres] pays [pres] any tax.
Missing finite verb
X X X Det ∗ bli viktigt.
GF
That will-be [inf] important.
Missing verb
X X X Ingen koll.
GR
No control.
Missing inf. marker
X X X Vi kommer – spela en låt av Ebba Grön.
GR
We will – play a song by Ebba Grön.
Extra inf. marker
X (x) X Sverige började ∗ att klassa kärnkraftsincidenter enligt den inSC
ternationella standarden.
Sweden started to classify nuclear incidents in accordance with
the international standard.
C ONTINUED ON N EXT PAGE
Chapter 5.
138
E RROR T YPE
GF GR SC
Supine instead of im- –
–
X
perative
E XAMPLE
också de anläggningskostnader som tillkommer.
∗ Betänkt
Consider [sup] also the construction-costs that will be added.
De kunde – fått bilderna på begravningsgästerna från danska
polisen.
They could – get pictures of the funeral-gests from the Danish
police.
Vi hade velat ∗ sett en större anslutningstakt, säger Dennis.
We had wanted [sup] seen [sup] a greater rate of joining, says
Dennis.
Saken har försökts att ∗ tystas ner.
The thing has been tried [pass] to be quietened [pass] down.
Huset ämnar byggas
S OURCE
SC
Supine without “ha”
X
X
(x)
Double supine
X
X
X
Double passive
X
X
X
S-passive after certain
verbs
X
(x)
X
Tense harmony
–
X
–
The house intends to be built [pass].
Jag höll mig inne tills stormen ∗ har bedarrat.
I kept [pret] myself inside until the storm has abated [perf].
GR
P REPOSITIONS :
Wrong preposition in
fixed expressions
–
X
–
med utgångspunkt ∗ från
GR
X
(x)
–
with starting-point from
Det är utbildning som idag inte erbjuds vare sig i Lund eller –
Malmö.
GF
Preposition
harmony with two-part
conjunctions
GF
GF
GF
SC
It is education that today is not offered either in Lund or
Malmö.
W ORD O RDER :
Placement of
verb/negation
ad-
X
X
X
Man kan tro inte sina öron.
SC
Word order in subordinate interrogative
clause
X
X
X
One can believe not one’s ears.
Jag undrar vad gör de unga männen i Finland.
GF
Word order in main
clause with inversion
–
–
X
OTHER :
Missing subject
I wonder what do the young men in Finland do.
Nu man kan testa de kommande versionerna av programvaran.
Now one can try the future versions of the program.
–
(x) (x)
Missing inf. marker
with preposition
X
X
X
Jag klarar av – gå.
Repeated words
X
–
–
I can manage – walk.
(No example given in the specification.)
Double negation
X
(x)
–
Construction “möjligast” + adjective
X
(x) (x)
SC
SC
Det kan bli svårt att få jobb om man inte har varken pengar
eller familj att stöda en.
It can be hard to get work if one does not have neither money
or family to support one.
Hon körde med möjligast stora snabbhet.
She drove with the most possible great speed.
GF
GF
GF
139
So far, comparison has concerned the different types of errors covered, but the
truth is that the detection of errors also depends on the syntactic complexity defined
in the separate error types. For instance, detection for errors in the verb form after
an infinitive marker can differ depending on whether other (optional) constituents
are inserted between the infinitive marker and the verb. In (5.13), all the sentences
violate the rule of a required infinitive verb form after an infinitive marker. In
(5.13a) and (5.13b) the targeted verb is preceded by an adverbial realized as a
prepositional phrase, which disturbed both Granska and Scarrie in the detection of
this error.25
(5.13) a.
A LARM
Han har
he
have
lovat
promised
att
to
i alla fall
in any case
G RANSKA S CARRIE
No
No
∗
skall
slå
Turkiet.
will [pres] beat [inf] Turkey
– He has promised to will beat Turkey in any
case.
b.
Han
he
har
have
lovat
promised
att
to
i alla fall
in any case
No
No
No
Yes
Yes
Yes
Yes
Yes
∗
vill
slå
Turkiet.
want [pres] beat [inf] Turkey
– He has promised to wants beat Turkey in any
case.
c.
slå
Han har lovat
att ∗ skall
he have promised to will [pres] beat [inf]
Turkiet.
Turkey
– He has promised to will beat Turkey.
d.
Han
he
har
have
lovat
promised
att
to
∗
vill
want [pres]
slå
Turkiet.
beat [inf] Turkey
– He has promised to wants beat Turkey.
e.
Han har lovat
att ∗ slår
Turkiet.
he have promised to beats [pres] Turkey
– He has promised to beat Turkey.
The error is detected only when the verb follows directly after the infinitive
marker in the cases (5.13d) and (5.13e). In the sentence in (5.13c) the verb also
25
The errors are not detected even if simple adverbials such as inte ‘not’, aldrig ‘never’ or sen
‘later’ are inserted.
Chapter 5.
140
follows directly after the infinitive marker, but Granska does not detect it as an
error although the verb is tagged as a verb in present tense form.
Another example of how important syntactic coverage is for error detection
is shown in (5.14), where Scarrie had problems detecting the agreement error
between the subject and the adjective form in the predicative complement due to
a possessive modifier in the head noun in the subject in (5.14b). Granska detects
both errors but Grammatifix does not react at all.
(5.14) a.
A LARM
är ∗ vacker.
Hus
house [pl, neu] is beautiful [sg, com]
S CARRIE ’ S D IAGNOSIS
wrong number in the adjective
in predicative complement
– House is beautiful.
b.
Mitt
my [sg, neu]
hus
house [sg, neu]
är
is
no reaction
∗
vacker.
beautiful [sg,com]
– My house is beautiful.
In conclusion, the three Swedish systems cover both grammatical and more
style-oriented errors. The coverage is similar in many aspects. In relation to the
most common errors in Child Data, they all cover the non-structural errors that are,
as discussed in the previous section, reserved to certain delimited text patterns. The
structural errors that require more complex analysis are included only to a small
extent.
They all detect the same errors in noun phrase agreement and most of the errors
in verb form. Exceptions are the verb form errors after an infinitive marker that are
not included in Grammatifix, errors concerning the use of supine verb form instead
of the infinitive are only included in Scarrie while tense harmony is only checked
by Granska. Errors in finite verb form, that were the most frequent error type in
Child Data, are (probably) covered by the ‘Missing finite verb’ category that all the
tools cover.
Among the errors of redundant or missing constituents in clauses, only Grammatifix checks for repeated words. All the tools check for missing infinitive marker
in the context of a preceding preposition. Granska and Scarrie also detect missing
subject. Other categories of redundant or missing constituents in clauses are not
covered. Word choice errors are only covered by Granska to the extent of prepositions in fixed expressions. Other types are not included. As discussed in the
previous section, structural errors of this kind require in general more complex
analysis in order to be identified, except when they are limited to certain parts that
can be delimited clearly (e.g. in a verb cluster).
141
The present overview of error types covered by these tools does not reveal the
actual grammatical coverage and precision of detection. As shown above, there is
a question of the extent of error coverage, since for instance insertion of some
optional constituents or presence/absence of certain constituents has influenced
whether or not an error was identified. I provided a test of these tools’ performance
directly on Child Data, which is reported in the subsequent Section 5.5.
Performance
All the systems were validated for the linguistic functionalities they provide for
as reported above in the descriptions of the separate projects (Section 5.4.3), summarized in Table 5.3 below. The validation tests carried out by the developers are
based on corpora of different size and composition, and different sets of errors were
found. As discussed in Section 5.2.3, the size and genre of the evaluated texts and
the writer’s experience may influence the outcome of such analysis and the results
should be interpreted carefully. The size and composition of the tested texts influence the occurrence of syntactic constructions giving rise to errors and should also
be related to how frequent errors in the tested population are.
Table 5.3: Overview of the Performance of Grammatifix, Granska and Scarrie
T OOL
Grammatifix
Granska
Scarrie
C ORPUS
newspaper articles
newspaper articles
newspaper articles,
official texts, student papers
newspaper articles
S IZE
87 713
1 000 504
201 019
R ECALL
35%
P RECISION
52%
70%
53%
14 810
83%
77%
Grammatifix and Scarrie were tested solely on newspaper texts written by professional writers, which is probably enough in the case of Scarrie since it was
developed for professional writers. On the other hand, Grammatifix as a module in
a word processor not aimed at any special groups, should be tested on texts of different genre written by different populations. Granska’s evaluation was tested upon
texts of different genre consisting of published newspaper and popular science articles, official texts and student compositions. This corpus is more balanced and
perhaps reflects the real performance of the system. In addition, certain types of
errors that dominate in the corpus depending on the genre are reported (Knutsson,
2001).
Chapter 5.
142
Further, a fairly large amount of data is needed in order to be able to test a
reasonable number of errors. The validation corpus used for Scarrie was small in
this aspect, including only six of the defined errors and yielding quite high rates
in both recall and precision. In the case of Granska, the size of the corpus is
much bigger and, as discussed, better balanced. Grammatifix included the largest
corpora for the test of precision and a smaller corpus for the test of recall and
obtained the lowest recall. As a commercial product with high expectations on
precision, the error coverage of the system was probably cut down. This means
that the system probably is able to detect more errors and receive a better recall
rate than the current 35%, but if the result is lower precision due to the number
of false flaggings increasing, the detection of those “unsafe” errors is not included
and remains undetected. The recall rates in the systems vary from 35% to 83%
and the precision rates are between 53% and 77%. Evaluation of individual error
types is only reported for Granska, with the best results for verb form errors and
agreement errors in noun phrase.
5.4.5
Summary
The Swedish approaches to grammar checking apply techniques for searching
(more or less) explicitly for ungrammaticalities in text. Errors are found either by
looking for specific patterns in certain contexts in the text that match the defined
error rules, or by using selections in a “relaxed” parse by a chart-scanner. The approaches seem to be dependent on how fine or broad a specific error type is defined,
so that the same error is not overlooked in other contexts. The choice of what types
of errors are detected is based on a more or less ambitious analysis of errors in writing, often for certain group of writers (e.g. professional writers, writers at work).
However, the risk is still there that some other type of error in the same pattern may
be overlooked. The coverage of error types is very similar between the systems.
Performance was evaluated separately on different text data so the results are hard
to compare.
143
5.5 Performance on Child Data
5.5.1
Introduction
Having examined what error types are covered by the current Swedish systems
Grammatifix, Granska and Scarrie, their performance will be tested on the Child
Data corpus. Recall that the error frequency is different in texts written by children
than adult writers targeted by the Swedish grammar checkers and also that the error
distribution is (slightly) different in Child Data. The purpose of testing the tools’
performance on Child Data is crucial in the view of handling text with higher error
density and of (slightly) different kind than they were designed for.
Discussion in the previous section on the error types covered by these systems
points out that many of the errors in Child Data are targeted. Among the most
common error types in Child Data, all (or most) of the error types related to verb
form and agreement in noun phrase are targeted by the tools and some (quite few)
of the errors concerning redundant or missing constituents in clauses and word
choice errors, a group of errors that needs more elaborated and complex analysis
for detection (see discussion in Section 5.3).
The tools are not, however, designed in the first place to detect errors in children’s texts and will most probably perform worse on these texts. The question
is how low will the performance be, where exactly will they fail, and what consequences do the results have for Child Data.
This section continues with a description of the evaluation procedure (Section 5.5.2) and the individual systems’ detection procedures (Section 5.5.3). Then
the detection results on Child Data are presented type by type (Section 5.5.4). Finally, a summary of the results and discussion on overall performance is presented
(Section 5.5.5).
5.5.2
Evaluation Procedure
As discussed in Section 5.2.3, evaluation of authoring tools normally concerns
detection, diagnosis and correction functionalities, either on single sentences or
on whole text samples. For the case of investigating how good a system is at
detecting targeted errors, sentence samples usually will do, but a corpus is better
for measuring how good a system is overall. In my analysis the whole Child Data
corpus in the spell-checked version was used as input, free from the non-word
spelling errors,26 since the main purpose of the evaluation analysis is to see the
checker’s performance in detection of grammar errors. The Child Data corpus
26
See Section 3.3 for discussion on how this was achieved.
144
Chapter 5.
represents texts that are new to all three systems and a writing population which is
not explicitly covered by any of them.
Since not all the systems give suggestions for correction, the present performance test will only analyze detection and diagnosis performance. Detection performance is investigated in terms of the number of correct and false alarms. Correct alarms include all detected errors, divided further into whether a correct or
an incorrect diagnosis was made. False alarms are divided further into detection
of correct word sequences diagnosed as errors, and detections that happen to include other error categories than grammar errors, e.g. a spelling error, a split, or a
sentence boundary.
To exemplify, the agreement mismatch between the common gender determiner en ‘a [com]’ and the neuter-gender compound noun stenhus ‘stone-house
[neu]’ in (5.15a) concerns the gender form of the determiner, which is a correct
alarm with a correct diagnosis. Now, identifying this noun phrase segment and
classifying this as an error in number agreement as in (5.15b) would then be considered as a correct alarm with an incorrect diagnosis. That is, the erroneous
segment is correctly detected, but the analysis of what type of error it concerns is
wrong. The example in (5.15c) represents a false alarm, where the correct (grammatical) form of the noun phrase was detected and diagnosed as an error in gender
agreement. Finally, in (5.15d) we see an example of a false alarm that includes
a segmentation error (not a grammar error). The noun in the noun phrase is split
and the determiner and the first part of the split noun are identified as a grammar
error with an agreement violation in gender. These instances of grammatically correct text selected as ungrammatical due to a split, spelling error, etc. are classified
as false alarms with other error. I’ve chosen to separate these detections from
the “real” false alarms, since they represent text fragments not entirely free from
errors, although the errors are of a different nature than grammar/syntactic ones.
These findings can be interesting, since as Knutsson (2001) points out, such an
alarm could be a hint to some writers that can see that the actual error lies in the
split noun. It could however also give rise to a new error if the user chooses to
change the gender of the determiner and writes: en sten hus ‘a [com] stone [com]
hus [neu]’.
145
(5.15) a.
A LARM
∗
en
stenhus
a [com] stone-house [neu]
D IAGNOSIS
gender agreement
error
C LASS OF A LARM
correct alarm with
correct diagnosis
b.
∗
stenhus
en
a [com] stone-house [neu]
number agreement
error
correct alarm with
incorrect diagnosis
c.
ett
stenhus
a [neu] stone-house [neu]
gender
error
agreement
false alarm
d.
ett
sten
hus
a [neu] stone [com] house [neu]
gender
error
agreement
false alarm with
other error
The set of all detected errors is represented then by all correct alarms with
correct or incorrect diagnosis and the set of false alarms consists of false flaggings
without any error and false flaggings containing other errors than grammatical
ones. The systems’ grammatical coverage (recall) and flagging accuracy (precision) has been calculated in accordance with the following definitions:
(5.16) a. recall =
correct alarms
all errors
b. precision =
* 100
correct alarms
correct alarms + f alse alarms
* 100
I will also consider the overall performance of the systems expressed in Fvalue, a combined measure of recall and precision. F-value is calculated as presented in (5.17), where the β parameter has the value 1, since both recall and precision
are equally important in the analysis.27
(5.17)
5.5.3
F-value =
(β 2 + 1) ∗ recall ∗ precision
β 2 ∗ (recall + precision)
The Systems’ Detection Procedures
Grammatifix
Grammatifix is included as a module in Microsoft Word, working along with a
spell checking module. The user may choose to disregard grammar checking and
just check the text for spelling or include both checkers. The tool then checks the
text sentence by sentence first for spelling and then grammar. Further adjustments
of grammar checking are possible, where the user may choose among the different
27
The parameter β obtains different values dependent on whether precision is more important
(β > 1) or whether recall is of greater value (β < 1). When both recall and precision are equally
important the value of β is 1 (β = 1).
Chapter 5.
146
error types defined in Grammatifix (including style, punctuation and formatting
and grammar errors) and also set the maximum length of a sentence in number of
words. The tool also provides a report on the text’s readability, including the sum
of tokens, words, sentences and paragraphs The mean score of these is counted
providing an index of readability. One diagnosis of the error is always given, and
usually a suggestion for correction.
Granska
The web-based demonstrator of Granska includes no interactive mode, and
spelling and grammar are corrected independently, based on the tagging information.
The user may choose a presentation format of the result that includes all sentences with comments on spelling and grammar or only the erroneous sentences.
Further adjustments include the choice to display error correction, the result of
tagging and if newline is interpreted as end of sentence or not. The last attribute
is quite important for children’s writing, where punctuation is often absent or not
used properly and the use of new line is also arbitrary, i.e. occurrence of new line
in the middle of a sentence is not unusual. In some cases, Granska yields also more
than one suggestion for error correction and there is a possibility of constructing
new detection rules.
Long parts in a text without any punctuation or new line (usual in children’s
writing) are probably hard to handle by the tool, which just rejects the text without
any output result.
Scarrie
Also the web-demonstrator of Scarrie does not include any interactive mode. Individual sentences (or a longer text) can be entered, with requirements on end-ofsentence punctuation. Both spelling and grammar are corrected and the result of
detection is displayed at the same time. Errors are highlighted and a diagnosis is
displayed in the status bar. The system gives no suggestion for correction.
5.5.4
The Systems’ Detection Results
In this section I present the result of the systems’ performance on Child Data. For
every error type I first present to what extent the errors are explicitly covered according to the systems’ specifications and then I proceed system by system and
present the detection result for the particular error type and discuss which errors
are actually detected and which were incorrectly diagnosed, characteristics of errors that were not found, and false alarms. A short conclusion ends every error
147
type presentation. Exemplified errors from Child Data refer either to previously
discussed samples or directly to the index numbers of the error corpus presented
in Appendix B.1. A system’s diagnosis is presented exactly as given by the particular system. All detection results are summarized and the overall performance is
presented in Section 5.5.5.
Most of the errors in Child Data concern definiteness in the noun and gender or
number in determiner in the noun phrases, errors that, according to the error specifications, are explicitly covered by all three tools. They all also check for errors in
masculine gender of adjective and agreement between the quantifier and the noun
in partitive constructions. The latter type found in Child Data concerns the form in
the noun rather than the form of the quantifier (see (4.11) on p.50).
Grammatifix detected seven errors in definiteness and gender agreement. One
of the errors in the masculine form of adjective was only detected in part and was
given a wrong diagnosis. The error concerns inconsistency in the use of adjectives
(previously discussed in (4.9) on p.49), either both adjectives should carry the masculine gender form or both should have the unmarked form. The error detection
by Grammatifix is exemplified in (5.18), where we see that due to the split noun,
the error was diagnosed as a gender agreement error between the common-gender
determiner den ‘the [com]’ and the first part of the split troll ‘troll [neu]’ that is neuter. An interesting observation is that when the split noun is corrected and forms
the correct word trollkarlen ‘magician [com,def]’ Grammatifix does not react and
the error in the adjectives is not discovered. Grammatifix checks only when the
masculine form of an adjective occurs together with a non-masculine noun, but not
consistency of use as is the case in this error sample.
(5.18)
A LARM
det va
it was
∗
den
hemske
the [com,def] awful [masc,wk]
∗
fula
troll
karlen
ugly [wk] troll [neu,indef] man [com,def]
(⇒ trollkarlen)
tokig som ...
(⇒ magician [com,def]) Tokig that
G RAMMATIFIX ’ S D IAGNOSIS
Check the word form den ‘the
[com,def]’. If a determiner
modifies a noun with neuter
gender, e.g. troll ‘troll’ the determiner should also have neuter gender ⇒ det ‘the [neu,def]’
In general, simple constructions with determiner and a noun are detected,
whereas more complex noun phrases were missed. Three errors in definiteness
form of the noun were overlooked (G1.1.1, G1.1.2 - see (4.2) p.46, G1.1.3 - see
Chapter 5.
148
(4.3) p.46). Concerning gender agreement, one error involving the masculine form
of an adjective was missed (G1.2.4 - see (4.8) p.48). None of the errors in number
agreement were detected, one with a determiner error (G1.3.1 - see (4.10) p.49)
and two with partitive constructions (G1.3.2 - see (4.11) p.50, G1.3.3).
Grammatifix made altogether 20 false assumptions, where 16 of them involved
other error categories, mostly splits (12 false alarms), such as the one in (5.19):
(5.19)
A LARM
det var
it
was
ett
a [neu]
stort
big [neu]
hus
house [neu]
sten
stone [com]
Check the word form ett ‘a
[neu]’. If a determiner modifies a noun with commongender, e.g.
sten ‘stone
[com]’ should also the determiner have common-gender ⇒
en ‘a [com]’
– It was a big stone-house.
The overall performance for Grammatifix’s detection of errors in noun phrase
agreement amounts then to 53% for recall and 29% for precision.
Granska detected six errors in definiteness and two in gender agreement, one
in a partitive noun phrase (G1.2.2). In three cases, where the error concerned the
definiteness form in the noun, Granska suggested instead to change the determiner
(and adjective), correcting G1.1.7 as den räkningen ‘the [def] bill [def]’ instead
of en räkning ‘a [indef] bill [indef]’ (see (4.6) p.47). The same happened for error
G1.1.8 where en kompisen ‘a [indef] friend [def]’ is corrected as den kompisen ‘the
[def] friend [def]’ and the opposite for G1.1.2 where the definite determiner and
adjective in den hemska pyroman ‘the [def] awful pyromaniac [indef]’ are changed
to indefinite forms instead of changing the form in the noun to definite (see (4.2)
p.46).
Two errors in definiteness agreement (G1.1.1, G1.1.3 - see (4.3) p.46), none
of the errors in masculine form of adjective (G1.2.3 - see (4.9) p.49, G1.2.4 - see
(4.8) p.48) and all errors in number agreement were left undiscovered by Granska.
Grammatical coverage for this error type results then in 53% recall. Some false
alarms occurred (25), where 17 included other error categories, with splits as the
most represented (9 false alarms), resulting in a slightly lower precision rate of
24% in comparison to Grammatifix.
Scarrie detected six errors in definiteness agreement, one in gender agreement
in a partitive noun phrase, two in the masculine form of adjective and one in number
agreement. In the case of number agreement, the error in det tre tjejerna ‘the [sg]
three girls [pl]’ (G1.3.1 - see (4.10) p.49) is incorrectly diagnosed as an error in the
noun instead of in the determiner.
149
Exactly as in Grammatifix, Scarrie detected the error in G1.2.3 due to the split
noun and gave the same incorrect diagnosis (see (5.18) above). The missed errors
include two errors in definiteness in the noun, one with a possessive determiner
(G1.1.4 - see (4.4) p.47) and one with an indefinite determiner (G1.1.7 - see (4.6)
p.47). One error concerned gender agreement with an incorrect determiner with a
compound noun (G1.2.1 - see (4.7) p.48). Finally, two errors in number of the noun
in partitive constructions were not detected (G1.3.2 - see (4.11) p.50, G1.3.3).
Many false alarms occurred (133) and 50 of them concerned other error categories, mostly splits (33 false alarms) as in (5.20):
(5.20)
A LARM
han tittade i
he looked into
ett
jord
a [neu] ground [com]
wrong gender
hål
hole [neu]
– He looked into a hole in the ground.
Others involved spelling errors (10 false alarms) as in (5.21), where the pronoun vad ‘what’ is written as var and interpreted as the pronoun ‘each’ that does
not agree in number with the following noun.
(5.21)
A LARM
Själv tycker jag att killarnas metoder
self think I that the-boys’ methods [pl]
wrong number
men också
är mer öppen och ärlig
are more open and honest but also
mer
more
elak
mean
än
than
var (⇒ vad)
each [sg] (⇒ what)
tjejernas
metoder är.
the-girls’ [pl] methods are
– I think myself that the boys’ methods are more
open and honest but also more mean than the
girls’ methods are.
Some false flaggings also concerned sentence boundaries (7 false alarms) as in
(5.22):
Chapter 5.
150
(5.22)
A LARM
pojken gick till fönstret
och ropade
the-boy went to the-window and shouted
på
at
grodan
the-frog
hunden
the-dog [com]
men
but
vad
what
har
had
fastnat
stuck
wrong form in adjective
dumt
silly [neu]
i
in
burken
the-pot
där grodan var.
there the-frog was
– The boy went to the window and shouted at
the frog, but how silly, the dog got stuck in the
pot where the frog was.
But mostly, ambiguity problems occurred (83 false alarms) as in (5.23a) and as
in (5.23b):
(5.23) a.
A LARM
dessutom
besides
luktade
smelled
det
it/the [neu]
wrong gender
saltgurka.
pickle-gherkin [com]
– Besides it smelled like pickled gherkin.
b.
Jag trampade rakt på den och skar upp
I
walked
right on it and cut up
wrong number
hela min vänstra
fot.
whole my left [pl,def] foot [sg,indef]
The coverage for this error type in Scarrie is 67%, but the high number of false
alarms results in a very low precision value of only 7%.
In conclusion, only Scarrie detected more than half of the errors in noun phrase
agreement, but at the cost of many false alarms. Grammatifix and Granska displayed similarities in detection of this error type, detecting almost the same errors
and also their false alarms are not that many. Scarrie’s coverage is different from
the other tools and the high number of false alarms considerably decreased the precision score for detection of this error type. All tools failed to find the erroneous
forms in the head nouns of the partitive noun phrases (G1.3.2 - see (4.11) p.50,
G1.3.3), that are most likely not defined in the grammars of these systems.
151
All the tools cover errors in both number and gender agreement with predicative
complement. These types of errors in Child Data are however represented in most
cases by rather complex phrase structures and will then at most result in three
detections.
Grammatifix detected only one instance of all the agreement errors in predicative complement (G2.2.6) and yielded an incomplete analysis of this particular
error. It failed in that only the context of a sentence is taken into consideration.
Due to ambiguity in the noun between a singular and plural form, Grammatifix
detected this error as gender agreement, but should suggest plural form instead,
which is clear from the preceding context (see (5.1) and the discussion on detection possibilities in Section 5.3, p.119).
Grammatifix obtained very low recall (13%) for this error type. Three false
alarms (one with a split), results in a precision value of 25%.
The three simple construction of agreement errors in the predicative complement were all detected by Granska (G2.1.1 - see (4.12) p.51, G2.2.3 - see (4.13)
p.51, G2.2.6 - see (5.1) p.119). In the case of G2.2.6 discussed above, the plural
alternative is suggested. In error G2.2.3, the predicative complement includes a
coordinated adjective phrase with errors in all three adjectives. Granska detected
the first part:
(5.24)
A LARM
Själv tycker
self
think
metoder
methods [pl]
jag
I
är
are
att
that
mer
more
killarnas
the-boys’ [pl]
∗
öppen
open [sg]
och
and
∗
ärlig
men också mer ∗ elak
än
honest [sg] but also more mean [sg] than
G RANSKA’ S D IAGNOSIS
If öppen ‘open [sg]’ refers
to metoder ‘methods [pl]’ that
is an agreement error ⇒ killarnas metoder är mer öppna
‘the boys’ [pl] methods [pl] are
more open [pl]’
var (⇒vad) tjejernas metoder är.
was (⇒what) the-girls’s methods are
– I think myself that the boys’ methods are more
open and honest but also more mean than the
girls’ methods.
Granska obtained then a coverage value of 38% for this error type with 5 false
alarms (including one in split and one with a spelling error) the precision rate is
also 38%.
In the case of Scarrie, no errors in predicative complement agreement were detected, only 13 false flaggings occurred, which leaves this category with no results
Chapter 5.
152
for recall or precision. The false alarms occurred due to incorrectly chosen segments as in the following examples. In (5.25a) we have a compound noun phrase,
where only the second part is considered and interpreted as a singular noun that
does not agree with the plural adjective phrase as its predicative complement. In
(5.25b) the verb is pratade ‘spoke [pret]’ interpreted as a plural past participle form
and is considered as not agreeing with the preceding singular noun hon ‘she [sg]’.
(5.25) a.
A LARM
Han och hans hund
var
mycket
he and his dog [sg] were/was very
wrong number in adjective in
predicative complement
över den.
stolta
proud [pl] over it
– He and his dog were very proud over it.
b.
då sa jag till dom och våran lärare
then said I to them and our teacher
wrong number in adjective in
predicative complement
att hon
blev mobbad och efter det
that she [sg] was harassed and after that
så pratade
läraren
med dom som
so spoke [pl] the-teacher with them that
mobbade henne och då slutade dom med
harassed her and then stopped they with
det.
that
– Then I said to them and our teacher that she
was harassed and after that the teacher spoke to
them that harassed her and then they stopped
with that.
In conclusion, only Granska detected at least the simplest forms of agreement
errors in predicative complement. The other tools had problems with selecting
correct segments, especially Scarrie with its high number of false alarms.
Pronoun Form Errors
All three tools check explicitly for pronoun case errors after certain prepositions.
Three of the four error instances in Child Data are preceded by a preposition.
Grammatifix found two errors in the form of pronoun in the context of different
prepositions (G4.1.1 - see (4.19) p.54, G4.1.3). No false flagging occurred.
Granska found three errors in the context of the prepositions efter ‘after’ and
med ‘with’ (G4.1.1 - see (4.19) p.54, G4.1.4, G4.1.5 - see (4.18) p.53), that gives a
153
recall rate of 60%. However, many false alarms (24) occurred involving conjunctions being interpreted as prepositions (17 flaggings) or prepositions in a sentence
boundary where punctuation is missing (5 flaggings), resulting in a very low precision value of 11%. In (5.26a) we see an example of a false alarm with the conjunction för ‘because’ and in (5.26b) with a preposition ending a sentence followed by
a personal pronoun as the subject of the next sentence:
(5.26) a.
A LARM
Vi
skulle
we would
hon
she [nom]
åka in
go in
skulle
would
i
into
hamnen
the-port
för
for
berätta något
tell
something
för
for
Erroneous pronoun form, use
object-form ⇒ för henne ‘for
her [acc]’
sin mamma.
her mother
– We would go into the port because she should
tell something to her mother.
b.
...
och jag kom då
tänka på den
and I
came then think at the
byn
vi va (⇒ var)
i jag
the-village we what (⇒ were) in I [nom]
Erroneous pronoun form, use
object-form ⇒ i mig ‘in me
[acc]’
berätta (⇒ berättade) om
byn
och
tell (⇒ told)
about the-village and
dom
they
sa
said
att
that
det va (⇒ var)
it what (⇒ was)
deras
their
by.
village
– and I came to think at the village we were in.
I told about the village and they said that it was
their village.
Scarrie also found three error instances (G4.1.1 - see (4.19) p.54, G4.1.3,
G4.1.4), all with different prepositions. False flaggings occurred also due to ambiguity problems, as for example in (5.27) and (5.28).
(5.27)
A LARM
Jag gick
och gick
tills jag hörde
I
walked and walked until I heard
Pappa skrika kom kom
daddy scream come come
– I walked and walked until I heard daddy
scream: Come! Come!
S CARRIE ’ S SUGGESTION
wrong form of pronoun
Chapter 5.
154
(5.28) a.
A LARM
Erik frågade om
han kunde få ett
Erik asked if OR about he could get a
S CARRIE ’ S SUGGESTION
barn.
child
– Erik asked if he could get a child.
b.
Tänk om
jag bott hos pappa.
think if OR about I lived with daddy
– Think if I lived with daddy.
Scarrie obtains a recall of 60% but with 17 false alarms, attains a precision rate
of only 15% for errors in pronoun case.
In conclusion, as seen in the above examples, the tools search for errors in the
pronoun form after certain types of prepositions, but due to ambiguity in them they
fail more often than they succeed in detection of these errors.
Finite Verb Form Errors
Errors in finite verbs concern non-inflected verb forms, which is also the most
common error found in Child Data. All of the tools search for missing finite verbs
in sentences and, judging from the examples in the error specifications, it seems
that they detect exactly this type of error.
Grammatifix detected very few instances of sentences that lack a finite verb.
Altogether four such errors are recognized and in one of them Grammatifix suggested correcting another verb. In total, seven false alarms occurred, detecting verbs
after an infinitive marker as in (5.29) or after an auxiliary verb as in (5.30).
(5.29)
A LARM
dom la sig
ner för att ta
they lay themselves down for to take [inf]
skydd under natten
shelter during the-night
– They lay down to take shelter during the night.
The sentence seems to lack a
tense-inflected verb form. If
such a construction is necessary
can you try to change ta ‘take’.
(5.30)
A LARM
det kan ju
bero
på att föräldrarna
it can of-course depend on that the-parents
inte bryr sig
dom kanske inte ens
not care themselves they maybe not even
155
The sentence seems to lack a
tense-inflected verb form. If
such a construction is necessary
can you try to change behöva
‘need’.
vet
att man har prov för dom lyssnar inte
know that one has test for they listen not
på sitt barn
för en del kan ju
to their children for a bit can of-course
behöva hjälp av
sina föräldrar
need help from their parents
– It can depend on that the parents do not care.
They probably do not even know that you have
a test, because they do not listen to their child,
because some can need help from their parents.
It seems that Grammatifix cannot cope with longer sentences. For instance,
breaking down the last example in (5.30b) from det kan ju bero p å... the error
marking is not highlighted anymore. Since many errors with non-finite verbs as the
predicates of sentences occurred in Child Data, Grammatifix obtains a low recall
value of 4%. False alarms were relatively few, which gives a precision rate of 36%.
Granska also checks for errors in clauses where a finite verb form is missing. It
detected nine errors in verbs lacking tense-endings altogether, resulting in a recall
of just 8%. Nine false flaggings occurred, mostly with imperatives, which gives it
a precision score of 44%.
Some other alarms concerned exclamations such as Grodan! ‘Frog!’ or Tyst!
‘Silence!’, or fragment clauses, where no verb was used (29 alarms). These are
excluded from the present analysis.
Scarrie explicitly checks verb forms in the predicate of a sentence and detected
17 errors in Child Data with two diagnoses - ‘wrong verb form in the predicate’
or ‘no inflected predicative verb’. Altogether, 13 false flaggings occurred due to
marking correct finite verbs. One false alarm included a split as shown below in
(5.31). Scarrie has the best result of the three systems for this error type with 15%
in recall and 57% in precision.
(5.31)
A LARM
Han ring de
till mig sen och sa samma
he call [pret] to me later and said same
sak.
thing
– He phoned me later and said the same thing.
wrong verb form in predicate
Chapter 5.
156
In conclusion, the tools succeeded in detecting at most 17 cases of errors in
finite verb form. The tools have a very low coverage rate for this frequent error
type. The worst detection rate is for Grammatifix the best for Scarrie.
Verb Form after Auxiliary Verb
All the tools include detection of errors in the verb form after auxiliary verbs. In
Child Data, only one of these erroneous verb clusters included an inserted adverb
and one occurred in a coordinated verb.
Grammatifix does not find any of these errors.
Four instances of erroneous verb form after auxiliary verbs were detected by
Granska. The remaining three which were not detected are presented in (5.32)
and concern G6.1.2, a coordinated verb in (5.32a), G6.1.5, a verb with preceding
adverb (5.32b) and G6.1.6, an auxiliary verb followed by a verb in imperative form
(5.32c).
(5.32) a. Ibland
sometimes
får
can [pres]
man bjuda
one offer [inf]
på
on
sig själv och
oneself and
∗
låter
let [pres]
henne/honom vara med!
her/him
be with
– Sometimes can one make a sacrifice and let him/her take part.
b. han råkade
bara ∗ kom
emot getingboet
he happened [pret] just came [pret] against the wasp-nest
– He just happened to come across the wasp’s nest.
∗
som vi alla nog
skulle
c. Det är något
gör
om vi inte
it is something that we all probably would [pret] do [imp] if we not
hade läst på ett prov.
had read to a test
– This is something that we all probably would do if we had not been studying
for a test.
Five false alarms occurred in sentence boundary. In (5.33a) we see an example
where the end of a preceding direct-speech clause is not marked and the final verb
is selected with the main verb of the subsequent clause. Similarly, in (5.33b) the
verb cluster ending a clause where the boundary is not marked is selected together
with the (adverb) and the initial main verb of the subsequent clause.
(5.33) a.
A LARM
Jo, det kanske han kan
sa
no that maybe he can [pres] said [pret]
pappa.
Daddy
157
unusual with verb form sa ‘said
[pret]’ after modal verb kan
‘can [pres]’. ⇒ kan säga ‘can
[pres] say [inf]’
– No, maybe he can, said Daddy.
b.
precis när dom skulle
börja
så
just when they would [pret] start [inf] so
hörde
dom en röst
heard [pret] they a voice
unusual with verb form hörde
‘heard [pret]’ after modal verb
skulle ‘would [pret]’. ⇒ skulle
börja så ha hört ‘would [pret]
start [inf] so have [inf] heard
[sup]’ or skulle börja så höra
‘would [pret] start [inf] so hear
[inf]’
– Just when they were about to begin, they
heard a voice.
Granska’s performance rates are 57% in recall and 44% in precision.
Scarrie detected only one error in verb form after an auxiliary verb in Child
Data (G6.1.6 - see (5.32c) above) and made altogether nine false flaggings. Two
false alarms occurred at sentence boundaries, one of them in the same instance as
in Granska, see (5.33a) above. Scarrie ends up with a performance result of 14%
in recall and 10% in precision.
In conclusion, Granska detects more than half of the verb errors after the auxiliary, but the performance of the other tools is very low, detecting either none or
just one such error.
Missing Auxiliary Verb
All the tools check explicitly for supine verb forms without the auxiliary infinitive
form ha ‘have’. It is not clear if they also check for omission of the finite forms of
the auxiliary verb in front of a bare supine. In Swedish, the supine is only used in
subordinate clauses (see Section 4.3.5). Two errors with bare supine form in main
clauses were found in Child Data.
Grammatifix did not find the two errors in Child Data. Instead, Grammatifix
suggested insertion of the auxiliary verb ha ‘have’ in constructions between an
auxiliary verb and a supine verb form. This is rather a stylistic correction and is
not part of the present analysis. Altogether, nine such suggestions were made of
the kind given below:
Chapter 5.
158
(5.34)
A LARM
ätit
för en kvart
jag skulle
I should [pret] eaten [sup] for a quarter
sen
later
Consider the word ätit ‘eaten
[sup]’. A verb such as skulle
‘should [pret]’ combines in polished style with ha ‘have [inf]’
+ supine rather than only a supine. ⇒ skulle ha ätit ‘should
[pret] have [inf] eaten [sup]’
– I should have eaten a quarter of an hour ago.
The same happened in Granska, no errors were detected and suggestions made
were for insertion of auxiliary ha ‘have’ in front of supine forms preceded by auxiliary verbs. Seven such flaggings occurred as in (5.35) and two flaggings were
false and occurred at sentence boundaries.
(5.35)
A LARM
Jag måste
svimmat
.
I must [pret] fainted [sup]
unusual with verb form
svimmat ‘fainted [sup]’ after
the modal verb måste ‘must
[pret]’. ⇒ måste ha svimmat
‘must [pret] have [inf] fainted
[sup]’
– I must have fainted.
Scarrie did find one of the error instances in Child Data with a missing auxiliary verb (G6.2.1). Eight other detections included the same stylistic issues as for
the other tools, suggesting insertion of ha ‘have’ between an auxiliary verb and a
supine verb form, as in:
(5.36)
A LARM
de kunde
berott
på att dom
it could [pret] depend [sup] on that they
wrong verb form after modal
verb
gillade samma tjej
liked same girl
– It could have been because they liked the same
girl.
In conclusion, just one of the two missing auxiliary verb errors in Child Data
was found by Scarrie. The systems bring more attention to the stylistic issue of
omitted ha ‘have’ with supine forms, pointing out that the supine verb form should
not stand alone in formal prose.
159
Verb Form in Infinitive Phrase
Granska and Scarrie search for erroneous verb forms following an infinitive marker
and should not have problems with finding these errors in Child Data, where only
one instance included an adverb splitting the infinitive.
Granska identified three errors in verb form after an infinitive marker, missing
only the one with an adverb between the parts of the infinitive (G7.1.1 - see (4.35)
p.62). This problem of syntactic coverage was already discussed in Section 5.4.4
in the examples in (5.13), where it also showed that Granska does not take adverbs into consideration. Altogether six false alarms occurred. Granska’s overall
performance rates are 75% in recall and 33% in precision.
Scarrie detected one of the errors in Child Data, where the infinitive marker is
followed by a verb in imperative form instead of infinitive: att g ör ‘to do [imp]’
(G7.1.4). Also, one false flagging occurred, shown in (5.37), where it seems that
the system misinterpreted the conjunction för att ‘because’ as the infinitive marker
att ‘to’:
(5.37)
A LARM
så jag sa att hon skulle ta
det lite
so I said that she should take it little
inflected verb form after att ‘to’
lugnt för att annars
så kan
hon
easy for that otherwise so can[pres] she
inte så
skada
sig
och det är ju
hurt[inf] herself and it is of-course not so
bra.
good
– So I said that she should take it easy a little because otherwise she might hurt herself and that
is of course not so good.
In conclusion, Granska finds all but one of the errors, due to insufficient syntactic coverage and makes also quite many false flaggings. Scarrie has difficulties
with this error type and Grammatifix does not target it at all.
Missing Infinitive Marker with Verbs
All the tools check explicitly for both missing and extra inserted infinitive marker.
Three errors in missing infinitive marker with verbs occurred in Child Data in the
context of the auxiliary verb komma ‘will’. As presented in Section 4.3.5, certain
main verbs take also an infinitive phrase as complement and some lack the infinitive
marker and start to behave as auxiliary verbs, that normally do not combine with an
Chapter 5.
160
infinitive marker and only take bare infinitives as complement. This development is
now in progress in Swedish, which indicates then rather to treat these constructions
as stylistic issues.
Grammatifix did not find the three errors in Child Data with omitted infinitive
markers with the auxiliary verb komma ‘will’ (see example (4.36) p.62). In seven
cases, the tool rather suggested removing the infinitive marker with the verbs b örja
‘begin’ and tänka ‘think’, e.g.:
(5.38) a.
A LARM
Jag och Virginia började
att berätta
I and Virginia started [pret] to tell [inf]
om
about
tromben
the-tornado
och
and
den
the
övergivna
abandoned
byn
the-village
Check the words att ‘to’ and
berätta ‘tell [inf]’. If an infinitive is governed by the verb
började ‘started [pret]’, the infinitive should not be preceded
by att ‘to’ ⇒ började berätta
‘started [pret] tell [inf]’
– Virginia and I started to tell about the tornado
and the abandoned village.
b.
4 hus
och 5 affärer var ordning gjorda
4 houses and 5 shops were order done
av gumman som hade
tänkt
by old-lady who had [pret] thought [sup]
att göra
museum av den gamla staden
to make [inf] museum of the old the-city
Check the words att ‘to’ and
göra ‘make [inf]’. If an infinitive is governed by the verb
tänkt ‘thought [sup]’, the infinitive should not be preceded by
att ‘to’ ⇒ tänkt göra ‘thought
[sup] make [inf]’
– 4 houses and 5 shops were tidied up by the old
lady who had planned to make a museum of the
old city.
Granska detected all the three omitted infinitive markers in the context of the
auxiliary verb komma ‘will’. In this case also six false flaggings occurred, concerning the same verb used as a main verb, e.g.:
(5.39)
A LARM
han
kommer
he
comes [pres]
alla
all
på
on
handen
the-hand
och
and
utan
except
undra (⇒ undrar)
wonder [inf] (⇒ wonder [pres])
161
klappar
pats
en
a
kille
boy
hur
how
han
he
kommer ‘will’ without att ‘to’
before verb in infinitive
känner sig
då?
feels himself then
– He comes and pats everybody’s hand except
one boy. (I) wonder how he feels then?
In two cases, Granska also suggested insertion of the the infinitive marker with
the verbs fortsätta ‘continue’ and prova ‘try’. In nine cases, it wanted to remove
the infinitive marker with the verbs börja ‘begin’, försöka ‘try’, sluta ‘stop’ and
tänka ‘think’.
Scarrie detected two of the three missing infinitive marker errors with the verb
komma ‘will’ found in Child Data. Quite a large number of false alarms (13) with
the verb used as main verb occurred as in (5.40), where s å is ambiguous between
the conjunction ‘so’ or ‘and’ and a verb reading ‘sow’. The precision rate is then
only 13%.
(5.40)
A LARM
men kom nu så
går vi hem
but come now so OR sow go we home
att ‘to’ missing
– But come now and we’ll go home.
In five cases, Scarrie suggested removal of the infinitive marker in the context
of the verbs börja ‘begin’, fortsätta ‘continue’ and sluta ‘stop’.
In conclusion, whereas both Granska and Scarrie performed well, Grammatifix
did not succeed in tracing any of the errors with omitted infinitive markers with the
auxiliary verb komma ‘will’. Overall, all the tools suggested both omission and
insertion of infinitive markers with certain main verbs. In some cases they agree,
but there are also cases where one system suggests removal of the infinitive marker
and an another suggests insertion. A clear indication of confusion in the use or
omission of the infinitive marker showed up when Granska suggested to insert
the infinitive marker in the verb sequence fortsätta leva ‘continue live’ as shown
in (5.41a), whereas in (5.41b) Scarrie suggested to remove it in the same verb
sequence. This fact indicates clearly that this issue should be classified as a matter
of style and not as a pure grammar error.
Chapter 5.
162
(5.41) a.
A LARM
när jag dog 1978 i cancer återvände jag
when I died 1978 of cancer returned I
D IAGNOSIS
Granska: ⇒ fortsätta att leva
‘continue to live’
hit för att fortsätta
leva
mitt
here for that continue [inf] live [inf] my
liv här
life here
– When I died in 1978 of cancer, I returned here
to continue live my life here.
b.
Vi fortsatte
att leva [inf] som en
we continued [pret] to live
as a
Scarrie: ⇒ fortsatte leva ‘continued live’
hel
familj i vårt nya hus
här i
whole family in our new house here in
Göteborg.
Göteborg
– We continued to live as a whole family in our
new house here in Göteborg.
Word Order Errors
All three tools check for the position of adverbs (or negation) in subordinate clauses
and constituent order in interrogative subordinate clauses. Scarrie also checks for
word order in main clauses with inversion. Among the word order errors found in
Child Data, all the errors are quite complex and also none of the tools succeeded
in detection of this type of error. However, false flaggings of correct sentences
occurred.
Grammatifix made 15 false alarms when checking word order, one included
a split and three occurred in clause boundary. A false flagging involving clause
boundary is presented in (5.42a), where Grammatifix concerned the adverb hem
‘home’ as being placed wrongly between verbs. This problem is not only complicated by the second verb initiating a subsequent clause, but also in that not all
adverbs can precede verbs. Another false flagging is presented in (5.42b), where
Grammatifix checked for adverbs placed after the main verb in the expected subordinate clause, but here, main word order is found in the indirect speech construction.28
28
Main clause word order occurs when the clause expresses the speaker’s or the subject’s opinion
or beliefs.
(5.42) a.
A LARM
När
when
vi
we
kom
came
undra (⇒ undrar)
wonder [inf] (⇒ wonder [pres])
163
hem
home
självklart
of-course
Check the placement of hem
‘home’. In a subclause adverb
is not usually placed between
the verbs. Placement before the
finite verb is often suitable.
mamma vart vi varit...
mother where we been
When we came home, mother wondered of
course where we had been.
b.
killen i luren
sa att han kommer
the-guy in the-receiver said that he comes
genast
immediately
Check the placement of genast
‘immediately’. In a subclause
sentential adverb is placed by
rule before the finite verb. ⇒
genast kommer ‘immediately
comes’
– The guy in the receiver said that he would
come immediately
In (5.43) the sentence is erroneously marked as a word order error in the placement of negation. The problem however concerns the choice of the (explanative)
conjunction för att ‘since/due to’ that combines with main clause and is more typical of spoken Swedish (Teleman et al., 1999, Part2:730). This conjunction corresponds to för ‘due to/in order to’ in writing and coordinates then only main clauses.
It is often confused with the causal subjunction för att ‘because/with the intention of’ that is used only with subordinate clauses and requires then adverbs to be
placed before the main verb (Teleman et al., 1999, Part2:736).
(5.43)
A LARM
...då sa han ja för att han ville
inte
then said he yes for that he wanted not
berätta för fröken
att han var ensam
tell
to the-teacher that he was alone
Check the placement of inte
‘not’. In a subclause sentential
adverb is placed by rule before
the finite verb. ⇒ inte ville ‘not
wanted’
– ... then he said yes, because he did not want
to tell the teacher that he was alone.
All of the 15 flaggings by Granska were false, interpreting conjunctions as
subjunctions as in (5.44a) or not taking indirect speech into consideration as in
(5.44b), where the subject’s opinion is expressed by main clause word order and
not subordinate clause word order as interpreted by the tool.
Chapter 5.
164
(5.44) a.
A LARM
... men den gick av så jag hade bara lite
but it went off so I had just little
gips
kvar.
plaster left
Word order error, erroneous
placement of adverb in subordinate clause. ⇒ bara hade
‘just had’
– ... but it broke off so I only had a little plaster
left.
då tycker jag att det var inte hans fel
then believe I that it was not his fault
b.
utan deras.
but theirs
Word order error, erroneous
placement of adverb in subordinate clause. ⇒ inte var ‘not
was’
Then I think that it was not his fault but theirs.
Scarrie’s 11 diagnoses were also false, mostly of the type “subject taking the
position of the verb” as in (5.45a) and also cases of interpreting conjunctions as
subjunctions as in (5.45b):
(5.45) a.
A LARM
Då
vi kom till min by.
Trillade jag
when we came to my village fell
I
the subject in the verb position
av brand bilen för det var en guppig väg.
off fire the-car for it was a bumpy road
– When we arrived in my village, I fell off the
fire-engine because the road was bumpy.
b.
dom kanske inte ens vet
att man har
they maybe not even know that one has
prov för dom lyssnar inte på sitt barn ...
test for they listen not at their child
the inflected verb before sentence adverbial in subordinate
clause
– They probably do not even know that you
have a test, because they do not listen to their
child ...
In conclusion, word order errors were hard to find due to their inner complexity.
The tools seem to apply rather straight-forward approaches that resulted in many
false flaggings.
Redundancy
According to the error specifications, only Grammatifix searches for repeated
words and should then be able to at least detect errors with doubled words.
165
Grammatifix identified the five errors with duplicated words immediately following each other. The number of false alarms is quite high (18 occurrences). One
example is given below:
(5.46)
A LARM
Var
var den där överraskningen.
where was the there surprise
doubled word
– Where was that surprise?
No other superfluous elements were detected so the system ends up with a
performance rate of 38% in recall, and 23% in precision.
All three tools search for sentences with omitted verbs or infinitive markers, also
in the context of a preceding preposition.
Grammatifix did not find any missing verbs, but detected the only error with
a missing infinitive marker in front of an infinitive verb after certain prepositions
(G10.3.1) shown in (5.47).
(5.47) a. Efter — ha
sprungit igenom häckarna två gånger så vilade
after — have [inf] run [sup] through the-hurdles two times then rest
vi lite ...
we little
– After twice running through the hurdles, we rested a little.
b. Efter att ha
sprungit
after to have [inf] run [sup]
Six false alarms occurred for this error type, mostly when the adverb tillbaka
‘back’ was split as shown in (5.48). The problem lies in that the split word results
in a preposition till ‘to’ and the verb baka ‘bake’.
(5.48)
A LARM
inget kvack kom till — baka
no quack came to — bake
Check the word baka. If an infinitive is governed by a preposition it should be preceded by
att ‘to’
– no quack came back.
Granska checked in the case of omitted verbs only for occurrences of single
words such as Slut. ‘End.’ or sentence fragments, such as Tom gr å och tyst. ‘Empty
Chapter 5.
166
grey and silent.’ or Inte ens pappa. ‘Not even daddy.’. The program further suggested that the error might be a title “Verb seems to be missing in the sentence. If
this is a title it should not be ended with a period.”
Altogether, 25 sentences were judged to be missing a verb and 12 false alarms
occurred. None of the errors listed in Child Data were detected by Granska. This
particular error type is not included in the present performance analysis.
Granska also checks for missing subjects. Two cases concerned short sentence
fragments and two were false flaggings as the one in (5.49) below.
(5.49)
A LARM
Hade alla 7 vandrat förgäves?
had all 7 walked in vain
a subject seems to be missing in
the sentence
– Had all seven walked in vain?
Scarrie also checks for missing subjects and successfully detected the error
G10.1.5, shown in (5.50). The other three flaggings were false. In the case of a
missing infinitive marker in constructions where a preposition precedes an infinitive phrase, six false flaggings occurred. Like Grammatifix, Scarrie marks erroneous splits homonymous with prepositions (see (5.48) above).
(5.50) a. man försöker att lära barnen
att om — fuskar med t ex ett prov då ...
one tries
to teach the-children to if — cheat with e.g. a test then
– One tries to teach children that if they cheat on e.g. a test then ...
b. om de fuskar med
if they cheat with
In conclusion, many of the omitted constituents are not covered by these tools
and result mostly in false flaggings. Grammatifix successfully detected a missing
infinitive marker preceded by a preposition and Scarrie detected a missing subject.
Other Errors
Among other error types, all the tools also check if a sentence has too many finite
verbs. Grammatifix succeeded in finding three instances of unmarked sentence
boundaries. In three cases, false flaggings occurred, listed in (5.51). Two such
flaggings concerned ambiguity between a verb and a pronoun and the one in (5.51c)
involved a spelling error that resulted in a verb. These alarms are not part of the
system’s performance test, since such errors were not the target of this analysis.
(5.51) a.
167
A LARM
Han undrade var
de var någonstans
he wondered where they were somewhere
Check the word forms undrade ‘wondered’ and var
‘where/was’. It seems as if the
sentence would have too many
finite verbs.
– He wondered where they were?
b.
Var
Check the word forms
var ‘where/was’ and var
‘where/was’. It seems as if the
sentence might have too many
finite verbs.
c.
Pojken blev
red (⇒ rädd)
the-boy became rode (⇒ afraid)
Check the word forms blev ‘became’ and red ‘rode’. It seems
as if the sentence might have
too many finite verbs.
– The boy became afraid.
Granska checks for occurrences of other finite verbs after the copula verb vara
‘be’. In Child Data, however, the only detections were false flaggings (8 occurrences), mostly due to homonymy between the verb and the adverb var ‘where’ as
in (5.52a) (5 occurrences). Three false alarms occurred because of spelling errors
as in (5.52a) or at sentence boundaries, as in (5.52b):
(5.52) a.
A LARM
Pojken
the-boy
blev
became [pret]
it is unusual to have a verb after
the verb blev ‘became [pret]’
som tur var
landade
jag på
as luck was [pret] landed [pret] I on
it is unusual to have a verb after
the verb var ‘was [pret]’
red (⇒ rädd)
rode [pret] (⇒ afraid)
b.
skyddsnätet på brandbilen
the-safety-net on the-fire-engine
– luckily I landed on the safety-net on the fireengine.
Scarrie also checks for occurrences of two finite verbs in a row, but provides a
diagnosis of a possible sentence boundary as well. Eight sentence boundaries were
found and eight false markings occurred, often due lexical ambiguity as in (5.53).
Also, in Scarrie’s case, these alarms are not included in the analysis.
Chapter 5.
168
(5.53)
A LARM
Men sen kom en tjej som visste vem jag
but then came a girl that knew who I
two inflected verbs in predicate
position or a sentence boundary
var
för
hon ...
was [pret] for OR lead [imp] she
– But then came a girl that knew who I was,
because she ...
Finally, Scarrie checks the noun case, where the genitive form of proper nouns
is suggested in constructions of a proper noun followed by a noun. All result in
false flaggings, due to part-of-speech ambiguity, e.g.:
(5.54)
A LARM
Men på morgonen
när Erik
såg
but in the-morning when Erik [nom] saw
basic form instead of genitive
att hans groda var försvunnen.
that his frog was disappeared
– But in the morning when Erik saw that his
frog had disappeared.
5.5.5
Overall Detection Results
In accordance with the error specifications of the systems, none of the Swedish
tools detects errors in definiteness in single nouns or reference and only Grammatifix checks for repeated words among redundancy errors. Missing constituents are
checked only when a verb, subject or infinitive marker is missing. Word choice errors represented as prepositions in idiomatic expressions are checked by Granska.
The detection results on Child Data, discussed in the previous section, are summarized in Tables 5.4, 5.5 and 5.6 below.
Among the most frequent error types in Child Data, represented by errors in finite verbs, missing constituents, word choice errors, agreement in noun phrase and
redundant words, Grammatifix succeeded in finding errors in four of these types,
Scarrie in three of them and Granska in two categories. All the tools were best at
finding errors in noun phrase agreement, with a recall rate between 53% and 67%
and precision between 7% and 37%. For the most common error, finite verb form,
all obtained very low coverage, with recall between 4% and 15% and precision
between 36% and 57%. Grammatifix succeeded in finding all the repeated words
among redundancy errors and one occurrence of missing constituent. Also Scarrie
found one missing constituent. No word choice errors were found by Granska.
Other error types in Child Data occurred less than ten times and no general assumptions can be made on how the tools performed on those.
169
Table 5.4: Performance Results of Grammatifix on Child Data
GRAMMATIFIX
C ORRECT A LARM
FALSE A LARM
P ERFORMANCE
Correct Incorrect
No
Other
E RROR T YPE
E RRORS Diagnosis Diagnosis Error
Error Recall Precision F-value
Agreement in NP
15
7
1
4
16
53%
29%
37%
Agreement in PRED
8
1
2
1
13%
25%
17%
6
0%
–
–
Pronoun case
5
2
40%
100%
57%
Finite Verb Form
110
3
1
5
2
4%
36%
7%
7
0%
–
–
Vaux Missing
2
0%
–
–
4
0%
–
–
Inf. marker Missing
3
0%
–
–
Word order
5
11
4
0%
0%
–
Redundancy
13
5
16
1
38%
23%
29%
44
1
1
6
5%
25%
8%
Word Choice
28
0%
–
–
Reference
8
0%
–
–
Other
4
0%
–
–
T OTAL
262
18
4
38
30
8%
24%
12%
Table 5.5: Performance Results of Granska on Child Data
GRANSKA
C ORRECT A LARM
FALSE A LARM
P ERFORMANCE
Correct Incorrect
No
Other
E RROR T YPE
Agreement in NP
15
5
3
8
17
53%
24%
33%
Agreement in PRED
8
3
3
2
38%
38%
38%
6
0%
–
–
Pronoun case
5
3
24
60%
11%
19%
Finite Verb Form
110
8
1
8
1
8%
50%
14%
7
4
5
57%
44%
50%
Vaux Missing
2
2
0%
0%
4
3
6
75%
33%
46%
Inf. marker Missing
3
3
6
100%
33%
50%
Word order
5
15
0%
0%
–
Redundancy
13
0%
–
–
44
2
0%
0%
–
Word Choice
28
0%
–
–
Reference
8
0%
–
–
Other
4
0%
–
–
T OTAL
262
29
4
79
20
13%
25%
17%
Chapter 5.
170
Table 5.6: Performance Results of Scarrie on Child Data
SCARRIE
C ORRECT A LARM
FALSE A LARM
P ERFORMANCE
Correct Incorrect
No
Other
E RROR T YPE
Agreement in NP
15
8
2
83
50
67%
7%
13%
Agreement in PRED
8
12
1
0%
0%
–
6
0%
–
–
Pronoun case
5
3
17
60%
15%
24%
Finite Verb Form
110
16
1
13
15%
57%
24%
7
1
7
2
14%
10%
12%
Vaux Missing
2
1
50%
100%
67%
4
1
1
25%
50%
33%
Inf. marker Missing
3
2
13
67%
13%
22%
Word order
5
11
0%
0%
–
Redundancy
13
0%
–
–
44
1
4
5
2%
10%
3%
Word Choice
28
0%
–
–
Reference
8
0%
–
–
Other
4
0%
–
–
T OTAL
262
33
3
161
58
14%
14%
14%
Overall performance figures in detecting the errors in Child Data show that
Grammatifix did not detect many of the verb errors at all and has the lowest recall.
Scarrie on the other hand detects most errors of them all, but has a high number
of false flaggings. Errors in agreement with predicative complement were hard to
find in general, even in cases where the subject and the predicate were adjacent,
more complex structures would obviously pose more of a problem for the tools.
Even when errors were found in these constructions, the tools often gave an incorrect diagnosis. Among the false flaggings, quite many included errors other than
grammatical ones.
The overall performance of the tools including all error types when applied to
Child Data ends up at a recall rate of 14% at most, and a precision rate between 14%
and 25%. Grammatifix detected the least number of errors and had the least number
of false alarms, but the quite low recall rate leads to the lowest F-value of 12%.
Granska found slightly more errors and had more false flaggings, obtaining the
best F-value of 17%. Scarrie performed best of the tools in grammatical coverage,
but at the cost of lots of false alarms, giving an F-value of 14%.
171
In Table 5.7 the overall performance of the systems is presented for the errors
they target specifically, excluding the zero-results. Observe that the F-values are
slightly higher due to increased recalls. Precision rates remain the same. 29
Table 5.7: Performance Results of Targeted Errors
T OOL
Grammatifix
Granska
Scarrie
E RRORS
166
174
170
C ORRECT
A LARM
22
33
36
FALSE
A LARM
68
97
214
R ECALL
13%
19%
21%
P RECISION
24%
25%
14%
F- VALUE
17%
22%
17%
The performance tests on published adult texts and some student papers
provided by the developers of these tools (see Table 5.3 on p.141), show on average
much higher validation rates for these texts, with an overall coverage between 35%
and 85% and precision between 53% and 77%. Granska shows to be best at detecting errors in verb form in the adult text data evaluated by the developers with a
recall rate of 97%. Verb form errors are mostly represented by errors in finite verb
form in Child Data, where Granska obtained a recall of 8%. Other types of verb
errors occurred less than ten times which makes the performance result uncertain.
For agreement errors in noun phrase, which is the second best category of Granska
when tested on adult texts, Granska obtained much better results and detected at
least half of the errors with a recall of 53%.
Since the error frequency is much higher in texts written by children, the size of
the Child Data corpus can be considered to be satisfactory and safe for evaluation,
at least for the most frequent error types. This performance test shows that the
three Swedish tools, designed for adult writers in the first place, have in general
difficulty in detecting errors in such texts as Child Data. As indicated in some
examples, this is not only due to insufficient error coverage of the defined error
types in the systems. The structure of the texts may also be a cause for certain
errors not being detected or being erroneously marked as errors. Different results
were obtained sometimes when sentences were split into smaller units.
29
Grammatifix: redundancy includes 5 errors in doubled word, missing constituents are counted
as infinitive marker (1) and verb (5). Granska: missing verb (5), choice of preposition (10). Scarrie:
missing subject (10), missing infinitive marker (1).
172
Chapter 5.
5.6 Summary and Conclusion
From the above analyses it is clear that among the grammar errors found in Child
Data, all non-structural errors and some types of structural errors should be possible to detect by syntactic analysis and partial parsing, whereas other errors require more complex analysis or wider context. Among the central error types in
Child Data, errors in finite verb form and agreement errors in noun phrases could be
handled by partial parsing, which I will show in Chapter 6. The other more frequent
errors, such as missing constituents, word choice and redundant words; forming
new lemmas require deeper analysis. Furthermore, some real word spelling errors
might be detected if they violate syntax. Missing punctuation in sentence boundaries requires analysis of at least the predicate’s complement structure.
All the errors in Child Data except definiteness in single nouns and reference
seem to be more or less covered by the Swedish tools considering the error specifications. The performance results show that agreement errors in noun phrases
are the error type best covered, whereas errors in finite verb forms in relation to
their frequency obtained a very low recall in all three systems. Grammatifix had in
general difficulty detecting any errors concerning verbs. Granska performed best
in this case.
Overall, all the tools detect few errors in Child Data and the precision rate is
quite low. It is not clear how many of the missed errors depended on insufficient
syntactic coverage and how many depended on the complexity of the sentences in
Child Data. That is, all three tools rely on sentences to be the unit of analysis, but
“sentences” in Child Data do not always correspond to syntactic sentences. They
often include adjoined clauses or quite long sentences (see Section 4.6). These
tools are not designed to handle such complex structures.
In conclusion, many errors that can be handled by partial parsing in Child Data
are detected at a rate of less of not more than 60% by the Swedish grammar checkers. Errors in finite verb form obtained quite low results and are the type of error
that needs the most improvement, especially since they are the most common error
in Child Data.
Chapter 6
FiniteCheck: A Grammar Error
Detector
6.1 Introduction
This chapter reports on automatic detection of some of the grammar errors discussed in Chapter 4. The challenge of this part of the work is to exploit correct
descriptions of language, instead of describing the structure of errors, and apply
finite state techniques to the whole process of error detection.
The implemented grammar error detector FiniteCheck identifies grammar errors using partial finite state methods, identifying syntactic patterns through a set
of regular grammar rules (see Section 6.2.4). Constraints are used to reduce alternative parses or adjust the parsing result. There are no explicit error rules in
the grammars of the system, in the sense that no grammar rules state the syntax of
erroneous (ungrammatical) patterns. The rules of the grammar are always positive and define the grammatical structure of Swedish. The only constraints related
to errors is the context of the error type. The present grammar is highly corpusoriented, based on the lexical and syntactic circumstances displayed in the Child
Data corpus.
Ungrammatical patterns are detected adopting the same method that Karttunen
et al. (1997a) use for extraction of invalid date expressions, presented in Section 6.2.4. In short, potential candidates of grammatical violations are identified
through a broad grammar that overgenerates and accepts also invalid (ungrammatical) constructions. Valid (grammatical) patterns are defined in an another narrow
grammar and the ungrammaticalities among the selected candidates are identified
as the difference between these two grammars. In other words, the strings selected
174
Chapter 6.
by the rules of the broad grammar that are not accepted by the narrow grammar
are the remaining ungrammatical violations.
The current system looks for errors in noun phrase agreement and verb form,
such as selection of finite and non-finite verb forms in main and subordinate clauses
and infinitival complements. Errors in the finite verb form in the main verb were
the most natural choice for implementation since these are the most frequent error
type in the Child Data corpus, represented by 110 error instances (see Figure 4.1
on p.73). Moreover, verb form errors are possible to detect using partial parsing
techniques (see Section 5.3.3). Inclusion of errors in the finite main verb motivated
expansion of this category to include other errors related to verbs, with addition
of other types of finite verb errors and errors in non-finite verb forms. Errors in
noun phrase agreement were among the five most frequent error types. In comparison to other writing populations this type of error might be considered as one
of the central error types in Swedish (see Section 4.7). Furthermore, noun phrase
errors are limited within the noun phrase and can most likely be detected by partial parsing (see Section 5.3). The other errors among the five most common error
types in Child Data, including word choice errors and errors with extra or missing
constituents, are not locally restricted in this way and will certainly require a more
complex analysis.
The development of the grammar error detector started with the project Finite
State Grammar for Finding Grammatical Errors in Swedish Text (1998 - 1999). It
was part of a larger project Integrated Language Tools for Writing and Document
Handling in collaboration with the Numerical Analysis and Computer Science Department (NADA) at the Royal Institute of Technology (KTH) in Stockholm. 1 The
project group in Göteborg consisted of Robin Cooper, Robert Andersson and myself.
In the description of the system I will include the whole system and its functionalities, in particular my own contributions concerning mainly a first version
of the lexicon, expansion of grammar and adjustment to the present corpus data of
children’s texts, disambiguation and other adjustments to parsing results, as well as
evaluation and improvements made on the system’s flagging accuracy. The work of
the other two members concerns primarily the final version of the lexicon, optimization of the tagset, the basic grammar and the system interface. I will not discuss
their contributions in detail but will refer to the project reports when relevant.
The chapter proceeds with a short introduction to finite state techniques and
parsing (Section 6.2). The description of FiniteCheck starts with an overview of the
system’s architecture including short presentations of the different modules (Sec1
The project was sponsored by HSFR/NUTEK Language Technology Programme. See http:
//www.ling.gu.se/˜sylvana/FSG/ for methods and goals of our part of the project.
FiniteCheck: A Grammar Error Detector
175
tion 6.3). Then follows a section on the composition of the lexicon with a description of the tagset, and identification of grammatical categories and features (Section 6.4). Next, the overgenerating broad grammar set is presented (Section 6.5),
followed by a section on parsing (Section 6.6). The chapter then proceeds with
a presentation of the narrow grammar of noun phrases and the verbal core (Section 6.7) and the actual error detection (Section 6.8). The chapter concludes with
a summary (Section 6.9). Performance results of FiniteCheck are presented in
Chapter 7.
6.2 Finite State Methods and Tools
6.2.1
Finite State Methods in NLP
Finite state technology as such has been used since the emergence of computer science, for instance for program compilation, hardware modeling or database management (Roche, 1997). Finite state calculus is considered in general to be powerful and well-designed, providing flexible, space and time effective engineering applications. However, in the domain of Natural Language Processing (NLP) finite
state models were considered to be efficient but somewhat inaccurate, often resulting in applications of limited size. Other formalisms such as context-free grammars
were preferred and considered to be more accurate than finite state methods, despite difficulties reaching reasonable efficiency. Thus, grammars approximated by
finite state models were considered more efficient and simpler, but at the cost of a
loss of accuracy.
Improvement of the mathematical properties of finite state methods and reexamination of the descriptive possibilities made it possible for the emergence of
applications for a variety of NLP tasks, such as morphological analysis (e.g. Karttunen et al., 1992; Clemenceau and Roche, 1993; Beesley and Karttunen, 2003),
phonetic and speech processing (e.g. Pereira and Riley, 1997; Laporte, 1997), parsing (e.g. Koskenniemi et al., 1992; Appelt et al., 1993; Abney, 1996; Grefenstette,
1996; Roche, 1997; Schiller, 1996).
In this section the finite state formalism is described along with possibilities
for compilation of such devices (Section 6.2.2). Next, the Xerox compiler used in
the present implementation is presented (Section 6.2.3). The techniques of finite
state parsing are explained along with description of a method for extracting invalid
input from unrestricted text that plays an important role for the present implementation (Section 6.2.4).
Chapter 6.
176
6.2.2
Regular Grammars and Automata
Adopting finite state techniques in parsing means modeling the syntactic relations
between words using regular grammars2 and applying finite state automata to recognize (or generate) corresponding patterns defined by such grammar.
A finite state automaton is a computer model representing the regular expressions defined in a regular grammar that takes a string of symbols as input, executes
some operations in a finite number of steps and halts with information interpreted
depending on the grammar as either that the machine accepted or rejected the input.
It is defined formally as a tuple consisting of a finite set of symbols (the alphabet),
a finite set of states with an unique initial state, a number of intermediate states and
final states, and finally a transition relation defining how to proceed between the
different states.3
Regular expressions represent sets of simple strings (a language) or sets of
pairs of strings (a relation) mapping between two regular languages, upper and
lower. Regular languages are represented by simple automata and regular relations
by transducers. Transducers are bi-directional finite state automata, which means
for example that the same automaton can be used for both analysis and generation.
Several tools for the compilation of regular expressions exist. AT&T’s FSM
Library4 is a toolbox designed for building speech recognition systems and supports development of phonetic, lexical and language-modeling components. The
compiler runs under UNIX and includes about 30 commands to construct weighted
finite-state machines (Mohri and Sproat, 1996; Pereira and Riley, 1997; Mohri
et al., 1998).
FSA Utilities5 is an another compiler developed in the first place for experimental purposes applying finite-state techniques in NLP. The tool is implemented
in SICStus Prolog and provides possibilities to compile new regular expressions
from the basic operations, thus extending the set of regular expressions handled by
the system (van Noord and Gerdemann, 1999).
The compiler used in the present implementation is the Xerox Finite-State Tool,
one of Xerox software tools for computing with finite state networks, described
further in the subsequent section.
2
Regular grammars are also called type-3 in the classification introduced by Noam Chomsky
(Chomsky, 1956, 1959).
3
See e.g. Hopcroft and Ullman (1979); Boman and Karlgren (1996) for exact formal definitions
of finite state automata. A ‘gentle’ introduction is presented in Beesley and Karttunen (2003).
4
The homepage of AT&T’s FSM Library: http://www.research.att.com/sw/
tools/fsm/
5
The homepage of FSA Utilities’: http://www.let.rug.nl/˜vannoord/Fsa/
6.2.3
177
Xerox Finite State Tool
Introduction
Xerox research developed a system for computing and compilation of finite-state
networks, the Xerox Finite State Tool (XFST).6 The tool is a successor to two
earlier interfaces: IFSM created at PARC by Lauri Karttunen and Todd Yampol
1990-92, and FSC developed at RXRC by Pasi Tapanainen in 1994-95 (Karttunen
et al., 1997b)
The system runs under UNIX and is supplemented with an interactive interface
and a compiler. Finite state networks of simple automata or transducers are compiled from regular expressions and can be saved into a binary file. The networks
can also be converted to Prolog-format.
The Regular Expression Formalism
The metalanguage of regular expressions in XFST includes a set of basic operators such as union (or), concatenation, optionality, ignoring, iteration, complement
(negation), intersection (and), subtraction (minus), crossproduct and composition,
and an extended set of operators such as containment, restriction and replacement.
The notational conventions of some part of the regular expression formalism in
XFST, including the operators and atomic expressions that are used in the present
implementation, are presented in Table 6.1 (cf. Karttunen et al., 1997b; Beesley
and Karttunen, 2003). Uppercase letters such as A denote here regular expressions.
For a description of the syntax and semantics of these operators see Karttunen et al.
(1997a). The replacement operators play an important role in the present implementation and are further explained below.
6
Technical documentation and demonstration of the XFST can be found at: http://www.
rxrc.xerox.com/research/mltt/fst/
Chapter 6.
178
Table 6.1: Some Expressions and Operators in XFST
ATOMIC E XPRESSIONS
epsilon symbol (the empty-string)
any (unknown) symbol, universal language
U NARY O PERATIONS
iteration: zero or more (Kleene star)
iteration: one or more (Kleene plus)
optionality
containment
complement (not)
B INARY O PERATIONS
concatenation
union (or)
intersection (and)
ignoring
composition
subtraction (minus)
replacement (simple)
0
?, ?*
A*
A+
(A)
$A
∼A
AB
A|B
A&B
A/B
A .o. B
A-B
A→B
Replacement Operators
The original version of the replacement operator was developed by Ronald M. Kaplan and Martin Kay in the early 1980s, and was applied as phonological rewrite
rules by finite state transducers. Replacement rules can be applied in an unconditional version or constrained by context or direction (Karttunen, 1995, 1996).
Simple (unconditional) replacement has the format UPPER → LOWER denoting a regular relation (Karttunen, 1995):7
(RE6.1)
[ NO_UPPER [UPPER .x. LOWER] ] * NO_UPPER;
For example the relation [a b c → d e]8 maps the string abcde to dede.
Replacement may start at any point and include alternative replacements, making
these transducers non-deterministic, and yield multiple results. For example, a
transducer represented by the regular expression in (RE6.2) produces four different results (axa, ax, xa, x) to the input string aba as shown in (6.1) (Karttunen,
1996).
7
NO UPPER corresponds to ∼$[UPPER - []].
Lower-case letters, such as a, represent symbols. Symbols can be unary (e.g. a, b, c ) or symbol
pairs (e.g. a:x, b:0) denoting relations (i.e. transducers). Identity relation where a symbol maps to
the same symbol as in a:a is ignored and written thus as a.
8
(RE6.2)
179
ab|b|ba|aba→x
(6.1) a b a
a x a
a b a
--a x
a b a
--x a
a b a
----x
Directionality and the length of replacement can be constrained by the directed
replacement operators. The replacement can start from the left or from the right,
choosing the longest or the shortest replacement. Four types of directed replacement are defined (Karttunen, 1996):
Table 6.2: Types of Directed Replacement
left-to-right
right-to-left
longest match
@→
→@
shortest match
@>
>@
Now, applying the same regular expression as above to the left-to-right longestmatch replacement as in the regular expression in (RE6.3), yields just one solution to the string aba as shown in (6.2).
(RE6.3)
a b | b | b a | a b a @→ x
(6.2) a b a
----x
Directed replacement is defined as a composition of four relations that are composed in advance by the XFST-compiler. The advantage is that the replacement
takes place in one step without any additional levels or symbols. For instance, the
left-to-right longest-match replacement UPPER @→ LOWER is composed of the
following relations (Karttunen, 1996):
(6.3)
Input string
.o.
Initial match
.o.
Left-to-right constraint
.o.
Longest-match constraint
.o.
Replacement
With these operators, transducers that mark (or filter) patterns in text can be
constructed easily. For instance, strings can be inserted before and after a string
Chapter 6.
180
that matches a defined regular expression. For this purpose a special insertion
symbol “...” is used on the right-hand side to represent the string that is found
matching the left-hand side: UPPER @→ PREFIX ... SUFFIX.
Following an example from Karttunen (1996), a noun phrase that consists of
an optional determiner (d), any number of adjectives a* and one or more nouns
n+, can be marked using the regular expression in (RE6.4), mapping dannvaa
into [dann]v[aan] as shown in (6.4). Thus, the expression compiles to a transducer
that inserts brackets around maximal instances of the noun phrase pattern.
(RE6.4)
(6.4)
(d) a* n+ @→ %[ ... ]%
dann v aan
-----[dann] v [aan]
The replacement can be constrained further by a specific context, both on the left and the right of a particular pattern:
UPPER @→ LOWER || LEFT
RIGHT (see Karttunen, 1995, for further
variations). Furthermore, the replacement can be parallel, meaning that multiple
replacements are performed at the same time (see Kempe and Karttunen, 1996).
For instance, the regular expression in (RE6.5) denotes a constrained parallel
replacement, where the symbol a is replaced by symbol b and at the same time
symbol b is replaced by c. Both replacements occur at the same time and only if
the symbols are preceded by symbol x and followed by symbol y. Applying this
automaton to the string xaxayby yields then the string xaxbyby and to the string
xbybyxa yields xcybyxa as presented in (6.5).
(RE6.5)
a → b , b → c || x y
(6.5) xaxayby
--xaxbyby
6.2.4
xbybyxa
--xcybyxa
Finite State Parsing
Introduction
New approaches to parsing with the finite state formalism show that the calculus
can be used to represent complex linguistic phenomena accurately and large scale
lexical grammars can be represented in a compact way (Roche, 1997). There are
various techniques for creating careful representations at increasing efficiency. For
181
instance, parts of rules that are similar are represented only once, reducing the
whole set of rules. For each state only one unique outgoing transition is possible
(determinization), an automaton can be reduced to a minimal number of states
(minimization). Moreover, one can create bi-directional machines, where the same
automaton can be used for both parsing and generation.
Applications of finite state parsing are used mostly in the fields of terminology
extraction, lexicography and information retrieval for large scale text. The methods
are more “partial” in the sense that the goal is not production of complete syntactic
descriptions of sentences, but rather recognition of various syntactic patterns in a
text (e.g. noun phrases, verbal groups).
Parsing Methods
Many finite-state parsers adopt the chunking techniques of Abney (1991) and collect sets of pattern rules into ordered sequences of levels of finite number, so called
cascades, where the result of one level is the input to the next level (e.g. Appelt et al., 1993; Abney, 1996; Chanod and Tapanainen, 1996; Grefenstette, 1996;
Roche, 1997). The parsing procedure over a text tagged for parts-of-speech usually proceeds by marking boundaries of adjacent patterns, such as noun or verbal
groups, then the nominal and verbal heads within these groups are identified. Finally, patterns between non-adjacent heads are extracted identifying syntactic relations between words, within and across group boundaries. For this purpose, finite
state transducers are used. The automata are applied both as finite state markers,
that introduce extra symbols such as surrounding brackets to the input (as exemplified in the previous section), and as finite state filters that extract and label patterns.
Usually, a combination of non-finite state methods and finite state procedures is
applied, but the whole parser can be built as a finite state system (see further Karttunen et al., 1997a).
The first application of finite state transducers to parsing was a parser developed at the University of Pennsylvania between 1958 and 1959 (Joshi and
Hopely, 1996).9 The parser is essentially a cascade of finite state transducers and
the parsing style resembles Abney’s “chunking” parser (Abney, 1991). Syntactic
patterns using subcategorization frames and local grammars were constructed and
recognize simple NPs, PPs, AdvPs, simple verb clusters and clauses. All of the
modules of the parser, including dictionary look-up and part-of-speech disambiguation are finite state computations, except for the module for recognition of clauses.
9
The original version of the parser is presented in Joshi (1961) Up-to-date information about
the reconstructed version of this parser - Uniparse - can be accessed from: http://www.cis.
upenn.edu/˜phopely/tdap-fe-post.html.
182
Chapter 6.
Besides Abney’s chunking approach (Abney, 1991, 1996), constructive finite
state parsing of collections of syntactic patterns and local grammars, others use
this technique to locate noun phrases (or other basic phrases) from unrestricted
text (e.g. Appelt et al., 1993; Schiller, 1996; Senellart, 1998). Further, Grefenstette
(1996) uses this technique to mark syntactic functions such as subject and object.
Other approaches to finite-state parsing start from a large number of alternative
analyses and, through application of constraints in the form of elimination or restriction rules, they reduce the alternative parses (e.g. Voutilainen and Tapanainen,
1993; Koskenniemi et al., 1992). These techniques were also used for extraction of noun phrases or other basic phrases (e.g. Voutilainen, 1995; Chanod and
Tapanainen, 1996; Voutilainen and Padró, 1997).
Salah Ait-Mokhtar and Jean-Pierre Chanod constructed a parser that combines
the constructive and reductionist approaches. The system defines segments by constraints rather than patterns. They mark potential beginnings and ends of phrases
and use replacement transducers to insert phrase boundaries. Incremental decisions
are made throughout the whole parsing process, but at each step linguistic constraints may eliminate or correct some of the previously added information (AitMohtar and Chanod, 1997).
In the case of Swedish, finite state methods have been applied on a small scale
to lexicography and information extraction. A Swedish regular expression grammar was implemented early at Umeå University, parsing a limited set of sentences
(Ejerhed and Church, 1983; Ejerhed, 1985).
Recently, a cascaded finite state parser Cass-Swe was developed for the syntactic analysis of Swedish (Kokkinakis and Johansson Kokkinakis, 1999), based
on Abney’s parser. Here the regular expression patterns are applied in cascades
ordered by complexity and length to recognize phrases. The output of one level in
a sequence is used as input in the subsequent level, starting from tagging and syntactic labeling proceeding to recognition of grammatical functions. The grammar
of Cass-Swe has been semi-automatically extracted from written text by the application of probabilistic methods, such as the mutual information statistics which
allows the exclusion of incorrect part-of-speech n-grams (Magerman and Marcus,
1990), and by looking at which function words signal boundaries between phrases
and clauses.
Discrimination of Input
One parsing application using finite state methods presented by Karttunen et al.
(1997a) aims at extraction of not only valid expressions, but also invalid patterns
occurring in free text due to errors and misprints. The method is applied to date
expressions and the idea is simply to define two language sets - one that overgen-
183
erates and accepts all date expressions, including dates that do not exist, and one
that defines only correct date expressions. The language of invalid dates is then obtained by subtracting the more specific language from the more general one. Thus,
by distinguishing the valid date expressions from the language of all date expressions we obtain the set of expressions corresponding to invalid dates, i.e. those
dates not accepted by the language set of valid expressions.
To illustrate, the definitions in Karttunen et al. (1997a) express date expressions
from January 1, 1 to December 31, 9999 and are represented by a small finite
state automaton (13 states, 96 arcs), that accepts date expressions consisting of
a day of the week, a month and a date with or without a year, or a combination
of the two as defined in (RE6.6a) (SP is a separator consisting of a comma
and a space, i.e. ‘, ’). The parser for that language presented in (RE6.6b) is
constraint by the left to right, longest match replacement operator which means
that only the maximal instances of such expressions are accepted. However, this
automaton also accepts dates that do not exist, such as “April 31”, which exceeds
the maximum number of days for the month. Other problems concern leap days
and the relationship between the day of the week and the date. A new language
is defined by intersecting constraints of invalid types of dates with the language
of date expressions as presented in (RE6.6c).10 This much larger automaton
(1346 states, 21006 arcs) accepts only valid date expressions and again a transducer
marks the maximal instances of such dates, see (RE6.6d).
(RE6.6) a. DateExpression = Day | (Day SP) Month ‘‘ ’’ Date (SP Year)
b. DateExpression @→ %[ ...
%]
c. ValidDate = DateExpression & MaxDaysInMonth & LeapDays &
WeekDayDates
d. ValidDate @→ %[ ...
%]
As the authors point out, it may be of use to distinguish valid dates from invalid
ones, but in practice we also need to recognize the invalid dates due to errors and
misprints in real text corpora. For this purpose we do not need to define a new
language that reveals the structure of invalid dates. Instead, we make use of the
already defined languages of all date expressions DateExpression and valid
dates ValidDate and obtain the language of invalid dates by subtracting these
language sets from each other [DateExpression - ValidDate].
10
For more detail on the separate definitions of constraints see Karttunen et al. (1997a).
Chapter 6.
184
A parser that identifies maximal instances of date expressions is presented in
(RE6.7), that tags both the valid (VD) and invalid (ID) dates.
(RE6.7)
[ [DateExpression - ValidDate] @→ ‘‘[ID ’’ ...
ValidDate @→ ‘‘[VD’’ ... %] ]
%] ,
In the example in (6.6) below given by the authors, the parser identified two
date expressions. First a valid one (VD) and then an invalid one (ID) differing only
in the weekday from the valid one. Notice that the effect of the application of the
longest match is reflected when for instance the invalid date Tuesday, September
16, 1996 is selected over Tuesday, September 16, 19, which is a valid date. 11
(6.6) The correct date for today is [VD Monday, September 16, 1996]. There is an
error in the program. Today is not [ID Tuesday, September 16, 1996].
6.3 System Architecture
6.3.1
Introduction
After this short introduction to finite state automata, parsing methods with finite
state techniques and a description of the XFST-compiler, I will now proceed with a
description of the implemented grammar error detector FiniteCheck. In this section
an overview is given of the system’s architecture and how the system proceeds in
the individual modules identifying errors in text. The types of automata used in
the implementation are also described. The implementation methods and detailed
descriptions of the individual modules are discussed in subsequent sections.
The framework of FiniteCheck is built as a cascade of finite state transducers
compiled from regular expressions including operators defined in the Xerox FiniteState Tool (XFST; see Section 6.2.3). Each automaton in the network composes
with the result of the previous application.
The implemented tool applies a strategy of simple dictionary lookup, incremental partial parsing with minimal disambiguation by parsing order and filtering,
and error detection using subtraction of ‘positive’ grammars that differ in their level
of detail. Accordingly, the current system of sequenced finite state transducers is
divided into four main modules: the dictionary lookup, the grammar, the parser
and the error finder, see Figure 6.1 below.
The system runs under UNIX in a simple emacs environment implemented
by Robert Andersson with an XFST-mode that allows for menus to be used to
11
This date is however only valid in theory since the Gregorian calendar was not yet in use in the
year 19 AD. The Gregorian calendar that replaced the Julian calendar was introduced in Catholic
countries by the pope Gregory XIII on Friday, October 15, 1582 (in Sweden 1753).
185
recompile files in the system. The modules are further described in the following
subsection on the flow of data in the error detector. The form of the types of
automata are discussed at the end of this section.
Figure 6.1: The System Architecture of FiniteCheck
Chapter 6.
186
6.3.2
The System Flow
The Dictionary Lookup
The input text into FiniteCheck is first manually tokenized so that spaces occur
between all strings and tokens, including punctuation. This formatted text is then
tagged with part-of-speech and feature annotations by the lookup module that assigns all lexical tags stored in the lexicon of the system to a string in the text.
No disambiguation is involved, only a simple lookup. The underlying lexicon of
around 160,000 word forms is built as a finite state transducer. The tagset is based
on the tagformat defined in the Stockholm Umeå Corpus (Ejerhed et al., 1992)
combining part-of-speech information with feature information (see Section 6.4
and Appendix C).
As an example, the sentence (6.7a) is ungrammatical, containing a (finite) auxiliary verb followed by yet another finite verb (see (4.32) on p.61). It will be annotated by the dictionary lookup as shown in (6.7b):
∗
(6.7) a. Men kom ihåg att det inte ska
blir
någon riktig brand
But remember that it not will [pres] becomes [pres] some real fire
– But remember that there will not be a real fire.
b. Men[kn] kom[vb prt akt] ihåg[ab][pl] att[sn][ie] det[pn neu sin def sub/obj][dt
neu sin def] inte[ab] ska[vb prs akt] blir[vb prs akt] någon[dt utr sin ind][pn
utr sin ind sub/obj] riktig[jj pos utr sin ind nom] brand[nn utr sin ind nom]
The Grammar
The grammar module includes two grammars with (positive) rules reflecting the
grammatical structure of Swedish, differing in their level of detail. The broad
grammar (Section 6.5) is especially designed to handle text with ungrammaticalities and the linguistic descriptions are less accurate, accepting both valid and
invalid patterns. The narrow grammar (Section 6.7) is more refined and accepts
only grammatical segments. For example, the regular expression in (RE6.8)
belongs to the broad grammar and recognizes potential verb clusters (VC) (both
grammatical and ungrammatical) as a pattern consisting of a sequence of two or
three verbs in combination with (zero or more) adverbs (Adv∗).
(RE6.8)
define VC [Verb Adv* Verb (Verb)];
This automaton accepts all the verb cluster examples in (6.8), including the
ungrammatical instance (6.8c) extracted from the text in (6.7), where a finite verb
187
in present tense follows a (finite) auxiliary verb, instead of a verb in infinitive form
(i.e. bli ‘be [inf]’).
(6.8) a. kan inte springa
can not run [inf]
b. skulle ha sprungit
would have run [sup]
c. ska ∗ blir
will be [pres]
Corresponding rules in the narrow grammar, represented by the regular expressions in (RE6.9), take into account the internal structure of a verb cluster and
define the grammar of modal auxiliary verbs (Mod) followed by (zero or more) adverb(s) (Adv∗), and either a verb in infinitive form (VerbInf) as in (RE6.9a),
or a temporal verb in infinitive (PerfInf) and a verb in supine form (VerbSup), as in (RE6.9b). These rules thus accept only the grammatical segments
in (6.8) and will not include example (6.8c). The actual grammar of grammatical
verb clusters is a little bit more complex (see Section 6.7).
(RE6.9) a. define VC1
b. define VC2
[Mod Adv* VerbInf];
[Mod Adv* PerfInf VerbSup];
The Parser
The system proceeds and the tagged text in (6.7b) is now the input to the next
phase, where various kinds of constituents are selected applying a lexical-prefixfirst strategy, i.e. parsing first from the left margin of a phrase to the head and then
extending the phrase by adding on complements. The phrase rules are ordered in
levels. The system proceeds in three steps by first recognizing the head phrases
in a certain order (verbal head vpHead, prepositional head ppHead, adjective
phrase ap) and then selecting and extending the phrases with complements in a
certain order (noun phrase np, prepositional phrase pp, verb phrase vp). The
heuristics of parsing order gives better flexibility to the system in that (some) false
parses can be blocked. This approach is further explained in the section on parsing
(Section 6.6). The system then yields the output in (6.9). 12 Simple ‘<’ and ‘>’
around a phrase-tag denote the beginning of a phrase and the same signs together
with a slash ‘/’ indicate the end.
12
For better readability, the lexical tags are kept only in the erroneous segment and removed manually in the rest of the exemplified sentence.
Chapter 6.
188
(6.9) Men <vp> <vpHead> kom ihåg </vpHead> </vp> att <np> det </np>
<vp> <vpHead> inte <vc> ska[vb prs akt] blir[vb prs akt] </vc>
</vpHead> <np> någon <ap> riktig </ap> brand </np> </vp>
We apply the rules defined in the broad grammar set for this parsing purpose,
like the one in (RE6.8) that identified the verb cluster in boldface in (6.9) above
as a sequence of two verbs.
The parsing output may be refined and/or revised by application of filtering
transducers. Earlier parsing decisions depending on lexical ambiguity are resolved,
and phrases are extended, e.g. with postnominal modifiers (see further in Section 6.6). Other structural ambiguities, such as verb coordinations or clausal modifiers on nouns, are also taken care of (see Section 6.7)
The Error Finder
Finally the error finder module is used to discriminate the grammatical patterns
from the ungrammatical ones, by subtracting the narrow grammar from the broad
grammar. These new transducers are used to mark the ungrammatical segments in
a text. For example, the regular expression in (RE6.10a) identifies verb clusters
that violate the narrow grammar of modal verb clusters (VC1 or VC2 in (RE6.9))
by subtracting (‘-’) these rules from the more general (overgenerating) rule in the
broad grammar (VC in ((RE6.8) within the boundaries of a verb cluster (<vc>
, </vc> ), previously marked in the parsing stage in (6.9). That is, the output of
the parsing stage in (6.9) is the input to this level. By application of the marking
transducer in (RE6.10b), the erroneous verb cluster consisting of two verbs in
present tense in a row is annotated directly in the text as shown in (6.10).
(RE6.10) a. define VCerror [ "<vc>"
[VC - [VC1 | VC2]]
"</vc>" ];
b. define markVCerror [
VCerror -> "<Error Verb after Vaux>" ... "</Error>"];
(6.10) Men <vp> <vpHead> kom ihåg </vpHead> </vp> att <np> det </np>
<vp> <vpHead> inte <Error Verb after Vaux> <vc> ska[vb prs akt]
blir[vb prs akt] </vc> </Error> </vpHead> <np> någon <ap> riktig
</ap> brand </np> </vp>
6.3.3
189
Types of Automata
In accordance with the techniques of finite-state parsing (see Section 6.2.4), there
are in general two types of transducers in use: one that annotates text in order
to select certain segments and one that redefines or refines earlier decisions. Annotations are handled by transducers called finite state markers that add reserved
symbols into the text and mark out syntactic constituents, grammar errors, or other
relevant patterns. For instance, the regular expression in (RE6.11) inserts noun
phrase tags in text by application of the left-to-right-longest-match replacement
operator (‘@→’) (see Section 6.2.3).
(RE6.11)
define markNP [NP @-> "<np>" ... "</np>"];
The automaton finds the pattern that matches the maximal instance of a noun
phrase (NP) and replaces it with a beginning marker (<np> ), copies the whole
pattern by application of the insertion operator (‘...’) and then assigns the endmarker (</np> ). Three (maximal) instances of noun phrase segments are recognized in the example sentence (6.11a), discussed earlier in Chapter 4 (see (4.2)
on p.46) as shown in (6.11b), where one violates definiteness agreement (in boldface).13
∗
(6.11) a. En gång blev den
hemska
pyroman
utkastad
one time was the [def] awful [def] pyromaniac [indef]) thrown-out
ur
stan.
from the-city
– Once the awful pyromaniac was thrown out of the city.
b. <np> En gång </np> blev <np> den[dt utr sin def][pn utr sin def
sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom]
pyroman[nn utr sin ind nom] </np> utkastad ur <np> stan </np> .
The regular expression in (RE6.12) represents an another example of an
annotating automaton.
(RE6.12)
define markNPDefError [
npDefError -> "<Error definiteness>" ... "</Error>"];
This finite state transducer marks out agreement violations of definiteness in
noun phrases (npDefError; see Section 6.8). It detects for instance the erroneous noun phrase den hemska pyroman in the example sentence, where the determiner den ‘the’ is in definite form and the noun pyroman ‘pyromaniac’ is in
indefinite form (6.12). By application of the left-to-right replacement operator
13
Only the erroneous segment is marked by lexical tags.
Chapter 6.
190
(→) the identified segment is replaced by first inserting an error-diagnosis-marker
(<Error definiteness> ) as the beginning of the identified pattern, then
the pattern is copied and the error-end-marker (</Error> ) is added.
(6.12) <np> En gång </np> blev <Error definiteness> <np> den[dt utr sin def][pn
utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def
nom] pyroman[nn utr sin ind nom] </np> </Error> utkastad ur <np> stan
</np> .
The marking transducers of the system have the form A @→ S ... E, when
marking the maximal instances of A from left to right by application of the left-toright-longest-match replacement operator (‘@→’) and inserting a start-symbol S
(e.g. <np> ) and an end-symbol E (e.g. </np> ). In cases where the maximal
instances are already recognized and only the operation of replacement is necessary, the transducers use the form A → S ... E, applying only the left-to-right
replacement operator (‘→’).
The other types of transducers are used for refinement and/or revision of earlier
decisions. These finite state filters can for instance be used to remove the noun
phrase tags from the example sentence, leaving just the error marking. The regular
expression in (RE6.13) replaces all occurrences of noun phrase tags with an
empty string (‘0’) by application of the left-to-right replacement operator (‘→’).
The result is shown in (6.13).
(RE6.13)
define removeNP ["<np>" -> 0, "</np>" -> 0];
(6.13) En gång blev <Error definiteness> den[dt utr sin def][pn utr sin def sub/obj]
hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn
utr sin ind nom] </Error> utkastad ur stan.
These filtering transducers have the form A → B and are used for simple replacement of instances of A by B by application of the left-to-right replacement
operator (‘→’). In cases where the context plays a crucial role, the automata are
extended by requirements on the left and/or the right context and have the form
A → B || L
R. Here, the patterns in A are replaced by B only if A is preceded
by the left context L and followed by the right context R. In some cases only the left
context is constrained, in others only the right, and in some cases both are needed.
191
6.4 The Lexicon
6.4.1
Composition of The Lexicon
The lexicon of the system is a full form lexicon based on two resources, the Lexin
(Skolverket, 1992) developed at the Department of Swedish Language, Section
of Lexicology, Göteborg University and a corpus-based lexicon from the SveLex
project under the direction of Daniel Ridings, LexiLogik AB.
At the initial stage of lexicon composition, only the Lexin dictionary of 58 326
word forms was available to us and we chose it especially for the lexical information stored in it, namely that the lexicon also included information on valence.
I converted the Lexin text records to one single regular expression by a two-step
process using the programming language gawk (Robbins, 1996). From the Lexin
records (exemplified in (6.14a) and (6.14b)) a new file was created with lemmas
separated by rows as in (6.14c). The first line here represents the Lexin-entry for
the noun bil ‘car’ in (6.14a) and the second for the verb bilar ‘travels by car [pres]’
in (6.14b). Only a word’s part-of-speech (entry #02), lemma (entry #01), and declined forms (entry #12) are listed in the current implementation. 14 The number
and type of forms vary according to the part-of-speech, and sometimes even within
a part-of-speech.
(6.14) a. #01
#02
#04
#07
#09
#11
#11
#11
#11
#11
#11
#11
#11
#12
#14
b. #01
#02
#04
#10
#12
#14
bil
subst
ett slags motordrivet fordon
åka bil
bild 17:34, 18:36-37
bil˜trafik -en
personbil
bil˜buren
bil˜fri
bil˜sjuk
bil˜sjuka
bil˜telefon
lastbil
bilen bilar
bi:l
bilar
verb
åka bil
A & (+ RIKTNING)
bilade bilat bila(!)
2bI:lar
c. subst
verb
bil
bilar
bilen bilar
bilade bilat bila
14
Future work will further extend the other kinds of information stored in the lexicon, such as
valence and compounding.
Chapter 6.
192
In the next step I converted the data in (6.14c) directly to a single regular expression as shown in (RE6.14). Each word entry in the lexicon was represented
as a single finite state transducer with the string in the LOWER side and the category and feature in the UPPER side, allowing both analysis and generation. The
whole dictionary is formed as the union of these automata. At this stage I used only
simple tagsets that were later converted to the SUC-format (see below). Using this
automatic generation of lexical entries to a regular expression, alternative versions
of the lexicon are easy to create with for example different tagsets or including
other information from Lexin (e.g. valence, compounds).
(RE6.14)
[
|
|
.
.
.
|
|
|
.
.
.
|
|
|
|
.
.
.
|
|
|
A % - i n k o m s t 0: %[%+NSI%]
A % - k a s s a 0: %[%+NSI%]
A % - s k a t t 0: %[%+NSI%]
b i l 0: %[%+NSI%]
b i l e n 0: %[%+NSD%]
b i l a r 0:%[%+NPI%]
b
b
b
b
i
i
i
i
l
l
l
l
a
a
a
a
0:%[%+VImp%]
r 0:%[%+VPres%]
d e 0:%[%+VPret%]
t 0:%[%+VSup%]
ö v ä r l d 0: %[%+NSI%]
ö v ä r l d a r 0: %[%+NPI%]
ö v ä r l d e n 0:%[%+NSD%] ];
The Lexin dictionary was later extended with 100,000 most frequent word
forms selected from the corpus-based SveLex. At this stage the format of the lexicon was revised. The new lexicon of 158,326 word forms was compiled to a new
transducer using instead the Xerox tool Finite-State Lexicon Compiler (LEXC)
(Karttunen, 1993), that made the lexicon more compact and effective. This software facilitates in particular the development of natural-language lexicons. Instead
of regular expression declarations, a high-level declarative language is used to specify the morphotactics of a language. I was not part of the composition of the new
version of the lexicon. The procedures and achievements of this work are described
further in Andersson et al. (1998, 1999).
6.4.2
193
The Tagset
In the present version of the lexicon, the set of tags follows the Stockholm Ume å
Corpus project conventions (Ejerhed et al., 1992), including 23 category classes
and 29 feature classes (see Appendix C). Four additional categories were added
to this set for recognition of copula verbs (cop), modal verbs (mvb), verbs with
infinitival complement (qmvb) and unknown words, that obtain the tag [nil]. This
morphosyntactic information is used for identification of strings by both their category and/or feature(s). For reasons of efficiency, the whole tag with category and
feature definitions is read by the system as a single symbol and not as a separate
list of atoms. An experiment conducted by Robert Andersson showed that the size
of an automaton recognizing a grammatical noun phrase decreased with 90% less
states and 60% less transitions in comparison to declaring a tag as consisting of a
category and a set of features (see further in Andersson et al., 1999).
As a consequence of this choice, the automata representing the tagset are divided both in accordance with the category they state and the features, always
rendering the whole tag. The automata are constructed as an union of all the tags
of the same category or feature. In practice this means that the same tag occurs
in different tag-definitions as many times as the number of defined characteristics.
For instance, the tag defining an active verb in present tense [vb prs akt] occurs in
three definitions, first as an union of all tags defining the verb category (TagVB in
(RE6.15)), then among all tags for present tense (TagPRS in (RE6.16)) and
then among all tags for active voice (TagAKT in (RE6.17)).
(RE6.15)
define TagVB
|
|
|
|
|
|
|
|
|
|
|
|
|
|
];
[
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
an]"
sms]"
prt akt]"
prt sfo]"
prs akt]"
prs sfo]"
sup akt]"
sup sfo]"
imp akt]"
imp sfo]"
inf akt]"
inf sfo]"
kon prt akt]"
kon prt sfo]"
kon prs akt]"
Chapter 6.
194
(RE6.16)
define TagPRS
|
|
|
|
];
(RE6.17)
define TagAKT
|
|
|
|
|
|
];
[
"[pc
"[pc
"[vb
"[vb
"[vb
prs
prs
prs
prs
kon
utr/neu sin/plu ind/def gen]"
utr/neu sin/plu ind/def nom]"
akt]"
sfo]"
prs akt]"
[
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
"[vb
prt
prs
sup
imp
inf
kon
kon
akt]"
akt]"
akt]"
akt]"
akt]"
prt akt]"
prs akt]"
On the other hand, the tag for an interjection ([in]) that consists only of the
category, occurs just once in the definitions of tags:
(RE6.18)
define TagIN
[ "[in]" ];
There are in total 55 different lexical-tag definitions of categories and features.
One single automaton (Tag) represents all the different categories and features,
that is composed as the union of these 55 lexical tags. The largest category of
singular-feature (TagSIN) includes 80 different tags.
6.4.3
Categories and Features
In the parsing and error detection processes, strings need to be recognized by their
category and/or feature inclusion. The morphosyntactic information in the tags
is used for this purpose and automata identifying different categories and feature
sets are defined. For instance, the regular expression in (RE6.19a) recognizes
the tagged string kan[vb prs akt] ‘can’ as a verb, i.e. a sequence of one or more
(the iteration sign ‘+’) letters followed by a sequence of tags one of which is a
tag containing ‘vb’ (TagVB). Features are defined in the same manner. The same
string can be recognized as a carrier of the feature of present tense. The regular
expression in (RE6.19b) defines the automaton for present tense as a sequence
of (one or more) letters followed by a sequence of tags, where one of them fulfills
the feature of present tense ‘prs’ (TagPRS).
(RE6.19) a. define Verb
b. define Prs
Letter+ Tag* TagVB Tag*;
Letter+ Tag* TagPRS Tag*;
195
By using intersection (‘&’) of category and feature sets, there is also the possibility recognizing category-feature combinations. The same string can then be
recognized directly as a verb in present tense by the regular expression VerbPrs
given in (RE6.20), that presents all the verb tense features.
(RE6.20)
define
define
define
define
define
VerbImp
VerbPrs
VerbPrt
VerbSup
VerbInf
[Verb
[Verb
[Verb
[Verb
[Verb
&
&
&
&
&
Imp];
Prs];
Prt];
Sup];
Inf];
Even higher level sets can be built. For instance, a category of tensed (finite) and untensed (non-finite) verbs may be defined as in (RE6.21), including
the union of appropriate verb form definitions from the verb tense feature set in
(RE6.20) above. Our string example falls then as a verb in present tense form
among the finite verb forms (VerbTensed).
(RE6.21)
define VerbTensed
define VerbUntensed
[VerbPrs | VerbPrt];
[VerbSup | VerbInf];
6.5 Broad Grammar
The rules of the broad grammar are used to mark potential phrases in a text, both
grammatical and ungrammatical. The grammar consists of valid (grammatical)
rules that define the syntactic relations of constituents mostly in terms of categories
and list the order of them. There are no other constraints on the selections than the
types of part-of-speech that combine with each other to form phrases. The grammar
is in other words underspecified and does not distinguish between grammatical and
ungrammatical patterns.
The parsing is incremental, i.e. identifying first heads and then complements.
This is also reflected in the broad grammar listed in (RE6.22), that includes
rules divided in heads and complements. The whole broad grammar consists of six
rules, including the head rules of adjective phrase (AP), verbal head (VPHead) and
prepositional head (PPHead) and then rules for noun phrase (NP), prepositional
phrase (PP) and verb phrase (VP).
Chapter 6.
196
(RE6.22)
# Head
define
define
define
rules
AP [(Adv) Adj+];
PPhead [Prep];
VPhead [[[Adv* Verb] | [Verb Adv*]] Verb*
(PNDef & PNNeu)];
# Complement rules
define NP [[[(Det | Det2 | NGen) (Num) (APPhr) (Noun)] & ?+] |
Pron];
define PP [PPheadPhr NPPhr];
define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*];
An adjective phrase (AP) consists of an (optional) adverb and a sequence of
(one or more) adjectives. This means that an adjective phrase consists of at least
one adjective. The head of a prepositional phrase (PPHead) is a preposition. A
verbal head (VPHead) includes a verb preceded by a (zero or more) adverb(s) or
followed by a (zero or more) adverb(s), and possibly followed by a (zero or more)
verb(s) and an optional pronoun. This means that a verbal head consists at least of
a single verb, that in turn may be preceded or followed by adverb(s) and followed
by verb(s). In order to prevent pronouns being analyzed as determiners in noun
phrases, e.g. jag anser det bra ‘I think it is good’, single neuter definite pronouns
are included in the verbal head.
The regular expression describing a noun phrase (NP) consists of two parts.
The first states that a noun phrase includes a determiner (Det) or a determiner with
adverbial här ‘here’ or där ‘there’ (Det2) or a possessive noun (NGen), followed
by a numeral (Num), an adjective phrase (APPhr) and a (proper) noun (Noun).
Not only the noun can form the head of the noun phrase which is why all the
constituents are optional. The intersection with ‘any-symbol’ (‘?’) followed by the
iteration sign (‘+’) is needed to state that at least one of the listed constituents has
to occur. The second part of the noun phrase rule, states that a noun phrase may
consist of a single pronoun (Pron).
A prepositional phrase (PP) is recognized as a sequence of prepositional head
(PPheadPhr) followed by a noun phrase (NPPhr). A verb phrase consists of a
verbal head (VPheadPhr) followed by at most three (optional) noun phrases and
(zero or more) prepositional phrases.
6.6 Parsing
6.6.1
Parsing Procedure
The rules of the (underspecified) broad grammar are used to mark syntactic patterns in a text. A partial, lexical-prefix-first, longest-match, incremental strategy is
197
used for parsing. The parsing procedure is partial in the sense that only portions
of text are recognized and no full parse is provided for. Patterns not recognized
by the rules of the (broad) grammar remain unchanged. The maximal instances
of a particular phrase are selected by application of the left-to-right-longest-match
replacement operator (‘@→’) (see Section 6.2.3). In (RE6.23) we see all the
marking transducers recognizing the syntactic patterns defined in the broad grammar. The automata replace the corresponding phrase (e.g. noun phrase, NP) with
a label indicating the beginning of such pattern (<np> ), the phrase itself and a
label that marks the end of that pattern (</np> ).
(RE6.23)
define
define
define
define
define
define
markPPhead
markVPhead
markAP
markNP
markPP
markVP
[PPhead
[VPhead
[AP @->
[NP @->
[PP @->
[VP @->
@-> "<ppHead>" ...
@-> "<vpHead>" ...
"<ap>" ... "</ap>"
"<np>" ... "</np>"
"<pp>" ... "</pp>"
"<vp>" ... "</vp>"
"</ppHead>"];
"</vpHead>"];
];
];
];
];
The segments are built on in cascades in the sense that first the heads are recognized, starting from the left-most edge to the head (so called lexical-prefix) and
then the segments are expanded in the next level by addition of complement constituents. The regular expressions in (RE6.24) compose the marking transducers
of separate segments into a three step process.
(RE6.24)
define parse1 [markVPhead .o. markPPhead .o. AP];
define parse2 [markNP];
define parse3 [markPP .o. markVP];
First the verbal heads, prepositional heads and adjective phrases are recognized by composition in that order (parse1). The corresponding marking transducers presented in (RE6.23) insert syntactic tags around the found phrases as
in (6.15a).15 This output serves then as input to the next level, where the adjective phrases are extended and noun phrases are recognized (parse2) and marked
as exemplified in (6.15b). This output in turn serves as input to the last level,
where the whole prepositional phrases and verb phrases are recognized in that order (parse3) and marked as in (6.15c).
15
The original sentence example is presented in (6.11) on p.189.
Chapter 6.
198
(6.15) a. PARSE 1: VPHead .o. PPHead .o. AP
En gång <vpHead> blev </vpHead> den <ap> hemska </ap> pyroman
<ap> utkastad </ap> <ppHead> ur </ppHead> stan .
b. PARSE 2: NP
<np> En gång </np> <vpHead> blev </vpHead> <np> den <ap>
hemska </ap> pyroman </np> <np> <ap> utkastad </ap> </np>
<ppHead> ur </ppHead> <np> stan </np> .
c. PARSE 3: PP .o. VP
<np> En gång </np> <vp> <vpHead> blev </vpHead> <np> den
<ap> hemska </ap> pyroman </np> <np> <ap> utkastad </ap> </np>
<pp> <ppHead> ur </ppHead> <np> stan </np> </pp> </vp> .
During and after this parsing annotation, some phrase types are further expanded with post-modifiers, split segments are joined and empty results are removed
(see Section 6.6.4).
The ‘broadness’ of the grammar and the lexical ambiguity in words, necessary for parsing text containing errors, also yields ambiguous and/or alternative
phrase annotations. We block some of the (erroneous) alternative parses by the
order in which phrase segments are selected, which causes bleeding of some rules
and more ‘correct’ parsing results are achieved. The order in which the labels are
inserted into the string influences the segmentation of patterns into phrases (see
Section 6.6.2). Further ambiguity resolution is provided for by filtering automata
(see Section 6.6.3).
6.6.2
The Heuristics of Parsing Order
The order in which phrases are labeled supports ambiguity resolution in the parse
to some degree. The choice of marking verbal heads before noun phrases prevents
merging constituents of verbal heads into noun phrases which would yield noun
phrases with too wide a range. For instance, marking first the sentence in (6.16a)
for noun phrases ((6.16b) ∗ NP:)16 would interpret the pronoun De ‘they’ as a determiner and the verb såg ‘saw’, that is exactly as in English homonymous with
the noun ‘saw’, as a noun and merges these two constituents to a noun phrase. The
output would then be composed with the selection of the verbal head ((6.16b) ∗ NP
.o. VPHead) that ends up within the boundaries of the noun phrase. Composing
the marking transducers in the opposite order instead yields the more correct parse
in (6.16c). Although the alternative of the verb being parsed as verbal head or
a noun remains (<vpHead> <np> såg </np> </vpHead> ), the pronoun is now marked correctly as a separate noun phrase and not merged together
with the main verb into a noun phrase.
16
Asterix ‘*’ indicates erroneous parse.
199
(6.16) a. De
såg
ledsna ut
They looked sad
out
– They looked sad.
b.
∗
NP:
<np> De såg </np> <np> ledsna </np> ut .
∗
NP .o. VPHead:
<np> De <vpHead> såg </vpHead> </np> <np> ledsna </np> ut .
c. VPHead:
De <vpHead> såg </vpHead> ledsna ut .
VPHead .o. NP:
<np> De </np> <vpHead> <np> såg </np> </vpHead> <np>
ledsna </np> ut .
This ordering strategy is not absolute however, since the opposite scenario is
possible where parsing noun phrases before verbal heads is more suitable. Consider for instance example (6.17a) below, where the string öppna ‘open’ in the
noun phrase det öppna fönstret ‘the open window’ will be split in three separate noun phrase segments when applying the order of parsing verbal heads before
noun phrases (6.17c), due the homonymity between an adjective and an infinitive
or imperative verb form. The opposite scenario of parsing noun phrases before
verbal heads yields a more correct parse (6.17b), where the whole noun phrase is
recognized as one segment.
(6.17) a. han tittade genom det öppna fönstret
he looked through the open window
– he looked through the open window
b. NP:
<np> han </np> tittade genom <np> det öppna fönstret </np>
NP .o. VPHead:
<np> han </np> <vpHead> tittade </vpHead> genom
<vpHead> öppna </vpHead> fönstret </np>
c.
<np> det
∗
VPHead:
han <vpHead> tittade </vpHead> genom det <vpHead> öppna
</vpHead> fönstret
∗
VPHead .o. NP:
<np> han </np> <vpHead> tittade </vpHead> genom <np> det </np>
<vpHead> <np> öppna </np> </vpHead> <np> fönstret </np>
We analyzed the ambiguity frequency in the Child Data corpus and found that
occurrences of nouns recognized as verbs are more recurrent than the opposite. On
Chapter 6.
200
this ground, we chose the strategy of marking verbal heads before marking noun
phrases. In the case of the opposite scenario, the false parsing can be revised and
corrected by an additional filter (see Section 6.6.3).
A similar problem occurs with homonymous prepositions and nouns. For instance, the string vid is ambiguous between an adjective (‘wide’) and a preposition
(‘by’) and influences the order of marking prepositional heads and noun phrases.
Parsing prepositional heads before noun phrases is more suitable for preposition
occurrences as shown in (6.18c) in order to prevent the preposition from being
merged as part of a noun phrase, as in (6.18b).
(6.18) a. Jag satte mig vid bordet
I
sat me by the-table
– I sat down at the table.
b.
∗
NP:
<np> Jag </np> satte <np> mig </np> <np> vid bordet </np> .
∗
NP .o. PP:
<np> Jag </np> satte <np> mig </np> <np>
</ppHead> bordet </np> .
<ppHead> vid
c. PP:
Jag satte mig <ppHead> vid </ppHead> bordet .
PP .o. NP:
<np> Jag </np> satte <np> mig </np> <ppHead> <np> vid </np>
</ppHead> <np> bordet </np> .
The opposite order is more suitable for adjective occurrences, as in (6.19),
where the adjective is joined together with the head noun when selecting noun
phrases first as in (6.19b). But when recognizing the adjective as prepositional
head, that noun phrase is split into two noun phrases as in (6.19c). Again, the
choice of marking prepositional heads before noun phrases was based on the result
of frequency analysis in the corpus, i.e. the string vid occurred more often as a
preposition than an adjective.
201
(6.19) a. Hon hade vid kjol på sig.
She had wide skirt on herself.
– She was wearing a wide skirt.
b. NP:
<np> Hon </np> hade <np> vid kjol </np> på <np> sig </np> .
NP .o. PP:
<np> Hon </np> hade <np> <ppHead> vid </ppHead> kjol </np>
på <np> sig </np> .
c.
∗
PP:
Hon hade <ppHead> vid </ppHead> kjol på sig .
∗
PP .o. NP:
<np> Hon </np> hade <ppHead> <np> vid </np> </ppHead>
<np> kjol </np> på <np> sig </np> .
6.6.3
Further Ambiguity Resolution
As discussed above, the parsing order does not give the correct result in every context. Nouns, adjectives and pronouns are homonymous with verbs and might then
be interpreted by the parser as verbal heads, or adjectives homonymous with prepositions can be analyzed as prepositional heads. These parsing decisions can be
redefined at a later stage by application of filtering transducers (see Section 6.3.3).
As exemplified in (6.17) above, the consequence of parsing verbal heads before noun phrases may yield noun phrases that are split into parts, due to the fact
that adjectives are interpreted as verbs. The filtering transducer in (RE6.25) adjusts such segments and removes the erroneous (inner) syntactic tags (i.e. replaces
them with the empty string ‘0’) so that only the outer noun phrase markers remain
and converts the split phrase in (6.20a) to one noun phrase yielding (6.20b). The
regular expression consists of two replacement rules that apply in parallel. They
are constrained by the surrounding context of a preceding determiner (Det) and a
subsequent adjective phrase (APPhr) and a noun phrase (NPPhr) in the first rule,
and a preceding determiner and an adjective phrase in the second rule.
(6.20) a. <np> han </np> <vpHead> tittade </vpHead> genom <np> det </np>
<vpHead> <np> öppna </np> </vpHead> <np> fönstret </np>
b. <np> han </np> <vpHead> tittade </vpHead> genom <np> det öppna
fönstret </np>
(RE6.25)
define adjustNPAdj [
"</np><vpHead><np>" -> 0 || Det _ APPhr "</np></vpHead>" NPPhr,,
"</np></vpHead><np>" -> 0 || Det "</np><vpHead><np>" APPhr _ ];
Chapter 6.
202
Noun phrases with a possessive noun as the modifier are split when the head
noun is homonymous with a verb as in (6.21).17 The parse is then adjusted by a
filter that simply extracts the noun from the verbal head and moves the borders of
the noun phrase yielding (6.21c).
(6.21) a. barnens far
hade dött
children’s father had died
– the father of the children had died
b.
<np> barnens </np> <vpHead> <np> far </np>
</vpHead>
hade dött
c. <np> barnens far </np> <vpHead> hade dött </vpHead>
The filtering automaton in (RE6.26) inserts a start-marker for verbal head
(i.e. replaces the empty string ‘0’ with the syntactic tag vpHead) right after the
end of the actual noun phrase and removes the redundant syntactic tags in the
second replacement rule. The replacement procedure is (again) simultaneous, by
application of parallel replacement.
(RE6.26)
define adjustNPGen [
0 -> "<vpHead>" || NGen "</np><vpHead>" NPPhr _,,
"</np><vpHead><np>" -> 0 || NGen _ ˜$"<np>" </np>"];
Another ambiguity problem occurs with the interrogative pronoun var ‘where’
that in Swedish is ambiguous with the copula verb var ‘were’ or ‘was’. Since verbal
heads are annotated first in the system identifying segments of maximal length, the
homonymous pronoun is recognized as a verb and combined with the subsequent
verb as in (6.22) and (6.23).
(6.22) a. Var
b. <vp> <vpHead> <vc> <np> Var var </np> </vc> </vpHead> <np>
den där överraskningen </np> </vp> ?
(6.23) a. Var
såg du hästen
Madde frågar jag.
where saw you the-horse Madde ask I
– Where did you see the horse, Madde? I asked.
b. <vp> <vpHead> <vc> <np> Var såg </np> </vc> </vpHead> <np>
du </np> <np> hästen </np> </vp> Madde<vp> <vpHead> frågar
<np> jag </np> </vpHead> </vp> .
17
Here the string far is ambiguous between the noun reading ‘father’ and the present tense verb
form ‘goes’.
203
A similar problem occurs with adjectives or participles homonymous with
verbs as in (6.24), where the adjective rädda ‘scared [pl]’ is identical to the infinitive or imperative form of the verb ‘rescue’ and is joined with the preceding
copula verb to form a verb cluster.
(6.24) a. Alla blev
rädda ...
all became afraid
– All became afraid ...
b. <np> Alla </np> <vp> <vpHead> <vc> blev<np> <ap> r ädda
</ap> </np> </vc> </vpHead> </vp> ...
All verbal heads recognized as sequences of verbs with a copula verb in the
beginning are selected by the replacement transducer in (RE6.27) that changes
the verb cluster label (<vc> ) to a new marking (<vcCopula> ). This selection provides no changes in the parsing result in that no markings are (re)moved.
Its purpose rather is to prevent false error detection and mark such verb clusters
as being different. For instance, applying this transducer on the example in (6.22,)
will yield the output presented in (6.25).
(RE6.27)
define SelectVCCopula [
"<vc>" -> "<vcCopula>" || _ [CopVerb / NPTags] ˜$"<vc>" "</vc>"];
(6.25) <vp> <vpHead> <vcCopula> <np> Var var </np> </vc> </vpHead>
<np> den där överraskningen </np> </vp> ?
6.6.4
Parsing Expansion and Adjustment
The text is now annotated with syntactic tags and some of the segments have to
be further expanded with postnominal attributes and coordinations. In the current
system, partitive prepositional phrases are the only postnominal attributes taken
care of. The reason is that grammatical errors were found in these constructions.
By application of the filtering transducer in (RE6.28) the example text in
(6.26a) with the partitive noun phrase split into a noun phrase followed by a prepositional head that includes the partitive preposition av ‘of’ and yet another noun
phrase from the parsing stage in (6.26b) is merged to form a single noun phrase
as in (6.26c). This automaton removes the redundant inner syntactic markers by
application of two replacement rules, constrained by the right or left context. The
replacement occurs simultaneously by application of parallel replacement.
(RE6.28)
define adjustNPPart [
"</np><ppHead>" -> 0 || _ PPart "</ppHead><np>",,
"</ppHead><np>" -> 0 || "</np><ppHead>" PPart _ ];
Chapter 6.
204
i en av
(6.26) a. Mamma och Virginias mamma hade öppnat en tygaffär
mum
and Virginia’s mum
had opened a fabric-store in one of
Dom gamla husen.
the old the-houses
– Mum and Virginia’s mum had opened a fabric-store in one of the old houses.
b. <np> Mamma </np> och <np> Virginias mamma </np> <vp>
<vpHead> <vc> hade öppnat </vc> </vpHead> <np> en tygaffär </np>
i <np> en </np> <ppHead> av </ppHead> <np> Dom <ap> gamla
</ap> husen </np> .
c. <np> Mamma </np> och <np> Virginias mamma </np> <vp>
<vpHead> <vc> hade öppnat </vc> </vpHead> <np> en tygaffär </np>
i <NPPart> en av Dom <ap> gamla </ap> husen </np> .
Another type of phrase that needs to be expanded are verbal groups with a
noun phrase in the middle, normally occurring when a sentence is initiated by
other constituents than a subject (i.e. with inverted word order; see Section 4.3.6),
as in (6.27a). In the parsing phase the verbal group is split into two verbal heads,
as in (6.27b) that should be joined in one as in (6.27c).
(6.27) a. En dag tänkte Urban göra varma mackor
One day thought Urban do hot
sandwiches
– One day Urban thought of making hot sandwiches.
b. <np> En dag </np> <vpHead> tänkte </vpHead> <np> Urban </np>
<vpHead> göra </vpHead> <np> varma mackor </np> .
c. <np> En dag </np> <vpHead> tänkte <np> Urban </np> göra
</vpHead> <np> varma mackor </np> .
The filtering automaton merging the parts of a verb cluster to a single segment
is constrained so that two verbal heads are joined together only if there is a noun
phrase in-between them and the preceding verbal head includes an auxiliary verb
or a verb that combines with an infinitive verb form (VBAux). The corresponding
regular expression (RE6.29) removes the redundant verbal head markers in this
constrained context. The replacement works in parallel, here removing both the
redundant start-marker (<vpHead> ) and the end-marker (</vpHead> ) at
the same time. There are two (alternative) replacement rules for every tag, since the
noun phrase can either occur directly after the first verbal head as in our example
(6.27) above, or as a pronoun be part of the first verbal head. Tags not relevant for
this replacement (VCTags) are ignored (/).
(RE6.29)
205
define adjustVC [
"</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags]
_ NPPhr VPheadPhr,,
"</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] NPPhr
_ VPheadPhr,,
"<vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags]]
"</vpHead>" NPPhr _ ˜$"<vpHead>" "</vpHead>",,
NPPhr "</vpHead>" _ ˜$"<vpHead>" "</vpHead>"
];
Other filtering transducers are used for refining the parsing result. Incomplete
parsing decisions are eliminated at the end of parsing. For instance, incomplete
prepositional phrases, i.e. a prepositional head without a following noun phrase,
defined in the regular expression (RE6.30a) are removed. Also removed are
empty verbal heads as in (RE6.30b) and other misplaced tags.
(RE6.30) a. define errorPPhead [
"<ppHead>" -> 0 || \["<pp>"] _ ,,
"</ppHead>" -> 0 || _ \["<np>"]];
b. define errorVPHead [ "<vp><vpHead></vpHead></vp>" -> 0];
6.7 Narrow Grammar
The narrow grammar is the grammar proper, whose purpose is to distinguish the
grammatical segments from the ungrammatical ones. The automata of this grammar express the valid (grammatical) rules of Swedish, and constrain both the order
of constituents and feature requirements. The current grammar is based on the
Child Data corpus and includes rules for noun phrases and the verbal core.
6.7.1
Noun Phrase Grammar
Noun Phrases
The rules in the noun phrase grammar are divided, following Cooper’s approach
(Cooper, 1984, 1986), according to what types of constituent they consist of and
what feature conditions they have to fulfill (see Section 4.3.1). There are altogether
ten noun phrase types implemented, listed in Table 6.3, including noun phrases
with the (proper) noun as the head, pronoun or determiner, adjective, numeral and
partitive attribute, reflecting the profile of the Child Data corpus.
Chapter 6.
206
Table 6.3: Noun Phrase Types
RULE S ET
NP1
N OUN P HRASE T YPE
single noun
(Num) N
PNoun
NP2
NP3
determiner and noun
Det (DetAdv) (Num) N
poss. noun and noun
NGen (Num) N
determiner, adj. and noun
Det AP N
poss. noun, adj. and noun
NGen AP N
NP4
adjective and noun
(Num) AP N
NP5
single pronoun
PN
NP6
single determiner
Det
NP7
adjective
NP8
determiner and adjective
NP9
numeral
(Det) Num
NPPart
partitive
Num PPart NP
partitive
Det PPart NP
Adj+
Det Adj+
E XAMPLE
(två) grodor
(two) frogs
Kalle
Kalle
de (här) (två) grodorna
the/these (two) frogs
flickans (två) grodor
girl’s (two) frogs
den lilla grodan
the little frog
flickans lilla groda
girl’s little frog
(två) små grodor
(two) little frogs
han
he
den
that
obehörig
unauthorized
de gamla
the old
den tredje, 8
the third, 8
två av husen
two of houses
ett av de gamla husen
one of the old houses
Every noun phrase type is divided into six subrules, expressing the different
types of errors, two for definiteness (NPDef, NPInd), two for number (NPSg,
NPPl) and two for gender agreement (NPUtr, NPNeu). 18 For instance, in
(RE6.31) we have the set of rules representing noun phrases consisting of a
single pronoun, that present the feature requirements on the pronoun as the only
constituent, i.e. that a definite form of the pronoun is required (PNDef) in order to
be considered as a definite noun phrase (NPDef).
18
Utr denotes the common gender called utrum in Swedish.
(RE6.31)
define
define
define
define
define
define
NPDef5
NPInd5
NPSg5
NPPl5
NPNeu5
NPUtr5
207
[PNDef];
[PNInd];
[PNSg];
[PNPl];
[PNNeu];
[PNUtr];
The rule set NP2 presented in (RE6.32) is more complex and defines the
grammar for both definite, indefinite and mixed noun phrases (see Section 4.3.1)
with a determiner (or a possessive noun) and a noun. For instance, the definite form
of this noun phrase type (NPDef2) is defined as a sequence of a definite determiner
(DetDef), an optional adverbial (DetAdv; e.g. här ‘here’), an optional numeral
(Num), and a definite noun, or as a sequence of mixed determiner (DetMixed i.e.
those that take an indefinite noun as complement; e.g. denna ‘this’) or a possessive
noun (NGen), followed by an optional numeral and an indefinite noun.
(RE6.32)
define NPDef2
define
define
define
define
define
NPInd2
NPSg2
NPPl2
NPNeu2
NPUtr2
[DetDef (DetAdv) (Num) NDef] |
[[DetMixed | NGen] (Num) NInd];
[DetInd (Num) NInd];
[[DetSg (DetAdv) | NGen] (NumO) NSg];
[[DetPl (DetAdv) | NGen] (Num) NPl];
[[DetNeu (DetAdv) | NGen] (Num) NNeu];
[[DetUtr (DetAdv) | NGen] (Num) NUtr];
This particular automaton (NPDef2) accepts all the noun phrases in (6.28),
except for the first one that forms an indefinite noun phrase and will be handled
by the subsequent automaton of indefinite noun phrases of this kind (NPInd2).
It also accepts the ungrammatical noun phrase in (6.28c), since it only constrains
the definiteness features. This erroneous noun phrase is then handled by the automaton representing singular noun phrases of this type (NPSg2) that states that only
ordinal numbers (NumO) can be combined with singular determiners and nouns.
(6.28) a. en
(första) blomma
a [indef] (first) flower [indef]
b. den
(här) (första) blomman
this [def] (here) (first) flower [def]
c.
∗
den
(här) (två) blomman
this [def] (here) (two) flower [def]
d. denna
(första) blomma
this [def] (first) flower [indef]
e. flickans (första) blomma
the [def] (first) flower [indef]
Chapter 6.
208
The different noun phrase rules can be joined by union into larger sets and divided in accordance with what different feature conditions they meet. For instance,
the set of all definite noun phrases is defined as in (RE6.33a) and indefinite noun
phrases as in (RE6.33b). All noun phrases that meet definiteness agreement are
then represented by the regular expression in (RE6.33c), that is an automaton
formed as an union of all definite and all indefinite noun phrase automata.
(RE6.33) a. ### Definite NPs
define NPDef [NPDef1 | NPDef2 | NPDef3 | NPDef4 | NPDef5 |
NPDef6 | NPDef7 | NPDef8 | NPDef9];
b. ### Indefinite NPs
define NPInd [NPInd1 | NPInd2 | NPInd3 | NPInd4 | NPInd5 |
NPInd6 | NNPInd7 | PInd8 | NPInd9];
c. ###### NPs that meet definiteness agreement
define NPDefs [NPDef | NPInd];
Noun phrases with partitive attributes have a noun phrase as the head and are
treated separately in the grammar. Although the agreement occurs only between
the quantifier and the noun phrase in gender, the rules of definiteness and number
state that the noun phrase has to be definite and plural, see (RE6.34).
(RE6.34)
define
define
define
define
define
define
NPPartDef
NPPartInd
NPPartSg
NPPartPl
NPPartNeu
NPPartUtr
[[Det | Num] PPart NPDef];
[[DetSg | Num] PPart NPPl];
[[DetPl | Num] PPart NPPl];
[[DetNeu | Num] PPart NPNeu];
[[DetUtr | Num] PPart NPUtr];
Adjective Phrases
Adjective phrases occur as modifiers in two types of the defined noun phrases (NP3
and NP4) and form a head of its own in two others, (NP7 and NP8). It consists in
the present implementation of an optional adverb and a sequence of one or more
adjectives and is also defined in accordance with the feature conditions that have
to be fulfilled for definiteness, number and gender as shown in (RE6.35). The
gender feature set includes also an additional definition for masculine gender.
(RE6.35)
define
define
define
define
define
define
define
APDef
APInd
APSg
APPl
APNeu
APUtr
APMas
["<ap>"
["<ap>"
["<ap>"
["<ap>"
["<ap>"
["<ap>"
["<ap>"
(Adv)
(Adv)
(Adv)
(Adv)
(Adv)
(Adv)
(Adv)
AdjDef+
AdjInd+
AdjSg+
AdjPl+
AdjNeu+
AdjUtr+
AdjMas+
"</ap>"];
"</ap>"];
"</ap>"];
"</ap>"];
"</ap>"];
"</ap>"];
"</ap>"];
209
One problem related to error detection concerns the ambiguity in weak and
strong forms of adjectives that coincide in the plural, but in the singular the weak
form of adjectives is used only in definite singular noun phrases (see Section 4.3.1).
Consequently, such adjectives obtain both singular and plural tags and errors such
as the one in (6.29) will be overlooked by the system. As we see in (6.29a), the
adjective trasiga ‘broken’ is both singular (and definite) and indefinite (and plural)
as the surrounding determiner and head noun and the check for number and definiteness will succeed. Since the whole noun phrase is singular, the plural tag highlighted in bold face in (6.29b) is irrelevant and can be removed by the automaton
defined in (RE6.36) allowing a definiteness error to be reported.
(6.29) a.
∗
en
trasiga
speldosa
a [sg,indef] broken [sg,wk] or [pl] musical box [sg,indef]
b. <np> en[dt utr sin ind][pn utr sin ind sub/obj] <ap> trasiga[jj pos utr/neu
sin def nom][jj pos utr/neu plu ind/def nom] </ap> speldosa[nn utr sin ind
nom] </np>
(RE6.36)
define removePluralTagsNPSg [
TagPLU -> 0 || DetSg "<ap>" Adj _ ˜$"</np>" "</np>"];
Other Selections
In addition to these noun phrase rules, noun phrases with a determiner and a noun
as the head that are followed by a relative subordinate clause are treated separately,
for the reason that definiteness conditions are different in this context (see Section
4.3.1). As in (6.30) the head noun that is normally in definite form after a definite article lacks the suffixed article and stands instead in indefinite form. In the
current system, these segments are selected as separate from other noun phrases
by application of the filtering transducer in (RE6.37), that simply changes the
beginning noun phrase label (<np> ) to the label <NPRel> in the context of a
definite determiner with other constituents and the complementizer som ‘that’. The
grammar is then prepared for expansion of detection of these error types as well.
Chapter 6.
210
(6.30) a. Jag tycker att det borde finnas en hjälpgrupp för
I think that it should exist a help-group for
de
elever
som har lite sociala problem.
the [pl,def] pupils [pl, indef] that have some social problems
– I think that there should be a help-group for the pupils that have som social
problems.
b. <np> Jag </np> <vp> <vpHead> tycker </vpHead> </vp> att <np>
det </np> <vp> <vpHead> <vc> borde finnas </vc> </vpHead> <np>
en </np> </vp> hjälpgrupp <np> för </np> <NPRel> de elever </np>
som <vp> <vpHead> har </vpHead> <np> <ap> sociala </ap> problem
</np> </vp> .
(RE6.37)
6.7.2
define SelectNPRel ["<np>" -> "<NPRel>" ||
_ DetDef ˜$"<np>" "</np>" (" ") {som} Tag*];
Verb Grammar
The narrow grammar of verbs specifies the valid rules of finite and non-finite verbs
(see Section 4.3.5). The rules consider the form of the main finite verb, verb
clusters and verbs in infinitive phrases.
Finite Verb Forms
The finite verb form occurs in verbal heads either as a single main verb or as an
auxiliary verb in a verb cluster. The grammar rule in (RE6.38) states that the
first verb in the verbal head (possibly preceded by adverb(s)) has to be tensed.
Any following verbs (or other constituents) in the verbal head are then ignored (the
any-symbol ‘?∗ ’ indicates that).
(RE6.38)
define VPFinite [Adv* VerbTensed ?*];
Infinitive Verb Phrases
The rule defining the verb form in infinitive phrases concerns verbal heads preceded
by an infinitive marker. The marking transducer in (RE6.39a) selects these
verbal heads and changes the label to infinitival verbal head (<vpHeadInf> ).
The grammar rule of the infinitive verbal core is defined in (RE6.39b), including just one verb in infinitive form (<VerbInf> ), possibly preceded by (zero
or more) adverbs and/or a modal verb also in infinitive form (<ModInf> ).
211
(RE6.39) a. define SelectInfVP [
"<vpHead>" -> "<vpHeadInf>" || InfMark "<vp>" _ ];
b. define VPInf [Adv* (ModInf) VerbInf Adv* ?*];
Verb Clusters
The narrow grammar of verb clusters is more complex, including rules for both
modal (Mod) and temporal auxiliary verbs (Perf) and verbs combining with infinitive verbs (INFVerb), i.e. infinitive phrases without infinitive marker (see
Section 4.3.5). The grammar rules state the order of constituents and the form of
the verbs following the auxiliary verb. The form of the auxiliary verb is defined in
the VPFinite rule above (see (RE6.38), i.e. the verb has to have finite form.
The marking automaton (RE6.40b) selects all verbal heads that include
more than one verb as verb clusters by the VC-rule in (RE6.40a). The potential verb clusters have the form of a verb followed by (zero or more) adverbs, an
(optional) noun phrase, (zero or more) adverbs and subsequently one or two verbs.
Other syntactic tags (NPTags) are ignored (‘/’ is the ignore-operator).
(RE6.40) a. define VC [ [[Verb Adv*] / NPTags] (NPPhr)
[[Adv* Verb (Verb)] / NPTags] ];
b. define SelectVC [VC @-> "<vc>" ... "</vc>" ];
Five different rules describe the grammar of verb clusters. Three rules concern the modal verbs (VC1, VC2, VC3 presented in (RE6.41)) and two rules
deal with temporal auxiliary verbs (VC4, VC5 presented in (RE6.42)). Verbs
that take infinitival phrases (without the infinitival marker) (INFVerb) share two
rules with the modal verbs (VC1, VC2). All the verb cluster rules have the form
VBaux (NP) Adv* Verb (Verb), i.e. an auxiliary verb followed by an optional noun phrase, (zero or more) adverb(s), a verb and an optional verb. By
including the optional noun phrase, the grammar also handles inverted sentences.
Again, irrelevant tags (NPTags) are ignored.
(RE6.41) a. define VC1 [ [[Mod | INFVerb] / NPTags ]
(NPPhr) [[Adv* VerbInf] / NPTags] ];
b. define VC2 [ [Mod / NPTags]
(NPPhr) [[Adv* ModInf VerbInf] / NPTags] ];
c. define VC3 [ [Mod / NPTags]
(NPPhr) [[Adv* PerfInf VerbSup] / NPTags] ];
Chapter 6.
212
(RE6.42) a. define VC4 [ [Perf / NPTags]
(NPPhr) [[Adv* VerbSup] / NPTags] ];
b. define VC5 [ [Perf / NPTags]
(NPPhr)[[Adv* ModSup VerbInf] / NPTags] ];
All the five rules can be combined by union in one automaton that represents
the grammar of all verb clusters presented in (RE6.43).
(RE6.43)
define VCgram [VC1 | VC2 | VC3 | VC4 | VC5];
Other Selections
Coordinations of verbal heads in verb clusters or as infinitive verb phrases are selected as separate segments by the marking transducer in (RE6.44). The automaton replaces the verbal head marking with a new label that indicates coordination of verbs (<vpHeadCoord>) as exemplified in (6.31) and (6.32.)
(RE6.44)
define SelectVPCoord ["<vpHead>" -> "<vpHeadCoord>" ||
["<vpHeadInf>" | "</vc>"] ˜$"<vpHead>" ˜$"<vp>"
[{eller} | {och}] Tag* (" ") "<vp>" _ ];
(6.31) a. hon skulle springa ner och larma
she would run
down and alarm
– she was about to run down and give the alarm.
b. <np> hon </np> <vp> <vpHead> <vc> skulle <np> springa </np>
</vc> </vpHead> </vp> ner och <vp> <vpHeadCoord> larma
</vpHead> </vp>
(6.32) a. det är dags att gå och lägga sig.
it is time to go and lay oneself
– It is time to go to bed.
b. <np> det </np><vp><vpHead> är </vpHead><np> dags </np>
</vp> att <vp> <vpHead> gå </vpHead> </vp> och <vp>
<vpHeadCoord> lägga <np> sig </np> </vpHead> </vp> .
The infinitive marker att is in Swedish homonymous with the complementizer
att ‘that’ or part of för att ‘because’ and thus not necessarily followed by an infinitive, as in (6.33), (6.34) and (6.35). Such ambiguous constructions are selected as
separate segments by the regular expression in (RE6.45), that changes the verbal
head label to <vpHeadATTFinite>.
213
(6.33) a. Tuni ringde mig sen och sa att allt
hade
gått
Tuni called me later and said that everything had [pret] gone [sup]
bara bra.
just good
– Tuni called me later and said that everything had gone just fine.
b. Tuni <vp> <vpHead> ringde </vpHead> <np> mig </np>
</vp> sen och <vp> <vpHead> sa </vpHead> </vp> att <vp>
<vpHeadATTFinite> <np> allt </np> <vc> hade gått </vc>
</vpHead> <np> <ap> bara bra </ap> </np> </vp> .
(6.34) a. Men det skulle han aldrig ha
gjort för att
då
börjar
but it should he never have done because then starts [pres]
grenen
att röra på sig ...
the-branch to move on itself
– But he should never have done that because then the branch starts to move.
b. Men <np> det </np> <vp> <vpHead> <vc> skulle <np> han </np>
aldrig ha <np> <ap> gjort </ap> </np> </vc> </vpHead> </vp>
<np> för </np> att <vp> <vpHeadATTFinite> då börjar </vpHead>
<np> grenen </np> </vp> att <vp> <vpHeadInf> <np> r öra </np>
<pp> <ppHead> på </ppHead> <np> sig </np> </pp> </vpHead>
</vp> ...
(6.35) a. så tänkte jag att nu hade
jag chansen.
so thought I that now had [pret] I the-chance
– so I thought that now I had the chance.
b. <vp> <vpHead> så tänkte <np> jag </np> </vpHead> </vp> att <vp>
<vpHeadATTFinite> nu hade <np> jag </np> </vpHead> <np>
chansen </np> </vp> .
(RE6.45)
define SelectATTFinite [ "<vpHead>" -> "<vpHeadATTFinite>" ||
[ [ [[{sa} Tag+] | [[{för} Tag+] / NPTags]]
("</vpHead></vp>")] | [ [{tänkte} Tag+]
[[NPPhr "</vpHead></vp>" ] | ["</vpHead>" NPPhr "</vp>"]]]]
InfMark "<vp>"_ ];
Verbal heads with only a supine verb as in (6.36) and (6.37) are also selected
separately. They are considered grammatical in subordinate clauses, whereas main
clauses with supine verbs without preceding auxiliary verbs are invalid in Swedish
(see Section 4.3.5). The transducer created by the regular expression in (RE6.46)
replaces a verbal head marking with <vpHeadSup>.
Chapter 6.
214
(6.36) a. Tänk om jag bott
hos pappa.
think if I lived [sup] with daddy
– Think if I had lived at Daddy’s.
b. Tänk <pp> <ppHead> om </ppHead> <np> jag </np> </pp> <vp>
<vpHeadSup> bott </vpHead> <pp> <ppHead> <np> hos </np>
</ppHead> <np> pappa </np> </pp> </vp> .
(6.37) a. det var en gång en pojke som fångat
en groda.
it was a time a boy that caught [sup] a frog
– There was once a boy that had caught a frog.
b. <np> det </np> <vp> <vpHead> <np> var </np> </vpHead> <np>
en gång </np> <np> en pojke </np> </vp> som <vp> <vpHeadSup>
fångat </vpHead> <np> en groda </np> </vp> .
(RE6.46)
define SelectSupVP [
"<vpHead>" -> "<vpHeadSup>" || _ VerbSup "</vpHead>"];
6.8 Error Detection and Diagnosis
6.8.1
Introduction
The broad grammar is applied for marking both the grammatical and ungrammatical phrases in a text. The narrow grammar expresses the nature of grammatical
phrases in Swedish and is then used to distinguish the true grammatical patterns
from the ungrammatical ones. The automata created in the stage of error detection
correspond to the patterns that do not meet the constraints of the narrow grammar
and thus compile into a grammar of errors. This is achieved by subtraction of the
narrow grammar from the broad grammar. The potential phrase segments recognized by the broad grammar are checked against the rules in the narrow grammar
and by looking at the difference. The constructions violating these rules are identified.
The detection process is also partial in the sense that errors are located in an
appropriately delimited context, i.e. a noun phrase when looking for agreement
errors in noun phrases, a verbal head when looking for violations of finite verbs,
etc. The replacement operator is used for selection of errors in text.
6.8.2
215
Detection of Errors in Noun Phrases
In the current narrow grammar, there are three rules for agreement errors in noun
phrases without postnominal attributes and three for partitive constructions, all
reflecting the features of definiteness, number and gender and differing only in
the context they are detected in. We present the detection rules for noun phrases
without postnominal attributes in (RE6.47) and for partitive noun phrases in
(RE6.48). These automata represent the result of subtracting the narrow grammar of e.g. all noun phrases that meet the definiteness conditions (NPDefs)
((RE6.33) on p.208), from the overgenerating grammar of all noun phrases (NP)
((RE6.22) on p.196). By application of a marking transducer, the ungrammatical
segments are selected and annotated with appropriate diagnosis-markers related to
the types of rules that are violated, presented in (RE6.47) and (RE6.48).
(RE6.47) a. define npDefError ["<np>" [NP - NPDefs] "</np>"];
define npNumError ["<np>" [NP - NPNum] "</np>"];
define npGenError ["<np>" [NP - NPGen] "</np>"];
b. define markNPDefError [
define markNPNumError [
npNumError -> "<Error number>" ... "</Error>"];
define markNPGenError [
npGenError -> "<Error gender>" ...
"</Error>"];
(RE6.48) a. define NPPartDefError [
"<NPPart>" [NPPart - NPPartDefs "</np>"];
define NPPartNumError [
"<NPPart>" [NPPart - NPPartNum] "</np>"];
define NPPartGenError [
"<NPPart>" [NPPart - NPPartGen] "</np>"];
b. define markNPPartDefError [
NPPartDefError -> "<Error definiteness NPPart>" ... "</Error>"];
define markNPPartNumError [
NPPartNumError -> "<Error number NPPart>" ... "</Error>"];
define markNPPartGenError [
NPPartGenError -> "<Error gender NPPart>" ...
"</Error>"];
The narrow grammar of noun phrases is prepared for further extension of noun
phrases modified by relative clauses that in the current version of the system, are
just selected as distinct from the other noun phrase types.
Chapter 6.
216
6.8.3
Detection of Errors in the Verbal Head
Three detection rules are defined for verb errors, identifying the three types of
context they can appear in. Errors in finite verb form are checked directly in the
verbal head (vpHead). Errors in infinitive phrases are detected in the context
of a verbal head preceded by an infinitive marker (vpHeadInf). Errors in verb
form following an auxiliary verb are detected in the context of previously selected
(potential) verb clusters (vc).
The nets of these detecting regular expressions presented in (RE6.49a) correspond (as for noun phrases) to the difference between the grammatical rules
(e.g. VPInf in (RE6.39) on p. 211) and the more general rules (e.g. VPHead
in (RE6.22) on p. 196), yielding the ungrammatical verbal head patterns. The
annotating automata in (RE6.49b) are used for error diagnosis.
(RE6.49) a. define vpFiniteError [
"<vpHead>" [VPhead - VPFinite] "</vpHead>"];
define vpInfError [
"<vpHeadInf>" [VPhead - VPInf] "</vpHead>"];
define VCerror [
"<vc>" [VC - VCgram] "</vc>"];
b. define markFiniteError [
vpFiniteError -> "<Error finite verb>" ... "</Error>"];
define markInfError [
vpInfError -> "<Error infinitive verb>" ... "</Error>"];
define markVCerror [
VCerror -> "<Error verb after Vaux>" ... "</Error>"];
Also, the narrow grammar of verbs can be extended with the grammar of coordinated verbs, use of finite verb forms after att ‘that’ and bare supine verb form
as the predicate, all selected as separate patterns.
6.9 Summary
This chapter presented the final step of this thesis, to implement detection of some
of the grammar errors found in the Child Data corpus. The whole system is implemented as a network of finite state transducers, disambiguation is minimal,
achieved essentially by parsing order and filtering techniques, and the grammars
of the system are always positive. The system detects errors in noun phrase agreement and errors in the finite and non-finite verb forms.
The strength of the implemented system lies in the definition of grammars as
positive rule sets, covering the valid rules of the language. The rule sets remain
217
quite small and practically no description of errors by hand is necessary. There are
altogether six rules defining the broad grammar set and the narrow grammar set is
also quite small. Other automata are used for selection and filtering. We do not
have to elaborate on what errors may occur, only in what context, and certainly not
spend time on stipulating the structure of them.
The approach aimed further at minimal information loss in order to be able
to handle texts containing errors. The degree of ambiguity is maximal at the lexical level, where we choose to attach all lexical tags to strings. At a higher level,
structural ambiguity is treated by parsing order, grammar extension and filtering
techniques. The parsing order resolves some structural ambiguities and is complemented by grammar extensions as an application of filtering transducers that
refine and/or redefine the parsing decisions. Other disambiguation heuristics are
applied for instance to noun phrases, where pronouns that follow a verbal head are
attached directly to the verbal head in order to prevent them from attachment to a
subsequent noun.
218
Chapter 7
Performance Results
7.1 Introduction
The implementation of the grammar error detector is to a large extent based on
the lexical and syntactic circumstances displayed in the Child Data corpus. The
actual implementation proceeded in two steps. In the first phase we developed the
grammar so that the system could run on sentences containing errors and correctly
identify the errors. When the system was then run on complete texts, including
correct material, the false alarms allowed by the system were revealed. The second
phase involved adjustment of the grammar to improve the flagging accuracy of the
system.
FiniteCheck was tested for grammatical coverage (recall) and flagging accuracy
(precision) on Child Data and on an arbitrary text not known to the system in
accordance with the performance test on the other three grammar checkers (see
Section 5.2.3).
In this chapter I present results from both the initial phase in the development of
the system (Section 7.2) and the improved current version (Section 7.3). The results
are further compared to the performance of the other three Swedish checkers on
both Child Data (Section 7.4) and the unseen adult text (Section 7.5). The chapter
ends with a short summary and conclusions (Section 7.6).
7.2 Initial Performance on Child Data
7.2.1
Performance Results: Phase I
The results of the implemented detection of errors in noun phrase agreement,
verb form in finite verbs, after auxiliary verb and after infinitive markers in Child
Chapter 7.
220
Data from the initial Phase I in the development of FiniteCheck are presented in
Table 7.1.
Table 7.1: Performance Results on Child Data: Phase I
E RROR T YPE
E RRORS
Agreement in NP
15
Finite Verb Form
110
7
Verb Form after inf. m.
4
T OTAL
136
F INITE C HECK : PHASE I
C ORRECT A LARM
FALSE A LARM
P ERFORMANCE
Correct Incorrect
No
Other
Diagnosis Diagnosis Error
14
1
76
64 100%
10%
18%
98
0
237
19
89%
28%
42%
6
0
61
10
86%
8%
15%
4
0
5
0 100%
44%
62%
122
1
379
93
90%
21%
34%
The grammatical coverage (recall) in this training corpus was maximal, except
in one erroneous verb form after auxiliary verb and a few instances of errors in
finite verb form. The overall recall rate for these four error types was 90%. When
tested on the whole Child Data corpus, many segments were wrongly marked as
errors and the precision rate was quite low, only 21% total, resulting in the overall
F-value of 34%.
Most of the false alarms occurred in errors in finite verb form, followed by
errors in noun phrase agreement. Related to the error frequency of the individual
error types, errors in verb form after an auxiliary verb had the lowest precision
(8%), closely followed by errors in noun phrase agreement (10%). The grammar
of the system was at this initial stage based essentially on the syntactic constructions displayed in the erroneous patterns that we wanted to capture. Many of the
false alarms were due to missing grammar rules when tested on the whole text
corpus. Other false markings of correct text occurred due to ambiguity, incorrect
segmentation of the text in parsing stage, or occurrences of other error categories
than grammatical ones. Below I discuss in more detail the grammatical coverage
and flagging accuracy in this initial phase.
7.2.2
Grammatical Coverage
Errors in Noun Phrase Agreement
All errors in noun phrase agreement were detected and one with incorrect diagnosis, due to a split in the head-noun. FiniteCheck is not prepared to handle
segmentation errors and exactly as the other Swedish grammar checkers the noun
phrase with inconsistent use of adjectives (G1.2.3; see (4.9) on p.49) is only detected in part. The detector yields both the correct diagnosis of gender mismatch and
Performance Results
221
an incorrect diagnosis of a definiteness mismatch, since the first part troll ‘troll’ of
the head-noun is indefinite and neuter and does not agree with the definite, common gender determiner den ‘the’ as seen in (7.1a). When the head-noun has the
correct form and is no longer split into two parts, the whole noun phrase is selected
and a gender mismatch is reported, as seen in (7.1b).
(7.1) (G1.2.3)
a. det va
it
was
<Error
∗
∗
hemske
awful [masc]
definiteness><Error
fula
ugly [def]
gender>
den
the [com,def]
troll
</Error>
troll [neu,indef]
karlen
man [com,def]
tokig som ...
Tokig that
b. det va
it
was
<Error gender> den
the [com,def]
∗
hemske
awful [masc]
∗
fula
ugly [def]
trollkarlen
</Error> tokig som ...
magician [com,def]
Tokig that
Errors in Finite Verb Form
Among the errors in finite verb form, none of the errors concerning main verbs
realized as participles were detected (G5.2.90 - G5.2.99; see (4.30) p.60). They
require other methods for detection, since as seen in (7.2) they are interpreted as
adjective phrases.
(7.2) a. älgen
sprang med olof till ett stup och ∗ kastad
ner olof
the-moose ran
with Olof to a cliff and threw [past part] down Olof
och hans hund
and his dog
– The moose ran with Olof to a cliff and threw Olof and his dog over it.
b. <np> älgen </np> <vp> <vpHead> sprang med</vpHead> <np> olof
</np> <pp> <ppHead> till </ppHead> <np> ett stup </np> </pp>
</vp> och <np> <ap> kastad </ap> </np> ner <np> olof</np> och
<np> hans hund</np>
Two errors were missed due to preceding verbs joined into the same segment
and were then treated as verb clusters, as shown in (7.3) and (7.4).
Chapter 7.
222
(7.3) (G5.1.1)
a. Madde och jag bestämde oss
för att sova i kojan och se om vi
Madde and I decided usselfs for to sleep in the-hut and see if we
∗
få
se vind.
can [untensed] see Vind
– Madde and I decided to sleep in the hut and see if we will see Vind.
b. <np> Madde </np> och <np> jag </np> <vp> <vpHead> bestämde
</vpHead> <np> oss </np> </vp> <np> för </np> att <vp>
<vpHeadInf> sova i </vpHead> <np> kojan </np> </vp> och <vp>
<vpHead> se om <np> vi </np> <np> få </np> se </vpHead> <np>
vind </np> </vp> .
(7.4) (G5.2.40)
∗
a. När vi kom fram
börja
vi packa upp våra grejer och
when we came forward start [untensed] we pack up our stuff and
rulla upp sovsäcken.
role up the-sleepingbag
– When we arrived, we started to unpack our things and role out the sleepingbag.
b. När <np> vi </np> <vp> <vpHead> <vc> kom fram börja </vc>
<np> vi </np> packa upp </vpHead> <np> våra grejer </np> </vp>
och <vp> <vpHeadCoord> rulla upp </vpHead> <np> sovsäcken </np>
</vp> .
One of the errors in finite verb was wrongly selected as seen in (7.5b). Here,
the noun bo ‘nest’ is homonymous with the verb bo ‘live’ and joined together with
the main verb to a verb cluster the detector selects the verb cluster 1 and diagnoses
it as an error in finite verb, which is actually true but only for the main verb, the
second constituent of this segment.
1
The noun phrase tags surrounding bo are ignored in the selection as verb cluster, see (RE6.40)
on p.211.
Performance Results
(7.5) (G5.2.70)
a. Då
gick
then went
223
pojken
the-boy
vidare
further
och
and
såg
saw
inte
not
att
that
binas
bees’s
bo
nest
∗
trilla
ner.
tumble [untensed] down
– Then the boy went further on and did not see that the nest of the bees
tumbled down.
b. <vp> <vpHead> Då gick </vpHead> <np> pojken </np> <np> vidare
</np> </vp> och <vp> <vpHead> <np> såg </np> inte </vpHead>
</vp> att <np> binas </np> <vp> <vpHead> <Error finite verb>
<vc> <np> bo </np> trilla </vc> </Error> </vpHead> </vp> ner.
Rest of the Verb Form Errors
One error in verb form after auxiliary verb was not detected (see (7.6)), that, involved coordination of a verb cluster and yet another verb, that should follow the
same pattern and thus be in infinitive form (i.e. låta ‘let [inf]’). The system does
not take coordination of verbs into consideration and the coordinated verb is identified as a separate verbal head with a finite verb, which then is a valid form in
accordance with the grammar rules of the system, and the error is overlooked.
(7.6) (G6.1.2)
a. Ibland
får
man bjuda
på sig själv och ∗ låter
sometimes must [pres] one offer [inf] on oneself and let [pres]
henne/honom vara med!
her/him
be with
– Sometimes one has to make a sacrifice and let him/her come along.
b. <vp> <vpHead> Ibland <np> får </np><np> man </np> bjuda <pp>
<ppHead> på </ppHead><np> sig </np> </pp></vpHead><np>
själv </np></vp> och <vp> <vpHead> låter </vpHead> </vp>
henne/honom <vp> <vpHead><np> vara </np> med </vpHead> </vp>
!
Finally, all errors in verb form after infinitive marker were detected.
7.2.3
Flagging Accuracy
In this subsection follows a presentation of the kinds of false flaggings that occurred
in this first test of the system. The description proceeds error type by error type,
with specifications on whether the appearance was due to missing grammar rules,
erroneous segmentation of text at parsing stage or ambiguity. Furthermore, the
false alarms with other error categories are specified.
Chapter 7.
224
False Alarms in Noun Phrase Agreement
The kinds and the number of false alarms occurring in noun phrases are presented
in Table 7.2.
Table 7.2: False Alarms in Noun Phrases: Phase I
FALSE ALARM TYPE
Not in Grammar:
Segmentation:
Ambiguity:
Other Error:
NPInd+som
Adv in NP
other
too long parse
PP
V
misspelling
split
sentence boundary
NO.
5
28
8
26
7
2
12
48
4
Most of these false alarms were due to the fact that they were not included
in the grammar of the system. For instance, adverbs in noun phrases as in (7.7a)
were not covered, causing alarms in gender agreement since often in Swedish a
neuter form adjective coincides with the adverb of the same lemma. Further, noun
phrases with a subsequent relative clause such as (7.7b) were selected as errors in
definiteness, although they are correct since the form of the head noun is indefinite
when followed by such clauses (see Section 4.3.1).
(7.7) a. Det var i skolan och jag kom lite för sent till en lektion med
it was in school and I came little too late to a class with
<Error gender> väldigt sträng
lärare. </Error>
very hard/strict teacher
– It was in school and I came little late to a class with very strict teacher.
b. Jag tycker att det borde finnas en hjälpgrupp för
I think that it should exist a help-group for
<Error definiteness> de
elever
</Error> som har
the [pl,def] pupils [pl, indef]
that have
lite sociala problem.
some social problems
– I think that there should be a help-group for the pupils that have some social
problems.
Other false flaggings depended on the application of longest-match resulting
in noun phrases with too wide range as in (7.8a), where the modifying predicative
Performance Results
225
complement and the subject are merged to one noun phrase since the inverted word
order forced the verb to be placed at the end of sentence instead of the usual place
in-between, i.e. skolan ‘school’ should form a noun phrase on its own.
(7.8) dom tänker inte hur
they think not how
<Error definiteness> viktig
skolan
important [str] school [def]
</Error> är
is
– They do not think how important school is.
Furthermore, due to lexical ambiguity some prepositional phrases such as in
(7.9) and verbs were parsed as noun phrases and later marked as errors.
(7.9) Det är en ganska stor väg ungefär
it is a rather big road somewhere
<Error definiteness> vid
hamnen
</Error>
wide [indef]/at harbor [def]
– It is a rather big road somewhere at the harbor.
Also false flaggings with other error categories than grammar errors were quite
common. Mostly splits as in (7.10a) were flagged. Here, the noun ögonblick ‘moment’ is split and the first part of it ögon ‘eyes’ does not agree in number with
the preceding singular determiner and adjective. Also, flaggings involving misspellings occurred as in (7.10b), where the newly formed word results in a noun of
different gender and definiteness than the preceding determiner and causes agreement errors. Some cases of missing sentence boundary were flagged as errors in
noun phrase agreement.
(7.10) a. För <Error number> ett
kort ögon
</Error> blick
trodde
For
a [sg] short eye [pl]
blinking thought
jag ...
I ...
– For a short moment I thought ...
b.
<Error definiteness> <Error gender> Det
ända
the [neu,def] end [com,indef]
</Error> </Error> jag vet
I know
– The only thing I know ...
Furthermore, erroneous tags assigned in the lexical lookup caused trouble when
for instance many words were erroneously selected as proper names.
Chapter 7.
226
False Alarms in Finite Verb Form
The types and number of false alarms in finite verbs are summarized in Table 7.3.
These occurred mostly because of the small size of the grammar, but also due to
ambiguity problems.
Table 7.3: False Alarms in Finite Verbs: Phase I
FALSE ALARM TYPE
Not in Grammar:
Ambiguity:
Other Error:
imperative
coordinated infinitive
discontinuous verb cluster
noun
pronoun
preposition/adjective
misspelling
split
NO.
56
74
43
36
8
20
9
10
Imperative verb forms, that in the first phase were not part of the grammar,
caused false alarms not only in verbs as in (7.11a), but also in strings homonymous
with such forms as in (7.11b). Here the noun sätt is ambiguous between the noun
reading ‘way’ and the imperative verb form ‘set’.
(7.11) a. Men <Error finite verb> titta
</Error> en stock.
but
look [imp]
a log
– But look, a log.
b. Dom samlade in pengar <Error finite verb> på olika
sätt
they collected in money
in different ways/set [imp]
</Error>
– They collected money in different ways.
Further, coordinated infinitives as in (7.12) were diagnosed as errors in the
finite verb form, since due to the partial parsing strategy, they were selected as
separate verbal heads (see (6.31) and (6.32) on p.212).
(7.12) a. hon skulle springa ner och <Error finite verb> larma </Error>
she would run [inf] down and
alarm
– she would run down and alarm.
</Error>
b. det är dags att gå och <Error finite verb> lägga sig.
lay [inf] oneself
it is time to go and
– It is time to go and lay down.
Performance Results
227
Similar problems occurred with discontinuous verb clusters when a noun followed the auxiliary verb and the subsequent verb forms are treated as separate
verbal heads (see (6.27) on p.204). Further, primarily nouns, but also pronouns,
adjectives and prepositions were recognized also as verbal heads causing false diagnosis as errors.
Other error categories selected as errors in finite verb form concerned both
splits and misspellings, but were considerably fewer in comparison to similar false
alarms in noun phrase agreement.
False Alarms in Verb Forms after an Auxiliary Verb
False alarms in verb forms after an auxiliary verb occurred either due to ambiguity
in nouns, pronouns, adjectives and prepositions interpreted as verbs or due to occurrences of other error categories (Table 7.4). In the case of pronouns, they were
interpreted as verbs (mostly) in front of a copula verb and merged together to a
verbal cluster segment. Similar problems occurred with adjectives and participles
(see (6.22)-(6.24) starting on p.202).
Table 7.4: False Alarms in Verb Clusters: Phase I
FALSE ALARM TYPE
Ambiguity:
Other error category:
noun
pronoun
misspelling
split
NO.
26
18
17
3
7
Among false flaggings concerning other error categories, both spelling errors
and splits were flagged. In (7.13) we see an example of a misspelling where the adjective rädd ‘afraid’ is written as red coinciding with the verb ‘rode’ being marked
as an error in verb form after auxiliary verb.2
(7.13) pojken <Error verb after Vaux> blev
red </Error>
the-boy
became rode
Furthermore, many instances of missing punctuation at a sentence boundary
were flagged as errors in verb clusters, as the ones in (7.14). 3 Similarly to the
2
The broad grammar rule for verb clusters joins any types of verbs which is why the copula verb
blev ‘became’ is included.
3
Two vertical bars indicate the missing clause or sentence boundary.
Chapter 7.
228
performance test of the other grammar checkers, these flaggings are not included
in the test. They represent correct flaggings, although the diagnosis is not correct.
(7.14) a. Jag <Error verb after Vaux> fortsatte vägen
fram
|| då såg
I
continued the-road forward then saw
</Error> jag en brandbil || jag visste vad det var.
I a fire-car
I knew what it was
– I continued forward on the road, then I saw a firetruck. I knew what it was.
b. I hålet
pojken <Error verb after Vaux> hittat || fanns </Error>
in the-hole the-boy
found was
en mullvad.
a mole
– In the hole the boy found a mole.
False Alarms in Verb Forms in Infinitive Phrase
Finally, five false alarms in infinitival verbal heads occurred in constructions that
do not require an infinitive verb form after att, which is both an infinitive marker
‘to’ and a subjunction ‘that’ (see (6.33)-(6.35) starting on p.213).
7.3 Current Performance on Child Data
7.3.1
Introduction
As shown above, almost all the errors in Child Data were detected by FiniteCheck.
The erroneously selected segments classified as errors by the implemented detector
were mostly due to the small number of grammatical structures covered by the
grammar, tagging problems and the high degree of ambiguity in the system. Many
alarms included also other error categories, such as misspellings, splits and omitted
punctuation. In accordance with these observations, also the detection performance
of the system was improved in these three ways in order to avoid false alarms:
• extend and correct the lexicon
• extend the grammar
• improve parsing
The full form lexicon of the system is rather small (around 160,000 words)
and not without errors. So, the first and rather easy step was to correct erroneous
Performance Results
229
tagging and add new words to the lexicon. The grammar rules were extended and
filtering transducers were used to block false parsing. Below follows a description
of the grammar extension and other improvements in the system to avoid false
alarms in the individual error types. Then the current performance of the system is
presented (Section 7.3.3).
7.3.2
Improving Flagging Accuracy
Improving Flagging Accuracy in Noun Phrase Agreement
The grammar of adjective phrases was expanded with missing adverbs. Noun
phrases followed by relative clauses. These display distinct agreement constraints and were selected separately by the already discussed regular expression
(RE6.37) (see p.210). This does not mean that the grammar is extended for such
noun phrases, but false alarms in these constructions may be avoided. The false
alarms in noun phrases caused by limitations in the grammar set were all avoided.
This grammar update further improved parsing in the system and decreased the
number of wide parses giving rise to false alarms. The types and number of false
alarms that remain are presented in Table 7.5.
Table 7.5: False Alarms in Noun Phrases: Phase II
FALSE ALARM TYPE
Segmentation:
Ambiguity:
Other Error:
too long parse
PP
misspelling
split
sentence boundary
NO.
5
10
10
35
2
Among these are (relative) clauses without complementizers as in (7.15).
(7.15) a. det var den godaste frukost
jag någonsin ätit ...
eaten
it was the best
breakfast I ever
– It was the best breakfast I ever have eaten ...
b. <np> det </np> <vp> <vpHead> <np> var </np> </vpHead> <Error
definiteness> <np> den <ap> godaste </ap> frukost </np> </Error>
<np> jag </np> </vp> <vp> <vpHead> någonsin ätit </vpHead>
</vp> .
230
Chapter 7.
Improving Flagging Accuracy in Finite Verbs
In the case of finite verbs, the problem with imperative verbs is solved to the extent
that forms that do not coincide with other verb forms are accepted as finite verb
forms, e.g. tänk ‘think’. The imperative forms that coincide with infinitives (e.g.
titta ‘look’) remain. The problem lies mostly in that errors in verbs realized as
lack of the tense endings, often coincide with the imperative (and infinitive) form
of the verb. Allowing all imperative verb forms as grammatical finite verb forms
would then mean that such errors would not be detected by the system. Normally
other hints, such as for example checking for end-of-sentence marking or a noun
phrase before the predicate, are used to identify imperative forms of verbs. These
methods are however not suitable for the texts written by children since these texts
often lack end-of-sentence punctuation or capitals indicating the beginning of a
sentence. This could then mean that a noun phrase preceding the predicate could
be an end to a previous sentence. However, just to define imperative verb forms
not coinciding with other verb forms as grammatical finite verb forms reduced the
number of false alarms in imperatives decrease to half as shown in Table 7.6 below.
Finite verb false alarms in coordinations with infinitive verbs decreased to
just nine alarms and were blocked by selection of infinitive verbs preceded by a
verbal group or infinitive phrase as a separate pattern category by the transducer
in (RE6.44) (see p.212). Discontinuous verbal groups with a noun phrase following the auxiliary verb were joined together by the automaton (RE6.29) (see
p.205) and the narrow grammar of verb clusters was expanded to include (optional)
noun phrases. Almost half of such false alarms were avoided.
False alarms in finite verbs occurring because of ambiguous interpretation also
decreased. Some of those were avoided by the grammar update that also improved
parsing. Further adjustments included nouns being interpreted as verbs in possessive noun phrases and adjectives in noun phrases being interpreted as verbal heads
that were filtered applying the automata (RE6.26) and (RE6.25) (see p.201).
Furthermore, verbal heads with a single supine verb form were distinguished
since they are grammatical in subordinate clauses (see (RE6.46) on p.214).
Performance Results
231
The remaining false alarms are summarized in Table 7.6.
Table 7.6: False Alarms in Finite Verbs: Phase II
FALSE ALARM TYPE
Not in Grammar:
Ambiguity:
Other:
Other Error:
imperative
coordinated infinitive
discontinuous verb clusters
noun
pronoun
misspelling
split
NO.
27
9
28
9
1
14
6
18
14
Improving Flagging Accuracy in Verb Form after Auxiliary Verb
The ambiguity resolutions defined for finite verbs blocked not only the false
alarms in finite verbs, but also in verb clusters. Furthermore, an annotation filter (RE6.27) (see p.203) was defined for copula verbs to block false markings of
copula verbs combined with other constituents such such as pronouns, adjectives,
and participles as a sequence of verbs.
The types and number of false alarms that remain are presented in Table 7.7.
Table 7.7: False Alarms in Verb Clusters: Phase II
FALSE ALARM TYPE
Ambiguity:
Other error category:
noun
pronoun
misspelling
split
NO.
4
4
24
6
9
Improving Flagging Accuracy in Verb Form in Infinitive Phrases
The false alarms in infinitive verb phrases occurred due to constructions that do not
require an infinitive verb form after an infinitive marker. These were selected as
separate patterns by the automaton (RE6.45) (see p.213) and false markings of
this type were blocked.
Chapter 7.
232
7.3.3
Performance Results: Phase II
The performance of the system in the new improved version (Phase II) is presented
in Table 7.8. The grammatical coverage is the same for all error types, except for
finite verbs, where the recall rate (slightly) decreased from 89% to 87%.
Table 7.8: Performance Results on Child Data: Phase II
E RROR T YPE
E RRORS
Agreement in NP
15
Finite Verb Form
110
7
Verb Form after inf. m.
4
T OTAL
136
F INITE C HECK : PHASE II
C ORRECT A LARM
FALSE A LARM
P ERFORMANCE
Correct Incorrect
No
Other
14
1
15
47 100%
19%
33%
96
0
94
32
87%
43%
58%
6
0
32
15
86%
11%
20%
4
0
0
0 100%
100% 100%
120
1
145
94
89%
34%
49%
This is due to the improvement in flagging accuracy. That is, in addition to the
errors not detected by the system from the initial stage (see Section 7.2), two additional errors in finite verb form realized as bare supine were not detected (G5.2.88,
G5.2.89) as a consequence of selecting all bare supine forms as separate segments,
as shown in (7.16). This selection was necessary in order to avoid marking correct
use of bare supine forms as erroneous. When the grammar for the bare supine verb
form is covered, these errors can be detected as well.
(7.16) (G5.2.89)
a. Han tittade
på hunden. Hunden ∗ försökt att klättra ner.
he looked [pret] at the-dog the-dog tried [sup] to climb down
– He looked at the dog. The dog tried to climb down.
b. <np> Han </np> <vp> <vpHead> tittade på </vpHead> <np> hunden </np> </vp> , <np> hunden </np> <vp> <vpHeadSup> f örsökt
</vpHead> </vp> att <vp> <vpHeadInf> klättra ner </vpHead> </vp>
We were able to avoid many of the false flaggings by improvement of the lexical assignment of tags and expansion of grammar. The parsing results of the system also improved as was the case for false flaggings. The total precision rate
improved from 21% to 34%. The remaining false alarms have most often to do
with ambiguity. Only in the case of verb clusters is further expansion of grammar
needed. Figure 7.1 shows the number of false markings of correct text as erroneous
in comparison between the initial Phase I and the current Phase II.
Performance Results
233
Figure 7.1: False Alarms: Phase I vs. Phase II
The types and number of alarms revealing other error categories are more or
less constant and can be considered a side-effect of such a system. Methods for
recognizing these error types are of interest. In the case of splits and misspellings,
most of them were discovered due to agreement problems. Omission of sentence
boundaries is in many cases covered by verb cluster analysis.
The overall performance of the system in detecting the four error types defined,
increased in F-value from 34% in the initial phase to 49% in the current improved
version.
7.4 Overview of Performance on Child Data
I presented earlier in Section 5.5 the linguistic performance on the Child Data
corpus of the other three Swedish tools: Grammatifix, Granska and Scarrie. Here
I discuss the results of these tools for the four error types targeted by FiniteCheck
and explore the similarities and differences in performance between our system
and the other tools. The purpose is not to claim that FiniteCheck is in general
superior to the other tools. FiniteCheck was developed on the Child Data corpus,
whereas the other tools were not. However, it is important to show that FiniteCheck
represents some improvement over systems that were not even designed to cover
this particular data.
Chapter 7.
234
The grammatical coverage of these three tools and our detector for the four error types are presented in Figure 7.2.4 The three other tools are designed to detect
errors in adult texts and not surprisingly the detection rates are low. Among these
four error types, agreement errors in noun phrases is the error type best covered
by these tools, whereas errors in verb form obtained in general much lower results. All three systems managed to detect at least half of the errors in noun phrase
agreement. Errors in the finite verb form obtained the worst results. In the case
of Grammatifix, errors in verbs obtained no or very few results. Granska targeted
all four error types and detected more than half of the errors in three of the types
and only 4% of the errors in finite verb form. Scarrie also had problems in detecting errors in verbs, although it performed best on finite verbs in comparison to the
other tools, detecting 15% of them.
Figure 7.2: Overview of Recall in Child Data
FiniteCheck, which was trained on this data, obtained maximal recall rates for
errors in noun phrase agreement and verb form after infinitive markers. Errors in
other types of verb form obtained a somewhat lower recall (around 86%). Although
this is a good result, we should keep in mind that FiniteCheck is here tested on the
data that was used for development. That is, it is not clear if the system would
4
The number of errors per error type is presented within parentheses next to the error type name.
Performance Results
235
receive such high recall rates for all four error types even for unseen child texts. 5
However, the high performance in detecting errors especially in the frequent finite
verb form error type is an obvious difference in comparison to the low performance of the other tools, which at least seems to motivate the tailoring of grammar
checkers to children’s texts.
Precision rates are presented in Figure 7.3. They are in most cases below 50%
for all systems. The result is however relative to the number of errors. Best valued
are probably errors in finite verb form as a quite frequent error type. The errors
in verb form after infinitive marker are too few to draw any concrete conclusions
about the outcome.
Figure 7.3: Overview of Precision in Child Data
Evaluating the overall performance of the systems in detection of these four error types presented in Figure 7.4 below, the three other systems obtained a recall of
16% on average. The recall rate of FiniteCheck is considerably higher, which can
mean that the tool is good at finding erroneous patterns in texts written by children,
but that remains to be seen when tests on unseen texts are performed. Flagging
accuracy is slightly above 30% for Grammatifix, Granska and FiniteCheck. Scar5
We have not been able to test the system on new child data. Texts written by children are hard
to get and require a lot of preprocessing.
Chapter 7.
236
rie obtained slightly lower precision rates. In combining these rates and measuring
the overall system performance in F-value, Grammatifix obtained the lowest rate,
probably due to the low recall, closely followed by Scarrie. Granska had slightly
higher results of 23%. Our system obtained twice the value of Granska.
Figure 7.4: Overview of Overall Performance in Child Data
In conclusion, among these four error types the three other grammar checkers
had difficulties in detecting the verb form errors in Child Data and only detected
around half of the errors in noun phrase agreement. FiniteCheck had high recall
rates for all four error types and a precision on the same level as the other tools. It
is unclear how much the outcome is influenced by the fact that the system is based
on exactly this data, but FiniteCheck seems not to have difficulties in finding errors
in verb form (especially in finite verbs) that the other tools clearly signal. Further
evaluation of FiniteCheck on a small text not known to the system is reported in
the following section.
Performance Results
237
7.5 Performance on Other Text
In order to see how FiniteCheck would perform on unseen text, of the kind used to
test the other Swedish grammar checkers, a small literary text of 1,070 words describing a trip was evaluated. This text is used as a demonstration text by Granska. 6
The text includes 17 errors in noun phrase agreement, five errors in finite verb form
and one error in verb form after an auxiliary verb.
The purpose of this test is to see if the results are comparable to the other
Swedish tools. Note that, the aim is not to compare the performance between
all the checkers, which would be unfair since the text is a demonstration text of
Granska, but rather to see how our detector performs in just the error types it
targets in comparison to tools designed for this kind of text.
Below, I first present and discuss the results of FiniteCheck, then the performance of the three other checkers is presented, followed by a comparative discussion.
7.5.1
Performance Results of FiniteCheck
Introduction
The text was first manually prepared and spaces were inserted when needed
between all strings, including punctuation. Further, the lexicon had to be updated,
since the text used a particular jargon.7 The detection results of FiniteCheck are
presented in Table 7.9.
Table 7.9: Performance Results of FiniteCheck on Other Text
E RROR T YPE
E RRORS
Agreement in NP
17
Finite verb form
5
1
T OTAL
23
F INITE C HECK : OTHER T EXT
C ORRECT A LARM
FALSE A LARM
P ERFORMANCE
Correct Incorrect
No
Other
13
1
2
4
82%
70%
76%
5
0
1
0 100%
83%
91%
1
0
1
0 100%
50%
67%
19
1
4
4
87%
71%
78%
FiniteCheck missed three errors in noun phrase agreement, which leaves it with
a total recall of 87%. False alarms occurred in all three error types, mostly in noun
6
Demonstration page of Granska: http://www.nada.kth.se/theory/projects/
granska/.
7
FiniteCheck’s lexicon would need to be extended anyway to make a general grammar checking
application.
Chapter 7.
238
phrase agreement, and results in a total precision of 71%. Below I discuss the
performance results in more detail.
Errors in Noun Phrase Agreement
Among the noun phrase agreement errors, three errors were not detected and one
was incorrectly diagnosed. The latter concerned a proper noun preceded by an
indefinite neuter determiner. The noun phrase was selected and marked for all
three types of agreement errors, as shown in (7.17). The reason for this selection
is that the noun phrase was recognized by the broad grammar as a noun phrase,
but rejected by the narrow grammar as ungrammatical. In this case it is true, since
the proper noun should stand alone or be preceded by a neuter gender determiner,
but the system should signal only an error in gender agreement. That is, the noun
phrase was as a whole rejected by the system, since there are no rules for noun
phrases with a determiner and a proper noun.
(7.17) a. Detta är sannerligen ∗ en
Mekka för fjällälskaren ...
this is certainly a [com,indef] Mekka for the mountain-lover
– This is certainly a Mekka for the mountain-lover ...
b. <np> Detta </np> <vp> <vpHead> är sannerligen </vpHead> <Error
definiteness> <Error number> <Error gender> <np> en Mekka
</np> </Error> </Error> </Error> </vp> <np> för </np> fjällälskaren ...
The undetected errors all concerned constructions not covered by our grammar.
The first one in (7.18a)8 involves a possessive noun phrase modifying another noun.
FiniteCheck covers noun phrases with single possessive nouns as modifiers. The
other two concern numerals with nouns in definite form. Our current grammar does
not explore much about numerals and definiteness.
den stora ∗ forsen
brus
the big stream [nom] roar [nom]
⇒ den stora forsens
brus
the big stream [gen] roar [nom]
b.
två ∗ nackdelarna
two disadvantages [def]
⇒ två nackdelar
two disadvantages [indef]
c.
två ∗ kåsorna
kaffe
two scoops [def] coffee
⇒ två kåsor
kaffe
two scoops [indef] coffee
(7.18) a.
Altogether six false flaggings occurred in noun phrase agreement, four of them
due to a split, thus involving an another error category. Two were due to ambiguity in the parsing. Both types are exemplified in the sentence in (7.19), where in
8
Correct forms are presented to the right after the arrow in the examples.
Performance Results
239
the first case the noun fjällutrustningen ‘mountain-equipment [sg,com,def]’ is split
and the first part does not agree with the preceding modifiers. The second case
involves the complex preposition framför allt ‘above all’ where allt is joined with
the following noun to build a noun phrase and a gender mismatch occurs.
(7.19)
a. ...i
in
tältet och
tent and
den
the [sg,com,def]
övriga fjäll
rest
mountain [sg/pl,neu,indef]
utrustningen
vilar tryggheten och framför allt
equipment [sg,com,def] rests the-safety and above all [neu, indef]
friheten.
freedom [com,def]
– ... in the tent and the other mountain equipment lies the safety and above all
freedom.
b. <pp> <ppHead> i </ppHead> <np> tältet </np> </pp> och <Error
definiteness> <Error number> <Error gender> <np> den övriga fjäll
</np> </Error> </Error> </Error> <np> utrustningen </np> <vp>
<vpHead> vilar </vpHead> <np> tryggheten </np> </vp> och fram
<pp> <ppHead> <np> för </np> </ppHead> <Error gender> <np>
allt friheten </np> </Error> </pp> .
Errors in Verb Form
All the errors in verb form have been detected. One false alarm occurred in each
error type. In the case of finite verbs, the alarm was caused due to homonymity in
the noun styrka ‘force’ interpreted as the verb ‘prove’, as seen in (7.20).
(7.20) a. Vinden mojnar inte under natten
utan fortsätter med oför minskad
the-wind subside not during the-night but continues with undiminished
styrka.
force
– The wind does not subside during the night, but continues with undiminished
force.
b. <np> Vinden </np> mojnar inte <pp> <ppHead> <np> under </np>
</ppHead> <np> natten </np> </pp> <vp> <vpHead> <np> utan
</np> fortsätter </vpHead> </vp> med oför <np> minskad </np>
<vp> <Error finite verb> <vpHead> <np> styrka </np> </vpHead>
</Error> </vp> .
The false alarm in verb form after auxiliary verb concerned the split noun
sovsäcken ‘sleeping-bag’, where the first part sov is homonymous with the verb
‘sleep’ and was joined with the preceding verb to form a verb cluster, as shown
in (7.21).
Chapter 7.
240
(7.21)
a. Det
finns dock
två nackdelarna med tältning, pjäxorna
There exist however two disadvantages with camping the skiing-boots
måste i
sov
säcken för att inte krympa ihop
av
kylan ...
must into sleeping bag because not shrink together from the-cold
– There are two disadvantages with camping, the skiing-boots must be inside
the sleeping-bag in order to not shrink from the cold ...
b. <np> Det </np> <vp> <vpHead> finns dock </vpHead> <np> två nackdelarna </np> </vp> med tältning, pjäxorna <vp> <vpHead> <Error
verb after Vaux> <vc> måste i sov </vc> </Error> </vpHead> <np>
säcken </np> </vp> <np> för </np> att <vp> <vpHeadATTFinite> inte
krympa </vpHead> </vp> ihop <pp> <ppHead> av </ppHead> <np>
kylan </np> </pp>
7.5.2
Performance Results of Other Tools
Grammatifix
The results for Grammatifix are presented in Table 7.10 below, with 12 detected
errors in noun phrase agreement, one error in finite verb form and one false alarm
in verb form error after an auxiliary verb. This leaves the system a total recall of
57% and a precision of 93% for these three error types.
Table 7.10: Performance Results of Grammatifix on Other Text
GRAMMATIFIX: OTHER T EXT
FALSE A LARM
P ERFORMANCE
C ORRECT
No
Other
E RROR T YPE
E RRORS
A LARM Error
Agreement in NP
17
12
0
0
71%
100%
83%
Finite Verb Form
5
1
0
0
20%
100%
33%
1
0
0
1
0%
0%
–
T OTAL
23
13
0
1
57%
93%
70%
The five errors in noun phrase agreement that were missed concerned the same
segment with a possessive noun modifying another noun (see (7.18a)) and the one
with a numeral and a noun in the definite form (see (7.18b)). Other cases concerned
a possessive proper noun with erroneous definite noun (see (7.22a)), another definiteness error in noun (see (7.22b)) and a strong form of adjective used in definite
noun phrase (see (7.22c)). Correct forms are presented to the right, next to the
erroneous phrases.
Performance Results
241
Lapplands ∗ drottningen
Lappland’s queen [def]
⇒ Lapplands drottning
Lappland’s queen [indef]
b.
∗
en
ny
dagen
a [indef] new [indef] day [def]
⇒ en
ny
dag
a [indef] new [indef] day [indef]
c.
∗
djup
snön
den
the [def] deep [str] snow [def]
⇒ den
djupa
snön
the [def] deep [wk] snow [def]
(7.22) a.
No false alarms occurred other than one with a verb form after an auxiliary
verb, concerning exactly the same segment and error suggestions as our detector as
exemplified in (7.21) above.
Granska
The result for Granska is presented in Table 7.11. This system detected 11 agreement errors in noun phrase, the one error in verb form after auxiliary verb and one
false alarm occurred in noun phrase agreement. No errors in finite verb form were
identified. The total recall is 52% and precision 92% for these three error types.
Table 7.11: Performance Results of Granska on Other Text
GRANSKA: OTHER T EXT
FALSE A LARM
P ERFORMANCE
C ORRECT
No
Other
E RROR T YPE
E RRORS
A LARM Error
Agreement in NP
17
11
0
1
65%
92%
76%
Finite Verb Form
5
0
0
0
0%
–
–
1
1
0
0 100%
100% 100%
T OTAL
23
12
0
1
52%
92%
67%
The six errors in noun phrase agreement that were missed concerned the same
segment with a possessive noun modifying another noun (see (7.18a)) and both
cases with the numeral and a noun in definite form (see (7.18b-c)). Further errors
concerned a possessive noun with an erroneous definite noun (see (7.23a)), a neuter gender possessive pronoun with a common gender noun (see (7.23b)) and an
indefinite determiner with a definite noun (see (7.23c)).
Chapter 7.
242
ripornas ∗ kurren
grouse’s hoot [def]
⇒ ripornas kurr
grouse’s hoot [indef]
b.
∗
mitt
huva
my [neu] hood [com]
⇒ min
huva
my [com] hood [com]
c.
∗
smulan
en
a [indef] bit [def]
⇒ en
smula
a [indef] bit [indef]
(7.23) a.
One false alarm occurred in a noun phrase with a split adjective and a missing
noun, as shown in (7.24). Here the adjective vinteröppna ‘winter-open’ (i.e. open
for winter) is split and the first part causes an agreement error in definiteness.
öppna — husera
en arg gubbe ...
(7.24) ... i den
andra vinter
in the [def] other winter [indef] open — haunt [inf] an angry old man
–... the other cottage open for the winter was haunted by an angry old man ...
Scarrie
The results for Scarrie are presented in Table 7.12. This system detected 10 agreement errors in noun phrase and one error in finite verb form. It had six false markings concerning noun phrase agreement. The total recall is 48% and precision
65%.
Table 7.12: Performance Results of Scarrie on Other Text
SCARRIE: OTHER T EXT
FALSE A LARM
P ERFORMANCE
C ORRECT
No
Other
E RROR T YPE
E RRORS
A LARM Error
Agreement in NP
17
10
2
4
59%
63%
61%
Finite Verb Form
5
1
0
0
20%
100%
33%
1
0
0
0
0%
–
–
T OTAL
23
11
2
4
48%
65%
55%
The seven errors in noun phrase agreement that were missed concerned the
three our system did not find (see (7.18)) and two that Granska did not find (see
(7.23a) and (7.23c)). The others are presented below, where two concerned gender
agreement with determiner and a (proper) noun (see (7.25a)) and ((7.25b)), and
one definiteness agreement with a weak form adjective together with an indefinite
noun (see (7.25c)).
Performance Results
243
(7.25) a.
∗
en
Mekka
a [com] Mekka
⇒ ett
Mekka
a [neu] Mekka
b.
∗
en
mantra
a [com] mantra [neu]
⇒ ett
mantra
a [neu] mantra [neu]
c.
∗
⇒ orörd
fjällnatur
untouched [str] mountain-nature [indef]
orörda
untouched [wk]
fjällnatur
mountain-nature [indef]
All false alarms concerned noun phrase agreement, where four of them concerned other error categories, as for instance in the ones presented in (7.19) or in
(7.24).
7.5.3
Overview of Performance on Other Text
In Figure 7.5 I present the recall values for all three of the grammar checkers and
our FiniteCheck for the three evaluated error types. All the tools detected 60% or
more errors in noun phrase agreement, whereas verb form errors obtained different
results. The other tools detected at most one verb form error in total of either the
finite verb kind or after an auxiliary verb. FiniteCheck identified all six of the verb
form errors. The errors in verb form are in fact quite few (six instances in total),
but even for such a small amount there are indications that the other tools have
problems identifying errors in verb form.
Flagging accuracy for these error types is presented in Figure 7.6. Concerning
errors in noun phrase agreement, Grammatifix had no false flaggings and obtains a
precision of 100%. Granska’s precision rate is also quite high with only one false
alarm. Scarrie and FiniteCheck obtained a lower precision around 70% due to six
false alarms by each tool. Concerning verb errors, the three systems obtained full
rates without any false flaggings when detection occurred. FiniteCheck had one
false alarm in each error type, thus obtaining lower precision rates. The flagging
accuracy of FiniteCheck in this text is a bit lower in comparison to Grammatifix
and Granska, but comparable to the results of Scarrie.
Chapter 7.
244
Figure 7.5: Overview of Recall in Other Text
Figure 7.6: Overview of Precision in Other Text
Performance Results
245
The overall performance on the evaluated text presented in Figure 7.7 with 23
grammar errors, the three grammar checkers obtained on average 52% in recall,
FiniteCheck scored 87%. The opposite scenario applies for precision, where FiniteCheck had slightly worse rate (71%) than Grammatifix and Granska which had
a precision above 90%. Scarrie’s precision rate was 65%. In the combined measure of recall and precision (F-value) our system obtained a rate of 78%, which is
slightly better in comparison to the other tools that had 70% or less in F-value.
Figure 7.7: Overview of Overall Performance in Other Text
In conclusion, this test only compared a few of the constructions covered by
the other systems, represented by the error types targeted by FiniteCheck. The
result is promising for our detector that obtained comparable or better performance
rates for coverage in this text. Flagging accuracy was slightly worse, especially in
comparison to Grammatifix and Granska. Moreover the text was small with few
errors and future tests on larger unseen text are of interest for better understanding
of the system’s performance.
246
Chapter 7.
7.6 Summary and Conclusion
The performance of FiniteCheck was tested during the developmental stage and on
the current version. The system is in general good at finding errors and the flagging
accuracy of the system can be improved by relatively simple means. The initial performance was improved solely by extension of the grammar and some ambiguity
resolution. The broad grammar was extended by filtering transducers that extended
head phrases with complements and merged split constituents or somehow adjusted
the parsing output as a disambiguation step. The narrow grammar was improved
by either extension of existing grammar rules or additional selections of segments.
These new selections provide a basis for definitions of new grammars, thus the
possibility of extending the detection to other types of errors. In the current version, noun phrases followed by relative clauses, coordinated infinitives and verbs
in supine form were selected as separate segments and can be further extended with
corresponding grammar rules.
Detection of the four implemented error types in FiniteCheck was tested on
both Child Data and a short adult text not only for our detector but also for the
other three Swedish grammar checkers.9 In the case of Child Data, FiniteCheck
achieved maximal or high grammatical coverage, being based on this corpus, and
a total precision of around 30%. The other tools detected in general few errors in
Child Data in the included error types with an average recall of 16%. Flagging
accuracy is also around 30% for two of these tools and is lower for one of them.
The outcome of FiniteCheck is hard to compare to the performance of the other
tools, since our system is based on the Child Data corpus which was also used
for evaluation, but there are indications of differences in the detection of errors in
verb form at least, especially in finite verbs, where the other tools obtained quite
low recall on average 9%. A similar effect occurs when the tools were tested on
the adult text, where also here the other tools had difficulties to detect errors in
verb form (although they were few), whereas FiniteCheck identified all of them.
Otherwise, FiniteCheck obtained comparable (or even better) recall for the adult
text with the three tools and a slightly lower accuracy in comparison to two of the
tools. The performance rates of all the tools are in general higher on this adult
text in comparison to Child Data, with a recall around 50% and a precision around
80%. Corresponding rates for Child Data are around 16% in recall 10 and 30% in
precision.
The validation tests on Child Data and the adult text indicate clearly that the
children’s texts and the errors in them really are different from the adult texts and
9
Recall that these tools target many more error types. Evaluation of these grammar checkers on
all errors found in Child Data is presented in Chapter 5 (Section 5.5).
10
Here, the recall rates of FiniteCheck were not included, since it is developed on this data.
Performance Results
247
errors, and that they are more challenging for current grammar checkers that have
been developed for texts and errors written by adult writers. The low performance
of the Swedish tools on Child Data clearly demonstrates the need for adaptation of
grammar checking techniques to other users, such as children.
The performance of FiniteCheck is promising but at this point only preliminary.
More tests are needed in order to see the real performance of this tool, both on other
unseen children texts and texts written by other users, such as adult writers or even
second language learners.
248
Chapter 8
8.1 Introduction
This concluding chapter begins with a short summary of the thesis (Section 8.2),
followed by a section on conclusions (Section 8.3), finally, some future plans are
discussed (Section 8.4).
8.2 Summary
8.2.1
Introduction
This thesis concerns the analysis of grammar errors in Swedish texts written by
primary school children and the development of a finite state system for finding
such errors. Grammar errors are more frequent for this group of writers and the distribution of the error types is different from texts written by adults. Also other writing errors above word-level are discussed here, including punctuation and spelling
errors resulting in existing words.
The method used in the implemented tool FiniteCheck involves subtraction of
finite state automata that represent two ‘positive’ grammars with varying degree of
detail. The difference between the automata corresponds to the search for writing
problems that violate the grammars. The technique shows promising results on the
implemented agreement phenomena and verb selection phenomena.
The work is divided into three subtasks, analysis of errors in the gathered data,
investigation of what possibilities there are for detecting these errors automatically
and finally, implementation of detection of some errors. The summary of the thesis
presented below follows these three subtasks.
Chapter 8.
250
8.2.2
Children’s Writing Errors
Data, Error Categories and Error Classification
The analysis of children’s writing errors is based on empirical data of total 29,812
words gathered in a Child Data corpus consisting of three separate collections of
hand written and computer written compositions written by primary school children between 9 and 13 years of age (see Section 3.2). The analysis concentrates on
grammar, primarily. Other categories under investigation concern spelling errors
which give rise to real word strings and punctuation.
Error classification of the involved error categories is discussed in Chapter 3
(Section 3.3), where I present a taxonomy (Figure 3.1, p.31) and principles for
classifying writing errors. Although this taxonomy was designed particularly for
errors in the borderline between spelling and grammar error, it can be used for
classification of both spelling and grammar errors. It takes into consideration the
kind of new formation involved (new lemma or other forms of the same lemma),
the type of violation (change in letter, morpheme or word) and what level was
influenced (lexical, syntactic or semantic).
What Grammar Errors Occur?
In the survey of the considerably few existing studies on grammar errors in Chapter
2 (Section 2.4) I show that the most typical grammar errors in these studies are errors in noun phrase and predicative complement agreement, verb form and choice
of prepositions in idiomatic expressions. Furthermore, some indications of errors
influenced by spoken language are also evident in children’s writing. However,
grammar has in general low priority in research on writing in Swedish. In particular, there are no recent studies concerning grammar errors by children and certainly
no studies whatever for the youngest primary school children (see Section 2.3).
In the present analysis in Child Data in Chapter 4 (Section 4.3), a total of
262 grammar errors occur. They are spread over more than ten error types. The
expected “typical” errors occur, but they are not all particularly frequent. The most
common errors occur in finite verb form, omission of obligatory constituents in
sentences, choice of words, agreement in noun phrases and extra added constituents
in sentences.
In comparison to adult writers (Section 4.4), there are clear differences in error
frequency and the distribution of error types. Grammar errors occur on average
as much as 9 times in a child text of 1,000 words, which is considerably more
frequent compared to adult writers who make an average one grammar error per
1,000 words. For some error types (e.g. noun phrase agreement) frequency differs
marginally, whereas more significant differences arise, for instance for errors in
251
verb form, that are on average eight times more common in Child Data. Frequency
distribution across all error types is also different, although the representation of
the most common error types is similar except for finite verb form errors. The
most common error type for the adults in the studies presented were missing or
redundant constituents in sentences, agreement in noun phrase and word choice
errors. In contrast, the most common verb error among adult writers is in the verb
form after auxiliary verb and not in the finite verb form, as is the case for children.
What Real Word Spelling Errors Occur?
Spelling errors resulting in existing words are usually not captured by a spelling
checker. For that reason they have been included in the present analysis, since
they often require analysis of context larger than a word in order to be detected.
The ones found in the Child Data corpus (presented and discussed in Section 4.5)
are three times less frequent than the non-word spelling errors, where misspelled
words are the most common error type. These errors indicate a clear confusion as
to what form to use in which context as well as the influence of spoken language.
Splits were in general more common as real word errors.
How Is Punctuation Used?
The main purpose of the analysis of punctuation (Section 4.6) was to investigate
how children delimit text and use major delimiters and commas to signal clauses
and sentences. The analysis of Child Data reveals that mostly the younger children join sentences into larger units without using any major delimiters to signal
sentence boundaries. The oldest children formed the longest units with the least
adjoined clauses. Erroneous use of punctuation is mostly represented by omission
of delimiters, but also markings occurring at syntactically incorrect places. Punctuation analysis concludes at this point with recommendation not to rely on sentence
marking conventions in children’s texts when describing grammar and rules of a
system aiming at analyzing such texts.
8.2.3
Diagnosis and Possibilities for Detection
Possibilities and Means for Detection
The errors found in Child Data were analyzed according to what means and how
much context is needed for detection of them. Most of the non-structural errors (i.e.
substitutions of words, concerning feature mismatch) and some structural errors
(i.e. omission, insertion and transposition of words) can be detected successfully
by means of partial parsing. These errors concern agreement in noun phrases, verb
252
Chapter 8.
form or missing constituents in verb clusters, some pronoun case errors, repeated
words that cause redundant constituents, some word order errors and to some extent agreement errors in predicative complements. Furthermore, real word spelling
errors giving rise to syntactic violations can also be traced by partial parsing. Other
error types require more elaborate analysis in the form of parsing larger portions of
a clause or even full sentence parsing (e.g. missing or extra inserted constituents),
analysis above sentence-level requiring analysis of a preceding discourse (e.g. definiteness in single nouns, reference), or even semantics and world-knowledge (e.g.
word choice errors).
Among the most common errors in the Child Data corpus, errors in verb form
and noun phrase agreement can be detected by partial parsing, whereas errors in the
structure of sentences as insertions or omissions of constituents and word choice
errors require more elaborate analysis.
Coverage and Performance of Swedish Tools
The three existing Swedish grammar checkers Grammatifix, Granska and Scarrie
are designed for and primarily tested on texts written by (mostly professional) adult
writers. According to their error specifications, they cover many of the error types
found in Child Data. The errors that none of these tools targets include definiteness
errors in single nouns and reference errors.
The tools were tested on Child Data in order to gauge their real performance.
The result of this test indicates low coverage overall and in particular for the most
common error types. The systems are best at identifying errors in noun phrase
agreement and obtain an average recall rate of 58%. However, the most common
error in children’s writing, finite verb forms, is on average covered only to 9% (see
Tables 5.4, 5.5 and 5.6 starting on p.169 or Figure 7.2 on p.234).
The overall grammatical coverage (recall) by the adult grammar checkers
across all errors in Child Data averages around 12%. A figure which is almost
five times lower than in the tests on adult texts provided by the developers of these
tools where the average recall rate is 57% (see Table 5.3 on p.141).
This test showed that although these three proofing tools target the grammar
errors occurring in Child Data, they have problems in detecting them. The reasons for this effect could in some cases be ascribed to the complexity of the error
(e.g. insertion of optional constituents). However, the low performance has more
often to do with the high error frequency in some error types (e.g. errors in finite
verb form are much less frequent in adult texts; see Figure 4.5 on p.87) and the
complexity in the sentence and discourse structure of the texts used in this study
(e.g. violations of punctuation and capitalization conventions resulting in adjoined
clauses).
8.2.4
253
Detection of Grammar Errors
Targeted Errors
Among the errors found in Child Data, errors in noun phrase agreement and in the
verb form in finite and non-finite verbs were chosen for implementation. There
were two reasons for concentrating on these error types. First, they (almost all)
occur among the five most common error types. Second, these error types are all
limited to certain portions of text and can then be detected by means of partial
parsing.
In the current implementation, agreement errors in noun phrases with a noun,
adjective, pronoun or numeral as the head are detected, as well as in noun phrases
with partitive attributes. The noun phrase rules are defined in accordance with what
feature requirements they have to fulfill (i.e. definiteness, number and gender). The
noun phrase grammar is prepared for further detection of errors in noun phrases
with a relative subordinate clause as complement, that display different agreement
conditions. In the present implementation these are selected as separate segments
from the other noun phrases. The main purpose of this selection was to avoid
marking of correct noun phrase segments of this type as erroneous.
The verb grammar detects errors in finite form, both in bare main verbs and in
auxiliary verbs in a verb cluster, as well as non-finite forms in a verb cluster and
in infinitive phrases following an infinitive marker. The grammar is designed to
take into consideration insertion of optional constituents such as adverbs or noun
phrases and also handles inverted word order. Also the verb grammar is prepared
for expansion to cover detection of other errors in verbs. Coordinated verbs preceded by a verb cluster or infinitive phrase are selected as individual segments and
invite further expansion of the system’s grammar to detection of errors manifested as finite verbs instead of the expected non-finite verb form. Similarly, verbal
heads with bare supine form separate segments and lay a basis for the detection of
omitted temporal auxiliary verbs in main clauses.
Detection Approach
The implemented grammar error detector FiniteCheck is built as a cascade of finite
state transducers compiled from regular grammars using the expressions and operators defined in the Xerox Finite-State Tool. The detection of errors in a given text
is based on the difference between two positive grammars differing in degree of accuracy. This is the same method that Karttunen et al. (1997a) use for distinguishing
valid and invalid date expressions. The two grammars always describe valid rules
of Swedish. The first more relaxed (underspecified) grammar is needed in a text
containing errors to identify all segments that could contain errors and marks both
254
Chapter 8.
the grammatical and ungrammatical segments. The second grammar is a precise
grammar of valid rules in Swedish and is used to distinguish the ungrammatical
segments from the grammatical ones.
The parsing strategy of FiniteCheck is partial rather than full, annotating portions of text with syntactic tags. The procedure is incremental recognizing first the
heads (lexical prefix) and then expanding them with complements, always selecting
maximal instances of segments.
In order to prevent overlooking errors, the ambiguity in the system is maximal
at the lexical level, assigning all the lexical tags presented in the lexicon. Structural
ambiguity at a higher level is treated partially by parsing order and partially by
filtering techniques, blocking or rearranging insertion of syntactic tags.
Performance Results
FiniteCheck was tested both on the (training) Child Data written by children and an
adult text not known to the system. In the case of Child Data, the system showed
high coverage (recall) in the initial phase of development, whereas many correct
segments were selected as erroneous. Many of these false alarms were avoided by
extending the grammar of the system, blocking an average of half of all the false
markings. Remaining false alarms are more related to the ambiguity in parsing or
selection of other error categories (i.e. misspelled words, splits and missing sentence boundaries). Only in the case of verb clusters did the system mark constructions not yet covered by the grammar of the system. Being based on this corpus,
maximal or high grammatical coverage occurs with a total recall rate of 89% for
the four implemented error types. Precision is 34%. The other three Swedish tools
had on average lower results in recall with a total ratio of 16% on Child Data for the
four error types targeted by FiniteCheck. The corresponding total precision value
is on average 27%.
Further, the performance of FiniteCheck on a text not known to the system
shows that the system is good at finding errors, whereas the precision is lower. The
three undetected errors in noun phrase agreement occurred due to the small size of
the grammar. False flaggings involved both ambiguity problems and selections due
to occurrence of other error categories. The total grammatical coverage (recall) of
FiniteCheck on this text was 87% and precision was 71%. The other three Swedish
tools are (again) good at finding errors in noun phrase agreement, whereas the verb
errors obtain quite low results. The average total recall rate is 52% and precision is
83% for the three evaluated error types.
The validation tests show that the performance of FiniteCheck on the four implemented error types is promising and comparable to current Swedish checkers.
The low performance results of the Swedish systems on children’s texts indicates
255
that the nature of the errors found in texts written by primary school writers are
different from adult texts and are more challenging for current systems that are
oriented towards texts written by adult writers.
8.3 Conclusion
The present work contributes to research on children’s writing by revealing the
nature of grammar errors in their texts and fills a gap in this research field, since not
many studies are devoted to grammar in writing. It shows further that it is important
to develop aids for children since there are differences in both frequency and error
types in comparison to adult writers. Current tools have difficulties coping with
such texts.
The findings here also show that it is plausible and promising to use positive
rules for error detection. The advantage of applying positive grammars in detection
of errors is first, that only the valid grammar has to be described and I do not have
to speculate on what errors may occur. The prediction of errors is limited exactly
to the portions of text that can be delimited. For example, errors in number in noun
phrases with a partitive complement were not identified by any of the three Swedish
checkers, since adults probably do not make these types of errors. The grammar of
FiniteCheck describes the overall structure of such phrases in Swedish, including
agreement between the quantifying numeral or determiner and the modifying noun
phrase. It also states that the noun phrase has to have plural number in order to be
considered correct. The Swedish tools take into consideration only the agreement
between the constituents and not the whole structure of the phrase. Secondly, the
rule sets remain quite small. Thirdly the grammars can be used for other purposes.
That is, since the grammars of the system describe the real grammar of Swedish,
they can also be used for detection of valid noun phrases and verbs and be applied
for instance to extracting information in text or even parsing.
The performance of FiniteCheck indicates promising results by the fact that
not only good results were obtained on the ‘training’ Child Data, but also running
FiniteCheck on adult texts yielded good results comparable to the other current
tools. This result perhaps also indicates that the approach could be used as a generic
method for detection of errors.
The ambiguity in the system is not fully resolved, but his does not disturb the
error detection. However, false parses are hard to predict and they may give rise to
errors not being detected or occurrence of false alarms.
Chapter 8.
256
8.4 Future Plans
8.4.1
Introduction
The current version of the implemented grammar error detector is not intended
to be considered a full-fledged grammar checker or a generic tool for detection
of errors in any text written by any writer. The present version of FiniteCheck is
based on a lexicon of limited size, ambiguity in the system is not fully resolved
and it detects a limited set of grammar errors yielding simple diagnoses. The next
challenge will be to expand the lexicon, experiment with disambiguation contra
error detection, expand the coverage of the system to other error types, explore
the diagnosis stage and test to detect errors in new texts written by different users.
Furthermore, application of the grammars of the system for other purposes is also
interesting to explore.
8.4.2
Improving the System
The lexicon of the system has to be expanded with missing forms and new lemmas
and other valuable information, such as valence or compound information. The
latter has practically been accomplished, being stored in the original text-version
of part of the lexicon.
There is a high level of ambiguity in the system, especially at the lexical level
since we do not use a tagger which might eliminate information in incorrect text
that is later needed to find the error. The fact is that unresolved ambiguity can
sometimes lead to false parsing, which in turn could mean false alarms. The degree of lexical ambiguity and the impact on parsing and by extension detection of
errors can be studied by experiments with weighted lexical annotation for instance,
i.e. lexical tags ordered by probability measures (e.g. weighted automata). Such
taggers are however often based on texts written by adults and could give rise to
deceptive results.
Also, disambiguation is not fully resolved at the structural level, blocking some
insertions by parsing order and further adjusting the output by filtering automata.
Extension of grammars in the system have shown positive impact on parsing and
further evaluation is needed in order to decide the degree of ambiguity and prospects for prediction of false parsing, both having an influence on error detection.
Another possibility is to explore the use of alternative parses, implemented for
instance as charts.
The rules of the broad grammar overgenerate to a great extent. One thing to
experiment with is the degree of broadness in order to see how it influences the
detection process. Will the parsing of text be better at the cost of worse error
257
detection? How much could the grammar set be extended to improve the parsing
without influencing the error detection?
Since the grammars of the system are positive, experiments of using them for
other purposes are in place. For instance, the more accurate narrow grammar could
be applied for information extraction or even parsing.
8.4.3
Expanding Detection
The first step in expansion of detection of FiniteCheck would naturally involve the
types that are already selected for such expansion, i.e. noun phrases with relative
clauses, coordinated infinitives and bare supine verbs. Furthermore, the verb grammar can be expanded with other constructions such as the auxiliary verb komma
‘will’ that requires an infinitive marker preceding the main verb, or main verbs that
combine with infinitive phrases (see Section 4.3.5). Further expansion naturally
would concern errors that require least analysis. After noun phrase and verb form
errors, only some constructions can be detected by simple partial parsing, but more
complex analysis is required. The system can be further expanded to include detection of errors in predicative agreement, some pronoun case errors, some word
order errors and probably some definiteness errors in single nouns. With regard
to children, most crucial would be coverage of errors with missing or redundant
constituents in clauses or word choice errors, which represent two of the more frequent error types. These errors will, as my analysis reveals, most probably require
quite complex investigation with descriptions of complement structure. It would be
plausible to do more analysis of children’s writing in order to investigate if some
such errors are for instance limited to certain portions of text and could then be
detected by means of partial parsing.
In consideration of children as the users of a grammar checker for educational
purposes, the most important development will concern the error diagnosis and error messages to the user. A tool that supports beginning writers in their acquisition
has to place high demands on the diagnosis of and information on errors in order
to be useful. The message to the user has to be clear and adjusted to the skills of
the child. A child not familiar with a given writing error or the grammatical terminology associated with it will certainly not profit from detecting such an error or
from information containing grammatical terms. Studies of children’s interaction
with authoring aids are in place in order to explore how alternatives of detection,
diagnosis and error messages could best profit this user group. For instance, such a
tool could be used for training grammar allowing customizing and options for what
error types to detect or train on. There could also be different levels of diagnosis
and error messages depending on the individual child’s level of writing acquisi-
Chapter 8.
258
tion. Also other users could find such a tool interesting, for instance in language
acquisition as a second language learner.
The diagnosis stage could be adjusted by analysis of on-going processes in
writing of children, that could be a step toward revealing the cause of an error.
By logging all activities during the writing process on a screen, for instance all
revisions could be stored and then analyzed if they indicate any repeated patterns
and certainly if there is a difference between making a spelling error or making
a grammar error. Could a grammar checker gain from such on-line information?
This analysis would further be of interest for the errors in the borderline between
grammar and spelling error and could aid detection of other categories of errors
incorrectly detected as grammar errors.
8.4.4
Generic Tool?
The detection and overall performance of the system has so far been tested on the
‘training’ Child Data corpus and a small adult text not known to the system. The
results for the four implemented error types are promising on both texts representing two different writing populations. This fact could also imply that this method
could be used as generic. FiniteCheck obtained comparable performance to other
Swedish grammar checkers for the adult text and on Child Data. Although FiniteCheck was based on these texts, considerable difference in coverage occurred
for some error types that the other tools had difficulty finding.
The system needs to be tested further on other children’s texts not known to the
system and also texts from other writers, primarily, texts of different genres written
by adults. Furthermore, it would be interesting to test FiniteCheck on texts written
by second language learners, dyslectics or even the hearing impaired, in order to
explore how generic this tool is.
8.4.5
Learning to Write in the Information Society
Some of the future work discussed above is already initiated within the framework of a three year project Learning to Write in the Information Society, initiated in 2003 and sponsored by Vetenskapsrådet. The project group consisting of
Robin Cooper, Ylva Hård af Segerstad and me aims to investigate written language
by school children in different modalities and the effects of the use of computers
and other communicated media such as webchat and text messaging over mobile
phone. The main aims are to see how writing is used today and how information
technology can better be used for support. Texts written by primary school children will be gathered, both in hand written and computer written form. The study
will also involve writing experiments with email, SMS (Short-Message-Service)
259
and webchat. Further studies dealing with interaction with different writing aids
are included. The results of this study should reveal how writing aids influence
the writing of children, what the needs and requirements are on such tools by this
writing population and how writing aids can be improved to enhance writing development and instruction in school.
260
BIBLIOGRAPHY
261
Bibliography
Abney, S. (1991). Parsing by chunks. In Berwick, R. C., Abney, S., and Tenny,
C., editors, Principle-Based Parsing: Computation and Psycholinguistics, pages
257–278. Kluwer Academic Publishers, Dordrecht.
Abney, S. (1996). Partial parsing via finite-state cascades. In Workshop on Robust
Parsing at The European Summer School in Logic, Language and Information,
ESSLLI’96, Prague, Czech Republic.
Ahlström, K.-G. (1964). Studies in spelling I. Uppsala University, The Institute of
Education. Report 20.
Ahlström, K.-G. (1966). Studies in spelling II. Uppsala University, The Institute
of Education. Report 27.
Ait-Mohtar, S. and Chanod, J.-P. (1997). Incremental finite-state parsing. In
ANLP’97, pages 72–79, Washington.
Allard, B. and Sundblad, B. (1991). Skrivandets genes under skoltiden med fokus
på stavning och övriga konventioner. Doktorsavhandling, Stockholms Universitet, Pedagogiska Institutionen.
Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1998). Finite state grammar for finding grammatical errors in Swedish text: a finite-state word analyser.
Technical report, Göteborg University, Department of Linguistics. [http:
//www.ling.gu.se/˜sylvana/FSG/Report-9808.ps].
Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1999). Finite state grammar
for finding grammatical errors in Swedish text: a system for finding ungrammatical noun phrases in Swedish text. Technical report, Göteborg University, Department of Linguistics. [http://www.ling.gu.se/˜sylvana/FSG/
Report-9903.ps].
262
BIBLIOGRAPHY
Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., and Tyson, M. (1993). FASTUS:
A finite-state processor for information extraction from real-word text. In The
Proceedings of IJCAI’93, Chambery, France.
Arppe, A. (2000). Developing a grammar checker for Swedish. In Nordgård, T., editor, The 12th Nordic Conference in Computational Linguistics, NODALIDA’99,
pages 13–27. Department of Linguistics, Norwegian University of Science and
Technology, Trondheim.
Arppe, A., Birn, J., and Westerlund, F. (1998). Lingsoft’s Swedish Grammar
Checker. [http:www.lingsoft.fi/doc.swegc].
Beesley, K. R. and Karttunen, L. (2003).
Publications.
Finite-State Morphology.
CSLI-
Bereiter, C. and Scardamalia, M. (1985). Cognitive coping strategies and the problem of inert knowledge. In Chipman, S. F., Segal, J. W., and Glaser, R., editors,
Thinking and learning skills. Vol. 2, Research and open questions. Lawrence
Erlbaum Associates, Hillsdale, New Jersey.
Biber, D. (1988). Variation across speech and writing. Cambridge University
Press, Cambridge.
Birn, J. (1998). Swedish Constraint Grammar: A Short Presentation. [http:
//www.lingsoft.fi/doc/swecg/].
Birn, J. (2000). Detecting grammar errors with Lingsoft’s Swedish grammar
checker. In Nordgård, T., editor, The 12thNordic Conference in Computational
Linguistics, NODALIDA’99, pages 28–40. Department of Linguistics, Norwegian University of Science and Technology, Trondheim.
Björk, L. and Björk, M. (1983). Amerikansk projekt för bättre skrivundervisning.
det viktiga är själva skrivprocessen - inte resultatet. Lärartidningen 1983:28,
pages 30–33.
Björnsson, C.-H. (1957). Uppsatsbedömning och mätning av skrivförmåga. Licentiatavhandling, Stockholm.
Björnsson, C.-H. (1977). Skrivförmågan förr och nu. Pedagogiskt centrum, Stockholm.
Bloomfield, L. (1933). Language. Henry Holt & CO, New York.
Boman, M. and Karlgren, J. (1996). Abstrakta maskiner och formella spr åk. Studentlitteratur, Lund.
BIBLIOGRAPHY
263
Britton, J. (1982). Spectator role and the beginnings of writing. In Nystrand, M.,
editor, The Structure of Written Communication. Studies in Reciprocity Between
Writers and Readers. Academic Press, New York.
Bustamente, F. R. and León, F. S. (1996). GramCheck: A grammar and style
checker. In The 16th International Conference on Computational Linguistics,
Copenhagen, pages 175–181.
Calkins, L. M. (1986). The Art of Teaching Writing. Heinemann, Portsmouth.
Carlberger, J., Domeij, R., Kann, V., and Ola, K. (2002). A Swedish Grammar Checker. Submitted to Association for Computational Linguistics, October
2002.
Carlberger, J. and Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software - Practice and Experience, 29(9):815–832.
Carlsson, M. (1981). Uppsala Chart Parser 2: System documentation (UCDL-R81-1). Technical report, Uppsala University: Center for Computational Linguistics.
Chafe, W. L. (1985). Linguistic differences produced by differences between
speaking and writing. In Olson, D. R., Torrance, N., and Hildyard, A., editors, Literacy, language, and learning: The nature and consequences of reading
and writing. Cambridge University Press, Cambridge.
Chall, J. (1979). The great debate: ten years later, with a modest proposal for
reading stages. In Resnick, L. B. and Weaver, P. A., editors, Theory and practice
of early reading, Vol.2. Lawrence Erlbaum Associates, Hillsdale, New Jersey.
Chanod, J.-P. (1993). A broad-coverage French grammar checker: Some underlying principles. In the Sixth International Conference on Symbolic and Lo gical
Computing, Dakota State University Madison, South Dakota.
Chanod, J.-P. and Tapanainen, P. (1996). A robust finite-state parser for french. In
Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI’96, Prague, Czech Republic.
Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory, 2, pages 113–124.
Chomsky, N. (1959). On certain formal properties of grammars. Information and
Control, 1, pages 91–112.
264
BIBLIOGRAPHY
Chrystal, J.-A. and Ekvall, U. (1996). Skribenter in spe. elevers skrivf örmåga och
skriftspråkliga kompetens. ASLA-information 22:3, pages 28–35.
Chrystal, J.-A. and Ekvall, U. (1999). Planering och revidering i skolskrivande. In
Andersson, L.-G. e. a., editor, Svenskans beskrivning 23, pages 57–76. Lund.
Clemenceau, D. and Roche, E. (1993). Enhancing a large scale dictionary with a
two-level system. In EACL-93.
Cooper, R. (1984). Svenska nominalfraser och kontext-fri grammatik. Nordic
Journal of Linguistics, 7(2):115–144.
Cooper, R. (1986). Swedish and the head-feature convention. In Hellan, L. and
Koch Christensen, K., editors, Topics in Scandinavian Syntax.
Crystal, D. (2001). Language and the Internet. Cambridge University Press, Cambridge.
Dahlquist, A. and Henrysson, H. (1963). Om rättskrivning. Klassificering av fel i
diagnostiska skrivprov. Folkskolan 3.
Daiute, C. (1985). Writing and computers. Addison-Wesley, New York.
Domeij, R. (1996). Detecting and presenting errors for Swedish writers at work.
IPLab 108, TRITA-NA-P9629, KTH, Department of Numerical Analysis and
Computing Science, Stockholm.
Domeij, R. (2003). Datorstödd språkgranskning under skrivprocessen. Svensk
språkkontroll ur användarperspektiv. Doktorsavhandling, Stockholms Universitet, Institutionen för lingvistik.
Domeij, R. and Knutsson, O. (1998). Granskaprojektet: Rapport fr ån arbetet med
granskningsregler och kommentarer. KTH, Institutionen f ör numerisk analys
och datalogi, Stockholm.
Domeij, R. and Knutsson, O. (1999). Specifikation av grammatiska feltyper i
Granska. Internal working paper. KTH, Institutionen f ör numerisk analys och
datalogi, Stockholm.
Domeij, R., Knutsson, O., Larsson, S., Eklundh, K., and Rex, Ȧ. (1998).
Granskaprojektet 1996-1997. IPLab-146, KTH, Institutionen f ör numerisk analys och datalogi, Stockholm.
BIBLIOGRAPHY
265
Domeij, R., Ola, K., and Stefan, L. (1996). Datorstöd för språklig granskning
under skrivprocessen: en lägesrapport. IPLab 109, TRITA-NA-P9630, KTH,
Institutionen för numerisk analys och datalogi, Stockholm.
EAGLES (1996). EAGLES Evaluation of Natural Langauge Processing Systems.
Final Report. EAGLES Document EAG-EWG-PR.2. [http://www.issco.
unige.ch/projects/ewg96/ewg96.html].
Ejerhed, E. (1985). En ytstruktur grammatik för svenska. In Allén, S., Andersson,
L.-G., Löfström, J., Nordenstam, K., and Ralph, B., editors, Svenskans beskrivning 15. Göteborg.
Ejerhed, E. and Church, K. (1983). Finite state parsing. In Karlsson, F., editor,
Papers from the 7th Scandinavian Conference of Linguistics. University of Helsinki. No. 10(2):410-431.
Ejerhed, E., Källgren, G., Wennstedt, O., and Åström, M. (1992). The Linguistic
Annotation System of the Stockholm-Umeå Corpus Project. Report 33. University of Umeå, Department of Linguistics.
Emig, J. (1982). Writing, composition, and rhetoric. In Mitzel, H. E., editor,
Encyclopedia of Educational Research. The Free Press, New York.
Flower, L. and Hayes, J. R. (1981). A cognitive process theory of writing. College
Composition and Communication, 32:365–387.
Garme, B. (1988). Text och tanke. Liber, Malmö.
Graves, D. H. (1983). Writing: Teachers and Children at Work. Heinemann,
Portsmouth.
Grefenstette, G. (1996). Light parsing as finite-state filtering. In Kornai, A., editor,
ECAI’96 Workshop on Extended Finite State Models of Language, Budapest,
Hungary.
Grundin, H. (1975). Läs och skrivförmågans utveckling genom skolåren. Utbildningsforskning 20. Liber, Stockholm.
Gunnarsson, B.-L. (1992). Skrivande i yrkeslivet: en sociolingvistisk studie. Studentlitteratur, Lund.
Göransson, A.-L. (1998). Hur skriver vuxna? Språkvård 3.
Haage, H. (1954). Rättskrivningens psykologiska och pedagogiska problem. Folkskolans metodik.
266
BIBLIOGRAPHY
Haas, C. (1989). Does the medium make a difference? Two studies of writing with
pen and paper and with computers. Human-Computer Interaction, 4:149–169.
Hallencreutz, K. (2002). Särskrivningar och andra skrivningar - nu och då.
Språkvårdssamfundets skrifter 33.
Halliday, M. A. K. (1985). Spoken and Written Language. Oxford University
Press, Oxford.
Hammarbäck, S. (1989). Skriver, det gör jag aldrig. In Gunnarsson, B.-L., Liberg,
C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 103–113. Svenska f öreningen
för tillämpad språkvetenskap, Uppsala.
Hansen, W. J. and Haas, C. (1988). Reading and writing with computers: a framework for explaining differences in perfomance. Communications of the ACM,
31, Sept, pages 1080–1089.
Hawisher, G. E. (1986). Studies in word processing. Computers and Composition,
4:7–31.
Hayes, J. R. and Flower, L. (1980). Identifying the organisation of the writing
process. In Gregg, L. W. and Steinberg, E. R., editors, Cognitive processes in
writing. Lawrence Erlbaum Associates, Hillsdale, New Jersey.
Heidorn, G. (1993). Experience with an easily computed metric for ranking alternative parses. In Jensen, K., Heidorn, G., and Richardson, S. D., editors,
Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht.
Herring, S. C. (2001). Computer-mediated discourse. In Tannen, D., Schiffrin, D.,
and Hamilton, H., editors, Handbook of Discourse Analysis. Oxford, Blackwell.
Hersvall, M., Lindell, E., and Petterson, I.-L. (1974). Om kvalitet i gymnasisters
skriftspråk. Pedagogisk-psykologiska problem 253.
Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, New York.
Hultman, T. G. (1989). Skrivande i skolan: sett i ett utvecklingsperspektiv. In
Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport
från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 69–
89. Svenska föreningen för tillämpad språkvetenskap, Uppsala.
BIBLIOGRAPHY
267
Hultman, T. G. and Westman, M. (1977). Gymnasistsvenska. Liber, Lund.
Hunt, K. W. (1970). Recent measures in syntactic development. In Lester, M.,
editor, Readings in Applied Transformational Grammar. New York.
Håkansson, G. (1998). Språkinlärning hos barn. Studentlitteratur, Lund.
Hård af Segerstad, Y. (2002). Use and Adaptation of Written Language to the
Conditions of Computer-Mediated Communication. PhD thesis, G öteborg University, Department of Linguistics.
Ingels, P. (1996). A Robust Text Processing Technique Applied to Lexical Error
Recovery. Licentiate Thesis. Linköping University, Sweden.
Jensen, K. (1993). PEG: The PLNLP English Grammar. In Jensen, K., Heidorn,
G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP
Approach. Kluwer Academic Publishers, Dordrecht.
Jensen, K., Heidorn, G., Miller, L., and Ravin, Y. (1993a). Parse fitting and prose
fixing. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural
Language Processing: The PLNLP Approach. Kluwer Academic Publishers,
Dordrecht.
Jensen, K., Heidorn, G., and Richardson, S. D., editors (1993b). Natural Language
Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht.
Josephson, O., Melin, L., and Oliv, T. (1990). Elevtext. Analyser av skoluppsatser
från åk 1 till åk 9. Studentlitteratur, Lund.
Joshi, A. K. (1961). Computation of syntactic structure. Advances in Documentation and Library Science, Vol. III, Part 2.
Joshi, A. K. and Hopely, P. (1996). A parser from antiquity: An early application
of finite state transducers to natural language parsing. In Kornai, A., editor,
ECAI’96 Workshop on Extended Finite State Models of Language, Budapest,
Hungary.
Järvinen, T. and Tapanainen, P. (1998). Towards an implementable dependency
grammar. In Kahane, S. and Polguere, A., editors, The Proceedings of COLINGACL’98, Workshop on ‘Processing of Dependency-Based Grammars’, pages 1–
10. Universite de Montreal, Canada.
Karlsson, F. (1990). Constraint grammar as a system for parsing running text. In
The Proceedings of the International Conference on Computational Linguistics,
COLING’90, pages 168–173, Helsinki.
268
BIBLIOGRAPHY
Karlsson, F. (1992). SWETWOL: Comprehensive morphological analyzer for
Swedish. Nordic Journal of Linguistics, 15:1–45.
Karlsson, F., Voutilainen, A., Heikkilä, J., and Anttila, A. (1995). Constraint
Grammar: a language-independent system for parsing unrestricted text. Mouton
de Gruyter, Berlin.
Karttunen, L. (1993). Finite-State Lexicon Compiler. Technical Report ISTLNLTT-1993-04-02, Xerox PARC. April 1993. Palo Alto, California.
Karttunen, L. (1995). The replace operator. In The Proceedings of the 33rd Annual
Meeting of the Association for Computational Linguistics. ACL’95, pages 16–
23, Boston, Massachusetts.
Karttunen, L. (1996). Directed replacement. In The Proceedings of the 34th Annual
Meeting of the Association for Computational Linguistics, ACL’96, Santa Cruz,
California.
Karttunen, L., Chanod, J.-P., Grefenstette, G., and Schiller, A. (1997a). Regular expressions for language engineering. Natural Language Engineering 2(4), pages
305–328. Cambrigde University Press.
Karttunen, L., Gaál, T., and Kempe, A. (1997b). Xerox Finite-State Tool. Technical report, Xerox Research Centre Europe, Grenoble. June 1997. Maylan,
France.
Karttunen, L., Kaplan, R. M., and Zaenen, A. (1992). Two-level morphology with
composition. In The Proceedings of the International Conference on Computational Linguistics, COLING’92. Vol. I, pages 141–148, July 25-28, Nantes
France.
Kempe, A. and Karttunen, L. (1996). Parallel replacement in the finite-state calculus. In The Proceedings of the Sixteenth International Conference on Computational Linguistics, COLING’96, Copenhagen, Denmark.
Kirschner, Z. (1994). CZECKER - a maquette grammar-checker for Czech. The
Prague Bulletin of Mathematical Linguistics 62. Universita Karlova, Praha.
Knutsson, O. (2001). Automatisk språkgranskning av svensk text. Licentiatavhandling, KTH, Institutionen för numerisk analys och datalogi, Stockholm.
Kokkinakis, D. and Johansson Kokkinakis, S. (1999). A cascaded finite-state parser
for syntactic analysis of Swedish. In EACL’99, pages 245–248.
BIBLIOGRAPHY
269
Kollberg, P. (1996). S-notation as a tool for analysing the episodic structure of
revisions. In European writing conferences, Barcelona, October 1996.
Koskenniemi, K., Tapanainen, P., and Voutilainen, A. (1992). Compiling and using finite-state syntactic rules. In The Proceedings of the International Conference on Computational Linguistics, COLING’92. Vol. I, pages 156–162, Nantes,
France.
Kress, G. (1994). Learning to write. Routledge, London.
Kukich, K. (1992). Techniques for automatically correcting words in text. ACM
Computing Surveys, 24(4):377–439.
Laporte, E. (1997). Rational transductions for phonetic conversion and phonology.
In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT
Press, Cambridge, Massachusetts.
Larsson, K. (1984). Skrivförmåga: studier i svenskt elevspråk. Liber, Malmö.
Ledin, P. (1998). Att sätta punkt. hur elever på låg- och mellanstadiet använder
meningen i sina uppsatser. Språk och stil, 8:5–47.
Leijonhielm, B. (1989). Beskrivning av språket i brottsanmälningar. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från
ASLA:s nordiska symposium, Uppsala, 10-12 november 1988. Svenska f öreningen för tillämpad språkvetenskap, Uppsala.
Liberg, C. (1990). Learning to Read and Write. RUUL 20. Reports from Uppsala
University, Department of Linguistics.
Liberg, C. (1999). Elevers möte med skolans textvärldar. ASLA-information 25:2,
pages 40–44.
Lindell, E. (1964). Den svenska rättskrivningsmetodiken: bidrag till dess
pedagogisk-psykologiska grundval. Studia psychologica et paedagogica 12.
Lindell, E., Lundquist, B., Martinsson, A., Nordlund, A., and Petterson, I.-L.
(1978). Om fri skrivning i skolan. Utbildningsforskning 32. Liber, Stockholm.
Linell, P. (1982). The written language bias in linguistics. Department of Communication Studies, Univerity of Linköping.
Ljung, B.-O. (1959). En metod för standardisering av uppsatsbedömning. Pedagogisk forskning 1. Universitetsforlaget, Oslo.
270
BIBLIOGRAPHY
Ljung, M. and Ohlander, S. (1993). Allmän Grammatik. Gleerups Förlag, Surte.
Loman, B. and Jörgensen, N. (1971). Manual för analys och beskrivning av makrosyntagmer. Studentlitteratur, Lund.
Lundberg, I. (1989). Språkutveckling och läsinlärning. In Sandqvist, C. and Teleman, U., editors, Språkutveckling under skoltiden. Studentlitteratur, Lund.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk, Vol. 1:
Transcription Format and Programs. Lawrence Erlbaum Associates, Hillsdale,
New Jersey.
Magerman, D. M. and Marcus, M. P. (1990). Parsing a natural language using
mutual information statistics. In AAAI’90, Boston, Ma.
Manzi, S., King, M., and Douglas, S. (1996). Working towards user-orineted evaluation. In The Proceedings of the International Conference on Natural Language
Processing and Industrial Applications (NLP+IA 96), pages 155–160. Moncton,
New-Brunswick, Canada.
Matsuhasi, A. (1982). Explorations in the realtime production of written discourse.
In Nystrand, M., editor, What writers know: the language, process, and structure
of written discourse. Academic Press, New York.
Mattingly, I. G. (1972). Reading, the linguistic process and linguistic awareness.
In Kavanagh, J. F. and Mattingly, I. G., editors, Language by Ear and by Eye,
pages 133–147. MIT Press, Cambridge.
Moffett, J. (1968). Teaching the Universe of Discourse. Houghton Mifflin Company, New York.
Mohri, M., Pereira, F. C. N., and Riley, M. (1998). A rational design for a weighted
finite-state transducer library. Lecture Notes in Computer Science 1436.
Mohri, M. and Sproat, R. (1996). An efficient compiler for weighted rewrite rules.
In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL’96, Santa Cruz, California.
Montague, M. (1990). Computers and writing process instruction. Computers in
the Schools 7(3).
Nauclér, K. (1980). Perspectives on misspellings. A phonetic, phonological and
psycholinguistic study. Liber Läromedel, Lund.
BIBLIOGRAPHY
271
van Noord, G. and Gerdemann, D. (1999). An extendible regular expression compiler for finite-state approaches in natural language processing. In Workshop on
Implementing Automata’99, Postdam, Germany.
Nyström, C. (2000). Gymnasisters skrivande. En studie av genre, textstruktur och
sammnahang. Doktorsavhandling, Institutionen för Nordiska språk, Uppsala
Universitet.
Näslund, H. (1981). Satsradningar i svenskt elevspråk. FUMS 95: Forskningskommittén i Uppsala för modern svenska. Institutionen för nordiska språk, Uppsala
universitet.
Olevard, H. (1997). Tonårsliv. en pilotstudie av 60 elevtexter från standardproven
för skolår 9 åren 1987 och 1996. Svenska i utveckling nr 11. FUMS Rapport nr
194.
Paggio, P. and Music, B. (1998). Evaluation in the SCARRIE project. In First International Conference on Language Resources and Evaluation, Granada, Spain,
pages 277–281.
Paggio, P. and Underwood, N. L. (1998). Validating the TEMAA LE evaluation
methodology: a case study on danish spelling checkers. Natural Language Engineering, 4(3):211–228. Cambridge University Press.
Pereira, F. C. N. and Riley, M. D. (1997). Speech recognition by composition of
weighted finite automata. In Roche, E. and Schabes, Y., editors, Finite State
Language Processing. MIT Press, Cambridge, Massachusetts.
Pettersson, A. (1980). Hur gymnasister skriver. Svenskl ärarserien 184.
Pettersson, A. (1989). Utvecklingslinjer och utvecklingskrafter i elevernas uppsatser. In Sandqvist, C. and Teleman, U., editors, Språkutveckling under skoltiden.
Studentlitteratur, Lund.
Pontecorvo, C. (1997). Studying writing and writing acquisition today: A multidisciplinary view. In Pontecorvo, C., editor, Writing development: An interdisciplinary view. John Benjamins Publishing Company.
Povlsen, C., Sågvall Hein, A., and de Smedt, K. (1999). Final Project Report.
Reports from the SCARRIE project, Deliverable 0.4. [http://fasting.
hf.uib.no/˜desmedt/scarrie/final-report.html].
Ravin, Y. (1993). Grammar errors and style weakness in a text-critiquing system.
In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language
Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht.
272
BIBLIOGRAPHY
Richardson, S. D. (1993). The experience of developing a large-scale natural language processing system: Critique. In Jensen, K., Heidorn, G., and Richardson,
S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer
Academic Publishers, Dordrecht.
van Rijsbergen, C. J. (1979). Information Retrieval. London.
Robbins, A. D. (1996). AWK Language Programming. A User’s Guide for GNU
AWK. Free Software Foundation, Boston.
Roche, E. (1997). Parsing with finite-state transducers. In Roche, E. and Schabes,
Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts.
Sandström, G. (1996). Språklig redigering på en dagstidning. Språkvård 1.
de Saussure, F. (1922). Course in General Linguistics. Translation by Roy Harris.
Duckworth, London.
Scardamalia, M. and Bereiter, C. (1986). Research on written composition. In Wittrock, M. C., editor, Handbook of Research of Teaching. Third edition. A project
of the american Educational Research Association. Macmillan Publishing Company, New York.
Schiller, A. (1996). Multilingual finite-state noun phrase extraction. In ECAI’96
Workshop on Extended Finite State Models of Language, Budapest, Hungary.
Senellart, J. (1998). Locating noun phrases with finite state transducers. In The
Proceedings of COLING-ACL’98, pages 1212–1219.
Severinson Eklundh, K. (1990). Global strategies in computer-based writing: the
use of logging data. In 2nd Nordic Conference on Text Comprehension in Man
and Machine, Täby.
Severinson Eklundh, K. (1993). Skrivprocessen och datorn. IPLab 61, KTH, Institutionen för numerisk analys och datalogi, Stockholm.
Severinson Eklundh, K. (1994). Electronic mail as a medium for dialogue. In van
Waes, L., Woudstra, E., and van den Hoven, P., editors, Functional Communication Quality. Rodopi Publishers, Amsterdam/Atlanta.
Severinson Eklundh, K. (1995). Skrivmönster med ordbehandlare. Språkvård 4.
BIBLIOGRAPHY
273
Severinson Eklundh, K. and Sjöholm, K. (1989). Writing with a computer. A longitudinal survey of writers of technical documents. IPLab 19, KTH, Department
of Numerical Analysis and Computing Science, Stockholm.
Skolverket (1992). LEXIN: språklexikon för invandrare. Nordsteds Förlag.
Sofkova Hashemi, S. (1998). Writing on a computer and writing with a pencil and
paper. In Strömqvist, S. and Ahlsén, E., editors, The Process of Writing - a progress report, pages 195–208. Göteborg University, Department of Linguistics.
Starbäck, P. (1999). ScarCheck - a software for word and grammar checking. In
Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University,
Department of Linguistics.
Strömquist, S. (1987). Styckevis och helt. Liber, Malmö.
Strömquist, S. (1989). Skrivboken. Liber, Malmö.
Strömquist, S. (1993). Skrivprocessen. Teori och tillämpning. Studentlitteratur,
Lund.
Strömqvist, S. (1996). Discourse flow and linguistic information structuring: Explorations in speech and writing. Gothenburg Papers in Theoretical Linguistics
78. Göteborg University, Department of Linguistics.
Strömqvist, S. and Hellstrand, Ȧ. (1994). Tala och skriva i lingvistiskt och didaktiskt perspektiv - en projektbeskrivning. Didaktisk Tidskrift, Nr 1-2.
Strömqvist, S., Johansson, V., Kriz, S., Ragnarsdottir, H., Aisenman, R., and Ravid,
D. (2002). Towards a crosslinguistic comparisson of lexical quanta in speech and
writing. Written language and literacy Vol 5, N:o 1, pages 45–68.
Strömqvist, S. and Karlsson, H. (2002). ScriptLog for Windows - User’s Manual.
Department of Linguistics and University College of Stavanger: Centre for
Reading Research.
Strömqvist, S. and Malmsten, L. (1998). ScriptLog Pro 1.04 - User’s Manual.
Göteborg University, Department of Linguistics.
Svenska Akademiens Ordlista (1986). 11 uppl. Norstedts f örlag, Stockholm.
Sågvall Hein, A. (1981). An Overview of the Uppsala Chart Parser Version I
(UCP-1). Uppsala University, Department of Linguistics.
274
BIBLIOGRAPHY
Sågvall Hein, A. (1983). A Parser for Swedish. Status Report for SveUcp. (UCDLR-83-2). Uppsala University, Department of Linguistics. February 1983.
Sågvall Hein, A. (1998a). A chart-based framework for grammar checking: Initial
studies. In The 11th Nordic Conference in Computational Linguistics, NODALIDA’98.
Sågvall Hein, A. (1998b). A specification of the required grammar checking machinery. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.5.2, June 1998. Uppsala University, Department of Linguistics.
Sågvall Hein, A. (1999). A grammar checking module for Swedish. In
Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.6.3,
June 1999. Uppsala University, Department of Linguistics.
Sågvall Hein, A., Olsson, L.-G., Dahlqvist, B., and Mats, E. (1999). Evaluation
report for the Swedish prototype. In Sågvall Hein, A., editor, Reports from the
SCARRIE project, Deliverable 8.1.3, June 1999. Uppsala University, Department of Linguistics.
Teleman, U. (1974). Manual för beskrivning av talad och skriven svenska. Lund.
Teleman, U. (1979). Språkrätt. Gleerups, Malmö.
Teleman, U. (1991a). Lära svenska: Om språkbruk och modersmålsundervisning.
Skrifter utgivna av Svenska Språknämnden, Almqvist and Wiksell, Solna.
Teleman, U. (1991b). Vad kan man när man kan skriva? In Malmgren and Sandqvist, editors, Skrivpedagogik.
Teleman, U., Hellberg, S., and Andersson, E. (1999). Svenska Akademiens grammatik. Svenska Akademien.
Vanneste, A. (1994). Checking grammar checkers. Utrecht Studies and Communication, 4.
Vernon, A. (2000). Computerized grammar checkers 2000: Capabilities, limitations, and pedagogical possibilities. Computers and Composition 17, pages
329–349.
Vosse, T. G. (1994). The Word Connection. Grammar-based Spelling Error Correction in Dutch. PhD thesis, Neslia Paniculata, Enschede.
Voutilainen, A. (1995). NPtool, a detector of English noun phrases. In the Proceedings of Workshop on Very Large Corpora, Ohio State University.
BIBLIOGRAPHY
275
Voutilainen, A. and Padró, L. (1997). Developing a hybrid NP parser. In ANLP’97,
Washington.
Voutilainen, A. and Tapanainen, P. (1993). Ambiguity resolution in a reductionistic
parser. In EACL-93, pages 394–403, Utrecht.
Wallin, E. (1962). Bidrag till rättstavningsförmågans psykologi och pedagogik.
Göteborgs Universitet, Pedagogiska Institutionen.
Wallin, E. (1967). Spelling. Factorial and experimental studies. Almqvist and
Wiksell, Stockholm.
Wedbjer Rambell, O. (1999a). Error typology for automatic proof-reading purposes. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala
University, Department of Linguistics.
Wedbjer Rambell, O. (1999b). Swedish phrase constituent rules. A formalism for
the expression of local error rules for Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics.
Wedbjer Rambell, O. (1999c). Three types of grammatical errors in Swedish. In
Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University,
Department of Linguistics.
Wedbjer Rambell, O., Dahlqvist, B., Tjong Kim Sang, E., and Hein, N. (1999).
An error database of Swedish. In Sågvall Hein, A., editor, Reports from the
SCARRIE project. Uppsala University, Department of Linguistics.
Weijnitz, P. (1999). Uppsala Chart Parser Light: system documentation. In
Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics.
Wengelin, Ȧ. (2002). Text Production in Adults with Reading and Writing Difficulties. PhD thesis, Göteborg University, Department of Linguistics.
Wikborg, E. (1990). Composing on the computer: a study of writing habits on
the job. In Nordtext Symposium, Text structuring - reception and production
strategies, Hanasaari, Helsinki.
Wikborg, E. and Björk, L. (1989). Sammanhang i text. Hallgren and Fallgren
Studieförlag AB, Uppsala.
Wresch, W. (1984). The computer in composition instruction. National Council of
Teachers of English.
276
BIBLIOGRAPHY
Öberg, H. S. (1997). Referensbindning i elevuppsatser. en preliminär modell och
en analys i två delar. Svenska i utveckling nr 7. FUMS Rapport nr 187.
Östlund-Stjärnegårdh, E. (2002). Godkänd i svenska? Bedömning och analys av
gymnasieelevers texter. Doktorsavhandling, Institutionen f ör Nordiska språk,
Uppsala Universitet.
Appendices
278
Appendix A
Grammatical Feature Categories
GENDER:
com
neu
masc
fem
common gender
neuter gender
masculine gender
feminine gender
DEFINITENESS:
def
indef
wk
str
definite form
indefinite form
weak form of adjective
strong form of adjective
CASE:
nom
acc
gen
nominative case
accusative case
genitive case
NUMBER:
sg
pl
singular
plural
TENSE:
imp
inf
pres
pret
perf
past perf
sup
past part
untensed
imperative
infinitive
present
preterite
perfect
past perfect
supine
past participle
non-finite verb
VOICE:
pass
passive
OTHER:
adj
adv
adjective
adverb
280
Appendix B
Error Corpora
This Appendix presents the errors found in Child Data and consists of three corpora:
B.1 Grammar Errors
B.2 Misspelled Words
B.3 Segmentation Errors
Every listed instance of an error (E RROR) is indexed and followed by a suggestion
for possible correction (C ORRECTION) and information about which sub-corpora
(C ORP) it originates from, who the writer was (S UBJ), the writer’s age (AGE) and
sex (S EX; m for male and f for female).
The different sub-corpora are abbreviated as DV Deserted Village, CF Climbing Fireman, FS Frog Story, SN Spencer Narrative, SE Spencer Expository.
Appendix B.
282
B.1 Grammar Errors
Grammar errors are categorized by the type of error that occurred.
E RROR
1
1.1
1.1.1
1.1.2
1.1.3
1.1.4
1.1.5
1.1.6
1.1.7
1.1.8
1.2
1.2.1
1.2.2
1.2.3
1.2.4
1.3
1.3.1
AGREEMENT IN NOUN PHRASE
Definiteness agreement
Indefinite head with definite modifier
Jag tar den närmsta handduk och slänger den
i vasken och blöter den,
En gång blev den hemska pyroman utkastad
ur stan.
Jag såg på ett TV program där en metod mot
mobbing var att satta mobbarn på den stol
och andra människor runt den personen och då
fråga varför.
Definite head with possessive modifier
Pär tittar på sin klockan och det var tid för
familjen att gå hem.
hunden sa på pojkens huvet.
Definite head with modifier ‘denna’
Nu när jag kommer att skriva denna uppsatsen
så kommer jag ha en rubrik om några problem
och ...
Definite head with indefinite modifier
Men senare ångrade dom sig för det var en
räkningen på deras lägenhet.
Man ska inte fråga en kompisen om något arbete, man ska fråga läraren.
Gender agreement
Wrong article
pojken fick en grodbarn
Wrong article in partitive
Virginias mamma hade öppnat en tyg affär i en
av Dom gamla husen.
Masculine form of adjective
sen berätta den minsta att det va den hemske
fula troll karlen tokig som ville göra mos av
dom för han skulle bo i deras by.
nasse blev arg han gick och la sig med dom andre syskonen.
Number agreement
Singular modifier with plural head
Den dära scenen med det tre tjejerna tyckte
jag att de var taskiga som går ifrån den tredje
tjejen
C ORRECTION
C ORP
S UBJ
AGE
S EX
handuken
CF
alhe
9
f
pyromanen
CF
frma
9
m
en stol/den
stolen
SE
wj16
13
f
klocka
DV
frma
9
m
huve/ huvud
FS
haic
11
f
uppsats
SE
wj03
13
f
räkning
DV
jowe
9
f
kompis
SE
wg05
10
m
ett
FS
haic
11
f
ett
DV
idja
11
f
fule
DV
alhe
9
f
andra
CF
haic
11
f
de
SE
wg09
10
m
Error Corpora
1.3.2
1.3.3
283
E RROR
C ORRECTION
Singular noun in partitive attribute
Alla männen och pappa gick in i ett av huset.
en av boven tog bensinen och gick bakåt.
C ORP
S UBJ
AGE
S EX
husen
bovarna
DV
CF
haic
haic
11
11
f
f
CF
SE
caan
wg11
9
10
m
f
meningslös
SE
wj05
13
m
mobbade
SE
wj05
13
m
öppna,
ärliga, elaka
SE
wj13
13
m
utsatta
SE
wj19
13
m
själva
smutsiga
SE
CF
wj20
haic
13
11
m
f
byn
skeppet
DV
DV
haic
haic
11
11
f
f
ön
borgmästaren
grenen
pojken
DV
CF
FS
FS
haic
frma
frma
frma
11
9
9
9
f
m
m
m
dem
SN
wg10
10
m
dem
CF
klma
10
f
dem
honom
SE
SE
wg16
wj14
10
13
f
m
honom
SE
wj14
13
m
får
CF
alhe
9
f
2
2.1
2.1.1
2.1.2
AGREEMENT IN PREDICATIVE COMPLEMENT
Gender agreement
då börja Urban lipa och sa: Mitt hus är blöt.
blött
den som hörde de där stygga orden vågade
utskrattad
kanske inte spela på en konsert för att vara rädd
att bli utskrattat av avundsjuka personer.
2.2
Number agreement
Singular
En som är mobbad gråter säkert varje dag
känner sig menigslösa.
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.2.6
3
3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
3.1.6
4
4.1
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
5
5.1
5.1.1
Plural
Om dom som mobbar någon gång blir mobbad själv skulle han ändras helt och hållet.
Sjävl tycker jag att killarnas metoder är mer
öppen och ärlig men också mer elak än var tjejernas metoder är.
jag tror att dom som är s har själva varit ut satt
någon gång och nu vill dom hämnas och...
... för folk tänker mest på sig själv.
nasse är en gris som har massor av syskon.
nasse är skär. Men nasses syskon är smutsig.
DEFINITENESS IN SINGLE NOUNS
dom gick till by
dom som bodde på ön kanske försökte komma
på skepp
Jag såg en ö vi gick till ö
dom sa till borgmästare vad ska vi göra!
män han hade skrikit så börjar gren röra på sig
pojke hoppade ner till hunden
PRONOUN CASE
Case - Objective form
bilarna bromsade så att det blev svarta streck
efter de.
Två av brandmännen sprang in i huset för att
rädda de
jag tycker synd om de
då kan ju den eleven som blir utsatt gå fram och
prata med han
bara för man inte vill vara med han
FINITE MAIN VERB
Present tense
Regular verbs
Madde och jag bestämde oss för att sova i kojan
och se om vi få se vind.
Appendix B.
284
5.1.2
5.1.3
5.1.4
5.1.5
5.1.6
5.1.7
5.1.8
5.1.9
E RROR
C ORRECTION
C ORP
S UBJ
AGE
S EX
När hon kommer ner undrar hon varför det
lukta så bränt och varför det låg en handduk
över spisen.
undra vad det brann nånstans jag måste i alla
fall larma
Få se nu vilken väg är det, den här.
han kommer och klappar alla på handen utan en
kille undra hur han känner sig då?
... det kan även vara att nån kan sparka eller att
man få vara enstöring...
... där några tjejer/killar sitter och prata.
men det kanske bero på att det var en mindre
skola
... och inte bry sig om han man inte få vara
med,
luktar
CF
alhe
9
f
undrar
CF
erja
9
m
Får
undrar
FS
SE
idja
wj03
11
13
f
f
får
SE
wj08
13
f
pratar
beror
SE
SE
wj08
wj13
13
13
f
m
får
SE
wj14
13
m
säger
SE
wj03
13
f
berättade
berättade
berättade
berättade
började
DV
DV
DV
DV
DV
alhe
alhe
alhe
alhe
alhe
9
9
9
9
9
f
f
f
f
f
cyklade
hämtade
hämtade
DV
DV
DV
alhe
alhe
alhe
9
9
9
f
f
f
knackade
DV
alhe
9
f
knackade
DV
alhe
9
f
knackade
lugnade
pekade
ramlade
stannade
undrade
DV
DV
DV
DV
DV
DV
alhe
alhe
alhe
alhe
alhe
alhe
9
9
9
9
9
9
f
f
f
f
f
f
undrade
undrade
vaknade
öppnade
öppnade
öppnade
ropade
vaknade
lutade
DV
DV
DV
DV
DV
DV
DV
DV
DV
alhe
alhe
alhe
alhe
alhe
alhe
angu
angu
anhe
9
9
9
9
9
9
9
9
11
f
f
f
f
f
f
f
f
m
Strong verbs
5.1.10 Att stjäla är inte bra speciellt inte om man tar en
sak av en person som gick för en i ett led och
inte säga till att man hittade den utan att man
behåller den.
5.2
5.2.1
5.2.2
5.2.3
5.2.4
5.2.5
5.2.6
5.2.7
5.2.8
5.2.9
5.2.10
5.2.11
5.2.12
5.2.13
5.2.14
5.2.15
5.2.16
5.2.17
5.2.18
5.2.19
5.2.20
5.2.21
5.2.22
5.2.23
5.2.24
5.2.25
Preterite
Regular verbs
vi berätta och ...
den äldsta som va 80 år berätta
jag berätta om byn
sen berätta den minsta
då börja alla i hela tunneln förutom pappa och
ja gråta
sen cykla vi dit igen.
...gick ner och hämta min och pappas cyklar ...
Pappa gick och Knacka på en dörr till medan
jag hämta cyklarna
Pappa gick och knacka på en dörr för att vi var
väldigt hungriga
Pappa gick och Knacka på en dörr till medan
jag hämta cyklarna
jag knacka på dörren
men jag lugna mig och kände på marken
dom peka på väggen av tunneln
jag ramla i en rutschbana
långt åkte ja tills jag stanna vid en port ...
när vi kom hem undra självklart mamma vart
vi varit
pappa och jag undra va nycklarna va
sen undra han va dom bodde
på morgonen när vi vakna...
men ingen öppna
någon eller något öppna dörren
vi till och med öppna pensionathem
Lena Ropa mamma Lena vakna
Plötsligt vakna Hon av att någon sa Lena Lena.
Per luta sig mot en
Error Corpora
E RROR
5.2.26 Sen Svimma jag
5.2.27 när jag vakna satt Jag Per och Urban mitt i byn.
5.2.28 och när vi kom hem så Vakna jag och allt var
en dröm.
5.2.29 Plötsligt börja en lavin
5.2.30 när Gunnar öppna dörren till det stora huset
rasa det ihop
5.2.31 och snart rasa hela byn ihop.
5.2.32 när Gunnar öppna dörren till det stora huset
rasa det ihop
5.2.33 Niklas och Benny hoppa av kamelerna
5.2.34 och snabbt hoppa dom på kamelerna
5.2.35 och rusa iväg och red bort
5.2.36 snabbt samla han ihop alla sina jägare
5.2.37 men undra varför den är övergiven.
5.2.38 Ida gick och tänkte på vad dom skulle göra hon
snubbla på nåt
5.2.39 Jag tog min väska och Madde tog sin, och vi
börja gå mot vår koja, där vi skulle sova.
5.2.40 När vi kom fram börja vi packa upp våra grejer
och rulla upp sovsäcken.
5.2.41 Madde vaknade av mitt skrik, hon fråga va det
var för nåt.
5.2.42 På morgonen vaknade vi och klädde på oss sen
packa vi ner våra grejer.
5.2.43 jag sa att det inte va nåt så somna vi om.
5.2.44 För ett ögon blick trodde jag att den hästen
vakta våran koja.
5.2.45 på natten vakna jag av att brandlarmet tjöt
5.2.46 då börja Urban lipa och sa: Mitt hus är blöt.
5.2.47 Brandkåren kom och spola ner huset
5.2.48 Cristoffer stod och titta på ugglan i trädet
5.2.49 Erik gick till skogen och ropa allt han kunde.
5.2.50 Rådjuret sprang iväg med honom. Och kasta
av pojken vid ett berg.
5.2.51 De klättra över en stock.
5.2.52 Pojken ropa groda groda var är du
5.2.53 De gick ut och ropa men de fick inget svar.
5.2.54 Ruff råka trilla ut ur fönstret.
5.2.55 Pojken satt varje kväll och titta på grodan
5.2.56 När pojken vakna nästa morgon och fann att
grodan var försvunnen blev han orolig
5.2.57 Och utan att pojken visste om det hoppa grodan
ur burken när han låg.
5.2.58 Nästa dag vakna pojken och såg att grodan
hade rymt
5.2.59 hunden halka efter.
5.2.60 När han landa så svepte massa bin över honom.
5.2.61 Pojken leta och leta i sitt rum.
5.2.62 Pojken leta och leta i sitt rum.
5.2.63 Hunden leta också
5.2.64 Pojken gick då ut och leta efter grodan
5.2.65 Pojken leta i ett träd
285
C ORP
S UBJ
AGE
S EX
svimmade
vaknade
vaknade
C ORRECTION
DV
DV
DV
anhe
anhe
anhe
11
11
11
m
m
m
började
rasade
DV
DV
erha
erha
10
10
m
m
rasade
öppnade
DV
DV
erha
erha
10
10
m
m
hoppade
hoppade
rusade
samlade
undrade
snubblade
DV
DV
DV
DV
DV
DV
erja
erja
erja
erja
idja
jowe
9
9
9
9
11
9
m
m
m
m
f
f
började
CF
alhe
9
f
började
CF
alhe
9
f
frågade
CF
alhe
9
f
packade
CF
alhe
9
f
somnade
vaktade
CF
CF
alhe
alhe
9
9
f
f
vaknade
började
spolade
tittade
ropade
kastade
CF
CF
CF
FS
FS
FS
angu
caan
caan
alca
alhe
angu
9
9
9
11
9
9
f
m
m
f
f
f
klättrade
ropade
ropade
råkade
tittade
vaknade
FS
FS
FS
FS
FS
FS
angu
angu
angu
angu
angu
angu
9
9
9
9
9
9
f
f
f
f
f
f
hoppade
FS
caan
9
m
vaknade
FS
caan
9
m
halkade
landade
letade
letade
letade
letade
letade
FS
FS
FS
FS
FS
FS
FS
erge
erge
erge
erge
erge
erge
erge
9
9
9
9
9
9
9
f
f
f
f
f
f
f
Appendix B.
286
E RROR
5.2.66 Då helt plötsligt ramla hunden ner från fönstret
5.2.67 där bodde bara en uggla som skrämde honom
så han ramla ner på marken.
5.2.68 Där ställde pojken sig och ropa efter grodan
5.2.69 Hej då ropa han hej då.
5.2.70 Då gick pojken vidare och såg inte att binas bo
trilla ner.
5.2.71 när dom båda trilla i.
5.2.72 Han ropa hallå var är du
5.2.73 han gick upp på stora stenen ropa hallå hallå
5.2.74 Då öppnade han fönstret & ropa på grodan.
5.2.75 I min förra skola hade man nåt som man kallade
för kamratstödjare, Det funka väl ganska bra
men...
5.2.76 man visade ingen hänsyn eller att man inte heja
eller bara bråka
5.2.77 man visade ingen hänsyn eller att man inte heja
eller bara bråka
5.2.78 Var var den där överraskningen. Ni svara jag
men båda tittade på varandra ...
5.2.79 Ni svara jag
5.2.80 det gick inte så hon klättrade upp bredvid mig
och medan jag för sökte lyfta upp mig skälv
medan hon putta bort jackan från pelare.
5.2.81 medan hon putta jackan från pelaren
5.2.82 jag var på mitt land och bada
5.2.83 så här börja det
5.2.84 där sövde dom mig och gipsa handen.
5.2.85 Hon hade bara kladdskrivit den uppsats jag
lämna in ...
5.2.86 ...så jag ångra verkligen att jag tog hennes
uppsats...
5.2.87 När jag gick förbi den djupa avdelningen så
kom en annan kille och putta i mig
Supine
5.2.88 det låg massor av saker runtomkring jag
försökt att kom till fören
5.2.89 Han tittade på hunden, hunden försökt att
klättra ner
Participle
5.2.90 Fönstrena ser lite blankare ut där uppe sa Virginia och börjad klättra upp för den ruttna stegen.
5.2.91 älgen sprang med olof till ett stup och kastad
ner olof och hans hund
5.2.92 dom letad överallt
5.2.93 när han letad kollade en sork upp
5.2.94 han letad bakom stocken
5.2.95 alla pratad om borgmästaren
5.2.96 hunden råkade skakad ner ett getingbo
5.2.97 det var en liten pojke som satt och snyftad
5.2.98 svarad han
C ORP
S UBJ
AGE
S EX
ramlade
ramlade
C ORRECTION
FS
FS
erge
erge
9
9
f
f
ropade
ropade
trillade
FS
FS
FS
erge
erge
erge
9
9
9
f
f
f
trillade
ropade
ropade
ropade
funkade
FS
FS
FS
FS
SE
erge
haic
haic
jobe
wj13
9
11
11
10
13
f
f
f
m
m
bråkade
SE
wj18
13
m
hejade
SE
wj18
13
m
svarade
SN
wg07
10
f
svarade
puttade
SN
SN
wg07
wg16
10
10
f
f
puttade
badade
började
gipsade
lämnade
SN
SN
SN
SN
SN
wg16
wg18
wg18
wj05
wj16
10
10
10
13
13
f
m
m
m
f
ångrade
SN
wj16
13
f
puttade
SN
wj20
13
m
försökte
DV
haic
11
f
försökte
FS
haic
11
f
började
DV
idja
11
f
kastade
FS
frma
9
m
letade
letade
letade
pratade
skaka
snyftade
svarade
FS
FS
FS
CF
FS
DV
DV
frma
frma
frma
frma
frma
haic
alco
9
9
9
9
9
11
9
m
m
m
m
m
f
f
Error Corpora
E RROR
287
S UBJ
AGE
S EX
torkade
DV
idja
11
f
försvann
DV
erge
9
f
VERB CLUSTER
Verb form after auxiliary verb
Present
Och i morgon är det brandövning men kom ihåg
att det inte ska blir någon riktig brand.
Ibland får man bjuda på sig själv och låter
henne/honom vara med !
bli
CF
klma
10
f
låta
SE
wj17
13
f
Preterite
hon ville inte att jag skulle följde med men med
lite tjat fick jag.
följa
DV
alhe
9
f
rida
DV
idja
11
f
komma
göra
FS
SE
haic
wj20
11
13
f
m
kommit
DV
idja
11
f
har/hade
DV
erge
9
f
har/hade
DV
haic
11
f
7.1.1
INFINITIVE PHRASE
Verbform after infinitive marker
Present
Men hunden klarar att inte slår sig.
slå
FS
haic
11
f
7.1.2
7.1.3
7.1.4
Imperative
glöm inte att stäng dörren
jag försökt att kom till fören
Åt det går det nog inte att gör så mycket åt.
stänga
komma
göra
DV
DV
SE
hais
haic
wj20
11
11
13
f
f
m
att
E13
wj01
13
f
kommer det
inte att gå
E13
wj06
13
f
5.2.99 Jag tittade på Virginia som torkad av sin näsa
som var blodig på tröjarmen.
Strong verbs
5.2.100 Nästa dag så var en ryggsäck borta och mera
grejer försvinna
6
6.1
6.1.1
6.1.2
6.1.3
6.1.4
6.1.5
6.1.6
6.1.7
6.2
6.2.1
6.2.2
7
7.1
7.2
7.2.1
7.2.2
Imperative
Men de var fult med buskar utan för som vi fick
rid igenom.
han råkade bara kom i mot getingboet.
Det är något som vi alla nog skulle gör om vi
inte hade läst på ett prov.
Jag skrattade och undrade hur Tromben skulle
ha kom igenom det lilla hålet.
Missing auxiliary verb
Temporal ‘ha’
ni måste hjälpa mig om ni ska få henne. och
dom — lovat att bygga upp staden och de blev
hotell
Men pappa — frågat mig om jag ville följa
med
Missing infinitive marker
Men det vågar man kanske inte i första taget för
då kan man ju bli rädd att man kommer — få
ett kännetecken som skolans skvallerbytta eller
något sånt!
... tänkte jag att om man ska hålla på så kommer det — inte gå bra i skolan.
C ORRECTION
kommer
få
C ORP
Appendix B.
288
E RROR
C ORRECTION
C ORP
S UBJ
AGE
S EX
7.2.3
Nu när jag kommer att skriva denna uppsatsen
så kommer jag — ha en rubrik om några problem och vad man kan göra för att förbättra dom.
kommer jag
att ha
E13
wj03
13
f
8
8.1.1
WORD ORDER
När han kom hem så åt han middag gick och
borstade tänderna och gick och sedan lade sig.
för då kan man inte något ting bara kan gå på
stan det då fattar hjärnan ingenting
Jag den dan gjorde inget bättre.
sedan och
FS
jowe
9
f
kan bara
SE
wg03
10
f
Jag gjorde
inget bättre
den dan.
på
matten
med
SN
wg07
10
f
SE
wg10
10
m
dom
tvingar
SE
wj12
13
f
hund
, men
FS
SE
alhe
wj17
9
13
f
f
har
SE
wj19
13
m
också
SN
wg04
10
m
klarade
SN
wg10
10
m
jag tycker att
alla måste få
vara med
jag fick hjälp
med det
Åt det går det
nog inte att
gör så mycket.
Nasse sprang
som en liten
fnutknapp
efter
bovarna.
SE
wg18
10
m
SN
wj11
13
f
SE
wj20
13
m
CF
haic
11
f
CF
anhe
11
m
SE
wg03
10
f
SE
wg12
10
f
8.1.2
8.1.3
8.1.4
8.1.5
9
9.1
att jag har ett problem att jag måste hela tiden
fuska på proven annars med på matten nog alla
lektioner måste jag fuska och alltid bråka för att
få uppmärksamhet.
kompisarna gör det inte men om tvingar dom
inte dig till att göra det
9.1.5
REDUNDANCY
Doubled word
Following directly
Han tittade på sin hund hund oliver
Kompisen ska få titta på en ibland också men,
men det får inte bli regelbundet för då...
många som mobbar har har det oftast dåligt
hemma
vi skall i alla fall träffas idag 20 mars 1999
måndagen kanske imorgon också också
Jag hade tur jag klarade klarade mig
9.1.6
Word between
jag tycker jag att alla måste få vara med
9.1.7
jag fick jag hjälp med det.
9.1.8
Åt det går det nog inte att gör så mycket åt.
9.1.9
Nasse sprang efter som en liten fnutknapp efter
Bovarna.
9.2
9.2.1
Redundant word
Kalle som blev jätte rädd och sprang till
närmaste hus som låg 9, kilometer bort
för då kan man inte något ting bara kan gå på
stan det då fattar hjärnan ingenting
Hon och han borde pratat med en vuxen person
(läraren). Eller pratat med föräldrarna.
9.1.1
9.1.2
9.1.3
9.1.4
9.2.2
9.2.3
inte
Error Corpora
E RROR
9.2.4
289
C ORRECTION
C ORP
S UBJ
AGE
S EX
DV
haic
11
f
jag
CF
erja
9
m
jag
SN
wg04
10
m
något/det
folk som
SE
SE
wg08
wg14
10
10
f
m
de
SE
wg19
10
m
jag
SE
wj03
13
f
jag/vi
SN
wj09
13
m
man
SE
wj19
13
m
man
SE
wj19
13
m
han
FS
mawe
11
f
de?
SE
wg03
10
f
det
SN
wg06
10
f
varandra
SE
wg18
10
m
att
SN
wj03
13
f
hade
DV
alhe
9
f
var
FS
hais
11
f
att göra
SE
wj07
13
f
fick
SN
wj13
13
m
, blev (?)
DV
hais
11
f
när De kom till en övergiven by va Tor och jag
var rädda
10
10.1
Subject
10.1.1 — undra vad det brann nånstans jag måste i
alla fall larma
10.1.2 vidare hoppas — att vi kommer att vara kompisar rätt länge
10.1.3 Jag tror — skulle hjälpa dem är att ...
10.1.4 I början på filmen var det massa — kollade på
den andras papper på uppgiften
10.1.5 man försöker att lära barnen att om — fuskar
med t ex ett prov då...
10.1.6 han kommer och klappar alla på handen utan en
kille — undra hur han känner sig då?
10.1.7 När jag var ungefär 5 år och gick på dagis så
skulle — åka på ett barnkalas hos en tjej med
dagiset.
10.1.8 När man tror att man har kompisar blir — ledsen när man bara går där ifrån om just kom dit
10.1.9 När man tror att man har kompisar blir ledsen
när man bara går där ifrån om — just kom dit
10.1.10 Dom satte av efter Billy och Åke som suttit i ett
träd men blivit nerputtad av en uggla — blev
nästan nertrampad.
10.2
Object or other NPs
10.2.1 Om dom bråkar som — är det inte så mycket
man kan göra åt saken
10.2.2 jag viste att han skulle bli lite ledsen då efter
som vi hade bestämt —.
10.2.3 Om man sätter barn som är lika bra som — på
samma ställe blir det bättre för...
10.3
Infinitive marker
10.3.1 Efter — ha sprungit igenom häckarna två
gånger så vilade vi lite...
10.4
(att) Verb
10.4.1 en port som va helt glittrig och — 2 guldögon
och silver mun.
10.4.2 sedan skuttade han fram vidare till den öppna
burken där grodan — han. Nosade förundrat
på grodan
10.4.3 Jag tycker att det har med ens uppfostran —
om man nu ger eller inte ger hon/han den saken
som man tappade.
10.4.4 ... så kom det några utlänningar och tog bollen
och vi — inte tillbaka den.
10.4.5 då bar det av i 14 dagar och 14 äventyrsfyllda
nätter jagade av älg — kompis med huggorm
trampat på igelkott mycket hände verkligen.
Appendix B.
290
E RROR
10.5
Adverb
10.5.1 tuni hade jätte ont i knät men hon ville — sluta
för det.
10.6
Preposition
10.6.1 Gunnar var på semester — Norge och åkte
skidor.
10.6.2 dom bär massor av sken smycken massor —
saker
10.6.3 det ena huset efter det andra gjordes — ordning
10.6.4 Hunden hoppade ner — ett getingbo.
10.6.5 Nej det var inte grodan som bodde — hålet.
10.6.6 Pojken som var på väg upp — ett träd fick
slänga sig på marken...
10.6.7 att de som kollade på den andras papper skall
träna mer — sin uppgift
10.6.8 ... så tänkte jag att det är — verklighet sånt
händer
10.6.9 Mobbning handlar nog mycket — att man
inte förstår olika människor.
10.6.10 men jag blev — alla fall jätte rädd för...
10.6.11 mobbing är det värsta som finns och — dom
som gör det saknas det säkert någonting i huvudet.
10.7
Conjunction and subjunction
10.7.1 han gick upp på stora stenen — ropa hallå!
hallå!
10.7.2 Simon klädde på sig — åt frukost.
10.7.3 Det som flickan gjorde när det var en vuxen —
svarade i sin mobiltelefon som tappade en 100
lapp.
10.7.4 ...till exempel — den här killen gör så igen så...
10.7.5 om det är en tjej man inte alls är bra kompis
med — kommer och sätter sig på bänken
10.8
Other
10.8.1 Alla blev rädda för hans skrik hans hämnd
kunde vara — som helst ...
10.8.2 dom gick ut på kullek och letade. — och på
marken och i luften.
10.8.3 De körde långt bort och till slut kom de fram till
en gärdsgård och det var massor av hus —.
10.8.4 sen levde vi lyckliga — våra dagar
10.8.5 att jag har ett problem att jag måste hela tiden
fuska på proven annars — med på matten nog
alla lektioner måste jag fuska och alltid bråka
för att få uppmärksamhet.
10.8.6 den som hörde de där stygga orden vågade
kanske inte spela på en konsert för att — vara
rädd att bli utskrattat av avundsjuka personer.
C ORRECTION
C ORP
S UBJ
AGE
S EX
inte
SN
wj03
13
f
i
DV
erha
10
m
av
DV
haic
11
f
i
DV
hais
11
f
i
i
i
FS
FS
FS
anhe
haic
idja
11
11
11
m
f
f
på
SE
wg14
10
m
i
verkligheten
om
SE
wj06
13
f
SE
wj20
13
m
i
hos
SN
SE
wg18
wj05
10
13
m
m
och
FS
haic
11
f
och
som
FS
SE
hais
wg14
11
10
f
m
om
som
SE
SE
wj03
wj17
13
13
f
f
hur
hemsk/vad
de letade?
CF
frma
9
m
FS
hais
11
f
där
DV
alca
11
f
i alla
(?)
DV
SE
hais
wg10
11
10
f
m
han/hon var
SE
wg11
10
f
Error Corpora
E RROR
10.8.7 Om man inte kan det man ska göra och tittar
på någon annan visar — någon annans resultat
sen.
10.8.8 För att förbättra det är nog — att man ska prata
med en lärare eller förälder så...
11
11.1
11.1.1
11.1.2
WORD CHOICE
Prepositions and particles
dom peka på väggen av tunneln
Vi sprang allt vad vi orkade ner till sjön och
slängde ur oss kläderna.
11.1.3 Jag kom ihåg allt som hänt innan jag trillat
ifrån grenen.
11.1.4 Han ropade ut igenom fönstret men inget kvack
kom tillbaka.
11.1.5 sen var det problem på klass fotot
11.1.6 Jag tycker att om man har svårigheter för att
skriva eller nåt annat skall man visa det...
11.1.7 vi var väldigt lika på sättet alltså vi tyckte om
samma saker
11.1.8 Jag blev glad på Malin att hon hjälpte mig att
säga det till honom för...
11.1.9 han kommer och klappar alla på handen utan
en kille
11.1.10 När vi skulle gå av satt jag och dagdrömde och
så gick alla av utan jag.
11.2
Adverb
11.2.1 Jag undrar ibland vart mamma är men det är
ingen som vet.
11.2.2 Men vart ska jag bo?
11.2.3 Men vart dom en letade hittade dom ingen
groda.
11.3
11.3.1
11.3.2
11.3.3
Infinitive marker
det var onödigt och skrika pappa
sen gick jag in och la mig och sova
men jag vet inte hur man ska få dom och göra
det.
11.3.4 ... men om man vill försöka bli kompis med
några tjejer/killar och kanske försöker och gå
fram ...
11.3.5 ... det fick en och tänka till hur man kan hjälpa
såna som är utsatta.
11.4
Pronoun
11.4.1 vad skulle dom göra dess pengar tog nästan slut
11.4.2 Det är vanligt att om man har problem hemma
att man lätt blir arg och det går då ut över sina
kompisar.
291
C ORP
S UBJ
AGE
S EX
(?)
C ORRECTION
SE
wj05
13
m
det bästa?
SE
wj07
13
f
i
av
DV
DV
alhe
idja
9
11
f
f
från
CF
jowe
9
f
genom
FS
caan
9
m
med
med
SE
SE
wg18
wj11
10
13
m
f
till
SN
wg04
10
m
(?)
SN
wg06
10
f
utom
SE
wj03
13
f
utom
SN
wj09
13
m
var
CF
erge
9
f
var
var
CF
FS
erge
anhe
9
11
f
m
att
att
att
DV
DV
SE
alhe
alhe
wg18
9
9
10
f
f
m
att
SE
wj08
13
f
att
SE
wj16
13
f
deras
ens
DV
SE
jowe
wj12
9
13
f
f
Appendix B.
292
E RROR
11.5
Blend
11.5.1 när dom kommer hem så märker inte
föräldrarna något även fast att man luktar
rök och sprit
11.5.2 Det är nog inte ett ända barn som inte har något
problem även fast att man inte har så stora
11.5.3 jag sprang så fort så mycket jag var värd
11.6
Other
11.6.1 Hon satte sig på det guldigaste och mjukaste
gräset i hela världen.
11.6.2 men se där är ni ju det lilla följet bestående av
snutna djur från djuraffären.
11.6.3 Jag tittade på Virginia som torkad av sin näsa
som var blodig på tröjarmen.
11.6.4 jag förstår inte vad fröken menar med grammatik näringsväv och allt de andra.
11.6.5 Nasse sprang efter som en liten fnutknapp
efter Bovarna.
12
12.1
12.1.1
12.1.2
12.1.3
12.1.4
REFERENCE
Erroneous referent
Number
Lena fick en kattunge...Och Alexander fick ett
spjut. sen gav den sej iväg när de gått och gått
så hände något
långt bort skymtade ett gult hus. vi närmade
oss de sakta
Att Urban hade en fru. och en massa ungar
hade det.
Oliver försökte få av sig burken så aggressivt
så han ramlade över kanten. Erik tittade efter
honom med en frågande min När Oliver hade
dom i baken så hopade Erik ner.
Gender
12.1.5 ...vad heter din mamma? Det stod bara helt
still i huvudet vad var det han hette nu igen?
12.1.6 Om nu någon tappar någon som pengar...
12.2
Change of referent
12.2.1 spring ut nu vi har besökare när ni kom ut ...
12.2.2 Om dom som mobbar någon gång blir mobbad
själv skulle han ändras helt och hållet.
13
OTHER
13.1
Adverb
13.1.1 När jag var liten mindre ...
13.2
13.2.1
13.2.2
13.2.3
Strange construction
så Pär var läggdags
god natt på er Ses i morgon i går god natt
när vi rast skulle stänga affären så gömde jag
mig.
C ORRECTION
C ORP
S UBJ
AGE
S EX
även
om/fastän
SE
wj12
13
f
även
om/fastän
allt vad
SE
wj12
13
f
DV
haic
11
f
mest gulda
DV
angu
9
f
stulna
DV
hais
11
f
ärmen
DV
idja
11
f
näringslära?
CF
angu
9
f
?
CF
haic
11
f
de
DV
angu
9
f
det
DV
hais
11
f
de
FS
alhe
9
f
den
FS
alhe
9
f
hon
CF
hais
11
f
något
SE
wj07
13
f
vi
dom/han (?)
DV
SE
hais
wj05
11
13
f
m
lite
SN
wj11
13
f
DV
DV
DV
frma
hais
hais
9
11
11
m
f
f
Error Corpora
293
B.2 Misspelled Words
Errors are categorized by part-of-speech and then by the part-of-speech they are
realized in, indicated by an arrow (e.g. ‘Noun → Noun’ a noun becoming another
noun).
E RROR
1
1.1
1.1.1
1.1.2
1.1.3
1.1.4
1.1.5
1.1.6
1.1.7
1.1.8
1.1.9
1.1.10
1.1.11
1.1.12
1.1.13
1.1.14
1.1.15
1.1.16
1.1.17
1.1.18
1.1.19
1.1.20
1.1.21
1.1.22
1.1.23
1.1.24
1.1.25
1.1.26
1.1.27
1.1.28
1.1.29
1.1.30
1.1.31
1.1.32
1.1.33
1.1.34
1.1.35
NOUN
Noun → Noun
Medan Oliver hoppade efter bot.
Grävde sig Erik längre ner i bot
men upp ur bot kom ett djur upp.
Erik sprang i väg medan Oliver välte ner det
surande bot.
Bina som bodde i bot rusade i mot Oliver
men hunden hade fastnat i buken
att dom bot i en jätte fin dy
det va deras dy.
Det KaM Till EN övergiven Bi
dam bodde i en bi
pappa i har hittat än övergiven bi
de var en by en öde dy.
både pappa och jag kom då att tänka på den dyn
vi va i
på vägen hem undrade pär hur dyn hade kommit till.
jag sprang till boten
sen vaknade vi i botten
Den där scenen med dammen som tappade
sedlarna
Renen sprang tills dom kom till en dam
kastad ner olof och hans hund i en dam
En dag när han var vid damen drog han med
håven i vattnet och fick upp en groda.
Men damen är inte så djup.
Vi kom Över Molnen Jag och Per på en flygande fris som hette Urban.
pojken och huden kom i vattnet.
de lät precis som Fjory hennes hast
August rosen gren har lämnat hjorden...
därför skulle dom andra i klasen visa hur duktiga dom var.
Den brinnande makan
huset brann upp för att makan hade tagit eld.
En dag tänkte Urban göra varma makor.
Manen var tjock och rökte cigarr.
Ni har en son som ringt efter oss sa manen.
Den gamla manen Berättade om en by han Bot
i för länge sedan
den här gamla manen har tagit hand om oss.
manen kom ut med tre skålar härlig soppa.
men så en dag kom en man som hette svarta
manen
C ORRECTION
C ORP
S UBJ
AGE
S EX
boet
boet
boet
boet
FS
FS
FS
FS
alhe
alhe
alhe
alhe
9
9
9
9
f
f
f
f
boet
burken
by
by
by
by
by
by
byn
FS
FS
DV
DV
DV
DV
DV
DV
DV
alhe
frma
alhe
alhe
erja
erja
erja
frma
alhe
9
9
9
9
9
9
9
9
9
f
m
f
f
m
m
m
m
f
byn
DV
frma
9
m
båten
båten
damen
DV
DV
SE
haic
haic
wg09
11
11
10
f
f
m
damm
damm
dammen
FS
FS
FS
alhe
frma
alhe
9
9
9
f
m
f
dammen
gris???
FS
DV
jobe
caan
10
9
m
m
hunden
häst
jorden
klassen
FS
DV
DV
SE
haic
alco
hais
wg02
11
9
11
10
f
f
f
f
mackan
mackan
mackor
mannen
mannen
mannen
CF
CF
CF
CF
CF
DV
caan
caan
caan
alco
idja
angu
9
9
9
9
11
9
m
m
m
f
f
f
mannen
mannen
mannen
DV
DV
DV
angu
angu
angu
9
9
9
f
f
f
Appendix B.
294
E RROR
1.1.36
1.1.37
1.1.38
1.1.39
1.1.40
1.1.41
1.1.42
1.1.43
1.1.44
1.1.45
1.1.46
1.1.47
1.1.48
1.1.49
1.1.50
för manen hade många djur
Det var nog den här byn manen talade om
det var svarta manen.
Lena gick fram till svarta manen
svarta manen blev rädd
svarta manen sprang sin väg
det log maser av saker runtomkring
dom bär maser av sken smycken
... men plötsligt tog matten slut.
alla menen och Pappa gick in i ett av huset
pojken skrek ett tupp!
ja tak
just då ringde telefånen och pappa svarade:
Fram ur vasen kom det något
Sen gick jag ut, och fram för mig stod värdens
finaste häst.
1.1.51 dom som borde på örn kanske försökte koma
på skepp
1.2
1.2.1
1.2.2
1.2.3
Noun → Adjective
man kunde rida fyra i bred
kale som blev jätte rädd...
... och där fans ett tempel fult med matt.
1.3
1.3.1
Noun → Pronoun
Men det han höll i var ett par hon som i sin tur
satt fast i en hjort.
1.4
1.4.1
Noun→ Numeral
olof som klättrade i ett tre
1.5
1.5.1
1.5.2
1.5.3
1.5.4
Noun → Verb
pappa gick och knacka på en dör till
och knacka på en dör
Lena var en flika som var 8 år.
Han letade i ett hål medans hunden skällde på
masa bin.
När han landa så svepte masa bin över honom.
hunden hade hittat masa getingar
där va en masa människor
Jag tycker att om man inte gillar en viss person
ska man inte visa det på ett så taskigt sett.
1.5.5
1.5.6
1.5.7
1.5.8
1.6
1.6.1
1.6.2
1.7
1.7.1
1.7.2
Noun → Preposition
Då fick muffins syn på en massa in och började
jaga dom.
dam flyttade naturligtvis till den övergivna b in
Noun → More than one category
Jag hade en jacka på mig som det var ett litet
håll i...
Hur ska men kunna göra för att förbättra dessa
problem?
C ORP
S UBJ
AGE
S EX
mannen
mannen
mannen
mannen
mannen
mannen
massor
massor
maten
männen
stup
tack
telefonen
vassen
världens
C ORRECTION
DV
DV
DV
DV
DV
DV
DV
DV
DV
DV
FS
DV
CF
FS
CF
angu
angu
angu
angu
angu
angu
haic
haic
erge
haic
haic
haic
erge
idja
alhe
9
9
9
9
9
9
11
11
9
11
11
11
9
11
9
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
ön
DV
haic
11
f
bredd
Kalle
mat
DV
CF
DV
idja
anhe
erge
11
11
9
f
m
f
horn
FS
anhe
11
m
träd
FS
frma
9
m
dörr
dörr
flicka
massa
DV
DV
DV
FS
alhe
alhe
angu
erge
9
9
9
9
f
f
f
f
massa
massa
massa
sätt
FS
FS
DV
SE
erge
haic
alhe
wg17
9
11
9
10
f
f
f
f
bin
FS
jowe
9
f
byn
DV
erja
9
m
hål
SN
wg16
10
f
man
SE
wj03
13
f
Error Corpora
1.7.3
1.7.4
1.7.5
2
2.1
2.1.1
2.1.2
E RROR
C ORRECTION
C ORP
S UBJ
AGE
S EX
...och vad men kan göra för att förbättra dom.
Att utfrysa en kompis eller någon annan kan
vara det värsta men någonsin kan göra tycker
jag.
Precis då kom pappa och hans men.
man
man
SE
SE
wj03
wj03
13
13
f
f
män
DV
haic
11
f
kallt
CF
erge
9
f
trygga
CF
frma
9
m
bäst
enda
enda
enda
CF
CF
DV
SE
hais
idja
jowe
wj12
11
11
9
13
f
f
f
f
enda
SE
wj13
13
m
enda
SN
wg19
10
m
rädd
rädd
rädd
rädd
rädda
tyken
CF
FS
FS
SN
CF
SE
anhe
frma
frma
wg18
frma
wj14
11
9
9
10
9
13
m
m
m
m
m
m
förra
kända
lätt
rädd
FS
DV
SE
FS
idja
erge
wg03
erja
11
9
10
9
f
f
f
m
alla
de
de
DV
FS
FS
idja
alhe
caan
11
9
9
f
f
m
de
de
de
de
de
de
DV
DV
DV
DV
DV
DV
alco
erja
erja
erja
erja
frma
9
9
9
9
9
9
f
m
m
m
m
m
de
det
DV
CF
jobe
angu
10
9
m
f
det
CF
angu
9
f
ADJECTIVE
Adjective → Adjective
Pappa du har glömt att tända brasan och det är
kalt.
det är den plikt att få ås att bli dryga
Adjective → Noun
när hon var som best
... men inte en ända människa syntes till.
det här brevet är det ända jag kan ge dig idag
Det är nog inte ett ända barn som inte har något
problem även fast att man inte har så stora
2.2.5 Det ända jag vet om grov mobbing är det jag
har sett på tv!
2.2.6 ... för det var det ända sättet att komma upp till
en koja
2.2.7 kalle som blev jätte räd
2.2.8 han blev så räd
2.2.9 han var lite räd för kråkan
2.2.10 jag blev alla fall jätte räd
2.2.11 alla var reda
2.2.12 man behöver inte vara tycken bara för man inte
vill vara med han.
2.2
2.2.1
2.2.2
2.2.3
2.2.4
2.3
2.3.1
2.3.2
2.3.3
2.3.4
3
3.1
3.1.1
3.1.2
3.1.3
295
Adjective → Verb
Och kanske var det ett barn till hans föra groda.
... och spökena blev skända...
jag tror man ska ta ett lett prov först men...
pojken blev red
PRONOUN
Pronoun → Pronoun
fortsatte det att ringa i alle fall
och en massa ungar hade det.
Han sa till hunden att vara tyst för att det skull
titta efter.
3.1.4 Det kom till en övergiven by
3.1.5 Det KaM Till EN övergiven Bi
3.1.6 när det kam hem sade pappa...
3.1.7 när det hade kommit en liten bit sa pappa...
3.1.8 då hörde det att det bubblade...
3.1.9 Det kom till en, plats som de aldrig hade varit,
på.
3.1.10 Det kom till en övergiven by
3.1.11 jag förstår inte vad fröken menar med grammatik näringsväv och allt de andra.
3.1.12 Och sen den dagen de brann i Kamillas lägenhet leker vi alltid brandmän.
Appendix B.
296
E RROR
3.1.13 de börjar att skymma
3.1.14 De var han och han hade hittat en partner.
3.1.15 ... men de kom ingen groda den här gången
heller
3.1.16 De va en pojke som hette olof
3.1.17 de va en älg
3.1.18 mormor berättade att de fanns en by bortom
solens rike
3.1.19 där de fanns små röda hus med vita knutar
3.1.20 ja men nu är de läggdags sa mormor.
3.1.21 Anna funderade halva natten över de där med
morfar
3.1.22 de lät precis som Fjory hennes häst
3.1.23 de såg faktiskt ut som en övergiven by
3.1.24 de var bara ett fönster som lyste
3.1.25 De var en kväll som Lisa jag alltså ville höra en
saga...
3.1.26 och dom lovat att bygga upp staden och de blev
hotell
3.1.27 de var en by en öde by.
3.1.28 de var tid för familjen att gå hem.
3.1.29 Det var dåligt väder de blåste och regnade.
3.1.30 de blåste mer och mer
3.1.31 Men de var fullt med buskar utanför
3.1.32 Dom gick in genom dörren och blev förvånade
av de dom såg.
3.1.33 de kunde berott på att dom gillade samma tjej.
3.1.34 När jag får se en son här film tänker jag på att
de nog är så i de flesta skolorna
3.1.35 ... för de är nog något typiskt med de
3.1.36 ... för de är nog något typiskt med de
3.1.37 de får man nog för man får så mycket att göra
när man blir större
3.1.38 Den är ju inte heller säkert att den kompisen
man kollar på har rätt
3.1.39 De var bara ungdomar inga vuxna.
3.1.40 De hela började med att jag och min morfar
skulle cykla ner till sjön för...
3.1.41 de verkade lugnt.
3.1.42 de va en vanlig måndag
3.1.43 ... efter som de fanns en hel del snälla kompisar
i min klass så hjälpte dom mig...
3.1.44 När jag kom på fötter igen så hade de kommit
cirka tolv stycken i min klass och hjälpte mig
3.1.45 det är den plikt att få ås att bli dryga
3.1.46 Dem kom med en stegbil och hämtade oss.
3.1.47 Nästa dag gick dem upp till en grotta
3.1.48 där fick dem var sin korg med saker i
3.1.49 Dem hade ett privatplan
3.1.50 nu slår dem upp tältet för att vila...
3.1.51 nästa morgon går dem långt långt
3.1.52 men till slut kom dem till en övergiven by.
3.1.53 där stannade dem och bodde där resten av livet
C ORP
S UBJ
AGE
S EX
det
det
det
C ORRECTION
CF
FS
FS
frma
caan
frma
9
9
9
m
m
m
det
det
det
FS
FS
DV
frma
frma
alco
9
9
9
m
m
f
det
det
det
DV
DV
DV
alco
alco
alco
9
9
9
f
f
f
det
det
det
det
DV
DV
DV
DV
alco
alco
alco
erge
9
9
9
9
f
f
f
f
det
DV
erge
9
f
det
det
det
det
det
det
DV
DV
DV
DV
DV
DV
frma
frma
hais
idja
idja
mawe
9
9
11
11
11
11
m
m
f
f
f
f
det
det
SE
SE
wg07
wg20
10
10
f
m
det
det
det
SE
SE
SE
wg20
wg20
wg20
10
10
10
m
m
m
det
SE
wj17
13
f
det
det
SE
SN
wj18
wg10
13
10
m
m
det
det
det
SN
SN
SN
wg11
wg20
wg20
10
10
10
f
m
m
det
SN
wj10
13
m
din
dom
dom
dom
dom
dom
dom
dom
dom
CF
CF
DV
DV
DV
DV
DV
DV
DV
frma
jobe
angu
angu
jobe
jobe
jobe
jobe
jobe
9
10
9
9
10
10
10
10
10
m
m
f
f
m
m
m
m
m
Error Corpora
E RROR
3.1.54 dem kanske bodde i ett hus som dem fick hyra
3.1.55 dem kanske bodde i ett hus som dem fick hyra
3.1.56 ... dem måste få höga betyg annars får de skäll
av sina föräldrar.
3.1.57 Dem andra människorna som kollade på sina
kompisars provpapper,
3.1.58 ... när dem började bråka,
3.1.59 dem kunde väl hjälpa varandra.
3.1.60 Men dem fortsatte.
3.1.61 Men jag fortsatte kämpa för dem två skulle
kunna se på varan utan att vända bort huvudet,
3.2
3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
3.2.8
3.2.9
3.2.10
3.2.11
3.2.12
3.2.13
3.2.14
3.2.15
3.2.16
3.2.17
3.2.18
3.2.19
3.2.20
3.2.21
3.2.22
3.2.23
3.2.24
3.2.25
3.2.26
3.2.27
3.2.28
3.2.29
3.2.30
3.2.31
3.2.32
3.2.33
3.2.34
3.2.35
Pronoun → Noun
... för att du är ju alt jag har.
... och alt var en dröm.
Någon anan la mig på en bår...
och gick till en anan tunnel
Det finns nog en anan väg...
så jag fik åka med en anan som skulle också
hänga med
var är set...
var är set här
snabbt springer dam ut ur brand bilarna
snabbt tar dam fram stegen
dam ramlar rakt ner i en damm
då är dam ännu närmare ljudet
dam bodde i en by
dam tåg och så med sig sina två tigrar
när dam hade kommit än bit in i skogen
å dam två tigrarna följde också med
dam red bod
när dam kam hem
dam flyttade naturligtvis till den övergivna in
där levde dam lyckliga
tillslut blev dam två kamelerna så trötta...
när dam kam hem var kl. 12
hon fråga va det var för not
och efter som det inte fans not lock på burken
han har fot syn på not
om det skulle hända not
om man såg en älg eller räv och not anat stort
djur
en poäng alltid not
ni får gärna bo hos oss under tid en ni inte har
not att bo i.
det är den plikt att få ås att bli dryga
och la os på varsin sida av den spikiga toppen
och utrusta os
sa Desere med en son skarp röst hon alltid
använde.
gick vi upp till utgången av tältet men upptäckte
varan och vi blev så rädda
Visa i filmen gillade inte varan
297
C ORP
S UBJ
AGE
S EX
dom
dom
dom
C ORRECTION
SE
SE
SE
wg01
wg01
wg01
10
10
10
f
f
f
dom
SE
wg01
10
f
dom
dom
dom
dom
SN
SN
SN
SN
wg01
wg01
wg01
wg07
10
10
10
10
f
f
f
f
allt
allt
annan
annan
annan
annan
CF
DV
CF
DV
DV
SN
erge
caan
erge
alhe
idja
wg20
9
9
9
9
11
10
f
m
f
f
f
m
det
det
dom
dom
dom
dom
dom
dom
dom
dom
dom
dom
dom
dom
dom
dom
nåt
nåt
nåt
nåt
nåt
DV
DV
CF
CF
FS
FS
DV
DV
DV
DV
DV
DV
DV
DV
DV
DV
CF
FS
FS
DV
DV
hais
hais
erja
erja
erja
erja
erja
erja
erja
erja
erja
erja
erja
erja
erja
frma
alhe
alhe
frma
alhe
alhe
11
11
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
f
f
m
m
m
m
m
m
m
m
m
m
m
m
m
m
f
f
m
f
f
nåt
nåt
DV
DV
alhe
idja
9
11
f
f
oss
oss
oss
sån
CF
DV
DV
DV
frma
alhe
alhe
hais
9
9
9
11
m
f
f
f
varann
DV
alhe
9
f
varann
SE
wg06
10
f
Appendix B.
298
E RROR
3.2.36 det första problemet är att dom kollar på varan
3.2.37 för då tittar man inte på varan.
3.2.38 Men jag fortsatte kämpa för dem två skulle
kunna se på varan utan att vända bort huvudet,
3.3
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.3.6
3.3.7
3.3.8
3.3.9
3.3.10
3.3.11
3.3.12
3.3.13
3.3.14
3.3.15
Pronoun → Verb
om man såg en älg eller räv och not anat stort
djur
Vi såg ormar spindlar krokodiler ödlor och
anat.
hanns groda var försvunnen.
hanns mamma hade slängt ut den.
som nu satt på hanns huvud.
för att hanns kruka hade gått sönder
kastad ner olof och hanns hund i en dam
jag fick låna hanns mobiltelefon.
han frågade honom nått
... den killen eller tjejen måste ha nått problem
eller...
om det kommer nån ny till klassen eller nått
...så hon hamnade inne i skogen på nått konstigt sätt...
När det var två flickor som satt på en bänk så
kom det en annan flicka som satte säg bredvid
Det var också väldigt roligt för att man kände
säg inte ensam om det.
man får nog mer sona problem när man kommer högre upp i skolan
3.4
3.4.1
3.4.2
Pronoun → Preposition
vi bar allt till mamma hos sa...
sen när in kompis skulle hoppa så...
3.5
3.5.1
3.5.2
Pronoun → Interjection
va fiffigt tänkte ja
då börja alla i hela tunneln förutom pappa och
ja gråta
vilken fin klänning ja har
Madde vaknade av mitt skrik, hon fråga va det
var för nåt.
3.5.3
3.5.4
3.6
3.6.1
3.6.2
3.6.3
3.6.4
3.6.5
3.6.6
3.6.7
3.6.8
3.6.9
3.6.10
3.6.11
Pronoun → More than one category
Det var än gång än man som hette Gustav
Det var än gång än man som hette Gustav
än dag när Gustav var på jobbet ringde det
han trycker på än knapp
Gustav sitter i än av brand bilarna
där e än
där uppe på än balkong står det ett barn
han hade än groda
män än natt klev grodan upp ur glas burken
det var än gång två pojkar
dam bodde i än bi.
C ORP
S UBJ
AGE
S EX
varann
varann
varann
C ORRECTION
SE
SE
SN
wg18
wg18
wg07
10
10
10
m
m
f
annat
DV
alhe
9
f
annat
DV
caan
9
m
hans
hans
hans
hans
hans
hans
nåt
nåt
FS
FS
FS
FS
FS
SN
DV
SE
alhe
alhe
alhe
alhe
frma
wg14
haic
wj08
9
9
9
9
9
10
11
13
f
f
f
f
m
m
f
f
nåt
nåt
SE
SN
wj08
wj08
13
13
f
f
sig
SE
wg14
10
m
sig
SN
wj11
13
f
såna
SE
wg20
10
m
hon
min
DV
SN
haic
wj08
11
13
f
f
jag
jag
DV
DV
alhe
alhe
9
9
f
f
jag
vad
DV
CF
angu
alhe
9
9
f
f
en
en
en
en
en
en
en
en
en
en
en
CF
CF
CF
CF
CF
CF
CF
FS
FS
DV
DV
erja
erja
erja
erja
erja
erja
erja
erja
erja
erja
erja
9
9
9
9
9
9
9
9
9
9
9
m
m
m
m
m
m
m
m
m
m
m
Error Corpora
3.6.12
3.6.13
3.6.14
3.6.15
3.6.16
3.6.17
3.6.18
3.6.19
3.6.20
3.6.21
3.6.22
3.6.23
3.6.24
3.6.25
3.6.26
3.6.27
3.6.28
3.6.29
3.6.30
4
4.1
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.1.6
4.1.7
4.1.8
4.1.9
4.1.10
4.1.11
4.1.12
4.1.13
4.2
4.2.1
4.2.2
299
E RROR
C ORRECTION
S UBJ
AGE
S EX
pappa vi har hittat än övergiven bi.
än dag sa Niklas ska vi rida ut
när dam hade kommit än bit in i skogen
än liten bit in i skogen såg dom än övergiven
by
än liten bit in i skogen såg dom än övergiven
by
Man ska vara en bra kompis, när någon vill vara
än själv.
jag satt ner men packning
Men var nu då? dörren går inte upp.
När simon kom ut och såg var som hade hänt...
Hans hund Taxi var nyfiken på var det var för
något i burken.
Men var är det för ljud?
var fan gör du
Sjävl tycker jag att killarnas metoder är mer
öppen och ärlig men också mer elak än var tjejernas metoder är.
Hjälp det brinner vad nånstans
undra vad det brann nånstans jag måste i alla
fall larma
Jag visste inte att brandbilen vad på väg förbi
min egen by.
Lena sa vad är vi hon såg sig omkring
Visa i filmen gillade inte varan
dom bråkade och lämnade visa utanför.
en
en
en
en
DV
DV
DV
DV
erja
erja
erja
erja
9
9
9
9
m
m
m
m
en
DV
erja
9
m
en
SE
wg05
10
m
min
vad
vad
vad
DV
CF
FS
FS
haic
idja
hais
idja
11
11
11
11
f
f
f
f
vad
vad
vad
FS
SE
SE
idja
wg07
wj13
11
10
13
f
f
m
var
var
CF
CF
erja
erja
9
9
m
m
var
CF
jowe
9
f
var
Vissa
vissa
DV
SE
SE
angu
wg06
wg06
9
10
10
f
f
f
bet
FS
angu
9
f
bodde
DV
haic
11
f
hoppade
FS
alhe
9
f
hoppade
hoppade
hålla
lyfta
låg
låtsas
FS
DV
SN
SN
DV
SE
anhe
idja
wj12
wg16
haic
wj14
11
11
13
10
11
13
m
f
f
f
f
m
ryckte
satt
surrade
sätt
CF
FS
FS
DV
anhe
haic
erja
alco
11
11
9
9
m
f
m
f
beror
SE
wg12
10
f
bott
DV
angu
9
f
VERB
Verb → Verb
Upp ur hålet kom en grävling och bett pojken i
näsan
dom som borde på örn kanske försökte koma
på skepp
När Oliver hade dom i baken så hopade Erik
ner.
Och pojken hopade efter hunden.
Vi hopade upp på hästarna...
...för att hälla henne sällskap.
först försökte hon att lufta mig...
det log maser av saker runtomkring
han behöver inte lossas om som ingenting har
hänt,
brand männen rykte ut och släkte elden
hunden sa på pojkens huvet.
då surade bina rakt över pojken
sett dig hon gjorde som mannen sa
Verb → Noun
Och problemet kanske bror på att kompisarna
inte tyckte om den personen
Den gamla manen Berättade om en by han Bot
i för länge sedan
C ORP
Appendix B.
300
4.2.3
4.2.4
4.2.5
4.2.6
4.2.7
4.2.8
4.2.9
4.2.10
4.2.11
4.2.12
4.2.13
4.2.14
4.2.15
4.2.16
4.2.17
4.2.18
4.2.19
4.2.20
4.2.21
4.2.22
4.2.23
4.2.24
4.2.25
4.2.26
4.2.27
4.2.28
4.2.29
4.2.30
4.2.31
4.2.32
4.2.33
4.2.34
4.2.35
4.2.36
4.2.37
4.2.38
4.2.39
4.2.40
4.2.41
E RROR
C ORRECTION
C ORP
S UBJ
AGE
S EX
Men konstigt nog ville jag se den hästen fastän
den inte fans.
Det fans en doktor som pratade vänligt med
mig,
och efter som det inte fans not lock på burken
Men i hålet fans bara...
mormor berättade att de fans en by bortom
solens rike
därde fans små röda hus med vita knutar där
Annas morfar hade bott
men efter som de fans en hel del snälla kompisar i min klass
när jag kom ut ur huset sa Kamilla att jag fik
hunden...
Så fik pojken ett grodbarn
Och vad fik dom se?
men med lite tjat fik jag
och för varje djur fik man 1 eller 3 poäng
fik man tio poäng
först fik jag panik
hon hoppade till när hon fik syn på oss
Men de var fult med buskar utan för som vi fik
rid igenom.
så jag fik åka med en anan som skulle också
hänga med
han har fot syn på not
... som dom hade fot tillsammans.
På morgonen vaknade vi och kläde på oss
Madde sprang upp till sitt rum och kläde på sig
Han kläde på sig
Det Kam Till EN övergiven Bi
när det kam hem sade pappa...
när Niklas och Bennys halva kam fram till en
damm
upp ur dammen kam två krokodiler
när dam kam hem
när dam kam hem var kl. 12
då ko min bror
När jag kom ut såg jag en liten eld låga koma
ut genom fönstret,
det tog en timme att koma ditt
Pojken som var på väg upp ett träd fick slänga
sig på marken för att inte koma i vägen för bin.
dom som borde på örn kanske försökte koma
på skepp
hans hämnd kund vara som helst
på vägen till pappa möte jag en katt
Jag gick in och sate mig vid bordet och åt.
Han sate sig upp och lyssnade
Hon sate sej på det guldigaste och mjukaste
gräset i hela världen.
fanns
CF
alhe
9
f
fanns
CF
erge
9
f
fanns
fanns
fanns
FS
FS
DV
alhe
erge
alco
9
9
9
f
f
f
fanns
DV
alco
9
f
fanns
fanns
DV
SN
erge
wg20
9
10
f
m
fick
CF
angu
9
f
fick
fick
fick
fick
fick
fick
fick
fick
FS
FS
DV
DV
DV
DV
DV
DV
caan
erge
alhe
alhe
alhe
alhe
hais
idja
9
9
9
9
9
9
11
11
m
f
f
f
f
f
f
f
fick
SN
wg20
10
m
fått
fått
klädde
klädde
klädde
kom
kom
kom
FS
FS
CF
CF
FS
DV
DV
DV
frma
haic
alhe
alhe
haic
erja
erja
erja
9
11
9
9
11
9
9
9
m
f
f
f
f
m
m
m
kom
kom
kom
kom
komma
DV
DV
DV
SN
CF
erja
erja
frma
wg18
alhe
9
9
9
10
9
m
m
m
m
f
komma
komma
CF
FS
anhe
idja
11
11
m
f
komma
DV
haic
11
f
kunde
mötte
satte
satte
satte
CF
DV
CF
FS
DV
frma
alhe
alhe
alhe
angu
9
9
9
9
9
m
f
f
f
f
Error Corpora
E RROR
4.2.42 Redan nästa dag sate vi igång med reparationen
av byn.
4.2.43 Då såg jag nåt som jag aldrig har set
4.2.44 Jag tycker att hon skal prata med dom.
4.2.45 brandmännen släkte elden
4.2.46 där nere i det höga gräset låg dalmatinen tess,
grisen kalle-knorr... och sav
4.2.47 Ring till Börje sej att vi låst oss ute.
4.2.48 dam tåg och så med sig sina två tigrar
4.2.49 ... att vi åkt ner från berget och åkt så långt att
vi inte viste va vi va.
4.2.50 typ när man pratar om grejer som inte man villa
att alla ska höra!
4.2.51 ... att Mia inte viste om att mamma var en
strandskata.
4.2.52 Och utan att pojken viste om det hoppa grodan
ur burken när han låg.
4.2.53 jag viste att han skulle bli lite ledsen då efter
som vi hade bestämt.
4.2.54 då viste jag inte vad jag skulle göra
4.2.55 hon kan ju inte skylla på att hon inte märker nåt
för det ärr alltid tydligt.
4.3
4.3.1
Verb → Pronoun
mer han jag inte tänka...
4.4
4.4.1
4.4.2
4.4.3
4.4.4
Verb → Adjective
å älgen bara gode
Niklas och Benny kunde inte hala emot
han höll sig i och road
Jag såg på ett TV program där en metod mot
mobbing var att satta mobbarn på den stol
och andra människor runt den personen och då
fråga varför.
Hade Erik vekt en uggla
4.4.5
Verb → Interjection
jag blev jätte besviken för jag trodde att klockan
va sådär 7.
4.5.2 men jag va visst jätte ledsen så jag gick ut.
4.5.3 Vi kom tillbaks vid 6 tiden, och då va vi jätte
trötta och hungriga.
4.5.4 Klockan va ungefär 12 när jag vaknade, och va
får jag se om inte hästen.
4.5.5 Klockan va ungefär 12 när jag vaknade, och va
får jag se om inte hästen.
4.5.6 jag sa att det inte va nåt så somna vi om.
4.5.7 alla va överens
4.5.8 De va en pojke som hette olof
4.5.9 de va en älg
4.5.10 Nu va det bara att hoppa ut från fönstret.
4.5.11 ... att vi åkt ner från berget och åkt så långt att
4.5.12 pappa och jag undra va nycklarna va
4.5
4.5.1
301
S UBJ
AGE
S EX
satte
C ORRECTION
C ORP
DV
idja
11
f
sett
skall
släckte
sov
DV
SE
CF
DV
caan
wg02
frma
hais
9
10
9
11
m
f
m
f
säg
tog
var
CF
DV
DV
idja
erja
alhe
11
9
9
f
m
f
vill
SE
wj17
13
f
visste
CF
hais
11
f
visste
FS
caan
9
m
visste
SN
wg06
10
f
visste
är
SN
SE
wg20
wj13
10
13
m
m
hann
DV
idja
11
f
glodde?
hålla
ropade?
sätta
FS
DV
FS
SE
frma
erja
frma
wj16
9
9
9
13
m
m
m
f
väckt
FS
alhe
9
f
var
CF
alhe
9
f
var
var
CF
CF
alhe
alhe
9
9
f
f
var
CF
alhe
9
f
var
CF
alhe
9
f
var
var
var
var
var
var
CF
CF
FS
FS
FS
DV
alhe
frma
frma
frma
haic
alhe
9
9
9
9
11
9
f
m
m
m
f
f
var
DV
alhe
9
f
Appendix B.
302
E RROR
4.5.13 Det börjar med att pappa och jag va ute och
cyklade på landet...
4.5.14 ... att vi inte va på toppen av berget utan i en by
4.5.15 han va för tung
4.5.16 vi va i en jätte liten och fin by
4.5.17 nej det va en blåmes
4.5.18 Sen sa pappa att vi va tvungna att leta.
4.5.19 om dom va öppna
4.5.20 När jag kom dit va redan pappa där
4.5.21 en port som va helt glittrig
4.5.22 en katt som va svart och len
4.5.23 en platta som nästan va omringad av lava
4.5.24 där va en massa människor som va fastkedjade
med tjocka kedjor
4.5.25 där va en massa människor som va fastkedjade
med tjocka kedjor
4.5.26 den äldsta som va 80 år berätta att...
4.5.27 den byn vi va i
4.5.28 det va deras by
4.5.29 det va den hemske fula trollkarlen tokig
4.5.30 som tur va gick hästarna i hagen.
4.5.31 ... då vill ju han vara med den kompisen som
han va med innan.
4.5.32 ... men eftersom det inte va så mycket mobbing
så...
4.5.33 Det var i somras när jag, min syster och två andra kompisar va på vårat vanliga ställe...
4.5.34 Vi va kanske inte så bra på det utan vi ramlade
ganska ofta.
4.5.35 det kunde ju va att en sjusovare bor där inne
4.5.36 ... utan det kan även vara att nån kan sparka
eller att man få vara enstöring och sitta själv
hela tiden eller kanske spotta eller bara kanske
va taskiga mot den personen
4.5.37 ... att försöka va tuff hela tiden (eller?)
4.5.38 det kan ju va att den som blir mobbad inte
uppför sig på rätt sätt,
4.5.39 dom vill inte va kompis med hon/han.
4.5.40 Då måste man fråga dom som inte vill va
kompis med en vad man gör får fel...
4.5.41 Och om kompisarna tycker att man är ful och
inte vill va med en som är ful så...
4.5.42 Marianne sa fort farande hur jag kunde va med
henne
4.6
4.6.1
4.6.2
4.6.3
4.6.4
4.6.5
Verb → More than one category
så kommer det att vara svårare att skaffa jobb
om dom inte har gott i skolan
han fick hetta Hubert.
Men pojken är inte så glad för nu måste han
hetta en ny glasburk.
Men sen så dom att det var små grodor.
...vi hade precis gått förbi skolan när vi så ett
gäng på ca tio personer komma emot oss.
C ORP
S UBJ
AGE
S EX
var
C ORRECTION
DV
alhe
9
f
var
var
var
var
var
var
var
var
var
var
var
DV
DV
DV
DV
DV
DV
DV
DV
DV
DV
DV
alhe
alhe
alhe
alhe
alhe
alhe
alhe
alhe
alhe
alhe
alhe
9
9
9
9
9
9
9
9
9
9
9
f
f
f
f
f
f
f
f
f
f
f
var
DV
alhe
9
f
var
var
var
var
var
var
DV
DV
DV
DV
DV
SE
alhe
alhe
alhe
alhe
idja
wg12
9
9
9
9
11
10
f
f
f
f
f
f
var
SE
wj13
13
m
var
SN
wj06
13
f
var
SN
wj07
13
f
vara
vara
DV
SE
alhe
wj08
9
13
f
f
vara
vara
SE
SE
wj08
wj13
13
13
f
m
vara
vara
SE
SE
wj19
wj19
13
13
m
m
vara
SE
wj19
13
m
vara
SN
wg07
10
f
gått
SE
wg03
10
f
heta
hitta
FS
FS
haic
haic
11
11
f
f
såg
såg
FS
SN
idja
wj15
11
13
f
m
Error Corpora
4.6.6
4.6.7
4.6.8
4.6.9
4.6.10
4.6.11
4.6.12
4.6.13
5
5.1
5.1.1
E RROR
C ORRECTION
C ORP
S UBJ
AGE
S EX
Hela majs fältet vad svart
Oliver bodde i en liten stuga en liten bit i från
skogen och vad väldigt intresserad av djur.
Hans älsklings färg vad grön
För han vad mycket trött.
till slut vad han uppe på stocken med stort
besvär.
när jag senare vad klar kom grannen och
skrek...
För att komma till Strömstad vad de tvungna
att åka från Göteborg... och sedan Strömstad.
Det var en ganska dålig lärare som inte märkte
hans fusklapp han hade i pennfacket eller vad
det vad.
var
var
CF
FS
jowe
jowe
9
9
f
f
var
var
var
FS
FS
FS
jowe
jowe
jowe
9
9
9
f
f
f
var
DV
jowe
9
f
var
DV
klma
10
f
var
SE
wj07
13
f
surrande
FS
alhe
9
f
bort
DV
erja
9
m
bort
gott
gott
hur
DV
CF
FS
SE
erja
frma
alhe
wj13
9
9
9
13
m
m
f
m
väl
SE
wg08
10
f
väl
väl
SE
SE
wj04
wj07
13
13
m
f
väl
väl
SN
SN
wj08
wj08
13
13
f
f
fullt
fullt
DV
DV
erge
idja
9
11
f
f
inte
SE
wj12
13
f
nu
nu
rätt
visst
visst
visst
visst
visst
DV
DV
CF
CF
CF
FS
DV
DV
hais
hais
idja
alhe
jobe
erge
haic
idja
11
11
11
9
10
9
11
11
f
f
f
f
m
f
f
f
PARTICIPLE
Participle → Participle
Erik sprang i väg medan Oliver välte ner det
surande bot.
6
6.1
6.1.1
ADVERB
Adverb → Noun
snabbt hoppa dom på kamelerna och rusa iväg
och red bod till pappa
6.1.2 dam red bod
6.1.3 ingen sov got den natten
6.1.4 Oliver hjälpte till så got han kunde.
6.1.5 att säga ifrån och förklara ur den utsatta skall
uppföra sig.
6.1.6 När de gick ifrån tjejen som kom så var det väll
för att hon inte hjälpte dem med provet
6.1.7 ...men sen måste dom väll få skuld känslor.
6.1.8 så kan man väll fortfarande vara kompis med
han hon.
6.1.9 det gick väll ganska bra.
6.1.10 jag får väll ta av min snowboard.
Adverb → Adjective
Men de var fult med buskar utan för som vi fik
rid igenom.
6.2.3 men det är ju mycket coolare att säga nej tack
jag röker inte en att säga ja jag är väl inre feg.
6.2.4 ny vänta nu kommer hon
6.2.5 ny öppna inte garderoben
6.2.6 Det var rät blåsigt.
6.2.7 ... men jag va vist jätte ledsen såjag gick ut.
6.2.8 det började vist brinna
6.2.9 dom hade vist ungar och där var hans groda.
6.2.10 då får vi Natta över i byn vist.
6.2.11 Och så landade du vist i en möglig ko skit
också.
6.2
6.2.1
6.2.2
303
Appendix B.
304
E RROR
6.3
6.3.1
6.3.2
6.3.3
6.3.4
6.3.5
6.3.6
Adverb → Pronoun
det tog en timme att koma ditt
Men vart dom en letade hittade dom ingen
groda.
men hur han en lockade så kom den inte.
Det beror på att den andra har jobbat bättre en
den andra den som kollade på honom.
men det kan ju vara andra saker en bara skolan?
men det är ju mycket coolare att säga nej tack
jag röker inte en att säga ja jag är väl inre feg.
6.4
6.4.1
6.4.2
6.4.3
6.4.4
Adverb → Verb
förts att vi inte sögs med tromben
som jag förts trodde
så har gick det till:
är ett sånt problem uppstår försöker man klart
hjälpa till.
6.5
6.5.1
Adverb → Interjection
... att vi åkt ner från berget och åkt så långt att
pappa och jag undra va nycklarna va
sen undra han va dom bodde
6.5.2
6.5.3
6.6
6.6.1
Adverb → More than one category
Hunden hade skäll t så mycket att geting boet
hade ramlat när.
7
7.1
7.1.1
PREPOSITION
Preposition → Verb
Min kompis tänkte hämta hjälp så han hängde
sig i viadukten och hoppa ber sprang till
närmaste huset och sa att det var en som hade
trillat ner och att han skulle ringa ambulansen.
7.2
7.2.1
7.2.2
Preposition → More than one category
kan vi inte gå nu sa Filippa men darrig röst
Man beslöt att börja men marknaderna igen.
8
8.1
8.1.1
CONJUNCTION
Conjunction → Noun
pojken fick nästan inte resa på sig fören en
uggla kom.
Pojken hinner knappt resa sig upp fören en
uggla kommer flygande mot honom.
fören pappa kom in rusande i mitt rum.
inte fören när jag skulle gå ner märkte jag att
jag hade fastnat,
män än natt klev grodan upp ur glas burken
män plötsligt hoppade hunden ut ur fönstret
män då hoppade pojken efter
gick vi upp till utgången av tältet mer upptäckte
varan och vi blev så rädda
män han hade skrikit så...
8.1.2
8.1.3
8.1.4
8.1.5
8.1.6
8.1.7
8.1.8
8.1.9
C ORRECTION
C ORP
S UBJ
AGE
S EX
dit
än
CF
FS
anhe
anhe
11
11
m
m
än
än
FS
SE
erge
wg03
9
10
f
f
än
än
SE
SE
wg03
wj12
10
13
f
f
först
först
här
När
DV
SN
DV
SE
idja
wj16
hais
wg07
11
13
11
10
f
f
f
f
var
DV
alhe
9
f
var
var
DV
DV
alhe
alhe
9
9
f
f
ner
FS
caan
9
m
ner
SN
wj05
13
m
med
med
DV
DV
hais
mawe
11
11
f
f
förrän
FS
haic
11
f
förrän
FS
idja
11
f
förrän
förrän
DV
SN
idja
wg16
11
10
f
f
men
men
men
men
FS
FS
FS
DV
erja
erja
erja
alhe
9
9
9
9
m
m
m
f
men/medan
FS
frma
9
m
Error Corpora
E RROR
8.1.10 ... å ställde cyklarna på den utskurna plattan.
8.1.11 Vi bor i samma hus jag och Kamilla å hennes
hund.
8.1.12 Så vi fick vänta tills pappa kom hem å då skulle
jag visa pappa mamma
8.1.13 å älgen bara gode
8.1.14 å dam två tigrarna följde också med
8.2
8.2.1
8.2.2
8.2.3
Conjunction → More than one category
Då måste man fråga dom som inte vill va
kompis med en vad man gör får fel...
då skulle vi samlas 11.30 får bussen gick lite
senare
vi har så mycket saker så vi kan ha i byn
9
9.1
9.1.1
INTERJECTION
Interjection → Adjective
när vi kom in till mig så stod mamma och pappa
i dörren och sa gratis till mig när jag kom.
10
10.1.1
10.1.2
10.1.3
10.1.4
10.1.5
10.1.6
OTHER
där e huset som brinner
nu e nog alla människor ute
där e än
då e dam ännu närmare ljudet
Att bli mobbad e nog det värsta som finns,
Han slog då till mig över kinden så att jag fick
ett R.
305
C ORP
S UBJ
AGE
S EX
och
och
C ORRECTION
CF
CF
alhe
angu
9
9
f
f
och
CF
hais
11
f
och
och
FS
DV
frma
erja
9
9
m
m
för
SE
wj19
13
m
för
SN
wg20
10
m
som???
DV
haic
11
f
grattis
CF
alhe
9
f
är
är
är
är
är
ärr
CF
CF
CF
CF
SE
SN
erja
erja
erja
erja
wj08
wg15
9
9
9
9
13
10
m
m
m
m
f
m
Appendix B.
306
B.3 Segmentation Errors
Errors are categorized by part-of-speech.
E RROR
1
1.1.1
1.1.2
1.1.3
1.1.4
1.1.5
1.1.6
1.1.7
1.1.8
1.1.9
1.1.10
1.1.11
1.1.12
1.1.13
1.1.14
1.1.15
1.1.16
1.1.17
1.1.18
1.1.19
1.1.20
1.1.21
1.1.22
1.1.23
1.1.24
1.1.25
1.1.26
NOUN
VI VAR PÅ BORÅS BAD HUS
... har hunden fått syn på en bi kupa.
Han hoppar upp på bi kupan
... så att bi kupan börjar att skaka
bi kupan ramlar ner till marken!
då kom det en bi svärm surrande förbi
tillslut välte han ner hela kupan och en hel bi
svärm surrade ut.
Efter 5 minuter körde en brand bil in på
gården.
Då vi kom till min by. Trillade jag av brand
bilen
Men grannen intill ringde brand kåren.
när brand kåren kom hade hela vår ranch
brunnit ner till grunden.
brand larmet går
Just när han hörde smällen gick brand larmet
på riktigt!
Han rusade ut till brandmännen som inte hade
hört smällen och brand larmet.
Han jobbade som brand man
En brand man klättrade upp till oss.
om det fanns någon ledig brand man
jag håller på och utbildar mig till brand man
Petter sa att han tänkte bli Brand man när han
blir stor.
En brand man berättade att...
BRAND MANEN
det här var en bra träning för mig sa brand
manen
brand menen ryckte ut och släckte elden.
jag ringde till brand stationen
Och i morgon är det brand övning
där brand övningen skulle hålla till.
1.1.27 vi skulle börja göra i ordning den lilla byn som
bestod av 8 hus 6 affärer och ett by hus
1.1.28 Desere jobbade i en djur affär
1.1.29 men se där är ni ju det lilla följet bestående av
snutna djur från djur affären.
1.1.30 när det lilla djur följet gått i fyra timmar
1.1.31 Efter några sekunder stod såfus med tungan
halvvägs hängande ut i mun i dörr öppningen.
1.1.32 hon lurade i min pojkvän massa elak heter om
Linnea.
1.1.33 han hade ett 4 mannatält I sin fik kniv.
1.1.34 Då sprang dom fort till tunneln och fort till
skidbacken och Fort till flyg platsen
C ORRECTION
C ORP
S UBJ
AGE
S EX
badhus
bikupa
bikupan
bikupan
bikupan
bisvärm
bisvärm
SN
FS
FS
FS
FS
FS
FS
wg13
klma
klma
klma
klma
alca
hais
10
10
10
10
10
11
11
m
f
f
f
f
f
f
brandbil
CF
idja
11
f
brandbilen
CF
jowe
9
f
brandkåren
brandkåren
CF
DV
jobe
idja
10
11
m
f
brandlarmet
brandlarmet
CF
CF
erja
klma
9
10
m
f
brandlarmet
CF
klma
10
f
brandman
brandman
brandman
brandman
brandman
CF
CF
CF
CF
CF
erja
idja
idja
idja
idja
9
11
11
11
11
m
f
f
f
f
brandman
brandmannen
brandmannen
CF
CF
CF
jowe
erja
idja
9
9
11
f
m
f
brandmännen
brandstationen
brandövning
brandövningen
byhus
CF
CF
CF
CF
anhe
idja
klma
klma
11
11
10
10
m
f
f
f
DV
hais
11
f
djuraffär
djuraffären
DV
DV
hais
hais
11
11
f
f
djurföljet
dörröppningen
elakheter
DV
FS
hais
hais
11
11
f
f
SN
wg07
10
f
fickkniv
flygplatsen
DV
DV
alhe
erha
9
10
f
m
Error Corpora
E RROR
1.1.35
1.1.36
1.1.37
1.1.38
Jag hör fot steg från trappan
frukost klockan ringde
jag går ner och ringer i frukost klockan
genom att han tappat en jord fläck på fönster
karmen.
1.1.39 Ronja hittade en förbands låda
1.1.40 Men lars fick försäkrings pengarna
1.1.41
1.1.42
1.1.43
1.1.44
1.1.45
1.1.46
1.1.47
1.1.48
1.1.49
1.1.50
1.1.51
1.1.52
1.1.53
1.1.54
1.1.55
1.1.56
1.1.57
1.1.58
1.1.59
1.1.60
1.1.61
1.1.62
1.1.63
1.1.64
1.1.65
1.1.66
1.1.67
1.1.68
1.1.69
1.1.70
1.1.71
1.1.72
1.1.73
1.1.74
Hunden hoppar vid ett geting bo.
Geting boet trillar ner på marken.
Geting boet går sönder.
det var en gips skena som...
Nu hade han den i en ganska stor glas burk, på
sitt rum.
så han tog med sig grodan hem i en glas burk.
grodan klev upp ur glas burken.
hunden stack in huvudet i glas burken
Glas burken som hunden hade på huvudet gick
i tusen bitar
Oliver innerligt försökte få av sig den glas
burken som...
Hunden hade fastnat i glas burken och ramlade
ner.
Pojken och hunden sitter och kollar på grodan i
glas burken.
När pojken och hunden har somnat kryper
grodan ut ur glas burken.
Glas burken går sönder.
såfus hade letat i glas burken
han fick ha på sig glas burken över huvudet.
såfus landade med huvudet före och hela glas
burken sprack.
... så gick glas burken sönder.
dom plockade många kran kvistar och la som
täcke
här är också en grav sten från 1989.
jag satte upp grav stenar efter dom
dan efter grävde vi upp deras grav stenar
hit ut går det ju bara en grus väg
Hästarna saktade av när dom kom ut på en grus
väg.
vi fortsatte på den lilla grus vägen.
grus vägen ledde fram till en övergiven by.
Vi följde grus vägen
Vi red i genom det stora hålet och kom in på
grus vägen
vart tionde år måste han ha 5 guld klimpar
en hund på 14 hund år
trampat på igel kott
En dag hade vi en informations dag om mobbing
Då kom det upp en jord ekorre
han tittade i ett jord hål.
307
C ORRECTION
C ORP
S UBJ
AGE
S EX
fotsteg
frukostklockan
frukostklockan
fönsterkarmen
förbandslåda
försäkringspengarna
getingbo
getingboet
getingboet
gipsskena
glasburk
CF
DV
DV
FS
alhe
hais
hais
hais
9
11
11
11
f
f
f
f
DV
CF
mawe
erha
11
10
f
m
FS
FS
FS
SN
FS
erha
erha
erha
wj05
alca
10
10
10
13
11
m
m
m
m
f
glasburk
glasburken
glasburken
glasburken
FS
FS
FS
FS
alhe
alca
alca
alca
9
11
11
11
f
f
f
f
glasburken
FS
alhe
9
f
glasburken
FS
caan
9
m
glasburken
FS
erha
10
m
glasburken
FS
erha
10
m
glasburken
glasburken
glasburken
glasburken
FS
FS
FS
FS
erha
hais
hais
hais
10
11
11
11
m
f
f
f
glasburken
grankvistar
FS
DV
klma
hais
10
11
f
f
gravsten
gravstenar
gravstenar
grusväg
grusväg
DV
DV
DV
DV
DV
hais
hais
hais
idja
idja
11
11
11
11
11
f
f
f
f
f
grusvägen
grusvägen
grusvägen
grusvägen
DV
DV
DV
DV
idja
idja
idja
idja
11
11
11
11
f
f
f
f
guldklimpar
hundår
igelkott
informationsdag
DV
DV
DV
SE
angu
hais
hais
wj16
9
11
11
13
f
f
f
f
jordekorre
jordhål
FS
FS
alca
alhe
11
9
f
f
Appendix B.
308
E RROR
1.1.75
1.1.76
1.1.77
1.1.78
1.1.79
1.1.80
det är ju jul afton om 3 dagar
Innan jul skulle våran klass ha jul fest.
sen var det problem på klass fotot
man vill ju vara fin på klass fotot
På t ex klass fotot
MIN KLASS KAMRAT VILLE INTE
HOPPA FRÅN HOPPTORNET
1.1.81 snabbt tog han på sig klä där
1.1.82 Och så landade du visst i en möglig ko skit
också
1.1.83 men det finns i alla fall ingen tur med en möglig
ko skit.
1.1.84 De hade med sig : ett spritkök, ett tält, och
Massa Mat, några kul gevär, och ammunition
M.M.
1.1.85 När kvälls daggen kom var vi helt klara
1.1.86 Kvälls daggen hade fallit
1.1.87 det brann på Macintosh vägen 738c
1.1.88 Att få status är kanske det maffia ledarna
håller på med.
1.1.89 Hela majs fältet var svart
1.1.90 Vid mat bordet var det en livlig stämma
1.1.91 dom kom in till oss med 2 stora mat kassar.
1.1.92 det var när jag gick i mellan stadiet
1.1.93 Jag satt vid middags bordet tillsammans med
mamma och min lillebror Simon.
1.1.94 där stannade dem och bodde där resten av livet
för mobil telefonen räckte inte enda hem.
1.1.95 alla djur rusade ut ur affären upp på mölndals
vägen
1.1.96 Han hade fångat en groda när han var i parken
vid den stora näckros dammen.
1.1.97 skuggorna föll förundrat på det vita parkett
golvet.
1.1.98 En vecka senare så var det en polis patrull som
letade efter skol klassen
1.1.99 och precis när en av dem skulle slå till mig så
hörde jag polis sirener
1.1.100 Man hämtar då en rast vakt.
1.1.101 följer du med på en rid tur
1.1.102 här står det August rosen gren har lämnat
jorden
1.1.103 jag hade fått en sjuk dom
1.1.104 helt plötsligt var jag på sjuk huset.
1.1.105 ... förrän jag vaknade i en sjukhus säng.
1.1.106 jag tog mina saker ner i en sken påse
1.1.107 dom bär massor av sken smycken
1.1.108 Pappa det var du som la den i skrivbords lådan
1.1.109 ...men sen måste dom väl få skuld känslor.
1.1.110 därför är lärarens skyldig het att se till att eleven får hjälp.
1.1.111 Sedan var det ett sov rum med 4 bäddar.
1.1.112 Dem kom med en steg bil och hämtade oss.
C ORP
S UBJ
AGE
S EX
julafton
julfest
klassfotot
klassfotot
klassfotot
klasskamrat
C ORRECTION
CF
SN
SE
SE
SE
SN
erge
wg02
wg18
wg18
wg19
wg13
9
10
10
10
10
10
f
f
m
m
m
m
kläder
koskit
FS
DV
erja
idja
9
11
m
f
koskit
DV
idja
11
f
kulgevär
DV
jobe
10
m
kvällsdaggen
kvällsdaggen
Macintoshvägen
maffialedarna
DV
DV
CF
SE
hais
mawe
anhe
wj20
11
11
11
13
f
f
m
m
majsfältet
matbordet
matkassar
mellanstadiet
middagsbordet
CF
DV
CF
SN
CF
jowe
idja
alhe
wj14
mawe
9
11
9
13
11
f
f
f
m
f
mobiltelefonen
DV
jobe
10
m
Mölndalsvägen
DV
hais
11
f
näckrosdammen
parkettgolvet
FS
alca
11
f
FS
hais
11
f
polispatrull
DV
alca
11
f
polissirener
SN
wj15
13
m
rastvakt
ridtur
Rosengren
SE
DV
DV
wg07
idja
hais
10
11
11
f
f
f
sjukdom
CF
sjukhuset
CF
sjukhussäng
CF
skenpåse
DV
skensmycken
DV
skrivbordslådan CF
skuldkänslor
SE
skyldighet
SE
erge
erge
mawe
haic
haic
erge
wj04
wj19
9
9
11
11
11
9
13
13
f
f
f
f
f
f
m
m
sovrum
stegbil
mawe
jobe
11
10
f
m
DV
CF
Error Corpora
E RROR
1.1.113 det var ett stort sten hus
1.1.114 Kalle-knorr hade hittat ett stort sten kors
1.1.115 där står ett gult hus med stock rosor slingrande
efter väggarna
1.1.116 allt från att förstå en telefon apparat till att
förstå en människa.
1.1.117 när de var hemma så tittade de i telefon katalogen
1.1.118 ni får gärna bo hos oss under tid en ni inte har
nåt att bo i.
1.1.119 så kom brandbilen och räddade mamma ut
genom toalett fönstret.
1.1.120 där bakom några grenar låg någonting ett
trä hus
1.1.121 Ett vardags rum med 2 soffor 1 bord och en
stor öppenspis
1.1.122 Johan gick in i vardags rummet och satte upp
elementet.
1.1.123 hela vardags rummet stod i brand
1.1.124 hans älsklings djur var groda.
1.1.125 Hans älsklings färg vad grön
1.1.126 Och det är nog en överlevnads instinkt.
2
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.1.6
2.1.7
2.1.8
2.1.9
2.1.10
2.1.11
2.1.12
2.1.13
2.1.14
2.1.15
2.1.16
2.1.17
2.1.18
2.1.19
2.1.20
ADJECTIVE/PARTICIPLE
Fast pappa hade utrustat alla hus brand säkra.
där va massa människor som va fast kedjade
med tjocka kedjor
Människorna hade haft färg glada dräkter på
sig
Tanja sydde glatt färgade kläder åt allihop
Fönstret stod halv öppet
där han låg hjälp lös på marken.
Cristoffer hoppade ner och var jätte arg för att
burken gick sönder.
Cristoffer lyfte upp hunden och var fortfarande
jätte arg men ...
Ett par horn på en hjort som blev jätte arg.
Bina som var inne i boet blev jätte arga och
surrade upp ur boet.
så kanske de blir jätte bra kompisar.
och tänk om den som man skrev av hade skrivit
en jätte bra dikt
Det var inte så jätte djupt på den delen av
floden som Cristoffer och hunden föll i på.
dom bott i en jätte fin by
Sen hjälpte vi dom att göra om byn till en jätte
fin by
Mamma och pappa tyckte det var en jätte fin
by
Jag hade ett jätte fint rum.
då blev jag jätte glad
Då blev dom jätte glada.
där man kan äta jätte god picknick
309
C ORP
S UBJ
AGE
S EX
stenhus
stenkors
stockrosor
C ORRECTION
DV
DV
DV
erha
hais
hais
10
11
11
m
f
f
telefonapparat
SE
wj20
13
m
telefonkatalogen CF
alca
11
f
tiden
DV
idja
11
f
toalettfönstret
CF
hais
11
f
trähus
DV
hais
11
f
vardagsrum
DV
mawe
11
f
vardagsrummet
CF
alca
11
f
vardagsrummet
älsklingsdjur
älsklingsfärg
överlevnadsinstinkt
CF
FS
FS
SE
alca
jowe
jowe
wj20
11
9
9
13
f
f
f
m
brandsäkra
fastkedjade
DV
DV
idja
alhe
11
9
f
f
färgglada
DV
mawe
11
f
glattfärgade
halvöppet
hjälplös
jättearg
DV
FS
FS
FS
mawe
hais
hais
alca
11
11
11
11
f
f
f
f
jättearg
FS
alca
11
f
jättearg
jättearga
FS
FS
erge
alca
9
11
f
f
jättebra
jättebra
SE
SE
wg16
wg17
10
10
f
f
jättedjupt
FS
alca
11
f
jättefin
jättefin
DV
DV
alhe
alhe
9
9
f
f
jättefin
DV
idja
11
f
jättefint
jätteglad
jätteglada
jättegod
DV
SN
DV
DV
idja
wg18
alhe
alhe
11
10
9
9
f
f
f
f
Appendix B.
310
E RROR
2.1.21 det var helt lila och såg jätte hemskt ut,
2.1.22 pappa och jag tänkte att vi skulle cykla upp på
det jätte höga berget för att titta på ut sikten.
2.1.23 pappa gick ut och såg att vi va I en jätte liten
och fin by,
2.1.24 Den andra frågan är jätte lätt
2.1.25 vi mulade och kastade jätte många snöbollar
på dom
2.1.26 tuni hade jätte ont i knät
2.1.27 Nästa dag när Oliver vaknade blev han jätte
rädd för han såg inte grodan i glasburken.
2.1.28 Då blev Oliver jätte rädd.
2.1.29 jag blev jätte rädd
2.1.30 både muffins och Oliver blev jätte rädda.
2.1.31 Det blev jätte struligt med allt möjligt inblandat.
2.1.32 han sade till muffins att vara jätte tyst.
2.1.33 man ser att det är nåt jätte viktigt hon ville
berätta.
2.1.34 Med en gång blev jag klar vaken
2.1.35 en platta som nästan va om ringad av lava.
2.1.36 vi slog upp tältet på den spik spetsiga toppen
2.1.37 det var en varm och stjärn klar natt.
2.1.38 En gång blev den hemska pyroman ut kastad
ur stan.
2.1.39 Om man blir ut satt för något ...
2.1.40 i vart enda hus var alla saker kvar från 1600
talet
2.1.41 då bar det av i 14 dagar och 14 äventyrs fyllda
nätter
2.1.42 då kom dom till en över given by
2.1.43 de kom till en över given by
2.1.44 de kom till en över given by
2.1.45 Det var en över given by.
2.1.46 då för stod vi att det var en över given by
2.1.47 till slut kom dem till en över given By.
2.1.48 vi passerade många över vuxna hus
2.1.49 Oliver fick se ett geting bo och blev hel galen.
3
3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
4
4.1.1
4.1.2
4.1.3
4.1.4
PRONOUN
hon hade bara drömt allt ihop.
simon låg på sin kudde och hade inte märkt
någon ting.
Nu ska jag visa er någon ting
Dom flesta var duktiga på någon ting
för då kan man inte något ting
VERB
när jag dog 1978 i cancer återvände jag hit för
att fort sätta mitt liv här
Jag tror att killen inte kan för bättra sig själv...
då för stod vi att det var en över given by
medan jag för sökte lyfta upp mig skälv
C ORP
S UBJ
AGE
S EX
jättehemskt
jättehöga
C ORRECTION
SN
DV
wj03
alhe
13
9
f
f
jätteliten
DV
alhe
9
f
jättelätt
jättemånga
SE
SN
wj03
wj10
13
13
f
m
jätteont
jätterädd
SN
FS
wj03
jowe
13
9
f
f
jätterädd
jätterädd
jätterädda
jättestruligt
FS
SN
FS
SN
jowe
wj03
jowe
wg11
9
13
9
10
f
f
f
f
jättetyst
jätteviktigt
FS
CF
jowe
alhe
9
9
f
f
klarvaken
omringad
spikspetsiga
stjärnklar
utkastad
DV
DV
DV
DV
CF
idja
alhe
alhe
hais
frma
11
9
9
11
9
f
f
f
f
m
utsatt
vartenda
SE
DV
wj19
hais
13
11
m
f
äventyrsfyllda
DV
hais
11
f
övergiven
övergiven
övergiven
övergiven
övergiven
övergiven
övervuxna
helgalen
DV
DV
DV
DV
DV
DV
DV
FS
erge
erha
hais
hais
hais
jobe
hais
alhe
9
10
11
11
11
10
11
9
f
m
f
f
f
m
f
f
alltihop
någonting
DV
FS
angu
hais
9
11
f
f
någonting
någonting
någonting
DV
DV
SE
hais
mawe
wg03
11
11
10
f
f
f
fortsätta
DV
alco
9
f
förbättra
förstod
försökte
SE
DV
SN
wj03
hais
wg16
13
11
10
f
f
f
Error Corpora
4.1.5
4.1.6
4.1.7
4.1.8
5
5.1.1
5.1.2
5.1.3
5.1.4
5.1.5
5.1.6
5.1.7
5.1.8
5.1.9
5.1.10
5.1.11
5.1.12
5.1.13
5.1.14
5.1.15
5.1.16
5.1.17
5.1.18
5.1.19
5.1.20
5.1.21
5.1.22
5.1.23
5.1.24
5.1.25
5.1.26
5.1.27
5.1.28
5.1.29
5.1.30
5.1.31
5.1.32
5.1.33
5.1.34
5.1.35
311
E RROR
C ORRECTION
C ORP
S UBJ
AGE
S EX
ni för tjänar verkligen mina hem kokta kladdkakor
a Tess min fina gamla hund du på minner mig
om någon jag har träffat förut
Han ring de till mig sen och sa samma sak.
Hon under sökte noga hans fot.
förtjänar
DV
hais
11
f
påminner
DV
hais
11
f
ringde
undersökte
SN
DV
wg07
mawe
10
11
f
f
därefter
därifrån
därifrån
därifrån
därifrån
CF
FS
SE
SN
SN
hais
hais
wj19
wg13
wj01
11
11
13
10
13
f
f
m
m
f
därifrån
emot
emot
fortfarande
SN
FS
FS
SN
wj10
alhe
haic
wg07
13
9
11
10
m
f
f
f
framemot
förbi
förbi
förbi
förut
SN
FS
DV
SE
DV
wj09
caan
hais
wg07
hais
13
9
11
10
11
m
m
f
f
f
förut
DV
idja
11
f
härifrån
härifrån
ibland
ibland
CF
DV
SE
SE
idja
angu
wj02
wj09
11
9
13
13
f
f
f
m
igen
CF
hais
11
f
igen
igen
igen
igenom
igenom
CF
SN
SN
FS
DV
hais
wg03
wg03
erha
erge
11
10
10
10
9
f
f
f
m
f
igenom
igenom
ihop
DV
DV
DV
idja
idja
erha
11
11
10
f
f
m
ihop
ihop
iväg
iväg
DV
DV
FS
FS
erha
erja
angu09
anhe
10
9
9
11
m
m
f
m
också
DV
angu
9
f
också
omkring
DV
DV
erja
hais
9
11
m
f
ADVERB
Där efter dog mamma på sjukhuset.
men han tog sig snabbt där i från.
när man bara går där ifrån
SEN GICK VI DÄR IFRÅN
Jag ställde mig på en sten och efter ett tag så
ville jag gå där ifrån,
så till slut så sprang dom där ifrån
Bina som bodde i bot rusade i mot Oliver
han råkade bara kom i mot getingboet.
Marianne sa fort farande hur jag kunde va med
henne
Alla såg fram emot att åka
Då kom hunden för bi med getingar
människor som går för bi kan höra oss.
Eller när man går för bi varandra
vi hade aldrig fått smaka plättar sylt och kola
för ut
Inte konstigt att vi inte har upptäckt den här
ingången för ut
jag som alltid tyckt det var så högt här i från.
stick här i från annars är du dödens
I bland kan allt vara jobbigt och hemskt
Men i bland kan det vara så att dom tror att
dom är coola
jag var tvungen att berätta hela historien om i
gen.
vad var det han hete nu i gen?
jag vill bli kompis med henne i gen
och så ville Johanna bli kompis i gen.
Pojken och hunden söker i genom rummet.
morfar och dom andra letar och letar i genom
staden
Vi red i genom det stora hålet
Vi red i genom byn
när Gunnar öppna dörren till det stora huset rasa
det i hop
snart rasa hela byn i hop
snabbt samla han i hop alla sina jägare
Rådjuret sprang i väg med honom.
Han sprang i vägg och klättrade upp på en
kulle.
Lena såg en gammal man sitta i ett tält av guld
intill sov säckarna som och så var av guld.
dam tåg och så med sig sina två tigrar
undulater flög om kring
Appendix B.
312
E RROR
5.1.36 när de såg sig om kring
5.1.37 han trillar om kull.
5.1.38 Han ropade igenom fönstret men inget kvack
kom till baka.
5.1.39 vi gick till baka igen
5.1.40 svarta manen sprang sin väg och kom aldrig
mer till baka.
5.1.41 Efter det gick vi till baka
5.1.42 ... ska man lämna till baka den.
5.1.43 Sedan slumrade såfus, grodan och simon djupt
till sammans.
5.1.44 Men de var fult med buskar utan för som vi
fick rid igenom.
5.1.45 en kille blev utan för,
5.1.46 men olof var glad en då
5.1.47 men om man inte får vara med än då
5.1.48 Erik letade över allt
5.1.49 Han letade över allt i sitt rum
5.1.50 Han letade under sängen under pallen i tofflorna bland kläderna ja över allt
5.1.51 Han letade över allt
5.1.52 Desere letade över allt
5.1.53 jag har letat över allt
6
6.1.1
6.1.2
PREPOSITION
fram för mig stod världens finaste häst.
Vi gick längs vägen tills vi såg ett stort hus som
låg en bit utan för själva stan
7
7.1.1
CONJUNCTION
Efter som han frös och inte såg sig för
snubblade han på en sten.
... och efter som det inte fanns nåt lock på
burken...
men jag kunde inte säga det till honom för att
jag visste att han skulle bli lite ledsen då efter
som vi hade bestämt.
7.1.2
7.1.3
8
8.1.1
8.1.2
8.1.3
8.1.4
8.1.5
8.1.6
8.1.7
8.1.8
8.1.9
8.1.10
8.1.11
8.1.12
8.1.13
RUN-ONS
Nathalie berättade alltför mig
därbakom fanns 2 grodor.
och tillslut stod vi alla på marken
tillslut välte han ner hela kupan
tillslut kom de fram till en gärdsgård
men tillslut tyckte de också att ...
tillslut blev dam två kamelerna så trötta...
tillslut kom de fram till en vacker plats
tillslut sa pappa
Tillslut kom dom upp mot sidan av oss och sa,
Tillslut kom det en massa vuxna som...
Vi åkte tillslut på bio.
mobbing råkar väldigt många utför.
C ORP
S UBJ
AGE
S EX
omkring
omkull
tillbaka
C ORRECTION
DV
FS
FS
jowe
klma
caan
9
10
9
f
f
m
tillbaka
tillbaka
DV
DV
alhe
angu
9
9
f
f
tillbaka
tillbaka
tillsammans
DV
SE
FS
idja
wg17
hais
11
10
11
f
f
f
utanför
DV
idja
11
f
utanför
ändå
ändå
överallt
överallt
överallt
SE
FS
SE
FS
FS
FS
wj11
frma
wj14
alhe
jobe
jowe
13
9
13
9
10
9
f
m
m
f
m
f
överallt
överallt
överallt
FS
DV
DV
mawe
hais
hais
11
11
11
f
f
f
framför
utanför
CF
DV
alhe
idja
9
11
f
f
eftersom
DV
mawe
11
f
eftersom
FS
alhe
9
f
eftersom
SN
wg06
10
f
allt för
där bakom
till slut
till slut
till slut
till slut
till slut
till slut
till slut
till slut
till slut
till slut
ut för
SN
FS
CF
FS
DV
DV
DV
DV
DV
SN
SN
SN
SE
wg11
jowe
idja
hais
alca
alca
erja
hila
idja
wj04
wj04
wj04
wj05
10
9
11
11
11
11
9
10
11
13
13
13
13
f
f
f
f
f
f
m
f
f
m
m
m
m
Appendix C
SUC Tagset
The set of tags used was taken from the Stockholm Umeå Corpus (SUC):
Code
AB
DL
DT
HA
HD
HP
HS
IE
IN
JJ
KN
NN
PC
PL
PM
PN
PP
PS
RG
RO
SN
O
VB
Category
Adverb
Delimiter (Punctuation)
Determiner
Interrogative/Relative Adverb
Interrogative/Relative Determiner
Interrogative/Relative Pronoun
Interrogative/Relative Possessive
Infinitive Marker
Interjection
Adjective
Conjunction
Noun
Participle
Particle
Proper Noun
Pronoun
Preposition
Possessive
Cardinal Number
Ordinal Number
Subjunction
Foreign Word
Verb
Code
Feature
UTR
NEU
MAS
UTR/NEU
-
Common (Utrum)
Neutre
Masculine
Underspecified
Unspecified
Gender
Gender
Gender
Gender
Gender
SIN
PLU
Singular
Plural
Number
Number
Appendix C.
314
SIN/PLU
-
Underspecified
Unspecified
Number
Number
IND
DEF
IND/DEF
-
Indefinite
Definite
Underspecified
Unspecified
Definiteness
Definiteness
Definiteness
Definiteness
NOM
GEN
SMS
-
Nominative
Genitive
Compound
Unspecified
Case
Case
Case
Case
POS
KOM
SUV
Positive
Comparative
Superlative
Degree
Degree
Degree
SUB
OBJ
SUB/OBJ
Subject
Object
Underspecified
Pronoun Form
Pronoun Form
Pronoun Form
PRS
PRT
INF
SUP
IMP
Present
Preterite
Infinitive
Supinum
Imperative
Verb
Verb
Verb
Verb
Verb
AKT
SFO
Active
S form
Voice
Voice
KON
PRF
Subjunctive
Perfect
Mood
Perfect
AN
Abbreviation
Form
Form
Form
Form
Form
Form
Appendix D
Implementation
D.1 Broad Grammar
#### Declare categories
define PPheadPhr ["<ppHead>" ˜$"<ppHead>" "</ppHead>"];
define VPheadPhr ["<vpHead>" ˜$"<vpHead>" "</vpHead>"];
define
define
define
define
APPhr
NPPhr
PPPhr
VPPhr
["<ap>"
["<np>"
["<pp>"
["<vp>"
˜$"<ap>"
˜$"<np>"
˜$"<pp>"
˜$"<vp>"
"</ap>"];
"</np>"];
"</pp>"];
"</vp>"];
#### Head rules
define AP [(Adv) Adj+];
define PPhead
[Prep];
define VPhead
[[[Adv* Verb] | [Verb Adv*]] Verb* (PNDef & PNNeu)];
#### Complement rules
define NP [[[(Det | Det2 | NGen) (Num) (APPhr) (Noun) ] & ?+] | Pron];
define PP [PPheadPhr NPPhr];
define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*];
#### Verb clusters
define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ];
D.2 Narrow Grammar: Noun Phrases
###############
define APDef
define APInd
define APSg
define APPl
define APNeu
define APUtr
define APMas
Narrow grammar for APs:
["<ap>" (Adv) AdjDef+ "</ap>"];
["<ap>" (Adv) AdjInd+ "</ap>"];
["<ap>" (Adv) AdjSg+ "</ap>"];
["<ap>" (Adv) AdjPl+ "</ap>"];
["<ap>" (Adv) AdjNeu+ "</ap>"];
["<ap>" (Adv) AdjUtr+ "</ap>"];
["<ap>" (Adv) AdjMas+ "</ap>"];
Appendix D.
316
############### Narrow grammar for NPs:
###### NPs consisting of a single noun
define NPDef1
[(Num) [NDef | PNoun]];
define NPInd1
[(Num) NInd];
define NPSg1
[(NumO) NSg | [NPl & NInd] | PNoun];
define NPPl1
[(NumC) [NPl | PNoun]];
define NPNeu1
[(Num) [NNeu | [NUtr & NInd] | PNoun]];
define NPUtr1
[(Num) [[NUtr & NPl] | [NUtr & NDef] | PNoun]];
###### NPs consisting of a determiner (or a noun in genitive) and a noun
define NPDef2
[DetDef (DetAdv) (Num) NDef] |
[[DetMixed | NGen] (Num) NInd];
define NPInd2
[DetInd (Num) NInd];
define NPSg2
[[DetSg (DetAdv) | NGen] (NumO) NSg];
define NPPl2
[[DetPl (DetAdv) | NGen] (NumC) NPl];
define NPNeu2
[[DetNeu (DetAdv) | NGen] (Num) NNeu];
define NPUtr2
[[DetUtr (DetAdv) | NGen] (Num) NUtr];
###### NPs consisting of [Det (AP) N]
define NPDef3
[DetDef (DetAdv) (Num) (APDef) NDef] |
[[DetMixed | NGen] (Num) (APDef) NInd];
define NPInd3
[DetInd (NumO) (APInd) NInd];
define NPSg3
[[DetSg (DetAdv) | NGen] (NumO) (APSg) NSg];
define NPPl3
[[DetPl (DetAdv) | NGen] (NumC) (APPl) NPl];
#define NPNeu3 [[DetNeu (DetAdv) | NGen] (Num) (APNeu) NNeu];
define NPNeu3
[[DetNeu (DetAdv) | NGen] (Num)
[[(APNeu) NNeu] | [(APMas) NMas]]];
define NPUtr3
[[DetUtr (DetAdv) | NGen] (Num) (APUtr) NUtr];
###### NPs consisting of [Adj+ N]
# optional numbers only in NPINd and NPPl
define NPDef4
[APDef NDef];
define NPInd4
[(Num) APInd NInd];
define NPSg4
[APSg NSg];
define NPPl4
[(Num) APPl NPl];
define NPNeu4
[APNeu NNeu];
define NPUtr4
[APUtr NUtr];
######
define
define
define
define
define
define
NPs consisting of a single pronoun
NPDef5
[PNDef];
NPInd5
[PNInd];
NPSg5
[PNSg];
NPPl5
[PNPl];
NPNeu5
[PNNeu];
NPUtr5
[PNUtr];
######
define
define
define
define
define
define
NPs consisting of a single determiner
NPDef6
[DetDef (DetAdv)];
NPInd6
[DetInd];
NPSg6
[DetSg (DetAdv)];
NPPl6
[DetPl (DetAdv)];
NPNeu6
[DetNeu (DetAdv)];
NPUtr6
[DetUtr (DetAdv)];
Implementation
317
######
define
define
define
define
define
define
NPs consisting of adjectives
NPDef7
[APDef+];
NPInd7
[APInd+];
NPSg7
[APSg+];
NPPl7
[APPl+];
NPNeu7
[APNeu+];
NPUtr7
[APUtr+];
######
define
define
define
define
define
define
NPs consisting of a single determiner and adjectives
NPDef8
[DetDef APDef];
NPInd8
[DetInd APInd];
NPSg8
[DetSg APSg];
NPPl8
[DetPl APPl];
NPNeu8
[DetNeu APNeu];
NPUtr8
[DetUtr APUtr];
######
define
define
define
define
define
define
NPs consisting of number as the main word
NPDef9 [(DetDef) NumO];
NPInd9 [Num];
NPSg9
[Num];
NPPl9
[Num];
NPNeu9 [Num];
NPUtr9 [Num];
###### NPs that meet definiteness agreement
### Definite NPs
define NPDef [NPDef1 | NPDef2 | NPDef3 | NPDef4 | NPDef5 | NPDef6 | NPDef7 |
NPDef8 | NPDef9 ];
### Indefinite NPs
define NPInd [NPInd1 | NPInd2 | NPInd3 | NPInd4 | NPInd5 | NPInd6 | NPInd7 |
NPInd8 | NPInd9 ];
define NPDefs [NPDef | NPInd];
###### NPs that meet number agreement
### Singular NPs
define NPSg [NPSg1 | NPSg2 | NPSg3 | NPSg4 | NPSg5 | NPSg6 | NPSg7 |
NPSg8 | NPSg9 ];
### Plural NPs
define NPPl [NPPl1 | NPPl2 | NPPl3 | NPPl4 | NPPl5 | NPPl6 | NPPl7 |
NPPl8 | NPPl9 ];
define NPNum [NPSg | NPPl];
###### NPs that meet gender agreement
### Utrum NPs
define NPUtr [NPUtr1 | NPUtr2 | NPUtr3 | NPUtr4 | NPUtr5 | NPUtr6 | NPUtr7 |
NPUtr8 | NPUtr9 ];
### Neutrum NPs
define NPNeu [NPNeu1 | NPNeu2 | NPNeu3 | NPNeu4 | NPNeu5 | NPNeu6 | NPNeu7 |
NPNeu8 | NPNeu9 ];
define NPGen
[NPNeu | NPUtr];
Appendix D.
318
########## Partitive NPs
define NPPart [[Det | Num] PPart NP];
define
define
define
define
define
define
NPPartDef
NPPartInd
NPPartSg
NPPartPl
NPPartNeu
NPPartUtr
[[DetSg | Num] PPart NPPl];
[[DetPl | Num] PPart NPPl];
[[DetNeu | Num] PPart NPNeu];
[[DetUtr | Num] PPart NPUtr];
define NPPartDefs [NPPartDef | NPPartInd];
define NPPartNum [NPPartSg | NPPartPl];
define NPPartGen [NPPartNeu | NPPartUtr];
########## NPs followed by relative subclause
define SelectNPRel [
"<np>" -> "<NPRel>" || _ DetDef ˜$"<np>" "</np>" (" ") {som} Tag*];
D.3 Narrow Grammar: Verb Phrases
#### Infinitive VPs
# select Infinitive VPs
define SelectInfVP ["<vpHead>" -> "<vpHeadInf>" || InfMark "<vp>" _ ];
# Infinitive VP
define VPInf
[Adv* (ModInf) VerbInf Adv* (NPPhr)];
#### Tensed verb first
define VPFinite [ Adv* VerbTensed ?* ];
#### Verb Clusters:
# select VCs
define SelectVC [VC @-> "<vc>" ... "</vc>" ];
define VC1
[ [[Mod | INFVerb] / NPTags ]
(NPPhr) [[Adv* VerbInf] / NPTags ]];
define VC2
[ [Mod / NPTags]
(NPPhr) [[Adv* ModInf VerbInf] / NPTags ]];
define VC3
[ [Mod / NPTags]
(NPPhr) [[Adv* PerfInf VerbSup] / NPTags ]];
define VC4
[ [Perf / NPTags]
(NPPhr) [[Adv* VerbSup] / NPTags ]];
define VC5
[ [Perf / NPTags]
(NPPhr)[[Adv* ModSup VerbInf] / NPTags ]];
define VCgram
[VC1 | VC2 | VC3 | VC4 | VC5];
Implementation
319
### Coordinated VPs:
define SelectVPCoord ["<vpHead>" -> "<vpHeadCoord>" ||
["<vpHeadInf>" | "</vc>"] ˜$"<vpHead>" ˜$"<vp>"
[{eller} | {och}] Tag* (" ") "<vp>" _ ];
#** ATT-VPs that do not require infinitive
define SelectATTFinite [
"<vpHead>" -> "<vpHeadATTFinite>" ||
[
[ [[{sa} Tag+] | [[{för} Tag+] / NPTags]] ("</vpHead></vp>")] |
[ [{tänkte} Tag+] [[NPPhr "</vpHead></vp>" ] |
["</vpHead>" NPPhr "</vp>"]]]] InfMark "<vp>"_ ];
### Supine VPs
define SelectSupVP [
"<vpHead>" -> "<vpHeadSup>" ||
_ VerbSup "</vpHead>"];
D.4 Parser
######
define
define
define
Mark head phrases (lexical prefix)
markPPhead [PPhead @-> "<ppHead>" ... "</ppHead>"];
markVPhead [VPhead @-> "<vpHead>" ... "</vpHead>"];
markAP
[AP @-> "<ap>" ... "</ap>" ];
######
define
define
define
Mark phrases with complements
markNP [NP @-> "<np>" ... "</np>" ];
markPP [PP @-> "<pp>" ... "</pp>" ];
markVP [VP @-> "<vp>" ... "</vp>" ];
######
define
define
define
Composing parsers
parse1 [markVPhead .o. markPPhead .o. markAP];
parse2 [markNP];
parse3 [markPP .o. markVP];
D.5 Filtering
################# Filtering Parsing Results
### Possessive NPs
define adjustNPGen
[
0 -> "<vpHead>" || NGen "</np><vpHead>" NPPhr _,,
"</np><vpHead><np>" -> 0 || NGen _ ˜$"<np>" </np>"];
### Adjectives
define adjustNPAdj [
"</np><vpHead><np>" -> 0 || Det _ APPhr "</np></vpHead>" NPPhr ,,
"</np></vpHead><np>" -> 0 || Det "</np><vpHead><np>" APPhr _];
### Adjective form, i.e. remove plural tags if singular NP
define removePluralTagsNPSg [
TagPLU -> 0 || DetSg "<ap>" Adj _ ˜$"</np>" "</np>"];
### Partitive NPs
define adjustNPPart [
Appendix D.
320
"</np><ppHead>" -> 0 || _ PPart "</ppHead><np>",,
"</ppHead><np>" -> 0 || "</np><ppHead>" PPart _];
### Complex VCs stretched over two vpHeads:
define adjustVC [
"</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,,
"</vpHead>" -> 0 || [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,,
"</vpHead>" NPPhr _ ˜$"<vpHead>" "</vpHead>",,
NPPhr "</vpHead>" _ ˜$"<vpHead>" "</vpHead>" ];
### VCs with two copula or copula and an adjective:
define SelectVCCopula [
"<vc>" -> "<vcCopula>" || _ [CopVerb / NPTags] ˜$"<vc>" "</vc>"];
################# Removing Parsing Errors
### not complete PPs, i.e. ppHeads without following NP
define errorPPhead [
"<ppHead>" -> 0 || \["<pp>"] _ ,,
"</ppHead>" -> 0 || _ \["<np>"]];
### empty VPHead
define errorVPHead [ "<vp><vpHead></vpHead></vp>" -> 0];
D.6 Error Finder
######### Finding grammatical
###### NPs
# Define NP-errors
define npDefError ["<np>" [NP
define npNumError ["<np>" [NP
define npGenError ["<np>" [NP
errors (Error marking)
- NPDefs] "</np>"];
- NPNum] "</np>"];
- NPGen] "</np>"];
# Mark NP-errors
define markNPDefError [
define markNPNumError [
npNumError -> "<Error number>" ... "</Error>"];
define markNPGenError [
npGenError -> "<Error gender>" ... "</Error>"];
# Define NPPart-errors
define NPPartDefError ["<NPPart>" [NPPart - NPPartDefs] "</np>"];
define NPPartNumError ["<NPPart>" [NPPart - NPPartNum] "</np>"];
define NPPartGenError ["<NPPart>" [NPPart - NPPartGen] "</np>"];
# Mark NPPart-errors
define markNPPartDefError [
NPPartDefError -> "<Error definiteness NPPart>" ... "</Error>"];
define markNPPartNumError [
NPPartNumError -> "<Error number NPPart>" ... "</Error>"];
define markNPPartGenError [
NPPartGenError -> "<Error gender NPPart>" ... "</Error>"];
Implementation
###### VPs
# Define errors in VPs
define vpFiniteError ["<vpHead>" [VPhead - VPFinite] "</vpHead>"];
define vpInfError
["<vpHeadInf>" [VPhead - VPInf] "</vpHead>"];
define VCerror
["<vc>" [VC - VCgram] "</vc>"];
# Mark VP-errors
define markFiniteError [
vpFiniteError -> "<Error finite verb>" ... "</Error>"];
define markInfError [
vpInfError -> "<Error infinitive verb>" ... "</Error>"];
define markVCerror [
VCerror -> "<Error verb after Vaux>" ... "</Error>"];
321

320 pages - Institutionen för filosofi, lingvistik och vetenskapsteori

Transcription

Similar documents

Shark`s Tale - GreatSchools

New Latin Grammar

This Power Point is about SUBJECTS and OBJECTS

How to improve my vocabulary

Connect the Dots

Life as a Grammarist

review of the programme of english

Skaga - the brand Skaga - varumärket

plan of basic course - Prefeitura de Juiz de Fora