NoSta-D Chat

Transcription

NoSta-D Chat

Chat-Dependencies
Dependency relations in the non-standard corpus NoSta-D
Marc Reznicek†*, Stefanie Dipper†,
Anke Lüdeling*, Burkhard Dietterle*
† Ruhr-Universität Bochum, * Humboldt-Universität zu Berlin
WS Verarbeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation
23.09.2013 - Darmstadt
Research questions (non-standard)
 Are syntactic structures that are produced online
similar independent of modality?
 Chat vs. spoken language
 Do learner texts show the same variability with
respect to word order as historical texts?
 Learner texts vs. diachronic texts
 Is literary prose more complex than newspaper text?
 Literary prose vs. newspaper
WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt
2
Overview
 Chat as a non-standard variety
 Dependency model
 Normalization
 Guidelines for dependency annotation
 Chat syntax
 Open issues
Clarin-D Curation Project II
Clarin F-AG 7 - Curation project (KP2):
Linguistic annotation of non-standard
varieties — guidelines and "best practices"
 Annotation categories, guidelines, and automatic tools
are based on newspaper texts
 Pilot project: Extension of existing resources for
 5 non-standard varieties
 3 types of annotations
 dependency analysis
 named entities
 coreference
4
Data: NoSta-D
Non-standard variety corpus - Deutsch
Learner essays:
6,762 Tokens
Falko
 word order deviation
 creative word formation
 non-standard filling of
syntactic slots
 divergent morphological
marking
Spoken map task:
7,294 Tokens
 repetition
 auto correction
 anacoluthon
 online argument
development
Newspaper:
Kafka – Der Prozeß
 double argument selection
 complex parenthesis
(Dipper et al. to appear)
BeMaTac
6,731 Tokens
Literary prose:
5,000 Tokens
Chat protocols:
TüBa-DZ
standard variety
6,664 Tokens
DCK – Plauderchat
 misspelling
 inflective (Vend)
 asterisk expression
 interjection
 emoticon
 concatenation
Historical text:
2,348 + 4,705 Tokens
DDB & Anselm
 no sentence boundary
marking
 rather free word order
 no standardized spelling
https://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/forschung/clarin-d
Data: Chat in NoSta-D
Learner essays:
6,762 Tokens
Falko
Spoken map task:Chat protocols:
6,731 Tokens
BeMaTac
DCK – Plauderchat
6,664 Tokens
 word order deviations
 repetitions
 creative word formation
 auto corrections
 non-standard
of
 anacoluthon
160
TomcatMJ filling
eh!*rupf,zerr,reiss,mich
losmach*
syntactic
online
161
quaki slots hmm der paris war ein
ganzargument
schön dämlcher 
 divergent
development
162
Emon morphological
ah soo

marking
163 TomcatMJ *nicht festgebunden sein mag*

welches ezugnis? klärt mich mal jemand
164 Emon

Literaryüber
prose:
den sachverhalt auf?
Kafka
– Der Prozeß
165 quaki
*nagut50cmlauflaufleine*
5,000 Tokens
7,294 Tokens

166 Erdbeere$ thor
167 Thor...
tach erdbeere

 doubled argument
168 Erdbeere$ hello

selection
169 marc30
wie großzügig quaki manchmal ist...
 complex parentheses
 standard variety 
170 Emon
*wart*
inflective (Vend)
asterisk expression
@ addressing
emoticon
Historical text:
2,348 + 4,705 Tokens
DDB & Anselm
fragment
 no sentence boundary
misspelling
marking
interjection
 rather free word order
concatenation
 no standardized spelling
 filled pause
http://www.chatkorpus.tu-dortmund.de/files/releasehtml/html-korpus/unicum_21-02-2003_1.html
Dependency model
Dependency parsing for German "reaches an
accuracy […] better than the best constituent
analysis including grammatical functions."
(Kübler & Prokic 2006)
labels
grammatical
functions
edges
dependencies
7
Normalization
8
Normalization: Motivation
Grammatical functions in the sentence depend
mostly on the main verb.
 Fragments are difficult to model.
TiGer Guidelines:
Bei verblosen Sätzen, die v.a. in Überschriften und Titeln
erscheinen, sollte man den Satz in Gedanken sinnvoll
ergänzen und ihn dann ganz normal annotieren.
(Albert et al 2003:72)
Verbless sentences as in newpaper titles should be
completed in a sensible way and then be annotated as
usually.
9
Normalization: Motivation
Normalization explicitly inserts prototypical main
verb with matching argument structure.
 Motivation for fragment functions
Normalization
ROOT
Original
ROOT
10
Normalization: two perspectives
Two objectives
❶
Two representations
Computational approach:
 Normalization = minimal preprocessing step to facilitate further processing
 What performance can be
achieved by automatic
annotation tools?
❷
Variationistic approach:
(Lüdeling 2008, Reznicek et al. 2013)
 Normalization = index to classify similar phenomena in the corpus
 Which linguistic structures
vary in different varieties?
11
Dependency annotation
S
12
 Guidelines for annotation of non-standard dependencies
1) Take guidelines that fully describe structures in
a large newspaper corpus of German:
 TiGer (Alberts et al. 2003)
Problem: Constituents
(Alberts et al. 2003:9)
13
2) Give human annotators a translation of TiGer-constituent
trees into dependencies:
NN → HEAD if not head is PIS
ADJA → HEAD if not head is PIS or NN.
ART → DET
14
3) Introduce new labels for structures not described by TiGer:
Cross classes:
 X.
 C.
 COR.
 DR.
Chat structures:
 SINFL
 @




Indirect dependencies
Coordination
Autocorrections
Direct Speech


Inflectives
Addressing
15
Annotation with WebAnno
https://clarin.ukp.informatik.tu- Darmstadt.de
16
Chat syntax



Ellipse
@ addressing
Concatenation
S
17
Chat: ellipses
Normalization also makes explicit all missing governors!
Word forms are attached to
1) Governor
2) Upper next governer
3) Conjunction
4) Corrected branch
5) Segment root

gram. func.
 X + gram. func of miss. gov.
 C + gram. func.
 COR + gram. func

gram. func.
Norm.
Orig.
18
Chat: ellipses
Word forms are connected to
2) Upper next governer
 X + gram. func of miss. gov.
Norm.
Orig.
19
Chat syntax



Ellipse
@ addressing
Concatenation
S
20
Chat: addressing
mit dem ast
in der hand im teich am rumsitzen@stoeps
with the branch in the hand in-the lake
sitting@stoeps
mit dem ast in der hand im teich am rumsitzen
XX
message
@ stoeps
PN
@
addressee
21
Chat: addressing
mit dem ast
in der hand im teich am rumsitzen@stoeps
with the branch in the hand in-the lake
sitting@stoeps
XX
message
PN
@
addressee
22
Chat: addressing
@-attached arguments are of variable type:
Modification: MOD
Sentence root: S
Inflective root: SINFL
23
Chat syntax



Ellipse
@ addressing
Concatenation
24
Chat: concatenation
Two ways of handling concatenation
1) Conserve tokenization  integrate gram. functions
2) Retokenize data
 regular annotation
Retok.
Orig.
25
Chat: concatenation
Three types of concatenations:
1) governor & dependent are concatenated
 only governor label is annotated
Retok.
Orig.
26
Chat: concatenation
Retok.
Orig.
27
Chat: concatenation
Retok.
Orig.
28
Chat: concatenation
2) governor & indirect dependent are concatenated
 governor & dependent labels are annotated
Retok.
Orig.
29
Chat: concatenation
3) independent word forms are concatenated
 all highest-governor labels are annotated
Retok.
Orig.
30
Open issues
S
31
Open Questions
Secondary Edges
 The current data model does not allow multiple
governors.
SUBJ
SUBJ
KON
PRED
PRED
CS
seine farbe ist und bleibt ein Krampf
his color is and stays a cramp
his color gives me the creeps
32
Open Questions
Spans
 The current data model does not allow
 spans of subtoken annotation
 parallel representation of dependency trees in original
data and normalization
 Integration into SALT & ANNIS3
DM
DM
S
SUBJ
$(
ADV
ADJD
$,
*
Na
gut
,
es
gibt
OBJA
ATTR
APP
CARD
NN
NN
$.
$(
50
cm
Lauflaufleine
.
*
*nagut50cmlauflaufleine*
S
33
Thanks!
Project site:
https://www.linguistik.huberlin.de/institut/professuren/korpuslinguistik/forschung/clarin-d
Contact:
[email protected]
34
Bibliography
Albert, Stefanie; Anderssen, Jan; Bader, Regine; Becker, Stefanie; Bracht, Tobias;
Brants, Thorsten et al. (2003): TiGer Annotationsschema.
Dipper, Stefanie; Lüdeling, Anke; Reznicek, Marc (to appear): NoSta-D. A Corpus of
German Non-Standard Varieties. In: Marcos Zampieri (ed): Non-Standard Data
Sources in Corpus-Based Research: Shaker.
Kübler, Sandra; Prokic, Jelena (2006): Why is German Dependency Parsing more Reliable
than Constituent Parsing? In: Proceedings of the Fifth International Workshop on
Treebanks and Linguistic Theories. Prague, Czech Republic.
Lüdeling, Anke (2008): Mehrdeutigkeiten und Kategorisierung. Probleme bei der
Annotation von Lernerkorpora. In: Walter, Maik; Grommes, Patrick (eds):
Fortgeschrittene Lernervarietäten: Korpuslinguistik und
Zweitspracherwerbsforschung. Tübingen: Max Niemeyer Verlag , 119–140.
Reznicek, Marc; Lüdeling, Anke; Hirschmann, Hagen (to appear): Competing target
hypotheses in the Falko corpus. A flexible multi-layer corpus architecture In: DíazNegrillo, Ana; Ballier, Nicolas; Thompson, Paul (eds): Automatic Treatment and
Analysis of Learner Corpus Data. Amsterdam: John Benjamins (Series Studies in
Corpus Linguistics).
35

NoSta-D Chat

Transcription

Similar documents

Sommersemester 2013 - Institut für deutsche Sprache und Linguistik

Experiments with Tokenization and Part-of-speech

Forschungsparadigmen und Anwendungsbereiche der Textlinguistik

C-WEP—Rich Annotated Collection of Writing Errors by

presentation slides

Folien - Professur für Korpuslinguistik

3 Inhaltsverzeichnis Teil 1

Disfluencies bei Muttersprachlern und Lernern des Deutschen

Informationsveranstaltung Linguistische Informatik

(Un)Stable emotions

HIWI-‐Stelle am ICL / UKP - Institut für Computerlinguistik