NoSta-D Chat
Transcription
NoSta-D Chat
Chat-Dependencies Dependency relations in the non-standard corpus NoSta-D Marc Reznicek†*, Stefanie Dipper†, Anke Lüdeling*, Burkhard Dietterle* † Ruhr-Universität Bochum, * Humboldt-Universität zu Berlin WS Verarbeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation 23.09.2013 - Darmstadt Research questions (non-standard) Are syntactic structures that are produced online similar independent of modality? Chat vs. spoken language Do learner texts show the same variability with respect to word order as historical texts? Learner texts vs. diachronic texts Is literary prose more complex than newspaper text? Literary prose vs. newspaper WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 2 Overview Chat as a non-standard variety Dependency model Normalization Guidelines for dependency annotation Chat syntax Open issues WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt Clarin-D Curation Project II Clarin F-AG 7 - Curation project (KP2): Linguistic annotation of non-standard varieties — guidelines and "best practices" Annotation categories, guidelines, and automatic tools are based on newspaper texts Pilot project: Extension of existing resources for 5 non-standard varieties 3 types of annotations dependency analysis named entities coreference WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 4 Data: NoSta-D Non-standard variety corpus - Deutsch Learner essays: 6,762 Tokens Falko word order deviation creative word formation non-standard filling of syntactic slots divergent morphological marking Spoken map task: 7,294 Tokens repetition auto correction anacoluthon online argument development Newspaper: Kafka – Der Prozeß double argument selection complex parenthesis (Dipper et al. to appear) BeMaTac 6,731 Tokens Literary prose: 5,000 Tokens Chat protocols: TüBa-DZ standard variety 6,664 Tokens DCK – Plauderchat misspelling inflective (Vend) asterisk expression interjection emoticon concatenation Historical text: 2,348 + 4,705 Tokens DDB & Anselm no sentence boundary marking rather free word order no standardized spelling https://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/forschung/clarin-d WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt Data: Chat in NoSta-D Learner essays: 6,762 Tokens Falko Spoken map task:Chat protocols: 6,731 Tokens BeMaTac DCK – Plauderchat 6,664 Tokens word order deviations repetitions creative word formation auto corrections non-standard of anacoluthon 160 TomcatMJ filling eh!*rupf,zerr,reiss,mich losmach* syntactic online 161 quaki slots hmm der paris war ein ganzargument schön dämlcher divergent development 162 Emon morphological ah soo marking 163 TomcatMJ *nicht festgebunden sein mag* welches ezugnis? klärt mich mal jemand 164 Emon Literaryüber prose: den sachverhalt auf? Kafka – Der Prozeß 165 quaki *nagut50cmlauflaufleine* 5,000 Tokens 7,294 Tokens 166 Erdbeere$ thor 167 Thor... tach erdbeere doubled argument 168 Erdbeere$ hello selection 169 marc30 wie großzügig quaki manchmal ist... complex parentheses standard variety 170 Emon *wart* inflective (Vend) asterisk expression @ addressing emoticon Historical text: 2,348 + 4,705 Tokens DDB & Anselm fragment no sentence boundary misspelling marking interjection rather free word order concatenation no standardized spelling filled pause http://www.chatkorpus.tu-dortmund.de/files/releasehtml/html-korpus/unicum_21-02-2003_1.html WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt Dependency model Dependency parsing for German "reaches an accuracy […] better than the best constituent analysis including grammatical functions." (Kübler & Prokic 2006) labels grammatical functions edges dependencies WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 7 Normalization WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 8 Normalization: Motivation Grammatical functions in the sentence depend mostly on the main verb. Fragments are difficult to model. TiGer Guidelines: Bei verblosen Sätzen, die v.a. in Überschriften und Titeln erscheinen, sollte man den Satz in Gedanken sinnvoll ergänzen und ihn dann ganz normal annotieren. (Albert et al 2003:72) Verbless sentences as in newpaper titles should be completed in a sensible way and then be annotated as usually. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 9 Normalization: Motivation Normalization explicitly inserts prototypical main verb with matching argument structure. Motivation for fragment functions Normalization ROOT Original ROOT WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 10 Normalization: two perspectives Two objectives ❶ Two representations Computational approach: Normalization = minimal preprocessing step to facilitate further processing What performance can be achieved by automatic annotation tools? ❷ Variationistic approach: (Lüdeling 2008, Reznicek et al. 2013) Normalization = index to classify similar phenomena in the corpus Which linguistic structures vary in different varieties? WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 11 Dependency annotation S WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 12 Dependency annotation Guidelines for annotation of non-standard dependencies 1) Take guidelines that fully describe structures in a large newspaper corpus of German: TiGer (Alberts et al. 2003) Problem: Constituents (Alberts et al. 2003:9) WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 13 Dependency annotation Guidelines for annotation of non-standard dependencies 2) Give human annotators a translation of TiGer-constituent trees into dependencies: NN → HEAD if not head is PIS ADJA → HEAD if not head is PIS or NN. ART → DET WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 14 Dependency annotation Guidelines for annotation of non-standard dependencies 3) Introduce new labels for structures not described by TiGer: Cross classes: X. C. COR. DR. Chat structures: SINFL @ Indirect dependencies Coordination Autocorrections Direct Speech Inflectives Addressing WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 15 Dependency annotation Annotation with WebAnno https://clarin.ukp.informatik.tu- Darmstadt.de WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 16 Chat syntax Ellipse @ addressing Concatenation S WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 17 Chat: ellipses Normalization also makes explicit all missing governors! Word forms are attached to 1) Governor 2) Upper next governer 3) Conjunction 4) Corrected branch 5) Segment root gram. func. X + gram. func of miss. gov. C + gram. func. COR + gram. func gram. func. Norm. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 18 Chat: ellipses Word forms are connected to 2) Upper next governer X + gram. func of miss. gov. Norm. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 19 Chat syntax Ellipse @ addressing Concatenation S WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 20 Chat: addressing mit dem ast in der hand im teich am rumsitzen@stoeps with the branch in the hand in-the lake sitting@stoeps mit dem ast in der hand im teich am rumsitzen XX message @ stoeps PN @ addressee WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 21 Chat: addressing mit dem ast in der hand im teich am rumsitzen@stoeps with the branch in the hand in-the lake sitting@stoeps XX message PN @ addressee WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 22 Chat: addressing @-attached arguments are of variable type: Modification: MOD Sentence root: S Inflective root: SINFL WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 23 Chat syntax Ellipse @ addressing Concatenation WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 24 Chat: concatenation Two ways of handling concatenation 1) Conserve tokenization integrate gram. functions 2) Retokenize data regular annotation Retok. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 25 Chat: concatenation Three types of concatenations: 1) governor & dependent are concatenated only governor label is annotated Retok. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 26 Chat: concatenation Three types of concatenations: 1) governor & dependent are concatenated only governor label is annotated Retok. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 27 Chat: concatenation Three types of concatenations: 1) governor & dependent are concatenated only governor label is annotated Retok. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 28 Chat: concatenation Three types of concatenations: 2) governor & indirect dependent are concatenated governor & dependent labels are annotated Retok. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 29 Chat: concatenation Three types of concatenations: 3) independent word forms are concatenated all highest-governor labels are annotated Retok. Orig. WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 30 Open issues S WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 31 Open Questions Secondary Edges The current data model does not allow multiple governors. SUBJ SUBJ KON PRED PRED CS seine farbe ist und bleibt ein Krampf his color is and stays a cramp his color gives me the creeps WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 32 Open Questions Spans The current data model does not allow spans of subtoken annotation parallel representation of dependency trees in original data and normalization Integration into SALT & ANNIS3 DM DM S SUBJ $( ADV ADJD $, * Na gut , es gibt OBJA ATTR APP CARD NN NN $. $( 50 cm Lauflaufleine . * *nagut50cmlauflaufleine* S WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 33 Thanks! Project site: https://www.linguistik.huberlin.de/institut/professuren/korpuslinguistik/forschung/clarin-d Contact: [email protected] WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 34 Bibliography Albert, Stefanie; Anderssen, Jan; Bader, Regine; Becker, Stefanie; Bracht, Tobias; Brants, Thorsten et al. (2003): TiGer Annotationsschema. Dipper, Stefanie; Lüdeling, Anke; Reznicek, Marc (to appear): NoSta-D. A Corpus of German Non-Standard Varieties. In: Marcos Zampieri (ed): Non-Standard Data Sources in Corpus-Based Research: Shaker. Kübler, Sandra; Prokic, Jelena (2006): Why is German Dependency Parsing more Reliable than Constituent Parsing? In: Proceedings of the Fifth International Workshop on Treebanks and Linguistic Theories. Prague, Czech Republic. Lüdeling, Anke (2008): Mehrdeutigkeiten und Kategorisierung. Probleme bei der Annotation von Lernerkorpora. In: Walter, Maik; Grommes, Patrick (eds): Fortgeschrittene Lernervarietäten: Korpuslinguistik und Zweitspracherwerbsforschung. Tübingen: Max Niemeyer Verlag , 119–140. Reznicek, Marc; Lüdeling, Anke; Hirschmann, Hagen (to appear): Competing target hypotheses in the Falko corpus. A flexible multi-layer corpus architecture In: DíazNegrillo, Ana; Ballier, Nicolas; Thompson, Paul (eds): Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins (Series Studies in Corpus Linguistics). WS IBK: Chat-Dependenzen – GSCL 23.09.2013 - Darmstadt 35
Similar documents
presentation slides
• quantitative and qualitative analysis of fragments in those corpora • analysis of density profiles of those corpora using language models
More information