Building a corpus for learning how to produce sequence Ciprian-Virgil Gerstenberger

Transcription

Building a corpus for learning how to produce sequence Ciprian-Virgil Gerstenberger
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Building a corpus for learning how to produce
atonal pronouns in the Romanian clitic
sequence
Ciprian-Virgil Gerstenberger
Universitetet i Tromsø, Norge
Learner Language, Learner Corpora Conference
LLLC 2012
06.10.2012 Oulu, Finnland
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Outline
Atonal pronouns: Why a special corpus?
Language knowledge: How to build it?
Language production: What are the benefits?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Outline
Atonal pronouns: Why a special corpus?
Language knowledge: How to build it?
Language production: What are the benefits?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Outline
Atonal pronouns: Why a special corpus?
Language knowledge: How to build it?
Language production: What are the benefits?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
General question
How to deal with soft constraints in language production?
free word order (e.g., in Finnish)
− information structure, style?
in-situ vs. extraposed relative clauses (e.g., in German)
− clause weight, registrer?
optional sandhi phenomena (e.g., in Romanian)
− genre, register, dialect, sociolect, idiolect?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Specific question
What triggers optional realizations of Romanian atonal
pronouns?
(1)
a. Te rog sa˘ îl faci!
[Please, do it!]
˘ faci!
b. Te rog sa-l
(2)
a. Stiu
¸
ca˘ îi scrii emailuri.
[I know that you write him/her emails.]
˘ scrii emailuri.
b. Stiu
¸
ca-i
(3)
˘ de treaba!
˘
a. Hai sa˘ ne apucam
[Let’s start working!]
˘ de treaba!
˘
b. Hai sa˘ ne-apucam
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
(External) Sandhi
Joining
Epenthesis in English: a car vs. an old car
Elision in French: la fille[the girl] vs. l’église[the church]
Elision in Romanian: Tu îl vezi. vs. Tu-l vezi.[You
see him/it.]
⇒ Sandhi can be marked graphically but it does’nt have to.
⇒ Elision in Romanian is always graphically marked !
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Sandhi in Romanian
General Rule: avoid hiatus
CV VC
⇒
C-VC
˘
Ma˘ apuc de treaba.
[I start working.]
˘
M-apuc de treaba.
⇒
CV-C
Tu îl vezi.
[You see him/it.]
Tu-l vezi.
⇒
CV-VC
“
˘
Te apuci de treaba.
[You start working.]
˘
Te-apuci de treaba.
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Romanian atonal pronouns
Accusative
Number
Person
Type
Gender
Syllabic
Non-syllabic
onset
Sg
Pl
coda
1.
pers/refl
m/f
[m@] ma˘
[m] m-
—
2.
pers/refl
m/f
[te] te
—
3.
pers
[te] te“
[l] l-
m
—
f
[o] o
relf
m/f
[se] se
pers/refl
m/f
[ne] ne
2.
pers/refl
m/f
[v@] va˘
3.
pers
m
—
f
[le] le
m/f
[se] se
1.
relf
[o] o“
[s] s-, [se] se“
[ne] ne“
[v] v[i] i“
[le] le“
[s] s-, [se] se“
..
.
..
.
..
.
[l] -l/îl
—
—
—
—
[j ] -i/îi
—
—
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Romanian atonal pronouns
Dative
Number
Person
Type
Syllabic
Non-syllabic
onset
Sg
Pl
1.
pers/refl
[mi] mi
2.
pers/refl
[tsi] ¸ti
3.
pers
[i] i
relf
[Si] s¸ i
1.
pers/refl
[ni] ni, [ne] ne
2.
pers/refl
[vi] vi, [v@] va˘
3.
pers
[li] li, [le] le
relf
[Si] s¸ i
coda
[mj ]
[mi] mi“
[tsi] ¸ti“
[i] i“
[Si] s¸ i“
[ne] ne“
[v] v-
[j ] -i/îi
[Sj ]
-¸si/î¸si
—
—
[le] le“
[Si] s¸ i“
..
-mi/îmi
[tsj ] -¸ti/î¸ti
—
[Sj ] -¸si/î¸si
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Obligatory sandhi
atonal pronouns
−
˘
*M-am apucat de treaba.
˘
*Ma˘ am apucat de treaba.
[I’ve started to work.]
elsewhere
−
*într-un vis de vara˘
*între un vis de vara˘
[in a summer dream]
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Optional sandhi
atonal pronouns
−
˘
M-apuc de treaba.
˘
Ma˘ apuc de treaba.
[I start to work.]
elsewhere
−
O s-aduc cartea.
O sa˘ aduc cartea.
[I’ll bring the book.]
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Hyphennated non-reduced (=syllabic) forms
as phonological hosts
Ti–l
¸ cumperi.
Sa˘ nu mi–¸ti pierzi timpul cu a¸sa ceva!
[You buy it (for yourself).]
[Don’t loose you time with such things.]
in postverbal position
˘
Du–te acasa!
[Go home!]
as phonological hosts in postverbal position
˘ a–¸
˘ ti–l !
Cumpar
[Buy it!]
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Understanding: What kind of hyphen is it?
hyphen as unreliable indicator for reduced forms
˘
Ti–ai
¸
cumparat
cartea.
[You’ve bought the book!]
Ti–l
¸ cumperi.
[You buy it.]
Ti–o
¸
cumperi.
[You buy it.]
˘ ti cumperi cartea!
Sa–¸
[Buy the book!]
˘
Du–te acasa!
[Go home!]
˘
Du–te–acasa!
[Go home!]
˘ a–¸
˘ ti–l!
Cumpar
[Buy it (for yourself)!]
˘ a--l!
˘
Cumpar
[Buy it!]
˘ a--¸
˘ ti cartea!
Cumpar
[Buy the book!]
gray = syllabic atonal pronoun
– non-syllabic
– post-verbal
black = reduced atonal pronoun
-- non-syllabic
..
.
..
.
..
.
AND
post-verbal
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Understanding: Which phonological form is it?
grapheme-phoneme ambiguity
˘ a-¸
˘ ti-l! [Buy
Cumpar
it!]
/ Ti-l
¸ cumperi.[You
˘ a-¸
˘ ti cartea! [Buy
Cumpar
the book!]
[tsi]
buy it!]
/ θti cumperi cartea.[You
˘
Ti-ai
¸
cumparat
cartea.[You ′ ve bought
buy the book .]
[tsj ]
[tsi]
“
the book .]
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Production: To hyphenate or not to hyphenate?
⇒ obligatory or optional hyphenation?
⇒ if optional, reduced or non-reduced form?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Problems from a learner’s perspective
Production: To hyphenate or not to hyphenate?
⇒ obligatory or optional hyphenation?
⇒ if optional, reduced or non-reduced form?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
The choice issue
Well-balanced mixture of jointed vs. non-jointed forms
defining well-balanceness?
domain of well-balanceness: clause, sentence, paragraph, text?
counting only optional or both obligatory and optional instances?
alignment, parallelity?
Trebuie s-o faci s¸ i s-o dregi!
Trebuie sa˘ o faci s¸ i sa˘ o dregi!
[You have to do it and to mend it!]
Trebuie sa˘ o faci s¸ i s-o dregi!
Trebuie s-o faci s¸ i sa˘ o dregi!
⇒ Different rhythm! A matter of style?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
The choice issue
Speech rate
Alexandra Popescu (2003) Morphophonologische Phänomene des
Rumänischen , PhD thesis, University of Düsseldorf, 2003
Optimality-Theoretic model:
– reduced forms always win in faster speech rate
– non-reduced forms always win in normal speech rate
Popescu (2003), Ex. (21), p. 160
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
The choice issue
Speech rate (cont.)
Alexandra Popescu (2003) Morphophonologische Phänomene des
Rumänischen , PhD thesis, University of Düsseldorf, 2003
⇒ speech rate is relative: no experimental setup
⇒ speech rate vs. number of syllable per time unit?
⇒ what about rhythm?
˘ Si
˘
˘
“Emil Boc, du-te-acasa/
¸ apuca-te
de coasa!”
“Emil Boc, go home/ And start scything!”
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
The choice issue
Speech rate (cont.)
Alexandra Popescu (2003) Morphophonologische Phänomene des
Rumänischen , PhD thesis, University of Düsseldorf, 2003: (p. 179)
⇒ the OT model fails to account for all presented data
“Es ist allerdings unklar, warum der Kandidat mit dem Vollvokal [1] neben dem
Kandidaten c. mit dem Vollvokal [i] beim Normalsprechen gewinnen kann, obwohl er
nach dem bisherigen Ranking schlechter ist als der Kandidat mit dem Vollvokal.”
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
The choice issue
Mode, register, style
˘ pronumelui personal
Maria Iliescu (1975) Pentru o sistematizare a predarii
˘
neaccentuat românesc (la studen¸tii straini),
In Limba Român˘a 24, 1975
“În limba literara˘ îngrijita˘ se prefera˘ proume nelegate”
“in well-groomed literary style, non-bound pronouns are preferred”
“În stilul beletristic formele enlitice apar mai des”
“in beletristic style, enclitic forms occur more often”
⇒ fuzzy formulations: "are prefered", "occur more often"
⇒ how to define well-groomedness?
⇒ how many styles to define?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Usage-based approach
Corpus-driven solution
Observation
Realization of some optional reduced atonal pronouns occur far more
often than their non-reduced counterparts.
Jürgen Bredemeier (1976) Strukturbeschränkungen im Rumänischen. Studien zur
Syntax der prä- und postverbalen Pronomina, TBL Verlag Gunter Narr, 1976
Why?
How often?
⇒ Look into relevant data!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Web as Corpus?
˘
"Du-te-acasa!"
⇒ No fine-tuning possible!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Web as Corpus
offering a wide range of usage-based instances of everything
improvements (e.g., sematic web) are not (yet) useful for the
current research issue
even simple but relevant distinctions are not possible without a
massive data cleanup (diacritica, hypens, misspellings, sloppy
formulations, etc.)
⇒ Far too expensive at the moment!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Use existing Corpora
Odense Grammatically Annotated Corpus of Romanian Business
Revista pe care a¸ti realizat-o mi-a atras aten¸tia
annotation and preprocessing changed the original string
lacking atonal pronouns and auxiliaries, dangling hyphens
⇒ Not of much use!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
What to do?
⇒ Build a special corpus!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
General ideas
account for specific phenomena (encountered instanced plus all
optional variants)
provide additional necessary linguistic annotation
(part-of-speech)
add accessible, relevant infos (spoken, written, genre, etc.)
enable unification of specific annotated data with other layers
(syntax, semantics, information structure)
keep the original string on place
use as much as possible copyright-free data
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Experimental data set
Europarl Corpus
Romanian part of the Europart Corpus
parallel corpus extracted from the proceedings of the European
Parliament
original purpose: Statistical Machine Translation (SMT)
freely available
compared to Google data, much cleaner
yet, still a huge amount of cleanup work
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Data evaluation
size after the first cleaned up and broken into sentences using
the default tools
224417 inc_europarl_ro.sent.txt
size after cleanup foreign sentences and diacritica correction
223622 europarl_ro.sent.xml
pseudo-senteces, formulaic senteces (parliament meetings)
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Usable data for the research question
search for lines with at least a hyphen
56155
unique instances
53897
⇒ Filter irrelevant hypen occurences!
⇒ Search for the non-reduced pronominal forms!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Language knowledge
The small universe of atonal pronouns in Romanian
local phenomenon
relatively small number of forms
modelling any possible combination (even non-grammatical
ones – aka mal rules in error modelling)
⇒ exhausitve modelling
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Language knowledge
Example: 1pers, Sg, Acc
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Annotation run
Current state
pattern + context-testing functions → current annotation state
⇒ add all other optional forms licensed by the given context
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Annotation run
Intended state
⇒ Part-of-speech information needed!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Part-of-Speech annotation
Current state
whole corpus pos-tagged using
http://www.racai.ro/webservices/TextProcessing.aspx
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Towards the final format
Steps to do
transform the MULTEX pos annotation into an xml format
unify the annotation of optional sandhi with the pos annotation
⇒ ... and then?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
... starts the real linguistic fun!
Using the whole potential of the linguistic annotation
Is there a significant difference between the occurences of
˘
˘ ti?
sa˘ îmi vs. sa-mi
and, e.g., ca˘ î¸ti vs. ca-¸
taking more context into account (item before subjunction + item
after the atonal pronoun) and count the syllable of the extended
context?
What about the rhythm changes in the context (cf. the huge
amount and variation of reduced forms in the Romanian poetry)?
include stylometric measurements
⇒ What triggers the choice of a specific surface form?
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Extending the linguistic playground
Annotating more (copyright-free) data
Romanian part of the JRC-Acquis Multilingual Parallel Corpus
DEX – Dic¸tionarul Explicativ al Limbii Române
Romanian Wikipedia
– articles (elaborated, well-formulated text)
– comments (informal, more personal)
⇒ Copyright-free data is shareable data!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Natural Language Generation vs. Language Learning
sharing the need to produce well-formed, situationally adequate
natural language utterances
Why not sharing the knowledge as well?
Why not the resources, too?
⇒ Sharing data is not like sharing a slice of bread,
rather like Jesus’ bread and fish miracle!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Machine vs. human
Transferability of constraint formulation from NLG to LL
Is the constraint formatization from NLG transferable to the LL
domain?
Yes! Linearization and surface realization have to be applied on
perceivable entities.
⇒ no room to generate partially empty strings
⇒ no room to linearize traces or empty categories
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Example of constraint formulation in NLG
Obligatory sandhy in the sequence of atonal pronouns
Rule: The rightmost item in the atonal pronoun sequence can not be an open syllable
with nucleus [i].
Assuming the base form [ni] ni:
Is it the rightmost atonal pron in the sequence?
1. yes
⇒ change from [ni] ni to [ne] ne
Is there on the left an item to obligatorily attach to?
1.1 yes (e.g., [ne] ne [a] a dat)
⇒ attach [nea] ne-a dat
1.2 no (e.g., [ne]“ ne [dai] dai)
“
⇒ done [ne dai] ne dai
“
2. no (e.g., [ni] ni [le] le [dai] dai)
“
⇒ done [ni le dai] ni le dai
“
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Optional sandhi phenomena
Exploting the specific language model
analyse the context
consult the specific language model
give hints to students wrt. most appropriate form to choose
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Further possible applications
Exploiting specific language resources
design and implementation of different types of language
learning exercises for training atonal pronouns
specific feedback to production error types because of mal-rule
like coding of non-licensed forms
enriching existing analysis tools (parsers) with specific
information
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Human vs. machine
NLG too much of a technique, too little of a science
Using NLG techniques for LL: rara avis
Karin Harbusch et Al (2009) Computing Accurate Grammatical Feedback in a Virtual
Writing Conference for German-Speaking Elementary-School Children: An Approach
Based on Natural Language Generation, CALICO Journal, 26(3), 2009
Using LL research insights for NLG
NLG too much of a Fiat!-domain: from the very beginning
NLG paying very little attention to surface phenomena such as
language variation or even orthography
⇒ modelling human language production: a real plus for NLG
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Atonal pronouns: Why a special corpus?
......................
Language knowledge: How to build it?
............
Language production: What are the benefits?
.......
Conclusions
motivating the need for special corpora for learning how to make
decisions in case of optional surface realization
reporting on the cumbersome process of building resources for
special phenomena
stressing the need of resource and insights sharing between
fields with similar goals
underlining the benefits of sharing resource between NLG and
LL wrt. realization of atonal pronouns in Romanian
⇒ Share resources!
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.