Statistical Dependency Parsing in Korean

Transcription

Statistical Dependency Parsing in Korean
Statistical Dependency Parsing in Korean:
From Corpus Generation To Automatic Parsing
Workshop on Statistical Parsing of Morphologically-Rich Languages
12th International Conference on Parsing Technologies
Jinho D. Choi & Martha Palmer
University of Colorado at Boulder
October 6th, 2011
[email protected]
Thursday, October 6, 2011
Dependency Parsing in Korean
•
Why dependency parsing in Korean?
-
Korean is a flexible word order language.
S
SOV construction
S
NP-SBJ
NP-OBJ-1
She
NP-SBJ
VP
AP
S
AP
VP
NP-OBJ
VP
him
loved
still
VP
Him
she
VP
NP-OBJ
VP
*T*
loved
still
OBJ
ADV
ADV
SBJ
SBJ
OBJ
2
Thursday, October 6, 2011
Dependency Parsing in Korean
•
Why dependency parsing in Korean?
-
Korean is a flexible word order language.
Rich morphology makes it easy for dependency parsing.
그녀 + 는
그 + 를
She + Aux. particle
SBJ
She
ADV
still
3
Thursday, October 6, 2011
He + Obj. case marker
loved
OBJ
him
Dependency Parsing in Korean
•
•
Statistical dependency parsing in Korean
-
Sufficiently large training data is required.
•
Not much training data available for Korean dependency parsing.
Constituent Treebanks in Korean
-
Penn Korean Treebank: 15K sentences.
KAIST Treebank: 30K sentences.
Sejong Treebank: 60K sentences.
•
•
The most recent and largest Treebank in Korean.
Containing Penn Treebank style constituent trees.
4
Thursday, October 6, 2011
Sejong Treebank
•
Phrase structure
-
Including phrase tags, POS tags, and function tags.
Each token can be broken into several morphemes.
!
S
NP-SBJ
She
still
!
VP
!
NP-OBJ
VP
him
loved
/JX
/MAG
/NP+
/JKO
/NNG+
/XSV+
/EP+
/EF
Tokens are mostly separated
by white spaces.
5
Thursday, October 6, 2011
)/NP+
!
VP
AP
(
containing only left and right brackets, respectively.
These tags are also used to determine dependency
relations during the conversion.
Sejong Treebank
higher precedenc
precedence in VP
Once we have th
Phrase-level tags
Function tags
erate dependenc
S
Sentence
SBJ Subject
each phrase (or
Q
Quotative clause
OBJ Object
the head of the
NP
Noun phrase
CMP Complement
all other nodes i
VP
Verb phrase
MOD Noun modifier
The procedure g
VNP Copula phrase
AJT Predicate modifier
in the tree finds
AP
Adverb phrase
CNJ Conjunctive
and Palmer (20
DP
Adnoun phrase
INT Vocative
IP
Interjection phrase PRN Parenthetical
by this procedu
(unique
root, si
NNG General noun
MM
Adnoun
EP
Prefinal
EM
JX
Auxiliary
PR
Table 2: Phrase-level tags (left) and function tags (right)
however,
it doe
NNP Proper noun
MAG General adverb
EF
Final EM
JC Conjunctive
PR
in
the
Sejong
Treebank.
NNB Bound noun
MAJ Conjunctive adverb EC
Conjunctive EM
IC Interjection
shows how to a
NP
Pronoun
JKS Subjective CP
ETN Nominalizing EM
SN Number
NR
Numeral
JKC Complemental CP
ETM Adnominalizing EM SL Foreign
Inword
addition, Sec
VV
Verb
JKG Adnomial CP
XPN Noun prefix
SH Chinese word
Head-percolation
rules
solve
VA
Adjective 3.2
JKO Objective CP
XSN Noun DS
NF Noun-like
wordsome of th
VX
Auxiliary predicate JKB Adverbial CP
XSV Verb DS
NV Predicate-like
word
nested
function
the list
VCP Copula Table 3 gives
JKV Vocative
CPof head-percolation
XSA Adjective DSrules (from
NA Unknown word
VCN Negation now
adjectiveon,JKQ
Quotative CPderived
XR from
Baseanalysis
morpheme of each
SF, SP, SS, SE,ItSO,isSWworth m
headrules),
Tree
Table 1: P OS tags phrase
in the Sejong
Treebank
(PM:Sejong
predicate marker,
CP: case particle,
EM:for
ending
DS: Sejong
derivatype
in the
Treebank.
Except
themarker,the
tional suffix, PR: particle, SF SP SS SE SO: different types of punctuation).
quotative clause (Q), all 6other phrase types try to egories. This im
ated by these he
find
their
heads
from
the
rightmost
children,
which
Thursday, October
6, 2011 morphological analysis has been one of
Automatic
Han et al. (2000) presented an approach for han-
Dependency Conversion
•
Conversion steps
-
•
Find the head of each phrase using head-percolation rules.
•
All other nodes in the phrase become dependents of the head.
Re-direct dependencies for empty categories.
•
•
Empty categories are not annotated in the Sejong Treebank.
Skipping this step generates only projective dependency trees.
Label (automatically generated) dependencies.
Special cases
-
Coordination, nested function tags.
7
Thursday, October 6, 2011
Dependency Conversion
Head-percolation rules
•
as described in there are some approaches that have treated each
-
by analyzing
each phrase
Treebank.
to several mor-Achieved
morpheme
as an individual
token in
to the
parseSejong
(Chung
le 1). In the Se- et al., 2010).5
Korean is a head-final language.
mostly by white
S
r VP;VNP;S;NP|AP;Q;*
B+C)D’ is conQ
l S|VP|VNP|NP;Q;*
oes not contain
NP
r NP;S;VP;VNP;AP;*
ult, a token can
VP
r VP;VNP;NP;S;IP;*
ual morphemes
VNP
r VNP;NP;S;*
AP
r AP;VP;NP;S;*
ated with funcDP
r DP;VP;*
gs show depenIP
r IP;VNP;*
hrases and their
X|L|R r *
y labels during
special types of No rules to find the head morpheme of each token.
Table 3: Head-percolation rules for the Sejong TreeTable 2. X indi- bank. l/r implies looking for the leftmost/rightmost conarticles, ending stituent. * implies any phrase-level tag. | implies a logi8
ndicate phrases cal OR and ; is a delimiter between tags. Each rule gives
Thursday, October 6, 2011
Dependency Conversion
•
Dependency labels
-
Labels retained from the function tags.
Labels inferred from constituent relations.
S use the automatiejong Treebank, and
d and linked empty categories to generNP-SBJ
VP
ective dependencies.
dency labels
AP
VP
NP-OBJ
VP
f dependency labels are derived from the
rees. The first type includes labels rethe function
When any
anShetags. still
him nodeloved
a function tag is determined toOBJ
be a deADV
some other node by our headrules,
the
SBJ
is taken as the dependency label to its
e 3 shows a dependency tree converted
stituent tree in Figure 2, using the funcdependency labels (SBJ and OBJ).
Thursday, October 6, 2011
input : (c, p), where c is a dependent of p.
l
output: A dependency label l as c ←
− p.
begin
if p = root
then ROOT → l
elif c.pos = AP then ADV → l
elif p.pos = AP then AMOD → l
elif p.pos = DP then DMOD → l
elif p.pos = NP then NMOD → l
elif p.pos = VP|VNP|IP then VMOD → l
else DEP → l
end
Algorithm 1: Getting inferred labels.
AJT
9
CMP
11.70
1.49
MOD
AMOD
18.71
0.13
X
X AJT
0.01
0.08
Dependency Conversion
•
•
Coordination
-
Previous conjuncts as dependents of the following conjuncts.
Nested function tag
-
Nodes with nested f-tags become the heads of the phrases.
S
NP-SBJ
NP-SBJ
NP-CNJ
I_and
VP
NP-CNJ
NP-SBJ
he_and
she
CNJ
NP-OBJ
VP
home
left
CNJ
OBJ
SBJ
10
Thursday, October 6, 2011
Dependency Parsing
•
Dependency parsing algorithm
-
Transition-based, non-projective parsing algorithm.
•
Performs transitions from both projective and non-projective
dependency parsing algorithms selectively.
•
•
Choi & Palmer, 2011.
Linear time parsing speed in practice for non-projective trees.
Machine learning algorithm
-
Liblinear L2-regularized L1-loss support vector.
Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of
Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11
11
Thursday, October 6, 2011
Dependency Parsing
•
Feature selection
-
Each token consists of multiple morphemes (up to 21).
POS tag feature of each token?
•
•
(NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF)
Sparse information vs. lack of information.
!
Nakrang_
/NNP+
/NNG+
Happy medium?
/JX
Nakrang + Princess + JX
!
Hodong_
/NNP+
/NNG+
/JKO
Hodong + Prince + JKO
!
/NNG+
/XSV+
/EP+ /EF+./SF
Love + XSV + EP + EF + .
12
Thursday, October 6, 2011
j
*
ure extraction
ed to λ1 along with all many other morphemes
for punctuation,
parsing. Thus,
PY helpful
The last
only if there is no other
ned in Section 3.1, each token in our coredure is repeated with as a compromise, we decide
to select followed
certain types
morpheme
by the punctuation
sts of one or many morphemes annotated
of morphemes and use only these as features. Ta+1 . The algorithm terrent POS tags. This morphology makes Table 6: Types of morphemes in each token used to exn left in β.
ble 6 shows the types of morphemes used to extract
to extract features for dependency pars- tract features for our parsing models.
features
for our parsing models.
Morpheme
selection
nglish, when two tokens, wi and wj , are
rithm
Figure 6 shows morphemes extracted from the tofor a dependency relation, FS
we extract
fea-morpheme
The
first
zed L1-loss S VM for
kens inbefore
FigureJO|DS|EM
5. For unigrams, these morphemes
POS tags of wi and wj (wLS
wj .pos),
i .pos,The
last
morpheme
, applying c = 0.1
be used
either individually (e.g., the POS tag of
d feature of POS tags between
two
tokens(J*can
JK
Particles
in
Table
1)
iterion), B = 0 (bias).
JK for the
1st in
token is JX) or jointly (e.g., a joined
annotated
with suffixes
Derivational
(XS
j .pos). Since each token isDS
* Table 1)
feature
POS tags
OS tag in English, it is trivial
extract
EM to
Ending
markers
(E* of
in Table
1) between LS and JK for the 1st
token is only
NNG+JX)
ures. In Korean, each token
PYis annotated
The last punctuation,
if theretoisgenerate
no other features. From our
each token in our corexperiments,
extracted from the JK and EM
uence of POS tags, depending on how
mor- followed
morpheme
by thefeatures
punctuation
morphemes annotated
e segmented. It is possible to join all POS morphemes are found to be the most useful.
s morphology makes Table 6: Types of morphemes in each token used to exn a token and treat
that as/NNG+
a single tag
/NNP+
/JX (e.g.,
for dependency pars- tract features for our parsing models.
+JX for theNakrang
first token
in Figure
5); how+ Princess
+ JX
okens, wi and wj , are
e tags usually cause veryFigure
sparse6 vectors
/NNP
/NNG
/JX toshows morphemes
extracted
from the
lation, we extract
fea/NNG+ /JKO
as features. /NNP+
kens in Figure 5. For unigrams,
these
wj (wi .pos,Hodong
wj .pos),
/NNP
/NNGmorphemes
/JKO
+ Prince + JKO
(e.g., the POS tag of
s between two tokens can be used either individually
/NNG
/XSV
/EF
/SF
JK
for the
1st token is JX) or jointly (e.g., a joined
oken is annotated
with
/NNG+
/XSV+
/EP+
/EF+./SF
Hodong_
feature
it is trivialLove
to extract
+ XSV + EP
+ EF of
+ . POS tags between LS and JK for the 1st
Figure
6: Morphemes
extracted
from the tokens in Figtoken
is
NNG+JX)
to
generate
features.
From
our
ch!token /NNP+
is annotated
/NNG+ /JX
ure 5 with respect to the types in Table 6.
experiments,
features
extracted from the JK and EM
epending on how mor13 the most useful.
+ Princess
+ JX
morphemes
are found to be
ossibleNakrang
to join all
POS
For n-grams where n > 1, it is not obvious which
atThursday,
as a single
tag (e.g.,/NNG+ /JKO
October
6, 2011
!
/NNP+
combinations of these morphemes across different
Dependency Parsing
•
rning (Hsieh et al., 2008), applying c = 0.1
st), e = 0.1 (termination criterion), B = 0 (bias).
JK
DS
EM
PY
Particles (J* in Table 1)
Derivational suffixes (XS* in Table 1)
Ending markers (E* in Table 1)
The last punctuation, only if there is no other
morpheme followed by the punctuation
Dependency Parsing
Feature extraction
mentioned in Section 3.1, each token in our cora consists of one or many morphemes annotated
h different POS tags. This morphology makes Table 6: Types of morphemes in each token used to exdifficult to extract features for dependency pars- tract features for our parsing models.
Extract
features
using
only
important
morphemes.
. In English, when two tokens, wi and wj , are
Figure 6 shows morphemes extracted from the tompared for a dependency
relation,
we
extract
feaIndividual POS tag features
of
the1st5.and
3rd
tokens.
kens
in
Figure
For
unigrams,
these morphemes
es like POS tags of wi and wj (wi .pos, wj .pos),
: NNP1, NNG1, JK1, NNG3,can
XSV3, EF3
be used either individually (e.g., the POS tag of
a joined feature of POS tags between two tokens
JK for the 1st token is JX) or jointly (e.g., a joined
.pos+wj .pos). Since each
token features
is annotatedofwith
Joined
POS tags between the 1st and 3rd tokens.
feature of POS tags between LS and JK for the 1st
ingle POS tag in English,
it
is
trivial
to
extract
: NNP1_NNG3, NNP1_XSVtoken
3, NNP1_EF3, JK1_NNG3, JK1_XSV3
is NNG+JX) to generate features. From our
se features. In Korean, each token is annotated
h a sequence of POS tags, depending on how mor- experiments, features extracted from the JK and EM
Tokens
used:
w
,
w
i, wj, wi±1morphemes
j±1
are found to be the most useful.
emes are segmented. It is possible to join all POS
s within a token and treat
that as/NNG+
a single tag
/NNP+
/JX (e.g.,
P+NNG+JX for theNakrang
first token
in Figure
5); how+ Princess
+ JX
r, these tags usually cause very sparse vectors
/NNP
/NNG
/JX
/NNG+ /JKO
en used as features. /NNP+
•
Feature extraction
-
•
•
-
Hodong + Prince + JKO
/NNG+
/XSV+
Hodong_
akrang_
/EP+ /EF+./SF
Love + XSV + EP + EF + .
!
/NNP+
/NNG+
/JX
Nakrang + Princess + JX
!
/NNP+
Thursday, October 6, 2011
/NNG+
/JKO
/NNP
/NNG
/NNG
/JKO
/XSV
/EF
/SF
Figure 6: Morphemes extracted from the tokens in Figure 5 with respect to the types in Table 6.
14
For n-grams
where n > 1, it is not obvious which
combinations of these morphemes across different
discussing world history.
m
Table 10 shows how these corpora are divided into
training, development, and evaluation sets. For the
, wi and wj .
development and evaluation sets, we pick one newsCorpora
paper about art, one fiction text, and one informad features extive book about
trans-nationalism,
and Sejong
use each
of
Dependency
trees
converted
from
the
Treebank.
h column and
the first half for development and the second half for
d wj , respecConsists
of 20Note
sources
in 6 genres.
evaluation.
that these
development and evalue of POS tags
ation
sets are
very
diverse(MZ),
compared
to(FI),
theMemoir
training(ME),
Newspaper
(NP),
Magazine
Fiction
joined feature
data.
Testing
on such
evaluation
sets Cartoon
ensures the
roInformative
Book
(IB),
and
Educational
(EC).
d a POS tag of
bustness of our parsing model, which is very impored feature be- Evaluation sets are very diverse compared to training sets.
tant for our purpose because we are hoping to use
and a form of
this
model
parse various
texts
on the
web.
Ensures
theto
robustness
of our
parsing
models.
features used
ed morpholoNP
MZ
FI
ME
IB
EC
DS
Experiments
•
-
EM
z
z
x,z
x
y+ x∗ ,y+
xThursday, October
x, z6, 2011
•
•
T
D
E
8,060
2,048
2,048
6,713
-
15,646
2,174
2,175
5,053
-
7,983
1,307
1,308
1,548
-
#
of
sentences
in
each
set
Table 10: Number of sentences in training (T), development (D), and evaluation (E) sets for each genre.
15
Experiments
•
•
•
Morphological analysis
-
Two automatic morphological analyzers are used.
Intelligent Morphological Analyzer
-
Developed by the Sejong project.
Provides the same morphological analysis as their Treebank.
•
Considered as fine-grained morphological analysis.
Mach (Shim and Yang, 2002)
-
Analyzes 1.3M words per second.
Provides more coarse-grained morphological analysis.
Kwangseob Shim & Jaehyung Yang. 2002. A Supersonic Korean
Morphological Analyzer. In Proceedings of COLING’02
16
Thursday, October 6, 2011
Experiments
•
Evaluations
-
NP
FI
IB
Avg.
Gold-standard vs. automatic morphological analysis.
•
Relatively low performance from the automatic system.
Fine vs. course-grained morphological analysis.
•
Differences are not too significant.
Robustness across different genres.
Gold, fine-grained
LAS UAS
LS
82.58 84.32 94.05
84.78 87.04 93.70
84.21 85.50 95.82
83.74 85.47 94.57
Auto, fine-grained
LAS UAS
LS
79.61 82.35 91.49
81.54 85.04 90.95
80.45 82.14 92.73
80.43 83.01 91.77
Auto, coarse-grained
LAS UAS
LS
79.00 81.68 91.50
80.11 83.96 90.24
81.43 83.38 93.89
80.14 82.89 91.99
ble 11: Parsing accuracies achieved by three models (in %). L AS - labeled attachment score, UAS - unlabel
achment score, L S - label accuracy score
17
Thursday, October 6, 2011
Conclusion
•
•
Contributions
-
Generating a Korean Dependency Treebank.
-
Evaluating the robustness across different genres.
Selecting important morphemes for dependency parsing.
Evaluating the impact of fine vs. coarse-grained
morphological analysis on dependency parsing.
Future work
-
Increase the feature span beyond bigrams.
Find head morphemes of individual tokens.
Insert empty categories.
18
Thursday, October 6, 2011
Acknowledgements
•
•
Special thanks are due to
-
Professor Kong Joo Lee of Chungnam National University.
Professor Kwangseob Shim of Shungshin Women’s University.
We gratefully acknowledge the support of the National
Science Foundation Grants CISE-IIS-RI-0910992, Richer
Representations for Machine Translation. Any opinions,
findings, and conclusions or recommendations expressed
in this material are those of the authors and do not
necessarily reflect the views of the National Science
Foundation.
19
Thursday, October 6, 2011