Ambiguity Resolution for Vt-N Structures in Chinese
Transcription
Ambiguity Resolution for Vt-N Structures in Chinese
Ambiguity Resolution for Vt-N Structures in Chinese Yu-Ming Hsieh1,2 Jason S. Chang2 Keh-Jiann Chen1 1 Institute of Information Science, Academia Sinica, Taiwan 2 Department of Computer Science, National Tsing-Hua University, Taiwan [email protected], [email protected] [email protected] Abstract The syntactic ambiguity of a transitive verb (Vt) followed by a noun (N) has long been a problem in Chinese parsing. In this paper, we propose a classifier to resolve the ambiguity of Vt-N structures. The design of the classifier is based on three important guidelines, namely, adopting linguistically motivated features, using all available resources, and easy integration into a parsing model. The linguistically motivated features include semantic relations, context, and morphological structures; and the available resources are treebank, thesaurus, affix database, and large corpora. We also propose two learning approaches that resolve the problem of data sparseness by autoparsing and extracting relative knowledge from large-scale unlabeled data. Our experiment results show that the Vt-N classifier outperforms the current PCFG parser. Furthermore, it can be easily and effectively integrated into the PCFG parser and general statistical parsing models. Evaluation of the learning approaches indicates that world knowledge facilitates Vt-N disambiguation through data selection and error correction. 1 Introduction In Chinese, the structure of a transitive verb (Vt) followed by a noun (N) may be a verb phrase (VP), a noun phrase (NP), or there may not be a dependent relation, as shown in (1) below. In general, parsers may prefer VP reading because a transitive verb followed by a noun object is nor- mally a VP structure. However, Chinese verbs can also modify nouns without morphological inflection, e.g., 養殖/farming 池/pond. Consequently, parsing Vt-N structures is difficult because it is hard to resolve such ambiguities without prior knowledge. The following are some typical examples of various Vt-N structures: 1) 解決/solve 問題/problem VP 解決/solving 方案/method NP 解決/solve 人類/mankind (問題/problem)None To find the most effective disambiguation features, we need more information about the Vt-N NP construction and the semantic relations between Vt and N. Statistical data from the Sinica Treebank (Chen et al., 2003) indicates that 58% of Vt-N structures are verb phrases, 16% are noun phrases, and 26% do not have any dependent relations. It is obvious that the semantic relations between a Vt-N structure and its context information are very important for differentiating between dependent relations. Although the verb-argument relation of VP structures is well understood, it is not clear what kind of semantic relations result in NP structures. In the next sub-section, we consider three questions: What sets of nouns accept verbs as their modifiers? Is it possible to identify the semantic types of such pairs of verbs and nouns? What are their semantic relations? 1.1 Problem Analysis Analysis of the instances of NP(Vt-N) structures in the Sinica Treebank reveals the following four types of semantic structures, which are used in the design of our classifier. Type 1. Telic(Vt) + Host(N): Vt denotes the telic function (purpose) of the head noun N, e.g., 928 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 928–937, c October 25-29, 2014, Doha, Qatar. 2014 Association for Computational Linguistics verb by checking the semantic parity of two morphemes of the verb. Based on our analysis, we designed a Vt-N classifier that incorporates the above features to solve the problem. However, there is a data sparseness problem because of the limited size of the current Treebank. In other words, Treebank cannot provide enough training data to train a classifier properly. To resolve the problem, we should mine useful information from all available resources. The remainder of this paper is organized as follows. Section 2 provides a review of related works. In Section 3, we describe the disambiguation model with our selected features, and introduce a strategy for handling unknown words. We also propose a learning approach for a largescale unlabeled corpus. In Section 4, we report the results of experiments conducted to evaluate the proposed Vt-N classifier on different feature combinations and learning approaches. Section 5 contains our concluding remarks. 研 究 /research 工 具 /tool; 探 測 /explore 機 /machine; 賭/gamble 館/house; 搜尋/search 程 式/program. The telic function must be a salient property of head nouns, such as tools, buildings, artifacts, organizations and people. To identify such cases, we need to know the types of nouns which take telic function as their salient property. Furthermore, many of the nouns are monosyllabic words, such as 員/people, 器/instruments, 機/machines. Type 2. Host-Event(Vt) + Attribute(N): Head nouns are attribute nouns that denote the attributes of the verb, e.g., 研究/research 方法 /method (method of research); 攻擊/attack 策略 /strategy (attacking strategy); 書寫/write 內容 /context (context of writing); 賭/gamble 規/rule (gambling rules). An attribute noun is a special type of noun. Semantically, attribute nouns denote the attribute types of objects or events, such as weight, color, method, and rule. Syntactically, attribute nouns do not play adjectival roles (Liu, 2008). By contrast, object nouns may modify nouns. The number of attributes for events is limited. If we could discover all event-attribute relations, then we can solve this type of construction. Type 3. Agentive + Host: There is only a limited number of such constructions and the results of the constructions are usually ambiguous, e.g., 炒飯/fried rice (NP), 叫聲/shouting sound. The first example also has the VP reading. Type 4. Apposition + Affair: Head nouns are event nouns and modifiers are verbs of apposition events, e.g. 追撞/collide 事故/accident, 破 壞 /destruct 運 動 /movement, 憤 恨 /hate 行 為 /behavior. There is finite number of event nouns. 2 Furthermore, when we consider verbal modifiers, we find that verbs can play adjectival roles in Chinese without inflection, but not all verbs play adjectival roles. According to Chang et al. (2000) and our observations, adjectival verbs are verbs that denote event types rather than event instances; that is, they denote a class of events which that are concepts in an upper-level ontology. One important characteristic of adjectival verbs is that they have conjunctive morphological structures, i.e., the words are conjunct with two nearly synonymous verbs, e.g., 研/study 究 /search (research), 探 /explore 測 /detect (explore), and 搜/search 尋/find (search). Therefore, we need a morphological classifier that can detect the conjunctive morphological structure of a Related Work Most works on V-N structure identification focus on two types of relation classification: modifierhead relations and predicate-object relations (Wu, 2003; Qiu, 2005; Chen, 2008; Chen et al., 2008; Yu et al., 2008). They exclude the independent structure and conjunctive head-head relation, but the cross-bracket relation does exist between two adjacent words in real language. For example, if “遍佈/all over 世界/world ” was included in the short sentence “遍佈/all over 世界/world 各國 /countries”, it would be an independent structure. A conjunctive head-head relation between a verb and a noun is rare. However, in the sentence “服 務 設備 都 甚 周到” (Both service and equipment are very thoughtful.), there is a conjunctive head-head relation bewteen the verb 服 務 /service and the noun 設備/equipment. Therefore, we use four types of relations to describe the VN structures in our experiments. The symbol ‘H/X’ denotes a predicate-object relation; ‘X/H’ denotes a modifier-head relation; ‘H/H’ denotes a conjunctive head-head relation; and ‘X/X’ denotes an independent relation. Feature selection is an important task in V-N disambiguation. Hence, a number of studies have suggested features that may help resolve the ambiguity of V-N structures (Zhao and Huang, 1999; Sun and Jurafsky, 2003; Chiu et al., 2004; Qiu, 2005; Chen, 2008). Zhao and Huang used lexicons, semantic knowledge, and word length in- 929 formation to increase the accuracy of identification. Although they used the Chinese thesaurus CiLin (Mei et al., 1983) to derive lexical semantic knowledge, the word coverage of CiLin is insufficient. Moreover, none of the above papers tackle the problem of unknown words. Sun and Jurafsky exploit the probabilistic rhythm feature (i.e., the number of syllables in a word or the number of words in a phrase) in their shallow parser. Their results show that the feature improves the parsing performance, which coincides with our analysis in Section 1.1. Chiu et al.’s study shows that the morphological structure of verbs influences their syntactic behavior. We follow this finding and utilize the morphological structure of verbs as a feature in the proposed VtN classifier. Qiu’s approach uses an electronic syntactic dictionary and a semantic dictionary to analyze the relations of V-N phrases. However, the approach suffers from two problems: (1) low word coverage of the semantic dictionary and (2) the semantic type classifier is inadequate. Finally, Chen proposed an automatic VN combination method with features of verbs, nouns, context, and the syllables of words. The experiment results show that the method performs reasonably well without using any other resources. Based on the above feature selection methods, we extract relevant knowledge from Treebank to design a Vt-N classifier. However we have to resolve the common problem of data sparseness. Learning knowledge by analyzing large-scale unlabeled data is necessary and proved useful in previous works (Wu, 2003; Chen et al., 2008; Yu et al., 2008). Wu developed a machine learning method that acquires verb-object and modifierhead relations automatically. The mutual information scores are then used to prune verb-noun whose scores are below a certain threshold. The author found that accurate identification of the verb-noun relation improved the parsing performance by 4%. Yu et al. learned head-modifier pairs from parsed data and proposed a headmodifier classifier to filter the data. The filtering model uses the following features: a PoS-tag pair of the head and the modifier; the distance between the head and the modifier; and the presence or absence of punctuation marks (e.g., commas, colons, and semi-colons) between the head and the modifier. Although the method improves the parsing performance by 2%, the filtering model obtains limited data; the recall rate is only 46.35%. The authors also fail to solve the problem of Vt-N ambiguity. Our review of previous works and the observations in Section 1.1 show that lexical words, semantic information, the syllabic length of words, neighboring PoSs and the knowledge learned from large-scale data are important for Vt-N disambiguation. We consider more features for disambiguating Vt-N structures than previous studies. For example, we utilize (1) four relationclassification in a real environment, including ‘X/H’, ‘H/X’, ‘X/X’ and ‘H/H’ relations; (2) unknown word processing of Vt-N words (including semantic type predication and morphstructure predication); (3) unsupervised data selection (a simple and effective way to extend knowledge); and (4) supervised knowledge correction, which makes the extracted knowledge more useful. 3 Design of the Disambiguation Model The disambiguation model is a Vt-N relation classifier that classifies Vt-N relations into ‘H/X’ (predicate-object relations), ‘X/H’ (modifierhead relations), ‘H/H’ (conjunctive head-head relations), or ‘X/X’ (independent relations). We use the Maximum Entropy toolkit (Zhang, 2004) to construct the classifier. The advantage of using the Maximum Entropy model is twofold: (1) it has the flexibility to adjust features; and (2) it provides the probability values of the classification, which can be easily integrated into our PCFG parsing model. In the following sections, we discuss the design of our model for feature selection and extraction, unknown word processing, and world knowledge learning. 3.1 Feature Selection and Extraction We divide the selected features into five groups: PoS tags of Vt and N, PoS tags of the context, words, semantics, and additional information. Table 1 shows the feature types and symbol notations. We use symbols of t1 and t2 to denote the PoS of Vt and N respectively. The context feature is neighboring PoSs of Vt and N: the symbols of t-2 and t-1 represent its left PoSs, and the symbole t3 and t4 represent its right PoSs. The semantic feature is the lexicon’s semantic type extracted from E-HowNet sense expressions (Huang et al., 2008). For example, the EHowNet expression of “ 車 輛 /vehicles” is {LandVehicle| 車 :quantity={mass| 眾 }}, so its semantic type is {LandVehicle|車}. We discuss the model’s performance with different feature combinations in Section 4. 930 Feature PoS Feature Description PoS of Vt and N t1; t2 Context Neighboring PoSs t-2; t-1; t3; t4 Word Lexical word w1; w2 Semantic Semantic type of word st1; st2 Additional Morphological structure of verb Information Vmorph Syllabic length of noun Nlen Table 1. The features used in the Vt-N classifier The example in Figure 1 illustrates feature labeling of a Vt-N structure. First, an instance of a Vt-N structure is identified from Treebank. Then, we assign the semantic type of each word without considering the problem of sense ambiguity for the moment. This is because sense ambiguities are partially resolved by PoS tagging, and the general problem of sense disambiguation is beyond the scope of this paper. Furthermore, Zhao and Huang (1999) demonstrated that the retained ambiguity does not have an adverse impact on identification. Therefore, we keep the ambiguous semantic type for future processing. zhe this zaochen cause xuexi zhongwen DE fongchao learn Chinese trend “This causes the trend of learning Chinese.” Feature Type Word PoS Semantic Context Additional Information Relation Type y w2=中文 t2=Na Vmorph=VV Nlen=2 st2=language|語言 t-2=Nep; t-1=VK; t3=DE; t4=Na rt = H/X Table 2. The feature lables of Vt-N pair in Figure 1 3.2 Unknown Word Processing In Chinese documents, 3% to 7% of the words are usually unknown (Sproat and Emerson, 2003). By ‘unknown words’, we mean words not listed in the dictionary. More specifically, in this paper, unknown words means words without semantic type information (i.e., E-HowNet expressions) and verbs without morphological structure information. Therefore, we propose a method for predicting the semantic types of unknown words, and use an affix database to train a morphstructure classifier to derive the morphological structure of verbs. Morph-Structure Predication of Verbs: We use data analyzed by Chiu et al. (2004) to develop a classifier for predicating the morphological structure of verbs. There are four types of morphological structures for verbs: the coordinating structure (VV), the modifier-head structure (AV), the verb-complement structure (VR), and the verb-object structure (VO). To classify verbs automatically, we incorporate three features in the proposed classifier, namely, the lexeme itself, the prefix and the suffix, and the semantic types of the prefix and the suffix. Then, we use training data from the affix database to train the classifier. Table 3 shows an example of the unknown verb “ 傳 播 到 /disseminate” and the morphstructure classifier shows that it is a ‘VR’ type. Feature Figure 1. An example of a tree with a Vt-N structure Table 2 shows the labeled features for “學習 /learn 中文/Chinese” in Figure 1. The column x and y describe relavant features in “學習/learn” and “中文/Chinese” respectively. Some features are not explicitly annotated in the Treebank, e.g., the semantic types of words and the morphological structure of verbs. We propose labeling methods for them in the next sub-section. x w1=學習 t1=VC st1=study|學習 Word=傳播到 PW=傳播 PWST={disseminate|傳播} SW=到 SWST={Vachieve|達成} Feature Description Lexicon Prefix word Semantic Type Prefix Word 傳播 Suffix Word Semantic Type Suffix Word 到 of of Table 3. An example of an unknown verb and feature templates for morph-structure predication 931 Semantic Type Provider: The system exploits WORD, PoS, affix and E-HowNet information to obtain the semantic types of words (see Figure 2). If a word is known and its PoS is given, we can usually find its semantic type by searching the E-HowNet database. For an unknown word, the semantic type of its head morpheme is its semantic type; and the semantic type of the head morpheme is obtained from EHowNet 1. For example, the unknown word “傳 播 到 /disseminate”, its prefix word is “ 傳 播 /disseminate” and we learn that its semantic type is {disseminate|傳播} from E-HowNet. Therefore, we assign {disseminate|傳播} as the semantic type of “ 傳 播 到 /disseminate”. If the word or head morpheme does not exist in the affix database, we assign a general semantic type based on its PoS, e.g., nouns are {thing|萬物} and verbs are {act|行動}. In this matching procedure, we may encounter multiple matching data of words and affixes. Our strategy is to keep the ambiguous semantic type for future processing. Input: WORD, PoS Output: Semantic Type (ST) procedure STP(WORD, PoS) (* Initial Step *) ST := null; (* Step 1: Known word *) if WORD already in E-HowNet then ST := EHowNet(WORD, PoS); else if WORD in Affix database then ST := EHowNet(affix of WORD, PoS); (* Step 2 : Unknown word *) if ST is null and PoS is ‘Vt’ then ST := EHowNet(prefix of WORD, PoS); else if ST is null and PoS is ‘N’ then ST := EHowNet(suffix of WORD, PoS); (* Step 3 : default *) if ST is null and PoS is ‘Vt’ then ST := ‘act|行動’; else if ST is null and PoS is ‘N’ then ST := ‘thing|萬物’ (* Finally *) STP := ST; end; Figure 2. The Pseudo-code of the Semantic Type Predication Algorithm. 1 The E-HowNet function in Figure 2 will return a null ST value where words do not exist in E-HowNet or Affix database. 3.3 Learning World Knowledge Based on the features discussed in the previous sub-section, we extract prior knowledge from Treebank to design the Vt-N classifier. However, the training suffers from the data sparseness problem. Furthermore most ambiguous Vt-N relations are resolved by common sense knowledge that makes it even harder to construct a well-trained system. An alternative way to extend world knowledge is to learn from largescale unlabeled data (Wu, 2003; Chen et al., 2008; Yu et al., 2008). However, the unsupervised approach accumulates errors caused by automatic annotation processes, such as word segmentation, PoS tagging, syntactic parsing, and semantic role assignment. Therefore, how to extract useful knowledge accurately is an important issue. To resolve the error accumulation problemm, we propose two methods: unsupervised NP selection and supervised error correction. The NP selection method exploits the fact that an intransitive verb followed by a noun can only be interpreted as an NP structure, not a VP structure. It is easy to find such instances with high precision by parsing a large corpus. Based on the selection method, we can extend contextual knowledge about NP(V+N) and extract nouns that take adjectival verbs as modifiers. The error correction method involves a small amount of manual editing in order to make the data more useful and reduce the number of errors in auto-extracted knowledge. The rationale is that, in general, high frequency Vt-N word-bigram is either VP or NP without ambiguity. Therefore, to obtain more accurate training data, we simply classify each high frequency Vt-N word bigram into a unique correct type without checking all of its instances. We provide more detailed information about the method in Section 4.3. 4 4.1 Experiments and Results Experimental Setting We classify Vt-N structures into four types of syntactic structures by using the bracketed information (tree structure) and dependency relation (head-modifier) to extract the Vt-N relations from treebank automatically. The resources used in the experiments as follows. Treebank: The Sinica Treebank contains 61,087 syntactic tree structures with 361,834 words. We extracted 9,017 instances of Vt-N structures from the corpus. Then, we randomly 932 selected 1,000 of the instances as test data and used the remainder (8,017 instances) as training data. Labeled information of word segmentation and PoS-tagging were retained and utilized in the experiments. E-HowNet: E-HowNet contains 99,525 lexical semantic definitions that provide information about the semantic type of words. We also implement the semantic type predication algorithm in Figure 2 to generate the semantic types of all Vt and N words, including unknown words. Affix Data: The database include 13,287 examples of verbs and 27,267 examples of nouns, each example relates to an affix. The detailed statistics of the verb morph-structure categorization are shown in Table 4. The data is used to train a classifier to predicate the morph-structure of verbs. We found that verbs with a conjunctive structure (VV) are more likely to play adjectival roles than the other three types of verbs. The classifier achieved 87.88% accuracy on 10-fold cross validation of the above 13,287 verbs. VV VR AV VO Prefix 920 2,892 904 662 Suffix 439 7,388 51 31 Table 4. The statistics of verb morph-structure categorization Large Corpus: We used a Chinese parser to analyze sentence structures automatically. The auto-parsed tree structures are used in Experiment 2 (described in the Sub-section 4.3). We obtained 1,262,420 parsed sentences and derived 237,843 instances of Vt-N structure as our dataset (called as ASBC). 4.2 Experiment 1: Evaluation of the Vt-N Classifier In this experiment, we used the Maximum Entropy Toolkit (Zhang, 2004) to develop the Vt-N classifier. Based on the features discussed in Section 3.1, we designed five models to evaluate the classifier’s performance on different feature combinations. The features and used in each model are described below. The feature values shown in brackets refer to the example in Figure 1. • M1 is the baseline model. It uses PoS-tag pairs as features, such as (t1=VC, t2=Na). • M2 extends the M1 model by adding context features of (t-1=VK, t1=VC), (t2=Na, t3=DE), (t-2=Nep, t-1=VK, t1=VC), (t2=Na, t3=DE, t4=Na) and (t-1=VK, t3=DE). • M3 extends the M2 model by adding lexicon features of (w1=學習, t1=VK, w2=中 文, t2=Na), (w1=學習, w2=中文), (w1=學 習) and (w2=中文). • M4 extends the M3 model by adding semantic features of (st1=study|學習, t1=VK , st2=language| 語 言 , t2=Na), (st1=study| 學 習 , t1=VK) and (st2=language| 語 言 , t2=Na). • M5 extends the M4 model by adding two features: the morph-structure of verbs; and the syllabic length of nouns (Vmorph=‘VV’) and (Nlen=2). Table 5 shows the results of using different feature combinations in the models. The symbol P1(%) is the 10-fold cross validation accuracy of the training data, and the symbol P2(%) is the accuracy of the test data. By adding contextual features, the accuracy rate of M2 increases from 59.10% to 72.30%. The result shows that contextual information is the most important feature used to disambiguate VP, NP and independent structures. The accuracy of M2 is approximately the same as the result of our PCFG parser because both systems use contextual information. By adding lexical features (M3), the accuracy rate increases from 72.30% to 80.20%. For semantic type features (M4), the accuracy rate increases from 80.20% to 81.90%. The 1.7% increase in the accuracy rate indicates that semantic generalization is useful. Finally, in M5, the accuracy rate increases from 81.90% to 83.00%. The improvement demonstrates the benefits of using the verb morph-structure and noun length features. Models Feature for Vt-N M1 (t1,t2) M2 + (t-1,t1) (t2,t3) (t-2,t1,t1) (t2,t3,t4) (t-1,t3) M3 + (w1,t1,w2,t2) (w1,w2) (w2) (w1) M4 + (st1,t1,st2,t2) (st1,t1) (st2, t2) M5 + (Vmorph) (Nlen) P1(%) P2(%) 61.94 59.10 76.59 72.30 83.55 80.20 84.63 81.90 85.01 83.00 Table 5. The results of using different feature combinations 933 Next, we consider the influence of unknown words on the Vt-N classifier. The statistics shows that 17% of the words in Treebank lack semantic type information, e.g., 留在/StayIn, 填飽/fill, 貼 出/posted, and 綁好/tied. The accuracy of the Vt-N classifier declines by 0.7% without semantic type information for unknown words. In other words, lexical semantic information improves the accuracy of the Vt-N classifier. Regarding the problem of unknown morph-structure of words, w observe that over 85% of verbs with more than 2 characters are not found in the affix database. If we exclude unknown words, the accuracy of the Vt-N prediction decreases by 1%. Therefore, morph-structure information has a positive effect on the classifier. 4.3 to correct errors made by the parser in the classification of high frequency Vt-N word pairs. After the manual correction operation, which takes 40 man-hours, we assign the correct classifications (w1, t1, w2, t2, rt) for 2,674 Vt-N structure types which contains 10,263 instances to creates the ASBC+Correction dataset. Adding the corrected data to the original training data increases the precision rate to 88.40% and reduces the number of errors by approximately 31.76%, as shown in the Treebank+ASBC+Correction column of Table 7. Treebank 56,521 46,258 8,017 88.40 83.90 83.00 Table 7. Experiment results of classifiers with different training data In this experiment, we evaluated the two methods discussed in Section 3, i.e., unsupervised NP selection and supervised error correction. We applied the data selection method (i.e., distance=1, with an intransitive verb (Vi) followed by an object noun (Na)) to select 46,258 instances from the ASBC corpus and compile a dataset called Treebank+ASBC-Vi-N. Table 6 shows the performance of model 5 (M5) on the training data derived from Treebank and Treebank+ASBC-Vi-N. The results demonstrate that learning more nouns that accept verbal modifiers improves the accuracy. We also used the precision and recall rates to evaluate the performance of the models on each type of relation. The results are shown in Table 8. Overall, the Treebank+ASBC+Correction method achieves the best performance in terms of the precision rate. The results for Treebank+ASBCVi-N show that the unsupervised data selection method can find some knowledge to help identify NP structures. In addition, the proposed models achieve better precision rates than the PCFG parser. The results demonstrate that using our guidelines to design a disambiguation model to resolve the Vt-N problem is successful. Treebank+ Treebank ASBC-Vi-N 46,258 8,017 83.90 Treebank+ ASBC-Vi-N size of training instrances M5 - P2(%) Experiment 2: Using Knowledge Obtained from Large-scale Unlabeled Data by the Selection and Correction Methods. size of training instances M5 - P2(%) Treebank+ ASBC+Correction Treebank We had also try to use the auto-parsed results of the Vt-N structures from the ASBC corpus as supplementary training data for train M5. It degrades the model’s performance by too much error when using the supplementary training data. To resolve the problem, we utilize the supervised error correction method, which manually correct errors rapidly because high frequency instances (w1, w2) rarely have ambiguous classifications in different contexts. So we designed an editing tool P(%) Treebank+ ASBC-Vi-N R(%) Treebank+ ASBC+Correction R(%) 83.00 Table 6. Experiment results on the test data for various knowledge sources R(%) PCFG P(%) P(%) R(%) P(%) H/X 91.11 84.43 91.00 84.57 98.62 86.63 90.54 78.24 X/H 67.90 78.57 72.22 72.67 60.49 88.29 23.63 73.58 X/X 74.62 81.86 71.54 85.71 83.08 93.51 80.21 75.00 Table 8. Performance comparison of different classification models. 4.4 Experiment 3: Integrating the Vt-N classifier with the PCFG Parser Identifying Vt-N structures correctly facilitates statistical parsing, machine translation, infor- 934 mation retrieval, and text classification. In this experiment, we develope a baseline PCFG parser based on feature-based grammar representation by Hsieh et al. (2012) to find the best tree structures (T) of a given sentence (S). The parser then selects the best tree according to the evaluation score Score(T,S) of all possible trees. If there are n PCFG rules in the tree T, the Score(T,S) is the accumulation of the logarithmic probabilities of the i-th grammar rule (RPi). Formula 1 shows the baseline PCFG parser. n P2(%) BF(%) 4.5 i =1 The Vt-N models can be easily integrated into the PCFG parser. Formula 2 represents the integrated structural evaluation model. We combine RPi and VtNPi with the weights w1 and w2 respectively, and set the value of w2 higher than that of w1. VtNPi is the probability produced by the Vt-N classifier for the type of the relation between Vt-N bigram determined by the PCFG parsing. The classifier is triggered when a [Vt, N] structure is encountered; otherwise, the Vt-N model is not processed. n Score(T , S ) = ∑ ( w1 × RPi + w2 × VtNPi ) (2) i =1 The results of evaluating the parsing model incorporated with the Vt-N classifier (see Formula 2) are shown in Table 9 and Table 10. The P2 is the accuracy of Vt-N classification on the test data. The bracketed f-score (BF 2) is the parsing performance metric. Based on these results, the integrated model outperforms the PCFG parser in terms of Vt-N classification. Because the Vt-N classifier only considers sentences that contain Vt-N structures, it does not affect the parsing accuracies of other sentences. P2(%) BF(%) PCFG + M5 (Treebank) 80.68 83.64 PCFG 77.09 82.80 Table 9. The performance of the PCFG parser with and without model M5 from Treebank. The evaluation formula is (BP*BR*2) / (BP+BR), where BP is the precision and BR is the recall. 77.09 82.80 Experiment 4: Comparison of Various Chinese Parsers In this experiment, we give some comparison results in various parser: ‘PCFG Parser’ (baseline), ‘CDM Parser’ (Hsieh et al., 2012), and ‘Berkeley Parser’ (Petrov et al., 2006). The CDM parser achieves the best score in Traditional Chinese Parsing task of SIGHAN Bake-offs 2012 (Tseng et al., 2012). Petrov’s parser (as Berkeley, version is 2009 1.1) is the best PCFG parser for non-English language and it is an open source. In our comparison, we use the same training data for training models and parse the same test dataset based on the gold standard word segmentation and PoS tags. We have already discussed the PCFG parser in Section 4.4. As for CDM parser, we retrain relevant model in our experiments. And since Berkeley parser take different tree structure (Penn Treebank format), we transform the experimental data to Berkeley CoNLL format and re-train a new model with parameters “treebank CHINESE -SMcycles 4” 3 from training data. Moreover we use “-useGoldPOS” parameters to parse test data and further transform them to Sinica Treebank style from the Berkeley parser’s results. The different tree structure formats of Sinica Treebank and Penn Treebank are as follow: Sinica Treebank: S(NP(Head:Nh:他們)|Head:VC:散播 |NP(Head:Na:熱情)) Penn Treebank: ( (S (NP (Head:Nh (Nh 他們))) (Head:VC (VC 散播)) (NP (Head:Na (Na 熱情))))) The evaluation results on the testing data, i.e. in P2 metric, are as follows. The accuracy of PCFG parser is 77.09%; CDM parser reaches 78.45% of accuracy; and Berkeley parser is 70.68%. The results show that the problem of Vt3 2 PCFG Table 10. The performance of the PCFG parser with and without model M5 from Treebank+ASBC+Correction data set. (1) Score(T , S ) = ∑ ( RPi ) PCFG + M5 (Treebank+ASBC+Correction) 87.88 84.68 The “-treebank CHINESE -SMcycles 4” is the best training parameter in Traditional Chinese Parsing task of SIGHAN Bake-offs 2012. 935 N cannot be well solved by any general parser including CDM parser and Berkeley’s parser. It is necessary to have a different approach aside from the general model. So we set the target for a better model for Vt-N classification which can be easily integrated into the existing parsing model. So far our best model achieved the P2 accuracy of 87.88%. 5 Concluding Remarks We have proposed a classifier to resolve the ambiguity of Vt-N structures. The design of the classifier is based on three important guidelines, namely, adopting linguistically motivated features, using all available resources, and easy integration into parsing model. After analyzing the Vt-N structures, we identify linguistically motivated features, such as lexical words, semantic knowledge, the morphological structure of verbs, neighboring parts-of-speech, and the syllabic length of words. Then, we design a classifier to verify the usefulness of each feature. We also resolve the technical problems that affect the prediction of the semantic types and morphstructures of unknown words. In addition, we propose a framework for unsupervised data selection and supervised error correction for learning more useful knowledge. Our experiment results show that the proposed Vt-N classifier significantly outperforms the PCFG Chinese parser in terms of Vt-N structure identification. Moreover, integrating the Vt-N classifier with a parsing model improves the overall parsing performance without side effects. In our future research, we will exploit the proposed framework to resolve other parsing difficulties in Chinese, e.g., N-N combination. We will also extend the Semantic Type Predication Algorithm (Figure 2) to deal with all Chinese words. Finally, for real world knowledge learning, we will continue to learn more useful knowledge by auto-parsing to improve the parsing performance. Acknowledgments We thank the anonymous reviewers for their valuable comments. This work was supported by National Science Council under Grant NSC992221-E-001-014-MY3. Reference Li-li Chang, Keh-Jiann Chen, and Chu-Ren Huang. 2000. Alternation Across Semantic Fields: A Study on Mandarin Verbs of Emotion. Internal Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP), 5(1):61-80. Keh-Jiann Chen, Chu-Ren Huang, Chi-Ching Luo, Feng-Yi Chen, Ming-Chung Chang, Chao-Jan Chen, , and Zhao-Ming Gao. 2003. Sinica Treebank: Design Criteria, Representational Issues and Implementation. In (Abeille 2003) Treebanks: Building and Using Parsed Corpora, pages 231248. Dordrecht, the Netherlands: Kluwer. Li-jiang Chen. 2008. Autolabeling of VN Combination Based on Multi-classifier. Journal of Computer Engineering, 34(5):79-81. Wenliang Chen, Daisuke Kawahara, Kiyotaka Uchimoto, Yujjie Zhang, and Hitoshi Isahara. 2008. Dependency Parsig with Short Dependency Relations in Unlabeled Data. In Proceedings of the third International Joint Conference on Natural Language Processing (IJCNLP). pages 88-94.. Chih-ming Chiu, Ji-Chin Lo, and Keh-Jiann Chen. 2004. Compositional Semantics of Mandarin Affix Verbs. In Proceedings of the Research on Computational Linguistics Conference (ROCLING), pages 131-139. Yu-Ming Hsieh, Ming-Hong Bai, Jason S. Chang, and Keh-Jiann Chen. 2012. Improving PCFG Chinese Parsing with Context-Dependent Probability Reestimation, In Proceedings of the Second CIPSSIGHAN Joint Conference on Chinese Language Processing, pages 216–221. Shu-Ling Huang, You-Shan Chung, Keh-Jiann Chen. 2008. E-HowNet: the Expansion of HowNet. In Proceedings of the First National HowNet workshop, pages 10-22, Beijing, China. Chunhi Liu, Xiandai Hanyu Shuxing Fanchou Yianjiu (現代漢語屬性範疇研究). Chengdu: Bashu Books, 2008. Jiaju Mei, Yiming Lan, Yunqi Gao, and Yongxian Ying. 1983. A Dictionary of Synonyms. Shanghai Cishu Chubanshe. Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein. 2006. Learning Accurate, Compact, and Interpretable Tree Annotation. In Proceesings of COLING/ACL, pages 433-400. Likun Qiu. 2005. Constitutive Relation Analysis for V-N Phrases. Journal of Chinese Language and Computing, 15(3):173-183. Richard Sproat and Thomas Emerson, 2003. The first International Chinese Word Segmentation Bakeoff. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pages 133-143. Honglin Sun and Dan Jurafsky. 2003. The Effect of Rhythm on Structural Disambiguation in Chinese. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pages 39-46. 936 Yuen-Hsieh Tseng, Lung-Hao Lee, and Liang-Chih Yu. 2012. Tranditional Chinese Parsing Evaluation at SIGHAN Bake-offs 2012. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 199-205. Andi Wu. 2003. Learning Verb-Noun Relations to Improve Parsing. In Proceedings of the Second SIGHAN workshop on Chinese Language Processing, pages 119-124. Kun Yu, Daisuke Kawahara, and Sadao Kurohashi. 2008. Chinese Dependency Parsing with Large Scale Automatically Constructed Case Structures, In Proceedings of the 22nd International Conference on Computational Linguistics (COLING2008), pages 1049-1056. Jun Zhao and Chang-ning Huang. 1999. The Complex-feature-based Model for Acquisition of VNconstruction Structure Templates. Journal of Software, 10(1):92-99. Le Zhang. 2004. Maximum Entropy Modeling Toolkit for Python and C++. Reference Manual. 937