Learning Lightweight Ontologies from Text across
Transcription
Learning Lightweight Ontologies from Text across
Learning Lightweight Ontologies from Text across Different Domains using the Web as Background Knowledge Wilson Yiksen Wong M.Sc. (Information and Communication Technology), 2005 B.IT. (HONS) (Data Communication), 2003 This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia School of Computer Science and Software Engineering. September 2009 To my wife Saujoe and my parents and sister Abstract The ability to provide abstractions of documents in the form of important concepts and their relations is a key asset, not only for bootstrapping the Semantic Web, but also for relieving us from the pressure of information overload. At present, the only viable solution for arriving at these abstractions is manual curation. In this research, ontology learning techniques are developed to automatically discover terms, concepts and relations from text documents. Ontology learning techniques rely on extensive background knowledge, ranging from unstructured data such as text corpora, to structured data such as a semantic lexicon. Manually-curated background knowledge is a scarce resource for many domains and languages, and the effort and cost required to keep the resource abreast of time is often high. More importantly, the size and coverage of manually-curated background knowledge is often inadequate to meet the requirements of most ontology learning techniques. This thesis investigates the use of the Web as the sole source of dynamic background knowledge across all phases of ontology learning for constructing term clouds (i.e. visual depictions of terms) and lightweight ontologies from documents. To appreciate the significance of term clouds and lightweight ontologies, a system for ontology-assisted document skimming and scanning is developed. This thesis presents a novel ontology learning approach that is devoid of any manually-curated resources, and is applicable across a wide range of domains (the current focus is medicine, technology and economics). More specifically, this research proposes and develops a set of novel techniques that take advantage of Web data to address the following problems: (1) the absence of integrated techniques for cleaning noisy data; (2) the inability of current term extraction techniques to systematically explicate, diversify and consolidate their evidence; (3) the inability of current corpus construction techniques to automatically create very large, high-quality text corpora using a small number of seed terms; and (4) the difficulty of locating and preparing features for clustering and extracting relations. This dissertation is organised as a series of published papers that contribute to a complete and coherent theme. The work into the individual techniques of the proposed ontology learning approach has resulted in a total of nineteen published articles: two book chapters, four journal articles, and thirteen refereed conference papers. The proposed approach consists of several major contributions to each task in ontology learning. These include (1) a technique for simultaneously correcting noises such as spelling errors, expanding abbreviations and restoring improper casing in text; (2) a novel probabilistic measure for recognising multi-word phrases; (3) a probabilistic framework for recognising domain-relevant terms using formal word distribution models; (4) a novel technique for constructing very large, high-quality text corpora using only a small number of seed terms; and (5) novel techniques for clustering terms and discovering coarse-grained semantic relations using featureless similarity measures and dynamic Web data. In addition, a comprehensive review is included to provide background on ontology learning and recent advances in this area. The implementation details of the proposed techniques are provided at the end, together with a description on how the system is used to automatically discover term clouds and lightweight ontologies for document skimming and scanning. Acknowledgements First and foremost, this dissertation would not have come into being without the continuous support provided by my supervisors Dr Wei Liu and Prof Mohammed Bennamoun. Their insightful guidance, financial support and broad interest made my research journey at the School of Computer Science and Software Engineering (CSSE) an extremely fruitful and enjoyable one. I am proud to have Wei and Mohammed as my mentors and personal friends. I would also like to thank Dr Krystyna Haq, Mrs Jo Francis and Prof Robyn Owens for being there to answer my questions on general research skills and scholarships. A very big thank you goes to the Australian Government and the University of Western Australia for sponsoring this research under the International Postgraduate Research Scholarship and the University Postgraduate Award for International Students. I am also very grateful to CSSE, and Dr David Glance of the Centre for Software Practice (CSP) for providing me with a multitude of opportunities to pursue this research further. I would like to thank the other members of CSSE including Prof Rachell CardellOliver, Assoc/Prof Chris McDonald and Prof Michael Wise for their advice. My appreciation goes to my office mates Faisal, Syed, Suman and Majigaa. A special thank you to the members of the CSSE’s support team, namely, Laurie, Ashley, Ryan, Sam and Joe, for always being there to restart the virtual machine and to fix my laptop computers due to accidental spills. Not forgetting the amicable people in CSSE’s administration office, namely, Jen Redman, Nicola Hallsworth, Ilse Lorenzen, Rachael Offer, Jayjay Jegathesan and Jeff Pollard for answering my administrative and travel needs, and making my stay at CSSE an extremely enjoyable one. I also had the pleasure of meeting with many reputable researchers during my travel whose advice has been invaluable. To name a few, Prof Kyo Kageura, Prof Udo Hahn, Prof Robert Dale, Assoc/Prof Christian Gutl, Prof Arno Scharl, Prof Albert Yeap, Dr Timothy Baldwin, and Assoc/Prof Stephen Bird. A special thank you to the wonderful people at the Department of Information Science, University of Otago for being such a gracious host during my visit to Dunedin. I would also like to extend my gratitude for the constant support and advice provided by researchers at the Curtin University of Technology, namely, Prof Moses Tade, Assoc/Prof Hongwei Wu, Dr Nicoleta Balliu and Prof Tharam Dillon. In addition, my appreciation goes to my previous mentors Assoc/Prof Ongsing Goh and Prof Shahrin Sahib of the Technical University of Malaysia Malacca (UTeM), and Assoc/Prof R. Mukundan of the University of Canterbury. I should also acknowledge my many friends and colleagues at the Faculty of Information and Communication Technology at UTeM. My thank you also goes to the anonymous reviewers who have commented on all publications that have arisen from this thesis. Last but not least, I will always remember the unwavering support provided by my wife Saujoe, my parents and my only sister, without which I would not have cruised through this research journey so pleasantly. Also a special appreciation to the city of Perth for being such a nice place to live in and to undertake this research. i Contents List of Figures v Publications Arising from this Thesis Contribution of Candidate to Published Work xiv xviii 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Overview of Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Text Preprocessing (Chapter 3) . . . . . . . . . . . . . . . . . 4 1.3.2 Text Processing (Chapter 4) . . . . . . . . . . . . . . . . . . . 5 1.3.3 Term Recognition (Chapter 5) . . . . . . . . . . . . . . . . . . 6 1.3.4 Corpus Construction for Term Recognition (Chapter 6) . . . . 7 1.3.5 Term Clustering and Relation Acquisition (Chapter 7 and 8) . 9 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Layout of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Background 13 2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Ontology Learning from Text . . . . . . . . . . . . . . . . . . . . . . 14 2.3 2.2.1 Outputs from Ontology Learning . . . . . . . . . . . . . . . . 15 2.2.2 Techniques for Ontology Learning . . . . . . . . . . . . . . . . 17 2.2.3 Evaluation of Ontology Learning Techniques . . . . . . . . . . 21 Existing Ontology Learning Systems . . . . . . . . . . . . . . . . . . 24 2.3.1 Prominent Ontology Learning Systems . . . . . . . . . . . . . 25 2.3.2 Recent Advances in Ontology Learning . . . . . . . . . . . . . 33 2.4 Applications of Ontologies . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 Text Preprocessing 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Basic ISSAC as Part of Text Preprocessing . . . . . . . . . . . . . . . 45 3.4 Enhancement of ISSAC . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ii 3.5 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.8 Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 56 4 Text Processing 57 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 A Probabilistic Measure for Unithood Determination . . . . . . . . . 60 4.3.1 Noun Phrase Extraction . . . . . . . . . . . . . . . . . . . . . 61 4.3.2 Determining the Unithood of Word Sequences . . . . . . . . . 62 4.4 Evaluations and Discussions . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 71 5 Term Recognition 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Notations and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 5.5 5.3.1 Existing Probabilistic Models for Term Recognition . . . . . . 79 5.3.2 Existing Ad-Hoc Techniques for Term Recognition . . . . . . . 80 5.3.3 Word Distribution Models . . . . . . . . . . . . . . . . . . . . 84 A New Probabilistic Framework for Determining Termhood . . . . . . 89 5.4.1 Parameters Estimation for Term Distribution Models . . . . . 95 5.4.2 Formalising Evidences in a Probabilistic Framework . . . . . . 100 Evaluations and Discussions . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . 112 5.5.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 116 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.8 Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 125 6 Corpus Construction for Term Recognition 127 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.2.1 Webpage Sourcing . . . . . . . . . . . . . . . . . . . . . . . . 130 iii 6.3 6.4 6.2.2 Relevant Text Identification . . . . . . . . . . . . . . . . . . . 132 6.2.3 Variability of Search Engine Counts . . . . . . . . . . . . . . . 133 Analysis of Website Contents for Corpus Construction . . . . . . . . 134 6.3.1 Website Preparation . . . . . . . . . . . . . . . . . . . . . . . 136 6.3.2 Website Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3.3 Website Content Localisation . . . . . . . . . . . . . . . . . . 143 Evaluations and Discussions . . . . . . . . . . . . . . . . . . . . . . . 145 6.4.1 The Impact of Search Engine Variations on Virtual Corpus Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.4.2 The Evaluation of HERCULES . . . . . . . . . . . . . . . . . 149 6.4.3 The Performance of Term Recognition using SPARTAN-based Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.7 Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 160 7 Term Clustering for Relation Acquisition 161 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.2 Existing Techniques for Term Clustering . . . . . . . . . . . . . . . . 163 7.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.4 7.3.1 Normalised Google Distance . . . . . . . . . . . . . . . . . . . 165 7.3.2 Ant-based Clustering . . . . . . . . . . . . . . . . . . . . . . . 168 The Proposed Tree-Traversing Ants . . . . . . . . . . . . . . . . . . . 171 7.4.1 First-Pass using Normalised Google Distance . . . . . . . . . . 172 7.4.2 n-degree of Wikipedia: A New Distance Metric . . . . . . . . 176 7.4.3 Second-Pass using n-degree of Wikipedia . . . . . . . . . . . . 178 7.5 Evaluations and Discussions . . . . . . . . . . . . . . . . . . . . . . . 179 7.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 191 7.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.8 Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 192 8 Relation Acquisition 195 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.3 A Hybrid Technique for Relation Acquisition . . . . . . . . . . . . . . 197 8.3.1 Lexical Simplification . . . . . . . . . . . . . . . . . . . . . . . 200 8.3.2 Word Disambiguation . . . . . . . . . . . . . . . . . . . . . . 202 iv 8.4 8.5 8.6 8.3.3 Association Inference . . . . Initial Experiments and Discussions Conclusion and Future Work . . . . Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Implementation 9.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Ontology-based Document Skimming and Scanning . . . . . . . . . 9.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 206 208 209 211 . 211 . 219 . 227 10 Conclusions and Future Work 229 10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 229 10.2 Limitations and Implications for Future Research . . . . . . . . . . . 231 Bibliography 233 v List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 An overview of the five phases in the proposed ontology learning system, and how the details of each phase are outlined in certain chapters of this dissertation. . . . . . . . . . . . . . . . . . . . . . . . 4 Overview of the ISSAC and HERCULES techniques used in the text preprocessing phase of the proposed ontology learning system. ISSAC and HERCULES are described in Chapter 3 and 6, respectively. . . . 5 Overview of the UH and OU measures used in the text processing phase of the proposed ontology learning system. These two measures are described in Chapter 4. . . . . . . . . . . . . . . . . . . . . . . . . 6 Overview of the TH and OT measures used in the term recognition phase of the proposed ontology learning system. These two measures are described in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . . 7 Overview of the SPARTAN technique used in the corpus construction phase of the proposed ontology learning system. The SPARTAN technique is described in Chapter 6. . . . . . . . . . . . . . . . . . . . 8 Overview of the ARCHILES technique used in the relation acquisition phase of the proposed ontology learning system. The ARCHILES technique is described in Chapter 8, while the TTA clustering technique and noW measure are described in Chapter 7. . . . . . . . . . 9 2.1 The spectrum of ontology kinds, adapted from Giunchiglia & Zaihrayeu [89]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Overview of the outputs, tasks and techniques of ontology learning. . 15 3.1 Examples of spelling errors, ad-hoc abbreviations and improper casing in a chat record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 The accuracy of basic ISSAC from previous evaluations. . . . . . . . 48 3.3 The breakdown of the causes behind the incorrect replacements by basic ISSAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Accuracy of enhanced ISSAC over seven evaluations. . . . . . . . . . 54 3.5 The breakdown of the causes behind the incorrect replacements by enhanced ISSAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 vi 4.1 The output by Stanford Parser. The tokens in the “modifiee” column marked with squares are head nouns, and the corresponding tokens along the same rows in the “word” column are the modifiers. The first column “offset” is subsequently represented using the variable i. 61 4.2 The output of the head-driven noun phrase chunker. The tokens which are highlighted with a darker tone are the head nouns. The underlined tokens are the corresponding modifiers identified by the chunker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 The probability of the areas with darker shade are the denominators required by the evidences e1 and e2 for the estimation of OU (s). . . . 66 4.4 The performance of OU (from Experiment 1) and UH (from Experiment 2) in terms of precision, recall and accuracy. The last column shows the difference between the performance of Experiment 1 and 2. 69 5.1 Summary of the datasets employed throughout this chapter for experiments and evaluations. . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Distribution of 3, 058 words randomly sampled from the domain corpus d. The line with the label “KM” is the aggregation of the individual probability of occurrence of word i in a document, 1 − P (0; αi , βi ) using K-mixture with αi and βi defined in Equations 5.21 and 5.20. The line with the label “ZM-MF” is the manually fitted Zipf-Mandelbrot model. The line labeled “RF” is the actual rate of occurrence computed as fi /F . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Parameters for the manually fitted Zipf-Mandelbrot model for the set of 3, 058 words randomly drawn from d. . . . . . . . . . . . . . . . . . 87 5.4 Distribution of the same 3, 058 words as employed in Figure 5.2. The line with the label “ZM-OLR” is the Zipf-Mandelbrot model fitted using ordinary least squares method. The line labeled “ZM-WLS” is the Zipf-Mandelbrot model fitted using weighted least squares method, while “RF” is the actual rate of occurrence computed as fi /F . . . . . 97 5.5 Summary of the sum of squares of residuals, SSR and the coefficient of determination, R2 for the regression using manually estimated parameters, parameters estimated using ordinary least squares (OLS), and parameters estimated using weighted least squares (WLS). Obviously, the smaller the SSR is, the better the fit. As for 0 ≤ R2 ≤ 1, the upper bound is achieved when the fit is perfect. . . . . . . . . . . 98 vii 5.6 Parameters for the automatically fitted Zipf-Mandelbrot model for the set of 3, 058 words randomly drawn. . . . . . . . . . . . . . . . . 98 5.7 Distribution of the 1, 954 terms extracted from the domain corpus d sorted according to the corresponding scores provided by OT and T H. The single dark smooth line stretching from the left (highest value) to the right (lowest value) of the graph is the scores assigned by the respective measures. As for the two oscillating lines, the dark line is the domain frequencies while the light one is the contrastive frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.8 Distribution of the 1, 954 terms extracted from the domain corpus d sorted according to the corresponding scores provided by N CV and CW . The single dark smooth line stretching from the left (highest value) to the right (lowest value) of the graph is the scores assigned by the respective measures. As for the two oscillating lines, the dark line is the domain frequencies while the light one is the contrastive frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.9 The means µ of the scores, standard deviations σ of the scores, sum of the domain frequencies and of the contrastive frequencies of all term candidates, and their ratio. . . . . . . . . . . . . . . . . . . . . . . . . 115 5.10 The Spearman rank correlation coefficients ρ between all possible pairs of measure under evaluation. . . . . . . . . . . . . . . . . . . . . 115 5.11 An example of a contingency table. The values in the cells T P , T N , F P and F N are employed to compute the precision, recall, Fα and accuracy. Note that |T C| is the total number of term candidates in the input set T C, and |T C| = T P + F P + F N + T N . . . . . . . . . 116 viii 5.12 The collection all contingency tables for all termhood measures X across all the 10 bins BjX . The first column contains the rank of the bins and the second column shows the number of term candidates in each bin. The third general column “termhood measures, X” holds all the 10 contingency tables for each measure X which are organised column-wise, bringing the total number of contingency tables to 40 (i.e. 10 bins, organised in rows by 4 measures). The structure of the individual contingency tables follows the one shown in Figure 5.11. The last column is the row-wise sums of T P + F P and F N + T N . The rows beginning from the second row until the second last are the rank bins. The last row is the column-wise sums of T P + F N and F P + T N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.13 Performance indicators for the four termhood measures in 10 respective bins. Each row shows the performance achieved by the four measures in a particular bin. The columns contain the performance indicators for the four measures. The notation pre stands for precision, rec is recall and acc is accuracy. We use two different α values, resulting in two F-scores, namely, F0.1 and F1 . The values of the performance measures with darker shades are the best performing ones.120 6.1 A diagram summarising our web partitioning technique. . . . . . . . . 135 6.2 An illustration of an example sample space on which the probabilities employed by the filter are based upon. The space within the dot-filled circle consists of all webpages from all sites in J containing W . The m rectangles represent the collections of all webpages of the respective sites {u1 , ..., um }. The shaded but not dot-filled portion of the space consists of all webpages from all sites in J that do not contain W . The individual shaded but not dot-filled portion within each rectangle is the collection of webpages in the respective sites ui ∈ J that do not contain W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.3 A summary of the number of websites returned by the respective search engines for each of the two domains. The number of common sites is also provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.4 A summary of the Spearman’s correlation coefficients between websites before and after re-ranking by PROSE. The native columns show the correlation between the websites when sorted according to their native ranks provided by the respective search engines. . . . . . 147 ix 6.5 The number of sites with OD less than −6 after re-ranking using PROSE based on page count information provided by the respective search engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.6 A listing of the 43 sites included in SPARTAN-V. . . . . . . . . . . . 151 6.7 The number of documents and tokens from the local and virtual corpora used in this evaluation. . . . . . . . . . . . . . . . . . . . . . . . 154 6.8 The contingency tables summarising the term recognition results using the various specialised corpora. . . . . . . . . . . . . . . . . . . . 156 6.9 A summary of the performance metrics for term recognition. . . . . . 156 7.1 Example of TTA at work . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.2 Experiment using 15 terms from the wine domain. Setting sT = 0.92 results in 5 clusters. Cluster A is simply red wine grapes or red wines, while Cluster E represents white wine grapes or white wines. Cluster B represents wines named after famous regions around the world and they can either be red, white or rose. Cluster C represents white noble grapes for producing great wines. Cluster D represents red noble grapes. Even though uncommon, Shiraz is occasionally admitted to this group. . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.3 Experiment using 16 terms from the mushroom domain. Setting sT = 0.89 results in 4 clusters. Cluster A represents poisonous mushrooms. Cluster B comprises edible mushrooms which are prominent in East Asian cuisine except for Agaricus Blazei. Nonetheless, this mushroom was included in this cluster probably due to its high content of beta glucan for potential use in cancer treatment, just like Shiitake. Moreover, China is the major exporter of Agaricus Blazei, also known as Himematsutake, further relating this mushroom to East Asia. Cluster C and D comprise edible mushrooms found mainly in Europe and North America, and are more prominent in Western cuisines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 x 7.4 Experiment using 20 terms from the disease domain. Setting sT = 0.86 results in 7 clusters. Cluster A represents skin diseases. Cluster B represents a class of blood disorders known as anaemia. Cluster C represents other kinds of blood disorders. Cluster D represents blood disorders characterised by the relatively low count of leukocytes (i.e. white blood cells) or platelets. Cluster E represents digestive diseases. Cluster F represents cardiovascular diseases characterised by both the inflammation and thrombosis (i.e. clotting) of arteries and veins. Cluster G represents cardiovascular diseases characterised by the inflammation of veins only. . . . . . . . . . . . . . . . . . . . . 185 7.5 Experiment using 16 terms from the animal domain. Setting sT = 0.60 produces 2 clusters. Cluster A comprises birds and Cluster B represents mammals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.6 Experiment using 16 terms from the animal domain (the same dataset from the experiment in Figure 7.5). Setting sT = 0.72 results in 5 clusters. Cluster A represents birds. Cluster B includes hoofed mammals (i.e. ungulates). Cluster C corresponds to predatory feline while Cluster D represents predatory canine. Cluster E constitutes animals kept as pets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.7 Experiment using 15 terms from the animal domain plus an additional term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60 (middle screenshot) and sT = 0.72 (right screenshot) result in 2 clusters, 3 clusters and 5 clusters, respectively. In the left screenshot, Cluster A acts as the parent for the two recommended clusters “bird” and “mammal”, while Cluster B includes the term “Google”. In the middle screenshot, the recommended clusters “bird” and “mammal” were clearly reflected through Cluster A and C respectively. By setting sT higher, we dissected the recommended cluster “mammal” to obtain the discovered sub-clusters C, D and E as shown in the right screenshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.8 Experiment using 31 terms from various domains. Setting sT = 0.70 results in 8 clusters. Cluster A represents actors and actresses. Cluster B represents musicians. Cluster C represents countries. Cluster D represents politics-related notions. Cluster E is transport. Cluster F includes finance and accounting matters. Cluster G constitutes technology and services on the Internet. Cluster H represents food. . 189 xi 7.9 Experiment using 60 terms from various domains. Setting sT = 0.76 results in 20 clusters. Cluster A and B represent herbs. Cluster C comprises pastry dishes while Cluster D represents dishes of Italian origin. Cluster E represents computing hardware. Cluster F is a group of politicians. Cluster G represents cities or towns in France while Cluster H includes countries and states other than France. Cluster I constitutes trees of the genus Eucalyptus. Cluster J represents marsupials. Cluster K represents finance and accounting matters. Cluster L comprises transports with four or more wheels. Cluster M includes plant organs. Cluster N represents beverages. Cluster O represents predatory birds. Cluster P comprises birds other than predatory birds. Cluster Q represents two-wheeled transports. Cluster R and S represent predatory mammals. Cluster T includes trees of the genus Acacia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.1 An overview of the proposed relation acquisition technique. The main phases are term mapping and term resolution, represented by black rectangles. The three steps involved in resolution are simplification, disambiguation and inference. The techniques represented by the white rounded rectangles were developed by the authors, while existing techniques and resources are shown using grey rounded rectangles. 197 8.2 Figure 8.2(a) shows the subgraph WT constructed for T={‘baking powder’,‘whole wheat flour’} using Algorithm 8, which is later pruned to produce a lightweight ontology in Figure 8.2(b). . . . . . . . . . . 201 8.3 The computation of mutual information for all pairs of contiguous constituents of the composite terms “one cup whole wheat flour” and “salt to taste”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.4 A graph showing the distribution of noW distance and the stepwise difference for the sequence of word senses for the term “pepper”. The set of mapped terms is M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking powder”, “buttermilk”}. The line “stepwise difference” shows the ∆i−1,i values. The line “average stepwise difference” is the constant value µ∆ . Note that the first sense s1 is located at x = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 xii 8.5 The result of clustering the non-existent term “conchiglioni” and the mapped terms M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking powder”, “buttermilk”,“carbonara”,“pancetta”} using T T A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 8.6 The results of relation acquisition using the proposed technique for the genetics and the food domains. The labels “correctly xxx” and “incorrectly xxx” represent the true positives (TP) and false positives (FP). Precision is computed as T P/(T P + F P ). . . . . . . . . . . . . 206 8.7 The lightweight domain ontologies generated using the two sets of input terms. The important vertices (i.e. NCAs, input terms, vertices with degree more than 3) have darker shades. The concepts genetics and food in the center of the graph are the NCAs. All input terms are located along the side of the graph. . . . . . . . . . . . . . . . . . 207 9.1 The online interface for the HERCULES module. . . . . . . . . . . . . . 212 9.2 The input section of the interface algorithm issac.pl shows the error sentence “Susan’s imabbirity to Undeerstant the msg got her INTu trubble.”. The correction provided by ISSAC is shown in the results section of the interface. The process log is also provided through this interface. Only a small portion of the process log is shown in this figure.213 9.3 The online interface algorithm unithood.pl for the module unithood. The interface shows the collocational stability of different phrases determined using unithood. The various weights involved in determining the extent of stability are also provided in these figures. . . . . . . 214 9.4 The online interfaces for querying the virtual and local corpora created using the SPARTAN module. . . . . . . . . . . . . . . . . . . . . . 215 9.5 Online interfaces related to the termhood module. . . . . . . . . . . . 216 9.6 Online interfaces related to the ARCHILES module. . . . . . . . . . . . 217 9.7 The interface data lightweightontology.pl for browsing pre-constructed lightweight ontologies for online news articles using the ARCHILES module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.8 The screenshot of the aggregated news services provided by Google (the left portion of the figure) and Yahoo (the right portion of the figure) on 11 June 2009. . . . . . . . . . . . . . . . . . . . . . . . . . 220 9.9 A splash screen on the online interface for document skimming and scanning at http://explorer.csse.uwa.edu.au/research/. . . . . 221 xiii 9.10 The cross-domain term cloud summarising the main concepts occurring in all the 395 articles listed in the news browser. This cloud currently contains terms in the technology, medicine and economics domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11 The single-domain term cloud for the domain of medicine. This cloud summarises all the main concepts occurring in the 75 articles listed below in the news browser. Users can arrive at this single-domain cloud from the cross-domain cloud in Figure 9.10 by clicking on the [domain(s)] option in the latter. . . . . . . . . . . . . . . . . . . . 9.12 The single-domain term cloud for the medicine domain. Users can view a list of articles describing a particular topic by clicking on the corresponding term in the single-domain cloud. . . . . . . . . . . . 9.13 The use of document term cloud and information from lightweight ontology to summarise individual news articles. Based on the term size in the clouds, one can arrive at the conclusion that the news featured in Figure 9.13(b) carries more domain-relevant (i.e. medical related) content than the news in Figure 9.13(a). . . . . . . . . . . 9.14 The document term cloud for the news “Tai Chi may ease arthritis pain”. Users can focus on a particular concept in the annotated news by clicking on the corresponding term in the document cloud. . . . . 221 . 222 . 223 . 224 . 225 xiv Publications Arising from this Thesis This thesis contains published work and/or work prepared for publication, some of which has been co-authored. The bibliographical details of the work and where it appears in the thesis are outlined below. Book Chapters (Fully Refereed) [1] Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global. This book chapter combines and summarises the ideas in [9][10] and [3][11][12], which form Chapter 3 and Chapter 4, respectively. [2] Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global. The clustering algorithm reported in [5], which contributes to Chapter 7, was generalised in this book chapter to work with both terms and Internet domain names. Journal Publications (Fully Refereed) [3] Wong, W., Liu, W. & Bennamoun, M. (2009) A Probabilistic Framework for Automatic Term Recognition. Intelligent Data Analysis, Volume 13, Issue 4, Pages 499-539. (Chapter 4) [4] Wong, W., Liu, W. & Bennamoun, M. (2009) Constructing Specialised Corpora through Domain Representativeness Analysis of Websites. Accepted with revision by Language Resources and Evaluation. (Chapter 5) [5] Wong, W., Liu, W. & Bennamoun, M. (2007) Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities. Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages 349-381. (Chapter 7) xv [6] Liu, W. & Wong, W. (2009) Web Service Clustering using Text Mining Techniques. International Journal of Agent-Oriented Software Engineering, Volume 3, Issue 1, Pages 6-26. This paper is an invited submission. It extends the work reported in [17]. Conference Publications (Fully Refereed) [7] Wong, W., Liu, W. & Bennamoun, M. (2006) Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text. In the Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney, Australia. The preliminary ideas in this paper were refined and extended to contribute towards [8], which forms Chapter 2 of this thesis. [8] Wong, W., Liu, W. & Bennamoun, M. (2007) Enhanced Integrated Scoring for Cleaning Dirty Texts. In the Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data (AND), Hyderabad, India. (Chapter 2) [9] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining the Unithood of Word Sequences using Mutual Information and Independence Measure. In the Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia. The ideas in this paper were refined and reformulated as a probabilistic framework to contribute towards [10], which forms Chapter 3 of this thesis. [10] Wong, W., Liu, W. & Bennamoun, M. (2008) Determining the Unithood of Word Sequences using a Probabilistic Approach. In the Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP), Hyderabad, India. (Chapter 3) [11] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning Domain Ontologies using Domain Prevalence and Tendency. In the Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia. xvi The ideas in this paper were refined and reformulated as a probabilistic framework to contribute towards [3], which forms Chapter 4 of this thesis. [12] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning Domain Ontologies in a Probabilistic Framework. In the Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia. The ideas and experiments in this paper were further extended to contribute towards [3], which forms Chapter 4 of this thesis. [13] Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora through Topical Web Partitioning for Term Recognition. In the Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand. The preliminary ideas in this paper were improved and extended to contribute towards [4], which forms Chapter 5 of this thesis. [14] Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for Terms Clustering using Tree-Traversing Ants. In the Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia. The preliminary ideas in this paper were refined to contribute towards [5], which forms Chapter 7 of this thesis. [15] Wong, W., Liu, W. & Bennamoun, M. (2009) Acquiring Semantic Relations using the Web for Constructing Lightweight Ontologies. In the Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand. (Chapter 6) [16] Enkhsaikhan, M., Wong, W., Liu, W. & Reynolds, M. (2007) Measuring Data-Driven Ontology Changes using Text Mining. In the Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia. This paper reports a technique for detecting changes in ontologies. The ontologies used for evaluation in this paper were generated using the clustering technique in [5] and the term recognition technique in [3]. xvii [17] Liu, W. & Wong, W. (2008) Discovering Homogenous Service Communities through Web Service Clustering. In the Proceedings of the AAMAS Workshop on Service-Oriented Computing: Agents, Semantics, and Engineering (SOCASE), Estoril, Portugal. This paper reports the results of discovering web service clusters using the extended clustering technique described in [2]. Conference Publications (Refereed on the basis of abstract) [18] Wong, W., Liu, W., Liaw, S., Balliu, N., Wu, H. & Tade, M. (2008) Automatic Construction of Lightweight Domain Ontologies for Chemical Engineering Risk Management. In the Proceedings of the 11th Conference on Process Integration, Modelling and Optimisation for Energy Saving and Pollution Reduction (PRES), Prague, Czech Rep. This paper reports the results of the preliminary integration of ideas in [1-15]. [19] Wong, W. (2008) Discovering Lightweight Ontologies using the Web. In the Proceedings of the 9th Postgraduate Electrical Engineering & Computing Symposium (PEECS), Perth, Australia. This paper reports the preliminary results of the integration of ideas in [1-15] into a system for document skimming and scanning. Note: • The 14 publications [1-5] and [7-15] describe research work on developing various techniques for ontology learning. The contents in these papers contributed directly to Chapters 3-8 of this thesis. The 5 publications [6] and [16-19] are application papers that arose from the use of these techniques in several areas. • All publications, except [18] and [19] are ranked B or higher by the Australasian Computing Research and Education Association (CORE). Data Mining and Knowledge Discovery (DMKD), Intelligent Data Analysis (IDA) and Language Resources and Evaluation (LRE) each have a 2008/2009 ISI journal impact factor of 2.421, 0.426 and 0.283, respectively. xviii Contribution of Candidate to Published Work Some of the published work included in this thesis has been co-authored. The extent of the candidate’s contribution towards the published work is outlined below. • Publications [1-5] and [7-15]: The candidate is the first author of these 14 papers, with 80% contribution. He co-authored them with his two supervisors. The candidate designed and implemented the algorithms, performed the experiments and wrote the papers. The candidate’s supervisors reviewed the papers and provided useful advice for improvements. • Publications [6] and [17]: The candidate is the second author of these papers with 50% contribution. His primary supervisor (Dr Wei Liu) is the first author. The candidate contributed to the clustering algorithm used in these papers, and wrote the experiment sections. • Publication [16]: The candidate is the second author of this paper with 20% contribution. His primary supervisor (Dr Wei Liu) and two academic colleagues are the remaining authors. The candidate contributed to the clustering algorithm and term recognition technique used in this paper. The candidate conducted half of the experiments reported in this paper. • Publication [18]: The candidate is the first author of this paper with 40% contribution. He co-authored the paper with two academic colleagues, and three researchers from the Curtin University of Technology. All techniques reported in this paper are contributed by the candidate. The candidate wrote all sections in this paper with advice from his primary supervisor (Dr Wei Liu) and the domain experts from the Curtin University of Technology. • Publication [19]: The candidate is the sole author of this paper with 100% contribution. CHAPTER 1 1 Introduction “If HTML and the Web made all the online documents look like one huge book, [the Semantic Web] will make all the data in the world look like one huge database.” - Tim Berners-Lee, Weaving the Web (1999) Imagine that every text document you encounter comes with an abstraction of what is important. Then we would no longer have to meticulously sift through every email, news article, search result or product review every day. If every document on the World Wide Web (the Web) has an abstraction of important concepts and relations, we will be one crucial step closer to realising the vision of a Semantic Web. At the moment, the widely adopted technique for creating these abstractions is manual curation. For instance, authors of news articles create their own summaries. Regular users assign descriptive tags to webpages using Web 2.0 portals. Webmasters provide machine-readable metadata to describe their webpages for the Semantic Web. The need to automate the abstraction process becomes evident when we consider the fact that more than 90% of the data in the World appear in unstructured forms [87]. Indeed, search engine giants such as Yahoo!, Google and Microsoft’s Bing are slowly and strategically gearing towards the presentation of webpages using visual summary and abstraction. In this research, ontology learning techniques are proposed and developed to automatically discover terms, concepts and relations from documents. Together, these ontological elements are represented as lightweight ontologies. As with any processes that involve extracting meaningful information from unstructured data, ontology learning relies on extensive background knowledge. This background knowledge can range from unstructured data such as a text corpus (i.e. a collection of documents) to structured data such as a semantic lexicon. From here on, we shall take background knowledge [32] in a broad sense as “information that is essential to understanding a situation or problem” 1 . More and more researchers in ontology learning are turning to Web data to address certain inadequacies of static background knowledge. This thesis investigates the systematic use of the Web as the sole source of dynamic background knowledge for automatically learning term clouds (i.e. visual depictions of terms) and lightweight ontologies from text across different domains. The significance of term clouds and lightweight ontologies is best appreciated in the context of document skimming and scanning as a way to alleviate the pressure of information 1 This definition is from http://wordnetweb.princeton.edu/perl/webwn?s=background%20knowledge. 2 Chapter 1. Introduction overload. Imagine hundreds of news articles, medical reports, product reviews and emails summarised using connected (i.e. lightweight ontologies) key concepts that stand out visually (i.e. term clouds). This thesis has produced an interface to do exactly this. 1.1 Problem Description Ontology learning from text is a relatively new research area that draws on the advances from related disciplines, especially text mining, data mining, natural language processing and information retrieval. The requirement for extensive background knowledge, be it in the form of text corpora or structured data, remains one of the greatest challenges facing the ontology learning community, and hence, the focus of this thesis. The adequacy of background knowledge in language processing is determined by two traits, namely, diversity and redundancy. Firstly, languages vary and evolve across different geographical regions, genres, and time [183]. For instance, a general lexicon for the English language such as WordNet [177] is of little or no use to a system that processes medical texts or texts in other languages. Similarly, a text corpus conceived in the early 90s such as the British National Corpus (BNC) [36] cannot cope with the need of processing texts that contain words such as “iPhone” or “metrosexual”. Secondly, redundancy of data is an important prerequisite in both statistical and symbolic language processing. Redundancy allows language processing techniques to arrive at conclusions regarding many linguistic events based on observations and induction. If we observe “politics” and “hypocrisy” often enough, we can say they are somehow related. Static background knowledge has neither adequate diversity nor redundancy. According to Engels & Lech [66], “Many of the approaches found use of statistical methods on larger corpora...Such approaches tend to get into trouble when domains are dynamic or when no large corpora are present...”. Indeed, many researchers realised this, and hence, gradually turned to Web data for the solution. For instance, in ontology learning, Wikipedia is used for relation acquisition [154, 236, 216], and word sense disambiguation [175]. Web search engines are employed for text corpus construction [15, 228], similarity measurement [50], and word collocation [224, 41]. However, the present use of Web data is typically confined to isolated cases where static background knowledge has outrun its course. There is currently no study concentrating on the systematic use of Web data as background knowledge for all phases of ontology learning. Research that focuses on the issue of diversity and 1.2. Thesis Statement redundancy of background knowledge in ontology learning is long overdue. How do we know if we have the necessary background knowledge to carry out all our ontology learning tasks? Where do we look for more background knowledge if we know that what we have is inadequate? 1.2 Thesis Statement The thesis of this research is that the process of ontology learning, which includes discovering terms, concepts and coarse-grained relations, from text across a wide range of domains can be effectively automated by relying solely upon background knowledge on the Web. In other words, our proposed system employs Web data as the sole background knowledge for all techniques across all phases of ontology learning. The effectiveness of the proposed system is determined by its ability to satisfy two requirements: (1) Avoid using any static resources commonly used by current ontology learning systems (e.g. semantic lexicons, text corpora). (2) Ensure the applicability of the system across a wide range of domains (current focus is technology, medicine and economics). At the same time, this research addresses the following four problems by taking advantage of the diversity and redundancy of Web data as background knowledge: (1) The absence of integrated techniques for cleaning noisy text. (2) The inability of current term extraction techniques, which are heavily influenced by word frequency, to systematically explicate, diversify and consolidate termhood evidence. (3) The inability of current corpus construction techniques to automatically create very large, high-quality text corpora using a small number of seed terms. (4) The difficulty of locating and preparing features for clustering and acquiring relations between terms. 1.3 Overview of Solution The ultimate goal of ontology learning in the context of this thesis is to discover terms, concepts and coarse-grained relations from documents using the Web as the sole source of background knowledge. Chapter 2 provides a thorough review 3 4 Chapter 1. Introduction Figure 1.1: An overview of the five phases in the proposed ontology learning system, and how the details of each phase are outlined in certain chapters of this dissertation. of the existing techniques for discovering these ontological elements. An ontology learning system comprising five phases, namely, text preprocessing, text processing, corpus construction, term recognition and relation acquisition is proposed in this research. Figure 1.1 provides an overview of the system. The common design and development methodology for the core techniques in each phase is: (1) first, perform in-depth study of the requirements of each phase to determine the types of background knowledge required, (2) second, identify ways of exploiting data on the Web to satisfy the background knowledge requirements, and (3) third, devise highperformance techniques that take advantage of the diversity and redundancy of the background knowledge. The system takes as input a set of seed terms and natural language texts, and produces three outputs, namely, text corpora, term clouds, and lightweight ontologies. The solution to each phase is described in the following subsections. 1.3.1 Text Preprocessing (Chapter 3) Figure 1.2 shows an overview of the techniques for text preprocessing. Unlike data developed in controlled settings, Web data come in a varying degree of quality which may contain spelling errors, abbreviations and improper casings. This calls for serious attention to the issue of data quality. A review of several prominent techniques for spelling error correction, abbreviation expansion and case restoration was conducted. Despite the blurring of the boundaries between these different errors in online data, there is little work on integrated correction techniques. For 1.3. Overview of Solution Figure 1.2: Overview of the ISSAC and HERCULES techniques used in the text preprocessing phase of the proposed ontology learning system. ISSAC and HERCULES are described in Chapter 3 and 6, respectively. instance, is “ocat” a spelling error (with the possibilities “coat”, “cat” or “oat”), or an abbreviation (with the expansion “Ontario Campaign for Action on Tobacco”)? A technique called Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration (ISSAC) is proposed and developed for cleaning potentially noisy texts. ISSAC relies on edit distance, online dictionaries, search engine page counts and Aspell [9]. An experiment using 700 chat records showed that ISSAC achieved an average accuracy of 98%. In addition, a heuristic technique, called Heuristic-based Cleaning Utility for Web Texts (HERCULES), is proposed for extracting relevant contents from webpages amidst HTML tags, boilerplates, etc. Due to the significance of HERCULES to the corpus construction phase, the details of this techniques are provided in Chapter 6. 1.3.2 Text Processing (Chapter 4) Figure 1.3 shows an overview of the techniques for text processing. The cleaned texts are processed using the Stanford Parser [132] and Minipar [150] to obtain partof-speech and grammatical information. This information is then used for chunking noun phrases and extracting instantiated sub-categorisation frames (i.e. syntactic triples, ternary frames) in the form of <arg1,connector,arg2>. Two measures based on search engine page counts are introduced as part of the noun phrase chunk- 5 6 Chapter 1. Introduction Figure 1.3: Overview of the UH and OU measures used in the text processing phase of the proposed ontology learning system. These two measures are described in Chapter 4. ing process. These two measures are used to determine the collocational stability of noun phrases. Noun phrases are considered as unstable if they can be further broken down to create non-overlapping units that refer to semantically distinct concepts. For example, the phrase “Centers for Disease Control and Prevention” is stable and semantically meaningful unit while “Centre for Clinical Interventions and Royal Perth Hospital” is an unstable compound. The first measure, called Unithood (UH) is an adaptation of existing word association measures, while the second measure, called Odds of Unithood (OU), is a novel probabilistic measure to address the ad-hoc nature of combining evidence. An experiment using 1, 825 test cases in the health domain showed that OU achieved a higher accuracy at 97.26% compared to UH at only 94.52%. 1.3.3 Term Recognition (Chapter 5) Figure 1.4 shows an overview of the techniques for term recognition. Noun phrases in the two arguments arg1 and arg2 of the instantiated sub-categorisation frames are used to create a list of term candidates. The extent to which each term candidate is relevant to the corresponding document is determined. Several existing techniques for measuring termhood using various term characteristics as termhood evidence are reviewed. Major shortcomings of existing techniques are identified and discussed, including the heavy influence of word frequency (especially techniques 1.3. Overview of Solution Figure 1.4: Overview of the TH and OT measures used in the term recognition phase of the proposed ontology learning system. These two measures are described in Chapter 5. based on TF-IDF), mathematically-unfounded derivation of weights and implicit assumptions regarding term characteristics. An analysis is carried out using word distribution models and text corpora to predict word occurrences for quantifying termhood evidence. Models that are considered include K-Mixture, Poisson, and Zipf-Mandelbrot. Based on the analysis, two termhood measures are proposed, which combine evidence based on explicitly defined term characteristics. The first is a heuristic measure called Termhood (TH), while the second, called Odds of Termhood (OT), is based on a novel probabilistic measure founded on the Bayes Theorem for formalising termhood evidence. These two measures are compared against two existing ones using the GENIA corpus [130] for molecular biology as benchmark. An evaluation using 1, 954 term candidates showed that TH and OT achieved the best precision at 98.5% and 98%, respectively, for the first 200 terms. 1.3.4 Corpus Construction for Term Recognition (Chapter 6) Figure 1.5 shows an overview of the techniques for corpus construction. Given that our goal is to rely solely on dynamic background knowledge in ontology learning, it is important to see how we can avoid the use of manually-curated text corpora during term recognition. Existing techniques for automatic corpus construction using data from the Web are reviewed. Most of the current techniques employ the naive 7 8 Chapter 1. Introduction Figure 1.5: Overview of the SPARTAN technique used in the corpus construction phase of the proposed ontology learning system. The SPARTAN technique is described in Chapter 6. query-and-download approach using search engines to construct text corpora. These techniques require a large number of seed terms (in the order of hundreds) to create very large text corpora. They also disregard the fact that the webpages suggested by search engines may have poor relevance and quality. A novel technique called Specialised Corpora Construction based on Web Texts Analysis (SPARTAN) is proposed and developed for constructing specialised text corpora from the Web based on the systematic analysis of website contents. A Probabilistic Site Selector (PROSE) is proposed as part of SPARTAN to identify the most suitable and authoritative data for contributing to the corpora. A heuristic technique called HERCULES, mentioned before in the text preprocessing phase, is included in SPARTAN for extracting relevant contents from the downloaded webpages. A comparison using the Cleaneval development set2 and a text comparison module based on vector space3 showed that HERCULES achieved a 89.19% similarity with the gold standard. An evaluation was conducted to show that SPARTAN requires only a small number of seed terms (three to five), and that SPARTAN -based corpora are independent of the search engine employed. The performance of term recognition across four different corpora (both automatically constructed and manually curated) was assessed using the OT 2 3 http://cleaneval.sigwac.org.uk/devset.html http://search.cpan.org/ stro/Text-Compare-1.03/lib/Text/Compare.pm 1.3. Overview of Solution measure, 1, 300 term candidates, and the GENIA corpus as benchmark. The evaluation showed that term recognition using the SPARTAN -based corpus achieved the best precision at 99.56%. 1.3.5 Term Clustering and Relation Acquisition (Chapter 7 and 8) Figure 1.6: Overview of the ARCHILES technique used in the relation acquisition phase of the proposed ontology learning system. The ARCHILES technique is described in Chapter 8, while the TTA clustering technique and noW measure are described in Chapter 7. Figure 1.6 shows an overview of the techniques for relation acquisition. The flat lists of domain-relevant terms obtained during the previous phase are organised into hierarchical structures during the relation acquisition phase. A review of the techniques for acquiring semantic relations between terms was conducted. Current techniques rely heavily on the presence of syntactic cues and static background knowledge such as semantic lexicons for acquiring relations. A novel technique named Acquiring Relations through Concept Hierarchy Disambiguation, Association Inference and Lexical Simplification (ARCHILES) is proposed for constructing lightweight ontologies using coarse-grained relations derived from Wikipedia and search engines. ARCHILES combines word disambiguation, which uses the distance measure n-degree of Wikipedia (noW), and lexical simplification to handle complex and ambiguous terms. ARCHILES also includes association inference using a novel multi-pass Tree-Traversing Ant (TTA) clustering algorithm with the Normalised 9 10 Chapter 1. Introduction Web Distance (NWD) 4 as the similarity measure to cope with terms not covered by Wikipedia. This technique can be used to complement conventional techniques for acquiring fine-grained relations. Two small experiments using 11 terms in the genetics domain and 31 terms in the food domain revealed precision scores between 80% to 100%. The details about TTA and noW are provided in Chapter 7. The descriptions of ARCHILES are included in Chapter 8. 1.4 Contributions The standout contribution of this dissertation is the exploration of a complete solution to the complex problem of automatic ontology learning from text. This research has produced several other contributions to the field of ontology learning. The complete list is as follows: • A technique which consolidates various evidence from existing tools and from search engines for simultaneously correcting spelling errors, expanding abbreviations and restoring improper casing. • Two measures for determining the collocational strength of word sequences using page counts from search engines, namely, an adaptation of existing word association measures, and a novel probabilistic measure. • In-depth experiments on parameter estimation and linear regression involving various word distribution models. • Two measures for determining term relevance based on explicitly defined term characteristics and the distributional behaviour of terms across different corpora. The first measure is a heuristic measure, while the second measure is based on a novel probabilistic framework for consolidating evidence using formal word distribution models. • In-depth experiments on the effects of search engine and page count variations on corpus construction. • A novel technique for corpus construction that requires only a small number of seed terms to automatically produce very large, high-quality text corpora through the systematic analysis of website contents. The on-demand 4 NWD is a generalisation of the Normalised Google Distance (NGD) [50] that employs any available Web search engines. 1.5. Layout of Thesis construction of new text corpora enables this and many other term recognition techniques to be widely applicable across different domains. A generallyapplicable heuristic technique is also introduced for removing HTML tags and boilerplates, and extracting relevant content from webpages. • In-depth experiments on the peculiarities of clustering terms as compared to other forms of feature-based data clustering. • A novel technique for constructing lightweight ontologies in an iterative process of lexical simplication, association inference through term clustering, and word disambiguation using only Wikipedia and search engines. A generallyapplicable technique is introduced for multi-pass term clustering using featureless similarity measurement based on Wikipedia and page counts by search engines. • Demonstration of the use of term clouds and lightweight ontologies to assist the skimming and scanning of documents. 1.5 Layout of Thesis Overall, this dissertation is organised as a series of papers published in internationally refereed book chapters, journals and conferences. Each paper constitutes an independent set of work into ontology learning. However, these papers together contribute to a complete and coherent theme. In Chapter 2, a background to ontology learning and a review on several prominent ontology learning systems is presented. The core content of this dissertation is laid out in Chapter 3 to 8. Each of these chapters describes one of the five phases in our ontology learning system. • Chapter 3 (Text Preprocessing) features an IJCAI workshop paper on the text cleaning technique called ISSAC. • In Chapter 4 (Text Processing), an IJCNLP conference paper describing the two word association measures UH and OU is included. • An Intelligent Data Analysis journal paper on the two term relevance measures TH and OT is included in Chapter 5 (Term Recognition). • In Chapter 6 (Corpus Construction for Term Recognition), a Language Resources and Evaluation journal paper is included to describe the SPARTAN technique for automatically constructing text corpora for term recognition. 11 12 Chapter 1. Introduction • A Data Mining and Knowledge Discovery journal paper that describes the TTA clustering technique and noW distance measure is included in Chapter 7 (Term Clustering for Relation Acquisition). • In Chapter 8 (Relation Acquisition), a PAKDD conference paper is included to describe the ARCHILES technique for acquiring coarse-grained relations using TTA and noW. After the core content, Chapter 9 elaborates on the implementation details of the proposed ontology learning system, and the application of term clouds and lightweight ontologies for document skimming and scanning. In Chapter 10, we summarise our conclusions and provide suggestions for future work. CHAPTER 2 Background “A while ago, the Artificial Intelligence research community got together to find a way to enable knowledge sharing...They proposed an infrastructure stack that could enable this level of information exchange, and began work on the very difficult problems that arise.” - Thomas Gruber, Ontology of Folksonomy (2007) This chapter provides a comprehensive review on ontology learning. It also serves as a background introduction to ontologies in terms of what they are, why they are important, how they are obtained and where they can be applied. The definition of an ontology is first introduced before a discussion on the differences between lightweight ontologies and the conventional understanding of ontologies is provided. Then the process of ontology learning is described, with a focus on types of output, commonly-used techniques and evaluation approaches. Finally, several current applications and prominent systems are explored to appreciate the significance of ontologies and the remaining challenges in ontology learning. 2.1 Ontologies Ontologies can be thought of as directed graphs consisting of concepts as nodes, and relations as the edges between the nodes. A concept is essentially a mental symbol often realised by a corresponding lexical representation (i.e. natural language name). For instance, the concept “food” denotes the set of all substances that can be consumed for nutrition or pleasure. In Information Science, an ontology is a “formal, explicit specification of a shared conceptualisation” [92]. This definition imposes the requirement that the names of concepts, and how the concepts are related to one another have to be explicitly expressed and represented using formal languages such as Web Ontology Language (OWL). An important benefit of a formal representation is the ability to specify axioms for reasoning to determine validity and to define constraints in ontologies. As research into ontology progresses, the definition of what constitutes an ontology evolves. The extent of relational and axiomatic richness, and the formality of representation eventually gave rise to a spectrum of ontology kinds [253] as illustrated in Figure 2.1. At one end of the spectrum, we have ontologies that make little or no use of axioms referred to as lightweight ontologies [89]. At the other end, we have heavyweight ontologies [84] that make intensive use of axioms for specification. Ontologies are fundamental to the success of the Semantic Web as they 13 14 Chapter 2. Background Figure 2.1: The spectrum of ontology kinds, adapted from Giunchiglia & Zaihrayeu [89]. enable software agents to exchange, share, reuse and reason about concepts and relations using axioms. In the words of Tim Berners-Lee [24], “For the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning”. However, the truth remains that the automatic learning of axioms is not an easy task. Despite certain success, many ontology learning systems are still struggling with the basics of extracting terms and relations [84]. For this reason, the majority of ontology learning systems out there that claim to learn ontologies are in fact creating lightweight ontologies. At the moment, lightweight ontologies appear to be the most common type of ontologies in a variety of Semantic Web applications (e.g. knowledge management, document retrieval, communities of practice, data integration) [59, 75]. 2.2 Ontology Learning from Text Ontology learning from text is the process of identifying terms, concepts, relations and optionally, axioms from natural language text, and using them to construct and maintain an ontology. Even though the area of ontology learning is still in its infancy, many proven techniques from established fields such as text mining, data mining, natural language processing, information retrieval, as well as knowledge representation and reasoning have powered a rapid growth in recent years. Information retrieval provides various algorithms to analyse associations between concepts in texts using vectors, matrices [76] and probabilistic theorems [280]. On the other hand, machine learning and data mining provides ontology learning the ability to 2.2. Ontology Learning from Text extract rules and patterns out of massive datasets in a supervised or unsupervised manner based on extensive statistical analysis. Natural language processing provides the tools for analysing natural language text on various language levels (e.g. morphology, syntax, semantics) to uncover concept representations and relations through linguistic cues. Knowledge representation and reasoning enables the ontological elements to be formally specified and represented such that new knowledge can be deduced. Figure 2.2: Overview of the outputs, tasks and techniques of ontology learning. In the following subsections, we look at the types of output, common techniques and evaluation approaches of a typical ontology learning process. 2.2.1 Outputs from Ontology Learning There are five types of output in ontology learning, namely, terms, concepts, taxonomic relations, non-taxonomic relations and axioms. Some researchers [35] refer to this as the “Ontology Learning Layer Cake”. To obtain each output, certain tasks have to be accomplished and the techniques employed for each task may vary between systems. This view of output-task relation that is independent of any implementation details promotes modularity in designing and implementing ontology learning systems. Figure 2.2 shows the output and the corresponding tasks. Each output is a prerequisite for obtaining the next output as shown in the figure. 15 16 Chapter 2. Background Terms are used to form concepts which in turn are organised according to relations. Relations can be further generalised to produce axioms. Terms are the most basic building blocks in ontology learning. Terms can be simple (i.e. single-word) or complex (i.e. multi-word), and are considered as lexical realisations of everything important and relevant to a domain. The main tasks associated with terms are to preprocess texts and extract terms. Preprocessing ensures that the input texts are in an acceptable format. Some of the techniques relevant to preprocessing include noisy text analytics and the extraction of relevant contents from webpages (i.e. boilerplate removal). The extraction of terms usually begin with some kind of part-of-speech tagging and sentence parsing. Statistical or probabilistic measures are then used to determine the extent of collocational strength and domain relevance of the term candidates. Concepts can be abstract or concrete, real or fictitious. Broadly speaking, a concept can be anything about which something is said. Concepts are formed by grouping similar terms. The main tasks are therefore to form concepts and label concepts. The task of forming concepts involve discovering the variants of a term and grouping them together. Term variants can be determined using predefined background knowledge, syntactic structure analysis or through clustering based on some similarity measures. As for deciding on the suitable label for a concept, existing background knowledge such as WordNet may be used to find the name of the nearest common ancestor. If a concept is determined through syntactic structure analysis, the heads of the complex terms can be used as the corresponding label. For instance, the common head noun “tart” can be used as the label for the concept comprising of “egg tart”, “French apple tart”, “chocolate tart”, etc. Relations are used to model the interactions between the concepts in a domain. There are two types of relations, namely, taxonomic relations and non-taxonomic relations. Taxonomic relations are the hypernymies between concepts. The main task is to construct hierarchies. Organising concepts into a hierarchy involves the discovery of hypernyms and hence, some researchers may also refer to this task as extracting taxonomic relations. Hierarchy construction can be performed in various ways such as using predefined relations from existing background knowledge, using statistical subsumption models, relying on semantic relatedness between concepts, and utilising linguistic and logical rules or patterns. Non-taxonomic relations are the interactions between concepts (e.g. meronymy, thematic roles, attributes, possession and causality) other than hypernymy. The less explicit and more complex use of words for specifying relations other than hypernymy causes the tasks to discover 2.2. Ontology Learning from Text non-taxonomic relations and label non-taxonomic relations to be more challenging. Discovering and labelling non-taxonomic relations are mainly reliant on the analysis of syntactic structures and dependencies. In this aspect, verbs are taken as good indicators for non-taxonomic relations and help from domain experts are usually required to label such relations. Lastly, axioms are propositions or sentences that are always taken as true. Axioms act as a starting point for deducing other truth, verifying correctness of existing ontological elements and defining constraints. The task involved here is to discover axioms. The task of learning axioms usually involve the generalisation or deduction of a large number of known relations that satisfy certain criteria. 2.2.2 Techniques for Ontology Learning The techniques employed by different systems may vary depending on the tasks to be accomplished. The techniques can generally be classified into statistics-based, linguistics-based, logic-based, or hybrid. Figure 2.2 illustrates the various commonlyused techniques, and each technique may be applicable to more than one task. The various statistics-based techniques for accomplishing the tasks in ontology learning are mostly derived from information retrieval, machine learning and data mining. The lack of consideration for the underlying semantics and relations between the components of a text makes statistics-based techniques more prevalent in the early stages of ontology learning. Some of the common techniques include clustering [272], latent semantic analysis [252], co-occurrence analysis [34], term subsumption [77], contrastive analysis [260] and association rule mining [239]. The main idea behind these techniques is that the extent of occurrence of terms and their contexts in documents often provide reliable estimates about the semantic identity of terms. • In clustering, some measures of relatedness (e.g. similarity, distance) is employed to assign terms into groups for discovering concepts or constructing hierarchy [152]. The process of clustering can either begin with individual terms or concepts and grouping the most related ones (i.e. agglomerative clustering), or begin with all terms or concepts and dividing them into smaller groups to maximise within-group relatedness (i.e. divisive clustering). Some of the major issues in clustering are working with high-dimensional data, and feature extraction and preparation for similarity measurement. This gave rise to a class of feature-less similarity and distance measures based solely on the co-occurrence of words in large text corpora. The Normalised Web Distance (NGD) is one example [262]. 17 18 Chapter 2. Background • Relying on raw data to measure relatedness may lead to data sparseness [35]. In latent semantic analysis, dimension reduction techniques such as singular value decomposition are applied on the term-document matrix to overcome the problem [139]. In addition, inherent relations between terms can be revealed by applying correlation measures on the dimensionally-reduced matrix, leading to the formation of groups. • The analysis of the occurrence of two or more terms within a well-defined unit of information such as sentence or more generally, n-gram is known as cooccurrence analysis. Co-occurrence analysis is usually coupled with some measures to determine the association strength between terms or the constituents of terms. Some of the popular measures include dependency measures (e.g. mutual information [47]), log-likelihood ratios [206] (e.g. chi-square test), rank correlations (e.g. Pearson’s and Spearman’s coefficient [244]), distance measures (e.g. Kullback-Leiber divergence [161]), and similarity measures (e.g. cosine measures [223]). • In term subsumption, the conditional probabilities of the occurrence of terms in documents are employed to discover hierarchical relations between them [77]. A term subsumption measure is used to quantify the extent of a term x being more general than another term y. The higher the subsumption value, the more general term x is with respect to y. • The extent of occurrence of terms in individual documents and in text corpora is employed for relevance analysis. Some of the common relevance measures from information retrieval include the Term Frequency-Inverse Document Frequency (TF-IDF) [215] and its variants, and others based on language modelling [56] and probability [83]. Contrastive analysis [19] is a kind of relevance analysis based on the heuristic that general language-dependent phenomena should spread equally across different text corpora, while special-language phenomena should portray odd behaviours. • Given a set of concept pairs, association rule mining is employed to describe the associations between the concepts at the appropriate level of abstraction [115]. In the example by [162], given the already known concept pairs {chips, beer} and {peanuts, soda}, association rule mining is then employed to generalise the pairs to provide {snacks, drinks}. The key to determining the degree of abstraction in association rules is provided by user-defined thresholds such as 2.2. Ontology Learning from Text confidence and support. Linguistics-based techniques are applicable to almost all tasks in ontology learning and are mainly dependent on the natural language processing tools. Some of the techniques include part-of-speech tagging, sentence parsing, syntactic structure analysis and dependency analysis. Other techniques rely on the use of semantic lexicon, lexico-syntactic patterns, semantic templates, subcategorisation frames, and seed words. • Part-of-speech tagging and sentence parsing provide the syntactic structures and dependency information required for further linguistic analysis. Some examples of part-of-speech tagger are Brill Tagger [33] and TreeTagger [219]. Principar [149], Minipar [150] and Link Grammar Parser [247] are among the few common sentence parsers. Other more comprehensive toolkits for natural language processing include General Architecture for Text Engineering (GATE) [57], and Natural Language Toolkit (NLTK) [25]. Despite the placement under the linguistics-based category, certain parsers are built on statistical parsing systems. For instance, the Stanford Parser [132] is a lexicalised probabilistic parser. • Syntactic structure analysis and dependency analysis examines syntactic and dependency information to uncover terms and relations at the sentence level. In syntactic structure analysis, words and modifiers in syntactic structures (e.g. noun phrases, verb phrases and prepositional phrases) are analysed to discover potential terms and relations. For example, ADJ-NN or DT-NN can be extracted as potential terms, while ignoring phrases containing other part-of-speech such as verbs. In particular, the head-modifier principle has been employed extensively to identify complex terms related through hyponymy with the heads of the terms assuming the hypernym role [105]. In dependency analysis, grammatical relations such as subject, object, adjunct and complement are used for determining more complex relations [86, 48]. • Semantic lexicon can either be general such as WordNet [177] or domainspecific such as the Unified Medical Language System (UMLS) [151]. Semantic lexicon offers easy access to a large collection of predefined words and relations. Concepts from semantic lexicon are usually organised in sets of similar words (i.e. synsets). These synonyms are employed for discovering variants of terms [250]. Relations from semantic lexicon have also been proven useful to 19 20 Chapter 2. Background ontology learning. These relations include hypernym-hyponym (i.e. parentchild relation) and meronym-holonym (i.e. part-whole relation). Many of the work related to the use of relations in WordNet can be found in the area of word sense disambiguation [265, 145] and lexical acquisitions [190]. • The use of lexico-syntactic patterns was proposed by [102], and has been employed to extract hypernyms [236] and meronyms. Lexico-syntactic patterns capture hypernymy relations using patterns such as NP such as NP, NP,..., and NP. For extracting meronyms, patterns such as NP is part of NP can be useful. The use of patterns provide reasonable precision but the recall is low [35]. Due to the cost and time involved in manually producing such patterns, efforts [234] have been taken to study the possibility of learning them. Semantic templates [238, 257] are similar to lexico-syntactic patterns in terms of their purpose. However, semantic templates offer more detailed rules and conditions to extract not only taxonomic relations but also complex non-taxonomic relations. • In linguistic theory, the subcategorisation frame [5, 85] of a word is the number and kinds of other words that it selects when appearing in a sentence. For example, in the sentence “Joe wrote a letter”, the verb “write” selects “Joe” and “letter” as its subject and object, respectively. In other words, “Person” and “Written-Communication” are the restrictions of selection for the subject and object of the verb “write”. The restrictions of selection extracted from parsed texts can be used in conjunction with clustering techniques to discover concepts [68]. • The use of seed words (i.e. seed terms) [281] is a common practice in many systems to guide a wide range of tasks in ontology learning. Seed words provide good starting points for the discovery of additional terms relevant to that particular domain [110]. Seed words are also used to guide the automatic construction of text corpora from the Web [15]. Logic-based techniques are the least common in ontology learning and are mainly adopted for more complex tasks involving relations and axioms. Logic-based techniques have connections with advances in knowledge representation and reasoning, and machine learning. The two main techniques employed are inductive logic programming [141, 283] and logical inference [227]. 2.2. Ontology Learning from Text • In inductive logic programming, rules are derived from existing collection of concepts and relations which are divided into positive and negative examples. The rules proves all the positive and none of the negative examples. In an example by Oliveira et al. [191], induction begins with the first positive example “tigers have fur”. With the second positive example “cats have fur”, a generalisation of “felines have fur” is obtained. Given the third positive example “dogs have fur”, the technique will attempt to generalise that “mammals have fur”. When encountered with a negative example “humans do not have fur”, then the previous generalisation will be dropped, giving only “canines and felines have fur”. • In logical inference, implicit relations are derived from existing ones using rules such as transitivity and inheritance. Using the classic example, given the premises “Socrates is a man” and “All men are mortal”, we can discover a new attribute relation stating that “Socrates is mortal”. Despite the power of inference, the possibilities of introducing invalid or conflicting relations may occur if the design of the rules is not complete. Consider the example where “human eats chicken” and “chicken eats worm” yield a new relation that is not valid. This happened because the intransitivity of the relation “eat” was not explicitly specified in advance. 2.2.3 Evaluation of Ontology Learning Techniques Evaluation is an important aspect of ontology learning, just like any other research areas. Evaluation allows individuals who use ontology learning systems to assess the resulting ontologies, and to possibly guide and refine the learning process. An interesting aspect about evaluation in ontology learning, as opposed to information retrieval and other areas, is that ontologies are not an end product but rather, a means to achieve some other tasks. In this sense, an evaluation approach is also useful to assist users in choosing the best ontology that fits their requirements when faced with a multitude of options. In document retrieval, the object of evaluation is documents and how well systems provide documents that satisfy user queries, either qualitatively or quantitatively. However, in ontology learning, we cannot simply measure how well a system constructs an ontology without raising more questions. For instance, is the ontology good enough? If so, with respect to what application? An ontology is made up of different layers such as terms, concepts and relations. If an ontology is inadequate for an application, then which part of the ontology is causing the problem? Consid- 21 22 Chapter 2. Background ering the intricacies of evaluating ontologies, a myriad of evaluation approaches have been proposed in the past few years. Generally, these approaches can be grouped into one of the four main categories depending on the kind of ontologies that are being evaluated and the purpose of the evaluation [30]: • The first approach evaluates the adequacy of ontologies in the context of other applications. For example Porzel & Malaka [202] evaluated the use of ontological relations in the context of speech recognition. The output from the speech recognition system is compared with a gold standard generated by humans. • The second approach uses domain-specific data sources to determine to what extent the ontologies are able to cover the corresponding domain. For instance, Brewster et al. [31] described a number of methods to evaluate the ‘fit’ between an ontology and the domain knowledge in the form of text corpora. • The third approach is used for comparing ontologies using benchmarks including other ontologies [164]. • The last approach rely on domain experts to assess how well an ontology meets a set of predefined criteria [158]. Due to the complex nature of ontologies, evaluation approaches can also be distinguished by the layers of an ontology (e.g. term, concept, relation) they evaluate [202]. More specifically, evaluations can be performed to assess the (1) correctness at the terminology layer, (2) coverage at the conceptual layer, (3) wellness at taxonomy layer, and (4) adequacy of the non-taxonomic relations. The focus of evaluation at the terminology layer is to determine if the terms used to identify domain-relevant concepts are included and correct. Some form of lexical reference or benchmark is typically required for evaluation in this layer. Typical precision and recall measures from information retrieval are used together with exact matching or edit distance [164] to determine performance at the terminology layer. The lexical precision and recall reflect how good the extracted terms cover the target domain. Lexical Recall (LR) measures the number of relevant terms extracted (erelevant ) divided by the total number of relevant terms in the benchmark (brelevant ), while Lexical Precision (LP) measures the number of relevant terms extracted (erelevant ) divided by the total number of terms extracted (eall ). LR and LP are defined as [214]: LP = erelevant eall (2.1) 23 2.2. Ontology Learning from Text LR = erelevant brelevant (2.2) The precision and recall measure can be also combined to compute the corresponding Fβ -score. The general formula for non-negative real β is: Fβ = (1 + β 2 )(precision × recall) β 2 × precision + recall (2.3) Evaluation measures at the conceptual level are concerned with whether the desired domain-relevant concepts are discovered or otherwise. Lexical Overlap (LO) measures the intersection between the discovered concepts (Cd ) and the recommended concepts (Cm ). LO is defined as: LO = |Cd ∩ Cm | |Cm | (2.4) Ontological Improvement (OI) and Ontological Loss (OL) are two additional measures to account for newly discovered concepts that are absent from the benchmark, and for concepts which exist in the benchmark but were not discovered, respectively. They are defined as [214]: OI = |Cd − Cm | |Cm | (2.5) OL = |Cm − Cd | |Cm | (2.6) Evaluations at the taxonomy layer is more complicated. Performance measures for the taxonomy layer are typically divided into local and global [60]. The similarity of the concepts’ positions in the learned taxonomy and in the benchmark is used to compute the local measure. The global measure is then derived by averaging the local scores for all concept pairs. One of the few measures for the taxonomy layer is the Taxonomic Overlap (TO) [164]. The computation of the global similarity between two taxonomies begins with the local overlap of their individual terms. The semantic cotopy, the set of all super- and sub-concepts, of a term varies depending on the taxonomy. The local similarity between two taxonomies given a particular term is determined based on the overlap of the term’s semantic cotopy. The global taxonomic overlap is then defined as the average of the local overlaps of all the terms in the two taxonomies. The same idea can be applied to compare adequacy non-taxonomic relations. 24 Chapter 2. Background 2.3 Existing Ontology Learning Systems Before looking into some of the prominent systems and recent advances in ontology learning, a recap of three previous independent surveys is conducted. The first is a report by the OntoWeb Consortium [90], a body funded by the Information Society Technologies Programme of the Commission of the European Communities. This survey listed 36 approaches for ontology learning from text. Some of the important findings presented by this review paper are: • There is no detailed methodology that guides the ontology learning process from text. • There is no fully automated system for ontology learning. Some of the systems act as tools to assist in the acquisition of lexical-semantic knowledge, while others help to extract concepts and relations from annotated corpora with the involvement of users. • There is no general approach for evaluating the accuracy of ontology learning, and for comparing the results produced by different systems. The second survey, released during the same time as the OntoWeb Consortium survey, was performed by Shamsfard & Barforoush [226]. The authors claimed to have studied over fifty different approaches before selecting and including seven prominent ones in their survey. The main focus of the review was to introduce a framework for comparing ontology learning approaches. The approaches included in the review merely served as test cases to be fitted into the framework. Consequently, the review provided an extensive coverage of the state-of-the-art of the relevant techniques but was limited in terms of discussions on the underlying problems and future outlook. The review arrived at the following list of problems: • Much work has been conducted on discovering taxonomic relations, while nontaxonomic relations were given less attention. • Research into axiom learning was nearly unexplored. • The focus of most research is on building domain ontologies. Most of the techniques were designed to make heavy use of domain-specific patterns and static background knowledge, with little regard to the portability of the systems across different domains. 2.3. Existing Ontology Learning Systems • Current ontology learning systems are evaluated within the confinement of their domains. Finding a formal, standard method to evaluate ontology learning systems remains an open problem. • Most systems are either semi-automated or tools for supporting domain experts in curating ontologies. Complete automation and elimination of user involvement requires more research. Lastly, Ding & Foo [62] presented a survey of 12 major ontology learning projects. The authors wrapped up their survey with following findings: • Input data are mostly structured. Learning from free texts remains within the realm of research. • The task of discovering relations is very complex and a difficult problem to solve. It has turned out to be the main impedance to the progress of ontology learning. • The techniques for discovering concepts have reached a certain level of maturity. A closer look into the three survey papers revealed a consensus on several aspects of ontology learning that required more work. These conclusions are in fact in line with the findings of our literature review in the following Sections 2.3.1 and 2.3.2. These conclusions are (1) fully automated ontology learning is still in the realm of research, (2) current approaches are heavily dependent on static background knowledge, and may face difficulty in porting across different domains and languages, (3) there is no common evaluation platform for ontology learning, and (4) there is a lack of research on discovering relations. The validity of some of these conclusions will become more evident as we look into several prominent systems and recent advances in ontology learning in the following two sections. 2.3.1 Prominent Ontology Learning Systems A summary of the techniques used by five prominent ontology learning systems, and the evaluation of these techniques are provided in this section. OntoLearn OntoLearn [178, 182, 259, 260], together with Consys (for ontology validation by experts) and SymOntoX (for updating and managing ontology by experts) are part 25 26 Chapter 2. Background of a project for developing an interoperable infrastructure for small and medium enterprises in the tourism sector under the Federated European Tourism Information System1 (FETISH). OntoLearn employs both linguistics and statistics-based techniques in four major tasks to discover terms, concepts and taxonomic relations. • Preprocess texts and extract terms: Domain and general corpora are first processed using part-of-speech tagging and sentence parsing tools to produce syntactic structures including noun phrases and prepositional phrases. For relevance analysis, the approach adopts two metrics known as Domain Relevance (DR) and Domain Consensus (DC). Domain relevance measures the specificity of term t with respect to the target domain Dk through comparative analysis across a list of predefined domains D1 , ..., Dn . The measure is defined as P (t|Dk ) DR(t, Dk ) = P i=1...n P (t|Di ) f f where P (t|Dk ) and P (t|Di ) are estimated as P t,k ft,k and P t,i ft,i , respect∈Dk t∈Di tively. ft,k and ft,i are the frequencies of term t in domain Dk and Di , respectively. Domain consensus, on the other hand, is used to measure the appearance of a term in a single document as compared to the overall occurrence in the target domain. The domain consensus of a term t in domain Dk is an entropy defined as X 1 DC(t, Dk ) = P (t|d)log P (t|d) d∈D k where P (t|d) is the probability of encountering term t in document d of domain Dk . • Form concepts: After the list of relevant terms has been identified, concepts and glossary from WordNet are employed for associating the terms to existing concepts and to provide definitions. The author named this process as semantic interpretation. If multi-word terms are involved, the approach evaluates all possible sense combinations by intersecting and weighting common semantic patterns in the glossary until it selects the best sense combinations. • Construct hierarchy: Once semantic interpretation has been performed on the terms to form concepts, taxonomic relations are discovered using hypernyms from WordNet to organise the concepts into domain concept trees. 1 More information is available via http://sourceforge.net/projects/fetishproj/. Last accessed 25 May 2009. 2.3. Existing Ontology Learning Systems An evaluation of the term extraction technique was performed using the Fmeasure. A tourism corpus was manually constructed from the Web containing about 200, 000 words. The evaluation was done by manually looking at 6, 000 of the 14, 383 candidate terms and marking all the terms judged as good domain terms and comparing the obtained list with the list of terms automatically filtered by the system. A precision of 85.42% and recall of 52.74% were achieved. Text-to-Onto Text-to-Onto [51, 162, 163, 165] is a semi-automated system that is part of an ontology management infrastructure called KAON2 . KAON is a comprehensive tool suite for ontology creation and management. The authors claimed that the approach has been applied to the tourism and insurance sector, but no further information was presented. Instead, ontologies3 for some toy domains4 have been constructed using this approach. Text-to-Onto employs both linguistics and statistics-based techniques in six major tasks to discover terms, concepts, taxonomic relations and non-taxonomic relations. • Preprocess texts and extract terms: Plain text extraction is performed to extract plain domain texts from semi-structured sources (i.e. HTML documents) and other formats (e.g. PDF documents). Abbreviation expansion is performed on the plain texts using rules and dictionaries to replace abbreviations and acronyms. Part-of-speech tagging and sentence parsing are performed on the preprocessed texts to produce syntactic structures and dependencies. Syntactic structure analysis is performed using weighted finite state transducers to identify important noun phrases as terms. These natural language processing tools are provided by a system called Saarbruecken Message Extraction System (SMES) [184]. • Form concepts: Concepts from domain lexicon are required to assign new terms to predefined concepts. Unlike other approaches that employ general background knowledge such as WordNet, the lexicon adopted by Text-to-Onto are domain-specific containing over 120, 000 terms. Each term is associated with concepts available in a concept taxonomy. Other techniques for concept 2 More information is available via http://kaon.semanticweb.org/. Last accessed 25 May 2009. The ontologies can be downloaded from http://kaon.semanticweb.org/ontologies. Last accessed 25 May 2009. 4 The term toy domain is in wide use in the research community to describe work in extremely restricted domains. 3 27 28 Chapter 2. Background formations are also performed such as the use of co-occurrence analysis but no additional information was provided. • Construct hierarchy: Once the concepts have been formed, taxonomic relations are discovered by exploiting the hypernyms from WordNet. Lexico-syntactic patterns are also employed to identify hypernymy relations in the texts. The authors refer to these hypernyms as oracle, denoted by H. The projection H(t) will return a set of tuples (x, y) where x is a hypernym for term t and y is the number of times the algorithm has found evidence for it. Using cosine measure for similarity and the oracle, a bottom-up hierarchical clustering is carried out with a list T of n terms as input. When given two terms which are similar according to the cosine measure, the algorithm works by ordering them as sub-concepts if one is a hypernym of the other. If the previous case does not apply, the most frequent common hypernym h is selected to create a new concept to accommodate both terms as siblings. • Discover non-taxonomic relations and label non-taxonomic relations: For nontaxonomic relations extraction, association rules together with two user-defined thresholds (i.e. confidence, support) are employed to determine associations between concepts at the right level of abstraction. Typically, users start with low support and confidence to explore general relations, and later increases the values to explore more specific relations. User participation is required to validate and label the non-taxonomic relations. An evaluation of the relation discovery technique was performed using a measure called the Generic Relations Learning Accuracy (RLA). Given a set of discovered relations D, precision is defined as |D ∩ R|/|D| and recall as |D ∩ R|/|R| where R is the non-taxonomic relations prepared by domain experts. RLA is a measure to capture intuitive notions for relation matches such as utterly wrong, rather bad, near miss and direct hit. RLA is the averaged accuracy that the instances of discovered relations match against their best counterpart from manually-curated gold-standard. As the learning algorithm is controlled by support and confidence parameters, the evaluation is done by varying the support and the confidence values. When both the support and the confidence thresholds are set to 0, 8, 058 relations were produced with a RLA of 0.51. Both the number of relations and the recall decreases with growing support and confidence. Precision increases at first but drops when so few relations are discovered that almost none is a direct hit. The best RLA at 0.67 is achieved with a support at 0.04 and a confidence at 0.01. 2.3. Existing Ontology Learning Systems 29 ASIUM ASIUM [71, 70, 69] is a semi-automated ontology learning system that is part of an information extraction infrastructure called INTEX, by the Laboratoire d’Automatique Documentaire et Linguistique de l’Universite de Paris 7. The aim of this approach is to learn semantic knowledge from texts and use the knowledge for the expansion (i.e. portability from one domain to the other) of INTEX. The authors mentioned that the system has been tested by Dassault Aviation, and has been applied on a toy domain using cooking recipe corpora in French. ASIUM employs both linguistics and statistics-based techniques to carry out five tasks to discover terms, concepts and taxonomic relations. • Preprocess texts and discover subcategorisation frames: Sentence parsing is applied on the input text using functionalities provided by a sentence parser called SYLEX [54]. SYLEX produces all interpretations of parsed sentences including attachments of noun phrases to verbs and clauses. Syntactic structure and dependency analysis is performed to extract instantiated subcategorisation frames in the form of <verb><syntactic role|preposition:head noun>∗ where the wildcard character ∗ indicates the possibility of multiple occurrences. • Extract terms and form concepts: The nouns in the arguments of the subcategorisation frames extracted from the previous step are gathered to form basic classes based on the assumption “head words occurring after the same, different prepositions (or with the same, different syntactic roles), and with the same, different verbs represent the same concept” [68]. To illustrate, suppose that we have the nouns “ballpoint pen”, “pencil and “fountain pen” occurring in different clauses as adjunct of the verb “to write” after the preposition “with”. At the same time, these nouns are the direct object of the verb “to purchase”. From the assumption, these nouns are thus considered as variants representing the same concept. • Construct hierarchy: The basic classes from the previous task are successively aggregated to form concepts of the ontology and reveal the taxonomic relations using clustering. Distance between all pairs of basic classes is computed and two basic classes are only aggregated if the distance is less than the threshold set by the user. On the one hand, the distance between two classes containing the same words with same frequencies have the distance 0. On the other hand, a pair of classes without a single common word have distance 1. The clustering 30 Chapter 2. Background algorithm works bottom-up and performs first-best using basic classes as input and builds the ontology level by level. User participation is required to validate each new cluster before it can be aggregated to a concept. An evaluation of the term extraction technique was performed using the precision measure. The evaluation uses texts from the French journal Le Monde that have been manually filtered to ensure the presence of terrorist event descriptions. The results were evaluated by two domain experts who were not aware of the ontology building process using the following indicators: OK if extracted information is correct, FALSE if extracted information is incorrect, NONE if there were no extracted information, and FALSE for all other cases. Two precision values are computed, namely, precision1 which is the ratio between OK and FALSE, and precision2 which is the same as precision1 by taking into consideration NONE. Precision1 and precision2 have the value 86% and 89%, respectively. TextStorm/Clouds TextStorm/Clouds [191, 198] is a semi-automated ontology learning system that is part of an idea sharing and generation system called Dr. Divago [197]. The aim of this approach is to build and refine domain ontology for use in Dr. Divago for searching resources in a multi-domain environment to generate musical pieces or drawings. No information was provided on the availability of any real-world applications, nor testing on toy domains. TextStorm/Clouds employs logic and linguistics-based techniques to carry out six tasks to discover terms, taxonomic relations, non-taxonomic relations and axioms. • Preprocess texts and extract terms: The part-of-speech information in WordNet is used to annotate the input text. Later, syntactic structure and dependency analysis is performed using an augmented grammar to extract syntactic structures in the form of binary predicates. The Prolog-like binary predicates represent relations between two terms. Two types of binary predicates are considered. The first type captures terms in the form of subject and object connected by a main verb. The second type captures the property of compound nouns, usually in the form of modifiers. For example, the sentence “Zebra eat green grass” will result in two binary predicates namely eat(Zebra, grass) and property(grass, green). When working with dependent sentences, finding the concepts may not be straightforward and this approach performs anaphora resolution to resolve ambiguities. The anaphora resolution 2.3. Existing Ontology Learning Systems 31 uses a history list of discourse entities generated from preceeding sentences [6]. In the presence of an anaphora, the most recent entities are given higher priority. • Construct hierarchy, discover non-taxonomic relations and label non-taxonomic relations: Next, the binary predicates are employed to gradually aggregate terms and relations to an existing ontology with user participation. Hypernymy relations appear in binary predicates in the form of is-a(X,Y) while part-of(X,Y) and contain(X,Y) provide good indicators for meronyms. Attribute-value relations are obtainable from the predicates in the form of property(X,Y). During the aggregation process, users may be required to introduce new predicates to connect certain terms and relations to the ontology. For example, in order to attach the predicate is-a(predator, animal) to an ontology with the root node living entity, user will have to introduce is-a(animal, living entity). • Extract axioms: The approach employs inductive logic programming to learn regularities by observing the recurrent concepts and relations in the predicates. For instance, the approach using the extracted predicates below 1: is-a(panther, carnivore) 2: eat(panther, zebra) 3: eat(panther, gazelle) 4: eat(zebra, grass) 5: is-a(zebra,herbivore) 6: eat(gazelle, grass) 7: is-a(gazelle,herbivore) will arrive at the conclusions that 1: eat(A, zebra):- is-a(A, carnivore) 2: eat(A, grass):- is-a(A, herbivore) These axioms describe relations between concepts in terms of its context (i.e. the set of neighbourhood connections that the arguments have). Using the accuracy measure, the performance of the binary predicate extraction task was evaluated to determine if the relations hold between the corresponding concepts. A total of 21 articles from the scientific domain were collected and analysed by the system. Domain experts then determined the coherence of the predicates and 32 Chapter 2. Background its accuracy with respect to the corresponding input text. The authors concluded an average accuracy of 52%. SYNDIKATE SYNDIKATE [96, 95] is a stand-alone automated ontology learning system. The authors have applied this approach in two toy domains, namely, information technology and medicine. However, no information was provided on the availability of any real-world applications. SYNDIKATE employs purely linguistics-based techniques to carry out five tasks to discover terms, concepts, taxonomic relations and non-taxonomic relations. • Extract terms: Syntactic structure and dependency analysis is performed on the input text using a lexicalised dependency grammar to capture binary valency5 constraints between a syntactic head (e.g. noun) and possible modifiers (e.g. determiners, adjectives). In order to establish a dependency relation between a head and a modifier, the term order, morpho-syntactic features compatibility and semantic criteria have to be met. Anaphora resolution based on the centering model is included to handle pronouns. • Form concepts, construct hierarchy, discover non-taxonomic relations and label non-taxonomic relations: Using predefined semantic templates, each term in the syntactic dependency graph is associated with a concept in the domain knowledge and at the same time, used to instantiate the text knowledge base. The text knowledge base is essentially an annotated representation of the input texts. For example, the term “hard disk” in the graph is associated with the concept HARD DISK in domain knowledge, and at the same time, an instance called HARD DISK3 will be created in the text knowledge base. The approach then tries to find all relational links between conceptual correlates of two words in the subgraph if both grammatical and conceptual constraints are fulfilled. The linkage may either be constrained by dependency relations, by intervening lexical materials, or by conceptual compatibility between the concepts involved. In the case where unknown words occur, semantic interpretation of the dependency graph involving unknown lexical items in the text knowledge base is employed to derive concept hypothesis. The structural patterns of consistency, mutual justification and analogy relative to the already 5 Valency refers to the capacity of a verb to take a specific number and type of arguments (noun phrase positions). 2.3. Existing Ontology Learning Systems available concept descriptions in the text knowledge base will be used as initial evidence to create linguistic and conceptual quality labels. An inference engine is then used to estimate the overall credibility of the concept hypotheses by taking into account the quality labels. An evaluation using the precision, recall and accuracy measures was conducted to assess the concepts and relations extracted by this system. The use of semantic interpretation to discover the relations between conceptual correlates yielded 57% recall and 97% precision, and 31% recall and 94% precision, for medicine and information technology texts, respectively. As for the formation of concepts, an accuracy of 87% was achieved. The authors also presented the performance of other aspects of the system. For example, sentence parsing in the system exhibits a linear time complexity while a third-party parser runs in exponential time complexity. This behaviour was caused by the latter’s ability to cope with ungrammatical input. The incompleteness of the system’s parser results in a 10% loss of structural information as compared to the complete third-party parser. 2.3.2 Recent Advances in Ontology Learning Since the publication of the three survey papers [62, 226, 90], the research activities within the ontology learning community have been mainly focusing on (1) the advancement of relation acquisition techniques, (2) the automatic labelling of concept and relation, (3) the use of structured and unstructured Web data for relation acquisition, and (4) the diversification of evidence for term recognition. On the advancement of relation acquisition techniques, Specia & Motta [237] presented an approach for extracting semantic relations between pairs of entities from texts. The approach makes use of a lemmatiser, syntactic parser, part-of-speech tagger, and word sense disambiguation models for language processing. New entities are recognised using a named-entity recognition system. The approach also relies on a domain ontology, a knowledge base, and lexical databases. Extracted entities that exist in the knowledge base are semantically annotated with their properties. Ciaramita et al. [48] employ syntactic dependencies as potential relations. The dependency paths are treated as bi-grams, and scored with statistical measures of correlation. At the same time, the arguments of the relations can be generalised to obtain abstract concepts using algorithms for Selectional Restrictions Learning [208]. Snow et al. [234, 235] also presented an approach that employs the dependency paths extracted from parse trees. The approach receives trainings using sets of text containing known hypernym pairs. The approach then automatically discovers 33 34 Chapter 2. Background useful dependency paths that can be applied to new corpora for identifying new hypernyms. On the automatic concept and relation labelling, Kavalec & Svatek [123] studied the feasibility of label identification for relations using semantically-tagged corpus and other background knowledge. The authors suggested that the use of verbs, identified through part-of-speech tagging, can be viewed as a rough approximation of relation labels. With the help of semantically-tagged corpus to resolve the verbs to the correct word sense, the quality of relations labelling may be increased. In addition, the authors also suggested that abstract verbs identified through generalisation via WordNet can be useful labels. Jones [119] proposed in her PhD research a semi-automated technique for identifying concepts and simple technique for labelling concepts using user-defined seed words. This research was carried out exclusively using small lists of words as input. In another PhD research by Rosario [211], the author proposed the use of statistical semantic parsing to extract concepts and relations from bioscience text. In addition, the research presented the use of statistical machine learning techniques to build a knowledge representation of the concepts. The concepts and relations extracted by the proposed approach are intended to be combined by some other systems to produce larger propositions which can then be used in areas such as abductive reasoning or inductive logic programming. This approach has only been tested with a small amount of data from toy domains. On the use of Web data for relation acquisition, Sombatsrisomboon [236] proposed a simple 3-step technique for discovering taxonomic relations (i.e. hypernym/hyponym) between pairs of terms using search engines. Search engine queries are first constructed using the term pairs and patterns such as X is a/an Y. The webpages provided by search engines are then gathered to create a small corpus. Sentence parsing and syntactic structure analysis is performed on the corpus to discover taxonomic relations between the terms. Such use of Web data redundancy and patterns can also be extended to discover non-taxonomic relations. Sanchez & Moreno [217] proposed methods for discovering non-taxonomic relations using Web data. The authors developed a technique for learning domain patterns using domain-relevant verb phrases extracted from webpages provided by search engines. These domain patterns are then used to extract and label non-taxonomic relations using linguistic and statistical analysis. There is also an increasing interest in the use of structured Web data such as Wikipedia for relation acquisition. Pei et al. [196] proposed an approach for constructing ontologies using Wikipedia. The approaches uses a two-step technique, namely, name mapping and logic-based map- 2.3. Existing Ontology Learning Systems ping, to deduce the type of relations between concepts in Wikipedia. Similarly, Liu et al. [154] developed a technique called Catriple for automatically extracting triples using Wikipedia’s categorical system. The approach focuses on category pairs containing both explicit property and explicit value (e.g. “Category:Songs by artist”-“Category:The Beatles songs” where “artist is property and “The Beatles” is value), and category pairs containing explicit value but implicit property (e.g. “Category:Rock songs”-“Category:British rock songs” where British is a value with no property). Sentence parsers and syntactic rules are used to extract the explicit properties and values from the category names. Weber & Buitelaar [267] proposed a system called Information System for Ontology Learning and Domain Exploration (ISOLDE) for deriving domain ontologies using manually-curated text corpora, a general-purpose named-entity tagger, and structured data on the Web (i.e. Wikipedia, Wiktionary and a German online dictionary known as DWDS) to derive a domain ontology. On the diversification of evidence for term recognition, Sclano & Velardi [222] developed a system called TermExtractor for identifying relevant terms in two steps. TermExtractor uses a sentence parser to parse texts and extract syntactic structures such as noun compounds, and ADJ-N and N-PREP-N sequences. The list of term candidates is then ranked and filtered using a combination of measures for realising different evidence, namely, Domain Pertinence (DP), Domain Consensus (DC), Lexical Cohesion (LC) and Structural Relevance (SR). Wermter & Hahn [269] incorporated a linguistic property of terms as evidence, namely, limited paradigmatic modifiability into an algorithm for extracting terms. The property of paradigmatic modifiability is concerned with the extent to which the constituents of a multi-word term can be modified or substituted. The more we are able to substitute the constituents by other words, the less probable it is that the corresponding multi-word lexical unit is a term. There is also an increase in interest to automatically construct the text corpora required for term extraction using Web data. Agbago & Barriere [4] proposed the use of richness estimators to assess the suitability of webpages provided by search engines for constructing corpora for use by terminologists. Baroni & Bernardini [15] developed the BootCat technique for bootstrapping text corpora and terms using Web data and search engines. The technique requires as input a set of seed terms. The seeds are used to build a corpus using webpages suggested by search engines. New terms are then extracted from the initial corpus, which in turn are used as seeds to build larger corpora. There are several other recent advances that fall outside the aforementioned 35 36 Chapter 2. Background groups. Novacek & Smrz [187] developed two frameworks for bottom-up generation and merging of ontologies called OLE and BOLE. The latter is a domain-specific adaptation of the former for learning bio-ontologies. OLE is designed and implemented as a modular framework consisting of several components for providing solutions to different tasks in ontology learning. For example, the OLITE module is responsible for preprocessing plain text and creating mini ontologies. PALEA is a module responsible for extracting new semantic relation patterns while OLEMAN merges the mini ontologies resulting from the OLITE module and updates the base domain ontology. The authors mentioned that any techniques for automated ontology learning can be employed as an independent part of any modules. Another research which contributed from a systemic point of view is the CORPORUM system [72]. OntoExtract is part of the CORPORUM OntoBuilder toolbox that analyses natural language texts for generating lightweight ontologies. No specific detail was provided regarding the techniques employed by OntoExtract. The author [72] merely mentioned that OntoExtract uses a repository of background knowledge to parse, tokenise and analyse texts on both the lexical and syntactic level, and generates nodes and relations between key terms. Liu et al. [156] presented an approach to semi-automatically extend and refine ontologies using text mining techniques. The approach make use of news from media sites to expand a seed ontology by first creating a semantic network through co-occurrence analysis, trigger phrase analysis, and disambiguation based on the WordNet lexical dictionary. Spreading activation is then applied on the resulting semantic network to find the most probable candidates for inclusion in the extended ontology. 2.4 Applications of Ontologies Ontologies are an important part of the standard stack for the Semantic Web6 by the World Wide Web Consortium (W3C). Ontologies are used to exchange data among multiple heterogeneous systems, provide services in an agent-based environment, and promote the reusability of knowledge bases. While the dream of realising the Semantic Web is still years away, ontologies have already found their ways into a myriad of applications such as document retrieval, question answering, image retrieval, agent interoperability and document annotation. Some of the research areas which have found use for ontologies are: • Document retrieval: Paralic & Kostial [194] developed an document retrieval 6 http://www.w3.org/2001/sw/ 2.4. Applications of Ontologies 37 system based on the use of ontologies. The authors demonstrated that the retrieval precision and recall of the ontology-based information retrieval system outperforms techniques based on latent semantic indexing and full-text search. The system registers every new document to several concepts in the ontology. Whenever retrieval requests arrive, resources are retrieved based on the associations between concepts, and not on partial or exact term matching. Similarly, Vallet et al. [255] and Castells et al. [39] proposed a model that uses knowledge in ontologies for improving document retrieval. The retrieval model includes an annotation weighting algorithm and a ranking algorithm based on the classic vector-space model. Keyword-based search is incorporated into their approach to ensure robustness in the event of incompleteness of ontology. • Question answering: Atzeni et al. [10] reported the development of an ontologybased question answering system for the Web sites of two European universities. The system accepts questions and produces answers in natural language. The system is being investigated in the context of an European Union project called MOSES. • Image retrieval: Hyvonen et al. [111] developed a system that uses ontologies to assist image retrieval. Images are first annotated with concepts in an ontology. Users are then presented with the same ontology to facilitate focused image retrieval and the browsing of semantically-related images using the right concepts. • Multi-agent system interoperability: Malucelli & Oliveira [167] proposed an ontology-based service to assist the communication and negotiation between agents in a decentralised and distributed system architecture. The agents typically have their own heterogeneous private vocabularies. The service uses a central ontology agent to monitor and lead the communication process between the heterogeneous agents without having to map all the ontologies involved. • Document annotation: Corcho [55] surveyed several approaches for annotating webpages with ontological elements for improving information retrieval. Many of the approaches described in the survey paper rely on manually-curated ontologies for annotation using a variety of tools such as SHOE Annotator7 7 http://www.cs.umd.edu/projects/plus/SHOE/KnowledgeAnnotator.html 38 Chapter 2. Background [103], CREAM [101], MnM8 [258], and OntoAnnotate [241]. In addition to the above-mentioned research areas, ontologies have also been deployed in certain applications across different domains. One of the most successful application area of ontologies is bioinformatics. Bioinformatics have thrived on the advances in ontology learning techniques and the availability of manually-curated terminologies and ontologies (e.g. Unified Medical Language System [151], Gene Ontology [8] and other small domain ontologies at www.obofoundry.org). The computable knowledge in ontologies is also proving to be a valuable resource for reasoning and knowledge discovery in biomedical decision support systems. For example, the inference that a disease of the myocardium is a heart problem is possible using the subsumption relations in an ontology of disease classification based on anatomic locations [52]. In addition, terminologies and ontologies are commonly used for annotating biological datasets, biomedical literature and patient records, and improving the access and retrieval of biomedical information [27]. For instance, Baker et al. [13] presented a document query and delivery system for the field of lipidomics9 . The main aim of the system is to overcome the navigation challenges that hinder the translation of scientific literatures into actionable knowledge. The system allows users to access tagged documents containing lipid, protein and disease names using description logic-based query capability that comes with the semiautomatically created lipid ontology. The lipid ontology contains a total of 672 concepts. The ontology is the result of merging existing biological terminologies, knowledge from domain experts, and output from a customised text mining system that recognises lipid-specific nomenclature. Another visible application of ontologies is in the manufacturing industry. Cho et al. [43] looked at the current approach for locating and comparing parts information in an e-procurement setting. At present, buyers are faced with the challenge of accessing and navigating through different parts libraries from multiple suppliers using different search procedures. The authors introduced the use of the “Parts Library Concept Ontology” to integrate heterogeneous parts library to enable the consistent identification and systematic structuring of domain concepts. Lemaignan et al. [143] presented a proposal for a manufacturing upper ontology. The authors stressed on the importance of ontologies as a common way for describing manufacturing processes for produce lifecycle management. The use of ontologies ensures the uniformity in assertions throughout a product’s lifecycle, and the seamless flow 8 9 http://projects.kmi.open.ac.uk/akt/MnM/index.html Lipidomics is the study of pathways and networks of cellular lipids in biological systems. 2.5. Chapter Summary 39 of data between heterogeneous manufacturing environments. For instance, assume that we have these relations in an ontology: isMadeOf(part,rawMaterial) isA(aluminium,rawMaterial) isA(drilling,operation) isMachinedBy(rawMaterial,operation) and the drilling operation has the attributes drillSpeed and drillDiameter. Using these elements, we can easily specify rules such as if isMachinedBy(aluminium,drilling) and the drillDiameter is less than 5mm, then drillSpeed should be 3000 rpm [143]. This ontology allows a uniform interpretation of assertions such as isMadeOf(part,aluminium) anywhere along the product lifecycle, thus facilitating the inference of standard information such as the drill speed. 2.5 Chapter Summary In this chapter, an overview on ontologies and ontology learning from text was provided. In particular, we looked the types of output, techniques and evaluation methods related to ontology learning. The differences between a heavyweight (i.e. formal) and a lightweight ontology are also explained. Several prominent ontology learning systems, and some recent advances in the field were summarised. Finally, some current noticeable applications were included to demonstrate the applicability of ontologies to a wide range of domains. The use of ontologies for real-world applications in the area of bioinformatics and the manufacturing industry was highlighted. Overall, it was concluded that the automatic and practical construction of fullfledged formal ontologies from text across different domains is currently beyond the reach of conventional systems. Many current ontology learning systems are still struggling to achieve high-performance term recognition, let alone more complex tasks (e.g. relation acquisition, axiom learning). An interesting point revealed during the literature review is that most systems ignore the fact that the static background knowledge relied upon by their techniques is a rare resource and may not have the adequate size and coverage. In particular, all existing term recognition techniques are rested on a false assumption that the domain corpora required will always be available. Only recently, there is a growing interest in automatically constructing text corpora using Web data. However, the governing philosophy behind 40 Chapter 2. Background these existing corpus construction techniques is inadequate for creating very large high-quality text corpora. In regard to relation acquisition, existing techniques rely heavily on static background knowledge, especially semantic lexicon, such as WordNet. While there is an increasing interest in the use of dynamic Web data for relation acquisition, more research work is still required. For instance, new techniques are appearing every now and then that make use of Wikipedia for finding semantic relations between two words. However, these techniques often leave out the details on how to cope with words that do not appear in Wikipedia. Moreover, the use of clustering techniques for acquiring semantic relations may appear less attractive due to the complications in feature extraction and preparation. The literature review also exposes the lack of treatment for data cleanliness during ontology learning. As the use of Web data becomes more common, integrated techniques for removing noises in texts are turning into a necessity. All in all, it is safe to conclude that there is currently no single system that systematically uses dynamic Web data to meet the requirements for every stage of the ontology learning process. There are several key areas that require more attention, namely, (1) integrated techniques for cleaning noisy text, (2) high-performance term recognition techniques, (3) high-quality corpus construction for term recognition, and (4) dynamic Web data for clustering and relation acquisition. Our proposed ontology learning system is designed specifically to address these key areas. In the subsequent six chapters (i.e. Chapter 3 to 8), details are provided on the design, development and testing of novel techniques for the five phases (i.e. text preprocessing, text processing, term recognition, corpus construction, relation acquisition) of the proposed system. CHAPTER 3 Text Preprocessing Abstract An increasing number of ontology learning systems are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning noisy texts from online sources. This chapter presents an enhancement of the Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration (ISSAC) technique. ISSAC is implemented as part of the text preprocessing phase in an ontology learning system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98% as compared to 96.5% and 71% based on the use of basic ISSAC and of Aspell, respectively. 3.1 Introduction Ontology is gaining applicability across a wide range of applications such as information retrieval, knowledge acquisition and management, and the Semantic Web. The manual construction and maintenance of ontologies was never a long-term solution due to factors such as the high cost of expertise and the constant change in knowledge. These factors have prompted an increasing effort in automatic and semiautomatic learning of ontologies using texts from electronic sources. A particular source of text that is becoming popular is the World Wide Web. The quality of texts from online sources for ontology learning can vary anywhere between noisy and clean. On the one hand, the quality of texts in the form of blogs, emails and chat logs can be extremely poor. The sentences in noisy texts are typically full of spelling errors, ad-hoc abbreviations and improper casing. On the other hand, clean sources are typically prepared and conformed to certain standards such as those in the academia and journalism. Some common clean sources include news articles from online media sites, and scientific papers. Different text quality requires different treatments during the preprocessing phase and noisy texts can be much more demanding. An increasing number of approaches are gearing towards the use of online sources 0 This chapter appeared in the Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data (AND), Hyderabad, India, 2007, with the title “Enhanced Integrated Scoring for Cleaning Dirty Texts”. 41 42 Chapter 3. Text Preprocessing such as corporate intranet [126] and search engines retrieved documents [51] for different aspects of ontology learning. Despite such growth, only a small number of researchers [165, 187] acknowledge the effect of text cleanliness on the quality of their ontology learning output. With the prevalence of online sources, this “...annoying phase of text cleaning...”[176] has become inevitable and ontology learning systems can no longer ignore the issue of text cleanliness. An effort by Tang et al. [246] showed that the accuracy of term extraction in text mining improved by 38-45% (F1 -measure) with the additional cleaning performed on the input texts (i.e. emails). Integrated techniques for correcting spelling errors, abbreviations and improper casing are becoming increasingly appealing as the boundaries between different errors in online sources are blurred. Along the same line of thought, Clark [53] defended that “...a unified tool is appropriate because of certain specific sorts of errors”. To illustrate this idea, consider the error word “cta”. Do we immediately take it as a spelling error and correct it as “cat”, or is there a problem with the letter casing, which makes it a probable acronym? It is obvious that the problems of spelling error, abbreviation and letter casing are inter-related to a certain extent. The challenge of providing a highly accurate integrated technique for automatically cleaning noisy text in ontology learning remains to be addressed. In an effort to provide an integrated technique to solve spelling errors, ad-hoc abbreviations and improper casing simultaneously, we have developed an Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration (ISSAC) 1 technique [273]. The basic ISSAC uses six weights from different sources for automatically correcting spelling error, expanding abbreviations and restoring improper casing. These includes the original rank by the spell checker Aspell [9], reuse factor, abbreviation factor, normalised edit distance, domain significance and general significance. Despite the achievement of 96.5% in accuracy by the basic ISSAC, several drawbacks have been identified that require additional work. In this chapter, we present the enhancement of the basic ISSAC. New evaluations performed on seven different sets of chat records yield an improved accuracy of 98% as compared to 96.5% and 71% based on the use of basic ISSAC and of Aspell, respectively. In Section 2, we present a summary of work related to spelling error detection and correction, abbreviation expansion, and other cleaning tasks in general. In Section 3, 1 This foundation work on ISSAC appeared in the Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney, Australia, 2006, with the title “Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text”. 3.2. Related Work we summarise the basic ISSAC. In Section 4, we propose the enhancement strategies for ISSAC. The evaluation results and discussions are presented in Section 5. We summarise and conclude this chapter with future outlook in Section 6. 3.2 Related Work Spelling error detection and correction is the task of recognising misspellings in texts and providing suggestions for correcting the errors. For example, detecting “cta” as an error and suggesting that the error to be replaced with “cat”, “act” or “tac”. More information is usually required to select a correct replacement from a list of suggestions. Two of the most studied classes of techniques are minimum edit distance and similarity key. The idea of minimum edit distance techniques began with Damerau [58] and Levenshtein [146]. Damerau-Levenshtein distance is the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. For example, changing the word “wear” to “beard” requires a minimum of two operations, namely, a substitution of ‘w’ with ‘b’, and an insertion of ‘d’. Many variants were developed subsequently such as the algorithm by Wagner & Fischer [266]. The second class of techniques is the similarity key. The main idea behind similarity key techniques is to map every string into a key such that similarly spelt strings will have identical keys [135]. Hence, the key, computed for each spelling error, will act as a pointer to all similarly spelt words (i.e. suggestions) in the dictionary. One of the earliest implementation is the SOUNDEX system [189]. SOUNDEX is a phonetic algorithm for indexing words based on their pronunciation in English. SOUNDEX works by mapping a word into a key consisting of its first letter followed by a sequence of numbers. For example, SOUNDEX replaces the letter li ∈ {A, E, I, O, U, H, W, Y } with 0 and li ∈ {R} with 6, and hence, wear → w006 → w6 and ware → w060 → w6. Since SOUNDEX, many improved variants were developed such as the Metaphone and the Double-metaphone algorithm [199], Daitch-Mokotoff Soundex [138] for Eastern European languages, and others [108]. One of the famous implementation that utilises the similarity key technique is Aspell [9]. Aspell is based on the Metaphone algorithm and the near-miss strategy from its predecessor Ispell [134]. Aspell begins by converting a misspelt word to its soundslike equivalent (i.e. metaphone) and then finding all words that have a soundslike within one or two edit distances from the original word’s soundslike2 . These soundslike words are the basis of the suggestions by Aspell. 2 Source from http://aspell.net/man-html/Aspell-Suggestion-Strategy.html 43 44 Chapter 3. Text Preprocessing Most of the work in detecting and correcting spelling errors, and expanding abbreviations are carried out separately. The task of abbreviation expansion deals with recognising shorter forms of words (e.g. “abbr.” or “abbrev.”), acronyms (e.g. “NATO”) and initialisms (e.g. “HTML”, “FBI”), and expanding them to their corresponding words3 . The work on detecting and expanding abbreviations are mostly conducted in the realm of named-entity recognition and word-sense disambiguation. The technique presented by Schwartz & Hearst [221] begins with the extraction of all abbreviations and definition candidates based on the adjacency to parentheses. A candidate is considered as the correct definition for an abbreviation if they appears in the same sentence, and the candidate has no more than min(|A| + 5, |A| ∗ 2) words, where |A| is the number of characters in an abbreviation A. Park & Byrd [195] presented an algorithm based on rules and heuristics for extracting definitions for abbreviations from texts. Several factors are employed in this technique such as syntactic cues, priority of rules, distance between abbreviation and definition and word casing. Pakhomov [193] proposed a semi-supervised technique that employs a hand-crafted table of abbreviations and their definitions for training a maximum entropy classifier. For case restoration, improper letter casings in words are detected and restored. For example, detecting the letter ‘j’ in “jones” as improper and correcting the word to produce “Jones”. Lita et al. [153] presented an approach for restoring cases based on the context in which the word exists. The approach first captures the context surrounding a word and approximates the meaning using ngrams. The casing of the letters in a word will depend on the most likely meaning of the sentence. Mikheev [176] presented a technique for identifying sentence boundaries, disambiguating capitalised words and identifying abbreviations using a list of common words. The technique can be described in four steps: identify abbreviations in texts, disambiguate ambiguously capitalised words, assign unambiguous sentence boundaries and disambiguate sentence boundaries if an abbreviation is followed by a proper name. In the context of ontology learning and other related areas such as text mining, spelling error correction and abbreviation expansion are mainly carried out as part of the text preprocessing (i.e. text cleaning, text normalisation) phase. Some other common tasks in text preprocessing include plain text extraction (i.e. format conversion, HTML/XML tag stripping, table identification [185]), sentence boundary detection [243], case restoration [176], part-of-speech tagging [33] and sentence 3 Some researchers refer to this relationship as abbreviation and definition or short-form and long-form. 3.3. Basic ISSAC as Part of Text Preprocessing parsing [149]. A review by Gomez-Perez & Manzano-Macho [90] showed that nearly all ontology learning systems in the survey perform only shallow linguistic analysis such as part-of-speech tagging during the text preprocessing phase. These existing systems require the input to be clean and hence, the techniques for correcting spelling errors, expanding abbreviations and restoring cases are considered as unnecessary. Ontology learning systems such as Text-to-Onto [165] and BOLE [187] are the few exceptions. In addition to shallow linguistic analysis, these systems incorporate some cleaning tasks. Text-to-Onto extracts plain text from various formats such as PDF, HTML, XML, and identifies and replaces abbreviations using substitution rules based on regular expressions. The text preprocessing phase of BOLE consists of sentence boundary detection, irrelevant sentence elimination and text tokenisation using Natural Language Toolkit (NLTK). In a text mining system for extracting topics from chat records, Castellanos [38] presented a comprehensive list of text preprocessing techniques. The system employs a thesaurus, constructed using the Smith-Waterman algorithm [233], for correcting spelling errors and identifying abbreviations. In addition, the system removes program codes from texts, and detects sentence boundary based on simple heuristics (e.g. shorter lines in program codes, and punctuation marks followed by an upper case letter). Tang et al. [246] presented a cascaded technique for cleaning emails prior to text mining. The technique is composed of four passes: non-text filtering for eliminating irrelevant data such as email header, sentence normalisation, case restoration, and spelling error correction for transforming relevant text into canonical form. Many of the techniques mentioned above perform only one out of the three cleaning tasks (i.e. spelling error correction, abbreviation expansion, case restoration). In addition, the evaluations conducted to obtain the accuracy are performed in different settings (e.g. no benchmark, test data and agreed measure of accuracy). Hence, it is not possible to compare these different techniques based on the accuracy reported in the respective papers. As pointed out earlier, only a small number of integrated techniques are available for handling all three tasks. Such techniques are usually embedded as part of a larger text preprocessing module. Consequently, the evaluations of the individual cleaning task in such environments are not available. 3.3 Basic ISSAC as Part of Text Preprocessing ISSAC was designed and implemented as part of the text preprocessing phase in an ontology learning system that uses chat records as input. The use of chat 45 46 Chapter 3. Text Preprocessing Figure 3.1: Examples of spelling errors, ad-hoc abbreviations and improper casing in a chat record. records has required us to place more effort in ensuring text cleanliness during the preprocessing phase. Figure 3.1 highlights the various spelling errors, ad-hoc abbreviations and improper casing that occur much more frequently in chat records than in clean texts. Prior to spelling error correction, abbreviation expansion and case restoration, three tasks are performed as part of the text preprocessing phase. Firstly, plain text extraction is conducted to remove HTML and XML tags from the chat records using regular expressions and Perl modules, namely, XML::Twig4 and HTML::Strip5 . Secondly, identification of URLs, emails, emoticons6 and tables is performed. Such information is extracted and set aside for assisting in other business intelligence analysis. Tables are removed using the signatures of a table such as multiple spaces between words, and words aligned in columns for multiple lines [38]. Thirdly, sentence boundary detection is performed using Lingua::EN::Sentence7 Perl module. Firstly, each sentence in the input text (e.g. chat record) is tokenised to obtain a set of words T = {t1 , ...tw }. The set T is then fed into Aspell. For each word e that Aspell considers as erroneous, a list of ranked suggestions S is produced. Initially, 4 http://search.cpan.org/dist/XML-Twig-3.26/ http://search.cpan.org/dist/HTML-Strip-1.06/ 6 An emoticon, also called a smiley, is a sequence of ordinary printable characters or a small image, intended to represent a human facial expression and convey an emotion. 7 http://search.cpan.org/dist/Lingua-EN-Sentence/ 5 3.3. Basic ISSAC as Part of Text Preprocessing S = {s1,1 , ..., sn,n } is an ordered list of n suggestions where sj,i is the j th suggestion with rank i (smaller i indicates higher confidence in the suggested word). If e appears in the abbreviation dictionary, the list S is augmented by adding all the corresponding m expansions in front of S as additional suggestions with rank 1. In addition, the error word e is appended at the end of S with rank n + 1. These augmentations produce an extended list S = {s1,1 , ..., sm,1 , sm+1,1 , ..., sm+n,n , sm+n+1,n+1 }, which is a combination of m suggestions from the abbreviation dictionary (if e is a potential abbreviation), n suggestions by Aspell, and the error word e itself. Placing the error word e back into the list of possible replacements serves one purpose: to ensure that if no better replacement is available, we keep the error word e as it is. Once the extended list S is obtained, each suggestion sj,i is re-ranked using ISSAC. The new score for the j th suggestion with original rank i is defined as N S(sj,i ) = i−1 + N ED(e, sj,i ) + RF (e, sj,i ) +AF (sj,i ) + DS(l, sj,i , r) + GS(l, sj,i , r) where • N ED(e, sj,i ) ∈ (0, 1] is the normalised edit distance defined as (ED(e, sj,i ) + 1)−1 where ED is the minimum edit distance between e and sj,i . • RF (e, sj,i ) ∈ {0, 1} is the boolean reuse factor for providing more weight to suggestion sj,i that has been previously used for correcting error e. The reuse factor is obtained through a lookup against a history list that ISSAC keeps to record previous corrections. RF (e, sj,i ) provides factor 1 if the error e has been previously corrected with sj,i and 0 otherwise. • AF (sj,i ) ∈ {0, 1} is the abbreviation factor for denoting that sj,i is a potential abbreviation. A lookup against the abbreviation dictionary, AF (sj,i ) yields factor 1 if suggestion sj,i exists in the dictionary and 0 otherwise. When the scoring process takes place and the corresponding expansions for potential abbreviations are required, www.stands4.com is consulted. A copy of the expansion is stored in a local abbreviation dictionary for future reference. • DS(l, sj,i , r) ∈ [0, 1] measures the domain significance of suggestion sj,i based on its appearance in the domain corpora by taking into account the neighbouring words l and r. This domain significance weight is inspired by the TF-IDF [210] measure commonly used for information retrieval. The weight is 47 48 Chapter 3. Text Preprocessing defined as the ratio between the frequency of occurrence of sj,i (individually, and within l and r) in the domain corpora and the sum of the frequencies of occurrences of all suggestions (individually, and within l and r). • GS(l, sj,i , r) ∈ [0, 1] measures the general significance of suggestion sj,i based on its appearance in the general collection (e.g. webpages indexed by the Goggle search engine). The purpose of this general significance weight is similar to that of the domain significance. In addition, the use of dynamic Web data allows ISSAC to cope with language change that is not possible with static corpora and Aspell. The weight is defined as the ratio between the number of documents in the general collection containing sj,i within l and r and the number of documents in the general collection that contains sj,i alone. Both the ratios in DS and GS are offset by a measure similar to that of the IDF [210]. For further details on DS and GS, please refer to Wong et al. [273]. 3.4 Enhancement of ISSAC The list of suggestions and the initial ranks provided by Aspell are an integral part of ISSAC. Figure 3.2 summarises the accuracy of basic ISSAC obtained from the previous evaluations [273] using four sets of chat records (where each set contains 100 chat records). The achievement of 74.4% accuracy by Aspell from the previous evaluations, given the extremely poor nature of the texts, demonstrated the strength of the Metaphone algorithm and the near-miss strategy. The further increase of 22% in accuracy using basic ISSAC demonstrated the potential of the combined weights N S(sj,i ). Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4 number of correct replacements using ISSAC number of correct replacements using Aspell Average 97.06% 97.07% 95.92% 96.20% 96.56% 74.61% 75.94% 71.81% 75.19% 74.39% Figure 3.2: The accuracy of basic ISSAC from previous evaluations. Based on the previous evaluation results, we discuss in detail the three causes behind the remaining 3.5% of errors which were incorrectly replaced. Figure 3.3 shows the breakdown of the causes behind the incorrect replacements by the basic ISSAC. The three causes are summarised as follow: 49 3.4. Enhancement of ISSAC Causes Correct replacement not in suggestion list Inadequate/erroneous neighbouring words Anomalies Basic ISSAC 2.00% 1.00% 0.50% Figure 3.3: The breakdown of the causes behind the incorrect replacements by basic ISSAC. 1. The accuracy of the corrections by basic ISSAC is bounded by the coverage of the list of suggestions S produced by Aspell. About 2% of the wrong replacements is due to the absence of the correct suggestions produced by Aspell. For example, the error “prder” in the context of “The prder number” was incorrectly replaced by both Aspell and basic ISSAC as “parader” and “prder”, respectively. After a look into the evaluation log, we realised that the correct replacement “order” was not in S. 2. The use of the two immediate neighbouring words l and r to inject more contextual consideration into domain and general significance has contributed to a huge increase in accuracy. Nonetheless, the use of l and r in ISSAC is by no means perfect. About 1% of the wrong replacements is due to two flaws related to l and r, namely, neighbouring words with incorrect spelling, and inadequate neighbouring words. Incorrectly spelt neighbouring words inject false contextual information into the computation of DS and GS. The neighbouring words may also be considered as inadequate due to their indiscriminative nature. For example, the left word “both” in “both ocats are” is too general and does not offer much discriminatory power for distinguishing between suggestions such as “coats”, “cats” and “acts”. 3. The remaining 0.5% is considered as anomalies where basic ISSAC cannot address. There are two cases of anomalies: the equally likely nature of all possible suggestions, and the contrasting value of certain weights. As an example for the first case, consider the error “Janice cheung has”. The left word is correctly spelt and has adequately confined the suggestions to proper names. In addition, the correct replacement “Cheung” is present in the suggestion list S. Despite all these, both Aspell and ISSAC decided to replace “cheung” with “Cheng”. A look into the evaluation log reveals that the surname “Cheung” is as common as “Cheng”. In such cases, the probability of replacing e with 50 Chapter 3. Text Preprocessing the correct replacement is c−1 where c is the number of suggestions with approximately same N S(sj,i ). The second case of anomalies is due to contrasting value of certain weights, especially N ED and i−1 , that causes wrong replacements. For example, in the case “cannot chage an”, basic ISSAC replaced the error “chage” with “charge” instead of “change”. All the other weights for “change” are comparatively higher (i.e. DS and GS) or the same (i.e. RF , N ED and AF ) as “charge”. Such inclination indicates that “change” is the most proper replacement given the various cues. Nonetheless, the original rank by Aspell for charge is i=1 while change is i=6. As smaller i indicates higher confidence, the inverse of the original Aspell’s rank i−1 results in the plummeting of the combined weight for “change”. In this chapter, we approach the enhancement of ISSAC from the perspective of the first and second cause. For this purpose, we proposed three modifications to the basic ISSAC : 1. We proposed the use of additional spell checking facilities as the answer to the first cause (i.e. compensating for the inadequacy of Aspell). Google spellcheck, which is based on statistical analysis of words on the World Wide Web8 , appears to be the ideal candidate for complementing Aspell. Using the Google SOAP API9 , we can have easy access to one of the many functions provided by Google, namely, Google spellcheck. Our new evaluations show that Google spellcheck works well for certain errors where Aspell fails to suggest the correct replacements. Similar to adding the expansions for abbreviations and the suggestions by Aspell, the suggestion provided by Google is added at the front of the list S with rank 1. This places the suggestion by Google on the same rank as the first suggestion by Aspell. 2. The basic ISSAC relies only on Aspell for determining if a word is an error. For this purpose, we decided to include Google spellcheck as a complement. If a word is detected as a possible error by either Aspell or Google spellcheck, then we have adequate evidence to proceed and correct it using enhanced ISSAC. In addition, errors that result in valid words are not recognised by Aspell. For example, Aspell does not recognise “hat” as an error. If we were to take into consideration the neighbours that it co-occurs with, namely, “suret hat they”, then “hat” is certainly an error. Google contributes in this aspect. 8 9 http://www.google.com/help/features.html http://www.google.com/apis 3.4. Enhancement of ISSAC In addition, the use of Google spellcheck has also indirectly provided ISSAC with a partial solution to the second cause (i.e. erroneous neighbouring words). Whenever Google is checking a word for spelling error, the neighbouring words are simultaneously examined. For example, while providing a suggestion for the error “tha”, Google simultaneously takes into consideration the neighbours, namely, “sure tha tthey”, and suggest that the right word “tthey” be replaced with “they”. Google spellcheck’s ability in considering contextual information is empowered by its large search engine index and the statistical evidence that comes with it. Word collocations are ruled out as statistically improbable when their co-occurrences are extremely low. In such cases, Google attempts to suggest better collocates (i.e. neighbouring words). 3. We have altered the reuse factor RF by eliminating the use of history list that gives more weight to suggestions that have been previously chosen to correct particular errors. We have come to realise that there is no guarantee a particular replacement for an error is correct. When a replacement is incorrect and is stored in the history list, the reuse factor will propagate the wrong replacement to the subsequent corrections. Therefore, we adapted the reuse factor to support the use of Google spellcheck in the form of entries in a local spelling dictionary. There are two types of entries in the spelling dictionary. The main type is the suggestions by Google for spelling errors. This type of entries is automatically updated every time Google suggest a replacement for an error. The second type, which is optional, is the suggestions for errors provided by users. Hence the modified reuse factor will now assign the weight of 1 to suggestions that are provided by Google spellcheck or predefined by users. Despite a certain level of superiority that Google spellcheck exhibits in the three enhancements, Aspell remains necessary. Google spellcheck is based on the occurrences of words on the World Wide Web. Determining whether a word is an error or not depends very much on its popularity. Even if a word does not exist in the English dictionary, Google will not judge it as an error as long as its popularity exceeds some threshold set by Google. This popularity approach has both its pros and cons. On the one hand, such approach is suitable for recognising proper nouns, especially emerging ones, such as “iPod” and “Xbox”. On the other hand, words such as “thanx” in the context of “[ok] [thanx] [for]” is not considered as an error by Google even though it should be corrected. 51 52 Chapter 3. Text Preprocessing The algorithm for text preprocessing that comprises of the basic ISSAC together with all its enhancements is described in Algorithm 1. 3.5 Evaluation and Discussion Evaluations are conducted using chat records provided by 247Customer.com. As a provider of customer lifecycle management services, the chat records by 247Customer.com offer a rich source of domain information in a natural setting (i.e. conversations between customers and agents). Consequently, these chat records are filled with spelling errors, ad-hoc abbreviations, improper casing and many other problems that are considered as intolerable by existing language and speech applications. Therefore, these chat records become the ideal source for evaluating ISSAC. Four sets of test data, each comes in an XML file of 100 chat sessions, were employed in the previous evaluations [273]. To evaluate the enhanced ISSAC, we have included an additional three sets, which brings the total number of chat records to 700. The chat records and the Google search engine constitute the domain corpora and the general collection, respectively. GNU Aspell version 0.60.4 [9] is employed for detecting errors and generating suggestions. Similar to the previous evaluations, determining the correctness of replacements by Aspell and enhanced ISSAC is a delicate process that must be performed manually. For example, it is difficult to automatically determine whether the error “itme” should be replaced with “time” or “item” without more information (e.g. the neighbouring words). The evaluation of the errors and replacements are conducted in a unified manner. The errors are not classified into spelling errors, ad-hoc abbreviations or improper casing. For example, should the error “az” (“AZ” is the abbreviation for the state of “Arizona”) in the context of “Glendale az <” be considered as an abbreviation or improper casing? The boundaries between the different noises that occur in real-world texts, especially those from online sources, are not clear. After a careful evaluation of all replacements suggested by Aspell and by enhanced ISSAC for all 3, 313 errors, we discovered a further improvement in accuracy using the latter. As shown in Figure 3.4, the use of the first suggestions by Aspell as replacements for spelling errors yields an average of 71%, which is a decrease from 74.4% in the previous evaluations. With the addition of the various weights which form basic ISSAC, an average increase of 22% was noted, resulting in an improved accuracy of 96.5%. As predicted, the enhanced ISSAC scored a much better accuracy at 98%. The increase of 1.5% from basic ISSAC is contributed by the suggestions from Google that complement the inadequacies of Aspell. A previous 3.5. Evaluation and Discussion Algorithm 1 Enhanced ISSAC 1: input: chat records or other online documents 2: Remove all HTML or XML tags from input documents 3: Extract and keep URLs, emails, emoticons and tables 4: Detect and identify sentence boundary 5: for each document do 6: for each sentence in the document do 7: tokenise the sentence to produce a set of words T = {t1 , ..., tw } 8: for each word t ∈ T do 9: Identify left l and right r word for t 10: if t consists of all upper case then 11: Turn all letters in t to lower case 12: else if t consists of all digits then 13: next 14: Feed t to Aspell 15: if t is identified as error by Aspell or Google spellcheck then 16: initialise S, the set of suggestions for error t, and N S, an array of new scores for all suggestions for error t 17: Add the n suggestions for word t produced by Aspell to S according to the original rank from 1 to n 18: Perform a lookup in the abbreviation dictionary and add all the corresponding m expansions for t at the front of S, all with rank 1 19: Perform a lookup in the spelling dictionary and add the retrieved suggestion at the front of S with rank 1 20: Add the error word t itself at the end of S, with rank n + 1 21: The final S is {s1,1 , s2,1 , ..., sm+1,1 , sm+2,1 , ..., sm+n+1,n , sm+n+2,n+1 } where j and i in sj,i are the element index and the rank, respectively 22: for each suggestion sj,i ∈ S do 23: Determine i−1 , N ED between error e and the j th suggestion, RF by looking into the spelling dictionary, AF by looking into the abbreviation dictionary, DS, and GS 24: Sum the weights and push the sum into N S 25: Correct word t with the suggestion that has the highest combined weights in array N S 26: output: documents with spelling errors corrected, abbreviations expanded and improper casing restored. 53 54 Chapter 3. Text Preprocessing Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4 number of correct replacements using enhanced ISSAC number of correct replacements using basic ISSAC number of correct replacements using Aspell 98.45% 97.91% 98.40% 98.23% 97.06% 97.07% 95.92% 96.20% 74.61% 75.94% 71.81% 75.19% (a) Evaluations 1 to 4. Evaluation 5 Evaluation 6 Evaluation 7 Average number of correct replacements using enhanced ISSAC number of correct replacements using basic ISSAC number of correct replacements using Aspell 97.39% 97.85% 97.86% 98.01% 95.64% 96.65% 97.14% 96.53% 63.62% 65.79% 70.24% 71.03% (b) Evaluations 5 to 7. Figure 3.4: Accuracy of enhanced ISSAC over seven evaluations. Causes Correct replacement not in suggestion list Inadequate/erroneous neighbouring words Anomalies Enhanced ISSAC 0.80% 0.70% 0.50% Figure 3.5: The breakdown of the causes behind the incorrect replacements by enhanced ISSAC. error “prder” within the context of “The prder number” that could not be corrected by basic ISSAC due to the first cause was solved after our enhancements. The correct replacement “order” was suggested by Google. Another error “ffer” in the context of “youo ffer on” that could not be corrected due to the second cause was successfully replaced by “offer” after Google has simultanouesly corrected the left word “you”. The increase in accuracy by 1.5% is in line with the drop in the number of errors with wrong replacements due to (1) the absence of correct replacements in Aspell’s suggestions, and (2) the erronous neighbouring words. There is a visible drop in the number of errors with wrong replacements due to the first and the second cause, from the existing 2% (in Figure 3.3) to 0.8% (in Figure 3.5), and 1% (in 3.6. Conclusion Figure 3.3) to 0.7% (in Figure 3.5), respectively. 3.6 Conclusion As an increasing number of ontology learning systems are opening up to the use of online sources, the need to handle noisy text becomes inevitable. Regardless of whether we acknowledge this fact, the quality of ontologies and the proper functioning of the systems are, to a certain extent, dependent on the cleanliness of the input texts. Most of the existing techniques for correcting spelling errors, expanding abbreviations and restoring cases are studied separately. We, along with an increasing number of researchers, have acknowledged the fact that many noises in text are composite in nature (i.e. multi-error). As we have demonstrated throughout this chapter, many errors are difficult to be classified as either spelling errors, ad-hoc abbreviations or improper casing. In this chapter, we presented the enhancement of the ISSAC technique. The basic ISSAC was built upon the famous spell checker Aspell for simultaneously providing solution to spelling errors, abbreviations and improper casing. This scoring mechanism combines weights based on various information sources, namely, original rank by Aspell, reuse factor, abbreviation factor, normalised edit distance, domain significance and general significance. In the course of evaluating basic ISSAC, we have uncovered and discussed in detail three causes behind the replacement errors. We approached the enhancement of ISSAC from the first and the second cause, namely, the absence of correct replacements from Aspell’s suggestions, and the inadequacy of the neighbouring words. We proposed three modifications to the basic ISSAC, namely, (1) the use of Google spellcheck for compensating the inadequacy of Aspell, (2) the incorporation of Google spellcheck for determining if a word is erroneous, and (3) the alteration of the reuse factor RS by shifting from the use of a history list to a spelling dictionary. Evaluations performed using the enhanced ISSAC on seven sets of chat records revealed a further improved accuracy at 98% from the previous 96.5% using basic ISSAC. Even though the idea for ISSAC was first motivated and conceived within the paradigm of ontology learning, we see great potentials in further improvements and fine-tuning for a wide range of uses, especially in language and speech applications. We hope that a unified technique such as ISSAC will pave the way for more research into providing a complete solution for text preprocessing (i.e. text cleaning) in general. 55 56 Chapter 3. Text Preprocessing 3.7 Acknowledgement This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the Research Grant 2006 by the University of Western Australia. The authors would like to thank 247Customer.com for providing the evaluation data. Gratitude to the developer of GNU Aspell, Kevin Atkinson. 3.8 Other Publications on this Topic Wong, W., Liu, W. & Bennamoun, M. (2006) Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text. In the Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney, Australia. This paper contains the preliminary ideas on basic ISSAC, which were extended and improved to contribute towards the conference paper on enhanced ISSAC that form Chapter 3. CHAPTER 4 Text Processing Abstract In ontology learning, research on word collocational stability or unithood is typically performed as part of a larger effort for term recognition. Consequently, independent work dedicated to the improvement of unithood measurement is limited. In addition, existing unithood measures were mostly empirically motivated and derived. This chapter presents a dedicated probabilistic measure that gathers linguistic evidence from parsed text and statistical evidence from Web search engines for determining unithood during noun phrase extraction. Our comparative study using 1, 825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy. 4.1 Introduction Automatic term recognition is the process of extracting stable noun phrases from text and filtering them for the purpose of identifying terms which characterise certain domains of interest. This process involves the determination of unithood and termhood. Unithood, which is the focus of this chapter, refers to “the degree of strength or stability of syntagmatic combinations or collocations” [120]. Measures for determining unithood can be used to decide whether or not word sequences can form collocationally stable and semantically meaningful compounds. Compounds are considered as unstable if they can be further broken down to create non-overlapping units that refer to semantically distinct concepts. For example, the noun phrase “Centers for Disease Control and Prevention” is a stable and meaningful unit while “Centre for Clinical Interventions and Royal Perth Hospital” is an unstable compound that refers to two separate entities. For this reason, unithood measures are typically used in term recognition for finding stable and meaningful noun phrases, which are considered as likelier terms. Recent reviews [275] showed that existing research on unithood is mostly conducted as part of larger efforts for termhood measurement. As a result, there is only a small number of existing measures dedicated to determining unithood. In addition, existing measures are usually derived using 0 This chapter appeared in the Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP), Hyderabad, India, 2008, with the title “Determining the Unithood of Word Sequences using a Probabilistic Approach”. 57 58 Chapter 4. Text Processing word frequency from static corpora, and are modified as per need. As such, the significance of the different weights that compose the measures typically assumes an empirical viewpoint [120]. The three objectives of this chapter are (1) to separate the measurement of unithood from the determination of termhood, (2) to devise a probabilistic measure which requires only one threshold for determining the unithood of word sequences using dynamic Web data, and (3) to demonstrate the superior performance of the new probabilistic measure against existing empirical measures. In regard to the first objective, we derive our probabilistic measure free from any influence of termhood determination. Following this, our unithood measure will be an independent tool that is applicable not only to term recognition, but also other tasks in information extraction and text mining. Concerning the second objective, we devise our new measure, known as the Odds of Unithood (OU), using Bayes Theorem and several elementary probabilities. The probabilities are estimated using Google page counts to eliminate problems related to the use of static corpora. Moreover, only one threshold, namely, OUT is required to control the functioning of OU. Regarding the third objective, we compare our new OU against an existing empirically-derived measure called Unithood (UH) 1 [275] in terms of their precision, recall and accuracy. In Section 4.2, we provide a brief review on some of existing techniques for measuring unithood. In Section 4.3, we present our new probabilistic measure and the accompanying theoretical and intuitive justification. In Section 4.4, we summarise some findings from our evaluations. Finally, we conclude this chapter with an outlook to future work in Section 4.5. 4.2 Related Works Some of the common measures of unithood include pointwise mutual information (MI) [47] and log-likelihood ratio [64]. In mutual information, the co-occurrence frequencies of the constituents of complex terms are utilised to measure their dependency. The mutual information for two words a and b is defined as: M I(a, b) = log2 1 p(a, b) p(a)p(b) (4.1) This foundation work on dedicated unithood measures appeared in the Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia, 2007 with the title “Determining the Unithood of Word Sequences using Mutual Information and Independence Measure”. 59 4.2. Related Works where p(a) and p(b) are the probabilities of occurrence of a and b. Many measures that apply statistical techniques assume strict normal distribution and independence between the word occurrences [81]. For handling extremely uncommon words or small sized corpus, the log-likelihood ratio delivers the best precision [136]. Loglikelihood ratio attempts to quantify how much more likely one pair of words is to occur compared to the others. Despite its potential, “How to apply this statistic measure to quantify structural dependency of a word sequence remains an interesting issue to explore.” [131]. Seretan et al. [224] examined the use of mutual information, log-likelihood ratio and t-tests with search engine page counts for determining the collocational strength of word pairs. However, no performance results were presented. Wong et al. [275] presented a hybrid measure inspired by mutual information in Equation 4.1, and Cvalue in Equation 4.3. The authors employ search engine page counts for the computation of statistical evidence to replace the use of frequencies obtained from static corpora. The authors proposed a measure known as Unithood (UH) for determining the mergeability of two lexical units ax and ay to produce a stable sequence of words s. The word sequences are organised as a set W = {s, ax , ay } where s = ax bay is a term candidate, b can be any preposition, the coordinating conjunction “and” or an empty string, and ax and ay can either be noun phrases in the form ADJ∗ N+ or another s (i.e. defining a new s in terms of another s). The authors define UH as: 1 if (M I(ax , ay ) > M I + ) ∨ (M I + ≥ M I(ax , ay ) ≥ M I −∧ ID(ax , s) ≥ IDT ∧ U H(ax , ay ) = ID(ay , s) ≥ IDT ∧ IDR+ ≥ IDR(ax , ay ) ≥ IDR− ) 0 otherwise (4.2) where M I + , M I − , IDT , IDR+ and IDR− are thresholds for determining the mergeability decision, and M I(ax , ay ) is the mutual information between ax and ay , while ID(ax , s), ID(ay , s) and IDR(ax , ay ) are measures of lexical independence of ax and ay from s. For brevity, let z be either ax or ay , and the independence measure 60 Chapter 4. Text Processing ID(z, s) is then defined as: log (n − n ) if(n > n ) z s z s 10 ID(z, s) = 0 otherwise where nz and ns are the Google page count for z and s, respectively. IDR(ax , ay ) x ,s) is computed as ID(a . Intuitively, U H(ax , ay ) states that the two lexical units ax ID(ay ,s) and ay can only be merged in two cases, namely, (1) if ax and ay has extremely high mutual information (i.e. higher than a certain threshold M I + ), or (2) if ax and ay achieve average mutual information (i.e. within the acceptable range of two thresholds M I + and M I − ) due to their extremely high independence from s (i.e. higher than the threshold IDT ). Frantzi [79] proposed a measure known as Cvalue for extracting complex terms. The measure is based upon the claim that a substring of a term candidate is a candidate itself given that it demonstrates adequate independence from the longer version it appears in. For example, “E. coli food poisoning”, “E. coli” and “food poisoning” are acceptable as valid complex term candidates. However, “E. coli food” is not. Given a word sequence a to be examined for unithood, the Cvalue is defined as: log |a|f if |a| = g a 2 P (4.3) Cvalue(a) = f l log |a|(f − l∈La ) otherwise a 2 |La | where |a| is the number of words in a, La is the set of longer term candidates that contain a, g is the longest n-gram considered, fa is the frequency of occurrence of a, and a ∈ / La . While certain researchers [131] consider Cvalue as a termhood measure, others [180] accept it as a measure of unithood. One can observe that longer candidates tend to gain higher weights due to the inclusion of log2 |a| in Equation 4.3. In addition, the weights computed using Equation 4.3 are purely dependent on the frequency of a alone. 4.3 A Probabilistic Measure for Unithood Determination The determination of the unithood (i.e. collocational strength) of word sequences using the new probabilistic measure is composed of two parts. Firstly, a list of noun phrases is extracted using syntactic and dependency analysis. Secondly, the collocational strength of word sequences are examined based on several probabilistic evidence. 4.3. A Probabilistic Measure for Unithood Determination Figure 4.1: The output by Stanford Parser. The tokens in the “modifiee” column marked with squares are head nouns, and the corresponding tokens along the same rows in the “word” column are the modifiers. The first column “offset” is subsequently represented using the variable i. 4.3.1 Noun Phrase Extraction Most techniques for extracting noun phrases rely on regular expressions, and part-of-speech and dependency information. Our extraction technique is implemented as a head-driven noun phrase chunker [271] that feeds on the output of Stanford Parser [132]. Figure 4.1 shows a sample output by the parser for the sentence “They’re living longer with HIV in the brain, explains Kathy Kopnisky of the NIH’s National Institute of Mental Health, which is spending about millions investigating neuroAIDS.”. Note that the words are lemmatised to obtain the root form. The noun phrase chunker begins by identifying a list of head nouns from the parser’s output. The head nouns are marked with squares in Figure 4.1. As the name suggests, the chunker uses the head nouns as the starting point, and proceeds to the left and later right in an attempt to identify maximal noun phrases using the head-modifier information. For example, the head “Institute” is modified by 61 62 Chapter 4. Text Processing Figure 4.2: The output of the head-driven noun phrase chunker. The tokens which are highlighted with a darker tone are the head nouns. The underlined tokens are the corresponding modifiers identified by the chunker. “NIH’s”, “National” and “of ”. Since modifiers of the type prep and poss cannot be straightforwardly chunked, the phrase “National Institute” was produced instead as shown in Figure 4.2. Similarly, the phrase “Mental Health” was also identified by the chunker. The fragments of noun phrases identified by the chunker which are separated by the coordinating conjunction “and” or prepositions (e.g. “National Institute”, “Mental Health”) are organised as pairs in the form of (ax , ay ) and placed in the set A. The i in ai is the word offset generated by the Stanford Parser (i.e. the “offset” column in Figure 4.1). If ax and ay are located immediately next to each other in the sentence, then x + 1 = y. If the pair is separated by a preposition or a conjunction, then x + 2 = y. 4.3.2 Determining the Unithood of Word Sequences The next step is to examine the collocational strength of the pairs in A. Word pairs in (ax , ay ) ∈ A that have very high unithood or collocational strength are combined to form stable noun phrases and hence, potential domain-relevant terms. Each pair (ax , ay ) that undergoes the examination for unithood is organised as W = {s, ax , ay } where s is the hypothetically-stable noun phrase composed of s = ax bay and b can either be an empty string, a preposition, or the coordinating conjunction “and”. Formally, the unithood of any two lexical units ax and ay can be defined as Definition 4.3.2.1. The unithood of two lexical units is the “degree of strength or stability of syntagmatic combinations and collocations” [120] between them. It then becomes obvious that the problem of measuring the unithood of any word sequences requires the determination of their “degree” of collocational strength as 4.3. A Probabilistic Measure for Unithood Determination mentioned in Definition 4.3.2.1. In practical terms, the “degree” mentioned above provides us with a quantitative means to determine if the units ax and ay should be combined to form s, or remain as separate units. The collocational strength of ax and ay that exceeds a certain threshold demonstrates to us that s has the potential of being a stable compound and hence, a better term candidate than ax and ay separated. It is worth mentioning that the size (i.e. number of words) of ax and ay is not limited to 1. For example, we can have ax =“National Institute”, b=“of ” and ay =“Allergy and Infectious Diseases”. In addition, the size of ax and ay has no effect on the determination of their unithood using our measure. As we have discussed in Section 4.2, most of the conventional measures employ frequency of occurrence from local corpora, and statistical tests or informationtheoretic measures to determine the coupling strength of word pairs. The two main problems associated with such measures are: • Data sparseness is a problem that is well-documented by many researchers [124]. The problem is inherent to the use of local corpora and it can lead to poor estimation of parameters or weights; and • The assumption of independence and normality of word distribution are two of the many problems in language modelling [81]. While the independence assumption reduces text to simply a bag of words, the assumption of the normal distribution of words will often lead to incorrect conclusions during statistical tests. As a general solution, we innovatively employ search engine page counts in a probabilistic framework for measuring unithood. We begin by defining the sample space, N as the set of all documents indexed by the Google search engine. We can estimate the index size of Google, |N | using function words as predictors. Function words such as “a”, “is” and “with”, as opposed to content words, appear with frequencies that are relatively stable over different domains. Next, we perform random draws (i.e. trial) of documents from N . For each lexical unit w ∈ W , there is a corresponding set of outcomes (i.e. events) from the draw. The three basic sets which are of interest to us are: Definition 4.3.2.2. Basic events corresponding to each w ∈ W : • X is the event that ax occurs in the document • Y is the event that ay occurs in the document 63 64 Chapter 4. Text Processing • S is the event that s occurs in the document It should be obvious to the readers that since the documents in S also contain the two units ax and ay , S is a subset of X ∩Y or S ⊆ X ∩Y . It is worth noting that even though S ⊆ X ∩ Y , it is highly unlikely that S = X ∩ Y since the two portions ax and ay may exist in the same document without being conjoined by b. Next, subscribing to the frequency interpretation of probability, we obtain the probability of the events in Definition 4.3.2.2 in terms of search engine page counts: nx P (X) = (4.4) |N | ny P (Y ) = |N | ns P (S) = |N | where nx , ny and ns are the page counts returned by a search engine using the terms [+“ax ”], [+“ay ”] and [+“s”], respectively. The pair of quotes that encapsulates the search terms is the phrase operator, while the character “+” is the required operator supported by the Google search engine. As discussed earlier, the independence assumption required by certain information-theoretic measures may not always be valid. In our case, P (X ∩ Y ) 6= P (X)P (Y ) since the occurrences of ax and ay in documents are inevitably governed by some hidden variables. Following this, we define the probabilities for two new sets which result from applying some set operations on the basic events in Definition 4.3.2.2: nxy (4.5) P (X ∩ Y ) = |N | P (X ∩ Y \ S) = P (X ∩ Y ) − P (S) where nxy is the page count returned by Google for the search using [+“ax ” +“ay ”]. Defining P (X ∩ Y ) in terms of observable page counts, rather than a combination of two independent events allows us to avoid any unnecessary assumption of independence. Next, referring back to our main problem discussed in Definition 4.3.2.1, we are required to estimate the collocation strength of the two units ax and ay . Since there is no standard metric for such measurement, we address the problem from a probabilistic perspective. We introduce the probability that s is a stable compound given the evidence s possesses: Definition 4.3.2.3. The probability of unithood: P (E|U )P (U ) P (U |E) = P (E) 65 4.3. A Probabilistic Measure for Unithood Determination where U is the event that s is a stable compound and E is the evidence belonging to s. P (U |E) is the posterior probability that s is a stable compound given the evidence E. P (U ) is the prior probability that s is a unit without any evidence, and P (E) is the prior probability of evidence held by s. As we shall see later, these two prior probabilities are immaterial in the final computation of unithood. Since s can either be a stable compound or not, we can state that, P (Ū |E) = 1 − P (U |E) (4.6) where Ū is the event that s is not a stable compound. Since Odds = P/(1 − P ), we multiply both sides of Definition 4.3.2.3 by (1 − P (U |E))−1 to obtain, P (E|U )P (U ) P (U |E) = 1 − P (U |E) P (E)(1 − P (U |E)) (4.7) By substituting Equation 4.6 in Equation 4.7 and later, applying the multiplication rule P (Ū |E)P (E) = P (E|Ū )P (Ū ) to it, we obtain: P (U |E) P (E|U )P (U ) = P (Ū |E) P (E|Ū )P (Ū ) (4.8) We proceed to take the log of the odds in Equation 4.8 (i.e. logit) to get: log P (E|U ) P (U |E) P (U ) = log − log P (E|Ū ) P (Ū |E) P (Ū ) (4.9) While it is obvious that certain words tend to co-occur more frequently than others (i.e. idioms and collocations), such phenomena are largely arbitrary [231]. This makes the task of deciding on what constitutes an acceptable collocation difficult. The only way to objectively identify stable and meaningful compounds is through observations in samples of the language (e.g. text corpus) [174]. In other words, assigning the apriori probability of collocational strength without empirical evidence is both subjective and difficult. As such, we are left with the option to assume that the probability of s being a stable unit and not being a stable compound without evidence is the same (i.e. P (U ) = P (Ū ) = 0.5). As a result, the second term in Equation 4.9 evaluates to 0: log P (E|U ) P (U |E) = log P (Ū |E) P (E|Ū ) (4.10) We introduce a new measure for determining the odds of s being a stable compound known as the Odds of Unithood (OU): 66 Chapter 4. Text Processing (a) The area with a darker shade is the set X ∩ Y \ S. Computing the ratio of P (S) and the probability of this area gives us the first evidence. (b) The area with a darker shade is the set S ′ . Computing the ratio of P (S) and the probability of this area (i.e. P (S ′ ) = 1 − P (S)) gives us the second evidence. Figure 4.3: The probability of the areas with darker shade are the denominators required by the evidences e1 and e2 for the estimation of OU (s). Definition 4.3.2.4. Odds of unithood OU (s) = log P (E|U ) P (E|Ū ) Assuming that the evidence in E are independent of one another, we can evaluate OU (s) in terms of: Q P (ei |U ) (4.11) OU (s) = log Qi i P (ei |Ū ) X P (ei |U ) = log P (ei |Ū ) i where ei is the individual evidence of s. With the introduction of Definition 4.3.2.4, we can examine the degree of collocation strength of ax and ay in forming a stable and meaningful s in terms of OU (s). With the base of the log in Definition 4.3.2.4 more than 1, the upper and lower bound of OU (s) would be +∞ and −∞, respectively. OU (s) = +∞ and OU (s) = −∞ corresponds to the highest and the lowest degree of stability of the two units ax and ay appearing as s, respectively. A high OU (s) indicates the suitability of the two units ax and ay to be merged to form s. Ultimately, we have reduced the vague problem of unithood determination introduced in Definition 4.3.2.1 into a practical and computable solution in Definition 4.3.2.4. The evidence that we employ for determining unithood is based on the occurrence of s (the event S if the readers recall from Definition 4.3.2.2). We are interested in two types of occurrence of s, namely, (1) the occurrence of s given that ax and ay have already occurred or X ∩ Y , and 67 4.3. A Probabilistic Measure for Unithood Determination (2) the occurrence of s as it is in our sample space, N . We refer to the first evidence e1 as local occurrence, while the second one e2 as global occurrence. We will discuss the justification behind each type of occurrence in the following paragraphs. Each evidence ei captures the occurrence of s within different confinements. We estimate the evidence using the elementary probabilities already defined in Equations 4.4 and 4.5. The first evidence e1 captures the probability of occurrence of s within the confinement of ax and ay or X ∩Y . As such, P (e1 |U ) can be interpreted as the probability of s occurring within X ∩ Y as a stable compound or P (S|X ∩ Y ). On the other hand, P (e1 |Ū ) captures the probability of s occurring in X ∩ Y not as a unit. In other words, P (e1 |Ū ) is the probability of s not occurring in X ∩ Y , or equivalently, equal to P ((X ∩ Y \ S)|(X ∩ Y )). The set X ∩ Y \ S is shown as the area with a darker shade in Figure 4.3(a). Let us define the odds based on the first evidence as: OL = P (e1 |U ) P (e1 |Ū ) (4.12) Substituting P (e1 |U ) = P (S|X ∩ Y ) and P (e1 |Ū ) = P ((X ∩ Y \ S)|(X ∩ Y )) into Equation 4.12 gives us: P (S|X ∩ Y ) P ((X ∩ Y \ S)|(X ∩ Y )) P (S ∩ (X ∩ Y )) P (X ∩ Y ) = P (X ∩ Y ) P ((X ∩ Y \ S) ∩ (X ∩ Y )) P (S ∩ (X ∩ Y )) = P ((X ∩ Y \ S) ∩ (X ∩ Y )) OL = and since S ⊆ (X ∩ Y ) and (X ∩ Y \ S) ⊆ (X ∩ Y ), OL = P (S) P (X ∩ Y \ S) if (P (X ∩ Y \ S) 6= 0) and OL = 1 if P (X ∩ Y \ S) = 0. The second evidence e2 captures the probability of occurrence of s without confinement. If s is a stable compound, then its probability of occurrence in the sample space would simply be P (S). On the other hand, if s occurs not as a unit, then its probability of non-occurrence is 1 − P (S). The complement of S, which is the set S ′ is shown as the area with a darker shade in Figure 4.3(b). Let us define the odds based on the second evidence as: OG = P (e2 |U ) P (e2 |Ū ) (4.13) 68 Chapter 4. Text Processing Substituting P (e2 |U ) = P (S) and P (e2 |Ū ) = 1 − P (S) into Equation 4.13 gives us: OG = P (S) 1 − P (S) Intuitively, the first evidence attempts to capture the extent to which the existence of the two lexical units ax and ay is attributable to s. Referring back to OL , whenever the denominator P (X ∩ Y \ S) becomes less than P (S), we can deduce that ax and ay actually exist together as s more than in other forms. At one extreme when P (X ∩ Y \ S) = 0, we can conclude that the co-occurrence of ax and ay is exclusively for s. As such, we can also refer to OL as a measure of exclusivity for the use of ax and ay with respect to s. This first evidence is a good indication for the unithood of s since the more the existence of ax and ay is attributed to s, the stronger the collocational strength of s becomes. Concerning the second evidence, OG attempts to capture the extent to which s occurs in general usage (i.e. World Wide Web). We can consider OG as a measure of pervasiveness for the use of s. As s becomes more widely used in text, the numerator in OG increases. This provides a good indication on the unithood of s since the more s appears in usage, the likelier it becomes that s is a stable compound instead of an occurrence by chance. As a result, the derivation of OU in terms of OL and OG offers a comprehensive way of determining unithood. Finally, expanding OU (s) in Equation 4.11 using Equations 4.12 and 4.13 gives us: OU (s) = log OL + log OG = log (4.14) P (S) P (S) + log P (X ∩ Y \ S) 1 − P (S) As such, the decision on whether ax and ay should be merged to form s is made based solely on OU defined in Equation 4.14. We merge ax and ay if their odds of unithood exceeds a certain threshold, OUT . 4.4 Evaluations and Discussions For this evaluation, we employed 500 news articles from Reuters in the health domain gathered between December 2006 to May 2007. These 500 articles are fed into the Stanford Parser whose output is then used by our head-driven noun phrase chunker [271, 275] to extract word sequences in the form of nouns and noun phrases. Pairs of word sequences (i.e. ax and ay ) located immediately next to each other, or separated by a preposition or the conjunction “and” in the same sentence are 4.4. Evaluations and Discussions measured for their unithood. Using the 500 news articles, we managed to obtain 1, 825 pairs of words to be tested for unithood. We performed a comparative study of our new probabilistic measure against the empirical measure described in Equation 4.2. Two experiments were conducted. In the first one, the decisions on whether or not to merge the 1, 825 pairs were performed automatically using our probabilistic measure OU. These decisions are known as the actual results. At the same time, we inspected the same list of word pairs to manually decide on their unithood. These decisions are known as the ideal results. The threshold OUT employed for our evaluation is determined empirically through experiments and is set to −8.39. However, since only one threshold is involved in deciding mergeability, training algorithms and datasets may be employed to automatically decide on an optimal number. This option is beyond the scope of this chapter. The actual and ideal results for this first experiment are organised into a contingency table (not shown here) for identifying the true and the false positives, and the true and the false negatives. In the second experiment, we conducted the same assessment as carried out in the first one but the decisions to merge the 1, 825 pairs are based on the UH measure described in Equation 4.2. The thresholds required for this measure are based on the values suggested by Wong et al. [275], namely, M I + = 0.9, M I − = 0.02, IDT = 6, IDR+ = 1.35, and IDR− = 0.93. Figure 4.4: The performance of OU (from Experiment 1) and UH (from Experiment 2) in terms of precision, recall and accuracy. The last column shows the difference between the performance of Experiment 1 and 2. Using the results from the contingency tables, we computed the precision, recall and accuracy for the two measures under evaluation. Figure 4.4 summarises the performance of OU and UH in determining the unithood of 1, 825 pairs of lexical units. One will notice that our new measure OU outperformed the empirical measure UH in all aspects, with an improvement of 2.63%, 3.33% and 2.74% for precision, recall and accuracy, respectively. Our new measure achieved a 100% precision with a lower recall at 95.83%. However, more evaluations using larger datasets and 69 70 Chapter 4. Text Processing statistical tests for significance are required to further validate the performance of the probabilistic measure OU. As with any measures that employ thresholds as a cut-off point for accepting or rejecting certain decisions, we can improve the recall of OU by decreasing the threshold OUT . In this way, there will be less false negatives (i.e. pairs which are supposed to be merged but are not) and hence, increases the recall rate. Unfortunately, recall will improve at the expense of precision since the number of false positives will definitely increase from the existing 0. Since our application (i.e. ontology learning) requires perfect precision in determining the unithood of noun phrases, OU is the ideal candidate. Moreover, with only one threshold (i.e. OUT ) required in controlling the performance of OU, we are able to reduce the amount of time and effort spent on optimising our results. 4.5 Conclusion and Future Work In this chapter, we highlighted the significance of unithood, and that its measurement should be given equal attention by researchers in ontology learning. We focused on the development of a dedicated probabilistic measure for determining the unithood of word sequences. We refer to this measure as the Odds of Unithood (OU). OU is derived using Bayes Theorem and is founded upon two evidence, namely, local occurrence and global occurrence. Elementary probabilities estimated using page counts from Web search engines are utilised to quantify the two evidence. The new probabilistic measure OU is then evaluated against an existing empirical measure known as Unithood (UH). Our new measure OU achieved a precision and a recall of 100% and 95.83%, respectively, with an accuracy at 97.26% in measuring the unithood of 1, 825 test cases. OU outperformed UH by 2.63%, 3.33% and 2.74% in terms of precision, recall and accuracy, respectively. Moreover, our new measure requires only one threshold, as compared to five in UH to control the mergeability decision. More work is required to establish the coverage and the depth of the World Wide Web with regard to the determination of unithood. While the Web has demonstrated reasonable strength in handling general news articles, we have yet to study its appropriateness in dealing with unithood determination for technical text (i.e. the depth of the Web). Similarly, it remains a question the extent to which the Web is able to satisfy the requirement of unithood determination for a wider range of domains (i.e. the coverage of the Web). Studies on the effect of noises (e.g. keyword spamming) and multiple word senses on unithood determination using the Web is 4.6. Acknowledgement another future research direction. 4.6 Acknowledgement This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the Research Grant 2006 by the University of Western Australia. 4.7 Other Publications on this Topic Wong, W., Liu, W. & Bennamoun, M. (2007) Determining the Unithood of Word Sequences using Mutual Information and Independence Measure. In the Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia. This paper presents the work on the adaptation of existing word association measures to form the UH measure. The ideas on UH were later reformulated to give rise to the probabilistic measure OU. The description of OU forms the core contents of this Chapter 4. Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global. This book chapter combines the ideas on the UH measure, from Chapter 4, and the TH measure, from Chapter 5. 71 72 Chapter 4. Text Processing CHAPTER 5 Term Recognition Abstract Term recognition identifies domain-relevant terms which are essential for discovering domain concepts and for the construction of terminologies required by a wide range of natural language applications. Many techniques have been developed in an attempt to numerically determine or quantify termhood based on term characteristics. Some of the apparent shortcomings of existing techniques are the ad-hoc combination of termhood evidence, mathematically-unfounded derivation of scores and implicit assumptions concerning term characteristics. We propose a probabilistic framework for formalising and combining qualitative evidence based on explicitly defined term characteristics to produce a new termhood measure. Our qualitative and quantitative evaluations demonstrate consistently better precision, recall and accuracy compared to three other existing ad-hoc measures. 5.1 Introduction Technical terms, more commonly referred to as terms, are content-bearing lexical units which describe the various aspects of a particular domain. There are two types of terms, namely, simple terms (i.e. single-word terms) and complex terms (multi-word terms). In general, the task of identifying domain-relevant terms is referred to as automatic term recognition, term extraction or terminology mining. The broader scope of term recognition can also be viewed in terms of the computational problem of measuring termhood, which is the extent of a term’s relevance to a particular domain [120]. Terms are particularly important for labelling or designating domain-specific concepts, and for contributing to the construction of terminologies, which are essentially enumerations of technical terms in a domain. Manual efforts in term recognition are no longer viable as more new terms come into use and new meanings may be added to existing terms as a result of information explosion. Coupled with the significance of terminologies to a wide range of applications such as ontology learning, machine translation and thesaurus construction, automatic term recognition is the next logical solution. 0 This chapter appeared in Intelligent Data Analysis, Volume 13, Issue 4, Pages 499-539, 2009, with the title “A Probabilistic Framework for Automatic Term Recognition”. 73 74 Chapter 5. Term Recognition Very often, term recognition is considered as similar or equivalent to namedentity recognition, information retrieval and term relatedness measurement. An obvious dissimilarity between named-entity recognition and term recognition is that the former is a deterministic problem of classification whereas the latter involves the subjective measurement of relevance and ranking. Hence, unlike the evaluation of named-entity recognition where various platforms such as the BioCreAtIvE Task [106] and the Message Understanding Conference (MUC) [42] are readily available, determining the performance of term recognition remains an extremely subjective problem. Having closer resemblance to information retrieval in that both involve relevance ranking, term recognition does have its unique requirements [120]. Unlike information retrieval where information relevance can be evaluated based on user information needs, term recognition does not have user queries as evidence for deciding on the domain relevance of terms. In general, term recognition can be performed with or without initial seedterms as evidence. The seedterms enable term recognition to be conducted in a controlled environment and offer more predictable outcomes. Term recognition using seedterms, also referred to as guided term recognition is in some aspects similar to measuring term relatedness. The relevance of terms to a domain in guided term recognition is determined in terms of their semantic relatedness with the domain seedterms. Therefore, existing semantic similarity or relatedness measures based on lexical information (e.g. WordNet [206], Wikipedia [276]), corpus statistics (e.g. Web corpus [50]), or the combination of both [114] are available for use. Without using seedterms, term recognition relies solely on term characteristics as evidence. This term recognition approach is far more difficult and faces numerous challenges. The focus of this chapter is on term recognition without seedterms. In this chapter, we develop a formal framework for quantifying evidence based on qualitative term characteristics for the purpose of measuring termhood, and ultimately, term recognition. Several techniques have been developed in an attempt to numerically determine or quantify termhood based on a list of term characteristics. The shortcomings of existing techniques can be examined from three perspectives. Firstly, word or document frequency in text corpus has always been the main source of evidence due to its accessibility and computability. Despite a general agreement [37] that frequency is a good criteria for discriminating terms from non-terms, frequency alone is insufficient. Many researchers began realising this issue and more diverse evidence [131, 109] was incorporated, especially linguistic-based such as syntactic and semantic information. Unfortunately, as the types of evidence become 5.1. Introduction increasingly diversified (e.g. numerical and nominal), the consolidation of evidence by existing techniques becomes more ad-hoc. This issue is very obvious when one examines simple but crucial questions as to why certain measures take different bases for logarithm, or why two weights were combined using addition instead of multiplication. In the words of Kageura & Umino [120], most of the existing techniques “take an empirical or pragmatic standpoint regarding the meaning of weight”. Secondly, the underlying assumptions made by many techniques regarding term characteristics for deriving evidence were mostly implicit. This makes the task of characteristic attribution and tracing inaccuracies in termhood measurement difficult. Thirdly, many techniques for determining termhood failed to provide ways for selecting the final terms from a long list of term candidates. According to Cabre-Castellvi et al. [37], “all systems propose large lists of candidate terms, which at the end of the process have to be manually accepted or rejected.”. In short, the derivation of a formal termhood measure based on term characteristics for term recognition requires solutions to the following issues: • the development of a general framework to consolidate all evidence representing the various term characteristics; • the determination of the types of evidence to be included to ensure that the resulting score will closely reflect the actual state of termhood implied by a term’s characteristics; • the explicit definition of term characteristics and their attribution to linguistic theories (if any) or other justifications; and • the automatic determination of optimal thresholds to identify terms from the final lists of ranked term candidates. The main objective of this chapter is to address the development of a new probabilistic framework for incorporating qualitative evidence for measuring termhood which provides solutions to all four issues outlined above. This new framework is based on the general Bayes Theorem, and the word distributions required for computing termhood evidence are founded upon the Zipf-Mandelbrot model. The secondary objective of this chapter is to demonstrate the performance of this new term recognition technique in comparison with existing techniques using widelyaccessible benchmarks. In Section 5.2, we summarise the notations and datasets employed throughout this chapter for the formulation of equations, experiments 75 76 Chapter 5. Term Recognition and evaluations. Section 5.3.1 and 5.3.2 summarise several prominent probabilistic models and ad-hoc techniques related to term recognition. In Section 5.3.3, we discuss several commonly employed word distribution models that are crucial for formalising statistical and linguistic evidence. We outline in detail our proposed technique for term recognition in Section 5.4. We evaluate our new technique, both qualitatively and quantitatively, in Section 5.5 and compare its performance with several other existing techniques. In particular, Section 5.5.2 includes the detailed description of an automatic way of identifying actual terms from the list of ranked term candidates. We conclude this chapter in Section 5.6 with an outlook to future work. 5.2 Notations and Datasets In this section, we discuss briefly the types of termhood evidence. This section can also be used as a reference for readers who require clarification about the notations used at any point in this chapter. In addition, we summarise the origin and composition of the datasets employed in various parts of this chapter for experiments and evaluations. There is a wide array of evidence employed for term recognition ranging from statistical to linguistics. Word and document frequency is used extensively to measure the significance of a lexical unit. Depending on how frequency is employed, one can classify the term recognition techniques as either ad-hoc or probabilistic. Linguistic evidence, on the other hand, typically includes syntactical and semantic information. Syntactical evidence relies upon information about how distinct lexical units assume the role of heads and modifiers to form complex terms. For semantic evidence, some predefined knowledge is often employed to relate one lexical unit with others to form networks of related terms. Frequency refers to the number of occurrences of certain event or entity. There are mainly two types of frequency related to the area of term recognition. The first is document frequency. Document frequency refers to the number of documents in a corpus that contains some words of interest. There are many different notations but throughout this chapter, we will adopt the notation N as the number of documents in a corpus and na as the number of documents in the corpus which contains word a. In cases where more than one corpus is involved, nax is used to denote the number of documents in corpus x containing word a. The second type of frequency is term frequency. Term frequency is the number of occurrences of certain words in a corpus. In other words, term frequency is independent of the documents in the corpus. We will employ the notation fa as the number of occurrences of word a in a corpus and 5.2. Notations and Datasets F as the sum of the number of occurrences of all words in a corpus. In the case where different units of text are involved such as paragraphs, sentences, documents or even corpora, fax represents the frequency of candidate a in unit x. Given that P W is the set of all distinct words in a corpus, then F = ∀a∈W fa . With regard to term recognition, we will use the notation T C to represent the set of all term candidates extracted from some corpus for processing, and |T C| is the number of term candidates in T C. The notation a where a ∈ T C is used to represent a term candidate and it can either be simple or complex. For complex terms, term candidate a is made up of constituents where ah is the head and Ma is the set of modifiers. A term candidate can also be surrounded by a set of context words Ca . The notion of context words may differ across different techniques. Certain techniques consider all words surrounding term a located within a fixed-size window as context words of a, while others may employ grammatical relations to extract context words. The actual composition of Ca is not of concern at this point. Following this, Ca ∩ T C is simply the context words of a which are also term candidates themselves (i.e. context terms). Figure 5.1: Summary of the datasets employed throughout this chapter for experiments and evaluations. Throughout this chapter, we employ a standard set of corpora for experimenting with the various aspects of term recognition, and also for evaluation purposes. The corpora that we employ are divided into two groups. The first group, known as domain corpus, consists of a collection of abstracts in the domain of molecular biology that is made available through the GENIA corpus [130]. Currently, version 3.0 of the corpus consists of 2, 000 abstracts with a total of 402, 483 word count. 77 78 Chapter 5. Term Recognition The GENIA corpus is an ideal resource for evaluating term recognition techniques since the text in the corpus is marked-up with both part-of-speech tags and semantic categories. Biologically-relevant terms in the corpus were manually identified by two domain experts [130]. Hence, a gold standard (i.e. a list of terms relevant to the domain), represented as the set G, for the molecular biology domain can be constructed by extracting the terms which have semantic descriptors enclosed by cons tags. For reproducibility of our experiments, the corpus can be downloaded from http://www-tsujii.is.s.u-tokyo.ac.jp/∼genia/topics/Corpus/. The second collection of text is called the contrastive corpus and is made up of twelve different text collections gathered from various online sources. As the name implies, the second group of text serves to contrast and discriminate the content of the domain corpus. The writing style of the contrastive corpus is different from the domain corpus because the former tend to be prepared using journalistic writing (i.e. written in general language with minimal usage of technical terms), targeting general readers. The contrastive texts were automatically gathered from news provider such as Reuters between the period of February 2006 to July 2007. The summary of the domain corpus and contrastive corpus is presented in Figure 5.1. Note that for simplicity reasons, hereafter, d is used to represent the domain corpus and d¯ for contrastive corpus. 5.3 Related Works There are mainly two schools of techniques in term recognition. The first attempts to begin the empirical study of termhood from a theoretically-founded perspective, while the second is based upon the belief that a method should be judged for its quality of being of practical use. These two groups are by no means exclusive but they form a good platform for comparison. In the first group, probability and statistics are the main guidance for designing new techniques. Probability theory acts as the mathematical foundation for modelling the various components in the corpus, and drawing inferences about different aspects such as relevance and representativeness of documents or domains using descriptive and inferential statistics. In the second group, ad-hoc techniques are characterised by the pragmatic use of evidence to measure termhood. Ad-hoc techniques are usually put together and modified as per need as the observation of immediate results progresses. Obviously, such techniques are at most inspired by, but not derived from formal mathematical models [120]. Many critics claim that such techniques are unfounded and the results that are reported using these techniques are merely coincidental. 79 5.3. Related Works The details of some existing research work on the two groups of techniques relevant to term recognition are presented in the next two subsections. 5.3.1 Existing Probabilistic Models for Term Recognition There is currently no formal framework dedicated to the determination of termhood which combines both statistical and qualitative linguistic evidence. Formal probabilistic models related to dealing with terms in general are mainly studied within the realm of document retrieval and automatic indexing. In probabilistic indexing, one of the first few detailed quantitative models was proposed by Bookstein & Swanson [29]. In this model, the differences in the distributional behaviour of words are employed as a guide to determine if a word should be considered as an index term. This model is founded upon the research on how function words can be closely modeled by a Poisson distribution whereas content words deviate from it [256]. We will elaborate on Poisson and other related models in Section 5.3.3. An even larger collection of literature on probabilistic models can be found in the related area of document retrieval. The simplest of all the retrieval models is the Binary Independence Model [82, 147]. As with all other retrieval models, the Binary Independence Model is designed to estimate the probability that a document j is considered as relevant given a specific query k, which is essentially a bag of words. Let T = {t1 , ...tn } be the set of terms in the collection of documents (i.e. corpus). We can then represent the set of terms Tj occurring in document j as a binary vector vj = {x1 , ..., xn } where xi = 1 if ti ∈ Tj and xi = 0 otherwise. This way, the odds of document j, represented by a binary vector vj being relevant R to query k can be computed as [83] O(R|k, vj ) = P (R|k) P (vj |R, k) P (R|k, vj ) = P (R̄|k, vj ) P (R̄|k) P (vj |R̄, k) and based on the assumption of independence between the presence and absence of terms, n P (vj |R, k) Y P (xi |R, k) = P (vj |R̄, k) i=1 P (xi |R̄, k) Other more advanced models that take into consideration other factors such as term frequency, document frequency and document length have also been proposed by researchers such as Spark Jones et al. [117, 118]. There is also another line of research which treats the problem of term recognition as a supervised machine learning task. In this term recognition approach, each 80 Chapter 5. Term Recognition word from a corpus is classified as a term or non-term. Classifiers are trained using annotated domain corpora. The trained models can then be applied to other corpora in the same domain. Turney [251] presented a comparative study between a recognition model based on genetic algorithms and an implementation of the bagged C4.5 decision tree algorithm. Hulth [109] studied the impact of prior input word selection on the performance of term recognition. The author uses a classifier trained on 2, 000 abstracts in the domain of information technology to identify terms from non-terms, and concluded that limiting the input words to NP-chunks offered the best precision. This study further reaffirmed the benefit of incorporating linguistic evidence during the measurement of termhood. 5.3.2 Existing Ad-Hoc Techniques for Term Recognition Most of the current termhood measures for term recognition fall into this ad-hoc techniques group. Term frequency and document frequency are the main types of evidence used by ad-hoc techniques. Unlike the use of classifiers described in the previous section, techniques in this group employ termhood scores for ranking and selecting terms from non-terms. Most common ad-hoc techniques that employ raw frequencies are variants of Term Frequency Inverse Document Frequency (TF-IDF). TF attempts to capture the pervasiveness of a term candidate within some documents, while IDF measures the “informativeness” of a term candidate. Despite the mere heuristic background of TF-IDF, the robustness of this weighting scheme has given rise to a number of variants and has found its way into many retrieval applications. Certain researchers [210] have even attempted to provide theoretical justifications as to why the combination of TF and IDF works so well. Basili et al. [19] proposed a TF-IDF inspired measure for assigning terms with more accurate weights that reflect their specificity with respect to the target domain. This contrastive analysis is based on the heuristic that general language-dependent phenomena should spread similarly across different domain corpus and special-language phenomena should portray odd behaviours. The Contrastive Weight [19] for simple term candidate a in target domain d is defined as: ! P P f j i ij CW (a) = log fad log P (5.1) j faj where fad is the frequency of the simple term candidate a in the target domain d, P P fij is the sum of the frequencies of all term candidates in all domain corpora, j Pi and j faj is the sum of the frequencies of term candidate a in all domain corpora. 81 5.3. Related Works For complex term candidates, the frequencies of their heads are utilised to compute their weights. This is necessary because the low frequencies among complex terms make estimations difficult. Consequently, the weight for complex term candidate a in domain d is defined as: CW (a) = fad CW (ah ) (5.2) where fad is the frequency of the complex term candidate a in the target domain d, and CW (ah ) is the contrastive weight for the head, ah of the complex term candidate, a. The use of head noun by Basili et al. [19] for computing the contrastive weights of complex term candidates CW (a) reflects the head-modifier principle [105]. The principle suggests that the information being conveyed by complex terms manifests itself in the arrangement of the constituents. The head acts as the key that refers to a general category to which all other modifications of the head belong. The modifiers are responsible for distinguishing the head from other forms in the same category. Wong et al. [274] presented another termhood measure based on contrastive analysis called Termhood (TH) which places emphasis on the difference between the notion of prevalence and tendency. The measure computes a Discriminative Weight (DW) for each candidate a as: DW (a) = DP (a)DT (a) (5.3) This weight realises the heuristic that the task of discriminating terms from nonterms is a function of Domain Prevalence (DP) and Domain Tendency (DT). If a is a simple term candidate, its DP is defined as: P P j fjd + j fj d¯ DP (a) = log10 (fad + 10) log10 + 10 (5.4) fad + fad¯ P P where j fjd + j fj d¯ is the sum of the frequencies of occurrences of all a ∈ T C in both domain and contrastive corpus, while fad and fad¯ are the frequencies of occurrences of a in the domain corpus and contrastive corpus, respectively. DP simply increases, with offset for too frequent terms, along with the frequency of a in the domain corpus. If the term candidate is complex, the authors define its DP as: DP (a) = log10 (fad + 10)DP (ah )M F (a) (5.5) The reason behind the use of the DP of the complex term’s head (i.e. DP (ah )) in Equation 5.5 is similar to that of CW in Equation 5.2. DT , on the other hand, is 82 Chapter 5. Term Recognition employed to determine the extent of the inclination of the usage of term candidate a for domain and non-domain purposes. The authors defined DT as: fad + 1 DT (a) = log2 +1 (5.6) fad¯ + 1 where fad is the frequency of occurrences of a in the domain corpus, while fad¯ is the frequency of occurrences of a in the contrastive corpus. If term candidate a is equally common in both domain and non-domains (i.e. contrastive domains), DT = 1. If the usage of a is more inclined towards the target domain, fad > fad¯, then DT > 1, and DT < 1 otherwise. Besides contrastive analysis, the use of contextual evidence to assist in the correct identification of terms is also common. There are currently two dominant approaches to extract contextual information. Most of the existing researchers such as Maynard & Ananiadou [171] employed fixed-size windows for capturing context words for term candidates. The Keyword in Context (KWIC) [159] index can be employed to identify the appropriate windows of words surrounding the term candidates. Other researchers such as Basili et al. [20], LeMoigno et al. [144] and Wong et al. [278] employed grammatical relations to identify verb phrases or independent clauses containing the term candidates. One of the work along the line of incorporating contextual information is NCvalue by Frantzi & Ananiadou [80]. Part of the NCvalue measure involves the assignment of weights to context words in the form of nouns, adjectives and verbs located within a fixed-size window from the term candidate. Given that T C is the set of all term candidates and c is a noun, verb or adjective appearing with term candidates, weight(c) is defined as: P |T Cc | e∈T Cc fe (5.7) + weight(c) = 0.5 |T C| fc P where T Cc is the set of term candidates that have c as a context word, e∈T Cc fe is the sum of the frequencies of term candidates that appear with c, and fc is the frequency of c in the corpus. After calculating the weights for all possible context words, the sum of the weights of context words appearing with each term candidate is obtained. Formally, for each term candidate a that has a set of accompanying context words Ca , the cumulative context weight is defined as: X cweight(a) = weight(c) + 1 (5.8) c∈Ca Eventually, the NCvalue for a term candidate is defined as: N Cvalue(a) = 1 Cvalue(a)cweight(a) log F (5.9) 83 5.3. Related Works where F is the number of words in the corpus. Cvalue(a) is given by, log |a|f if |a| = g a 2 P Cvalue(a) = log |a|(f − l∈La fl ) otherwise 2 a (5.10) |La | where |a| is the number of words that constitute a, La is the set of potential longer term candidates that contain a, g is the longest n-gram considered, and fa is frequency of occurrences of a in the corpus. The T H measure by Wong et al. [274] incorporates contextual evidence in the form of Average Contextual Discriminative Weight (ACDW). ACDW is the average DW of the context words of a adjusted based on the context’s relatedness with a: P DW (c)N GD(a, c) (5.11) ACDW (a) = c∈Ca |Ca | where NGD is the Normalised Google Distance by Cilibrasi & Vitanyi [50] that is used to determine the relatedness between two lexical units without any feature extraction or static background knowledge. The final termhood score for each term candidate a is given by [278]: T H(a) = DW (a) + ACC(a) (5.12) where ACC is the adjusted value of ACDW based on DW of Equation 5.3. The inclusion of semantic relatedness measure by Wong et al. [278] brings us to the use of semantic information during the determination of termhood. Maynard & Ananiadou [171, 172] employed the Unified Medical Language System (UMLS) to compute two weights, namely, positional and commonality. Positional weight is obtained based on the combined number of nodes belonging to each word, while commonality is measured by the number of shared common ancestors multiplied by the number of words. Accordingly, the similarity between two term candidates is defined as [171]: sim(a, b) = com(a, b) pos(a, b) (5.13) where com(a, b) and pos(a, b) is the commonality and positional weight, respectively, between term candidate a and b. The authors then modified the NCvalue discussed in Equation 5.9 by incorporating the new similarity measure as part of a Context Factor (CF). The context factor of a term candidate a is defined as: X X fb|a sim(a, b) (5.14) CF (a) = fc|a weight(c) + c∈Ca b∈CTa 84 Chapter 5. Term Recognition where Ca is the set of context words of a, fc|a is the frequency of c as a context word of a, weight(c) is the weight for context word c as defined in Equation 5.7, CTa is the set of context words of a which also happen to be term candidates (i.e. context terms), fb|a is the frequency of b as a context term of a, and sim(a, b) is the similarity between term candidate a and its context term b as defined in Equation 5.13. The new NCvalue is defined as: N Cvalue(a) = 0.8Cvalue(a) + 0.2CF (a) (5.15) Basili et al. [20] commented that the use of extensive and well-grounded semantic resources by Maynard & Ananiadou [171] faces the issue of portability to other domains. Instead, Basili et al. [20] combined the use of contextual information and the head-modifier principle to capture term candidates and their context words on a feature space for computing similarity using the cosine measure. According to the authors [20], “the term sense is usually determined by its head.”. On the contrary, such statement by the authors opposes the fundamental fact, not only in terminology but in general linguistics, that simple terms are polysemous and the modification of such terms is necessary to narrow down their possible interpretations [105]. Moreover, the size of corpus has to be very large, and the specificity and density of domain terms in the corpus has to be very high to allow for extraction of adequate features. In summary, while the existing techniques described above may be intuitively justifiable, the manner in which the weights were derived remains questionable. To illustrate, why are the products of the various variables in Equations 5.1 and 5.9 taken instead of their summations? What would happen to the resulting weights if the products are taken instead of the summations, and the summations taken instead of the products in Equations 5.7, 5.15 and 5.4? These are just minor but thought-provoking questions in comparison to more fundamental issues related to the decomposability and traceability of the weights back to their various constituents or individual evidence. The two main advantages of decomposability and traceability are (1) the ability to trace inaccuracies of termhood measurement to their origin (i.e. what went wrong and why), and (2) the attribution of the significance of the various weights to their intended term characteristics (i.e. what do the weights measure?). 5.3.3 Word Distribution Models An alternative to the use of relative frequency as practiced by many ad-hoc techniques discussed above in Section 5.3.2 is to develop models of the distribution 85 5.3. Related Works of words and employ such models to describe the various characteristics of terms, and the corpus or the domain they represent. It is worth pointing out that the modelling is done with respect to all words (i.e. terms and non-terms) that a corpus contains. This is important for capturing the behaviour of both terms and nonterms in the domain for discrimination purposes. Word distribution models can be used to normalise frequency of occurrence [7] and to solve problems related to data sparsity caused by the use of raw frequencies in ad-hoc techniques. The modelling of word distributions in documents or the entire corpus can also be employed as means of predicting the rate of occurrence of words. There are mainly two groups of models related to word distribution. The first group attempts to model the frequency distribution of all words in an entire corpus while the second group focuses on the distribution of a single word. The foundation of the first group of models is the relationship between the frequencies of words and their ranks. One of the most widely-used models in this group is the Zipf ’s Law [285] which describes the relationship between the frequency of a word, f and its rank, r as P (r; s, H) = 1 rs H (5.16) where s is an exponent characterising the distribution. Given that 1 ≤ r ≤ |W | where W is the set of all distinct words in the corpus, H is defined as the |W |-th P | −s harmonic number, H = |W i=1 i . The actual notation for H computed as the |W |-th harmonic number is H|W |,s . However, for brevity, we will continue with the use of the notation H. A generalised version of the Zipfian distribution is the Zipf-Mandelbrot Law [168] whose probability mass function is given by P (r; q, s, H) = 1 (r + q)s H (5.17) where q is a parameter for expressing the richness of word usage in the text. SimiP | −s larly, H can be computed as H = |W i=1 (i + q) . There is still a hyperbolic relation between rank and frequency in the Zipf-Mandelbrot Distribution. The additional parameter q can be used to model curves in the distribution, something not possible in the original Zipf. There are few other probability distributions that can or have been used to model word distribution such as the Pareto distribution [7], the Yule-Simon distribution [230] and the generalised inverse Gauss-Poisson law [12]. All of the distributions described above are discrete power law distributions, except for Pareto, that have the ability to model the unique property of word occurrences, 86 Chapter 5. Term Recognition (a) Distribution of words extracted from the domain corpus dispersed according to the domain corpus. (b) Distribution of words extracted from the domain corpus dispersed according to the contrastive corpus. Figure 5.2: Distribution of 3, 058 words randomly sampled from the domain corpus d. The line with the label “KM” is the aggregation of the individual probability of occurrence of word i in a document, 1 − P (0; αi , βi ) using K-mixture with αi and βi defined in Equations 5.21 and 5.20. The line with the label “ZM-MF” is the manually fitted Zipf-Mandelbrot model. The line labeled “RF” is the actual rate of occurrence computed as fi /F . 87 5.3. Related Works namely, the “long tail phenomenon”. One of the main problems that hinders the practical use of these distributions is the estimation of the various parameters [112]. To illustrate, Figure 5.3 summarises the parameters of the manually fitted ZipfMandelbrot models for the distribution of a set of 3, 058 words randomly drawn from our domain corpus d. The lines with the label “ZM-MF” shown in Figures 5.2(a) and 5.2(b) show the distributions of the words dispersed according to the domain corpus, and the contrastive corpus, respectively. One can notice that the distribution of the words in d¯ is particularly difficult to be fitted because they tend to have a bulge near the end. This is caused by the presence of many domain-specific terms (which are unique to d) in the set of 3, 058 words. Such domain-specific terms will have extremely low or most of the time, zero word count. Nevertheless, a better fit for the Figure 5.2(b) can be achieved through more trial-and-error. In addition to the trial-and-error exercise required in manual fitting, the values in Figure 5.3 clearly show that different parameters are required even for fitting the same set of words using different corpus. The manual fits we have carried out are far from perfect and some automatic fitting mechanism is required if we were to practically employ the Zipf-Mandelbrot model. In the words of Edmundson [65], “a distribution with more or different parameters may be required. It is clear that computers should be used on this problem...”. Figure 5.3: Parameters for the manually fitted Zipf-Mandelbrot model for the set of 3, 058 words randomly drawn from d. In the second group of models, individual word distributions allow us to capture and express the behaviour of individual words in parts of a corpus. The standard probabilistic model for distribution of some event over fixed-size units is the Poisson distribution. In the conventional case of individual word distribution, the event would be the k occurrence of word i and the unit would be a document. The definition of Poisson distribution is given by e−λi λki (5.18) k! where λi is the average number of occurrences of word i per document or λi = fi /N . Obviously, λi will vary between different words. P (0; λi ) will give the probability P (k; λi ) = 88 Chapter 5. Term Recognition that word i does not exist in a document while 1 − P (0; λi ) will give the probability that a candidate has at least one occurrence in a document. Other similarly unsuccessful attempts for better fits are the Binomial model and the Two-Poisson model [82, 240, 83]. These single-parameter distributions (i.e. Poisson and Binomial) have been traditionally employed to model individual word distributions based on unrealistic assumptions such as the independence between word occurrences. As a result, they are poor-fits of the actual word distribution. Nevertheless, such variation from the Poisson distribution or colloquially known as non-poissonness serves a purpose. It is well-known throughout the literature [256, 45, 169] that Poisson distribution is only a good fit for functional words while content words tend to deviate from it. Using this property, we can also employ the single Poisson as a predictor of whether a lexical unit is a content word or not, and hence as an indicator of possible termhood. A better fit for individual word distribution employs a mixture of Poissons [170, 46]. Negative Binomial is one of such mixtures but the involvement of large binomial coefficients makes it computationally unattractive. Another alternative is the K-mixture proposed by Katz [122] that allows the Poisson parameter λi to vary between documents. The distribution of k occurrences of word i in a document is given by: αi P (k; αi , βi ) = (1 − αi )δk,0 + βi + 1 βi βi + 1 k (5.19) where δk,0 is the Dirac’s delta. δk,0 = 1 if k = 0 and δk,0 = 0 otherwise. The parameters αi and βi can be computed as: βi = fi − ni ni (5.20) λi βi (5.21) αi = where λi is the single Poisson parameter of the observed mean. βi determines the additional word i per each document that contain i, and αi can be seen as the fraction of documents with i and without i. One of the property of K-mixture is that it is always a perfect fit at k = 0. This desirable property of K-mixture can be employed to accurately determine the probability of non-occurrence of word i in a document. P (0; αi , βi ) gives us the probability that word i does not occur in a document and 1 − P (0; αi , βi ) gives us the probability that a word has at least one occurrence in a document (i.e. the candidate exists in a document). When k = 0, 89 5.4. A New Probabilistic Framework for Determining Termhood the K-mixture is reduced to P (0; αi , βi ) = (1 − αi ) + αi βi + 1 (5.22) Unlike fixed-size textual units such as documents, the notion of domains is elusive. The lines labeled with “KM” in Figure 5.2 are the result of the aggregation of the individual probability of occurrence of word i in documents of the respective corpora. Figures 5.2(a) and 5.2(b) clearly show that models like K-mixture whose distributions are defined over documents or other units with clear, explicit boundaries cannot be employed directly as predictors for the actual rate of occurrence of words in domain. 5.4 A New Probabilistic Framework for Determining Termhood We begin the derivation of the new probabilistic framework for term recognition by examining the definition of termhood. Based on two prominent review papers on term recognition [120, 131], we define termhood as: Definition 5.4.1. Termhood is the degree to which a lexical unit is relevant to a domain of interest. As outlined in Section 5.1, our focus is to construct a formal framework which combines evidence, in the form of term characteristics, instead of seedterms for term recognition. The aggregated evidence can then used to determine the extent of relevance of the corresponding term with respect to a particular domain, as in the definition of termhood in Definition 5.4.1. The characteristics of terms manifest themselves in suitable corpora that represent the domain of interest. From here on, we will use the notation d to interchangeably denote the elusive notion of a domain and its tangible counterpart, the domain corpus. Since the quality of the termhood evidence with respect to the domain is dependent on the issue of representativeness of the corresponding corpus, the following assumption is necessary for us to proceed: Assumption 1. Corpus d is a balanced, unbiased and randomised sample of the population text representing the corresponding domain. The actual discussion on corpus representativeness is nevertheless important but the issue is beyond the scope of this chapter. Having Assumption 1 in place, we restate Definition 5.4.1 in terms of probability to allow us to formulate a probabilistic model for measuring termhood in the next two steps, 90 Chapter 5. Term Recognition Aim 1. What is the probability that a is relevant to domain d given the evidence a has? In the second step, we lay the foundation for the various term characteristics which are mentioned throughout this chapter, and used specifically for formalising termhood evidence in Section 5.4.2. We subscribed to the definition of ideal terms as adopted by many researchers [157]. Definition 5.4.2 outlines the primary characteristics of terms. These characteristics rarely exist in real-world settings since word ambiguity is a common phenomenon in linguistics. Nevertheless, this definition is necessary to establish a platform for determining the extent of deviation of the characteristics of terms in actual usage from the ideal cases, Definition 5.4.2. The primary characteristics of terms in ideal settings are: • Terms should not have synonyms. In other words, there should be no different terms implying the same meaning. • Meaning of terms is independent of context. • Meaning of terms should be precise and related directly to a concept. In other words, a term should not have different meanings or senses. In addition to Definition 5.4.2, there are several other related characteristics of terms which are of common knowledge in this area. Some of these characteristics follow from the general properties of words in linguistics. This list is not a standard and is by no means exhaustive or properly theorised. Nonetheless, as we have pointed out, such heuristically-motivated list is one of the foundation of automatic term recognition. They are as follow: Definition 5.4.3. The extended characteristics of terms: 1 Terms are properties of domain, not document [19]. 2 Terms tend to clump together [28] the same way content-bearing words do [285]. 3 Terms with longer length are rare in a corpus since the usage of words with shorter length is more predominant [284]. 4 Simple terms are often ambiguous and modifiers are required to reduce the number of possible interpretations. 91 5.4. A New Probabilistic Framework for Determining Termhood 5 Complex terms are preferred [80] since the specificity of such terms with respect to certain domains are well-defined. Definition 5.4.2 simply states that a term is unambiguously relevant to a domain. For instance, assume that once we encounter the term “bridge”, it has to immediately mean “a device that connects multiple network segments at the data link layer”, and nothing else. At the same time, such “device” should not be identifiable using other labels. If this is the case, all we need to do is to measure the extent to which a term candidate is relevant to a domain regardless of its relevance to other domains since an ideal term cannot be relevant to both (as implied in Definition 5.4.2). This brings us to the third step where we can now formulate our Aim 1 as a conditional probability between two events and pose it using Bayes Theorem, P (R1 |A) = P (A|R1 )P (R1 ) P (A) (5.23) where R1 is the event that a is relevant to domain d and A is the event that a is a candidate with evidence set V = {E1 , ..., Em }. P (R1 |A) is the posterior probability of candidate a being relevant to d given the evidence set V associated to a. P (R1 ) and P (A) are the prior probabilities of candidate a being relevant without any evidence, and the probability of a being a candidate with evidence V , respectively. One has to bare in mind that Equation 5.23 is founded upon the Bayesian interpretation of probability. Consequently, subjective rather than frequency-based assessments of P (R1 ) and P (A) are well-accepted, at least by the Bayesians. As we shall see later, these two prior probabilities will be immaterial in the final computation of weights for the candidates. In addition, we introduce the event that a is relevant to other ¯ R2 , which can be seen as the complementary event of R1 . Similar to domains d, ¯ Assumption 1, we subscribe to the following assumption for d, Assumption 2. Contrastive corpus d¯ is the set of balanced, unbiased and randomised sample of the population text representing approximately all major domains other than d. Based on the ideal characteristics of terms in Definition 5.4.2, and the new event R2 , we can state that P (R1 ∩ R2 ) = 0. In other words, R1 and R2 are mutually exclusive in ideal settings. Ignoring the fact that a term may appear in certain ¯ but definitely domains by chance, any candidate a can either be relevant to d or to d, not both. Unfortunately, a point worth noting is that “An impregnable barrier between words of a general language and terminologies does not exist.” [157]. For 92 Chapter 5. Term Recognition example, the word “bridge” has multiple meaning and is definitely relevant to more than one domain, or in other words, P (R1 ∩ R2 ) is not strictly 0 in reality. While people in the computer networking domain may accept and use the word “bridge” as a term, it is in fact not an ideal term. Words like “bridge” are often a poor choice of terms (i.e. not ideal terms) simply because they are simple terms, and inherently ambiguous as defined in Definition 5.4.3.4. Instead, a better term for denoting the concept which the word “bridge” attempts to represent would be “network bridge”. As such, we assume that: Assumption 3. Each concept represented using a polysemous simple term in a corpus has a corresponding unambiguous complex term representation occurring in the same corpus. From Assumption 3, since all important concepts of a domain have unambiguous manifestations in the corpus, the possibility of the ambiguous counterparts achieving lower ranks during our termhood measurement will have no effect on the overall term recognition output. In other words, polysemous simple terms can be considered as insignificant in our determination of termhood. Based on this alone, we can assume ¯ This brings us that the non-relevance to d approximately implies the relevance to d. to the next property about the prior probability of relevance of terms. The mutual exclusion and complementation properties of the relevance of terms in d, R1 , and in ¯ R2 are: d, • P (R1 ∩ R2 ) ≈ 0 • P (R1 ∪ R2 ) = P (R1 ) + P (R2 ) ≈ 1 Even in the presence of a prior probability to this approximation, the addition law of probability still has to hold. As such, we can extend this approximation of the sum of the probability of relevance without evidence to include the prior probability of evidence: P (R1 |A) + P (R2 |A) ≈ 1 (5.24) without violating the probability axioms. Knowing that P (R1 ∩ R2 ) only approximates to 0 in reality, we will need to make sure that the relevance of candidate a in domain d does not happen by chance. The occurrence of a term in a domain is considered as accidental if the concepts represented by the terms are not topical for that domain. Moreover, the accidental repeats of the same term in non-topical cases are possible [122]. Consequently, we 93 5.4. A New Probabilistic Framework for Determining Termhood need to demonstrate the odds of term candidate a being more relevant to d than to ¯ d: Aim 2. What are the odds of candidate a being relevant to domain d given the evidence it has? In this fourth step, we alter Equation 5.23 to reflect our new Aim 2 for determinP ing the odds rather than merely probabilities. Since Odds = 1−P , we can apply an 1 order-preserving transformation by multiplying 1−P (R1 |A) to Equation 5.23 to give us the odds of relevance given the evidence candidate a has: P (R1 |A) P (A|R1 )P (R1 ) = 1 − P (R1 |A) P (A)(1 − P (R1 |A)) (5.25) and since 1 − P (R1 |A) ≈ P (R2 |A) from Equation 5.24, we have: P (A|R1 )P (R1 ) P (R1 |A) = P (R2 |A) P (A)P (R2 |A) (5.26) and applying the multiplication rule P (R2 |A)P (A) = P (A|R2 )P (R2 ) to both sides of Equation 5.25 to obtain P (A|R1 ) P (R1 ) P (R1 |A) = P (R2 |A) P (A|R2 ) P (R2 ) (5.27) Equation 5.27 can also be called the odds of relevance of candidate a to d given the (R1 ) is the odds of relevance evidence a has. The second term in Equation 5.27, PP (R 2) of candidate a without evidence. We can use Equation 5.27 as a way to rank the candidates. Taking the log of odds, we have log P (R1 |A) P (R1 ) P (A|R1 ) = log − log P (A|R2 ) P (R2 |A) P (R2 ) P (A|R1 ) and P (A|R2 ) are the class conditional probabilities for a being a candidate with evidence V given its different states of relevance. Since probability of relevance to d and to d¯ of all candidates without any evidence are the same, we can safely ignore the second term (i.e. odds of relevance without evidence) in Equation 5.27 without committing the prosecutor’s fallacy [232]. This gives us log P (R1 |A) P (A|R1 ) ≈ log P (A|R2 ) P (R2 |A) (5.28) To facilitate the scoring and ranking of the candidates based on the evidence they have, we introduce a new function of evidence possessed by candidate a. We call this new function the Odds of Termhood (OT) 94 Chapter 5. Term Recognition P (A|R1 ) (5.29) P (A|R2 ) Since we are only interested in ranking and from Equation 5.28, ranking candidates according to OT (A) is the same as ranking the candidates according to our Aim 2 reflected through Equation 5.27. Obviously, from Equation 5.29, our initial predicament of not being able to empirically interpret the prior probabilities P (A) and P (R1 ) is no longer a problem. OT (A) = log Assumption 4. Independence between evidences in the set V . In the fifth step, we decompose the evidence set V associated with each candidate a to facilitate the assessment of the class conditional probabilities P (A|R1 ) and P (A|R2 ). Given Assumption 4, we can evaluate P (A|R1 ) as Y P (A|R1 ) = P (Ei |R1 ) (5.30) i and P (A|R2 ) as P (A|R2 ) = Y P (Ei |R2 ) (5.31) i where P (Ei |R1 ) and P (Ei |R2 ) are the probabilities of a as a candidate associated with evidence Ei given its different states of relevance R1 and R2 , respectively. Substituting Equation 5.30 and 5.31 in 5.29, we get OT (A) = X log i P (Ei |R1 ) P (Ei |R2 ) (5.32) Lastly, for the ease of computing the evidence, we define individual scores called evidential weight (Oi ) provided by each evidence Ei as Oi = P (Ei |R1 ) P (Ei |R2 ) (5.33) and substituting Equation 5.33 in 5.32 provides OT (A) = X log Oi (5.34) i The purpose of OT is similar to many other functions for scoring and ranking term candidates such as those reviewed in Section 5.3.2. However, what differentiates our new function from the existing ones is that OT is founded upon and derived in a probabilistic framework whose assumptions are made explicit. Moreover, as we will discuss in the following Section 5.4.2, the individual evidence is formulated using probability and the necessary term distributions are derived from formal distribution models to be discussed in Section 5.4.1. 95 5.4. A New Probabilistic Framework for Determining Termhood 5.4.1 Parameters Estimation for Term Distribution Models In Section 5.3.3, we presented a wide range of distribution models for individual terms and for all terms in the corpus. Our intention is to avoid the use of raw frequencies and also relative frequencies for computing the evidential weights, Oi . The shortcomings related to the use of raw frequencies have been clearly highlighted in Section 5.3. In this section, we discuss the two models that we employ for computing the evidential weights, namely, the Zipf-Mandelbrot model and the K-mixture model. The Zipf-Mandelbrot model is employed to predict the probability of occurrence of a term candidate in the domain, while the K-mixture model predicts the probability of a certain number of occurrences of a term candidate in a document of the domain. Most of the literature on Zipf and Zipf-Mandelbrot laws often leave out the single most important aspect that makes these two distributions applicable to real-world applications, namely, parameter estimation. As we have discussed in Section 5.3.3, the manual process of deciding the parameters, namely, s, q and H in the case of Zipf-Mandelbrot distribution, is tedious and may not easily achieve the best fit. Moreover, the parameters for modelling the same set of terms vary across different corpora. A recent paper by Izsak [112] discussed on some general aspects of standardising the process of fitting the Zipf-Mandelbrot model. We experimented with linear regression using the ordinary least squares method [200] and the weighted least squares method [63] to estimate the three parameters, namely, s, q and H. The results of our experiments are reported in this section. We would like to stress that our focus is to search for appropriate parameters to achieve a good fit of the models using observed data (i.e. raw frequencies and ranks). While we do not discount the importance of observing the various assumptions involved in linear regression such as the normality of residuals and the homogeneity of variance, their discussions are beyond the scope of this chapter. We begin by linearising the actual Equation 5.17 of the Zipf-Mandelbrot model to allow for linear regression. From here on, where there is no confusion, we will refer to the probability mass function of the Zipf-Mandelbrot model, P (r; q, s, H) as ZMr for clarity. We will take the natural logarithm of both sides of Equation 5.17 to obtain: ln ZMr = ln H − s ln(r + q) (5.35) Our aim is then to find the line defined by Equation 5.35 that best fits our observed points {(ln r, ln fFr ); r = 1, 2, ..., |W |}. For sufficiently large r, ln(r + q)/ ln(r) will 96 Chapter 5. Term Recognition approximates to 1 and we will have: ln ZMr ≈ ln H − s ln r As a result, ln ZMr will be an approximate linear function of ln r with the points scattered along a straight line with slope −s and an intersection at ln H on the Y-axis. We can then move on to determine the estimates ln H and s. We attempt to minimise the squared sum of residuals (SSR) between the actual points ln fFi and the predicted points ln ZMi : |W | X fi (ln − ln ZMi )2 SSR = F i=1 (5.36) Given that |W | is the number of words, the least squares estimates ln H and s is defined as [1]: P P f P f |W | j (ln Fj )(ln j) − j (ln Fj ) j (ln j) P P P (5.37) s= |W | j (ln j)2 − j (ln j) j (ln j) and ln H = P j (ln ZMj ) − s |W | P j (ln j) (5.38) Since the approximate of ln fFr is given by ln ZMr = ln H − s ln(r + q), then it holds that ln fF1 ≈ ln ZM1 . Following this, ln fF1 ≈ ln H − s ln(1 + q). As a result, we can estimate q using s and ln H as such f1 ≈ ln H − s ln(1 + q) F 1 f1 ln(1 + q) ≈ ln H − ln s F f1 1 (ln H−ln ) F q ≈ es −1 ln (5.39) To illustrate our process of automatically fitting the Zipf-Mandelbrot model, please refer to Figure 5.4. Figure 5.6 summarises the parameters of the automatically fitted Zipf-Mandelbrot model. The lines with the label “ZM-OLS” in Figures 5.4(a) and 5.4(b) show the automatically fitted Zipf-Mandelbrot model for the distribution of the same set of 3, 058 words employed in Figure 5.2, using the ordinary least squares method. The line “RF” is the actual relative frequency that we are trying to fit. One will notice from the SSR column of Figure 5.5 that the automatic fit provided by the ordinary least squares method achieves relatively good results (i.e. low SSR) in dealing with the curves along the line in Figure 5.4(b). 5.4. A New Probabilistic Framework for Determining Termhood (a) Distribution of words extracted from the domain corpus dispersed according to the domain corpus. (b) Distribution of words extracted from the domain corpus dispersed according to the contrastive corpus. Figure 5.4: Distribution of the same 3, 058 words as employed in Figure 5.2. The line with the label “ZM-OLR” is the Zipf-Mandelbrot model fitted using ordinary least squares method. The line labeled “ZM-WLS” is the Zipf-Mandelbrot model fitted using weighted least squares method, while “RF” is the actual rate of occurrence computed as fi /F . 97 98 Chapter 5. Term Recognition Figure 5.5: Summary of the sum of squares of residuals, SSR and the coefficient of determination, R2 for the regression using manually estimated parameters, parameters estimated using ordinary least squares (OLS), and parameters estimated using weighted least squares (WLS). Obviously, the smaller the SSR is, the better the fit. As for 0 ≤ R2 ≤ 1, the upper bound is achieved when the fit is perfect. Figure 5.6: Parameters for the automatically fitted Zipf-Mandelbrot model for the set of 3, 058 words randomly drawn. We also attempted to fit the Zipf-Mandelbrot model using the second type of least squares method, namely, the weighted least squares. The idea is to assign to each point ln fi /F a weight that reflects the uncertainty of the observation. Instead of weighting all points equally, they are weighted such that points with a greater weight contribute more to the fit. Most of the time, the weight wi assigned to the i-th point is determined as a function of the variance of that observation, denoted as wi = σi−1 . In other words, we assign points with lower variances greater statistical weights. Instead of using variance, we propose the assignment of weights for the weighted least squares method based on the changes of the slopes at each segment of the distribution. The slope at the point (xi , yi ) is defined as the slope of the segment between the points (xi , yi ) and (xi−1 , yi−1 ), and it is given by: mi,i−1 = yi − yi−1 xi − xi−1 The weight to be assigned for each point (xi , yi ) is a function of the conditional cumulation of slopes up to that point. The cumulation of the slopes is conditional depending on the changes between slopes. The slope of the segment between point 99 5.4. A New Probabilistic Framework for Determining Termhood i and i − 1 is added to the cumulative slope if its rate of change from the previous segment i − 1 and i − 2 is between 1.1 and 0.9. In other words, the slopes between the two segments are approximately the same. If the change in slopes between the two segments is outside that range, the cumulative slope is reset to 0. Given that i = 1, 2, ..., |W |, computing the slope at point 1 uses a prior non-existence point, m1,0 = 0. Formally, we set the weight wi to be assigned to point i for the weighted least squares method as: 0 if(i = 1) mi,i−1 mi,i−1 + wi−1 if(i 6= 1 ∧ 1.1 ≤ ≤ 0.9) wi = (5.40) mi−1,i−2 0 otherwise Consequently, instead of minimising the sum of squares of the residuals (SSR) where all points are treated equally as in the ordinary least squares in Equation 5.36, we include the new weight wi defined in Equation 5.40 to give us: SSR = |W | X i=1 wi (ln fi − ln ZMi )2 F (5.41) Referring back to Figure 5.4, the lines with the label “ZM-WLS” demonstrate the fit of the Zipf-Mandelbrot model whose parameters are estimated using the weighted least squares method. The line “RF” is again the actual relative frequency we are trying to fit. Despite the curves, especially in the case of using the contrastive corpus ¯ the weighted least squares is able to provide a good fit. The constantly changing d, slopes especially in the middle of the distribution provide an increasing weight to each point, enabling such points to contribute more to the fitting. In the subsequent sections, we will utilise the Zipf-Mandelbrot model for modelling the distribution of term candidates in both the domain corpus (from which the candidates were extracted), and also the contrastive corpus. We employ P (r; q, s, H) of Equation 5.17 in Section 5.3.3 to compute the probability of occurrence of ranked words in both the domain corpus and the contrastive corpus. The parameters H, q and s are estimated as shown in Equation 5.38, 5.39 and 5.37, respectively. For standardisation purposes, we introduce the following notations: • ZMrd provides the probability of occurrence of a word with rank r in domain corpus d; and • ZMrd¯ provides the probability of occurrence of a word with rank r in the ¯ contrastive corpus d. 100 Chapter 5. Term Recognition In addition, we will also be using the K-mixture model as discussed in Section 5.3.3 for predicting the probability of occurrence of the term candidates in documents of the respective corpora. Recall that P (0; αi , βi ) in Equation 5.22 gives us the probability that word i does not occur in a document (i.e. probability of nonoccurrence) and 1 − P (0; αi , βi ) gives us the probability that word i has at least one occurrence in a document (i.e. probability of occurrence). The βi and αi are computed based on Equations 5.20 and 5.21. The distribution of words in either d or d¯ can be achieved by defining the parameters of the K-mixture model over the respective corpora. We will employ the following notations: • KMad is the probability of occurrence of word a in documents in the domain corpus d; and • KMad¯ is the probability of occurrence of word a in documents in contrastive ¯ corpus d; 5.4.2 Formalising Evidences in a Probabilistic Framework All existing techniques for term recognition are founded upon some heuristics or linguistic theories that define what makes a term candidate relevant. However, there are researchers [120, 107] who criticised such existing methods for the lack of proper theorisation despite the reasonable intuitions behind them. Definition 5.4.2.1 highlights a list of commonly adopted characteristics for determining the relevance of terms [120]. Definition 5.4.2.1. Characteristics of term relevance: 1 A term candidate is relevant to a domain if it appears relatively more frequent in that domain than in others. 2 A term candidate is relevant to a domain if it appears only in this one domain. 3 A term candidate relevant to a domain may have biased occurrences in that domain: 3.1 A term candidate of rare occurrence in a domain. Such candidates are also known as “hapax legomena” which manifest itself as the long tail in Zipf’s law. 3.2 A term candidate of common occurrence in a domain. 4 Following from Definition 5.4.3.4 and 5.4.3.5, a complex term candidate is relevant to a domain if its head is specific to that domain. 5.4. A New Probabilistic Framework for Determining Termhood 101 We propose a series of evidence as listed below to capture the individual characteristics presented in Definition 5.4.2.1 and 5.4.3. They are as follow: • Evidence 1: Occurrence of term candidate a • Evidence 2: Existence of term candidate a • Evidence 3: Specificity of the head ah of term candidate a • Evidence 4: Uniqueness of term candidate a • Evidence 5: Exclusivity of term candidate a • Evidence 6: Pervasiveness of term candidate a • Evidence 7: Clumping tendency of term candidate a The seven evidence is used to compute the corresponding evidential weights Oi which in turn are summed to produce the final ranking using OT as defined in Equation 5.34. Since OT served as a probabilistically-derived formulaic realisation of our Aim 2, we can consider the various Oi as manifestations of sub-aims of Aim 2. The formulation of the evidential weights begin with the associated definitions and subaims. Each sub-aim attempting to realise the associated definition has an equivalent mathematical formulation. The formula is then expanded into a series of probability functions connected through the addition and multiplication rule. There are four basic probability distributions that are required to compute the various evidential weights: • P(occurrence of a in d)=P (a, d): This distribution provides the probability of occurrence of a in the domain corpus d. By ranking the term candidates according to their frequency of occurrence in domain d, each term candidate will have a rank r. We employ ZMrd described in Section 5.4.1 for this purpose. For brevity, we use P (a, d) to denote the probability of occurrence of term candidate a in the domain corpus d. ¯ ¯ This distribution provides the probability of • P(occurrence of a in d)=P (a, d): ¯ By ranking the term candidates occurrence of a in the contrastive corpus d. ¯ each according to their frequency of occurrence in the contrastive corpus d, term candidate will have a rank r. We employ ZMrd¯ described in Section ¯ to denote the probability of 5.4.1 for this purpose. For brevity, we use P (a, d) ¯ occurrence of term candidate a in the contrastive corpus d. 102 Chapter 5. Term Recognition • P(occurrence of a in documents in d)=PK (a, d): This distribution provides the probability of occurrence of a in documents in domain corpus d where the subscript K refers to K-mixture. One should be immediately reminded of KMad described in Section 5.4.1. For brevity, we employ PK (a, d) to denote the probability of occurrence of term candidate a in documents in the domain corpus d. ¯ ¯ • P(occurrence of a in documents in d)=P K (a, d): This distribution provides ¯ the probability of occurrence of a in documents in the contrastive corpus d. We employ the distribution provided by KMad¯ described in Section 5.4.1. ¯ to denote the probability of occurrence of term For brevity, we use PK (a, d) ¯ candidate a in documents in the contrastive corpus d. ¯ described above are defined over Since the probability masses P (a, d) and P (a, d) ¯ we have the sample space of all words in the respective corpus (i.e. either d or d), that for any term candidate a ∈ W : • 0 ≤ P (a, d) ≤ 1 ¯ ≤1 • 0 ≤ P (a, d) and • • P ∀a∈W P ∀a∈W P (a, d) = 1 ¯ =1 P (a, d) ¯ are On the other hand, the other two distributions, namely, PK (a, d) and PK (a, d) defined over the sample space of all possible number of occurrences k = 0, 1, 2, ..., n of a particular term candidate a in a document using the K-mixture model. Hence, • 0 ≤ PK (a, d) ≤ 1 ¯ ≤1 • 0 ≤ PK (a, d) but • • P ∀a∈W P ∀a∈W PK (a, d) 6= 1 ¯ 6= 1 PK (a, d) 5.4. A New Probabilistic Framework for Determining Termhood 103 In addition to the axioms above, there are three sets of related properties that require further clarification. The first set concerns the events of occurrence and non-occurrence of term candidates in the domain corpus d and in the contrastive ¯ corpus d: Property 1. Properties of the probability distributions of occurrence and nonoccurrence of term candidates in the domain corpus and the contrastive corpus: 1) The events of the occurrences of a in d and in d¯ are not mutually exclusive. In other words, the occurrence of a in d does not imply the non-occurrence of ¯ This is true since any term candidate can occur in d, d¯ or even both, a in d. either intentionally or by accident. ¯ = ◦ P (occurrence of a in d ∩ occurrence of a in d) 6 0 2) The occurrence of words in d does not affect (i.e. independent of) the probability of its occurrence in other domains d¯ and vice versa. ¯ = P (a, d)P (a, d) ¯ ◦ P (occurrence of a in d ∩ occurrence of a in d) 3) The events of occurrence and non-occurrence of the same candidate within the same domain are complementary. ◦ P (non-occurrence of a in d) = 1 − P (a, d) 4) Following from 1), the events of occurrence in d and non-occurrence in d¯ and vice versa of the same term are not mutually exclusive since candidate a can ¯ occur in both d and d. ¯ = ◦ P (occurrence of a in d ∩ non-occurrence of a in d) 6 0 5) Following from 2), the events of occurrence in d and non-occurrence in d¯ and vice versa of the same term are also independent. ¯ = P (a, d)(1 − P (a, d)) ¯ ◦ P (occurrence of a in d ∩ non-occurrence of a in d) The second set of properties is concerned with complex term candidates. Each complex candidate a is made up of a head ah and a set of modifiers Ma . Since ¯ they candidate a and its head ah have the possibility of both occurring in d or in d, are not mutually exclusive. As such, the probability of union of the two events of occurrence is not the sum of the individual probability of occurrence. Lastly, we will assume that the occurrences of candidate a and its head ah within the same domain 104 Chapter 5. Term Recognition ¯ are independent. While this may not be the case in reality, but as (i.e. either d or d) we shall see later, such property allows us to provide estimates for many non-trivial situations. As such, Property 2. The mutual exclusion and independence property of the occurrence of ¯ term candidate a and its head ah within the same corpus (i.e. either in d or in d): ◦ P (occurrence of a in d ∩ occurrence of ah in d) 6= 0 ◦ P (occurrence of a in d ∩ occurrence of ah in d) = P (a, d)P (ah , d) ◦ P (occurrence of a in d ∪ occurrence of ah in d) = P (a, d) + P (ah , d) − P (a, d)P (ah , d) The last set of properties is made in regard to the occurrence of candidates in documents in the corpus. Since the probability of occurrence of a candidate in documents is derived from Poisson mixture, then it follows that the probability of occurrence (where k ≥ 1) of a candidate in documents is the complement of the probability of non-occurrence of that candidate (where k = 0). Property 3. The complementation property of the occurrence and non-occurrence ¯ of term candidate a in documents within the same domain (i.e. either in d or in d): ◦ P (non-occurrence of a in documents in d) = 1 − PK (a, d) Next, we move on to define the odds that correspond to each of the evidence laid out earlier. • Odds of Occurrence: The first evidential weight O1 attempts to realise Defi¯ The nition 5.4.2.1.1. O1 captures the odds of whether a occurs in d or in d. notion of occurrence is the simplest among all the weights on which most other evidential weights are founded upon. Formally, O1 can be described as Sub-Aim 1. What are the odds of term candidate a occurring in d? and can be mathematically formulated as: P (occurrence P (occurrence P (occurrence = P (occurrence P (a, d) = ¯ P (a, d) O1 = of of of of a|R1 ) a|R2 ) a in d) ¯ a in d) (5.42) 5.4. A New Probabilistic Framework for Determining Termhood 105 • Odds of Existence: Similar to O1 , the second evidential weight O2 attempts to realise Definition 5.4.2.1.1 but keeping in mind Definition 5.4.3.3 and Definition 5.4.3.4. We can consider O2 as a realistic extension of O1 for reasons to be discussed below. Since the non-occurrence of term candidates in the corpus does not imply its conceptual absence or non-existence, we would like O2 to capture the following: Sub-Aim 2. What are the odds of term candidate a being in existence in d? The main issue related to the probability of occurrence is the fact that a big portion of candidates rest along the long tail of Zipf’s Law. Since most of the candidates are rare, their probabilities of occurrences alone do not reflect their actual existence or intended usage. What makes the situation worse is that longer words tend to have lower rate of occurrences [249] based on Definition 5.4.3.3. In the words of Zipf [284], “it seems reasonably clear that shorter words are distinctly more favoured in language than longer words.”. For example, consider the events that we observe more “bridge” occurring in the computer networking domain than its complex counterpart “network bridge”. The observed events do not imply that the concept represented by the complex term is different or of any less importance to the domain simply because “network bridge” occurs less than “bridge”. The fact that authors are more predisposed at using shorter terms whenever possible to represent the same concept demonstrate the Principle of Least Effort, which is the foundation behind most of Zipf’s Laws. This brings us to Definition 5.4.3.4 which requires us to assign higher importance to complex terms than their heads appearing as simple terms. We need to ensure that O2 captures these requirements. We can extend the lexical occurrence of complex candidates conceptually by including the lexical occurrence of their heads. Since the events of occurrences of a and its head ah are not mutually exclusive as discussed in Property 2, we will need to subtract the probability of the intersection of these two events from the sum of the two probabilities to obtain the probability of the union. Following this, and based on the assumptions about the probability of occurrences of complex candidates and their heads in Property 2, we can mathematically formulate O2 as P (existence P (existence P (existence = P (existence O2 = of of of of a|R1 ) a|R2 ) a in d) ¯ a in d) 106 Chapter 5. Term Recognition P (occurrence of a in d ∪ occurrence of ah in d) ¯ P (occurrence of a in d¯ ∪ occurrence of ah in d) P (a, d) + P (ah , d) − P (a, d)P (ah , d) = ¯ + P (ah , d) ¯ − P (a, d)P ¯ (ah , d) ¯ P (a, d) = In the case where candidate a is simple, the probability of occurrence of its head ah , and the probability of both a and ah occurring will be evaluated to zero. As a result, the second evidential weight O2 for simple terms will be equivalent to its first evidential weight O1 . Such formulation satisfies the additional Definition 5.4.3.5 that requires us to allocate higher weights to complex terms. • Odds of Specificity: The third evidential weight O3 specifically focuses on Definition 5.4.2.1.4 for complex term candidates. O3 is meant for capturing the odds of whether the inherently ambiguous head ah of a complex term a is specific to d. If the heads ah of complex terms are found to occur individually without a in large numbers across different domains, then the specificity of the concept represented by ah with regard to d may be questionable. O3 can be formally stated as: Sub-Aim 3. What are the odds that the head ah of a complex term candidate a is specific to d? The head of a complex candidate is considered as specific to a domain if the head and the candidate itself both have higher tendency of occurring together in that domain. The higher the intersection of the events of occurrences of a and ah in a certain domain, the more specific ah is to that domain. For example, if the event of both “bridge” and “network bridge” occurring together in the computer networking domain is very high, this means the possibly ambiguous head “bridge” is used in a very specific context in that domain. In such cases, when “bridge” is encountered in the domain of computer networking, one can safely deduce that it refers to the same domain-specific concept as “network bridge”. Consequently, the more specific the head ah is with respect to d, the less ambiguous its occurrence is in d. It follows from Definition 5.4.2.1.4 that the less ambiguous ah is, the chances of its complex counterpart a being relevant to d will be higher. Based on the assumptions about the probability of occurrence of complex candidates and their heads in Property 5.4. A New Probabilistic Framework for Determining Termhood 107 2, we define the third evidential weight for complex term candidates as: P (specificity of a|R1 ) P (specificity of a|R2 ) P (specificity of a to d) = ¯ P (specificity of a to d) P (occurrence of a in d ∩ occurrence of ah in d) = ¯ P (occurrence of a in d¯ ∩ occurrence of ah in d) P (a, d)P (ah , d) = ¯ (ah , d) ¯ P (a, d)P O3 = • Odds of Uniqueness: The fourth evidential weight O4 realises Definition 5.4.2.1.2 ¯ The notion of by capturing the odds of whether a is unique to d or to d. uniqueness defined here will be employed for the computation of the next two evidential weights O5 and O6 . Formally, O4 can be described as Sub-Aim 4. What are the odds of term candidate a being unique to d? A term candidate is considered as unique if it occurs only in one domain and not others. Based on the assumptions on the probability of occurrence and non-occurrence in Property 1, O4 can be mathematically formulated as: P (uniqueness of a|R1 ) P (uniqueness of a|R2 ) P (uniqueness of a to d) = ¯ P (uniqueness of a to d) ¯ P (occurrence of a in d ∩ non-occurrence of a in d) = P (occurrence of a in d¯ ∩ non-occurrence of a in d) ¯ P (a, d)(1 − P (a, d)) = ¯ − P (a, d)) P (a, d)(1 O4 = • Odds of Exclusivity: The fifth evidential weight O5 realises Definition 5.4.2.1.3.1 ¯ Forby capturing the probability of whether a is more exclusive in d or in d. mally, O5 can be described as Sub-Aim 5. What are the odds of term candidate a being exclusive in d? Something is regarded as exclusive if it exists only in a category (i.e. unique to that category) with certain restrictions such as limited usage. It is obvious at this point that a term candidate which is unique and rare in a domain is considered as exclusive in that domain. There are several ways of realising the rarity of terms in domains. For example, one can employ some measures 108 Chapter 5. Term Recognition of vocabulary richness or diversity [65] to quantify the extent of dispersion or concentration of term usage in a particular domain. However, the question on how such measures can be integrated into probabilistic frameworks such as the one proposed in this chapter remains a challenge. We propose the view that terms are considered as rare in a domain if they exist only in certain aspects of that domain. For example, in the domain of computer networking, we may encounter terms like “Fiber distributed data interface” and “WiMAX”. They may both be relevant to the domain but their distributional behaviour in the domain corpus is definitely different. While both may appear to represent certain similar concepts such as “high-speed transmission”, their existence are biased to different aspects of the same domain. The first term may be biased to certain aspects characterised by concepts such as “token ring” and “local area network”, while the second may appear biased to aspects like “wireless network” and “mobile application”. We propose to realise the notion of “domain aspects” through the documents that the domain contains. We consider the documents that made up the domain corpus as discussions of the various aspects of a domain. Consequently, a term candidate can be considered as rare if it has a low probability of occurrence in documents in the domain. Please note the difference in the probability of occurrence in a domain versus the probability of occurrence in documents in a domain. Following this, Property 3 and the probability of uniqueness discussed as part of O4 , we define the fifth evidential weight as: P (exclusivity of a|R1 ) P (exclusivity of a|R2 ) P (exclusivity of a in d) = ¯ P (exclusivity of a in d) P (uniqueness of a to d)P (rarity of a in d) = ¯ (rarity of a in d) ¯ P (uniqueness of a to d)P ¯ (rarity of a in d) P (a, d)(1 − P (a, d))P = ¯ − P (a, d))P (rarity of a in d) ¯ P (a, d)(1 O5 = If we subscribe to our definition of rarity proposed above where P (rarity of a in d)= 1 − PK (a, d), then, ¯ P (a, d)(1 − P (a, d))(1 − PK (a, d)) = ¯ ¯ P (a, d)(1 − P (a, d))(1 − PK (a, d)) The higher the probability that candidate a has no occurrence in documents in d, the rarer it becomes in d. Whenever the occurrence of a increases in documents in d, 1 − PK (a, d) reduces (i.e. getting less rare) and this leads to the 5.4. A New Probabilistic Framework for Determining Termhood 109 decrease in overall O5 or exclusivity. One may have noticed that the interpretation of the notion of rarity using the occurrences of candidates in documents may not be the most appropriate. Since the usage of terms in documents has an effect on the existence of terms in the domain, the independence assumption required to enable the product of the probability of uniqueness and of rarity to take place does not hold in reality. • Odds of Pervasiveness: The sixth evidential weight O6 attempts to capture Definition 5.4.2.1.3.2. Formally, O6 can be described as Sub-Aim 6. What are the odds of term candidate a being pervasive in d? Something is considered to be pervasive if it exists very commonly in only one category. This makes the notion of commonness the opposite of rarity. A term candidate is said to be common if it occurs in most aspects of a domain. In other words, among all documents discussing about a domain, the term candidate has a high probability of occurring in most or nearly all of them. Following this and Property 3, we define the sixth evidential weight as: P (pervasiveness of a|R1 ) P (pervasiveness of a|R2 ) P (pervasiveness of a in d) = ¯ P (pervasiveness of a in d) P (uniqueness of a to d)P (commonness of a in d) = ¯ (commonness of a in d) ¯ P (uniqueness of a to d)P ¯ (commonness of a in d) P (a, d)(1 − P (a, d))P = ¯ − P (a, d))P (commonness of a in d) ¯ P (a, d)(1 O6 = If we follow the definition of rarity introduced as part of the fifth evidential weight O5 , then the notion of commonness is the complement of rarity which means P (commonness of a in d)=1-P (rarity of a in d), or = ¯ K (a, d) P (a, d)(1 − P (a, d))P ¯ − P (a, d))PK (a, d) ¯ P (a, d)(1 Similar to the interpretation of the notion of rarity, the use of the probability of non-occurrence in documents may not be the most suitable since the independence assumption that we need to make does not hold in reality. • Odds of Clumping Tendency: The last evidential weight involves the use of contextual information. There are two main issues involved in utilising contextual information. First is the question of what constitutes the context of 110 Chapter 5. Term Recognition a term candidate, and secondly, how to cultivate and employ contextual evidence. Regarding the first issue, it is obvious that the importance of terms and context in characterising domain concepts should be reflected through their heavy participations in the states or actions expressed by verbs. Following this, we put forward a few definitions related to what context is, and the relationship between terms and context. Definition 5.4.2.2. Words that are contributors to the same state or action as a term can be considered as context related to that term. Definition 5.4.2.3. Relationship between terms and their context 1 In relation to Definition 5.4.3.2, terms tend to congregate at different parts of text to describe or characterise certain aspects of the domain d. 2 Following the above, terms which clump together will eventually be each others’ context. 3 Following the above, context which are also terms (i.e. context terms) are more likely to be better context since actual terms tend to clump. 4 Context words which are also semantically related to their terms are more qualified at describing those terms. Regarding the second issue highlighted above, we can employ context to promote or demote the rank of terms based on the terms’ tendency to clump. Following Definition 5.4.2.3, terms with higher tendency to clump or occur together with their context should be promoted since such candidates are more likely to be the “actual” terms in a domain. Some readers may be able to recall from Definition 5.4.2 which states that the meaning of terms should be independent from their context. We would like to point out that the function of this last evidential weight is not to infer the meaning of terms from their context since such action conflicts with Definition 5.4.2. Instead, we employ context to investigate and reveal an important characteristic of terms as defined in Definition 5.4.3.2 and 5.4.2.3, namely, the tendency of terms to clump. We employ the linguistically-motivated technique by Wong et al. [275] to extract term candidates together with their context words in the form of instantiated sub-categorisation frames [271]. This seventh evidential weight O7 attempts to realise Definition 5.4.3.2 and 5.4.2.3. Formally, O7 can be described as 5.5. Evaluations and Discussions 111 Sub-Aim 7. What are the odds that term candidate a clumps with its context Ca in d? We can compute the clumping tendency of candidate a and its context words as the probability of candidate a occurring together with any of its context words. The higher the probability of candidate a and its context words occurring together in the same domain, the more likely it is that they clump. Since related context words are more qualified at describing the terms based on Definition 5.4.2.3.4, we have to include a semantic relatedness measure for that purpose. We employ Psim (a, c) to estimate the probability of relatedness between candidate a and its context word c ∈ Ca ∩ T C. Psim (a, c) is implemented using the semantic relatedness measure N GD by Cilibrasi & Vitanyi [50, 261] which has been discussed in Section 5.3.2. Let c ∈ Ca ∩ T C be the set of context terms, the last evidential weight can be mathematically formulated as: O7 = = = = = 5.5 P (clumping of a with its related context|R1 ) P (clumping of a with its related context|R2 ) P (clumping of a with its related context in d) ¯ P (clumping of a with its related context in d) P (occurrence of a with any related c ∈ Ca ∩ T C in d) ¯ P (occurrence of a with any related c ∈ Ca ∩ T C in d) P P (occurrence of a in d ∩ occurrence of c in d)Psim (a, c) P∀c∈Ca ∩T C ¯ sim (a, c) P (occurrence of a in d¯ ∩ occurrence of c in d)P P∀c∈Ca ∩T C P (a, d)P (c, d)Psim (a, c) P∀c∈Ca ∩T C ¯ ¯ ∀c∈Ca ∩T C P (a, d)P (c, d)Psim (a, c) Evaluations and Discussions In this evaluation, we studied the ability of our new probabilistic measure known as the Odds of Termhood (OT) in separating domain-relevant terms from general ones. We contrasted our new measure with three existing scoring and ranking schemes, namely, Contrastive Weight (CW), NCvalue (NCV) and Termhood (TH). The implementations of CW , N CV and T H are in accordance to Equation 5.1 and 5.2, 5.9, and 5.12 respectively. The evaluations of the four termhood measures were conducted in two parts: • Part 1: Qualitative evaluation through the analysis and discussion based on the frequency distribution and measures of dispersion, central tendency and correlation. 112 Chapter 5. Term Recognition • Part 2: Quantitative evaluation through the use of performance measures, namely, precision, recall, F-measure and accuracy, and the GENIA annotated text corpus as the gold standard G. In addition, the approach we chose to evaluate the termhood measures provides a way to automatically decide on a threshold for accepting and rejecting the ranked term candidates. For both parts of our evaluation, we employed a dataset containing a domain corpus describing the domain of molecular biology, and a contrastive corpus which spans across twelve different domains other than molecular biology. The datasets are described in Figure 5.1 in Section 5.2. Using the part-of-speech tags in the GENIA corpus, we extracted the maximal noun phrases as term candidates. Due to the large number of distinct words (over 400, 000 as mentioned in Section 5.2) in the GENIA corpus, we have extracted over 40, 000 lexically-distinct maximal noun phrases. The large number of distinct noun phrases is due to the absence of preprocessing to normalise the lexical variants of the same concept. For practical reasons, we randomly sampled the set of noun phrases for distinct term candidates. The resulting set T C contains 1, 954 term candidates. Following this, we performed the scoring and ranking procedure using the four measures included in this evaluation on the set of 1, 954 term candidates. 5.5.1 Qualitative Evaluation In the first part of the evaluation, we analysed the frequency distributions of the ranked term candidates generated by the four measures. Figures 5.7 and 5.8 show the frequency distributions of the candidates ranked in descending order according to the weights assigned by the four measures. The candidates are ranked in descending order according to their scores assigned by the respective measures. One can notice the interesting trends from the graphs by CW and N CV in Figures 5.8(b) and 5.8(a). The first half of the graph by CW , prior to the sudden surge of frequency, consists of only complex terms. Complex terms tend to have lower word counts compared to simple terms and hence, the disparity in the frequency distribution as shown in Figure 5.8(b). This is attributed to the biased treatment given to complex terms evident in Equation 5.2. However, priority is also given to complex terms by T H but as one can see from the distribution of candidates by T H, such undesirable trend does not occur. One of the explanation is the heavy reliance of frequency by CW while T H attempts to diversify the evidence in the computation of weights. While frequency may be a reliable source of evidence, the use of it alone is definitely inadequate [37]. As for N CV , Figure 5.8(a) reveals that scores are assigned to 5.5. Evaluations and Discussions 113 (a) Candidates ranked in descending order according to the scores assigned by OT . (b) Candidates ranked in descending order according to the scores assigned by T H. Figure 5.7: Distribution of the 1, 954 terms extracted from the domain corpus d sorted according to the corresponding scores provided by OT and T H. The single dark smooth line stretching from the left (highest value) to the right (lowest value) of the graph is the scores assigned by the respective measures. As for the two oscillating lines, the dark line is the domain frequencies while the light one is the contrastive frequencies. 114 Chapter 5. Term Recognition (a) Candidates ranked in descending order according to the scores assigned by N CV . (b) Candidates ranked in descending order according to the scores assigned by CW . Figure 5.8: Distribution of the 1, 954 terms extracted from the domain corpus d sorted according to the corresponding scores provided by N CV and CW . The single dark smooth line stretching from the left (highest value) to the right (lowest value) of the graph is the scores assigned by the respective measures. As for the two oscillating lines, the dark line is the domain frequencies while the light one is the contrastive frequencies. 5.5. Evaluations and Discussions 115 Figure 5.9: The means µ of the scores, standard deviations σ of the scores, sum of the domain frequencies and of the contrastive frequencies of all term candidates, and their ratio. Figure 5.10: The Spearman rank correlation coefficients ρ between all possible pairs of measure under evaluation. candidates by N CV based solely on the domain frequency. In other words, the measure N CV lacks the required contrastive analysis. As we have pointed out, terms can be ambiguous and we must not ignore the cross-domain distributional behaviour of terms. In addition, upon inspecting the actual list of ranked candidates, we noticed that higher scores were assigned to candidates which were accompanied by more context words. Another positive trait that T H exhibits is its ability to ¯ assign higher scores to terms which occur relatively more frequent in d and in d. This is evident through the gap between fd (dark oscillating line) and fd¯ (light oscillating line), especially at the beginning of the x-axis in Figure 5.7(b). One can notice that candidates along the end of the x-axis are those with fd¯ > fd . The same can be said about our new measure OT . However, the discriminating power of OT is apparently better since the gap between fd and fd¯ is larger and lasted longer. Figure 5.9 summarises the mean and standard deviation of the weights generated by the various measures. One can notice the extremely high dispersion from the mean of the scores generated by CW and N CV . We speculate that such trends are due to the erratic assignments of weights, heavily influenced by frequencies. In addition, we employed the Spearman rank correlation coefficient to study the possibility of any correlation between the four ranking schemes under evaluation. Figure 5.10 summarises the correlation coefficients between the various measures. Note that 116 Chapter 5. Term Recognition there is a relatively strong correlation between the ranks produced by our new probabilistic measure OT and the ranks by the ad-hoc measure T H. The correlation of T H with OT revealed the possibility of providing mathematical justifications for the former’s heuristically-motivated ad-hoc technique using a general probabilistic framework. 5.5.2 Quantitative Evaluation Figure 5.11: An example of a contingency table. The values in the cells T P , T N , F P and F N are employed to compute the precision, recall, Fα and accuracy. Note that |T C| is the total number of term candidates in the input set T C, and |T C| = TP + FP + FN + TN. In the second part of the evaluation, we employed the gold standard G generated from the GENIA corpus as discussed in Section 5.2 for evaluating our new term recognition technique using OT and three other existing ones (i.e. T H, N CV and CW ). We employed four measures [166] common to the field of information retrieval for performance comparison. These performance measures are precision, recall, Fα measure, and accuracy. These measures are computed by constructing a contingency table as shown in Figure 5.11: TP TP + FP TP recall = TP + FN (1 + α)(precision × recall) Fα = (α × precision) + recall TP + TN accuracy = TP + FP + FN + TN precision = where T P , T N , F P and F N are values from the four cells of the contingency table shown in Figure 5.11, and α is the weight for recall within the range (0, ∞). It suffices to know that as the α value increases, the weight of recall increases in the 5.5. Evaluations and Discussions 117 measure [204]. Two common α values are 0.5 and 2. F2 weighs recall twice as much as precision, and precision in F0.5 weighs two times more than recall. Recall and precision are evenly weighted in the traditional F1 measure. Before presenting the results for this second part of the evaluation, there are several points worth clarifying. Firstly, the gold standard G is a set of unordered collection of terms whose domain relevance has been established by experts. Secondly, as part of the evaluation, each term candidate a ∈ T C will be assigned a score by the respective termhood measures. These scores are used to rank the candidates in descending order where larger scores correspond to higher ranks. As a result, there will be four new sets of ranked term candidates, each corresponding to a measure under evaluation. For example, T CN CV is the output set after the scoring and ranking of the input term candidates T C by the measure N CV . The output from the measures T H, OT and CW are T CT H , T COT and T CCW , respectively. We would like to remind the readers that |T C| = |T CT H | = |T COT | = |T CCW | = |T CN CV |. The individual elements of the output sets appear as ai where a is the term candidate from T C and i is the rank. Next, the challenge lies in how the resulting sets of ranked term candidates should be evaluated using the gold standard G. Generally, as in information retrieval, a binary classification is performed. In other words, we try to find a match in G for every ai in T CX where X is any of the measure under evaluation. A positive match indicates that ai is a term, while no match implies that ai is a non-term. However, there is a problem with this approach. The elements (i.e. term candidates) in the four output sets are essentially the same. The difference between the four sets lies in the ranks of the term candidates, and not the candidates themselves. In other words, the simple attempts of trying to find the number of matches in every T CX with G will produce the same results (i.e. same precision, recall, F-score and accuracy). Obviously, the ranks i assigned to the ranked term candidates ai by the different termhood measures have a role to play. Following this, a cut-off point (i.e. threshold) for the ranked term candidates needs to be employed. To have an unbiased comparison of the four measures using the gold standard, we have to ensure that the cut-off rank for each termhood measure is the optimal. Manually deciding on the four different “magic numbers” for the four measures is a challenging and undesirable task. To overcome the challenges involved in performing an unbiased comparative study on the four termhood measures as discussed above, we propose to fairly examine their performance and possibly, to decide on the optimal cut-off ranks through rank binning. We briefly describe the 3-step process of rank binning for each set 118 Chapter 5. Term Recognition T CX : • Decide on a standard size b for the bins; • Create n = ⌈ |T CbX | ⌉ rank bins where ⌈y⌉ is the ceiling value of y; • Each bin BjX is assigned a rank j where 1 ≤ j ≤ ⌈ |T CbX | ⌉, and X is the identifier for the corresponding measure (i.e. N CV , CW , OT or T H). Bin B1X is considered as a higher bin compared to B2X and so on; and • Distribute the ranked term candidates in set T CX to their respective bins. Bin B1X will hold the first b-th ranked term candidates from set T CX . In general, bin BjX will contain the first (j × b)-th ranked term candidates from T CX where 1 ≤ j ≤ ⌈ |T CbX | ⌉. Obviously, there is an exception for the last bin BnX where n = ⌈ |T CbX | ⌉ if b is not a factor of |T CX | (i.e. |T CX | is not divisible by b). In such exceptions, the last bin BnX would simply contain all the ranked term candidates in |T CX |. The results of binning the ranked term candidates produced by the four termhood measures using the input set of 1, 954 term candidates are shown in Figure 5.12. We would like to point out that the choice of b has effects on the performance indicators of each bin. Setting the bin size too large may produce deceiving performance indicators that do not reflect the actual quality of the ranked term candidates. This occurs when an increasing numbers of ranked term candidates, which can either be actual terms or non-terms in the gold standard, are mixed into the same bin. On the other hand, selecting bin sizes which are too small may defeat the purpose of collective evaluation of the ranked term candidates. Moreover, large number of bins will make interpretation of the results difficult. A rule-of-thumb is to have appropriate bin sizes selected based on the size of T C and the sensible number of bins for ensuring interpretable results. For example, setting b = 100 for a set of 500 term candidates is a suitable number since there are only 5 bins. However, using the same bin size on a set of 5, 000 term candidates is inappropriate. In our case, to ensure interpretability of the tables included in this chapter, we set the bin size to b = 200 for our 1, 954 term candidates. The results are organised into contingency tables as introduced in Figure 5.11. Each individual contingency table in Figure 5.12 contains four cells and is structured in the same way as Figure 5.11. The individual table summarises the result obtained from the binary classification performed using the corresponding bin of term candidates on the gold standard G. Each measure will have several contingency tables where each table contains values 5.5. Evaluations and Discussions 119 for determining the quality (i.e. relevance to the domain) of the term candidates that fall in the corresponding bins, as prescribed by the gold standard. Figure 5.12: The collection all contingency tables for all termhood measures X across all the 10 bins BjX . The first column contains the rank of the bins and the second column shows the number of term candidates in each bin. The third general column “termhood measures, X” holds all the 10 contingency tables for each measure X which are organised column-wise, bringing the total number of contingency tables to 40 (i.e. 10 bins, organised in rows by 4 measures). The structure of the individual contingency tables follows the one shown in Figure 5.11. The last column is the rowwise sums of T P + F P and F N + T N . The rows beginning from the second row until the second last are the rank bins. The last row is the column-wise sums of T P + F N and F P + T N . Using the values in the contingency tables in Figure 5.12, we computed the precision, recall, F-scores and accuracy of the four measures at different bins. The performance results are summarised in Figure 5.13. The accuracy indicators, acc 120 Chapter 5. Term Recognition Figure 5.13: Performance indicators for the four termhood measures in 10 respective bins. Each row shows the performance achieved by the four measures in a particular bin. The columns contain the performance indicators for the four measures. The notation pre stands for precision, rec is recall and acc is accuracy. We use two different α values, resulting in two F-scores, namely, F0.1 and F1 . The values of the performance measures with darker shades are the best performing ones. show the extent to which a termhood measure correctly predicted both terms (i.e. T P ) and non-terms (i.e. T N ). On the other hand, precision, pre measures the extent to which the candidates predicted as terms (i.e. T P + F P ) are actual terms (i.e. T P ). The recall indicators, rec capture the ability of the measures in correctly (i.e. T P ) identifying all terms that exist in the set T C (i.e. T P + F N ). As shown in the last bins (i.e. j = 10) of Figure 5.13, it is trivial to achieve a recall of 100% by simply binning all term candidates in T C into one large bin during evaluation. Recall alone is not a good performance indicator and one needs to take into consideration the number of non-terms which are mistakenly predicted as terms (i.e. F P ) which is neglected in the computation of recall. Hence, we need to find a balance between recall and precision instead, which is aptly captured as the F1 score. To demonstrate an F-score which places more emphasis on precision, we combine both recall and precision to obtain the F0.1 score as shown in Figure 5.13. Before we begin discussing the quantitative results, we would like to provide an example of how to intepret the performance indicators in Figure 5.13 to ensure absolute clarity. If we accept bin j = 8 using the termhood measure OT as the solution to term recognition, then 84.56% of the solution is precise, and 79.73% of the solution is accurate. In other words, 84.56% of the predicted terms in bin j = 8 are actual terms, and 79.73% of the term predictions and non-term predictions are actually true, all according to the gold standard. The similar approach is used to intepret the performance of all 5.5. Evaluations and Discussions 121 other measures in all bins. From Figure 5.13, we notice that the measures T H and OT are consistently better in terms of all the performance indicators compared to N CV and CW . Using any one of the 10 bins and any performance indicators for comparison, T H and OT offered the best performance. The close resemblance and consistency in the performance indicators of T H and OT supports and objectively confirms the correlation between the two termhood measures as suggested by the Spearman rank correlation coefficient discovered in Section 5.5.1. OT and T H achieved the best precision in the first bin at 98% and 98.5%, respectively by obviously sacrificing the recall. The worst performing termhood measure in terms of precision is the N CV at the maximum of only 76.87%. Since the lowest precision of N CV lies in the last bin, its recall achieves the maximum at 100%. In fact, Figure 5.13 clearly shows that the maximum values for all performance indicators of N CV rest in the last bin, and the precision values of N CV are erratically distributed across all the bins. Ideally, a good termhood measure should attain the highest precision in the first bin with the subsequent bins achieving decreasing precisions. This is important to show that actual terms are assigned with higher ranks by the termhood measures. This reaffirms our suggestion that contrastive analysis which is present in OT , T H and CW is necessary for term recognition. The frequency distribution of terms ranked by N CV , as shown in Figure 5.8(a) in the previous Section 5.5.1, clearly shows the improperly ranked term candidates where only the frequencies from the domain corpus are considered. Generally, as we are able to observe from the precision and recall columns, “well-behaved” termhood measures usually have higher precisions with lower recalls in the first few bins. This is due to the more restrictive membership of the higher bins where only highly ranked term candidates by the respective termhood measures are included. The highest recall is achieved when there is no more false negative since all term candidates in set T C are included for scoring and ranking, and all the candidates are simply predicted as terms. In our case, the highest recall is obviously 100% as we have mentioned earlier, while lowest precision is 76.87%. This relation between precision and recall is aptly captured by the F0.1 and F1 scores. The F0.1 scores begin at much higher values at the higher bins compared to the F1 scores. This is due to our emphasis on precision instead of recall as we have justified earlier, and since higher bins have higher precision, F0.1 will inevitably gain higher values. F0.1 becomes lower than F1 when precision falls below recall. The cells with darker shades under each performance measure in Figure 5.13 122 Chapter 5. Term Recognition indicate the maximum values for that measure. In other words, the termhood measure OT has the best accuracy of 81.47% at bin j = 9, or a maximum F0.1 score of 86.17% at bin j = 5. We can see that the highest F1 score and the best accuracy of OT are more than those of T H. Assuming consistency of these results with other corpora, OT can be regarded as a measure which attempts to find a balance between precision and recall. If we weigh precision more, T H will triumph over OT based on their maximum F0.1 scores. Consequently, we can employ these maximum values of different performance measures as highly flexible cut-off points for deciding on which top n ranked term candidates to be selected and considered as actual terms. These maximum values optimise the precision and the recall to ensure that the maximum number of actual terms is selected while minimising the inclusion of non-terms. This evaluation approach provides a solution to the problem discussed at the start of this chapter which is also mentioned by Cabre-Castellvi et al. [37], “all systems propose large lists of candidate terms, which at the end of the process have to be manually accepted or rejected.”. In addition, our proposed use of rank binning and the maximum values of the various performance measures have allowed us to perform an unbiased comparison on all four term measures. In short, we have shown that: • The new OT termhood measure can provide mathematical justifications for the heuristically-derived measure T H; • The new OT termhood measure aims for a balance between precision and recall, and offers the most accurate solution to the requirements of term recognition compared to the other measures T H, N CV and CW ; and • The new OT termhood measure performs on par with the heuristically-derived measure T H, and they are both consistently the best performing term recognition measures in terms of precision, recall, F-scores and accuracy compared to N CV and CW . Some critics may simply disregard the results reported here as unimpressive and be inclined to compare them with results from other related but distinct disciplines such as named-entity recognition or document retrieval. However, one has to keep in mind several fundamental differences in regard to the evaluations in term recognition. Firstly, unlike other established fields, term recognition is largely an unconsolidated research area which still lacks a common comparative platform [120]. As a result, individual techniques or systems are being developed and tested with 5.6. Conclusions 123 small datasets in highly specialised domains. According to Cabre-Castellvi et al. [37], “This lack of data makes it difficult to evaluate and compare them.”. Secondly, we could not emphasise enough the fact that term recognition is a subjective task in comparison to fields such as named-entity recognition and document retrieval. In the most primitive way, named-entity recognition, which is essentially a classification problem, can be deterministically performed through a finite set of rigid designators, resulting in near-human performance in common evaluation forums such as the Message Understanding Conference (MUC) [42]. While being more subjective than named-entity recognition, the task of determining document relevance in document retrieval is guided by explicit user queries with common evaluation platform such as the Text Retrieval Conference (TREC) [264]. On the other hand, term recognition is based upon the elusive characteristics of terms. Moreover, the set of characteristics employed differs across a diversed range of term recognition techniques, and within each individual technique, the characteristics may be subjected to different implicit interpretations. The challenges of evaluating term recognition techniques become more obvious when one consider the survey by Cabre-Castellvi et al. [37] where more than half of the systems reviewed remain not evaluated. 5.6 Conclusions Term recognition is an important task for many natural language systems. Many techniques have been developed in an attempt to numerically determine or quantify termhood based on heuristically-motivated term characteristics. We have discussed several shortcomings related to many existing techniques such as ad-hoc combination of termhood evidence, mathematically-unfounded derivation of scores, and implicit and possibly flawed assumptions concerning term characteristics. All these shortcomings lead to issues such as non-decomposability and non-traceability of how the weights and scores are obtained. These issues bring into light the question of what are the term characteristics that the different weights and scores are trying to embody, if any, and whether these individual weights or scores are actually measuring what they are supposed to capture. Termhood measures which cannot be traced or attributed to any term characteristics are fundamentally flawed. In this chapter, we stated clearly the four main challenges in creating a formal and practical technique for measuring termhood. These challenges are (1) the formalisation of a general framework for consolidating evidence representing different term characteristics, (2) the formalisation of the various evidence representing the different term characteristics, (3) the explicit definition of term characteristics and 124 Chapter 5. Term Recognition their attribution to linguistic theories (if any) or other justifications, and (4) the automatic determination of optimal thresholds for selecting terms from the final lists of ranked term candidates. We addressed the first three challenges through a new probabilistically-derived measure called the Odds of Termhood (OT) for scoring and ranking term candidates for term recognition. The design of the measure begins with the derivation of a general probabilistic framework for integrating termhood evidence. Next, we introduced seven evidence, founded on formal models of word distribution, to facilitate the calculation of OT . The evidence captures the various different characteristics of terms which are either heuristically-motivated or based on linguistic theories. The fact that evidence can be added or removed makes OT a highly flexible framework that is adaptable to different applications’ requirements and constraints. In fact, in the evaluation, we have shown close correlation between our new measure OT and the ad-hoc measure T H. We believe by adjusting the inclusion or exclusion of various evidence, other ad-hoc measures can be captured as well. Our two-part evaluation comparing OT with three other existing ad-hoc measures, namely, CW , N CV and T H have demonstrated the effectiveness of the new measure and the new framework. A qualitative evaluation studying the frequency distributions revealed advantages of our new measure OT . A quantitative evaluation using the GENIA corpus as the gold standard and four performance measures further supported our claim that our new measure OT offers the best performance compared to the three existing ad-hoc measures. Our evaluation revealed that (1) the current evidence employed in OT can be seen as probabilistic realisations of the heuristically-derived measure T H, (2) OT offers a solution to the need for term recognition which is both accurate and balanced in terms of recall and precision, and (3) OT performs on par with the heuristically-derived measure T H and they are both the best performing term recognition measures in terms of precision, recall, F-scores and accuracy compared to N CV and CW . In addition, our approach of rank binning and the use of performance measures for deciding on optimal cut-off ranks addresses the fourth challenge. 5.7 Acknowledgement This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, the University Postgraduate Award (International Students) by the University of Western Australia, the 2008 UWA Research Grant, and the Curtin Chemical Engineering Inter-University Collaboration Fund. The authors would like to thank the anonymous reviewers for their invaluable comments. 5.8. Other Publications on this Topic 5.8 125 Other Publications on this Topic Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning Domain Ontologies using Domain Prevalence and Tendency. In the Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia. This paper describes a heuristic measure called TH for determining termhood based on explicitly defined term characteristics and the distributional behaviour of terms across different corpora. The ideas on TH were later reformulated to give rise to the probabilistic measure OT. The description of OT forms the core contents of this Chapter 5. Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning Domain Ontologies in a Probabilistic Framework. In the Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia. This paper describes the preliminary attempts at developing a probabilistic framework for consolidating termhood evidence based on explicitly defined term characteristics and formal word distribution models. This work was later extended to form the core contents of this Chapter 5. Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global. This book chapter combines the ideas on the UH measure and the TH measure from Chapter 4 and 5, respectively. 126 Chapter 5. Term Recognition CHAPTER 6 Corpus Construction for Term Recognition Abstract The role of the Web for text corpus construction is becoming increasingly significant. However, its contribution is largely confined to the role of a general virtual corpus, or poorly derived specialised corpora. In this chapter, we introduce a new technique for constructing specialised corpora from the Web based on the systematic analysis of website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines used, and that they outperform all corpora based on existing techniques for the task of term recognition. 6.1 Introduction Broadly, a text corpus is considered as any collection containing more than one text of a certain language. A general corpus is balanced with regard to the various types of information covered by the language of choice [173]. In contrast, the content of a specialised corpus, also known as domain corpus, is biased towards a certain sub-language. For example, the British National Corpus (BNC) is a general corpus designed to represent modern British English. On the other hand, the specialised corpus GENIA contains solely texts from the molecular biology domain. Several connotations associated with text corpora such as size, representativeness, balance and sampling are the main topics of ongoing debate within the field of corpus linguistics. In reality, great manual effort is required for constructing and maintaining text corpora that satisfy these connotations. Although these curated corpora do play a significant role, several related inadequacies such as the inability to incorporate frequent changes, rarity of traditional corpora for certain domains, and limited corpus size have hampered the development of corpus-driven applications in knowledge discovery and information extraction. The increasingly accessible, diverse and inexpensive information on the World Wide Web (the Web) have attracted the attention of researchers who are in search of alternatives to manual construction of corpora. Despite issues such as poor reproducibility of results, noise, duplicates and sampling, many researchers [40, 129, 16, 228, 74] agreed that the vastness and diversity of the Web remains the most 0 This chapter was accepted with revision by Language Resources and Evaluation, 2009, with the title “Constructing Specialised Corpora through Domain Representativeness Analysis of Websites”. 127 128 Chapter 6. Corpus Construction for Term Recognition promising solution to the increasing need for very large corpora. Current work on using the Web for linguistic purposes can be broadly grouped into (1) the Web itself as a corpus, also known as virtual corpus [97], and (2) the Web as a source of data for constructing locally-accessible corpora known as Web-derived corpora. The contents of a virtual corpus are distributed over heterogeneous servers, and accessed using URLs and search engines. It is not difficult to see that these two types of corpora are not mutually exclusive, and that a Web-derived corpus can be easily constructed, albeit the downloading time, using the URLs from the corresponding virtual corpus. The choice between the two types of corpora then becomes a question of trade-off between effort and control. On the one hand, applications which require stable count, and complete access to the texts for processing and analysis can opt for Web-derived corpora. On the other hand, in applications where speed and corpus size supersede any other concerns, a virtual corpus alone suffices. The current state-of-the-art mainly focuses on the construction of Web-derived corpora, ranging from the simple query-and-download approach using search engines [15], to the more ambitious custom Web crawlers for very large collections [155, 205]. BootCat [15] is a widely-used toolkit to construct specialised Web-derived corpora. It employs a naive technique of downloading webpages returned by search engines without further analysis. [228] extended the use of BootCat to construct a large general Web-derived corpus using 500 seed terms. This technique requires a large number of seed terms (in the order of hundreds) to produce very large Web-derived corpora, and the composition of the corpora varies depending on the search engines used. Instead of relying on search engines and seed terms, [155] constructed a very large general Web-derived corpus by crawling the Web using seed URLs. In this approach, the lack of control and the absence of further analysis cause topic drift as the crawler traverses further away from the seeds. A closer look into the advances in this area reveals the lack of systematic analysis of website contents during corpus construction. Current techniques simply allow the search engines to dictate which webpages are suitable for the domain based solely on matching seed terms. Others allow their Web crawlers to run astray without systematic controls. We propose a technique, called Specialised Corpora Construction based on Web Texts Analysis (SPARTAN) 1 to automatically analyze the contents of websites for discovering domain-specific texts to construct very large specialised corpora. The 1 This foundation work on corpus construction using Web data appeared in the Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand, 2008 with the title “Constructing Web Corpora through Topical Web Partitioning for Term Recognition”. 6.1. Introduction 129 first part of our technique analyzes the domain representativeness of websites for discovering specialised virtual corpora. The second part of the technique selectively localises the distributed contents of websites in the virtual corpora to create specialised Web-derived corpora. This technique can also be employed to construct BNC-style balanced corpora through stratified random sampling from a balanced mixture of domain-categorised Web texts. During our experiments, we will show that unlike BootCat-derived corpora which vary greatly across different search engines, our technique is independent of the search engine employed. Instead of blindly using the results returned by search engines, our systematic analysis allows the most suitable websites and their contents to surface and to contribute to the specialised corpora. This systematic analysis significantly improved the quality of our specialised corpora as compared to BootCat-based corpora, and the naive SeedRestricted Querying (SREQ) of the Web. This is verified using the term recognition task. In short, the theses of this chapter are as follows: 1) Web-derived corpora are simply localised versions of the corresponding virtual corpora; 2) The often mentioned problems of using search engines for corpus construction are in fact a revelation of the inadequacies in current techniques; 3) The use of websites, instead of webpages, as basic units of analysis during corpus construction is more suitable for constructing very large corpora; and 4) The results provided by search engines cannot be directly accepted for constructing specialised corpora. The systematic analysis of website contents is fundamental in constructing high-quality corpora. The main contributions of this chapter are (1) a technique for constructing very large, quality corpora using only a small number of seed terms, (2) the use of systematic content analysis for re-ranking websites based on their domain representativeness to allow the corpora to be search engine independent, and (3) processes for extending user-provided seed terms and localising domain-relevant contents. This chapter is structured as follows. In Section 6.2, we summarise current work on corpus construction. In Section 6.3, we outline our specialised corpora construction technique. In Section 6.4, we evaluate the specialised corpora constructed using our technique in the context of term recognition. We end this chapter with an outlook to future work in Section 6.5. 130 Chapter 6. Corpus Construction for Term Recognition 6.2 Related Research The process of constructing corpora using data from the Web generally comprises of webpage sourcing, and relevant text identification, which is discussed in Section 6.2.1 and 6.2.2, respectively. In Section 6.2.3, we outline several studies demonstrating the significance of search engine counts in natural language applications despite their inconsistencies. 6.2.1 Webpage Sourcing Currently, there are two main approaches for sourcing webpages to construct Web-derived corpora, namely, using seed terms as query strings for search engines [15, 74], and using seed URLs for guiding custom crawlers [155, 207]. The first approach is popular among current corpus construction practices due to the toolkit known as BootCat [17]. BootCat requires several seed terms as input, and formulates queries as conjunctions of randomly selected seeds for submission to the Google search engine. The method then gathers the webpages listed in Google’s search result to create a specialised corpus. There are several shortcomings related to the construction of large corpora using this technique: • First, different search engines employ different algorithms and criteria for determining webpage relevance with respect to a certain query string. Since this technique simply downloads the top webpages returned by a search engine, the composition of the resulting corpora would vary greatly across different search engines for reasons beyond knowing and control. It is worth noting that webpages highly ranked by the different search engines may not have the necessary coverage of the domain terminology for constructing high-quality corpora. For example, the ranking by the Google search engine is primarily a popularity contest [116]. In the words of [228], “...results are ordered...using page-rank considerations”. • Second, the aim of creating very large Web-derived corpora using this technique may be far from realistic. Most major search engines have restrictions on the number of URLs served for each search query. For instance, the AJAX Search API provided by Google returns a very low 322 search results for each query. The developers of BootCat [15] suggested that 5 to 15 seed terms are 2 Google’s Web search interface serves up to 1, 000 results. However, automated crawling and scraping of that page for URLs will result in a blocking of the IP addresses. The SOAP API by Google, which allows up to 1, 000 queries per day will be permanently phased out by August 2009. 6.2. Related Research 131 typically sufficient in many cases. Assuming each URL provides us with a valid readable page, 20 seed terms and their resulting 1, 140 three-word combinations would produce a specialised corpus of only 1, 1400 × 32 = 36, 480 webpages. Since the combinations are supposed to represent the same domain, duplicates will most likely occur when all search results are aggregated [228]. A 10% duplicate and download error for every search query reduces the corpus size to 32, 832 webpages. For example, in order to produce a small corpus of only 40, 000 webpages using BootCat, [228] has to prepare a startling 500 seed terms. • Third, to overcome issues related to inadequate seed terms for creating very large corpora, BootCat uses extracted terms from the initial corpus to incrementally extend the corpus. [15] suggested using a reference corpus to automatically identify domain-relevant terms. However, this approach does not work well since the simple frequency-based techniques used by BootCat are known for their low to mediocre performance in identifying domain terms [279]. Without the use of control mechanisms and more precise techniques to recognise terms, this iterative feedback approach will cause topic drift in the final specialised corpora. Moreover, the idea of creating corpora by relying on other existing corpora is not very appealing. In a similar approach, [74] used the most frequent words in BNC, and Microsoft’s Live Search instead of the typical BootCat-preferred Google to construct a very large BNC-like corpus from the Web. Fletcher provided the reasons behind his choice to use Live Search, which include generous query allowance, higher quality search results, and more responsive to changes on the Web. The approach of gathering webpages using custom crawlers based on seed URLs gains wider acceptance as criticisms on the use of search engines intensified. Issues against the use of search engines such as unknown algorithms for sorting search results [128] and restrictions on the amount of data that can be obtained [18] have become targets of critics in the recent years. Some of the current work based on custom crawlers include a general corpus of 10 billion words downloaded from the Web based on seed URLs from dmoz.org by [155]. Similarly, Renouf et al. [205] developed a Web crawler for finding a large subset of random texts from the Web using seed URLs from human experts and dmoz.org as part of the WebCorp3 project. Ravichandran et al. [203] demonstrated the use of randomised algorithm to gen3 www.webcorp.org.uk 132 Chapter 6. Corpus Construction for Term Recognition erate noun similarity lists from very large corpora. The authors used URLs from dmoz.org as seed links to guide their crawlers for downloading 70 million webpages. After boilerplates and duplicates removal, their corpus is reduced to approximately 31 million documents. Rather than sampling URLs from online directories, Baroni & Ueyama [18] used search engines to obtain webpage URLs for seeding their custom crawlers. The authors used a combinations of frequent Italian words for querying Google, and retrieve a maximum of 10 pages per query. A resulting 5, 231 URLs were used to seed breadth-first crawling to obtain a final 4 million-document Italian corpus. The approach of custom crawling is not without its shortcomings. This approach is typically based on the assumption that webpages of one domain tend to link to others in the same domain. It is obvious that the reliance on this assumption alone without explicit control will result in topic drift. Moreover, most authors do not provide explicit statements for addressing important issues such as selection policy (e.g. when to stop the crawl, where to crawl next), and politeness policy (e.g. respecting the robot exclusion standard, how to handle disgruntled webmasters due to the extra bandwidth). This trend of using custom crawlers calls for careful planning and justification. Issues such as cost-benefit analysis, hardware and software requirements, and sustainability in the long run have to be considered. Moreover, poorly-implemented crawlers are a nuisance on the web, consuming bandwidth and clogging networks at the expense of others [248]. In fact, the worry of unknown ranking and data restriction by search engines [155, 128, 228] exposes the inadequacies of these existing techniques for constructing Webderived corpora (e.g. BootCat). These so-called ‘shortcomings’ of search engines are merely mismatches in expectations. Linguists expect white box algorithms and unrestricted data access, something we know we will never get. Obviously, these two issues do place certain obstacles in our quest for very large corpora, but should we totally avoid search engines given their integral role on the Web? If so, would we risk missing the forest just for these few trees? The quick alternative, which is infesting the Web with more crawlers, poses even greater challenges. Rather than reinventing the wheel, we should think of how existing corpus construction techniques can be improved using the already available large search engine repositories out there. 6.2.2 Relevant Text Identification The process of identifying relevant texts, which usually comprise of webpage filtering and content extraction, is an important step after the sourcing of webpages. A filtering phase is fundamental in identifying relevant texts since not all webpages 6.2. Related Research 133 returned by search engines or custom web crawlers are suitable for specialised corpora. This phase, however, is often absent from most of the existing techniques such as BootCat. The commonly used techniques include some kind of richness or density measures with thresholds. For instance, [125] constructed domain corpora by collecting the top 100 webpages returned by search engines for each seed term. As a way of refining the corpora, webpages containing only a small number of userprovided seed terms are excluded. [4] proposed a knowledge-richness estimator that takes into account semantic relations to support the construction of Web-derived corpora. Webpages containing both the seed terms and the desired relations are considered as better candidates to be included in the corpus. The candidate documents are ranked and manually filtered based on several term and relation richness measures. In addition to webpage filtering, content extraction (i.e. boilerplate removal) is necessary to remove HTML tags and boilerplates (e.g. texts used in navigation bars, headers, disclaimers). HTMLCleaner by [88] is a boilerplate remover based on the heuristics that content-rich sections of webpages have longer sentences, lower number of links, and more function words compared to the boilerplates. [67] developed a boilerplate stripper called NCLEANER based on two character-level n-gram models. A text segment is considered as a boilerplate and discarded if the ‘dirty’ model (based on texts to be cleaned) achieves a higher probability compared to the ‘clean’ model (based on training data). 6.2.3 Variability of Search Engine Counts Unstable page counts have always been one of the main complaints of critics who are against the use of search engines for language processing. Many work has been conducted to discredit the use of search engines by demonstrating the arbitrariness of page counts. The fact remains that page counts are merely estimations [148]. We are not here to argue otherwise. However, for natural language applications that deal mainly with relative frequencies, ratios and ranking, these variations have been shown to be insignificant. [181] conducted a study on using page counts for estimating n-gram frequencies for noun compound bracketing. They showed that the variability of page counts over time and across search engines do not significantly affect the results of their task. [140] examined the use of page counts for several NLP tasks such as spelling correction, compound bracketing, adjective ordering and prepositional phrase attachment. The authors concluded that for majority of the tasks conducted, simple and unsupervised techniques perform better when n- 134 Chapter 6. Corpus Construction for Term Recognition gram frequencies are obtained from the Web. This is in line with the study by [252] which showed that a simple algorithm relying on page counts outperforms a complex method trained on a smaller corpus for synonym detection. [124] used search engines to estimate frequencies for predicate-argument bigrams. They demonstrated the high correlations between search engines page counts and frequencies obtained from balanced, carefully edited corpora such as the BNC. Similarly, experiments by [26] showed that search engine page counts were reliable over a period of 6-month, and highly consistent with those reported by several manually-curated corpora including the Brown Corpus [78]. In short, we can safely conclude that page counts from search engines are far from accurate and stable [148]. Moreover, due to the inherent differences in their relevance ranking and index sizes, page counts provided by the different search engines are not comparable. Adequate studies have been conducted to show that n-gram frequency estimations obtained from search engines indeed work well for a certain class of applications. As such, one can either make good use of what is available, or should stop harping on the primitive issue of unstable page count. The key question now is not whether search engine counts are stable or otherwise, but rather, how they are used. 6.3 Analysis of Website Contents for Corpus Construction It is apparent from our discussion in Section 6.2 that the current techniques for constructing corpora from the Web using search engines can be greatly improved. In this section, we address the question of how corpus construction can benefit from the current large search engine indexes despite several inherent mismatches in expectations. Due to the restrictions imposed by search engines, we only have access to limited number of webpage URLs [128]. As such, the common BootCat technique of downloading ‘off-the-shelf’ webpages by search engines to construct corpora is not the best approach since (1) the number of webpages provided is inadequate, and (2) not all contents are appropriate for a domain corpus [18]. Moreover, the authoritativeness of webpages has to be taken into consideration to eliminate lowquality contents from questionable sources. Putting into consideration these problems, we have developed a Probabilistic Site Selector (PROSE) to re-rank and filter the websites returned by search engines to construct virtual corpora. We will discuss in detail this analysis mechanism in Section 6.3.1 and 6.3.2. In addition, Section 6.3.3 outlines the Seed Term Expansion Process (STEP), the Selective Localisation Process (SLOP), and the Heuristic-based 6.3. Analysis of Website Contents for Corpus Construction 135 Figure 6.1: A diagram summarising our web partitioning technique. Cleaning Utility for Web Texts (HERCULES) designed to construct Web-derived corpora from virtual corpora for addressing the needs to access local texts by certain natural language applications. An overview of the proposed technique is shown in Figure 6.1. A summary of the three phases in SPARTAN is as follows: Input – A set of seed terms, W = {w1 , w2 , ..., wn }. Phase 1: Website Preparation – Gather the top 1, 000 webpages returned by search engines containing the seed terms. Search engines such as Yahoo will serve the first 1, 000 pages when accessed using the provided API. – Generalise the webpages to obtain a set of website URLs, J. Phase 2: Website Filtering – Obtain estimates of the inlinks, number of webpages in the website, and the number of webpages in the website containing the seed terms. – Analyze the domain representativeness of the websites in J using PROSE. – Select websites with good domain representativeness to form a new set J ′ . These sites constitute our virtual corpora. 136 Chapter 6. Corpus Construction for Term Recognition Phase 3: Website Content Localisation – Obtain a set of expanded seed terms, WX using Wikipedia through the STEP module. – Selectively download contents from websites in J ′ based on the expanded seed terms WX using the SLOP module. – Extract relevant contents from the downloaded webpages using HERCULES. Output – A specialised virtual corpus consisting of website URLs with high domain representativeness. – A specialised Web-derived corpus consisting of domain-relevant contents downloaded from the websites in the virtual corpus. 6.3.1 Website Preparation During this initial preparation phase, a set of candidate websites to represent the domain of interest D, is generated. Methods such as random walk and random IP address generation have been suggested to obtain random samples of webpages [104, 192]. Such random sampling methods may work well for constructing general or topic-diverse corpora from the Web if conducted under careful scrutiny. For our specialised corpora, we employ purposive sampling instead to seek items (i.e. websites) belonging to a specific, predefined group (i.e. domain D). Since there is no direct way of deciding if a website belongs to domain D, a set of seed terms W = {w1 , w2 , ..., wn } is employed as the determinant factor. Next, we submit queries to the search engines for webpages containing the conjunction of the seed terms W . The set of webpage URLs, which contains the purposive samples that we require, is returned as the result. At the moment, only webpages in the form of HTML files or plain text files are accepted. Since most search engines only serve the first 1, 000 documents, the size of our sample is no larger than 1, 000. We then process the webpage URLs to obtain the corresponding domain names of the websites. In other words, only the segment of the URL beginning from the scheme (e.g. http://) until the authority segment of the hierarchical part is considered for further processing. For example, in the URL http://web.csse.uwa.edu.au/research/areas/, only the segment http://web.csse.uwa.edu.au/ is applicable. This collection of distinct websites (i.e. collection of webpages), represented using the notation J will be subjected to re-ranking and filtering in the next phase. 6.3. Analysis of Website Contents for Corpus Construction 137 We have selected websites as the basic unit for analysis instead of the typical webpages for two main reasons. Firstly, websites are collections of related webpages belonging to the same theme. This allows us to construct a much larger corpus using the same number of units. For instance, assume that a search engine returns 1, 000 distinct webpages belonging to 300 distinct websites. In this example, we can construct a corpus comprising of at most 1, 000 documents using a webpage as a unit. However, using a website as a unit, we would be able to derive a much larger 9, 000-document corpus, assuming an average of 300 webpages per website. Secondly, the fine granularity and volatility of individual webpages makes analysis and maintenance of the corpus difficult. It has been accepted [3, 14, 188] that webpages disappear at a rate of 0.25 − 0.5% per week [73]. Considering this figure, virtual corpora based on webpage URLs are extremely unstable and require constant monitoring as pointed out by Kilgarriff [127] to replace offline sources. Virtual corpora based on websites as units are far less volatile. This is especially true if the virtual corpora are composed of highly authoritative websites. 6.3.2 Website Filtering In this section, we describe our probabilistic website selector called PROSE for measuring and determining the domain representativeness of candidate websites in J. The domain representativeness of a website is determined using PROSE based on the following criteria introduced by [277]: • The extent to which the vocabulary covered by a website is inclined towards domain D; • The extent to which the vocabulary of a website is specific to domain D; and • The authoritativeness of a website with respect to domain D. The websites from J which satisfy these criteria are considered as sites with good domain representativeness, denoted as set J ′ . The selected sites in J ′ form our virtual corpus. For the next three subsections, we will discuss in detail the notations involved, the means to quantify the three criteria for measuring domain representativeness, and the ways to automatically determine the selection thresholds. Notations Each site ui ∈ J has three pieces of important information, namely, an authority rank, ri , the number of webpages containing the conjunction of the seed terms in W , nwi , and the total number of webpages, nΩi . The authority rank, ri is obtained 138 Chapter 6. Corpus Construction for Term Recognition by ranking the candidate sites in J according to their number of inlinks (i.e. low numerical value indicates high rank). The inlinks to a website can be obtained using the “link:” operator in certain search engines (e.g. Google, Yahoo). As for the second (i.e. nwi ) and the third (i.e. nΩi ) piece of information, additional queries using the operator “site:” need to be performed. The total number of webpages in site ui ∈ J can be estimated by restricting the search (i.e. site search) as “site:ui ”. The number of webpages in site ui containing W can be obtained using the query “w site:ui ”, where w is the conjunction of the seeds in W with the AND operator. Figure 6.2 shows the distribution of webpages within the sites in J. Each rectangle represents the collection of all webpages of a site in J. Each rectangle is further divided into the collection of webpages containing seed terms W , and the collection of webpages not containing W . The size of the collection of webpages for site ui that contain W is nwi . Using the total number of webpages for the i-th site, nΩi , we estimate the number of webpages in the same site not containing W as nw̄i = nΩi − nwi . With the page counts nwi and nΩi , we can obtain the total page count for webpages not containing W in J as X X X nw̄ = N − nw = nΩi − nwi = (nΩi − nwi ) ui ∈J ui ∈J ui ∈J where N is the total number of webpages in J, and nw is the total number of webpages in J which contains W (i.e. the area within the circle in Figure 6.2). Probabilistic Site Selector A site’s domain representativeness is assessed based on three criteria, namely, vocabulary coverage, vocabulary specificity and authoritativeness. Assuming independence, the odds in favour of a site’s ability to represent a domain, defined as the Odds of Domain Representativeness (OD), is measured as a product of the odds for realising each individual criterion: OD(u) = OC(u)OS(u)OA(u) (6.1) where OC is the Odds of Vocabulary Coverage, OS is the Odds of Vocabulary Specificity, and OA is the Odds of Authoritativeness. OC quantifies the extent to which site u is able to cover the vocabulary of the domain represented by W , while OS captures the chances of the vocabulary of website u being specific to the domain represented by W . On the other hand, OA measures the chances of u being an authoritative website with respect to the domain represented by W . Next, we define the probabilities that make up these three odds. 6.3. Analysis of Website Contents for Corpus Construction 139 Figure 6.2: An illustration of an example sample space on which the probabilities employed by the filter are based upon. The space within the dot-filled circle consists of all webpages from all sites in J containing W . The m rectangles represent the collections of all webpages of the respective sites {u1 , ..., um }. The shaded but not dot-filled portion of the space consists of all webpages from all sites in J that do not contain W . The individual shaded but not dot-filled portion within each rectangle is the collection of webpages in the respective sites ui ∈ J that do not contain W . • Odds of Vocabulary Coverage: Intuitively, the more webpages from site ui that contains W in comparison with other sites, the likelier it is that ui has a good coverage of the vocabulary of the domain represented by W . As such, this factor requires a cross-site analysis of page counts. Let the sample space, set Y , be the collection of all webpages from all sites in J that contain W . This space is the area within the circle in Figure 6.2 and the size is |Y | = nw . Following this, let Z be the set of all webpages in site ui (i.e. any rectangles in Figure 6.2) with the size |Z| = nΩi . Subscribing to the frequency interpretation of probability, we compute the probability of encountering a webpage from site ui among all webpages from all sites in J that contain W as: PC (nwi ) = P (Z|Y ) (6.2) P (Z ∩ Y ) P (Y ) nwi = nw = where |Z ∩ Y | = nwi is the number of webpages from the site ui containing W . 140 Chapter 6. Corpus Construction for Term Recognition We compute OC as: OC(ui ) = PC (nwi ) 1 − PC (nwi ) (6.3) • Odds of Vocabulary Specificity: This odds acts as an offset for sites which have a high coverage of vocabulary across many different domains (i.e. the vocabulary is not specific to a particular domain). This helps us to identify overly general sites, especially those encyclopaedic in nature which provide background knowledge across a broad range of disciplines. The vocabulary specificity of a site can be estimated using the variation in the pagecount of W from the total pagecount of that site. Within a single site with fixed total pagecount, an increase in the number of webpages containing W implies a decrease of pagecount not containing W . In such cases, a larger portion of the site would be dedicated to discussing W and the domain represented by W . Intuitively, such phenomenon would indicate the narrowing of the scope of word usage, and hence, an increase in the specificity of the vocabulary. As such, the examination of the specificity of vocabulary is confined within a single site, and hence, is defined over the collection of all webpages within that site. Let Z be the set of all webpages in site ui and V be the set of all webpages in site ui that contains W . Following this, the probability of encountering a webpage that contains W in site ui is defined as: PS (nwi ) = P (V |Z) (6.4) P (V ∩ Z) P (Z) nwi = nΩi = where |V ∩ Z| = |V | = nwi . We compute OS as: OS(ui ) = PS (nwi ) 1 − PS (nwi ) (6.5) • Odds of Authoritativeness: We first define a distribution for computing the probability that website ui is authoritative with respect to W . It has been demonstrated that the various indicators of a website’s authority such as the number of inlinks, the number of outlinks and the frequency of visits, follow the Zipf ’s ranked distribution [2]. As such, the probability that the site ui with authority rank ri (i.e. a rank based on the number of inlinks to site ui ) 6.3. Analysis of Website Contents for Corpus Construction 141 is authoritative with respect to W can be defined using the probability mass function: PA (ri ) = P (ri ; |J|) = 1 ri H|J| (6.6) where |J| is the number of websites under consideration, and H|J| is the |J|-th generalised harmonic number computed as: H|J| = |J| X 1 k k (6.7) We then compute OA as: OA(ui ) = PA (ri ) 1 − PA (ri ) (6.8) Selection Thresholds In order to select websites with good domain representativeness, a threshold for OD is derived automatically as a combination of the individual thresholds related to OC, OS and OA: ODT = OAT OCT OST (6.9) Depending on the desired output, these individual thresholds can be determined using either one of the three options associated with each probability mass function. All sites ui ∈ J with their odds OD(ui ) exceeding ODT will be considered as suitable candidates for representing the domain. These selected sites, denoted as the set J ′ , constitute our virtual corpus. We now go through the details of deriving the thresholds for the individual odds. • Firstly, the threshold for OC is defined as: OCT = τC 1 − τC (6.10) τC can either by P̄C , PCmax or , PCmin . The mean of the distribution is given by: P nwi 1 n̄w × = ui ∈J P̄C = nw |J| nw 1 1 nw × = = |J| nw |J| 142 Chapter 6. Corpus Construction for Term Recognition while the highest and lowest probabilities are defined as: PCmax = max PC (nwi ) ui ∈J PCmin = min PC (nwi ) ui ∈J where max PC (nwi ) returns the maximum probability of the function PC (nwi ) where nwi ranges over the page counts of all websites ui in J. • Secondly, the threshold for OS is given by: τS OST = 1 − τS where τS can either be P̄S , PSmax or PSmin : P PS (nwi ) P̄S = ui ∈J |J| (6.11) PSmax = max PS (nwi ) ui ∈J PSmin = min PS (nwi ) ui ∈J Note that P̄S 6= 1/|J| since the sum of PS (ui ) for all ui ∈ J is not equal to 1. • Thirdly, the threshold for OA is defined as: τA OAT = (6.12) 1 − τA where τA can either be P̄A , PAmax or PAmin . The expected value of the random variable X for the Zipfian distribution is defined as: HN,s−1 X= HN,s and since s = 1 in our distribution of authority rank, the expected value of the variable r, can be obtained through: |J| r̄ = H|J| Using r̄, we have P̄A as: P̄A = 1 1 = r̄H|J| |J| The highest and lowest probabilities are given by: PAmax = max PA (ri ) ui ∈J (6.13) PAmin = min PA (ri ) ui ∈J where max PA (ri ) returns the maximum probability of the function PA (ri ) where ri ranges over the authority ranks of all websites ui in J. 6.3. Analysis of Website Contents for Corpus Construction 6.3.3 143 Website Content Localisation This content localisation phase is designed to construct Web-derived corpora using the virtual corpora created in the previous phase. The three main processes in this phase are seed term expansion (STEP), selective content downloading (SLOP), and content extraction (HERCULES). STEP uses the categorical organisation of Wikipedia topics to discover related terms to complement the user-provided seed terms. Under each Wikipedia category, there is typically a listing of subordinate topics. For instance, there is category called “Category:Blood cells” which corresponds to the “blood cell” seed term. STEP begins by finding the category page “Category:w” on Wikipedia which corresponds to each w ∈ W (line 3 in Algorithm 2). Under the category page “Category:Blood cells” is a listing of the various types of blood cells such as leukocytes, red blood cell, reticulocytes, etc. STEP relies on regular expressions to scrap the category page to obtain these related terms (line 4 in Algorithm 2). The related topics in the category pages are typically structured using the <li> tag. It is important to note that not all topics listed under a Wikipedia category adhere strictly to the hypernym-hyponym relation. Nevertheless, the terms obtained through such means are highly related to the encompassing category since they are determined by human contributors. These related terms can be relatively large in numbers. As such, we employed the Normalised Web Distance4 (NWD) [276] for selecting the m most related ones (line 6 and 8 in Algorithm 2). Algorithm 2 summarises STEP. The existing set of seed terms W = {w1 , w2 , ..., wn } is expanded to become WX = {W1 = {w1 , ...}, W2 = {w2 , ...}, ..., Wn = {wn , ...}} through this process. Algorithm 2 STEP(W ,m) 1: initialise WX 2: for each wi ∈ W do 3: page := getcategorypage(wi ) 4: relatedtopics = scrapepage(page) 5: for each a ∈ relatedtopics do 6: sim := NWD(a,wi ) 7: recall the m most related topics (a1 , ..., am ) 8: Wi = {wi , a1 , ..., am } 9: add Wi to the set WX 10: return WX 4 A generalised version of the Normalised Google Distance (NGD) by [50]. 144 Chapter 6. Corpus Construction for Term Recognition SLOP then uses the expanded seed terms WX to selectively download the contents from the websites in J ′ . Firstly, all possible pairs of seed terms are obtained for every combination of sets Wi and Wj from WX : C = {(x, y)|x ∈ Wi ∈ WX ∧ y ∈ Wj ∈ WX ∧ i < j ≤ |WX |} Using the seed term pairs in C, SLOP localises the webpages for all websites in J ′ . For every site u ∈ J ′ , all pairs (x, y) in C are used to construct queries in the form of q = “x”“y”site : u. These queries are then submitted to search engines to obtain the URLs of webpages that contains the seed terms from each site. This move ensures that only relevant pages from a website are downloaded. This prevents the localising of boilerplate pages such as “about us”, “disclaimer”, “contact us”, “home”, “faq”, etc whose contents are not suitable for the specialised corpora. Currently, only HTML and plain text pages are considered. Using these URLs, SLOP downloads the corresponding webpages to a local repository. The final step of content localisation makes use of HERCULES to extract contents from the downloaded webpages. HERCULES is based on the following sequence of heuristics: 1) all relevant texts are located within the <body> tag. 2) the contribution of invisible elements and formatting tags for determining the relevance of texts is insignificant. 3) the segmentation of relevant texts, typically paragraphs, are defined by structural tags such as <br>, <p>, <span>, <div>, etc. 4) length of sentences in relevant texts are typically higher. 5) the concentration of function words in relevant texts is higher [88]. 6) the concentration of certain non-alphanumeric characters such as “|”, “-”, “.” and “,” in irrelevant texts is higher. 7) other common observations such as the capitalisation of the first character of sentences, and the termination of sentences by punctuation marks are also observed. HERCULES begins the process by detecting the presence of the <body> and </body> tags, and extracting the contents between them. If no <body> tag is present, the complete HTML source code is used. Next, HERCULES removes all 6.4. Evaluations and Discussions 145 invisible elements (e.g. comments, javascript codes) and all tags without contents (e.g. images, applets). Formatting tags such as <b>, <i>, <center>, etc are also discarded. Structural tags are then used to break the remaining texts in the page into segments. The length of each segment relative to all other segments is determined. In addition, the ratio of function words and certain non-alphanumeric characters (i.e. “|”, “-”, “.”, “,”) to the number of words in each segment is measured. The ratios related to non-alphanumeric characters are particularly useful for further removing boilerplates such as Disclaimer | Contact Us | ..., or the reference section of academic papers where the concentration of such characters is higher than normal. Using these indicators, HERCULES removes segments which do not satisfy the heuristics 4) to 7). The remaining segments are aggregated and returned as contents. 6.4 Evaluations and Discussions In this section, we discuss the results of three experiments conducted to assess the different aspects of our technique. 6.4.1 The Impact of Search Engine Variations on Virtual Corpus Construction We conducted a three-part experiment to study the impact of the choice of search engines on the resulting virtual corpus. In this experiment, we examine the extent of correlation between the websites ranked by the different search engines. Then, we study whether or not the websites re-ranked using PROSE achieve higher level of correlations. A high correlation between the websites re-ranked by PROSE will suggest that the composition of the virtual corpora will remain relatively stable regardless of the choice of search engines. We performed a scaled-down version of the virtual corpus construction procedure outlined in Section 6.3.1 and 6.3.2. For this experiment, we employed the three major search engines, namely, Yahoo, Google and Live Search (by Microsoft), and their APIs for constructing virtual corpora. We chose the seed terms “transcription factor” and “blood cell” to represent the domain of molecular biology D1 , while the reliability engineering domain D2 is represented using the seed terms “risk management” and “process safety”. For each domain D1 and D2 , we gathered the first 1, 000 webpage URLs from the three search engines. We then processed the URLs to obtain the corresponding websites’ addresses. The set of websites obtained for domain D1 using Google, Yahoo and Live Search is denoted as J1G , J1Y and J1M , 146 Chapter 6. Corpus Construction for Term Recognition Figure 6.3: A summary of the number of websites returned by the respective search engines for each of the two domains. The number of common sites is also provided. respectively. The same notations apply for domain D2 . Next, these websites were assigned with ranks based on their corresponding webpages’ order of relevance determined by the respective search engines. We refer to these ranks as native ranks. If a site has multiple webpages included in the search results, the highest rank shall prevail. This ranking information is kept for use in the later part of this experiment. Figure 6.3 summarises the number of websites obtained from each search engines for each domain. In the first part of this experiment, we sorted the 77 common websites for D1 , denoted as J1C = J1G ∩ J1Y ∩ J1M , and the 103 in J2C = J2G ∩ J2Y ∩ J2M using their native ranks (i.e. the ranks generated by the search engines). We then determined their Spearman’s rank correlation coefficients. The native columns in Figure 6.4(a) and 6.4(b) show the correlations between websites sorted by different pairs of search engines. The correlation between websites based on native rank is moderate, ranging between 0.45 to 0.54. This extent of correlation does not come as a surprise. In fact, this result supports our implicit knowledge that different search engines rank the same webpages differently. Assuming the same query, the same webpage will inevitably be assigned distinct ranks due to the inherent differences in the index size and the algorithm itself. For this reason, the ranks generated by search engines (i.e. native ranks) do not necessarily reflect the domain representativeness of the webpages. In the second part of the experiment, we re-rank the websites in J{1,2}C using PROSE. For simplicity, we only employ the coverage and specificity criteria for determining the domain representativeness of websites, in the form of odds of domain representativeness (OD). The information required by PROSE, namely, the number of webpages containing W , nwi , and the total number of webpages, nΩi are obtained from the respective search engines. In other words, the OD of each website is estimated three times, each using different nwi and nΩi obtained from the three different search engines. The three variants of estimation are later translated into ranks for re-ordering the websites. Due to the varying nature of page counts 6.4. Evaluations and Discussions 147 across different search engines as discussed in Section 6.2.3, many would expect that re-ranking the websites using metrics based on such information would yield even worst correlation. On the contrary, the significant increases in correlation between websites after re-ranking by PROSE as shown in the PROSE columns in Figure 6.4(a) and 6.4(b) demonstrated otherwise. (a) Correlations between websites in the molecular biology domain. (b) Correlations between websites in the reliability engineering domain. Figure 6.4: A summary of the Spearman’s correlation coefficients between websites before and after re-ranking by PROSE. The native columns show the correlation between the websites when sorted according to their native ranks provided by the respective search engines. We discuss the reasons behind this interesting finding. As we have mentioned before, search engine indexes vary greatly. For instance, based on page counts by Google, we have a 15, 900/23, 800, 000 = 0.0006685 probability of encountering a webpage from the site www.pubmedcentral.nih.gov that contains the bi-gram “blood cell”. However, Yahoo provides us with a higher estimate at 0.001440. This is not because Yahoo is more accurate than Google or vice versa, they are just different. We have discussed this in detail in Section 6.2.3. This reaffirms that estimations using different search engines are by themselves not comparable. Consider the next example n-gram “gasoline”. Google and Yahoo provides the estimates 0.000046 and 0.000093 for the same site, respectively. Again, they are very different from one another. While the estimations are inconsistent (i.e. Google and Yahoo offer different page counts for the same n-grams), the conclusion is the same, namely, one has better chances of encountering a page in www.pubmedcentral.nih.gov that contains “blood cell”. In other words, estimations based on search engine counts have significance only in relation to something else (i.e. relativity). This is exactly 5 This page count and all subsequent page counts derived from Google and Yahoo is obtained on 2 April 2009. 148 Chapter 6. Corpus Construction for Term Recognition how PROSE works. PROSE determines a site’s OD based entirely on its contents. OD is computed by PROSE using search engine counts. Even though the analysis of the same site using different search engines eventually produces different OD, the object of the study, namely, the content of the site, remains constant. In this sense, the only variable in the analysis by PROSE is the search engine count. Since the ODs generated by PROSE are used for inter-comparing the websites in J{1,2}C (i.e. ranking), the numerical differences introduced through variable page counts by the different search engines become insignificant. Ultimately, the same site analysed by PROSE using unstable page counts by different search engines can still achieve the same rank. In the third part of this experiment, we examine the general ‘quality’ of the websites ranked by PROSE using information provided by the different search engines. As we have elaborated on in Section 6.3.2, PROSE measures the odds in favour of the websites’ authority, vocabulary coverage and specificity. Websites with low OD can be considered as poor representers of the domain. The ranking of sites by PROSE using information from Google consistently resulted in the most number of websites with OD less than −6. About 70.13% in domain D1 and 34.95% in domain D2 by Google are considered as poor representers. On the other hand, the sites ranked using information by Yahoo and Live Search have relatively higher OD. To explain this trend, let us consider the seed terms {“transcription factor”, “blood cell”}. According to Google, there are 23, 800, 000 webpages in www.pubmedcentral.nih.gov and out of that number, 1, 180 contain both seed terms. As for Yahoo, it indexes far less 9, 051, 487 webpages from the same site but offering approximately the same page count 1, 060 for the seed terms. This trend is consistent when we examined the page count for the non-related n-gram “vehicle” from the same site. Google and Yahoo reports the approximately same page count of 24, 900 and 20, 100, respectively. There are a few possibilities. Firstly, the remaining 23, 800, 000 − 9, 051, 487 = 14, 748, 513 indexed by Google really do not contain the n-grams, or secondly, Google overestimated the overall figure of 23, 800, 000. The second possibility becomes more evident as we look at the page count by other search engines6 . Live Search reports a total page count of 61, 400 for the same site with 1, 460 webpages containing the seed terms {“transcription factor”, “blood cell”}. Ask.com, with a much larger site index at 15, 600, 000 has 914 pages with the seed terms. The index sizes of all these other search engines are much smaller 6 Other commonly-used search engines such as AltaVista and AlltheWeb were not cited for comparison since they used the same search index as Yahoo’s. 6.4. Evaluations and Discussions 149 Figure 6.5: The number of sites with OD less than −6 after re-ranking using PROSE based on page count information provided by the respective search engines. than that of Google’s, and yet, they provided us with approximately the same number of pages containing the seed terms. Our finding is consistent with the recent report by Uyar [254] which concluded that the page counts provided by Google are usually higher than the estimates by other search engines. Due to such inflated figures by Google, when we take the relative frequency of n-grams using Google’s page counts, the significance of domain-relevant n-grams are greatly undermined. The seed terms (i.e. “transcription factor”, “blood cell”) achieved a much lower probability at 1, 180/23, 800, 000 = 0.000049 when assessed using Google’s page count as compared to the probability by Yahoo 1, 060/9, 051, 487 = 0.000117. This explains the devaluation of domain-relevant seed terms when assessed by PROSE using information from Google, which leads to the falling of the OD of websites. In short, Live Search and Yahoo are comparatively better search engines for the task of measuring OD by PROSE. However, the index size of Live Search is undesirably small, a problem agreed upon by other researchers such as [74]. Moreover, the search facility using the “site:” operator is occasionally turned off by Microsoft, and it sometimes offers illogical estimates. While this problem is present in all search engines, it is particularly evident in Live Search when site search is used. For instance, there are about 61, 400 pages from www.pubmedcentral.nih.gov indexed by Live Search. However, Live Search reports that there are 159, 000 pages in that site which contains the n-gram “transcription factor”. For this reason, we preferred the balance between the index size and the ‘honesty’ in page counts offered by Yahoo. 6.4.2 The Evaluation of HERCULES We conducted a simple evaluation of our content extraction utility HERCULES using Cleaneval development set7 . Due to some implementation difficulties, the scoring program provided by Cleaneval cannot be used for this evaluation. Instead, we employed a text comparison module8 written in Perl. The module, based on vector7 8 http://cleaneval.sigwac.org.uk/devset.html http://search.cpan.org/ stro/Text-Compare-1.03/lib/Text/Compare.pm 150 Chapter 6. Corpus Construction for Term Recognition space model, is used for comparing the contents of the texts cleaned by HERCULES with the gold standard provided by Cleaneval. The module uses a rudimentary stop list to filter out common words and then the cosine similarity measure is employed to compute text similarity. The texts cleaned by HERCULES achieved a 0.8919 similarity with the gold standard, and has a standard deviation of 0.0832. The relatively small standard deviation shows that HERCULES is able to consistently extract contents that meet the standard of human curators. We have made available an online demo of HERCULES 9 . 6.4.3 The Performance of Term Recognition using SPARTAN-based Corpora In this section, we evaluated the quality of the corpora constructed using SPARTAN in the context of term recognition for the domain of molecular biology. We compared the performance of term recognition using several specialised corpora, namely: • SPARTAN-based corpora • the manually-crafted GENIA corpus [130] • a BootCat-derived corpus • seed-restricted querying of the Web (SREQ), as a virtual corpus We employed the gold standard reference provided along with the GENIA corpus for evaluating term recognition. We used the same set of seed terms W = {“human”,“blood cell”,“transcription factor”} for various purposes throughout this evaluation. The reason behind our choice of seed terms is simple: these are the same seed terms used for the construction of GENIA, which is our gold standard. BootCat-Derived Corpus We downloaded and employed the BootCat toolkit10 with the new support for Yahoo API to construct a BootCat-derived corpus using the same set of seed terms W = {“human”,“blood cell”,“transcription factor”}. For reasons discussed in Section 6.2.1, BootCat will not be able to construct a large corpus using only three seed terms. The default settings of 3 terms per tuple, and 10 randomly selected 9 A demo is available at http://explorer.csse.uwa.edu.au/research/algorithm hercules.pl. Note that slow response time is possible when server is under heavy load. 10 http://sslmit.unibo.it/ baroni/bootcat.html 6.4. Evaluations and Discussions 151 Figure 6.6: A listing of the 43 sites included in SPARTAN-V. tuples for querying cannot be applied in our case. Moreover, we could not perceive the benefits of randomly selecting terms for constructing tuples. As such, we generated all possible combinations of all possible length in this experiment. In other words, we have three 1-tuple, three 2-tuple, and one 3-tuple for use. While this move may appear redundant since all webpages which contain the 3-tuple will also have the 2-tuples, we can never be sure that the same webpages will be provided as results by the search engines. In addition, we altered a default setting in the script by BootCat, collect urls from yahoo.pl which restricted our access to only the first 100 results for each query. Using the seven seed term combinations and the altered Perl script, we obtained 3, 431 webpage URLs for downloading. We then employed the script by BootCat, retrieve and clean pages from url list.pl to download and clean the webpages, resulting in a final corpus size of N = 3, 174 documents with F = 7, 641, 018 tokens. SPARTAN-Based Corpora and SREQ We first constructed a virtual corpus using SPARTAN and the seed terms W . Yahoo is selected as our search engine of choice for this experiment for reasons outlined 152 Chapter 6. Corpus Construction for Term Recognition in Section 6.4.1. We employed the API11 provided by Yahoo. All requests to Yahoo are sent to this server process http://search.yahooapis.com/WebSearchService/ V1/webSearch?/. We format our query strings as appid=APIKEY&query=SEEDTERMS &results=100. Additional options such as start=START are applied to enable SPARTAN to obtain results beyond the first 100 webpages. This service by Yahoo is limited to 5, 000 queries per IP address per day. However, the implementation of this rule is actually quite lenient. In the first phase of SPARTAN, we obtained 176 distinct websites from the first 1, 000 webpages returned by Yahoo using the conjunction of the three seed terms. For the second phase of SPARTAN, we selected the average values as described in Section 6.3.2 for all three thresholds, namely, τC , τS and τA to derive our selection cut-off point ODT . The selection process using PROSE provided us with a reduced 43 sites. The virtual corpus thus contains about N = 84, 963, 524 documents (i.e. webpages) distributed over 43 websites. In this evaluation, we would refer to this virtual corpus as SPARTAN-V, where the letter V stands for virtual. We have made available an online query tool for SPARTAN-V12 . Figure 6.6 shows the websites included in the virtual corpus for this evaluation. We then extended the virtual corpus during the third phase of SPARTAN to construct a Web-derived corpus. We selected three most related topics for each seed term in W during seed term expansion by STEP. The seed term “human” has no corresponding category page on Wikipedia and hence, cannot be expanded. The set of expanded seed terms is WX ={{“human”}, {“blood cell”, “erythropoiesis”,“reticulocyte”,“haematopoiesis”}, {“transcription factor”, “CREB”, “C-Fos”,“E2F”}}. Using WX , SLOP gathered 80, 633 webpage URLs for downloading. A total of 76, 876 pages were actually downloaded while the remaining 3, 743 could not be reached for reasons such as connection error. Finally, HERCULES is used to extract contents from the downloaded pages for constructing the Web-derived corpus. About 15% of the webpages were discarded by HERCULES due to the absence of proper contents. The final Web-derived corpus, denoted as SPARTAN-L (the letter L refers to local ) is composed of N = 64, 578 documents with F = 118, 790, 478 tokens. We have made available an online query tool for SPARTAN-L13 . It is worth pointing out that using SPARTAN and the same number of seed terms, we can easily construct a corpus 11 More information on Yahoo! Search, including API key registration, is available at http://developer.yahoo.com/search/web/V1/webSearch.html. 12 A demo is available at http://explorer.csse.uwa.edu.au/research/data virtualcorpus.pl. Note that slow response time is possible when server is under heavy load. 13 A demo is available at http://explorer.csse.uwa.edu.au/research/data localcorpus.pl. Note that slow response time is possible when server is under heavy load. 6.4. Evaluations and Discussions 153 that is at least 20 times larger than a BootCat-derived corpus. Many researchers have found good use of page counts for a wide range of NLP applications using search engines as gateways to the Web (i.e. general virtual corpus). In order to justify the need for content analysis during the construction of virtual corpora by SPARTAN, we included the use of guided search engine queries as a form of specialised virtual corpus during term recognition. We refer to this virtual corpus as SREQ, the seed-restricted querying of the Web. Quite simply, we append the conjunction of the seed terms W for every query made to the search engines. In a sense, we can consider SREQ as the portion of the Web which contains the seed terms W . For instance, the normal approach for obtaining the general page count (i.e. the number of pages on the Web) for “TNF beta” is by submitting the n-gram as a query to any search engines. Using Yahoo, the general virtual corpus has 56, 400 documents containing “TNF beta”. In SREQ, the conjunction of the seeds in W is appended to “TNF beta”, resulting in the query q=“TNF beta” “transcription factor” “blood cell” “human”. Using this query, Yahoo provides us with 218 webpages, while the conjunction of the seed terms alone results in the page count N = 149, 000. We can consider the latter as the size of SREQ (i.e. total number of documents in SREQ), while the former as the number of documents in SREQ which contains the term “TNF beta”. GENIA Corpus and the Preparations for Term Recognition In this section, we evaluate the performance of term recognition using the different corpora discussed in Sections 6.4.3 and 6.4.3. Terms are content-bearing words which are unambiguous, highly specific and relevant to a certain domain of interest. Most existing term recognition techniques identify terms from among the candidates through some scoring and ranking mechanisms. The performance of term recognition is heavily dependent on the quality and the coverage of the text corpora. Therefore, we find it appropriate to use this task to judge the adequacy and applicability of both SPARTAN-V and SPARTAN-L in real-world applications. The term candidates and gold standard employed in this evaluation comes with the GENIA corpus [130]. The term candidates were extracted from the GENIA corpus based on the readily-available part-of-speech and semantic marked-ups. A gold standard, denoted as the set G, was constructed by extracting the terms which have semantic descriptors enclosed by cons tags. For practicality reasons, we randomly selected 1, 300 term candidates for evaluation, denoted as T . We manually inspected the list of candidates and compared them against the gold standard. Out 154 Chapter 6. Corpus Construction for Term Recognition of the 1, 300 candidates, 121 are non-terms (i.e. misses) while the remaining 1, 179 are domain-relevant terms (i.e. hits). Figure 6.7: The number of documents and tokens from the local and virtual corpora used in this evaluation. Instead of relying on some complex measures, we used a simple, unsupervised technique based solely on the cross-domain distributional behaviour of words for term recognition. Our intention is to observe the extent of contribution of the quality of corpora towards term recognition without being obscured by the complexity of state-of-the-art techniques. We employed relative frequencies to determine whether a word (i.e. term candidate) is a domain-relevant term or otherwise. The idea is simple: if a word is encountered more often in a specialised corpus than the contrastive corpus, then the word is considered as relevant to the domain represented by the former. As such, this technique places even more emphasis on the coverage and adequacy of the corpora to achieve good performance term recognition. For the contrastive corpus, we have prepared a collection comprising of texts from a broad sweeping range of domains other than our domain of interest, which is molecular biology. Figure 6.7 summarises the composition of the contrastive corpus. The term recognition procedure is performed as follows. Firstly, we took note of the total number of tokens F in each local corpus (i.e. BootCat, GENIA, SPARTANL, contrastive corpus). For the two virtual corpora, namely, SPARTAN-V and SREQ, the total page count (i.e. total number of documents) N is used instead. Secondly, the word frequency ft for each candidate t ∈ T is obtained from each local corpus. We use page counts (i.e. document frequencies), nt as substitutes for the virtual corpora. Thirdly, the relative frequency, pt for each t ∈ T are calculated as either ft /F or nt /N depending on the corpus type (i.e. virtual or local). Fourthly, we evaluated the performance of term recognition using these relative frequencies. Please take note that when comparing local corpora (i.e. BootCat, 6.4. Evaluations and Discussions 155 Algorithm 3 assessBinaryClassification(t,dt ,ct ,G) 1: initialise decision 2: if dt > ct ∧ t ∈ G then 3: decision := “true positive” 4: else if dt > ct ∧ t ∈ / G then 5: decision := “false positive” 6: else if dt < ct ∧ t ∈ G then 7: decision := “false negative” 8: else if dt < ct ∧ t ∈ / G then 9: decision := “true negative” 10: return decision GENIA, SPARTAN-L) with the contrastive corpus, the pt based on word frequency is used. The pt based on document frequency is used for comparing virtual corpora (i.e. SPARTAN-V, SREQ) with the contrastive corpus. If the pt by a specialised corpus (i.e. BootCat, GENIA, SPARTAN-L, SPARTAN-V, SREQ), denoted as dt , is larger than or equal to the pt by the contrastive corpus, ct , then the candidate t is classified as a term. The candidate t is classified as a non-term if dt < ct . An assessment function described in Algorithm 3 is employed to grade the decisions achieved using the various specialised corpora. Term Recognition Results Contingency tables are constructed using the number of false positives and negatives, and true positives and negatives obtained from Algorithm 3. Figure 6.8 summarises the errors introduced during the classification process for term recognition using several different specialised corpora. We then computed the precision, accuracy, F1 and F.5 score using the values in the contingency tables. Figure 6.9 summarises the performance metrics for term recognition using the different corpora. Firstly, in the context of local corpora, Figure 6.9 shows that SPARTAN-L achieved a better performance compared to BootCat. While SPARTAN-L is merely 2.5% more precise compared to BootCat, the latter fared the worst recall at 65.06% among all other corpora included in the evaluation. The poor recall by BootCat is due to its high false negative rate. In other words, true terms are not classified as terms by BootCat due to its low-quality composition (e.g. poor coverage, specificity). Many domain-relevant terms in the vocabulary of molecular biology are not covered by the BootCat-derived corpus. Despite being 19 times larger than GENIA, 156 Chapter 6. Corpus Construction for Term Recognition (a) Results using the GENIA corpus. (b) Results using the SPARTAN-V corpus. (c) Results using the SPARTAN-L corpus. (d) Results using the SREQ corpus. (e) Results using the BootCat corpus. Figure 6.8: The contingency tables summarising the term recognition results using the various specialised corpora. Figure 6.9: A summary of the performance metrics for term recognition. the F1 score of the BootCat-derived corpus is far from ideal. The SPARTAN-L corpus, which is 295 times larger than GENIA in terms of token size, has the closest performance to the gold standard at F1 = 92.87%. Assuming that size does matter, we speculate that a specialised Web-derived corpus of at least 419 times larger than GENIA (using linear extrapolation) would be required to match the latter’s high vocabulary coverage and specificity for achieving a 100% F1 score. At the moment, this conjecture remains to be tested. Given its inferior performance and effortless setup, 6.4. Evaluations and Discussions 157 BootCat-derived corpora can only serve as baselines in the task of term recognition using specialised Web-derived corpora. Secondly, in the context of virtual corpora, term recognition using the SPARTANV achieved the best performance across all metrics with a 99.56% precision, even outperforming the local version SPARTAN-L. An interesting point here is that the other virtual corpus, SREQ achieved a good result with precision and recall close to 90% despite the relative ease of setting up the apparatus required for guided search engine querying. For this reason, we regard SREQ as the baseline for comparing the use of specialised virtual corpus in term recognition. In our opinion, a 9% improvement in precision justifies the additional systematic analysis of website content performed by SPARTAN for creating a virtual corpus. From our experience, the analysis of 200 websites generally requires on average, ceteris paribus, 1 to 1.5 hours of processing time using Yahoo API on a standard 1GHz computer with a 256 Mbps Internet connection. The ad-hoc use of search engines for accessing the general virtual corpus may work for many NLP tasks. However, the relatively poor performance by SREQ here justifies the need for more systematic techniques such as SPARTAN when the Web is used as a specialised corpus for tasks such as term recognition. Thirdly, comparing between virtual and local corpora, only SPARTAN-V scored a recall above 90% at 96.44%. Upon localising, the recall of SPARTAN-L dropped to 89.40%. This further confirms that term recognition requires large corpora with high vocabulary coverage, and that the SPARTAN technique has the ability to systematically construct virtual corpora with the required coverage. It is also interesting to note that a large 118 million token local corpus (i.e. SPARTAN-L) matches the recall of a 149, 000 document virtual corpus (i.e. SREQ). However, due to the heterogenous nature of the Web and the inadequacy of simple seed term restriction, SREQ scored 6% less than SPARTAN-L in precision. This concurred with our earlier conclusion that ad-hoc querying, as in SREQ, is not the optimal way of using the Web as specialised virtual corpora. Even the considerably smaller BootCatderived corpus achieved a 4% higher precision compared to SREQ. This shows that size and coverage (there is 46 times more documents in SREQ than in BootCat) contributes only to recall, which explains SREQ’s 24% better recall than BootCat. Due to SREQ’s lack of vocabulary specificity, it fared the least precision at 90.44%. Overall, certain tasks indeed benefit from larger corpora, obviously when meticulously constructed. More specifically, tasks which do not require local access to the texts in the corpora such as term recognition may well benefit from the con- 158 Chapter 6. Corpus Construction for Term Recognition siderably larger and distributed nature of virtual corpora. This is evident when the SPARTAN-based corpus fared 3 − 7% less across all metrics upon localising (i.e. SPARTAN-L). Furthermore, the very close F1 score achieved by the worst performing virtual corpus (i.e. baseline SREQ) with the best performing local corpus SPARTAN-L shows that virtual corpus may indeed be more suitable for the task of term recognition. We speculate that several reasons are at play, including the ever-evolving vocabulary on the Web, and the sheer size of the vocabulary that even Web-derived corpora cannot match. In short, in the context of term recognition, the two most important factors which determine the adequacy of the constructed corpora are coverage and specificity. On the one hand, larger corpora, even when conceived in an ad-hoc manner, can potentially lead to higher coverage, which in turn contributes significantly to recall. On the other hand, the extra efforts spent on systematic analysis leads to more specific vocabulary, which in turn contributes to precision. Most existing techniques lack focus on either one or both factors, leading to poorly constructed and inadequate virtual corpora and Web-derived corpora. For instance, BootCat has difficulty in practically constructing very large corpora, while ad-hoc techniques such as SREQ lacks systematic analysis which results in poor specificity. From our evaluation, only SPARTAN-V achieved a balance F1 score exceeding 95%. In other words, the virtual corpora constructed using SPARTAN are both adequately large with high coverage and has specific enough vocabulary to achieve highly desirable term recognition performance. We can construct much larger specialised corpora using SPARTAN by adjusting certain thresholds. We can adjust τC , τS and τA to allow for more websites to be included into the virtual corpora. We can also permit more related terms to be included as extended seed terms during STEP. This will allow more webpages to be downloaded to create even larger Web-derived corpora. This is possible since the maximum pages derivable from the 43 websites are 84, 963, 524 as shown in Figure 6.7. During the localisation phase, only 64, 578 webpages which is a mere 0.07% of the total, were actually downloaded. In other words, the SPARTAN technique is highly customisable to create both small and very large Web and Web-derived corpora using only several thresholds. 6.5 Conclusions The sheer volume of textual data available on the Web, the ubiquitous coverage of topics, and the growth of content have become the catalysts in promoting a wider acceptance of the Web for corpus construction in various applications of knowledge 6.6. Acknowledgement 159 discovery and information extraction. Despite the extensive use of the Web as a general virtual corpus, very few studies have focused on the systematic analysis of website contents for constructing specialised corpora from the Web. Existing techniques such as BootCat simply pass the responsibility of deciding on suitable webpages to the search engines. Others allow their Web crawlers to run astray (and subsequently resulting in topic drift) without systematic controls while downloading webpages for corpus construction. In the face of these inadequacies, we introduced a novel technique called SPARTAN which places emphasis on the analysis of the domain representativeness of websites for constructing virtual corpora. This technique also provides the means to extend the virtual corpora in a systematic way to construct specialised Web-derived corpora with high vocabulary coverage and specificity. Overall, we have shown that SPARTAN is independent of the search engines used during corpus construction. SPARTAN performed the re-ranking of websites provided by search engines based on their domain representativeness to allow those with the highest vocabulary coverage, specificity and authority to surface. The systematic analysis performed by SPARTAN is adequately justified when the performance of term recognition using SPARTAN-based corpora achieved the best precision and recall in comparison to all other corpora based on existing techniques. Moreover, our evaluation showed that only the virtual corpora constructed using SPARTAN are both adequately large with high coverage and has specific enough vocabulary to achieve a balance term recognition performance (i.e. highest F1 score). Most existing techniques lack focus on either one or both factors. We conclude that larger corpora, when constructed with consideration for vocabulary coverage and specificity, deliver the prerequisites required for producing consistent and high-quality output during term recognition. Several future work have been planned to further assess SPARTAN. In the near future, we hope to study the effect of corpus construction using different seed terms W . We also intend to conduct research on examining how the content of SPARTANbased corpora evolve over time and its effect on term recognition. Furthermore, we are also planning to study the possibility of extending the use of virtual corpora to other applications which requires contrastive analysis. 6.6 Acknowledgement This research was supported by the Australian Endeavour International Postgraduate Research Scholarship. The authors would like to thank the anonymous 160 Chapter 6. Corpus Construction for Term Recognition reviewers for their invaluable comments. 6.7 Other Publications on this Topic Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora through Topical Web Partitioning for Term Recognition. In the Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand. This paper reports the preliminary ideas on the SPARTAN technique for creating text corpora using data from the Web. The SPARTAN technique was later improved and extended to form the core contents of this Chapter 6. CHAPTER 7 Term Clustering for Relation Acquisition Abstract Many conventional techniques for concepts formation in ontology learning rely on the use of predefined templates and rules, and static background knowledge such as WordNet. These techniques are not only difficult to scale between different domains and to handle knowledge change, their results are far from desirable. This chapter proposes a new multi-pass clustering algorithm for concepts formation known as Tree-Traversing Ant (TTA) as part of an ontology learning system. This technique uses Normalised Google Distance (NGD) and n-degree of Wikipedia (noW) as measures for similarity and distance between terms to achieve highly adaptable clustering across different domains. Evaluations using seven datasets show promising results with an average lexical overlap of 97% and an ontological improvement of 48%. In addition, the evaluations demonstrated several advantages that are not simultaneously present in standard ant-based and other conventional clustering techniques. 7.1 Introduction Ontologies are gaining increasing importance in modern information systems for providing inter-operable semantics. Increasing demand on ontologies makes labourintensive creation more and more undesirable, if not impossible. Exacerbating the situation is the problem of knowledge change that results from ever growing information sources, both online and offline. Since the late nineties, more and more researchers started looking for solutions to relieve knowledge engineers from the increasingly acute situations. One of the main research area with high impact if successful is to construct and maintain ontology automatically or semi-automatically from electronic text. Ontology learning from text is the process of identifying concepts and relations from natural language text, and using them to construct and maintain ontologies. In ontology learning, terms are the lexical realisations of important concepts for characterising a domain. Consequently, the task of grouping together variants of terms to form concepts, known as term clustering, constitutes 0 This chapter appeared in Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages 349-381, with the title “Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities”. 161 162 Chapter 7. Term Clustering for Relation Acquisition a crucial fundamental step in ontology learning. Unlike documents [242], webpages [44], and pixels in image segmentation and object recognition [113], terms alone are lexically featureless. Similarity of objects can be established by feature analysis based on visible (e.g. physical and behavioural) traits. Unfortunately, using object names (i.e. terms) alone, similarity depends on something less tangible, namely, background knowledge which humans acquired through their senses over the years. The absence of features requires certain adjustments to be made with regard to the term clustering techniques. One of the most evident adaptation required is the use of context and other linguistic evidence as features for the computation of similarity. A recent survey [90] revealed that all ontology learning systems which apply clustering techniques rely on the contextual cues surrounding the terms as features. The large collection of documents, and predefined patterns and templates required for the extraction of contextual cues makes the portability of such ontology learning systems difficult. Consequently, non-feature similarity measures are fast becoming a necessity for term clustering in ontology learning from text. Along the same line of thought, Lagus et al. [137] stated that “In principle a document might be encoded as a histogram of its words...symbolic words as such retain no information of their relatedness”. In addition to the problems associated with feature extraction in term clustering, much work is still required with respect to the clustering algorithm itself. Researchers [98] have shown that certain commonly adopted algorithms such as the K-means and average-link agglomerative yield mediocre results in comparison with the ant-based algorithms, which is a relatively new paradigm. Handl et al. [98] demonstrated certain desirable properties in ant-based algorithms such as the tolerance to different cluster sizes, and the ability to identify the number of clusters. Despite such advantages, the potentials of antbased algorithms remain relatively unexplored for possible applications in ontology learning. In this chapter, we employ the established Normalised Google Distance (NGD) [50] together with a new hybrid, multi-pass algorithm called Tree-Traversing Ant (TTA) 1 for clustering terms in ontology learning. TTA fuses the strengths of standard ant-based and conventional clustering techniques with the advantages of featureless-similarity measures. In addition, a second-pass is introduced in TTA 1 This foundation work on term clustering using featureless similarity measures appeared in the Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia, 2006 with the title “Featureless Similarities for Terms Clustering using TreeTraversing Ants”. 7.2. Existing Techniques for Term Clustering 163 for refining the results produced using NGD. During the second-pass, the TTA employs a new distance measure called n-degree of Wikipedia (noW) for quantifying the distance between two terms based on Wikipedia’s categorical system. Evaluations using seven datasets show promising results, and revealed several advantages which are not simultaneously present in existing clustering algorithms. In Section 2, we give an introduction to the current term clustering techniques for ontology learning. In Section 3, a description of the NGD measure and an introduction to standard ant-based clustering are presented. In Section 4, we present the TTA, and how NGD and noW are employed to support term clustering. In Section 5, we summarise the results and findings from our evaluations. Finally, we conclude this chapter with an outlook to future work in Section 6. 7.2 Existing Techniques for Term Clustering Faure & Nedellec [69] presented a corpus-based conceptual clustering technique as part of an ontology learning system called ASIUM. The clustering technique is designed for aggregating basic classes based on a distance measure inspired by the Hamming distance. The basic classes are formed prior to clustering in a phase for extracting sub-categorisation frames [71]. Terms that appear in at least two different occasions with the same verb, and the same preposition or syntactic role, can be regarded as semantically similar such that they can substituted with one another in that particular context. These semantically similar terms form the basic classes. The basic classes form the lowest level of the ontology and are successively aggregated to construct a hierarchy from bottom-up. Each time, only two basic classes are compared. The clustering begins by computing the distance between all pairs of basic classes and aggregate those with distance less than a user-defined threshold. The distance between two classes containing the same words with same frequencies have a distance 0. On the other hand, two classes without a single common word have a distance 1. In other words, the terms in the basic classes act as features, allowing for inter-class comparison. The measure for distance is defined as P P Ncomm Ncomm + F C2 × card(C F C1 × card(C 1) 2) ) distance(C1 , C2 ) = 1 − ( Pcard(C1 ) Pcard(C2 ) f (wordiC1 ) + i=1 f (wordiC2 ) i=1 where card(C1 ) and card(C2 ) are the numbers of words in C1 and C2 , respectively, P P and Ncomm is the number of words common to both C1 and C2 . F C1 and F C2 are the sums of the frequencies of the words in C1 and C2 which also occur in C2 and C1 , respectively. f (wordiC1 ) and f (wordiC2 ) are the frequencies of the ith word of class C1 and C2 , respectively. 164 Chapter 7. Term Clustering for Relation Acquisition Maedche & Volz [165] presented a bottom-up hierarchical clustering technique that is part of the ontology learning system Text-to-Onto. This term clustering technique relies on an all-knowing oracle, denoted by H, which is capable of returning possible hypernyms for a given term. In other words, the performance of the clustering algorithm has an upper-bound limited by the ability of the oracle to know all possible hypernyms for a term. The oracle is constructed using WordNet and lexico-syntactic patterns [51]. During the clustering phase, the algorithm is provided with a list of terms and the similarity between each pair is computed using the cosine measure. For this purpose, the syntactic dependencies of each term are extracted and used as the features for that term. The algorithm is an extremely long list of nested if-else statements. For the sake of brevity, it suffices to know that the algorithm examines the hypernymy relations between all pairs of terms before it decides on the placement of terms as parents, children or siblings of other terms. Each time the information about the hypernym relations between two terms is required, the oracle is consulted. The projection H(t) returns a set of tuples (x, y) where x is a hypernym of term t and y is the number of times the algorithm has found the evidence for it. Shamsfard & Barforoush [225] presented two clustering algorithms as part of the ontology learning system Hasti. Concepts have to be formed prior to the clustering phase. It suffices to know that the process of forming the concepts and extracting relations that are used as features for clustering involve a knowledge extractor where “the knowledge extractor is a combination of logical, template driven and semantic analysis methods” [227]. In the concept-based clustering technique, a similarity matrix, consisting of the similarity of all possible pairs of concepts is computed. The pair with the maximum similarity that is also greater than the merge-threshold is chosen to form a new super concept. In this technique, each intermediate (i.e. nonleaf) node in the conceptual hierarchy has at most two children, but the hierarchy is not a binary tree as each node may have more than one parent. As for the relationbased clustering technique, only non-taxonomic relations are considered. For every concept c, a set of assertions about the non-taxonomic relations N F (c) that c has with other concepts is identified. In other words, these relations can be regarded as features that allow concepts to be merged according to what they share. If at least one related concept is common between assertions about that relation, then the set comprising the other concepts (called merge-set) contains good candidates for merging. After all the relations have been examined, a list of merge-set is obtained. The merge-set with the highest similarity between its members is chosen for merging. In 165 7.3. Background both clustering algorithms, the similarity measure employed is defined as similarity(a, b) = maxlevel X X card(cm) (Wcm(i).r + ( j=1 i=1 valence(cm(i).r) X Wcm(i).arg(k) )) × Lj k=1 where cm = N f (a) ∩ N f (b) is the intersection between the sets of assertions (i.e. common relations) about a and b, and card(cm) is the cardinality of cm. Wcm(i).r is Pvalence(cm(i).r) the weight for each common relation and k=1 Wcm(i).arg(k) is the sum of the weights of all terms related to the common relations cm. Lj is the level constant assigned to each similarity level which decreases as the level increases. The main aspect of the similarity measure is the common features between two concepts a and b (i.e. the intersection between the sets of non-taxonomic assertions N f (a)∩N f (b)). Each common feature cm(i).r together with the corresponding weight Wcm(i).r and the weight of the related terms are accumulated. In other words, the more features two concepts have in common, the higher the similarity between them. Regardless of how the existing techniques described in this section are named, they shared a common point, namely, the reliance on some linguistic (e.g. subcategorisation frames, lexico-syntactic patterns) or predefined semantic (e.g. WordNet) resources as features. These features are necessary for the computation of similarity using conventional measures and clustering algorithms. The ease of scalability across different domains and the resources required for feature extraction are among the few questions our new clustering technique attempts to overcome. In addition, the new clustering technique fuses the strengths of recent innovations such as ant-based algorithms and featureless similarity measures that have yet to benefit ontology learning systems. 7.3 7.3.1 Background Normalised Google Distance Normalised Google Distance (NGD) computes the semantic distance between objects based on their names using only page counts from the Google search engine. A more generic name for the measure that employs page counts provided by any Web search engines is the Normalised Web Distance (NWD) [262]. NGD is a nonfeature distance measure which attempts to capture every effective distance (e.g. Hamming distance, Euclidean distance, edit distances) into a single metric. NGD is based on the notions of Kolmogorov Complexity [93] and Shannon-Fano coding [142]. 166 Chapter 7. Term Clustering for Relation Acquisition The basis of NGD begins with the idea of the shortest binary program capable of producing a string x as an output. The Kolmogorov Complexity of the string x, K(x) is just the length of that program in binary bits. Extending this notion to include an additional string y produces the Information Distance [23] where E(x, y) is the length of the shortest binary program that can produce x given y, and y given x. It was shown that [23]: E(x, y) = K(x, y) − min{K(x), K(y)} (7.1) where E(x, x) = 0, E(x, y) > 0 for x 6= y, and E(x, y) = E(y, x). Next, for every other computable distances D that are non-negative and symmetric, there is a binary program, given string x and y, with a length equal to D(x, y). Formally, E(x, y) ≤ D(x, y) + cD where cD is a constant that depends on the distance D and not x and y. E(x, y) is called universal because it acts as the lower bound for all computable distances. In other words, if two strings x and y are close according to some distance D, then they are at least as close according to E [49]. Since all computable distances compare the closeness of strings through the quantification of certain common features they share, we can consider that information distance determines the distance between two strings according to the feature by which they are most similar. By normalising information distance, we have N ID(x, y) ∈ (0, 1) where 0 means the two strings are the same and 1 being completely different in the sense that they share no features. The normalised information distance is defined as: N ID(x, y) = K(x, y) − min{K(x), K(y)} max{K(x), K(y)} Nonetheless, referring back to Kolmogorov Complexity and Equation 7.1, the noncomputability of K(x) implies the non-computability of N ID(x, y). Nonetheless, an approximation of K can be achieved using real compression programs [261]. If C is a compressor, then C(x) denotes the length of the compressed version of string x. Approximating K(x) with C(x) results in: N CD(x, y) = C(x, y) − min{C(x), C(y)} max{C(x), C(y)} The derivation of NGD continues by observing the working behind compressors. Compressors encode source words x into code words x′ such that the length |x| < |x′ |. We can consider these code words from the perspective of Shannon-Fano coding. 167 7.3. Background Shannon-Fano coding encodes a source word x using a code word that has the 1 length log p(x) . p(x) can be thought of as a probability mass function that maps each source word x to the code that achieves optimal compression of x. In Shannon-Fano coding, p(x) = nXx captures the probability of encountering source word x in a text or a stream of data from a source, where nx is the occurrence of x and N is the total number of source words in the same text. Cilibrasi & Vitanyi [49] discussed the use of compressors for N CD and concluded that the existing compressors’ inability to take into consideration external knowledge during compression makes them inadequate. Instead, the authors proposed to make use of a source that “...stands out as the most inclusive summary of statistical information” [49], namely, the World Wide Web. More specifically, the authors proposed the use of the Google search engine to devise a probability mass function that reflects the Shannon-Fano code. It appears that the Google’s equivalence of Shannon-Fano code, known as Google code, has length defined by [49]: 1 G(x) = log g(x) G(x, y) = log 1 g(x, y) where g(x) = |x|/N and g(x, y) = |x ∩ y|/N are the new probability mass function that capture the probability of occurrences of search terms x and y. x is the set of webpages returned by Google containing the single search term x (i.e. singleton set) and similarly, x ∩ y is the set of webpages returned by Google containing both search term x and y (i.e. doubleton set). N is the summation of all unique singleton and doubleton sets. Consequently, the Google search engine can be considered as a compressor for encoding search terms (i.e. source words) x to produce the meaning (i.e. compressed code words) that has the length G(x). By rewriting the N CD, we obtain the new NGD defined as: N GD(x, y) = G(x, y) − min{G(x), G(y)} max{G(x), G(y)} (7.2) All in all, NGD is an approximation of N CD and hence, N ID to overcome the noncomputability of Kolmogorov Complexity. NGD employs the Google search engine as a compressor to generate Google codes based on the Shannon-Fano coding. From the perspective of term clustering, NGD provides an innovative starting point which demonstrates the advantages of featureless similarity measures. In our new term clustering technique, we take such innovation a step further by employing NGD in 168 Chapter 7. Term Clustering for Relation Acquisition a new clustering technique that combines the strengths from both conventional and ant-based algorithms. 7.3.2 Ant-based Clustering The idea of ant-based clustering was first proposed by Deneubourg et al. [61] in 1991 as part of an attempt to explain the different types of emergent technologies inspired by nature. During simulation, the ants are represented as agents that move around the environment, a square grid, in random. Objects are randomly placed in this environment and the ants can pick-up the object, move and drop them. These three basic operations are influenced by the distribution of the objects. Objects that are surrounded by dissimilar ones are more likely to be picked up and later dropped elsewhere in the surrounding of more similar ones. The probability of picking up and dropping of objects are influenced by the probabilities: Ppick (i) = ( kp )2 kp + f (i) Pdrop (i) = ( f (i) )2 kd + f (i) where f (i) is an estimation of the distribution density of the objects in the ants’ immediate environment (i.e. local neighbourhood) with respect to the object that the ants is considering to pick or drop. The choice of f (i) varies depending on the cost and other factors related to the environment and the data items. As f (i) decreases below kp , the probability of picking up the object is very high, and the opposite occurs when f (i) exceeds kp . As for the probability of dropping an object, high f (i) exceeding kd induces the ants to give up the object, while f (i) less than kd encourages the ants to hold on to the object. The combination of these three simple operations and the heuristics behind them gave birth to the notion of basic ants for clustering, also known as standard ant clustering algorithm (SACA). Gutowitz [94] examined the basic ants described by Deneubourg et al. and proposed a variant ant known as complexity-seeking ants. Such ants are capable of sensing local complexity and are inclined to work in regions of high interest (i.e. high complexity). Regions with high complexity are determined using a local measure that assesses the neighbouring cells and counts the number of pairs of contrasting cells (i.e. occupied or empty). Neighbourhoods with all empty or all occupied immediate cells have zero complexity while regions with checkboard patterns have high 7.3. Background 169 complexity. Hence, these modified ants are able to accomplish their task faster because they are more inclined to manipulate objects in regions with higher complexity [263]. Lumer & Faieta [160] further extended and improved the idea of ant-based clustering in terms of the numerical aspects of the algorithm and the convergence time. The authors represented the objects in terms of numerical vectors and the distance between the vectors is computed using the Euclidean distance. Hence, given that δ(i, j) ∈ [0, 1] as the Euclidean distance between object i (i.e. i is the location of the object in the centre of the neighbourhood) and every other neighbouring objects j, the neighbourhood function f (i) is defined by the authors as: 1 P 1 − δ(i,j) if f (i) > 0 2 j α f (i) = s (7.3) 0 otherwise where s2 is the size of the local neighbourhood, and α ∈ [0, 1] is a constant for scaling the distance among objects. In other words, an ant has to consider the average similarity of object i with respect to all other objects j in the local neighbourhood before performing an operation (i.e. pickup or drop). As the value of f (i) is obtained by averaging the total similarities with the number of neighbouring cells s2 , empty cells which do not contribute to the overall similarity must be penalised. In addition, the radius of perception (i.e. the extent to which objects are taken into consideration for f (i)) of each ant at the centre of the local neighbourhood is given by s−1 . The 2 clustering algorithm using the basic ant SACA is defined in Algorithm 4. Handl & Meyer [100] introduced several enhancements to make ant-based clustering more efficient. The first is the concept of eager ants where idle phases are avoided by having the ants to immediately pickup objects as soon as existing ones are dropped. The second is the notion of stagnant control. There are occasions in ant-based clustering when ants are occupied or blocked due to objects that are difficult to dispose. In such cases, the ants are forced to drop whatever they are carrying after a certain number of unsuccessful drops. In a different paper [98], the authors have also demonstrated that the ant-based algorithm has several advantages: • tolerance to different cluster size • ability to identify the number of clusters • performance increases with the size of the datasets • graceful degradation in the face of overlapping clusters. 170 Chapter 7. Term Clustering for Relation Acquisition Algorithm 4 Basic ant-based clustering defined by Handl et al. [99] 1: begin 2: //INITIALISATION PHASE 3: Randomly scatter data items on the toroidal grid 4: for each j in 1 to #agents do 5: i := random select(remaining items) 6: pick up(agent(j),i) 7: g := random select(remaining empty grid locations) 8: place agent(agent(j),g) 9: //MAIN LOOP 10: for each it ctr in 1 to #iterations do 11: j := random select(all agents) 12: step(agent(j),stepsize) 13: i := carried item(agent(j)) 14: drop := drop item?(f (i)) 15: if drop = TRUE then 16: while pick = FALSE do 17: i := random select(f ree data items) 18: pick := pick item?(f (i)) Nonetheless, the authors have also highlighted two shortcomings of ant-based clustering, namely, the inability to distinguish more refined clusters within coarser level ones, and the inability to specify the number of clusters can be seen as a disadvantage when the users have precise ideas about it. Vizine et al. [263] proposed an adaptive ant clustering algorithm (A2 CA) that improves upon the algorithm by Lumer & Faieta. The authors introduced two major modifications, namely, progressive vision scheme and the use of pheromones on grid cells. The progressive vision scheme allows the dynamic adjustment of s2 . Whenever an ant perceives a larger cluster, it increases its radius of perception from the original ′ s −1 s−1 to the new . The second enhancement allows ants to mark regions that are 2 2 recently constructed or under construction. The pheromones attract other ants, resulting in an increase in the probability of deconstruction of relatively smaller regions, and increases the probability of dropping objects at denser clusters. Ant-based algorithms have been employed to cluster objects that can be represented using numerical vectors. Similar to conventional algorithms, the similarity or distance measures used by existing ant-based algorithms are still feature-based. Con- 7.4. The Proposed Tree-Traversing Ants 171 sequently, they share similar problems such as difficult portability across domains. In addition, despite the strengths of standard ant-based algorithms, two disadvantages were identified. In our new technique, we make use of the known strengths of standard ant-based algorithms and some desirable traits from conventional ones for clustering terms using featureless similarity. 7.4 The Proposed Tree-Traversing Ants The Tree-Traversing Ant (TTA) clustering technique is based on dynamic tree structures as compared to toroidal grids in the case of standard ants. The dynamic tree begins with one root node r0 consisting of all terms T = {t1 , ..., tn }, and branches out to new sub-nodes as required. In other words, the clustering process begins with r0 = {t1 , ..., tn }. For example, the first snapshot in Figure 7.1 shows the start of the TTA clustering process with the root node r0 initialised with the terms t1 , ...tn=10 . Essentially, each node in the tree is a set of terms ru = {t1 , ..., tq }. The sizes of new sub-nodes |ru | reduce as less and less terms are assigned to them in the process of creating nodes with higher intra-node similarity. The clustering starts with only one ant, while an unbounded number of ants awaits to work at each of the new sub-node created. In the third snapshot in Figure 7.1, while the first ant moves on to work at the left sub-node r01 , a new second ant proceeds to process the right sub-node r02 . The number of possible new subnodes for each main node (i.e branching factor) in this version of TTA is two. In other words, for each main node rm , we have the sub-nodes rm1 , rm2 . Similar to some of the current enhanced ants, the TTA ants are endowed with the ability of short-term memory for remembering similarities and distances acquired through their senses. The TTA is equipped with two types of senses, namely, NGD and n-degree of Wikipedia (noW). The standard ants have a radius of perception defined in terms of cells immediately surrounding the ants. Instead, the perception radius of TTA ants covers all terms in the two sub-nodes created for each current node. A current node is simply a node originally consisting of terms to be sorted to the news sub-nodes. The TTA adopts a two-pass approach for term clustering. During the first-pass, the TTA recursively breaks nodes into sub-nodes and relocate terms until the ideal clusters are achieved. The resulting trees created in the first-pass are often good enough to reflect the natural clusters. Nonetheless, discrepancies do occur due to certain oddities in the co-occurrences of terms on the World Wide Web that manifest themselves through NGD. Accordingly, a second-pass is created that uses noW for 172 Chapter 7. Term Clustering for Relation Acquisition Figure 7.1: Example of TTA at work relocating terms which are displaced due to NGD. The second-pass can be regarded as a refinement phase for producing clusters with higher quality. 7.4.1 First-Pass using Normalised Google Distance The TTA begins clustering at the root node which consists of all n terms, r0 = {t1 , ..., tn }. Each term can be considered as an element in the node. A TTA ant randomly picks a term, and proceed on to sense its similarity with every other terms on that same node. The ant repeats this for all n terms until the similarity of all possible pairs of terms have been memorised. The similarity between two terms tx and ty is defined as: s(tx , ty ) = 1 − N GD(tx , ty )α (7.4) where N GD(tx , ty ) is the distance between term tx and ty estimated using the original NGD defined at Equation 7.2. α is a constant for scaling the distance between the two terms. The algorithm then grows two new sub-nodes to accommodate the two least similar terms ta and tb . The ant moves the first term ta from the main node rm to the first sub-node while emitting pheromones that trace back to tb in 7.4. The Proposed Tree-Traversing Ants 173 the process. The ant then follows the pheromone trail back to the second term tb to move it to the second sub-node. The second snapshot in Figure 7.1 shows two new sub-nodes r01 and r02 . The ant moved the term t1 to r01 and the least similar term t6 to r02 . Nonetheless, prior to the creation of new sub-nodes and the relocation of terms, an ideal intra-node similarity condition must be tested. The operation of moving the two least similar terms from the current node to create and initialise new sub-nodes is essentially a partitioning process. Eventually, each leaf node will end up with only one term if the TTA does not know when to stop. For this reason, we adopt an ideal intra-node similarity threshold sT for controlling the extent of branching out. Whenever an ant senses that the similarity between the two least similar terms exceeds sT , no further sub-nodes will be created and the partitioning process at that branch will cease. A high similarity (higher than sT ) between the two most dissimilar terms in a node provides a simple but effective indication that the intra-node similarity has reached an ideal stage. More refined factors such as the mean and standard deviation of intra-node similarity are possible but have not been considered. If the similarity between the two most dissimilar terms is still less than sT , further branching out will be performed. In this case, the TTA ant repeatedly picks up the remaining terms on the current node one by one and senses their similarities with every other terms which are already located in the sub-nodes. Formally, the probability of picking up term ti by an ant in the first-pass is defined as: ( 1 if ti ∈ rm 1 (7.5) Ppick (ti ) = 0 otherwise where rm is the set of terms in the current node. In other words, the probability of picking up terms by an ant is always 1 as long as there are still terms remaining in the current node. Each term ti ∈ rm is moved to one of the two sub-nodes ru that has the term tj ∈ ru with the highest similarity with ti . In other words, an ant considers multiple neighbourhoods prior to dropping a term. Snapshot 3 in Figure 7.1 illustrates the corresponding two sub-nodes r01 and r02 that have been populated with all the terms which were previously located at the current node r0 . The standard neighbourhood function f (i) defined in Equation 7.3 represents the density of the neighbourhood as the average of the similarities between ti with every other term in its immediate surrounding (i.e. local neighbourhood) confined by s2 . Unlike the sense of basic ants which covers only the surrounding cells s2 , the extent to which a TTA ant perceives covers all terms in the two sub-nodes (i.e. multiple neighbourhoods) corresponding to the immediate current node. Accord- 174 Chapter 7. Term Clustering for Relation Acquisition ingly, instead of estimating f (i) as the averaged similarity defined over s2 terms surrounding the ant, the new neighbourhood function fT T A (ti , u) is defined as the maximum similarity between term ti ∈ rm and the neighbourhood (i.e. sub-nodes) ru . The maximum similarity between ti and ru is the highest similarity between ti and all other terms tj ∈ ru . Formally, we define the density of neighbourhood ru with respect to term ti during the first-pass as: fT1 T A (ti , ru ) = maximum of s(ti , tj ) w.r.t. tj ∈ ru (7.6) where the similarity between the two terms s(ti , tj ) is computed using Equation 7.4. Besides deciding on whether to drop an object or not, like in the case of basic ants, the TTA ant has to decide on one additional issue, namely, where to drop. The TTA decides on where to drop a term based on the fT T A (ti , ru ) that it has memorised for all sub-nodes ru of the current node rm . Formally, the decision on whether to drop term ti ∈ rm on sub-node rv depends on: 1 if (fT1 T A (ti , rv ) = maximum of fT1 T A (ti , ru ) 1 Pdrop (ti , rv ) = w.r.t. ru ∈ {rm1 , rm2 }) 0 otherwise (7.7) The current version of the TTA clustering algorithm is implemented in two parts. The first is the main function while the second one is a recursive function. The main function is defined in Algorithm 5 while the recursive function for the firstpass elaborated in this subsection is reported in Algorithm 6. Algorithm 5 Main function 1: input A list of terms, T = {t1 , ...tn }. 2: Create an initial tree with a root node r0 containing n terms. 3: Define the ideal intra-node similarity threshold sT and δT . 4: //first-pass using NGD 5: ant := new ant() 6: ant.ant traverse(r0 ,r0 ) 7: //second-pass using noW 8: leaf nodes := ant.pickup trail()//return all leaf nodes marked by pheromones 9: for each rnext ∈ leaf nodes do 10: ant.ant refine(leaf nodes,rnext ) 7.4. The Proposed Tree-Traversing Ants 175 Algorithm 6 Function ant traverse(rm ,r0 ) using NGD 1: if |rm | = 1 then 2: leave trail(rm ,r0 )//leave trail from current leave node to root node. for use in second-pass 3: return //only one term left. return to root 4: {ta , tb } := find most dissimilar terms(rm ) 5: if s(ta , tb ) > sT then 6: leave trail(rm ,r0 )//leave trail from current leave node to root node. for use in second-pass 7: return //ideal cluster has been achieved. return to root node 8: else 9: {rm1 , rm2 } := grow sub nodes(rm ) 10: move terms({ta , tb },{rm1 , rm2 }) 11: for each term ti ∈ rm do 12: pick(ti )//based on Eq. 7.5 13: for each ru ∈ {rm1 , rm2 } do 14: for each term tj ∈ ru do 15: s(ti , tj ) := sense similarity(ti ,tj ) //based on Eq. 7.4 16: remember similarity(s(ti , tj )) 1 17: fT T A (ti , ru ) := sense neighbourhood() //based on Eq. Eq. 7.6 18: remember neighbourhood(fT1 T A (ti , ru )) 19: {∀u, fT1 T A (ti , ru )} := recall neighbourhood() 20: rv := decide drop({∀u, fT1 T A (ti , ru )})// based on Eq. 7.7 21: drop({ti },{rv }) 22: antm1 := new ant() 23: antm1 .ant traverse(rm1 ,r0 )//repeat the process recursively for each sub-node 24: antm2 := new ant() 25: antm2 .ant traverse(rm2 ,r0 )//repeat the process recursively for each sub-node 176 Chapter 7. Term Clustering for Relation Acquisition 7.4.2 n-degree of Wikipedia: A New Distance Metric The use of NGD for quantifying the similarity between two objects based on their names alone can occasionally produce low-quality clusters. We will highlight some of these discrepancies during our initial experiments in the next section. The initial tree of clusters generated by the TTA using NGD demonstrated promising results. Nonetheless, we reckoned that higher-quality clusters could be generated if we allow the TTA ants to visit the nodes again for the purpose of refinement. Instead of using NGD, we present a new way to gauge the similarity between terms. Google can be regarded as the gateway to the huge volume of documents on the World Wide Web. The sheer size of Google’s index enables a relatively reliable estimate of term usage and occurrence using NGD. The page counts provided by the Google search engine, which are the essence of NGD, are used to compute the similarity between two terms based on the mutual information that they both share at the compressed level. As for Wikipedia, its number of articles is less than a fraction of what Google indexes. Nonetheless, the restrictions imposed on the authoring of Wikipedia’s articles and their organisations provide a possibly new way of looking at similarity between terms. n-degree of Wikipedia (noW) [272] is inspired by a game for Wikipedians. 6-degree of Wikipedia 2 is a task set out to study the characteristics of Wikipedia in terms of the similarity between its articles. An article in Wikipedia can be regarded as an entry of encyclopaedic information describing a particular topic. The articles are organised using categorical indices which eventually leads to the highest level, namely, “Categories” 3 . Each article can appear under more than one category. Hence, the organisation of articles in Wikipedia appears more as a directed acyclic graph with a root node instead of a pure tree structure4 . The huge volume of articles in Wikipedia, the organisation of articles in a graph structure, the open-source nature of the articles, and the availability of the articles in electronic form makes Wikipedia the ideal candidate for our endeavour. We define Wikipedia as a directed graph W := (V, E). W is essentially a network of linked-articles where V = {a1 , ..., aω } is the set of articles. We limit the vertices to English articles only. At the moment, ω = |V | is reported to be 1, 384, 7295 , making it the largest encyclopaedia6 in merely five years since its conception. The inter2 http://en.wikipedia.org/wiki/Six Degrees of Wikipedia http://en.wikipedia.org/wiki/Category:Categories 4 http://en.wikipedia.org/wiki/Wikipedia:Categorization#Categories do not form a tree 5 http://en.wikipedia.org/wiki/Wikipedia:Size comparisons 6 http://en.wikipedia.org/wiki/Wikipedia:Largest encyclopedia 3 177 7.4. The Proposed Tree-Traversing Ants connections between articles are represented as the set of ordered pairs of vertices E. At the moment, the edges are uniformly assigned with weight 1. Each article can be considered as an elaboration of a particular event, an entity or an abstract idea. In this sense, an article in Wikipedia is a manifestation of the information encoded in the terms. Consequently, we can represent each term ti using the corresponding article ai ∈ V in Wikipedia. Hence, the problem of finding the distance between two terms ti and tj can be reduced to the discovery of how closely situated are the two corresponding articles ai and aj in the Wikipedia categorical indices. The problem of finding the degree of separation between two articles can be addressed in terms of the single-source shortest path problem. Since the weights are all positive, we have resorted to Dijkstra’s Algorithm for finding the shortest-path between two vertices (i.e. articles). Other algorithms for the shortest-path problem are available. However, a discussion on these algorithms is beyond the scope of this chapter. Formally, the noW value between terms tx and ty is defined as noW (tx , ty ) = δ(ax , ay ) = P|SP | k=1 cek 0 ∞ if(ax 6= ay ∧ ax , ay ∈ V ) if(ax = ay ∧ ax , ay ∈ V ) (7.8) otherwise where delta(ax , ay ) is the degree of separation between the articles ax and ay which corresponds to the term tx and ty , respectively. The degree of separation is computed as the sum of the cost of all edges along the shortest path between articles ax and ay in the graph of Wikipedia articles W . SP is the set of edges along the shortest path and ek is the k th edge or element in set SP . |SP | is the number of edges along the shortest path and cek is the cost associated with the k th edge. It is also worth mentioning that while δ(tx , ty ) ≥ 0 for ax , ay ∈ V , no upper bound can be ascertained. The noW value between terms that do not have corresponding articles in Wikipedia is set to ∞. There is a hypothesis7 stating that no two articles in Wikipedia are separated by more than six degrees. However, there are some Wikipedians who have shown that certain articles can be separated by up to eight steps8 . This is the reason why we adopted the name n-degree of Wikipedia instead of 6-degree of Wikipedia. 7 8 http://tools.wikimedia.de/sixdeg/index.jsp http://en.wikipedia.org/wiki/Six Degrees of Wikipedia 178 7.4.3 Chapter 7. Term Clustering for Relation Acquisition Second-Pass using n-degree of Wikipedia Upon completing the first-pass, there is at most n leaf nodes where each term in the initial set of all terms T end up in individual nodes (i.e. clusters). There are only two possibilities for such extreme cases. The first is when the ideal intra-node similarity threshold sT is set too high while the second is when all the terms are extremely unrelated. In normal cases, most of the terms will be nicely clustered into nodes with intra-node similarities exceeding sT . Only a small number of terms is usually isolated into individual nodes. We refer to these terms as isolated terms. There are two possibilities that lead to isolated terms in normal cases, namely, (1) the term has been displaced during the first-pass due to discrepancies related to NGD, or (2) the term is in fact an outlier. The TTA ants leave pheromone trails on their return trip to the root node (as in line 2 and line 7 of Algorithm 6) to mark the paths to the leaf nodes. In order to relocate the isolated terms to other more suitable nodes, the TTA ants return to the leaf nodes by following the pheromone trails. At each leaf node rl , the probability of picking up a term ti during the second-pass is 1 if the leaf node has only one term (isolated term): 2 Ppick (ti ) = ( 1 if |rl | = 1 ∧ ti ∈ rl 0 otherwise (7.9) After picking up an isolated term, the TTA ant continues to move from one leaf node to the next. At each leaf node, the ant determines whether that particular leaf node (i.e. neighbourhood) rl is the most suitable one to house the isolated term ti based on the average distance between ti and all other existing terms in rl . Formally, the density of neighbourhood rl with respect to the isolated term ti during the second-pass is defined as: fT2 T A (ti , rl ) = P|rl | j=1 noW (ti , tj ) |rl | (7.10) where |rl | is the number of terms in the leaf node rl and the noW value between the two terms ti and tj is computed using Equation 7.8. This process of sensing the distance of the isolated term with all other terms in a leaf node is performed for all leaf nodes. The probability of the ant dropping the isolated term ti on the most suitable leaf node rv is evaluated once the ant returns to the original leaf node that used to contain ti . Back at the original leaf node of ti , the ant recalls the neighbourhood density fT2 T A (ti , rl ) that it has memorised for 179 7.5. Evaluations and Discussions all neighbourhoods (i.e leaf nodes). The TTA ant drops the isolated term ti on the leaf node rv if all terms in rv collectively yield the minimum average distance with ti that satisfies the outlier discrimination threshold δT . Formally, 2 Pdrop (ti , rv ) = 2 2 1 if (fT T A (ti , rv ) = minimum of fT T A (ti , rl ) w.r.t. rl ∈ L) ∧ (fT2 T A (ti , rv ) ≤ δT ) 0 otherwise (7.11) where L is the set of all leaf nodes. After the ant has visited all the leaf nodes and has failed to drop the isolated term, the term will be returned to its original location. The failure to drop the isolated term in a more suitable node indicates that the term is an outlier. Referring back to the example in Figure 7.1, assume that snapshot 5 represents the end of the first-pass where the intra-node similarity of all nodes have satisfied sT . While all other leaf nodes, namely, r011 , r012 and r021 consist of multiple terms, leaf node r022 contains only one term t6 . Hence, at the end of the first-pass, all ants, namely, ant1, ant2, ant3 and ant4 retreat back to the root node r0 . Then, during the second-pass, one TTA ant is deployed to relocate the isolated term t6 from r022 to either leaf node r011 , r012 or r021 , depending on the average distances of these leaf nodes with respect to t6 . The algorithm for the second-pass using noW is described in Algorithm 7. Unlike the ant traverse() function in Algorithm 6 where each new sub-node is processed as a separate iteration of ant traverse() using an independent TTA ant, there is only one ant required throughout the second-pass. 7.5 Evaluations and Discussions In this section, we focus on evaluations at the conceptual layer of ontologies to verify the taxonomic structures discovered using TTA. We employ three existing metrics. The first is known as Lexical Overlap (LO) for evaluating the intersection between the discovered concepts (Cd ) and the recommended (i.e. manually created) concepts (Cm ) [164]. The manually created concepts can be regarded as the reference for our evaluations. LO is defined as: LO = |Cd ∩ Cm | |Cm | (7.12) Some minor changes were made in terms of how the intersection between the set of recommended clusters and discovered clusters (i.e. Cd ∩ Cm ) is computed. The 180 Chapter 7. Term Clustering for Relation Acquisition Algorithm 7 Function ant refine(leaf nodes, ru ) using noW 1: if |ru | = 1 then 2: //current leaf node has isolated term ti 3: pick(ti )//based on Eq. 7.9 4: for each rl ∈ leaf nodes do 5: for each term tj in current leaf node rl do 6: //jump from one leaf node to the next to sense neighbourhood density 7: δ(ti , tj ) := sense distance(ti , tj )//based on Eq. 7.8 8: remember distance(δ(ti , tj )) 9: fT2 T A (ti , rl ) := sense neighbourhood()//based on Eq. 7.10 10: remember neighbourhood(fT2 T A (ti , rl )) 11: //back to original leaf node of term ti after visiting all other leaves 12: {∀l, fT2 T A (ti , rl )} := recall neighbourhood() 13: rv := decide drop({∀l, fT2 T A (ti , rl )})// based on Eq. 7.11 14: if rv not null then 15: drop({ti },{rv })//drop at ideal leaf node 16: else 17: drop({ti },{ru })//outlier. no ideal leaf node. drop back at original leaf normal way of having exact lexical matching of the concept identifiers cannot be applied to our experiments. Due to the ability of the TTA in discovering concepts with varying level of granularity depending on sT , we have to put into consideration the possibility of sub-clusters that collectively correspond to some recommended clusters. For our evaluations, the presence of discovered sub-clusters that correspond to some recommended clusters are considered as a valid intersection. In other words, given that Cd = {c1 , ..., cn } and Cm = {cx } where cx ∈ / Cd , then |Cd ∩ Cm | = 1 if c1 ∪ ... ∪ cn = cx The second metric is used to account for valid discovered concepts that are absent from the reference set, while the third metric ensures that concepts which exist in the reference set but are not discovered are also taken into consideration. The second metric is referred to as Ontological Improvement (OI) and the third metric is known as Ontological Loss (OL). They are defined as [214]: OI = |Cd − Cm | |Cm | (7.13) 181 7.5. Evaluations and Discussions Table 1. Summary of the datasets employed for experiments. Column Cm are the recommended clusters and Cd are clusters automatically discovered using TTA. OL = |Cm − Cd | |Cm | (7.14) Ontology learning is an incremental process that involves the continuous maintenance of ontology every time new terms are added. As such, we do not see clustering large datasets as a problem. In this section, we employ seven datasets to assess the quality of the discovered clusters using the three metrics described above. The origin of the datasets and some brief descriptions are provided below: • Three of the datasets used for our experiments were obtained from the UCI Machine Learning Repository9 . These sets are labelled as WINE 15T, MUSH9 http://www.ics.uci.edu/ mlearn/MLRepository.html 182 Chapter 7. Term Clustering for Relation Acquisition Table 2. Summary of the evaluation results for all ten experiments using the three metrics LO, OI and OL. ROOM 16T and DISEASE 20T. The accompanying numerical attributes, which were designed for use with feature-based similarities, were removed. • We also employ the original animals dataset (i.e. ANIMAL 16T) proposed for use with Self-Organising Maps (SOMs) by Ritter & Kohonen [209]. • We constructed the remaining three datasets called ANIMALGOOGLE 16T, MIX 31T and MIX 60T. ANIMALGOOGLE 16T is similar to the ANIMAL 16T dataset except for a single replacement with the term “Google”. The other two MIX datasets consist of a mixture of terms from a large number of domains. Table 1 summarises the datasets employed for our experiments. The column Cm are the recommended clusters and Cd are clusters automatically discovered using TTA. Table 2 summarises the evaluation of TTA using the three metrics for all ten experiments. The high lexical overlap (LO) shows the good domain coverage of the discovered clusters. The occasionally high ontological improvement (OI) demonstrates the ability of TTA in highlighting new, interesting concepts that were ignored during manual creation of the recommended clusters. During the experiments, snapshots were produced to show the results in two parts: results after the first-pass using NGD, and results after the second-pass using noW. The first experiment uses WINE 15T. The original dataset has 178 nameless instances spread out over 3 clusters. Each instance has 13 attributes for use with feature-based similarity measures. We augment the dataset by introducing famous names in the wine domain and remove their numerical attributes. We maintained the three clusters namely, “white”, “red” and “mix”. “Mix” refers to wines that were 7.5. Evaluations and Discussions 183 Figure 7.2: Experiment using 15 terms from the wine domain. Setting sT = 0.92 results in 5 clusters. Cluster A is simply red wine grapes or red wines, while Cluster E represents white wine grapes or white wines. Cluster B represents wines named after famous regions around the world and they can either be red, white or rose. Cluster C represents white noble grapes for producing great wines. Cluster D represents red noble grapes. Even though uncommon, Shiraz is occasionally admitted to this group. named after famous wine regions around the world. Such wines can either be red or white. As shown in Figure 7.2, setting sT = 0.92 produces five clusters. Clusters A and D are actually sub-clusters for the recommended cluster “red”, while Clusters C and E are sub-clusters for the recommended cluster “white”. Cluster B corresponds exactly to the recommended cluster “mix”. The second experiment uses MUSHROOM 16T. The original dataset has 8124 nameless instances spread out over two clusters. Each instance has 22 nominal attributes for use with feature-based similarity measures. We augment the dataset by introducing names of mushrooms that fit into one of the two recommended clusters, namely, “edible” and “poisonous”. As shown in Figure 7.3, setting sT = 0.89 produces 4 clusters. Cluster A corresponds exactly to the recommended cluster “poisonous”. The remaining three clusters are actually sub-clusters of the recommended cluster “edible”. Cluster B contains edible mushrooms prominent in East Asia, while Clusters C and D comprise mushrooms found mostly in North America and Europe, and are prominent in Western cuisines. Similarly, the third experiment was conducted using DISEASE 20T with the results shown in Figure 7.4. At sT = 0.86, TTA discovered hidden sub-clusters within the four recommended clusters, namely, “skin”, “blood”, “cardiovascular” and “digestion”. In relation to this, Handl et al. [99] highlighted a shortcoming in their 184 Chapter 7. Term Clustering for Relation Acquisition Figure 7.3: Experiment using 16 terms from the mushroom domain. Setting sT = 0.89 results in 4 clusters. Cluster A represents poisonous mushrooms. Cluster B comprises edible mushrooms which are prominent in East Asian cuisine except for Agaricus Blazei. Nonetheless, this mushroom was included in this cluster probably due to its high content of beta glucan for potential use in cancer treatment, just like Shiitake. Moreover, China is the major exporter of Agaricus Blazei, also known as Himematsutake, further relating this mushroom to East Asia. Cluster C and D comprise edible mushrooms found mainly in Europe and North America, and are more prominent in Western cuisines. evaluation of ant-based clustering algorithms. The authors stated that the algorithm “...only manages to identify these upper-level structures and fails to further distinguish between groups of data within them.”. In other words, unlike existing ant-based algorithms, the first three experiments demonstrated that our TTA has the ability to further distinguish hidden structures within clusters. The fourth and fifth experiments were conducted using ANIMAL 16T dataset. This dataset has been employed to evaluate both the standard ant-based clustering (SACA) and the improved version called A2 CA by Vizine et al. [263]. The original dataset consists of 16 named instances, each representing an animal using binary feature attributes. Both SACA and A2 CA discovered two natural clusters, one for “mammal” while the other for “bird”. While SACA was inconsistent in its results, A2 CA yielded 100% recall rate over ten runs. The authors of A2 CA stated that the dataset can also be represented as three recommended clusters. In the spirit of the evaluation by Vizine et al., we performed the clustering of the 16 animals using TTA over ten runs. In our case, no feature was used. Just like all experiments in this chapter, the 16 animals were clustered based on their names. 7.5. Evaluations and Discussions 185 Figure 7.4: Experiment using 20 terms from the disease domain. Setting sT = 0.86 results in 7 clusters. Cluster A represents skin diseases. Cluster B represents a class of blood disorders known as anaemia. Cluster C represents other kinds of blood disorders. Cluster D represents blood disorders characterised by the relatively low count of leukocytes (i.e. white blood cells) or platelets. Cluster E represents digestive diseases. Cluster F represents cardiovascular diseases characterised by both the inflammation and thrombosis (i.e. clotting) of arteries and veins. Cluster G represents cardiovascular diseases characterised by the inflammation of veins only. As shown in the fourth experiment in Figure 7.5, by setting sT = 0.60, the TTA automatically discovered the two recommended clusters after the second-pass: “bird” and “mammal”. While ant-based techniques are known for their intrinsic capability in identifying clusters automatically, conventional clustering techniques (e.g. Kmeans, average link agglomerative clustering) rely on the specification of the number of clusters [99]. The inability to control the desired number of natural clusters can be troublesome. According to Vizine et al. [263], “in most cases, they generate a number of clusters that is much larger that the natural number of clusters”. Unlike both extremes, TTA has the flexibility in regard to the discovery of clusters. The granularity and number of discovered clusters in TTA can be adjusted by simply modifying the threshold sT . By setting higher sT , the number of discovered clusters for ANIMAL 16T has been increased to five as shown in Figure 7.6. A lower value of the desired ideal intra-node similarity sT results in less branching out and hence, fewer clusters. Conversely, setting higher sT produces more tightly coupled terms where the similarities between elements in the leaf nodes are very high. In the 186 Chapter 7. Term Clustering for Relation Acquisition fifth experiment depicted in Figure 7.6, the value sT was raised to 0.72 and more refined clusters were discovered: “bird”, “mammal hoofed”, “mammal kept as pet”, “predatory canine” and “predatory feline”. Figure 7.5: Experiment using 16 terms from the animal domain. Setting sT = 0.60 produces 2 clusters. Cluster A comprises birds and Cluster B represents mammals. The next three experiments were conducted using the ANIMALGOOGLE 16T dataset. These three experiments are meant to reveal another advantage of TTA through the presence of an outlier, namely, the term “Google”. An outlier can be simply considered as a term that does not fit into any of the clusters. In Figure 7.7, TTA successfully isolated the term “Google” while discovering clusters at different levels of granularities based on different sT . As similar terms are clustered into the same node, outliers are eventually singled out as isolated terms in individual leaf nodes. Consequently, unlike some conventional techniques such as K-means [282], clustering using TTA is not susceptible to poor results due to outliers. In fact, there are two ways of looking at the term “Google”, one as an outlier as described above, or the second as an extremely small cluster with one term. Either way, the term “Google” demonstrates two abilities of TTA: capable of identifying and isolating outliers, and tolerance to differing cluster sizes like its predecessors. Handl et al. [99] have shown through experiments that certain conventional clustering techniques such as K-means and one-dimensional self-organising maps perform poorly in the face of increasing deviations between cluster sizes. The last two experiments were conducted using MIX 31T and MIX 60T. Figure 7.8 shows the results after the first-pass and second-pass using 31 terms while 7.5. Evaluations and Discussions 187 Figure 7.6: Experiment using 16 terms from the animal domain (the same dataset from the experiment in Figure 7.5). Setting sT = 0.72 results in 5 clusters. Cluster A represents birds. Cluster B includes hoofed mammals (i.e. ungulates). Cluster C corresponds to predatory feline while Cluster D represents predatory canine. Cluster E constitutes animals kept as pets. Figure 7.9 shows the final results using 60 terms. Similar to the previous experiments, the first-pass resulted in a number of clusters plus some isolated terms. The second-pass aims to relocate these isolated terms to the most appropriate clusters. Despite the rise in the number of terms from 31 to 60, all the clusters formed by the TTA after the second-pass correspond precisely to their occurrences in real-life (i.e. natural clusters). With the absolute consistency of the results over ten runs, these two experiments yield 100% recalls just like the previous experiments. Consequently, we can claim that TTA is able to produce consistent results, unlike the standard ant-based clustering where the solution does not stabilise and fail to converge. For example, in the evaluation by Vizine et al. [263], the standard ant-based clustering were inconsistent in their performance over the ten runs using the ANIMAL 16T dataset. This is a very common problem in ant-based clustering when “they constantly construct and deconstruct clusters during the iterative procedure of adaptation” [263]. There is also another advantage of TTA that is not found in the standard ants, namely, the ability to identify taxonomic relations between clusters. Referring to all the ten experiments conducted, we noticed that there are implicit hierarchical information that connects the discovered clusters. For example, referring to the most 188 Chapter 7. Term Clustering for Relation Acquisition Figure 7.7: Experiment using 15 terms from the animal domain plus an additional term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60 (middle screenshot) and sT = 0.72 (right screenshot) result in 2 clusters, 3 clusters and 5 clusters, respectively. In the left screenshot, Cluster A acts as the parent for the two recommended clusters “bird” and “mammal”, while Cluster B includes the term “Google”. In the middle screenshot, the recommended clusters “bird” and “mammal” were clearly reflected through Cluster A and C respectively. By setting sT higher, we dissected the recommended cluster “mammal” to obtain the discovered sub-clusters C, D and E as shown in the right screenshot. recent experiment in Figure 7.8, the two discovered Clusters A (which contains “Sandra Bullock”, “Jackie Chan”, “Brad Pitt”) and B (which contains “3 Doors Down”, “Aerosmith”,“Rod Stewart”) after the second-pass share the same parent node. We can employ the graph of Wikipedia articles W to find the nearest common ancestor of the two natural clusters and label it with the category name provided by Wikipedia. In our case, we can label the parent node of the two natural clusters as “Entertainers”. In fact, the labels of the natural clusters themselves can be named using the same approach. For example, the terms in the discovered cluster B (which contains “3 Doors Down”, “Aerosmith”,“Rod Stewart”) fall under the same category “American musicians” in Wikipedia and hence, we can accordingly label this cluster using that category name. In other words, clustering using TTA with the help of NGD and noW does not only produce flexible and consistent natural clusters, but is also able to identify implicit taxonomic relations between clusters. Nonetheless, we would like to point out that not all hierarchies of natural clusters formed by the TTA correspond to real-life hierarchical relations. More research is required to properly validate this capability of the TTA. 7.5. Evaluations and Discussions 189 Figure 7.8: Experiment using 31 terms from various domains. Setting sT = 0.70 results in 8 clusters. Cluster A represents actors and actresses. Cluster B represents musicians. Cluster C represents countries. Cluster D represents politics-related notions. Cluster E is transport. Cluster F includes finance and accounting matters. Cluster G constitutes technology and services on the Internet. Cluster H represents food. One can notice that in all the experiments in this section, the quality of the clustering output using TTA was less desirable if we were to only rely on the results from the first-pass. As pointed out earlier, the second-pass is necessary to produce naturally-occurring clusters. The results after the first-pass usually contain isolated terms due to discrepancies in NGD. This is mainly due to the appearance of words and the popularity of word pairs that are not natural. For example, given the words “Fox”, “Wolf ” and “Entertainment”, the first two should go together naturally. Unfortunately, due to the popularity of the name “Fox Entertainment”, a Google search using the pair “Fox” and “Wolf ” generates lower page count as compared to “Fox” and “Entertainment”. A lower page count has adverse effects on Equation 7.2, resulting in lower similarity. Using Equation 7.4, “Fox” and “Entertainment” achieve a similarity of 0.7488 while “Fox” and “Wolf ” yield a lower similarity of 0.7364. Despite such shortcomings, search engine page counts and Wikipedia offer TTA the ability to handle technical terms and common words of any domain regardless of whether they have been around for some time or merely beginning to evolve into common use on the Web. Due to the mere reliance on names or nouns for clustering, some readers may question the ability of TTA in handling various linguistic issues such as synonyms and word senses. Looking back at Figure 7.4, the 190 Chapter 7. Term Clustering for Relation Acquisition Figure 7.9: Experiment using 60 terms from various domains. Setting sT = 0.76 results in 20 clusters. Cluster A and B represent herbs. Cluster C comprises pastry dishes while Cluster D represents dishes of Italian origin. Cluster E represents computing hardware. Cluster F is a group of politicians. Cluster G represents cities or towns in France while Cluster H includes countries and states other than France. Cluster I constitutes trees of the genus Eucalyptus. Cluster J represents marsupials. Cluster K represents finance and accounting matters. Cluster L comprises transports with four or more wheels. Cluster M includes plant organs. Cluster N represents beverages. Cluster O represents predatory birds. Cluster P comprises birds other than predatory birds. Cluster Q represents two-wheeled transports. Cluster R and S represent predatory mammals. Cluster T includes trees of the genus Acacia. term “Buerger’s disease” and “Thromboangiitis obliterans” are actually synonyms referring to the acute inflammation and thrombosis (clotting) of arteries and veins of the hands and feet. In the context of the experiment in Figure 7.2, the term “Bordeaux” was treated as “Bordeaux wine” instead of the “city of Bordeaux”, and was successfully clustered together with other wines from other famous regions such as “Burgundy”. In another experiment in Figure 7.9, the similar term “Bordeaux” was automatically disambiguated and treated as a port-city in the Southwest of France instead. The TTA then automatically clustered this term together with other cities in France such as “Chamonix” and “Paris”. In short, TTA has the inherent capability of coping with synonyms, word senses and the fluctuation in term usage. The quality of the clustering results is very much dependent on the choice of sT 7.6. Conclusion and Future Work 191 and to a lesser extent, δT . Nonetheless, as an effective rule-of-thumb, sT should be set as high as possible. Higher sT will result in more leaf nodes with each having possibly a smaller number of terms that are tightly coupled together. High sT will also enable the isolation of potential outliers. The isolated terms and outliers generated by a high sT can then be further refined in the second-pass. The ideal range of sT derived through our experiments is within 0.60 − 0.9. Setting sT too low will result in very coarse clusters like the ones shown in Figure 7.5 where potential sub-clusters are left uncovered. Regarding the value of δT , it is usually set inversely proportional to sT . As shown during our evaluations, the higher we set sT , the more we decrease the value of δT . The reason behind the choices of these two threshold values can be explained as follows: as we lower sT , TTA produces coarser clusters with loosely coupled terms. The intra-node distance of such clusters are inevitably higher compared to the finer clusters because the terms in these coarse clusters are more likely to be less similar. In order for the second-pass to function appropriately during the relocation of isolated terms and the isolation of outliers, δT has to be set comparatively higher. Besides, lower sT will not provide the adequate discriminative ability for the TTA to distinguish or pick out the outliers. Another interesting point about sT is that by setting it to the maximum (i.e. 1.0), it results in a divisive clustering effect. In divisive clustering, the process starts with one, all-inclusive cluster and at each step, splits the cluster until only singleton clusters of individual term remain [242]. 7.6 Conclusion and Future Work In this chapter, we introduced a decentralised multi-agent system for term clustering in ontology learning. Unlike documents clustering or other forms of clustering in pattern recognition, clustering terms in ontology learning requires a different approach. The most evident adjustment required in term clustering is the measure of similarity and distance. Existing term clustering techniques in many ontology learning systems remain confined within the realm of conventional clustering algorithms and feature-based similarity measures. Since there is no explicit feature attached to terms, these existing techniques have come to rely on contextual cues surrounding the terms. These clustering techniques require extremely large collections of domain documents to reliably extract contextual cues for the computation of similarity matrices. In addition, the static background knowledge required for term clustering such as WordNet, patterns and templates make such techniques even more difficult to scale across domains. 192 Chapter 7. Term Clustering for Relation Acquisition Consequently, we introduced the use of featureless similarity and distance measures called Normalised Google Distance (NGD) and n-degree of Wikipedia (noW) for term clustering. The use of these two measures as part of a new multi-pass clustering algorithm called Tree-Traversing Ant (TTA) demonstrated excellent results during our evaluations. Standard ant-based techniques exhibit certain characteristics that have been shown to be useful and superior compared to conventional clustering techniques. The TTA is the result of an attempt to inherit these strengths while avoiding some inherent drawbacks. In the process, certain advantages from the conventional divisive clustering were incorporated, resulting in the appearance of a hybrid between ant-based and conventional algorithms. Seven of the most notable strengths of the TTA with NGD and noW are (1) able to further distinguish hidden structures within clusters, (2) flexible in regard to the discovery of clusters, (3) capable of identifying and isolating outliers, (4) tolerance to differing cluster sizes, (5) able to produce consistent results, (6) able to identify implicit taxonomic relations between clusters, and (7) inherent capability of coping with synonyms, word senses and the fluctuation in term usage. Nonetheless, much work is still required in certain aspects. One of the main future work we have in plan is to ascertain the validity and make good use of the implicit hierarchical relations discovered using TTA. The next issue that interests us is the automatic labelling of the natural clusters and the nodes in the hierarchy using Wikipedia. Labelling has always been a hard problem in clustering, especially document and term clustering. We are also keen on conducting more studies on the interaction between the two thresholds in TTA, namely, sT and δT . If possible, we intend to find ways to enable the automatic adjustment of these threshold values to maximise the quality of clustering output. 7.7 Acknowledgement This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and a Research Grant 2006 from the University of Western Australia. The authors would like to thank the anonymous reviewers for their invaluable comments. 7.8 Other Publications on this Topic Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for Terms Clustering using Tree-Traversing Ants. In the Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia. 7.8. Other Publications on this Topic 193 This paper reports the preliminary work on clustering terms using featureless similarity measures. The resulting clustering technique called TTA was later refined to contribute towards the core contents of Chapter 7. Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global. The research on TTA reported in Chapter 7 was generalised in this book chapter to work with both terms and Internet domain names. 194 Chapter 7. Term Clustering for Relation Acquisition CHAPTER 8 Relation Acquisition Abstract Common techniques for acquiring semantic relations rely on static domain and linguistic resources, predefined patterns, and the presence of syntactic cues. This chapter proposes a hybrid technique which brings together established and novel techniques in lexical simplification, word disambiguation and association inference for acquiring coarse-grained relations between potentially ambiguous and composite terms using only dynamic Web data. Our experiments using terms from two different domains demonstrate potential preliminary results. 8.1 Introduction Relation acquisition, also known as relation extraction or relation discovery is an important aspect of ontology learning. Traditionally, semantic relations are either extracted as verbs based on grammatical structures [217], induced through term co-occurrence using large text corpora [220], or discovered in the form of unnamed associations through cluster analysis [212]. Challenges faced by conventional techniques include (1) the reliance on static patterns and text corpora together with rare domain knowledge, (2) the need for named entities to guide relation acquisition, (3) the difficulty of classifying composite or ambiguous names into the required categories, and (4) the dependence on grammatical structures and the presence of verbs can result in the overlooking of indirect, implicit relations. In recent years, there is a growing trend in relation acquisition using Web data such as Wikipedia [245] and online ontologies (e.g. Swoogle) [213] to partially address the shortcomings of conventional techniques. In this chapter, we propose a hybrid technique which integrates lexical simplification, word disambiguation and association inference for acquiring semantic relations using only Web data (i.e. Wikipedia and Web search engines) for constructing lightweight domain ontologies. The proposed technique performs an iterative process of term mapping and term resolution to identify coarse-grained relations between domain terms. The main contribution of this chapter is the resolution phase which 0 This chapter appeared in the Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, 2009, with the title “Acquiring Semantic Relations using the Web for Constructing Lightweight Ontologies”. 195 196 Chapter 8. Relation Acquisition allows our relation acquisition technique to handle complex and ambiguous terms, and terms not covered by our background knowledge on the Web. The proposed technique can be used to complement conventional techniques for acquiring finegrained relations and to automatically extend online structured data as Wikipedia. The rest of the chapter is structured as follows. Section 8.2 and 8.3 present existing work related to relation acquisition, and the details of our technique, respectively. The outcome of the initial experiment is summarised in Section 8.4. We conclude this chapter in Section 8.5. 8.2 Related Work Techniques for relation acquisition can be classified as symbolic-based, statisticsbased or a hybrid of both. The use of linguistic patterns enables the discovery of fine-grained semantic relations. For instance, Poesio & Almuhareb [201] developed specific lexico-syntactic patterns to discover named relations such as part-of and causation. However, linguistic-based techniques using static rules tend to face difficulties in coping with the structural diversity of a language. The technique by Sanchez & Moreno [217] for extracting verbs as potential named relations is restricted to handling verbs in simple tense and verb phrases which do not contain modifiers such as adverbs. In order to identify indirect relations, statistics-based techniques such as co-occurrence analysis and cluster analysis are necessary. Cooccurrence analysis employs the redundancy in large text corpora to detect the presence of statistically significant associations between terms. However, the textual resources required by such techniques are difficult to obtain, and remain static over a period of time. For example, Schutz & Buitelaar [220] manually constructed a corpus for the football domain containing only 1, 219 documents from an online football site for relation acquisition. Cluster analysis [212], on the other hand, requires tremendous computational effort in preparing features from texts for similarity measurement. The lack of emphasis on indirect relations is also evident in existing techniques. Many relation acquisition techniques in information extraction acquire semantic relations with the guidance of named entities [229]. Relation acquisition techniques which require named entities have restricted applicability since many domain terms with important relations cannot be easily categorised. In addition, the common practice of extracting triples using only patterns and grammatical structures tends to disregard relations between syntactically unrelated terms. In view of the shortcomings of conventional techniques, there is a growing trend in relation acquisition which favours the exploration of rich, heterogeneous Web data 8.3. A Hybrid Technique for Relation Acquisition 197 over the use of static, rare background knowledge. SCARLET [213], which stemmed from a work in ontology matching, follows this paradigm by harvesting online ontologies on the Semantic Web to discover relations between concepts. Sumida et al. [245] developed a technique for extracting a large set of hyponymy relations in Japanese using the hierarchical structures of Wikipedia. There is also a group of researchers who employ Web documents as input for relation acquisition [115]. Similar to the conventional techniques, this group of work still relies on the ubiquitous Wordnet and other domain lexicons for determining the proper level of abstraction and labelling of relations between the terms extracted from Web documents. Pei et al. [196] employed predefined local (i.e. WordNet) and online ontologies to name the unlabelled associations between concepts in Wikipedia. The labels are acquired through a mapping process which attempts to find lexical matches for Wikipedia concepts in the predefined ontologies. The obvious shortcomings include the inability to handle complex and new terms which do not have lexical matches in the predefined ontologies. 8.3 A Hybrid Technique for Relation Acquisition Figure 8.1: An overview of the proposed relation acquisition technique. The main phases are term mapping and term resolution, represented by black rectangles. The three steps involved in resolution are simplification, disambiguation and inference. The techniques represented by the white rounded rectangles were developed by the authors, while existing techniques and resources are shown using grey rounded rectangles. The proposed relation acquisition technique is composed of two phases, namely, 198 Chapter 8. Relation Acquisition Algorithm 8 termMap(t, WT , M, root, iteration) 1: rslt := map(t) 2: if iteration equals to 1 then 3: if rslt equals to undef then 4: if t is multi-word then return composite 5: else return non-existent 6: else if rslt equals to Nt = (Vt , Et ) ∧ Pt 6= φ then 7: return ambiguous 8: else if rslt equals to Nt = (Vt , Et ) ∧ Pt = φ then 9: add neighbourhood Nt to the subgraph WT and iteration ← iteration + 1 10: for each u ∈ Vt where (t, u) ∈ Ht ∪ At do 11: termMap(u, WT , M, root, iteration) 12: M ← M ∪ {t} 13: return mapped 14: else if iteration more than 1 then 15: if rslt equals to Nt = (Vt , Et ) ∧ Pt = φ then 16: add neighbourhood Nt to the subgraph WT and iteration ← iteration + 1 17: for each u ∈ Vt where (t, u) ∈ Ht do 18: if u not equal to root then termMap(u, WT , M, root, iteration) 19: else return // all paths from the origin t will arrive at the root term mapping and term resolution. The input is a set of domain terms T produced using a separate term recognition technique. The inclusion of a resolution phase sets our technique apart from existing techniques which employ Web data for relation acquisition. This resolution phase allows our technique to handle complex and ambiguous terms, and terms which are not covered by the background knowledge on the Web. Figure 8.1 provides an overview of the proposed technique. In this technique, Wikipedia is seen as a directed acyclic graph W where vertices V are topics covered by Wikipedia, and edges E are three types of coarse-grained relations between the topics, namely, hierarchical H, associative A, and polysemous P , or E = H ∪ A ∪ P . It is worth noting that H, A and P are disjoint sets. These coarse-grained links are obtained from Wikipedia’s classification scheme, “See Also” section, and disambiguation pages, respectively. The term mapping phase creates a subgraph of W for each set T , denoted as WT by recursively querying W for relations that belong to the terms t ∈ T . The querying aspect is defined as the function map(t), which finds an equivalent topic u ∈ V for term t, and returns the 8.3. A Hybrid Technique for Relation Acquisition 199 Algorithm 9 findNCA(M ,WT ) 1: initialise commonAnc = φ, ancestors = φ, continue = true 2: for each m ∈ M do 3: Nm := map(m) 4: ancestor := {v : v ∈ Vm ∧ (m, v) ∈ Hm ∪ Am } 5: ancestors ← ancestors ∪ ancestor 6: while continue equals to true do 7: for each a ∈ ancestors do 8: initialise pthCnt = 0, sumD = 0 9: for each m ∈ M do 10: dist := shortestDirectedPath(m,a,WT ) 11: if dist not infinite then 12: pthCnt ← pthCnt + 1 and sumD ← sumD + dist 13: if pthCnt equals to |M | then 14: commonAnc ← commonAnc ∪ {(a, sumD)} 15: if commonAnc not equals to φ then 16: continue = f alse 17: else 18: initialise newAncestors = φ 19: for each a ∈ ancestors do 20: Na := map(a) 21: ancestor := {v : v ∈ Va ∧ (a, v) ∈ Ha ∪ Aa } 22: newAncestors ← newAncestors ∪ ancestor 23: ancestors ← newAncestors 24: return nca where (nca, dist) ∈ commonAnc and dist is the minimum distance closed neighbourhood Nt : N = (V , E ) if (∃u ∈ V, u ≡ t) t t t map(t) = undef otherwise (8.1) The neighbourhood for term t is denoted as (Vt , Et ) where Et = {(t, y) : (t, y) ∈ Ht ∪ At ∪ Pt ∧ y ∈ Vt } and Vt is the set of vertices in the neighbourhood. The sets Ht , At and Pt contain hierarchical, associative and polysemous links which connect term t to its adjacent terms y ∈ Vt . The process of term mapping is summarised in Algorithm 8. The term mapper in Algorithm 8 is invoked once for every t ∈ T . The term mapper ceases the recursion upon encountering the base case, which consists 200 Chapter 8. Relation Acquisition of the root vertices of Wikipedia (e.g. “Main topic classifications”). An input term t ∈ T which is traced to the root vertex is considered as successfully mapped, and is moved from set T to set M . Figure 8.2(a) shows the subgraph WT created for the input set T={‘baking powder’,‘whole wheat flour’}. In reality, many terms cannot be straightforwardly mapped because they do not have lexically equivalent topics in W due to (1) the non-exhaustive coverage of Wikipedia, (2) the tendency to modify terms for domain-specific uses, and (3) the polysemous nature of certain terms. The term mapper in Algorithm 8 returns different values, namely, composite, non-existent and ambiguous to indicate the causes of mapping failures. The term resolution phase resolves mapping failures through the iterative process of lexical simplification, word disambiguation and association inference. Upon the completion of mapping and resolution of all input terms, any direct or indirect relations between the mapped terms t ∈ M can be identified by finding paths which connect them in the subgraph WT . Finally, we devise a 2-step technique to transform the subgraph WT into a lightweight domain ontology. Firstly, we identify the nearest common ancestor (NCA) for the mapped terms. Our simple algorithm for finding NCA is presented in Algorithm 9. The discussion on more complex algorithms [21, 22] for finding NCA is beyond the scope of this chapter. Secondly, we identify all directed paths in WT which connect the mapped terms to the new root NCA and use those paths to form the final lightweight domain ontology. The lightweight domain ontology for the subgraph WT in Figure 8.2(a) is shown in Figure 8.2(b). We discuss the details of the three parts of term resolution in the following three subsections. 8.3.1 Lexical Simplification The term mapper in Algorithm 8 returns the composite value to indicate the inability to map a composite term (i.e. multi-word term). Composite terms which have many modifiers tend to face difficulty during term mapping due to the absence of lexically equivalent topics in W . To address this, we designed a lexical simplification step to reduce the lexical complexity of composite terms in a bid to increase their chances of re-mapping. A composite term is comprised of a head noun altered by some pre- (e.g. adjectives and nouns) or post-modifiers (e.g. prepositional phrases). These modifiers are important in clarifying or limiting the extent of the semantics of the terms in a particular context. For instance, the modifier “one cup” as in “one cup whole wheat flour” is crucial for specifying the amount of “whole wheat flour” required for a particular pastry. However, the semantic diversity of 8.3. A Hybrid Technique for Relation Acquisition 201 (a) The dotted arrows represent additional hierarchical links from each vertex. The only associative link is between “whole wheat flour” and “whole grain”. (b) “Food ingredients” is the N CA. Figure 8.2: Figure 8.2(a) shows the subgraph WT constructed for T={‘baking powder’,‘whole wheat flour’} using Algorithm 8, which is later pruned to produce a lightweight ontology in Figure 8.2(b). terms created by certain modifiers is often unnecessary in a larger context. Our lexical simplifier make use of this fact to reduce the complexity of a composite term for re-mapping. Figure 8.3: The computation of mutual information for all pairs of contiguous constituents of the composite terms “one cup whole wheat flour” and “salt to taste”. The lexical simplification step breaks down a composite term into two struc- 202 Chapter 8. Relation Acquisition turally coherent parts, namely, an optional constituent and a mandatory constituent. A mandatory constituent is composed of but not limited to the head noun of the composite term, and has to be in common use in the language independent of the optional constituent. The lexical simplifier then finds the least dependent pair as the ideally decomposed constituents. The dependencies are measured by estimating the mutual information of all contiguous constituents of a term. A term with n-words has n − 1 possible pairs denoted as < x1 , y1 >, ..., < xn−1 , yn−1 >. The mutual information for each pair < x, y > of term t is computed as M I(x, y) = f (t)/f (x)f (y) where f is a frequency measure. In a previous work [278], we utilise the page count returned by Web search engines to compute the relative frequency required for mutual information. Given that Z = {t, x, y}, the relative frequency for each z ∈ Z is (− nz ) computed as f (z) = nnZz e nZ where nz is the page count returned by Web search P engines, and nZ = u∈Z nu . Figure 8.3 shows an example of finding the least dependent constituents of two complex terms. Upon identifying the two least dependent constituents, we re-map the mandatory portion. To retain the possibly significant semantics delivered by the modifiers, we also attempt to re-map the optional constituents. If the decomposed constituents are in turn not mapped, another iteration of term resolution is performed. Unrelated constituents will be discarded. For this purpose, we define the distance of a constituent with respect to the set of mapped terms M as: P noW ({x, y}, m) (8.2) δ({x, y}, M ) = m∈M |M | where noW (a, b) is a measure of geodesic distance between topic a and b based on Wikipedia developed by Wong et al. [276] known as n-degree of Wikipedia (noW). A constituent is discarded if δ({x, y}, M ) > τ and the current set of mapped terms is not empty, |M | = 6 0. The threshold τ = δ(M ) + σ(M ), where δ(M ) and σ(M ) are the average and the standard deviation of the intra-group distance of M . 8.3.2 Word Disambiguation The term mapping phase in Algorithm 8 returns the ambiguous value if a term t has a non-empty set of polysemous links Pt in its neighbourhood. In such cases, the terms are considered as ambiguous and cannot be directly mapped. To address this, we include a word disambiguation step which automatically resolves ambiguous terms using noW [276]. Since all input terms in T belong to the same domain of interest, the word disambiguator finds the proper senses to replace the ambiguous terms by the virtue of the senses’ relatedness to the already mapped terms. Senses which are highly related to the mapped terms have lower noW value. For example, 8.3. A Hybrid Technique for Relation Acquisition 203 Figure 8.4: A graph showing the distribution of noW distance and the stepwise difference for the sequence of word senses for the term “pepper”. The set of mapped terms is M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking powder”, “buttermilk”}. The line “stepwise difference” shows the ∆i−1,i values. The line “average stepwise difference” is the constant value µ∆ . Note that the first sense s1 is located at x = 0. the term “pepper” is considered as ambiguous since its neighbourhood contains a non-empty set Ppepper with numerous polysemous links pointing to various senses in the food, music and sports domains. If the term “pepper” is provided as input together with terms such as “vinegar” and “garlic”, we can eliminate all semantic categories except food. Each ambiguous term t has a set of senses St = {s : s ∈ Vt ∧ (t, s) ∈ Pt }. Equation 8.2, denoted as δ(s, M ), is used to measure the distance between a sense s ∈ St with the set of mapped terms M . The senses are then sorted into a list (s1 , ..., sn ) in ascending order according to their distance with the mapped terms. The smaller the subscript, the smaller the distance, and therefore, the closer to the domain in consideration. An interesting observation is that many senses for an ambiguous term are in fact minor variations belonging to the same semantic category (i.e. paradigm). Referring back to our example term “pepper”, within the food domain alone, multiple possible senses exist (e.g. “sichuan pepper”, “bell pepper”, “black pepper”). While these senses have their intrinsic differences, they are paradigmatically substitutable for one another. Using this property, we devise a senses selection mechanism to identify suitable paradigms covering highly related senses as substitutes for the ambiguous terms. The mechanism computes the difference in noW value as ∆i−1,i = δ(si , M ) − δ(si−1 , M ) for 204 Chapter 8. Relation Acquisition 2 ≤ i ≤ n between every two consecutive senses. We currently employ the average stepwise difference of the sequence asPthe cutoff point. The average stepwise n ∆ difference for a list of n senses is µ∆ = i=2n−1i−1,i . Finally, the first k senses in the sequence with ∆i−1,i < µ∆ are accepted as belonging to a single paradigm for replacing the ambiguous term. Using this mechanism, we have reduced the scope of the term “pepper” to only the food domain out of the many senses across domains such as music (e.g. “pepper (band)”) and beverage (e.g. “dr pepper”). In our example in Figure 8.4, the ambiguous term “pepper” is replaced by {“black pepper”,“allspice”,“melegueta pepper”,“cubeb”}. These k = 4 word senses are selected as replacements since the stepwise difference at point i = 5, ∆4,5 = 0.5 exceeds µ∆ = 0.2417. 8.3.3 Association Inference Terms that are labelled as non-existent by Algorithm 8 simply do not have any lexical matches on Wikipedia. We propose to use cluster analysis to infer potential associations for such non-existent terms. We employ our term clustering algorithm with featureless similarity measures known as Tree-Traversing Ant (TTA) [276]. T T A is a hybrid algorithm inspired by ant-based methods and hierarchical clustering which utilises two featureless similarity measures, namely, Normalised Google Distance (NGD) [50] and noW . Unlike conventional clustering algorithms which involve feature extraction and selection, terms are automatically clustered using T T A based on their usage prevalence and co-occurrence on the Web. In this step, we perform term clustering on the non-existent terms together with the already mapped terms in M to infer hidden associations. The association inference step is based on the premise that terms grouped into similar clusters are bound by some common dominant properties. By inference, any non-existent terms which appear in the same clusters as the mapped terms should have similar properties. The T T A returns a set of term clusters C = {C1 , ..., Cn } upon the completion of term clustering for each set of input terms. Each Ci ∈ C is a set of related terms as determined by T T A. Figure 8.5 shows the results of clustering the non-existent term “conchiglioni” with 14 mapped terms. The output is a set of three clusters {C1′ , C2 , C3 }. Next, we acquire the parent topics of all mapped terms located in the same cluster as the non-existent term by calling the mapping function in Equation 8.1. We refer to such cluster as target cluster. These parent topics, represented as the set R, constitute the potential topics which may be associated with the non-existent term. In our example in Figure 8.5, the target 8.3. A Hybrid Technique for Relation Acquisition 205 Figure 8.5: The result of clustering the non-existent term “conchiglioni” and the mapped terms M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking powder”, “buttermilk”,“carbonara”,“pancetta”} using T T A. cluster is C1′ , and the elements of set R are {“pasta”,“pasta”,“pasta”,“italian cuisine”,“sauces”,“cuts of pork”,“dried meat”,“italian cuisine”,“pork”,“salumi”}. We devise a prevailing parent selection mechanism to identify the most suitable parent in R to which we attach the non-existent term. The prevailing parent is determined by assigning a weight to each parent r ∈ R, and ranking the parents according to their weights. Given the non-existent term t and a set of parents R, the prevailing parent weight (ρr ) where 0 ≤ ρr < 1 for each unique r ∈ R is defined as ρr = common(r)sim(r, t)subsume(r, t)δr where sim(a, b) is given by 1−N GD(a, b)θ, and N GD(a, b) is the Normalised Google Distance [50] between a and b. θ is a constant withinP the range (0, 1] for adjusting the N GD distance. The function 1 determines the number of occurrence of r in set R. δr = 1 common(r) = q∈R,q=r |R| if subsume(r, t) > subsume(t, r) and δr = 0 otherwise. The subsumption measure subsume(x, y) [77] is the probability of x given y computed as n(x, y)/n(y), where n(x, y) and n(y) are page counts obtained from Web search engines. This measure is used to quantify the extent of term x being more general than term y. The higher the subsumption value, the more general term x is with respect to y. Upon ranking the unique parents in R based on their weights, we select the prevailing parent r as the one with the largest ρ. A link is then created for the non-existent term t to hierarchically relate it to r. 206 8.4 Chapter 8. Relation Acquisition Initial Experiments and Discussions Figure 8.6: The results of relation acquisition using the proposed technique for the genetics and the food domains. The labels “correctly xxx” and “incorrectly xxx” represent the true positives (TP) and false positives (FP). Precision is computed as T P/(T P + F P ). We experimented with the proposed technique shown in Figure 8.1 using two manually-constructed datasets, namely, a set of 11 terms in the genetics domain, and a set of 31 terms in the food domain. The system performed the initial mappings of the input terms at level 0. This results in 6 successfully mapped terms and 5 unmapped composite terms in the genetics domain. As for the terms in the food domain, 14 were mapped, 16 were composite and 1 was non-existent. At level 1, the 5 composite terms in the genetics domain were decomposed into 10 constituents where 8 were remapped and 2 required further level 2 resolution. For the food domain, the 16 composite terms were decomposed into 32 constituents in level 1 where 10, 5 and 3 were still composite, non-existent and discarded, respectively. Together with the successfully clustered non-existent term and the 14 remapped constituents, there were a total of 15 remapped terms at level 1. Figure 8.6 summarises the experiment results. Overall, the system has a 100% precision in the aspect of term mapping, lexical simplication and word disambiguation in all levels using the small set of 11 terms 8.4. Initial Experiments and Discussions 207 (a) The lightweight domain ontology for genetics constructed using 11 terms. (b) The lightweight domain ontology for food constructed using 31 terms. Figure 8.7: The lightweight domain ontologies generated using the two sets of input terms. The important vertices (i.e. NCAs, input terms, vertices with degree more than 3) have darker shades. The concepts genetics and food in the center of the graph are the NCAs. All input terms are located along the side of the graph. 208 Chapter 8. Relation Acquisition in the genetics domain as shown in Figure 8.6. As for the set of food-related terms, there was one false positive (i.e. incorrectly mapped) involving the composite term “100g baby spinach” which results in an 80% precision in level 2. In level 1, this composite term was decomposed into the appropriate constituents “100g” and “baby spinach”. In level 2, the term “baby spinach” was further decomposed and its constituent “spinach” was successfully remapped. The constituent “baby” in this case refers to the adjectival sense of “comparatively little”. However, the modifier “baby” was inappropriately remapped and attached to the concept of “infant”. The lack of information on polysemes and synonyms for basic English words is the main cause to this problem. In this regard, we are planning to incorporate dynamic linguistic resources such as Wiktionary to complement the encyclopaedic nature of Wikipedia. Other established, static resources such as WordNet can also be used as a source of basic English vocabulary. Moreover, the incorporation of such complementary resources can assist in retaining and capturing additional semantics of complex terms by improving the mapping of constituents such as “dried” and “sliced”. General words which act as modifiers in composite terms often do not have corresponding topics in Wikipedia, and are usually unable to satisfy the relatedness requirement outlined in Section 8.3.1. Such constituents are currently ignored as shown through the high number of discarded constituents in level 2 in Figure 8.6. Moreover, the clustering of terms to discover new associations is only performed at level 1, and non-existent terms at level 2 and beyond are currently discarded. Upon obtaining the subgraphs WT for the two input sets, the system finds the corresponding nearest common ancestors. The NCAs for the genetics-related and the food-related terms are genetics and food, respectively. Using these NCAs, our system constructed the corresponding lightweight domain ontologies as shown in Figure 8.7. A detailed account of this experiment is available to the public1 . 8.5 Conclusion and Future Work Acquiring semantic relations is an important part of ontology learning. Many existing techniques face difficulty in extending to different domains, disregard implicit and indirect relations, and are unable to handle relations between composite, ambiguous and non-existent terms. We presented a hybrid technique which combines lexical simplification, word disambiguation and association inference for acquiring semantic relations between potentially composite and ambiguous terms using only dynamic Web data (i.e. Wikipedia and Web search engines). During 1 http://explorer.csse.uwa.edu.au/research/ sandbox evaluation.pl 8.6. Acknowledgement 209 our initial experiment, the technique demonstrated the ability to handle terms from different domains, to accurately acquire relations between composite and ambiguous terms, and to infer relations between terms which do not exist in Wikipedia. The lightweight ontologies discovered using this technique is a valuable resource to complement other techniques for constructing full-fledged ontologies. Our future work includes the diversification of domain and linguistic knowledge by incorporating online dictionaries to support general words not available on Wikipedia. Evaluation using larger datasets, and the study on the effect of clustering words beyond level 1 is also required. 8.6 Acknowledgement This research is supported by the Australian Endeavour International Postgraduate Research Scholarship, the DEST (Australia-China) Grant, and the Interuniversity Grant from the Department of Chemical Engineering, Curtin University of Technology. 210 Chapter 8. Relation Acquisition CHAPTER 9 Implementation “Thinking about information overload isn’t accurately describing the problem; thinking about filter failure is.” - Clay Shirky, the Web 2.0 Expo (2008) The focus of this chapter is to illustrate the advantages of the seamless and automatic construction of term clouds and lightweight ontologies from text documents. An application is developed to assist in the skimming and scanning of large amounts of news articles across different domains, including technology, medicine and economics. First, the implementation details of the proposed techniques described in the previous six chapters are provided. Second, three representative use cases are described to demonstrate our powerful interface for assisting document skimming and scanning. The details on how term clouds and lightweight ontologies can be used for this purpose are provided. 9.1 System Implementation This section presents the implementation details of the core techniques discussed in the previous six chapters (i.e. Chapter 3-8). Overall, the proposed ontology learning system is implemented as a Web application hosted on http://explorer. csse.uwa.edu.au/research/. The techniques are developed entirely using the Perl programming language. The availability and reusability of a wide range of external modules on the Comprehensive Perl Archive Network (CPAN) for text mining, natural language processing, statistical analysis, and other Web services makes Perl the ideal development platform for this research. Moreover, the richer and more consistent regular expression syntax in Perl provides a powerful tool for manipulating and processing texts. The Prefuse visualisation toolkit1 using Java programming language, and the Perl module SVGGraph2 for creating Scalable Vector Graphics (SVG) graphs are also used for visualisation purposes. The CGI::Ajax3 module is employed to enable the systems’s online interfaces to asynchronously access backend Perl modules. The use of Ajax improves interactivity, bandwidth usage and load time by allowing back-end modules to be invoked and data to be returned without interfering the interface behaviour. Overall, the Web application comprises 1 http://prefuse.org/ http://search.cpan.org/dist/SVGGraph-0.07/ 3 http://search.cpan.org/dist/CGI-Ajax-0.707/ 2 211 212 Chapter 9. Implementation a suite of modules with about 90, 000 lines of properly documented Perl codes. The implementation details of the system’s modules are as follow: (a) The screenshot of a webpage containing a short abstract of a journal article, hosted at http://www.ncbi.nlm.nih.gov/pubmed/7602115. The relevant content, which is the abstract, extracted by HERCULES is shown in Figure 9.1(b). (b) The input section of this interface algorithm hercules.pl shows the HTML source code of the webpage in Figure 9.1(a). The relevant content extracted from the HTML source by the HERCULES module is shown in the results section. The process log, not included in this figure, is also available in the results section. Figure 9.1: The online interface for the HERCULES module. • The relevant content extraction technique, described in Chapter 6, is implemented as the HERCULES module that can be accessed and tested via the online interface algorithm hercules.pl. The HERCULES module uses only regular 9.1. System Implementation 213 Figure 9.2: The input section of the interface algorithm issac.pl shows the error sentence “Susan’s imabbirity to Undeerstant the msg got her INTu trubble.”. The correction provided by ISSAC is shown in the results section of the interface. The process log is also provided through this interface. Only a small portion of the process log is shown in this figure. expressions to implement the set of heuristic rules described in Chapter 6 for removing HTML tags and other non-content elements. Figure 9.1(a) shows an example webpage that has both relevant content, and boilerplates such as navigation and complex search features in the header section and other related information in the right panel. The relevant content extracted by HERCULES is shown in Figure 9.1(b). • The integrated technique for cleaning noisy text, described in Chapter 3, is implemented as the ISSAC module accessible via the interface algorithm issac.pl. The implementation of ISSAC uses the following Perl modules, namely, Text:: WagnerFischer4 for computing the Wagner-Fischer edit distance [266], WWW:: Search::AcronymFinder5 for accessing the online dictionary www.acronymfin der.com, and Text::Aspell6 for interfacing with the GNU spell checker Aspell. In addition, the Yahoo APIs for spelling suggestion7 and Web search8 are used to obtain replacement candidates, and to obtain page counts for deriving the general significance score. Figure 9.2 shows the correction by ISSAC for 4 http://search.cpan.org/dist/Text-WagnerFischer-0.04/ http://search.cpan.org/dist/WWW-Search-AcronymFinder-0.01/ 6 http://search.cpan.org/dist/Text-Aspell/ 7 http://developer.yahoo.com/search/web/V1/spellingSuggestion.html 8 http://developer.yahoo.com/search/web/V1/webSearch.html 5 214 Chapter 9. Implementation Figure 9.3: The online interface algorithm unithood.pl for the module unithood. The interface shows the collocational stability of different phrases determined using unithood. The various weights involved in determining the extent of stability are also provided in these figures. the noisy sentence “Susan’s imabbirity to Undeerstant the msg got her INTu trubble.”. • The two measures OU and UH described in Chapter 4 are implemented as a single module called unithood that can be accessed online via the interface algorithm unithood.pl. The unithood module also uses the Yahoo API for Web search to access page counts for estimating the collocational strength of noun phrases. Figure 9.3 shows the results of checking the collocational strength of the phrases “Centers for Disease Control and Prevention” and “Drug Enforcement Administration and Federal Bureau of Investigation”. As mentioned in Chapter 4, phrases containing both prepositions and conjunctions can be relatively difficult to deal with. The unithood module using the UH measure automatically decides that the second phrase “Drug Enforcement Administration and Federal Bureau of Investigation” does not form a stable noun phrase as shown in Figure 9.3. The decision is correct considering that the unstable phrase can refer to two separate entities in the real world. • The technique for constructing text corpora described in Chapter 6 is implemented under the SPARTAN module. As part of SPARTAN, three submodules PROSE, STEP and SLOP are implemented to filter websites, expand seed 9.1. System Implementation 215 (a) The data virtualcorpus.pl interface for querying pre-constructed virtual corpora by SPARTAN. (b) The data localcorpus.pl interface for querying pre-constructed local corpora by SPARTAN, and some other types of local corpora. Figure 9.4: The online interfaces for querying the virtual and local corpora created using the SPARTAN module. terms and localise webpage contents, respectively. No online corpus construction interface was provided for users due to the extensive storage space required for downloading and constructing text corpora. Instead, an interface to query pre-constructed virtual and local corpora is made available via data virtualcorpus.pl and data localcorpus.pl, respectively. The SPARTAN module uses both the Yahoo APIs for Web search and site search9 throughout the corpus construction process. The Perl modules WWW::Wikipedia10 and LWP::UserAgent11 are used to access Wikipedia during seed term expan9 http://developer.yahoo.com/search/siteexplorer/siteexplorer.html http://search.cpan.org/dist/WWW-Wikipedia-1.95/ 11 http://search.cpan.org/dist/libwww-perl-5.826/lib/LWP/UserAgent.pm 10 216 Chapter 9. Implementation (a) The online interface algorithm termhood.pl accepts short text snippets as input and produces term clouds using the termhood module. (b) The online interface data termcloud.pl for browsing pre-constructed term clouds using the termhood module. Each term cloud is a summary of important concepts of the corresponding news article. (c) The online interface data corpus.pl summarises the text corpora available in the system for use by the termhood module. Figure 9.5: Online interfaces related to the termhood module. 9.1. System Implementation (a) The algorithm nwd.pl interface for finding the semantic similarity between terms using the NWD module. (b) The algorithm now.pl interface for finding the semantic distance between terms using the noW module. (c) The algorithm tta.pl interface for clustering terms using the TTA module with the support of featureless similarity metrics by the NWD and noW modules. Figure 9.6: Online interfaces related to the ARCHILES module. 217 218 Chapter 9. Implementation Figure 9.7: The interface data lightweightontology.pl for browsing preconstructed lightweight ontologies for online news articles using the ARCHILES module. sion and to access webpages using HTTP style communication. Figure 9.4(a) shows the interface data virtualcorpus.pl for querying the virtual corpora constructed using SPARTAN. Some statistics related to the virtual corpora, such as document frequency and word frequency, are provided in this interface. A simple implementation based on document frequency is also used in this interface to decide if the search term is relevant to the domain represented by the corpus or otherwise. For instance, Figure 9.4(a) shows the results of querying the virtual corpus in the medicine domain using the word “tumor necrosis factor”. There are 322, 065 documents that contain the word “tumor necrosis factor” out of the total 84 million in the domain corpus. There are, however, only 5 documents in the contrastive corpus that have this word. Based on these frequencies, the interface decides that “tumor necrosis factor” is relevant to the medicine domain. Figure 9.4(b) shows the interface data localcorpus.pl for querying the localised versions of the virtual corpora, and other types of local corpora. • The two measures TH and OT described in Chapter 5 for recognising domainrelevant terms are implemented as the termhood module. An interface is created at algorithm termhood.pl to allow users to access the termhood module online. Figure 9.5(a) shows the result of term recognition for the 9.2. Ontology-based Document Skimming and Scanning 219 input sentence “Melanoma is one of the rarer types of skin cancer. Around 160,000 new cases of melanoma are diagnosed each year.”. The termhood module presents the output as term clouds containing domain-relevant terms of different sizes. Larger terms assume a more significant role in representing the content of the input text. The results section of the interface in Figure 9.5(a) also provides information on the composition of the text corpora used and the process log of text processing and term recognition. The termhood module has the option of using either the text corpora constructed through guided crawling of online news sites, the corpora (both local and virtual) built using the SPARTAN module, publicly-available collections (e.g. Reuters-21578, texts from the Gutenberg project, GENIA), or any combination thereof. Figure 9.5(c) shows the interface data corpus.pl that summarises information about the available text corpora for use by the termhood module. A list of preconstructed term clouds from online news articles is available for browsing at data termcloud.pl as shown in Figure 9.5(b). • The relation acquisition technique, described in Chapter 8, is implemented under the ARCHILES module. Due to some implementation challenges, an online interface cannot be provided for users to directly access the ARCHILES module. Nevertheless, a list of pre-constructed lightweight ontologies for online news articles is available for browsing using the interface data lightweightontology.pl as shown in Figure 9.7. ARCHILES employs two featureless similarity measures NWD and noW, and the TTA clustering technique described in Chapter 7 for disambiguating terms and discovering relations between unknown terms. The NWD and noW modules can be accessed online via algorithm nwd.pl and algorithm now.pl. The NWD module relies on the Yahoo API for Web search to access page counts for similarity estimation. The noW module uses the external Graph and WWW::Wikipedia Perl modules to simulate Wikipedia’s categorical system, to compute shortest paths for deriving distance values, and to resolve ambiguous terms. The clustering technique is implemented as the TTA module with an online interface at algorithm tta.pl. TTA uses both the noW and NWD modules to determine similarity for clustering terms. 9.2 Ontology-based Document Skimming and Scanning The growth of textual information on the Web is a double-edged sword. On the one hand, we are blessed with unsurpassed freedom and accessibility to endless information. We all know that information is power, and so we thought the more the 220 Chapter 9. Implementation Figure 9.8: The screenshot of the aggregated news services provided by Google (the left portion of the figure) and Yahoo (the right portion of the figure) on 11 June 2009. better. On the other hand, such explosion of information on the Web (i.e. information explosion) can be a curse. While information has been growing exponentially since the conception of the Web, our cognitive abilities have not caught up. We have short attention span on the Web [133], and we are slow at reading off the screen [91]. For this reason, users are finding it increasingly difficult to handle the excess amount of information being provided on a daily basis, an effect known as information overload. An interesting study at King’s College London showed that information overload is actually doing more harm to our concentration than marijuana [270]. It has became apparent that “when it comes to information, sometimes 9.2. Ontology-based Document Skimming and Scanning 221 Figure 9.9: A splash screen on the online interface for document skimming and scanning at http://explorer.csse.uwa.edu.au/research/. Figure 9.10: The cross-domain term cloud summarising the main concepts occurring in all the 395 articles listed in the news browser. This cloud currently contains terms in the technology, medicine and economics domains. less is more...” [179]. There are two key issues to be considered when attempting to address the problem of information overload. Firstly, it is becoming increasingly challenging for retrieval systems to locate relevant information amidst a growing Web, and secondly, users are finding it more difficult to interpret a growing amount of relevant information. While many studies have been conducted to improve the performance of retrieval systems, there is virtually no work on the issue of information interpretability. This lack of attention to information interpretability becomes 222 Chapter 9. Implementation Figure 9.11: The single-domain term cloud for the domain of medicine. This cloud summarises all the main concepts occurring in the 75 articles listed below in the news browser. Users can arrive at this single-domain cloud from the cross-domain cloud in Figure 9.10 by clicking on the [domain(s)] option in the latter. obvious as we look at the way Google and Yahoo present search results, news articles and other documents to the users. At most, these systems rank the webpages for relevance and generate short snippets with keyword bolding to assist users in locating what they need. Studies [121, 11] have shown that these summaries often have poor readability and are inadequate in conveying the gist of the documents. In other words, the users would still have to painstakingly read through the documents in order to find the information they need. We take the two aggregated news services by Google and Yahoo shown in Figure 9.8 as examples to demonstrate the current lack of regard for information interpretability. The left portion of the figure shows the Google News interface, while the right portion shows the Yahoo News interface. Both interfaces are focused on the health news category. The interfaces in Figure 9.8 merely show half of all the news listed on 11 June 2009. The actual listings are considerably longer. A quick look at both interfaces would immediately reveal the time and cognitive effort that users have to invest in order to arrive at a summary of the texts or to find a particular piece of information. Over time, users adopted the technique of skimming and scanning to keep up with such a constant flow of textual documents online [218, 186, 268]. 9.2. Ontology-based Document Skimming and Scanning 223 Figure 9.12: The single-domain term cloud for the medicine domain. Users can view a list of articles describing a particular topic by clicking on the corresponding term in the single-domain cloud. Users employ skimming to quickly identify the main ideas conveyed by a document, usually to decide if the text is interesting and whether one should read it in more detail. Scanning, on the other hand, is used to obtain specific information from a document (e.g. a particular page where a certain idea occurred). This section provides details on the use of term clouds and lightweight ontologies to aid document skimming and scanning for improving information interpretability. More specifically, term clouds and lightweight ontologies are employed to assist users in quickly identifying the overall ideas or specific information in individual documents or groups of documents. In particular, the following three cases are examined: (1) Can the users quickly guess (in 3 seconds or so) from the listing alone what are the main topics of interest across all articles for that day? (2) Is there a better way to present the gist of individual news articles to the users other than the conventional, ineffective use of short text snippets as summaries? (3) Are there other options besides the typical [find] feature for users to quickly pinpoint a particular concept in an article or a group of articles? 224 Chapter 9. Implementation (a) Abstraction of the news “Tai Chi may ease arthritis pain”. (b) Abstraction of the news “Omega-3-fatty acids may slow macular disease”. Figure 9.13: The use of document term cloud and information from lightweight ontology to summarise individual news articles. Based on the term size in the clouds, one can arrive at the conclusion that the news featured in Figure 9.13(b) carries more domain-relevant (i.e. medical related) content than the news in Figure 9.13(a). 9.2. Ontology-based Document Skimming and Scanning 225 Figure 9.14: The document term cloud for the news “Tai Chi may ease arthritis pain”. Users can focus on a particular concept in the annotated news by clicking on the corresponding term in the document cloud. For this purpose, an online interface for document skimming and scanning is incorporated into the Web application’s homepage12 . While news articles may be the focus of the current document skimming and scanning system, other text documents including product reviews, medical reports, emails and search results can equally benefit from such automatic abstraction system. Figure 9.9 shows the splash screen of the interface for document skimming and scanning. This splash screen explains the need for better means to assist document skimming and scanning while data (i.e. term clouds and lightweight ontologies) is loading. Figure 9.10 shows the main interface for skimming and scanning a list of news articles across different domains. The white canvas on the top right corner containing words of different colours and sizes is the cross-domain term cloud. This term cloud summarises the key concepts in all news articles across all domains listed in the news browser panel below. For instance, Figure 9.10 shows that there are 395 articles across three domains (i.e. technology, medicine and economics) listed in the news browser with a total of 727 terms in the term cloud. The solutions to the above three use cases using our ontology-based document skimming and scanning system are as follows: • Figure 9.11 shows the single-domain term cloud for summarising the key con12 http://explorer.csse.uwa.edu.au/research/ 226 Chapter 9. Implementation cepts in the medicine domain. This term cloud is obtained by simply selecting the medicine option in the [domain(s)] field. There are 75 articles in the news browser with a total of 136 terms in the cloud. Looking at this singledomain term cloud, one would immediately be able to conclude that some of the news articles are concerned with “diabetes”, “drug”, “gene”, “hormone”, “heart disease”, “H1N1 swine flu” and so on. One can also say that “diabetes” was discussed more intensely in these articles than other topics such as “diarrhea”. The users are able to grasp the gist of large groups of articles in a matter of seconds without any complex cognitive effort. Can the same be accomplished through the typical news listing and text snippets as summaries shown in Figure 9.8? The use of the cross-domain or single-domain term clouds for summarising the main topics across multiple documents addresses the first problem. • If the users are interested in drilling down on a particular topic, they can do so by simply clicking on the terms in the cloud. A list of news articles describing the selected topic is provided in the news browser panel as shown in Figure 9.12. The context in which the selected topic exists is also provided. For instance, Figure 9.12 shows that “diabetes” topic is mentioned in the context of “hypertension” in the news “Psoriasis linked to...”. Clicking on the [back] option brings the users back to the complete listing of articles in the medicine domain as in Figure 9.11. The users can also preview the gist of a news article by simply clicking on the title in the news browser panel. Figure 9.13(a) and 9.13(b) show the document term clouds for the news “Tai Chi may ease arthritis pain” and “Omega-3-fatty acids may slow macular disease”. These document term clouds summarise the content of the news articles and present the key terms in a visually appealing manner to enhance the interpretability and retention of information. The interfaces in Figure 9.13(a) and 9.13(b) also provide information derived from the corresponding lightweight ontologies. For instance, the root concept in the ontology is shown in the [this news is about] field. In the news “Tai Chi may ease arthritis pain”, the root concept is “self-care”. The parent concepts of the key terms in the ontology are presented as part of the field [the main concepts are]. In addition, based on the term size in the clouds, one can arrive at the conclusion that the news featured in Figure 9.13(b) carries more domain-relevant (i.e. medical related) content than the news in Figure 9.13(a). Can the users arrive at such comprehensive and abstract information regarding a document with minimal 9.3. Chapter Summary 227 time and cognitive effort using the conventional news listing interfaces shown in Figure 9.8? The use of document term cloud and lightweight ontologies for presenting the gist of individual news articles addresses the second problem. • The use of the following features [click for articles], [find term], [context terms] and [click to focus] helps users to locate a particular concept at different level of granularities. At the document collection level, users can locate articles containing a particular term using the [click for articles], the [find term] or the [context terms] features. The [click for articles] feature allows users to view a list of articles (using the news browser) related to a particular topic in the cross-domain or the single-domain term cloud. The [find term] feature can be used anytime to refine and reduce the size of the cross-domain or the single-domain term cloud. Context terms are provided together with the listing of articles in the news browser when users selected the [click for articles] feature. Clicking on any terms under the column [context terms], as shown in Figure 9.12, will list all articles containing the selected term. At the individual document level, news articles are annotated with the key terms that occurred in the document clouds to assist scanning activities. Users can employ the [click to focus] feature to pinpoint the occurrence of a particular concept in an article by clicking on the corresponding term in the document cloud. Figure 9.14 shows how a user clicked on “chronic tension headache” in the document term cloud which triggered the auto-scrolling and highlighting of that term in the annotated news. Can the users pinpoint a particular topic that occurred in a large document collection or a single lengthy document with minimal time and cognitive effort using the conventional interfaces shown in Figure 9.8? The various features provided by this system allow users to quickly pinpoint a particular concept, either in an article or a group of articles, to address the last problem. 9.3 Chapter Summary This chapter provided the implementation details of the proposed ontology learning system as a Web application. The type of programming language, external tools and development environment was described. Online interfaces to several modules of the Web application were made publicly available. The benefits of using automatically-generated term clouds and lightweight ontologies for document skimming and scanning were highlighted using three use cases. It was qualitatively 228 Chapter 9. Implementation demonstrated that conventional news listing interfaces, unlike ontology-based document skimming and scanning, are unable to satisfy the following three common scenarios: (1) to grasp the gist of large groups of articles in a matter of seconds without any complex cognitive effort, (2) to arrive at a comprehensive and abstract overview of a document with minimal time and cognitive effort, and (3) to pinpoint a particular topic that occurred in a large document collection or a single lengthy document with minimal time and cognitive effort. In the next chapter, the research work presented throughout this dissertation is summarised. Plans for system improvement are outlined, and an outlook to future research direction in the area of ontology learning is provided. CHAPTER 10 Conclusions and Future Work “We can only see a short distance ahead, but we can see plenty there that needs to be done.” - Alan Turing, Computing Machinery and Intelligence (1950) Term clouds and lightweight ontologies are the key to bootstrapping the Semantic Web, creating better search engines, and providing effective document management for individuals and organisations. A major problem faced by current ontology learning systems is the reliance on rare, static background knowledge (e.g. WordNet, British National Corpus). This problem is described in detail in Chapter 1 and subsequently confirmed by the literature review in Chapter 2. Overall, this research demonstrates that the use of dynamic Web data as the sole background knowledge is a viable, long-term alternative for cross-domain ontology learning from text. This finding verifies the thesis statement in Chapter 1. In particular, four major research questions identified as part of the thesis statement are addressed in Chapter 3 to 8 with a common theme on taking advantage of the diversity and redundancy of Web data. These four problems are (1) the absence of integrated techniques for cleaning noisy data, (2) the inability of current term extraction techniques, which are heavily influenced by word frequency, to systematically explicate, diversify and consolidate their evidence, (3) the inability of current corpus construction techniques to automatically create very large, highquality text corpora using a small number of seed terms, and (4) the difficulty of locating and preparing features for clustering and extracting relations. As a proof of concept, Chapter 9 of this thesis demonstrated the benefits of using automaticallyconstructed term clouds and lightweight ontologies for skimming and scanning large number of real-world documents. More precisely, term clouds and lightweight ontologies are employed to assist users in quickly identifying the overall ideas or specific information in individual news articles or groups of news articles across different domains, including technology, medicine and economics. Chapter 9 also discussed the implementation details of the proposed ontology learning system. 10.1 Summary of Contributions The major contributions to the field of ontology learning that arose from this thesis (described in Chapter 3 to 8) are summarised as follows. Chapter 3 addressed the first problem through proposing and implementing one of the first integrated technique called ISSAC for cleaning noisy text. ISSAC si- 229 230 Chapter 10. Conclusions and Future Work multaneously corrects spelling errors, expands abbreviations and restores improper casings. It was found that in order to cope with language change (e.g. appearance of new words) and the blurring of boundaries between noises, the use of multiple dynamic Web data sources (in the form of statistical evidence from search engine page count and online abbreviation dictionaries) was necessary. Evaluations using noisy chat records from industry demonstrated high accuracy of correction. To address the second problem, firstly Chapter 4 outlined two measures for determining word collocation strength during noun phrase extraction. These measures are UH, an adaptation of existing measures, and OU, a probabilistic measure. UH and OU rely on page count from search engines to derive statistical evidence required for measuring word collocation. It was found that the noun phrases extracted based on the probabilistic measure OU achieved the best precision compared to the heuristic measure UH. The stable noun phrases extracted using these measures constitute the input to the next stage of term recognition. Secondly in Chapter 5, a novel probabilistic framework for recognising domain-relevant terms from stable noun phrases was developed. The framework allows different evidence to be added or removed depending on the implementation constraints and the desired term recognition output. The framework currently incorporates seven types of evidence which are formalised using word distribution models into a new probabilistic measure called OT. The adaptability of this framework is demonstrated through the close correlation between OT and its heuristic counterpart TH. It was concluded that OT offers the best term recognition solution (compared to three existing heuristic measures) that is both accurate and balanced in terms of recall and precision. Chapter 6 solved the third problem by introducing the SPARTAN technique for corpus construction to alleviate the dependence on manually-crafted corpora during term recognition. SPARTAN uses a probabilistic filter with statistical information gathered from search engine page count to analyse the domain representativeness of websites for constructing both virtual and local corpora. It was found that adequately large corpora with high coverage and specific enough vocabulary are necessary for high performance term recognition. An elaborate evaluation proved that term recognition using SPARTAN -based corpora achieved the best precision and recall in comparison to all other corpora based on existing corpus construction techniques. Chapter 8 addressed the last problem through the proposal of a novel technique ARCHILES. It employs term clustering, word disambiguation and lexical simplication techniques with Wikipedia and search engines for acquiring coarse-grained 10.2. Limitations and Implications for Future Research 231 semantic relations between terms. Chapter 7 discussed in detail the multi-pass clustering algorithm TTA with the featureless relatedness measures noW and NWD used by ARCHILES. It was found that the use of mutual information during lexical simplication, TTA and NWD for term clustering, and noW and Wikipedia for word disambiguation enables ARCHILES to cope with complex, uncommon and ambiguous terms during relation acquisition. 10.2 Limitations and Implications for Future Research There are at least five interesting questions related to the proposed ontology learning system that remain unexplored. First, can the current text processing techniques, including the word collocation measures OU and UH, adapt to domains with highly-complex vocabulary such as those involving biological entities (e.g. proteins, genes)? At the very least, the adaptation of existing sentence parsing and noun phrase chunking techniques will be required to meet the needs of these domains. Second, another area for future work is to incorporate sentence parsing and named entity tagging ability into the current SPARTAN -based corpus construction technique for creating annotated text corpora. Automatically-constructed annotated corpora will prove to be invaluable resources for a wide range of applications such as text categorisation and machine translation. Third, there is an increasing interest in mining opinions or sentiments from text. The two main research interests in opinion mining are the automatic building of sentiment dictionaries (typically comprise adjectives and adverbs as sentiments), and the recognition of sentiments expressed in text and their relations with other aspects of the text (e.g. who expressed the sentiment, the sentiment’s target). Can the current term recognition technique using OT and TH, which focuses on domain-relevant noun phrases, be extended to handle other part of speech for opinion mining? If yes, the proposed term recognition technique using SPARTAN -based corpora can ultimately be used to produce high-quality sentiment clouds and sentiment ontologies. Fourth, the current system lacks consideration for the temporal aspect of information such as publication date during the discovery of term clouds and lightweight ontologies. With the inclusion of the date factor into these abstractions, users can browse and see the evolution of important concepts and relations across different time periods. Lastly, care should be taken when interpreting the results from some of the preliminary experiments reported in this dissertation. In particular, more work is required to critically evaluate the ARCHILES and ISSAC techniques using larger datasets, and to demonstrate the significance of the results through statistical tests. 232 Chapter 10. Conclusions and Future Work For instance, the ARCHILES technique for acquiring coarse-grained relations between terms reported in this dissertation has only been tested using small datasets. Assessments using larger datasets are being planned for the near future. It would also be interesting to look at how ARCHILES can be used to complement other techniques for discovering fine-grained semantic relations. In fact, ARCHILES and all techniques reported in this dissertation are constantly undergoing further tests using real-world text from various domains. New term clouds and lightweight ontologies are constantly being created automatically from recent news articles. These clouds and ontologies are available for browsing via our dedicated Web application1 . In other words, this research, the new techniques, and the resulting Web application are subjected to continuous scrutiny and improvement in an effort to achieve better ontology learning performance and to define new application areas. In order to further demonstrate the system’s overall ability at cross-domain ontology learning, practical applications using real-world text in several domains have been planned. In the long run, ontology learning research will cross paths with advances from ontology merging. As the number of automatically-created ontologies grow, the need to consolidate and merge them into one single extensive structure will arise. This thesis has only looked at the automatic learning of term clouds and ontologies from cross-domain documents in the English language. Overall, the proposed system represents only a few important steps in the vast area of ontology learning. One area of growing interest is cross-media ontology learning. While news articles may be the focus of the proposed ontology learning system, other text documents including product reviews, medical reports, financial reports, emails and search results can equally benefit from such automatic abstraction service. The automatic generation of term clouds and lightweight ontologies using different media types such as audio (e.g. call center recording, interactive voice response systems) and video (i.e. teleconferencing, video surveillance) is an interesting research direction for future researchers. Another research direction that will gain greater attention in the future is cross-language ontology learning. It remains to be seen as we cannot underestimate the level of difficulties involved in transferring the proposed system to other languages of different morphological and syntactic complexity. All in all, the suggestions and questions raised in this section provide interesting insights into future research directions in ontology learning from text. 1 http://explorer.csse.uwa.edu.au/research/ 233 Bibliography [1] H. Abdi. The method of least squares. In N. Salkind, editor, Encyclopedia of Measurement and Statistics. CA, USA: Thousand Oaks, 2007. [2] L. Adamic and B. Huberman. Zipfs law and the internet. Glottometrics, 3(1):143–150, 2002. [3] E. Adar, J. Teevan, S. Dumais, and J. Elsas. The web changes everything: Understanding the dynamics of web content. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, Barcelona, Spain, 2009. [4] A. Agbago and C. Barriere. Corpus construction for terminology. In Proceedings of the Corpus Linguistics Conference, Birmingham, UK, 2005. [5] A. Agustini, P. Gamallo, and G. Lopes. Selection restrictions acquisition for parsing and information retrieval improvement. In Proceedings of the 14th International Conference on Applications of Prolog, Tokyo, Japan, 2001. [6] J. Allen. Natural language understanding. Benjamin/Cummings, California, 1995. [7] G. Amati and C. vanRijsbergen. Term frequency normalization via pareto distributions. In Proceedings of the 24th BCS-IRSG European Colloquium on Information Retrieval Research, Glasgow, UK, 2002. [8] M. Ashburner, C. Ball, J. Blake, D. Botstein, H. Butler, M. Cherry, A. Davis, K. Dolinski, S. Dwight, and J. Eppig. Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1):25–29, 2000. [9] K. Atkinson. Gnu aspell 0.60.4. http://aspell.sourceforge.net/, 2006. [10] P. Atzeni, R. Basili, D. Hansen, P. Missier, P. Paggio, M. Pazienza, and F. Zanzotto. Ontology-based question answering in a federation of university sites: the moses case study. In Proceedings of the 9th International Conference on Applications of Natural Language to Information Systems (NLDB), Manchester, United Kingdom, 2004. [11] A. Aula. Enhancing the readability of search result summaries. In Proceedings of the 18th British HCI Group Annual Conference, Leeds Metropolitan University, UK, 2004. 234 Bibliography [12] H. Baayen. Statistical models for word frequency distributions: A linguistic evaluation. Computers and the Humanities, 26(5-6):347–363, 2004. [13] C. Baker, R. Kanagasabai, W. Ang, A. Veeramani, H. Low, and M. Wenk. Towards ontology-driven navigation of the lipid bibliosphere. In Proceedings of the 6th International Conference on Bioinformatics (InCoB), Hong Kong, 2007. [14] Z. Bar-Yossef, A. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: Towards an understanding of the webs decay. In Proceedings of the 13th International Conference on World Wide Web (WWW), New York, 2004. [15] M. Baroni and S. Bernardini. Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of the 4th Language Resources and Evaluation Conference (LREC), Lisbon, Portugal, 2004. [16] M. Baroni and S. Bernardini. Wacky! working papers on the web as corpus. GEDIT, Bologna, Italy, 2006. [17] M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th Annual Conference of the European Association for Machine Translation (EAMT), Norway, 2006. [18] M. Baroni and M. Ueyama. Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL International Symposium on Language Corpora: Their Compilation and Application, 2006. [19] R. Basili, A. Moschitti, M. Pazienza, and F. Zanzotto. A contrastive approach to term extraction. In Proceedings of the 4th Terminology and Artificial Intelligence Conference (TIA), France, 2001. [20] R. Basili, M. Pazienza, and F. Zanzotto. Modelling syntactic context in automatic term extraction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Bulgaria, 2001. [21] M. Bender and M. Farach-Colton. The lca problem revisited. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics, Punta del Este, Uruguay, 2000. Bibliography 235 [22] M. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):7594, 2005. [23] C. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek. Information distance. IEEE Transactions on Information Theory, 44(4):1407–1423, 1998. [24] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. http://www.scientificamerican.com/article.cfm?id=the-semantic-web; 20 May 2009, 2001. [25] S. Bird, E. Klein, E. Loper, and J. Baldridge. Multidisciplinary instruction with the natural language toolkit. In Proceedings of the 3rd ACL Workshop on Issues in Teaching Computational Linguistics, Ohio, USA, 2008. [26] I. Blair, G. Urland, and J. Ma. Using internet search engines to estimate word frequency. Behavior Research Methods Instruments & Computers, 34(2):286– 290, 2002. [27] O. Bodenreider. Biomedical ontologies in action: Role in knowledge management, data integration and decision support. IMIA Yearbook of Medical Informatics, 1(1):67–79, 2008. [28] A. Bookstein, S. Klein, and T. Raita. Clumping properties of content-bearing words. Journal of the American Society of Information Science, 49(2):102–114, 1998. [29] A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25(5):312–8, 1974. [30] J. Brank, M. Grobelnik, and D. Mladenic. A survey of ontology evaluation techniques. In Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD), Ljubljana, Slovenia, 2005. [31] C. Brewster, H. Alani, S. Dasmahapatra, and Y. Wilks. Data driven ontology evaluation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, 2004. [32] C. Brewster, F. Ciravegna, and Y. Wilks. Background and foreground knowledge in dynamic ontology construction: Viewing text as knowledge maintenance. In Proceedings of the SIGIR Workshop on Semantic Web, Toronto, Canada, 2003. 236 Bibliography [33] E. Brill. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing, 1992. [34] A. Budanitsky. Lexical semantic relatedness and its application in natural language processing. Technical Report CSRG-390, Computer Systems Research Group, University of Toronto, 1999. [35] P. Buitelaar, P. Cimiano, and B. Magnini. Ontology learning from text: An overview. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005. [36] L. Burnard. Reference guide for the british http://www.natcorp.ox.ac.uk/XMLedition/URG/, 2007. national corpus. [37] T. Cabre-Castellvi, R. Estopa, and J. Vivaldi-Palatresi. Automatic term detection: A review of current systems. In D. Bourigault, C. Jacquemin, and M. LHomme, editors, Recent Advances in Computational Terminology. John Benjamins, 2001. [38] M. Castellanos. Hotminer: Discovering hot topics on the web. In M. Berry, editor, Survey of Text Mining. Springer-Verlag, 2003. [39] P. Castells, M. Fernandez, and D. Vallet. An adaptation of the vector-space model for ontology-based information retrieval. IEEE Transactions on Knowledge and Data Engineering, 19(2):261–272, 2007. [40] G. Cavaglia and A. Kilgarriff. Corpora from the web. In Proceedings of the 4th Annual CLUCK Colloquium, Sheffield, UK, 2001. [41] H. Chen, M. Lin, and Y. Wei. Novel association measures using web search with double checking. In Proceedings of the 21st International Conference on Computational Linguistics, Sydney, Australia, 2006. [42] N. Chinchor, D. Lewis, and L. Hirschman. Evaluating message understanding systems: An analysis of the third message understanding conference (muc-3). Computational Linguistics, 19(3):409–449, 1993. [43] J. Cho, S. Han, and H. Kim. Meta-ontology for automated information integration of parts libraries. Computer-Aided Design, 38(7):713–725, 2006. [44] B. Choi and Z. Yao. Web page classification. In W. Chu and T. Lin, editors, Foundations and Advances in Data Mining. Springer-Verlag, 2005. Bibliography 237 [45] K. Church and W. Gale. Inverse document frequency (idf): A measure of deviations from poisson. In Proceedings of the ACL 3rd Workshop on Very Large Corpora, 1995. [46] K. Church and W. Gale. Poisson mixtures. Natural Language Engineering, 1(2):Page 163–190, 1995. [47] K. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29, 1990. [48] M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, and I. Rojas. Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), 2005. [49] R. Cilibrasi and P. Vitanyi. Automatic extraction of meaning from the web. In Proceedings of the IEEE International Symposium on Information Theory, Seattle, USA, 2006. [50] R. Cilibrasi and P. Vitanyi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370–383, 2007. [51] P. Cimiano and S. Staab. Learning concept hierarchies from text with a guided agglomerative clustering algorithm. In Proceedings of the Workshop on Learning and Extending Lexical Ontologies with Machine Learning Methods, Bonn, Germany, 2005. [52] J. Cimino and X. Zhu. The practical impact of ontologies on biomedical informatics. IMIA Yearbook of Medical Informatics, 1(1):124–135, 2006. [53] A. Clark. Pre-processing very noisy text. In Proceedings of the Workshop on Shallow Processing of Large Corpora at Corpus Linguistics, 2003. [54] P. Constant. L’analyseur linguistique sylex. In Proceedings of the 5eme Ecole d’ete du CNET, 1995. [55] O. Corcho. Ontology-based document annotation: Trends and open research problems. International Journal on Metadata, Semantics and Ontologies(Volume 1):Issue 1, 2006. 238 Bibliography [56] B. Croft and J. Ponte. A language modeling approach to information retrieval. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval, 1998. [57] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Gate: an architecture for development of robust hlt applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL), Philadelphia, USA, 2002. [58] F. Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, 1964. [59] J. Davies, D. Fensel, and F. vanHarmelen. Towards the semantic web: Ontology-driven knowledge management. Wiley, UK, 2003. [60] K. Dellschaft and S. Staab. On how to perform a gold standard based evaluation of ontology learning. In Proceedings of the 5th International Semantic Web Conference (ISWC), 2006. [61] J. Deneubourg, S. Goss, N. Franks, A. Sendova-Franks, C. Detrain, and L. Chretien. The dynamics of collective sorting: Robot-like ants and antlike robots. In Proceedings of the 1st International Conference on Simulation of Adaptive Behavior: From Animals to Animats, France, 1991. [62] Y. Ding and S. Foo. Ontology research and development: Part 1. Journal of Information Science, 28(2):123–136, 2002. [63] N. Draper and H. Smith. Applied regression analysis (3rd ed.). John Wiley & Sons, 1998. [64] T. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1994. [65] H. Edmundson. Statistical inference in mathematical and computational linguistics. International Journal of Computer and Information Sciences, 6(2 Pages 95-129), 1977. [66] R. Engels and T. Lech. Generating ontologies for the semantic web: Ontobuilder. In J. Davies, D. Fensel, and F. vanHarmelen, editors, Towards the Semantic Web: Ontology-driven Knowledge Management. England: John Wiley & Sons, 2003. Bibliography 239 [67] S. Evert. A lightweight and efficient tool for cleaning web pages. In Proceedings of the 4th Web as Corpus Workshop (WAC), Morocco, 2008. [68] D. Faure and C. Nedellec. Asium: Learning subcategorization frames and restrictions of selection. In Proceedings of the 10th Conference on Machine Learning (ECML), Germany, 1998. [69] D. Faure and C. Nedellec. A corpus-based conceptual clustering method for verb frames and ontology acquisition. In Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC), Granada, Spain, 1998. [70] D. Faure and C. Nedellec. Knowledge acquisition of predicate argument structures from technical texts using machine learning: The system asium. In Proceedings of the 11th European Workshop on Knowledge Acquisition, Modeling and Management (EKAW), Dagstuhl Castle, Germany, 1999. [71] D. Faure and T. Poibeau. First experiments of using semantic knowledge learned by asium for information extraction task using intex. In Proceedings of the 1st Workshop on Ontology Learning, Berlin, Germany, 2000. [72] D. Fensel. Ontology-based knowledge management. Computer, 35(11):56–59, 2002. [73] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, 2003. [74] W. Fletcher. Implementing a bnc-comparable web corpus. In Proceedings of the 3rd Web as Corpus Workshop, Belgium, 2007. [75] C. Fluit, M. Sabou, and F. vanHarmelen. Supporting user tasks through visualisation of lightweight ontologies. In S. Staab and R. Studer, editors, Handbook on Ontologies in Information Systems. Springer-Verlag, 2003. [76] B. Fortuna, D. Mladenic, and M. Grobelnik. Semi-automatic construction of topic ontology. In Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD), Ljubljana, Slovenia, 2005. [77] H. Fotzo and P. Gallinari. Learning generalization/specialization relations between concepts - application for automatically building thematic document 240 Bibliography hierarchies. In Proceedings of the 7th International Conference on ComputerAssisted Information Retrieval (RIAO), Vaucluse, France, 2004. [78] W. Francis and H. Kucera. http://icame.uib.no/brown/bcm.html, 1979. Brown corpus manual. [79] K. Frantzi. Incorporating context information for the extraction of terms. In Proceedings of the 35th Annual Meeting on Association for Computational Linguistics, Spain, 1997. [80] K. Frantzi and S. Ananiadou. Automatic term recognition using contextual cues. In Proceedings of the IJCAI Workshop on Multilinguality in Software Industry: the AI Contribution, Japan, 1997. [81] A. Franz. Independence assumptions considered harmful. In Proceedings of the 8th Conference on European Chapter of the Association for Computational Linguistics, Madrid, Spain, 1997. [82] N. Fuhr. Two models of retrieval with probabilistic indexing. In Proceedings of the 9th ACM SIGIR International Conference on Research and Development in Information Retrieval, 1986. [83] N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243–255, 1992. [84] F. Furst and F. Trichet. Heavyweight ontology engineering. In Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE), Montpellier, France, 2006. [85] P. Gamallo, A. Agustini, and G. Lopes. Learning subcategorisation information to model a grammar with co-restrictions. Traitement Automatic de la Langue, 44(1):93–117, 2003. [86] P. Gamallo, M. Gonzalez, A. Agustini, G. Lopes, and V. deLima. Mapping syntactic dependencies onto semantic relations. In Proceedings of the ECAI Workshop on Machine Learning and Natural Language Processing for Ontology Engineering, 2002. [87] J. Gantz. The diverse and exploding digital universe: An updated forecast of worldwide information growth through 2011. Technical Report White paper, International Data Corporation, 2008. Bibliography 241 [88] C. Girardi. Htmlcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd Web as Corpus Workshop, Belgium, 2007. [89] F. Giunchiglia and I. Zaihrayeu. Lightweight ontologies. Technical Report DIT-07-071, University of Trento, 2007. [90] A. Gomez-Perez and D. Manzano-Macho. A survey of ontology learning methods and techniques. Deliverable 1.5, OntoWeb Consortium, 2003. [91] J. Gould, L. Alfaro, R. Finn, B. Haupt, A. Minuto, and J. Salaun. Why reading was slower from crt displays than from paper. ACM SIGCHI Bulletin, 17(SI):7–11, 1986. [92] T. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–220, 1993. [93] P. Grunwald and P. Vitanyi. Kolmogorov complexity and information theory. Journal of Logic, Language(and Information):Volume 12, 2003. [94] H. Gutowitz. Complexity-seeking ants. In Proceedings of the 3rd European Conference on Artificial Life, 1993. [95] U. Hahn and M. Romacker. Content management in the syndikate system: How technical documents are automatically transformed to text knowledge bases. Data & Knowledge Engineering, 35(1):137–159, 2000. [96] U. Hahn and M. Romacker. The syndikate text knowledge base generator. In Proceedings of the 1st International Conference on Human Language Technology Research, San Diego, USA, 2001. [97] M. Halliday, W. Teubert, C. Yallop, and A. Cermakova. Lexicology and corpus linguistics: An introduction. Continuum, London, 2004. [98] J. Handl, J. Knowles, and M. Dorigo. Ant-based clustering: A comparative study of its relative performance with respect to k-means, average link and 1d-som. Technical Report TR/IRIDIA/2003-24, Universite Libre de Bruxelles, 2003. [99] J. Handl, J. Knowles, and M. Dorigo. Ant-based clustering and topographic mapping. Artificial Life, 12(1):35–61, 2006. 242 Bibliography [100] J. Handl and B. Meyer. Improved ant-based clustering and sorting. In Proceedings of the 7th International Conference on Parallel Problem Solving from Nature, 2002. [101] S. Handschuh and S. Staab. Authoring and annotation of web pages in cream. In Proceedings of the 11th International Conference on World Wide Web (WWW), Hawaii, USA, 2002. [102] M. Hearst. Automated discovery of wordnet relations. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press, 1998. [103] J. Heflin and J. Hendler. A portrait of the semantic web in action. IEEE Intelligent Systems, 16(2):54–59, 2001. [104] M. Henzinger and S. Lawrence. Extracting knowledge from the world wide web. PNAS, 101(1):5186–5191, 2004. [105] A. Hippisley, D. Cheng, and K. Ahmad. The head-modifier principle and multilingual term extraction. Natural Language Engineering, 11(2):129–157, 2005. [106] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of biocreative: Critical assessment of information. BMC Bioinformatics, 6(1):S1, 2005. [107] T. Hisamitsu, Y. Niwa, and J. Tsujii. A method of measuring term representativeness: Baseline method using co-occurrence di. In Proceedings of the 18th International Conference on Computational Linguistics, Germany, 2000. [108] D. Holmes and C. McCabe. Improving precision and recall for soundex retrieval. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC), 2002. [109] A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Japan, 2003. [110] C. Hwang. Incompletely and imprecisely speaking: Using dynamic ontologies for representing and retrieving information. In Proceedings of the 6th International Workshop on Knowledge Representation meets Databases (KRDB), Sweden, 1999. Bibliography 243 [111] E. Hyvonen, A. Styrman, and S. Saarela. Ontology-based image retrieval. In Proceedings of the 12th International World Wide Web Conference, Budapest, Hungary, 2003. [112] J. Izsak. Some practical aspects of fitting and testing the zipf-mandelbrot model. Scientometrics, 67(1):107–120, 2006. [113] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Computing Survey, 31(3):264–323, 1999. [114] J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference Research on Computational Linguistics (ROCLING), Taiwan, 1997. [115] T. Jiang, A. Tan, and K. Wang. Mining generalized associations of semantic relations from textual web content. IEEE Transactions on Knowledge and Data Engineering, 19(2):164–179, 2007. [116] F. Jock. An overview of the importance of page rank. http://www.associatedcontent.com/article/1502284/an overview of the importance of pa 9 March 2009, 2009. [117] K. Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: Development and status. Information Processing and Management, 36(6):809–840, 1998. [118] K. Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: Development and comparative experiments. Information Processing & Management, 36(6):809–840, 2000. [119] R. Jones. Learning to Extract Entities from Labeled and Unlabeled Text. PhD thesis, Carnegie Mellon University, 2005. [120] K. Kageura and B. Umino. Methods of automatic term recognition: A review. Terminology, 3(2):259–289, 1996. [121] T. Kanungo and D. Orr. Predicting the readability of short web summaries. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, Barcelona, Spain, 2009. [122] S. Katz. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15–59, 1996. 244 Bibliography [123] M. Kavalec and V. Svatek. A study on automated relation labelling in ontology learning. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005. [124] F. Keller, M. Lapata, and O. Ourioupina. Using the web to overcome data sparseness. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, 2002. [125] M. Kida, M. Tonoike, T. Utsuro, and S. Sato. Domain classification of technical terms using the web. Systems and Computers in Japan, 38(14):11–19, 2007. [126] J. Kietz, R. Volz, and A. Maedche. Extracting a domain-specific ontology from a corporate intranet. In Proceedings of the 4th Conference on Computational Natural Language Learning, Lisbon, Portugal, 2000. [127] A. Kilgarriff. Web as corpus. In Proceedings of the Corpus Linguistics (CL), Lancaster University, UK, 2001. [128] A. Kilgarriff. Googleology is bad science. 33(1):147–151, 2007. Computational Linguistics, [129] A. Kilgarriff and G. Grefenstette. Web as corpus. Computational Linguistics, 29(3):1–15, 2003. [130] J. Kim, T. Ohta, Y. Teteisi, and J. Tsujii. Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics, 19(1):Page 180–182, 2003. [131] C. Kit. Corpus tools for retrieving and deriving termhood evidence. In Proceedings of the 5th East Asia Forum of Terminology, Haikou, China, 2002. [132] D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics, 2003. [133] S. Krug. Dont make me think! a common sense approach to web usability. New Riders, Indianapolis, USA, 2000. [134] G. Kuenning. International ispell www.cs.ucla.edu/geoff/ispell.html, 2006. 3.3.02. http://ficus- Bibliography 245 [135] K. Kukich. Technique for automatically correcting words in text. ACM Computing Surveys, 24(4):377– 439, 1992. [136] D. Kurz and F. Xu. Text mining for the extraction of domain relevant terms and term collocations. In Proceedings of the International Workshop on Computational Approaches to Collocations, Vienna, 2002. [137] K. Lagus, T. Honkela, S. Kaski, and T. Kohonen. Self-organizing maps of document collections: A new approach to interactive exploration. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996. [138] A. Lait and B. Randell. An assessment of name matching algorithms. Technical report, University of Newcastle upon Tyne, 1993. [139] T. Landauer, P. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse Processes, 25(1):259–284, 1998. [140] M. Lapata and F. Keller. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1):1–30, 2005. [141] N. Lavrac and S. Dzeroski. Inductive logic programming: Techniques and applications. Ellis Horwood, NY, USA, 1994. [142] D. Lelewer and D. Hirschberg. Data compression. Volume 19, Issue 3(Pages 261-296), 1987. [143] S. Lemaignan, A. Siadat, J. Dantan, and A. Semenenko. Mason: A proposal for an ontology of manufacturing domain. In Proceedings of the IEEE Workshop on Distributed Intelligent Systems: Collective Intelligence and Its Applications (DIS), Czech Republic, 2006. [144] S. LeMoigno, J. Charlet, D. Bourigault, P. Degoulet, and M. Jaulent. Terminology extraction from text to build an ontology in surgical intensive care. In Proceedings of the ECAI Workshop on Machine Learning and Natural Language Processing for Ontology Engine, 2002. [145] M. Lesk. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th International Conference on Systems Documentation, 1986. 246 Bibliography [146] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710, 1966. [147] D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning, 1998. [148] M. Liberman. Questioning reality. http://itre.cis.upenn.edu./ myl/languagelog/archives/001837.html 26 March 2009, 2005. [149] D. Lin. Principar: An efficient, broad-coverage, principle-based parser. In Proceedings of the 15th International Conference on Computational Linguistics, 1994. [150] D. Lin. Dependency-based evaluation of minipar. In Proceedings of the 1st International Conference on Language Resources and Evaluation, 1998. [151] D. Lindberg, B. Humphreys, and A. McCray. The unified medical language system. Methods of Information in Medicine, 32(4):281–291, 1993. [152] K. Linden and J. Piitulainen. Discovering synonyms and other related words. In Proceedings of the CompuTerm 2004, Geneva, Switzerland, 2004. [153] L. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. truecasing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Japan, 2003. [154] Q. Liu, K. Xu, L. Zhang, H. Wang, Y. Yu, and Y. Pan. Catriple: Extracting triples from wikipedia categories. In Proceedings of the 3rd Asian Semantic Web Conference (ASWC), Bangkok, Thailand, 2008. [155] V. Liu and J. Curran. Web text corpus for natural language processing. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Italy, 2006. [156] W. Liu, A. Weichselbraun, A. Scharl, and E. Chang. Semi-automatic ontology extension using spreading activation. Journal of Universal Knowledge Management, volume 0(1):50–58, 2005. [157] N. Loukachevitch and B. Dobrov. Sociopolitical domain as a bridge from general words to terms of specific domains. In Proceedings of the 2nd International Global Wordnet Conference, 2004. Bibliography 247 [158] A. Lozano-Tello, A. Gomez-Perez, and E. Sosa. Selection of ontologies for the semantic web. In Proceedings of the International Conference on Web Engineering (ICWE), Oviedo, Spain, 2003. [159] H. Luhn. Keyword in context index for technical literature. American Documentation, 11(4):288–295, 1960. [160] E. Lumer and B. Faieta. Diversity and adaptation in populations of clustering ants. In Proceedings of the 3rd International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3, 1994. [161] A. Maedche, V. Pekar, and S. Staab. Ontology learning part one - on discovering taxonomic relations from the web. In N. Zhong, J. Liu, and Y. Yao, editors, Web Intelligence. Springer Verlag, 2002. [162] A. Maedche and S. Staab. Discovering conceptual relations from text. In Proceedings of the 14th European Conference on Artificial Intelligence, Berlin, Germany, 2000. [163] A. Maedche and S. Staab. The text-to-onto ontology learning environment. In Proceedings of the 8th International Conference on Conceptual Structures, Darmstadt, Germany, 2000. [164] A. Maedche and S. Staab. Measuring similarity between ontologies. In Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW), Madrid, Spain, 2002. [165] A. Maedche and R. Volz. The ontology extraction & maintenance framework: Text-to-onto. In Proceedings of the IEEE International Conference on Data Mining, California, USA, 2001. [166] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Performance measures for information extraction. In Proceedings of the DARPA Broadcast News Workshop, 1999. [167] A. Malucelli and E. Oliveira. Ontology-services to facilitate agents interoperability. In Proceedings of the 6th Pacific Rim International Workshop On Multi-Agents (PRIMA), Seoul, South Korea, 2003. [168] B. Mandelbrot. Information theory and psycholinguistics: A theory of word frequencies. MIT Press, MA, USA, 1967. 248 Bibliography [169] C. Manning and H. Schutze. Foundations of statistical natural language processing. MIT Press, MA, USA, 1999. [170] E. Margulis. N-poisson document modelling. In Proceedings of the 15th ACM SIGIR International Conference on Research and Development in Information Retrieval, 1992. [171] D. Maynard and S. Ananiadou. Term extraction using a similarity-based approach. In Recent Advances in Computational Terminology. John Benjamins, 1999. [172] D. Maynard and S. Ananiadou. Identifying terms by their family and friends. In Proceedings of the 18th International Conference on Computational Linguistics, Germany, 2000. [173] T. McEnery, R. Xiao, and Y. Tono. Corpus-based language studies: An advanced resource book. Taylor & Francis Group Plc, London, UK, 2005. [174] K. McKeown and D. Radev. Collocations. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing. Marcel Dekker, 2000. [175] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Rochester, 2007. [176] A. Mikheev. Periods, capitalized words, etc. 28(3):289–318, 2002. Computational Linguistics, [177] G. Miller, R. Beckwith, C. Fellbaum, K. Miller, and D. Gross. Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235–244, 1990. [178] M. Missikoff, R. Navigli, and P. Velardi. Integrated approach to web ontology learning and engineering. IEEE Computer, 35(11):60–63, 2002. [179] P. Morville. Ambient findability. OReilly, California, USA, 2005. [180] H. Nakagawa and T. Mori. A simple but powerful automatic term extraction method. In Proceedings of the International Conference On Computational Linguistics (COLING), 2002. Bibliography 249 [181] P. Nakov and M. Hearst. A study of using search engine page hits as a proxy for n-gram frequencies. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Bulgaria, 2005. [182] R. Navigli and P. Velardi. Semantic interpretation of terminological strings. In Proceedings of the 3rd Terminology and Knowledge Engineering Conference, Nancy, France, 2002. [183] D. Nettle. Linguistic diversity. Oxford University Press, UK, 1999. [184] G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information extraction core system for real world german text processing. In Proceedings of the 5th International Conference of Applied Natural Language, 1997. [185] H. Ng, C. Lim, and L. Koo. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, 1999. [186] J. Nielsen. How little do users read? http://www.useit.com/alertbox/percenttext-read.html; 4 May 2009, 2008. [187] V. Novacek and P. Smrz. Bole - a new bio-ontology learning platform. In Proceedings of the Workshop on Biomedical Ontologies and Text Processing, 2005. [188] A. Ntoulas, J. Cho, and C. Olston. Whats new on the web? the evolution of the web from a search engine perspective. In Proceedings of the 13th International Conference on World Wide Web, New York, USA, 2004. [189] M. Odell and R. Russell. U.s. patent numbers 1,435,663. U.S. Patent Office, Washington, D.C., 1922. [190] T. OHara, K. Mahesh, and S. Nirenburg. Lexical acquisition with wordnet and the microkosmos ontology. In Proceedings of the Coling-ACL Workshop on Usage of WordNet in Natural Language Processing Systems, 1998. [191] A. Oliveira, F. Pereira, and A. Cardoso. Automatic reading and learning from text. In Proceedings of the International Symposium on Artificial Intelligence (ISAI), Kolhapur, India, 2001. [192] E. ONeill, P. McClain, and B. Lavoie. A methodology for sampling the world wide web. Journal of Library Administration, 34(3):279–291, 2001. 250 Bibliography [193] S. Pakhomov. Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2001. [194] J. Paralic and I. Kostial. Ontology-based information retrieval. In Proceedings of the 14th International Conference on Information and Intelligent systems (IIS), Varazdin, Croatia, 2003. [195] Y. Park and R. Byrd. Hybrid text mining for finding abbreviations and their definitions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2001. [196] M. Pei, K. Nakayama, T. Hara, and S. Nishio. Constructing a global ontology by concept mapping using wikipedia thesaurus. In Proceedings of the 22nd International Conference on Advanced Information Networking and Applications, Okinawa, Japan, 2008. [197] F. Pereira and A. Cardoso. Dr. divago: Searching for new ideas in a multidomain environment. In Proceedings of the 8th Cognitive Science of Natural Language Processing (CSNLP), Ireland, 1999. [198] F. Pereira, A. Oliveira, and A. Cardoso. Extracting concept maps with clouds. In Proceedings of the Argentine Symposium of Artificial Intelligence (ASAI), Buenos Aires, Argentina, 2000. [199] L. Philips. Hanging on the metaphone. 7(12):38–44, 1990. Computer Language Magazine, [200] R. Plackett. Some theorems in least squares. Biometrika, 37(1/2):149–157, 1950. [201] M. Poesio and A. Almuhareb. Identifying concept attributes using a classifier. In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, Ann Arbor, USA, 2005. [202] R. Porzel and R. Malaka. A task-based approach for ontology evaluation. In Proceedings of the 16th European Conference on Artificial Intelligence, Valencia, Spain, 2004. Bibliography 251 [203] D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Michigan, USA, 2005. [204] J. Rennie. Derivation of the http://people.csail.mit.edu/jrennie/writing/fmeasure.pdf, 2004. f-measure. [205] A. Renouf, A. Kehoe, and J. Banerjee. Webcorp: An integrated system for web text search. In Marianne Hundt & Carolin Biewer Nadja Nesselhauf, editor, Corpus Linguistics and the Web. Amsterdam: Rodopi, 2007. [206] P. Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11(1):95–130, 1999. [207] P. Resnik and N. Smith. The web as a parallel corpus. Computational Linguistics, 29(3):349–380, 2003. [208] F. Ribas. On learning more appropriate selectional restrictions. In Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, Dublin, Ireland, 1995. [209] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61(1):241–254, 1989. [210] S. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60(5):503–520, 2004. [211] B. Rosario. Extraction of Semantic Relations from Bioscience Text. PhD thesis, University of California Berkeley, 2005. [212] B. Rozenfeld and R. Feldman. Clustering for unsupervised relation identification. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, 2007. [213] M. Sabou, M. dAquin, and E. Motta. Scarlet: Semantic relation discovery by harvesting online ontologies. In Proceedings of the 5th European Semantic Web Conference, Tenerife, Spain, 2008. 252 Bibliography [214] M. Sabou, C. Wroe, C. Goble, and G. Mishne. Learning domain ontologies for web service descriptions: an experiment in bioinformatics. In Proceedings of the 14th International Conference on World Wide Web, 2005. [215] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. [216] D. Sanchez and A. Moreno. Automatic discovery of synonyms and lexicalizations from the web. In Proceedings of the 8th Catalan Conference on Artificial Intelligence, 2005. [217] D. Sanchez and A. Moreno. Learning non-taxonomic relationships from web documents for domain ontology construction. Data & Knowledge Engineering, 64(3):600–623, 2008. [218] E. Schmar-Dobler. Reading on the internet: The link between literacy and technology. Journal of Adolescent and Adult Literacy, 47(1):80–85, 2003. [219] H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, United Kingdom, 1994. [220] A. Schutz and P. Buitelaar. Relext: A tool for relation extraction from text in ontology extension. In Proceedings of the 4th International Semantic Web Conference (ISWC), Ireland, 2005. [221] A. Schwartz and M. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), 2003. [222] F. Sclano and P. Velardi. Termextractor: a web application to learn the shared terminology of emergent web communities. In Proceedings of the 3rd International Conference on Interoperability for Enterprise Software and Applications (I-ESA), Portugal, 2007. [223] P. Senellart and V. Blondel. Automatic discovery of similar words. In M. Berry, editor, Survey of Text Mining. Springer-Verlag, 2003. [224] V. Seretan, L. Nerima, and E. Wehrli. Using the web as a corpus for the syntactic-based collocation identification. In Proceedings of the International Bibliography 253 Conference on on Language Resources and Evaluation (LREC), Lisbon, Portugal, 2004. [225] M. Shamsfard and A. Barforoush. An introduction to hasti: An ontology learning system. In Proceedings of the 7th Iranian Conference on Electrical Engineering, Tehran, Iran, 2002. [226] M. Shamsfard and A. Barforoush. The state of the art in ontology learning: A framework for comparison. Knowledge Engineering Review, 18(4):293–316, 2003. [227] M. Shamsfard and A. Barforoush. Learning ontologies from natural language texts. International Journal of Human-Computer Studies, 60(1):Page 17–63, 2004. [228] S. Sharoff. Creating general-purpose corpora using automated search engine queries. In M. Baroni and S. Bernardini, editors, Wacky! Working papers on the Web as Corpus. Bologna: GEDIT, 2006. [229] Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In Proceedings of the NAACL Conference on Human Language Technology (HLT), New York, 2006. [230] H. Simon. On a class of skew distribution functions. Biometrika, 42(3-4):425– 440, 1955. [231] F. Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177, 1993. [232] B. Smith, S. Penrod, A. Otto, and R. Park. Jurors use of probabilistic evidence. Behavioral Science, 20(1):49–82, 1996. [233] T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. [234] R. Snow, D. Jurafsky, and A. Ng. Learning syntactic patterns for automatic hypernym discovery. In Proceedings of the 17th Conference on Advances in Neural, 2005. [235] R. Snow, D. Jurafsky, and A. Ng. Semantic taxonomy induction from heterogenous evidence. In Proceedings of the ACL 23rd International Conference on Computational Linguistics (COLING), 2006. 254 Bibliography [236] R. Sombatsrisomboon, Y. Matsuo, and M. Ishizuka. Acquisition of hypernyms and hyponyms from the www. In Proceedings of the 2d International Workshop on Active Mining, Japan, 2003. [237] L. Specia and E. Motta. A hybrid approach for relation extraction aimed at the semantic web. In Proceedings of the International Conference on Flexible Query Answering Systems (FQAS), Milan, Italy, 2006. [238] M. Spiliopoulou, F. Rinaldi, W. Black, and G. Piero-Zarri. Coupling information extraction and data mining for ontology learning in parmenides. In Proceedings of the 7th International Conference on Computer-Assisted Information Retrieval (RIAO);Vaucluse, France, 2004. [239] R. Srikant and R. Agrawal. Mining generalized association rules. Future Generation Computer Systems, 13(2-3):161–180, 1997. [240] P. Srinivason. On generalizing the two-poisson model. Journal of the American Society for Information Science, 41(1):61–66, 1989. [241] S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In Proceedings of the 1st International Workshop on MultiMedia Annotation, Tokyo, Japan, 2001. [242] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. Technical Report 00-034, University of Minnesota, 2000. [243] M. Stevenson and R. Gaizauskas. Experiments on sentence boundary detection. In Proceedings of the 6th Conference on Applied Natural Language Processing, 2000. [244] A. Strehl. Relationship-based Clustering and Cluster Ensembles for Highdimensional Data Mining. PhD thesis, University of Texas at Austin, 2002. [245] A. Sumida, N. Yoshinaga, and K. Torisawa. Boosting precision and recall of hyponymy relation acquisition from hierarchical layouts in wikipedia. In Proceedings of the 6th International Language Resources and Evaluation (LREC), Marrakech, Morocco, 2008. [246] J. Tang, H. Li, Y. Cao, and Z. Tang. Email data cleaning. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005. Bibliography 255 [247] D. Temperley and D. Sleator. Parsing english with a link grammar. In Proceedings of the 3rd International Workshop on Parsing Technologies, 1993. [248] M. Thelwall and D. Stuart. Web crawling ethics revisited: Cost, privacy and denial of service. Journal of the American Society for Information Science and Technology, 57(13):1771–1779, 2006. [249] C. Tullo and J. Hurford. Modelling zipfian distributions in language. In Proceedings of the ESSLLI Workshop on Language Evolution and Computation, Vienna, 2003. [250] D. Turcato, F. Popowich, J. Toole, D. Fass, D. Nicholson, and G. Tisher. Adapting a synonym database to specific domains. In Proceedings of the ACL Workshop on Recent Advances in Natural Language Processing and Information Retrieval, Hong Kong, 2000. [251] P. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303–336, 2000. [252] P. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the 12th European Conference on Machine Learning (ECML), Freiburg, Germany, 2001. [253] M. Uschold and M. Gruninger. Ontologies and semantics for seamless connectivity. ACM SIGMOD, 33(4):58–64, 2004. [254] A. Uyar. Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4):469–480, 2009. [255] D. Vallet, M. Fernandez, and P. Castells. An ontology-based information retrieval model. In Proceedings of the European Semantic Web Conference (ESWC), Crete, Greece, 2005. [256] C. vanRijsbergen. Automatic text analysis. In Information Retrieval. University of Glasgow, 1979. [257] M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta, and S. Shum. Template-driven information extraction for populating ontologies. In Proceedings of the IJCAI Workshop on Ontology Learning, 2001. 256 Bibliography [258] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. Mnm: Ontology driven tool for semantic markup. In Proceedings of the ECAI Workshop on Semantic Authoring, Annotation & Knowledge Markup (SAAKM), Lyon France, 2002. [259] P. Velardi, P. Fabriani, and M. Missikoff. Using text processing techniques to automatically enrich a domain ontology. In Proceedings of the International Conference on Formal Ontology in Information Systems (FOIS), Ogunquit, Maine, USA, 2001. [260] P. Velardi, R. Navigli, A. Cucchiarelli, and F. Neri. Evaluation of ontolearn, a methodology for automatic learning of ontologies. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005. [261] P. Vitanyi. Universal similarity. In Proceedings of the IEEE ITSOC Information Theory Workshop on Coding and Complexity, New Zealand, 2005. [262] P. Vitanyi, F. Balbach, R. Cilibrasi, and M. Li. Normalized information distance. In F. Emmert-Streib and M. Dehmer, editors, Information Theory and Statistical Learning. New-York: Springer, 2009. [263] A. Vizine, L. deCastro, E. Hruschka, and R. Gudwin. Towards improving clustering ants: An adaptive ant clustering algorithm. Informatica, 29(2):143– 154, 2005. [264] E. Voorhees and D. Harman. Trec experiment and evaluation in information retrieval. MIT Press, MA, USA, 2005. [265] J. Vronis and N. Ide. Word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1–41, 1998. [266] R. Wagner and M. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168–173, 1974. [267] N. Weber and P. Buitelaar. Web-based ontology learning with isolde. In Proceedings of the ISWC Workshop on Web Content Mining with Human Language Technologies, Athens, USA, 2006. [268] H. Weinreich, H. Obendorf, E. Herder, and M. Mayer. Not quite the average: An empirical study of web use. ACM Transactions on the Web, 2(1):1–31, 2008. Bibliography 257 [269] J. Wermter and U. Hahn. Finding new terminology in very large corpora. In Proceedings of the 3rd International Conference on Knowledge Capture, Alberta, Canada, 2005. [270] G. Wilson. Info-overload harms concentration more than marijuana. New Scientists, April(2497):6–6, 2005. [271] W. Wong. Practical approach to knowledge-based question answering with natural language understanding and advanced reasoning. Master’s thesis, National Technical University College of Malaysia, 2005. [272] W. Wong, W. Liu, and M. Bennamoun. Featureless similarities for terms clustering using tree-traversing ants. In Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia, 2006. [273] W. Wong, W. Liu, and M. Bennamoun. Integrated scoring for spelling error correction, abbreviation expansion and case restoration in dirty text. In Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney, Australia, 2006. [274] W. Wong, W. Liu, and M. Bennamoun. Determining termhood for learning domain ontologies using domain prevalence and tendency. In Proceedings of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia, 2007. [275] W. Wong, W. Liu, and M. Bennamoun. Determining the unithood of word sequences using mutual information and independence measure. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia, 2007. [276] W. Wong, W. Liu, and M. Bennamoun. Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery, 15(3):349–381, 2007. [277] W. Wong, W. Liu, and M. Bennamoun. Constructing web corpora through topical web partitioning for term recognition. In Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand, 2008. 258 Bibliography [278] W. Wong, W. Liu, and M. Bennamoun. Determination of unithood and termhood for term recognition. In M. Song and Y. Wu, editors, Handbook of Research on Text and Web Mining Technologies. IGI Global, 2008. [279] W. Wong, W. Liu, and M. Bennamoun. A probabilistic framework for automatic term recognition. Intelligent Data Analysis, 13(4):499–539, 2009. [280] Y. Yang and J. Calmet. Ontobayes: An ontology-driven uncertainty model. In Proceedings of the International Conference on Intelligent Agents, Web Technologies and Internet Commerce (IAWTIC), Vienna, Austria, 2005. [281] R. Yangarber, R. Grishman, and P. Tapanainen. Automatic acquisition of domain knowledge for information extraction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING), Saarbrucken, Germany, 2000. [282] Z. Yao and B. Choi. Bidirectional hierarchical clustering for web mining. In Proceedings of the IEEE/WIC International Conference on Web Intelligence, 2003. [283] J. Zelle and R. Mooney. Learning semantic grammars with constructive inductive logic programming. In Proceedings of the 11th National Conference of the American Association for Artificial Intelligence (AAAI), 1993. [284] G. Zipf. The psycho-biology of language. Houghton Mifflin, Boston, MA, 1935. [285] G. Zipf. Human behaviour and the principle of least-effort. Addison-Wesley, Cambridge, MA, 1949.