Learning Lightweight Ontologies from Text across

Transcription

Learning Lightweight Ontologies from
Text across Different Domains using the
Web as Background Knowledge
Wilson Yiksen Wong
M.Sc. (Information and Communication Technology), 2005
B.IT. (HONS) (Data Communication), 2003
This thesis is presented for the degree of
Doctor of Philosophy
of The University of Western Australia
School of Computer Science and Software Engineering.
September 2009
To my wife Saujoe
and
my parents and sister
Abstract
The ability to provide abstractions of documents in the form of important concepts and their relations is a key asset, not only for bootstrapping the Semantic
Web, but also for relieving us from the pressure of information overload. At present,
the only viable solution for arriving at these abstractions is manual curation. In
this research, ontology learning techniques are developed to automatically discover
terms, concepts and relations from text documents.
Ontology learning techniques rely on extensive background knowledge, ranging
from unstructured data such as text corpora, to structured data such as a semantic
lexicon. Manually-curated background knowledge is a scarce resource for many domains and languages, and the effort and cost required to keep the resource abreast
of time is often high. More importantly, the size and coverage of manually-curated
background knowledge is often inadequate to meet the requirements of most ontology learning techniques. This thesis investigates the use of the Web as the sole
source of dynamic background knowledge across all phases of ontology learning for
constructing term clouds (i.e. visual depictions of terms) and lightweight ontologies from documents. To appreciate the significance of term clouds and lightweight
ontologies, a system for ontology-assisted document skimming and scanning is developed.
This thesis presents a novel ontology learning approach that is devoid of any
manually-curated resources, and is applicable across a wide range of domains (the
current focus is medicine, technology and economics). More specifically, this research
proposes and develops a set of novel techniques that take advantage of Web data to
address the following problems: (1) the absence of integrated techniques for cleaning
noisy data; (2) the inability of current term extraction techniques to systematically
explicate, diversify and consolidate their evidence; (3) the inability of current corpus
construction techniques to automatically create very large, high-quality text corpora
using a small number of seed terms; and (4) the difficulty of locating and preparing
features for clustering and extracting relations.
This dissertation is organised as a series of published papers that contribute to
a complete and coherent theme. The work into the individual techniques of the
proposed ontology learning approach has resulted in a total of nineteen published
articles: two book chapters, four journal articles, and thirteen refereed conference
papers. The proposed approach consists of several major contributions to each task
in ontology learning. These include (1) a technique for simultaneously correcting
noises such as spelling errors, expanding abbreviations and restoring improper casing
in text; (2) a novel probabilistic measure for recognising multi-word phrases; (3) a
probabilistic framework for recognising domain-relevant terms using formal word
distribution models; (4) a novel technique for constructing very large, high-quality
text corpora using only a small number of seed terms; and (5) novel techniques for
clustering terms and discovering coarse-grained semantic relations using featureless
similarity measures and dynamic Web data. In addition, a comprehensive review
is included to provide background on ontology learning and recent advances in this
area. The implementation details of the proposed techniques are provided at the
end, together with a description on how the system is used to automatically discover
term clouds and lightweight ontologies for document skimming and scanning.
Acknowledgements
First and foremost, this dissertation would not have come into being without the
continuous support provided by my supervisors Dr Wei Liu and Prof Mohammed
Bennamoun. Their insightful guidance, financial support and broad interest made
my research journey at the School of Computer Science and Software Engineering
(CSSE) an extremely fruitful and enjoyable one. I am proud to have Wei and
Mohammed as my mentors and personal friends.
I would also like to thank Dr Krystyna Haq, Mrs Jo Francis and Prof Robyn
Owens for being there to answer my questions on general research skills and scholarships. A very big thank you goes to the Australian Government and the University
of Western Australia for sponsoring this research under the International Postgraduate Research Scholarship and the University Postgraduate Award for International
Students. I am also very grateful to CSSE, and Dr David Glance of the Centre
for Software Practice (CSP) for providing me with a multitude of opportunities to
pursue this research further.
I would like to thank the other members of CSSE including Prof Rachell CardellOliver, Assoc/Prof Chris McDonald and Prof Michael Wise for their advice. My
appreciation goes to my office mates Faisal, Syed, Suman and Majigaa. A special
thank you to the members of the CSSE’s support team, namely, Laurie, Ashley,
Ryan, Sam and Joe, for always being there to restart the virtual machine and to
fix my laptop computers due to accidental spills. Not forgetting the amicable people in CSSE’s administration office, namely, Jen Redman, Nicola Hallsworth, Ilse
Lorenzen, Rachael Offer, Jayjay Jegathesan and Jeff Pollard for answering my administrative and travel needs, and making my stay at CSSE an extremely enjoyable
one.
I also had the pleasure of meeting with many reputable researchers during my
travel whose advice has been invaluable. To name a few, Prof Kyo Kageura, Prof
Udo Hahn, Prof Robert Dale, Assoc/Prof Christian Gutl, Prof Arno Scharl, Prof
Albert Yeap, Dr Timothy Baldwin, and Assoc/Prof Stephen Bird. A special thank
you to the wonderful people at the Department of Information Science, University of
Otago for being such a gracious host during my visit to Dunedin. I would also like to
extend my gratitude for the constant support and advice provided by researchers at
the Curtin University of Technology, namely, Prof Moses Tade, Assoc/Prof Hongwei
Wu, Dr Nicoleta Balliu and Prof Tharam Dillon. In addition, my appreciation goes
to my previous mentors Assoc/Prof Ongsing Goh and Prof Shahrin Sahib of the
Technical University of Malaysia Malacca (UTeM), and Assoc/Prof R. Mukundan
of the University of Canterbury. I should also acknowledge my many friends and
colleagues at the Faculty of Information and Communication Technology at UTeM.
My thank you also goes to the anonymous reviewers who have commented on all
publications that have arisen from this thesis.
Last but not least, I will always remember the unwavering support provided by
my wife Saujoe, my parents and my only sister, without which I would not have
cruised through this research journey so pleasantly. Also a special appreciation to
the city of Perth for being such a nice place to live in and to undertake this research.
i
Contents
List of Figures
v
Publications Arising from this Thesis
Contribution of Candidate to Published Work
xiv
xviii
1 Introduction
1
1.1
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Overview of Solution . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3.1
Text Preprocessing (Chapter 3) . . . . . . . . . . . . . . . . .
4
1.3.2
Text Processing (Chapter 4) . . . . . . . . . . . . . . . . . . .
5
1.3.3
Term Recognition (Chapter 5) . . . . . . . . . . . . . . . . . .
6
1.3.4
Corpus Construction for Term Recognition (Chapter 6) . . . .
7
1.3.5
Term Clustering and Relation Acquisition (Chapter 7 and 8) .
9
1.4
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5
Layout of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background
13
2.1
Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2
Ontology Learning from Text . . . . . . . . . . . . . . . . . . . . . . 14
2.3
2.2.1
Outputs from Ontology Learning . . . . . . . . . . . . . . . . 15
2.2.2
Techniques for Ontology Learning . . . . . . . . . . . . . . . . 17
2.2.3
Evaluation of Ontology Learning Techniques . . . . . . . . . . 21
Existing Ontology Learning Systems . . . . . . . . . . . . . . . . . . 24
2.3.1
Prominent Ontology Learning Systems . . . . . . . . . . . . . 25
2.3.2
Recent Advances in Ontology Learning . . . . . . . . . . . . . 33
2.4
Applications of Ontologies . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Text Preprocessing
41
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3
Basic ISSAC as Part of Text Preprocessing . . . . . . . . . . . . . . . 45
3.4
Enhancement of ISSAC . . . . . . . . . . . . . . . . . . . . . . . . . . 48
ii
3.5
Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8
Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 56
4 Text Processing
57
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2
Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3
A Probabilistic Measure for Unithood Determination . . . . . . . . . 60
4.3.1
Noun Phrase Extraction . . . . . . . . . . . . . . . . . . . . . 61
4.3.2
Determining the Unithood of Word Sequences . . . . . . . . . 62
4.4
Evaluations and Discussions . . . . . . . . . . . . . . . . . . . . . . . 68
4.5
Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 70
4.6
4.7
5 Term Recognition
73
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2
Notations and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3
Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4
5.5
5.3.1
Existing Probabilistic Models for Term Recognition . . . . . . 79
5.3.2
Existing Ad-Hoc Techniques for Term Recognition . . . . . . . 80
5.3.3
Word Distribution Models . . . . . . . . . . . . . . . . . . . . 84
A New Probabilistic Framework for Determining Termhood . . . . . . 89
5.4.1
Parameters Estimation for Term Distribution Models . . . . . 95
5.4.2
Formalising Evidences in a Probabilistic Framework . . . . . . 100
5.5.1
Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . 112
5.5.2
Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 116
5.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7
5.8
6 Corpus Construction for Term Recognition
127
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2
Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1
Webpage Sourcing . . . . . . . . . . . . . . . . . . . . . . . . 130
iii
6.3
6.4
6.2.2
Relevant Text Identification . . . . . . . . . . . . . . . . . . . 132
6.2.3
Variability of Search Engine Counts . . . . . . . . . . . . . . . 133
Analysis of Website Contents for Corpus Construction
. . . . . . . . 134
6.3.1
Website Preparation . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.2
Website Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3.3
Website Content Localisation . . . . . . . . . . . . . . . . . . 143
6.4.1
The Impact of Search Engine Variations on Virtual Corpus
Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.4.2
The Evaluation of HERCULES . . . . . . . . . . . . . . . . . 149
6.4.3
The Performance of Term Recognition using SPARTAN-based
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.6
6.7
7 Term Clustering for Relation Acquisition
161
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2
Existing Techniques for Term Clustering . . . . . . . . . . . . . . . . 163
7.3
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.4
7.3.1
Normalised Google Distance . . . . . . . . . . . . . . . . . . . 165
7.3.2
Ant-based Clustering . . . . . . . . . . . . . . . . . . . . . . . 168
The Proposed Tree-Traversing Ants . . . . . . . . . . . . . . . . . . . 171
7.4.1
First-Pass using Normalised Google Distance . . . . . . . . . . 172
7.4.2
n-degree of Wikipedia: A New Distance Metric . . . . . . . . 176
7.4.3
Second-Pass using n-degree of Wikipedia . . . . . . . . . . . . 178
7.5
7.6
Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 191
7.7
7.8
8 Relation Acquisition
195
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.3
A Hybrid Technique for Relation Acquisition . . . . . . . . . . . . . . 197
8.3.1
Lexical Simplification . . . . . . . . . . . . . . . . . . . . . . . 200
8.3.2
Word Disambiguation
. . . . . . . . . . . . . . . . . . . . . . 202
iv
8.4
8.5
8.6
8.3.3 Association Inference . . . .
Initial Experiments and Discussions
Conclusion and Future Work . . . .
Acknowledgement . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9 Implementation
9.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Ontology-based Document Skimming and Scanning . . . . . . . . .
9.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
204
206
208
209
211
. 211
. 219
. 227
10 Conclusions and Future Work
229
10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 229
10.2 Limitations and Implications for Future Research . . . . . . . . . . . 231
Bibliography
233
v
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
An overview of the five phases in the proposed ontology learning
system, and how the details of each phase are outlined in certain
chapters of this dissertation. . . . . . . . . . . . . . . . . . . . . . . .
4
Overview of the ISSAC and HERCULES techniques used in the text
preprocessing phase of the proposed ontology learning system. ISSAC
and HERCULES are described in Chapter 3 and 6, respectively. . . .
5
Overview of the UH and OU measures used in the text processing
phase of the proposed ontology learning system. These two measures
are described in Chapter 4. . . . . . . . . . . . . . . . . . . . . . . . .
6
Overview of the TH and OT measures used in the term recognition
phase of the proposed ontology learning system. These two measures
are described in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . .
7
Overview of the SPARTAN technique used in the corpus construction phase of the proposed ontology learning system. The SPARTAN
technique is described in Chapter 6. . . . . . . . . . . . . . . . . . . .
8
Overview of the ARCHILES technique used in the relation acquisition phase of the proposed ontology learning system. The ARCHILES
technique is described in Chapter 8, while the TTA clustering technique and noW measure are described in Chapter 7. . . . . . . . . .
9
2.1
The spectrum of ontology kinds, adapted from Giunchiglia & Zaihrayeu [89]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2
Overview of the outputs, tasks and techniques of ontology learning. . 15
3.1
Examples of spelling errors, ad-hoc abbreviations and improper casing
in a chat record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2
The accuracy of basic ISSAC from previous evaluations. . . . . . . . 48
3.3
The breakdown of the causes behind the incorrect replacements by
basic ISSAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4
Accuracy of enhanced ISSAC over seven evaluations. . . . . . . . . . 54
3.5
The breakdown of the causes behind the incorrect replacements by
enhanced ISSAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vi
4.1
The output by Stanford Parser. The tokens in the “modifiee” column
marked with squares are head nouns, and the corresponding tokens
along the same rows in the “word” column are the modifiers. The
first column “offset” is subsequently represented using the variable i.
61
4.2
The output of the head-driven noun phrase chunker. The tokens
which are highlighted with a darker tone are the head nouns. The
underlined tokens are the corresponding modifiers identified by the
chunker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3
The probability of the areas with darker shade are the denominators
required by the evidences e1 and e2 for the estimation of OU (s). . . . 66
4.4
The performance of OU (from Experiment 1) and UH (from Experiment 2) in terms of precision, recall and accuracy. The last column
shows the difference between the performance of Experiment 1 and 2.
69
5.1
Summary of the datasets employed throughout this chapter for experiments and evaluations. . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2
Distribution of 3, 058 words randomly sampled from the domain corpus d. The line with the label “KM” is the aggregation of the
individual probability of occurrence of word i in a document, 1 −
P (0; αi , βi ) using K-mixture with αi and βi defined in Equations 5.21
and 5.20. The line with the label “ZM-MF” is the manually fitted
Zipf-Mandelbrot model. The line labeled “RF” is the actual rate of
occurrence computed as fi /F . . . . . . . . . . . . . . . . . . . . . . . 86
5.3
Parameters for the manually fitted Zipf-Mandelbrot model for the set
of 3, 058 words randomly drawn from d. . . . . . . . . . . . . . . . . . 87
5.4
Distribution of the same 3, 058 words as employed in Figure 5.2. The
line with the label “ZM-OLR” is the Zipf-Mandelbrot model fitted using ordinary least squares method. The line labeled “ZM-WLS” is the
Zipf-Mandelbrot model fitted using weighted least squares method,
while “RF” is the actual rate of occurrence computed as fi /F . . . . . 97
5.5
Summary of the sum of squares of residuals, SSR and the coefficient
of determination, R2 for the regression using manually estimated parameters, parameters estimated using ordinary least squares (OLS),
and parameters estimated using weighted least squares (WLS). Obviously, the smaller the SSR is, the better the fit. As for 0 ≤ R2 ≤ 1,
the upper bound is achieved when the fit is perfect. . . . . . . . . . . 98
vii
5.6
Parameters for the automatically fitted Zipf-Mandelbrot model for
the set of 3, 058 words randomly drawn. . . . . . . . . . . . . . . . . 98
5.7
Distribution of the 1, 954 terms extracted from the domain corpus
d sorted according to the corresponding scores provided by OT and
T H. The single dark smooth line stretching from the left (highest
value) to the right (lowest value) of the graph is the scores assigned
by the respective measures. As for the two oscillating lines, the dark
line is the domain frequencies while the light one is the contrastive
frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.8
Distribution of the 1, 954 terms extracted from the domain corpus d
sorted according to the corresponding scores provided by N CV and
CW . The single dark smooth line stretching from the left (highest
value) to the right (lowest value) of the graph is the scores assigned
by the respective measures. As for the two oscillating lines, the dark
line is the domain frequencies while the light one is the contrastive
frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.9
The means µ of the scores, standard deviations σ of the scores, sum of
the domain frequencies and of the contrastive frequencies of all term
candidates, and their ratio. . . . . . . . . . . . . . . . . . . . . . . . . 115
5.10 The Spearman rank correlation coefficients ρ between all possible
pairs of measure under evaluation. . . . . . . . . . . . . . . . . . . . . 115
5.11 An example of a contingency table. The values in the cells T P , T N ,
F P and F N are employed to compute the precision, recall, Fα and
accuracy. Note that |T C| is the total number of term candidates in
the input set T C, and |T C| = T P + F P + F N + T N . . . . . . . . . 116
viii
5.12 The collection all contingency tables for all termhood measures X
across all the 10 bins BjX . The first column contains the rank of the
bins and the second column shows the number of term candidates in
each bin. The third general column “termhood measures, X” holds
all the 10 contingency tables for each measure X which are organised
column-wise, bringing the total number of contingency tables to 40
(i.e. 10 bins, organised in rows by 4 measures). The structure of the
individual contingency tables follows the one shown in Figure 5.11.
The last column is the row-wise sums of T P + F P and F N + T N .
The rows beginning from the second row until the second last are the
rank bins. The last row is the column-wise sums of T P + F N and
F P + T N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.13 Performance indicators for the four termhood measures in 10 respective bins. Each row shows the performance achieved by the four
measures in a particular bin. The columns contain the performance
indicators for the four measures. The notation pre stands for precision, rec is recall and acc is accuracy. We use two different α values,
resulting in two F-scores, namely, F0.1 and F1 . The values of the
performance measures with darker shades are the best performing ones.120
6.1
A diagram summarising our web partitioning technique. . . . . . . . . 135
6.2
An illustration of an example sample space on which the probabilities
employed by the filter are based upon. The space within the dot-filled
circle consists of all webpages from all sites in J containing W . The m
rectangles represent the collections of all webpages of the respective
sites {u1 , ..., um }. The shaded but not dot-filled portion of the space
consists of all webpages from all sites in J that do not contain W . The
individual shaded but not dot-filled portion within each rectangle is
the collection of webpages in the respective sites ui ∈ J that do not
contain W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3
A summary of the number of websites returned by the respective
search engines for each of the two domains. The number of common
sites is also provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.4
A summary of the Spearman’s correlation coefficients between websites before and after re-ranking by PROSE. The native columns
show the correlation between the websites when sorted according to
their native ranks provided by the respective search engines. . . . . . 147
ix
6.5
The number of sites with OD less than −6 after re-ranking using
PROSE based on page count information provided by the respective
search engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.6
A listing of the 43 sites included in SPARTAN-V. . . . . . . . . . . . 151
6.7
The number of documents and tokens from the local and virtual corpora used in this evaluation. . . . . . . . . . . . . . . . . . . . . . . . 154
6.8
The contingency tables summarising the term recognition results using the various specialised corpora. . . . . . . . . . . . . . . . . . . . 156
6.9
A summary of the performance metrics for term recognition. . . . . . 156
7.1
Example of TTA at work . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2
Experiment using 15 terms from the wine domain. Setting sT = 0.92
results in 5 clusters. Cluster A is simply red wine grapes or red
wines, while Cluster E represents white wine grapes or white wines.
Cluster B represents wines named after famous regions around the
world and they can either be red, white or rose. Cluster C represents
white noble grapes for producing great wines. Cluster D represents
red noble grapes. Even though uncommon, Shiraz is occasionally
admitted to this group. . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3
Experiment using 16 terms from the mushroom domain. Setting
sT = 0.89 results in 4 clusters. Cluster A represents poisonous mushrooms. Cluster B comprises edible mushrooms which are prominent
in East Asian cuisine except for Agaricus Blazei. Nonetheless, this
mushroom was included in this cluster probably due to its high content of beta glucan for potential use in cancer treatment, just like Shiitake. Moreover, China is the major exporter of Agaricus Blazei, also
known as Himematsutake, further relating this mushroom to East
Asia. Cluster C and D comprise edible mushrooms found mainly
in Europe and North America, and are more prominent in Western
cuisines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
x
7.4
Experiment using 20 terms from the disease domain. Setting sT =
0.86 results in 7 clusters. Cluster A represents skin diseases. Cluster
B represents a class of blood disorders known as anaemia. Cluster
C represents other kinds of blood disorders. Cluster D represents
blood disorders characterised by the relatively low count of leukocytes
(i.e. white blood cells) or platelets. Cluster E represents digestive
diseases. Cluster F represents cardiovascular diseases characterised
by both the inflammation and thrombosis (i.e. clotting) of arteries
and veins. Cluster G represents cardiovascular diseases characterised
by the inflammation of veins only. . . . . . . . . . . . . . . . . . . . . 185
7.5
Experiment using 16 terms from the animal domain. Setting sT =
0.60 produces 2 clusters. Cluster A comprises birds and Cluster B
represents mammals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.6
Experiment using 16 terms from the animal domain (the same dataset
from the experiment in Figure 7.5). Setting sT = 0.72 results in
5 clusters. Cluster A represents birds. Cluster B includes hoofed
mammals (i.e. ungulates). Cluster C corresponds to predatory feline
while Cluster D represents predatory canine. Cluster E constitutes
animals kept as pets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.7
Experiment using 15 terms from the animal domain plus an additional term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60
(middle screenshot) and sT = 0.72 (right screenshot) result in 2 clusters, 3 clusters and 5 clusters, respectively. In the left screenshot,
Cluster A acts as the parent for the two recommended clusters “bird”
and “mammal”, while Cluster B includes the term “Google”. In the
middle screenshot, the recommended clusters “bird” and “mammal”
were clearly reflected through Cluster A and C respectively. By setting sT higher, we dissected the recommended cluster “mammal” to
obtain the discovered sub-clusters C, D and E as shown in the right
screenshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.8
Experiment using 31 terms from various domains. Setting sT = 0.70
results in 8 clusters. Cluster A represents actors and actresses. Cluster B represents musicians. Cluster C represents countries. Cluster
D represents politics-related notions. Cluster E is transport. Cluster F includes finance and accounting matters. Cluster G constitutes
technology and services on the Internet. Cluster H represents food. . 189
xi
7.9
Experiment using 60 terms from various domains. Setting sT = 0.76
results in 20 clusters. Cluster A and B represent herbs. Cluster C
comprises pastry dishes while Cluster D represents dishes of Italian
origin. Cluster E represents computing hardware. Cluster F is a
group of politicians. Cluster G represents cities or towns in France
while Cluster H includes countries and states other than France. Cluster I constitutes trees of the genus Eucalyptus. Cluster J represents
marsupials. Cluster K represents finance and accounting matters.
Cluster L comprises transports with four or more wheels. Cluster
M includes plant organs. Cluster N represents beverages. Cluster
O represents predatory birds. Cluster P comprises birds other than
predatory birds. Cluster Q represents two-wheeled transports. Cluster R and S represent predatory mammals. Cluster T includes trees
of the genus Acacia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.1
An overview of the proposed relation acquisition technique. The main
phases are term mapping and term resolution, represented by black
rectangles. The three steps involved in resolution are simplification,
disambiguation and inference. The techniques represented by the
white rounded rectangles were developed by the authors, while existing techniques and resources are shown using grey rounded rectangles. 197
8.2
Figure 8.2(a) shows the subgraph WT constructed for T={‘baking
powder’,‘whole wheat flour’} using Algorithm 8, which is later pruned
to produce a lightweight ontology in Figure 8.2(b). . . . . . . . . . . 201
8.3
The computation of mutual information for all pairs of contiguous
constituents of the composite terms “one cup whole wheat flour” and
“salt to taste”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.4
A graph showing the distribution of noW distance and the stepwise
difference for the sequence of word senses for the term “pepper”. The
set of mapped terms is M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat
flour”, “egg white”, “baking powder”, “buttermilk”}. The line “stepwise difference” shows the ∆i−1,i values. The line “average stepwise
difference” is the constant value µ∆ . Note that the first sense s1 is
located at x = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
xii
8.5
The result of clustering the non-existent term “conchiglioni” and
the mapped terms M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”,
“garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat flour”,
“egg white”, “baking powder”, “buttermilk”,“carbonara”,“pancetta”}
using T T A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.6
The results of relation acquisition using the proposed technique for
the genetics and the food domains. The labels “correctly xxx” and
“incorrectly xxx” represent the true positives (TP) and false positives
(FP). Precision is computed as T P/(T P + F P ). . . . . . . . . . . . . 206
8.7
The lightweight domain ontologies generated using the two sets of
input terms. The important vertices (i.e. NCAs, input terms, vertices
with degree more than 3) have darker shades. The concepts genetics
and food in the center of the graph are the NCAs. All input terms
are located along the side of the graph. . . . . . . . . . . . . . . . . . 207
9.1
The online interface for the HERCULES module. . . . . . . . . . . . . . 212
9.2
The input section of the interface algorithm issac.pl shows the error sentence “Susan’s imabbirity to Undeerstant the msg got her INTu
trubble.”. The correction provided by ISSAC is shown in the results
section of the interface. The process log is also provided through this
interface. Only a small portion of the process log is shown in this figure.213
9.3
The online interface algorithm unithood.pl for the module unithood.
The interface shows the collocational stability of different phrases determined using unithood. The various weights involved in determining the extent of stability are also provided in these figures. . . . . . . 214
9.4
The online interfaces for querying the virtual and local corpora created using the SPARTAN module. . . . . . . . . . . . . . . . . . . . . . 215
9.5
Online interfaces related to the termhood module. . . . . . . . . . . . 216
9.6
Online interfaces related to the ARCHILES module. . . . . . . . . . . . 217
9.7
The interface data lightweightontology.pl for browsing pre-constructed
lightweight ontologies for online news articles using the ARCHILES
module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.8
The screenshot of the aggregated news services provided by Google
(the left portion of the figure) and Yahoo (the right portion of the
figure) on 11 June 2009. . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.9
A splash screen on the online interface for document skimming and
scanning at http://explorer.csse.uwa.edu.au/research/. . . . . 221
xiii
9.10 The cross-domain term cloud summarising the main concepts occurring in all the 395 articles listed in the news browser. This cloud
currently contains terms in the technology, medicine and economics
domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11 The single-domain term cloud for the domain of medicine. This cloud
summarises all the main concepts occurring in the 75 articles listed
below in the news browser. Users can arrive at this single-domain
cloud from the cross-domain cloud in Figure 9.10 by clicking on the
[domain(s)] option in the latter. . . . . . . . . . . . . . . . . . . .
9.12 The single-domain term cloud for the medicine domain. Users can
view a list of articles describing a particular topic by clicking on the
corresponding term in the single-domain cloud. . . . . . . . . . . .
9.13 The use of document term cloud and information from lightweight
ontology to summarise individual news articles. Based on the term
size in the clouds, one can arrive at the conclusion that the news
featured in Figure 9.13(b) carries more domain-relevant (i.e. medical
related) content than the news in Figure 9.13(a). . . . . . . . . . .
9.14 The document term cloud for the news “Tai Chi may ease arthritis
pain”. Users can focus on a particular concept in the annotated news
by clicking on the corresponding term in the document cloud. . . .
. 221
. 222
. 223
. 224
. 225
xiv
Publications Arising from this Thesis
This thesis contains published work and/or work prepared for publication, some of
which has been co-authored. The bibliographical details of the work and where it
appears in the thesis are outlined below.
Book Chapters (Fully Refereed)
[1] Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood
and Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook
of Research on Text and Web Mining Technologies, IGI Global.
This book chapter combines and summarises the ideas in [9][10] and [3][11][12],
which form Chapter 3 and Chapter 4, respectively.
[2] Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering.
M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining
Technologies, IGI Global.
The clustering algorithm reported in [5], which contributes to Chapter 7, was
generalised in this book chapter to work with both terms and Internet domain
names.
Journal Publications (Fully Refereed)
[3] Wong, W., Liu, W. & Bennamoun, M. (2009) A Probabilistic Framework for
Automatic Term Recognition. Intelligent Data Analysis, Volume 13, Issue 4,
Pages 499-539. (Chapter 4)
[4] Wong, W., Liu, W. & Bennamoun, M. (2009) Constructing Specialised Corpora through Domain Representativeness Analysis of Websites. Accepted with
revision by Language Resources and Evaluation. (Chapter 5)
[5] Wong, W., Liu, W. & Bennamoun, M. (2007) Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities. Data Mining and
Knowledge Discovery, Volume 15, Issue 3, Pages 349-381. (Chapter 7)
xv
[6] Liu, W. & Wong, W. (2009) Web Service Clustering using Text Mining
Techniques. International Journal of Agent-Oriented Software Engineering,
Volume 3, Issue 1, Pages 6-26.
This paper is an invited submission. It extends the work reported in [17].
Conference Publications (Fully Refereed)
[7] Wong, W., Liu, W. & Bennamoun, M. (2006) Integrated Scoring for Spelling
Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text.
In the Proceedings of the 5th Australasian Conference on Data Mining (AusDM),
Sydney, Australia.
The preliminary ideas in this paper were refined and extended to contribute
towards [8], which forms Chapter 2 of this thesis.
[8] Wong, W., Liu, W. & Bennamoun, M. (2007) Enhanced Integrated Scoring
for Cleaning Dirty Texts. In the Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data (AND), Hyderabad, India. (Chapter
2)
[9] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining the Unithood
of Word Sequences using Mutual Information and Independence Measure. In
the Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia.
The ideas in this paper were refined and reformulated as a probabilistic framework to contribute towards [10], which forms Chapter 3 of this thesis.
[10] Wong, W., Liu, W. & Bennamoun, M. (2008) Determining the Unithood of
Word Sequences using a Probabilistic Approach. In the Proceedings of the 3rd
International Joint Conference on Natural Language Processing (IJCNLP),
Hyderabad, India. (Chapter 3)
[11] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for
Learning Domain Ontologies using Domain Prevalence and Tendency. In the
Proceedings of the 6th Australasian Conference on Data Mining (AusDM),
Gold Coast, Australia.
xvi
The ideas in this paper were refined and reformulated as a probabilistic framework to contribute towards [3], which forms Chapter 4 of this thesis.
[12] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for
Learning Domain Ontologies in a Probabilistic Framework. In the Proceedings
of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast,
Australia.
The ideas and experiments in this paper were further extended to contribute
[13] Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora
through Topical Web Partitioning for Term Recognition. In the Proceedings
of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand.
The preliminary ideas in this paper were improved and extended to contribute
[14] Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for
Terms Clustering using Tree-Traversing Ants. In the Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth,
Australia.
The preliminary ideas in this paper were refined to contribute towards [5],
which forms Chapter 7 of this thesis.
[15] Wong, W., Liu, W. & Bennamoun, M. (2009) Acquiring Semantic Relations
using the Web for Constructing Lightweight Ontologies. In the Proceedings of
the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD), Bangkok, Thailand. (Chapter 6)
[16] Enkhsaikhan, M., Wong, W., Liu, W. & Reynolds, M. (2007) Measuring
Data-Driven Ontology Changes using Text Mining. In the Proceedings of the
6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia.
This paper reports a technique for detecting changes in ontologies. The ontologies used for evaluation in this paper were generated using the clustering
technique in [5] and the term recognition technique in [3].
xvii
[17] Liu, W. & Wong, W. (2008) Discovering Homogenous Service Communities
through Web Service Clustering. In the Proceedings of the AAMAS Workshop
on Service-Oriented Computing: Agents, Semantics, and Engineering (SOCASE), Estoril, Portugal.
This paper reports the results of discovering web service clusters using the
extended clustering technique described in [2].
Conference Publications (Refereed on the basis of abstract)
[18] Wong, W., Liu, W., Liaw, S., Balliu, N., Wu, H. & Tade, M. (2008) Automatic Construction of Lightweight Domain Ontologies for Chemical Engineering Risk Management. In the Proceedings of the 11th Conference on Process
Integration, Modelling and Optimisation for Energy Saving and Pollution Reduction (PRES), Prague, Czech Rep.
This paper reports the results of the preliminary integration of ideas in [1-15].
[19] Wong, W. (2008) Discovering Lightweight Ontologies using the Web. In the
Proceedings of the 9th Postgraduate Electrical Engineering & Computing Symposium (PEECS), Perth, Australia.
This paper reports the preliminary results of the integration of ideas in [1-15]
into a system for document skimming and scanning.
Note:
• The 14 publications [1-5] and [7-15] describe research work on developing various techniques for ontology learning. The contents in these papers contributed
directly to Chapters 3-8 of this thesis. The 5 publications [6] and [16-19] are
application papers that arose from the use of these techniques in several areas.
• All publications, except [18] and [19] are ranked B or higher by the Australasian Computing Research and Education Association (CORE). Data Mining and Knowledge Discovery (DMKD), Intelligent Data Analysis (IDA) and
Language Resources and Evaluation (LRE) each have a 2008/2009 ISI journal
impact factor of 2.421, 0.426 and 0.283, respectively.
xviii
Contribution of Candidate to Published Work
Some of the published work included in this thesis has been co-authored. The extent
of the candidate’s contribution towards the published work is outlined below.
• Publications [1-5] and [7-15]: The candidate is the first author of these 14
papers, with 80% contribution. He co-authored them with his two supervisors. The candidate designed and implemented the algorithms, performed the
experiments and wrote the papers. The candidate’s supervisors reviewed the
papers and provided useful advice for improvements.
• Publications [6] and [17]: The candidate is the second author of these
papers with 50% contribution. His primary supervisor (Dr Wei Liu) is the
first author. The candidate contributed to the clustering algorithm used in
these papers, and wrote the experiment sections.
• Publication [16]: The candidate is the second author of this paper with
20% contribution. His primary supervisor (Dr Wei Liu) and two academic colleagues are the remaining authors. The candidate contributed to the clustering
algorithm and term recognition technique used in this paper. The candidate
conducted half of the experiments reported in this paper.
• Publication [18]: The candidate is the first author of this paper with 40%
contribution. He co-authored the paper with two academic colleagues, and
three researchers from the Curtin University of Technology. All techniques
reported in this paper are contributed by the candidate. The candidate wrote
all sections in this paper with advice from his primary supervisor (Dr Wei Liu)
and the domain experts from the Curtin University of Technology.
• Publication [19]: The candidate is the sole author of this paper with 100%
contribution.
CHAPTER 1
1
Introduction
“If HTML and the Web made all the online documents
look like one huge book, [the Semantic Web] will make
all the data in the world look like one huge database.”
- Tim Berners-Lee, Weaving the Web (1999)
Imagine that every text document you encounter comes with an abstraction of
what is important. Then we would no longer have to meticulously sift through every
email, news article, search result or product review every day. If every document on
the World Wide Web (the Web) has an abstraction of important concepts and relations, we will be one crucial step closer to realising the vision of a Semantic Web. At
the moment, the widely adopted technique for creating these abstractions is manual
curation. For instance, authors of news articles create their own summaries. Regular
users assign descriptive tags to webpages using Web 2.0 portals. Webmasters provide machine-readable metadata to describe their webpages for the Semantic Web.
The need to automate the abstraction process becomes evident when we consider
the fact that more than 90% of the data in the World appear in unstructured forms
[87]. Indeed, search engine giants such as Yahoo!, Google and Microsoft’s Bing are
slowly and strategically gearing towards the presentation of webpages using visual
summary and abstraction.
In this research, ontology learning techniques are proposed and developed to automatically discover terms, concepts and relations from documents. Together, these
ontological elements are represented as lightweight ontologies. As with any processes
that involve extracting meaningful information from unstructured data, ontology
learning relies on extensive background knowledge. This background knowledge can
range from unstructured data such as a text corpus (i.e. a collection of documents) to
structured data such as a semantic lexicon. From here on, we shall take background
knowledge [32] in a broad sense as “information that is essential to understanding a
situation or problem” 1 . More and more researchers in ontology learning are turning
to Web data to address certain inadequacies of static background knowledge. This
thesis investigates the systematic use of the Web as the sole source of dynamic background knowledge for automatically learning term clouds (i.e. visual depictions of
terms) and lightweight ontologies from text across different domains. The significance of term clouds and lightweight ontologies is best appreciated in the context of
document skimming and scanning as a way to alleviate the pressure of information
1
This definition is from http://wordnetweb.princeton.edu/perl/webwn?s=background%20knowledge.
2
Chapter 1. Introduction
overload. Imagine hundreds of news articles, medical reports, product reviews and
emails summarised using connected (i.e. lightweight ontologies) key concepts that
stand out visually (i.e. term clouds). This thesis has produced an interface to do
exactly this.
1.1
Problem Description
Ontology learning from text is a relatively new research area that draws on
the advances from related disciplines, especially text mining, data mining, natural
language processing and information retrieval. The requirement for extensive background knowledge, be it in the form of text corpora or structured data, remains one
of the greatest challenges facing the ontology learning community, and hence, the
focus of this thesis.
The adequacy of background knowledge in language processing is determined by
two traits, namely, diversity and redundancy. Firstly, languages vary and evolve
across different geographical regions, genres, and time [183]. For instance, a general
lexicon for the English language such as WordNet [177] is of little or no use to a
system that processes medical texts or texts in other languages. Similarly, a text
corpus conceived in the early 90s such as the British National Corpus (BNC) [36]
cannot cope with the need of processing texts that contain words such as “iPhone”
or “metrosexual”. Secondly, redundancy of data is an important prerequisite in both
statistical and symbolic language processing. Redundancy allows language processing techniques to arrive at conclusions regarding many linguistic events based on
observations and induction. If we observe “politics” and “hypocrisy” often enough,
we can say they are somehow related. Static background knowledge has neither
adequate diversity nor redundancy. According to Engels & Lech [66], “Many of
the approaches found use of statistical methods on larger corpora...Such approaches
tend to get into trouble when domains are dynamic or when no large corpora are
present...”. Indeed, many researchers realised this, and hence, gradually turned to
Web data for the solution. For instance, in ontology learning, Wikipedia is used for
relation acquisition [154, 236, 216], and word sense disambiguation [175]. Web search
engines are employed for text corpus construction [15, 228], similarity measurement
[50], and word collocation [224, 41].
However, the present use of Web data is typically confined to isolated cases where
static background knowledge has outrun its course. There is currently no study
concentrating on the systematic use of Web data as background knowledge for all
phases of ontology learning. Research that focuses on the issue of diversity and
1.2. Thesis Statement
redundancy of background knowledge in ontology learning is long overdue. How
do we know if we have the necessary background knowledge to carry out all our
ontology learning tasks? Where do we look for more background knowledge if we
know that what we have is inadequate?
1.2
Thesis Statement
The thesis of this research is that the process of ontology learning, which includes
discovering terms, concepts and coarse-grained relations, from text across a wide
range of domains can be effectively automated by relying solely upon background
knowledge on the Web. In other words, our proposed system employs Web data
as the sole background knowledge for all techniques across all phases of ontology
learning. The effectiveness of the proposed system is determined by its ability to
satisfy two requirements:
(1) Avoid using any static resources commonly used by current ontology learning
systems (e.g. semantic lexicons, text corpora).
(2) Ensure the applicability of the system across a wide range of domains (current
focus is technology, medicine and economics).
At the same time, this research addresses the following four problems by taking
advantage of the diversity and redundancy of Web data as background knowledge:
(1) The absence of integrated techniques for cleaning noisy text.
(2) The inability of current term extraction techniques, which are heavily influenced by word frequency, to systematically explicate, diversify and consolidate
termhood evidence.
(3) The inability of current corpus construction techniques to automatically create
very large, high-quality text corpora using a small number of seed terms.
(4) The difficulty of locating and preparing features for clustering and acquiring
relations between terms.
1.3
Overview of Solution
The ultimate goal of ontology learning in the context of this thesis is to discover terms, concepts and coarse-grained relations from documents using the Web
as the sole source of background knowledge. Chapter 2 provides a thorough review
3
4
Figure 1.1: An overview of the five phases in the proposed ontology learning system,
and how the details of each phase are outlined in certain chapters of this dissertation.
of the existing techniques for discovering these ontological elements. An ontology
learning system comprising five phases, namely, text preprocessing, text processing,
corpus construction, term recognition and relation acquisition is proposed in this
research. Figure 1.1 provides an overview of the system. The common design and
development methodology for the core techniques in each phase is: (1) first, perform in-depth study of the requirements of each phase to determine the types of
background knowledge required, (2) second, identify ways of exploiting data on the
Web to satisfy the background knowledge requirements, and (3) third, devise highperformance techniques that take advantage of the diversity and redundancy of the
background knowledge. The system takes as input a set of seed terms and natural language texts, and produces three outputs, namely, text corpora, term clouds,
and lightweight ontologies. The solution to each phase is described in the following
subsections.
1.3.1
Text Preprocessing (Chapter 3)
Figure 1.2 shows an overview of the techniques for text preprocessing. Unlike
data developed in controlled settings, Web data come in a varying degree of quality
which may contain spelling errors, abbreviations and improper casings. This calls
for serious attention to the issue of data quality. A review of several prominent
techniques for spelling error correction, abbreviation expansion and case restoration was conducted. Despite the blurring of the boundaries between these different
errors in online data, there is little work on integrated correction techniques. For
1.3. Overview of Solution
Figure 1.2: Overview of the ISSAC and HERCULES techniques used in the text
preprocessing phase of the proposed ontology learning system. ISSAC and HERCULES are described in Chapter 3 and 6, respectively.
instance, is “ocat” a spelling error (with the possibilities “coat”, “cat” or “oat”), or
an abbreviation (with the expansion “Ontario Campaign for Action on Tobacco”)?
A technique called Integrated Scoring for Spelling Error Correction, Abbreviation
Expansion and Case Restoration (ISSAC) is proposed and developed for cleaning
potentially noisy texts. ISSAC relies on edit distance, online dictionaries, search engine page counts and Aspell [9]. An experiment using 700 chat records showed that
ISSAC achieved an average accuracy of 98%. In addition, a heuristic technique,
called Heuristic-based Cleaning Utility for Web Texts (HERCULES), is proposed
for extracting relevant contents from webpages amidst HTML tags, boilerplates,
etc. Due to the significance of HERCULES to the corpus construction phase, the
details of this techniques are provided in Chapter 6.
1.3.2
Text Processing (Chapter 4)
Figure 1.3 shows an overview of the techniques for text processing. The cleaned
texts are processed using the Stanford Parser [132] and Minipar [150] to obtain partof-speech and grammatical information. This information is then used for chunking
noun phrases and extracting instantiated sub-categorisation frames (i.e. syntactic triples, ternary frames) in the form of <arg1,connector,arg2>. Two measures
based on search engine page counts are introduced as part of the noun phrase chunk-
5
6
Figure 1.3: Overview of the UH and OU measures used in the text processing
phase of the proposed ontology learning system. These two measures are described
in Chapter 4.
ing process. These two measures are used to determine the collocational stability
of noun phrases. Noun phrases are considered as unstable if they can be further
broken down to create non-overlapping units that refer to semantically distinct concepts. For example, the phrase “Centers for Disease Control and Prevention” is
stable and semantically meaningful unit while “Centre for Clinical Interventions
and Royal Perth Hospital” is an unstable compound. The first measure, called
Unithood (UH) is an adaptation of existing word association measures, while the
second measure, called Odds of Unithood (OU), is a novel probabilistic measure to
address the ad-hoc nature of combining evidence. An experiment using 1, 825 test
cases in the health domain showed that OU achieved a higher accuracy at 97.26%
compared to UH at only 94.52%.
1.3.3
Term Recognition (Chapter 5)
Figure 1.4 shows an overview of the techniques for term recognition. Noun
phrases in the two arguments arg1 and arg2 of the instantiated sub-categorisation
frames are used to create a list of term candidates. The extent to which each term
candidate is relevant to the corresponding document is determined. Several existing
techniques for measuring termhood using various term characteristics as termhood
evidence are reviewed. Major shortcomings of existing techniques are identified and
discussed, including the heavy influence of word frequency (especially techniques
Figure 1.4: Overview of the TH and OT measures used in the term recognition
phase of the proposed ontology learning system. These two measures are described
in Chapter 5.
based on TF-IDF), mathematically-unfounded derivation of weights and implicit
assumptions regarding term characteristics. An analysis is carried out using word
distribution models and text corpora to predict word occurrences for quantifying
termhood evidence. Models that are considered include K-Mixture, Poisson, and
Zipf-Mandelbrot. Based on the analysis, two termhood measures are proposed,
which combine evidence based on explicitly defined term characteristics. The first
is a heuristic measure called Termhood (TH), while the second, called Odds of Termhood (OT), is based on a novel probabilistic measure founded on the Bayes Theorem for formalising termhood evidence. These two measures are compared against
two existing ones using the GENIA corpus [130] for molecular biology as benchmark.
An evaluation using 1, 954 term candidates showed that TH and OT achieved the
best precision at 98.5% and 98%, respectively, for the first 200 terms.
1.3.4
Corpus Construction for Term Recognition (Chapter 6)
Figure 1.5 shows an overview of the techniques for corpus construction. Given
that our goal is to rely solely on dynamic background knowledge in ontology learning,
it is important to see how we can avoid the use of manually-curated text corpora during term recognition. Existing techniques for automatic corpus construction using
data from the Web are reviewed. Most of the current techniques employ the naive
7
8
Figure 1.5: Overview of the SPARTAN technique used in the corpus construction
phase of the proposed ontology learning system. The SPARTAN technique is described in Chapter 6.
query-and-download approach using search engines to construct text corpora. These
techniques require a large number of seed terms (in the order of hundreds) to create
very large text corpora. They also disregard the fact that the webpages suggested by
search engines may have poor relevance and quality. A novel technique called Specialised Corpora Construction based on Web Texts Analysis (SPARTAN) is proposed
and developed for constructing specialised text corpora from the Web based on the
systematic analysis of website contents. A Probabilistic Site Selector (PROSE) is
proposed as part of SPARTAN to identify the most suitable and authoritative data
for contributing to the corpora. A heuristic technique called HERCULES, mentioned
before in the text preprocessing phase, is included in SPARTAN for extracting relevant contents from the downloaded webpages. A comparison using the Cleaneval
development set2 and a text comparison module based on vector space3 showed that
HERCULES achieved a 89.19% similarity with the gold standard. An evaluation
was conducted to show that SPARTAN requires only a small number of seed terms
(three to five), and that SPARTAN -based corpora are independent of the search
engine employed. The performance of term recognition across four different corpora
(both automatically constructed and manually curated) was assessed using the OT
2
3
http://cleaneval.sigwac.org.uk/devset.html
http://search.cpan.org/ stro/Text-Compare-1.03/lib/Text/Compare.pm
measure, 1, 300 term candidates, and the GENIA corpus as benchmark. The evaluation showed that term recognition using the SPARTAN -based corpus achieved the
best precision at 99.56%.
1.3.5
Term Clustering and Relation Acquisition (Chapter 7 and 8)
Figure 1.6: Overview of the ARCHILES technique used in the relation acquisition
phase of the proposed ontology learning system. The ARCHILES technique is
described in Chapter 8, while the TTA clustering technique and noW measure are
described in Chapter 7.
Figure 1.6 shows an overview of the techniques for relation acquisition. The
flat lists of domain-relevant terms obtained during the previous phase are organised into hierarchical structures during the relation acquisition phase. A review of
the techniques for acquiring semantic relations between terms was conducted. Current techniques rely heavily on the presence of syntactic cues and static background
knowledge such as semantic lexicons for acquiring relations. A novel technique
named Acquiring Relations through Concept Hierarchy Disambiguation, Association Inference and Lexical Simplification (ARCHILES) is proposed for constructing
lightweight ontologies using coarse-grained relations derived from Wikipedia and
search engines. ARCHILES combines word disambiguation, which uses the distance
measure n-degree of Wikipedia (noW), and lexical simplification to handle complex
and ambiguous terms. ARCHILES also includes association inference using a novel
multi-pass Tree-Traversing Ant (TTA) clustering algorithm with the Normalised
9
10
Web Distance (NWD) 4 as the similarity measure to cope with terms not covered
by Wikipedia. This technique can be used to complement conventional techniques
for acquiring fine-grained relations. Two small experiments using 11 terms in the
genetics domain and 31 terms in the food domain revealed precision scores between
80% to 100%. The details about TTA and noW are provided in Chapter 7. The
descriptions of ARCHILES are included in Chapter 8.
1.4
Contributions
The standout contribution of this dissertation is the exploration of a complete
solution to the complex problem of automatic ontology learning from text. This
research has produced several other contributions to the field of ontology learning.
The complete list is as follows:
• A technique which consolidates various evidence from existing tools and from
search engines for simultaneously correcting spelling errors, expanding abbreviations and restoring improper casing.
• Two measures for determining the collocational strength of word sequences
using page counts from search engines, namely, an adaptation of existing word
association measures, and a novel probabilistic measure.
• In-depth experiments on parameter estimation and linear regression involving
various word distribution models.
• Two measures for determining term relevance based on explicitly defined term
characteristics and the distributional behaviour of terms across different corpora. The first measure is a heuristic measure, while the second measure
is based on a novel probabilistic framework for consolidating evidence using
formal word distribution models.
• In-depth experiments on the effects of search engine and page count variations
on corpus construction.
• A novel technique for corpus construction that requires only a small number of seed terms to automatically produce very large, high-quality text corpora through the systematic analysis of website contents. The on-demand
4
NWD is a generalisation of the Normalised Google Distance (NGD) [50] that employs any
available Web search engines.
1.5. Layout of Thesis
construction of new text corpora enables this and many other term recognition techniques to be widely applicable across different domains. A generallyapplicable heuristic technique is also introduced for removing HTML tags and
boilerplates, and extracting relevant content from webpages.
• In-depth experiments on the peculiarities of clustering terms as compared to
other forms of feature-based data clustering.
• A novel technique for constructing lightweight ontologies in an iterative process of lexical simplication, association inference through term clustering, and
word disambiguation using only Wikipedia and search engines. A generallyapplicable technique is introduced for multi-pass term clustering using featureless similarity measurement based on Wikipedia and page counts by search
engines.
• Demonstration of the use of term clouds and lightweight ontologies to assist
the skimming and scanning of documents.
1.5
Layout of Thesis
Overall, this dissertation is organised as a series of papers published in internationally refereed book chapters, journals and conferences. Each paper constitutes an
independent set of work into ontology learning. However, these papers together contribute to a complete and coherent theme. In Chapter 2, a background to ontology
learning and a review on several prominent ontology learning systems is presented.
The core content of this dissertation is laid out in Chapter 3 to 8. Each of these
chapters describes one of the five phases in our ontology learning system.
• Chapter 3 (Text Preprocessing) features an IJCAI workshop paper on the text
cleaning technique called ISSAC.
• In Chapter 4 (Text Processing), an IJCNLP conference paper describing the
two word association measures UH and OU is included.
• An Intelligent Data Analysis journal paper on the two term relevance measures
TH and OT is included in Chapter 5 (Term Recognition).
• In Chapter 6 (Corpus Construction for Term Recognition), a Language Resources and Evaluation journal paper is included to describe the SPARTAN
technique for automatically constructing text corpora for term recognition.
11
12
• A Data Mining and Knowledge Discovery journal paper that describes the
TTA clustering technique and noW distance measure is included in Chapter
7 (Term Clustering for Relation Acquisition).
• In Chapter 8 (Relation Acquisition), a PAKDD conference paper is included
to describe the ARCHILES technique for acquiring coarse-grained relations
using TTA and noW.
After the core content, Chapter 9 elaborates on the implementation details of
the proposed ontology learning system, and the application of term clouds and
lightweight ontologies for document skimming and scanning. In Chapter 10, we
summarise our conclusions and provide suggestions for future work.
CHAPTER 2
Background
“A while ago, the Artificial Intelligence research community got
together to find a way to enable knowledge sharing...They proposed an
infrastructure stack that could enable this level of information exchange,
and began work on the very difficult problems that arise.”
- Thomas Gruber, Ontology of Folksonomy (2007)
This chapter provides a comprehensive review on ontology learning. It also
serves as a background introduction to ontologies in terms of what they are, why
they are important, how they are obtained and where they can be applied. The
definition of an ontology is first introduced before a discussion on the differences
between lightweight ontologies and the conventional understanding of ontologies
is provided. Then the process of ontology learning is described, with a focus on
types of output, commonly-used techniques and evaluation approaches. Finally,
several current applications and prominent systems are explored to appreciate the
significance of ontologies and the remaining challenges in ontology learning.
2.1
Ontologies
Ontologies can be thought of as directed graphs consisting of concepts as nodes,
and relations as the edges between the nodes. A concept is essentially a mental
symbol often realised by a corresponding lexical representation (i.e. natural language
name). For instance, the concept “food” denotes the set of all substances that can
be consumed for nutrition or pleasure. In Information Science, an ontology is a
“formal, explicit specification of a shared conceptualisation” [92]. This definition
imposes the requirement that the names of concepts, and how the concepts are
related to one another have to be explicitly expressed and represented using formal
languages such as Web Ontology Language (OWL). An important benefit of a formal
representation is the ability to specify axioms for reasoning to determine validity
and to define constraints in ontologies.
As research into ontology progresses, the definition of what constitutes an ontology evolves. The extent of relational and axiomatic richness, and the formality
of representation eventually gave rise to a spectrum of ontology kinds [253] as illustrated in Figure 2.1. At one end of the spectrum, we have ontologies that make
little or no use of axioms referred to as lightweight ontologies [89]. At the other
end, we have heavyweight ontologies [84] that make intensive use of axioms for specification. Ontologies are fundamental to the success of the Semantic Web as they
13
14
Chapter 2. Background
Figure 2.1: The spectrum of ontology kinds, adapted from Giunchiglia & Zaihrayeu
[89].
enable software agents to exchange, share, reuse and reason about concepts and
relations using axioms. In the words of Tim Berners-Lee [24], “For the semantic
web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning”.
However, the truth remains that the automatic learning of axioms is not an easy
task. Despite certain success, many ontology learning systems are still struggling
with the basics of extracting terms and relations [84]. For this reason, the majority
of ontology learning systems out there that claim to learn ontologies are in fact
creating lightweight ontologies. At the moment, lightweight ontologies appear to
be the most common type of ontologies in a variety of Semantic Web applications
(e.g. knowledge management, document retrieval, communities of practice, data
integration) [59, 75].
2.2
Ontology Learning from Text
Ontology learning from text is the process of identifying terms, concepts, relations and optionally, axioms from natural language text, and using them to construct
and maintain an ontology. Even though the area of ontology learning is still in its
infancy, many proven techniques from established fields such as text mining, data
mining, natural language processing, information retrieval, as well as knowledge representation and reasoning have powered a rapid growth in recent years. Information
retrieval provides various algorithms to analyse associations between concepts in
texts using vectors, matrices [76] and probabilistic theorems [280]. On the other
hand, machine learning and data mining provides ontology learning the ability to
2.2. Ontology Learning from Text
extract rules and patterns out of massive datasets in a supervised or unsupervised
manner based on extensive statistical analysis. Natural language processing provides the tools for analysing natural language text on various language levels (e.g.
morphology, syntax, semantics) to uncover concept representations and relations
through linguistic cues. Knowledge representation and reasoning enables the ontological elements to be formally specified and represented such that new knowledge
can be deduced.
Figure 2.2: Overview of the outputs, tasks and techniques of ontology learning.
In the following subsections, we look at the types of output, common techniques
and evaluation approaches of a typical ontology learning process.
2.2.1
Outputs from Ontology Learning
There are five types of output in ontology learning, namely, terms, concepts,
taxonomic relations, non-taxonomic relations and axioms. Some researchers [35]
refer to this as the “Ontology Learning Layer Cake”. To obtain each output, certain tasks have to be accomplished and the techniques employed for each task may
vary between systems. This view of output-task relation that is independent of any
implementation details promotes modularity in designing and implementing ontology learning systems. Figure 2.2 shows the output and the corresponding tasks.
Each output is a prerequisite for obtaining the next output as shown in the figure.
15
16
Terms are used to form concepts which in turn are organised according to relations.
Relations can be further generalised to produce axioms.
Terms are the most basic building blocks in ontology learning. Terms can be
simple (i.e. single-word) or complex (i.e. multi-word), and are considered as lexical
realisations of everything important and relevant to a domain. The main tasks associated with terms are to preprocess texts and extract terms. Preprocessing ensures
that the input texts are in an acceptable format. Some of the techniques relevant
to preprocessing include noisy text analytics and the extraction of relevant contents
from webpages (i.e. boilerplate removal). The extraction of terms usually begin
with some kind of part-of-speech tagging and sentence parsing. Statistical or probabilistic measures are then used to determine the extent of collocational strength
and domain relevance of the term candidates.
Concepts can be abstract or concrete, real or fictitious. Broadly speaking, a
concept can be anything about which something is said. Concepts are formed by
grouping similar terms. The main tasks are therefore to form concepts and label
concepts. The task of forming concepts involve discovering the variants of a term
and grouping them together. Term variants can be determined using predefined
background knowledge, syntactic structure analysis or through clustering based on
some similarity measures. As for deciding on the suitable label for a concept, existing
background knowledge such as WordNet may be used to find the name of the nearest
common ancestor. If a concept is determined through syntactic structure analysis,
the heads of the complex terms can be used as the corresponding label. For instance,
the common head noun “tart” can be used as the label for the concept comprising
of “egg tart”, “French apple tart”, “chocolate tart”, etc.
Relations are used to model the interactions between the concepts in a domain.
There are two types of relations, namely, taxonomic relations and non-taxonomic
relations. Taxonomic relations are the hypernymies between concepts. The main
task is to construct hierarchies. Organising concepts into a hierarchy involves the
discovery of hypernyms and hence, some researchers may also refer to this task as
extracting taxonomic relations. Hierarchy construction can be performed in various
ways such as using predefined relations from existing background knowledge, using
statistical subsumption models, relying on semantic relatedness between concepts,
and utilising linguistic and logical rules or patterns. Non-taxonomic relations are the
interactions between concepts (e.g. meronymy, thematic roles, attributes, possession
and causality) other than hypernymy. The less explicit and more complex use of
words for specifying relations other than hypernymy causes the tasks to discover
non-taxonomic relations and label non-taxonomic relations to be more challenging.
Discovering and labelling non-taxonomic relations are mainly reliant on the analysis
of syntactic structures and dependencies. In this aspect, verbs are taken as good
indicators for non-taxonomic relations and help from domain experts are usually
required to label such relations.
Lastly, axioms are propositions or sentences that are always taken as true. Axioms act as a starting point for deducing other truth, verifying correctness of existing
ontological elements and defining constraints. The task involved here is to discover
axioms. The task of learning axioms usually involve the generalisation or deduction
of a large number of known relations that satisfy certain criteria.
2.2.2
Techniques for Ontology Learning
The techniques employed by different systems may vary depending on the tasks
to be accomplished. The techniques can generally be classified into statistics-based,
linguistics-based, logic-based, or hybrid. Figure 2.2 illustrates the various commonlyused techniques, and each technique may be applicable to more than one task.
The various statistics-based techniques for accomplishing the tasks in ontology
learning are mostly derived from information retrieval, machine learning and data
mining. The lack of consideration for the underlying semantics and relations between
the components of a text makes statistics-based techniques more prevalent in the
early stages of ontology learning. Some of the common techniques include clustering
[272], latent semantic analysis [252], co-occurrence analysis [34], term subsumption
[77], contrastive analysis [260] and association rule mining [239]. The main idea
behind these techniques is that the extent of occurrence of terms and their contexts
in documents often provide reliable estimates about the semantic identity of terms.
• In clustering, some measures of relatedness (e.g. similarity, distance) is employed to assign terms into groups for discovering concepts or constructing
hierarchy [152]. The process of clustering can either begin with individual
terms or concepts and grouping the most related ones (i.e. agglomerative
clustering), or begin with all terms or concepts and dividing them into smaller
groups to maximise within-group relatedness (i.e. divisive clustering). Some
of the major issues in clustering are working with high-dimensional data, and
feature extraction and preparation for similarity measurement. This gave rise
to a class of feature-less similarity and distance measures based solely on the
co-occurrence of words in large text corpora. The Normalised Web Distance
(NGD) is one example [262].
17
18
• Relying on raw data to measure relatedness may lead to data sparseness [35].
In latent semantic analysis, dimension reduction techniques such as singular
value decomposition are applied on the term-document matrix to overcome the
problem [139]. In addition, inherent relations between terms can be revealed
by applying correlation measures on the dimensionally-reduced matrix, leading
to the formation of groups.
• The analysis of the occurrence of two or more terms within a well-defined
unit of information such as sentence or more generally, n-gram is known as cooccurrence analysis. Co-occurrence analysis is usually coupled with some measures to determine the association strength between terms or the constituents
of terms. Some of the popular measures include dependency measures (e.g.
mutual information [47]), log-likelihood ratios [206] (e.g. chi-square test), rank
correlations (e.g. Pearson’s and Spearman’s coefficient [244]), distance measures (e.g. Kullback-Leiber divergence [161]), and similarity measures (e.g.
cosine measures [223]).
• In term subsumption, the conditional probabilities of the occurrence of terms
in documents are employed to discover hierarchical relations between them
[77]. A term subsumption measure is used to quantify the extent of a term x
being more general than another term y. The higher the subsumption value,
the more general term x is with respect to y.
• The extent of occurrence of terms in individual documents and in text corpora
is employed for relevance analysis. Some of the common relevance measures
from information retrieval include the Term Frequency-Inverse Document Frequency (TF-IDF) [215] and its variants, and others based on language modelling [56] and probability [83]. Contrastive analysis [19] is a kind of relevance
analysis based on the heuristic that general language-dependent phenomena
should spread equally across different text corpora, while special-language phenomena should portray odd behaviours.
• Given a set of concept pairs, association rule mining is employed to describe the
associations between the concepts at the appropriate level of abstraction [115].
In the example by [162], given the already known concept pairs {chips, beer}
and {peanuts, soda}, association rule mining is then employed to generalise
the pairs to provide {snacks, drinks}. The key to determining the degree of
abstraction in association rules is provided by user-defined thresholds such as
confidence and support.
Linguistics-based techniques are applicable to almost all tasks in ontology learning and are mainly dependent on the natural language processing tools. Some of
the techniques include part-of-speech tagging, sentence parsing, syntactic structure
analysis and dependency analysis. Other techniques rely on the use of semantic
lexicon, lexico-syntactic patterns, semantic templates, subcategorisation frames, and
seed words.
• Part-of-speech tagging and sentence parsing provide the syntactic structures
and dependency information required for further linguistic analysis. Some examples of part-of-speech tagger are Brill Tagger [33] and TreeTagger [219].
Principar [149], Minipar [150] and Link Grammar Parser [247] are among the
few common sentence parsers. Other more comprehensive toolkits for natural language processing include General Architecture for Text Engineering
(GATE) [57], and Natural Language Toolkit (NLTK) [25]. Despite the placement under the linguistics-based category, certain parsers are built on statistical parsing systems. For instance, the Stanford Parser [132] is a lexicalised
probabilistic parser.
• Syntactic structure analysis and dependency analysis examines syntactic and
dependency information to uncover terms and relations at the sentence level.
In syntactic structure analysis, words and modifiers in syntactic structures (e.g.
noun phrases, verb phrases and prepositional phrases) are analysed to discover
potential terms and relations. For example, ADJ-NN or DT-NN can be extracted
as potential terms, while ignoring phrases containing other part-of-speech such
as verbs. In particular, the head-modifier principle has been employed extensively to identify complex terms related through hyponymy with the heads of
the terms assuming the hypernym role [105]. In dependency analysis, grammatical relations such as subject, object, adjunct and complement are used
for determining more complex relations [86, 48].
• Semantic lexicon can either be general such as WordNet [177] or domainspecific such as the Unified Medical Language System (UMLS) [151]. Semantic
lexicon offers easy access to a large collection of predefined words and relations. Concepts from semantic lexicon are usually organised in sets of similar
words (i.e. synsets). These synonyms are employed for discovering variants of
terms [250]. Relations from semantic lexicon have also been proven useful to
19
20
ontology learning. These relations include hypernym-hyponym (i.e. parentchild relation) and meronym-holonym (i.e. part-whole relation). Many of the
work related to the use of relations in WordNet can be found in the area of
word sense disambiguation [265, 145] and lexical acquisitions [190].
• The use of lexico-syntactic patterns was proposed by [102], and has been employed to extract hypernyms [236] and meronyms. Lexico-syntactic patterns
capture hypernymy relations using patterns such as NP such as NP, NP,...,
and NP. For extracting meronyms, patterns such as NP is part of NP can be
useful. The use of patterns provide reasonable precision but the recall is low
[35]. Due to the cost and time involved in manually producing such patterns,
efforts [234] have been taken to study the possibility of learning them. Semantic templates [238, 257] are similar to lexico-syntactic patterns in terms of their
purpose. However, semantic templates offer more detailed rules and conditions
to extract not only taxonomic relations but also complex non-taxonomic relations.
• In linguistic theory, the subcategorisation frame [5, 85] of a word is the number
and kinds of other words that it selects when appearing in a sentence. For
example, in the sentence “Joe wrote a letter”, the verb “write” selects “Joe”
and “letter” as its subject and object, respectively. In other words, “Person”
and “Written-Communication” are the restrictions of selection for the subject
and object of the verb “write”. The restrictions of selection extracted from
parsed texts can be used in conjunction with clustering techniques to discover
concepts [68].
• The use of seed words (i.e. seed terms) [281] is a common practice in many
systems to guide a wide range of tasks in ontology learning. Seed words provide
good starting points for the discovery of additional terms relevant to that
particular domain [110]. Seed words are also used to guide the automatic
construction of text corpora from the Web [15].
Logic-based techniques are the least common in ontology learning and are mainly
adopted for more complex tasks involving relations and axioms. Logic-based techniques have connections with advances in knowledge representation and reasoning,
and machine learning. The two main techniques employed are inductive logic programming [141, 283] and logical inference [227].
• In inductive logic programming, rules are derived from existing collection of
concepts and relations which are divided into positive and negative examples.
The rules proves all the positive and none of the negative examples. In an
example by Oliveira et al. [191], induction begins with the first positive example “tigers have fur”. With the second positive example “cats have fur”, a
generalisation of “felines have fur” is obtained. Given the third positive example “dogs have fur”, the technique will attempt to generalise that “mammals
have fur”. When encountered with a negative example “humans do not have
fur”, then the previous generalisation will be dropped, giving only “canines
and felines have fur”.
• In logical inference, implicit relations are derived from existing ones using
rules such as transitivity and inheritance. Using the classic example, given the
premises “Socrates is a man” and “All men are mortal”, we can discover a
new attribute relation stating that “Socrates is mortal”. Despite the power of
inference, the possibilities of introducing invalid or conflicting relations may
occur if the design of the rules is not complete. Consider the example where
“human eats chicken” and “chicken eats worm” yield a new relation that is
not valid. This happened because the intransitivity of the relation “eat” was
not explicitly specified in advance.
2.2.3
Evaluation of Ontology Learning Techniques
Evaluation is an important aspect of ontology learning, just like any other research areas. Evaluation allows individuals who use ontology learning systems to
assess the resulting ontologies, and to possibly guide and refine the learning process.
An interesting aspect about evaluation in ontology learning, as opposed to information retrieval and other areas, is that ontologies are not an end product but rather,
a means to achieve some other tasks. In this sense, an evaluation approach is also
useful to assist users in choosing the best ontology that fits their requirements when
faced with a multitude of options.
In document retrieval, the object of evaluation is documents and how well systems provide documents that satisfy user queries, either qualitatively or quantitatively. However, in ontology learning, we cannot simply measure how well a system
constructs an ontology without raising more questions. For instance, is the ontology
good enough? If so, with respect to what application? An ontology is made up of
different layers such as terms, concepts and relations. If an ontology is inadequate
for an application, then which part of the ontology is causing the problem? Consid-
21
22
ering the intricacies of evaluating ontologies, a myriad of evaluation approaches have
been proposed in the past few years. Generally, these approaches can be grouped
into one of the four main categories depending on the kind of ontologies that are
being evaluated and the purpose of the evaluation [30]:
• The first approach evaluates the adequacy of ontologies in the context of other
applications. For example Porzel & Malaka [202] evaluated the use of ontological relations in the context of speech recognition. The output from the speech
recognition system is compared with a gold standard generated by humans.
• The second approach uses domain-specific data sources to determine to what
extent the ontologies are able to cover the corresponding domain. For instance,
Brewster et al. [31] described a number of methods to evaluate the ‘fit’ between
an ontology and the domain knowledge in the form of text corpora.
• The third approach is used for comparing ontologies using benchmarks including other ontologies [164].
• The last approach rely on domain experts to assess how well an ontology meets
a set of predefined criteria [158].
Due to the complex nature of ontologies, evaluation approaches can also be
distinguished by the layers of an ontology (e.g. term, concept, relation) they evaluate
[202]. More specifically, evaluations can be performed to assess the (1) correctness at
the terminology layer, (2) coverage at the conceptual layer, (3) wellness at taxonomy
layer, and (4) adequacy of the non-taxonomic relations.
The focus of evaluation at the terminology layer is to determine if the terms
used to identify domain-relevant concepts are included and correct. Some form of
lexical reference or benchmark is typically required for evaluation in this layer. Typical precision and recall measures from information retrieval are used together with
exact matching or edit distance [164] to determine performance at the terminology
layer. The lexical precision and recall reflect how good the extracted terms cover
the target domain. Lexical Recall (LR) measures the number of relevant terms extracted (erelevant ) divided by the total number of relevant terms in the benchmark
(brelevant ), while Lexical Precision (LP) measures the number of relevant terms extracted (erelevant ) divided by the total number of terms extracted (eall ). LR and LP
are defined as [214]:
LP =
erelevant
eall
(2.1)
23
LR =
erelevant
brelevant
(2.2)
The precision and recall measure can be also combined to compute the corresponding
Fβ -score. The general formula for non-negative real β is:
Fβ =
(1 + β 2 )(precision × recall)
β 2 × precision + recall
(2.3)
Evaluation measures at the conceptual level are concerned with whether the desired domain-relevant concepts are discovered or otherwise. Lexical Overlap (LO)
measures the intersection between the discovered concepts (Cd ) and the recommended concepts (Cm ). LO is defined as:
LO =
|Cd ∩ Cm |
|Cm |
(2.4)
Ontological Improvement (OI) and Ontological Loss (OL) are two additional measures to account for newly discovered concepts that are absent from the benchmark,
and for concepts which exist in the benchmark but were not discovered, respectively.
They are defined as [214]:
OI =
|Cd − Cm |
|Cm |
(2.5)
OL =
|Cm − Cd |
|Cm |
(2.6)
Evaluations at the taxonomy layer is more complicated. Performance measures
for the taxonomy layer are typically divided into local and global [60]. The similarity
of the concepts’ positions in the learned taxonomy and in the benchmark is used
to compute the local measure. The global measure is then derived by averaging
the local scores for all concept pairs. One of the few measures for the taxonomy
layer is the Taxonomic Overlap (TO) [164]. The computation of the global similarity
between two taxonomies begins with the local overlap of their individual terms. The
semantic cotopy, the set of all super- and sub-concepts, of a term varies depending
on the taxonomy. The local similarity between two taxonomies given a particular
term is determined based on the overlap of the term’s semantic cotopy. The global
taxonomic overlap is then defined as the average of the local overlaps of all the
terms in the two taxonomies. The same idea can be applied to compare adequacy
non-taxonomic relations.
24
2.3
Existing Ontology Learning Systems
Before looking into some of the prominent systems and recent advances in ontology learning, a recap of three previous independent surveys is conducted. The
first is a report by the OntoWeb Consortium [90], a body funded by the Information
Society Technologies Programme of the Commission of the European Communities. This survey listed 36 approaches for ontology learning from text. Some of the
important findings presented by this review paper are:
• There is no detailed methodology that guides the ontology learning process
from text.
• There is no fully automated system for ontology learning. Some of the systems
act as tools to assist in the acquisition of lexical-semantic knowledge, while
others help to extract concepts and relations from annotated corpora with the
involvement of users.
• There is no general approach for evaluating the accuracy of ontology learning,
and for comparing the results produced by different systems.
The second survey, released during the same time as the OntoWeb Consortium
survey, was performed by Shamsfard & Barforoush [226]. The authors claimed to
have studied over fifty different approaches before selecting and including seven
prominent ones in their survey. The main focus of the review was to introduce a
framework for comparing ontology learning approaches. The approaches included in
the review merely served as test cases to be fitted into the framework. Consequently,
the review provided an extensive coverage of the state-of-the-art of the relevant
techniques but was limited in terms of discussions on the underlying problems and
future outlook. The review arrived at the following list of problems:
• Much work has been conducted on discovering taxonomic relations, while nontaxonomic relations were given less attention.
• Research into axiom learning was nearly unexplored.
• The focus of most research is on building domain ontologies. Most of the techniques were designed to make heavy use of domain-specific patterns and static
background knowledge, with little regard to the portability of the systems
across different domains.
2.3. Existing Ontology Learning Systems
• Current ontology learning systems are evaluated within the confinement of
their domains. Finding a formal, standard method to evaluate ontology learning systems remains an open problem.
• Most systems are either semi-automated or tools for supporting domain experts in curating ontologies. Complete automation and elimination of user
involvement requires more research.
Lastly, Ding & Foo [62] presented a survey of 12 major ontology learning projects.
The authors wrapped up their survey with following findings:
• Input data are mostly structured. Learning from free texts remains within the
realm of research.
• The task of discovering relations is very complex and a difficult problem to
solve. It has turned out to be the main impedance to the progress of ontology
learning.
• The techniques for discovering concepts have reached a certain level of maturity.
A closer look into the three survey papers revealed a consensus on several aspects of
ontology learning that required more work. These conclusions are in fact in line with
the findings of our literature review in the following Sections 2.3.1 and 2.3.2. These
conclusions are (1) fully automated ontology learning is still in the realm of research,
(2) current approaches are heavily dependent on static background knowledge, and
may face difficulty in porting across different domains and languages, (3) there is no
common evaluation platform for ontology learning, and (4) there is a lack of research
on discovering relations. The validity of some of these conclusions will become more
evident as we look into several prominent systems and recent advances in ontology
learning in the following two sections.
2.3.1
Prominent Ontology Learning Systems
A summary of the techniques used by five prominent ontology learning systems,
and the evaluation of these techniques are provided in this section.
OntoLearn
OntoLearn [178, 182, 259, 260], together with Consys (for ontology validation by
experts) and SymOntoX (for updating and managing ontology by experts) are part
25
26
of a project for developing an interoperable infrastructure for small and medium
enterprises in the tourism sector under the Federated European Tourism Information System1 (FETISH). OntoLearn employs both linguistics and statistics-based
techniques in four major tasks to discover terms, concepts and taxonomic relations.
• Preprocess texts and extract terms: Domain and general corpora are first
processed using part-of-speech tagging and sentence parsing tools to produce
syntactic structures including noun phrases and prepositional phrases. For
relevance analysis, the approach adopts two metrics known as Domain Relevance (DR) and Domain Consensus (DC). Domain relevance measures the
specificity of term t with respect to the target domain Dk through comparative
analysis across a list of predefined domains D1 , ..., Dn . The measure is defined
as
P (t|Dk )
DR(t, Dk ) = P
i=1...n P (t|Di )
f
f
where P (t|Dk ) and P (t|Di ) are estimated as P t,k ft,k and P t,i ft,i , respect∈Dk
t∈Di
tively. ft,k and ft,i are the frequencies of term t in domain Dk and Di , respectively. Domain consensus, on the other hand, is used to measure the
appearance of a term in a single document as compared to the overall occurrence in the target domain. The domain consensus of a term t in domain Dk
is an entropy defined as
X
1
DC(t, Dk ) =
P (t|d)log
P (t|d)
d∈D
k
where P (t|d) is the probability of encountering term t in document d of domain
Dk .
• Form concepts: After the list of relevant terms has been identified, concepts
and glossary from WordNet are employed for associating the terms to existing
concepts and to provide definitions. The author named this process as semantic interpretation. If multi-word terms are involved, the approach evaluates all
possible sense combinations by intersecting and weighting common semantic
patterns in the glossary until it selects the best sense combinations.
• Construct hierarchy: Once semantic interpretation has been performed on the
terms to form concepts, taxonomic relations are discovered using hypernyms
from WordNet to organise the concepts into domain concept trees.
1
More information is available via http://sourceforge.net/projects/fetishproj/. Last accessed
25 May 2009.
An evaluation of the term extraction technique was performed using the Fmeasure. A tourism corpus was manually constructed from the Web containing
about 200, 000 words. The evaluation was done by manually looking at 6, 000 of the
14, 383 candidate terms and marking all the terms judged as good domain terms
and comparing the obtained list with the list of terms automatically filtered by the
system. A precision of 85.42% and recall of 52.74% were achieved.
Text-to-Onto
Text-to-Onto [51, 162, 163, 165] is a semi-automated system that is part of an
ontology management infrastructure called KAON2 . KAON is a comprehensive tool
suite for ontology creation and management. The authors claimed that the approach
has been applied to the tourism and insurance sector, but no further information
was presented. Instead, ontologies3 for some toy domains4 have been constructed
using this approach. Text-to-Onto employs both linguistics and statistics-based
techniques in six major tasks to discover terms, concepts, taxonomic relations and
• Preprocess texts and extract terms: Plain text extraction is performed to extract plain domain texts from semi-structured sources (i.e. HTML documents)
and other formats (e.g. PDF documents). Abbreviation expansion is performed on the plain texts using rules and dictionaries to replace abbreviations
and acronyms. Part-of-speech tagging and sentence parsing are performed on
the preprocessed texts to produce syntactic structures and dependencies. Syntactic structure analysis is performed using weighted finite state transducers
to identify important noun phrases as terms. These natural language processing tools are provided by a system called Saarbruecken Message Extraction
System (SMES) [184].
• Form concepts: Concepts from domain lexicon are required to assign new
terms to predefined concepts. Unlike other approaches that employ general
background knowledge such as WordNet, the lexicon adopted by Text-to-Onto
are domain-specific containing over 120, 000 terms. Each term is associated
with concepts available in a concept taxonomy. Other techniques for concept
2
More information is available via http://kaon.semanticweb.org/. Last accessed 25 May 2009.
The ontologies can be downloaded from http://kaon.semanticweb.org/ontologies. Last accessed 25 May 2009.
4
The term toy domain is in wide use in the research community to describe work in extremely
restricted domains.
3
27
28
formations are also performed such as the use of co-occurrence analysis but no
additional information was provided.
• Construct hierarchy: Once the concepts have been formed, taxonomic relations
are discovered by exploiting the hypernyms from WordNet. Lexico-syntactic
patterns are also employed to identify hypernymy relations in the texts. The
authors refer to these hypernyms as oracle, denoted by H. The projection
H(t) will return a set of tuples (x, y) where x is a hypernym for term t and y
is the number of times the algorithm has found evidence for it. Using cosine
measure for similarity and the oracle, a bottom-up hierarchical clustering is
carried out with a list T of n terms as input. When given two terms which
are similar according to the cosine measure, the algorithm works by ordering
them as sub-concepts if one is a hypernym of the other. If the previous case
does not apply, the most frequent common hypernym h is selected to create a
new concept to accommodate both terms as siblings.
• Discover non-taxonomic relations and label non-taxonomic relations: For nontaxonomic relations extraction, association rules together with two user-defined
thresholds (i.e. confidence, support) are employed to determine associations
between concepts at the right level of abstraction. Typically, users start with
low support and confidence to explore general relations, and later increases
the values to explore more specific relations. User participation is required to
validate and label the non-taxonomic relations.
An evaluation of the relation discovery technique was performed using a measure
called the Generic Relations Learning Accuracy (RLA). Given a set of discovered
relations D, precision is defined as |D ∩ R|/|D| and recall as |D ∩ R|/|R| where R
is the non-taxonomic relations prepared by domain experts. RLA is a measure to
capture intuitive notions for relation matches such as utterly wrong, rather bad, near
miss and direct hit. RLA is the averaged accuracy that the instances of discovered
relations match against their best counterpart from manually-curated gold-standard.
As the learning algorithm is controlled by support and confidence parameters, the
evaluation is done by varying the support and the confidence values. When both the
support and the confidence thresholds are set to 0, 8, 058 relations were produced
with a RLA of 0.51. Both the number of relations and the recall decreases with
growing support and confidence. Precision increases at first but drops when so few
relations are discovered that almost none is a direct hit. The best RLA at 0.67 is
achieved with a support at 0.04 and a confidence at 0.01.
29
ASIUM
ASIUM [71, 70, 69] is a semi-automated ontology learning system that is part of
an information extraction infrastructure called INTEX, by the Laboratoire d’Automatique
Documentaire et Linguistique de l’Universite de Paris 7. The aim of this approach
is to learn semantic knowledge from texts and use the knowledge for the expansion
(i.e. portability from one domain to the other) of INTEX. The authors mentioned
that the system has been tested by Dassault Aviation, and has been applied on a toy
domain using cooking recipe corpora in French. ASIUM employs both linguistics
and statistics-based techniques to carry out five tasks to discover terms, concepts
and taxonomic relations.
• Preprocess texts and discover subcategorisation frames: Sentence parsing is
applied on the input text using functionalities provided by a sentence parser
called SYLEX [54]. SYLEX produces all interpretations of parsed sentences including attachments of noun phrases to verbs and clauses. Syntactic structure
and dependency analysis is performed to extract instantiated subcategorisation
frames in the form of <verb><syntactic role|preposition:head noun>∗
where the wildcard character ∗ indicates the possibility of multiple occurrences.
• Extract terms and form concepts: The nouns in the arguments of the subcategorisation frames extracted from the previous step are gathered to form
basic classes based on the assumption “head words occurring after the same,
different prepositions (or with the same, different syntactic roles), and with the
same, different verbs represent the same concept” [68]. To illustrate, suppose
that we have the nouns “ballpoint pen”, “pencil and “fountain pen” occurring
in different clauses as adjunct of the verb “to write” after the preposition
“with”. At the same time, these nouns are the direct object of the verb “to
purchase”. From the assumption, these nouns are thus considered as variants
representing the same concept.
• Construct hierarchy: The basic classes from the previous task are successively
aggregated to form concepts of the ontology and reveal the taxonomic relations
using clustering. Distance between all pairs of basic classes is computed and
two basic classes are only aggregated if the distance is less than the threshold
set by the user. On the one hand, the distance between two classes containing
the same words with same frequencies have the distance 0. On the other hand,
a pair of classes without a single common word have distance 1. The clustering
30
algorithm works bottom-up and performs first-best using basic classes as input
and builds the ontology level by level. User participation is required to validate
each new cluster before it can be aggregated to a concept.
An evaluation of the term extraction technique was performed using the precision
measure. The evaluation uses texts from the French journal Le Monde that have
been manually filtered to ensure the presence of terrorist event descriptions. The
results were evaluated by two domain experts who were not aware of the ontology
building process using the following indicators: OK if extracted information is correct, FALSE if extracted information is incorrect, NONE if there were no extracted
information, and FALSE for all other cases. Two precision values are computed,
namely, precision1 which is the ratio between OK and FALSE, and precision2 which
is the same as precision1 by taking into consideration NONE. Precision1 and precision2 have the value 86% and 89%, respectively.
TextStorm/Clouds
TextStorm/Clouds [191, 198] is a semi-automated ontology learning system that
is part of an idea sharing and generation system called Dr. Divago [197]. The
aim of this approach is to build and refine domain ontology for use in Dr. Divago
for searching resources in a multi-domain environment to generate musical pieces
or drawings. No information was provided on the availability of any real-world
applications, nor testing on toy domains. TextStorm/Clouds employs logic and
linguistics-based techniques to carry out six tasks to discover terms, taxonomic
relations, non-taxonomic relations and axioms.
• Preprocess texts and extract terms: The part-of-speech information in WordNet is used to annotate the input text. Later, syntactic structure and dependency analysis is performed using an augmented grammar to extract syntactic
structures in the form of binary predicates. The Prolog-like binary predicates
represent relations between two terms. Two types of binary predicates are
considered. The first type captures terms in the form of subject and object
connected by a main verb. The second type captures the property of compound nouns, usually in the form of modifiers. For example, the sentence “Zebra eat green grass” will result in two binary predicates namely eat(Zebra,
grass) and property(grass, green). When working with dependent sentences, finding the concepts may not be straightforward and this approach
performs anaphora resolution to resolve ambiguities. The anaphora resolution
31
uses a history list of discourse entities generated from preceeding sentences
[6]. In the presence of an anaphora, the most recent entities are given higher
priority.
• Construct hierarchy, discover non-taxonomic relations and label non-taxonomic
relations: Next, the binary predicates are employed to gradually aggregate
terms and relations to an existing ontology with user participation. Hypernymy relations appear in binary predicates in the form of is-a(X,Y)
while part-of(X,Y) and contain(X,Y) provide good indicators for meronyms.
Attribute-value relations are obtainable from the predicates in the form of
property(X,Y). During the aggregation process, users may be required to introduce new predicates to connect certain terms and relations to the ontology.
For example, in order to attach the predicate is-a(predator, animal) to
an ontology with the root node living entity, user will have to introduce
is-a(animal, living entity).
• Extract axioms: The approach employs inductive logic programming to learn
regularities by observing the recurrent concepts and relations in the predicates.
For instance, the approach using the extracted predicates below
1: is-a(panther, carnivore)
2: eat(panther, zebra)
3: eat(panther, gazelle)
4: eat(zebra, grass)
5: is-a(zebra,herbivore)
6: eat(gazelle, grass)
7: is-a(gazelle,herbivore)
will arrive at the conclusions that
1: eat(A, zebra):- is-a(A, carnivore)
2: eat(A, grass):- is-a(A, herbivore)
These axioms describe relations between concepts in terms of its context (i.e.
the set of neighbourhood connections that the arguments have).
Using the accuracy measure, the performance of the binary predicate extraction
task was evaluated to determine if the relations hold between the corresponding
concepts. A total of 21 articles from the scientific domain were collected and analysed
by the system. Domain experts then determined the coherence of the predicates and
32
its accuracy with respect to the corresponding input text. The authors concluded
an average accuracy of 52%.
SYNDIKATE
SYNDIKATE [96, 95] is a stand-alone automated ontology learning system. The
authors have applied this approach in two toy domains, namely, information technology and medicine. However, no information was provided on the availability of
any real-world applications. SYNDIKATE employs purely linguistics-based techniques to carry out five tasks to discover terms, concepts, taxonomic relations and
• Extract terms: Syntactic structure and dependency analysis is performed on
the input text using a lexicalised dependency grammar to capture binary valency5 constraints between a syntactic head (e.g. noun) and possible modifiers
(e.g. determiners, adjectives). In order to establish a dependency relation between a head and a modifier, the term order, morpho-syntactic features compatibility and semantic criteria have to be met. Anaphora resolution based on
the centering model is included to handle pronouns.
• Form concepts, construct hierarchy, discover non-taxonomic relations and label non-taxonomic relations: Using predefined semantic templates, each term
in the syntactic dependency graph is associated with a concept in the domain
knowledge and at the same time, used to instantiate the text knowledge base.
The text knowledge base is essentially an annotated representation of the input texts. For example, the term “hard disk” in the graph is associated with
the concept HARD DISK in domain knowledge, and at the same time, an instance called HARD DISK3 will be created in the text knowledge base. The
approach then tries to find all relational links between conceptual correlates
of two words in the subgraph if both grammatical and conceptual constraints
are fulfilled. The linkage may either be constrained by dependency relations,
by intervening lexical materials, or by conceptual compatibility between the
concepts involved. In the case where unknown words occur, semantic interpretation of the dependency graph involving unknown lexical items in the text
knowledge base is employed to derive concept hypothesis. The structural patterns of consistency, mutual justification and analogy relative to the already
5
Valency refers to the capacity of a verb to take a specific number and type of arguments (noun
phrase positions).
available concept descriptions in the text knowledge base will be used as initial
evidence to create linguistic and conceptual quality labels. An inference engine is then used to estimate the overall credibility of the concept hypotheses
by taking into account the quality labels.
An evaluation using the precision, recall and accuracy measures was conducted
to assess the concepts and relations extracted by this system. The use of semantic
interpretation to discover the relations between conceptual correlates yielded 57%
recall and 97% precision, and 31% recall and 94% precision, for medicine and information technology texts, respectively. As for the formation of concepts, an accuracy
of 87% was achieved. The authors also presented the performance of other aspects
of the system. For example, sentence parsing in the system exhibits a linear time
complexity while a third-party parser runs in exponential time complexity. This
behaviour was caused by the latter’s ability to cope with ungrammatical input. The
incompleteness of the system’s parser results in a 10% loss of structural information
as compared to the complete third-party parser.
2.3.2
Recent Advances in Ontology Learning
Since the publication of the three survey papers [62, 226, 90], the research activities within the ontology learning community have been mainly focusing on (1) the
advancement of relation acquisition techniques, (2) the automatic labelling of concept and relation, (3) the use of structured and unstructured Web data for relation
acquisition, and (4) the diversification of evidence for term recognition.
On the advancement of relation acquisition techniques, Specia & Motta [237]
presented an approach for extracting semantic relations between pairs of entities
from texts. The approach makes use of a lemmatiser, syntactic parser, part-of-speech
tagger, and word sense disambiguation models for language processing. New entities
are recognised using a named-entity recognition system. The approach also relies
on a domain ontology, a knowledge base, and lexical databases. Extracted entities
that exist in the knowledge base are semantically annotated with their properties.
Ciaramita et al. [48] employ syntactic dependencies as potential relations. The
dependency paths are treated as bi-grams, and scored with statistical measures of
correlation. At the same time, the arguments of the relations can be generalised to
obtain abstract concepts using algorithms for Selectional Restrictions Learning [208].
Snow et al. [234, 235] also presented an approach that employs the dependency
paths extracted from parse trees. The approach receives trainings using sets of
text containing known hypernym pairs. The approach then automatically discovers
33
34
useful dependency paths that can be applied to new corpora for identifying new
hypernyms.
On the automatic concept and relation labelling, Kavalec & Svatek [123] studied
the feasibility of label identification for relations using semantically-tagged corpus
and other background knowledge. The authors suggested that the use of verbs,
identified through part-of-speech tagging, can be viewed as a rough approximation
of relation labels. With the help of semantically-tagged corpus to resolve the verbs
to the correct word sense, the quality of relations labelling may be increased. In
addition, the authors also suggested that abstract verbs identified through generalisation via WordNet can be useful labels. Jones [119] proposed in her PhD research a
semi-automated technique for identifying concepts and simple technique for labelling
concepts using user-defined seed words. This research was carried out exclusively
using small lists of words as input. In another PhD research by Rosario [211], the
author proposed the use of statistical semantic parsing to extract concepts and relations from bioscience text. In addition, the research presented the use of statistical
machine learning techniques to build a knowledge representation of the concepts.
The concepts and relations extracted by the proposed approach are intended to be
combined by some other systems to produce larger propositions which can then be
used in areas such as abductive reasoning or inductive logic programming. This
approach has only been tested with a small amount of data from toy domains.
On the use of Web data for relation acquisition, Sombatsrisomboon [236] proposed a simple 3-step technique for discovering taxonomic relations (i.e. hypernym/hyponym) between pairs of terms using search engines. Search engine queries
are first constructed using the term pairs and patterns such as X is a/an Y. The
webpages provided by search engines are then gathered to create a small corpus.
Sentence parsing and syntactic structure analysis is performed on the corpus to discover taxonomic relations between the terms. Such use of Web data redundancy
and patterns can also be extended to discover non-taxonomic relations. Sanchez
& Moreno [217] proposed methods for discovering non-taxonomic relations using
Web data. The authors developed a technique for learning domain patterns using
domain-relevant verb phrases extracted from webpages provided by search engines.
These domain patterns are then used to extract and label non-taxonomic relations
using linguistic and statistical analysis. There is also an increasing interest in the
use of structured Web data such as Wikipedia for relation acquisition. Pei et al.
[196] proposed an approach for constructing ontologies using Wikipedia. The approaches uses a two-step technique, namely, name mapping and logic-based map-
ping, to deduce the type of relations between concepts in Wikipedia. Similarly,
Liu et al. [154] developed a technique called Catriple for automatically extracting triples using Wikipedia’s categorical system. The approach focuses on category
pairs containing both explicit property and explicit value (e.g. “Category:Songs
by artist”-“Category:The Beatles songs” where “artist is property and “The Beatles” is value), and category pairs containing explicit value but implicit property
(e.g. “Category:Rock songs”-“Category:British rock songs” where British is a value
with no property). Sentence parsers and syntactic rules are used to extract the
explicit properties and values from the category names. Weber & Buitelaar [267]
proposed a system called Information System for Ontology Learning and Domain
Exploration (ISOLDE) for deriving domain ontologies using manually-curated text
corpora, a general-purpose named-entity tagger, and structured data on the Web
(i.e. Wikipedia, Wiktionary and a German online dictionary known as DWDS) to
derive a domain ontology.
On the diversification of evidence for term recognition, Sclano & Velardi [222] developed a system called TermExtractor for identifying relevant terms in two steps.
TermExtractor uses a sentence parser to parse texts and extract syntactic structures such as noun compounds, and ADJ-N and N-PREP-N sequences. The list of
term candidates is then ranked and filtered using a combination of measures for
realising different evidence, namely, Domain Pertinence (DP), Domain Consensus
(DC), Lexical Cohesion (LC) and Structural Relevance (SR). Wermter & Hahn [269]
incorporated a linguistic property of terms as evidence, namely, limited paradigmatic
modifiability into an algorithm for extracting terms. The property of paradigmatic
modifiability is concerned with the extent to which the constituents of a multi-word
term can be modified or substituted. The more we are able to substitute the constituents by other words, the less probable it is that the corresponding multi-word
lexical unit is a term. There is also an increase in interest to automatically construct
the text corpora required for term extraction using Web data. Agbago & Barriere
[4] proposed the use of richness estimators to assess the suitability of webpages provided by search engines for constructing corpora for use by terminologists. Baroni
& Bernardini [15] developed the BootCat technique for bootstrapping text corpora
and terms using Web data and search engines. The technique requires as input a
set of seed terms. The seeds are used to build a corpus using webpages suggested
by search engines. New terms are then extracted from the initial corpus, which in
turn are used as seeds to build larger corpora.
There are several other recent advances that fall outside the aforementioned
35
36
groups. Novacek & Smrz [187] developed two frameworks for bottom-up generation
and merging of ontologies called OLE and BOLE. The latter is a domain-specific
adaptation of the former for learning bio-ontologies. OLE is designed and implemented as a modular framework consisting of several components for providing
solutions to different tasks in ontology learning. For example, the OLITE module
is responsible for preprocessing plain text and creating mini ontologies. PALEA
is a module responsible for extracting new semantic relation patterns while OLEMAN merges the mini ontologies resulting from the OLITE module and updates
the base domain ontology. The authors mentioned that any techniques for automated ontology learning can be employed as an independent part of any modules.
Another research which contributed from a systemic point of view is the CORPORUM system [72]. OntoExtract is part of the CORPORUM OntoBuilder toolbox
that analyses natural language texts for generating lightweight ontologies. No specific detail was provided regarding the techniques employed by OntoExtract. The
author [72] merely mentioned that OntoExtract uses a repository of background
knowledge to parse, tokenise and analyse texts on both the lexical and syntactic
level, and generates nodes and relations between key terms. Liu et al. [156] presented an approach to semi-automatically extend and refine ontologies using text
mining techniques. The approach make use of news from media sites to expand a
seed ontology by first creating a semantic network through co-occurrence analysis,
trigger phrase analysis, and disambiguation based on the WordNet lexical dictionary. Spreading activation is then applied on the resulting semantic network to find
the most probable candidates for inclusion in the extended ontology.
2.4
Applications of Ontologies
Ontologies are an important part of the standard stack for the Semantic Web6
by the World Wide Web Consortium (W3C). Ontologies are used to exchange data
among multiple heterogeneous systems, provide services in an agent-based environment, and promote the reusability of knowledge bases. While the dream of realising
the Semantic Web is still years away, ontologies have already found their ways into
a myriad of applications such as document retrieval, question answering, image retrieval, agent interoperability and document annotation. Some of the research areas
which have found use for ontologies are:
• Document retrieval: Paralic & Kostial [194] developed an document retrieval
6
http://www.w3.org/2001/sw/
2.4. Applications of Ontologies
37
system based on the use of ontologies. The authors demonstrated that the
retrieval precision and recall of the ontology-based information retrieval system outperforms techniques based on latent semantic indexing and full-text
search. The system registers every new document to several concepts in the
ontology. Whenever retrieval requests arrive, resources are retrieved based on
the associations between concepts, and not on partial or exact term matching.
Similarly, Vallet et al. [255] and Castells et al. [39] proposed a model that
uses knowledge in ontologies for improving document retrieval. The retrieval
model includes an annotation weighting algorithm and a ranking algorithm
based on the classic vector-space model. Keyword-based search is incorporated into their approach to ensure robustness in the event of incompleteness
of ontology.
• Question answering: Atzeni et al. [10] reported the development of an ontologybased question answering system for the Web sites of two European universities. The system accepts questions and produces answers in natural language.
The system is being investigated in the context of an European Union project
called MOSES.
• Image retrieval: Hyvonen et al. [111] developed a system that uses ontologies
to assist image retrieval. Images are first annotated with concepts in an ontology. Users are then presented with the same ontology to facilitate focused
image retrieval and the browsing of semantically-related images using the right
concepts.
• Multi-agent system interoperability: Malucelli & Oliveira [167] proposed an
ontology-based service to assist the communication and negotiation between
agents in a decentralised and distributed system architecture. The agents
typically have their own heterogeneous private vocabularies. The service uses a
central ontology agent to monitor and lead the communication process between
the heterogeneous agents without having to map all the ontologies involved.
• Document annotation: Corcho [55] surveyed several approaches for annotating
webpages with ontological elements for improving information retrieval. Many
of the approaches described in the survey paper rely on manually-curated
ontologies for annotation using a variety of tools such as SHOE Annotator7
7
http://www.cs.umd.edu/projects/plus/SHOE/KnowledgeAnnotator.html
38
[103], CREAM [101], MnM8 [258], and OntoAnnotate [241].
In addition to the above-mentioned research areas, ontologies have also been
deployed in certain applications across different domains. One of the most successful
application area of ontologies is bioinformatics. Bioinformatics have thrived on the
advances in ontology learning techniques and the availability of manually-curated
terminologies and ontologies (e.g. Unified Medical Language System [151], Gene
Ontology [8] and other small domain ontologies at www.obofoundry.org). The
computable knowledge in ontologies is also proving to be a valuable resource for
reasoning and knowledge discovery in biomedical decision support systems. For
example, the inference that a disease of the myocardium is a heart problem is possible
using the subsumption relations in an ontology of disease classification based on
anatomic locations [52]. In addition, terminologies and ontologies are commonly
used for annotating biological datasets, biomedical literature and patient records,
and improving the access and retrieval of biomedical information [27]. For instance,
Baker et al. [13] presented a document query and delivery system for the field of
lipidomics9 . The main aim of the system is to overcome the navigation challenges
that hinder the translation of scientific literatures into actionable knowledge. The
system allows users to access tagged documents containing lipid, protein and disease
names using description logic-based query capability that comes with the semiautomatically created lipid ontology. The lipid ontology contains a total of 672
concepts. The ontology is the result of merging existing biological terminologies,
knowledge from domain experts, and output from a customised text mining system
that recognises lipid-specific nomenclature.
Another visible application of ontologies is in the manufacturing industry. Cho
et al. [43] looked at the current approach for locating and comparing parts information in an e-procurement setting. At present, buyers are faced with the challenge
of accessing and navigating through different parts libraries from multiple suppliers
using different search procedures. The authors introduced the use of the “Parts
Library Concept Ontology” to integrate heterogeneous parts library to enable the
consistent identification and systematic structuring of domain concepts. Lemaignan
et al. [143] presented a proposal for a manufacturing upper ontology. The authors
stressed on the importance of ontologies as a common way for describing manufacturing processes for produce lifecycle management. The use of ontologies ensures
the uniformity in assertions throughout a product’s lifecycle, and the seamless flow
8
9
http://projects.kmi.open.ac.uk/akt/MnM/index.html
Lipidomics is the study of pathways and networks of cellular lipids in biological systems.
2.5. Chapter Summary
39
of data between heterogeneous manufacturing environments. For instance, assume
that we have these relations in an ontology:
isMadeOf(part,rawMaterial)
isA(aluminium,rawMaterial)
isA(drilling,operation)
isMachinedBy(rawMaterial,operation)
and the drilling operation has the attributes drillSpeed and drillDiameter. Using
these elements, we can easily specify rules such as if isMachinedBy(aluminium,drilling)
and the drillDiameter is less than 5mm, then drillSpeed should be 3000 rpm
[143]. This ontology allows a uniform interpretation of assertions such as
isMadeOf(part,aluminium) anywhere along the product lifecycle, thus facilitating
the inference of standard information such as the drill speed.
2.5
Chapter Summary
In this chapter, an overview on ontologies and ontology learning from text was
provided. In particular, we looked the types of output, techniques and evaluation
methods related to ontology learning. The differences between a heavyweight (i.e.
formal) and a lightweight ontology are also explained. Several prominent ontology
learning systems, and some recent advances in the field were summarised. Finally,
some current noticeable applications were included to demonstrate the applicability of ontologies to a wide range of domains. The use of ontologies for real-world
applications in the area of bioinformatics and the manufacturing industry was highlighted.
Overall, it was concluded that the automatic and practical construction of fullfledged formal ontologies from text across different domains is currently beyond
the reach of conventional systems. Many current ontology learning systems are
still struggling to achieve high-performance term recognition, let alone more complex tasks (e.g. relation acquisition, axiom learning). An interesting point revealed
during the literature review is that most systems ignore the fact that the static
background knowledge relied upon by their techniques is a rare resource and may
not have the adequate size and coverage. In particular, all existing term recognition
techniques are rested on a false assumption that the domain corpora required will
always be available. Only recently, there is a growing interest in automatically constructing text corpora using Web data. However, the governing philosophy behind
40
these existing corpus construction techniques is inadequate for creating very large
high-quality text corpora. In regard to relation acquisition, existing techniques rely
heavily on static background knowledge, especially semantic lexicon, such as WordNet. While there is an increasing interest in the use of dynamic Web data for relation
acquisition, more research work is still required. For instance, new techniques are
appearing every now and then that make use of Wikipedia for finding semantic relations between two words. However, these techniques often leave out the details
on how to cope with words that do not appear in Wikipedia. Moreover, the use of
clustering techniques for acquiring semantic relations may appear less attractive due
to the complications in feature extraction and preparation. The literature review
also exposes the lack of treatment for data cleanliness during ontology learning. As
the use of Web data becomes more common, integrated techniques for removing
noises in texts are turning into a necessity.
All in all, it is safe to conclude that there is currently no single system that systematically uses dynamic Web data to meet the requirements for every stage of the
ontology learning process. There are several key areas that require more attention,
namely, (1) integrated techniques for cleaning noisy text, (2) high-performance term
recognition techniques, (3) high-quality corpus construction for term recognition,
and (4) dynamic Web data for clustering and relation acquisition. Our proposed
ontology learning system is designed specifically to address these key areas. In the
subsequent six chapters (i.e. Chapter 3 to 8), details are provided on the design,
development and testing of novel techniques for the five phases (i.e. text preprocessing, text processing, term recognition, corpus construction, relation acquisition)
of the proposed system.
CHAPTER 3
Text Preprocessing
Abstract
An increasing number of ontology learning systems are gearing towards the use
of online sources such as company intranet and the World Wide Web. Despite such
rise, not much work can be found in aspects of preprocessing and cleaning noisy
texts from online sources. This chapter presents an enhancement of the Integrated
Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration
(ISSAC) technique. ISSAC is implemented as part of the text preprocessing phase
in an ontology learning system. New evaluations performed on the enhanced ISSAC
using 700 chat records reveal an improved accuracy of 98% as compared to 96.5%
and 71% based on the use of basic ISSAC and of Aspell, respectively.
3.1
Introduction
Ontology is gaining applicability across a wide range of applications such as
information retrieval, knowledge acquisition and management, and the Semantic
Web. The manual construction and maintenance of ontologies was never a long-term
solution due to factors such as the high cost of expertise and the constant change in
knowledge. These factors have prompted an increasing effort in automatic and semiautomatic learning of ontologies using texts from electronic sources. A particular
source of text that is becoming popular is the World Wide Web.
The quality of texts from online sources for ontology learning can vary anywhere
between noisy and clean. On the one hand, the quality of texts in the form of
blogs, emails and chat logs can be extremely poor. The sentences in noisy texts are
typically full of spelling errors, ad-hoc abbreviations and improper casing. On the
other hand, clean sources are typically prepared and conformed to certain standards
such as those in the academia and journalism. Some common clean sources include
news articles from online media sites, and scientific papers. Different text quality
requires different treatments during the preprocessing phase and noisy texts can be
much more demanding.
An increasing number of approaches are gearing towards the use of online sources
0
This chapter appeared in the Proceedings of the IJCAI Workshop on Analytics for Noisy
Unstructured Text Data (AND), Hyderabad, India, 2007, with the title “Enhanced Integrated
Scoring for Cleaning Dirty Texts”.
41
42
Chapter 3. Text Preprocessing
such as corporate intranet [126] and search engines retrieved documents [51] for
different aspects of ontology learning. Despite such growth, only a small number of
researchers [165, 187] acknowledge the effect of text cleanliness on the quality of their
ontology learning output. With the prevalence of online sources, this “...annoying
phase of text cleaning...”[176] has become inevitable and ontology learning systems
can no longer ignore the issue of text cleanliness. An effort by Tang et al. [246]
showed that the accuracy of term extraction in text mining improved by 38-45%
(F1 -measure) with the additional cleaning performed on the input texts (i.e. emails).
Integrated techniques for correcting spelling errors, abbreviations and improper
casing are becoming increasingly appealing as the boundaries between different errors in online sources are blurred. Along the same line of thought, Clark [53] defended that “...a unified tool is appropriate because of certain specific sorts of errors”. To illustrate this idea, consider the error word “cta”. Do we immediately
take it as a spelling error and correct it as “cat”, or is there a problem with the
letter casing, which makes it a probable acronym? It is obvious that the problems
of spelling error, abbreviation and letter casing are inter-related to a certain extent.
The challenge of providing a highly accurate integrated technique for automatically
cleaning noisy text in ontology learning remains to be addressed.
In an effort to provide an integrated technique to solve spelling errors, ad-hoc
abbreviations and improper casing simultaneously, we have developed an Integrated
Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration
(ISSAC) 1 technique [273]. The basic ISSAC uses six weights from different sources
for automatically correcting spelling error, expanding abbreviations and restoring
improper casing. These includes the original rank by the spell checker Aspell [9],
reuse factor, abbreviation factor, normalised edit distance, domain significance and
general significance. Despite the achievement of 96.5% in accuracy by the basic
ISSAC, several drawbacks have been identified that require additional work. In
this chapter, we present the enhancement of the basic ISSAC. New evaluations
performed on seven different sets of chat records yield an improved accuracy of 98%
as compared to 96.5% and 71% based on the use of basic ISSAC and of Aspell,
respectively.
In Section 2, we present a summary of work related to spelling error detection and
correction, abbreviation expansion, and other cleaning tasks in general. In Section 3,
1
This foundation work on ISSAC appeared in the Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney, Australia, 2006, with the title “Integrated Scoring for
Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text”.
3.2. Related Work
we summarise the basic ISSAC. In Section 4, we propose the enhancement strategies
for ISSAC. The evaluation results and discussions are presented in Section 5. We
summarise and conclude this chapter with future outlook in Section 6.
3.2
Related Work
Spelling error detection and correction is the task of recognising misspellings in
texts and providing suggestions for correcting the errors. For example, detecting
“cta” as an error and suggesting that the error to be replaced with “cat”, “act” or
“tac”. More information is usually required to select a correct replacement from
a list of suggestions. Two of the most studied classes of techniques are minimum
edit distance and similarity key. The idea of minimum edit distance techniques
began with Damerau [58] and Levenshtein [146]. Damerau-Levenshtein distance is
the minimal number of insertions, deletions, substitutions and transpositions needed
to transform one string into the other. For example, changing the word “wear” to
“beard” requires a minimum of two operations, namely, a substitution of ‘w’ with
‘b’, and an insertion of ‘d’. Many variants were developed subsequently such as
the algorithm by Wagner & Fischer [266]. The second class of techniques is the
similarity key. The main idea behind similarity key techniques is to map every
string into a key such that similarly spelt strings will have identical keys [135].
Hence, the key, computed for each spelling error, will act as a pointer to all similarly
spelt words (i.e. suggestions) in the dictionary. One of the earliest implementation
is the SOUNDEX system [189]. SOUNDEX is a phonetic algorithm for indexing
words based on their pronunciation in English. SOUNDEX works by mapping a
word into a key consisting of its first letter followed by a sequence of numbers. For
example, SOUNDEX replaces the letter li ∈ {A, E, I, O, U, H, W, Y } with 0 and
li ∈ {R} with 6, and hence, wear → w006 → w6 and ware → w060 → w6. Since
SOUNDEX, many improved variants were developed such as the Metaphone and
the Double-metaphone algorithm [199], Daitch-Mokotoff Soundex [138] for Eastern
European languages, and others [108]. One of the famous implementation that
utilises the similarity key technique is Aspell [9]. Aspell is based on the Metaphone
algorithm and the near-miss strategy from its predecessor Ispell [134]. Aspell begins
by converting a misspelt word to its soundslike equivalent (i.e. metaphone) and then
finding all words that have a soundslike within one or two edit distances from the
original word’s soundslike2 . These soundslike words are the basis of the suggestions
by Aspell.
2
Source from http://aspell.net/man-html/Aspell-Suggestion-Strategy.html
43
44
Most of the work in detecting and correcting spelling errors, and expanding abbreviations are carried out separately. The task of abbreviation expansion deals
with recognising shorter forms of words (e.g. “abbr.” or “abbrev.”), acronyms (e.g.
“NATO”) and initialisms (e.g. “HTML”, “FBI”), and expanding them to their corresponding words3 . The work on detecting and expanding abbreviations are mostly
conducted in the realm of named-entity recognition and word-sense disambiguation.
The technique presented by Schwartz & Hearst [221] begins with the extraction of
all abbreviations and definition candidates based on the adjacency to parentheses.
A candidate is considered as the correct definition for an abbreviation if they appears in the same sentence, and the candidate has no more than min(|A| + 5, |A| ∗ 2)
words, where |A| is the number of characters in an abbreviation A. Park & Byrd
[195] presented an algorithm based on rules and heuristics for extracting definitions
for abbreviations from texts. Several factors are employed in this technique such as
syntactic cues, priority of rules, distance between abbreviation and definition and
word casing. Pakhomov [193] proposed a semi-supervised technique that employs
a hand-crafted table of abbreviations and their definitions for training a maximum
entropy classifier. For case restoration, improper letter casings in words are detected
and restored. For example, detecting the letter ‘j’ in “jones” as improper and correcting the word to produce “Jones”. Lita et al. [153] presented an approach for
restoring cases based on the context in which the word exists. The approach first
captures the context surrounding a word and approximates the meaning using ngrams. The casing of the letters in a word will depend on the most likely meaning of
the sentence. Mikheev [176] presented a technique for identifying sentence boundaries, disambiguating capitalised words and identifying abbreviations using a list of
common words. The technique can be described in four steps: identify abbreviations
in texts, disambiguate ambiguously capitalised words, assign unambiguous sentence
boundaries and disambiguate sentence boundaries if an abbreviation is followed by
a proper name.
In the context of ontology learning and other related areas such as text mining, spelling error correction and abbreviation expansion are mainly carried out as
part of the text preprocessing (i.e. text cleaning, text normalisation) phase. Some
other common tasks in text preprocessing include plain text extraction (i.e. format
conversion, HTML/XML tag stripping, table identification [185]), sentence boundary detection [243], case restoration [176], part-of-speech tagging [33] and sentence
3
Some researchers refer to this relationship as abbreviation and definition or short-form and
long-form.
3.3. Basic ISSAC as Part of Text Preprocessing
parsing [149]. A review by Gomez-Perez & Manzano-Macho [90] showed that nearly
all ontology learning systems in the survey perform only shallow linguistic analysis
such as part-of-speech tagging during the text preprocessing phase. These existing systems require the input to be clean and hence, the techniques for correcting
spelling errors, expanding abbreviations and restoring cases are considered as unnecessary. Ontology learning systems such as Text-to-Onto [165] and BOLE [187]
are the few exceptions. In addition to shallow linguistic analysis, these systems
incorporate some cleaning tasks. Text-to-Onto extracts plain text from various formats such as PDF, HTML, XML, and identifies and replaces abbreviations using
substitution rules based on regular expressions. The text preprocessing phase of
BOLE consists of sentence boundary detection, irrelevant sentence elimination and
text tokenisation using Natural Language Toolkit (NLTK).
In a text mining system for extracting topics from chat records, Castellanos
[38] presented a comprehensive list of text preprocessing techniques. The system
employs a thesaurus, constructed using the Smith-Waterman algorithm [233], for
correcting spelling errors and identifying abbreviations. In addition, the system
removes program codes from texts, and detects sentence boundary based on simple
heuristics (e.g. shorter lines in program codes, and punctuation marks followed by
an upper case letter). Tang et al. [246] presented a cascaded technique for cleaning
emails prior to text mining. The technique is composed of four passes: non-text
filtering for eliminating irrelevant data such as email header, sentence normalisation,
case restoration, and spelling error correction for transforming relevant text into
canonical form. Many of the techniques mentioned above perform only one out
of the three cleaning tasks (i.e. spelling error correction, abbreviation expansion,
case restoration). In addition, the evaluations conducted to obtain the accuracy are
performed in different settings (e.g. no benchmark, test data and agreed measure
of accuracy). Hence, it is not possible to compare these different techniques based
on the accuracy reported in the respective papers. As pointed out earlier, only
a small number of integrated techniques are available for handling all three tasks.
Such techniques are usually embedded as part of a larger text preprocessing module.
Consequently, the evaluations of the individual cleaning task in such environments
are not available.
3.3
Basic ISSAC as Part of Text Preprocessing
ISSAC was designed and implemented as part of the text preprocessing phase
in an ontology learning system that uses chat records as input. The use of chat
45
46
Figure 3.1: Examples of spelling errors, ad-hoc abbreviations and improper casing
in a chat record.
records has required us to place more effort in ensuring text cleanliness during
the preprocessing phase. Figure 3.1 highlights the various spelling errors, ad-hoc
abbreviations and improper casing that occur much more frequently in chat records
than in clean texts.
Prior to spelling error correction, abbreviation expansion and case restoration,
three tasks are performed as part of the text preprocessing phase. Firstly, plain text
extraction is conducted to remove HTML and XML tags from the chat records using regular expressions and Perl modules, namely, XML::Twig4 and HTML::Strip5 .
Secondly, identification of URLs, emails, emoticons6 and tables is performed. Such
information is extracted and set aside for assisting in other business intelligence
analysis. Tables are removed using the signatures of a table such as multiple spaces
between words, and words aligned in columns for multiple lines [38]. Thirdly, sentence boundary detection is performed using Lingua::EN::Sentence7 Perl module.
Firstly, each sentence in the input text (e.g. chat record) is tokenised to obtain a
set of words T = {t1 , ...tw }. The set T is then fed into Aspell. For each word e that
Aspell considers as erroneous, a list of ranked suggestions S is produced. Initially,
4
http://search.cpan.org/dist/XML-Twig-3.26/
http://search.cpan.org/dist/HTML-Strip-1.06/
6
An emoticon, also called a smiley, is a sequence of ordinary printable characters or a small
image, intended to represent a human facial expression and convey an emotion.
7
http://search.cpan.org/dist/Lingua-EN-Sentence/
5
3.3. Basic ISSAC as Part of Text Preprocessing
S = {s1,1 , ..., sn,n } is an ordered list of n suggestions where sj,i is the j th suggestion
with rank i (smaller i indicates higher confidence in the suggested word). If e appears
in the abbreviation dictionary, the list S is augmented by adding all the corresponding m expansions in front of S as additional suggestions with rank 1. In addition,
the error word e is appended at the end of S with rank n + 1. These augmentations
produce an extended list S = {s1,1 , ..., sm,1 , sm+1,1 , ..., sm+n,n , sm+n+1,n+1 }, which is
a combination of m suggestions from the abbreviation dictionary (if e is a potential
abbreviation), n suggestions by Aspell, and the error word e itself. Placing the error
word e back into the list of possible replacements serves one purpose: to ensure that
if no better replacement is available, we keep the error word e as it is. Once the
extended list S is obtained, each suggestion sj,i is re-ranked using ISSAC. The new
score for the j th suggestion with original rank i is defined as
N S(sj,i )
=
i−1 + N ED(e, sj,i ) + RF (e, sj,i )
+AF (sj,i ) + DS(l, sj,i , r) + GS(l, sj,i , r)
where
• N ED(e, sj,i ) ∈ (0, 1] is the normalised edit distance defined as (ED(e, sj,i ) +
1)−1 where ED is the minimum edit distance between e and sj,i .
• RF (e, sj,i ) ∈ {0, 1} is the boolean reuse factor for providing more weight to
suggestion sj,i that has been previously used for correcting error e. The reuse
factor is obtained through a lookup against a history list that ISSAC keeps
to record previous corrections. RF (e, sj,i ) provides factor 1 if the error e has
been previously corrected with sj,i and 0 otherwise.
• AF (sj,i ) ∈ {0, 1} is the abbreviation factor for denoting that sj,i is a potential
abbreviation. A lookup against the abbreviation dictionary, AF (sj,i ) yields
factor 1 if suggestion sj,i exists in the dictionary and 0 otherwise. When the
scoring process takes place and the corresponding expansions for potential
abbreviations are required, www.stands4.com is consulted. A copy of the
expansion is stored in a local abbreviation dictionary for future reference.
• DS(l, sj,i , r) ∈ [0, 1] measures the domain significance of suggestion sj,i based
on its appearance in the domain corpora by taking into account the neighbouring words l and r. This domain significance weight is inspired by the
TF-IDF [210] measure commonly used for information retrieval. The weight is
47
48
defined as the ratio between the frequency of occurrence of sj,i (individually,
and within l and r) in the domain corpora and the sum of the frequencies of
occurrences of all suggestions (individually, and within l and r).
• GS(l, sj,i , r) ∈ [0, 1] measures the general significance of suggestion sj,i based
on its appearance in the general collection (e.g. webpages indexed by the Goggle search engine). The purpose of this general significance weight is similar
to that of the domain significance. In addition, the use of dynamic Web data
allows ISSAC to cope with language change that is not possible with static
corpora and Aspell. The weight is defined as the ratio between the number
of documents in the general collection containing sj,i within l and r and the
number of documents in the general collection that contains sj,i alone. Both
the ratios in DS and GS are offset by a measure similar to that of the IDF
[210]. For further details on DS and GS, please refer to Wong et al. [273].
3.4
Enhancement of ISSAC
The list of suggestions and the initial ranks provided by Aspell are an integral
part of ISSAC. Figure 3.2 summarises the accuracy of basic ISSAC obtained from
the previous evaluations [273] using four sets of chat records (where each set contains
100 chat records). The achievement of 74.4% accuracy by Aspell from the previous
evaluations, given the extremely poor nature of the texts, demonstrated the strength
of the Metaphone algorithm and the near-miss strategy. The further increase of 22%
in accuracy using basic ISSAC demonstrated the potential of the combined weights
N S(sj,i ).
Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4
number of correct
replacements using
ISSAC
number of correct
replacements using
Aspell
Average
97.06%
97.07%
95.92%
96.20%
96.56%
74.61%
75.94%
71.81%
75.19%
74.39%
Figure 3.2: The accuracy of basic ISSAC from previous evaluations.
Based on the previous evaluation results, we discuss in detail the three causes
behind the remaining 3.5% of errors which were incorrectly replaced. Figure 3.3
shows the breakdown of the causes behind the incorrect replacements by the basic
ISSAC. The three causes are summarised as follow:
49
3.4. Enhancement of ISSAC
Causes
Correct replacement not in suggestion list
Inadequate/erroneous neighbouring words
Anomalies
Basic ISSAC
2.00%
1.00%
0.50%
Figure 3.3: The breakdown of the causes behind the incorrect replacements by basic
ISSAC.
1. The accuracy of the corrections by basic ISSAC is bounded by the coverage
of the list of suggestions S produced by Aspell. About 2% of the wrong
replacements is due to the absence of the correct suggestions produced by
Aspell. For example, the error “prder” in the context of “The prder number”
was incorrectly replaced by both Aspell and basic ISSAC as “parader” and
“prder”, respectively. After a look into the evaluation log, we realised that the
correct replacement “order” was not in S.
2. The use of the two immediate neighbouring words l and r to inject more contextual consideration into domain and general significance has contributed to
a huge increase in accuracy. Nonetheless, the use of l and r in ISSAC is
by no means perfect. About 1% of the wrong replacements is due to two
flaws related to l and r, namely, neighbouring words with incorrect spelling,
and inadequate neighbouring words. Incorrectly spelt neighbouring words inject false contextual information into the computation of DS and GS. The
neighbouring words may also be considered as inadequate due to their indiscriminative nature. For example, the left word “both” in “both ocats are” is
too general and does not offer much discriminatory power for distinguishing
between suggestions such as “coats”, “cats” and “acts”.
3. The remaining 0.5% is considered as anomalies where basic ISSAC cannot
address. There are two cases of anomalies: the equally likely nature of all possible suggestions, and the contrasting value of certain weights. As an example
for the first case, consider the error “Janice cheung has”. The left word is
correctly spelt and has adequately confined the suggestions to proper names.
In addition, the correct replacement “Cheung” is present in the suggestion list
S. Despite all these, both Aspell and ISSAC decided to replace “cheung” with
“Cheng”. A look into the evaluation log reveals that the surname “Cheung”
is as common as “Cheng”. In such cases, the probability of replacing e with
50
the correct replacement is c−1 where c is the number of suggestions with approximately same N S(sj,i ). The second case of anomalies is due to contrasting
value of certain weights, especially N ED and i−1 , that causes wrong replacements. For example, in the case “cannot chage an”, basic ISSAC replaced
the error “chage” with “charge” instead of “change”. All the other weights
for “change” are comparatively higher (i.e. DS and GS) or the same (i.e.
RF , N ED and AF ) as “charge”. Such inclination indicates that “change” is
the most proper replacement given the various cues. Nonetheless, the original
rank by Aspell for charge is i=1 while change is i=6. As smaller i indicates
higher confidence, the inverse of the original Aspell’s rank i−1 results in the
plummeting of the combined weight for “change”.
In this chapter, we approach the enhancement of ISSAC from the perspective
of the first and second cause. For this purpose, we proposed three modifications to
the basic ISSAC :
1. We proposed the use of additional spell checking facilities as the answer to the
first cause (i.e. compensating for the inadequacy of Aspell). Google spellcheck,
which is based on statistical analysis of words on the World Wide Web8 , appears to be the ideal candidate for complementing Aspell. Using the Google
SOAP API9 , we can have easy access to one of the many functions provided
by Google, namely, Google spellcheck. Our new evaluations show that Google
spellcheck works well for certain errors where Aspell fails to suggest the correct replacements. Similar to adding the expansions for abbreviations and the
suggestions by Aspell, the suggestion provided by Google is added at the front
of the list S with rank 1. This places the suggestion by Google on the same
rank as the first suggestion by Aspell.
2. The basic ISSAC relies only on Aspell for determining if a word is an error.
For this purpose, we decided to include Google spellcheck as a complement.
If a word is detected as a possible error by either Aspell or Google spellcheck,
then we have adequate evidence to proceed and correct it using enhanced
ISSAC. In addition, errors that result in valid words are not recognised by
Aspell. For example, Aspell does not recognise “hat” as an error. If we were
to take into consideration the neighbours that it co-occurs with, namely, “suret
hat they”, then “hat” is certainly an error. Google contributes in this aspect.
8
9
http://www.google.com/help/features.html
http://www.google.com/apis
3.4. Enhancement of ISSAC
In addition, the use of Google spellcheck has also indirectly provided ISSAC
with a partial solution to the second cause (i.e. erroneous neighbouring words).
Whenever Google is checking a word for spelling error, the neighbouring words
are simultaneously examined. For example, while providing a suggestion for
the error “tha”, Google simultaneously takes into consideration the neighbours,
namely, “sure tha tthey”, and suggest that the right word “tthey” be replaced
with “they”. Google spellcheck’s ability in considering contextual information
is empowered by its large search engine index and the statistical evidence that
comes with it. Word collocations are ruled out as statistically improbable
when their co-occurrences are extremely low. In such cases, Google attempts
to suggest better collocates (i.e. neighbouring words).
3. We have altered the reuse factor RF by eliminating the use of history list
that gives more weight to suggestions that have been previously chosen to
correct particular errors. We have come to realise that there is no guarantee a
particular replacement for an error is correct. When a replacement is incorrect
and is stored in the history list, the reuse factor will propagate the wrong
replacement to the subsequent corrections. Therefore, we adapted the reuse
factor to support the use of Google spellcheck in the form of entries in a local
spelling dictionary. There are two types of entries in the spelling dictionary.
The main type is the suggestions by Google for spelling errors. This type
of entries is automatically updated every time Google suggest a replacement
for an error. The second type, which is optional, is the suggestions for errors
provided by users. Hence the modified reuse factor will now assign the weight
of 1 to suggestions that are provided by Google spellcheck or predefined by
users.
Despite a certain level of superiority that Google spellcheck exhibits in the three
enhancements, Aspell remains necessary. Google spellcheck is based on the occurrences of words on the World Wide Web. Determining whether a word is an error
or not depends very much on its popularity. Even if a word does not exist in the
English dictionary, Google will not judge it as an error as long as its popularity
exceeds some threshold set by Google. This popularity approach has both its pros
and cons. On the one hand, such approach is suitable for recognising proper nouns,
especially emerging ones, such as “iPod” and “Xbox”. On the other hand, words
such as “thanx” in the context of “[ok] [thanx] [for]” is not considered as an error
by Google even though it should be corrected.
51
52
The algorithm for text preprocessing that comprises of the basic ISSAC together
with all its enhancements is described in Algorithm 1.
3.5
Evaluation and Discussion
Evaluations are conducted using chat records provided by 247Customer.com. As
a provider of customer lifecycle management services, the chat records by
247Customer.com offer a rich source of domain information in a natural setting (i.e.
conversations between customers and agents). Consequently, these chat records are
filled with spelling errors, ad-hoc abbreviations, improper casing and many other
problems that are considered as intolerable by existing language and speech applications. Therefore, these chat records become the ideal source for evaluating ISSAC.
Four sets of test data, each comes in an XML file of 100 chat sessions, were employed in the previous evaluations [273]. To evaluate the enhanced ISSAC, we have
included an additional three sets, which brings the total number of chat records to
700. The chat records and the Google search engine constitute the domain corpora
and the general collection, respectively. GNU Aspell version 0.60.4 [9] is employed
for detecting errors and generating suggestions. Similar to the previous evaluations,
determining the correctness of replacements by Aspell and enhanced ISSAC is a
delicate process that must be performed manually. For example, it is difficult to
automatically determine whether the error “itme” should be replaced with “time”
or “item” without more information (e.g. the neighbouring words). The evaluation
of the errors and replacements are conducted in a unified manner. The errors are
not classified into spelling errors, ad-hoc abbreviations or improper casing. For example, should the error “az” (“AZ” is the abbreviation for the state of “Arizona”)
in the context of “Glendale az <” be considered as an abbreviation or improper
casing? The boundaries between the different noises that occur in real-world texts,
especially those from online sources, are not clear.
After a careful evaluation of all replacements suggested by Aspell and by enhanced ISSAC for all 3, 313 errors, we discovered a further improvement in accuracy
using the latter. As shown in Figure 3.4, the use of the first suggestions by Aspell
as replacements for spelling errors yields an average of 71%, which is a decrease
from 74.4% in the previous evaluations. With the addition of the various weights
which form basic ISSAC, an average increase of 22% was noted, resulting in an improved accuracy of 96.5%. As predicted, the enhanced ISSAC scored a much better
accuracy at 98%. The increase of 1.5% from basic ISSAC is contributed by the
suggestions from Google that complement the inadequacies of Aspell. A previous
3.5. Evaluation and Discussion
Algorithm 1 Enhanced ISSAC
1: input: chat records or other online documents
2: Remove all HTML or XML tags from input documents
3: Extract and keep URLs, emails, emoticons and tables
4: Detect and identify sentence boundary
5: for each document do
6:
for each sentence in the document do
7:
tokenise the sentence to produce a set of words T = {t1 , ..., tw }
8:
for each word t ∈ T do
9:
Identify left l and right r word for t
10:
if t consists of all upper case then
11:
Turn all letters in t to lower case
12:
else if t consists of all digits then
13:
next
14:
Feed t to Aspell
15:
if t is identified as error by Aspell or Google spellcheck then
16:
initialise S, the set of suggestions for error t, and N S, an array of new
scores for all suggestions for error t
17:
Add the n suggestions for word t produced by Aspell to S according
to the original rank from 1 to n
18:
Perform a lookup in the abbreviation dictionary and add all the corresponding m expansions for t at the front of S, all with rank 1
19:
Perform a lookup in the spelling dictionary and add the retrieved suggestion at the front of S with rank 1
20:
Add the error word t itself at the end of S, with rank n + 1
21:
The final S is {s1,1 , s2,1 , ..., sm+1,1 ,
sm+2,1 , ..., sm+n+1,n , sm+n+2,n+1 } where j and i in sj,i are the element
index and the rank, respectively
22:
for each suggestion sj,i ∈ S do
23:
Determine i−1 , N ED between error e and the j th suggestion, RF by
looking into the spelling dictionary, AF by looking into the abbreviation dictionary, DS, and GS
24:
Sum the weights and push the sum into N S
25:
Correct word t with the suggestion that has the highest combined
weights in array N S
26: output: documents with spelling errors corrected, abbreviations expanded and
improper casing restored.
53
54
Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4
number of correct
replacements using
enhanced ISSAC
number of correct
replacements using
basic ISSAC
number of correct
replacements using
Aspell
98.45%
97.91%
98.40%
98.23%
97.06%
97.07%
95.92%
96.20%
74.61%
75.94%
71.81%
75.19%
(a) Evaluations 1 to 4.
Evaluation 5 Evaluation 6 Evaluation 7 Average
number of correct
replacements using
enhanced ISSAC
number of correct
replacements using
basic ISSAC
number of correct
replacements using
Aspell
97.39%
97.85%
97.86%
98.01%
95.64%
96.65%
97.14%
96.53%
63.62%
65.79%
70.24%
71.03%
(b) Evaluations 5 to 7.
Figure 3.4: Accuracy of enhanced ISSAC over seven evaluations.
Causes
Correct replacement not in suggestion list
Inadequate/erroneous neighbouring words
Anomalies
Enhanced
ISSAC
0.80%
0.70%
0.50%
Figure 3.5: The breakdown of the causes behind the incorrect replacements by
enhanced ISSAC.
error “prder” within the context of “The prder number” that could not be corrected
by basic ISSAC due to the first cause was solved after our enhancements. The
correct replacement “order” was suggested by Google. Another error “ffer” in the
context of “youo ffer on” that could not be corrected due to the second cause was
successfully replaced by “offer” after Google has simultanouesly corrected the left
word “you”. The increase in accuracy by 1.5% is in line with the drop in the number
of errors with wrong replacements due to (1) the absence of correct replacements
in Aspell’s suggestions, and (2) the erronous neighbouring words. There is a visible
drop in the number of errors with wrong replacements due to the first and the second cause, from the existing 2% (in Figure 3.3) to 0.8% (in Figure 3.5), and 1% (in
3.6. Conclusion
Figure 3.3) to 0.7% (in Figure 3.5), respectively.
3.6
Conclusion
As an increasing number of ontology learning systems are opening up to the use
of online sources, the need to handle noisy text becomes inevitable. Regardless of
whether we acknowledge this fact, the quality of ontologies and the proper functioning of the systems are, to a certain extent, dependent on the cleanliness of the
input texts. Most of the existing techniques for correcting spelling errors, expanding
abbreviations and restoring cases are studied separately. We, along with an increasing number of researchers, have acknowledged the fact that many noises in text are
composite in nature (i.e. multi-error). As we have demonstrated throughout this
chapter, many errors are difficult to be classified as either spelling errors, ad-hoc
abbreviations or improper casing.
In this chapter, we presented the enhancement of the ISSAC technique. The
basic ISSAC was built upon the famous spell checker Aspell for simultaneously providing solution to spelling errors, abbreviations and improper casing. This scoring
mechanism combines weights based on various information sources, namely, original
rank by Aspell, reuse factor, abbreviation factor, normalised edit distance, domain
significance and general significance. In the course of evaluating basic ISSAC, we
have uncovered and discussed in detail three causes behind the replacement errors.
We approached the enhancement of ISSAC from the first and the second cause,
namely, the absence of correct replacements from Aspell’s suggestions, and the inadequacy of the neighbouring words. We proposed three modifications to the basic
ISSAC, namely, (1) the use of Google spellcheck for compensating the inadequacy
of Aspell, (2) the incorporation of Google spellcheck for determining if a word is
erroneous, and (3) the alteration of the reuse factor RS by shifting from the use of
a history list to a spelling dictionary. Evaluations performed using the enhanced
ISSAC on seven sets of chat records revealed a further improved accuracy at 98%
from the previous 96.5% using basic ISSAC.
Even though the idea for ISSAC was first motivated and conceived within the
paradigm of ontology learning, we see great potentials in further improvements and
fine-tuning for a wide range of uses, especially in language and speech applications.
We hope that a unified technique such as ISSAC will pave the way for more research
into providing a complete solution for text preprocessing (i.e. text cleaning) in
general.
55
56
3.7
Acknowledgement
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the Research Grant 2006 by the University of
Western Australia. The authors would like to thank 247Customer.com for providing
the evaluation data. Gratitude to the developer of GNU Aspell, Kevin Atkinson.
3.8
Other Publications on this Topic
Wong, W., Liu, W. & Bennamoun, M. (2006) Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text. In the
Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney,
Australia.
This paper contains the preliminary ideas on basic ISSAC, which were extended
and improved to contribute towards the conference paper on enhanced ISSAC that
form Chapter 3.
CHAPTER 4
Text Processing
Abstract
In ontology learning, research on word collocational stability or unithood is typically performed as part of a larger effort for term recognition. Consequently, independent work dedicated to the improvement of unithood measurement is limited.
In addition, existing unithood measures were mostly empirically motivated and derived. This chapter presents a dedicated probabilistic measure that gathers linguistic
evidence from parsed text and statistical evidence from Web search engines for determining unithood during noun phrase extraction. Our comparative study using 1, 825
test cases against an existing empirically-derived function revealed an improvement
in terms of precision, recall and accuracy.
4.1
Introduction
Automatic term recognition is the process of extracting stable noun phrases from
text and filtering them for the purpose of identifying terms which characterise certain
domains of interest. This process involves the determination of unithood and termhood. Unithood, which is the focus of this chapter, refers to “the degree of strength
or stability of syntagmatic combinations or collocations” [120]. Measures for determining unithood can be used to decide whether or not word sequences can form
collocationally stable and semantically meaningful compounds. Compounds are considered as unstable if they can be further broken down to create non-overlapping
units that refer to semantically distinct concepts. For example, the noun phrase
“Centers for Disease Control and Prevention” is a stable and meaningful unit while
“Centre for Clinical Interventions and Royal Perth Hospital” is an unstable compound that refers to two separate entities. For this reason, unithood measures are
typically used in term recognition for finding stable and meaningful noun phrases,
which are considered as likelier terms. Recent reviews [275] showed that existing
research on unithood is mostly conducted as part of larger efforts for termhood measurement. As a result, there is only a small number of existing measures dedicated
to determining unithood. In addition, existing measures are usually derived using
0
This chapter appeared in the Proceedings of the 3rd International Joint Conference on Natural
Language Processing (IJCNLP), Hyderabad, India, 2008, with the title “Determining the Unithood
of Word Sequences using a Probabilistic Approach”.
57
58
Chapter 4. Text Processing
word frequency from static corpora, and are modified as per need. As such, the
significance of the different weights that compose the measures typically assumes an
empirical viewpoint [120].
The three objectives of this chapter are (1) to separate the measurement of
unithood from the determination of termhood, (2) to devise a probabilistic measure
which requires only one threshold for determining the unithood of word sequences
using dynamic Web data, and (3) to demonstrate the superior performance of the
new probabilistic measure against existing empirical measures. In regard to the first
objective, we derive our probabilistic measure free from any influence of termhood
determination. Following this, our unithood measure will be an independent tool
that is applicable not only to term recognition, but also other tasks in information
extraction and text mining. Concerning the second objective, we devise our new
measure, known as the Odds of Unithood (OU), using Bayes Theorem and several
elementary probabilities. The probabilities are estimated using Google page counts
to eliminate problems related to the use of static corpora. Moreover, only one
threshold, namely, OUT is required to control the functioning of OU. Regarding the
third objective, we compare our new OU against an existing empirically-derived
measure called Unithood (UH) 1 [275] in terms of their precision, recall and accuracy.
In Section 4.2, we provide a brief review on some of existing techniques for measuring unithood. In Section 4.3, we present our new probabilistic measure and the
accompanying theoretical and intuitive justification. In Section 4.4, we summarise
some findings from our evaluations. Finally, we conclude this chapter with an outlook to future work in Section 4.5.
4.2
Related Works
Some of the common measures of unithood include pointwise mutual information (MI) [47] and log-likelihood ratio [64]. In mutual information, the co-occurrence
frequencies of the constituents of complex terms are utilised to measure their dependency. The mutual information for two words a and b is defined as:
M I(a, b) = log2
1
p(a, b)
p(a)p(b)
(4.1)
This foundation work on dedicated unithood measures appeared in the Proceedings of the 10th
Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia, 2007 with the title “Determining the Unithood of Word Sequences using Mutual Information
and Independence Measure”.
59
4.2. Related Works
where p(a) and p(b) are the probabilities of occurrence of a and b. Many measures
that apply statistical techniques assume strict normal distribution and independence between the word occurrences [81]. For handling extremely uncommon words
or small sized corpus, the log-likelihood ratio delivers the best precision [136]. Loglikelihood ratio attempts to quantify how much more likely one pair of words is to
occur compared to the others. Despite its potential, “How to apply this statistic
measure to quantify structural dependency of a word sequence remains an interesting issue to explore.” [131]. Seretan et al. [224] examined the use of mutual
information, log-likelihood ratio and t-tests with search engine page counts for determining the collocational strength of word pairs. However, no performance results
were presented.
Wong et al. [275] presented a hybrid measure inspired by mutual information in
Equation 4.1, and Cvalue in Equation 4.3. The authors employ search engine page
counts for the computation of statistical evidence to replace the use of frequencies
obtained from static corpora. The authors proposed a measure known as Unithood
(UH) for determining the mergeability of two lexical units ax and ay to produce
a stable sequence of words s. The word sequences are organised as a set W =
{s, ax , ay } where s = ax bay is a term candidate, b can be any preposition, the
coordinating conjunction “and” or an empty string, and ax and ay can either be
noun phrases in the form ADJ∗ N+ or another s (i.e. defining a new s in terms of
another s). The authors define UH as:



1 if (M I(ax , ay ) > M I + ) ∨






(M I + ≥ M I(ax , ay )






≥ M I −∧





ID(ax , s) ≥ IDT ∧
U H(ax , ay ) =


ID(ay , s) ≥ IDT ∧






IDR+ ≥ IDR(ax , ay )






≥ IDR− )




0 otherwise
(4.2)
where M I + , M I − , IDT , IDR+ and IDR− are thresholds for determining the mergeability decision, and M I(ax , ay ) is the mutual information between ax and ay , while
ID(ax , s), ID(ay , s) and IDR(ax , ay ) are measures of lexical independence of ax
and ay from s. For brevity, let z be either ax or ay , and the independence measure
60
ID(z, s) is then defined as:

log (n − n ) if(n > n )
z
s
z
s
10
ID(z, s) =
0
otherwise
where nz and ns are the Google page count for z and s, respectively. IDR(ax , ay )
x ,s)
is computed as ID(a
. Intuitively, U H(ax , ay ) states that the two lexical units ax
ID(ay ,s)
and ay can only be merged in two cases, namely, (1) if ax and ay has extremely
high mutual information (i.e. higher than a certain threshold M I + ), or (2) if ax
and ay achieve average mutual information (i.e. within the acceptable range of two
thresholds M I + and M I − ) due to their extremely high independence from s (i.e.
higher than the threshold IDT ).
Frantzi [79] proposed a measure known as Cvalue for extracting complex terms.
The measure is based upon the claim that a substring of a term candidate is a
candidate itself given that it demonstrates adequate independence from the longer
version it appears in. For example, “E. coli food poisoning”, “E. coli” and “food
poisoning” are acceptable as valid complex term candidates. However, “E. coli food”
is not. Given a word sequence a to be examined for unithood, the Cvalue is defined
as:

log |a|f
if |a| = g
a
2
P
(4.3)
Cvalue(a) =
f
l
log |a|(f − l∈La ) otherwise
a
2
|La |
where |a| is the number of words in a, La is the set of longer term candidates that
contain a, g is the longest n-gram considered, fa is the frequency of occurrence
of a, and a ∈
/ La . While certain researchers [131] consider Cvalue as a termhood
measure, others [180] accept it as a measure of unithood. One can observe that
longer candidates tend to gain higher weights due to the inclusion of log2 |a| in
Equation 4.3. In addition, the weights computed using Equation 4.3 are purely
dependent on the frequency of a alone.
4.3
A Probabilistic Measure for Unithood Determination
The determination of the unithood (i.e. collocational strength) of word sequences
using the new probabilistic measure is composed of two parts. Firstly, a list of
noun phrases is extracted using syntactic and dependency analysis. Secondly, the
collocational strength of word sequences are examined based on several probabilistic
evidence.
4.3. A Probabilistic Measure for Unithood Determination
Figure 4.1: The output by Stanford Parser. The tokens in the “modifiee” column
marked with squares are head nouns, and the corresponding tokens along the same
rows in the “word” column are the modifiers. The first column “offset” is subsequently represented using the variable i.
4.3.1
Noun Phrase Extraction
Most techniques for extracting noun phrases rely on regular expressions, and
part-of-speech and dependency information. Our extraction technique is implemented as a head-driven noun phrase chunker [271] that feeds on the output of
Stanford Parser [132]. Figure 4.1 shows a sample output by the parser for the
sentence “They’re living longer with HIV in the brain, explains Kathy Kopnisky of
the NIH’s National Institute of Mental Health, which is spending about millions investigating neuroAIDS.”. Note that the words are lemmatised to obtain the root
form. The noun phrase chunker begins by identifying a list of head nouns from the
parser’s output. The head nouns are marked with squares in Figure 4.1. As the
name suggests, the chunker uses the head nouns as the starting point, and proceeds
to the left and later right in an attempt to identify maximal noun phrases using
the head-modifier information. For example, the head “Institute” is modified by
61
62
Figure 4.2: The output of the head-driven noun phrase chunker. The tokens which
are highlighted with a darker tone are the head nouns. The underlined tokens are
the corresponding modifiers identified by the chunker.
“NIH’s”, “National” and “of ”. Since modifiers of the type prep and poss cannot
be straightforwardly chunked, the phrase “National Institute” was produced instead
as shown in Figure 4.2. Similarly, the phrase “Mental Health” was also identified
by the chunker. The fragments of noun phrases identified by the chunker which are
separated by the coordinating conjunction “and” or prepositions (e.g. “National Institute”, “Mental Health”) are organised as pairs in the form of (ax , ay ) and placed
in the set A. The i in ai is the word offset generated by the Stanford Parser (i.e. the
“offset” column in Figure 4.1). If ax and ay are located immediately next to each
other in the sentence, then x + 1 = y. If the pair is separated by a preposition or a
conjunction, then x + 2 = y.
4.3.2
Determining the Unithood of Word Sequences
The next step is to examine the collocational strength of the pairs in A. Word
pairs in (ax , ay ) ∈ A that have very high unithood or collocational strength are
combined to form stable noun phrases and hence, potential domain-relevant terms.
Each pair (ax , ay ) that undergoes the examination for unithood is organised as W =
{s, ax , ay } where s is the hypothetically-stable noun phrase composed of s = ax bay
and b can either be an empty string, a preposition, or the coordinating conjunction
“and”. Formally, the unithood of any two lexical units ax and ay can be defined as
Definition 4.3.2.1. The unithood of two lexical units is the “degree of strength or
stability of syntagmatic combinations and collocations” [120] between them.
It then becomes obvious that the problem of measuring the unithood of any word
sequences requires the determination of their “degree” of collocational strength as
mentioned in Definition 4.3.2.1. In practical terms, the “degree” mentioned above
provides us with a quantitative means to determine if the units ax and ay should be
combined to form s, or remain as separate units. The collocational strength of ax
and ay that exceeds a certain threshold demonstrates to us that s has the potential
of being a stable compound and hence, a better term candidate than ax and ay
separated. It is worth mentioning that the size (i.e. number of words) of ax and
ay is not limited to 1. For example, we can have ax =“National Institute”, b=“of ”
and ay =“Allergy and Infectious Diseases”. In addition, the size of ax and ay has no
effect on the determination of their unithood using our measure.
As we have discussed in Section 4.2, most of the conventional measures employ
frequency of occurrence from local corpora, and statistical tests or informationtheoretic measures to determine the coupling strength of word pairs. The two main
problems associated with such measures are:
• Data sparseness is a problem that is well-documented by many researchers
[124]. The problem is inherent to the use of local corpora and it can lead to
poor estimation of parameters or weights; and
• The assumption of independence and normality of word distribution are two
of the many problems in language modelling [81]. While the independence
assumption reduces text to simply a bag of words, the assumption of the
normal distribution of words will often lead to incorrect conclusions during
statistical tests.
As a general solution, we innovatively employ search engine page counts in a probabilistic framework for measuring unithood. We begin by defining the sample space,
N as the set of all documents indexed by the Google search engine. We can estimate
the index size of Google, |N | using function words as predictors. Function words
such as “a”, “is” and “with”, as opposed to content words, appear with frequencies
that are relatively stable over different domains. Next, we perform random draws
(i.e. trial) of documents from N . For each lexical unit w ∈ W , there is a corresponding set of outcomes (i.e. events) from the draw. The three basic sets which
are of interest to us are:
Definition 4.3.2.2. Basic events corresponding to each w ∈ W :
• X is the event that ax occurs in the document
• Y is the event that ay occurs in the document
63
64
• S is the event that s occurs in the document
It should be obvious to the readers that since the documents in S also contain
the two units ax and ay , S is a subset of X ∩Y or S ⊆ X ∩Y . It is worth noting that
even though S ⊆ X ∩ Y , it is highly unlikely that S = X ∩ Y since the two portions
ax and ay may exist in the same document without being conjoined by b. Next,
subscribing to the frequency interpretation of probability, we obtain the probability
of the events in Definition 4.3.2.2 in terms of search engine page counts:
nx
P (X) =
(4.4)
|N |
ny
P (Y ) =
|N |
ns
P (S) =
|N |
where nx , ny and ns are the page counts returned by a search engine using the terms
[+“ax ”], [+“ay ”] and [+“s”], respectively. The pair of quotes that encapsulates the
search terms is the phrase operator, while the character “+” is the required operator
supported by the Google search engine. As discussed earlier, the independence
assumption required by certain information-theoretic measures may not always be
valid. In our case, P (X ∩ Y ) 6= P (X)P (Y ) since the occurrences of ax and ay
in documents are inevitably governed by some hidden variables. Following this,
we define the probabilities for two new sets which result from applying some set
operations on the basic events in Definition 4.3.2.2:
nxy
(4.5)
P (X ∩ Y ) =
|N |
P (X ∩ Y \ S) = P (X ∩ Y ) − P (S)
where nxy is the page count returned by Google for the search using [+“ax ” +“ay ”].
Defining P (X ∩ Y ) in terms of observable page counts, rather than a combination of two independent events allows us to avoid any unnecessary assumption of
independence.
Next, referring back to our main problem discussed in Definition 4.3.2.1, we are
required to estimate the collocation strength of the two units ax and ay . Since
there is no standard metric for such measurement, we address the problem from a
probabilistic perspective. We introduce the probability that s is a stable compound
given the evidence s possesses:
Definition 4.3.2.3. The probability of unithood:
P (E|U )P (U )
P (U |E) =
P (E)
65
where U is the event that s is a stable compound and E is the evidence belonging
to s. P (U |E) is the posterior probability that s is a stable compound given the
evidence E. P (U ) is the prior probability that s is a unit without any evidence, and
P (E) is the prior probability of evidence held by s. As we shall see later, these two
prior probabilities are immaterial in the final computation of unithood. Since s can
either be a stable compound or not, we can state that,
P (Ū |E) = 1 − P (U |E)
(4.6)
where Ū is the event that s is not a stable compound. Since Odds = P/(1 − P ), we
multiply both sides of Definition 4.3.2.3 by (1 − P (U |E))−1 to obtain,
P (E|U )P (U )
P (U |E)
=
1 − P (U |E)
P (E)(1 − P (U |E))
(4.7)
By substituting Equation 4.6 in Equation 4.7 and later, applying the multiplication
rule P (Ū |E)P (E) = P (E|Ū )P (Ū ) to it, we obtain:
P (U |E)
P (E|U )P (U )
=
P (Ū |E)
P (E|Ū )P (Ū )
(4.8)
We proceed to take the log of the odds in Equation 4.8 (i.e. logit) to get:
log
P (E|U )
P (U |E)
P (U )
= log
− log
P (E|Ū )
P (Ū |E)
P (Ū )
(4.9)
While it is obvious that certain words tend to co-occur more frequently than others
(i.e. idioms and collocations), such phenomena are largely arbitrary [231]. This
makes the task of deciding on what constitutes an acceptable collocation difficult.
The only way to objectively identify stable and meaningful compounds is through
observations in samples of the language (e.g. text corpus) [174]. In other words,
assigning the apriori probability of collocational strength without empirical evidence
is both subjective and difficult. As such, we are left with the option to assume that
the probability of s being a stable unit and not being a stable compound without
evidence is the same (i.e. P (U ) = P (Ū ) = 0.5). As a result, the second term in
Equation 4.9 evaluates to 0:
log
P (E|U )
P (U |E)
= log
P (Ū |E)
P (E|Ū )
(4.10)
We introduce a new measure for determining the odds of s being a stable compound
known as the Odds of Unithood (OU):
66
(a) The area with a darker shade is the set
X ∩ Y \ S. Computing the ratio of P (S)
and the probability of this area gives us
the first evidence.
(b) The area with a darker shade is the
set S ′ . Computing the ratio of P (S) and
the probability of this area (i.e. P (S ′ ) =
1 − P (S)) gives us the second evidence.
Figure 4.3: The probability of the areas with darker shade are the denominators
required by the evidences e1 and e2 for the estimation of OU (s).
Definition 4.3.2.4. Odds of unithood
OU (s) = log
P (E|U )
P (E|Ū )
Assuming that the evidence in E are independent of one another, we can evaluate
OU (s) in terms of:
Q
P (ei |U )
(4.11)
OU (s) = log Qi
i P (ei |Ū )
X
P (ei |U )
=
log
P (ei |Ū )
i
where ei is the individual evidence of s.
With the introduction of Definition 4.3.2.4, we can examine the degree of collocation strength of ax and ay in forming a stable and meaningful s in terms of OU (s).
With the base of the log in Definition 4.3.2.4 more than 1, the upper and lower bound
of OU (s) would be +∞ and −∞, respectively. OU (s) = +∞ and OU (s) = −∞
corresponds to the highest and the lowest degree of stability of the two units ax and
ay appearing as s, respectively. A high OU (s) indicates the suitability of the two
units ax and ay to be merged to form s. Ultimately, we have reduced the vague
problem of unithood determination introduced in Definition 4.3.2.1 into a practical
and computable solution in Definition 4.3.2.4. The evidence that we employ for determining unithood is based on the occurrence of s (the event S if the readers recall
from Definition 4.3.2.2). We are interested in two types of occurrence of s, namely,
(1) the occurrence of s given that ax and ay have already occurred or X ∩ Y , and
67
(2) the occurrence of s as it is in our sample space, N . We refer to the first evidence
e1 as local occurrence, while the second one e2 as global occurrence. We will discuss
the justification behind each type of occurrence in the following paragraphs. Each
evidence ei captures the occurrence of s within different confinements. We estimate
the evidence using the elementary probabilities already defined in Equations 4.4 and
4.5.
The first evidence e1 captures the probability of occurrence of s within the confinement of ax and ay or X ∩Y . As such, P (e1 |U ) can be interpreted as the probability of s occurring within X ∩ Y as a stable compound or P (S|X ∩ Y ). On the other
hand, P (e1 |Ū ) captures the probability of s occurring in X ∩ Y not as a unit. In
other words, P (e1 |Ū ) is the probability of s not occurring in X ∩ Y , or equivalently,
equal to P ((X ∩ Y \ S)|(X ∩ Y )). The set X ∩ Y \ S is shown as the area with a
darker shade in Figure 4.3(a). Let us define the odds based on the first evidence as:
OL =
P (e1 |U )
P (e1 |Ū )
(4.12)
Substituting P (e1 |U ) = P (S|X ∩ Y ) and P (e1 |Ū ) = P ((X ∩ Y \ S)|(X ∩ Y )) into
Equation 4.12 gives us:
P (S|X ∩ Y )
P ((X ∩ Y \ S)|(X ∩ Y ))
P (S ∩ (X ∩ Y ))
P (X ∩ Y )
=
P (X ∩ Y ) P ((X ∩ Y \ S) ∩ (X ∩ Y ))
P (S ∩ (X ∩ Y ))
=
P ((X ∩ Y \ S) ∩ (X ∩ Y ))
OL =
and since S ⊆ (X ∩ Y ) and (X ∩ Y \ S) ⊆ (X ∩ Y ),
OL =
P (S)
P (X ∩ Y \ S)
if (P (X ∩ Y \ S) 6= 0)
and OL = 1 if P (X ∩ Y \ S) = 0.
The second evidence e2 captures the probability of occurrence of s without confinement. If s is a stable compound, then its probability of occurrence in the sample
space would simply be P (S). On the other hand, if s occurs not as a unit, then its
probability of non-occurrence is 1 − P (S). The complement of S, which is the set
S ′ is shown as the area with a darker shade in Figure 4.3(b). Let us define the odds
based on the second evidence as:
OG =
P (e2 |U )
P (e2 |Ū )
(4.13)
68
Substituting P (e2 |U ) = P (S) and P (e2 |Ū ) = 1 − P (S) into Equation 4.13 gives us:
OG =
P (S)
1 − P (S)
Intuitively, the first evidence attempts to capture the extent to which the existence of the two lexical units ax and ay is attributable to s. Referring back to OL ,
whenever the denominator P (X ∩ Y \ S) becomes less than P (S), we can deduce
that ax and ay actually exist together as s more than in other forms. At one extreme
when P (X ∩ Y \ S) = 0, we can conclude that the co-occurrence of ax and ay is
exclusively for s. As such, we can also refer to OL as a measure of exclusivity for
the use of ax and ay with respect to s. This first evidence is a good indication for
the unithood of s since the more the existence of ax and ay is attributed to s, the
stronger the collocational strength of s becomes. Concerning the second evidence,
OG attempts to capture the extent to which s occurs in general usage (i.e. World
Wide Web). We can consider OG as a measure of pervasiveness for the use of s. As
s becomes more widely used in text, the numerator in OG increases. This provides a
good indication on the unithood of s since the more s appears in usage, the likelier
it becomes that s is a stable compound instead of an occurrence by chance. As a
result, the derivation of OU in terms of OL and OG offers a comprehensive way of
determining unithood.
Finally, expanding OU (s) in Equation 4.11 using Equations 4.12 and 4.13 gives
us:
OU (s) = log OL + log OG
= log
(4.14)
P (S)
P (S)
+ log
P (X ∩ Y \ S)
1 − P (S)
As such, the decision on whether ax and ay should be merged to form s is made
based solely on OU defined in Equation 4.14. We merge ax and ay if their odds of
unithood exceeds a certain threshold, OUT .
4.4
Evaluations and Discussions
For this evaluation, we employed 500 news articles from Reuters in the health
domain gathered between December 2006 to May 2007. These 500 articles are fed
into the Stanford Parser whose output is then used by our head-driven noun phrase
chunker [271, 275] to extract word sequences in the form of nouns and noun phrases.
Pairs of word sequences (i.e. ax and ay ) located immediately next to each other,
or separated by a preposition or the conjunction “and” in the same sentence are
4.4. Evaluations and Discussions
measured for their unithood. Using the 500 news articles, we managed to obtain
1, 825 pairs of words to be tested for unithood.
We performed a comparative study of our new probabilistic measure against the
empirical measure described in Equation 4.2. Two experiments were conducted.
In the first one, the decisions on whether or not to merge the 1, 825 pairs were
performed automatically using our probabilistic measure OU. These decisions are
known as the actual results. At the same time, we inspected the same list of word
pairs to manually decide on their unithood. These decisions are known as the ideal
results. The threshold OUT employed for our evaluation is determined empirically
through experiments and is set to −8.39. However, since only one threshold is
involved in deciding mergeability, training algorithms and datasets may be employed
to automatically decide on an optimal number. This option is beyond the scope of
this chapter. The actual and ideal results for this first experiment are organised into
a contingency table (not shown here) for identifying the true and the false positives,
and the true and the false negatives. In the second experiment, we conducted the
same assessment as carried out in the first one but the decisions to merge the 1, 825
pairs are based on the UH measure described in Equation 4.2. The thresholds
required for this measure are based on the values suggested by Wong et al. [275],
namely, M I + = 0.9, M I − = 0.02, IDT = 6, IDR+ = 1.35, and IDR− = 0.93.
Figure 4.4: The performance of OU (from Experiment 1) and UH (from Experiment
2) in terms of precision, recall and accuracy. The last column shows the difference
between the performance of Experiment 1 and 2.
Using the results from the contingency tables, we computed the precision, recall
and accuracy for the two measures under evaluation. Figure 4.4 summarises the
performance of OU and UH in determining the unithood of 1, 825 pairs of lexical
units. One will notice that our new measure OU outperformed the empirical measure
UH in all aspects, with an improvement of 2.63%, 3.33% and 2.74% for precision,
recall and accuracy, respectively. Our new measure achieved a 100% precision with
a lower recall at 95.83%. However, more evaluations using larger datasets and
69
70
statistical tests for significance are required to further validate the performance of
the probabilistic measure OU.
As with any measures that employ thresholds as a cut-off point for accepting
or rejecting certain decisions, we can improve the recall of OU by decreasing the
threshold OUT . In this way, there will be less false negatives (i.e. pairs which are
supposed to be merged but are not) and hence, increases the recall rate. Unfortunately, recall will improve at the expense of precision since the number of false
positives will definitely increase from the existing 0. Since our application (i.e.
ontology learning) requires perfect precision in determining the unithood of noun
phrases, OU is the ideal candidate. Moreover, with only one threshold (i.e. OUT )
required in controlling the performance of OU, we are able to reduce the amount of
time and effort spent on optimising our results.
4.5
Conclusion and Future Work
In this chapter, we highlighted the significance of unithood, and that its measurement should be given equal attention by researchers in ontology learning. We
focused on the development of a dedicated probabilistic measure for determining
the unithood of word sequences. We refer to this measure as the Odds of Unithood (OU). OU is derived using Bayes Theorem and is founded upon two evidence,
namely, local occurrence and global occurrence. Elementary probabilities estimated
using page counts from Web search engines are utilised to quantify the two evidence.
The new probabilistic measure OU is then evaluated against an existing empirical
measure known as Unithood (UH). Our new measure OU achieved a precision and
a recall of 100% and 95.83%, respectively, with an accuracy at 97.26% in measuring
the unithood of 1, 825 test cases. OU outperformed UH by 2.63%, 3.33% and 2.74%
in terms of precision, recall and accuracy, respectively. Moreover, our new measure
requires only one threshold, as compared to five in UH to control the mergeability
decision.
More work is required to establish the coverage and the depth of the World Wide
Web with regard to the determination of unithood. While the Web has demonstrated
reasonable strength in handling general news articles, we have yet to study its appropriateness in dealing with unithood determination for technical text (i.e. the
depth of the Web). Similarly, it remains a question the extent to which the Web
is able to satisfy the requirement of unithood determination for a wider range of
domains (i.e. the coverage of the Web). Studies on the effect of noises (e.g. keyword
spamming) and multiple word senses on unithood determination using the Web is
4.6. Acknowledgement
another future research direction.
4.6
Acknowledgement
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the Research Grant 2006 by the University of
Western Australia.
4.7
Wong, W., Liu, W. & Bennamoun, M. (2007) Determining the Unithood of Word
Sequences using Mutual Information and Independence Measure. In the Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics
(PACLING), Melbourne, Australia.
This paper presents the work on the adaptation of existing word association measures
to form the UH measure. The ideas on UH were later reformulated to give rise to
the probabilistic measure OU. The description of OU forms the core contents of this
Chapter 4.
Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and
Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research
on Text and Web Mining Technologies, IGI Global.
This book chapter combines the ideas on the UH measure, from Chapter 4, and
the TH measure, from Chapter 5.
71
72
CHAPTER 5
Term Recognition
Abstract
Term recognition identifies domain-relevant terms which are essential for discovering domain concepts and for the construction of terminologies required by a wide
range of natural language applications. Many techniques have been developed in
an attempt to numerically determine or quantify termhood based on term characteristics. Some of the apparent shortcomings of existing techniques are the ad-hoc
combination of termhood evidence, mathematically-unfounded derivation of scores
and implicit assumptions concerning term characteristics. We propose a probabilistic framework for formalising and combining qualitative evidence based on explicitly
defined term characteristics to produce a new termhood measure. Our qualitative
and quantitative evaluations demonstrate consistently better precision, recall and
accuracy compared to three other existing ad-hoc measures.
5.1
Introduction
Technical terms, more commonly referred to as terms, are content-bearing lexical units which describe the various aspects of a particular domain. There are two
types of terms, namely, simple terms (i.e. single-word terms) and complex terms
(multi-word terms). In general, the task of identifying domain-relevant terms is referred to as automatic term recognition, term extraction or terminology mining. The
broader scope of term recognition can also be viewed in terms of the computational
problem of measuring termhood, which is the extent of a term’s relevance to a particular domain [120]. Terms are particularly important for labelling or designating
domain-specific concepts, and for contributing to the construction of terminologies,
which are essentially enumerations of technical terms in a domain. Manual efforts
in term recognition are no longer viable as more new terms come into use and new
meanings may be added to existing terms as a result of information explosion. Coupled with the significance of terminologies to a wide range of applications such as
ontology learning, machine translation and thesaurus construction, automatic term
recognition is the next logical solution.
0
This chapter appeared in Intelligent Data Analysis, Volume 13, Issue 4, Pages 499-539, 2009,
with the title “A Probabilistic Framework for Automatic Term Recognition”.
73
74
Chapter 5. Term Recognition
Very often, term recognition is considered as similar or equivalent to namedentity recognition, information retrieval and term relatedness measurement. An
obvious dissimilarity between named-entity recognition and term recognition is that
the former is a deterministic problem of classification whereas the latter involves
the subjective measurement of relevance and ranking. Hence, unlike the evaluation
of named-entity recognition where various platforms such as the BioCreAtIvE Task
[106] and the Message Understanding Conference (MUC) [42] are readily available,
determining the performance of term recognition remains an extremely subjective
problem. Having closer resemblance to information retrieval in that both involve
relevance ranking, term recognition does have its unique requirements [120]. Unlike information retrieval where information relevance can be evaluated based on
user information needs, term recognition does not have user queries as evidence for
deciding on the domain relevance of terms. In general, term recognition can be performed with or without initial seedterms as evidence. The seedterms enable term
recognition to be conducted in a controlled environment and offer more predictable
outcomes. Term recognition using seedterms, also referred to as guided term recognition is in some aspects similar to measuring term relatedness. The relevance of
terms to a domain in guided term recognition is determined in terms of their semantic relatedness with the domain seedterms. Therefore, existing semantic similarity
or relatedness measures based on lexical information (e.g. WordNet [206], Wikipedia
[276]), corpus statistics (e.g. Web corpus [50]), or the combination of both [114] are
available for use. Without using seedterms, term recognition relies solely on term
characteristics as evidence. This term recognition approach is far more difficult and
faces numerous challenges. The focus of this chapter is on term recognition without
seedterms.
In this chapter, we develop a formal framework for quantifying evidence based
on qualitative term characteristics for the purpose of measuring termhood, and ultimately, term recognition. Several techniques have been developed in an attempt to
numerically determine or quantify termhood based on a list of term characteristics.
The shortcomings of existing techniques can be examined from three perspectives.
Firstly, word or document frequency in text corpus has always been the main source
of evidence due to its accessibility and computability. Despite a general agreement
[37] that frequency is a good criteria for discriminating terms from non-terms, frequency alone is insufficient. Many researchers began realising this issue and more
diverse evidence [131, 109] was incorporated, especially linguistic-based such as syntactic and semantic information. Unfortunately, as the types of evidence become
5.1. Introduction
increasingly diversified (e.g. numerical and nominal), the consolidation of evidence
by existing techniques becomes more ad-hoc. This issue is very obvious when one
examines simple but crucial questions as to why certain measures take different bases
for logarithm, or why two weights were combined using addition instead of multiplication. In the words of Kageura & Umino [120], most of the existing techniques “take
an empirical or pragmatic standpoint regarding the meaning of weight”. Secondly,
the underlying assumptions made by many techniques regarding term characteristics for deriving evidence were mostly implicit. This makes the task of characteristic
attribution and tracing inaccuracies in termhood measurement difficult. Thirdly,
many techniques for determining termhood failed to provide ways for selecting the
final terms from a long list of term candidates. According to Cabre-Castellvi et
al. [37], “all systems propose large lists of candidate terms, which at the end of the
process have to be manually accepted or rejected.”. In short, the derivation of a formal termhood measure based on term characteristics for term recognition requires
solutions to the following issues:
• the development of a general framework to consolidate all evidence representing the various term characteristics;
• the determination of the types of evidence to be included to ensure that the
resulting score will closely reflect the actual state of termhood implied by a
term’s characteristics;
• the explicit definition of term characteristics and their attribution to linguistic
theories (if any) or other justifications; and
• the automatic determination of optimal thresholds to identify terms from the
final lists of ranked term candidates.
The main objective of this chapter is to address the development of a new probabilistic framework for incorporating qualitative evidence for measuring termhood
which provides solutions to all four issues outlined above. This new framework
is based on the general Bayes Theorem, and the word distributions required for
computing termhood evidence are founded upon the Zipf-Mandelbrot model. The
secondary objective of this chapter is to demonstrate the performance of this new
term recognition technique in comparison with existing techniques using widelyaccessible benchmarks. In Section 5.2, we summarise the notations and datasets
employed throughout this chapter for the formulation of equations, experiments
75
76
and evaluations. Section 5.3.1 and 5.3.2 summarise several prominent probabilistic models and ad-hoc techniques related to term recognition. In Section 5.3.3, we
discuss several commonly employed word distribution models that are crucial for
formalising statistical and linguistic evidence. We outline in detail our proposed
technique for term recognition in Section 5.4. We evaluate our new technique, both
qualitatively and quantitatively, in Section 5.5 and compare its performance with
several other existing techniques. In particular, Section 5.5.2 includes the detailed
description of an automatic way of identifying actual terms from the list of ranked
term candidates. We conclude this chapter in Section 5.6 with an outlook to future
work.
5.2
Notations and Datasets
In this section, we discuss briefly the types of termhood evidence. This section
can also be used as a reference for readers who require clarification about the notations used at any point in this chapter. In addition, we summarise the origin and
composition of the datasets employed in various parts of this chapter for experiments
and evaluations. There is a wide array of evidence employed for term recognition
ranging from statistical to linguistics. Word and document frequency is used extensively to measure the significance of a lexical unit. Depending on how frequency
is employed, one can classify the term recognition techniques as either ad-hoc or
probabilistic. Linguistic evidence, on the other hand, typically includes syntactical
and semantic information. Syntactical evidence relies upon information about how
distinct lexical units assume the role of heads and modifiers to form complex terms.
For semantic evidence, some predefined knowledge is often employed to relate one
lexical unit with others to form networks of related terms.
Frequency refers to the number of occurrences of certain event or entity. There
are mainly two types of frequency related to the area of term recognition. The first
is document frequency. Document frequency refers to the number of documents in a
corpus that contains some words of interest. There are many different notations but
throughout this chapter, we will adopt the notation N as the number of documents
in a corpus and na as the number of documents in the corpus which contains word a.
In cases where more than one corpus is involved, nax is used to denote the number
of documents in corpus x containing word a. The second type of frequency is term
frequency. Term frequency is the number of occurrences of certain words in a corpus.
In other words, term frequency is independent of the documents in the corpus. We
will employ the notation fa as the number of occurrences of word a in a corpus and
5.2. Notations and Datasets
F as the sum of the number of occurrences of all words in a corpus. In the case
where different units of text are involved such as paragraphs, sentences, documents
or even corpora, fax represents the frequency of candidate a in unit x. Given that
P
W is the set of all distinct words in a corpus, then F = ∀a∈W fa . With regard
to term recognition, we will use the notation T C to represent the set of all term
candidates extracted from some corpus for processing, and |T C| is the number of
term candidates in T C. The notation a where a ∈ T C is used to represent a term
candidate and it can either be simple or complex. For complex terms, term candidate
a is made up of constituents where ah is the head and Ma is the set of modifiers. A
term candidate can also be surrounded by a set of context words Ca . The notion
of context words may differ across different techniques. Certain techniques consider
all words surrounding term a located within a fixed-size window as context words
of a, while others may employ grammatical relations to extract context words. The
actual composition of Ca is not of concern at this point. Following this, Ca ∩ T C
is simply the context words of a which are also term candidates themselves (i.e.
context terms).
Figure 5.1: Summary of the datasets employed throughout this chapter for experiments and evaluations.
Throughout this chapter, we employ a standard set of corpora for experimenting with the various aspects of term recognition, and also for evaluation purposes.
The corpora that we employ are divided into two groups. The first group, known
as domain corpus, consists of a collection of abstracts in the domain of molecular
biology that is made available through the GENIA corpus [130]. Currently, version
3.0 of the corpus consists of 2, 000 abstracts with a total of 402, 483 word count.
77
78
The GENIA corpus is an ideal resource for evaluating term recognition techniques
since the text in the corpus is marked-up with both part-of-speech tags and semantic categories. Biologically-relevant terms in the corpus were manually identified by
two domain experts [130]. Hence, a gold standard (i.e. a list of terms relevant to
the domain), represented as the set G, for the molecular biology domain can be constructed by extracting the terms which have semantic descriptors enclosed by cons
tags. For reproducibility of our experiments, the corpus can be downloaded from
http://www-tsujii.is.s.u-tokyo.ac.jp/∼genia/topics/Corpus/. The second
collection of text is called the contrastive corpus and is made up of twelve different text collections gathered from various online sources. As the name implies, the
second group of text serves to contrast and discriminate the content of the domain
corpus. The writing style of the contrastive corpus is different from the domain
corpus because the former tend to be prepared using journalistic writing (i.e. written in general language with minimal usage of technical terms), targeting general
readers. The contrastive texts were automatically gathered from news provider such
as Reuters between the period of February 2006 to July 2007. The summary of the
domain corpus and contrastive corpus is presented in Figure 5.1. Note that for
simplicity reasons, hereafter, d is used to represent the domain corpus and d¯ for
contrastive corpus.
5.3
Related Works
There are mainly two schools of techniques in term recognition. The first attempts to begin the empirical study of termhood from a theoretically-founded perspective, while the second is based upon the belief that a method should be judged
for its quality of being of practical use. These two groups are by no means exclusive
but they form a good platform for comparison. In the first group, probability and
statistics are the main guidance for designing new techniques. Probability theory
acts as the mathematical foundation for modelling the various components in the
corpus, and drawing inferences about different aspects such as relevance and representativeness of documents or domains using descriptive and inferential statistics.
In the second group, ad-hoc techniques are characterised by the pragmatic use of
evidence to measure termhood. Ad-hoc techniques are usually put together and
modified as per need as the observation of immediate results progresses. Obviously,
such techniques are at most inspired by, but not derived from formal mathematical
models [120]. Many critics claim that such techniques are unfounded and the results
that are reported using these techniques are merely coincidental.
79
5.3. Related Works
The details of some existing research work on the two groups of techniques relevant to term recognition are presented in the next two subsections.
5.3.1
Existing Probabilistic Models for Term Recognition
There is currently no formal framework dedicated to the determination of termhood which combines both statistical and qualitative linguistic evidence. Formal
probabilistic models related to dealing with terms in general are mainly studied
within the realm of document retrieval and automatic indexing. In probabilistic indexing, one of the first few detailed quantitative models was proposed by Bookstein
& Swanson [29]. In this model, the differences in the distributional behaviour of
words are employed as a guide to determine if a word should be considered as an
index term. This model is founded upon the research on how function words can
be closely modeled by a Poisson distribution whereas content words deviate from
it [256]. We will elaborate on Poisson and other related models in Section 5.3.3.
An even larger collection of literature on probabilistic models can be found in the
related area of document retrieval. The simplest of all the retrieval models is the
Binary Independence Model [82, 147]. As with all other retrieval models, the Binary
Independence Model is designed to estimate the probability that a document j is
considered as relevant given a specific query k, which is essentially a bag of words.
Let T = {t1 , ...tn } be the set of terms in the collection of documents (i.e. corpus).
We can then represent the set of terms Tj occurring in document j as a binary vector
vj = {x1 , ..., xn } where xi = 1 if ti ∈ Tj and xi = 0 otherwise. This way, the odds
of document j, represented by a binary vector vj being relevant R to query k can
be computed as [83]
O(R|k, vj ) =
P (R|k) P (vj |R, k)
P (R|k, vj )
=
P (R̄|k, vj )
P (R̄|k) P (vj |R̄, k)
and based on the assumption of independence between the presence and absence of
terms,
n
P (vj |R, k) Y P (xi |R, k)
=
P (vj |R̄, k) i=1 P (xi |R̄, k)
Other more advanced models that take into consideration other factors such as
term frequency, document frequency and document length have also been proposed
by researchers such as Spark Jones et al. [117, 118].
There is also another line of research which treats the problem of term recognition
as a supervised machine learning task. In this term recognition approach, each
80
word from a corpus is classified as a term or non-term. Classifiers are trained
using annotated domain corpora. The trained models can then be applied to other
corpora in the same domain. Turney [251] presented a comparative study between a
recognition model based on genetic algorithms and an implementation of the bagged
C4.5 decision tree algorithm. Hulth [109] studied the impact of prior input word
selection on the performance of term recognition. The author uses a classifier trained
on 2, 000 abstracts in the domain of information technology to identify terms from
non-terms, and concluded that limiting the input words to NP-chunks offered the
best precision. This study further reaffirmed the benefit of incorporating linguistic
evidence during the measurement of termhood.
5.3.2
Existing Ad-Hoc Techniques for Term Recognition
Most of the current termhood measures for term recognition fall into this ad-hoc
techniques group. Term frequency and document frequency are the main types of
evidence used by ad-hoc techniques. Unlike the use of classifiers described in the
previous section, techniques in this group employ termhood scores for ranking and
selecting terms from non-terms.
Most common ad-hoc techniques that employ raw frequencies are variants of
Term Frequency Inverse Document Frequency (TF-IDF). TF attempts to capture
the pervasiveness of a term candidate within some documents, while IDF measures
the “informativeness” of a term candidate. Despite the mere heuristic background
of TF-IDF, the robustness of this weighting scheme has given rise to a number of
variants and has found its way into many retrieval applications. Certain researchers
[210] have even attempted to provide theoretical justifications as to why the combination of TF and IDF works so well. Basili et al. [19] proposed a TF-IDF inspired
measure for assigning terms with more accurate weights that reflect their specificity
with respect to the target domain. This contrastive analysis is based on the heuristic
that general language-dependent phenomena should spread similarly across different domain corpus and special-language phenomena should portray odd behaviours.
The Contrastive Weight [19] for simple term candidate a in target domain d is
defined as:
!
P P
f
j
i ij
CW (a) = log fad log P
(5.1)
j faj
where fad is the frequency of the simple term candidate a in the target domain d,
P P
fij is the sum of the frequencies of all term candidates in all domain corpora,
j
Pi
and j faj is the sum of the frequencies of term candidate a in all domain corpora.
81
5.3. Related Works
For complex term candidates, the frequencies of their heads are utilised to compute
their weights. This is necessary because the low frequencies among complex terms
make estimations difficult. Consequently, the weight for complex term candidate a
in domain d is defined as:
CW (a) = fad CW (ah )
(5.2)
where fad is the frequency of the complex term candidate a in the target domain d,
and CW (ah ) is the contrastive weight for the head, ah of the complex term candidate,
a. The use of head noun by Basili et al. [19] for computing the contrastive weights
of complex term candidates CW (a) reflects the head-modifier principle [105]. The
principle suggests that the information being conveyed by complex terms manifests
itself in the arrangement of the constituents. The head acts as the key that refers to
a general category to which all other modifications of the head belong. The modifiers
are responsible for distinguishing the head from other forms in the same category.
Wong et al. [274] presented another termhood measure based on contrastive analysis
called Termhood (TH) which places emphasis on the difference between the notion
of prevalence and tendency. The measure computes a Discriminative Weight (DW)
for each candidate a as:
DW (a) = DP (a)DT (a)
(5.3)
This weight realises the heuristic that the task of discriminating terms from nonterms is a function of Domain Prevalence (DP) and Domain Tendency (DT). If a
is a simple term candidate, its DP is defined as:
P
P
j fjd +
j fj d¯
DP (a) = log10 (fad + 10) log10
+ 10
(5.4)
fad + fad¯
P
P
where j fjd + j fj d¯ is the sum of the frequencies of occurrences of all a ∈ T C
in both domain and contrastive corpus, while fad and fad¯ are the frequencies of
occurrences of a in the domain corpus and contrastive corpus, respectively. DP
simply increases, with offset for too frequent terms, along with the frequency of a in
the domain corpus. If the term candidate is complex, the authors define its DP as:
DP (a) = log10 (fad + 10)DP (ah )M F (a)
(5.5)
The reason behind the use of the DP of the complex term’s head (i.e. DP (ah )) in
Equation 5.5 is similar to that of CW in Equation 5.2. DT , on the other hand, is
82
employed to determine the extent of the inclination of the usage of term candidate
a for domain and non-domain purposes. The authors defined DT as:
fad + 1
DT (a) = log2
+1
(5.6)
fad¯ + 1
where fad is the frequency of occurrences of a in the domain corpus, while fad¯ is the
frequency of occurrences of a in the contrastive corpus. If term candidate a is equally
common in both domain and non-domains (i.e. contrastive domains), DT = 1. If
the usage of a is more inclined towards the target domain, fad > fad¯, then DT > 1,
and DT < 1 otherwise.
Besides contrastive analysis, the use of contextual evidence to assist in the correct identification of terms is also common. There are currently two dominant
approaches to extract contextual information. Most of the existing researchers such
as Maynard & Ananiadou [171] employed fixed-size windows for capturing context
words for term candidates. The Keyword in Context (KWIC) [159] index can be
employed to identify the appropriate windows of words surrounding the term candidates. Other researchers such as Basili et al. [20], LeMoigno et al. [144] and Wong
et al. [278] employed grammatical relations to identify verb phrases or independent
clauses containing the term candidates. One of the work along the line of incorporating contextual information is NCvalue by Frantzi & Ananiadou [80]. Part of the
NCvalue measure involves the assignment of weights to context words in the form
of nouns, adjectives and verbs located within a fixed-size window from the term
candidate. Given that T C is the set of all term candidates and c is a noun, verb or
adjective appearing with term candidates, weight(c) is defined as:
P
|T Cc |
e∈T Cc fe
(5.7)
+
weight(c) = 0.5
|T C|
fc
P
where T Cc is the set of term candidates that have c as a context word, e∈T Cc fe
is the sum of the frequencies of term candidates that appear with c, and fc is the
frequency of c in the corpus. After calculating the weights for all possible context
words, the sum of the weights of context words appearing with each term candidate
is obtained. Formally, for each term candidate a that has a set of accompanying
context words Ca , the cumulative context weight is defined as:
X
cweight(a) =
weight(c) + 1
(5.8)
c∈Ca
Eventually, the NCvalue for a term candidate is defined as:
N Cvalue(a) =
1
Cvalue(a)cweight(a)
log F
(5.9)
83
5.3. Related Works
where F is the number of words in the corpus. Cvalue(a) is given by,

log |a|f
if |a| = g
a
2
P
Cvalue(a) =
log |a|(f − l∈La fl ) otherwise
2
a
(5.10)
|La |
where |a| is the number of words that constitute a, La is the set of potential longer
term candidates that contain a, g is the longest n-gram considered, and fa is frequency of occurrences of a in the corpus. The T H measure by Wong et al. [274]
incorporates contextual evidence in the form of Average Contextual Discriminative
Weight (ACDW). ACDW is the average DW of the context words of a adjusted
based on the context’s relatedness with a:
P
DW (c)N GD(a, c)
(5.11)
ACDW (a) = c∈Ca
|Ca |
where NGD is the Normalised Google Distance by Cilibrasi & Vitanyi [50] that is
used to determine the relatedness between two lexical units without any feature
extraction or static background knowledge. The final termhood score for each term
candidate a is given by [278]:
T H(a) = DW (a) + ACC(a)
(5.12)
where ACC is the adjusted value of ACDW based on DW of Equation 5.3.
The inclusion of semantic relatedness measure by Wong et al. [278] brings us to
the use of semantic information during the determination of termhood. Maynard
& Ananiadou [171, 172] employed the Unified Medical Language System (UMLS)
to compute two weights, namely, positional and commonality. Positional weight is
obtained based on the combined number of nodes belonging to each word, while
commonality is measured by the number of shared common ancestors multiplied by
the number of words. Accordingly, the similarity between two term candidates is
defined as [171]:
sim(a, b) =
com(a, b)
pos(a, b)
(5.13)
where com(a, b) and pos(a, b) is the commonality and positional weight, respectively,
between term candidate a and b. The authors then modified the NCvalue discussed
in Equation 5.9 by incorporating the new similarity measure as part of a Context
Factor (CF). The context factor of a term candidate a is defined as:
X
X
fb|a sim(a, b)
(5.14)
CF (a) =
fc|a weight(c) +
c∈Ca
b∈CTa
84
where Ca is the set of context words of a, fc|a is the frequency of c as a context
word of a, weight(c) is the weight for context word c as defined in Equation 5.7,
CTa is the set of context words of a which also happen to be term candidates (i.e.
context terms), fb|a is the frequency of b as a context term of a, and sim(a, b) is the
similarity between term candidate a and its context term b as defined in Equation
5.13. The new NCvalue is defined as:
N Cvalue(a) = 0.8Cvalue(a) + 0.2CF (a)
(5.15)
Basili et al. [20] commented that the use of extensive and well-grounded semantic
resources by Maynard & Ananiadou [171] faces the issue of portability to other
domains. Instead, Basili et al. [20] combined the use of contextual information
and the head-modifier principle to capture term candidates and their context words
on a feature space for computing similarity using the cosine measure. According
to the authors [20], “the term sense is usually determined by its head.”. On the
contrary, such statement by the authors opposes the fundamental fact, not only
in terminology but in general linguistics, that simple terms are polysemous and the
modification of such terms is necessary to narrow down their possible interpretations
[105]. Moreover, the size of corpus has to be very large, and the specificity and
density of domain terms in the corpus has to be very high to allow for extraction of
adequate features.
In summary, while the existing techniques described above may be intuitively
justifiable, the manner in which the weights were derived remains questionable. To
illustrate, why are the products of the various variables in Equations 5.1 and 5.9
taken instead of their summations? What would happen to the resulting weights
if the products are taken instead of the summations, and the summations taken
instead of the products in Equations 5.7, 5.15 and 5.4? These are just minor but
thought-provoking questions in comparison to more fundamental issues related to the
decomposability and traceability of the weights back to their various constituents or
individual evidence. The two main advantages of decomposability and traceability
are (1) the ability to trace inaccuracies of termhood measurement to their origin (i.e.
what went wrong and why), and (2) the attribution of the significance of the various
weights to their intended term characteristics (i.e. what do the weights measure?).
5.3.3
Word Distribution Models
An alternative to the use of relative frequency as practiced by many ad-hoc
techniques discussed above in Section 5.3.2 is to develop models of the distribution
85
5.3. Related Works
of words and employ such models to describe the various characteristics of terms,
and the corpus or the domain they represent. It is worth pointing out that the
modelling is done with respect to all words (i.e. terms and non-terms) that a corpus
contains. This is important for capturing the behaviour of both terms and nonterms in the domain for discrimination purposes. Word distribution models can be
used to normalise frequency of occurrence [7] and to solve problems related to data
sparsity caused by the use of raw frequencies in ad-hoc techniques. The modelling
of word distributions in documents or the entire corpus can also be employed as
means of predicting the rate of occurrence of words. There are mainly two groups of
models related to word distribution. The first group attempts to model the frequency
distribution of all words in an entire corpus while the second group focuses on the
distribution of a single word.
The foundation of the first group of models is the relationship between the frequencies of words and their ranks. One of the most widely-used models in this group
is the Zipf ’s Law [285] which describes the relationship between the frequency of a
word, f and its rank, r as
P (r; s, H) =
1
rs H
(5.16)
where s is an exponent characterising the distribution. Given that 1 ≤ r ≤ |W |
where W is the set of all distinct words in the corpus, H is defined as the |W |-th
P | −s
harmonic number, H = |W
i=1 i . The actual notation for H computed as the |W |-th
harmonic number is H|W |,s . However, for brevity, we will continue with the use of the
notation H. A generalised version of the Zipfian distribution is the Zipf-Mandelbrot
Law [168] whose probability mass function is given by
P (r; q, s, H) =
1
(r + q)s H
(5.17)
where q is a parameter for expressing the richness of word usage in the text. SimiP |
−s
larly, H can be computed as H = |W
i=1 (i + q) . There is still a hyperbolic relation
between rank and frequency in the Zipf-Mandelbrot Distribution. The additional
parameter q can be used to model curves in the distribution, something not possible in the original Zipf. There are few other probability distributions that can or
have been used to model word distribution such as the Pareto distribution [7], the
Yule-Simon distribution [230] and the generalised inverse Gauss-Poisson law [12].
All of the distributions described above are discrete power law distributions, except
for Pareto, that have the ability to model the unique property of word occurrences,
86
(a) Distribution of words extracted from the domain corpus dispersed according
to the domain corpus.
(b) Distribution of words extracted from the domain corpus dispersed according
to the contrastive corpus.
Figure 5.2: Distribution of 3, 058 words randomly sampled from the domain corpus
d. The line with the label “KM” is the aggregation of the individual probability of
occurrence of word i in a document, 1 − P (0; αi , βi ) using K-mixture with αi and
βi defined in Equations 5.21 and 5.20. The line with the label “ZM-MF” is the
manually fitted Zipf-Mandelbrot model. The line labeled “RF” is the actual rate of
occurrence computed as fi /F .
87
5.3. Related Works
namely, the “long tail phenomenon”. One of the main problems that hinders the
practical use of these distributions is the estimation of the various parameters [112].
To illustrate, Figure 5.3 summarises the parameters of the manually fitted ZipfMandelbrot models for the distribution of a set of 3, 058 words randomly drawn from
our domain corpus d. The lines with the label “ZM-MF” shown in Figures 5.2(a) and
5.2(b) show the distributions of the words dispersed according to the domain corpus,
and the contrastive corpus, respectively. One can notice that the distribution of the
words in d¯ is particularly difficult to be fitted because they tend to have a bulge
near the end. This is caused by the presence of many domain-specific terms (which
are unique to d) in the set of 3, 058 words. Such domain-specific terms will have
extremely low or most of the time, zero word count. Nevertheless, a better fit for
the Figure 5.2(b) can be achieved through more trial-and-error. In addition to the
trial-and-error exercise required in manual fitting, the values in Figure 5.3 clearly
show that different parameters are required even for fitting the same set of words
using different corpus. The manual fits we have carried out are far from perfect and
some automatic fitting mechanism is required if we were to practically employ the
Zipf-Mandelbrot model. In the words of Edmundson [65], “a distribution with more
or different parameters may be required. It is clear that computers should be used on
this problem...”.
Figure 5.3: Parameters for the manually fitted Zipf-Mandelbrot model for the set of
3, 058 words randomly drawn from d.
In the second group of models, individual word distributions allow us to capture
and express the behaviour of individual words in parts of a corpus. The standard
probabilistic model for distribution of some event over fixed-size units is the Poisson
distribution. In the conventional case of individual word distribution, the event
would be the k occurrence of word i and the unit would be a document. The
definition of Poisson distribution is given by
e−λi λki
(5.18)
k!
where λi is the average number of occurrences of word i per document or λi = fi /N .
Obviously, λi will vary between different words. P (0; λi ) will give the probability
P (k; λi ) =
88
that word i does not exist in a document while 1 − P (0; λi ) will give the probability
that a candidate has at least one occurrence in a document. Other similarly unsuccessful attempts for better fits are the Binomial model and the Two-Poisson model
[82, 240, 83]. These single-parameter distributions (i.e. Poisson and Binomial) have
been traditionally employed to model individual word distributions based on unrealistic assumptions such as the independence between word occurrences. As a result,
they are poor-fits of the actual word distribution. Nevertheless, such variation from
the Poisson distribution or colloquially known as non-poissonness serves a purpose.
It is well-known throughout the literature [256, 45, 169] that Poisson distribution
is only a good fit for functional words while content words tend to deviate from it.
Using this property, we can also employ the single Poisson as a predictor of whether
a lexical unit is a content word or not, and hence as an indicator of possible termhood. A better fit for individual word distribution employs a mixture of Poissons
[170, 46]. Negative Binomial is one of such mixtures but the involvement of large
binomial coefficients makes it computationally unattractive. Another alternative is
the K-mixture proposed by Katz [122] that allows the Poisson parameter λi to vary
between documents. The distribution of k occurrences of word i in a document is
given by:
αi
P (k; αi , βi ) = (1 − αi )δk,0 +
βi + 1
βi
βi + 1
k
(5.19)
where δk,0 is the Dirac’s delta. δk,0 = 1 if k = 0 and δk,0 = 0 otherwise. The
parameters αi and βi can be computed as:
βi =
fi − ni
ni
(5.20)
λi
βi
(5.21)
αi =
where λi is the single Poisson parameter of the observed mean. βi determines the
additional word i per each document that contain i, and αi can be seen as the
fraction of documents with i and without i. One of the property of K-mixture is
that it is always a perfect fit at k = 0. This desirable property of K-mixture can
be employed to accurately determine the probability of non-occurrence of word i in
a document. P (0; αi , βi ) gives us the probability that word i does not occur in a
document and 1 − P (0; αi , βi ) gives us the probability that a word has at least one
occurrence in a document (i.e. the candidate exists in a document). When k = 0,
89
5.4. A New Probabilistic Framework for Determining Termhood
the K-mixture is reduced to
P (0; αi , βi ) = (1 − αi ) +
αi
βi + 1
(5.22)
Unlike fixed-size textual units such as documents, the notion of domains is elusive.
The lines labeled with “KM” in Figure 5.2 are the result of the aggregation of
the individual probability of occurrence of word i in documents of the respective
corpora. Figures 5.2(a) and 5.2(b) clearly show that models like K-mixture whose
distributions are defined over documents or other units with clear, explicit boundaries cannot be employed directly as predictors for the actual rate of occurrence of
words in domain.
5.4
A New Probabilistic Framework for Determining Termhood
We begin the derivation of the new probabilistic framework for term recognition
by examining the definition of termhood. Based on two prominent review papers
on term recognition [120, 131], we define termhood as:
Definition 5.4.1. Termhood is the degree to which a lexical unit is relevant to a
domain of interest.
As outlined in Section 5.1, our focus is to construct a formal framework which
combines evidence, in the form of term characteristics, instead of seedterms for
term recognition. The aggregated evidence can then used to determine the extent
of relevance of the corresponding term with respect to a particular domain, as in
the definition of termhood in Definition 5.4.1. The characteristics of terms manifest
themselves in suitable corpora that represent the domain of interest. From here on,
we will use the notation d to interchangeably denote the elusive notion of a domain
and its tangible counterpart, the domain corpus. Since the quality of the termhood
evidence with respect to the domain is dependent on the issue of representativeness
of the corresponding corpus, the following assumption is necessary for us to proceed:
Assumption 1. Corpus d is a balanced, unbiased and randomised sample of the
population text representing the corresponding domain.
The actual discussion on corpus representativeness is nevertheless important but
the issue is beyond the scope of this chapter. Having Assumption 1 in place, we
restate Definition 5.4.1 in terms of probability to allow us to formulate a probabilistic
model for measuring termhood in the next two steps,
90
Aim 1. What is the probability that a is relevant to domain d given the evidence a
has?
In the second step, we lay the foundation for the various term characteristics
which are mentioned throughout this chapter, and used specifically for formalising
termhood evidence in Section 5.4.2. We subscribed to the definition of ideal terms
as adopted by many researchers [157]. Definition 5.4.2 outlines the primary characteristics of terms. These characteristics rarely exist in real-world settings since
word ambiguity is a common phenomenon in linguistics. Nevertheless, this definition is necessary to establish a platform for determining the extent of deviation of
the characteristics of terms in actual usage from the ideal cases,
Definition 5.4.2. The primary characteristics of terms in ideal settings are:
• Terms should not have synonyms. In other words, there should be no different
terms implying the same meaning.
• Meaning of terms is independent of context.
• Meaning of terms should be precise and related directly to a concept. In other
words, a term should not have different meanings or senses.
In addition to Definition 5.4.2, there are several other related characteristics of
terms which are of common knowledge in this area. Some of these characteristics
follow from the general properties of words in linguistics. This list is not a standard
and is by no means exhaustive or properly theorised. Nonetheless, as we have
pointed out, such heuristically-motivated list is one of the foundation of automatic
term recognition. They are as follow:
Definition 5.4.3. The extended characteristics of terms:
1 Terms are properties of domain, not document [19].
2 Terms tend to clump together [28] the same way content-bearing words do
[285].
3 Terms with longer length are rare in a corpus since the usage of words with
shorter length is more predominant [284].
4 Simple terms are often ambiguous and modifiers are required to reduce the
number of possible interpretations.
91
5 Complex terms are preferred [80] since the specificity of such terms with respect
to certain domains are well-defined.
Definition 5.4.2 simply states that a term is unambiguously relevant to a domain.
For instance, assume that once we encounter the term “bridge”, it has to immediately
mean “a device that connects multiple network segments at the data link layer”, and
nothing else. At the same time, such “device” should not be identifiable using other
labels. If this is the case, all we need to do is to measure the extent to which a term
candidate is relevant to a domain regardless of its relevance to other domains since
an ideal term cannot be relevant to both (as implied in Definition 5.4.2).
This brings us to the third step where we can now formulate our Aim 1 as a
conditional probability between two events and pose it using Bayes Theorem,
P (R1 |A) =
P (A|R1 )P (R1 )
P (A)
(5.23)
where R1 is the event that a is relevant to domain d and A is the event that a is a
candidate with evidence set V = {E1 , ..., Em }. P (R1 |A) is the posterior probability
of candidate a being relevant to d given the evidence set V associated to a. P (R1 )
and P (A) are the prior probabilities of candidate a being relevant without any evidence, and the probability of a being a candidate with evidence V , respectively. One
has to bare in mind that Equation 5.23 is founded upon the Bayesian interpretation
of probability. Consequently, subjective rather than frequency-based assessments of
P (R1 ) and P (A) are well-accepted, at least by the Bayesians. As we shall see later,
these two prior probabilities will be immaterial in the final computation of weights
for the candidates. In addition, we introduce the event that a is relevant to other
¯ R2 , which can be seen as the complementary event of R1 . Similar to
domains d,
¯
Assumption 1, we subscribe to the following assumption for d,
Assumption 2. Contrastive corpus d¯ is the set of balanced, unbiased and randomised
sample of the population text representing approximately all major domains other
than d.
Based on the ideal characteristics of terms in Definition 5.4.2, and the new event
R2 , we can state that P (R1 ∩ R2 ) = 0. In other words, R1 and R2 are mutually
exclusive in ideal settings. Ignoring the fact that a term may appear in certain
¯ but definitely
domains by chance, any candidate a can either be relevant to d or to d,
not both. Unfortunately, a point worth noting is that “An impregnable barrier
between words of a general language and terminologies does not exist.” [157]. For
92
example, the word “bridge” has multiple meaning and is definitely relevant to more
than one domain, or in other words, P (R1 ∩ R2 ) is not strictly 0 in reality. While
people in the computer networking domain may accept and use the word “bridge”
as a term, it is in fact not an ideal term. Words like “bridge” are often a poor choice
of terms (i.e. not ideal terms) simply because they are simple terms, and inherently
ambiguous as defined in Definition 5.4.3.4. Instead, a better term for denoting the
concept which the word “bridge” attempts to represent would be “network bridge”.
As such, we assume that:
Assumption 3. Each concept represented using a polysemous simple term in a corpus
has a corresponding unambiguous complex term representation occurring in the
same corpus.
From Assumption 3, since all important concepts of a domain have unambiguous
manifestations in the corpus, the possibility of the ambiguous counterparts achieving
lower ranks during our termhood measurement will have no effect on the overall term
recognition output. In other words, polysemous simple terms can be considered as
insignificant in our determination of termhood. Based on this alone, we can assume
¯ This brings us
that the non-relevance to d approximately implies the relevance to d.
to the next property about the prior probability of relevance of terms. The mutual
exclusion and complementation properties of the relevance of terms in d, R1 , and in
¯ R2 are:
d,
• P (R1 ∩ R2 ) ≈ 0
• P (R1 ∪ R2 ) = P (R1 ) + P (R2 ) ≈ 1
Even in the presence of a prior probability to this approximation, the addition law
of probability still has to hold. As such, we can extend this approximation of the
sum of the probability of relevance without evidence to include the prior probability
of evidence:
P (R1 |A) + P (R2 |A) ≈ 1
(5.24)
without violating the probability axioms.
Knowing that P (R1 ∩ R2 ) only approximates to 0 in reality, we will need to make
sure that the relevance of candidate a in domain d does not happen by chance.
The occurrence of a term in a domain is considered as accidental if the concepts
represented by the terms are not topical for that domain. Moreover, the accidental
repeats of the same term in non-topical cases are possible [122]. Consequently, we
93
need to demonstrate the odds of term candidate a being more relevant to d than to
¯
d:
Aim 2. What are the odds of candidate a being relevant to domain d given the
evidence it has?
In this fourth step, we alter Equation 5.23 to reflect our new Aim 2 for determinP
ing the odds rather than merely probabilities. Since Odds = 1−P
, we can apply an
1
order-preserving transformation by multiplying 1−P (R1 |A) to Equation 5.23 to give
us the odds of relevance given the evidence candidate a has:
P (R1 |A)
P (A|R1 )P (R1 )
=
1 − P (R1 |A)
P (A)(1 − P (R1 |A))
(5.25)
and since 1 − P (R1 |A) ≈ P (R2 |A) from Equation 5.24, we have:
P (A|R1 )P (R1 )
P (R1 |A)
=
P (R2 |A)
P (A)P (R2 |A)
(5.26)
and applying the multiplication rule P (R2 |A)P (A) = P (A|R2 )P (R2 ) to both sides
of Equation 5.25 to obtain
P (A|R1 ) P (R1 )
P (R1 |A)
=
P (R2 |A)
P (A|R2 ) P (R2 )
(5.27)
Equation 5.27 can also be called the odds of relevance of candidate a to d given the
(R1 )
is the odds of relevance
evidence a has. The second term in Equation 5.27, PP (R
2)
of candidate a without evidence. We can use Equation 5.27 as a way to rank the
candidates. Taking the log of odds, we have
log
P (R1 |A)
P (R1 )
P (A|R1 )
= log
− log
P (A|R2 )
P (R2 |A)
P (R2 )
P (A|R1 ) and P (A|R2 ) are the class conditional probabilities for a being a candidate
with evidence V given its different states of relevance. Since probability of relevance
to d and to d¯ of all candidates without any evidence are the same, we can safely
ignore the second term (i.e. odds of relevance without evidence) in Equation 5.27
without committing the prosecutor’s fallacy [232]. This gives us
log
P (R1 |A)
P (A|R1 )
≈ log
P (A|R2 )
P (R2 |A)
(5.28)
To facilitate the scoring and ranking of the candidates based on the evidence they
have, we introduce a new function of evidence possessed by candidate a. We call
this new function the Odds of Termhood (OT)
94
P (A|R1 )
(5.29)
P (A|R2 )
Since we are only interested in ranking and from Equation 5.28, ranking candidates
according to OT (A) is the same as ranking the candidates according to our Aim 2
reflected through Equation 5.27. Obviously, from Equation 5.29, our initial predicament of not being able to empirically interpret the prior probabilities P (A) and
P (R1 ) is no longer a problem.
OT (A) = log
Assumption 4. Independence between evidences in the set V .
In the fifth step, we decompose the evidence set V associated with each candidate
a to facilitate the assessment of the class conditional probabilities P (A|R1 ) and
P (A|R2 ). Given Assumption 4, we can evaluate P (A|R1 ) as
Y
P (A|R1 ) =
P (Ei |R1 )
(5.30)
i
and P (A|R2 ) as
P (A|R2 ) =
Y
P (Ei |R2 )
(5.31)
i
where P (Ei |R1 ) and P (Ei |R2 ) are the probabilities of a as a candidate associated
with evidence Ei given its different states of relevance R1 and R2 , respectively.
Substituting Equation 5.30 and 5.31 in 5.29, we get
OT (A) =
X
log
i
P (Ei |R1 )
P (Ei |R2 )
(5.32)
Lastly, for the ease of computing the evidence, we define individual scores called
evidential weight (Oi ) provided by each evidence Ei as
Oi =
P (Ei |R1 )
P (Ei |R2 )
(5.33)
and substituting Equation 5.33 in 5.32 provides
OT (A) =
X
log Oi
(5.34)
i
The purpose of OT is similar to many other functions for scoring and ranking term
candidates such as those reviewed in Section 5.3.2. However, what differentiates our
new function from the existing ones is that OT is founded upon and derived in a
probabilistic framework whose assumptions are made explicit. Moreover, as we will
discuss in the following Section 5.4.2, the individual evidence is formulated using
probability and the necessary term distributions are derived from formal distribution
models to be discussed in Section 5.4.1.
95
5.4.1
Parameters Estimation for Term Distribution Models
In Section 5.3.3, we presented a wide range of distribution models for individual
terms and for all terms in the corpus. Our intention is to avoid the use of raw frequencies and also relative frequencies for computing the evidential weights, Oi . The
shortcomings related to the use of raw frequencies have been clearly highlighted in
Section 5.3. In this section, we discuss the two models that we employ for computing the evidential weights, namely, the Zipf-Mandelbrot model and the K-mixture
model. The Zipf-Mandelbrot model is employed to predict the probability of occurrence of a term candidate in the domain, while the K-mixture model predicts the
probability of a certain number of occurrences of a term candidate in a document
of the domain. Most of the literature on Zipf and Zipf-Mandelbrot laws often leave
out the single most important aspect that makes these two distributions applicable
to real-world applications, namely, parameter estimation. As we have discussed in
Section 5.3.3, the manual process of deciding the parameters, namely, s, q and H
in the case of Zipf-Mandelbrot distribution, is tedious and may not easily achieve
the best fit. Moreover, the parameters for modelling the same set of terms vary
across different corpora. A recent paper by Izsak [112] discussed on some general
aspects of standardising the process of fitting the Zipf-Mandelbrot model. We experimented with linear regression using the ordinary least squares method [200] and
the weighted least squares method [63] to estimate the three parameters, namely, s,
q and H. The results of our experiments are reported in this section. We would like
to stress that our focus is to search for appropriate parameters to achieve a good
fit of the models using observed data (i.e. raw frequencies and ranks). While we do
not discount the importance of observing the various assumptions involved in linear
regression such as the normality of residuals and the homogeneity of variance, their
discussions are beyond the scope of this chapter.
We begin by linearising the actual Equation 5.17 of the Zipf-Mandelbrot model
to allow for linear regression. From here on, where there is no confusion, we will
refer to the probability mass function of the Zipf-Mandelbrot model, P (r; q, s, H) as
ZMr for clarity. We will take the natural logarithm of both sides of Equation 5.17
to obtain:
ln ZMr = ln H − s ln(r + q)
(5.35)
Our aim is then to find the line defined by Equation 5.35 that best fits our observed
points {(ln r, ln fFr ); r = 1, 2, ..., |W |}. For sufficiently large r, ln(r + q)/ ln(r) will
96
approximates to 1 and we will have:
ln ZMr ≈ ln H − s ln r
As a result, ln ZMr will be an approximate linear function of ln r with the points
scattered along a straight line with slope −s and an intersection at ln H on the
Y-axis. We can then move on to determine the estimates ln H and s. We attempt
to minimise the squared sum of residuals (SSR) between the actual points ln fFi and
the predicted points ln ZMi :
|W |
X
fi
(ln − ln ZMi )2
SSR =
F
i=1
(5.36)
Given that |W | is the number of words, the least squares estimates ln H and s is
defined as [1]:
P
P
f P
f
|W | j (ln Fj )(ln j) − j (ln Fj ) j (ln j)
P
P
P
(5.37)
s=
|W | j (ln j)2 − j (ln j) j (ln j)
and
ln H =
P
j
(ln ZMj ) − s
|W |
P
j
(ln j)
(5.38)
Since the approximate of ln fFr is given by ln ZMr = ln H − s ln(r + q), then it holds
that ln fF1 ≈ ln ZM1 . Following this, ln fF1 ≈ ln H − s ln(1 + q). As a result, we can
estimate q using s and ln H as such
f1
≈ ln H − s ln(1 + q)
F
1
f1
ln(1 + q) ≈ ln H − ln
s
F
f1
1
(ln
H−ln
)
F
q ≈ es
−1
ln
(5.39)
To illustrate our process of automatically fitting the Zipf-Mandelbrot model, please
refer to Figure 5.4. Figure 5.6 summarises the parameters of the automatically fitted
Zipf-Mandelbrot model. The lines with the label “ZM-OLS” in Figures 5.4(a) and
5.4(b) show the automatically fitted Zipf-Mandelbrot model for the distribution of
the same set of 3, 058 words employed in Figure 5.2, using the ordinary least squares
method. The line “RF” is the actual relative frequency that we are trying to fit.
One will notice from the SSR column of Figure 5.5 that the automatic fit provided
by the ordinary least squares method achieves relatively good results (i.e. low SSR)
in dealing with the curves along the line in Figure 5.4(b).
(a) Distribution of words extracted from the domain corpus dispersed according to the
domain corpus.
(b) Distribution of words extracted from the domain corpus dispersed according to the
contrastive corpus.
Figure 5.4: Distribution of the same 3, 058 words as employed in Figure 5.2. The line
with the label “ZM-OLR” is the Zipf-Mandelbrot model fitted using ordinary least
squares method. The line labeled “ZM-WLS” is the Zipf-Mandelbrot model fitted
using weighted least squares method, while “RF” is the actual rate of occurrence
computed as fi /F .
97
98
Figure 5.5: Summary of the sum of squares of residuals, SSR and the coefficient of
determination, R2 for the regression using manually estimated parameters, parameters estimated using ordinary least squares (OLS), and parameters estimated using
weighted least squares (WLS). Obviously, the smaller the SSR is, the better the fit.
As for 0 ≤ R2 ≤ 1, the upper bound is achieved when the fit is perfect.
Figure 5.6: Parameters for the automatically fitted Zipf-Mandelbrot model for the
set of 3, 058 words randomly drawn.
We also attempted to fit the Zipf-Mandelbrot model using the second type of
least squares method, namely, the weighted least squares. The idea is to assign to
each point ln fi /F a weight that reflects the uncertainty of the observation. Instead
of weighting all points equally, they are weighted such that points with a greater
weight contribute more to the fit. Most of the time, the weight wi assigned to the
i-th point is determined as a function of the variance of that observation, denoted as
wi = σi−1 . In other words, we assign points with lower variances greater statistical
weights. Instead of using variance, we propose the assignment of weights for the
weighted least squares method based on the changes of the slopes at each segment
of the distribution. The slope at the point (xi , yi ) is defined as the slope of the
segment between the points (xi , yi ) and (xi−1 , yi−1 ), and it is given by:
mi,i−1 =
yi − yi−1
xi − xi−1
The weight to be assigned for each point (xi , yi ) is a function of the conditional
cumulation of slopes up to that point. The cumulation of the slopes is conditional
depending on the changes between slopes. The slope of the segment between point
99
i and i − 1 is added to the cumulative slope if its rate of change from the previous
segment i − 1 and i − 2 is between 1.1 and 0.9. In other words, the slopes between
the two segments are approximately the same. If the change in slopes between the
two segments is outside that range, the cumulative slope is reset to 0. Given that
i = 1, 2, ..., |W |, computing the slope at point 1 uses a prior non-existence point,
m1,0 = 0. Formally, we set the weight wi to be assigned to point i for the weighted
least squares method as:



0
if(i = 1)



mi,i−1
mi,i−1 + wi−1
if(i 6= 1 ∧ 1.1 ≤
≤ 0.9)
wi =
(5.40)
mi−1,i−2




 0
otherwise
Consequently, instead of minimising the sum of squares of the residuals (SSR) where
all points are treated equally as in the ordinary least squares in Equation 5.36, we
include the new weight wi defined in Equation 5.40 to give us:
SSR =
|W |
X
i=1
wi (ln
fi
− ln ZMi )2
F
(5.41)
Referring back to Figure 5.4, the lines with the label “ZM-WLS” demonstrate the
fit of the Zipf-Mandelbrot model whose parameters are estimated using the weighted
least squares method. The line “RF” is again the actual relative frequency we are
trying to fit. Despite the curves, especially in the case of using the contrastive corpus
¯ the weighted least squares is able to provide a good fit. The constantly changing
d,
slopes especially in the middle of the distribution provide an increasing weight to
each point, enabling such points to contribute more to the fitting.
In the subsequent sections, we will utilise the Zipf-Mandelbrot model for modelling the distribution of term candidates in both the domain corpus (from which the
candidates were extracted), and also the contrastive corpus. We employ P (r; q, s, H)
of Equation 5.17 in Section 5.3.3 to compute the probability of occurrence of ranked
words in both the domain corpus and the contrastive corpus. The parameters H,
q and s are estimated as shown in Equation 5.38, 5.39 and 5.37, respectively. For
standardisation purposes, we introduce the following notations:
• ZMrd provides the probability of occurrence of a word with rank r in domain
corpus d; and
• ZMrd¯ provides the probability of occurrence of a word with rank r in the
¯
contrastive corpus d.
100
In addition, we will also be using the K-mixture model as discussed in Section 5.3.3
for predicting the probability of occurrence of the term candidates in documents
of the respective corpora. Recall that P (0; αi , βi ) in Equation 5.22 gives us the
probability that word i does not occur in a document (i.e. probability of nonoccurrence) and 1 − P (0; αi , βi ) gives us the probability that word i has at least
one occurrence in a document (i.e. probability of occurrence). The βi and αi are
computed based on Equations 5.20 and 5.21. The distribution of words in either d
or d¯ can be achieved by defining the parameters of the K-mixture model over the
respective corpora. We will employ the following notations:
• KMad is the probability of occurrence of word a in documents in the domain
corpus d; and
• KMad¯ is the probability of occurrence of word a in documents in contrastive
¯
corpus d;
5.4.2
Formalising Evidences in a Probabilistic Framework
All existing techniques for term recognition are founded upon some heuristics
or linguistic theories that define what makes a term candidate relevant. However,
there are researchers [120, 107] who criticised such existing methods for the lack of
proper theorisation despite the reasonable intuitions behind them. Definition 5.4.2.1
highlights a list of commonly adopted characteristics for determining the relevance
of terms [120].
Definition 5.4.2.1. Characteristics of term relevance:
1 A term candidate is relevant to a domain if it appears relatively more frequent
in that domain than in others.
2 A term candidate is relevant to a domain if it appears only in this one domain.
3 A term candidate relevant to a domain may have biased occurrences in that
domain:
3.1 A term candidate of rare occurrence in a domain. Such candidates are
also known as “hapax legomena” which manifest itself as the long tail in
Zipf’s law.
3.2 A term candidate of common occurrence in a domain.
4 Following from Definition 5.4.3.4 and 5.4.3.5, a complex term candidate is
relevant to a domain if its head is specific to that domain.
101
We propose a series of evidence as listed below to capture the individual characteristics presented in Definition 5.4.2.1 and 5.4.3. They are as follow:
• Evidence 1: Occurrence of term candidate a
• Evidence 2: Existence of term candidate a
• Evidence 3: Specificity of the head ah of term candidate a
• Evidence 4: Uniqueness of term candidate a
• Evidence 5: Exclusivity of term candidate a
• Evidence 6: Pervasiveness of term candidate a
• Evidence 7: Clumping tendency of term candidate a
The seven evidence is used to compute the corresponding evidential weights Oi which
in turn are summed to produce the final ranking using OT as defined in Equation
5.34. Since OT served as a probabilistically-derived formulaic realisation of our
Aim 2, we can consider the various Oi as manifestations of sub-aims of Aim 2. The
formulation of the evidential weights begin with the associated definitions and subaims. Each sub-aim attempting to realise the associated definition has an equivalent
mathematical formulation. The formula is then expanded into a series of probability
functions connected through the addition and multiplication rule. There are four
basic probability distributions that are required to compute the various evidential
weights:
• P(occurrence of a in d)=P (a, d): This distribution provides the probability
of occurrence of a in the domain corpus d. By ranking the term candidates
according to their frequency of occurrence in domain d, each term candidate
will have a rank r. We employ ZMrd described in Section 5.4.1 for this purpose.
For brevity, we use P (a, d) to denote the probability of occurrence of term
candidate a in the domain corpus d.
¯
¯ This distribution provides the probability of
• P(occurrence of a in d)=P
(a, d):
¯ By ranking the term candidates
occurrence of a in the contrastive corpus d.
¯ each
according to their frequency of occurrence in the contrastive corpus d,
term candidate will have a rank r. We employ ZMrd¯ described in Section
¯ to denote the probability of
5.4.1 for this purpose. For brevity, we use P (a, d)
¯
occurrence of term candidate a in the contrastive corpus d.
102
• P(occurrence of a in documents in d)=PK (a, d): This distribution provides
the probability of occurrence of a in documents in domain corpus d where
the subscript K refers to K-mixture. One should be immediately reminded of
KMad described in Section 5.4.1. For brevity, we employ PK (a, d) to denote
the probability of occurrence of term candidate a in documents in the domain
corpus d.
¯
¯
• P(occurrence of a in documents in d)=P
K (a, d): This distribution provides
¯
the probability of occurrence of a in documents in the contrastive corpus d.
We employ the distribution provided by KMad¯ described in Section 5.4.1.
¯ to denote the probability of occurrence of term
For brevity, we use PK (a, d)
¯
candidate a in documents in the contrastive corpus d.
¯ described above are defined over
Since the probability masses P (a, d) and P (a, d)
¯ we have
the sample space of all words in the respective corpus (i.e. either d or d),
that for any term candidate a ∈ W :
• 0 ≤ P (a, d) ≤ 1
¯ ≤1
• 0 ≤ P (a, d)
and
•
•
P
∀a∈W
P
∀a∈W
P (a, d) = 1
¯ =1
P (a, d)
¯ are
On the other hand, the other two distributions, namely, PK (a, d) and PK (a, d)
defined over the sample space of all possible number of occurrences k = 0, 1, 2, ..., n
of a particular term candidate a in a document using the K-mixture model. Hence,
• 0 ≤ PK (a, d) ≤ 1
¯ ≤1
• 0 ≤ PK (a, d)
but
•
•
P
∀a∈W
P
∀a∈W
PK (a, d) 6= 1
¯ 6= 1
PK (a, d)
103
In addition to the axioms above, there are three sets of related properties that
require further clarification. The first set concerns the events of occurrence and
non-occurrence of term candidates in the domain corpus d and in the contrastive
¯
corpus d:
Property 1. Properties of the probability distributions of occurrence and nonoccurrence of term candidates in the domain corpus and the contrastive corpus:
1) The events of the occurrences of a in d and in d¯ are not mutually exclusive.
In other words, the occurrence of a in d does not imply the non-occurrence of
¯ This is true since any term candidate can occur in d, d¯ or even both,
a in d.
either intentionally or by accident.
¯ =
◦ P (occurrence of a in d ∩ occurrence of a in d)
6 0
2) The occurrence of words in d does not affect (i.e. independent of) the probability of its occurrence in other domains d¯ and vice versa.
¯ = P (a, d)P (a, d)
¯
◦ P (occurrence of a in d ∩ occurrence of a in d)
3) The events of occurrence and non-occurrence of the same candidate within the
same domain are complementary.
◦ P (non-occurrence of a in d) = 1 − P (a, d)
4) Following from 1), the events of occurrence in d and non-occurrence in d¯ and
vice versa of the same term are not mutually exclusive since candidate a can
¯
occur in both d and d.
¯ =
◦ P (occurrence of a in d ∩ non-occurrence of a in d)
6 0
5) Following from 2), the events of occurrence in d and non-occurrence in d¯ and
vice versa of the same term are also independent.
¯ = P (a, d)(1 − P (a, d))
¯
◦ P (occurrence of a in d ∩ non-occurrence of a in d)
The second set of properties is concerned with complex term candidates. Each
complex candidate a is made up of a head ah and a set of modifiers Ma . Since
¯ they
candidate a and its head ah have the possibility of both occurring in d or in d,
are not mutually exclusive. As such, the probability of union of the two events of
occurrence is not the sum of the individual probability of occurrence. Lastly, we will
assume that the occurrences of candidate a and its head ah within the same domain
104
¯ are independent. While this may not be the case in reality, but as
(i.e. either d or d)
we shall see later, such property allows us to provide estimates for many non-trivial
situations. As such,
Property 2. The mutual exclusion and independence property of the occurrence of
¯
term candidate a and its head ah within the same corpus (i.e. either in d or in d):
◦ P (occurrence of a in d ∩ occurrence of ah in d) 6= 0
◦ P (occurrence of a in d ∩ occurrence of ah in d) = P (a, d)P (ah , d)
◦ P (occurrence of a in d ∪ occurrence of ah in d) = P (a, d) + P (ah , d) −
P (a, d)P (ah , d)
The last set of properties is made in regard to the occurrence of candidates in
documents in the corpus. Since the probability of occurrence of a candidate in
documents is derived from Poisson mixture, then it follows that the probability of
occurrence (where k ≥ 1) of a candidate in documents is the complement of the
probability of non-occurrence of that candidate (where k = 0).
Property 3. The complementation property of the occurrence and non-occurrence
¯
of term candidate a in documents within the same domain (i.e. either in d or in d):
◦ P (non-occurrence of a in documents in d) = 1 − PK (a, d)
Next, we move on to define the odds that correspond to each of the evidence laid
out earlier.
• Odds of Occurrence: The first evidential weight O1 attempts to realise Defi¯ The
nition 5.4.2.1.1. O1 captures the odds of whether a occurs in d or in d.
notion of occurrence is the simplest among all the weights on which most other
evidential weights are founded upon. Formally, O1 can be described as
Sub-Aim 1. What are the odds of term candidate a occurring in d?
and can be mathematically formulated as:
P (occurrence
P (occurrence
P (occurrence
=
P (occurrence
P (a, d)
=
¯
P (a, d)
O1 =
of
of
of
of
a|R1 )
a|R2 )
a in d)
¯
a in d)
(5.42)
105
• Odds of Existence: Similar to O1 , the second evidential weight O2 attempts to
realise Definition 5.4.2.1.1 but keeping in mind Definition 5.4.3.3 and Definition
5.4.3.4. We can consider O2 as a realistic extension of O1 for reasons to be
discussed below. Since the non-occurrence of term candidates in the corpus
does not imply its conceptual absence or non-existence, we would like O2 to
capture the following:
Sub-Aim 2. What are the odds of term candidate a being in existence in d?
The main issue related to the probability of occurrence is the fact that a big
portion of candidates rest along the long tail of Zipf’s Law. Since most of the
candidates are rare, their probabilities of occurrences alone do not reflect their
actual existence or intended usage. What makes the situation worse is that
longer words tend to have lower rate of occurrences [249] based on Definition
5.4.3.3. In the words of Zipf [284], “it seems reasonably clear that shorter
words are distinctly more favoured in language than longer words.”. For example, consider the events that we observe more “bridge” occurring in the
computer networking domain than its complex counterpart “network bridge”.
The observed events do not imply that the concept represented by the complex term is different or of any less importance to the domain simply because
“network bridge” occurs less than “bridge”. The fact that authors are more
predisposed at using shorter terms whenever possible to represent the same
concept demonstrate the Principle of Least Effort, which is the foundation behind most of Zipf’s Laws. This brings us to Definition 5.4.3.4 which requires
us to assign higher importance to complex terms than their heads appearing as
simple terms. We need to ensure that O2 captures these requirements. We can
extend the lexical occurrence of complex candidates conceptually by including
the lexical occurrence of their heads. Since the events of occurrences of a and
its head ah are not mutually exclusive as discussed in Property 2, we will need
to subtract the probability of the intersection of these two events from the sum
of the two probabilities to obtain the probability of the union. Following this,
and based on the assumptions about the probability of occurrences of complex
candidates and their heads in Property 2, we can mathematically formulate
O2 as
P (existence
P (existence
P (existence
=
P (existence
O2 =
of
of
of
of
a|R1 )
a|R2 )
a in d)
¯
a in d)
106
P (occurrence of a in d ∪ occurrence of ah in d)
¯
P (occurrence of a in d¯ ∪ occurrence of ah in d)
P (a, d) + P (ah , d) − P (a, d)P (ah , d)
=
¯ + P (ah , d)
¯ − P (a, d)P
¯ (ah , d)
¯
P (a, d)
=
In the case where candidate a is simple, the probability of occurrence of its
head ah , and the probability of both a and ah occurring will be evaluated
to zero. As a result, the second evidential weight O2 for simple terms will
be equivalent to its first evidential weight O1 . Such formulation satisfies the
additional Definition 5.4.3.5 that requires us to allocate higher weights to
complex terms.
• Odds of Specificity: The third evidential weight O3 specifically focuses on
Definition 5.4.2.1.4 for complex term candidates. O3 is meant for capturing
the odds of whether the inherently ambiguous head ah of a complex term a is
specific to d. If the heads ah of complex terms are found to occur individually
without a in large numbers across different domains, then the specificity of the
concept represented by ah with regard to d may be questionable. O3 can be
formally stated as:
Sub-Aim 3. What are the odds that the head ah of a complex term candidate
a is specific to d?
The head of a complex candidate is considered as specific to a domain if the
head and the candidate itself both have higher tendency of occurring together
in that domain. The higher the intersection of the events of occurrences of a
and ah in a certain domain, the more specific ah is to that domain. For example, if the event of both “bridge” and “network bridge” occurring together
in the computer networking domain is very high, this means the possibly ambiguous head “bridge” is used in a very specific context in that domain. In
such cases, when “bridge” is encountered in the domain of computer networking, one can safely deduce that it refers to the same domain-specific concept
as “network bridge”. Consequently, the more specific the head ah is with respect to d, the less ambiguous its occurrence is in d. It follows from Definition
5.4.2.1.4 that the less ambiguous ah is, the chances of its complex counterpart
a being relevant to d will be higher. Based on the assumptions about the
probability of occurrence of complex candidates and their heads in Property
107
2, we define the third evidential weight for complex term candidates as:
P (specificity of a|R1 )
P (specificity of a|R2 )
P (specificity of a to d)
=
¯
P (specificity of a to d)
P (occurrence of a in d ∩ occurrence of ah in d)
=
¯
P (occurrence of a in d¯ ∩ occurrence of ah in d)
P (a, d)P (ah , d)
=
¯ (ah , d)
¯
P (a, d)P
O3 =
• Odds of Uniqueness: The fourth evidential weight O4 realises Definition 5.4.2.1.2
¯ The notion of
by capturing the odds of whether a is unique to d or to d.
uniqueness defined here will be employed for the computation of the next two
evidential weights O5 and O6 . Formally, O4 can be described as
Sub-Aim 4. What are the odds of term candidate a being unique to d?
A term candidate is considered as unique if it occurs only in one domain and
not others. Based on the assumptions on the probability of occurrence and
non-occurrence in Property 1, O4 can be mathematically formulated as:
P (uniqueness of a|R1 )
P (uniqueness of a|R2 )
P (uniqueness of a to d)
=
¯
P (uniqueness of a to d)
¯
P (occurrence of a in d ∩ non-occurrence of a in d)
=
P (occurrence of a in d¯ ∩ non-occurrence of a in d)
¯
P (a, d)(1 − P (a, d))
=
¯ − P (a, d))
P (a, d)(1
O4 =
• Odds of Exclusivity: The fifth evidential weight O5 realises Definition 5.4.2.1.3.1
¯ Forby capturing the probability of whether a is more exclusive in d or in d.
mally, O5 can be described as
Sub-Aim 5. What are the odds of term candidate a being exclusive in d?
Something is regarded as exclusive if it exists only in a category (i.e. unique
to that category) with certain restrictions such as limited usage. It is obvious
at this point that a term candidate which is unique and rare in a domain is
considered as exclusive in that domain. There are several ways of realising
the rarity of terms in domains. For example, one can employ some measures
108
of vocabulary richness or diversity [65] to quantify the extent of dispersion or
concentration of term usage in a particular domain. However, the question on
how such measures can be integrated into probabilistic frameworks such as the
one proposed in this chapter remains a challenge. We propose the view that
terms are considered as rare in a domain if they exist only in certain aspects
of that domain. For example, in the domain of computer networking, we may
encounter terms like “Fiber distributed data interface” and “WiMAX”. They
may both be relevant to the domain but their distributional behaviour in the
domain corpus is definitely different. While both may appear to represent
certain similar concepts such as “high-speed transmission”, their existence are
biased to different aspects of the same domain. The first term may be biased
to certain aspects characterised by concepts such as “token ring” and “local
area network”, while the second may appear biased to aspects like “wireless
network” and “mobile application”. We propose to realise the notion of “domain aspects” through the documents that the domain contains. We consider
the documents that made up the domain corpus as discussions of the various
aspects of a domain. Consequently, a term candidate can be considered as rare
if it has a low probability of occurrence in documents in the domain. Please
note the difference in the probability of occurrence in a domain versus the
probability of occurrence in documents in a domain. Following this, Property
3 and the probability of uniqueness discussed as part of O4 , we define the fifth
evidential weight as:
P (exclusivity of a|R1 )
P (exclusivity of a|R2 )
P (exclusivity of a in d)
=
¯
P (exclusivity of a in d)
P (uniqueness of a to d)P (rarity of a in d)
=
¯ (rarity of a in d)
¯
P (uniqueness of a to d)P
¯ (rarity of a in d)
P (a, d)(1 − P (a, d))P
=
¯ − P (a, d))P (rarity of a in d)
¯
P (a, d)(1
O5 =
If we subscribe to our definition of rarity proposed above where P (rarity of a in d)=
1 − PK (a, d), then,
¯
P (a, d)(1 − P (a, d))(1
− PK (a, d))
=
¯
¯
P (a, d)(1 − P (a, d))(1 − PK (a, d))
The higher the probability that candidate a has no occurrence in documents
in d, the rarer it becomes in d. Whenever the occurrence of a increases in documents in d, 1 − PK (a, d) reduces (i.e. getting less rare) and this leads to the
109
decrease in overall O5 or exclusivity. One may have noticed that the interpretation of the notion of rarity using the occurrences of candidates in documents
may not be the most appropriate. Since the usage of terms in documents has
an effect on the existence of terms in the domain, the independence assumption required to enable the product of the probability of uniqueness and of
rarity to take place does not hold in reality.
• Odds of Pervasiveness: The sixth evidential weight O6 attempts to capture
Definition 5.4.2.1.3.2. Formally, O6 can be described as
Sub-Aim 6. What are the odds of term candidate a being pervasive in d?
Something is considered to be pervasive if it exists very commonly in only
one category. This makes the notion of commonness the opposite of rarity. A
term candidate is said to be common if it occurs in most aspects of a domain.
In other words, among all documents discussing about a domain, the term
candidate has a high probability of occurring in most or nearly all of them.
Following this and Property 3, we define the sixth evidential weight as:
P (pervasiveness of a|R1 )
P (pervasiveness of a|R2 )
P (pervasiveness of a in d)
=
¯
P (pervasiveness of a in d)
P (uniqueness of a to d)P (commonness of a in d)
=
¯ (commonness of a in d)
¯
P (uniqueness of a to d)P
¯ (commonness of a in d)
P (a, d)(1 − P (a, d))P
=
¯ − P (a, d))P (commonness of a in d)
¯
P (a, d)(1
O6 =
If we follow the definition of rarity introduced as part of the fifth evidential
weight O5 , then the notion of commonness is the complement of rarity which
means P (commonness of a in d)=1-P (rarity of a in d), or
=
¯ K (a, d)
P (a, d)(1 − P (a, d))P
¯ − P (a, d))PK (a, d)
¯
P (a, d)(1
Similar to the interpretation of the notion of rarity, the use of the probability of non-occurrence in documents may not be the most suitable since the
independence assumption that we need to make does not hold in reality.
• Odds of Clumping Tendency: The last evidential weight involves the use of
contextual information. There are two main issues involved in utilising contextual information. First is the question of what constitutes the context of
110
a term candidate, and secondly, how to cultivate and employ contextual evidence. Regarding the first issue, it is obvious that the importance of terms
and context in characterising domain concepts should be reflected through
their heavy participations in the states or actions expressed by verbs. Following this, we put forward a few definitions related to what context is, and the
relationship between terms and context.
Definition 5.4.2.2. Words that are contributors to the same state or action
as a term can be considered as context related to that term.
Definition 5.4.2.3. Relationship between terms and their context
1 In relation to Definition 5.4.3.2, terms tend to congregate at different
parts of text to describe or characterise certain aspects of the domain d.
2 Following the above, terms which clump together will eventually be each
others’ context.
3 Following the above, context which are also terms (i.e. context terms)
are more likely to be better context since actual terms tend to clump.
4 Context words which are also semantically related to their terms are more
qualified at describing those terms.
Regarding the second issue highlighted above, we can employ context to promote or demote the rank of terms based on the terms’ tendency to clump.
Following Definition 5.4.2.3, terms with higher tendency to clump or occur together with their context should be promoted since such candidates are more
likely to be the “actual” terms in a domain. Some readers may be able to
recall from Definition 5.4.2 which states that the meaning of terms should be
independent from their context. We would like to point out that the function of this last evidential weight is not to infer the meaning of terms from
their context since such action conflicts with Definition 5.4.2. Instead, we
employ context to investigate and reveal an important characteristic of terms
as defined in Definition 5.4.3.2 and 5.4.2.3, namely, the tendency of terms to
clump. We employ the linguistically-motivated technique by Wong et al. [275]
to extract term candidates together with their context words in the form of
instantiated sub-categorisation frames [271]. This seventh evidential weight
O7 attempts to realise Definition 5.4.3.2 and 5.4.2.3. Formally, O7 can be
described as
111
Sub-Aim 7. What are the odds that term candidate a clumps with its context
Ca in d?
We can compute the clumping tendency of candidate a and its context words
as the probability of candidate a occurring together with any of its context
words. The higher the probability of candidate a and its context words occurring together in the same domain, the more likely it is that they clump.
Since related context words are more qualified at describing the terms based
on Definition 5.4.2.3.4, we have to include a semantic relatedness measure for
that purpose. We employ Psim (a, c) to estimate the probability of relatedness
between candidate a and its context word c ∈ Ca ∩ T C. Psim (a, c) is implemented using the semantic relatedness measure N GD by Cilibrasi & Vitanyi
[50, 261] which has been discussed in Section 5.3.2. Let c ∈ Ca ∩ T C be the set
of context terms, the last evidential weight can be mathematically formulated
as:
O7 =
=
=
=
=
5.5
P (clumping of a with its related context|R1 )
P (clumping of a with its related context|R2 )
P (clumping of a with its related context in d)
¯
P (clumping of a with its related context in d)
P (occurrence of a with any related c ∈ Ca ∩ T C in d)
¯
P (occurrence of a with any related c ∈ Ca ∩ T C in d)
P
P (occurrence of a in d ∩ occurrence of c in d)Psim (a, c)
P∀c∈Ca ∩T C
¯ sim (a, c)
P (occurrence of a in d¯ ∩ occurrence of c in d)P
P∀c∈Ca ∩T C
P (a, d)P (c, d)Psim (a, c)
P∀c∈Ca ∩T C
¯
¯
∀c∈Ca ∩T C P (a, d)P (c, d)Psim (a, c)
In this evaluation, we studied the ability of our new probabilistic measure known
as the Odds of Termhood (OT) in separating domain-relevant terms from general
ones. We contrasted our new measure with three existing scoring and ranking
schemes, namely, Contrastive Weight (CW), NCvalue (NCV) and Termhood (TH).
The implementations of CW , N CV and T H are in accordance to Equation 5.1 and
5.2, 5.9, and 5.12 respectively. The evaluations of the four termhood measures were
conducted in two parts:
• Part 1: Qualitative evaluation through the analysis and discussion based on
the frequency distribution and measures of dispersion, central tendency and
correlation.
112
• Part 2: Quantitative evaluation through the use of performance measures,
namely, precision, recall, F-measure and accuracy, and the GENIA annotated
text corpus as the gold standard G. In addition, the approach we chose to
evaluate the termhood measures provides a way to automatically decide on a
threshold for accepting and rejecting the ranked term candidates.
For both parts of our evaluation, we employed a dataset containing a domain corpus
describing the domain of molecular biology, and a contrastive corpus which spans
across twelve different domains other than molecular biology. The datasets are described in Figure 5.1 in Section 5.2. Using the part-of-speech tags in the GENIA
corpus, we extracted the maximal noun phrases as term candidates. Due to the large
number of distinct words (over 400, 000 as mentioned in Section 5.2) in the GENIA
corpus, we have extracted over 40, 000 lexically-distinct maximal noun phrases. The
large number of distinct noun phrases is due to the absence of preprocessing to normalise the lexical variants of the same concept. For practical reasons, we randomly
sampled the set of noun phrases for distinct term candidates. The resulting set
T C contains 1, 954 term candidates. Following this, we performed the scoring and
ranking procedure using the four measures included in this evaluation on the set of
1, 954 term candidates.
5.5.1
Qualitative Evaluation
In the first part of the evaluation, we analysed the frequency distributions of the
ranked term candidates generated by the four measures. Figures 5.7 and 5.8 show
the frequency distributions of the candidates ranked in descending order according to
the weights assigned by the four measures. The candidates are ranked in descending
order according to their scores assigned by the respective measures. One can notice
the interesting trends from the graphs by CW and N CV in Figures 5.8(b) and
5.8(a). The first half of the graph by CW , prior to the sudden surge of frequency,
consists of only complex terms. Complex terms tend to have lower word counts
compared to simple terms and hence, the disparity in the frequency distribution as
shown in Figure 5.8(b). This is attributed to the biased treatment given to complex
terms evident in Equation 5.2. However, priority is also given to complex terms by
T H but as one can see from the distribution of candidates by T H, such undesirable
trend does not occur. One of the explanation is the heavy reliance of frequency by
CW while T H attempts to diversify the evidence in the computation of weights.
While frequency may be a reliable source of evidence, the use of it alone is definitely
inadequate [37]. As for N CV , Figure 5.8(a) reveals that scores are assigned to
113
(a) Candidates ranked in descending order according to the scores assigned by OT .
(b) Candidates ranked in descending order according to the scores assigned by T H.
Figure 5.7: Distribution of the 1, 954 terms extracted from the domain corpus d
sorted according to the corresponding scores provided by OT and T H. The single
dark smooth line stretching from the left (highest value) to the right (lowest value)
of the graph is the scores assigned by the respective measures. As for the two
oscillating lines, the dark line is the domain frequencies while the light one is the
contrastive frequencies.
114
(a) Candidates ranked in descending order according to the scores assigned by N CV .
(b) Candidates ranked in descending order according to the scores assigned by CW .
Figure 5.8: Distribution of the 1, 954 terms extracted from the domain corpus d
sorted according to the corresponding scores provided by N CV and CW . The
single dark smooth line stretching from the left (highest value) to the right (lowest
value) of the graph is the scores assigned by the respective measures. As for the two
oscillating lines, the dark line is the domain frequencies while the light one is the
contrastive frequencies.
115
Figure 5.9: The means µ of the scores, standard deviations σ of the scores, sum
of the domain frequencies and of the contrastive frequencies of all term candidates,
and their ratio.
Figure 5.10: The Spearman rank correlation coefficients ρ between all possible pairs
of measure under evaluation.
candidates by N CV based solely on the domain frequency. In other words, the
measure N CV lacks the required contrastive analysis. As we have pointed out,
terms can be ambiguous and we must not ignore the cross-domain distributional
behaviour of terms. In addition, upon inspecting the actual list of ranked candidates,
we noticed that higher scores were assigned to candidates which were accompanied
by more context words. Another positive trait that T H exhibits is its ability to
¯
assign higher scores to terms which occur relatively more frequent in d and in d.
This is evident through the gap between fd (dark oscillating line) and fd¯ (light
oscillating line), especially at the beginning of the x-axis in Figure 5.7(b). One can
notice that candidates along the end of the x-axis are those with fd¯ > fd . The same
can be said about our new measure OT . However, the discriminating power of OT is
apparently better since the gap between fd and fd¯ is larger and lasted longer. Figure
5.9 summarises the mean and standard deviation of the weights generated by the
various measures. One can notice the extremely high dispersion from the mean of
the scores generated by CW and N CV . We speculate that such trends are due to
the erratic assignments of weights, heavily influenced by frequencies. In addition,
we employed the Spearman rank correlation coefficient to study the possibility of
any correlation between the four ranking schemes under evaluation. Figure 5.10
summarises the correlation coefficients between the various measures. Note that
116
there is a relatively strong correlation between the ranks produced by our new
probabilistic measure OT and the ranks by the ad-hoc measure T H. The correlation
of T H with OT revealed the possibility of providing mathematical justifications for
the former’s heuristically-motivated ad-hoc technique using a general probabilistic
framework.
5.5.2
Quantitative Evaluation
Figure 5.11: An example of a contingency table. The values in the cells T P , T N ,
F P and F N are employed to compute the precision, recall, Fα and accuracy. Note
that |T C| is the total number of term candidates in the input set T C, and |T C| =
TP + FP + FN + TN.
In the second part of the evaluation, we employed the gold standard G generated
from the GENIA corpus as discussed in Section 5.2 for evaluating our new term
recognition technique using OT and three other existing ones (i.e. T H, N CV and
CW ). We employed four measures [166] common to the field of information retrieval
for performance comparison. These performance measures are precision, recall, Fα measure, and accuracy. These measures are computed by constructing a contingency
table as shown in Figure 5.11:
TP
TP + FP
TP
recall =
TP + FN
(1 + α)(precision × recall)
Fα =
(α × precision) + recall
TP + TN
accuracy =
TP + FP + FN + TN
precision =
where T P , T N , F P and F N are values from the four cells of the contingency table
shown in Figure 5.11, and α is the weight for recall within the range (0, ∞). It
suffices to know that as the α value increases, the weight of recall increases in the
117
measure [204]. Two common α values are 0.5 and 2. F2 weighs recall twice as
much as precision, and precision in F0.5 weighs two times more than recall. Recall
and precision are evenly weighted in the traditional F1 measure. Before presenting
the results for this second part of the evaluation, there are several points worth
clarifying. Firstly, the gold standard G is a set of unordered collection of terms
whose domain relevance has been established by experts. Secondly, as part of the
evaluation, each term candidate a ∈ T C will be assigned a score by the respective
termhood measures. These scores are used to rank the candidates in descending
order where larger scores correspond to higher ranks. As a result, there will be
four new sets of ranked term candidates, each corresponding to a measure under
evaluation. For example, T CN CV is the output set after the scoring and ranking of
the input term candidates T C by the measure N CV . The output from the measures
T H, OT and CW are T CT H , T COT and T CCW , respectively. We would like to
remind the readers that |T C| = |T CT H | = |T COT | = |T CCW | = |T CN CV |. The
individual elements of the output sets appear as ai where a is the term candidate
from T C and i is the rank. Next, the challenge lies in how the resulting sets of
ranked term candidates should be evaluated using the gold standard G. Generally,
as in information retrieval, a binary classification is performed. In other words,
we try to find a match in G for every ai in T CX where X is any of the measure
under evaluation. A positive match indicates that ai is a term, while no match
implies that ai is a non-term. However, there is a problem with this approach. The
elements (i.e. term candidates) in the four output sets are essentially the same. The
difference between the four sets lies in the ranks of the term candidates, and not
the candidates themselves. In other words, the simple attempts of trying to find
the number of matches in every T CX with G will produce the same results (i.e.
same precision, recall, F-score and accuracy). Obviously, the ranks i assigned to the
ranked term candidates ai by the different termhood measures have a role to play.
Following this, a cut-off point (i.e. threshold) for the ranked term candidates needs
to be employed. To have an unbiased comparison of the four measures using the
gold standard, we have to ensure that the cut-off rank for each termhood measure is
the optimal. Manually deciding on the four different “magic numbers” for the four
measures is a challenging and undesirable task.
To overcome the challenges involved in performing an unbiased comparative
study on the four termhood measures as discussed above, we propose to fairly examine their performance and possibly, to decide on the optimal cut-off ranks through
rank binning. We briefly describe the 3-step process of rank binning for each set
118
T CX :
• Decide on a standard size b for the bins;
• Create n = ⌈ |T CbX | ⌉ rank bins where ⌈y⌉ is the ceiling value of y;
• Each bin BjX is assigned a rank j where 1 ≤ j ≤ ⌈ |T CbX | ⌉, and X is the
identifier for the corresponding measure (i.e. N CV , CW , OT or T H). Bin
B1X is considered as a higher bin compared to B2X and so on; and
• Distribute the ranked term candidates in set T CX to their respective bins. Bin
B1X will hold the first b-th ranked term candidates from set T CX . In general,
bin BjX will contain the first (j × b)-th ranked term candidates from T CX
where 1 ≤ j ≤ ⌈ |T CbX | ⌉. Obviously, there is an exception for the last bin BnX
where n = ⌈ |T CbX | ⌉ if b is not a factor of |T CX | (i.e. |T CX | is not divisible by
b). In such exceptions, the last bin BnX would simply contain all the ranked
term candidates in |T CX |.
The results of binning the ranked term candidates produced by the four termhood
measures using the input set of 1, 954 term candidates are shown in Figure 5.12. We
would like to point out that the choice of b has effects on the performance indicators
of each bin. Setting the bin size too large may produce deceiving performance
indicators that do not reflect the actual quality of the ranked term candidates. This
occurs when an increasing numbers of ranked term candidates, which can either
be actual terms or non-terms in the gold standard, are mixed into the same bin.
On the other hand, selecting bin sizes which are too small may defeat the purpose
of collective evaluation of the ranked term candidates. Moreover, large number of
bins will make interpretation of the results difficult. A rule-of-thumb is to have
appropriate bin sizes selected based on the size of T C and the sensible number of
bins for ensuring interpretable results. For example, setting b = 100 for a set of
500 term candidates is a suitable number since there are only 5 bins. However,
using the same bin size on a set of 5, 000 term candidates is inappropriate. In our
case, to ensure interpretability of the tables included in this chapter, we set the
bin size to b = 200 for our 1, 954 term candidates. The results are organised into
contingency tables as introduced in Figure 5.11. Each individual contingency table
in Figure 5.12 contains four cells and is structured in the same way as Figure 5.11.
The individual table summarises the result obtained from the binary classification
performed using the corresponding bin of term candidates on the gold standard G.
Each measure will have several contingency tables where each table contains values
119
for determining the quality (i.e. relevance to the domain) of the term candidates
that fall in the corresponding bins, as prescribed by the gold standard.
Figure 5.12: The collection all contingency tables for all termhood measures X across
all the 10 bins BjX . The first column contains the rank of the bins and the second
column shows the number of term candidates in each bin. The third general column
“termhood measures, X” holds all the 10 contingency tables for each measure X
which are organised column-wise, bringing the total number of contingency tables
to 40 (i.e. 10 bins, organised in rows by 4 measures). The structure of the individual
contingency tables follows the one shown in Figure 5.11. The last column is the rowwise sums of T P + F P and F N + T N . The rows beginning from the second row
until the second last are the rank bins. The last row is the column-wise sums of
T P + F N and F P + T N .
Using the values in the contingency tables in Figure 5.12, we computed the
precision, recall, F-scores and accuracy of the four measures at different bins. The
performance results are summarised in Figure 5.13. The accuracy indicators, acc
120
Figure 5.13: Performance indicators for the four termhood measures in 10 respective
bins. Each row shows the performance achieved by the four measures in a particular
bin. The columns contain the performance indicators for the four measures. The
notation pre stands for precision, rec is recall and acc is accuracy. We use two
different α values, resulting in two F-scores, namely, F0.1 and F1 . The values of the
performance measures with darker shades are the best performing ones.
show the extent to which a termhood measure correctly predicted both terms (i.e.
T P ) and non-terms (i.e. T N ). On the other hand, precision, pre measures the
extent to which the candidates predicted as terms (i.e. T P + F P ) are actual terms
(i.e. T P ). The recall indicators, rec capture the ability of the measures in correctly
(i.e. T P ) identifying all terms that exist in the set T C (i.e. T P + F N ). As shown
in the last bins (i.e. j = 10) of Figure 5.13, it is trivial to achieve a recall of 100% by
simply binning all term candidates in T C into one large bin during evaluation. Recall
alone is not a good performance indicator and one needs to take into consideration
the number of non-terms which are mistakenly predicted as terms (i.e. F P ) which
is neglected in the computation of recall. Hence, we need to find a balance between
recall and precision instead, which is aptly captured as the F1 score. To demonstrate
an F-score which places more emphasis on precision, we combine both recall and
precision to obtain the F0.1 score as shown in Figure 5.13. Before we begin discussing
the quantitative results, we would like to provide an example of how to intepret the
performance indicators in Figure 5.13 to ensure absolute clarity. If we accept bin
j = 8 using the termhood measure OT as the solution to term recognition, then
84.56% of the solution is precise, and 79.73% of the solution is accurate. In other
words, 84.56% of the predicted terms in bin j = 8 are actual terms, and 79.73%
of the term predictions and non-term predictions are actually true, all according to
the gold standard. The similar approach is used to intepret the performance of all
121
other measures in all bins.
From Figure 5.13, we notice that the measures T H and OT are consistently better in terms of all the performance indicators compared to N CV and CW . Using
any one of the 10 bins and any performance indicators for comparison, T H and
OT offered the best performance. The close resemblance and consistency in the
performance indicators of T H and OT supports and objectively confirms the correlation between the two termhood measures as suggested by the Spearman rank
correlation coefficient discovered in Section 5.5.1. OT and T H achieved the best
precision in the first bin at 98% and 98.5%, respectively by obviously sacrificing the
recall. The worst performing termhood measure in terms of precision is the N CV
at the maximum of only 76.87%. Since the lowest precision of N CV lies in the last
bin, its recall achieves the maximum at 100%. In fact, Figure 5.13 clearly shows
that the maximum values for all performance indicators of N CV rest in the last
bin, and the precision values of N CV are erratically distributed across all the bins.
Ideally, a good termhood measure should attain the highest precision in the first
bin with the subsequent bins achieving decreasing precisions. This is important to
show that actual terms are assigned with higher ranks by the termhood measures.
This reaffirms our suggestion that contrastive analysis which is present in OT , T H
and CW is necessary for term recognition. The frequency distribution of terms
ranked by N CV , as shown in Figure 5.8(a) in the previous Section 5.5.1, clearly
shows the improperly ranked term candidates where only the frequencies from the
domain corpus are considered. Generally, as we are able to observe from the precision and recall columns, “well-behaved” termhood measures usually have higher
precisions with lower recalls in the first few bins. This is due to the more restrictive
membership of the higher bins where only highly ranked term candidates by the
respective termhood measures are included. The highest recall is achieved when
there is no more false negative since all term candidates in set T C are included for
scoring and ranking, and all the candidates are simply predicted as terms. In our
case, the highest recall is obviously 100% as we have mentioned earlier, while lowest
precision is 76.87%. This relation between precision and recall is aptly captured by
the F0.1 and F1 scores. The F0.1 scores begin at much higher values at the higher
bins compared to the F1 scores. This is due to our emphasis on precision instead of
recall as we have justified earlier, and since higher bins have higher precision, F0.1
will inevitably gain higher values. F0.1 becomes lower than F1 when precision falls
below recall.
The cells with darker shades under each performance measure in Figure 5.13
122
indicate the maximum values for that measure. In other words, the termhood measure OT has the best accuracy of 81.47% at bin j = 9, or a maximum F0.1 score
of 86.17% at bin j = 5. We can see that the highest F1 score and the best accuracy of OT are more than those of T H. Assuming consistency of these results with
other corpora, OT can be regarded as a measure which attempts to find a balance
between precision and recall. If we weigh precision more, T H will triumph over OT
based on their maximum F0.1 scores. Consequently, we can employ these maximum
values of different performance measures as highly flexible cut-off points for deciding on which top n ranked term candidates to be selected and considered as actual
terms. These maximum values optimise the precision and the recall to ensure that
the maximum number of actual terms is selected while minimising the inclusion of
non-terms. This evaluation approach provides a solution to the problem discussed
at the start of this chapter which is also mentioned by Cabre-Castellvi et al. [37],
“all systems propose large lists of candidate terms, which at the end of the process
have to be manually accepted or rejected.”. In addition, our proposed use of rank
binning and the maximum values of the various performance measures have allowed
us to perform an unbiased comparison on all four term measures. In short, we have
shown that:
• The new OT termhood measure can provide mathematical justifications for
the heuristically-derived measure T H;
• The new OT termhood measure aims for a balance between precision and recall, and offers the most accurate solution to the requirements of term recognition compared to the other measures T H, N CV and CW ; and
• The new OT termhood measure performs on par with the heuristically-derived
measure T H, and they are both consistently the best performing term recognition measures in terms of precision, recall, F-scores and accuracy compared
to N CV and CW .
Some critics may simply disregard the results reported here as unimpressive and
be inclined to compare them with results from other related but distinct disciplines such as named-entity recognition or document retrieval. However, one has to
keep in mind several fundamental differences in regard to the evaluations in term
recognition. Firstly, unlike other established fields, term recognition is largely an
unconsolidated research area which still lacks a common comparative platform [120].
As a result, individual techniques or systems are being developed and tested with
5.6. Conclusions
123
small datasets in highly specialised domains. According to Cabre-Castellvi et al.
[37], “This lack of data makes it difficult to evaluate and compare them.”. Secondly,
we could not emphasise enough the fact that term recognition is a subjective task
in comparison to fields such as named-entity recognition and document retrieval. In
the most primitive way, named-entity recognition, which is essentially a classification
problem, can be deterministically performed through a finite set of rigid designators, resulting in near-human performance in common evaluation forums such as the
Message Understanding Conference (MUC) [42]. While being more subjective than
named-entity recognition, the task of determining document relevance in document
retrieval is guided by explicit user queries with common evaluation platform such as
the Text Retrieval Conference (TREC) [264]. On the other hand, term recognition
is based upon the elusive characteristics of terms. Moreover, the set of characteristics employed differs across a diversed range of term recognition techniques, and
within each individual technique, the characteristics may be subjected to different
implicit interpretations. The challenges of evaluating term recognition techniques
become more obvious when one consider the survey by Cabre-Castellvi et al. [37]
where more than half of the systems reviewed remain not evaluated.
5.6
Conclusions
Term recognition is an important task for many natural language systems. Many
techniques have been developed in an attempt to numerically determine or quantify
termhood based on heuristically-motivated term characteristics. We have discussed
several shortcomings related to many existing techniques such as ad-hoc combination
of termhood evidence, mathematically-unfounded derivation of scores, and implicit
and possibly flawed assumptions concerning term characteristics. All these shortcomings lead to issues such as non-decomposability and non-traceability of how the
weights and scores are obtained. These issues bring into light the question of what
are the term characteristics that the different weights and scores are trying to embody, if any, and whether these individual weights or scores are actually measuring
what they are supposed to capture. Termhood measures which cannot be traced or
attributed to any term characteristics are fundamentally flawed.
In this chapter, we stated clearly the four main challenges in creating a formal
and practical technique for measuring termhood. These challenges are (1) the formalisation of a general framework for consolidating evidence representing different
term characteristics, (2) the formalisation of the various evidence representing the
different term characteristics, (3) the explicit definition of term characteristics and
124
their attribution to linguistic theories (if any) or other justifications, and (4) the automatic determination of optimal thresholds for selecting terms from the final lists
of ranked term candidates. We addressed the first three challenges through a new
probabilistically-derived measure called the Odds of Termhood (OT) for scoring and
ranking term candidates for term recognition. The design of the measure begins
with the derivation of a general probabilistic framework for integrating termhood
evidence. Next, we introduced seven evidence, founded on formal models of word
distribution, to facilitate the calculation of OT . The evidence captures the various
different characteristics of terms which are either heuristically-motivated or based
on linguistic theories. The fact that evidence can be added or removed makes OT
a highly flexible framework that is adaptable to different applications’ requirements
and constraints. In fact, in the evaluation, we have shown close correlation between
our new measure OT and the ad-hoc measure T H. We believe by adjusting the
inclusion or exclusion of various evidence, other ad-hoc measures can be captured as
well. Our two-part evaluation comparing OT with three other existing ad-hoc measures, namely, CW , N CV and T H have demonstrated the effectiveness of the new
measure and the new framework. A qualitative evaluation studying the frequency
distributions revealed advantages of our new measure OT . A quantitative evaluation using the GENIA corpus as the gold standard and four performance measures
further supported our claim that our new measure OT offers the best performance
compared to the three existing ad-hoc measures. Our evaluation revealed that (1)
the current evidence employed in OT can be seen as probabilistic realisations of
the heuristically-derived measure T H, (2) OT offers a solution to the need for term
recognition which is both accurate and balanced in terms of recall and precision,
and (3) OT performs on par with the heuristically-derived measure T H and they
are both the best performing term recognition measures in terms of precision, recall,
F-scores and accuracy compared to N CV and CW . In addition, our approach of
rank binning and the use of performance measures for deciding on optimal cut-off
ranks addresses the fourth challenge.
5.7
Acknowledgement
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, the University Postgraduate Award (International
Students) by the University of Western Australia, the 2008 UWA Research Grant,
and the Curtin Chemical Engineering Inter-University Collaboration Fund. The authors would like to thank the anonymous reviewers for their invaluable comments.
5.8. Other Publications on this Topic
5.8
125
Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning
Domain Ontologies using Domain Prevalence and Tendency. In the Proceedings of
the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia.
This paper describes a heuristic measure called TH for determining termhood based
on explicitly defined term characteristics and the distributional behaviour of terms
across different corpora. The ideas on TH were later reformulated to give rise to
the probabilistic measure OT. The description of OT forms the core contents of this
Chapter 5.
Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning Domain Ontologies in a Probabilistic Framework. In the Proceedings of the 6th
Australasian Conference on Data Mining (AusDM), Gold Coast, Australia.
This paper describes the preliminary attempts at developing a probabilistic framework
for consolidating termhood evidence based on explicitly defined term characteristics
and formal word distribution models. This work was later extended to form the core
contents of this Chapter 5.
Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and
Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research
on Text and Web Mining Technologies, IGI Global.
This book chapter combines the ideas on the UH measure and the TH measure from
Chapter 4 and 5, respectively.
126
CHAPTER 6
Corpus Construction for Term Recognition
Abstract
The role of the Web for text corpus construction is becoming increasingly significant. However, its contribution is largely confined to the role of a general virtual
corpus, or poorly derived specialised corpora. In this chapter, we introduce a new
technique for constructing specialised corpora from the Web based on the systematic analysis of website contents. Our evaluations show that the corpora constructed
using our technique are independent of the search engines used, and that they outperform all corpora based on existing techniques for the task of term recognition.
6.1
Introduction
Broadly, a text corpus is considered as any collection containing more than one
text of a certain language. A general corpus is balanced with regard to the various
types of information covered by the language of choice [173]. In contrast, the content of a specialised corpus, also known as domain corpus, is biased towards a certain
sub-language. For example, the British National Corpus (BNC) is a general corpus
designed to represent modern British English. On the other hand, the specialised
corpus GENIA contains solely texts from the molecular biology domain. Several
connotations associated with text corpora such as size, representativeness, balance
and sampling are the main topics of ongoing debate within the field of corpus linguistics. In reality, great manual effort is required for constructing and maintaining text
corpora that satisfy these connotations. Although these curated corpora do play
a significant role, several related inadequacies such as the inability to incorporate
frequent changes, rarity of traditional corpora for certain domains, and limited corpus size have hampered the development of corpus-driven applications in knowledge
discovery and information extraction.
The increasingly accessible, diverse and inexpensive information on the World
Wide Web (the Web) have attracted the attention of researchers who are in search
of alternatives to manual construction of corpora. Despite issues such as poor reproducibility of results, noise, duplicates and sampling, many researchers [40, 129,
16, 228, 74] agreed that the vastness and diversity of the Web remains the most
0
This chapter was accepted with revision by Language Resources and Evaluation, 2009, with the
title “Constructing Specialised Corpora through Domain Representativeness Analysis of Websites”.
127
128
Chapter 6. Corpus Construction for Term Recognition
promising solution to the increasing need for very large corpora. Current work on
using the Web for linguistic purposes can be broadly grouped into (1) the Web itself as a corpus, also known as virtual corpus [97], and (2) the Web as a source of
data for constructing locally-accessible corpora known as Web-derived corpora. The
contents of a virtual corpus are distributed over heterogeneous servers, and accessed
using URLs and search engines. It is not difficult to see that these two types of
corpora are not mutually exclusive, and that a Web-derived corpus can be easily
constructed, albeit the downloading time, using the URLs from the corresponding
virtual corpus. The choice between the two types of corpora then becomes a question of trade-off between effort and control. On the one hand, applications which
require stable count, and complete access to the texts for processing and analysis
can opt for Web-derived corpora. On the other hand, in applications where speed
and corpus size supersede any other concerns, a virtual corpus alone suffices.
The current state-of-the-art mainly focuses on the construction of Web-derived
corpora, ranging from the simple query-and-download approach using search engines
[15], to the more ambitious custom Web crawlers for very large collections [155, 205].
BootCat [15] is a widely-used toolkit to construct specialised Web-derived corpora.
It employs a naive technique of downloading webpages returned by search engines
without further analysis. [228] extended the use of BootCat to construct a large
general Web-derived corpus using 500 seed terms. This technique requires a large
number of seed terms (in the order of hundreds) to produce very large Web-derived
corpora, and the composition of the corpora varies depending on the search engines
used. Instead of relying on search engines and seed terms, [155] constructed a very
large general Web-derived corpus by crawling the Web using seed URLs. In this
approach, the lack of control and the absence of further analysis cause topic drift as
the crawler traverses further away from the seeds. A closer look into the advances
in this area reveals the lack of systematic analysis of website contents during corpus
construction. Current techniques simply allow the search engines to dictate which
webpages are suitable for the domain based solely on matching seed terms. Others
allow their Web crawlers to run astray without systematic controls.
We propose a technique, called Specialised Corpora Construction based on Web
Texts Analysis (SPARTAN) 1 to automatically analyze the contents of websites for
discovering domain-specific texts to construct very large specialised corpora. The
1
This foundation work on corpus construction using Web data appeared in the Proceedings of
the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand, 2008
with the title “Constructing Web Corpora through Topical Web Partitioning for Term Recognition”.
6.1. Introduction
129
first part of our technique analyzes the domain representativeness of websites for
discovering specialised virtual corpora. The second part of the technique selectively
localises the distributed contents of websites in the virtual corpora to create specialised Web-derived corpora. This technique can also be employed to construct
BNC-style balanced corpora through stratified random sampling from a balanced
mixture of domain-categorised Web texts. During our experiments, we will show
that unlike BootCat-derived corpora which vary greatly across different search engines, our technique is independent of the search engine employed. Instead of blindly
using the results returned by search engines, our systematic analysis allows the
most suitable websites and their contents to surface and to contribute to the specialised corpora. This systematic analysis significantly improved the quality of our
specialised corpora as compared to BootCat-based corpora, and the naive SeedRestricted Querying (SREQ) of the Web. This is verified using the term recognition
task.
In short, the theses of this chapter are as follows:
1) Web-derived corpora are simply localised versions of the corresponding virtual
corpora;
2) The often mentioned problems of using search engines for corpus construction
are in fact a revelation of the inadequacies in current techniques;
3) The use of websites, instead of webpages, as basic units of analysis during
corpus construction is more suitable for constructing very large corpora; and
4) The results provided by search engines cannot be directly accepted for constructing specialised corpora. The systematic analysis of website contents is
fundamental in constructing high-quality corpora.
The main contributions of this chapter are (1) a technique for constructing very large,
quality corpora using only a small number of seed terms, (2) the use of systematic
content analysis for re-ranking websites based on their domain representativeness to
allow the corpora to be search engine independent, and (3) processes for extending
user-provided seed terms and localising domain-relevant contents. This chapter is
structured as follows. In Section 6.2, we summarise current work on corpus construction. In Section 6.3, we outline our specialised corpora construction technique.
In Section 6.4, we evaluate the specialised corpora constructed using our technique
in the context of term recognition. We end this chapter with an outlook to future
work in Section 6.5.
130
6.2
Related Research
The process of constructing corpora using data from the Web generally comprises
of webpage sourcing, and relevant text identification, which is discussed in Section
6.2.1 and 6.2.2, respectively. In Section 6.2.3, we outline several studies demonstrating the significance of search engine counts in natural language applications despite
their inconsistencies.
6.2.1
Webpage Sourcing
Currently, there are two main approaches for sourcing webpages to construct
Web-derived corpora, namely, using seed terms as query strings for search engines
[15, 74], and using seed URLs for guiding custom crawlers [155, 207].
The first approach is popular among current corpus construction practices due
to the toolkit known as BootCat [17]. BootCat requires several seed terms as input,
and formulates queries as conjunctions of randomly selected seeds for submission to
the Google search engine. The method then gathers the webpages listed in Google’s
search result to create a specialised corpus. There are several shortcomings related
to the construction of large corpora using this technique:
• First, different search engines employ different algorithms and criteria for determining webpage relevance with respect to a certain query string. Since this
technique simply downloads the top webpages returned by a search engine,
the composition of the resulting corpora would vary greatly across different
search engines for reasons beyond knowing and control. It is worth noting
that webpages highly ranked by the different search engines may not have the
necessary coverage of the domain terminology for constructing high-quality
corpora. For example, the ranking by the Google search engine is primarily a
popularity contest [116]. In the words of [228], “...results are ordered...using
page-rank considerations”.
• Second, the aim of creating very large Web-derived corpora using this technique may be far from realistic. Most major search engines have restrictions
on the number of URLs served for each search query. For instance, the AJAX
Search API provided by Google returns a very low 322 search results for each
query. The developers of BootCat [15] suggested that 5 to 15 seed terms are
2
Google’s Web search interface serves up to 1, 000 results. However, automated crawling and
scraping of that page for URLs will result in a blocking of the IP addresses. The SOAP API by
Google, which allows up to 1, 000 queries per day will be permanently phased out by August 2009.
6.2. Related Research
131
typically sufficient in many cases. Assuming each URL provides us with a
valid readable page, 20 seed terms and their resulting 1, 140 three-word combinations would produce a specialised corpus of only 1, 1400 × 32 = 36, 480
webpages. Since the combinations are supposed to represent the same domain,
duplicates will most likely occur when all search results are aggregated [228].
A 10% duplicate and download error for every search query reduces the corpus
size to 32, 832 webpages. For example, in order to produce a small corpus of
only 40, 000 webpages using BootCat, [228] has to prepare a startling 500 seed
terms.
• Third, to overcome issues related to inadequate seed terms for creating very
large corpora, BootCat uses extracted terms from the initial corpus to incrementally extend the corpus. [15] suggested using a reference corpus to
automatically identify domain-relevant terms. However, this approach does
not work well since the simple frequency-based techniques used by BootCat
are known for their low to mediocre performance in identifying domain terms
[279]. Without the use of control mechanisms and more precise techniques to
recognise terms, this iterative feedback approach will cause topic drift in the
final specialised corpora. Moreover, the idea of creating corpora by relying on
other existing corpora is not very appealing.
In a similar approach, [74] used the most frequent words in BNC, and Microsoft’s
Live Search instead of the typical BootCat-preferred Google to construct a very
large BNC-like corpus from the Web. Fletcher provided the reasons behind his
choice to use Live Search, which include generous query allowance, higher quality
search results, and more responsive to changes on the Web.
The approach of gathering webpages using custom crawlers based on seed URLs
gains wider acceptance as criticisms on the use of search engines intensified. Issues
against the use of search engines such as unknown algorithms for sorting search
results [128] and restrictions on the amount of data that can be obtained [18] have
become targets of critics in the recent years. Some of the current work based on
custom crawlers include a general corpus of 10 billion words downloaded from the
Web based on seed URLs from dmoz.org by [155]. Similarly, Renouf et al. [205]
developed a Web crawler for finding a large subset of random texts from the Web using seed URLs from human experts and dmoz.org as part of the WebCorp3 project.
Ravichandran et al. [203] demonstrated the use of randomised algorithm to gen3
www.webcorp.org.uk
132
erate noun similarity lists from very large corpora. The authors used URLs from
dmoz.org as seed links to guide their crawlers for downloading 70 million webpages.
After boilerplates and duplicates removal, their corpus is reduced to approximately
31 million documents. Rather than sampling URLs from online directories, Baroni
& Ueyama [18] used search engines to obtain webpage URLs for seeding their custom
crawlers. The authors used a combinations of frequent Italian words for querying
Google, and retrieve a maximum of 10 pages per query. A resulting 5, 231 URLs
were used to seed breadth-first crawling to obtain a final 4 million-document Italian
corpus. The approach of custom crawling is not without its shortcomings. This
approach is typically based on the assumption that webpages of one domain tend to
link to others in the same domain. It is obvious that the reliance on this assumption alone without explicit control will result in topic drift. Moreover, most authors
do not provide explicit statements for addressing important issues such as selection
policy (e.g. when to stop the crawl, where to crawl next), and politeness policy (e.g.
respecting the robot exclusion standard, how to handle disgruntled webmasters due
to the extra bandwidth). This trend of using custom crawlers calls for careful planning and justification. Issues such as cost-benefit analysis, hardware and software
requirements, and sustainability in the long run have to be considered. Moreover,
poorly-implemented crawlers are a nuisance on the web, consuming bandwidth and
clogging networks at the expense of others [248].
In fact, the worry of unknown ranking and data restriction by search engines [155,
128, 228] exposes the inadequacies of these existing techniques for constructing Webderived corpora (e.g. BootCat). These so-called ‘shortcomings’ of search engines
are merely mismatches in expectations. Linguists expect white box algorithms and
unrestricted data access, something we know we will never get. Obviously, these two
issues do place certain obstacles in our quest for very large corpora, but should we
totally avoid search engines given their integral role on the Web? If so, would we risk
missing the forest just for these few trees? The quick alternative, which is infesting
the Web with more crawlers, poses even greater challenges. Rather than reinventing
the wheel, we should think of how existing corpus construction techniques can be
improved using the already available large search engine repositories out there.
6.2.2
Relevant Text Identification
The process of identifying relevant texts, which usually comprise of webpage filtering and content extraction, is an important step after the sourcing of webpages.
A filtering phase is fundamental in identifying relevant texts since not all webpages
6.2. Related Research
133
returned by search engines or custom web crawlers are suitable for specialised corpora. This phase, however, is often absent from most of the existing techniques
such as BootCat. The commonly used techniques include some kind of richness or
density measures with thresholds. For instance, [125] constructed domain corpora
by collecting the top 100 webpages returned by search engines for each seed term.
As a way of refining the corpora, webpages containing only a small number of userprovided seed terms are excluded. [4] proposed a knowledge-richness estimator that
takes into account semantic relations to support the construction of Web-derived
corpora. Webpages containing both the seed terms and the desired relations are
considered as better candidates to be included in the corpus. The candidate documents are ranked and manually filtered based on several term and relation richness
measures.
In addition to webpage filtering, content extraction (i.e. boilerplate removal) is
necessary to remove HTML tags and boilerplates (e.g. texts used in navigation bars,
headers, disclaimers). HTMLCleaner by [88] is a boilerplate remover based on the
heuristics that content-rich sections of webpages have longer sentences, lower number
of links, and more function words compared to the boilerplates. [67] developed a
boilerplate stripper called NCLEANER based on two character-level n-gram models.
A text segment is considered as a boilerplate and discarded if the ‘dirty’ model (based
on texts to be cleaned) achieves a higher probability compared to the ‘clean’ model
(based on training data).
6.2.3
Variability of Search Engine Counts
Unstable page counts have always been one of the main complaints of critics who
are against the use of search engines for language processing. Many work has been
conducted to discredit the use of search engines by demonstrating the arbitrariness
of page counts. The fact remains that page counts are merely estimations [148].
We are not here to argue otherwise. However, for natural language applications
that deal mainly with relative frequencies, ratios and ranking, these variations have
been shown to be insignificant. [181] conducted a study on using page counts for
estimating n-gram frequencies for noun compound bracketing. They showed that
the variability of page counts over time and across search engines do not significantly
affect the results of their task. [140] examined the use of page counts for several
NLP tasks such as spelling correction, compound bracketing, adjective ordering
and prepositional phrase attachment. The authors concluded that for majority of
the tasks conducted, simple and unsupervised techniques perform better when n-
134
gram frequencies are obtained from the Web. This is in line with the study by [252]
which showed that a simple algorithm relying on page counts outperforms a complex
method trained on a smaller corpus for synonym detection. [124] used search engines
to estimate frequencies for predicate-argument bigrams. They demonstrated the
high correlations between search engines page counts and frequencies obtained from
balanced, carefully edited corpora such as the BNC. Similarly, experiments by [26]
showed that search engine page counts were reliable over a period of 6-month, and
highly consistent with those reported by several manually-curated corpora including
the Brown Corpus [78].
In short, we can safely conclude that page counts from search engines are far
from accurate and stable [148]. Moreover, due to the inherent differences in their
relevance ranking and index sizes, page counts provided by the different search
engines are not comparable. Adequate studies have been conducted to show that
n-gram frequency estimations obtained from search engines indeed work well for a
certain class of applications. As such, one can either make good use of what is
available, or should stop harping on the primitive issue of unstable page count. The
key question now is not whether search engine counts are stable or otherwise, but
rather, how they are used.
6.3
Analysis of Website Contents for Corpus Construction
It is apparent from our discussion in Section 6.2 that the current techniques for
constructing corpora from the Web using search engines can be greatly improved.
In this section, we address the question of how corpus construction can benefit
from the current large search engine indexes despite several inherent mismatches in
expectations. Due to the restrictions imposed by search engines, we only have access
to limited number of webpage URLs [128]. As such, the common BootCat technique
of downloading ‘off-the-shelf’ webpages by search engines to construct corpora is
not the best approach since (1) the number of webpages provided is inadequate,
and (2) not all contents are appropriate for a domain corpus [18]. Moreover, the
authoritativeness of webpages has to be taken into consideration to eliminate lowquality contents from questionable sources.
Putting into consideration these problems, we have developed a Probabilistic
Site Selector (PROSE) to re-rank and filter the websites returned by search engines
to construct virtual corpora. We will discuss in detail this analysis mechanism in
Section 6.3.1 and 6.3.2. In addition, Section 6.3.3 outlines the Seed Term Expansion
Process (STEP), the Selective Localisation Process (SLOP), and the Heuristic-based
6.3. Analysis of Website Contents for Corpus Construction
135
Figure 6.1: A diagram summarising our web partitioning technique.
Cleaning Utility for Web Texts (HERCULES) designed to construct Web-derived
corpora from virtual corpora for addressing the needs to access local texts by certain
natural language applications. An overview of the proposed technique is shown in
Figure 6.1. A summary of the three phases in SPARTAN is as follows:
Input
– A set of seed terms, W = {w1 , w2 , ..., wn }.
Phase 1: Website Preparation
– Gather the top 1, 000 webpages returned by search engines containing the
seed terms. Search engines such as Yahoo will serve the first 1, 000 pages
when accessed using the provided API.
– Generalise the webpages to obtain a set of website URLs, J.
Phase 2: Website Filtering
– Obtain estimates of the inlinks, number of webpages in the website, and
the number of webpages in the website containing the seed terms.
– Analyze the domain representativeness of the websites in J using PROSE.
– Select websites with good domain representativeness to form a new set
J ′ . These sites constitute our virtual corpora.
136
Phase 3: Website Content Localisation
– Obtain a set of expanded seed terms, WX using Wikipedia through the
STEP module.
– Selectively download contents from websites in J ′ based on the expanded
seed terms WX using the SLOP module.
– Extract relevant contents from the downloaded webpages using HERCULES.
Output
– A specialised virtual corpus consisting of website URLs with high domain
representativeness.
– A specialised Web-derived corpus consisting of domain-relevant contents
downloaded from the websites in the virtual corpus.
6.3.1
Website Preparation
During this initial preparation phase, a set of candidate websites to represent
the domain of interest D, is generated. Methods such as random walk and random
IP address generation have been suggested to obtain random samples of webpages
[104, 192]. Such random sampling methods may work well for constructing general or topic-diverse corpora from the Web if conducted under careful scrutiny. For
our specialised corpora, we employ purposive sampling instead to seek items (i.e.
websites) belonging to a specific, predefined group (i.e. domain D). Since there
is no direct way of deciding if a website belongs to domain D, a set of seed terms
W = {w1 , w2 , ..., wn } is employed as the determinant factor. Next, we submit queries
to the search engines for webpages containing the conjunction of the seed terms W .
The set of webpage URLs, which contains the purposive samples that we require, is
returned as the result. At the moment, only webpages in the form of HTML files
or plain text files are accepted. Since most search engines only serve the first 1, 000
documents, the size of our sample is no larger than 1, 000. We then process the
webpage URLs to obtain the corresponding domain names of the websites. In other
words, only the segment of the URL beginning from the scheme (e.g. http://) until
the authority segment of the hierarchical part is considered for further processing.
For example, in the URL http://web.csse.uwa.edu.au/research/areas/, only
the segment http://web.csse.uwa.edu.au/ is applicable. This collection of distinct websites (i.e. collection of webpages), represented using the notation J will be
subjected to re-ranking and filtering in the next phase.
137
We have selected websites as the basic unit for analysis instead of the typical
webpages for two main reasons. Firstly, websites are collections of related webpages
belonging to the same theme. This allows us to construct a much larger corpus
using the same number of units. For instance, assume that a search engine returns
1, 000 distinct webpages belonging to 300 distinct websites. In this example, we
can construct a corpus comprising of at most 1, 000 documents using a webpage
as a unit. However, using a website as a unit, we would be able to derive a much
larger 9, 000-document corpus, assuming an average of 300 webpages per website.
Secondly, the fine granularity and volatility of individual webpages makes analysis
and maintenance of the corpus difficult. It has been accepted [3, 14, 188] that
webpages disappear at a rate of 0.25 − 0.5% per week [73]. Considering this figure,
virtual corpora based on webpage URLs are extremely unstable and require constant
monitoring as pointed out by Kilgarriff [127] to replace offline sources. Virtual
corpora based on websites as units are far less volatile. This is especially true if the
virtual corpora are composed of highly authoritative websites.
6.3.2
Website Filtering
In this section, we describe our probabilistic website selector called PROSE for
measuring and determining the domain representativeness of candidate websites in
J. The domain representativeness of a website is determined using PROSE based
on the following criteria introduced by [277]:
• The extent to which the vocabulary covered by a website is inclined towards
domain D;
• The extent to which the vocabulary of a website is specific to domain D; and
• The authoritativeness of a website with respect to domain D.
The websites from J which satisfy these criteria are considered as sites with good
domain representativeness, denoted as set J ′ . The selected sites in J ′ form our virtual corpus. For the next three subsections, we will discuss in detail the notations
involved, the means to quantify the three criteria for measuring domain representativeness, and the ways to automatically determine the selection thresholds.
Notations
Each site ui ∈ J has three pieces of important information, namely, an authority
rank, ri , the number of webpages containing the conjunction of the seed terms in
W , nwi , and the total number of webpages, nΩi . The authority rank, ri is obtained
138
by ranking the candidate sites in J according to their number of inlinks (i.e. low
numerical value indicates high rank). The inlinks to a website can be obtained using
the “link:” operator in certain search engines (e.g. Google, Yahoo). As for the second
(i.e. nwi ) and the third (i.e. nΩi ) piece of information, additional queries using the
operator “site:” need to be performed. The total number of webpages in site ui ∈ J
can be estimated by restricting the search (i.e. site search) as “site:ui ”. The number
of webpages in site ui containing W can be obtained using the query “w site:ui ”,
where w is the conjunction of the seeds in W with the AND operator. Figure 6.2
shows the distribution of webpages within the sites in J. Each rectangle represents
the collection of all webpages of a site in J. Each rectangle is further divided into
the collection of webpages containing seed terms W , and the collection of webpages
not containing W . The size of the collection of webpages for site ui that contain W
is nwi . Using the total number of webpages for the i-th site, nΩi , we estimate the
number of webpages in the same site not containing W as nw̄i = nΩi − nwi . With
the page counts nwi and nΩi , we can obtain the total page count for webpages not
containing W in J as
X
X
X
nw̄ = N − nw =
nΩi −
nwi =
(nΩi − nwi )
ui ∈J
ui ∈J
ui ∈J
where N is the total number of webpages in J, and nw is the total number of
webpages in J which contains W (i.e. the area within the circle in Figure 6.2).
Probabilistic Site Selector
A site’s domain representativeness is assessed based on three criteria, namely,
vocabulary coverage, vocabulary specificity and authoritativeness. Assuming independence, the odds in favour of a site’s ability to represent a domain, defined as the
Odds of Domain Representativeness (OD), is measured as a product of the odds for
realising each individual criterion:
OD(u) = OC(u)OS(u)OA(u)
(6.1)
where OC is the Odds of Vocabulary Coverage, OS is the Odds of Vocabulary Specificity, and OA is the Odds of Authoritativeness. OC quantifies the extent to which
site u is able to cover the vocabulary of the domain represented by W , while OS
captures the chances of the vocabulary of website u being specific to the domain
represented by W . On the other hand, OA measures the chances of u being an authoritative website with respect to the domain represented by W . Next, we define
the probabilities that make up these three odds.
139
Figure 6.2: An illustration of an example sample space on which the probabilities
employed by the filter are based upon. The space within the dot-filled circle consists
of all webpages from all sites in J containing W . The m rectangles represent the
collections of all webpages of the respective sites {u1 , ..., um }. The shaded but not
dot-filled portion of the space consists of all webpages from all sites in J that do not
contain W . The individual shaded but not dot-filled portion within each rectangle
is the collection of webpages in the respective sites ui ∈ J that do not contain W .
• Odds of Vocabulary Coverage: Intuitively, the more webpages from site ui that
contains W in comparison with other sites, the likelier it is that ui has a good
coverage of the vocabulary of the domain represented by W . As such, this
factor requires a cross-site analysis of page counts. Let the sample space, set
Y , be the collection of all webpages from all sites in J that contain W . This
space is the area within the circle in Figure 6.2 and the size is |Y | = nw .
Following this, let Z be the set of all webpages in site ui (i.e. any rectangles in
Figure 6.2) with the size |Z| = nΩi . Subscribing to the frequency interpretation
of probability, we compute the probability of encountering a webpage from site
ui among all webpages from all sites in J that contain W as:
PC (nwi ) = P (Z|Y )
(6.2)
P (Z ∩ Y )
P (Y )
nwi
=
nw
=
where |Z ∩ Y | = nwi is the number of webpages from the site ui containing W .
140
We compute OC as:
OC(ui ) =
PC (nwi )
1 − PC (nwi )
(6.3)
• Odds of Vocabulary Specificity: This odds acts as an offset for sites which
have a high coverage of vocabulary across many different domains (i.e. the
vocabulary is not specific to a particular domain). This helps us to identify
overly general sites, especially those encyclopaedic in nature which provide
background knowledge across a broad range of disciplines. The vocabulary
specificity of a site can be estimated using the variation in the pagecount of
W from the total pagecount of that site. Within a single site with fixed total
pagecount, an increase in the number of webpages containing W implies a
decrease of pagecount not containing W . In such cases, a larger portion of
the site would be dedicated to discussing W and the domain represented by
W . Intuitively, such phenomenon would indicate the narrowing of the scope
of word usage, and hence, an increase in the specificity of the vocabulary.
As such, the examination of the specificity of vocabulary is confined within a
single site, and hence, is defined over the collection of all webpages within that
site. Let Z be the set of all webpages in site ui and V be the set of all webpages
in site ui that contains W . Following this, the probability of encountering a
webpage that contains W in site ui is defined as:
PS (nwi ) = P (V |Z)
(6.4)
P (V ∩ Z)
P (Z)
nwi
=
nΩi
=
where |V ∩ Z| = |V | = nwi . We compute OS as:
OS(ui ) =
PS (nwi )
1 − PS (nwi )
(6.5)
• Odds of Authoritativeness: We first define a distribution for computing the
probability that website ui is authoritative with respect to W . It has been
demonstrated that the various indicators of a website’s authority such as the
number of inlinks, the number of outlinks and the frequency of visits, follow
the Zipf ’s ranked distribution [2]. As such, the probability that the site ui
with authority rank ri (i.e. a rank based on the number of inlinks to site ui )
141
is authoritative with respect to W can be defined using the probability mass
function:
PA (ri ) = P (ri ; |J|) =
1
ri H|J|
(6.6)
where |J| is the number of websites under consideration, and H|J| is the |J|-th
generalised harmonic number computed as:
H|J| =
|J|
X
1
k
k
(6.7)
We then compute OA as:
OA(ui ) =
PA (ri )
1 − PA (ri )
(6.8)
Selection Thresholds
In order to select websites with good domain representativeness, a threshold for
OD is derived automatically as a combination of the individual thresholds related
to OC, OS and OA:
ODT = OAT OCT OST
(6.9)
Depending on the desired output, these individual thresholds can be determined
using either one of the three options associated with each probability mass function.
All sites ui ∈ J with their odds OD(ui ) exceeding ODT will be considered as suitable
candidates for representing the domain. These selected sites, denoted as the set J ′ ,
constitute our virtual corpus.
We now go through the details of deriving the thresholds for the individual odds.
• Firstly, the threshold for OC is defined as:
OCT =
τC
1 − τC
(6.10)
τC can either by P̄C , PCmax or , PCmin . The mean of the distribution is given
by:
P
nwi
1
n̄w
×
= ui ∈J
P̄C =
nw
|J|
nw
1
1
nw
×
=
=
|J| nw
|J|
142
while the highest and lowest probabilities are defined as:
PCmax = max PC (nwi )
ui ∈J
PCmin = min PC (nwi )
ui ∈J
where max PC (nwi ) returns the maximum probability of the function PC (nwi )
where nwi ranges over the page counts of all websites ui in J.
• Secondly, the threshold for OS is given by:
τS
OST =
1 − τS
where τS can either be P̄S , PSmax or PSmin :
P
PS (nwi )
P̄S = ui ∈J
|J|
(6.11)
PSmax = max PS (nwi )
ui ∈J
PSmin = min PS (nwi )
ui ∈J
Note that P̄S 6= 1/|J| since the sum of PS (ui ) for all ui ∈ J is not equal to 1.
• Thirdly, the threshold for OA is defined as:
τA
OAT =
(6.12)
1 − τA
where τA can either be P̄A , PAmax or PAmin . The expected value of the random
variable X for the Zipfian distribution is defined as:
HN,s−1
X=
HN,s
and since s = 1 in our distribution of authority rank, the expected value of
the variable r, can be obtained through:
|J|
r̄ =
H|J|
Using r̄, we have P̄A as:
P̄A =
1
1
=
r̄H|J|
|J|
The highest and lowest probabilities are given by:
PAmax = max PA (ri )
ui ∈J
(6.13)
PAmin = min PA (ri )
ui ∈J
where max PA (ri ) returns the maximum probability of the function PA (ri )
where ri ranges over the authority ranks of all websites ui in J.
6.3.3
143
Website Content Localisation
This content localisation phase is designed to construct Web-derived corpora
using the virtual corpora created in the previous phase. The three main processes in
this phase are seed term expansion (STEP), selective content downloading (SLOP),
and content extraction (HERCULES).
STEP uses the categorical organisation of Wikipedia topics to discover related
terms to complement the user-provided seed terms. Under each Wikipedia category, there is typically a listing of subordinate topics. For instance, there is category called “Category:Blood cells” which corresponds to the “blood cell” seed term.
STEP begins by finding the category page “Category:w” on Wikipedia which corresponds to each w ∈ W (line 3 in Algorithm 2). Under the category page “Category:Blood cells” is a listing of the various types of blood cells such as leukocytes,
red blood cell, reticulocytes, etc. STEP relies on regular expressions to scrap the
category page to obtain these related terms (line 4 in Algorithm 2). The related
topics in the category pages are typically structured using the <li> tag. It is important to note that not all topics listed under a Wikipedia category adhere strictly
to the hypernym-hyponym relation. Nevertheless, the terms obtained through such
means are highly related to the encompassing category since they are determined
by human contributors. These related terms can be relatively large in numbers.
As such, we employed the Normalised Web Distance4 (NWD) [276] for selecting
the m most related ones (line 6 and 8 in Algorithm 2). Algorithm 2 summarises
STEP. The existing set of seed terms W = {w1 , w2 , ..., wn } is expanded to become
WX = {W1 = {w1 , ...}, W2 = {w2 , ...}, ..., Wn = {wn , ...}} through this process.
Algorithm 2 STEP(W ,m)
1: initialise WX
2: for each wi ∈ W do
3:
page := getcategorypage(wi )
4:
relatedtopics = scrapepage(page)
5:
for each a ∈ relatedtopics do
6:
sim := NWD(a,wi )
7:
recall the m most related topics (a1 , ..., am )
8:
Wi = {wi , a1 , ..., am }
9:
add Wi to the set WX
10: return WX
4
A generalised version of the Normalised Google Distance (NGD) by [50].
144
SLOP then uses the expanded seed terms WX to selectively download the contents from the websites in J ′ . Firstly, all possible pairs of seed terms are obtained
for every combination of sets Wi and Wj from WX :
C = {(x, y)|x ∈ Wi ∈ WX ∧ y ∈ Wj ∈ WX ∧ i < j ≤ |WX |}
Using the seed term pairs in C, SLOP localises the webpages for all websites in J ′ .
For every site u ∈ J ′ , all pairs (x, y) in C are used to construct queries in the form of
q = “x”“y”site : u. These queries are then submitted to search engines to obtain the
URLs of webpages that contains the seed terms from each site. This move ensures
that only relevant pages from a website are downloaded. This prevents the localising
of boilerplate pages such as “about us”, “disclaimer”, “contact us”, “home”, “faq”,
etc whose contents are not suitable for the specialised corpora. Currently, only
HTML and plain text pages are considered. Using these URLs, SLOP downloads
the corresponding webpages to a local repository.
The final step of content localisation makes use of HERCULES to extract contents from the downloaded webpages. HERCULES is based on the following sequence of heuristics:
1) all relevant texts are located within the <body> tag.
2) the contribution of invisible elements and formatting tags for determining the
relevance of texts is insignificant.
3) the segmentation of relevant texts, typically paragraphs, are defined by structural tags such as , , , <div>, etc.
4) length of sentences in relevant texts are typically higher.
5) the concentration of function words in relevant texts is higher [88].
6) the concentration of certain non-alphanumeric characters such as “|”, “-”, “.”
and “,” in irrelevant texts is higher.
7) other common observations such as the capitalisation of the first character
of sentences, and the termination of sentences by punctuation marks are also
observed.
HERCULES begins the process by detecting the presence of the <body> and
</body> tags, and extracting the contents between them. If no <body> tag is
present, the complete HTML source code is used. Next, HERCULES removes all
145
invisible elements (e.g. comments, javascript codes) and all tags without contents
(e.g. images, applets). Formatting tags such as , , <center>, etc are
also discarded. Structural tags are then used to break the remaining texts in the
page into segments. The length of each segment relative to all other segments is
determined. In addition, the ratio of function words and certain non-alphanumeric
characters (i.e. “|”, “-”, “.”, “,”) to the number of words in each segment is measured. The ratios related to non-alphanumeric characters are particularly useful
for further removing boilerplates such as Disclaimer | Contact Us | ..., or the
reference section of academic papers where the concentration of such characters is
higher than normal. Using these indicators, HERCULES removes segments which
do not satisfy the heuristics 4) to 7). The remaining segments are aggregated and
returned as contents.
6.4
In this section, we discuss the results of three experiments conducted to assess
the different aspects of our technique.
6.4.1
The Impact of Search Engine Variations on Virtual Corpus Construction
We conducted a three-part experiment to study the impact of the choice of search
engines on the resulting virtual corpus. In this experiment, we examine the extent
of correlation between the websites ranked by the different search engines. Then,
we study whether or not the websites re-ranked using PROSE achieve higher level
of correlations. A high correlation between the websites re-ranked by PROSE will
suggest that the composition of the virtual corpora will remain relatively stable
regardless of the choice of search engines.
We performed a scaled-down version of the virtual corpus construction procedure outlined in Section 6.3.1 and 6.3.2. For this experiment, we employed the three
major search engines, namely, Yahoo, Google and Live Search (by Microsoft), and
their APIs for constructing virtual corpora. We chose the seed terms “transcription
factor” and “blood cell” to represent the domain of molecular biology D1 , while the
reliability engineering domain D2 is represented using the seed terms “risk management” and “process safety”. For each domain D1 and D2 , we gathered the first
1, 000 webpage URLs from the three search engines. We then processed the URLs
to obtain the corresponding websites’ addresses. The set of websites obtained for
domain D1 using Google, Yahoo and Live Search is denoted as J1G , J1Y and J1M ,
146
Figure 6.3: A summary of the number of websites returned by the respective search
engines for each of the two domains. The number of common sites is also provided.
respectively. The same notations apply for domain D2 . Next, these websites were
assigned with ranks based on their corresponding webpages’ order of relevance determined by the respective search engines. We refer to these ranks as native ranks.
If a site has multiple webpages included in the search results, the highest rank shall
prevail. This ranking information is kept for use in the later part of this experiment.
Figure 6.3 summarises the number of websites obtained from each search engines
for each domain.
In the first part of this experiment, we sorted the 77 common websites for D1 ,
denoted as J1C = J1G ∩ J1Y ∩ J1M , and the 103 in J2C = J2G ∩ J2Y ∩ J2M using their
native ranks (i.e. the ranks generated by the search engines). We then determined
their Spearman’s rank correlation coefficients. The native columns in Figure 6.4(a)
and 6.4(b) show the correlations between websites sorted by different pairs of search
engines. The correlation between websites based on native rank is moderate, ranging
between 0.45 to 0.54. This extent of correlation does not come as a surprise. In
fact, this result supports our implicit knowledge that different search engines rank
the same webpages differently. Assuming the same query, the same webpage will
inevitably be assigned distinct ranks due to the inherent differences in the index
size and the algorithm itself. For this reason, the ranks generated by search engines
(i.e. native ranks) do not necessarily reflect the domain representativeness of the
webpages. In the second part of the experiment, we re-rank the websites in J{1,2}C
using PROSE. For simplicity, we only employ the coverage and specificity criteria
for determining the domain representativeness of websites, in the form of odds of
domain representativeness (OD). The information required by PROSE, namely, the
number of webpages containing W , nwi , and the total number of webpages, nΩi
are obtained from the respective search engines. In other words, the OD of each
website is estimated three times, each using different nwi and nΩi obtained from the
three different search engines. The three variants of estimation are later translated
into ranks for re-ordering the websites. Due to the varying nature of page counts
147
across different search engines as discussed in Section 6.2.3, many would expect that
re-ranking the websites using metrics based on such information would yield even
worst correlation. On the contrary, the significant increases in correlation between
websites after re-ranking by PROSE as shown in the PROSE columns in Figure 6.4(a)
and 6.4(b) demonstrated otherwise.
(a) Correlations between websites in the molecular biology domain.
(b) Correlations between websites in the reliability engineering domain.
Figure 6.4: A summary of the Spearman’s correlation coefficients between websites
before and after re-ranking by PROSE. The native columns show the correlation
between the websites when sorted according to their native ranks provided by the
respective search engines.
We discuss the reasons behind this interesting finding. As we have mentioned
before, search engine indexes vary greatly. For instance, based on page counts
by Google, we have a 15, 900/23, 800, 000 = 0.0006685 probability of encountering
a webpage from the site www.pubmedcentral.nih.gov that contains the bi-gram
“blood cell”. However, Yahoo provides us with a higher estimate at 0.001440. This is
not because Yahoo is more accurate than Google or vice versa, they are just different.
We have discussed this in detail in Section 6.2.3. This reaffirms that estimations
using different search engines are by themselves not comparable. Consider the next
example n-gram “gasoline”. Google and Yahoo provides the estimates 0.000046
and 0.000093 for the same site, respectively. Again, they are very different from
one another. While the estimations are inconsistent (i.e. Google and Yahoo offer
different page counts for the same n-grams), the conclusion is the same, namely, one
has better chances of encountering a page in www.pubmedcentral.nih.gov that
contains “blood cell”. In other words, estimations based on search engine counts
have significance only in relation to something else (i.e. relativity). This is exactly
5
This page count and all subsequent page counts derived from Google and Yahoo is obtained
on 2 April 2009.
148
how PROSE works. PROSE determines a site’s OD based entirely on its contents.
OD is computed by PROSE using search engine counts. Even though the analysis
of the same site using different search engines eventually produces different OD, the
object of the study, namely, the content of the site, remains constant. In this sense,
the only variable in the analysis by PROSE is the search engine count. Since the
ODs generated by PROSE are used for inter-comparing the websites in J{1,2}C (i.e.
ranking), the numerical differences introduced through variable page counts by the
different search engines become insignificant. Ultimately, the same site analysed by
PROSE using unstable page counts by different search engines can still achieve the
same rank.
In the third part of this experiment, we examine the general ‘quality’ of the
websites ranked by PROSE using information provided by the different search engines. As we have elaborated on in Section 6.3.2, PROSE measures the odds in
favour of the websites’ authority, vocabulary coverage and specificity. Websites
with low OD can be considered as poor representers of the domain. The ranking of sites by PROSE using information from Google consistently resulted in the
most number of websites with OD less than −6. About 70.13% in domain D1 and
34.95% in domain D2 by Google are considered as poor representers. On the other
hand, the sites ranked using information by Yahoo and Live Search have relatively
higher OD. To explain this trend, let us consider the seed terms {“transcription
factor”, “blood cell”}. According to Google, there are 23, 800, 000 webpages in
www.pubmedcentral.nih.gov and out of that number, 1, 180 contain both seed
terms. As for Yahoo, it indexes far less 9, 051, 487 webpages from the same site but
offering approximately the same page count 1, 060 for the seed terms. This trend is
consistent when we examined the page count for the non-related n-gram “vehicle”
from the same site. Google and Yahoo reports the approximately same page count
of 24, 900 and 20, 100, respectively. There are a few possibilities. Firstly, the remaining 23, 800, 000 − 9, 051, 487 = 14, 748, 513 indexed by Google really do not contain
the n-grams, or secondly, Google overestimated the overall figure of 23, 800, 000.
The second possibility becomes more evident as we look at the page count by other
search engines6 . Live Search reports a total page count of 61, 400 for the same
site with 1, 460 webpages containing the seed terms {“transcription factor”, “blood
cell”}. Ask.com, with a much larger site index at 15, 600, 000 has 914 pages with
the seed terms. The index sizes of all these other search engines are much smaller
6
Other commonly-used search engines such as AltaVista and AlltheWeb were not cited for
comparison since they used the same search index as Yahoo’s.
149
Figure 6.5: The number of sites with OD less than −6 after re-ranking using PROSE
based on page count information provided by the respective search engines.
than that of Google’s, and yet, they provided us with approximately the same number of pages containing the seed terms. Our finding is consistent with the recent
report by Uyar [254] which concluded that the page counts provided by Google are
usually higher than the estimates by other search engines. Due to such inflated
figures by Google, when we take the relative frequency of n-grams using Google’s
page counts, the significance of domain-relevant n-grams are greatly undermined.
The seed terms (i.e. “transcription factor”, “blood cell”) achieved a much lower
probability at 1, 180/23, 800, 000 = 0.000049 when assessed using Google’s page
count as compared to the probability by Yahoo 1, 060/9, 051, 487 = 0.000117. This
explains the devaluation of domain-relevant seed terms when assessed by PROSE
using information from Google, which leads to the falling of the OD of websites.
In short, Live Search and Yahoo are comparatively better search engines for the
task of measuring OD by PROSE. However, the index size of Live Search is undesirably small, a problem agreed upon by other researchers such as [74]. Moreover, the
search facility using the “site:” operator is occasionally turned off by Microsoft, and
it sometimes offers illogical estimates. While this problem is present in all search
engines, it is particularly evident in Live Search when site search is used. For instance, there are about 61, 400 pages from www.pubmedcentral.nih.gov indexed by
Live Search. However, Live Search reports that there are 159, 000 pages in that site
which contains the n-gram “transcription factor”. For this reason, we preferred the
balance between the index size and the ‘honesty’ in page counts offered by Yahoo.
6.4.2
The Evaluation of HERCULES
We conducted a simple evaluation of our content extraction utility HERCULES
using Cleaneval development set7 . Due to some implementation difficulties, the scoring program provided by Cleaneval cannot be used for this evaluation. Instead, we
employed a text comparison module8 written in Perl. The module, based on vector7
8
http://cleaneval.sigwac.org.uk/devset.html
http://search.cpan.org/ stro/Text-Compare-1.03/lib/Text/Compare.pm
150
space model, is used for comparing the contents of the texts cleaned by HERCULES
with the gold standard provided by Cleaneval. The module uses a rudimentary stop
list to filter out common words and then the cosine similarity measure is employed
to compute text similarity. The texts cleaned by HERCULES achieved a 0.8919
similarity with the gold standard, and has a standard deviation of 0.0832. The
relatively small standard deviation shows that HERCULES is able to consistently
extract contents that meet the standard of human curators. We have made available
an online demo of HERCULES 9 .
6.4.3
The Performance of Term Recognition using SPARTAN-based
Corpora
In this section, we evaluated the quality of the corpora constructed using SPARTAN in the context of term recognition for the domain of molecular biology. We
compared the performance of term recognition using several specialised corpora,
namely:
• SPARTAN-based corpora
• the manually-crafted GENIA corpus [130]
• a BootCat-derived corpus
• seed-restricted querying of the Web (SREQ), as a virtual corpus
We employed the gold standard reference provided along with the GENIA corpus for evaluating term recognition. We used the same set of seed terms W =
{“human”,“blood cell”,“transcription factor”} for various purposes throughout this
evaluation. The reason behind our choice of seed terms is simple: these are the same
seed terms used for the construction of GENIA, which is our gold standard.
BootCat-Derived Corpus
We downloaded and employed the BootCat toolkit10 with the new support for
Yahoo API to construct a BootCat-derived corpus using the same set of seed terms
W = {“human”,“blood cell”,“transcription factor”}. For reasons discussed in Section 6.2.1, BootCat will not be able to construct a large corpus using only three
seed terms. The default settings of 3 terms per tuple, and 10 randomly selected
9
A demo is available at http://explorer.csse.uwa.edu.au/research/algorithm hercules.pl. Note
that slow response time is possible when server is under heavy load.
10
http://sslmit.unibo.it/ baroni/bootcat.html
151
Figure 6.6: A listing of the 43 sites included in SPARTAN-V.
tuples for querying cannot be applied in our case. Moreover, we could not perceive
the benefits of randomly selecting terms for constructing tuples. As such, we generated all possible combinations of all possible length in this experiment. In other
words, we have three 1-tuple, three 2-tuple, and one 3-tuple for use. While this
move may appear redundant since all webpages which contain the 3-tuple will also
have the 2-tuples, we can never be sure that the same webpages will be provided as
results by the search engines. In addition, we altered a default setting in the script
by BootCat, collect urls from yahoo.pl which restricted our access to only the
first 100 results for each query. Using the seven seed term combinations and the
altered Perl script, we obtained 3, 431 webpage URLs for downloading. We then
employed the script by BootCat, retrieve and clean pages from url list.pl to
download and clean the webpages, resulting in a final corpus size of N = 3, 174
documents with F = 7, 641, 018 tokens.
SPARTAN-Based Corpora and SREQ
We first constructed a virtual corpus using SPARTAN and the seed terms W . Yahoo is selected as our search engine of choice for this experiment for reasons outlined
152
in Section 6.4.1. We employed the API11 provided by Yahoo. All requests to Yahoo
are sent to this server process http://search.yahooapis.com/WebSearchService/
V1/webSearch?/. We format our query strings as appid=APIKEY&query=SEEDTERMS
&results=100. Additional options such as start=START are applied to enable
SPARTAN to obtain results beyond the first 100 webpages. This service by Yahoo is
limited to 5, 000 queries per IP address per day. However, the implementation of this
rule is actually quite lenient. In the first phase of SPARTAN, we obtained 176 distinct websites from the first 1, 000 webpages returned by Yahoo using the conjunction
of the three seed terms. For the second phase of SPARTAN, we selected the average
values as described in Section 6.3.2 for all three thresholds, namely, τC , τS and τA to
derive our selection cut-off point ODT . The selection process using PROSE provided
us with a reduced 43 sites. The virtual corpus thus contains about N = 84, 963, 524
documents (i.e. webpages) distributed over 43 websites. In this evaluation, we would
refer to this virtual corpus as SPARTAN-V, where the letter V stands for virtual.
We have made available an online query tool for SPARTAN-V12 . Figure 6.6 shows
the websites included in the virtual corpus for this evaluation. We then extended the
virtual corpus during the third phase of SPARTAN to construct a Web-derived corpus. We selected three most related topics for each seed term in W during seed term
expansion by STEP. The seed term “human” has no corresponding category page
on Wikipedia and hence, cannot be expanded. The set of expanded seed terms is
WX ={{“human”}, {“blood cell”, “erythropoiesis”,“reticulocyte”,“haematopoiesis”},
{“transcription factor”, “CREB”, “C-Fos”,“E2F”}}. Using WX , SLOP gathered
80, 633 webpage URLs for downloading. A total of 76, 876 pages were actually
downloaded while the remaining 3, 743 could not be reached for reasons such as
connection error. Finally, HERCULES is used to extract contents from the downloaded pages for constructing the Web-derived corpus. About 15% of the webpages
were discarded by HERCULES due to the absence of proper contents. The final
Web-derived corpus, denoted as SPARTAN-L (the letter L refers to local ) is composed of N = 64, 578 documents with F = 118, 790, 478 tokens. We have made
available an online query tool for SPARTAN-L13 . It is worth pointing out that using
SPARTAN and the same number of seed terms, we can easily construct a corpus
11
More information on Yahoo! Search, including API key registration, is available at
http://developer.yahoo.com/search/web/V1/webSearch.html.
12
A demo is available at http://explorer.csse.uwa.edu.au/research/data virtualcorpus.pl. Note
13
A demo is available at http://explorer.csse.uwa.edu.au/research/data localcorpus.pl. Note
153
that is at least 20 times larger than a BootCat-derived corpus.
Many researchers have found good use of page counts for a wide range of NLP
applications using search engines as gateways to the Web (i.e. general virtual corpus). In order to justify the need for content analysis during the construction of
virtual corpora by SPARTAN, we included the use of guided search engine queries as
a form of specialised virtual corpus during term recognition. We refer to this virtual
corpus as SREQ, the seed-restricted querying of the Web. Quite simply, we append
the conjunction of the seed terms W for every query made to the search engines. In
a sense, we can consider SREQ as the portion of the Web which contains the seed
terms W . For instance, the normal approach for obtaining the general page count
(i.e. the number of pages on the Web) for “TNF beta” is by submitting the n-gram
as a query to any search engines. Using Yahoo, the general virtual corpus has 56, 400
documents containing “TNF beta”. In SREQ, the conjunction of the seeds in W is
appended to “TNF beta”, resulting in the query q=“TNF beta” “transcription factor” “blood cell” “human”. Using this query, Yahoo provides us with 218 webpages,
while the conjunction of the seed terms alone results in the page count N = 149, 000.
We can consider the latter as the size of SREQ (i.e. total number of documents in
SREQ), while the former as the number of documents in SREQ which contains the
term “TNF beta”.
GENIA Corpus and the Preparations for Term Recognition
In this section, we evaluate the performance of term recognition using the different corpora discussed in Sections 6.4.3 and 6.4.3. Terms are content-bearing
words which are unambiguous, highly specific and relevant to a certain domain of
interest. Most existing term recognition techniques identify terms from among the
candidates through some scoring and ranking mechanisms. The performance of
term recognition is heavily dependent on the quality and the coverage of the text
corpora. Therefore, we find it appropriate to use this task to judge the adequacy
and applicability of both SPARTAN-V and SPARTAN-L in real-world applications.
The term candidates and gold standard employed in this evaluation comes with
the GENIA corpus [130]. The term candidates were extracted from the GENIA
corpus based on the readily-available part-of-speech and semantic marked-ups. A
gold standard, denoted as the set G, was constructed by extracting the terms which
have semantic descriptors enclosed by cons tags. For practicality reasons, we randomly selected 1, 300 term candidates for evaluation, denoted as T . We manually
inspected the list of candidates and compared them against the gold standard. Out
154
of the 1, 300 candidates, 121 are non-terms (i.e. misses) while the remaining 1, 179
are domain-relevant terms (i.e. hits).
Figure 6.7: The number of documents and tokens from the local and virtual corpora
used in this evaluation.
Instead of relying on some complex measures, we used a simple, unsupervised
technique based solely on the cross-domain distributional behaviour of words for
term recognition. Our intention is to observe the extent of contribution of the quality
of corpora towards term recognition without being obscured by the complexity of
state-of-the-art techniques. We employed relative frequencies to determine whether
a word (i.e. term candidate) is a domain-relevant term or otherwise. The idea
is simple: if a word is encountered more often in a specialised corpus than the
contrastive corpus, then the word is considered as relevant to the domain represented
by the former. As such, this technique places even more emphasis on the coverage
and adequacy of the corpora to achieve good performance term recognition. For the
contrastive corpus, we have prepared a collection comprising of texts from a broad
sweeping range of domains other than our domain of interest, which is molecular
biology. Figure 6.7 summarises the composition of the contrastive corpus.
The term recognition procedure is performed as follows. Firstly, we took note of
the total number of tokens F in each local corpus (i.e. BootCat, GENIA, SPARTANL, contrastive corpus). For the two virtual corpora, namely, SPARTAN-V and
SREQ, the total page count (i.e. total number of documents) N is used instead.
Secondly, the word frequency ft for each candidate t ∈ T is obtained from each
local corpus. We use page counts (i.e. document frequencies), nt as substitutes for
the virtual corpora. Thirdly, the relative frequency, pt for each t ∈ T are calculated as either ft /F or nt /N depending on the corpus type (i.e. virtual or local).
Fourthly, we evaluated the performance of term recognition using these relative
frequencies. Please take note that when comparing local corpora (i.e. BootCat,
155
Algorithm 3 assessBinaryClassification(t,dt ,ct ,G)
1: initialise decision
2: if dt > ct ∧ t ∈ G then
3:
decision := “true positive”
4: else if dt > ct ∧ t ∈
/ G then
5:
decision := “false positive”
6: else if dt < ct ∧ t ∈ G then
7:
decision := “false negative”
8: else if dt < ct ∧ t ∈
/ G then
9:
decision := “true negative”
10: return decision
GENIA, SPARTAN-L) with the contrastive corpus, the pt based on word frequency
is used. The pt based on document frequency is used for comparing virtual corpora
(i.e. SPARTAN-V, SREQ) with the contrastive corpus. If the pt by a specialised
corpus (i.e. BootCat, GENIA, SPARTAN-L, SPARTAN-V, SREQ), denoted as dt ,
is larger than or equal to the pt by the contrastive corpus, ct , then the candidate
t is classified as a term. The candidate t is classified as a non-term if dt < ct . An
assessment function described in Algorithm 3 is employed to grade the decisions
achieved using the various specialised corpora.
Term Recognition Results
Contingency tables are constructed using the number of false positives and negatives, and true positives and negatives obtained from Algorithm 3. Figure 6.8
summarises the errors introduced during the classification process for term recognition using several different specialised corpora. We then computed the precision,
accuracy, F1 and F.5 score using the values in the contingency tables. Figure 6.9
summarises the performance metrics for term recognition using the different corpora.
Firstly, in the context of local corpora, Figure 6.9 shows that SPARTAN-L
achieved a better performance compared to BootCat. While SPARTAN-L is merely
2.5% more precise compared to BootCat, the latter fared the worst recall at 65.06%
among all other corpora included in the evaluation. The poor recall by BootCat
is due to its high false negative rate. In other words, true terms are not classified
as terms by BootCat due to its low-quality composition (e.g. poor coverage, specificity). Many domain-relevant terms in the vocabulary of molecular biology are not
covered by the BootCat-derived corpus. Despite being 19 times larger than GENIA,
156
(a) Results using the GENIA corpus.
(b) Results using the SPARTAN-V corpus.
(c) Results using the SPARTAN-L corpus.
(d) Results using the SREQ corpus.
(e) Results using the BootCat corpus.
Figure 6.8: The contingency tables summarising the term recognition results using
the various specialised corpora.
Figure 6.9: A summary of the performance metrics for term recognition.
the F1 score of the BootCat-derived corpus is far from ideal. The SPARTAN-L corpus, which is 295 times larger than GENIA in terms of token size, has the closest
performance to the gold standard at F1 = 92.87%. Assuming that size does matter,
we speculate that a specialised Web-derived corpus of at least 419 times larger than
GENIA (using linear extrapolation) would be required to match the latter’s high vocabulary coverage and specificity for achieving a 100% F1 score. At the moment, this
conjecture remains to be tested. Given its inferior performance and effortless setup,
157
BootCat-derived corpora can only serve as baselines in the task of term recognition
using specialised Web-derived corpora.
Secondly, in the context of virtual corpora, term recognition using the SPARTANV achieved the best performance across all metrics with a 99.56% precision, even
outperforming the local version SPARTAN-L. An interesting point here is that the
other virtual corpus, SREQ achieved a good result with precision and recall close
to 90% despite the relative ease of setting up the apparatus required for guided
search engine querying. For this reason, we regard SREQ as the baseline for comparing the use of specialised virtual corpus in term recognition. In our opinion, a
9% improvement in precision justifies the additional systematic analysis of website
content performed by SPARTAN for creating a virtual corpus. From our experience,
the analysis of 200 websites generally requires on average, ceteris paribus, 1 to 1.5
hours of processing time using Yahoo API on a standard 1GHz computer with a
256 Mbps Internet connection. The ad-hoc use of search engines for accessing the
general virtual corpus may work for many NLP tasks. However, the relatively poor
performance by SREQ here justifies the need for more systematic techniques such
as SPARTAN when the Web is used as a specialised corpus for tasks such as term
recognition.
Thirdly, comparing between virtual and local corpora, only SPARTAN-V scored
a recall above 90% at 96.44%. Upon localising, the recall of SPARTAN-L dropped to
89.40%. This further confirms that term recognition requires large corpora with high
vocabulary coverage, and that the SPARTAN technique has the ability to systematically construct virtual corpora with the required coverage. It is also interesting
to note that a large 118 million token local corpus (i.e. SPARTAN-L) matches the
recall of a 149, 000 document virtual corpus (i.e. SREQ). However, due to the heterogenous nature of the Web and the inadequacy of simple seed term restriction,
SREQ scored 6% less than SPARTAN-L in precision. This concurred with our earlier conclusion that ad-hoc querying, as in SREQ, is not the optimal way of using
the Web as specialised virtual corpora. Even the considerably smaller BootCatderived corpus achieved a 4% higher precision compared to SREQ. This shows that
size and coverage (there is 46 times more documents in SREQ than in BootCat)
contributes only to recall, which explains SREQ’s 24% better recall than BootCat.
Due to SREQ’s lack of vocabulary specificity, it fared the least precision at 90.44%.
Overall, certain tasks indeed benefit from larger corpora, obviously when meticulously constructed. More specifically, tasks which do not require local access to
the texts in the corpora such as term recognition may well benefit from the con-
158
siderably larger and distributed nature of virtual corpora. This is evident when
the SPARTAN-based corpus fared 3 − 7% less across all metrics upon localising
(i.e. SPARTAN-L). Furthermore, the very close F1 score achieved by the worst performing virtual corpus (i.e. baseline SREQ) with the best performing local corpus
SPARTAN-L shows that virtual corpus may indeed be more suitable for the task
of term recognition. We speculate that several reasons are at play, including the
ever-evolving vocabulary on the Web, and the sheer size of the vocabulary that even
Web-derived corpora cannot match.
In short, in the context of term recognition, the two most important factors which
determine the adequacy of the constructed corpora are coverage and specificity.
On the one hand, larger corpora, even when conceived in an ad-hoc manner, can
potentially lead to higher coverage, which in turn contributes significantly to recall.
On the other hand, the extra efforts spent on systematic analysis leads to more
specific vocabulary, which in turn contributes to precision. Most existing techniques
lack focus on either one or both factors, leading to poorly constructed and inadequate
virtual corpora and Web-derived corpora. For instance, BootCat has difficulty in
practically constructing very large corpora, while ad-hoc techniques such as SREQ
lacks systematic analysis which results in poor specificity. From our evaluation, only
SPARTAN-V achieved a balance F1 score exceeding 95%. In other words, the virtual
corpora constructed using SPARTAN are both adequately large with high coverage
and has specific enough vocabulary to achieve highly desirable term recognition
performance. We can construct much larger specialised corpora using SPARTAN by
adjusting certain thresholds. We can adjust τC , τS and τA to allow for more websites
to be included into the virtual corpora. We can also permit more related terms to
be included as extended seed terms during STEP. This will allow more webpages to
be downloaded to create even larger Web-derived corpora. This is possible since the
maximum pages derivable from the 43 websites are 84, 963, 524 as shown in Figure
6.7. During the localisation phase, only 64, 578 webpages which is a mere 0.07%
of the total, were actually downloaded. In other words, the SPARTAN technique
is highly customisable to create both small and very large Web and Web-derived
corpora using only several thresholds.
6.5
Conclusions
The sheer volume of textual data available on the Web, the ubiquitous coverage
of topics, and the growth of content have become the catalysts in promoting a wider
acceptance of the Web for corpus construction in various applications of knowledge
159
discovery and information extraction. Despite the extensive use of the Web as a
general virtual corpus, very few studies have focused on the systematic analysis
of website contents for constructing specialised corpora from the Web. Existing
techniques such as BootCat simply pass the responsibility of deciding on suitable
webpages to the search engines. Others allow their Web crawlers to run astray (and
subsequently resulting in topic drift) without systematic controls while downloading
webpages for corpus construction. In the face of these inadequacies, we introduced
a novel technique called SPARTAN which places emphasis on the analysis of the
domain representativeness of websites for constructing virtual corpora. This technique also provides the means to extend the virtual corpora in a systematic way
to construct specialised Web-derived corpora with high vocabulary coverage and
specificity.
Overall, we have shown that SPARTAN is independent of the search engines used
during corpus construction. SPARTAN performed the re-ranking of websites provided by search engines based on their domain representativeness to allow those with
the highest vocabulary coverage, specificity and authority to surface. The systematic analysis performed by SPARTAN is adequately justified when the performance
of term recognition using SPARTAN-based corpora achieved the best precision and
recall in comparison to all other corpora based on existing techniques. Moreover,
our evaluation showed that only the virtual corpora constructed using SPARTAN
are both adequately large with high coverage and has specific enough vocabulary
to achieve a balance term recognition performance (i.e. highest F1 score). Most
existing techniques lack focus on either one or both factors. We conclude that larger
corpora, when constructed with consideration for vocabulary coverage and specificity, deliver the prerequisites required for producing consistent and high-quality
output during term recognition.
Several future work have been planned to further assess SPARTAN. In the near
future, we hope to study the effect of corpus construction using different seed terms
W . We also intend to conduct research on examining how the content of SPARTANbased corpora evolve over time and its effect on term recognition. Furthermore, we
are also planning to study the possibility of extending the use of virtual corpora to
other applications which requires contrastive analysis.
6.6
Acknowledgement
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship. The authors would like to thank the anonymous
160
reviewers for their invaluable comments.
6.7
Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora through
Topical Web Partitioning for Term Recognition. In the Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand.
This paper reports the preliminary ideas on the SPARTAN technique for creating
text corpora using data from the Web. The SPARTAN technique was later improved
and extended to form the core contents of this Chapter 6.
CHAPTER 7
Term Clustering for Relation Acquisition
Abstract
Many conventional techniques for concepts formation in ontology learning rely on
the use of predefined templates and rules, and static background knowledge such as
WordNet. These techniques are not only difficult to scale between different domains
and to handle knowledge change, their results are far from desirable. This chapter proposes a new multi-pass clustering algorithm for concepts formation known
as Tree-Traversing Ant (TTA) as part of an ontology learning system. This technique uses Normalised Google Distance (NGD) and n-degree of Wikipedia (noW)
as measures for similarity and distance between terms to achieve highly adaptable
clustering across different domains. Evaluations using seven datasets show promising results with an average lexical overlap of 97% and an ontological improvement
of 48%. In addition, the evaluations demonstrated several advantages that are not
simultaneously present in standard ant-based and other conventional clustering techniques.
7.1
Introduction
Ontologies are gaining increasing importance in modern information systems for
providing inter-operable semantics. Increasing demand on ontologies makes labourintensive creation more and more undesirable, if not impossible. Exacerbating the
situation is the problem of knowledge change that results from ever growing information sources, both online and offline. Since the late nineties, more and more
researchers started looking for solutions to relieve knowledge engineers from the increasingly acute situations. One of the main research area with high impact if successful is to construct and maintain ontology automatically or semi-automatically
from electronic text. Ontology learning from text is the process of identifying concepts and relations from natural language text, and using them to construct and
maintain ontologies. In ontology learning, terms are the lexical realisations of important concepts for characterising a domain. Consequently, the task of grouping
together variants of terms to form concepts, known as term clustering, constitutes
0
This chapter appeared in Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages
349-381, with the title “Tree-Traversing Ant Algorithm for Term Clustering based on Featureless
Similarities”.
161
162
Chapter 7. Term Clustering for Relation Acquisition
a crucial fundamental step in ontology learning.
Unlike documents [242], webpages [44], and pixels in image segmentation and object recognition [113], terms alone are lexically featureless. Similarity of objects can
be established by feature analysis based on visible (e.g. physical and behavioural)
traits. Unfortunately, using object names (i.e. terms) alone, similarity depends
on something less tangible, namely, background knowledge which humans acquired
through their senses over the years. The absence of features requires certain adjustments to be made with regard to the term clustering techniques. One of the
most evident adaptation required is the use of context and other linguistic evidence
as features for the computation of similarity. A recent survey [90] revealed that all
ontology learning systems which apply clustering techniques rely on the contextual
cues surrounding the terms as features. The large collection of documents, and predefined patterns and templates required for the extraction of contextual cues makes
the portability of such ontology learning systems difficult. Consequently, non-feature
similarity measures are fast becoming a necessity for term clustering in ontology
learning from text. Along the same line of thought, Lagus et al. [137] stated that “In
principle a document might be encoded as a histogram of its words...symbolic words
as such retain no information of their relatedness”. In addition to the problems associated with feature extraction in term clustering, much work is still required with
respect to the clustering algorithm itself. Researchers [98] have shown that certain
commonly adopted algorithms such as the K-means and average-link agglomerative
yield mediocre results in comparison with the ant-based algorithms, which is a relatively new paradigm. Handl et al. [98] demonstrated certain desirable properties in
ant-based algorithms such as the tolerance to different cluster sizes, and the ability
to identify the number of clusters. Despite such advantages, the potentials of antbased algorithms remain relatively unexplored for possible applications in ontology
learning.
In this chapter, we employ the established Normalised Google Distance (NGD)
[50] together with a new hybrid, multi-pass algorithm called Tree-Traversing Ant
(TTA) 1 for clustering terms in ontology learning. TTA fuses the strengths of
standard ant-based and conventional clustering techniques with the advantages of
featureless-similarity measures. In addition, a second-pass is introduced in TTA
1
This foundation work on term clustering using featureless similarity measures appeared in the
Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR),
Perth, Australia, 2006 with the title “Featureless Similarities for Terms Clustering using TreeTraversing Ants”.
7.2. Existing Techniques for Term Clustering
163
for refining the results produced using NGD. During the second-pass, the TTA employs a new distance measure called n-degree of Wikipedia (noW) for quantifying the
distance between two terms based on Wikipedia’s categorical system. Evaluations
using seven datasets show promising results, and revealed several advantages which
are not simultaneously present in existing clustering algorithms. In Section 2, we
give an introduction to the current term clustering techniques for ontology learning.
In Section 3, a description of the NGD measure and an introduction to standard
ant-based clustering are presented. In Section 4, we present the TTA, and how NGD
and noW are employed to support term clustering. In Section 5, we summarise the
results and findings from our evaluations. Finally, we conclude this chapter with an
outlook to future work in Section 6.
7.2
Existing Techniques for Term Clustering
Faure & Nedellec [69] presented a corpus-based conceptual clustering technique
as part of an ontology learning system called ASIUM. The clustering technique is
designed for aggregating basic classes based on a distance measure inspired by the
Hamming distance. The basic classes are formed prior to clustering in a phase for
extracting sub-categorisation frames [71]. Terms that appear in at least two different
occasions with the same verb, and the same preposition or syntactic role, can be
regarded as semantically similar such that they can substituted with one another in
that particular context. These semantically similar terms form the basic classes. The
basic classes form the lowest level of the ontology and are successively aggregated
to construct a hierarchy from bottom-up. Each time, only two basic classes are
compared. The clustering begins by computing the distance between all pairs of
basic classes and aggregate those with distance less than a user-defined threshold.
The distance between two classes containing the same words with same frequencies
have a distance 0. On the other hand, two classes without a single common word
have a distance 1. In other words, the terms in the basic classes act as features,
allowing for inter-class comparison. The measure for distance is defined as
P
P
Ncomm
Ncomm
+ F C2 × card(C
F C1 × card(C
1)
2)
)
distance(C1 , C2 ) = 1 − ( Pcard(C1 )
Pcard(C2 )
f (wordiC1 ) + i=1
f (wordiC2 )
i=1
where card(C1 ) and card(C2 ) are the numbers of words in C1 and C2 , respectively,
P
P
and Ncomm is the number of words common to both C1 and C2 .
F C1 and
F C2
are the sums of the frequencies of the words in C1 and C2 which also occur in C2
and C1 , respectively. f (wordiC1 ) and f (wordiC2 ) are the frequencies of the ith word
of class C1 and C2 , respectively.
164
Maedche & Volz [165] presented a bottom-up hierarchical clustering technique
that is part of the ontology learning system Text-to-Onto. This term clustering
technique relies on an all-knowing oracle, denoted by H, which is capable of returning possible hypernyms for a given term. In other words, the performance of
the clustering algorithm has an upper-bound limited by the ability of the oracle to
know all possible hypernyms for a term. The oracle is constructed using WordNet
and lexico-syntactic patterns [51]. During the clustering phase, the algorithm is
provided with a list of terms and the similarity between each pair is computed using
the cosine measure. For this purpose, the syntactic dependencies of each term are
extracted and used as the features for that term. The algorithm is an extremely
long list of nested if-else statements. For the sake of brevity, it suffices to know
that the algorithm examines the hypernymy relations between all pairs of terms
before it decides on the placement of terms as parents, children or siblings of other
terms. Each time the information about the hypernym relations between two terms
is required, the oracle is consulted. The projection H(t) returns a set of tuples (x, y)
where x is a hypernym of term t and y is the number of times the algorithm has
found the evidence for it.
Shamsfard & Barforoush [225] presented two clustering algorithms as part of the
ontology learning system Hasti. Concepts have to be formed prior to the clustering
phase. It suffices to know that the process of forming the concepts and extracting
relations that are used as features for clustering involve a knowledge extractor where
“the knowledge extractor is a combination of logical, template driven and semantic
analysis methods” [227]. In the concept-based clustering technique, a similarity matrix, consisting of the similarity of all possible pairs of concepts is computed. The
pair with the maximum similarity that is also greater than the merge-threshold is
chosen to form a new super concept. In this technique, each intermediate (i.e. nonleaf) node in the conceptual hierarchy has at most two children, but the hierarchy is
not a binary tree as each node may have more than one parent. As for the relationbased clustering technique, only non-taxonomic relations are considered. For every
concept c, a set of assertions about the non-taxonomic relations N F (c) that c has
with other concepts is identified. In other words, these relations can be regarded as
features that allow concepts to be merged according to what they share. If at least
one related concept is common between assertions about that relation, then the set
comprising the other concepts (called merge-set) contains good candidates for merging. After all the relations have been examined, a list of merge-set is obtained. The
merge-set with the highest similarity between its members is chosen for merging. In
165
7.3. Background
both clustering algorithms, the similarity measure employed is defined as
similarity(a, b) =
maxlevel
X
X card(cm)
(Wcm(i).r +
(
j=1
i=1
valence(cm(i).r)
X
Wcm(i).arg(k) )) × Lj
k=1
where cm = N f (a) ∩ N f (b) is the intersection between the sets of assertions (i.e.
common relations) about a and b, and card(cm) is the cardinality of cm. Wcm(i).r is
Pvalence(cm(i).r)
the weight for each common relation and k=1
Wcm(i).arg(k) is the sum of
the weights of all terms related to the common relations cm. Lj is the level constant
assigned to each similarity level which decreases as the level increases. The main
aspect of the similarity measure is the common features between two concepts a and
b (i.e. the intersection between the sets of non-taxonomic assertions N f (a)∩N f (b)).
Each common feature cm(i).r together with the corresponding weight Wcm(i).r and
the weight of the related terms are accumulated. In other words, the more features
two concepts have in common, the higher the similarity between them.
Regardless of how the existing techniques described in this section are named,
they shared a common point, namely, the reliance on some linguistic (e.g. subcategorisation frames, lexico-syntactic patterns) or predefined semantic (e.g. WordNet)
resources as features. These features are necessary for the computation of similarity
using conventional measures and clustering algorithms. The ease of scalability across
different domains and the resources required for feature extraction are among the
few questions our new clustering technique attempts to overcome. In addition, the
new clustering technique fuses the strengths of recent innovations such as ant-based
algorithms and featureless similarity measures that have yet to benefit ontology
learning systems.
7.3
7.3.1
Background
Normalised Google Distance
Normalised Google Distance (NGD) computes the semantic distance between
objects based on their names using only page counts from the Google search engine.
A more generic name for the measure that employs page counts provided by any
Web search engines is the Normalised Web Distance (NWD) [262]. NGD is a nonfeature distance measure which attempts to capture every effective distance (e.g.
Hamming distance, Euclidean distance, edit distances) into a single metric. NGD
is based on the notions of Kolmogorov Complexity [93] and Shannon-Fano coding
[142].
166
The basis of NGD begins with the idea of the shortest binary program capable
of producing a string x as an output. The Kolmogorov Complexity of the string x,
K(x) is just the length of that program in binary bits. Extending this notion to
include an additional string y produces the Information Distance [23] where E(x, y)
is the length of the shortest binary program that can produce x given y, and y given
x. It was shown that [23]:
E(x, y) = K(x, y) − min{K(x), K(y)}
(7.1)
where E(x, x) = 0, E(x, y) > 0 for x 6= y, and E(x, y) = E(y, x). Next, for every
other computable distances D that are non-negative and symmetric, there is a binary
program, given string x and y, with a length equal to D(x, y). Formally,
E(x, y) ≤ D(x, y) + cD
where cD is a constant that depends on the distance D and not x and y. E(x, y) is
called universal because it acts as the lower bound for all computable distances. In
other words, if two strings x and y are close according to some distance D, then they
are at least as close according to E [49]. Since all computable distances compare
the closeness of strings through the quantification of certain common features they
share, we can consider that information distance determines the distance between
two strings according to the feature by which they are most similar.
By normalising information distance, we have N ID(x, y) ∈ (0, 1) where 0 means
the two strings are the same and 1 being completely different in the sense that they
share no features. The normalised information distance is defined as:
N ID(x, y) =
K(x, y) − min{K(x), K(y)}
max{K(x), K(y)}
Nonetheless, referring back to Kolmogorov Complexity and Equation 7.1, the noncomputability of K(x) implies the non-computability of N ID(x, y). Nonetheless,
an approximation of K can be achieved using real compression programs [261]. If
C is a compressor, then C(x) denotes the length of the compressed version of string
x. Approximating K(x) with C(x) results in:
N CD(x, y) =
C(x, y) − min{C(x), C(y)}
max{C(x), C(y)}
The derivation of NGD continues by observing the working behind compressors.
Compressors encode source words x into code words x′ such that the length |x| < |x′ |.
We can consider these code words from the perspective of Shannon-Fano coding.
167
7.3. Background
Shannon-Fano coding encodes a source word x using a code word that has the
1
length log p(x)
. p(x) can be thought of as a probability mass function that maps each
source word x to the code that achieves optimal compression of x. In Shannon-Fano
coding, p(x) = nXx captures the probability of encountering source word x in a text
or a stream of data from a source, where nx is the occurrence of x and N is the total
number of source words in the same text. Cilibrasi & Vitanyi [49] discussed the use of
compressors for N CD and concluded that the existing compressors’ inability to take
into consideration external knowledge during compression makes them inadequate.
Instead, the authors proposed to make use of a source that “...stands out as the most
inclusive summary of statistical information” [49], namely, the World Wide Web.
More specifically, the authors proposed the use of the Google search engine to devise
a probability mass function that reflects the Shannon-Fano code. It appears that
the Google’s equivalence of Shannon-Fano code, known as Google code, has length
defined by [49]:
1
G(x) = log
g(x)
G(x, y) = log
1
g(x, y)
where g(x) = |x|/N and g(x, y) = |x ∩ y|/N are the new probability mass function
that capture the probability of occurrences of search terms x and y. x is the set
of webpages returned by Google containing the single search term x (i.e. singleton
set) and similarly, x ∩ y is the set of webpages returned by Google containing both
search term x and y (i.e. doubleton set). N is the summation of all unique singleton
and doubleton sets.
Consequently, the Google search engine can be considered as a compressor for
encoding search terms (i.e. source words) x to produce the meaning (i.e. compressed
code words) that has the length G(x). By rewriting the N CD, we obtain the new
NGD defined as:
N GD(x, y) =
G(x, y) − min{G(x), G(y)}
max{G(x), G(y)}
(7.2)
All in all, NGD is an approximation of N CD and hence, N ID to overcome the noncomputability of Kolmogorov Complexity. NGD employs the Google search engine
as a compressor to generate Google codes based on the Shannon-Fano coding. From
the perspective of term clustering, NGD provides an innovative starting point which
demonstrates the advantages of featureless similarity measures. In our new term
clustering technique, we take such innovation a step further by employing NGD in
168
a new clustering technique that combines the strengths from both conventional and
ant-based algorithms.
7.3.2
Ant-based Clustering
The idea of ant-based clustering was first proposed by Deneubourg et al. [61] in
1991 as part of an attempt to explain the different types of emergent technologies
inspired by nature. During simulation, the ants are represented as agents that move
around the environment, a square grid, in random. Objects are randomly placed in
this environment and the ants can pick-up the object, move and drop them. These
three basic operations are influenced by the distribution of the objects. Objects that
are surrounded by dissimilar ones are more likely to be picked up and later dropped
elsewhere in the surrounding of more similar ones. The probability of picking up
and dropping of objects are influenced by the probabilities:
Ppick (i) = (
kp
)2
kp + f (i)
Pdrop (i) = (
f (i)
)2
kd + f (i)
where f (i) is an estimation of the distribution density of the objects in the ants’
immediate environment (i.e. local neighbourhood) with respect to the object that
the ants is considering to pick or drop. The choice of f (i) varies depending on
the cost and other factors related to the environment and the data items. As f (i)
decreases below kp , the probability of picking up the object is very high, and the
opposite occurs when f (i) exceeds kp . As for the probability of dropping an object,
high f (i) exceeding kd induces the ants to give up the object, while f (i) less than
kd encourages the ants to hold on to the object. The combination of these three
simple operations and the heuristics behind them gave birth to the notion of basic
ants for clustering, also known as standard ant clustering algorithm (SACA).
Gutowitz [94] examined the basic ants described by Deneubourg et al. and
proposed a variant ant known as complexity-seeking ants. Such ants are capable of
sensing local complexity and are inclined to work in regions of high interest (i.e. high
complexity). Regions with high complexity are determined using a local measure
that assesses the neighbouring cells and counts the number of pairs of contrasting
cells (i.e. occupied or empty). Neighbourhoods with all empty or all occupied immediate cells have zero complexity while regions with checkboard patterns have high
7.3. Background
169
complexity. Hence, these modified ants are able to accomplish their task faster because they are more inclined to manipulate objects in regions with higher complexity
[263].
Lumer & Faieta [160] further extended and improved the idea of ant-based clustering in terms of the numerical aspects of the algorithm and the convergence time.
The authors represented the objects in terms of numerical vectors and the distance
between the vectors is computed using the Euclidean distance. Hence, given that
δ(i, j) ∈ [0, 1] as the Euclidean distance between object i (i.e. i is the location of the
object in the centre of the neighbourhood) and every other neighbouring objects j,
the neighbourhood function f (i) is defined by the authors as:

 1 P 1 − δ(i,j) if f (i) > 0
2
j
α
f (i) = s
(7.3)
0
otherwise
where s2 is the size of the local neighbourhood, and α ∈ [0, 1] is a constant for scaling
the distance among objects. In other words, an ant has to consider the average
similarity of object i with respect to all other objects j in the local neighbourhood
before performing an operation (i.e. pickup or drop). As the value of f (i) is obtained
by averaging the total similarities with the number of neighbouring cells s2 , empty
cells which do not contribute to the overall similarity must be penalised. In addition,
the radius of perception (i.e. the extent to which objects are taken into consideration
for f (i)) of each ant at the centre of the local neighbourhood is given by s−1
. The
2
clustering algorithm using the basic ant SACA is defined in Algorithm 4.
Handl & Meyer [100] introduced several enhancements to make ant-based clustering more efficient. The first is the concept of eager ants where idle phases are
avoided by having the ants to immediately pickup objects as soon as existing ones
are dropped. The second is the notion of stagnant control. There are occasions in
ant-based clustering when ants are occupied or blocked due to objects that are difficult to dispose. In such cases, the ants are forced to drop whatever they are carrying
after a certain number of unsuccessful drops. In a different paper [98], the authors
have also demonstrated that the ant-based algorithm has several advantages:
• tolerance to different cluster size
• ability to identify the number of clusters
• performance increases with the size of the datasets
• graceful degradation in the face of overlapping clusters.
170
Algorithm 4 Basic ant-based clustering defined by Handl et al. [99]
1: begin
2: //INITIALISATION PHASE
3: Randomly scatter data items on the toroidal grid
4: for each j in 1 to #agents do
5:
i := random select(remaining items)
6:
pick up(agent(j),i)
7:
g := random select(remaining empty grid locations)
8:
place agent(agent(j),g)
9: //MAIN LOOP
10: for each it ctr in 1 to #iterations do
11:
j := random select(all agents)
12:
step(agent(j),stepsize)
13:
i := carried item(agent(j))
14:
drop := drop item?(f (i))
15:
if drop = TRUE then
16:
while pick = FALSE do
17:
i := random select(f ree data items)
18:
pick := pick item?(f (i))
Nonetheless, the authors have also highlighted two shortcomings of ant-based clustering, namely, the inability to distinguish more refined clusters within coarser level
ones, and the inability to specify the number of clusters can be seen as a disadvantage
when the users have precise ideas about it.
Vizine et al. [263] proposed an adaptive ant clustering algorithm (A2 CA) that
improves upon the algorithm by Lumer & Faieta. The authors introduced two major
modifications, namely, progressive vision scheme and the use of pheromones on grid
cells. The progressive vision scheme allows the dynamic adjustment of s2 . Whenever
an ant perceives a larger cluster, it increases its radius of perception from the original
′
s −1
s−1
to
the
new
. The second enhancement allows ants to mark regions that are
2
2
recently constructed or under construction. The pheromones attract other ants,
resulting in an increase in the probability of deconstruction of relatively smaller
regions, and increases the probability of dropping objects at denser clusters.
Ant-based algorithms have been employed to cluster objects that can be represented using numerical vectors. Similar to conventional algorithms, the similarity or
distance measures used by existing ant-based algorithms are still feature-based. Con-
7.4. The Proposed Tree-Traversing Ants
171
sequently, they share similar problems such as difficult portability across domains.
In addition, despite the strengths of standard ant-based algorithms, two disadvantages were identified. In our new technique, we make use of the known strengths of
standard ant-based algorithms and some desirable traits from conventional ones for
clustering terms using featureless similarity.
7.4
The Proposed Tree-Traversing Ants
The Tree-Traversing Ant (TTA) clustering technique is based on dynamic tree
structures as compared to toroidal grids in the case of standard ants. The dynamic
tree begins with one root node r0 consisting of all terms T = {t1 , ..., tn }, and branches
out to new sub-nodes as required. In other words, the clustering process begins with
r0 = {t1 , ..., tn }. For example, the first snapshot in Figure 7.1 shows the start of the
TTA clustering process with the root node r0 initialised with the terms t1 , ...tn=10 .
Essentially, each node in the tree is a set of terms ru = {t1 , ..., tq }. The sizes of new
sub-nodes |ru | reduce as less and less terms are assigned to them in the process of
creating nodes with higher intra-node similarity.
The clustering starts with only one ant, while an unbounded number of ants
awaits to work at each of the new sub-node created. In the third snapshot in Figure
7.1, while the first ant moves on to work at the left sub-node r01 , a new second
ant proceeds to process the right sub-node r02 . The number of possible new subnodes for each main node (i.e branching factor) in this version of TTA is two. In
other words, for each main node rm , we have the sub-nodes rm1 , rm2 . Similar to
some of the current enhanced ants, the TTA ants are endowed with the ability
of short-term memory for remembering similarities and distances acquired through
their senses. The TTA is equipped with two types of senses, namely, NGD and
n-degree of Wikipedia (noW). The standard ants have a radius of perception defined
in terms of cells immediately surrounding the ants. Instead, the perception radius
of TTA ants covers all terms in the two sub-nodes created for each current node. A
current node is simply a node originally consisting of terms to be sorted to the news
sub-nodes.
The TTA adopts a two-pass approach for term clustering. During the first-pass,
the TTA recursively breaks nodes into sub-nodes and relocate terms until the ideal
clusters are achieved. The resulting trees created in the first-pass are often good
enough to reflect the natural clusters. Nonetheless, discrepancies do occur due to
certain oddities in the co-occurrences of terms on the World Wide Web that manifest
themselves through NGD. Accordingly, a second-pass is created that uses noW for
172
Figure 7.1: Example of TTA at work
relocating terms which are displaced due to NGD. The second-pass can be regarded
as a refinement phase for producing clusters with higher quality.
7.4.1
First-Pass using Normalised Google Distance
The TTA begins clustering at the root node which consists of all n terms, r0 =
{t1 , ..., tn }. Each term can be considered as an element in the node. A TTA ant
randomly picks a term, and proceed on to sense its similarity with every other terms
on that same node. The ant repeats this for all n terms until the similarity of all
possible pairs of terms have been memorised. The similarity between two terms tx
and ty is defined as:
s(tx , ty ) = 1 − N GD(tx , ty )α
(7.4)
where N GD(tx , ty ) is the distance between term tx and ty estimated using the original NGD defined at Equation 7.2. α is a constant for scaling the distance between
the two terms. The algorithm then grows two new sub-nodes to accommodate the
two least similar terms ta and tb . The ant moves the first term ta from the main
node rm to the first sub-node while emitting pheromones that trace back to tb in
173
the process. The ant then follows the pheromone trail back to the second term tb
to move it to the second sub-node.
The second snapshot in Figure 7.1 shows two new sub-nodes r01 and r02 . The
ant moved the term t1 to r01 and the least similar term t6 to r02 . Nonetheless, prior
to the creation of new sub-nodes and the relocation of terms, an ideal intra-node
similarity condition must be tested. The operation of moving the two least similar
terms from the current node to create and initialise new sub-nodes is essentially a
partitioning process. Eventually, each leaf node will end up with only one term if
the TTA does not know when to stop. For this reason, we adopt an ideal intra-node
similarity threshold sT for controlling the extent of branching out. Whenever an ant
senses that the similarity between the two least similar terms exceeds sT , no further
sub-nodes will be created and the partitioning process at that branch will cease. A
high similarity (higher than sT ) between the two most dissimilar terms in a node
provides a simple but effective indication that the intra-node similarity has reached
an ideal stage. More refined factors such as the mean and standard deviation of
intra-node similarity are possible but have not been considered.
If the similarity between the two most dissimilar terms is still less than sT ,
further branching out will be performed. In this case, the TTA ant repeatedly picks
up the remaining terms on the current node one by one and senses their similarities
with every other terms which are already located in the sub-nodes. Formally, the
probability of picking up term ti by an ant in the first-pass is defined as:
(
1 if ti ∈ rm
1
(7.5)
Ppick
(ti ) =
0 otherwise
where rm is the set of terms in the current node. In other words, the probability
of picking up terms by an ant is always 1 as long as there are still terms remaining in the current node. Each term ti ∈ rm is moved to one of the two sub-nodes
ru that has the term tj ∈ ru with the highest similarity with ti . In other words,
an ant considers multiple neighbourhoods prior to dropping a term. Snapshot 3 in
Figure 7.1 illustrates the corresponding two sub-nodes r01 and r02 that have been
populated with all the terms which were previously located at the current node
r0 . The standard neighbourhood function f (i) defined in Equation 7.3 represents
the density of the neighbourhood as the average of the similarities between ti with
every other term in its immediate surrounding (i.e. local neighbourhood) confined
by s2 . Unlike the sense of basic ants which covers only the surrounding cells s2 ,
the extent to which a TTA ant perceives covers all terms in the two sub-nodes (i.e.
multiple neighbourhoods) corresponding to the immediate current node. Accord-
174
ingly, instead of estimating f (i) as the averaged similarity defined over s2 terms
surrounding the ant, the new neighbourhood function fT T A (ti , u) is defined as the
maximum similarity between term ti ∈ rm and the neighbourhood (i.e. sub-nodes)
ru . The maximum similarity between ti and ru is the highest similarity between ti
and all other terms tj ∈ ru . Formally, we define the density of neighbourhood ru
with respect to term ti during the first-pass as:
fT1 T A (ti , ru ) = maximum of s(ti , tj ) w.r.t. tj ∈ ru
(7.6)
where the similarity between the two terms s(ti , tj ) is computed using Equation 7.4.
Besides deciding on whether to drop an object or not, like in the case of basic
ants, the TTA ant has to decide on one additional issue, namely, where to drop.
The TTA decides on where to drop a term based on the fT T A (ti , ru ) that it has
memorised for all sub-nodes ru of the current node rm . Formally, the decision on
whether to drop term ti ∈ rm on sub-node rv depends on:



1 if (fT1 T A (ti , rv ) = maximum of fT1 T A (ti , ru )


1
Pdrop
(ti , rv ) =
w.r.t. ru ∈ {rm1 , rm2 })



0 otherwise
(7.7)
The current version of the TTA clustering algorithm is implemented in two parts.
The first is the main function while the second one is a recursive function. The
main function is defined in Algorithm 5 while the recursive function for the firstpass elaborated in this subsection is reported in Algorithm 6.
Algorithm 5 Main function
1: input A list of terms, T = {t1 , ...tn }.
2: Create an initial tree with a root node r0 containing n terms.
3: Define the ideal intra-node similarity threshold sT and δT .
4: //first-pass using NGD
5: ant := new ant()
6: ant.ant traverse(r0 ,r0 )
7: //second-pass using noW
8: leaf nodes := ant.pickup trail()//return all leaf nodes marked by pheromones
9: for each rnext ∈ leaf nodes do
10:
ant.ant refine(leaf nodes,rnext )
175
Algorithm 6 Function ant traverse(rm ,r0 ) using NGD
1: if |rm | = 1 then
2:
leave trail(rm ,r0 )//leave trail from current leave node to root node. for use in
second-pass
3:
return //only one term left. return to root
4: {ta , tb } := find most dissimilar terms(rm )
5: if s(ta , tb ) > sT then
6:
leave trail(rm ,r0 )//leave trail from current leave node to root node. for use in
second-pass
7:
return //ideal cluster has been achieved. return to root node
8: else
9:
{rm1 , rm2 } := grow sub nodes(rm )
10:
move terms({ta , tb },{rm1 , rm2 })
11:
for each term ti ∈ rm do
12:
pick(ti )//based on Eq. 7.5
13:
for each ru ∈ {rm1 , rm2 } do
14:
for each term tj ∈ ru do
15:
s(ti , tj ) := sense similarity(ti ,tj ) //based on Eq. 7.4
16:
remember similarity(s(ti , tj ))
1
17:
fT T A (ti , ru ) := sense neighbourhood() //based on Eq. Eq. 7.6
18:
remember neighbourhood(fT1 T A (ti , ru ))
19:
{∀u, fT1 T A (ti , ru )} := recall neighbourhood()
20:
rv := decide drop({∀u, fT1 T A (ti , ru )})// based on Eq. 7.7
21:
drop({ti },{rv })
22: antm1 := new ant()
23: antm1 .ant traverse(rm1 ,r0 )//repeat the process recursively for each sub-node
24: antm2 := new ant()
25: antm2 .ant traverse(rm2 ,r0 )//repeat the process recursively for each sub-node
176
7.4.2
n-degree of Wikipedia: A New Distance Metric
The use of NGD for quantifying the similarity between two objects based on
their names alone can occasionally produce low-quality clusters. We will highlight
some of these discrepancies during our initial experiments in the next section. The
initial tree of clusters generated by the TTA using NGD demonstrated promising
results. Nonetheless, we reckoned that higher-quality clusters could be generated
if we allow the TTA ants to visit the nodes again for the purpose of refinement.
Instead of using NGD, we present a new way to gauge the similarity between terms.
Google can be regarded as the gateway to the huge volume of documents on
the World Wide Web. The sheer size of Google’s index enables a relatively reliable
estimate of term usage and occurrence using NGD. The page counts provided by
the Google search engine, which are the essence of NGD, are used to compute
the similarity between two terms based on the mutual information that they both
share at the compressed level. As for Wikipedia, its number of articles is less than
a fraction of what Google indexes. Nonetheless, the restrictions imposed on the
authoring of Wikipedia’s articles and their organisations provide a possibly new
way of looking at similarity between terms. n-degree of Wikipedia (noW) [272] is
inspired by a game for Wikipedians. 6-degree of Wikipedia 2 is a task set out to
study the characteristics of Wikipedia in terms of the similarity between its articles.
An article in Wikipedia can be regarded as an entry of encyclopaedic information
describing a particular topic. The articles are organised using categorical indices
which eventually leads to the highest level, namely, “Categories” 3 . Each article
can appear under more than one category. Hence, the organisation of articles in
Wikipedia appears more as a directed acyclic graph with a root node instead of a
pure tree structure4 . The huge volume of articles in Wikipedia, the organisation
of articles in a graph structure, the open-source nature of the articles, and the
availability of the articles in electronic form makes Wikipedia the ideal candidate
for our endeavour.
We define Wikipedia as a directed graph W := (V, E). W is essentially a network
of linked-articles where V = {a1 , ..., aω } is the set of articles. We limit the vertices to
English articles only. At the moment, ω = |V | is reported to be 1, 384, 7295 , making
it the largest encyclopaedia6 in merely five years since its conception. The inter2
http://en.wikipedia.org/wiki/Six Degrees of Wikipedia
http://en.wikipedia.org/wiki/Category:Categories
4
http://en.wikipedia.org/wiki/Wikipedia:Categorization#Categories do not form a tree
5
http://en.wikipedia.org/wiki/Wikipedia:Size comparisons
6
http://en.wikipedia.org/wiki/Wikipedia:Largest encyclopedia
3
177
connections between articles are represented as the set of ordered pairs of vertices
E. At the moment, the edges are uniformly assigned with weight 1. Each article can
be considered as an elaboration of a particular event, an entity or an abstract idea.
In this sense, an article in Wikipedia is a manifestation of the information encoded
in the terms. Consequently, we can represent each term ti using the corresponding
article ai ∈ V in Wikipedia. Hence, the problem of finding the distance between two
terms ti and tj can be reduced to the discovery of how closely situated are the two
corresponding articles ai and aj in the Wikipedia categorical indices. The problem
of finding the degree of separation between two articles can be addressed in terms of
the single-source shortest path problem. Since the weights are all positive, we have
resorted to Dijkstra’s Algorithm for finding the shortest-path between two vertices
(i.e. articles). Other algorithms for the shortest-path problem are available. However, a discussion on these algorithms is beyond the scope of this chapter. Formally,
the noW value between terms tx and ty is defined as
noW (tx , ty ) = δ(ax , ay ) =

P|SP |


 k=1 cek

0



∞
if(ax 6= ay ∧ ax , ay ∈ V )
if(ax = ay ∧ ax , ay ∈ V )
(7.8)
otherwise
where delta(ax , ay ) is the degree of separation between the articles ax and ay which
corresponds to the term tx and ty , respectively. The degree of separation is computed
as the sum of the cost of all edges along the shortest path between articles ax and
ay in the graph of Wikipedia articles W . SP is the set of edges along the shortest
path and ek is the k th edge or element in set SP . |SP | is the number of edges
along the shortest path and cek is the cost associated with the k th edge. It is also
worth mentioning that while δ(tx , ty ) ≥ 0 for ax , ay ∈ V , no upper bound can
be ascertained. The noW value between terms that do not have corresponding
articles in Wikipedia is set to ∞. There is a hypothesis7 stating that no two articles
in Wikipedia are separated by more than six degrees. However, there are some
Wikipedians who have shown that certain articles can be separated by up to eight
steps8 . This is the reason why we adopted the name n-degree of Wikipedia instead
of 6-degree of Wikipedia.
7
8
http://tools.wikimedia.de/sixdeg/index.jsp
http://en.wikipedia.org/wiki/Six Degrees of Wikipedia
178
7.4.3
Second-Pass using n-degree of Wikipedia
Upon completing the first-pass, there is at most n leaf nodes where each term
in the initial set of all terms T end up in individual nodes (i.e. clusters). There are
only two possibilities for such extreme cases. The first is when the ideal intra-node
similarity threshold sT is set too high while the second is when all the terms are
extremely unrelated. In normal cases, most of the terms will be nicely clustered
into nodes with intra-node similarities exceeding sT . Only a small number of terms
is usually isolated into individual nodes. We refer to these terms as isolated terms.
There are two possibilities that lead to isolated terms in normal cases, namely, (1)
the term has been displaced during the first-pass due to discrepancies related to
NGD, or (2) the term is in fact an outlier. The TTA ants leave pheromone trails on
their return trip to the root node (as in line 2 and line 7 of Algorithm 6) to mark the
paths to the leaf nodes. In order to relocate the isolated terms to other more suitable
nodes, the TTA ants return to the leaf nodes by following the pheromone trails. At
each leaf node rl , the probability of picking up a term ti during the second-pass is
1 if the leaf node has only one term (isolated term):
2
Ppick
(ti ) =
(
1 if |rl | = 1 ∧ ti ∈ rl
0 otherwise
(7.9)
After picking up an isolated term, the TTA ant continues to move from one leaf
node to the next. At each leaf node, the ant determines whether that particular
leaf node (i.e. neighbourhood) rl is the most suitable one to house the isolated
term ti based on the average distance between ti and all other existing terms in rl .
Formally, the density of neighbourhood rl with respect to the isolated term ti during
the second-pass is defined as:
fT2 T A (ti , rl )
=
P|rl |
j=1
noW (ti , tj )
|rl |
(7.10)
where |rl | is the number of terms in the leaf node rl and the noW value between
the two terms ti and tj is computed using Equation 7.8.
This process of sensing the distance of the isolated term with all other terms in
a leaf node is performed for all leaf nodes. The probability of the ant dropping the
isolated term ti on the most suitable leaf node rv is evaluated once the ant returns
to the original leaf node that used to contain ti . Back at the original leaf node of
ti , the ant recalls the neighbourhood density fT2 T A (ti , rl ) that it has memorised for
179
all neighbourhoods (i.e leaf nodes). The TTA ant drops the isolated term ti on the
leaf node rv if all terms in rv collectively yield the minimum average distance with
ti that satisfies the outlier discrimination threshold δT . Formally,
2
Pdrop
(ti , rv ) =

2
2


1 if (fT T A (ti , rv ) = minimum of fT T A (ti , rl ) w.r.t. rl ∈ L)

∧ (fT2 T A (ti , rv ) ≤ δT )



0 otherwise
(7.11)
where L is the set of all leaf nodes. After the ant has visited all the leaf nodes
and has failed to drop the isolated term, the term will be returned to its original
location. The failure to drop the isolated term in a more suitable node indicates
that the term is an outlier.
Referring back to the example in Figure 7.1, assume that snapshot 5 represents
the end of the first-pass where the intra-node similarity of all nodes have satisfied
sT . While all other leaf nodes, namely, r011 , r012 and r021 consist of multiple terms,
leaf node r022 contains only one term t6 . Hence, at the end of the first-pass, all ants,
namely, ant1, ant2, ant3 and ant4 retreat back to the root node r0 . Then, during
the second-pass, one TTA ant is deployed to relocate the isolated term t6 from r022
to either leaf node r011 , r012 or r021 , depending on the average distances of these leaf
nodes with respect to t6 . The algorithm for the second-pass using noW is described
in Algorithm 7. Unlike the ant traverse() function in Algorithm 6 where each new
sub-node is processed as a separate iteration of ant traverse() using an independent
TTA ant, there is only one ant required throughout the second-pass.
7.5
In this section, we focus on evaluations at the conceptual layer of ontologies to
verify the taxonomic structures discovered using TTA. We employ three existing
metrics. The first is known as Lexical Overlap (LO) for evaluating the intersection
between the discovered concepts (Cd ) and the recommended (i.e. manually created)
concepts (Cm ) [164]. The manually created concepts can be regarded as the reference
for our evaluations. LO is defined as:
LO =
|Cd ∩ Cm |
|Cm |
(7.12)
Some minor changes were made in terms of how the intersection between the set
of recommended clusters and discovered clusters (i.e. Cd ∩ Cm ) is computed. The
180
Algorithm 7 Function ant refine(leaf nodes, ru ) using noW
1: if |ru | = 1 then
2:
//current leaf node has isolated term ti
3:
pick(ti )//based on Eq. 7.9
4:
for each rl ∈ leaf nodes do
5:
for each term tj in current leaf node rl do
6:
//jump from one leaf node to the next to sense neighbourhood density
7:
δ(ti , tj ) := sense distance(ti , tj )//based on Eq. 7.8
8:
remember distance(δ(ti , tj ))
9:
fT2 T A (ti , rl ) := sense neighbourhood()//based on Eq. 7.10
10:
remember neighbourhood(fT2 T A (ti , rl ))
11:
//back to original leaf node of term ti after visiting all other leaves
12:
{∀l, fT2 T A (ti , rl )} := recall neighbourhood()
13:
rv := decide drop({∀l, fT2 T A (ti , rl )})// based on Eq. 7.11
14:
if rv not null then
15:
drop({ti },{rv })//drop at ideal leaf node
16:
else
17:
drop({ti },{ru })//outlier. no ideal leaf node. drop back at original leaf
normal way of having exact lexical matching of the concept identifiers cannot be
applied to our experiments. Due to the ability of the TTA in discovering concepts
with varying level of granularity depending on sT , we have to put into consideration
the possibility of sub-clusters that collectively correspond to some recommended
clusters. For our evaluations, the presence of discovered sub-clusters that correspond
to some recommended clusters are considered as a valid intersection. In other words,
given that Cd = {c1 , ..., cn } and Cm = {cx } where cx ∈
/ Cd , then
|Cd ∩ Cm | = 1 if c1 ∪ ... ∪ cn = cx
The second metric is used to account for valid discovered concepts that are absent
from the reference set, while the third metric ensures that concepts which exist in
the reference set but are not discovered are also taken into consideration. The second
metric is referred to as Ontological Improvement (OI) and the third metric is known
as Ontological Loss (OL). They are defined as [214]:
OI =
|Cd − Cm |
|Cm |
(7.13)
181
Table 1. Summary of the datasets employed for experiments. Column
Cm are the recommended clusters and Cd are clusters automatically
discovered using TTA.
OL =
|Cm − Cd |
|Cm |
(7.14)
Ontology learning is an incremental process that involves the continuous maintenance of ontology every time new terms are added. As such, we do not see clustering
large datasets as a problem. In this section, we employ seven datasets to assess the
quality of the discovered clusters using the three metrics described above. The origin
of the datasets and some brief descriptions are provided below:
• Three of the datasets used for our experiments were obtained from the UCI
Machine Learning Repository9 . These sets are labelled as WINE 15T, MUSH9
http://www.ics.uci.edu/ mlearn/MLRepository.html
182
Table 2. Summary of the evaluation results for all ten experiments
using the three metrics LO, OI and OL.
ROOM 16T and DISEASE 20T. The accompanying numerical attributes, which
were designed for use with feature-based similarities, were removed.
• We also employ the original animals dataset (i.e. ANIMAL 16T) proposed for
use with Self-Organising Maps (SOMs) by Ritter & Kohonen [209].
• We constructed the remaining three datasets called ANIMALGOOGLE 16T,
MIX 31T and MIX 60T. ANIMALGOOGLE 16T is similar to the ANIMAL 16T
dataset except for a single replacement with the term “Google”. The other two
MIX datasets consist of a mixture of terms from a large number of domains.
Table 1 summarises the datasets employed for our experiments. The column Cm
are the recommended clusters and Cd are clusters automatically discovered using
TTA. Table 2 summarises the evaluation of TTA using the three metrics for all
ten experiments. The high lexical overlap (LO) shows the good domain coverage of
the discovered clusters. The occasionally high ontological improvement (OI) demonstrates the ability of TTA in highlighting new, interesting concepts that were ignored
during manual creation of the recommended clusters.
During the experiments, snapshots were produced to show the results in two
parts: results after the first-pass using NGD, and results after the second-pass using
noW. The first experiment uses WINE 15T. The original dataset has 178 nameless
instances spread out over 3 clusters. Each instance has 13 attributes for use with
feature-based similarity measures. We augment the dataset by introducing famous
names in the wine domain and remove their numerical attributes. We maintained
the three clusters namely, “white”, “red” and “mix”. “Mix” refers to wines that were
183
Figure 7.2: Experiment using 15 terms from the wine domain. Setting sT = 0.92
results in 5 clusters. Cluster A is simply red wine grapes or red wines, while Cluster E
represents white wine grapes or white wines. Cluster B represents wines named after
famous regions around the world and they can either be red, white or rose. Cluster
C represents white noble grapes for producing great wines. Cluster D represents
red noble grapes. Even though uncommon, Shiraz is occasionally admitted to this
group.
named after famous wine regions around the world. Such wines can either be red or
white. As shown in Figure 7.2, setting sT = 0.92 produces five clusters. Clusters A
and D are actually sub-clusters for the recommended cluster “red”, while Clusters C
and E are sub-clusters for the recommended cluster “white”. Cluster B corresponds
exactly to the recommended cluster “mix”. The second experiment uses MUSHROOM 16T. The original dataset has 8124 nameless instances spread out over two
clusters. Each instance has 22 nominal attributes for use with feature-based similarity measures. We augment the dataset by introducing names of mushrooms that
fit into one of the two recommended clusters, namely, “edible” and “poisonous”. As
shown in Figure 7.3, setting sT = 0.89 produces 4 clusters. Cluster A corresponds
exactly to the recommended cluster “poisonous”. The remaining three clusters are
actually sub-clusters of the recommended cluster “edible”. Cluster B contains edible
mushrooms prominent in East Asia, while Clusters C and D comprise mushrooms
found mostly in North America and Europe, and are prominent in Western cuisines.
Similarly, the third experiment was conducted using DISEASE 20T with the results
shown in Figure 7.4. At sT = 0.86, TTA discovered hidden sub-clusters within
the four recommended clusters, namely, “skin”, “blood”, “cardiovascular” and “digestion”. In relation to this, Handl et al. [99] highlighted a shortcoming in their
184
Figure 7.3: Experiment using 16 terms from the mushroom domain. Setting sT =
0.89 results in 4 clusters. Cluster A represents poisonous mushrooms. Cluster B
comprises edible mushrooms which are prominent in East Asian cuisine except for
Agaricus Blazei. Nonetheless, this mushroom was included in this cluster probably
due to its high content of beta glucan for potential use in cancer treatment, just
like Shiitake. Moreover, China is the major exporter of Agaricus Blazei, also known
as Himematsutake, further relating this mushroom to East Asia. Cluster C and D
comprise edible mushrooms found mainly in Europe and North America, and are
more prominent in Western cuisines.
evaluation of ant-based clustering algorithms. The authors stated that the algorithm “...only manages to identify these upper-level structures and fails to further
distinguish between groups of data within them.”. In other words, unlike existing
ant-based algorithms, the first three experiments demonstrated that our TTA has
the ability to further distinguish hidden structures within clusters.
The fourth and fifth experiments were conducted using ANIMAL 16T dataset.
This dataset has been employed to evaluate both the standard ant-based clustering
(SACA) and the improved version called A2 CA by Vizine et al. [263]. The original
dataset consists of 16 named instances, each representing an animal using binary
feature attributes. Both SACA and A2 CA discovered two natural clusters, one
for “mammal” while the other for “bird”. While SACA was inconsistent in its
results, A2 CA yielded 100% recall rate over ten runs. The authors of A2 CA stated
that the dataset can also be represented as three recommended clusters. In the
spirit of the evaluation by Vizine et al., we performed the clustering of the 16
animals using TTA over ten runs. In our case, no feature was used. Just like all
experiments in this chapter, the 16 animals were clustered based on their names.
185
Figure 7.4: Experiment using 20 terms from the disease domain. Setting sT = 0.86
results in 7 clusters. Cluster A represents skin diseases. Cluster B represents a
class of blood disorders known as anaemia. Cluster C represents other kinds of
blood disorders. Cluster D represents blood disorders characterised by the relatively
low count of leukocytes (i.e. white blood cells) or platelets. Cluster E represents
digestive diseases. Cluster F represents cardiovascular diseases characterised by
both the inflammation and thrombosis (i.e. clotting) of arteries and veins. Cluster
G represents cardiovascular diseases characterised by the inflammation of veins only.
As shown in the fourth experiment in Figure 7.5, by setting sT = 0.60, the TTA
automatically discovered the two recommended clusters after the second-pass: “bird”
and “mammal”. While ant-based techniques are known for their intrinsic capability
in identifying clusters automatically, conventional clustering techniques (e.g. Kmeans, average link agglomerative clustering) rely on the specification of the number
of clusters [99]. The inability to control the desired number of natural clusters can
be troublesome. According to Vizine et al. [263], “in most cases, they generate a
number of clusters that is much larger that the natural number of clusters”. Unlike
both extremes, TTA has the flexibility in regard to the discovery of clusters. The
granularity and number of discovered clusters in TTA can be adjusted by simply
modifying the threshold sT . By setting higher sT , the number of discovered clusters
for ANIMAL 16T has been increased to five as shown in Figure 7.6. A lower value
of the desired ideal intra-node similarity sT results in less branching out and hence,
fewer clusters. Conversely, setting higher sT produces more tightly coupled terms
where the similarities between elements in the leaf nodes are very high. In the
186
fifth experiment depicted in Figure 7.6, the value sT was raised to 0.72 and more
refined clusters were discovered: “bird”, “mammal hoofed”, “mammal kept as pet”,
“predatory canine” and “predatory feline”.
Figure 7.5: Experiment using 16 terms from the animal domain. Setting sT = 0.60
produces 2 clusters. Cluster A comprises birds and Cluster B represents mammals.
The next three experiments were conducted using the ANIMALGOOGLE 16T
dataset. These three experiments are meant to reveal another advantage of TTA
through the presence of an outlier, namely, the term “Google”. An outlier can be
simply considered as a term that does not fit into any of the clusters. In Figure 7.7,
TTA successfully isolated the term “Google” while discovering clusters at different
levels of granularities based on different sT . As similar terms are clustered into the
same node, outliers are eventually singled out as isolated terms in individual leaf
nodes. Consequently, unlike some conventional techniques such as K-means [282],
clustering using TTA is not susceptible to poor results due to outliers. In fact, there
are two ways of looking at the term “Google”, one as an outlier as described above,
or the second as an extremely small cluster with one term. Either way, the term
“Google” demonstrates two abilities of TTA: capable of identifying and isolating
outliers, and tolerance to differing cluster sizes like its predecessors. Handl et al.
[99] have shown through experiments that certain conventional clustering techniques
such as K-means and one-dimensional self-organising maps perform poorly in the
face of increasing deviations between cluster sizes.
The last two experiments were conducted using MIX 31T and MIX 60T. Figure 7.8 shows the results after the first-pass and second-pass using 31 terms while
187
Figure 7.6: Experiment using 16 terms from the animal domain (the same dataset
from the experiment in Figure 7.5). Setting sT = 0.72 results in 5 clusters. Cluster
A represents birds. Cluster B includes hoofed mammals (i.e. ungulates). Cluster C
corresponds to predatory feline while Cluster D represents predatory canine. Cluster
E constitutes animals kept as pets.
Figure 7.9 shows the final results using 60 terms. Similar to the previous experiments, the first-pass resulted in a number of clusters plus some isolated terms. The
second-pass aims to relocate these isolated terms to the most appropriate clusters.
Despite the rise in the number of terms from 31 to 60, all the clusters formed by
the TTA after the second-pass correspond precisely to their occurrences in real-life
(i.e. natural clusters). With the absolute consistency of the results over ten runs,
these two experiments yield 100% recalls just like the previous experiments. Consequently, we can claim that TTA is able to produce consistent results, unlike the
standard ant-based clustering where the solution does not stabilise and fail to converge. For example, in the evaluation by Vizine et al. [263], the standard ant-based
clustering were inconsistent in their performance over the ten runs using the ANIMAL 16T dataset. This is a very common problem in ant-based clustering when
“they constantly construct and deconstruct clusters during the iterative procedure of
adaptation” [263].
There is also another advantage of TTA that is not found in the standard ants,
namely, the ability to identify taxonomic relations between clusters. Referring to
all the ten experiments conducted, we noticed that there are implicit hierarchical
information that connects the discovered clusters. For example, referring to the most
188
Figure 7.7: Experiment using 15 terms from the animal domain plus an additional
term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60 (middle screenshot)
and sT = 0.72 (right screenshot) result in 2 clusters, 3 clusters and 5 clusters, respectively. In the left screenshot, Cluster A acts as the parent for the two recommended
clusters “bird” and “mammal”, while Cluster B includes the term “Google”. In the
middle screenshot, the recommended clusters “bird” and “mammal” were clearly
reflected through Cluster A and C respectively. By setting sT higher, we dissected
the recommended cluster “mammal” to obtain the discovered sub-clusters C, D and
E as shown in the right screenshot.
recent experiment in Figure 7.8, the two discovered Clusters A (which contains
“Sandra Bullock”, “Jackie Chan”, “Brad Pitt”) and B (which contains “3 Doors
Down”, “Aerosmith”,“Rod Stewart”) after the second-pass share the same parent
node. We can employ the graph of Wikipedia articles W to find the nearest common
ancestor of the two natural clusters and label it with the category name provided by
Wikipedia. In our case, we can label the parent node of the two natural clusters as
“Entertainers”. In fact, the labels of the natural clusters themselves can be named
using the same approach. For example, the terms in the discovered cluster B (which
contains “3 Doors Down”, “Aerosmith”,“Rod Stewart”) fall under the same category
“American musicians” in Wikipedia and hence, we can accordingly label this cluster
using that category name. In other words, clustering using TTA with the help of
NGD and noW does not only produce flexible and consistent natural clusters, but
is also able to identify implicit taxonomic relations between clusters. Nonetheless,
we would like to point out that not all hierarchies of natural clusters formed by
the TTA correspond to real-life hierarchical relations. More research is required to
properly validate this capability of the TTA.
189
Figure 7.8: Experiment using 31 terms from various domains. Setting sT = 0.70
results in 8 clusters. Cluster A represents actors and actresses. Cluster B represents
musicians. Cluster C represents countries. Cluster D represents politics-related
notions. Cluster E is transport. Cluster F includes finance and accounting matters.
Cluster G constitutes technology and services on the Internet. Cluster H represents
food.
One can notice that in all the experiments in this section, the quality of the
clustering output using TTA was less desirable if we were to only rely on the results
from the first-pass. As pointed out earlier, the second-pass is necessary to produce
naturally-occurring clusters. The results after the first-pass usually contain isolated
terms due to discrepancies in NGD. This is mainly due to the appearance of words
and the popularity of word pairs that are not natural. For example, given the words
“Fox”, “Wolf ” and “Entertainment”, the first two should go together naturally.
Unfortunately, due to the popularity of the name “Fox Entertainment”, a Google
search using the pair “Fox” and “Wolf ” generates lower page count as compared to
“Fox” and “Entertainment”. A lower page count has adverse effects on Equation
7.2, resulting in lower similarity. Using Equation 7.4, “Fox” and “Entertainment”
achieve a similarity of 0.7488 while “Fox” and “Wolf ” yield a lower similarity of
0.7364. Despite such shortcomings, search engine page counts and Wikipedia offer
TTA the ability to handle technical terms and common words of any domain regardless of whether they have been around for some time or merely beginning to
evolve into common use on the Web. Due to the mere reliance on names or nouns
for clustering, some readers may question the ability of TTA in handling various
linguistic issues such as synonyms and word senses. Looking back at Figure 7.4, the
190
Figure 7.9: Experiment using 60 terms from various domains. Setting sT = 0.76
results in 20 clusters. Cluster A and B represent herbs. Cluster C comprises pastry
dishes while Cluster D represents dishes of Italian origin. Cluster E represents
computing hardware. Cluster F is a group of politicians. Cluster G represents cities
or towns in France while Cluster H includes countries and states other than France.
Cluster I constitutes trees of the genus Eucalyptus. Cluster J represents marsupials.
Cluster K represents finance and accounting matters. Cluster L comprises transports
with four or more wheels. Cluster M includes plant organs. Cluster N represents
beverages. Cluster O represents predatory birds. Cluster P comprises birds other
than predatory birds. Cluster Q represents two-wheeled transports. Cluster R and
S represent predatory mammals. Cluster T includes trees of the genus Acacia.
term “Buerger’s disease” and “Thromboangiitis obliterans” are actually synonyms
referring to the acute inflammation and thrombosis (clotting) of arteries and veins
of the hands and feet. In the context of the experiment in Figure 7.2, the term “Bordeaux” was treated as “Bordeaux wine” instead of the “city of Bordeaux”, and was
successfully clustered together with other wines from other famous regions such as
“Burgundy”. In another experiment in Figure 7.9, the similar term “Bordeaux” was
automatically disambiguated and treated as a port-city in the Southwest of France
instead. The TTA then automatically clustered this term together with other cities
in France such as “Chamonix” and “Paris”. In short, TTA has the inherent capability of coping with synonyms, word senses and the fluctuation in term usage.
The quality of the clustering results is very much dependent on the choice of sT
7.6. Conclusion and Future Work
191
and to a lesser extent, δT . Nonetheless, as an effective rule-of-thumb, sT should be
set as high as possible. Higher sT will result in more leaf nodes with each having
possibly a smaller number of terms that are tightly coupled together. High sT
will also enable the isolation of potential outliers. The isolated terms and outliers
generated by a high sT can then be further refined in the second-pass. The ideal
range of sT derived through our experiments is within 0.60 − 0.9. Setting sT too low
will result in very coarse clusters like the ones shown in Figure 7.5 where potential
sub-clusters are left uncovered. Regarding the value of δT , it is usually set inversely
proportional to sT . As shown during our evaluations, the higher we set sT , the more
we decrease the value of δT . The reason behind the choices of these two threshold
values can be explained as follows: as we lower sT , TTA produces coarser clusters
with loosely coupled terms. The intra-node distance of such clusters are inevitably
higher compared to the finer clusters because the terms in these coarse clusters are
more likely to be less similar. In order for the second-pass to function appropriately
during the relocation of isolated terms and the isolation of outliers, δT has to be set
comparatively higher. Besides, lower sT will not provide the adequate discriminative
ability for the TTA to distinguish or pick out the outliers. Another interesting point
about sT is that by setting it to the maximum (i.e. 1.0), it results in a divisive
clustering effect. In divisive clustering, the process starts with one, all-inclusive
cluster and at each step, splits the cluster until only singleton clusters of individual
term remain [242].
7.6
In this chapter, we introduced a decentralised multi-agent system for term clustering in ontology learning. Unlike documents clustering or other forms of clustering
in pattern recognition, clustering terms in ontology learning requires a different approach. The most evident adjustment required in term clustering is the measure of
similarity and distance. Existing term clustering techniques in many ontology learning systems remain confined within the realm of conventional clustering algorithms
and feature-based similarity measures. Since there is no explicit feature attached to
terms, these existing techniques have come to rely on contextual cues surrounding
the terms. These clustering techniques require extremely large collections of domain
documents to reliably extract contextual cues for the computation of similarity matrices. In addition, the static background knowledge required for term clustering
such as WordNet, patterns and templates make such techniques even more difficult
to scale across domains.
192
Consequently, we introduced the use of featureless similarity and distance measures called Normalised Google Distance (NGD) and n-degree of Wikipedia (noW)
for term clustering. The use of these two measures as part of a new multi-pass clustering algorithm called Tree-Traversing Ant (TTA) demonstrated excellent results
during our evaluations. Standard ant-based techniques exhibit certain characteristics that have been shown to be useful and superior compared to conventional clustering techniques. The TTA is the result of an attempt to inherit these strengths
while avoiding some inherent drawbacks. In the process, certain advantages from the
conventional divisive clustering were incorporated, resulting in the appearance of a
hybrid between ant-based and conventional algorithms. Seven of the most notable
strengths of the TTA with NGD and noW are (1) able to further distinguish hidden
structures within clusters, (2) flexible in regard to the discovery of clusters, (3) capable of identifying and isolating outliers, (4) tolerance to differing cluster sizes, (5)
able to produce consistent results, (6) able to identify implicit taxonomic relations
between clusters, and (7) inherent capability of coping with synonyms, word senses
and the fluctuation in term usage.
Nonetheless, much work is still required in certain aspects. One of the main
future work we have in plan is to ascertain the validity and make good use of the
implicit hierarchical relations discovered using TTA. The next issue that interests
us is the automatic labelling of the natural clusters and the nodes in the hierarchy
using Wikipedia. Labelling has always been a hard problem in clustering, especially
document and term clustering. We are also keen on conducting more studies on the
interaction between the two thresholds in TTA, namely, sT and δT . If possible, we
intend to find ways to enable the automatic adjustment of these threshold values to
maximise the quality of clustering output.
7.7
Acknowledgement
This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and a Research Grant 2006 from the University of
Western Australia. The authors would like to thank the anonymous reviewers for
their invaluable comments.
7.8
Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for Terms
Clustering using Tree-Traversing Ants. In the Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia.
7.8. Other Publications on this Topic
193
This paper reports the preliminary work on clustering terms using featureless similarity measures. The resulting clustering technique called TTA was later refined to
contribute towards the core contents of Chapter 7.
Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. M.
Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global.
The research on TTA reported in Chapter 7 was generalised in this book chapter
to work with both terms and Internet domain names.
194
CHAPTER 8
Relation Acquisition
Abstract
Common techniques for acquiring semantic relations rely on static domain and
linguistic resources, predefined patterns, and the presence of syntactic cues. This
chapter proposes a hybrid technique which brings together established and novel
techniques in lexical simplification, word disambiguation and association inference
for acquiring coarse-grained relations between potentially ambiguous and composite
terms using only dynamic Web data. Our experiments using terms from two different
domains demonstrate potential preliminary results.
8.1
Introduction
Relation acquisition, also known as relation extraction or relation discovery is an
important aspect of ontology learning. Traditionally, semantic relations are either
extracted as verbs based on grammatical structures [217], induced through term
co-occurrence using large text corpora [220], or discovered in the form of unnamed
associations through cluster analysis [212]. Challenges faced by conventional techniques include (1) the reliance on static patterns and text corpora together with rare
domain knowledge, (2) the need for named entities to guide relation acquisition, (3)
the difficulty of classifying composite or ambiguous names into the required categories, and (4) the dependence on grammatical structures and the presence of verbs
can result in the overlooking of indirect, implicit relations. In recent years, there
is a growing trend in relation acquisition using Web data such as Wikipedia [245]
and online ontologies (e.g. Swoogle) [213] to partially address the shortcomings of
conventional techniques.
In this chapter, we propose a hybrid technique which integrates lexical simplification, word disambiguation and association inference for acquiring semantic relations using only Web data (i.e. Wikipedia and Web search engines) for constructing
lightweight domain ontologies. The proposed technique performs an iterative process
of term mapping and term resolution to identify coarse-grained relations between
domain terms. The main contribution of this chapter is the resolution phase which
0
This chapter appeared in the Proceedings of the 13th Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD), Bangkok, Thailand, 2009, with the title “Acquiring Semantic Relations using the Web for Constructing Lightweight Ontologies”.
195
196
Chapter 8. Relation Acquisition
allows our relation acquisition technique to handle complex and ambiguous terms,
and terms not covered by our background knowledge on the Web. The proposed
technique can be used to complement conventional techniques for acquiring finegrained relations and to automatically extend online structured data as Wikipedia.
The rest of the chapter is structured as follows. Section 8.2 and 8.3 present existing
work related to relation acquisition, and the details of our technique, respectively.
The outcome of the initial experiment is summarised in Section 8.4. We conclude
this chapter in Section 8.5.
8.2
Related Work
Techniques for relation acquisition can be classified as symbolic-based, statisticsbased or a hybrid of both. The use of linguistic patterns enables the discovery of
fine-grained semantic relations. For instance, Poesio & Almuhareb [201] developed
specific lexico-syntactic patterns to discover named relations such as part-of and
causation. However, linguistic-based techniques using static rules tend to face difficulties in coping with the structural diversity of a language. The technique by
Sanchez & Moreno [217] for extracting verbs as potential named relations is restricted to handling verbs in simple tense and verb phrases which do not contain
modifiers such as adverbs. In order to identify indirect relations, statistics-based
techniques such as co-occurrence analysis and cluster analysis are necessary. Cooccurrence analysis employs the redundancy in large text corpora to detect the
presence of statistically significant associations between terms. However, the textual resources required by such techniques are difficult to obtain, and remain static
over a period of time. For example, Schutz & Buitelaar [220] manually constructed
a corpus for the football domain containing only 1, 219 documents from an online
football site for relation acquisition. Cluster analysis [212], on the other hand,
requires tremendous computational effort in preparing features from texts for similarity measurement. The lack of emphasis on indirect relations is also evident in
existing techniques. Many relation acquisition techniques in information extraction
acquire semantic relations with the guidance of named entities [229]. Relation acquisition techniques which require named entities have restricted applicability since
many domain terms with important relations cannot be easily categorised. In addition, the common practice of extracting triples using only patterns and grammatical
structures tends to disregard relations between syntactically unrelated terms.
In view of the shortcomings of conventional techniques, there is a growing trend
in relation acquisition which favours the exploration of rich, heterogeneous Web data
8.3. A Hybrid Technique for Relation Acquisition
197
over the use of static, rare background knowledge. SCARLET [213], which stemmed
from a work in ontology matching, follows this paradigm by harvesting online ontologies on the Semantic Web to discover relations between concepts. Sumida et
al. [245] developed a technique for extracting a large set of hyponymy relations in
Japanese using the hierarchical structures of Wikipedia. There is also a group of
researchers who employ Web documents as input for relation acquisition [115]. Similar to the conventional techniques, this group of work still relies on the ubiquitous
Wordnet and other domain lexicons for determining the proper level of abstraction
and labelling of relations between the terms extracted from Web documents. Pei et
al. [196] employed predefined local (i.e. WordNet) and online ontologies to name
the unlabelled associations between concepts in Wikipedia. The labels are acquired
through a mapping process which attempts to find lexical matches for Wikipedia
concepts in the predefined ontologies. The obvious shortcomings include the inability to handle complex and new terms which do not have lexical matches in the
predefined ontologies.
8.3
A Hybrid Technique for Relation Acquisition
Figure 8.1: An overview of the proposed relation acquisition technique. The main
phases are term mapping and term resolution, represented by black rectangles. The
three steps involved in resolution are simplification, disambiguation and inference.
The techniques represented by the white rounded rectangles were developed by
the authors, while existing techniques and resources are shown using grey rounded
rectangles.
The proposed relation acquisition technique is composed of two phases, namely,
198
Algorithm 8 termMap(t, WT , M, root, iteration)
1: rslt := map(t)
2: if iteration equals to 1 then
3:
if rslt equals to undef then
4:
if t is multi-word then return composite
5:
else return non-existent
6:
else if rslt equals to Nt = (Vt , Et ) ∧ Pt 6= φ then
7:
return ambiguous
8:
else if rslt equals to Nt = (Vt , Et ) ∧ Pt = φ then
9:
add neighbourhood Nt to the subgraph WT and iteration ← iteration + 1
10:
for each u ∈ Vt where (t, u) ∈ Ht ∪ At do
11:
termMap(u, WT , M, root, iteration)
12:
M ← M ∪ {t}
13:
return mapped
14: else if iteration more than 1 then
15:
if rslt equals to Nt = (Vt , Et ) ∧ Pt = φ then
16:
add neighbourhood Nt to the subgraph WT and iteration ← iteration + 1
17:
for each u ∈ Vt where (t, u) ∈ Ht do
18:
if u not equal to root then termMap(u, WT , M, root, iteration)
19:
else return // all paths from the origin t will arrive at the root
term mapping and term resolution. The input is a set of domain terms T produced
using a separate term recognition technique. The inclusion of a resolution phase sets
our technique apart from existing techniques which employ Web data for relation
acquisition. This resolution phase allows our technique to handle complex and
ambiguous terms, and terms which are not covered by the background knowledge
on the Web. Figure 8.1 provides an overview of the proposed technique.
In this technique, Wikipedia is seen as a directed acyclic graph W where vertices
V are topics covered by Wikipedia, and edges E are three types of coarse-grained
relations between the topics, namely, hierarchical H, associative A, and polysemous
P , or E = H ∪ A ∪ P . It is worth noting that H, A and P are disjoint sets. These
coarse-grained links are obtained from Wikipedia’s classification scheme, “See Also”
section, and disambiguation pages, respectively. The term mapping phase creates
a subgraph of W for each set T , denoted as WT by recursively querying W for
relations that belong to the terms t ∈ T . The querying aspect is defined as the
function map(t), which finds an equivalent topic u ∈ V for term t, and returns the
199
Algorithm 9 findNCA(M ,WT )
1: initialise commonAnc = φ, ancestors = φ, continue = true
2: for each m ∈ M do
3:
Nm := map(m)
4:
ancestor := {v : v ∈ Vm ∧ (m, v) ∈ Hm ∪ Am }
5:
ancestors ← ancestors ∪ ancestor
6: while continue equals to true do
7:
for each a ∈ ancestors do
8:
initialise pthCnt = 0, sumD = 0
9:
for each m ∈ M do
10:
dist := shortestDirectedPath(m,a,WT )
11:
if dist not infinite then
12:
pthCnt ← pthCnt + 1 and sumD ← sumD + dist
13:
if pthCnt equals to |M | then
14:
commonAnc ← commonAnc ∪ {(a, sumD)}
15:
if commonAnc not equals to φ then
16:
continue = f alse
17:
else
18:
initialise newAncestors = φ
19:
for each a ∈ ancestors do
20:
Na := map(a)
21:
ancestor := {v : v ∈ Va ∧ (a, v) ∈ Ha ∪ Aa }
22:
newAncestors ← newAncestors ∪ ancestor
23:
ancestors ← newAncestors
24: return nca where (nca, dist) ∈ commonAnc and dist is the minimum distance
closed neighbourhood Nt :

N = (V , E ) if (∃u ∈ V, u ≡ t)
t
t
t
map(t) =
undef
otherwise
(8.1)
The neighbourhood for term t is denoted as (Vt , Et ) where Et = {(t, y) : (t, y) ∈
Ht ∪ At ∪ Pt ∧ y ∈ Vt } and Vt is the set of vertices in the neighbourhood. The sets
Ht , At and Pt contain hierarchical, associative and polysemous links which connect
term t to its adjacent terms y ∈ Vt . The process of term mapping is summarised in
Algorithm 8. The term mapper in Algorithm 8 is invoked once for every t ∈ T . The
term mapper ceases the recursion upon encountering the base case, which consists
200
of the root vertices of Wikipedia (e.g. “Main topic classifications”). An input
term t ∈ T which is traced to the root vertex is considered as successfully mapped,
and is moved from set T to set M . Figure 8.2(a) shows the subgraph WT created
for the input set T={‘baking powder’,‘whole wheat flour’}. In reality, many terms
cannot be straightforwardly mapped because they do not have lexically equivalent
topics in W due to (1) the non-exhaustive coverage of Wikipedia, (2) the tendency
to modify terms for domain-specific uses, and (3) the polysemous nature of certain
terms. The term mapper in Algorithm 8 returns different values, namely, composite,
non-existent and ambiguous to indicate the causes of mapping failures. The term
resolution phase resolves mapping failures through the iterative process of lexical
simplification, word disambiguation and association inference. Upon the completion
of mapping and resolution of all input terms, any direct or indirect relations between
the mapped terms t ∈ M can be identified by finding paths which connect them in
the subgraph WT .
Finally, we devise a 2-step technique to transform the subgraph WT into a
lightweight domain ontology. Firstly, we identify the nearest common ancestor
(NCA) for the mapped terms. Our simple algorithm for finding NCA is presented
in Algorithm 9. The discussion on more complex algorithms [21, 22] for finding
NCA is beyond the scope of this chapter. Secondly, we identify all directed paths
in WT which connect the mapped terms to the new root NCA and use those paths
to form the final lightweight domain ontology. The lightweight domain ontology for
the subgraph WT in Figure 8.2(a) is shown in Figure 8.2(b). We discuss the details
of the three parts of term resolution in the following three subsections.
8.3.1
Lexical Simplification
The term mapper in Algorithm 8 returns the composite value to indicate the
inability to map a composite term (i.e. multi-word term). Composite terms which
have many modifiers tend to face difficulty during term mapping due to the absence
of lexically equivalent topics in W . To address this, we designed a lexical simplification step to reduce the lexical complexity of composite terms in a bid to increase
their chances of re-mapping. A composite term is comprised of a head noun altered by some pre- (e.g. adjectives and nouns) or post-modifiers (e.g. prepositional
phrases). These modifiers are important in clarifying or limiting the extent of the
semantics of the terms in a particular context. For instance, the modifier “one cup”
as in “one cup whole wheat flour” is crucial for specifying the amount of “whole
wheat flour” required for a particular pastry. However, the semantic diversity of
201
(a) The dotted arrows represent additional hierarchical
links from each vertex. The only associative link is between “whole wheat flour” and “whole grain”.
(b) “Food ingredients” is the N CA.
Figure 8.2: Figure 8.2(a) shows the subgraph WT constructed for T={‘baking powder’,‘whole wheat flour’} using Algorithm 8, which is later pruned to produce a
lightweight ontology in Figure 8.2(b).
terms created by certain modifiers is often unnecessary in a larger context. Our
lexical simplifier make use of this fact to reduce the complexity of a composite term
for re-mapping.
Figure 8.3: The computation of mutual information for all pairs of contiguous constituents of the composite terms “one cup whole wheat flour” and “salt to taste”.
The lexical simplification step breaks down a composite term into two struc-
202
turally coherent parts, namely, an optional constituent and a mandatory constituent.
A mandatory constituent is composed of but not limited to the head noun of the
composite term, and has to be in common use in the language independent of the
optional constituent. The lexical simplifier then finds the least dependent pair as the
ideally decomposed constituents. The dependencies are measured by estimating the
mutual information of all contiguous constituents of a term. A term with n-words
has n − 1 possible pairs denoted as < x1 , y1 >, ..., < xn−1 , yn−1 >. The mutual information for each pair < x, y > of term t is computed as M I(x, y) = f (t)/f (x)f (y)
where f is a frequency measure. In a previous work [278], we utilise the page count
returned by Web search engines to compute the relative frequency required for mutual information. Given that Z = {t, x, y}, the relative frequency for each z ∈ Z is
(− nz )
computed as f (z) = nnZz e nZ where nz is the page count returned by Web search
P
engines, and nZ = u∈Z nu . Figure 8.3 shows an example of finding the least dependent constituents of two complex terms. Upon identifying the two least dependent
constituents, we re-map the mandatory portion. To retain the possibly significant
semantics delivered by the modifiers, we also attempt to re-map the optional constituents. If the decomposed constituents are in turn not mapped, another iteration
of term resolution is performed. Unrelated constituents will be discarded. For this
purpose, we define the distance of a constituent with respect to the set of mapped
terms M as:
P
noW ({x, y}, m)
(8.2)
δ({x, y}, M ) = m∈M
|M |
where noW (a, b) is a measure of geodesic distance between topic a and b based on
Wikipedia developed by Wong et al. [276] known as n-degree of Wikipedia (noW).
A constituent is discarded if δ({x, y}, M ) > τ and the current set of mapped terms
is not empty, |M | =
6 0. The threshold τ = δ(M ) + σ(M ), where δ(M ) and σ(M ) are
the average and the standard deviation of the intra-group distance of M .
8.3.2
Word Disambiguation
The term mapping phase in Algorithm 8 returns the ambiguous value if a term
t has a non-empty set of polysemous links Pt in its neighbourhood. In such cases,
the terms are considered as ambiguous and cannot be directly mapped. To address
this, we include a word disambiguation step which automatically resolves ambiguous
terms using noW [276]. Since all input terms in T belong to the same domain of
interest, the word disambiguator finds the proper senses to replace the ambiguous
terms by the virtue of the senses’ relatedness to the already mapped terms. Senses
which are highly related to the mapped terms have lower noW value. For example,
203
Figure 8.4: A graph showing the distribution of noW distance and the stepwise
difference for the sequence of word senses for the term “pepper”. The set of
mapped terms is M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red
onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking powder”, “buttermilk”}. The line “stepwise difference” shows the ∆i−1,i values. The
line “average stepwise difference” is the constant value µ∆ . Note that the first sense
s1 is located at x = 0.
the term “pepper” is considered as ambiguous since its neighbourhood contains a
non-empty set Ppepper with numerous polysemous links pointing to various senses
in the food, music and sports domains. If the term “pepper” is provided as input
together with terms such as “vinegar” and “garlic”, we can eliminate all semantic
categories except food. Each ambiguous term t has a set of senses St = {s : s ∈
Vt ∧ (t, s) ∈ Pt }. Equation 8.2, denoted as δ(s, M ), is used to measure the distance
between a sense s ∈ St with the set of mapped terms M .
The senses are then sorted into a list (s1 , ..., sn ) in ascending order according to
their distance with the mapped terms. The smaller the subscript, the smaller the
distance, and therefore, the closer to the domain in consideration. An interesting
observation is that many senses for an ambiguous term are in fact minor variations
belonging to the same semantic category (i.e. paradigm). Referring back to our
example term “pepper”, within the food domain alone, multiple possible senses exist
(e.g. “sichuan pepper”, “bell pepper”, “black pepper”). While these senses have their
intrinsic differences, they are paradigmatically substitutable for one another. Using
this property, we devise a senses selection mechanism to identify suitable paradigms
covering highly related senses as substitutes for the ambiguous terms. The mechanism computes the difference in noW value as ∆i−1,i = δ(si , M ) − δ(si−1 , M ) for
204
2 ≤ i ≤ n between every two consecutive senses. We currently employ the average stepwise difference of the sequence asPthe cutoff point. The average stepwise
n
∆
difference for a list of n senses is µ∆ = i=2n−1i−1,i . Finally, the first k senses in
the sequence with ∆i−1,i < µ∆ are accepted as belonging to a single paradigm for
replacing the ambiguous term. Using this mechanism, we have reduced the scope
of the term “pepper” to only the food domain out of the many senses across domains such as music (e.g. “pepper (band)”) and beverage (e.g. “dr pepper”). In
our example in Figure 8.4, the ambiguous term “pepper” is replaced by {“black pepper”,“allspice”,“melegueta pepper”,“cubeb”}. These k = 4 word senses are selected
as replacements since the stepwise difference at point i = 5, ∆4,5 = 0.5 exceeds
µ∆ = 0.2417.
8.3.3
Association Inference
Terms that are labelled as non-existent by Algorithm 8 simply do not have any
lexical matches on Wikipedia. We propose to use cluster analysis to infer potential
associations for such non-existent terms. We employ our term clustering algorithm
with featureless similarity measures known as Tree-Traversing Ant (TTA) [276].
T T A is a hybrid algorithm inspired by ant-based methods and hierarchical clustering which utilises two featureless similarity measures, namely, Normalised Google
Distance (NGD) [50] and noW . Unlike conventional clustering algorithms which involve feature extraction and selection, terms are automatically clustered using T T A
based on their usage prevalence and co-occurrence on the Web.
In this step, we perform term clustering on the non-existent terms together
with the already mapped terms in M to infer hidden associations. The association inference step is based on the premise that terms grouped into similar clusters
are bound by some common dominant properties. By inference, any non-existent
terms which appear in the same clusters as the mapped terms should have similar properties. The T T A returns a set of term clusters C = {C1 , ..., Cn } upon the
completion of term clustering for each set of input terms. Each Ci ∈ C is a set
of related terms as determined by T T A. Figure 8.5 shows the results of clustering
the non-existent term “conchiglioni” with 14 mapped terms. The output is a set
of three clusters {C1′ , C2 , C3 }. Next, we acquire the parent topics of all mapped
terms located in the same cluster as the non-existent term by calling the mapping
function in Equation 8.1. We refer to such cluster as target cluster. These parent topics, represented as the set R, constitute the potential topics which may be
associated with the non-existent term. In our example in Figure 8.5, the target
205
Figure 8.5: The result of clustering the non-existent term “conchiglioni” and
the mapped terms M={“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red
onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking powder”, “buttermilk”,“carbonara”,“pancetta”} using T T A.
cluster is C1′ , and the elements of set R are {“pasta”,“pasta”,“pasta”,“italian cuisine”,“sauces”,“cuts of pork”,“dried meat”,“italian cuisine”,“pork”,“salumi”}. We
devise a prevailing parent selection mechanism to identify the most suitable parent
in R to which we attach the non-existent term. The prevailing parent is determined
by assigning a weight to each parent r ∈ R, and ranking the parents according
to their weights. Given the non-existent term t and a set of parents R, the prevailing parent weight (ρr ) where 0 ≤ ρr < 1 for each unique r ∈ R is defined as
ρr = common(r)sim(r, t)subsume(r, t)δr where sim(a, b) is given by 1−N GD(a, b)θ,
and N GD(a, b) is the Normalised Google Distance [50] between a and b. θ is a
constant withinP the range (0, 1] for adjusting the N GD distance. The function
1
determines the number of occurrence of r in set R. δr = 1
common(r) = q∈R,q=r
|R|
if subsume(r, t) > subsume(t, r) and δr = 0 otherwise. The subsumption measure
subsume(x, y) [77] is the probability of x given y computed as n(x, y)/n(y), where
n(x, y) and n(y) are page counts obtained from Web search engines. This measure
is used to quantify the extent of term x being more general than term y. The higher
the subsumption value, the more general term x is with respect to y. Upon ranking
the unique parents in R based on their weights, we select the prevailing parent r
as the one with the largest ρ. A link is then created for the non-existent term t to
hierarchically relate it to r.
206
8.4
Initial Experiments and Discussions
Figure 8.6: The results of relation acquisition using the proposed technique for the
genetics and the food domains. The labels “correctly xxx” and “incorrectly xxx”
represent the true positives (TP) and false positives (FP). Precision is computed as
T P/(T P + F P ).
We experimented with the proposed technique shown in Figure 8.1 using two
manually-constructed datasets, namely, a set of 11 terms in the genetics domain, and
a set of 31 terms in the food domain. The system performed the initial mappings
of the input terms at level 0. This results in 6 successfully mapped terms and 5
unmapped composite terms in the genetics domain. As for the terms in the food
domain, 14 were mapped, 16 were composite and 1 was non-existent. At level 1, the 5
composite terms in the genetics domain were decomposed into 10 constituents where
8 were remapped and 2 required further level 2 resolution. For the food domain, the
16 composite terms were decomposed into 32 constituents in level 1 where 10, 5 and
3 were still composite, non-existent and discarded, respectively. Together with the
successfully clustered non-existent term and the 14 remapped constituents, there
were a total of 15 remapped terms at level 1. Figure 8.6 summarises the experiment
results.
Overall, the system has a 100% precision in the aspect of term mapping, lexical
simplication and word disambiguation in all levels using the small set of 11 terms
8.4. Initial Experiments and Discussions
207
(a) The lightweight domain ontology for genetics constructed using 11 terms.
(b) The lightweight domain ontology for food constructed using 31 terms.
Figure 8.7: The lightweight domain ontologies generated using the two sets of input
terms. The important vertices (i.e. NCAs, input terms, vertices with degree more
than 3) have darker shades. The concepts genetics and food in the center of the
graph are the NCAs. All input terms are located along the side of the graph.
208
in the genetics domain as shown in Figure 8.6. As for the set of food-related terms,
there was one false positive (i.e. incorrectly mapped) involving the composite term
“100g baby spinach” which results in an 80% precision in level 2. In level 1, this
composite term was decomposed into the appropriate constituents “100g” and “baby
spinach”. In level 2, the term “baby spinach” was further decomposed and its
constituent “spinach” was successfully remapped. The constituent “baby” in this
case refers to the adjectival sense of “comparatively little”. However, the modifier
“baby” was inappropriately remapped and attached to the concept of “infant”. The
lack of information on polysemes and synonyms for basic English words is the main
cause to this problem. In this regard, we are planning to incorporate dynamic
linguistic resources such as Wiktionary to complement the encyclopaedic nature
of Wikipedia. Other established, static resources such as WordNet can also be
used as a source of basic English vocabulary. Moreover, the incorporation of such
complementary resources can assist in retaining and capturing additional semantics
of complex terms by improving the mapping of constituents such as “dried” and
“sliced”. General words which act as modifiers in composite terms often do not have
corresponding topics in Wikipedia, and are usually unable to satisfy the relatedness
requirement outlined in Section 8.3.1. Such constituents are currently ignored as
shown through the high number of discarded constituents in level 2 in Figure 8.6.
Moreover, the clustering of terms to discover new associations is only performed at
level 1, and non-existent terms at level 2 and beyond are currently discarded.
Upon obtaining the subgraphs WT for the two input sets, the system finds the
corresponding nearest common ancestors. The NCAs for the genetics-related and
the food-related terms are genetics and food, respectively. Using these NCAs, our
system constructed the corresponding lightweight domain ontologies as shown in
Figure 8.7. A detailed account of this experiment is available to the public1 .
8.5
Acquiring semantic relations is an important part of ontology learning. Many
existing techniques face difficulty in extending to different domains, disregard implicit and indirect relations, and are unable to handle relations between composite, ambiguous and non-existent terms. We presented a hybrid technique which
combines lexical simplification, word disambiguation and association inference for
acquiring semantic relations between potentially composite and ambiguous terms
using only dynamic Web data (i.e. Wikipedia and Web search engines). During
1
http://explorer.csse.uwa.edu.au/research/ sandbox evaluation.pl
209
our initial experiment, the technique demonstrated the ability to handle terms from
different domains, to accurately acquire relations between composite and ambiguous
terms, and to infer relations between terms which do not exist in Wikipedia. The
lightweight ontologies discovered using this technique is a valuable resource to complement other techniques for constructing full-fledged ontologies. Our future work
includes the diversification of domain and linguistic knowledge by incorporating online dictionaries to support general words not available on Wikipedia. Evaluation
using larger datasets, and the study on the effect of clustering words beyond level 1
is also required.
8.6
Acknowledgement
This research is supported by the Australian Endeavour International Postgraduate Research Scholarship, the DEST (Australia-China) Grant, and the Interuniversity Grant from the Department of Chemical Engineering, Curtin University
of Technology.
210
CHAPTER 9
Implementation
“Thinking about information overload isn’t accurately
describing the problem; thinking about filter failure is.”
- Clay Shirky, the Web 2.0 Expo (2008)
The focus of this chapter is to illustrate the advantages of the seamless and automatic construction of term clouds and lightweight ontologies from text documents.
An application is developed to assist in the skimming and scanning of large amounts
of news articles across different domains, including technology, medicine and economics. First, the implementation details of the proposed techniques described in
the previous six chapters are provided. Second, three representative use cases are
described to demonstrate our powerful interface for assisting document skimming
and scanning. The details on how term clouds and lightweight ontologies can be
used for this purpose are provided.
9.1
System Implementation
This section presents the implementation details of the core techniques discussed
in the previous six chapters (i.e. Chapter 3-8). Overall, the proposed ontology
learning system is implemented as a Web application hosted on http://explorer.
csse.uwa.edu.au/research/. The techniques are developed entirely using the Perl
programming language. The availability and reusability of a wide range of external
modules on the Comprehensive Perl Archive Network (CPAN) for text mining, natural language processing, statistical analysis, and other Web services makes Perl the
ideal development platform for this research. Moreover, the richer and more consistent regular expression syntax in Perl provides a powerful tool for manipulating
and processing texts. The Prefuse visualisation toolkit1 using Java programming
language, and the Perl module SVGGraph2 for creating Scalable Vector Graphics
(SVG) graphs are also used for visualisation purposes. The CGI::Ajax3 module is
employed to enable the systems’s online interfaces to asynchronously access backend Perl modules. The use of Ajax improves interactivity, bandwidth usage and
load time by allowing back-end modules to be invoked and data to be returned
without interfering the interface behaviour. Overall, the Web application comprises
1
http://prefuse.org/
http://search.cpan.org/dist/SVGGraph-0.07/
3
http://search.cpan.org/dist/CGI-Ajax-0.707/
2
211
212
Chapter 9. Implementation
a suite of modules with about 90, 000 lines of properly documented Perl codes. The
implementation details of the system’s modules are as follow:
(a) The screenshot of a webpage containing a short abstract of a journal article,
hosted at http://www.ncbi.nlm.nih.gov/pubmed/7602115. The relevant content,
which is the abstract, extracted by HERCULES is shown in Figure 9.1(b).
(b) The input section of this interface algorithm hercules.pl shows the HTML
source code of the webpage in Figure 9.1(a). The relevant content extracted from
the HTML source by the HERCULES module is shown in the results section. The
process log, not included in this figure, is also available in the results section.
Figure 9.1: The online interface for the HERCULES module.
• The relevant content extraction technique, described in Chapter 6, is implemented as the HERCULES module that can be accessed and tested via the online
interface algorithm hercules.pl. The HERCULES module uses only regular
9.1. System Implementation
213
Figure 9.2: The input section of the interface algorithm issac.pl shows the error
sentence “Susan’s imabbirity to Undeerstant the msg got her INTu trubble.”. The
correction provided by ISSAC is shown in the results section of the interface. The
process log is also provided through this interface. Only a small portion of the
process log is shown in this figure.
expressions to implement the set of heuristic rules described in Chapter 6 for
removing HTML tags and other non-content elements. Figure 9.1(a) shows
an example webpage that has both relevant content, and boilerplates such as
navigation and complex search features in the header section and other related
information in the right panel. The relevant content extracted by HERCULES is
shown in Figure 9.1(b).
• The integrated technique for cleaning noisy text, described in Chapter 3, is implemented as the ISSAC module accessible via the interface algorithm issac.pl.
The implementation of ISSAC uses the following Perl modules, namely, Text::
WagnerFischer4 for computing the Wagner-Fischer edit distance [266], WWW::
Search::AcronymFinder5 for accessing the online dictionary www.acronymfin
der.com, and Text::Aspell6 for interfacing with the GNU spell checker Aspell. In addition, the Yahoo APIs for spelling suggestion7 and Web search8 are
used to obtain replacement candidates, and to obtain page counts for deriving
the general significance score. Figure 9.2 shows the correction by ISSAC for
4
http://search.cpan.org/dist/Text-WagnerFischer-0.04/
http://search.cpan.org/dist/WWW-Search-AcronymFinder-0.01/
6
http://search.cpan.org/dist/Text-Aspell/
7
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
8
http://developer.yahoo.com/search/web/V1/webSearch.html
5
214
Figure 9.3: The online interface algorithm unithood.pl for the module unithood.
The interface shows the collocational stability of different phrases determined using
unithood. The various weights involved in determining the extent of stability are
also provided in these figures.
the noisy sentence “Susan’s imabbirity to Undeerstant the msg got her INTu
trubble.”.
• The two measures OU and UH described in Chapter 4 are implemented as a
single module called unithood that can be accessed online via the interface
algorithm unithood.pl. The unithood module also uses the Yahoo API for
Web search to access page counts for estimating the collocational strength
of noun phrases. Figure 9.3 shows the results of checking the collocational
strength of the phrases “Centers for Disease Control and Prevention” and
“Drug Enforcement Administration and Federal Bureau of Investigation”. As
mentioned in Chapter 4, phrases containing both prepositions and conjunctions can be relatively difficult to deal with. The unithood module using the
UH measure automatically decides that the second phrase “Drug Enforcement
Administration and Federal Bureau of Investigation” does not form a stable
noun phrase as shown in Figure 9.3. The decision is correct considering that
the unstable phrase can refer to two separate entities in the real world.
• The technique for constructing text corpora described in Chapter 6 is implemented under the SPARTAN module. As part of SPARTAN, three submodules PROSE, STEP and SLOP are implemented to filter websites, expand seed
215
(a) The data virtualcorpus.pl interface for querying pre-constructed virtual corpora by SPARTAN.
(b) The data localcorpus.pl interface for querying pre-constructed local corpora
by SPARTAN, and some other types of local corpora.
Figure 9.4: The online interfaces for querying the virtual and local corpora created
using the SPARTAN module.
terms and localise webpage contents, respectively. No online corpus construction interface was provided for users due to the extensive storage space
required for downloading and constructing text corpora. Instead, an interface to query pre-constructed virtual and local corpora is made available
via data virtualcorpus.pl and data localcorpus.pl, respectively. The
SPARTAN module uses both the Yahoo APIs for Web search and site search9
throughout the corpus construction process. The Perl modules WWW::Wikipedia10
and LWP::UserAgent11 are used to access Wikipedia during seed term expan9
http://developer.yahoo.com/search/siteexplorer/siteexplorer.html
http://search.cpan.org/dist/WWW-Wikipedia-1.95/
11
http://search.cpan.org/dist/libwww-perl-5.826/lib/LWP/UserAgent.pm
10
216
(a) The online interface algorithm termhood.pl accepts short text snippets as
input and produces term clouds using the termhood module.
(b) The online interface data termcloud.pl for browsing pre-constructed term
clouds using the termhood module. Each term cloud is a summary of important
concepts of the corresponding news article.
(c) The online interface data corpus.pl summarises the text corpora available in
the system for use by the termhood module.
Figure 9.5: Online interfaces related to the termhood module.
(a) The algorithm nwd.pl interface for finding the semantic similarity between
terms using the NWD module.
(b) The algorithm now.pl interface for finding the semantic distance between
terms using the noW module.
(c) The algorithm tta.pl interface for clustering terms using the TTA module with
the support of featureless similarity metrics by the NWD and noW modules.
Figure 9.6: Online interfaces related to the ARCHILES module.
217
218
Figure 9.7: The interface data lightweightontology.pl for browsing preconstructed lightweight ontologies for online news articles using the ARCHILES module.
sion and to access webpages using HTTP style communication. Figure 9.4(a)
shows the interface data virtualcorpus.pl for querying the virtual corpora
constructed using SPARTAN. Some statistics related to the virtual corpora, such
as document frequency and word frequency, are provided in this interface. A
simple implementation based on document frequency is also used in this interface to decide if the search term is relevant to the domain represented by the
corpus or otherwise. For instance, Figure 9.4(a) shows the results of querying
the virtual corpus in the medicine domain using the word “tumor necrosis
factor”. There are 322, 065 documents that contain the word “tumor necrosis
factor” out of the total 84 million in the domain corpus. There are, however,
only 5 documents in the contrastive corpus that have this word. Based on these
frequencies, the interface decides that “tumor necrosis factor” is relevant to
the medicine domain. Figure 9.4(b) shows the interface data localcorpus.pl
for querying the localised versions of the virtual corpora, and other types of
local corpora.
• The two measures TH and OT described in Chapter 5 for recognising domainrelevant terms are implemented as the termhood module. An interface is
created at algorithm termhood.pl to allow users to access the termhood
module online. Figure 9.5(a) shows the result of term recognition for the
9.2. Ontology-based Document Skimming and Scanning
219
input sentence “Melanoma is one of the rarer types of skin cancer. Around
160,000 new cases of melanoma are diagnosed each year.”. The termhood
module presents the output as term clouds containing domain-relevant terms of
different sizes. Larger terms assume a more significant role in representing the
content of the input text. The results section of the interface in Figure 9.5(a)
also provides information on the composition of the text corpora used and the
process log of text processing and term recognition. The termhood module
has the option of using either the text corpora constructed through guided
crawling of online news sites, the corpora (both local and virtual) built using
the SPARTAN module, publicly-available collections (e.g. Reuters-21578, texts
from the Gutenberg project, GENIA), or any combination thereof. Figure
9.5(c) shows the interface data corpus.pl that summarises information about
the available text corpora for use by the termhood module. A list of preconstructed term clouds from online news articles is available for browsing at
data termcloud.pl as shown in Figure 9.5(b).
• The relation acquisition technique, described in Chapter 8, is implemented under the ARCHILES module. Due to some implementation challenges, an online
interface cannot be provided for users to directly access the ARCHILES module.
Nevertheless, a list of pre-constructed lightweight ontologies for online news articles is available for browsing using the interface data lightweightontology.pl
as shown in Figure 9.7. ARCHILES employs two featureless similarity measures
NWD and noW, and the TTA clustering technique described in Chapter 7
for disambiguating terms and discovering relations between unknown terms.
The NWD and noW modules can be accessed online via algorithm nwd.pl and
algorithm now.pl. The NWD module relies on the Yahoo API for Web search
to access page counts for similarity estimation. The noW module uses the
external Graph and WWW::Wikipedia Perl modules to simulate Wikipedia’s
categorical system, to compute shortest paths for deriving distance values,
and to resolve ambiguous terms. The clustering technique is implemented as
the TTA module with an online interface at algorithm tta.pl. TTA uses both
the noW and NWD modules to determine similarity for clustering terms.
9.2
Ontology-based Document Skimming and Scanning
The growth of textual information on the Web is a double-edged sword. On the
one hand, we are blessed with unsurpassed freedom and accessibility to endless information. We all know that information is power, and so we thought the more the
220
Figure 9.8: The screenshot of the aggregated news services provided by Google (the
left portion of the figure) and Yahoo (the right portion of the figure) on 11 June
2009.
better. On the other hand, such explosion of information on the Web (i.e. information explosion) can be a curse. While information has been growing exponentially
since the conception of the Web, our cognitive abilities have not caught up. We
have short attention span on the Web [133], and we are slow at reading off the
screen [91]. For this reason, users are finding it increasingly difficult to handle the
excess amount of information being provided on a daily basis, an effect known as
information overload. An interesting study at King’s College London showed that
information overload is actually doing more harm to our concentration than marijuana [270]. It has became apparent that “when it comes to information, sometimes
221
Figure 9.9: A splash screen on the online interface for document skimming and
scanning at http://explorer.csse.uwa.edu.au/research/.
Figure 9.10: The cross-domain term cloud summarising the main concepts occurring
in all the 395 articles listed in the news browser. This cloud currently contains terms
in the technology, medicine and economics domains.
less is more...” [179]. There are two key issues to be considered when attempting
to address the problem of information overload. Firstly, it is becoming increasingly
challenging for retrieval systems to locate relevant information amidst a growing
Web, and secondly, users are finding it more difficult to interpret a growing amount
of relevant information. While many studies have been conducted to improve the
performance of retrieval systems, there is virtually no work on the issue of information interpretability. This lack of attention to information interpretability becomes
222
Figure 9.11: The single-domain term cloud for the domain of medicine. This cloud
summarises all the main concepts occurring in the 75 articles listed below in the
news browser. Users can arrive at this single-domain cloud from the cross-domain
cloud in Figure 9.10 by clicking on the [domain(s)] option in the latter.
obvious as we look at the way Google and Yahoo present search results, news articles and other documents to the users. At most, these systems rank the webpages
for relevance and generate short snippets with keyword bolding to assist users in
locating what they need. Studies [121, 11] have shown that these summaries often
have poor readability and are inadequate in conveying the gist of the documents. In
other words, the users would still have to painstakingly read through the documents
in order to find the information they need.
We take the two aggregated news services by Google and Yahoo shown in Figure
9.8 as examples to demonstrate the current lack of regard for information interpretability. The left portion of the figure shows the Google News interface, while
the right portion shows the Yahoo News interface. Both interfaces are focused on the
health news category. The interfaces in Figure 9.8 merely show half of all the news
listed on 11 June 2009. The actual listings are considerably longer. A quick look
at both interfaces would immediately reveal the time and cognitive effort that users
have to invest in order to arrive at a summary of the texts or to find a particular piece
of information. Over time, users adopted the technique of skimming and scanning
to keep up with such a constant flow of textual documents online [218, 186, 268].
223
Figure 9.12: The single-domain term cloud for the medicine domain. Users can view
a list of articles describing a particular topic by clicking on the corresponding term
in the single-domain cloud.
Users employ skimming to quickly identify the main ideas conveyed by a document,
usually to decide if the text is interesting and whether one should read it in more
detail. Scanning, on the other hand, is used to obtain specific information from a
document (e.g. a particular page where a certain idea occurred).
This section provides details on the use of term clouds and lightweight ontologies
to aid document skimming and scanning for improving information interpretability.
More specifically, term clouds and lightweight ontologies are employed to assist
users in quickly identifying the overall ideas or specific information in individual
documents or groups of documents. In particular, the following three cases are
examined:
(1) Can the users quickly guess (in 3 seconds or so) from the listing alone what
are the main topics of interest across all articles for that day?
(2) Is there a better way to present the gist of individual news articles to the
users other than the conventional, ineffective use of short text snippets as
summaries?
(3) Are there other options besides the typical [find] feature for users to quickly
pinpoint a particular concept in an article or a group of articles?
224
(a) Abstraction of the news “Tai Chi may ease arthritis pain”.
(b) Abstraction of the news “Omega-3-fatty acids may slow macular disease”.
Figure 9.13: The use of document term cloud and information from lightweight
ontology to summarise individual news articles. Based on the term size in the
clouds, one can arrive at the conclusion that the news featured in Figure 9.13(b)
carries more domain-relevant (i.e. medical related) content than the news in Figure
9.13(a).
225
Figure 9.14: The document term cloud for the news “Tai Chi may ease arthritis
pain”. Users can focus on a particular concept in the annotated news by clicking on
the corresponding term in the document cloud.
For this purpose, an online interface for document skimming and scanning is incorporated into the Web application’s homepage12 . While news articles may be
the focus of the current document skimming and scanning system, other text documents including product reviews, medical reports, emails and search results can
equally benefit from such automatic abstraction system. Figure 9.9 shows the splash
screen of the interface for document skimming and scanning. This splash screen explains the need for better means to assist document skimming and scanning while
data (i.e. term clouds and lightweight ontologies) is loading. Figure 9.10 shows
the main interface for skimming and scanning a list of news articles across different
domains. The white canvas on the top right corner containing words of different
colours and sizes is the cross-domain term cloud. This term cloud summarises the
key concepts in all news articles across all domains listed in the news browser panel
below. For instance, Figure 9.10 shows that there are 395 articles across three domains (i.e. technology, medicine and economics) listed in the news browser with a
total of 727 terms in the term cloud. The solutions to the above three use cases
using our ontology-based document skimming and scanning system are as follows:
• Figure 9.11 shows the single-domain term cloud for summarising the key con12
http://explorer.csse.uwa.edu.au/research/
226
cepts in the medicine domain. This term cloud is obtained by simply selecting
the medicine option in the [domain(s)] field. There are 75 articles in the
news browser with a total of 136 terms in the cloud. Looking at this singledomain term cloud, one would immediately be able to conclude that some of
the news articles are concerned with “diabetes”, “drug”, “gene”, “hormone”,
“heart disease”, “H1N1 swine flu” and so on. One can also say that “diabetes” was discussed more intensely in these articles than other topics such as
“diarrhea”. The users are able to grasp the gist of large groups of articles in
a matter of seconds without any complex cognitive effort. Can the same be
accomplished through the typical news listing and text snippets as summaries
shown in Figure 9.8? The use of the cross-domain or single-domain term clouds
for summarising the main topics across multiple documents addresses the first
problem.
• If the users are interested in drilling down on a particular topic, they can do so
by simply clicking on the terms in the cloud. A list of news articles describing
the selected topic is provided in the news browser panel as shown in Figure
9.12. The context in which the selected topic exists is also provided. For
instance, Figure 9.12 shows that “diabetes” topic is mentioned in the context
of “hypertension” in the news “Psoriasis linked to...”. Clicking on the [back]
option brings the users back to the complete listing of articles in the medicine
domain as in Figure 9.11. The users can also preview the gist of a news article
by simply clicking on the title in the news browser panel. Figure 9.13(a)
and 9.13(b) show the document term clouds for the news “Tai Chi may ease
arthritis pain” and “Omega-3-fatty acids may slow macular disease”. These
document term clouds summarise the content of the news articles and present
the key terms in a visually appealing manner to enhance the interpretability
and retention of information. The interfaces in Figure 9.13(a) and 9.13(b) also
provide information derived from the corresponding lightweight ontologies.
For instance, the root concept in the ontology is shown in the [this news
is about] field. In the news “Tai Chi may ease arthritis pain”, the root
concept is “self-care”. The parent concepts of the key terms in the ontology
are presented as part of the field [the main concepts are]. In addition,
based on the term size in the clouds, one can arrive at the conclusion that
the news featured in Figure 9.13(b) carries more domain-relevant (i.e. medical
related) content than the news in Figure 9.13(a). Can the users arrive at such
comprehensive and abstract information regarding a document with minimal
9.3. Chapter Summary
227
time and cognitive effort using the conventional news listing interfaces shown
in Figure 9.8? The use of document term cloud and lightweight ontologies for
presenting the gist of individual news articles addresses the second problem.
• The use of the following features [click for articles], [find term], [context
terms] and [click to focus] helps users to locate a particular concept at
different level of granularities. At the document collection level, users can locate articles containing a particular term using the [click for articles],
the [find term] or the [context terms] features. The [click for articles]
feature allows users to view a list of articles (using the news browser) related
to a particular topic in the cross-domain or the single-domain term cloud. The
[find term] feature can be used anytime to refine and reduce the size of the
cross-domain or the single-domain term cloud. Context terms are provided
together with the listing of articles in the news browser when users selected
the [click for articles] feature. Clicking on any terms under the column
[context terms], as shown in Figure 9.12, will list all articles containing the
selected term. At the individual document level, news articles are annotated
with the key terms that occurred in the document clouds to assist scanning
activities. Users can employ the [click to focus] feature to pinpoint the
occurrence of a particular concept in an article by clicking on the corresponding term in the document cloud. Figure 9.14 shows how a user clicked on
“chronic tension headache” in the document term cloud which triggered the
auto-scrolling and highlighting of that term in the annotated news. Can the
users pinpoint a particular topic that occurred in a large document collection
or a single lengthy document with minimal time and cognitive effort using the
conventional interfaces shown in Figure 9.8? The various features provided by
this system allow users to quickly pinpoint a particular concept, either in an
article or a group of articles, to address the last problem.
9.3
Chapter Summary
This chapter provided the implementation details of the proposed ontology learning system as a Web application. The type of programming language, external tools
and development environment was described. Online interfaces to several modules of the Web application were made publicly available. The benefits of using
automatically-generated term clouds and lightweight ontologies for document skimming and scanning were highlighted using three use cases. It was qualitatively
228
demonstrated that conventional news listing interfaces, unlike ontology-based document skimming and scanning, are unable to satisfy the following three common
scenarios: (1) to grasp the gist of large groups of articles in a matter of seconds
without any complex cognitive effort, (2) to arrive at a comprehensive and abstract
overview of a document with minimal time and cognitive effort, and (3) to pinpoint
a particular topic that occurred in a large document collection or a single lengthy
document with minimal time and cognitive effort.
In the next chapter, the research work presented throughout this dissertation is
summarised. Plans for system improvement are outlined, and an outlook to future
research direction in the area of ontology learning is provided.
CHAPTER 10
Conclusions and Future Work
“We can only see a short distance ahead,
but we can see plenty there that needs to be done.”
- Alan Turing, Computing Machinery and Intelligence (1950)
Term clouds and lightweight ontologies are the key to bootstrapping the Semantic Web, creating better search engines, and providing effective document management for individuals and organisations. A major problem faced by current ontology
learning systems is the reliance on rare, static background knowledge (e.g. WordNet, British National Corpus). This problem is described in detail in Chapter 1 and
subsequently confirmed by the literature review in Chapter 2. Overall, this research
demonstrates that the use of dynamic Web data as the sole background knowledge
is a viable, long-term alternative for cross-domain ontology learning from text. This
finding verifies the thesis statement in Chapter 1.
In particular, four major research questions identified as part of the thesis statement are addressed in Chapter 3 to 8 with a common theme on taking advantage
of the diversity and redundancy of Web data. These four problems are (1) the
absence of integrated techniques for cleaning noisy data, (2) the inability of current term extraction techniques, which are heavily influenced by word frequency, to
systematically explicate, diversify and consolidate their evidence, (3) the inability
of current corpus construction techniques to automatically create very large, highquality text corpora using a small number of seed terms, and (4) the difficulty of
locating and preparing features for clustering and extracting relations. As a proof of
concept, Chapter 9 of this thesis demonstrated the benefits of using automaticallyconstructed term clouds and lightweight ontologies for skimming and scanning large
number of real-world documents. More precisely, term clouds and lightweight ontologies are employed to assist users in quickly identifying the overall ideas or specific
information in individual news articles or groups of news articles across different domains, including technology, medicine and economics. Chapter 9 also discussed the
implementation details of the proposed ontology learning system.
10.1
Summary of Contributions
The major contributions to the field of ontology learning that arose from this
thesis (described in Chapter 3 to 8) are summarised as follows.
Chapter 3 addressed the first problem through proposing and implementing one
of the first integrated technique called ISSAC for cleaning noisy text. ISSAC si-
229
230
Chapter 10. Conclusions and Future Work
multaneously corrects spelling errors, expands abbreviations and restores improper
casings. It was found that in order to cope with language change (e.g. appearance
of new words) and the blurring of boundaries between noises, the use of multiple
dynamic Web data sources (in the form of statistical evidence from search engine
page count and online abbreviation dictionaries) was necessary. Evaluations using
noisy chat records from industry demonstrated high accuracy of correction.
To address the second problem, firstly Chapter 4 outlined two measures for determining word collocation strength during noun phrase extraction. These measures are
UH, an adaptation of existing measures, and OU, a probabilistic measure. UH and
OU rely on page count from search engines to derive statistical evidence required for
measuring word collocation. It was found that the noun phrases extracted based on
the probabilistic measure OU achieved the best precision compared to the heuristic
measure UH. The stable noun phrases extracted using these measures constitute the
input to the next stage of term recognition. Secondly in Chapter 5, a novel probabilistic framework for recognising domain-relevant terms from stable noun phrases
was developed. The framework allows different evidence to be added or removed depending on the implementation constraints and the desired term recognition output.
The framework currently incorporates seven types of evidence which are formalised
using word distribution models into a new probabilistic measure called OT. The
adaptability of this framework is demonstrated through the close correlation between OT and its heuristic counterpart TH. It was concluded that OT offers the
best term recognition solution (compared to three existing heuristic measures) that
is both accurate and balanced in terms of recall and precision.
Chapter 6 solved the third problem by introducing the SPARTAN technique
for corpus construction to alleviate the dependence on manually-crafted corpora
during term recognition. SPARTAN uses a probabilistic filter with statistical information gathered from search engine page count to analyse the domain representativeness of websites for constructing both virtual and local corpora. It was found
that adequately large corpora with high coverage and specific enough vocabulary are
necessary for high performance term recognition. An elaborate evaluation proved
that term recognition using SPARTAN -based corpora achieved the best precision
and recall in comparison to all other corpora based on existing corpus construction
techniques.
Chapter 8 addressed the last problem through the proposal of a novel technique
ARCHILES. It employs term clustering, word disambiguation and lexical simplication techniques with Wikipedia and search engines for acquiring coarse-grained
10.2. Limitations and Implications for Future Research
231
semantic relations between terms. Chapter 7 discussed in detail the multi-pass clustering algorithm TTA with the featureless relatedness measures noW and NWD
used by ARCHILES. It was found that the use of mutual information during lexical simplication, TTA and NWD for term clustering, and noW and Wikipedia for
word disambiguation enables ARCHILES to cope with complex, uncommon and
ambiguous terms during relation acquisition.
10.2
Limitations and Implications for Future Research
There are at least five interesting questions related to the proposed ontology
learning system that remain unexplored. First, can the current text processing
techniques, including the word collocation measures OU and UH, adapt to domains
with highly-complex vocabulary such as those involving biological entities (e.g. proteins, genes)? At the very least, the adaptation of existing sentence parsing and
noun phrase chunking techniques will be required to meet the needs of these domains. Second, another area for future work is to incorporate sentence parsing and
named entity tagging ability into the current SPARTAN -based corpus construction
technique for creating annotated text corpora. Automatically-constructed annotated corpora will prove to be invaluable resources for a wide range of applications
such as text categorisation and machine translation. Third, there is an increasing
interest in mining opinions or sentiments from text. The two main research interests in opinion mining are the automatic building of sentiment dictionaries (typically
comprise adjectives and adverbs as sentiments), and the recognition of sentiments
expressed in text and their relations with other aspects of the text (e.g. who expressed the sentiment, the sentiment’s target). Can the current term recognition
technique using OT and TH, which focuses on domain-relevant noun phrases, be
extended to handle other part of speech for opinion mining? If yes, the proposed
term recognition technique using SPARTAN -based corpora can ultimately be used
to produce high-quality sentiment clouds and sentiment ontologies. Fourth, the
current system lacks consideration for the temporal aspect of information such as
publication date during the discovery of term clouds and lightweight ontologies.
With the inclusion of the date factor into these abstractions, users can browse and
see the evolution of important concepts and relations across different time periods. Lastly, care should be taken when interpreting the results from some of the
preliminary experiments reported in this dissertation. In particular, more work is
required to critically evaluate the ARCHILES and ISSAC techniques using larger
datasets, and to demonstrate the significance of the results through statistical tests.
232
Chapter 10. Conclusions and Future Work
For instance, the ARCHILES technique for acquiring coarse-grained relations between terms reported in this dissertation has only been tested using small datasets.
Assessments using larger datasets are being planned for the near future. It would
also be interesting to look at how ARCHILES can be used to complement other
techniques for discovering fine-grained semantic relations.
In fact, ARCHILES and all techniques reported in this dissertation are constantly undergoing further tests using real-world text from various domains. New
term clouds and lightweight ontologies are constantly being created automatically
from recent news articles. These clouds and ontologies are available for browsing
via our dedicated Web application1 . In other words, this research, the new techniques, and the resulting Web application are subjected to continuous scrutiny and
improvement in an effort to achieve better ontology learning performance and to
define new application areas. In order to further demonstrate the system’s overall ability at cross-domain ontology learning, practical applications using real-world
text in several domains have been planned. In the long run, ontology learning research will cross paths with advances from ontology merging. As the number of
automatically-created ontologies grow, the need to consolidate and merge them into
one single extensive structure will arise.
This thesis has only looked at the automatic learning of term clouds and ontologies from cross-domain documents in the English language. Overall, the proposed
system represents only a few important steps in the vast area of ontology learning.
One area of growing interest is cross-media ontology learning. While news articles
may be the focus of the proposed ontology learning system, other text documents
including product reviews, medical reports, financial reports, emails and search results can equally benefit from such automatic abstraction service. The automatic
generation of term clouds and lightweight ontologies using different media types
such as audio (e.g. call center recording, interactive voice response systems) and
video (i.e. teleconferencing, video surveillance) is an interesting research direction
for future researchers. Another research direction that will gain greater attention in
the future is cross-language ontology learning. It remains to be seen as we cannot
underestimate the level of difficulties involved in transferring the proposed system
to other languages of different morphological and syntactic complexity.
All in all, the suggestions and questions raised in this section provide interesting
insights into future research directions in ontology learning from text.
1
http://explorer.csse.uwa.edu.au/research/
233
Bibliography
[1] H. Abdi. The method of least squares. In N. Salkind, editor, Encyclopedia of
Measurement and Statistics. CA, USA: Thousand Oaks, 2007.
[2] L. Adamic and B. Huberman. Zipfs law and the internet. Glottometrics,
3(1):143–150, 2002.
[3] E. Adar, J. Teevan, S. Dumais, and J. Elsas. The web changes everything:
Understanding the dynamics of web content. In Proceedings of the 2nd ACM
International Conference on Web Search and Data Mining, Barcelona, Spain,
2009.
[4] A. Agbago and C. Barriere. Corpus construction for terminology. In Proceedings of the Corpus Linguistics Conference, Birmingham, UK, 2005.
[5] A. Agustini, P. Gamallo, and G. Lopes. Selection restrictions acquisition for
parsing and information retrieval improvement. In Proceedings of the 14th
International Conference on Applications of Prolog, Tokyo, Japan, 2001.
[6] J. Allen. Natural language understanding. Benjamin/Cummings, California,
1995.
[7] G. Amati and C. vanRijsbergen. Term frequency normalization via pareto
distributions. In Proceedings of the 24th BCS-IRSG European Colloquium on
Information Retrieval Research, Glasgow, UK, 2002.
[8] M. Ashburner, C. Ball, J. Blake, D. Botstein, H. Butler, M. Cherry, A. Davis,
K. Dolinski, S. Dwight, and J. Eppig. Gene ontology: Tool for the unification
of biology. Nature Genetics, 25(1):25–29, 2000.
[9] K. Atkinson. Gnu aspell 0.60.4. http://aspell.sourceforge.net/, 2006.
[10] P. Atzeni, R. Basili, D. Hansen, P. Missier, P. Paggio, M. Pazienza, and F. Zanzotto. Ontology-based question answering in a federation of university sites:
the moses case study. In Proceedings of the 9th International Conference on
Applications of Natural Language to Information Systems (NLDB), Manchester, United Kingdom, 2004.
[11] A. Aula. Enhancing the readability of search result summaries. In Proceedings of the 18th British HCI Group Annual Conference, Leeds Metropolitan
University, UK, 2004.
234
Bibliography
[12] H. Baayen. Statistical models for word frequency distributions: A linguistic
evaluation. Computers and the Humanities, 26(5-6):347–363, 2004.
[13] C. Baker, R. Kanagasabai, W. Ang, A. Veeramani, H. Low, and M. Wenk.
Towards ontology-driven navigation of the lipid bibliosphere. In Proceedings
of the 6th International Conference on Bioinformatics (InCoB), Hong Kong,
2007.
[14] Z. Bar-Yossef, A. Broder, R. Kumar, and A. Tomkins. Sic transit gloria
telae: Towards an understanding of the webs decay. In Proceedings of the 13th
International Conference on World Wide Web (WWW), New York, 2004.
[15] M. Baroni and S. Bernardini. Bootcat: Bootstrapping corpora and terms
from the web. In Proceedings of the 4th Language Resources and Evaluation
Conference (LREC), Lisbon, Portugal, 2004.
[16] M. Baroni and S. Bernardini. Wacky! working papers on the web as corpus.
GEDIT, Bologna, Italy, 2006.
[17] M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. Webbootcat: Instant
domain-specific corpora to support human translators. In Proceedings of the
11th Annual Conference of the European Association for Machine Translation
(EAMT), Norway, 2006.
[18] M. Baroni and M. Ueyama. Building general- and special-purpose corpora by
web crawling. In Proceedings of the 13th NIJL International Symposium on
Language Corpora: Their Compilation and Application, 2006.
[19] R. Basili, A. Moschitti, M. Pazienza, and F. Zanzotto. A contrastive approach
to term extraction. In Proceedings of the 4th Terminology and Artificial Intelligence Conference (TIA), France, 2001.
[20] R. Basili, M. Pazienza, and F. Zanzotto. Modelling syntactic context in automatic term extraction. In Proceedings of the International Conference on
Recent Advances in Natural Language Processing, Bulgaria, 2001.
[21] M. Bender and M. Farach-Colton. The lca problem revisited. In Proceedings
of the 4th Latin American Symposium on Theoretical Informatics, Punta del
Este, Uruguay, 2000.
Bibliography
235
[22] M. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin.
Lowest common ancestors in trees and directed acyclic graphs. Journal of
Algorithms, 57(2):7594, 2005.
[23] C. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek. Information distance.
IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.
[24] T. Berners-Lee, J. Hendler, and O. Lassila.
The semantic web.
http://www.scientificamerican.com/article.cfm?id=the-semantic-web; 20 May
2009, 2001.
[25] S. Bird, E. Klein, E. Loper, and J. Baldridge. Multidisciplinary instruction
with the natural language toolkit. In Proceedings of the 3rd ACL Workshop
on Issues in Teaching Computational Linguistics, Ohio, USA, 2008.
[26] I. Blair, G. Urland, and J. Ma. Using internet search engines to estimate word
frequency. Behavior Research Methods Instruments & Computers, 34(2):286–
290, 2002.
[27] O. Bodenreider. Biomedical ontologies in action: Role in knowledge management, data integration and decision support. IMIA Yearbook of Medical
Informatics, 1(1):67–79, 2008.
[28] A. Bookstein, S. Klein, and T. Raita. Clumping properties of content-bearing
words. Journal of the American Society of Information Science, 49(2):102–114,
1998.
[29] A. Bookstein and D. Swanson. Probabilistic models for automatic indexing.
Journal of the American Society for Information Science, 25(5):312–8, 1974.
[30] J. Brank, M. Grobelnik, and D. Mladenic. A survey of ontology evaluation
techniques. In Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD), Ljubljana, Slovenia, 2005.
[31] C. Brewster, H. Alani, S. Dasmahapatra, and Y. Wilks. Data driven ontology evaluation. In Proceedings of the International Conference on Language
Resources and Evaluation (LREC), Lisbon, Portugal, 2004.
[32] C. Brewster, F. Ciravegna, and Y. Wilks. Background and foreground knowledge in dynamic ontology construction: Viewing text as knowledge maintenance. In Proceedings of the SIGIR Workshop on Semantic Web, Toronto,
Canada, 2003.
236
Bibliography
[33] E. Brill. A simple rule-based part of speech tagger. In Proceedings of the 3rd
Conference on Applied Natural Language Processing, 1992.
[34] A. Budanitsky. Lexical semantic relatedness and its application in natural language processing. Technical Report CSRG-390, Computer Systems Research
Group, University of Toronto, 1999.
[35] P. Buitelaar, P. Cimiano, and B. Magnini. Ontology learning from text: An
overview. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology
Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005.
[36] L. Burnard.
Reference guide for the british
http://www.natcorp.ox.ac.uk/XMLedition/URG/, 2007.
national
corpus.
[37] T. Cabre-Castellvi, R. Estopa, and J. Vivaldi-Palatresi. Automatic term detection: A review of current systems. In D. Bourigault, C. Jacquemin, and
M. LHomme, editors, Recent Advances in Computational Terminology. John
Benjamins, 2001.
[38] M. Castellanos. Hotminer: Discovering hot topics on the web. In M. Berry,
editor, Survey of Text Mining. Springer-Verlag, 2003.
[39] P. Castells, M. Fernandez, and D. Vallet. An adaptation of the vector-space
model for ontology-based information retrieval. IEEE Transactions on Knowledge and Data Engineering, 19(2):261–272, 2007.
[40] G. Cavaglia and A. Kilgarriff. Corpora from the web. In Proceedings of the
4th Annual CLUCK Colloquium, Sheffield, UK, 2001.
[41] H. Chen, M. Lin, and Y. Wei. Novel association measures using web search
with double checking. In Proceedings of the 21st International Conference on
Computational Linguistics, Sydney, Australia, 2006.
[42] N. Chinchor, D. Lewis, and L. Hirschman. Evaluating message understanding
systems: An analysis of the third message understanding conference (muc-3).
Computational Linguistics, 19(3):409–449, 1993.
[43] J. Cho, S. Han, and H. Kim. Meta-ontology for automated information integration of parts libraries. Computer-Aided Design, 38(7):713–725, 2006.
[44] B. Choi and Z. Yao. Web page classification. In W. Chu and T. Lin, editors,
Foundations and Advances in Data Mining. Springer-Verlag, 2005.
Bibliography
237
[45] K. Church and W. Gale. Inverse document frequency (idf): A measure of
deviations from poisson. In Proceedings of the ACL 3rd Workshop on Very
Large Corpora, 1995.
[46] K. Church and W. Gale. Poisson mixtures. Natural Language Engineering,
1(2):Page 163–190, 1995.
[47] K. Church and P. Hanks. Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1):22–29, 1990.
[48] M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, and I. Rojas. Unsupervised
learning of semantic relations between concepts of a molecular biology ontology. In Proceedings of the 19th International Joint Conference on Artificial
Intelligence (IJCAI), 2005.
[49] R. Cilibrasi and P. Vitanyi. Automatic extraction of meaning from the web.
In Proceedings of the IEEE International Symposium on Information Theory,
Seattle, USA, 2006.
[50] R. Cilibrasi and P. Vitanyi. The google similarity distance. IEEE Transactions
on Knowledge and Data Engineering, 19(3):370–383, 2007.
[51] P. Cimiano and S. Staab. Learning concept hierarchies from text with a guided
agglomerative clustering algorithm. In Proceedings of the Workshop on Learning and Extending Lexical Ontologies with Machine Learning Methods, Bonn,
Germany, 2005.
[52] J. Cimino and X. Zhu. The practical impact of ontologies on biomedical
informatics. IMIA Yearbook of Medical Informatics, 1(1):124–135, 2006.
[53] A. Clark. Pre-processing very noisy text. In Proceedings of the Workshop on
Shallow Processing of Large Corpora at Corpus Linguistics, 2003.
[54] P. Constant. L’analyseur linguistique sylex. In Proceedings of the 5eme Ecole
d’ete du CNET, 1995.
[55] O. Corcho. Ontology-based document annotation: Trends and open research problems. International Journal on Metadata, Semantics and Ontologies(Volume 1):Issue 1, 2006.
238
Bibliography
[56] B. Croft and J. Ponte. A language modeling approach to information retrieval. In Proceedings of the 21st International Conference on Research and
Development in Information Retrieval, 1998.
[57] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Gate: an architecture for development of robust hlt applications. In Proceedings of the 40th
Anniversary Meeting of the Association for Computational Linguistics (ACL),
Philadelphia, USA, 2002.
[58] F. Damerau. A technique for computer detection and correction of spelling
errors. Communications of the ACM, 7(3):171–176, 1964.
[59] J. Davies, D. Fensel, and F. vanHarmelen. Towards the semantic web:
Ontology-driven knowledge management. Wiley, UK, 2003.
[60] K. Dellschaft and S. Staab. On how to perform a gold standard based evaluation of ontology learning. In Proceedings of the 5th International Semantic
Web Conference (ISWC), 2006.
[61] J. Deneubourg, S. Goss, N. Franks, A. Sendova-Franks, C. Detrain, and
L. Chretien. The dynamics of collective sorting: Robot-like ants and antlike robots. In Proceedings of the 1st International Conference on Simulation
of Adaptive Behavior: From Animals to Animats, France, 1991.
[62] Y. Ding and S. Foo. Ontology research and development: Part 1. Journal of
Information Science, 28(2):123–136, 2002.
[63] N. Draper and H. Smith. Applied regression analysis (3rd ed.). John Wiley
& Sons, 1998.
[64] T. Dunning. Accurate methods for the statistics of surprise and coincidence.
[65] H. Edmundson. Statistical inference in mathematical and computational linguistics. International Journal of Computer and Information Sciences, 6(2
Pages 95-129), 1977.
[66] R. Engels and T. Lech. Generating ontologies for the semantic web: Ontobuilder. In J. Davies, D. Fensel, and F. vanHarmelen, editors, Towards the
Semantic Web: Ontology-driven Knowledge Management. England: John Wiley & Sons, 2003.
Bibliography
239
[67] S. Evert. A lightweight and efficient tool for cleaning web pages. In Proceedings
of the 4th Web as Corpus Workshop (WAC), Morocco, 2008.
[68] D. Faure and C. Nedellec. Asium: Learning subcategorization frames and
restrictions of selection. In Proceedings of the 10th Conference on Machine
Learning (ECML), Germany, 1998.
[69] D. Faure and C. Nedellec. A corpus-based conceptual clustering method for
verb frames and ontology acquisition. In Proceedings of the 1st International
Conference on Language Resources and Evaluation (LREC), Granada, Spain,
1998.
[70] D. Faure and C. Nedellec. Knowledge acquisition of predicate argument structures from technical texts using machine learning: The system asium. In Proceedings of the 11th European Workshop on Knowledge Acquisition, Modeling
and Management (EKAW), Dagstuhl Castle, Germany, 1999.
[71] D. Faure and T. Poibeau. First experiments of using semantic knowledge
learned by asium for information extraction task using intex. In Proceedings
of the 1st Workshop on Ontology Learning, Berlin, Germany, 2000.
[72] D. Fensel. Ontology-based knowledge management. Computer, 35(11):56–59,
2002.
[73] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the
evolution of web pages. In Proceedings of the 12th International Conference
on World Wide Web, Budapest, Hungary, 2003.
[74] W. Fletcher. Implementing a bnc-comparable web corpus. In Proceedings of
the 3rd Web as Corpus Workshop, Belgium, 2007.
[75] C. Fluit, M. Sabou, and F. vanHarmelen. Supporting user tasks through
visualisation of lightweight ontologies. In S. Staab and R. Studer, editors,
Handbook on Ontologies in Information Systems. Springer-Verlag, 2003.
[76] B. Fortuna, D. Mladenic, and M. Grobelnik. Semi-automatic construction of
topic ontology. In Proceedings of the Conference on Data Mining and Data
Warehouses (SiKDD), Ljubljana, Slovenia, 2005.
[77] H. Fotzo and P. Gallinari. Learning generalization/specialization relations
between concepts - application for automatically building thematic document
240
Bibliography
hierarchies. In Proceedings of the 7th International Conference on ComputerAssisted Information Retrieval (RIAO), Vaucluse, France, 2004.
[78] W.
Francis
and
H.
Kucera.
http://icame.uib.no/brown/bcm.html, 1979.
Brown
corpus
manual.
[79] K. Frantzi. Incorporating context information for the extraction of terms.
In Proceedings of the 35th Annual Meeting on Association for Computational
Linguistics, Spain, 1997.
[80] K. Frantzi and S. Ananiadou. Automatic term recognition using contextual
cues. In Proceedings of the IJCAI Workshop on Multilinguality in Software
Industry: the AI Contribution, Japan, 1997.
[81] A. Franz. Independence assumptions considered harmful. In Proceedings of
the 8th Conference on European Chapter of the Association for Computational
Linguistics, Madrid, Spain, 1997.
[82] N. Fuhr. Two models of retrieval with probabilistic indexing. In Proceedings of
the 9th ACM SIGIR International Conference on Research and Development
in Information Retrieval, 1986.
[83] N. Fuhr. Probabilistic models in information retrieval. The Computer Journal,
35(3):243–255, 1992.
[84] F. Furst and F. Trichet. Heavyweight ontology engineering. In Proceedings
of the International Conference on Ontologies, Databases, and Applications of
Semantics (ODBASE), Montpellier, France, 2006.
[85] P. Gamallo, A. Agustini, and G. Lopes. Learning subcategorisation information to model a grammar with co-restrictions. Traitement Automatic de la
Langue, 44(1):93–117, 2003.
[86] P. Gamallo, M. Gonzalez, A. Agustini, G. Lopes, and V. deLima. Mapping
syntactic dependencies onto semantic relations. In Proceedings of the ECAI
Workshop on Machine Learning and Natural Language Processing for Ontology
Engineering, 2002.
[87] J. Gantz. The diverse and exploding digital universe: An updated forecast of
worldwide information growth through 2011. Technical Report White paper,
International Data Corporation, 2008.
Bibliography
241
[88] C. Girardi. Htmlcleaner: Extracting the relevant text from the web pages. In
Proceedings of the 3rd Web as Corpus Workshop, Belgium, 2007.
[89] F. Giunchiglia and I. Zaihrayeu. Lightweight ontologies. Technical Report
DIT-07-071, University of Trento, 2007.
[90] A. Gomez-Perez and D. Manzano-Macho. A survey of ontology learning methods and techniques. Deliverable 1.5, OntoWeb Consortium, 2003.
[91] J. Gould, L. Alfaro, R. Finn, B. Haupt, A. Minuto, and J. Salaun. Why
reading was slower from crt displays than from paper. ACM SIGCHI Bulletin,
17(SI):7–11, 1986.
[92] T. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–220, 1993.
[93] P. Grunwald and P. Vitanyi. Kolmogorov complexity and information theory.
Journal of Logic, Language(and Information):Volume 12, 2003.
[94] H. Gutowitz. Complexity-seeking ants. In Proceedings of the 3rd European
Conference on Artificial Life, 1993.
[95] U. Hahn and M. Romacker. Content management in the syndikate system:
How technical documents are automatically transformed to text knowledge
bases. Data & Knowledge Engineering, 35(1):137–159, 2000.
[96] U. Hahn and M. Romacker. The syndikate text knowledge base generator. In
Proceedings of the 1st International Conference on Human Language Technology Research, San Diego, USA, 2001.
[97] M. Halliday, W. Teubert, C. Yallop, and A. Cermakova. Lexicology and corpus
linguistics: An introduction. Continuum, London, 2004.
[98] J. Handl, J. Knowles, and M. Dorigo. Ant-based clustering: A comparative
study of its relative performance with respect to k-means, average link and
1d-som. Technical Report TR/IRIDIA/2003-24, Universite Libre de Bruxelles,
2003.
[99] J. Handl, J. Knowles, and M. Dorigo. Ant-based clustering and topographic
mapping. Artificial Life, 12(1):35–61, 2006.
242
Bibliography
[100] J. Handl and B. Meyer. Improved ant-based clustering and sorting. In Proceedings of the 7th International Conference on Parallel Problem Solving from
Nature, 2002.
[101] S. Handschuh and S. Staab. Authoring and annotation of web pages in
cream. In Proceedings of the 11th International Conference on World Wide
Web (WWW), Hawaii, USA, 2002.
[102] M. Hearst. Automated discovery of wordnet relations. In Christiane Fellbaum,
editor, WordNet: An Electronic Lexical Database and Some of its Applications.
MIT Press, 1998.
[103] J. Heflin and J. Hendler. A portrait of the semantic web in action. IEEE
Intelligent Systems, 16(2):54–59, 2001.
[104] M. Henzinger and S. Lawrence. Extracting knowledge from the world wide
web. PNAS, 101(1):5186–5191, 2004.
[105] A. Hippisley, D. Cheng, and K. Ahmad. The head-modifier principle and
multilingual term extraction. Natural Language Engineering, 11(2):129–157,
2005.
[106] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of biocreative:
Critical assessment of information. BMC Bioinformatics, 6(1):S1, 2005.
[107] T. Hisamitsu, Y. Niwa, and J. Tsujii. A method of measuring term representativeness: Baseline method using co-occurrence di. In Proceedings of the 18th
International Conference on Computational Linguistics, Germany, 2000.
[108] D. Holmes and C. McCabe. Improving precision and recall for soundex retrieval. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC), 2002.
[109] A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Japan, 2003.
[110] C. Hwang. Incompletely and imprecisely speaking: Using dynamic ontologies
for representing and retrieving information. In Proceedings of the 6th International Workshop on Knowledge Representation meets Databases (KRDB),
Sweden, 1999.
Bibliography
243
[111] E. Hyvonen, A. Styrman, and S. Saarela. Ontology-based image retrieval. In
Proceedings of the 12th International World Wide Web Conference, Budapest,
Hungary, 2003.
[112] J. Izsak. Some practical aspects of fitting and testing the zipf-mandelbrot
model. Scientometrics, 67(1):107–120, 2006.
[113] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Computing
Survey, 31(3):264–323, 1999.
[114] J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and
lexical taxonomy. In Proceedings of the International Conference Research on
Computational Linguistics (ROCLING), Taiwan, 1997.
[115] T. Jiang, A. Tan, and K. Wang. Mining generalized associations of semantic
relations from textual web content. IEEE Transactions on Knowledge and
Data Engineering, 19(2):164–179, 2007.
[116] F. Jock.
An overview of the importance of page rank.
http://www.associatedcontent.com/article/1502284/an overview of the importance of pa
9 March 2009, 2009.
[117] K. Jones, S. Walker, and S. Robertson. A probabilistic model of information
retrieval: Development and status. Information Processing and Management,
36(6):809–840, 1998.
[118] K. Jones, S. Walker, and S. Robertson. A probabilistic model of information
retrieval: Development and comparative experiments. Information Processing
& Management, 36(6):809–840, 2000.
[119] R. Jones. Learning to Extract Entities from Labeled and Unlabeled Text. PhD
thesis, Carnegie Mellon University, 2005.
[120] K. Kageura and B. Umino. Methods of automatic term recognition: A review.
Terminology, 3(2):259–289, 1996.
[121] T. Kanungo and D. Orr. Predicting the readability of short web summaries.
In Proceedings of the 2nd ACM International Conference on Web Search and
Data Mining, Barcelona, Spain, 2009.
[122] S. Katz. Distribution of content words and phrases in text and language
modelling. Natural Language Engineering, 2(1):15–59, 1996.
244
Bibliography
[123] M. Kavalec and V. Svatek. A study on automated relation labelling in ontology
learning. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology
Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005.
[124] F. Keller, M. Lapata, and O. Ourioupina. Using the web to overcome data
sparseness. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP), Philadelphia, 2002.
[125] M. Kida, M. Tonoike, T. Utsuro, and S. Sato. Domain classification of technical terms using the web. Systems and Computers in Japan, 38(14):11–19,
2007.
[126] J. Kietz, R. Volz, and A. Maedche. Extracting a domain-specific ontology from
a corporate intranet. In Proceedings of the 4th Conference on Computational
Natural Language Learning, Lisbon, Portugal, 2000.
[127] A. Kilgarriff. Web as corpus. In Proceedings of the Corpus Linguistics (CL),
Lancaster University, UK, 2001.
[128] A. Kilgarriff. Googleology is bad science.
33(1):147–151, 2007.
Computational Linguistics,
[129] A. Kilgarriff and G. Grefenstette. Web as corpus. Computational Linguistics,
29(3):1–15, 2003.
[130] J. Kim, T. Ohta, Y. Teteisi, and J. Tsujii. Genia corpus - a semantically
annotated corpus for bio-textmining. Bioinformatics, 19(1):Page 180–182,
2003.
[131] C. Kit. Corpus tools for retrieving and deriving termhood evidence. In Proceedings of the 5th East Asia Forum of Terminology, Haikou, China, 2002.
[132] D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of
the 41st Meeting of the Association for Computational Linguistics, 2003.
[133] S. Krug. Dont make me think! a common sense approach to web usability.
New Riders, Indianapolis, USA, 2000.
[134] G. Kuenning.
International ispell
www.cs.ucla.edu/geoff/ispell.html, 2006.
3.3.02.
http://ficus-
Bibliography
245
[135] K. Kukich. Technique for automatically correcting words in text. ACM Computing Surveys, 24(4):377– 439, 1992.
[136] D. Kurz and F. Xu. Text mining for the extraction of domain relevant terms
and term collocations. In Proceedings of the International Workshop on Computational Approaches to Collocations, Vienna, 2002.
[137] K. Lagus, T. Honkela, S. Kaski, and T. Kohonen. Self-organizing maps of
document collections: A new approach to interactive exploration. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data
Mining, 1996.
[138] A. Lait and B. Randell. An assessment of name matching algorithms. Technical
report, University of Newcastle upon Tyne, 1993.
[139] T. Landauer, P. Foltz, and D. Laham. An introduction to latent semantic
analysis. Discourse Processes, 25(1):259–284, 1998.
[140] M. Lapata and F. Keller. Web-based models for natural language processing.
ACM Transactions on Speech and Language Processing, 2(1):1–30, 2005.
[141] N. Lavrac and S. Dzeroski. Inductive logic programming: Techniques and
applications. Ellis Horwood, NY, USA, 1994.
[142] D. Lelewer and D. Hirschberg. Data compression. Volume 19, Issue 3(Pages
261-296), 1987.
[143] S. Lemaignan, A. Siadat, J. Dantan, and A. Semenenko. Mason: A proposal for an ontology of manufacturing domain. In Proceedings of the IEEE
Workshop on Distributed Intelligent Systems: Collective Intelligence and Its
Applications (DIS), Czech Republic, 2006.
[144] S. LeMoigno, J. Charlet, D. Bourigault, P. Degoulet, and M. Jaulent. Terminology extraction from text to build an ontology in surgical intensive care.
In Proceedings of the ECAI Workshop on Machine Learning and Natural Language Processing for Ontology Engine, 2002.
[145] M. Lesk. Automatic sense disambiguation using machine readable dictionaries:
How to tell a pine cone from an ice cream cone. In Proceedings of the 5th
International Conference on Systems Documentation, 1986.
246
Bibliography
[146] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and
reversals. Soviet Physics Doklady, 10(8):707–710, 1966.
[147] D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine
Learning, 1998.
[148] M. Liberman. Questioning reality. http://itre.cis.upenn.edu./ myl/languagelog/archives/001837.html
26 March 2009, 2005.
[149] D. Lin. Principar: An efficient, broad-coverage, principle-based parser. In Proceedings of the 15th International Conference on Computational Linguistics,
1994.
[150] D. Lin. Dependency-based evaluation of minipar. In Proceedings of the 1st
International Conference on Language Resources and Evaluation, 1998.
[151] D. Lindberg, B. Humphreys, and A. McCray. The unified medical language
system. Methods of Information in Medicine, 32(4):281–291, 1993.
[152] K. Linden and J. Piitulainen. Discovering synonyms and other related words.
In Proceedings of the CompuTerm 2004, Geneva, Switzerland, 2004.
[153] L. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. truecasing. In Proceedings of the 41st Annual Meeting of the Association for Computational
Linguistics, Japan, 2003.
[154] Q. Liu, K. Xu, L. Zhang, H. Wang, Y. Yu, and Y. Pan. Catriple: Extracting
triples from wikipedia categories. In Proceedings of the 3rd Asian Semantic
Web Conference (ASWC), Bangkok, Thailand, 2008.
[155] V. Liu and J. Curran. Web text corpus for natural language processing. In
Proceedings of the 11th Conference of the European Chapter of the Association
for Computational Linguistics (EACL), Italy, 2006.
[156] W. Liu, A. Weichselbraun, A. Scharl, and E. Chang. Semi-automatic ontology extension using spreading activation. Journal of Universal Knowledge
Management, volume 0(1):50–58, 2005.
[157] N. Loukachevitch and B. Dobrov. Sociopolitical domain as a bridge from general words to terms of specific domains. In Proceedings of the 2nd International
Global Wordnet Conference, 2004.
Bibliography
247
[158] A. Lozano-Tello, A. Gomez-Perez, and E. Sosa. Selection of ontologies for
the semantic web. In Proceedings of the International Conference on Web
Engineering (ICWE), Oviedo, Spain, 2003.
[159] H. Luhn. Keyword in context index for technical literature. American Documentation, 11(4):288–295, 1960.
[160] E. Lumer and B. Faieta. Diversity and adaptation in populations of clustering
ants. In Proceedings of the 3rd International Conference on Simulation of
Adaptive Behavior: From Animals to Animats 3, 1994.
[161] A. Maedche, V. Pekar, and S. Staab. Ontology learning part one - on discovering taxonomic relations from the web. In N. Zhong, J. Liu, and Y. Yao,
editors, Web Intelligence. Springer Verlag, 2002.
[162] A. Maedche and S. Staab. Discovering conceptual relations from text. In
Proceedings of the 14th European Conference on Artificial Intelligence, Berlin,
Germany, 2000.
[163] A. Maedche and S. Staab. The text-to-onto ontology learning environment.
In Proceedings of the 8th International Conference on Conceptual Structures,
Darmstadt, Germany, 2000.
[164] A. Maedche and S. Staab. Measuring similarity between ontologies. In Proceedings of the European Conference on Knowledge Acquisition and Management
(EKAW), Madrid, Spain, 2002.
[165] A. Maedche and R. Volz. The ontology extraction & maintenance framework:
Text-to-onto. In Proceedings of the IEEE International Conference on Data
Mining, California, USA, 2001.
[166] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Performance measures for information extraction. In Proceedings of the DARPA Broadcast News
Workshop, 1999.
[167] A. Malucelli and E. Oliveira. Ontology-services to facilitate agents interoperability. In Proceedings of the 6th Pacific Rim International Workshop On
Multi-Agents (PRIMA), Seoul, South Korea, 2003.
[168] B. Mandelbrot. Information theory and psycholinguistics: A theory of word
frequencies. MIT Press, MA, USA, 1967.
248
Bibliography
[169] C. Manning and H. Schutze. Foundations of statistical natural language processing. MIT Press, MA, USA, 1999.
[170] E. Margulis. N-poisson document modelling. In Proceedings of the 15th ACM
SIGIR International Conference on Research and Development in Information
Retrieval, 1992.
[171] D. Maynard and S. Ananiadou. Term extraction using a similarity-based approach. In Recent Advances in Computational Terminology. John Benjamins,
1999.
[172] D. Maynard and S. Ananiadou. Identifying terms by their family and friends.
In Proceedings of the 18th International Conference on Computational Linguistics, Germany, 2000.
[173] T. McEnery, R. Xiao, and Y. Tono. Corpus-based language studies: An advanced resource book. Taylor & Francis Group Plc, London, UK, 2005.
[174] K. McKeown and D. Radev. Collocations. In R. Dale, H. Moisl, and H. Somers,
editors, Handbook of Natural Language Processing. Marcel Dekker, 2000.
[175] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In
Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Rochester, 2007.
[176] A. Mikheev. Periods, capitalized words, etc.
28(3):289–318, 2002.
Computational Linguistics,
[177] G. Miller, R. Beckwith, C. Fellbaum, K. Miller, and D. Gross. Introduction to
wordnet: An on-line lexical database. International Journal of Lexicography,
3(4):235–244, 1990.
[178] M. Missikoff, R. Navigli, and P. Velardi. Integrated approach to web ontology
learning and engineering. IEEE Computer, 35(11):60–63, 2002.
[179] P. Morville. Ambient findability. OReilly, California, USA, 2005.
[180] H. Nakagawa and T. Mori. A simple but powerful automatic term extraction
method. In Proceedings of the International Conference On Computational
Linguistics (COLING), 2002.
Bibliography
249
[181] P. Nakov and M. Hearst. A study of using search engine page hits as a proxy
for n-gram frequencies. In Proceedings of the International Conference on
Recent Advances in Natural Language Processing (RANLP), Bulgaria, 2005.
[182] R. Navigli and P. Velardi. Semantic interpretation of terminological strings. In
Proceedings of the 3rd Terminology and Knowledge Engineering Conference,
Nancy, France, 2002.
[183] D. Nettle. Linguistic diversity. Oxford University Press, UK, 1999.
[184] G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information
extraction core system for real world german text processing. In Proceedings
of the 5th International Conference of Applied Natural Language, 1997.
[185] H. Ng, C. Lim, and L. Koo. Learning to recognize tables in free text. In
Proceedings of the 37th Annual Meeting of the Association for Computational
Linguistics on Computational Linguistics, 1999.
[186] J. Nielsen. How little do users read? http://www.useit.com/alertbox/percenttext-read.html; 4 May 2009, 2008.
[187] V. Novacek and P. Smrz. Bole - a new bio-ontology learning platform. In
Proceedings of the Workshop on Biomedical Ontologies and Text Processing,
2005.
[188] A. Ntoulas, J. Cho, and C. Olston. Whats new on the web? the evolution of the
web from a search engine perspective. In Proceedings of the 13th International
Conference on World Wide Web, New York, USA, 2004.
[189] M. Odell and R. Russell. U.s. patent numbers 1,435,663. U.S. Patent Office,
Washington, D.C., 1922.
[190] T. OHara, K. Mahesh, and S. Nirenburg. Lexical acquisition with wordnet
and the microkosmos ontology. In Proceedings of the Coling-ACL Workshop
on Usage of WordNet in Natural Language Processing Systems, 1998.
[191] A. Oliveira, F. Pereira, and A. Cardoso. Automatic reading and learning from
text. In Proceedings of the International Symposium on Artificial Intelligence
(ISAI), Kolhapur, India, 2001.
[192] E. ONeill, P. McClain, and B. Lavoie. A methodology for sampling the world
wide web. Journal of Library Administration, 34(3):279–291, 2001.
250
Bibliography
[193] S. Pakhomov. Semi-supervised maximum entropy based approach to acronym
and abbreviation normalization in medical texts. In Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics, 2001.
[194] J. Paralic and I. Kostial. Ontology-based information retrieval. In Proceedings
of the 14th International Conference on Information and Intelligent systems
(IIS), Varazdin, Croatia, 2003.
[195] Y. Park and R. Byrd. Hybrid text mining for finding abbreviations and their
definitions. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2001.
[196] M. Pei, K. Nakayama, T. Hara, and S. Nishio. Constructing a global ontology
by concept mapping using wikipedia thesaurus. In Proceedings of the 22nd
International Conference on Advanced Information Networking and Applications, Okinawa, Japan, 2008.
[197] F. Pereira and A. Cardoso. Dr. divago: Searching for new ideas in a multidomain environment. In Proceedings of the 8th Cognitive Science of Natural
Language Processing (CSNLP), Ireland, 1999.
[198] F. Pereira, A. Oliveira, and A. Cardoso. Extracting concept maps with clouds.
In Proceedings of the Argentine Symposium of Artificial Intelligence (ASAI),
Buenos Aires, Argentina, 2000.
[199] L. Philips. Hanging on the metaphone.
7(12):38–44, 1990.
Computer Language Magazine,
[200] R. Plackett. Some theorems in least squares. Biometrika, 37(1/2):149–157,
1950.
[201] M. Poesio and A. Almuhareb. Identifying concept attributes using a classifier.
In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition,
Ann Arbor, USA, 2005.
[202] R. Porzel and R. Malaka. A task-based approach for ontology evaluation. In
Proceedings of the 16th European Conference on Artificial Intelligence, Valencia, Spain, 2004.
Bibliography
251
[203] D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and nlp:
Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Michigan, USA, 2005.
[204] J.
Rennie.
Derivation
of
the
http://people.csail.mit.edu/jrennie/writing/fmeasure.pdf, 2004.
f-measure.
[205] A. Renouf, A. Kehoe, and J. Banerjee. Webcorp: An integrated system for
web text search. In Marianne Hundt & Carolin Biewer Nadja Nesselhauf,
editor, Corpus Linguistics and the Web. Amsterdam: Rodopi, 2007.
[206] P. Resnik. Semantic similarity in a taxonomy: An information-based measure
and its application to problems of ambiguity in natural language. Journal of
Artificial Intelligence Research, 11(1):95–130, 1999.
[207] P. Resnik and N. Smith. The web as a parallel corpus. Computational Linguistics, 29(3):349–380, 2003.
[208] F. Ribas. On learning more appropriate selectional restrictions. In Proceedings of the 7th Conference of the European Chapter of the Association for
Computational Linguistics, Dublin, Ireland, 1995.
[209] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61(1):241–254, 1989.
[210] S. Robertson. Understanding inverse document frequency: On theoretical
arguments for idf. Journal of Documentation, 60(5):503–520, 2004.
[211] B. Rosario. Extraction of Semantic Relations from Bioscience Text. PhD
thesis, University of California Berkeley, 2005.
[212] B. Rozenfeld and R. Feldman. Clustering for unsupervised relation identification. In Proceedings of the 16th ACM Conference on Information and
Knowledge Management, 2007.
[213] M. Sabou, M. dAquin, and E. Motta. Scarlet: Semantic relation discovery
by harvesting online ontologies. In Proceedings of the 5th European Semantic
Web Conference, Tenerife, Spain, 2008.
252
Bibliography
[214] M. Sabou, C. Wroe, C. Goble, and G. Mishne. Learning domain ontologies
for web service descriptions: an experiment in bioinformatics. In Proceedings
of the 14th International Conference on World Wide Web, 2005.
[215] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988.
[216] D. Sanchez and A. Moreno. Automatic discovery of synonyms and lexicalizations from the web. In Proceedings of the 8th Catalan Conference on Artificial
Intelligence, 2005.
[217] D. Sanchez and A. Moreno. Learning non-taxonomic relationships from web
documents for domain ontology construction. Data & Knowledge Engineering,
64(3):600–623, 2008.
[218] E. Schmar-Dobler. Reading on the internet: The link between literacy and
technology. Journal of Adolescent and Adult Literacy, 47(1):80–85, 2003.
[219] H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, United Kingdom, 1994.
[220] A. Schutz and P. Buitelaar. Relext: A tool for relation extraction from text
in ontology extension. In Proceedings of the 4th International Semantic Web
Conference (ISWC), Ireland, 2005.
[221] A. Schwartz and M. Hearst. A simple algorithm for identifying abbreviation
definitions in biomedical text. In Proceedings of the Pacific Symposium on
Biocomputing (PSB), 2003.
[222] F. Sclano and P. Velardi. Termextractor: a web application to learn the shared
terminology of emergent web communities. In Proceedings of the 3rd International Conference on Interoperability for Enterprise Software and Applications
(I-ESA), Portugal, 2007.
[223] P. Senellart and V. Blondel. Automatic discovery of similar words. In M. Berry,
editor, Survey of Text Mining. Springer-Verlag, 2003.
[224] V. Seretan, L. Nerima, and E. Wehrli. Using the web as a corpus for the
syntactic-based collocation identification. In Proceedings of the International
Bibliography
253
Conference on on Language Resources and Evaluation (LREC), Lisbon, Portugal, 2004.
[225] M. Shamsfard and A. Barforoush. An introduction to hasti: An ontology
learning system. In Proceedings of the 7th Iranian Conference on Electrical
Engineering, Tehran, Iran, 2002.
[226] M. Shamsfard and A. Barforoush. The state of the art in ontology learning:
A framework for comparison. Knowledge Engineering Review, 18(4):293–316,
2003.
[227] M. Shamsfard and A. Barforoush. Learning ontologies from natural language
texts. International Journal of Human-Computer Studies, 60(1):Page 17–63,
2004.
[228] S. Sharoff. Creating general-purpose corpora using automated search engine
queries. In M. Baroni and S. Bernardini, editors, Wacky! Working papers on
the Web as Corpus. Bologna: GEDIT, 2006.
[229] Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In Proceedings of the NAACL Conference on
Human Language Technology (HLT), New York, 2006.
[230] H. Simon. On a class of skew distribution functions. Biometrika, 42(3-4):425–
440, 1955.
[231] F. Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177, 1993.
[232] B. Smith, S. Penrod, A. Otto, and R. Park. Jurors use of probabilistic evidence.
Behavioral Science, 20(1):49–82, 1996.
[233] T. Smith and M. Waterman. Identification of common molecular subsequences.
Journal of Molecular Biology, 147(1):195–197, 1981.
[234] R. Snow, D. Jurafsky, and A. Ng. Learning syntactic patterns for automatic
hypernym discovery. In Proceedings of the 17th Conference on Advances in
Neural, 2005.
[235] R. Snow, D. Jurafsky, and A. Ng. Semantic taxonomy induction from heterogenous evidence. In Proceedings of the ACL 23rd International Conference
on Computational Linguistics (COLING), 2006.
254
Bibliography
[236] R. Sombatsrisomboon, Y. Matsuo, and M. Ishizuka. Acquisition of hypernyms
and hyponyms from the www. In Proceedings of the 2d International Workshop
on Active Mining, Japan, 2003.
[237] L. Specia and E. Motta. A hybrid approach for relation extraction aimed at
the semantic web. In Proceedings of the International Conference on Flexible
Query Answering Systems (FQAS), Milan, Italy, 2006.
[238] M. Spiliopoulou, F. Rinaldi, W. Black, and G. Piero-Zarri. Coupling information extraction and data mining for ontology learning in parmenides. In
Proceedings of the 7th International Conference on Computer-Assisted Information Retrieval (RIAO);Vaucluse, France, 2004.
[239] R. Srikant and R. Agrawal. Mining generalized association rules. Future
Generation Computer Systems, 13(2-3):161–180, 1997.
[240] P. Srinivason. On generalizing the two-poisson model. Journal of the American
Society for Information Science, 41(1):61–66, 1989.
[241] S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the
semantic web. In Proceedings of the 1st International Workshop on MultiMedia
Annotation, Tokyo, Japan, 2001.
[242] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. Technical Report 00-034, University of Minnesota, 2000.
[243] M. Stevenson and R. Gaizauskas. Experiments on sentence boundary detection. In Proceedings of the 6th Conference on Applied Natural Language
Processing, 2000.
[244] A. Strehl. Relationship-based Clustering and Cluster Ensembles for Highdimensional Data Mining. PhD thesis, University of Texas at Austin, 2002.
[245] A. Sumida, N. Yoshinaga, and K. Torisawa. Boosting precision and recall of
hyponymy relation acquisition from hierarchical layouts in wikipedia. In Proceedings of the 6th International Language Resources and Evaluation (LREC),
Marrakech, Morocco, 2008.
[246] J. Tang, H. Li, Y. Cao, and Z. Tang. Email data cleaning. In Proceedings of
the 11th ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining, 2005.
Bibliography
255
[247] D. Temperley and D. Sleator. Parsing english with a link grammar. In Proceedings of the 3rd International Workshop on Parsing Technologies, 1993.
[248] M. Thelwall and D. Stuart. Web crawling ethics revisited: Cost, privacy and
denial of service. Journal of the American Society for Information Science
and Technology, 57(13):1771–1779, 2006.
[249] C. Tullo and J. Hurford. Modelling zipfian distributions in language. In Proceedings of the ESSLLI Workshop on Language Evolution and Computation,
Vienna, 2003.
[250] D. Turcato, F. Popowich, J. Toole, D. Fass, D. Nicholson, and G. Tisher.
Adapting a synonym database to specific domains. In Proceedings of the ACL
Workshop on Recent Advances in Natural Language Processing and Information Retrieval, Hong Kong, 2000.
[251] P. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303–336, 2000.
[252] P. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In
Proceedings of the 12th European Conference on Machine Learning (ECML),
Freiburg, Germany, 2001.
[253] M. Uschold and M. Gruninger. Ontologies and semantics for seamless connectivity. ACM SIGMOD, 33(4):58–64, 2004.
[254] A. Uyar. Investigation of the accuracy of search engine hit counts. Journal of
Information Science, 35(4):469–480, 2009.
[255] D. Vallet, M. Fernandez, and P. Castells. An ontology-based information
retrieval model. In Proceedings of the European Semantic Web Conference
(ESWC), Crete, Greece, 2005.
[256] C. vanRijsbergen. Automatic text analysis. In Information Retrieval. University of Glasgow, 1979.
[257] M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta, and S. Shum.
Template-driven information extraction for populating ontologies. In Proceedings of the IJCAI Workshop on Ontology Learning, 2001.
256
Bibliography
[258] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and
F. Ciravegna. Mnm: Ontology driven tool for semantic markup. In Proceedings of the ECAI Workshop on Semantic Authoring, Annotation & Knowledge
Markup (SAAKM), Lyon France, 2002.
[259] P. Velardi, P. Fabriani, and M. Missikoff. Using text processing techniques to
automatically enrich a domain ontology. In Proceedings of the International
Conference on Formal Ontology in Information Systems (FOIS), Ogunquit,
Maine, USA, 2001.
[260] P. Velardi, R. Navigli, A. Cucchiarelli, and F. Neri. Evaluation of ontolearn, a
methodology for automatic learning of ontologies. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005.
[261] P. Vitanyi. Universal similarity. In Proceedings of the IEEE ITSOC Information Theory Workshop on Coding and Complexity, New Zealand, 2005.
[262] P. Vitanyi, F. Balbach, R. Cilibrasi, and M. Li. Normalized information distance. In F. Emmert-Streib and M. Dehmer, editors, Information Theory and
Statistical Learning. New-York: Springer, 2009.
[263] A. Vizine, L. deCastro, E. Hruschka, and R. Gudwin. Towards improving
clustering ants: An adaptive ant clustering algorithm. Informatica, 29(2):143–
154, 2005.
[264] E. Voorhees and D. Harman. Trec experiment and evaluation in information
retrieval. MIT Press, MA, USA, 2005.
[265] J. Vronis and N. Ide. Word sense disambiguation: The state of the art.
[266] R. Wagner and M. Fischer. The string-to-string correction problem. Journal
of the ACM, 21(1):168–173, 1974.
[267] N. Weber and P. Buitelaar. Web-based ontology learning with isolde. In Proceedings of the ISWC Workshop on Web Content Mining with Human Language Technologies, Athens, USA, 2006.
[268] H. Weinreich, H. Obendorf, E. Herder, and M. Mayer. Not quite the average:
An empirical study of web use. ACM Transactions on the Web, 2(1):1–31,
2008.
Bibliography
257
[269] J. Wermter and U. Hahn. Finding new terminology in very large corpora.
In Proceedings of the 3rd International Conference on Knowledge Capture,
Alberta, Canada, 2005.
[270] G. Wilson. Info-overload harms concentration more than marijuana. New
Scientists, April(2497):6–6, 2005.
[271] W. Wong. Practical approach to knowledge-based question answering with
natural language understanding and advanced reasoning. Master’s thesis, National Technical University College of Malaysia, 2005.
[272] W. Wong, W. Liu, and M. Bennamoun. Featureless similarities for terms
clustering using tree-traversing ants. In Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia,
2006.
[273] W. Wong, W. Liu, and M. Bennamoun. Integrated scoring for spelling error correction, abbreviation expansion and case restoration in dirty text. In
Proceedings of the 5th Australasian Conference on Data Mining (AusDM),
Sydney, Australia, 2006.
[274] W. Wong, W. Liu, and M. Bennamoun. Determining termhood for learning
domain ontologies using domain prevalence and tendency. In Proceedings of the
6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia,
2007.
[275] W. Wong, W. Liu, and M. Bennamoun. Determining the unithood of word sequences using mutual information and independence measure. In Proceedings
of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Australia, 2007.
[276] W. Wong, W. Liu, and M. Bennamoun. Tree-traversing ant algorithm for
term clustering based on featureless similarities. Data Mining and Knowledge
Discovery, 15(3):349–381, 2007.
[277] W. Wong, W. Liu, and M. Bennamoun. Constructing web corpora through
topical web partitioning for term recognition. In Proceedings of the 21st
Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New
Zealand, 2008.
258
Bibliography
[278] W. Wong, W. Liu, and M. Bennamoun. Determination of unithood and termhood for term recognition. In M. Song and Y. Wu, editors, Handbook of
Research on Text and Web Mining Technologies. IGI Global, 2008.
[279] W. Wong, W. Liu, and M. Bennamoun. A probabilistic framework for automatic term recognition. Intelligent Data Analysis, 13(4):499–539, 2009.
[280] Y. Yang and J. Calmet. Ontobayes: An ontology-driven uncertainty model.
In Proceedings of the International Conference on Intelligent Agents, Web
Technologies and Internet Commerce (IAWTIC), Vienna, Austria, 2005.
[281] R. Yangarber, R. Grishman, and P. Tapanainen. Automatic acquisition of domain knowledge for information extraction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING), Saarbrucken,
Germany, 2000.
[282] Z. Yao and B. Choi. Bidirectional hierarchical clustering for web mining. In
Proceedings of the IEEE/WIC International Conference on Web Intelligence,
2003.
[283] J. Zelle and R. Mooney. Learning semantic grammars with constructive inductive logic programming. In Proceedings of the 11th National Conference of
the American Association for Artificial Intelligence (AAAI), 1993.
[284] G. Zipf. The psycho-biology of language. Houghton Mifflin, Boston, MA, 1935.
[285] G. Zipf. Human behaviour and the principle of least-effort. Addison-Wesley,
Cambridge, MA, 1949.

Learning Lightweight Ontologies from Text across

Transcription

Similar documents

UPDATE UPDATE

13 - Festival of the Arts

Tickets Sold! - Corpus Christi Catholic Church

Fall - Coastal Community and Teachers Credit Union

Presentation 1

american tire distributors facility

Corpus Christi Chamber of Commerce

Quelques ontologies existantes et leurs applications

The Luvin` Spoonful - The Food Bank of Corpus Christi