hidden topic discovery toward classification and clustering in

Transcription

hidden topic discovery toward classification and clustering in
VIET NAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY
NGUYEN CAM TU
HIDDEN TOPIC DISCOVERY TOWARD
CLASSIFICATION AND CLUSTERING IN
VIETNAMESE WEB DOCUMENTS
MASTER THESIS
HANOI - 2008
VIET NAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY
NGUYEN CAM TU
HIDDEN TOPIC DISCOVERY TOWARD
CLASSIFICATION AND CLUSTERING IN
VIETNAMESE WEB DOCUMENTS
Major: Information Technology
Specificity: Information Systems
Code: 60 48 05
MASTER THESIS
SUPERVISOR: Prof. Dr. Ha Quang Thuy
HANOI - 2008
i
Acknowledgements
My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who
offers me an endless inspiration in scientific research, leading me to this research area. I
particularly appreciate his unconditional support and advice in both academic
environment and daily life during the last four years.
Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments.
This work can not be possible without his support. Also, I would like to thank him for
being my friend, my older brother who has brought me a lot of lessons in both scientific
research and daily life.
My thanks also go to all members of seminar group “data mining”. Especially, I would
like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing
experiments.
I highly acknowledge the invaluable support and advice in both technical and daily life of
my teachers, my colleagues in Department of Information Systems, Faculty of
Technology, Vietnam National University, Hanoi
I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity
Resolution and Tracking crossover Web Documents”, Vietnam National University,
Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and
Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and
Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10
“Developing content filter systems to support management and implementation public
security – ensure policy”
Finally, from bottom of my heart, I would specially like to say thanks to all members in
my family, all my friends. They are really an endless encouragement in my life.
Nguyen Cam Tu
ii
Assurance
I certify that the achievements in this thesis belong to my personal, and are not copied
from any other’s results. Throughout the dissertation, all the mentions are either my
proposal, or summarized from many sources. All the references have clear origins, and
properly quoted. I am responsible for this statement.
Hanoi, November 15, 2007
Nguyen Cam Tu
iii
Table of Content
Introduction ..........................................................................................................................1
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis ...............3
1.1. Introduction ...............................................................................................................3
1.2. The Early Methods ....................................................................................................5
1.2.1. Latent Semantic Analysis ...................................................................................5
1.2.2. Probabilistic Latent Semantic Analysis..............................................................8
1.3. Latent Dirichlet Allocation......................................................................................11
1.3.1. Generative Model in LDA................................................................................12
1.3.2. Likelihood.........................................................................................................13
1.3.3. Parameter Estimation and Inference via Gibbs Sampling................................14
1.3.4. Applications......................................................................................................17
1.4. Summary..................................................................................................................17
Chapter 2. Frameworks of Learning with Hidden Topics..................................................19
2.1. Learning with External Resources: Related Works ................................................19
2.2. General Learning Frameworks ................................................................................20
2.2.1. Frameworks for Learning with Hidden Topics ................................................20
2.2.2. Large-Scale Web Collections as Universal Dataset .........................................22
2.3. Advantages of the Frameworks ...............................................................................23
2.4. Summary..................................................................................................................23
Chapter 3. Topics Analysis of Large-Scale Web Dataset ..................................................24
3.1. Some Characteristics of Vietnamese .......................................................................24
3.1.1. Sound ................................................................................................................24
3.1.2. Syllable Structure .............................................................................................26
3.1.3. Vietnamese Word .............................................................................................26
3.2. Preprocessing and Transformation ..........................................................................27
3.2.1. Sentence Segmentation.....................................................................................27
iv
3.2.2. Sentence Tokenization......................................................................................28
3.2.3. Word Segmentation ..........................................................................................28
3.2.4. Filters ................................................................................................................28
3.2.5. Remove Non Topic-Oriented Words ...............................................................28
3.3. Topic Analysis for VnExpress Dataset ...................................................................29
3.4. Topic Analysis for Vietnamese Wikipedia Dataset ................................................30
3.5. Discussion................................................................................................................31
3.6. Summary..................................................................................................................32
Chapter 4. Deployments of General Frameworks ..............................................................33
4.1. Classification with Hidden Topics ..........................................................................33
4.1.1. Classification Method.......................................................................................33
4.1.2. Experiments ......................................................................................................36
4.2. Clustering with Hidden Topics................................................................................40
4.2.1. Clustering Method ............................................................................................40
4.2.2. Experiments ......................................................................................................45
4.3. Summary..................................................................................................................49
Conclusion ..........................................................................................................................50
Achievements throughout the thesis...............................................................................50
Future Works ..................................................................................................................50
References ..........................................................................................................................52
Vietnamese References ..................................................................................................52
English References .........................................................................................................52
Appendix: Some Clustering Results...................................................................................56
v
List of Figures
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and
symmetric (b) parameterization. ( [55]) ...............................................................................9
Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55])...10
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing
replicates. The outer plate represents documents, while the inner plate represents the
repeated choice of topics and words within a document [20] ............................................12
Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult
stand for Dirichlet, Poisson, Multinomial distributions respectively.................................13
Figure 1.5. Quantities in the model of latent Dirichlet allocation......................................13
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation..............................16
Figure 2.1. Classification with Hidden Topics...................................................................20
Figure 2.2. Clustering with Hidden Topics ........................................................................21
Figure 3.1. Pipeline of Data Preprocessing and Transformation .......................................27
Figure 4.1. Classification with VnExpress topics .............................................................33
Figure 4.2 Combination of one snippet with its topics: an example ..................................35
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline
(without topics)...................................................................................................................37
Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the
number of topics is set at 60topics .....................................................................................37
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60
topics...................................................................................................................................39
Figure 4.6. Clustering with Hidden Topics ........................................................................40
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering..................................42
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query.........................47
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query ...........................47
vi
List of Tables
Table 3.1. Vowels in Vietnamese.......................................................................................24
Table 3.2. Tones in Vietnamese .........................................................................................25
Table 3.3. Consonants of hanoi variety ..............................................................................26
Table 3.4. Structure of Vietnamese syllables ....................................................................26
Table 3.5. Functional words in Vietnamese .......................................................................29
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset.........................29
Table 3.7. Statistics of VnExpress dataset .........................................................................30
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100
topics...................................................................................................................................30
Table 3.9. Statistic of Vietnamese Wikipedia Dataset ......................................................31
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200
topics...................................................................................................................................31
Table 4.1 Google search results as training and testing dataset. The search phrases for
training and test data are designed to be exclusive ............................................................34
Table 4.2. Experimental results of baseline (learning without topics)...............................38
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset.............38
Table 4.4. Some collocations with highest values of chi-square statistic ..........................44
Table 4.5. Queries submitted to Google.............................................................................45
Table 4.6. Parameters for clustering web search results ....................................................46
vii
Notations & Abbreviations
Word or phrase
Information Retrieval
Abbreviation
IR
Latent Semantic Analysis
LSA
Probability Latent Semantic Analysis
PLSA
Latent Dirichlet Allocation
LDA
Dynamic Topic Models
DTM
Correlated Topic Models
CTM
Singular Value Decomposition
SVD
1
Introduction
The World Wide Web has influenced many aspects of our lives, changing the way we
communicate, conduct business, shop, entertain, and so on. However, a large portion of
the Web data is not organized in systematic and well structured forms, a situation which
causes great challenges to those seeking for information on the Web. Consequently, a lot
of tasks enabling users to search, navigate and organize web pages in a more effective
way have been posed in the last decade, such as searching, page rank, web clustering, text
classification, etc. To this end, there have been a lot of successful stories like Google,
Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few.
Inspired by this trend, the aim of this thesis is to develop efficient systems which
are able to overcome the difficulties of dealing with sparse data. The main motivation is
that while being overwhelmed by a huge amount of online data, we sometimes lack data
to search or learn effectively. Let take web search clustering as an example. In order to
meet the real-time condition, that is the response time must be short enough, most of
online clustering systems only work with small pieces of text returned from search
engines. Unfortunately those pieces are not long and rich enough to build a good
clustering system. A similar situation occurs in the case of searching images only based
on captions. Because image captions are only very short and sparse chunks of text, most
of the current image retrieval systems still fail to achieve high accuracy. As a result, much
effort has been made recently to take advantage of external resources like learning with
knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy.
These approaches, however, have some difficulties: (1) constructing a knowledge base is
very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in
one application cannot be reused in another one even in the same domain.
In the thesis, we introduce two general frameworks for learning with hidden topics
discovered from large-scale data collections: one for clustering and another for
classification. Unlike semi-supervised learning, we approach this issue from the point of
view of text/web data analysis that is based on recently successful topic analysis models,
such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent
Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we
collect a very large external data collection called “universal dataset”, and then build the
learner on both the original data (like snippets or image captions) and a rich set of hidden
topics discovered from the universal data collection. The general frameworks are flexible
2
and general enough to apply for a wide range of domains and languages. Once we analyze
a universal dataset, the resulting hidden topics can be used for several learning tasks in the
same domain. This is also particularly useful for sparse data mining. Sparse data like
snippets returned from a search engine can be expanded and enriched with hidden topics.
Thus, a better performance can be achieved. Moreover, because the method can learn with
smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less
computational resources than semi-supervised learning.
Roadmap: The organization of this thesis is follow
Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis,
Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models
can be considered the basic building blocks of general framework of probabilistic
modeling of text and be used to develop more sophisticated and application-oriented
models, such as hierarchical models, author-role models, entity models, and so on. They
can also be considered key components in our proposals in subsequent chapters.
Chapter 2 introduces two general frameworks for learning with hidden topics: one for
classification and one for clustering. These frameworks are flexible and general enough to
apply in many domains of applications. The key common phrase between the two
frameworks is topic analysis for large-scale collections of web documents. The quality of
the hidden topic described in this chapter will much influence the performance of
subsequent stages.
Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese
documents/Web pages. We first review some characteristics of Vietnamese which are
considered significant for data preprocessing and transformation in the subsequent
processes. Next, we discuss more details about each step of preprocessing and
transforming data. Important notes, including specific characteristics of Vietnamese are
highlighted. Also, we demonstrate the results from topic analysis using LDA for the
clean, preprocessed dataset.
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2
tasks: search result classification, and search result clustering. The two implementations
are based on the topic model analyzed from a universal dataset like shown in chapter 3.
The Conclusion sums up the achievements throughout the previous four chapters. Some
future research topics are also mentioned in this section.
3
Chapter 1. The Problem of Modeling Text Corpora and Hidden
Topic Analysis
1.1. Introduction
The goal of modeling text corpora and other collections of discrete data is to find short
description of the members of a collection that enable efficient processing of large
collections while preserving the essential statistical relationships that are useful for basis
tasks such as classification, clustering, summarization, and similarity and relevance
judgments.
Significant achievements have been made on this problem by researchers in the context of
information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a
methodology successfully deployed in modern search technologies - is a typical approach
proposed by IR researchers for modeling text corpora. In this model, documents are
represented as vectors in a multidimensional Euclidean space. Each axis in this space
corresponds to a term (or word). The i-th coordinate of a vector represents some functions
of times of the i-th term occurs in the document represented by the vector. The end result
is a term-by-document matrix X whose columns contain the coordinates for each of the
documents in the corpus. Thus, this model reduces documents of arbitrary length to fixedlength lists of numbers.
While the vector space model has some appealing features – notably in its basis
identification of sets of words that are discriminative for documents in the collection – the
approach also provides a relatively small amount of reduction in description length and
reveals little in the way of inter- or intra- document statistical structure. To overcome
these shortcomings, IR researchers have proposed some other modeling methods such as
generalized vector space model, topic-based vector space model, etc., among which latent
semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably. LSA uses a
singular value decomposition of the term-by-document X matrix to identify a linear
subspace in the space of term weight features that captures most of variance in the
collection. This approach can achieve considerable reduction in large collections.
Furthermore, Deerwester et al argue that this method can reveal some aspects of basic
linguistic notions such as synonymy or polysemy.
In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text
corpora to study the ability of recovering aspects of the generative model from data in
LSA approach. However, once we have a generative model in hand, it is not clear why we
4
should follow the LSI approach – we can attempt to proceed more directly, fitting the
model to data using maximum likelihood or Bayesian methods.
The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard.
The pLSI models each word in a document as a sample from a mixture model, where each
mixture components are multinomial random variables that can be viewed as
representation of “topics”. Consequently, each word is generated from a single topic, and
different words in a document may be generated from different topics. Each document is
represented as a probability distribution over a fixed set of topics. This distribution can be
considered as a “reduced description” associated with the document.
While Hofmann’s work is a useful step toward probabilistic text modeling, it suffers from
severe overfitting problems. The number of parameters grows linearly with the number of
documents. Additionally, although pLSA is a generative model of the documents in the
collection it is estimated on, it is not a generative model of new documents. Latent
Dirichlet Allocation (LDA) [5][20] proposed by Blei et. al. (2003) is one solution to these
problems. Like all of the above methods, LDA bases on the “bag of word” assumption –
that the order of words in a document can be neglected. In addition, although less often
stated formally, these methods also assume that documents are exchangeable; the specific
ordering of the documents in a corpus can also be omitted. According to de Finetti (1990),
any collection of exchangeable random variables can be represented as a mixture
distribution – in general an infinite mixture. Thus, if we wish to consider exchangeable
representations for documents and words, we need to consider mixture models that
capture the exchangeability of both words and documents. This is the key idea of LDA
model that we will consider carefully in the section 1.3.
In recent time, Blei et al have developed the two extensions to LDA. They are Dynamic
Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8]. DTM is
suitable for time series data analysis thanks to the non-exchangeability nature of modeling
documents. On the other hand, CTM is capable of revealing topic correlation, for
example, a document about genetics is more likely to also be about disease than X-ray
astronomy. Though the CTM gives a better fit of the data in comparison to LDA, it is so
complicated by the fact that it loses the conjugate relationship between prior distribution
and likelihood.
In the following sections, we will discuss more about the issues behind these modeling
methods with particular attention to LDA – a well-known model that has shown its
efficiency and success in many applications.
5
1.2. The Early Methods
1.2.1. Latent Semantic Analysis
The main challenge of machine learning systems is to determine the distinction between
the lexical level of “what actually has been said or written” and the semantic level of
“what is intended” or “what was referred to” in the text or utterance. This problem lies in
twofold: (i) polysemy, i.e., a word has multiple meaning and multiple types of usage in
different context, and (ii), synonymy and semantically related words, i.e, different words
mat have similar sense. They at least in certain context specify the same concept or the
same topic in a weaker sense.
Latent semantic analysis (LSA - Deerwester et al, 1990) [13][24][26] is the well-known
technique which partially addresses this problem. The key idea is to map from the
document vectors in word space to a lower dimensional representation in the so-called
concept space or latent semantic space. Mathematically, LSA relies on singular value
decomposition (SVD), a well-known factorization method in linear algebra.
a. Latent Semantic Analysis by SVD
In the first step, we present the text corpus as term-by-document matrix where elements
(i, j) describes the occurrences of term i in document j. Let X be such a matrix, X will look
like this:
dj
⎡ x1,1
⎢
t → ⎢ M
⎢⎣ x m ,1
T
i
↓
L
O
L
x1,n ⎤
⎥
M ⎥
x m,n ⎥⎦
Now a row in this matrix is a vector corresponding to a term, giving its relation to each
document:
t Tj = [xi ,1
L xi , n ]
Likewise, a column in this matrix will be a vector corresponding to a document, giving
its relation to each term:
[
d Tj = x1, j L x m, j
]
Now, the dot product t Ti t p between two term vectors gives us the correlation between the
terms over the documents. The matrix product XX T contains all these dot products.
6
Element (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product
t iT t p (= t Tp t i ) . Similarly, the matrix X T X contains the dot products between all the
document vectors, giving their correlation over the terms: d Tj d q = d qT d j
In the next step, we conduct the standard SVD for the X matrix and get X = UΣV T , where
U and V are orthogonal matrices U T U = V T V = I and the diagonal matrix Σ contains the
singular values of X. The matrix products giving us the term and document correlations
are then become XX T = UΣΣ T V T and X T X = VΣΣ T U T respectively.
Since ΣΣ T and Σ T Σ are diagonal we see that U must contain the eigenvectors
of XX T , while V must be the eigenvectors of X T X . Both products have the same non-zero
eigenvalues, given by the non-zero entries of ΣΣ T , or equally, the non-zero entries of Σ T Σ .
Now the decomposition looks like this:
The values σ 1 ,..., σ l are called the singular values, and u1 ,..., u l and v1 ,..., vl the left and
right singular vectors. Note that only part of U , which contributes to t i , is the i-th row.
Let this row vector be called tˆi . Likewise, the only part of V T that contributes to d j is the
j’th column, d̂ j . These are not the eigenvectors, but depend on all the eigenvectors.
The LSA approximation of X is computed by selecting k largest singular values, and their
corresponding singular vectors from U and V. This results in the rank k approximation to
X with the smallest error. The appealing thing in this approximation is that not only does
it have the minimal error, but it translates the terms and document vectors into a concept
space. The vector tˆi then has k entries, each gives the occurrence of term i in one of the k
concepts. Similarly, the vector d̂ j gives the relation between document j and each concept.
We write this approximation as X k = U k Σ k VkT . Based on this approximation, we can now
do the following:
-
See how related documents j and q are in the concept space by comparing the
vectors d̂ j and d̂ q (usually by cosine similarity). This gives us a clustering of the
documents.
-
7
Comparing terms i and p by comparing the vectors tˆi and tˆn , giving us a clustering
of the terms in the concept space.
-
Given a query, view this as a mini document, and compare it to your documents in
the concept space.
To do the latter, we must first translate your query into the concept space with the same
transformation used on the documents, i.e. d j = U k Σ k dˆ j and dˆ j = Σ k−1U kT d j . This means
that if we have a query vector, we must do the translation qˆ = Σ k−1U kT q before comparing it
to the document vectors in the concept space.
b. Applications
The new concept space typically can be used to:
-
Compare the documents in the latent semantic space. This is useful to some typical
learning tasks such as data clustering or document classification.
-
Find similar documents across languages, after analyzing a base set of translated
documents.
-
Find relations between terms (synonymy and polysemy). Synonymy and polysemy
are fundamental problems in natural language processing:
o Synonymy is the phenomenon where different words describe the same
idea. Thus, a query in a search engine may fail to retrieve a relevant
document that does not contain the words which appeared in the query.
o Polysemy is the phenomenon where the same word has multiple meanings.
So a search may retrieve irrelevant documents containing the desired words
in the wrong meaning. For example, a botanist and a computer scientist
looking for the word "tree" probably desire different sets of documents.
-
Given a query of terms, we could translate it into the concept space, and find
matching documents (information retrieval).
c. Limitations
LSA has two drawbacks:
-
The resulting dimensions might be difficult to interpret. For instance, in
{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle".
However, it is very likely that cases close to
8
{(car), (bottle), (flower)} --> {(1.3452 * car + 0.2828 * bottle), (flower)}
will occur. This leads to results which can be justified on the mathematical level,
but have no interpretable meaning in natural language.
-
The probabilistic model of LSA does not match observed data: LSA assumes that
words and documents form a joint Gaussian model (ergodic hypothesis), while a
Poisson distribution has been observed. Thus, a newer alternative is probabilistic
latent semantic analysis, based on a multinomial model, which is reported to give
better results than standard LSA.
1.2.2. Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for
analysis of two-mode and co-occurrence data which has applications in information
retrieval and filtering, natural language processing, machine learning from text and in
related areas. Compared to standard LSA, PLSA is based on a mixture decomposition
derived from a latent class model. This results in a more principled approach which has a
solid foundation in statistics.
a. The Aspect Model
Suppose that we have given a collection of text documents D = {d1 ,..., d N } with terms
from a vocabulary W = {w1 ,..., wM } . The starting point for PLSA is a statistical model
namely aspect model. The aspect model is a latent variable model for co-occurrence data
in which an unobserved variable z ∈ Z = {z1 ,..., z K } is introduced to capture the hidden
topics implied in the documents. Here, N, M and K are the number of documents, words,
and topics respectively. Hence, we model the joint probability over DxW by the mixture
as follows:
P(d , w) = P(d ) P( w | d ), P( w | d ) = ∑ P( w | z ) P( z | d )
(1.1)
z∈Z
Like virtually all statistical latent variable models the aspect model relies on a conditional
independence assumption, i.e. d and w are independent conditioned on the state of the
associated latent variable (the graphical model representing this is demonstrated in Figure
1.1(a))
9
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b)
parameterization. ( [53])
It is necessary to note that the aspect model can be equivalently parameterized by (cf.
Figure 1.1 (b))
P(d , w) = ∑ P( z ) P(d | z ) P ( w | z )
(1.2)
z∈Z
This is perfectly symmetric with respect to both documents and words.
b. Model Fitting with the Expectation Maximization Algorithm
The aspect model is estimated by the traditional procedure for maximum likelihood
estimation, i.e. Expectation Maximization. EM iterates two coupled steps: (i) an
expectation (E) step in which posterior probabilities are computed for the latent variables;
and (ii) a maximization (M) step where parameters are updated. Standard calculations
give us the E-step formulae
P ( z | d , w) =
P( z ) P(d | z ) P( w | z )
∑ P( z ′) P(d | z ′) P(w | z ′)
(1.3)
z ′∈Z
As well as the following M-step equation
P(w | z ) ∝
∑ n(d , w) P( z | d , w)
(1.4)
∑ n(d , w) P( z | d , w)
(1.5)
∑ ∑ n(d , w) P( z | d , w)
(1.6)
d ∈D
P (d | z ) ∝
w∈W
P( z ) ∝
d ∈D w∈W
c. Probabilistic Latent Semantic Space
10
Let us consider topic-conditional multinomial distribution p (. | z ) over vocabulary as
points on the M − 1 dimensional simplex of all possible multinomial. Via convex hull, the
K points define a L ≤ K − 1 dimensional sub-simplex. The modeling assumption
expressedby (1.1) is that conditional distributions P ( w | d ) for all documents are
approximated by a multinomial representable as a convex combination of P ( w | z ) in
which the mixture component P ( z | d ) uniquely define a point on the spanned sub-simplex
which can identified with a concept space. A simple illustration of this idea is shown in
Figure 1.2.
Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [53])
In order to clarify the relation to LSA, it is useful to reformulate the aspect model as
parameterized
by
(1.2)
in
matrix
notation.
By
defining Uˆ = (P (d i | z k ) )i ,,k
,
Vˆ = (P(w j | z k )) j ,k and Σ̂ = diag (P( z k ))k matrices, we can write the joint probability model
P as a matrix product P = UˆΣˆ Vˆ T . Comparing this with SVD, we can draw the following
observations: (i) outer products between rows of Uˆ and Vˆ reflect conditional
independence in PLSA, (ii) the mixture proportions in PLSA substitute the singular
values. Nevertheless, the main difference between PLSA and LSA lies on the objective
function used to specify the optimal approximation. While LSA uses L2 or Frobenius
norm which corresponds to an implicit additive Gaussian noise assumption on counts,
PLSA relies on the likelihood function of multinomial sampling and aims at an explicit
maximization of the predictive power of the model. As is well known, this corresponds to
a minimization of the cross entropy or Kullback - Leibler divergence between empirical
distribution and the model, which is very different from the view of any types of squared
deviation. On the modeling side, this offers crucial advantages, for example, the mixture
approximation P of the term-by-document matrix is a well-defined probability
11
distribution. IN contrast, LSA does not define a properly normalized probability
distribution and the approximation of term-by-document matrix may contain negative
entries. In addition, there is no obvious interpretation of the directions in the LSA latent
space, while the directions in the PLSA space are interpretable as multinomial word
distributions. The probabilistic approach can also take advantage of the well-established
statistical theory for model selection and complexity control, e.g., to determine the
optimal number of latent space dimensions. Choosing the number of dimensions in LSA
on the other hand is typically based on ad hoc heuristics.
d. Limitations
In the aspect model, notice that d is a dummy index into the list of documents in the
training set. Consequently, d is a multinomial random variable with as many possible
values as there are training documents and the model learns the topic mixtures p ( z | d )
only for those documents on which it is trained. For this reason, pLSI is not a welldefined generative model of documents; there is no natural way to assign probability to a
previously unseen document.
A further difficulty with pLSA, which also originate from the use of a distribution
indexed by training documents, is that the numbers of parameters grows linearly with the
number of training documents. The parameters for a K-topic pLSI model are K
multinomial distributions of size V and M mixtures over the K hidden topics. This gives
KV + KM parameters and therefore linear growth in M. The linear growth in parameters
suggests that the model is prone to overfitting and, empirically, overfitting is indeed a
serious problem. In practice, a tempering heuristic is used to smooth the parameters of the
model for acceptable predictive performance. It has been shown, however, that overfitting
can occur even when tempering is used (Popescul et al., 2001, [41]).
Latent Dirichlet Allocation (LDA - which is described in section 1.3. overcomes both of
these problems by treating the topic mixture weights as a K-parameter hidden random
variable rather than a large set of individual parameters which are explicitly linked to the
training set.
1.3. Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) [7][20] is a generative probabilistic model for
collections of discrete data such as text corpora. It was developed by David Blei, Andrew
Ng, and Michael Jordan in 2003. By nature, LDA is a three-level hierarchical Bayesian
model in which each item of a collection is modeled as a finite mixture over an
underlying set of topics. Each topic, in turn, modeled as an infinite mixture over an
12
underlying set of topic probabilities. In the context of text modeling, the topic
probabilities provide an explicit representation of a document. In the following sections,
we will discuss more about generative model, parameter estimation as well as inference in
LDA.
1.3.1. Generative Model in LDA
Given a corpus of M documents denoted by D = {d1 , d 2 ,..., d M }, in which each document
number m in the corpus consists of Nm words wi drawn from a vocabulary of terms
{t1 ,..., tV }, the goal of LDA is to find the latent structure of “topics” or “concepts” which
captured the meaning of text that is imagined to be obscured by “word choice” noise.
Though the terminology of “hidden topics” or “latent concepts” has been encountered in
LSA and pLSA, LDA provides us a complete generative model that has shown better
results than the earlier approaches.
Consider the graphical model representation of LDA as shown in Figure 1.3, the
generative process can be interpreted as follows: LDA generates a stream of observable
r
words wm ,n , partitioned into documents d m . For each of these documents, a topic
r
proportion ϑm is drawn, and from this, topic-specific words are emitted. That is, for each
word, a topic indicator z m ,n is sampled according to the document – specific mixture
r
proportion, and then the corresponding topic-specific term distribution ϕ z used to draw a
r
m ,n
word. The topics ϕ k are sampled once for the entire corpus. The complete (annotated)
generative model is presented in Figure 1.4. Figure 1.5 gives a list of all involved
quantities.
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer
plate represents documents, while the inner plate represents the repeated choice of topics and words within a
document [20]
13
Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet,
Poisson, Multinomial distributions respectively.
Figure 1.5. Quantities in the model of latent Dirichlet allocation
1.3.2. Likelihood
According to the model, the probability that a word wm ,n instantiates a particular term t
given the LDA parameters is:
K
r
r
r
p ( wm , n = t | ϑ m , Φ ) = ∑ p ( wm , n = t | ϕ k ) p ( z m , n = k | ϑ m )
(1.7)
k =1
which corresponds to one iteration on the word plate of the graphical model. From the
topology of the graphical model, we can further specify the complete-data likelihood of a
14
document, i.e., the joint distribution of all known and hidden variables given the hyper
parameters:
document plate (1 document)
64444
44
474444444
8
N
m
r r r
r
r
r
r
r
r
p (d m , z m , ϑm , Φ | α , β ) = ∏ p ( wm ,n | ϕ zm , n ) p( z m , n | ϑm ). p(ϑm , α ). p(Φ | β ) (1.8)
1
424
3
n =1
topic plate
1
44444244444
3
word plate
Specifying this distribution is often simple and useful as a basic for other derivations. So
r
we can obtain the likelihood of a document d m , i.e., of the joint event of all word
r
occurrences, as one of its marginal distributions by integrating out the distributions ϑm and
Φ and summing over z m ,n :
(
)
r r r
r r
r Nm
r
r
r
p(d m | α , β ) = ∫∫ p (ϑm | α ). p(Φ | β ).∏∑ p ( wm ,n | ϕ zm , n ) p z m,n | ϑm d Φdϑm
(1.9)
n =1 z m , n
(
)
r r
r Nm
r
r
= ∫∫ p ϑm | α . p (Φ | β ).∏ p( wm ,n | ϑm , Φ )d Φdϑm
(1.10)
n =1
{ }
r
Finally, the likelihood of the complete corpus W = d m
M
m =1
is determined by the product of
the likelihoods of the independent documents:
M
r r r
r r
p (W | α , β ) = ∏ p (d m | α , β ).
(1.11)
m =1
1.3.3. Parameter Estimation and Inference via Gibbs Sampling
Exact estimation for LDA is generally intractable. The common solution to this is to use
approximate inference algorithms such as mean-field variational expectation
maximization, expectation propagation, and Gibbs sampling [20]
a. Gibbs Sampling
Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) and often yields
relatively simple algorithms for approximate inference in high-dimensional models such
as LDA. Through the stationary behavior of a Markov chain, MCMC methods can
r
emulate high-dimensional probability distributions p ( x ) . This means that one sample is
generated for each transition in the chain after a stationary state of the chain has been
reached, which happens after a so-called “burn-in period” that eliminates the influence of
initialization parameters. In Gibbs sampling, the dimensions xi of the distribution are
15
sampled alternately one at a time, conditioned on the values of all other dimensions,
r
which we denote x −i . The algorithm works as follows:
1. Choose dimension i (random or by permutation)
r
2. Sample xi from p( xi | x −i )
r
To build a Gibbs sampler, the full conditionals p( xi | x −i ) must be calculated, which is
possible using
r
p ( xi | x −i ) =
∫
r
r
r
p(x )
with x = {xi , x −i }
r
p (x )dxi
(1.12)
r r
r
For models that contain hidden variables z , their posterior given the evidence, p(z | x ) , is a
distribution commonly wanted. With Eq. 1.12, the general formulation of a Gibbs sampler
for such latent-variable models becomes:
r r
p ( z i | z −i , x ) =
∫
Z
r r
p( z , x )
r r
p( z , x )dz i
(1.13)
where the integral changes to a sum for discrete variables. With a sufficient number of
~r
samples z r , r ∈ [1, R ], the latent-variable posterior can be approximated using:
(
1 R r ~r
r r
p(z | x ) ≈ ∑ δ z − z r
R r =1
r
)
(1.14)
r
With the Kronecker delta δ (u ) = {1 if u = 0; 0 otherwise}
b. Parameter Estimation
Heirich et al [20] has applied the above hidden-variable method to develop a Gibbs
sampler for LDA as shown in Figure 1.6. The hidden variables here are z m ,n , i.e., the
topics that appear with the words of the corpus wm,n . We do not need to include the
parameter sets Θ and Φ because they are just the statistics of the associations between the
observed wm ,n and the corresponding zm,n , the state variables of the Markov chain.
Heirich [20] has been shown a sequence of calculations to lead to the formulation of the
full conditionals for LDA as follows:
r r
p ( z i | z −i , w ) =
n z(t, )− i + β t
nm( z,)−i + α z
⎡V v
⎤
⎡ K (z )
⎤
n
+
β
−
1
v⎥
⎢∑ z
⎢∑ n m + α z ⎥ − 1
⎣ z =1
⎦
⎣ v =1
⎦
(1.15)
16
The other hidden variables of LDA can be calculated as follows:
ϕ k ,t =
n k(t ) + β t
V
∑n
v =1
ϑm,k =
(v )
k
+ βv
nm(k ) + α k
K
∑n
z =1
(z )
m
(1.16)
(1.17)
+αz
- Initialization
zero all count variables, nm( z ) , nm , n z(t ) , n z
for all documents m ∈ [1, M ] do
for all words n ∈ [1, N m ] in document m do
sample topic index z m ,n ~Mult(1/K)
increment document-topic count: n m( s ) + 1
increment document-topic sum: nm + 1
increment topic-term count: n s(t ) + 1
increment topic-term sum: n z + 1
end for
end for
- Gibbs sampling over burn-in period and sampling period
while not finished do
for all documents m ∈ [1, M ] do
for all words n ∈ [1, N m ] in document m do
- for the current assignment of z to a term t for word wm , n :
decrement counts and sums: n m( z ) − 1 ; n m − 1 ; n z(t ) − 1 ; n z − 1
- multinomial sampling acc. To Eq. 1.15 (decrements from previous step):
r r
z ~ p ( z i | z −i , w)
sample topic index ~
- use the new assignment of z to the term t for word wm , n to:
r
increment counts and sums: n m( z ) + 1 ; n ztr + 1 ; n zr + 1
end for
end for
- check convergence and read out parameters
if converged and L sampling iterations since last read out then
- the different parameters read outs are averaged
read out parameter set Φ acc. to Eq. 1.16
read out parameter set Θ acc. to Eq. 1.17
end if
end while
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation.
17
c. Inference
Given an estimated LDA model, we can now do topic inference for unknown documents
r
~ is a vector of words w
, our goal is
by a similar sampling procedure. A new document m
~r
r
to estimate the posterior distribution of topics z given the word vector of the query w and
~r ~r r r
r r
the LDA model L(Θ, Φ ) : p(z | w, L ) = p ( z , w, w, z ) . In order to find the required counts for a
complete new document, the similar reasoning is made to get the Gibbs sampling update:
~ r r r
p (~
z i = k | z −i , w; z −i , w) =
nk(t ) + n~k(t, −) i + β t
nm(~k , −i ) + α k
(1.18)
⎡ V (v ) ~ (v )
⎤
⎡ K (z )
⎤
⎢∑ nk + nk + β v ⎥ − 1 ⎢∑ nm~ + α z ⎥ − 1
⎣ z =1
⎦
⎣ v =1
⎦
r
where the new variable n k(t ) counts the observations of term t and topic k in the unseen
document. This equation gives a colorful example of the workings of Gibbs posterior
sampling: High estimated word-topic associations n k(t ) will dominate the multinomial
masses compared to the contributions of n~k(t ) and n m(~k ) , which are chosen randomly.
Consequently, on repeatedly sampling from the distribution and updating of n m(~k ) , the
masses of topic-word associations are propagated into document-topic associations. Note
the smoothing influence of the Dirichlet hyper parameters.
Applying Eq. 1.17 gives the topic distribution for the unknown document:
ϑm,k =
nm(~k ) + α k
K
∑n
z =1
(z )
~
m
(1.19)
+αz
1.3.4. Applications
LDA has been successfully applied to text modeling and feature reduction in text
classification [5]. Recent work has also used LDA as a building block in more
sophisticated topic models such as author-document models [42], abstract-reference
models [15], syntax-semantic models [18] and image-caption models [6]. Additionally,
the same kinds of modeling tools have been used in a variety of non-text settings, such as
image processing [46], and the modeling of user profiles [17].
1.4. Summary
This chapter has shown some typical topic analysis methods such as LSA, PLSA, and
LDA. These models can be considered the basic building blocks of general framework of
probabilistic modeling of text and be used to develop more sophisticated and application-
18
oriented models. These models can also be seen as key components in our proposals in
subsequent chapters.
Among the topic analysis methods, we pay much attention to LDA, a generative
probabilistic model for collections of discrete data such as text corpora. It was developed
by David Blei, Andrew Ng, and Michael Jordan in 2003 and has proven its success in
many applications. Given the data, the goal is to reverse the generative process to estimate
model parameters. However, exact inference or estimation for even the not-so-complex
model like LDA is intractable. Consequently, there are a lot of attempts to make use of
approximate approaches to this task among which Gibbs Sampling is one of the most
suitable methods. Gibbs Sampling, which is also mentioned in this chapter, is a special
case of Markov-chain Monte Carlo (MCMC) and often yields relatively simple
algorithms for approximate inference in high-dimensional models like LDA.
19
Chapter 2. Frameworks of Learning with Hidden Topics
2.1. Learning with External Resources: Related Works
In recent time, there were a lot of attempts making use of external resources to enhance
learning performance. Depending on types of external resources, these methods can be
roughly classified into 2 categories: those make use of unlabeled data, and those exploit
structured or semi-structured data.
The first category is commonly referred under the name of semi-supervised learning. The
key argument is that unlabeled examples are significantly easier to collect than labeled
ones. One example of this is web-page classification. Suppose that we want a program to
electronically visit some web site and download all the web pages of interest to us, such
as all the Computer Science faculty pages, or all the course home pages at some
university. To train such a system to automatically classify web pages, one would
typically rely on hand labeled web pages. Unfortunately, these labeled examples are fairly
expensive to obtain because they require human effort. In contrast, the web has hundreds
of millions of unlabeled web pages that can be inexpensively gathered using a web
crawler. Therefore, we would like the learning algorithms to be able to take as much
advantage of the unlabeled data as possible.
Semi-supervised learning has been received a lot of attentions in the last decade.
Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding whether
the word “plant” means a living organism or a factory in a given context. Rosenberg et. al
(2005) apply it to object detection systems from images, and show the semi-supervised
technique compares favorably with a state-of-the-art detector. In 2000, Nigam and Ghani
[30] perform extensive empirical experiments to compare co-training with generative
mixture models and Expectation Maximization (EM). Jones (2005) used co-training, coEM and other related methods for information extraction from text. Besides, there were a
lot of works that applied Transductive Support Vector Machines (TSVMs) to use
unlabeled data for determining optimal decision boundary.
The second category covers a lot of works exploiting resources like Wikipedia to support
learning process. Gabrilovich et. al. (2007) [16] has demonstrated the value of using
Wikipedia as an additional source of features for text classification and determining the
semantic relatedness between texts. Banerjee et al (2007)[3] also extract titles of
Wikipedia articles and use them as features for clustering short texts. Unfortunately, this
approach is not very flexible in the sense that it depends much on the external resource or
the application.
20
This chapter describes frameworks for learning with the support of topic model estimated
from a large universal dataset. This topic model can be considered background knowledge
for the domain of application. It also helps the learning process to capture hidden topics
(of the domain), the relationships between topics and words as well as words and words,
thus partially overcome the limitations of different word choices in text.
2.2. General Learning Frameworks
This section presents general frameworks for learning with the support of hidden topics.
The main motivation is how to gain benefits from huge sources of online data in order to
enhance quality of the Text/Web clustering and classification. Unlike previous studies of
learning with external resources, we approach this issue from the point of view of
text/Web data analysis that is based on recently successful latent topic analysis models
like LSA, pLSA, and LDA. The underlying idea of the frameworks is that for each
learning task, we collect a very large external data collection called “universal dataset”,
and then build a learner on both the learning data and a rich set of hidden topics
discovered from that data collection.
2.2.1. Frameworks for Learning with Hidden Topics
Corresponding to two typical learning problems, i.e. classification and clustering, we
describe two frameworks with some differences in the architectures.
a. Framework for Classification
Figure 2.1. Classification with Hidden Topics
Nowadays, the continuous development of Internet has created a huge amount of
documents which are difficult to manage, organize and navigate. As a result, the task of
automatic classification, which is to categorize textual documents into two or more
predefined classes, has been received a lot of attentions.
21
Several machine-learning methods have been applied to text classification including
decision trees, neural networks, support vector machines, etc. In the typical applications
of machine-learning methods, the training data is passed to a learning phrase. The result
of the learning step is an appropriate classifier capable of categorizing new documents.
However, in the cases such as the training data is not as much as expected or the data to
be classified is too rare [52], learning with only training data can not provide us a
satisfactory classifier. Inspired by this fact, we propose a framework that enables us to
enrich both training and new coming data with hidden topics from available large dataset
so as to enhance the performance of text classification.
Classification with hidden topics is described in Figure 2.1. We first collect a very large
external data collection called “universal dataset”. Next, a topic analysis technique such
as pLSA, LDA, etc. is applied to the dataset. The result of this step is an estimated topic
model which consists of hidden topics and the probability distributions of words over
these topics. Upon this model, we can do topic inference for training dataset and new
data. For each document, the output of topic inference is a probability distribution of
hidden topics – the topics analyzed in the estimation phrase – given the document. The
topic distributions of training dataset are then combined with training dataset itself for
learning classifier. In the similar way, new documents, which need to be classified, are
combined with their topic distributions to create the so called “new data with hidden
topics” before passing to the learned classifier.
b. Framework for Clustering
Figure 2.2. Clustering with Hidden Topics
Text clustering is to automatically generate groups (clusters) of documents based on the
similarity or distance among documents. Unlike Classification, the clusters are not known
22
previously. User can optionally give the requirement about the number of clusters. The
documents will later be organized in clusters, each of which contains “close” documents.
Clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find
successive clusters using previously established clusters, whereas partitional algorithms
determine all clusters at once. Hierarchical algorithms can be agglomerative (“bottomup”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a
separate cluster and merge them into larger ones. Divisive algorithm begin with the while
set and divide it into smaller ones.
Distance measure, which determines how similarity of two documents is calculated, is a
key to the success of any text clustering algorithms. Some documents may be close to one
another according to one distance and further away according to another. Common
distance functions are the Euclidean distance, the Manhattan distance (also called taxicab
norm or 1-norm) and the maximum norm, just to name here but a few.
Web clustering, which is a type of text clustering specific for web pages, can be offline or
online. Offline clustering is to cluster the whole storage of available web documents and
does not have the constraint of response time. In online clustering, the algorithms need to
meet the “real-time condition”, i.e. the system need to perform clustering as fast as
possible. For example, the algorithm should take the document snippets instead of the
whole documents as input since the downloading of original documents is timeconsuming. The question here is how to enhance the quality of clustering for such
document snippets in online web clustering. Inspired by the fact those snippets are only
small pieces of text (and thus poor in content) we propose the framework to enrich them
with hidden topics for clustering (Figure 2.2). This framework and topic analysis is
similar to one for classification. The difference here is only due to the essential
differences between classification and clustering.
2.2.2. Large-Scale Web Collections as Universal Dataset
Despite of the obvious differences between two learning frameworks, there is a key
phrase sharing between them – the phrase of analyzing topics for previously collected
dataset. Here are some important considerations for this phrase:
-
The degree of coverage of the dataset: the universal dataset should be large enough
to cover topics underlined in the domain of application.
-
Preprocessing: this step is very important to get good analysis results. Although
there is no general instruction for all languages, the common advice is to remove
as much as possible noise words such as functional words, stop words or too
frequent/ rare words.
-
23
Methods for topic analysis: Some analyzing methods which can be applied have
been mentioned in the Chapter 1. The tradeoff between the quality of topic analysis
and time complexity should be taken into account. For example, topic analysis for
snippets in online clustering should be as short as possible to meet the “real-time”
condition.
2.3. Advantages of the Frameworks
- The general frameworks are flexible and general enough to apply in any
domain/language. Once we have trained a universal dataset, its hidden topics could
be useful for several learning tasks in the same domain.
-
This is particularly useful for sparse data mining. Spare data like snippets returned
from a search engine could be enriched with hidden topics. Thus, enhanced
performance can be achieved.
-
Due to learning with smaller data, the presented methods require less
computational resources than semi-supervised learning.
-
Thank to the nice generative model for analyzing topics for new documents (in the
case of LDA), we have a natural way to map documents from term space into topic
space. This is really an advantage over heuristic-based mapping in the previous
approaches [16][3][10].
2.4. Summary
This chapter described two general frameworks and their advantages for learning with
hidden topics: one for classification and one for clustering. The main advantages of our
frameworks are that they are flexible and general to apply in any domain/language and be
able to deal with sparse data. The key common phrase between the two frameworks is
topic analysis for large-scale web collection called “universal dataset”. The quality of the
topic model estimation for this data will influence much the performance of learning in
the later phrases.
24
Chapter 3. Topics Analysis of Large-Scale Web Dataset
As mentioned earlier, topic analysis for a universal dataset is a key to the success of our
proposed methods. Thus, toward Vietnamese text mining, this chapter contributes to
considerations for the problem of topics analysis for large-scale web datasets in
Vietnamese.
3.1. Some Characteristics of Vietnamese
Vietnamese is the national and official language of Vietnam [48]. It is the mother tongue
of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three
million overseas Vietnamese. It is also spoken as a second language by some ethnic
minorities of Vietnam. Many words in Vietnamese are borrowed from Chinese.
Originally, it is written in Chinese-like writing system. The current writing system of
Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and
certain letters.
3.1.1. Sound
a. Vowels
Like other Southeast Asian languages, Vietnamese has a comparatively large number of
vowels. Below is a vowel chart of vowels in Vietnamese:
Table 3.1. Vowels in Vietnamese
The correspondence between the orthography and pronunciation is rather complicated.
For example, the vowel i is often written as y; both may represent [i], in which case the
difference is in the quality of the preceding vowel. For instance, “tai” (ear) is [tāi] while
tay (hand/arm) is [tāj].
In addition to single vowels (or monophthongs), Vietnamese has diphthongs (âm đôi).
Three diphthongs consist of a vowel plus a. These are: “ia”, “ua”, “ưa” (When followed
by a consonant, they become “iê”, “uô”, and “ươ”, respectively). The other diphthongs
25
consist of a vowel plus semivowel. There are two of these semivowels: y (written i or y).
A majority of diphthongs in Vietnamese are formed this way.
Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”,
“ưa”) resulting in tripthongs.
b. Tones
Vietnamese vowels are all pronounced with an inherent tone. Tones differ in pitch,
length, contour melody, intensity, and glottal (with or without accompanying constricted
vocal cords)
Tone is indicated by diacritics written above or below the vowel (most of the tone
diacritics appear above the vowel; however, the “nặng” tone dot diacritic goes below the
vowel). The six tones in Vietnamese are:
Table 3.2. Tones in Vietnamese
c. Consonants
The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for
the bilabial approximant which is written as “w” (in the writing system it is written the
same as the vowels “o” and “u”
Some consonant sounds are written with only one letter (like “p”), other consonant sounds
are written with a two-letter digraph (like “ph”), and others are written with more than
one letter or digraph (the velar stop is written variously as “c”, “k”, or “q”).
26
Table 3.3. Consonants of hanoi variety
3.1.2. Syllable Structure
Syllables are elementary units that have one way of pronunciation. In documents, they are
usually delimited by white-space. In spite of being the elementary units, Vietnamese
syllables are not undividable elements but a structure. Table 3.4 depicts the general
structure of Vietnamese syllable:
Table 3.4. Structure of Vietnamese syllables
First
Consonant
TONE MARK
Rhyme
Secondary
Main
Consonant
Vowel
Last
Consonant
Generally, each Vietnamese syllable has all five parts: first consonant, secondary vowel,
main vowel, last consonant and a tone mark. For instance, the syllable “tuần” (week) has
a tone mark (grave accent), a first consonant (t), a secondary vowel (u), a main vowel (â)
and a last consonant (n). However, except for main vowel that is required for all syllables,
the other parts may be not present in some cases. For example, the syllable “anh”
(brother) has no tone mar, no secondary vowel and no first consonant. Another example is
the syllable “hoa” (flower) has a secondary vowel (o) but no last consonant.
3.1.3. Vietnamese Word
Vietnamese is often erroneously considered to be a "monosyllabic" language. It is true
that Vietnamese has many words that consist of only one syllable; however, most words
indeed contain more than one syllable.
Based on the way of constructing words from syllables, we can classify them into three
classes: single words, complex words and reduplicative words. Each single word has only
27
one syllable that implies specific meaning. For example: “tôi” (I), “bạn” (you), “nhà”
(house), etc. Words that involve more than one syllable are called “complex word”. The
syllables in complex words are combined based on semantic relationships which are
either coordinated (“bơi lội” – swim) or “principle and accessory” (“đường sắt” –railway).
A word is considered as a reduplicative word if its syllables have phonic components
(Table 3.4) reduplicated, for instance: “đùng đùng” (full-reduplicative), “lung linh” (first
consonant reduplicated), etc. This type of words is usually used for scene or sound
descriptions particularly in the literary.
3.2. Preprocessing and Transformation
Data preprocessing and Transformation are necessary steps for any data mining process in
general and for hidden topics mining in particular. After these steps, data is clean,
complete, reduced, partially free of noises, and ready to be mined. The main steps for our
preprocessing and transformation are described in the subsequent sections and shown in
the following chart:
Figure 3.1. Pipeline of Data Preprocessing and Transformation
3.2.1. Sentence Segmentation
Sentence segmentation is to determine whether a ‘sentence delimiter’ is really a sentence
boundary. Like English, sentence delimiters in Vietnamese are full-stop, the exclamation
mark and the question mark (.!?). The exclamation mark and the question mark do not
really pose the problems. The critical element is again the period: (1) the period can be a
sentence-ending character (full stop); (2) the period can denote an abbreviation; (3) the
period can used in some expressions like URL, Email, numbers, etc.; (4) in some cases, a
period can assume both (1) and (2) functions.
Given an input string, the result of this detector are sentences, each of which is in one
line. Then, this output is shifted to the sentence tokenization step.
28
3.2.2. Sentence Tokenization
Sentence tokenization is the process of detaching marks from words in a sentence. For
example, we would like to detach “,” from its previous word.
3.2.3. Word Segmentation
As mentioned in Section 3.1. , Vietnamese words are not always determined by whitespaces due to the fact that each word can contain more than one syllable. This gives birth
to the task of word segmentation, i.e. segment a sentence into a sequence of words.
Vietnamese word segmentation is a perquisite for any further processing and text mining.
Though being quite basic, it is not a trivial task because of the following ambiguities:
-
Overlapping ambiguity: String αβγ were called overlapping ambiguity when both
αβ and βγ are valid Vietnamese word. For example: “học sinh học sinh học”
(Student studies biology) Î “học sinh” (student) and “sinh học” (biology) are
found in Vietnamese dictionary.
-
Combination ambiguity: String αβγ were called combination ambiguity when
( α , β , αβ ) are possible choices. For instance: “bàn là một dụng cụ” (Table is a tool)
Î “bàn” (Table), “bàn là” (iron), “là” (is) are found in Vietnamese dictionary.
In this work, we used Conditional Random Fields approach to segment Vietnamese
words[31] . The outputs of this step are sequences of syllables joined to form words.
3.2.4. Filters
After word segmentation, tokens now are separated by white-space. Filters remove trivial
tokens for analyzing process, i.e. tokens for number, date/time, too-short tokens (length is
less than 2 characters). Too short sentences, English sentences, or Vietnamese sentences
without tones (The Vietnamese sometimes write Vietnamese text without tone) also
should be filtered or manipulated in this phrase.
3.2.5. Remove Non Topic-Oriented Words
Non Topic-Oriented Words are those we consider to be trivial for topic analyzing process.
These words can cause much noise and negative effects for our analysis. Here, we treat
functional words, too rare or too common words as non topic-oriented words. See the
following table for more details about functional words in Vietnamese:
29
Table 3.5. Functional words in Vietnamese
Part of Speech (POS)
Examples
Classifier Noun
cái, chiếc, con, bài, câu, cây, tờ, lá, việc
Major/Minor conjunction
Bởi chưng, bởi vậy, chẳng những, …
Combination Conjunction
Cho, cho nên, cơ mà, cùng, dẫu, dù, và
Introductory word
Gì, hẳn, hết, …
Numeral
Nhiều, vô số, một, một số, …
Pronoun
Anh ấy, cô ấy, …
Adjunct
Sẽ, sắp sửa, suýt, …
3.3. Topic Analysis for VnExpress Dataset
We collect a large dataset from VnExpress [47] using Nutch [36] and then do
preprocessing and transformation. The statistics of the topics assigned by humans and
other parameters of the dataset are shown in the tables below:
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset
Society: Education, Entrance Exams, Life of Youths …
International: Analysis, Files, Lifestyle …
Business: Business man, Stock, Integration …
Culture: Music, Fashion, Stage – Cinema …
Sport: Football, Tennis
Life: Family, Health …
Science: New Techniques, Natural Life, Psychology
And Others …
Note that information about topics assigned by humans is just listed here for reference and
not used in the topic analysis process. After data preprocessing and transformation, we get
53M data (40,268 documents, 257,533 words; vocabulary size of 128,768). This data is
put into GibbLDA++ [38] – a tool for Latent Dirichlet Allocation using Gibb Sampling
(see Section 1.3. ). The results of topic analysis with K = 100 topics are shown in
30
Table 3.5.
Table 3.7. Statistics of VnExpress dataset
After removing html, doing sentence and word segmentation:
size ≈ 219M, number of docs = 40,328
After filtering and removing non-topic oriented words:
size ≈ 53M, number of docs = 40,268
number of words = 5,512,251; vocabulary size = 128,768
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics.
Topic 1
Tòa (Court)
Điều tra (Investigate)
Luật sư (Lawyer)
Tội (Crime)
Tòa án (court)
Kiện (Lawsuits)
Buộc tội (Accuse)
Xét xử (Judge)
Bị cáo (Accused)
Phán quyết (Sentence)
Bằng chứng (Evidence)
Thẩm phán (Judge)
Topic 3
0.0192
0.0180
0.0162
0.0142
0.0108
0.0092
0.0076
0.0076
0.0065
0.0060
0.0046
0.0050
Topic 9
Du lịch (Tourism)
Khách (Passengers)
Khách sạn (Hotel)
Du khách (Tourists)
Tour
Tham quan (Visit)
Biển (Sea)
Chuyến đi (Journey)
Giải trí (Entertainment)
Khám phá (Discovery)
Lữ hành (Travel)
Điểm đến (Destination)
Topic 7
Trường (School)
0.0660
Lớp (Class)
0.0562
Học sinh (Pupil)
0.0471
Giáo dục (Education)
0.0192
Dạy (Teach)
0.0183
Giáo viên (Teacher)
0.0179
Môn (Subject)
0.0080
Tiểu học (Primary school)0.0070
Hiệu trưởng (Rector)
0.0067
Trung học (High school) 0.0064
Tốt nghiệp (Graduation) 0.0063
Năm học (Academic year)0.0062
Game
Trò chơi (Game)
Người chơi (Gamer)
Nhân vật (Characters)
Online
Giải trí (Entertainment)
Trực tuyến (Online)
Phát hành (Release)
Điều khiển (Control)
Nhiệm vụ (Mission)
Chiến đấu (Fight)
Phiên bản (Version)
Topic 14
0.0542
0.0314
0.0276
0.0239
0.0117
0.0097
0.0075
0.0050
0.0044
0.0044
0.0039
0.0034
Thời trang (Fashion)
Người mẫu (Models)
Mặc (Wear)
Mẫu (Sample)
Trang phục (Clothing)
Đẹp (Nice)
Thiết kế (Design)
Sưu tập (Collection)
Váy (Skirt)
Quần áo (Clothes)
Phong cách (Styles)
Trình diễn (Perform)
0.0869
0.0386
0.0211
0.0118
0.0082
0.0074
0.0063
0.0055
0.0052
0.0041
0.0038
0.0038
Topic 15
0.0482
0.0407
0.0326
0.0305
0.0254
0.0249
0.0229
0.0108
0.0105
0.0092
0.0089
0.0051
Bóng đá (Football)
0.0285
Đội (Team)
0.0273
Cầu thủ (Football Players)0.0241
HLV (Coach)
0.0201
Thi đấu (Compete)
0.0197
Thể thao (Sports)
0.0176
Đội tuyển (Team)
0.0139
CLB (Club)
0.0138
Vô địch (Championship) 0.0089
Mùa (Season)
0.0063
Liên đoàn (Federal)
0.0056
Tập huấn (Training)
0.0042
3.4. Topic Analysis for Vietnamese Wikipedia Dataset
The second dataset is collected from Vietnamese Wikipedia, and contains D=29,043
documents. We preprocessed this dataset in the same way described in the Section 3.2.
This led to a vocabulary size of V = 63,150, and a total of 4,784,036 word tokens. In the
31
hidden topic mining phrase, the number of topics K was fixed at 200. The hyperparameters α and β were set at 0.25 and 0.1 respectively.
Table 3.9. Statistic of Vietnamese Wikipedia Dataset
After removing html, doing sentence and word segmentation:
size ≈ 270M, number of docs = 29,043
After filtering and removing non-topic oriented words:
size ≈ 48M, number of docs = 17,428
number of words = 4,784,036; vocabulary size = 63,150
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics
Topic 2
Tàu (Ship)
Hải quân (Navy)
Hạm đội (Fleet)
Thuyền (Ship)
Đô đốc (Admiral)
Tàu chiến (Warship)
Cảng (Harbour)
Tấn công (Attack)
Lục chiến (Marine)
Thủy quân (Seaman)
Căn cứ (Army Base)
Chiến hạm (Gunboat)
Topic 5
0.0527
0.0437
0.0201
0.0100
0.0097
0.0092
0.0086
0.0081
0.0075
0.0067
0.0066
0.0058
Topic 8
Nguyên tố (Element)
Nguyên tử (Atom)
Hợp chất (Compound)
Hóa học (Chemical)
Đồng vị (Isotope)
Kim loại (Metal)
Hidro (Hidro)
Phản ứng (Reaction)
Phóng xạ (Radioactivity)
Tuần hoàn (Circulation)
Hạt nhân (Nuclear)
Điện tử (Electronics)
Độc lập (Independence) 0.0095
Lãnh đạo (Lead)
0.0088
Tổng thống (President) 0.0084
Đất nước (Country)
0.0070
Quyền lực (Power)
0.0069
Dân chủ (Democratic) 0.0068
Chính quyền (Government)0.0067
Ủng hộ (Support)
0.0065
Chế độ (System)
0.0063
Kiểm soát (Control)
0.0058
Lãnh thổ (Territory)
0.0058
Liên bang (Federal)
0.0051
Topic 9
0.0383
0.0174
0.0172
0.0154
0.0149
0.0148
0.0142
0.0123
0.0092
0.0086
0.0078
0.0076
Trang (page)
0.0490
Web (Web)
0.0189
Google (Google)
0.0143
Thông tin (information) 0.0113
Quảng cáo(advertisement)0.0065
Người dùng(user)
0.0058
Yahoo (Yahoo)
0.0054
Internet (Internet)
0.0051
Cơ sở dữ liệu (database) 0.0044
Rss (RSS)
0.0041
HTML (html)
0.0039
Dữ liệu (data)
0.0038
Topic 6
Động vật (Animal)
0.0220
Chim (Bird)
0.0146
Lớp (Class)
0.0123
Cá sấu (Crocodiles)
0.0116
Côn trùng (Insect)
0.0113
Trứng (Eggs)
0.0093
Cánh (Wing)
0.0092
Vây (Fin)
0.0077
Xương (Bone)
0.0075
Phân loại (Classify)
0.0054
Môi trường (Environment)0.0049
Xương sống (Spine)
0.0049
Topic 17
Lực (Force)
Chuyển động (Move)
Định luật (Law)
Khối lượng (Mass)
Quy chiếu (Reference)
Vận tốc (Velocity)
Quán tính (Inertia)
Vật thể (Object)
Newton (Newton)
Cơ học (Mechanics)
Hấp dẫn (Attractive)
Tác động (Influence)
0.0487
0.0323
0.0289
0.0203
0.0180
0.0179
0.0173
0.0165
0.0150
0.0149
0.0121
0.0114
3.5. Discussion
The hidden topics analysis using LDA for both VnExpress and Vietnamese Wikipedia
datasets have shown satisfactory results. While VnExpress dataset is more suitable for
daily life topic analysis, Vietnamese Wikipedia dataset is good for scientific topic
32
modeling. The decision of which one is suitable for a task depends much on its domain of
application.
From experiments, it can be seen that the number of topics should be appropriate to the
nature of dataset and the domain of application. If we choose a large number of topics, the
analysis process can generate a lot of topics which are too close (in the semantic) to each
others. On the other hand, if we assign a small number of topics, the results can be too
common. Hence, the learning process can benefits less from this topic information.
When conducting topic analysis, one should consider data very carefully. Preprocessing
and transformation are important steps because noise words can cause negative effects. In
Vietnamese, focus should be made on word segmentation, stop words filter. Also,
common personal names in Vietnamese should be removed. In other cases, it is necessary
to either remove all Vietnamese sentences written without tones (this writing style is quite
often in online data in Vietnamese) or do tone recovery for them. Other considerations
also should be made for Vietnamese Identification or Encoding conversions, etc., due to
the complex variety of online data.
3.6. Summary
This chapter summarized major issues for topics analysis of 2 specific datasets in
Vietnamese. We first reviewed some characteristics in Vietnamese. These considerations
are significant for dataset preprocessing and transformation in the subsequent processes.
We then described each step of preprocessing and transforming data. Significant notes,
including specific characteristics of Vietnamese, are also highlighted. In the last part, we
demonstrated the results from topics analysis using LDA for some dataset in Vietnamese.
The results showed that LDA is a potential method for topics analysis in Vietnamese.
33
Chapter 4. Deployments of General Frameworks
This chapter goes further into details of the deployments of general frameworks for the
two tasks: classification and clustering for Vietnamese Web Search Results. Evaluation
and Analysis for our proposals are also considered in the next subsections.
4.1. Classification with Hidden Topics
4.1.1. Classification Method
Figure 4.1. Classification with VnExpress topics
The objective of classification is to automatically categorize new coming documents into
one of k classes. Given a moderate training dataset, an estimated topic model and k
classes, we would like to build a classifier based on the framework in Figure 4.1. Here,
we use the model estimated from VnExpress dataset with LDA (see section 3.3. for more
details). In the following subsections, we will discuss more about important issues of this
deployment.
a. Data Description
For training and testing data, we first submit queries to Google and get results through
Google API [19]. The number of query phrases and snippets in each train and test dataset
are shown in Table 4.1 Google search results as training and testing dataset.
The search phrases for training and test data are designed to be exclusive. Note that, the
training and testing data here are designed to be as exclusive as possible.
b. Combining Data with Hidden Topics
The outputs of topic inference for train/new data are topic distributions, each of which
corresponds to one snippet. We now have to combine each snippet with its hidden topics.
34
This can be done by a simple procedure in which the occurrence frequency of a topic in
the combination depends on its probability. For example: a topic with probability greater
than 0.03 and less than 0.05 have 2 occurrences, while a topic with probability less than
0.01 is not included in the combination. One demonstrated example is shown in Figure
4.2.
Table 4.1 Google search results as training and testing dataset.
The search phrases for training and test data are designed to be exclusive
Training dataset
Domains
Testing dataset
#phrases #snippets #phrases #snippets
Business
50
1.479
9
270
Culture-Arts
49
1.350
10
285
Health
45
1.311
8
240
Laws
52
1.558
10
300
Politics
32
957
9
270
Science –
Education
41
1.229
9
259
Life-Society
19
552
8
240
Sports
45
1.267
9
223
Technologies
51
1.482
9
270
c. Maximum Entropy Classifier
The motivating idea behind maximum entropy [34][35] is that one should prefer the most
uniform models that also satisfy any given constraints. For example, consider a four-class
text classification task where we told only that on average 40% documents with the word
“professor” in them are in the faculty class. Intuitively, when given a document with
“professor” in it, we would say it has a 40% chance of being a faculty document, and a
20% chance for each of the other three classes. If a document does not have “professor”
we would guess the uniform class distribution, 25% each. This model is exactly the
maximum entropy model that conforms to our known constraint.
Although maximum entropy can be used to estimate any probability distribution, we only
consider here the classification task; thus we limit the problem to learning conditional
distributions from labeled training data. Specifically, we would like to learn the
conditional distribution of the class label given a document.
35
Figure 4.2 Combination of one snippet with its topics: an example
Constraints and Features
In maximum entropy, training data is used to set constraints on the conditional
distribution. Each constraint shows a characteristic of the training data and the class
should be present in the learned distribution. Any real-valued function of the document
and the class can be a feature: f i (d , c) . Maximum Entropy enables us to restrict the model
distribution to have the same expected value for this feature as seen in the training data,
D . Thus, we stipulate that the learned conditional distribution P (c | d ) (here, c stands for
class, and d represents document) must have the below form:
P (c | d ) =
⎛
⎞
1
exp⎜⎜ ∑ λi f i (d , c )⎟⎟
Z (d )
⎝ i
⎠
(4.1)
Where each f i (d , c ) is a feature, λi is a parameter which needs to be estimated and Z (d ) is
simply the normalizing factor to ensure a proper probability: Z (d ) = ∑ exp ∑ λi f i (d , c )
c
c
There are several methods for estimating maximum entropy model from training data
such as IIS (improved iterative scaling), GIS, L-BFGS, and so forth.
36
Maximum Entropy for Classification
In order to apply maximum entropy, we need to select a set of features. For this work, we
use words in documents as our features. More specifically, for each word-class
combination, we instantiate a feature as:
⎧1
f w,c ' ( d , c ) = ⎨
⎩ 0
if c = c' and d contains w
otherwise
Here, c’ is a class, w is a specific word, and d is current document. This feature will check
whether “this document d contains the word w and belongs to the class c’ ”. The predicate
which states that “this document d contains the word w” is called the “context predicate”
of the feature.
4.1.2. Experiments
a. Experimental Settings
For all the experiments, we based on hidden topics analysis with LDA as described in the
previous chapter. We then conduct several experiments: one for learning without hidden
topics and the others for learning with different numbers of topic models of the
VnExpress dataset which are generated by doing topic analysis for VnExpress dataset
with 60, 80, 100, 120, 140 and 160 topics.
For learning maximum entropy classifier, we use JMaxent [39] and set the context
predicate and feature thresholds to be zero; the other parameters are set at defaults.
b. Evaluation
Traditional Classification use Precision, Recall and F-1 measure to evaluate the
performance of the system. The meanings of such measures are given below:
Precision of a classifier with respect to a class is the fraction of the number of examples
which are correctly categorized into that class over the number of examples which are
classified into that class:
Precision c =
# examples correctly classified as class c
# examples classified as class c
Recall of a classifier with respect to a class is the fraction of the number of examples
which are correctly categorized into that class over the number of examples which belong
to that class (by human assignment):
Recallc =
# examples correctly classified as class c
# all examples belong to class c
37
To measure the performance of a classifier, it is usually used F-1 measure which is the
harmonic mean of precision and recall:
F - 1c =
2 × Precision c × Recallc
Precision c + Recallc
c. Experimental Results and Discussion
Precision
Recall
F1-measure
74
72
70
68
66
64
14
0
to
pi
cs
16
0
to
pi
cs
to
pi
cs
12
0
80
10
0
to
pi
c
to
pi
cs
s
s
to
pi
c
60
W
it h
ou
t
To
pi
c
s
62
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline (without topics)
74
72.45
72
72.25
70
F1 measure (%)
72.26
71.91
70.86
68
67.08
65.94
66
66.41
66
64
62
62.02
w ith hidden topic inference
60
baseline (w ithout topics)
58
11
.1
3
10
.7
2
9.
90
9
8.
85
9
7.
75
2
6.
55
2
5.
35
2
4.
05
2.
7
1.
35
56
Size of labeled training data (x1000 examples)
Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the number of topics is set at
60topics
38
Table 4.2. Experimental results of baseline (learning without topics)
Class
Human
Model
Match
Pre.
Rec.
F1-score
Business
270
347
203
58.50
75.19
65.80
Culture-Arts
285
260
183
70.38
64.21
67.16
Health
240
275
179
65.09
74.58
69.51
Laws
300
374
246
65.78
82.00
73.00
Politics
270
244
192
78.69
71.11
74.71
Science
259
187
121
64.71
46.72
54.26
Society
240
155
106
68.39
44.17
53.67
Sports
223
230
175
76.09
78.48
77.26
Technologies
270
285
164
57.54
60.74
59.10
67.24
66.35
66.79
66.57
66.57
66.57
Avg.1
Avg.2
2357
2357
1569
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset
Class
Human
Model
Match
Pre.
Rec.
F1-score
Business
270
275
197
71.64
72.96
72.29
Culture-Arts
285
340
227
66.76
79.65
72.64
Health
240
256
186
72.66
77.5
75
Laws
300
386
252
65.28
84
73.47
Politics
270
242
206
85.12
76.3
80.47
Science
259
274
177
64.6
68.34
66.42
Society
240
124
97
78.23
40.42
53.3
Sports
223
205
173
84.39
77.58
80.84
Technologies
270
255
180
70.59
66.67
68.57
73.25
71.49
72.36
71.91
71.91
71.91
Avg.1
Avg.2
2357
2357
1695
39
Without Topics
With Hidden Topics Inference
100
95
F1-Measure
90
85
80
75
70
65
60
55
A
ve
ra
ge
S
po
rts
Te
ch
no
lo
gi
es
S
oc
ie
ty
S
ci
en
ce
P
ol
iti
cs
La
w
s
H
ea
lth
B
us
in
es
s
C
ul
tu
re
-A
rts
50
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics
Figure 4.3 shows the results of learning with different settings (without topics, with 60,
80, 100, 120, 140 topics) among which learning with 60 topics got the highest F-1
measure (72.91% in comparison with 66.57% in baseline – see Table 4.2 and Table 4.3).
When the number of topics increase, the F-1 measures vary around 70-71% (learning with
100, 120, 140 topics). This shows that learning with hidden topics does improve the
performance of classifier no matter how many numbers of topics is chosen.
Figure 4.4 depicts the results of learning with 60 topics and different number of training
examples. Because the testing dataset and training dataset are relatively exclusive, the
performance is not always improved when the training size increases. In any cases, the
results for learning with topics are always better than learning without topics. Even with
little training dataset (1300 examples), the F-1 measure of learning with topics is quite
good (70.68%). Also, the variation of F-1 measure in experiments with topics (2% - from
70 to 72%) is smaller than one without topics (8% - from 62 to 66%). From these
observations, we see that our method does take effects even with little learning data.
40
4.2. Clustering with Hidden Topics
4.2.1. Clustering Method
Figure 4.6. Clustering with Hidden Topics from VnExpress and Wikipedia data
Web search clustering is a solution to reorganize search results in a more convenient way
for users. For example, when a user submits query “jaguar” into Google and wants to get
search results related to “big cats”, s/he need go to the 10th, 11th, 32nd, and 71st results. If
there is a group named “big cats”, the four relevant results can be ranked high in the
corresponding list. Among previous works, the noticeable and most successful clustering
system is Vivisimo [49] in which the techniques are kept unknown. This section considers
deployment issues of clustering web search results with hidden topics in Vietnamese.
a. Topic Inference and Similarity
For each snippet, after topic inference we get the probability distribution of topics over
the snippet. From that topic distribution for each snippet, we construct the topic vector for
that snippet as following: the weight of a topic will be assigned zero if its probability less
than a predefined ‘cutt-off threshold’, and be assigned the value of its probability
otherwise. Suppose that weights for words in the term vector of the snippet have been
normalized in some way (tf; tf-idf; etc), the combined vector corresponding to snippet ith has the following form:
d i = {t1 , t 2 ,..., t K , w1 ,..., w|V | }
(4.2)
41
Here, ti is the weight for topic i-th in K analyzed topics (K is a constant parameter of
LDA); wi is the weight for word/term i-th in vocabulary V of all snippets.
Next, for 2 snippets i-th and j-th, we use the cosine similarity to measure the similarities
between topic-parts as well as between term-parts of the 2 vectors.
K
sim i , j (topic − parts ) =
∏t
i ,k
K
∑ t i2,k
k =1
|V |
simi , j ( word − parts ) =
× t j ,k
k =1
∏w
i ,t
K
∑t
k =1
2
j ,k
× w j ,t
t =1
|V |
∑ wi2,t
t =1
|V |
∑w
t =1
2
j ,t
We later propose the following combination to measure the final similarity between them:
sim(d i , d j ) = λ × sim( topic − parts) +(1 − λ ) × sim( word - parts) (4.3)
Here, λ is a mixture constant. If λ = 0 , we calculate the similarity without the support of
hidden topics. If λ = 1 , we measure the similarity between 2 snippets from hidden topic
distributions without concerning words in snippets.
b. Agglomerative Hierarchical Clustering
Hierarchical clustering [48] builds (agglomerative), or breaks up (divisible), a hierarchy
of clusters. The traditional representation of this hierarchy is a tree called dendrogram.
Agglomerative algorithms begin with each element as a separate cluster and merge them
into successively larger clusters.
Cutting the tree at a given height will give a clustering at a selected precision. In the
example in Figure 4.7, cutting after the second row will generate clusters {a} {b c} {d e}
{f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser
clustering, with a smaller number of larger clusters.
The method builds the hierarchy from the individual elements by progressively merging
clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step
is to determine which elements to merge in a cluster. Usually, we want to take the two
closest elements, according to the chosen similarity.
Optionally, one can construct a similarity matrix at this stage, where the number in the ith row j-th column is the similarity/distance between the i-th and j-th elements. Then, as
42
clustering progresses, rows and columns are merged as the clusters are merged and the
similarities updated. This is a common way to implement this type of clustering, and has
the benefit of catching distances between clusters.
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering
Suppose that we have merged the two closest elements b and c, we now have the
following clusters {a},{b,c},{d},{e}, and {f}, and want to merge them further. To do that,
we need to measure the similarity/distance between {a} and {b c}, or generally similarity
between two clusters. Usually the similarity between two clusters A and B can be
calculated as one of the following:
-
The minimum similarity between elements of each cluster (also called complete
linkage clustering):
min{sim( x, y ) : x ∈ A, y ∈ B}
-
The maximum similarity between elements of each cluster (also called single
linkage clustering):
max{sim( x, y ) : x ∈ A, y ∈ B}
-
The mean similarity between elements of each clusters (also called average linkage
clustering):
1
∑∑ sim(x, y )
| A || B | x∈A y∈B
Each agglomerative occurs at a smaller similarity between clusters than the previous
agglomeration, and one can decide to stop clustering either when the clusters are too far
apart to be merged (similarity criterion) or when there is a sufficiently small number of
clusters (number criterion).
43
c. Labeling Clusters
Given a set of clusters for a text collection, our goal is to generate understandable
semantic labels for each cluster. We now state the problem of cluster labeling similarly to
the problem of “topic labeling problem” [27] as follows:
Definition 1: A cluster ( c ∈ C ) in a text collection has a set of “close” snippets (here, we
consider snippets are small documents), each cluster is characterized by an “expected
topic distribution” θ c which is the average of topic distributions of all snippets in the
cluster.
Definition 2: A “cluster label” or a “label” l for a cluster c ∈ C is a sequence of words
which is semantically meaningful and covers the latent meaning of θ c . Words, phrases,
and sentences are all valid labels under this definition.
Definition 3 (Relevance Score) The relevance score of a label to a cluster θ c , s (l , θ c ) ,
measures the semantic similarity between the label and the topic model. Given that both
l1 and l 2 are meaningful candidate labels, l1 is a better label for c than l 2 if
s(l1 ,θ c ) > s(l 2 ,θ c )
With these definitions, the problem of cluster labeling can be defined as follows: Let
C = {c1 , c 2 ,..., c N } be a set of N clusters, and Li = {l i ,1 , l i , 2 ,..., li , s } be the set of candidate
cluster labels for the cluster number i in C. Our goal is to select the most likely label for
each cluster.
Candidate Label Generation
Candidate label is the first phrase to label clusters. In this work, we generate candidates
based on “Ngram Testing” which extract meaningful phrases from word n-grams based
on statistical tests. There are many methods for testing whether an n-gram is meaningful
collocation/phrase or just co-occur by accidence. Some methods depend on statistical
measures such as mutual information. Others rely on hypothesis testing techniques. The
null hypothesis usually assumes that “the words in an n-gram are independent”, and
different test statistics have been proposed to test the significance of violating the null
hypothesis.
For the experiments, we use the n-gram hypothesis testing (n <=2) which depend on chisquare test [11] to find out meaningful phrases. In other words, there are two types of
label candidates: (1) non-stop words (1-gram); and (2) a phrase of 2 consecutive words
(2-grams) with its chi-square value calculated from a large collection of text is greater
than a threshold - the “colocThreshold”.
44
Table 4.4. Some collocations with highest values of chi-square statistic
Collocation (Meaning in Enlish)
TP HCM (HCM city)
Monte Carlo (Monte Carlo)
Thuần_phong mỹ_tục (Habits and Customs)
Bin Laden (Bin Laden)
Bộ Vi_xử_lý (Center Processing)
Thép_miền_nam Cảng_Sài_gòn (a football club)
Trận chung_kết (Final Match)
Đất_khách quê_người (Forein Land)
Vạn_lý trường_thành (the Great Wall of China)
Đi_tắt đón_đầu (Take a short-cut, Wait in front)
Xướng_ca vô_loài
Ổ cứng (Hard Disk)
Sao_mai Điểm_hạn (a music competition)
Bảng xếp_hạng (Ranking Table)
Sơ_yếu lý_lịch (Curiculum Vitae)
Vốn điều_lệ (Charter Capital)
Xứ_sở sương_mù (England)
Windows XP (Windows XP)
Thụ_tinh ống_nghiệm (Test-tube Fertilization)
Outlook Express (Outlook Express)
Công_nghệ thông_tin (Information Technology)
Hệ_thống thông_tin (Information System)
Silicon Valley (Silicon Valley)
Chi-square value
2.098409912555148E11
2.3750623868571806E9
8.404755120045843E8
5.938943787195972E8
3.5782968749839115E8
2.5598593044043452E8
1.939850017618072E8
1.8430912500609657E8
1.6699845099865612E8
1.0498738800702788E8
1.0469589600052954E8
9.693021145946936E7
8.833816801460913E7
8.55072554114269E7
8.152866670394194E7
5.578214903954915E7
4.9596802405895464E7
4.8020442441390194E7
4.750102933435161E7
3.490459668749844E7
1587584.1576983468
19716.68246929993
1589327.942940336
Relevance Score
We borrowed the ideas of simple score and inter-cluster score from [27]. Simple score is
the relevance of a label and a specific cluster without concerning the other clusters. Intercluster score of a label and a cluster, on the other hand, look at not only the interesting
cluster but also other clusters. As a result, the labels chosen using inter-cluster score
discriminate clusters better than simple score.
In order to get relevance between a label candidate (l) and a cluster (c) using simple
score, we use 3 types of features including the topic similarity (topsim) between topicdistribution of the label candidate and the “expected topic distribution” of the cluster, the
length of candidate (lenl), number of snippets in the cluster c containing this phrase
(cdfl,c). More concretely, given one candidate label l = w1 w2 ...wl , we first inference the
45
topic distribution θ l for the label also by using the estimated model of the universal
dataset. Next, the simple relevance score of l is measured with respect to a cluster θ c by
using cosine similarity for topic similarity [48]:
splscore (l , θ c ) = α × cosine (θ l | θ c ) + β × cdf l ,c + γ × lenl
(4.4)
A good cluster label is not only relevant to the current cluster but also help to distinguish
this cluster to another. So, it is very useful to penalize the reference score of a label with
respect to the current cluster c by the reference scores of that label to other clusters c'
( c'∈ C and c' ≠ c ). Thus, we get the inter-cluster scoring function as follows:
score(l , θ c ) = splscore(l ,θ c ) −
∑ μ × splscore(l ,θ
c '∈C , c '≠ c
c'
)
(4.5)
The candidate labels of a cluster are sorted - in descending order - by its relevance and the
4 most relevant candidates are then chosen as labels for the cluster.
d. Ranking within Cluster
We reorganize the order of documents in each cluster by the relevance between its topic
distribution and the “expected topic distribution” of the cluster. If the relevant measures
of 2 snippets within one cluster are the same, their old ranks in the complete list
determined by Google are used to specify the final ranks. In other words, if the 2 snippets
have the same relevant score, the one with higher old rank has higher rank in the
considered cluster.
4.2.2. Experiments
a. Experimental Settings
For experiments, we first submit 10 ambiguous queries to Google and get back about 200
search result snippets [Table 4.5]. The reason why we choose these queries is that they are
likely to contain multiple sub-topics. Thus, we will benefit more from clustering search
results.
Table 4.5. Queries submitted to Google
Types
Queries
General Terms
Sản phẩm (products), thị trường (market), triển lãm (exhibition), công
nghệ (technology), đầu tư (investment), hàng hóa (goods)
Ambiguous Terms
Ma trận (matrix), tài khoản (account), hoa hồng (rose/money), ngôi
sao (star)
46
For each query, we cluster search results using HAC and hidden topics which are
discovered from the collection of Vnexpress and Wikipedia data (see previous chapter)
with 200 topics. Parameters for clustering are shown as in the following table.
Table 4.6. Parameters for clustering web search results
Parameters
Normalized
Meaning
Values
Method for constructing word-part vector of the snippet
TF
Similarity between How to calculate similarity between 2 clusters based on Average
Clusters
similarites between pair of snippets
Linkage
Cut-off
A topic with its probability less than this value will be 0.01
weighted as zero in topic-part vector of the snippet
Lamda
Mixture of topic-part and word-part in the vector of the 0.4
snippet
The smallest similarity with which two clusters can be 0.13
merged (similarity criterion) [see 4.2.1. b. ]
mergeThreshold
Anpha
Weight of topic-similarity feature in the simple scoring 10
method [Eq.4.4]
Beta
Weight of the “cdf” feature in the simple scoring method 2
[Eq.4.4]
Gama
Weight of the “len” feature in the simple scoring method 2
[Eq.4.4]
Muy
Parameter in the inter-cluster scoring method [Eq.4.5]
colocThreshold
The smallest chi-square value that 2-grams can be get to form 2500.0
a collocation (a meaningful phrase for labeling)
0.35
b. Evaluation
In order to evaluate the clustering method, for each query, we specify “good clusters”,
which are clusters with snippets telling us about a coherent topic. We count the number of
snippets in good clusters for each query and calculate the ‘coverage’ as following:
cov erage =
number of snippets over selected " good clusters"
number of snippets for this query
(4.6)
For each good cluster of some query, we evaluate both the quality of the cluster as well as
the ranking policy by calculating P@5, P@10 and P@20 which are the precision at top 5,
top 10, and top 20 snippets respectively.
c. Experimental Results and Discussions
precision
47
P@5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
công
nghệ
đầu tư
hàng
hóa
hoa
hồng
Ma trận Ngôi sao
sản
phẩm
P@10
tài
khoản
P@20
thị
triển lãm
trường
queries
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query
coverage(%)
Top 5 Coverage
Top 10 Coverage
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
công
nghệ
đầu tư
hàng
hóa
hoa
hồng
ma trận
ngôi
sao
sản
phẩm
tài
khoản
thị
trường
triển
lãm
queries
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query
Figure 4.8 shows the precision of top 5 (and 10, 20) in best clusters for each query.
Although the performance depends heavily on search results returned by the Web search
engine, the overall quality is satisfactory (the precision is above 80% on average). For
some queries such as “công nghệ” (technology), the returned snippets focus mostly on the
topic of “information technology”, thus making the clustering system depends heavily on
the word similarities to determine clusters. As a result, the clustering quality is not as
good as for other queries. For queries such as “ma trận” (matrix), the search results vary
in multiple domains (movie, game, mathematics, technology – like matrix in cameras)
making topic information become really beneficial, the performance for them is quite
good (for “ma trận”, P@5, P@10 and P@20 are of 100%, 98%, and 96% respectively).
48
The coverage of the best 5 (and 10) clusters for each query is demonstrated in the figure
4.9. From this figure, we can see that the coverage of 10 best clusters for each query is
around 40 – 50 % (of about 250 snippets). This means that these clusters can help users to
navigate efficiently through about 10 pages returned from Google (suppose that the
number of snippets per page is 10 – the default number of snippets per page of Google).
3.5
3
2.5
Snippet 3
2
Snippet 2
1.5
Snippet 1
1
\
0.5
0
Phát
hiện
hành
tinh
mặt trời
nhà
khoa
học
quỹ
đạo
sao
Hải
vương
trắng
trạng
thái
vũ trụ
vật
chất
Weight (scaled and rounded)
40
lùn
khối
lượng
Topic 107
(astronomy):mặt_trời
35
30
25
Snippet 3
20
Snippet 2
15
Snippet 1
10
5
0
Topic 107
Snippet1 = Phát
hiện 28 hành_tinh
mới
ngoài
hệ
mặt_trời Tin_tức
sự_kiện Trong_số
28 hành_tinh mới
các nhà_khoa_học
phát_hiện
một
hành_tinh
ố
Topic 141
Topic 9
Snippet 2 = Sao lùn
cực nhẹ là những có
khối_lượng nhỏ hơn
0,3 lần khối_lượng
mặt_trời Trong
vùng lân_cận
mặt_trời hiện còn
vô_số sao lùn có
khối_lượng cực
Topic 185
trái_đất hành_tinh
quỹ_đạo vệ_tinh
quan_sát quay
mặt_trăng ngôi_sao
vũ_trụ vật_thể
thiên_thể thiên_văn
khối_lượng
Snippet 3 =
Bên_trong các sao
lùn trắng vật_chất ở
trạng_thái siêu_đặc
như_vậy
Figure 4.10. Word and Topic sharing among 3 snippets of the same cluster about astronomy
Although there is still a lot of work to verify our method, these results have partly proved
its effect. The most advantage of our clustering method is that not only snippets which
share a lot of word choices are considered similar, but also those sharing the hidden
topics. As a result, it goes beyond the limitations of different word choices (see Figure
4.10).
49
4.3. Summary
This chapter describes details of the deployments of general frameworks in classification
and clustering in Vietnamese. From experiments, good results have been observed in both
tasks. We can get the improvement of about 8% for the task of classification with sparse
data. The topic-oriented clustering method has shown its efficiency in both improving the
quality of clustering search results, labeling and re-ranking clusters. These results can be
seen as practical evidences for our arguments in the previous chapters.
50
Conclusion
Achievements throughout the thesis
The main contributions of this thesis lie in the following folds:
-
Chapter 1 summarize some major text modeling and hidden topic models with
particular attention to LDA which has recently shown its success in many
applications such as entity resolution, classification, feature selection and so on.
These models are milestones for our proposals in the subsequent chapters.
-
In chapter 2, two general frameworks have been proposed for learning with the
support of hidden topics. The main motivation is how to gain benefits from huge
sources of online data in order to enhance quality of the Text/Web clustering and
classification. Unlike previous studies of learning with external resources, we
approach this issue from the point of view of text/Web data analysis that is based
on recently successful latent topic analysis models like LSA, pLSA, and LDA. The
underlying idea of the framework is that for each learning task, we collect a very
large external data collection called “universal dataset”, and then build a learner on
both the learning data and the rich set of hidden topics discovered from that
collection.
-
In chapter 3, we discuss important issues and results of topic analysis with LDA
for two datasets: VnExpress (199MB) and Wikipedia (270M). Significant
considerations about preprocessing and transformation in Vietnamese as well as
topic analysis have been highlighted. From the experimental results, we see that
LDA is a suitable method for topic analysis in Vietnamese.
-
Chapter 4 describes two deployments of general frameworks for 2 tasks which are
classifying and clustering search results in Vietnamese. Significant improvement
for classification and clustering has shown the success of our proposed methods.
Future Works
Topic Analysis is attractive to many researchers because of its widespread applications in
various areas as well as it potentially contains different research trends. In the future, the
following research directions could be taken into considerations:
-
Deployments of general frameworks for Page Rank and Summary in Search
Engine. In the task of Page rank, we consider a query as a short document and infer
topic distribution for it. Based on the topic distributions of returned pages, we then
order them with respect to their relevance to the topic distribution of the query,
51
thus providing user with topic-oriented ranking. In the Summary problem, for each
page result, we can take sentences which are closest in topic-distribution with the
query and contain keywords as the summary for that page. For the
implementations, the topic inference can be done offline for all the web pages
stored in the search engine so that reduce the online computations.
-
Tracking Online News over time using Dynamic Topic Models: DTM is an
extension to Latent Dirichlet Allocation and has proposed by Blei et. al. (2006).
This is a useful tool for tracking and visualizing the development of topics over
time. One application of this is to track news about business so that one can answer
the questions like “during which time, attentions will be paid to some business
field”. This can help much for stockbroker to make their investment decisions.
52
References
Vietnamese References
[1]. Mai, N.C., Vu, D.N., Hoang, T.P. (1997), “Cơ sở ngôn ngữ học và tiếng Việt”, Nhà
xuất bản Giáo dục
English References
[2]. Andrieu, C., Freitas, N.D., Doucet, A. and M.I. Jordan (2003), “An Introduction to
MCMC for Machine Learning”, Machine Learning Journal, pp. 5- 43.
[3]. Banerjee, S., Ramanathan, K., and Gupta, A (2007), “Clustering Short Texts Using
Wikipedia”, In Proceedings of ACL.
[4]. Bhattacharya, I. and Getoor, L. (2006), “A Latent Dirichlet Allocation Model for
Entity Resolution”, In Proceedings of 6th SIAM Conference on Data Mining,
Maryland, USA.
[5]. Blei, D.M., Ng, A.Y. and Jornal, M.I. (2003), “Latent Dirichlet Allocation”, Journal
of Machine Learning Research 3, pp.993-1022
[6]. Blei, D. and Jordan, M. (2003), “Modeling annotated data”, In Proceedings of the
26th annual International ACM SIGIR Conference on Research and Development in
Information Retrieval 127–134. ACM Press, New York, NY.
[7]. Blei, M. and Lafferty, J. (2006), “Dynamic Topic Models”, In Proceedings of the
23rd International Conference on Machine Learning, Pittsburgh, PA.
[8]. Blei, M. and Lafferty, J. (2007), “A Correlated Topic Model of Science”, The
Annals of Applied Statistics. 1, pp. 17-35
[9]. Blum, A. And Mitchell, T. (1998), “Combining Labeled and Unlabeled Data with
Co-training”. In Proceedings of COLT.
[10]. Bollegala, D., Matsuo, Y., And Ishizuka, M. (2007), “Measuring Semantic
Similarity between Words using Web Search Engines”. In Proceedings of WWW.
[11]. Christopher, D.M., Hinrich, S. (Jun, 1999), Foundations of Statistical Natural
Language Processing.
[12]. Chuang, S.L., and Chien, L.F. (2005), “Taxonomy Generation for Text Segments: a
Practical Web-based Approach”, ACM Transactions on Information Systems. 23,
pp.363-386.
[13]. Deerwester, S., Furnas, G.W., Landauer, T.K., and Harshman, R.(1990), “Indexing
by Latent Semantic Analysis”, Journal of the American Society for Information
Science. 41, 391-407.
[14]. Dhillon, I., Mallela, S., And Kumar, R. (2002), “Enhanced Word Clustering for
Hierarchical Text Classification”, In Proceedings of ACM SIGKDD
53
[15]. Erosheva, E., Fienberg, S. and Lafferty, J. (2004). Mixed-membership models of
scientific publications. Proc. Natl. Acad. Sci. USA 97 11885–11892.
[16]. Gabrilovich, E. and Markovitch, S. (2007), “Computing Semantic Relatedness using
Wikipedia-based Explicit Semantic Analysis”, In Proceedings of IJCAI
[17]. Girolami, M. and Kaban, A. (2004). “Simplicial mixtures of Markov chains:
Distributed modelling of dynamic user profiles”. In Advances in Neural Information
Procesing Systems 16 9–16. MIT Press, Cambridge, MA.
[18]. Griffiths, T., Steyvers, M., Blei, D. and Tenenbaum, J. (2005), “Integrating topics
and syntax”, In Advances in Neural Information Processing Systems 17 537–544,
MIT Press, Cambridge, MA.
[19]. Google Search, 2007. http://www.google.com/.
[20]. Heinrich, G., “Parameter Estimation for Text Analysis”, Technical Report.
[21]. Hofmann, T., “Probabilistic Latent Semantic Analysis”, In Proceedings of UAI
[22]. Hofmann, T., (2001), “Unsupervised Learning by Probabilistic Latent Semantic
Analysis”, Machine Learning. 42, pp. 177-196
[23]. Lawrie, D. J. and Croft, W.B. (2003), “Generating Hierarchical Summaries for Web
Searches”, In Proceedings of ACM SIGIR.
[24]. Letsche, T. A. and Berry, M. W. (1997), “Large-Scale Information Retrieval with
Latent Semantic Analysis”, Information Science. 100, pp. 105-137
[25]. Liu, B., Chin, C. W., and Ng, H. T., “Mining Topic-Specific Concepts and
Definitions on the Web”, In Proceedings of WWW
[26]. Latent Semantic Analysis, http://en.wikipedia.org/wiki/Latent_semantic_indexing
[27]. Mei, Q., Shen, X., And, Zhai, C., “Automatic Labeling of Multinomial Topic
Models”. In Proceedings of ACM SIGKDD, 2007
[28]. Modha, D.S. and Spangler, W.S., “Clustering Hypertext with Applications to Web
Searching”, In Proceedings of the 11th ACM on Hypertext and Hypermedia
[29]. Mccallum, A., Corrada-emmanuel, A. And Wang, X. (2004). “The author–
recipient–topic model for topic and role discovery in social networks: Experiments
with Enron and academic email”, Technical report, Univ. Massachusetts, Amherst.
[30]. Nigram, K., McCallum, A., Thrun, S., and Mitchell, T., “Text Classification from
Labeled and Unlabeled Documents using EM”, Machine Learning. 39, pp. 103-134
[31]. Nguyen, C.T., Nguyen, T.K, Phan, X.H., Nguyen, L.M. and Ha, Q.T., “Vietnamese
Word Segmentation with CRFs and SVMs: An Investigation”, In Proceedings of the
20th Pacific Asia Conference on Language, Information and Compuation
(PACLIC20), pp.215-222, Wuhan, China, 1-3 November 2006
[32]. Nguyen, C.T., “JVnSegmenter: A Java-based Vietnamese Word Segmentation
Tool”, http://jvnsegmenter.sourceforge.net/, 2007
54
[33]. Nguyen, C.T., Tran, T.O., Ha, Q.T., Phan, X.H, “Named Entity Recognition in
Vietnamese Free-Text and Web Documents Using Conditional Random Fields”, The
Workshop on Asian Applied NLP and language resource development, Sirindhorn
International Institute of Technology, Pathumthani, Thailand, March 13, 2007.
[34]. Nguyen, V.C., Nguyen, T.T.L., Ha, Q.T.., Phan, X.H. (2006), “A Maximum Entropy
Model for Text Classification”, In Proceeding of International Conference on
Internet Information Retrieval, pp. 143-149, Korea.
[35]. Nigam, K., Lafferty, J., McCallum, A. (1999), “Using Maximum Entropy for Text
Classification”, In Proceeding of the International Joint Conference on Artificial
Intelligence.
[36]. Nutch: an open-source search engine, http://lucene.apache.org/nutch/
[37]. Phan,
X.H,
“JWebPro:
A
Java-based
http://jwebpro.sourceforge.net/, 2007
Web
Processing
Toolkit”
[38]. Phan, X.H, “GibbsLDA++: A C/C++ and Gibbs Sampling based Implementation of
Latent Dirichlet Allocation (LDA)”, http://gibbslda.sourceforge.net/, 2007
[39]. Phan,
X.H,
“JTextPro:
A
http://jtextpro.sourceforge.net/
Java-based
Text
Processing
Toolkit”,
[40]. Papadimitriou, C., Tamaki, H., Raghavan, P., and Vempala, S., “Latent Semantic
Indexing: A probabilistic analysis”, pages 159-168, 1998
[41]. Popescul, A., Ungar, L., Pennock, D., and Lawrence, S., (2001) “Probabilistic
Models for Unified Collaborative and Content-based Recommendation in Sparsedata environments”, In Uncertainty in Artificial Intelligence, Proceeding of the
Seventeenth Conference.
[42]. Rosen-zvi, M., Griffiths, T., Steyvers, M. and Smith, P. (2004), “The author-topic
model for authors and documents”, In AUAI’04: Proceedings of the 20th Conference
on Uncertainty in Artificial Intelligence 487–494. AUAI Press, Arlington, VA.
[43]. Sahami, M., and Heilman, T.D. (2006), “A Web-based Kernel Function for
Measuring the Similarity of Short Text Snippets”, In Proceedings of WWW
[44]. Sato, I. and Nakagawa, H., “Knowledge Discovery of Multiple-Topic Document
using Parametric Mixture Model with Dirichlet Prior”, In Proceedings of ACM
SIGKDD, 2007
[45]. Schonofen, P., “Identifying Document Topics using the Wikipedia Category
Network”, In Proceedings of IEEE/WIC/ACM International Conference on Web
Intelligence, 2006
[46]. Sivic, J., Rusell, B., Efros, A., Zisserman, A. and Freeman, W. (2005), “Discovering
object categories in image collections” Technical report, CSAIL, Massachusetts
Institute of Technology
[47]. VnExpress: the online Vietnamese news, http://vnexpress.net/
[48]. Vector Space Model, http://en.wikipedia.org/wiki/Vector_space_model
55
[49]. Vivisimo Web Search, http://vivisimo.com/
[50]. Wikipedia: the free encyclopedia, http://wikipedia.org/, 2007
[51]. Zamir, O. and Etzioni, O., “Grouper: a Dynamic clustering interface to Web search
results”, In Proceedings of WWW.
[52]. Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., And Ma, J., “Learning to Cluster Web
Search Results”, In Proceedings of ACM SIGIR, 2004
56
Appendix: Some Clustering Results
Clustering results for the query “ma trận” (matrix)
1
1. Bửu_bối giúp đậu mấy môn toán đại_cương
hệ_phương_trình(18)
phương_pháp tính hŕm ... lấy giới_hạn ma_trận số_thực
tuyến_tính, đại_số,
ma_trận số_ảo lượng_giác thực lượng_giác ảo
định_thức, số_phức,
2. Đại_số Diễn_Đàn Sinh_Viên Quy_Nhơn Bài_4 Cho
bài_toán, thuần_nhất,
ma_trận vuông có các phần_tử trên đường_chéo chính là
lượng_giác, số_ảo, khả,
2007 các phần_tử Giải hệ_phương_trình tuyến_tính
thuần_nhất với là ma_trận cột
3. Đề_cương môn_học Biết cách biểu_diễn các đồng cấu
bởi ma_trận đồng_thời biết sử_dụng các phép_toán trên
ma_trận và biết cách giải một hệ_phương_trình đại_số
tuyến_tính ứng
2
1. phim truyen han_quoc ma tran tai_hien dvd Những
phim(12)
khách_hàng mua Phim_Truyện Hàn_Quốc Ma_Trận
bộ phim, diễn_viên, vai
Tái_Hiện DVD HQ_Pro cũng mua những món_hàng
diễn, điện_ảnh thế_giới,
sau_đây
ngôi_sao điện_ảnh, phim
2. 24h.com.vn Giới_thiệu các ngôi_sao điện_ảnh thế_giới
ăn_khách, phim_truyện,
và Việt_nam Với vai diễn Neo trong The_Matrix Ma_trận
ngôi_sao, diễn,
1999 Reeves đã đưa bộ phim trở thành_bộ phim
ăn_khách và được nhiều người săn_lùng
3. Ma_Trận I Ii Iii Tổng_Hợp 3 Phần Của_Phim VN
Zoom_forum Phim The_Matrix quả_là một phim_hay
không_chỉ bởi diễn_viên bởi kỹ_xảo hình_ảnh âm_thanh
3
1. Game Online Cuộc Chiến Ma_Trận Cuộc Chiến Ngoài
game(8)
Hành_Tinh
Cuộc
Phiêu_Lưu
Của
Chó_Con
game, game trực_tuyến,
Cướp_Biển_Caribê Dap_Tho Đi Tìm Cún_Con
game nhập_vai,
Lau_Kính
trò_chơi, game online,
2. Giới_thiệu các trò_chơi mới hướng_dẫn chơi game
picachu, phiêu_lưu, tải
Thất_bại vẫn tiếp_tục đeo_bám Ma_Trận khi The
game, giải_trí
Matrix_Online game nhập_vai trực_tuyến đầy tham_vọng
trực_tuyến, chơi game,
3. game_online tải game game_flash game_mini game hay
game trực_tuyến Tên_Game Ma_trận Description
Chúng_ta đang ở trong ma_trận hãy tiêu_diệt hết bọn
nhân_bản nào
4
1. ĐIỀU_KHIỂN MA_TRẬN LED VÀ BÀN_PHÍM HEX
led(5)
LED MATRIX_
led, máy_ảnh, cảm_biến,
nét, tat ca, kiểu ma_trận, 2. Nội_dung các con chíp_LED nguyên_vật_liệu đèn
phát_quang các thiết_bị năng_lượng cao linh_kiện
linh_kiện, tu van, đo,
điện_tử có độ_phân_giải cao điểm ma_trận
sáng,
3. Nikon D2Xs Máy_ảnh chuyên_dụng tốt nhất DIỄN_ĐÀN
CÔNG_NGHỆ VIỆT_NAM Nikon còn trang_bị mạch đo
sáng kiểu ma_trận 3D
5
1. MÁY_IN HÓA ĐƠN POSIFLEX_PP 5600_SERIES
máy_in(2)
Mua_ban rao_vat Raovat Tốc_độ in_nhanh in theo
máy_in, đòi_hỏi,
phương_pháp ma_trận điểm
ứng_dụng rộng_rãi, hóa
2. Các máy_in ma_trận dòng_T6212 T6215 và T6218 là
đơn, http, môi_trường,
những máy_in tốc_độ cao phù_hợp trong môi_trường
series, tốc_độ, in_nhanh,
đòi_hỏi in_ấn với số_lượng lớn Tốc_độ tương_ứng
raovat,
57
Clustering results for the query “ngôi sao” (star)
1
điện_ảnh(23), ngôi_sao 1. Giới_thiệu các ngôi_sao điện_ảnh thế_giới và Việt_nam
Khi công_bố nữ diễn_viên chính sẽ đóng cặp với
điện_ảnh, điện_ảnh
ngôi_sao điện_ảnh xứ_Hàn Bae_Yong_Joon là
thế_giới, phim, bộ phim,
Lee_Ji_Ah khiến ai cũng ngỡ_ngàng
nữ diễn_viên, ngôi_sao
phim, thế_giới điện_ảnh, 2. Hai ngôi_sao võ_thuật của Trung_Quốc góp_mặt trong
bộ phim mới Hai ngôi_sao võ_thuật của Trung_Quốc
diễn_viên, thế_giới,,
góp_mặt trong bộ phim mới King of_Kungfu
3. phim nhiều thể loại thế_giới điện_ảnh với các thông_tin
nóng_hổi Phim đời_tư các ngôi_sao điện_ảnh của
thế_giới và việt_nam Phim bình_luận về các bộ phim hay
Lịch chiếu_phim trên_HBO CINEMAX STAR_MOVIE
2
1. Các ngôi_sao ca_nhạc ban_nhạc nổi_tiếng Trang
ca_nhạc(12), ngôi_sao
chân_dung nghệ_sĩ ca_sĩ nhạc_sĩ và ban_nhạc nổi_tiếng
ca_nhạc, ban_nhạc,
2. Năm 2005 Eva ký hợp_đồng với Oréal hãng mỹ_phẩm
nhạc pop, thể_thao,
nổi_tiếng của Pháp hợp_đồng đã đưa cô lên ngang_hàng
yêu_mến, mariah,
với các ngôi_sao như ca_sĩ da_màu Beyonce
mariah_carey, nhạc_sĩ,
3. Các ngôi_sao ca_nhạc ban_nhạc nổi_tiếng Mariah
pop,
Tuy_nhiên lợi_thế trẻ_trung của teen_star vẫn chưa
đủ_sức làm lu_mờ một số ngôi_sao gạo_cội trong_đó có
Mariah_Carey Người_ta gọi cô là Diva quên tuổi
3
1. Milan chưa từ_bỏ ý_định mua_Ronaldinho Chủ_tịch
bóng_đá(10)
Milan Silvio_Berlusconi cho_biết ông sẵn_sàng nối_lại
sân_cỏ, ngôi_sao
đàm phám với Barca về trường_hợp của Ronaldinho khi
sân_cỏ, premiership,
đội_bóng chủ sân
vòng chung_kết, vòng
2. Tin nhanh bóng_đá Ngôi_sao sân_cỏ Cháy_mãi Ronaldo
thi, trung_tâm, tphcm,
24h.com.vn Tin nhanh bóng_đá
giải, ronaldo,
3. 24h.com.vn Tin nhanh bóng_đá Ngôi_sao sân_cỏ
Premiership 24h.com.vn Tin nhanh bóng_đá Ngôi_sao
sân_cỏ Các tin khác của mục Ngôi_sao sân_cỏ Tổng_hợp
24H
4
1. Phot hiện 28 hành_tinh mới ngoài hệ mặt_trời
mặt_trời(5)
2. Người góp_chìa Sao lùn cực nhẹ là những ngôi_sao có
quỹ_đạo, quanh,
khối_lượng nhỏ hơn 0,3 lần khối_lượng mặt_trời
thiên_hà, địa_cầu,
3. an binh hanh_phuc Kích_thước của các ngôi_sao vŕ
hành_tinh, get the, lùn,
khoảng_cách của chúng đối_với trái_đất vượt Mặt_trời
single_network,
của chúng_ta lŕ ngôi_sao gần chúng_ta nhất cách địa_cầu
vietnamkhoa_học,
độ_chừng
5
1. VnExpress Anh 10 ngoi sao thoi trang nhat the_gioi
mặc đẹp(3)
Nữ_ca_sĩ Beyonce Knowles được các biên_tập_viên của
mặc, thời_trang, mot so,
tạp_chí Life Style Mỹ bầu chọn là ngôi_sao mặc đẹp và
nhat, beyonce knowles,
cá_tính nhất năm_nay
tạp_chí life,
2. 10 Ngôi_Sao Quyến_Rũ Nhất Trung_Quốc eVietBay 10
biên_tập_viên, knowles,
Ngôi_Sao Quyến_Rũ Nhất Trung_Quốc Tin_Tức về
nữ_ca_sĩ,
Thời_Trang Điện_Ảnh
58
Clustering results for the query “thị trường” (market)
1
1. Chứng_khoán Ngân_hàng THÔNG_TIN THỊ_TRƯỜNG
otc(29)
NGHIÊN_CỨU PHÂN_TÍCH TIN VCBS
thị_trường otc, niêm_yết,
2. Vietstock Vietnam Stock Market News and_Information
cổ_phiếu otc, cổ_phiếu,
Thong_tin Thị_trường bất_động_sản Cơ_hội vàng cho
một_số cổ_phiếu,
các nhà đầu_tư
công_ty niêm_yết,
3. Chứng_khoán
Biển_Việt
Báo_Cáo
Tổng_Quan
Thị_Trường Cổ_Phiếu
2
kinh_tế thị_trường(20) 1. Làm điếm trong nền kinh_tế thị_trường H-A O
nền kinh_tế, tăng_trưởng 2. Các chuyên_gia phân_tích thị_trường nhận_định
thị_trường ĐTDĐ trong năm_nay sẽ không duy_trì được
kinh_tế, tốc_độ
tốc_độ tăng_trưởng hai con_số như trong quý_IV
tăng_trưởng,
3. Chính_phủ và thị_trường Diễn_đàn X cafe Vận_hành nền
tăng_trưởng mạnh_mẽ,
kinh_tế thị_trường có_nghĩa là nhiều vấn_đề sẽ
động_thái, ,
3
1. Saigon bất_động_sản Bất_động_sản Nhà_đất Địa_ốc
đất(17)
Xây_dựng Bước vào quý_
đất, thị_trường
2. Thị_trường bất_động_sản Thành_phố Hồ_Chí_Minh vẫn
bất_động_sản,
đang sôi VietNamNet_Bridge
bất_động_sản, căn_hộ
3. hị_trường BĐS TP.HCM Vùng ven lên_giá Cotec Group
chung_cư, đất dự_án,
Website Đất các quận_huyện vùng ven_như Thủ_Đức
từ_liêm, quy_hoạch,
Cần_Giờ Bình_Chánh
lô_đất, giá đất, nhà_ở,
4
điện_thoại di_động(11) 1. MOBILENET Mobile Online Magazine Cú đột_phá trên
thị_trường điện_thoại Năm 2006 khi E Com dịch_vụ
điện_thoại di_động, fpt,
điện_thoại cố_định
viễn_thông di_động,
2. Thị trường viễn thông di động sẽ có sự thay đổi lớn kể từ
công_nghệ cdma,
năm 2006
dịch_vụ điện_thoại
3. Thị trường di động Việt Nam nửa đầu năm 2007