http://jict.uum.edu.my - Journal of Information and Communication

Transcription

http://jict.uum.edu.my
Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81
GF-CLUST: A NATURE-INSPIRED ALGORITHM FOR
AUTOMATIC TEXT CLUSTERING
1
Athraa Jasim Mohammed, 2Yuhanis Yusof & 2Husniza Husni
1
University of Technology, Baghdad, Iraq
2
Universiti Utara Malaysia, Malaysia
[email protected]; [email protected];
[email protected]
ABSTRACT
Text clustering is a task of grouping similar documents into a
cluster while assigning the dissimilar ones in other clusters. A
well-known clustering method which is the K-means algorithm
is extensively employed in many disciplines. However, there is a
big challenge to determine the number of clusters using K-means.
This paper presents a new clustering algorithm, termed Gravity
Firefly Clustering (GF-CLUST) that utilizes Firefly Algorithm
for dynamic document clustering. The GF-CLUST features
the ability of identifying the appropriate number of clusters
for a given text collection, which is a challenging problem in
document clustering. It determines documents having strong
force as centers and creates clusters based on cosine similarity
measurement. This is followed by selecting potential clusters
and merging small clusters to them. Experiments on various
document datasets, such as 20Newgroups, Reuters-21578 and
TREC collection are conducted to evaluate the performance
of the proposed GF-CLUST. The results of purity, F-measure
and Entropy of GF-CLUST outperform the ones produced by
existing clustering techniques, such as K-means, Particle Swarm
Optimization (PSO) and Practical General Stochastic Clustering
Method (pGSCM). Furthermore, the number of obtained
clusters in GF-CLUST is near to the actual number of clusters as
compared to pGSCM.
Keywords: Firefly algorithm, text clustering, divisive clustering, dynamic
clustering.
57
INTRODUCTION
Text documents, such as articles, blogs and news, are periodically increased in
the Web. This increase has made online users to require more time and effort
to obtain relevant information. In this context, manual analysis and manual
discovery of beneficial information are very difficult. Hence, it is relevant to
provide automatic tools for analysing large textual collections. Referring to
such needs, data mining tasks, such as classification, association analysis and
clustering are commonly integrated in the tools.
Clustering is a technique of grouping similar documents into a cluster and
dissimilar documents in a different cluster (Aggarwal, & Reddy, 2014). It is
a descriptive task of data mining where the algorithm learns by identifying
similarities between items in a collection. According to Gil-Garicia and PonsPorrata (2010), clustering algorithms are classified into two categories based
on prior information (i.e., number of clusters): static and dynamic. In static
clustering, a set of objects is classified into a determined number of clusters,
while in the dynamic clustering, the objects are automatically grouped based
on some criteria to discover the right number of clusters.
Clustering can also be carried out in two ways: soft clustering or hard
clustering (Aliguliyev, 2009). In soft clustering, each object is grouped to be
a member of any or all clusters with different membership grades, while in
the case of hard clustering, the objects are members of a single cluster. Based
on the mechanics of constructing the clusters, clustering methods can be
divided into two categories: Hierarchical clustering and Partitional clustering
(Forsati, Mahdavi, Shamsfard, & Meybodi, 2013; Luo, Li, & Chung, 2009).
Hierarchical clustering methods create a tree of clusters, while Partitional
clustering methods create a flat of clusters (Forsati, Mahdavi, Shamsfard,
& Meybodi, 2013). This study focuses on dynamic, hard and hierarchical
clustering.
LITERATURE REVIEW
Previous studies divided clustering algorithms into two main categories:
Hierarchical and Partitional (Forsati et al., 2013; Luo et al., 2009). The
Hierarchical clustering methods create a hierarchy of clusters (Forsati,
Mahdavi, Shamsfard, & Meybodi, 2013). It is an efficient method for
document clustering in information retrieval as it provides data-view at
different levels and organizes the document collection in a structured manner.
Based on the mechanics of constructing a hierarchy, it has two approaches:
58
Agglomerative and Divisive hierarchical clustering. The Agglomerative
clustering approach operates from bottom to top, where every document
is assumed as a single cluster. The approach attempts to merge any closest
clusters based on dissimilarity matrix. There are three methods that are used
for merging: single linkage, complete linkage and average linkage or also
known as UPGMA (un-weighted pair group method with arithmetic mean).
Details on these methods can be found in previous work, such as Yujian and
Liye (2010). On the other hand, divisive clustering builds a tree of multilevel clusters (Forsati, Mahdavi, Shamsfard, & Meybodi, 2013). All objects
are initially in a single cluster and at each level, the clusters are split into two
clusters. Such an operation is demonstrated in Bisect K-means (Kashef, &
Kamel, 2010, 2009). In the splitting process, it employs one of the partitional
clustering algorithms that uses an objective function. This objective function
has the ability to minimize the distance between a center and objects in one
cluster and maximizes the distance between clusters.
A well-known example of a hard and static partitional clustering is the
K-means (Jain, 2010). It initially determines a number of clusters and then
assigns objects of a collection into the predefined clusters while trying to
minimize the sum of squared error over all clusters. This process continues
until a specific number of iterations is achieved. Figure 1 illustrates the steps
involved in K-means. The K-means algorithm is easy to implement and
efficient. However, it suffers from some drawbacks: random initialization
of centers (including the determination of number of clusters) may cause
the solution to be trapped into local optima. To overcome such weakness, a
research area that employs meta-heuristics has been developed. It optimizes
the solution to search for global optimal or near optimal solution using an
objective function (Tang, Fong, Yang, & Deb, 2012). Problems are formulated
as either a minimum or maximum function (Rothlauf, 2011).
There are two types of Meta-heuristic approaches: single meta-heuristic
solution and population meta-heuristic solution (Boussaïd, Lepagnot, & Siarry,
2013). Single meta-heuristic solution initializes one solution and moves away
from it, such as implemented in the Simulated Annealing (Kirkpatrick, Gelatt,
& Vecchi, 1983) and Tabu Search (Glover, 1986). Population meta-heuristic
solution initializes multi solutions and chooses the best solution based on
evaluation of solutions at each iteration, such as in Genetic algorithm (Beasley,
Bull, & Martin, 1993), Evolutionary programming (Fogel, 1994), Differential
Evolution (Aliguliyev, 2009) and nature-inspired algorithms (Bonabeau,
Dorigo, & Theraulaz, 1999). The nature-inspired algorithm, also known as
Swarm intelligence, is related to the collective behavior of social insects
or animals in solving hard problems (Rothlauf, 2011). Swarm intelligence,
59
includes Particle Swarm Optimization (PSO) (Kennedy, & Eberhart, 1995),
optimal solution
using
objective
function (Tang,
Fong,
Yang, Yusof,
& Deb,&2012).
Problems are
Artificial
BeeanColony
(Mahmuddin,
2008;
Mustaffa,
Kamaruddin,
2013), and Firefly Algorithm (Yang, 2010).
formulated as either a minimum or maximum function (Rothlauf, 2011).
Step1:
Randomly choose K cluster centers.
Step2:
Assign each object to closest center
using Euclidean distance.
Step3:
Re-calculate the centers.
Step4:
Repeat step1 and step2
stopping condition is reached.
until
Figure 1. Steps of K-means (Jain, 2010).
Figure 1. Steps of K-means (Jain, 2010).
With regards to the population meta-heuristic approach, Aliguliyev (2009)
developed
modified differential
evolution
(DE) algorithm
for and
textpopulation
clusteringmetaThere are two
types ofaMeta-heuristic
approaches:
single meta-heuristic
solution
that optimized various density based criterion functions: internal, external and
heuristic solution
& indicates
Siarry, 2013).
meta-heuristic
solution initializes
hybrid (Boussaïd,
functions.Lepagnot,
The result
thatSingle
the proposed
DE algorithm
speeds one
up the convergence. On the other hand, the PSO, proposed by Kennedy and
solution and moves away from it, such as implemented in the Simulated Annealing (Kirkpatrick,
Eberhart (1995) was used for document clustering in Cui, Potok, & Palathingal
basic
of Search
PSO comes
from
the flock
and foraging
behavior
Gelatt, & (2005).
Vecchi,The
1983)
andidea
Tabu
(Glover,
1986).
Population
meta-heuristic
solution
where each solution has n dimensions search space. The birds do not have
space, so
is calledthe“Particles”.
Each
particle
has a fitness
function
initializes search
multi solutions
andit chooses
best solution
based
on evaluation
of solutions
at each
value that can be computed using a velocity of particles flight direction and
iteration, such
as in Genetic
algorithm
& Martin,
1993),
Evolutionary
programming
distance.
The basic
PSO (Beasley,
clusteringBull,
algorithm
(Cui,
Potok,
& Palathingal,
2005) is illustrated in Figure 2.
(Fogel, 1994), Differential Evolution (Aliguliyev, 2009) and nature-inspired algorithms (Bonabeau,
each
particle
randomly chooses
a number
of cluster
centersintelligence,
from a is
Dorigo, &Initially,
Theraulaz,
1999).
The nature-inspired
algorithm,
also known
as Swarm
vector of document dataset. Then, each particle performs three steps: creating
related to clusters,
the collective
behaviorclusters
of social
insects
or animals
in solvingCreating
hard problems
(Rothlauf,
evaluating
and
creating
new solutions.
clusters
is
done by assigning documents to the closest center. Evaluating clusters is done
by evaluating the created clusters using fitness function (i.e., average distance
2011). Swarm intelligence, includes Particle Swarm Optimization (PSO) (Kennedy and Eberhart,
1995), Artificial Bee Colony (Mahmuddin, 2008; Mustaffa, Yusof, & Kamaruddin, 2013), and Firefly
60
Algorithm (Yang, 2010).
between documents and center (ADDC)) and selecting the best solution from
multiple solutions. The last step is creating new solutions which is done
by updating the velocity and position of particle. The previous three steps
are repeated until one of the stopping conditions is reached; the maximum
number of iterations or the average change in center is less than a threshold
(predefined value).
Step1:
Each particle, randomly choose K cluster centers.
For each Particle:
Step2:
-
Step3:
Assign each document to closest center.
Compute the fitness value based on average
distance between documents and center
(ADDC).
Update the velocity and position of particle.
Repeat step2 until one of stop conditions is
reached; the maximum number of iterations or the
average change in center is less than threshold
(predefined value).
Figure 2. Steps in standard PSO clustering (Cui, Potok, & Palathingal, 2005).
Further work in clustering also includes Rashedi, Nezamabadi-pour and
Saryazdi (2009) who proposed optimization algorithm that utilizew law of
gravity and law of motion known as Gravitation Search Algorithm (GSA).
They considered each agent as an object and determined their performance by
their masses. The agents move towards heavier masses (the mass is calculated
by map of fitness). The heavy masses represent good solutions. Later
(Hatamlou, Abdullah, & Nezamabadi-pour, 2012), a hybrid Gravitational
Search Algorithm with K-means (GSA-KM) for numerical data clustering
was presented. GSA algorithm prevents K-means from trapping into local
optima whereas the K-means algorithm speeds up the convergence of GSA.
On the other hand, Fuzzy C-means (FCM) algorithm (which is a type of soft
clustering) based on gravity and cluster merging is presented in Zhong, Liu,
& Li (2010). It tries to find initial centers of clusters and solve the outlier’s
sensitivity problem. It also utilizes the law of gravity but the calculation of
mass is different where the mass of object p is the number of neighbourhood
objects of p.
61
Previous clustering algorithms (Cui, Potok, & Palathingal, 2005; Jain, 2010;
Zhong, Liu, & Li, 2010) are denoted as static method which initially requires
a pre-defined number of clusters. Hence, such algorithms are not appropriate
to cluster data collections that are not accompanied with relevant information
(i.e., number of classes or clusters). To date, such issues are solved using two
approaches: estimation or by using dynamic swarm based approach. The first
approach employs the validity index in clustering, which can drive to select
the optimal number of clusters. Initially, it starts by determining a range of
clusters (minimum and maximum number of clusters). Then, it performs
clustering with the various numbers of clusters and chooses the number of
clusters that produces the best quality performance. In the work of Sayed,
Hacid and Zighed (2009), the clustering is of hierarchical agglomerative
with validity index (VI) where at each level of merging step, it calculates the
index of two closest clusters before and after merging. If the VI improves
after merging, then merging of the clusters is finalized. This process continues
until it reaches optimal clustering solution. Similarly, Kuo and Zulvia (2013)
proposed an automatic clustering method, known as Automatic Clustering
using Particle Swarm Optimization (ACPSO). It is based on PSO where it
identifies number of clusters along with the usage of K-means that adjusts the
clustering centers. The ACPSO determines the appropriate number of clusters
in the range of [2, Nmax] . The result shows that ACPSO produces better
accuracy and consistency compared to Dynamic Clustering Particle Swarm
Optimization and Genetic (DCPG) algorithm, Dynamic Clustering Genetic
Algorithm (DCGA) and Dynamic Clustering Particle Swarm Optimization
(DCPSO). In the work of Mahmuddin (2008), a modified K-means and
bees’ algorithm are integrated to estimate the total number of clusters in a
dataset. The aim of using bees’ algorithm is to identify as near as possible the
right centroids, while K-means is utilized to identify the best cluster. From
previous discussions, it is learnt that the estimation approach is suitable for a
problem that requires little or no knowledge of it; however, there is difficulty
to determine the range of clusters for each dataset (lower and upper bound of
number of clusters).
On the other hand, the dynamic swarm based approach can automatically
find the appropriate number of clusters in a given data collection, without
any support. Hence, it offers a more convenient cluster analysis. Dynamic
swarm based approach adapts the mechanism of a specific insect or animal
that is found in nature and converts it to heuristics rules. Each swarm employs
it like an agent that follows the heuristic rules to carry out the sorting and
grouping of objects (Tan, 2012). In literature, there are examples of such an
62
approach in solving clustering problems, such as Flocking based approach
(Cui, Gao, & Potok, 2006; Picarougne, Azzag, Venturini, & Guinot, 2007) and
Ant based clustering (Tan, Ting, & Teng, 2011a, 2011b). The Flocking based
approach relates to behavior of swarm intelligence (Bonabeau et al., 1999)
where a group of flocks swarm move in 2D or 3D search space following the
same rules of flocks; get close to similar agents or far away from dissimilar
agents (Picarougne, Azzag, Venturini & Guinot, 2007). This approach is
computationally expensive as it requires multiple distance computations. On
the other hand, the Ant based approach deals with behavior of ants, where
each ant can perform sorting and corpse cleaning. This approach works by
distributing the data object randomly in the 2D grid (search space), then
determining a specific number of ants (agents) that move randomly in this
grid to pick up a data item if it does not hold any object (item) and drop the
object (item) if it finds similar object. This process continues until it reaches a
specific number of iterations (Deneubourg et al., 1991).
Tan, Ting and Teng (2011b) proposed practical General Stochastic Clustering
Method (pGSCM) that is a simplification of the Ant based clustering approach.
The pGSCM is used to cluster multivariate real world data. The pseudo code
of pGSCM is illustrated in Figure 3. The input of pGSCM is a dataset, D,
that contains n objects and the output is the number of clusters discovered
by pGSCM method, without any prior knowledge. In the initialization of
pGSCM, the dissimilarity threshold for n objects is estimated. Then, it
creates n bins where each bin includes one object from dataset D. Through
the working of pGSCM, it selects two objects randomly from a dataset; if the
distance between these two objects is less than their dissimilarity threshold,
then the level of support of the two objects is compared. If object i has less
support than j, then the lesser one is moved to the greater one and vice versa.
At the end of iterations, a number of small and large bins are created. The
large bins are selected as output clusters while the small bins are reassigned
to large bins (objects in small bins assigned to similar center in large bins).
The selected large bins process is based on threshold of 50, n/20 (means the
threshold is 5% of the size of dataset, n); this threshold is based on criterion
used by Picarougne, Azzag, Venturini, and Guinot (2007). This method
performs well compared to the state-of-the-art methods; however, randomly
selecting two objects in each iteration may create other issues. There is a
chance that in some iterations, the same objects are selected or some objects
are not selected at all. Furthermore, the selection process initially requires
large number of iterations to increase the probability of selecting different
objects.
63
Step1:
Input: dataset D of n objects.
Step2:
Estimate the dissimilarity threshold for n
objects.
Step3:
Initialize n bins by allocating each object in D to
a bin.
Step4:
While iteration <= maxiteration
- i= random select (D)
- j= random select (D)
- if d(i,j) < min (Ti,Tj) {Ti,Tj is dissimilarity
threshold}
- Store the comparison outcome in Vi and Vj.
- If c(i) < c(j) move i to j.
- Else move j to i.
- End if
- End while
Step5:
Return all non-empty bins as a set of final.
Figure 3.
Pseudo
of pGSCM
clustering
algorithm
(Tan,
Teng, 2011b).
Figure
3. Pseudo
codecode
of pGSCM
clustering
algorithm
(Tan, Ting,
andTing,
Teng,&2011b).
Of late, Of
a newly
meta-heuristic
algorithm has appeared,
known
algorithm
(FA). FA
late,inspired
a newly
inspired meta-heuristic
algorithm
hasas Firefly
appeared,
known
as Firefly
Algorithm
(FA). FA was
developed
and presented
at Cambridge
was developed
and presented
at Cambridge
University
by Xin-She
Yang in 2008.
Firefly algorithm is
University by Xin-She Yang in 2008. Firefly algorithm is related to behavior of
related to
behavior
of firefly
insects that
produce
short and flashes
rhythmic(flashing
flashes (flashing
light), where,
firefly
insects
that produce
short
and rhythmic
light), where,
the rate of rhythmic flashes and amount of time brings two fireflies together.
Further, the distance between two fireflies also affects the light, where the
weaker
whenthe
thelight
distance
increases.
Yang
betweenlight
two becomes
fireflies also
affectsand
theweaker
light, where
becomes
weaker Xin-She
and weaker
when the
formulated this mechanism by associating the flashing light with objective
distancefunction
increases.f(x).
Xin-She
Yang of
formulated
this mechanism
by associating
flashing
light with
The value
x is represented
by the position
of thethe
firefly,
where
every position has various values of flashing light. Based on the problem
objective function f(x). The value of x is represented by the position of the firefly, where every
(maximization or minimization) that we want to solve, we can deal with the
is light.
another
factor
in problem
FA algorithm
affected
the
position objective
has variousfunction.
values ofThere
flashing
Based
on the
(maximization
or by
minimization)
distance of two fireflies which is the attractiveness β. This factor is changed
that we want
solve,
can dealofwith
objective
function.
is another
factor intoFA
algorithm
basedtoon
the we
distance
twothe
fireflies;
when
two There
fireflies
are attracted
each
the rate of rhythmic flashes and amount of time brings two fireflies together. Further, the distance
affected by the distance of two fireflies which is the
64 attractiveness β. This factor is changed based on
the distance of two fireflies; when two fireflies are attracted to each other, the highest light will attract
other, the highest light will attract the lower light; this process will cause the
changing in the position of two fireflies and lead to change in the value of β.
Theofpseudo
standard
Firefly
Algorithm
is isillustrated
Figure4.4.
in the value
β. The code
pseudoofcode
of standard
Firefly
Algorithm
illustrated in
in Figure
the lower light; this process will cause the changing in the position of two fireflies and lead to change
Step1:
- Objective function f(x), x=(x1, ..., xn)T.
- Generate initial population of firefly randomly x i (
i=1, 2, .., n).
- Light Intensity I at xi is determined by f(xi ).
- Define light absorption coefficient γ.
Step2:
-
Step3:
- Rank the fireflies and find the current global best g*
- End while
- Post-process results and visualization
While (t < Max Generation)
For i=1 to N (N all fireflies)
For j=1 to N
If (Ii < Ij) { Move firefly i towards j; end if
Vary attractiveness with distance r via exp[-yr]
Evaluate new solutions and update light intensity
End For j
End For i
Figure 4. Pseudo code of standard Firefly algorithm (Yang, 2010).
Figure 4. Pseudo code of standard Firefly Algorithm (Yang, 2010).
Firefly Algorithm (Yang, 2010) has been applied in many disciplines and
Firefly algorithm
2010) has been
applied segmentation
in many disciplines
and proven to Vojodi,
be successful
proven to(Yang,
be successful
in image
(Hassanzadeh,
& in
and dispatch
problem
(Apostolopoulos
Vlachos,
2011).
image Moghadam,
segmentation 2011)
(Hassanzadeh,
Vojodi,
& Moghadam,
2011) &and
dispatch
problem
Additionally, the Firefly Algorithm was utilized in numeric data clustering
(Apostolopoulos
& Vlachos,
2011).Senthilnath,
Additionally, the
Fireflyand
algorithm
utilized
in numeric
and proven
successful.
Omkar,
Mani was
(2011)
used
Firefly data
algorithm in supervised clustering (the class for each object is defined) and
clustering and proven successful. Senthilnath, Omkar, and Mani (2011) used Firefly algorithm in
also in static manner (the number of clusters is defined). In the process of this
algorithm,
specific
x inand
2Dalso
search
space
evaluated
the of
supervised
clusteringeach
(the firefly
class forateach
objectlocation
is defined)
in static
manner
(the number
fitness using objective function related to the sum of Euclidean distance on
clustersall
is defined).
the process
of this
algorithm, each
firefly
at specific
location xused
in 2Dfor
search
training Indata.
The result
demonstrates
that
FA can
be efficiently
clustering. But Banati and Bajaj (2013) implemented the Firefly algorithm
space evaluated the fitness using objective function related to the sum of Euclidean distance on all
differently. They used FA as an unsupervised learning (the class for each
is result
undefined).
However,
implementation
basedBut
on Banati
static and
trainingobject
data. The
demonstrates
that FAtheir
can be
efficiently used is
for still
clustering.
manner, as shown in Senthilnath, Omkar, and Mani’s (2011) work, where the
Bajaj (2013)
implemented
Firefly algorithm differently. They used FA as an unsupervised
number
of clustersthe
is pre-defined.
learning (the class for each object is undefined). However, their implementation is still based on static
65
In this paper, Firefly Algorithm is proposed to cluster documents automatically
using divisive clustering approach, and the algorithm is termed Gravity Firefly
Clustering (GF-CLUST). The proposed GF-CLUST integrates the work on
Gravitation Firefly Algorithm (GFA) (Mohammed, Yusof, & Husni, 2014a)
with the criterion of selection clusters (Picarougne, Azzag, Venturini, & Guinot,
2007). GF-CLUST operates based on random positioning of documents that
employs the law of gravity to find the force between documents which is used
as the objective function.
METHODOLOGY
The proposed GF-CLUST works in three steps: data pre-processing,
development of vector space model and data clustering, as shown in Figure 5.
Data Pre-processing
This step is very important in any machine learning system. It can be defined
as the process of converting a set of documents from unstructured into
a structured form. This process involves three steps as shown in Figure 5:
Data cleaning, stop word remover and word stemming. Initially, it starts by
selecting texts from each document. The extracted texts are cleaned of special
characters and digits. Then, the text undergoes a splitting process that divides
each cleaned text into a set of words. Later, it removes words that have length
less than three characters, such as in, on, at, etc., and removes stop words,
such as propositions, conjunctions, etc. The last step in pre-processing is the
stemming process, where all the words are retained as the root (Manning,
Raghavan, & Schütze, 2008).
Development of Vector Space Model
This step is commonly used in information retrieval and data mining approach
where it represents the utilized data in a vector space. In this work, each cleaned
document is represented in 2D space (columns and rows). The column denotes
terms, m, extracted from documents, while the rows refer to the documents,
n. The term frequency-inverse document frequency (TF-IDF) is an efficient
scheme that is used to identify significance of terms in the text, The benefit of
utilizing TF-IDF is the balance between the local and global term weighting
in a document (Aliguliyev, 2009; Manning, Raghavan, & Schutze, 2008).
Equation 1 is used to calculate the TF-IDF.
tfidft , d  tft , d *
log N / dftt
(1)
F G
M1* M 2
R
2
66
(2)
(1)
Data Clustering
This step includes two processes: identify centers and create clusters and
selection of clusters. The identification of centers is achieved by using GFA
(Mohammed, Yusof, & Husni, 2014a), where the GFA employs Newton’s
law of gravity to determine the force between documents and uses it as an
objective function. Newton’s law of gravity states that “Every point mass
attracts every single other point mass by a force pointing along the line
intersecting both points. The force is proportional to the product of the two
masses and inversely proportional to the square of the distance between them”
(Rashedi, Nezamabadi-pour, & Saryazdi, 2009). Equation 2 shows Newton’s
law of gravity.
tfidft , d  tft , d * log N / dftt
(1)
F G
M1* M 2
R
2
F ( di * dj )  Co sin eSimilarity ( di * dj )
T (di ) * T ( dj )
2
(3)
Cdist
Data Pre-processing
m
Co sin eSimilarity (di * dj )  Data
i * dj )
 ( dCleaning
i  j 1
m
Documents
T ( dj ) Collection
 tfidfti , dj
i 1
Cdist( Xi * Xj ) 
2
( Xi  Xj )
2
Stop Word
Remover
k
j 1
K
Dis ( dj , di ) 

(4)
Construct
TFIDF
(5)
(6)
Data Clustering
Produced Clusters
2
Vector Space Model
Development
Word Stemming
Dis (ci * di )
ni
i 1
ni
ADDC  
(2)
(2)
m
 ( djn  din)
2
n1
P (k * Cj )
Purity 

N
k 1,..,c
(7)
Identify Centers &
Create Clusters
(8)
Selection Clusters
(9)
Figure 5. The process of Gravity Firefly clustering (GF-Clust)
Figure 5.
The process of Gravity Firefly clustering (GF-CLUST).
P(k * Cj )  Maxk k  Cj
(10)
where F is the force between two masses, G is the gravitational constant, M1 is the first mass, M2 is
R(masses,
Cj)
2two
*two
masses.
 k ,gravitational
the
secondFmass,
R isforce
the distance
between
where
is the
between
constant, M
k , C j )G*isP(the
1
F ( k ) 
max
(
is the first
mass,
M
is
the
second
mass,
R
is
the
distance
between
two
masses.
(11)


C jC21,...,Ck  R( k , C j ) * P ( k , C j )
 Yusof & Husni,

In the GFA (Mohammed,
2014a), the F is the force between two documents as shown
67
in Equation 3 while
represents
similarity
documents
TheGnumber
ofthethecosine
members
of between
class k two
in cluster
j calculated using
R( k , C j ) 
(12)
The number
k of the first
Equation 4 (Luo, Li & Chung,
2009). Theof
M1the
and members
M2 representof
the class
total weight
and second
In the GFA (Mohammed, Yusof, & Husni, 2014a), the F is the force between
two documents as shown in Equation 3 while G represents the cosine
similarity between two documents calculated using Equation 4 (Luo, Li, &
Chung, 2009). The M1 and M2 represent the total weight of the first and second
document, and is calculated using Equation 5 (Mohammed, Yusof, & Husni,
tfidf
d Rtf
**log
log
N
dft
tfidf
ttt,,,dddbased
*
dft
2014b). The value
on
distance (Cdist)
(1) between the
tfidfttt,,of
,dd 
 tftfis
logN
N ///Cartesian
dftttt
(1)
(1)
tfidf
t , d  tft , d *
logis
N obtained
/ dftt
positions of two
documents
and
using Equation
6 (Yang, 2010).
(1)
1* M
M
M
*M
M
M222
The representation
of11*documents
position in GFA is illustrated in Figure 6,
 G
FF 
G
F
G M 1 *2M 2
(2)
(2)
2
where x is a Frandom
value
(for
example,
in 20 Newsgroups dataset,
(2) the value
2
R
 G RR
(2)
2
is in the range 1-300) and
y
is
fixed
at
0.5.
R
TT(((dddiii)))***T
TT(((dddjjj)))
T
FF(((dddiii***dddjjj))) 
Co
eSimilarit
 Co
sin
F
Cosin
eSimilarityyy(((dddiii***dddjjj))T
sineSimilarit
) (di ) * T ( d222j )
(3)
(3)
Cdist
(3)
F ( di * dj )
 Co sin eSimilarity ( di * dj )
Cdist
2
Cdist
(3)
(3)
m
m
m
Cdist
 m
Co
sin eSimilarit
eSimilarityyy(((dddiii***dddjjj))) 
 (((dddiii***dddjjj)))
Co
sin
Co
j (1di * dj )
Cosin
sineSimilarit
eSimilarity (di * dj )  i
i ij j11
m
m
i  j 1
 mmtfidf
TT(((dddjjj))) 
tfidfttiti,i,,dddjj j
T

T ( dj ) i
tfidfti , dj
1 tfidf

i i11
i 1
2
(4)
(4)(4)
(4) (4)
(5)
(5)
(5)
(5) (5)
2222 (( X
 X
XXjjj)))222 
(6)
(X
Xiiii 
i*
j )) 
Cdist
(
X
*
X

(
X
X
j)
i
j
Cdist
(
X
X
(6)
Cdist
Cdist((X
Xii * Xj ) 
(6)
(6)(6)
ni
(
*
)
Dis
ci
di
ni
Later, the GFA assigns the
value
force
ni
ciciof
di
ni Dis
***di
Dis
Dis(((ci
di))) as initial light of each firefly (in this


1 represents
paper, the number of fireflies
number of documents). Every firefly
ni
k
i

ni
kkk iii11
ni
ni

ADDC

will compete
with
each
other;
if
a
firefly
has
a brighter light than (7)
another, then

ADDC
ADDC 


ADDC
KK
(7)(7)
j
11
K
(7)β between
1
it will attract the onesjjj1with
less
bright light. The attraction value
two fireflies changes basedmon the distance between these fireflies.
m
m
2
22
m
Dis
(
d
d

j,j,,d
ii))) 

jn 
in))) 2
ddddinin
Dis
(
d
d

222 
Dis
(
d
j
i
(((dddjnjn

(8)
Dis ( dj , di )  n
(8)(8)
jn
in
)
1
(8)the firefly
The position of the less nbright
firefly will change. Changes of
n11
position will then lead to the change
of*C
the
force value (objective function in
k **
j)
P

PP(((
C
P
kkkof
(
*C
Cjjj))) firefly). After a specific number

Purity

Purity

this algorithm
that
represents
the
light
each

Purity

(9)(9)
Purity  kk 
(9)
NN

1,..,
N
1
,..,
ccc


(9) light and
kk

N
of iterations, GFA identifies
firefly
(document)
with the brightest
,..,
c
 11,..,

denotes it asPan
cluster
The
pseudo-code for the identification of
kkinitial
*C
Cj ) 

Maxkk center.

k 
C
j
PP((((

**
Max


C
(10)
P
kk*
CCjjj))) 
kk
C
(10)
in

Max
kkk 
in
CjjFigure
j
cluster centers
GFA
isMax
presented
7.
(10)
(10)
C
22**RR((k , ,CCj ) )**PP
((
,
)
j
C
,
)
CCjj ))**PP(the
(kkkk,first
,CCjjj))cluster starts by
max the process
(( 22**RR((of
identified,
kkk,,creating
Once a center
j
FFF(((
max
(kk))))is

(11)


F
max
(
R
C
P
C
(

,
)
*
(

,
)
max
C jC
,...,Ck  ( (i.e.,
(11)
kk similar
finding the most
documents
as (11)
in
equation
R(k using
, Cj )cosine
* P (k similarity
, Cj )
1,...,
C
C
C

(11)
C
Cjjj
C111,...,
Ckkk RR((kkk,,CCjjj))**PP((kkk,,CCjjj))
,...,C
C

4). Documents that have
high similarity to the centroid is located in the
The number
of the members
class k in value
cluster(in
j this paper,
first cluster.
approach
requires
amembers
specificofof
threshold
R( k , CThis
)  The
The
number
of
the
members
of
class
kkin
in
cluster
number
of
the
class
k
cluster
jjj (12)
j
The
number
of
the
members
of
class
in
cluster
RR(((
,,C
C
)) 
 value
k ,threshold
The
number
ofdifferent
the members
of class
k
R
)
different
is
used
for
dataset).
Documents
that (12)
do not
j
(12)
C
j
kk
The
number
of
the
members
of
class
j
(12)
The
number
of
of
kkk
Theare
number
of the
themembers
members
of class
class
belong to the firstThe
cluster
ranked
(step
17
in
GFA
process)
based
on it
number of the members of class k in cluster j
P( k , C
The
number
of
the
members
of
class
k
in
cluster
j
j ) isThe
brightness.
This
to
find
a
new
center
and
later,
create
a
new
cluster.
Such
a
number
of
the
members
of
class
k
in
cluster
j
of the
of class
k injcluster j (13)
The number
ofmembers
the members
of class
PP(((
C
)) 
 The number
k ,,,C
P
)
j
(13)
C
j
processkkis repeated
untilThe
all number
documents
aremembers
grouped accordingly.
(13)
The
of
the
of
class
j
(13)
Thenumber
numberof
of the
themembers
membersof
of class
class jjj

C

C


c
k
j
k
j
68C
HC ( j )   cc 

CCj log 

CCj
k 
k 
C
(14)


kkC
jj
jj
kc1 kk C
HC
 

log

j
j
HC
log
HC(((jjj))) 
log

(14)
11
(14)
CCj
CCj
kkk
(14)
C
C
1
jj
jj
b)The
Fireflies
(documents)
(b)The
Fireflies
(documents)
position position
after
(a)Initial
position
of Fireflies
(b)The
Fireflies
(documents)
afteraft
(b)The
Fireflies
(documents)
position
(b)The
Fireflies
(documents)
position
(a)Initial
position
ofposition
Fireflies
(Documents)
(documents)
position
after
(a)Initial
position
of
Fireflies
(Documents)
(a)Initial
position
ofof
Fireflies
(Documents)
(b)The Fireflies
Fireflies
position
after after
(a)Initial
Fireflies
(Documents)(b)The
(a)Initial
position
of
(Documents)
1st (documents)
iteration
(a)Initial
position
of Fireflies
Fireflies
(Documents)
1st
iteration
position
after
1st
iteration
1st
iteration
(Documents)
1st
iteration
1st
iteration
1st iteration
(c)The
Fireflies
(documents)
position
after(d)The
(d)The
Fireflies
(documents)
position af
(c)The
Fireflies
(documents)
Fireflies
(documents)
(c)The
Fireflies
(documents)
position
(d)The
Fireflies
(documents)
position
(c)The
Fireflies
(documents)
position
after after((d)The
d)The
Fireflies
(documents)
position
after after
(c)The
Fireflies
(documents)
position
after
Fireflies
(documents)
position
after
2nd
iteration
(c)The
Fireflies
(documents)
position
after
(
d)The
Fireflies
(documents)
position
after
5th
iteration
2nd
iteration
5th
iteration
position
after
2nd
iteration
(c)The
Fireflies
(documents)
position
after
(d)The
(documents) position after
2nd iteration
iteration
position
afterFireflies
5thiteration
iteration
5th
iteration
2nd
5th
2nd iteration
5th iteration
2nd iteration
5th iteration
(e) The
Fireflies
(documents)
position after
(f)The
Fireflies
(documents) position aft
(f)The
Fireflies
(documents)
(e)Fireflies
The(documents)
Fireflies
(documents)
(e) The
Fireflies
position
after
(f)The
Fireflies
(documents)
position
after
(e) Fireflies
The
(documents)
position
(f)The
Fireflies
(documents)
position
(e) The
(documents)
position
after after(f)The
10th
iteration.
Fireflies
(documents)
position
after after
20
iterations.
10th
iteration.
position
after
20
iterations.
20
iterations.
position
after
10th
iteration.
10th
iteration.
20
iterations.
10th iteration.
20 iterations.
(e) The Fireflies
(documents)
position after
(f)The
Fireflies
(documents)
Figure
6. The representation
of documents
position
in GFA. position after
(e)
The
Fireflies
(documents)
position
after
(f)The
Fireflies
Figure
6.10th
The
representation
of documents
position
in
GFA.
Figure
6.
The
representation
of
documents
position
in
GFA. (documents) position after
iteration.
20 iterations.
Figure 6. The representation of documents position in GFA.
10th iteration.
Figure 6. The representation of documents position in GFA. 20 iterations.
Figure 6. Figure
The representation
of documents
position inposition
GFA. in GFA.
6. The representation
of documents
69
15
15
15
15
Step1: Generate Initial population of firefly xi where i=1, 2,.., n, n=number of fireflies
(documents).
Step2: Initial Light Intensity, I=Force between two document using equation 3.
Step3: Define light absorption coefficient γ, initial γ=1
Step4: Define the randomization parameter α, α=0.2
Step5: Define initial attractiveness
Step6: While t < Number of iterations
Step7: For i=1 to N
Step8: For j=1 to N
Step9: If(Force Ii< Force Ij){
Step10: Calculate distance between i,j using Equation 6.
Step11: Calculate attractiveness using equation below .
Step12: Move document i to j using Equation
Step13: Update force between two documents (light intensity).
Step14: End For j
Step15: End For i
Step16: Loop
Step17: Rank to identify center (brightest light).
Figure 7. Pseudo code of GFA.
Figure 7. Pseudo code of GFA.
obtaining
the cluster
clusters,
cluster
selection
processThis
is process
conducted.
Thisout by
UponUpon
obtaining
the clusters,
selection
process
is conducted.
is carried
process is carried out by choosing clusters (which include documents greater
choosing
include
documents
greater
than 5%
of the size of
dataset n)To
thatdate,
exceed an
thanclusters
5% of (which
the size
of dataset
n) that
exceed
an identified
threshold.
it is set to 50, n/20 for normal distributed data, such as the 20Newsgroup
(20NewsgroupsDataSet, 2006) and Reuters-21578 (Lewis, 1999) (normal
distribution means
every
includes the
same1999)
number
of documents).
This every
(20NewsgroupsDataSet,
2006)
andclass
Reuters-21578
(Lewis,
(normal
distribution means
threshold is based on the criteria used by Tan et al. (2011) and the idea of
class merging
includes the
same number
of documents).
This threshold
is based
on the criteria
by Tan et
clusters
is adopted
from Picarougne,
Azzag,
Venturini,
andused
Guinot
(2007). The merging of clusters assigns smaller clusters to the bigger ones.
al. (2011) and the idea of merging clusters is adopted from Picarougne, Azzag, Venturini, and Guinot
On the other hand, the threshold value of 50, n/40 is used for dataset that is
notThe
normally
such smaller
as the TR11
TR12,
TREC
(2007).
mergingdistributed,
of clusters assigns
clustersand
to the
biggerretrieved
ones. On from
the other
hand, the
collection (TREC, 1999).
identified threshold. To date, it is set to 50, n/20 for normal distributed data, such as the 20Newsgroup
threshold value of 50, n/40 is used for dataset that is not normally distributed, such as the TR11 and
TR12, retrieved from TREC collection (TREC, 1999).
RESULTS
Data Sets
RESULTS
Data Sets
Four datasets are used in evaluating the performance of GF-Clust. They were
20Newsgroup
Four obtained
datasets arefrom
used indifferent
evaluatingresources:
the performance
of GF-Clust. They
were obtained from different
2006), Reuters-21578 (Lewis, 1999) and TREC collection (TREC, 1999).
resources:
2006),
Reuters-21578 (Lewis, 1999) and TREC
Table20Newsgroup
1 displays description
of the chosen
datasets.
collection (TREC, 1999). Table 1 displays description
70 of the chosen datasets.
16
Table 1
Description of Datasets
Datasets
20Newsgroups
Reuters-21578
TR11
TR12
Source
20Newsgroups
Reuters-21578
TREC
TREC
No. of
Doc.
300
300
414
313
Classes Min class
Max
size
class size
3
100
100
6
50
50
9
6
132
8
9
93
No. of
Terms
2275
1212
6429
5804
The first dataset, named 20Newsgroups, contains 300 documents separated
in three classes that include hardware, baseball and electronics. Each class
involves 100 documents and 2,275 number of terms. The second dataset which
is the Reuters-21578 contains 300 documents distributed in six classes which
are the ‘earn’, ‘sugar’, ‘trade’, ‘ship’, ‘money-supply’ and ‘gold’. Each class
includes 50 documents and 1,212 number of terms. The third dataset is known
as TR11 and is derived from TREC collection. It includes 414 documents
distributed in nine classes. The smallest class size is six while the largest class
includes 132
comprises 6,429 terms. The fourth
tfidfdocuments
t , d  tft , d *and
log the
N / collection
dftt
(1)
tfidft , TR12
d  tft ,is
d *also
log obtained
N / dftt from TREC collection.
dataset called
It includes 313
(1)
documents with
eight
M1*
M 2 classes and 5,804 terms. The smallest class is nine
F  G M 1 * M 2largest contains 93 documents.
documents
(2)
2
F  GwhileRthe
(2)
2
R
Evaluation Metrics
T (di ) * T ( dj )
F ( di * dj )  Co sin eSimilarity ( di * dj )T (di ) * T (2dj )
(3)
i * dj )  Co sin eSimilarity ( di * dj )
FIn( dorder
Cdist
2
to evaluate the performance of the
GF-Clust
against state-of-the-art
(3)
Cdist
methods, K-means (Jain, 2010), PSO (Cui, Potok, & Palathingal, 2005) and
m
pGSCM
(Tan, Ting,
2011a),
evaluation metrics are employed.
Co sin eSimilarit
y (di &
* dTeng,
( di * djfour
)
j)  m

(4)
Co sin eSimilarit
y (di * dthe
dj )
j ) ADDC,
i j (1di *Purity,
These
metrics include
F-measure and Entropy
(4) (Forsati,
i  j 1
Mahdavi, mShamsfard, & Meybodi, 2013; Murugesan, & Zhang, 2011).
T ( dj )  m tfidfti , dj
(5)
j)  
T
( dfirst
i 1tfidfti , dj
The
metric,
ADDC, measures the compactness of the obtained
(5) clusters. A
i 1
smaller value of ADDC indicates
a better cluster and it satisfies the optimization
2
2 ( Xi  Xj )
2 Shamsfard, & Meybodi, 2013). The ADDC can
constrains
Cdist( Xi *(Forsati,
Xj )  2 (Mahdavi,
Xi  Xj )
(6)
Cdist
( Xi * as
Xj )Equations

(6)
be
defined
7 and 8.
ni Dis ( ci * di )
ni
 Dis (ci * di )
ni
k 
i 1
(7)
ni
k i 1
ADDC  
(7)
K
ADDC  
j 1
(7)
K
j 1
m
(8)
2
Dis ( dj , di )  2 m ( djn  din)2
(8)
Dis ( dj , di )  2 
n1( djn  din )
(8)
n1
P (k * Cj )
Purity 

P (k * Cj )71
N
Purity  k 
1,..,c
N
k 1,..,c
(9)
(9)
m
Co
Cosin
sineSimilarit
eSimilarityy((ddi i**ddj )j ) ((ddi i**ddj )j )
(4)
(4)
i i j j 1
1
mm
TT((ddj )j )tfidf
tfidftit,i d, dj j
(5)
(5)
i i 
11
Where, K refers to number of clusters, ni refers to number of documents in
22
center
cluster i, ci refers22to
((XX
i iXX
j )j )of cluster I, di refers to document in cluster i,
i i**XX
j )j )
Cdist
(
X
Cdist
(
X
(6)
and Dis(ci, di) is Euclidian distance (Murugesan, & Zhang, 2011)
(6) which is
calculated using
Equation
8.
ni
tfidft tfidf
, d t ,tf
,
dDis
*tftlog
*/*dft
ci
di
dtni
, d((
*N
N
(1) (1)
Dis
cilog
dit)/)dftt
tfidft ,
d
 tft , d * log N / dftt
(1)
ni
k*k M
i i 1
M 1as

121
*
2 ni
M
M
Purity ADDC
is Fdefined
the
weighted
sum
of
all
cluster
purity
as shown in Equation
G
F  G
M
1* M 2
ADDC
(2) (2) (7)
2calculated
F G
2
(7)
K
9. Cluster purity
based
on
the
largest
class
of
jis

1
Rj 1 R2
K
(2) documents assigned
R
to a specific cluster as shown in Equation 10. The larger the value of purity,
T (di )T*(T
di()d*j )T ( dj )
mmsolution
y ( di *y (dis
( di F
* (ddj )ai *clustering
the Fbetter
Mahdavi,
Shamsfard, &
j ) sin
dCo
CoeSimilarit
eSimilarit
dj2)i2*Entropy
dj ) T (di2) *(Forsati,
sin
T2( dj )
(3) (3)
2
Dis
eSimilarit
id
F ((d(d
*j ,jd,dj )di )i)Co
()d)i * djCdist
)
((ddjnjn&
ddyinZhang,
Dis
 2sin
in
Cdist2
(8)
(3)
Meybodi,
2013;
Murugesan,
2011).
(8)
nn1
Cdist
1
m
m
Co sin
y (di *y (ddj )i *dj )( di
* (ddj )i * dj )
m
CoeSimilarit
sin eSimilarit
((
jP
1

Co sin eSimilarit
y (di * dij)P
*Cjd)jj))
jkk(1*d*iC
i 
 Purity
Purity 




k
1,..,

cc
k 
1,..,
m
, dj ti , dj
T ( dj )T( d
mtitfidf

j ) tfidf
i

1
T ( dj )  i
1tfidfti , dj
kk
ji j 1
kk kk
2
2
2 ( Xi2  Xj )
( Xi  Xj )
CdistCdist
( Xi *(X
Xji)*Xj )  2 ( Xi  Xj )2
Cdist( Xi * Xj ) 
m
(4)
NN
PP(( **CC))Max
Max  CCj j
(4)
(4)
(9)
(9)
i  j 1
(5)
(5)
(5)
(10)
(10)
(6)
(9)
(10)
(6)
22**RR((k ,metric
P((kk, ,CCj j))the(6)accuracy of the
,CCj j))**Pmeasures
On the other hand, the F-measure
ni Dis
ni( ci
*((di
)* di ) k
F
(
)
max


Dis
ci
F
(
)
max

(

clustering
as shown
It can be obtained
by calculating
kk solution
ni Dis

(11)
(ciin*Equation
di ) , C )11.

(11)
CCjkCC
ni  ni RR(
(kk , Cj j )**PP((kk, ,CCj j))
i 1,...,
k,...,
i
C
1C
1
k


j
two important
metrics
that
are
mostly
used
in
evaluation
of
information
1
k
 ni
  
ADDC
k i 1
ADDC
(7) (7)
K K
j 1 
ADDC
j

1
retrieval system which
are recall
and precision. Recall(7)is the division of the
K members
j 1
The
of
of class
kkinincluster
jj
Thenumber
number
of the
the members
clusterover
documents
from
specific
class of
in class
specific
cluster
the number
RRnumber
((k , ,CCj )of

m
)

m
2
(12)
2
jDis(dDis
(12)
k class
2  number
of
class
kkwhile Precision
) j
j , d(id
m
(2number
djn
djninas
)of
,The
dThe
i ) dataset
of
members
of
class
inthe
(
d

d
)
2themembers
of that
in
whole
shown
in
Equation
12;,
is the
(8)
(8)
Dis ( dj , di )n1 2 n1( djn  din)
(8)
nof
1 documents from specific class in specific cluster
division of the
number
The
the
members
of
Thenumber
numberof
of
members
ofclass
classkkinincluster
clusterj j
kP*(
jk)* Cj )
P (the
C
PPover
((k , ,C
)Purity
Purity
C
)
size
of
that
cluster
as
shown
in
Equation
13.
Larger
value of(13)
F-measure


j

k
j

(
*
)
P
C

j Purity
k
(13)
The
number
of the members
(9) j(9)
membersof
ofclass
class
j
k

c
The

1k,..,
number

1,..,
c N ofNthe
leads to a better
clustering
solution
Entropy
(Forsati,
Mahdavi,
Shamsfard,
&
(9)
N
k 1,..,c
Meybodi,
2013;
Murugesan,
&
Zhang,
2011).
P(kPc*(
Cjk)*
k kC
CMax
j )C

Cj CCj
kj k 
(10) (10)
CkMax
j
j
P(ckk*k C
j )  jMax
k k k
C
j
(10)
HC
j


(
)
log

HC ( j )   
log
(14)
2 * R2(C
*kR, (C j ), *CP)(*kP, (C j ), C )
(14)
k k1
1 CCj
j
j
C
k
k
(11)
j max (
F ( kF)(
 )  max
( 2 * jRj ( k , C j ) * P( k , C j )

(11)
F ( kkC) CC 
max
( R, (C j ), *
,...,
CP)(* P, (C j ), C ) (11) (11)
 C R
CCk,...,
j C 1j


 kR ( k
, C jj ) *kP ( kk , C jj )
C1,...,Ck 
k
j
k
* CCj j  1
k k HC
HCjThe
j * number
of the
of class
k in cluster
j
The number
ofmembers
the members
of class
k in cluster
j
(12)

HH
R(
,
C
)

kR
( kj , C j )  The number of the members of class k in cluster (12)
j
(15)
(12)(15)
1
The
number
of
the
members
of
class
k
N
Rj (j
,
C
)

1k
The number of the members of class k
jN
(12)
The number of the members of class k
The number
of the
of class
k in cluster
j
The
number
ofmembers
the members
of class
k in cluster
j
P( kP, (C j ), C
 )  The number of
the members of class k in cluster (13)
j
(13)
j
k
(13)
The
number
of
the
members
of
class
j
P( k , C j ) 
The number of the members of class j
(13)
The number of the members of class j
c  k cC j C  k C j C
k
j
Cthe
 kk  C jj of clusters and randomness Entropy
j ) ( 
( HC
c  k log
TheHC
Entropy
goodness
j )measures

j log
(14) (14)
k
HC ( j ) 1  kC1 j C
logC j C
j
(14)& Zhang, 2011).
(Forsati, Mahdavi,
&C
Meybodi,
2013; Murugesan,
k 1Shamsfard,
C jj
j
It also can measure the distribution of classes in each cluster. The clustering
* C j* C
k HC
k j HC
j high
j
solution
reaches
performance when clusters contain documents from
H H
k HCits
 
j * Cj
(15) (15)
j 1 
H
N
j

1
N situation, the entropy value of clustering
a single class.
In
this
solution will
(15)
j 1
N
be zero. A smaller value of entropy demonstrates a better cluster performance.
Equations 14 and 15 are used to compute the Entropy.
72
, C j) )
RR((kk, C
j
(12)
(12)
Thenumber
numberofofthe
themembers
membersofofclass
classkk
The
Journal
of ICT,of
15,the
No.members
1 (June) 2016,
pp: 57k–81
The
number
of class
in cluster j
, C j) )The number of the members of class k in cluster j
PP((kk, C
j
Thenumber
numberofofthe
themembers
membersofofclass
classj j
The
(13)
(13)
 kCCj j
log k
log
CCj j
CCj j
(14)
HCj j**CCj j
k k HC
HH
j j11
NN
(15)
C
c c 
k k C j j
HC( (j )j )
HC
k k11
(14)
(14)
(15)
(15)
Experimental Result and Discussion
The proposed GF-CLUST, K-means (Jain, 2010), PSO (Cui, Potok, &
Palathingal, 2005) and pGSCM (Tan et al., 2011a) was executed 10 times before
the average value of the evaluation metrics was obtained. All experiments
were carried out in Matlab on windows 8 with a 2000 MHz processor and
4 GB memory. Table 2 tabularizes the results of Purity, F-measure, Entropy,
ADDC and the number of obtained clusters.
Table 2
Performance Results and Standard Deviation: GF-CLUST vs. K-means vs.
PSO vs. pGSCM algorithms
Validity Indices
GF-Clust
K-means
PSO
pGSCM
Purity
20Newsgroups
Reuters
TR11
TR12
Datasets
0.5667(0.00)
0.4867(0.00)
0.4710(0.00)
0.4920(0.00)
0.3443(0.0314)
0.3240(0.0699)
0.4087(0.0811)
0.4096(0.0761)
0.4097(0.0691)
0.3497(0.0697)
0.4092(0.0606)
0.3712(0.0405)
0.3853(0.0159)
0.2357(0.0147)
0.3372(0.0184)
0.3029(0.0072)
F-measure
20Newsgroups
Reuters
TR11
TR12
0.5218(0.00)
0.3699(0.00)
0.3213(0.00)
0.3851(0.00)
0.4974(0.0073)
0.3575(0.0664)
0.3792(0.0861)
0.3678(0.0958)
0.5085(0.0390)
0.3522(0.0654)
0.3492(0.0539)
0.3161(0.0444)
0.3571(0.0275)
0.2400(0.0108)
0.2520(0.0346)
0.2245(0.0171)
Entropy
20Newsgroups
Reuters
TR11
TR12
1.3172(0.00)
1.6392(0.00)
2.0119(0.00)
1.9636(0.00)
1.5751(0.0272)
2.0964(0.2358)
2.2620(0.3850)
2.2768(0.2834)
1.4847(0.0935)
2.1483(0.1563)
2.2970(0.1913)
2.4571(0.1551)
1.5630(0.0132)
2.5144(0.0245)
2.5660(0.0586)
2.6647(0.0353)
ADDC
20Newsgroups
Reuters
TR11
TR12
1.9739(0.00)
1.5987(0.00)
1.1316(0.00)
1.1246(0.00)
0.5802(0.2057)
0.6957(0.1341)
0.5600(0.2648)
0.4592(0.1001)
1.8955(0.2166)
1.3806(0.1115)
0.7283(0.1362)
0.7486(0.1794)
1.7517(0.0159)
1.4078(0.0142)
1.0410(0.0115)
1.0604(0.0065)
Number of
clusters
20Newsgroups
Reuters
TR11
TR12
5
7
8
8
3
6
9
8
3
6
9
8
6
6.3
7.2
6.3
Hint: The value highlighted in bold indicates the best value.
73
As can be seen from Table 2 and Figure 8, the Purity value for GF-Clust is higher than K-means, PSO
and pGSCM
, inbe
allseen
datasets
in this
paper
(refers
Reuters,
TR11 and TR12). It
As can
fromused
Table
2 and
Figure
8, to
the20Newsgroups,
Purity value for
GF-CLUST
is higher
K-means,
PSO and
pGSCM
all datasets
used
in this On
paper
is also noted
that than
the PSO
outperforms
K-means
and, in
pGSCM
in most
datasets.
the other hand,
(refers to 20Newsgroups, Reuters, TR11 and TR12). It is also noted that the
outperforms
K-means
pGSCM
most datasets.
On the other dataset
hand, where it is
pGSCMPSO
produces
the lowest
purity and
in all
datasetsinexcluding
in 20Newsgroups
pGSCM produces the lowest purity in all datasets excluding in 20Newsgroups
better than
K-means.
on literature,
a better
clustering
is whena it
produces
high purity value
dataset
where Based
it is better
than K-means.
Based
on literature,
better
clustering
when itShamsfard
produces&high
purity2013;
value
(Forsati, &Mahdavi,
Shamsfard, &
(Forsati,isMahdavi,
Meybodi,
Murugesan
Zhang, 2011).
Meybodi, 2013; Murugesan, & Zhang, 2011).
0.6
0.5
0.4
GF-Clust
0.3
K-means
PSO
0.2
pGSCM
0.1
0
20Newsgroups
Reuters
TR11
TR12
Figure 8. Purity Result: GF-CLUST vs. K-means vs. PSO vs. pGSCM Methods.
Figure 8. Purity Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods.
The comparative results of F-measure among GF-CLUST, K-means, PSO
and pGSCM are tabularized in Table 2 and represented in a graph in Figure
The comparative results of F-measure among GF-Clust, K-means, PSO and pGSCM are tabularized in
9. The F-measure evaluates the accuracy of the clustering. It can be noticed
9 and Table
2 that in
theFigure
proposed
has evaluates
higher F-measure
Table 2from
and Figure
represented
in a graph
9. GF-CLUST
The F-measure
the accuracy of the
measure value in most datasets (refers to 20Newsgroups, Reuters and TR12) despite not being
value in most datasets (refers to 20Newsgroups, Reuters and TR12) despite not
clustering.
It can
be noticed
Figure 9 and
Table
2 that of
the
proposed
GF-Clust has higher Fbeing
supported
with from
anyoninformation
the number
clusters.
supported
with
any information
the number on
of clusters.
Nevertheless,
forNevertheless,
TR11 dataset, it can be
for TR11 dataset, it can be seen that K-means outperforms GF-CLUST by
seen that
K-means 0.3792
outperforms
GF-Clust
by generating
20 0.3792 compared to 0.3213.
generating
compared
to 0.3213.
0.6
0.5
0.4
GF-Clust
0.3
K-means
PSO
0.2
pGSCM
0.1
0
20Newsgroups
Reuters
TR11
TR12
Figure 9. F-measure Result: GF-CLUST vs. K-means vs. PSO vs. pGSCM
Methods.
Figure 9. F-measure Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods.
74
Further, as can see in Table 2, the GF-Clust has the best Entropy (lowest value) compared to Kmeans, PSO and pGSCM in all datasets tested in this work (refers to 20Newsgroups, Reuters, TR11
Further, as can see in Table 2, the GF-Clust has the best Entropy (lowest value) compared to KJournal of ICT, 15, No. 1 (June) 2016, pp: 57–81
means, PSO and pGSCM in all datasets tested in this work (refers to 20Newsgroups, Reuters, TR11
Further, as can see in Table 2, the GF-CLUST has the best Entropy (lowest
and TR12). The GF-Clust produces 1.3172, 1.6392, 2.0119 and 1.9636 for 20Newsgroups, Reuters,
value) compared to K-means, PSO and pGSCM in all datasets tested in this
to 20Newsgroups,
and TR12).
GF-CLUST
TR11 and work
TR12,(refers
respectively.
Further, it is Reuters,
also notedTR11
that K-means
is aThe
better
algorithm compared to
produces 1.3172, 1.6392, 2.0119 and 1.9636 for 20Newsgroups, Reuters,
TR11 and
respectively.
Further,
it (refers
is also noted
that K-means
a betteras it produces
PSO and pGSCM
in TR12,
clustering
most of the
datasets
to Reuters,
TR11 andisTR12)
algorithm compared to PSO and pGSCM in clustering most of the datasets
(refers
Reuters,
TR11and
and2.2768).
TR12) as
it produces
less Entropy
(i.e., representation
2.0964,
less Entropy
(i.e.,to2.0964,
2.2620
Figure
10 illustrates
a graphical
of the
2.2620 and 2.2768). Figure 10 illustrates a graphical representation of the
Entropy for
the F-Clust,
and pGSCM.
Entropy
for theK-means,
F-Clust,PSO
K-means,
PSO and pGSCM.
3
2.5
2
GF-Clust
1.5
K-means
PSO
1
pGSCM
0.5
0
20Newsgroups
Reuters
TR11
TR12
Figure 10. Entropy result: GF-CLUST vs. K-means vs. PSO vs. pGSCM
Methods.
Figure 10. Entropy Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods.
21 in a cluster, the ADDC value for
In terms of the distance between documents
Euclidian similarity is displayed in Table 2 for GF-CLUST, K-means, PSO and
pGSCM. This value is used to show the algorithm satisfies the optimization
constraints. The utilization of ADDC is similar to the Entropy metric where a
smaller value demonstrates a better algorithm (Forsati, Mahdavi, Shamsfard
& Meybodi, 2013). The ADDC results indicate that the K-means is a better
clustering than PSO, pGSCM and the proposed GF-CLUST in all datasets. The
reason behind this performance is that K-means utilizes Euclidean distance in
assigning documents to clusters. On the other hand, the GF-CLUST employs
Cosine similarity with a specific threshold to identify member of a cluster.
Such an approach does not take into account the physical distance that exists
between documents. Figure 11 illustrates a graphical representation of ADDC
value among GF-CLUST, K-means, PSO and pGSCM.
Research in clustering is aimed at producing clusters that match the actual
number of clusters. As can be seen in Table 2, the average number of obtained
clusters by the GF-CLUST is 5, 7, 8 and 8 for the four datasets, 20Newsgroups,
75
pGSCM and the proposed GF-Clust in all datasets. The reason behind this performance is that KJournal of ICT, 15, No. 1 (June) 2016, pp: 57–81
means utilizes Euclidean distance in assigning documents to clusters. On the other hand, the GF-Clust
employs Cosine
similarity with
specific
threshold
to identify
member
a cluster.
Such an approach
Reuters-21578,
TR11a and
TR12.
These values
are near
to theofactual
number
of clusters which are 3, 6, 9 and 8. The result of the proposed GF-CLUST is
does not take
intothan
account
the physical
distance
that that
exists
between6,documents.
better
the dynamic
method,
pGSCM,
generates
6.3, 7.2 andFigure
6.3 for11 illustrates a
the said datasets. Figure 12 shows a graphical representation of the number of
graphical representation
of ADDC
value among
GF-Clust,
PSO and pGSCM.
obtained clusters
with different
algorithms
usingK-means,
different datasets.
2
1.5
GF-Clust
K-means
1
PSO
0.5
0
pGSCM
20Newsgroups
Reuters
TR11
TR12
Figure 11. ADDC result: GF-CLUST vs. K-means vs. PSO vs. pGSCM Methods.
Figure 11. ADDC Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods.
10
Research in clustering
is aimed at producing clusters that match the
actual number of clusters. As can
8
GF-Clust
6
RealGF-Clust
Number of clusters
be seen in Table 2, the average number of obtained clusters by the
is 5,(K-7, 8 and 8 for the
means & PSO)
4
pGSCM
four datasets, 220Newsgroups, Reuters-21578, TR11 and TR12. These values are near to the actual
0
number of clusters
which are 3, 6, 9 andTR11
8. The result of the proposed GF-Clust is better than the
20Newsgroups
dynamic method,
pGSCM,
generates
6.3, 7.2 and
6.3 for
the said datasets.
Figure 12.
Resultsthat
of the
number 6,
of generated
clusters
by GF-Clust,
pGSCM, Figure 12 shows a
K-means & PSO.
12. Resultsof
of the
of generated
clusters
by GF-Clust,
K-means
& PSO.using different
graphical Figure
representation
thenumber
number
of obtained
clusters
with pGSCM,
different
algorithms
datasets.
CONCLUSION
CONCLUSION
This paper presents a new method known as Gravity Firefly clustering method
This(GF-CLUST)
paper presents for
a new
method known
as Gravity
Firefly clustering
(GF-Clust) for an
an automatic
document
clustering
where themethod
idea mimics
22
the behavior of the firefly insect in nature. The GF-CLUST has the ability
automatic document clustering where the idea mimics the behavior of the firefly insect in nature. The
to identify a near optimal number of clusters and this is achieved in three
steps:has
data
development
of vector
modeland
andthis
clustering.
GF-Clust
thepre-processing,
ability to identify
a near optimal
numberspace
of clusters
is achieved in three
In the clustering step, GF-CLUST uses GFA to identify documents with
steps: data pre-processing, development of vector space model and clustering. In the clustering step,
76
GF-Clust uses GFA to identify documents with high force as centers and create clusters based on
high force as centers and create clusters based on cosine similarity. It later
chooses quality clusters and assigns small size clusters to them. Results of
four benchmark datasets indicate that GF-Clust is a better clustering algorithm
than the pGSCM. Furthermore, the GF-Clust performs better than K-means,
PSO and pGSCM in terms of cluster quality; purity, F-measure and Entropy,
hence, indicating that GF-Clust can become a competitor in the area of
dynamic swarm based clustering.
REFERENCES
20NewsgroupsDataSet. (2006). Retrieved from http://www.cs.cmu.edu/afs/
cs.cmu.edu/project/theo-4/text-learning/www/datasets.html.
Aggarwal, C. C., & Reddy, C. K. (2014). Data clustering Algorithm and
applications. CRC Press, Taylor and Francis Group.
Aliguliyev, R. M. (2009). Clustering of document collection-A weighted
approach. Elsevier, Expert Systems with Applications, 36(4), 7904–
7916. Retrieved from doi: 10.1016/j.eswa.2008.11.017
Apostolopoulos, T., & Vlachos, A. (2011). Application of the firefly
algorithm for solving the economic emissions load dispatch problem.
International Journal of Combinatorics, 201, 23. Retrieved from
doi:10.1155/2011/523806
Banati, H., & Bajaj, M. (2013). Performance analysis of Firefly algorithm for
data clustering. Int. J. Swarm Intelligence, 1(1), 19–35.
Beasley, D., Bull, D. R., & Martin, R. R. (1993). An overview of genetic
algorithms : Part 1, fundamentals. University Computing, 15(2), 58–69.
Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm intelligence: From
Natural to artificial systems. New York, NY: Oxford University Press,
Santa Fe Institute Studies in the Sciences of Complexity.
Boussaïd, I., Lepagnot, J., & Siarry, P. (2013). A survey on optimization
metaheuristics. Elsevier,Information Sciences, 237, 82–117.
Cui, X., Gao, J., & Potok, T. E. (2006). A flocking based algorithm for
document clustering analysis. Journal of Systems Architecture, 52(8-9),
505–515.
77
Cui, X., Potok, T. E., & Palathingal, P. (2005). Document clustering using
particle swarm optimization. In Proceedings 2005 IEEE Swarm
Intelligence Symposium, SIS 2005. (pp. 185–191). IEEEXplore.
Retrieved from doi:10.1109/SIS.2005.1501621
Deneubourg, J. L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., &
Chrétien, L. (1991). The dynamics of collective sorting: robot-like ants
and ant-like robots. In Proceedings of the First International Conference
on Simulation of Adaptive Behavior on from Animals to Animats (pp.
356–363). MIT Press Cambridge, MA, USA.
Fogel, D. B. (1994). Asympototic convergence properties of gentic algorithms
and evaluationary programming. Cybernetics and Systems, 25(3),
389–407.
Forsati, R., Mahdavi, M., Shamsfard, M., & Meybodi, M. R. (2013). Efficient
stochastic algorithms for document clustering. Elsevier, Information
Sciences, 220, 269–291. Retrieved from doi: 10.1016/j.ins.2012.07.025
Gil-Garicia, R., & Pons-Porrata, A. (2010). Dynamic hierarchical algorithms
for document clustering. Elsevier, Pattern Recognition Letters, 31(6),
469–477. Retrieved from doi: 10.1016/j.patrec.2009.11.011
Glover, F. (1986). Future paths for integer programming and links to artificial
intelligence. Computers and Operations Research, 13(No.5), 533–549.
Hassanzadeh, T., Vojodi, H., & Moghadam, A. M. E. (2011). An Image
segmentation approach based on maximum variance intra-cluster
method and Firefly algorithm. In Seventh International Conference on
Natural Computation (ICNC) (Vol. 3, pp. 1817–1821). Shanghai: IEEE
Explore. Retrieved from doi:10.1109/ICNC.2011.6022379
Hatamlou, A., Abdullah, S., & Nezamabadi-pour, H. (2012). A combined
approach for clustering based on K-means and gravitational search
algorithms. Elsevier, Swarm and Evolutionary Computation, 6, 47–52.
Retrieved from doi: 10.1016/j.swevo.2012.02.003
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Elsevier,
Pattern Recognition Letters, 31(8), 651–666. Retrieved from doi:
10.1016/j.patrec.2009.09.011
78
Kashef, R., & Kamel, M. (2010). Cooperative clustering. Elsevier, Pattern
Recognition, 43(6), 2315–2329. Retrieved from doi: 10.1016/j.
patcog.2009.12.018
Kashef, R., & Kamel, M. S. (2009). Enhanced bisecting k-means clustering
using intermediate cooperation. Elsevier, Pattern Recognition, 42(11),
2557–2569.
Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings
of IEEE International Conference on Neural Networks IV. Perth, WA:
IEEE. Retrieved from doi:10.1109/ICNN.1995.488968
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by
simulated annealing. New Series, 220(No. 4598), 671–680.
Kuo, R. J., & Zulvia, F. E. (2013). automatic clustering using an improved
particle swarm optimization. Journal of Industrial and Intelligent
Information, 1(1), 46–51.
Lewis, D. (1999). The reuters-21578 text categorization test collection.
Retrieved from Available online at :http://kdd.ics.uci.edu/database/
reuters21578/reuters21578.html
Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on
neighbors. Elsevier, Data & Knowledge Engineering, 68(11), 1271–
1288. Retrieved from doi: 10.1016/j.datak.2009.06.007
Mahmuddin, M. (2008). Optimisation using Bees algorithm on unlabelled
data problems. Manufactoring engineering centre. Cardiff: Cardif
University , UK.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
Information Retrieval (1st ed.). New York, USA: Cambridge University
Press.
Mohammed, A. J., Yusof, Y., & Husni, H. (2014). A Newton’s universal
gravitation inspired Firefly algorithm for document clustering.
Proceedings of Advanced in Computer Science and its Applications,
v. 279, Lecture Notes in Electrical Engineering, Jeong, H.Y., Yen, N.
Y., Park, J.J. (Jong Hyuk), Springer Berlin Heidelberg, pp. 1259-1264.
79
Mohammed, A. J., Yusof, Y., & Husni, H. (2014). Weight-based Firefly
algorithm for document clustering. Proceedings of the First
International Conference on Advanced Data and Information
Engineering (DaEng-2013), v. 285, Lecture Notes in Electrical
Engineering, Herawan, T., Deris, M. M., Abawajy, J., Springer Berlin
Heidelberg, pp. 259-266.
Murugesan, K., & Zhang, J. (2011). Hybrid hierarchical clustering : An
expermintal analysis (p. 26). University of Kentucky.
Mustaffa, Z., Yusof, Y., & Kamaruddin, S. (2013). Enhanced Abc-Lssvm for
Energy fuel price prediction. Journal of Information and Communication
Technology, 12, 73–101. Retrieved from http://www.jict.uum.edu.my/
index.php
Picarougne, F., Azzag, H., Venturini, G., & Guinot, C. (2007). A new approach
of data clustering using a flock of agents. Evolutionary Computation,
15(3), 345–367. Cambridge: MIT Press.
Rashedi, E., Nezamabadi-pour, H., & Saryazdi, S. (2009). GSA: A gravitational
search algorithm. Elsevier, Information Sciences, 179(13), 2232–2248.
Rothlauf, F. (2011). Design of modern heuristics principles and application.
Springer-Verlag Berlin Heidelberg. Retrieved from doi:10.1007/978-3450-72962-4
Sayed, A., Hacid, H., & Zighed, D. (2009). Exploring validity indices for
clustering textual data. In Mining Complex Data, 165, 281–300.
Senthilnath, J., Omkar, S. N., & Mani, V. (2011). Clustering using Firefly
algorithm: Performance study. Elsevier, Swarm and Evolutionary
Computation, 1(3), 164–171. Retrieved from doi: 10.1016/j.
swevo.2011.06.003
Tan, S. C. (2012). Simplifying and improving swarm based clustering. In IEEE
Congress on Evolutionary Computation (CEC) (pp. 1–8). Brisbane,
QLD: IEEE.
Tan, S. C., Ting, K. M., & Teng, S. W. (2011a). A general stochastic clustering
method for automatic cluster discovery. Elsevier, Pattern Recognition,
44(10-11), 2786–2799.
80
Tan, S. C., Ting, K. M., & Teng, S. W. (2011b). Simplifying and improving
ant-based clustering. In Procedia Computer Science (pp. 46–55).
Tang, R., Fong, S., Yang, X. S., & Deb, S. (2012). Integrating natureinspired optimization algorithms to K-means clustering. In Seventh
International Conference on Digital Information Management (ICDIM),
2012 (pp. 116–123). Macau: IEEE. Retrieved from doi:10.1109/
ICDIM.2012.6360145
TREC. (1999). Text REtrieval Conference (TREC). Retrieved from  http://
trec.nist.gov
Yang, X. S. (2010). Nature-inspired metaheuristic algorithms (2nd ed). United
Kingdom: Luniver press.
Yujian, L., & Liye, X. (2010). Unweighted multiple group method with
arithmetic mean. In IEEE Fifth International Conference on Bio-Inspired
Computing: Theories and Applications (BIC-TA) (pp. 830–834).
Changsha: IEEE. Retrieved from doi:10.1109/BICTA.2010.5645232
Zhong, J., Liu, L., & Li, Z. (2010). A novel clustering algorithm based on
gravity and cluster merging. Advanced Data Mining and Applications,
6440, 302–309. Retrieved from doi:10.1007/978-3-642-17316-5_30
81

http://jict.uum.edu.my - Journal of Information and Communication

Transcription

Similar documents

Simulations in Statistical Physics Page 1

Fabio D`Andrea LMD – 4e étage “dans les serres” 01 44 32 22 31

Fireflies

Hartmann Data Driven Business models presentation

Insights from Unsupervised Clustering Into the Physical Properties

Normalized Cuts Without Eigenvectors