http://jict.uum.edu.my - Journal of Information and Communication
Transcription
http://jict.uum.edu.my - Journal of Information and Communication
http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 GF-CLUST: A NATURE-INSPIRED ALGORITHM FOR AUTOMATIC TEXT CLUSTERING 1 Athraa Jasim Mohammed, 2Yuhanis Yusof & 2Husniza Husni 1 University of Technology, Baghdad, Iraq 2 Universiti Utara Malaysia, Malaysia [email protected]; [email protected]; [email protected] ABSTRACT Text clustering is a task of grouping similar documents into a cluster while assigning the dissimilar ones in other clusters. A well-known clustering method which is the K-means algorithm is extensively employed in many disciplines. However, there is a big challenge to determine the number of clusters using K-means. This paper presents a new clustering algorithm, termed Gravity Firefly Clustering (GF-CLUST) that utilizes Firefly Algorithm for dynamic document clustering. The GF-CLUST features the ability of identifying the appropriate number of clusters for a given text collection, which is a challenging problem in document clustering. It determines documents having strong force as centers and creates clusters based on cosine similarity measurement. This is followed by selecting potential clusters and merging small clusters to them. Experiments on various document datasets, such as 20Newgroups, Reuters-21578 and TREC collection are conducted to evaluate the performance of the proposed GF-CLUST. The results of purity, F-measure and Entropy of GF-CLUST outperform the ones produced by existing clustering techniques, such as K-means, Particle Swarm Optimization (PSO) and Practical General Stochastic Clustering Method (pGSCM). Furthermore, the number of obtained clusters in GF-CLUST is near to the actual number of clusters as compared to pGSCM. Keywords: Firefly algorithm, text clustering, divisive clustering, dynamic clustering. 57 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 INTRODUCTION Text documents, such as articles, blogs and news, are periodically increased in the Web. This increase has made online users to require more time and effort to obtain relevant information. In this context, manual analysis and manual discovery of beneficial information are very difficult. Hence, it is relevant to provide automatic tools for analysing large textual collections. Referring to such needs, data mining tasks, such as classification, association analysis and clustering are commonly integrated in the tools. Clustering is a technique of grouping similar documents into a cluster and dissimilar documents in a different cluster (Aggarwal, & Reddy, 2014). It is a descriptive task of data mining where the algorithm learns by identifying similarities between items in a collection. According to Gil-Garicia and PonsPorrata (2010), clustering algorithms are classified into two categories based on prior information (i.e., number of clusters): static and dynamic. In static clustering, a set of objects is classified into a determined number of clusters, while in the dynamic clustering, the objects are automatically grouped based on some criteria to discover the right number of clusters. Clustering can also be carried out in two ways: soft clustering or hard clustering (Aliguliyev, 2009). In soft clustering, each object is grouped to be a member of any or all clusters with different membership grades, while in the case of hard clustering, the objects are members of a single cluster. Based on the mechanics of constructing the clusters, clustering methods can be divided into two categories: Hierarchical clustering and Partitional clustering (Forsati, Mahdavi, Shamsfard, & Meybodi, 2013; Luo, Li, & Chung, 2009). Hierarchical clustering methods create a tree of clusters, while Partitional clustering methods create a flat of clusters (Forsati, Mahdavi, Shamsfard, & Meybodi, 2013). This study focuses on dynamic, hard and hierarchical clustering. LITERATURE REVIEW Previous studies divided clustering algorithms into two main categories: Hierarchical and Partitional (Forsati et al., 2013; Luo et al., 2009). The Hierarchical clustering methods create a hierarchy of clusters (Forsati, Mahdavi, Shamsfard, & Meybodi, 2013). It is an efficient method for document clustering in information retrieval as it provides data-view at different levels and organizes the document collection in a structured manner. Based on the mechanics of constructing a hierarchy, it has two approaches: 58 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Agglomerative and Divisive hierarchical clustering. The Agglomerative clustering approach operates from bottom to top, where every document is assumed as a single cluster. The approach attempts to merge any closest clusters based on dissimilarity matrix. There are three methods that are used for merging: single linkage, complete linkage and average linkage or also known as UPGMA (un-weighted pair group method with arithmetic mean). Details on these methods can be found in previous work, such as Yujian and Liye (2010). On the other hand, divisive clustering builds a tree of multilevel clusters (Forsati, Mahdavi, Shamsfard, & Meybodi, 2013). All objects are initially in a single cluster and at each level, the clusters are split into two clusters. Such an operation is demonstrated in Bisect K-means (Kashef, & Kamel, 2010, 2009). In the splitting process, it employs one of the partitional clustering algorithms that uses an objective function. This objective function has the ability to minimize the distance between a center and objects in one cluster and maximizes the distance between clusters. A well-known example of a hard and static partitional clustering is the K-means (Jain, 2010). It initially determines a number of clusters and then assigns objects of a collection into the predefined clusters while trying to minimize the sum of squared error over all clusters. This process continues until a specific number of iterations is achieved. Figure 1 illustrates the steps involved in K-means. The K-means algorithm is easy to implement and efficient. However, it suffers from some drawbacks: random initialization of centers (including the determination of number of clusters) may cause the solution to be trapped into local optima. To overcome such weakness, a research area that employs meta-heuristics has been developed. It optimizes the solution to search for global optimal or near optimal solution using an objective function (Tang, Fong, Yang, & Deb, 2012). Problems are formulated as either a minimum or maximum function (Rothlauf, 2011). There are two types of Meta-heuristic approaches: single meta-heuristic solution and population meta-heuristic solution (Boussaïd, Lepagnot, & Siarry, 2013). Single meta-heuristic solution initializes one solution and moves away from it, such as implemented in the Simulated Annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) and Tabu Search (Glover, 1986). Population meta-heuristic solution initializes multi solutions and chooses the best solution based on evaluation of solutions at each iteration, such as in Genetic algorithm (Beasley, Bull, & Martin, 1993), Evolutionary programming (Fogel, 1994), Differential Evolution (Aliguliyev, 2009) and nature-inspired algorithms (Bonabeau, Dorigo, & Theraulaz, 1999). The nature-inspired algorithm, also known as Swarm intelligence, is related to the collective behavior of social insects or animals in solving hard problems (Rothlauf, 2011). Swarm intelligence, 59 Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 includes Particle Swarm Optimization (PSO) (Kennedy, & Eberhart, 1995), optimal solution using objective function (Tang, Fong, Yang, Yusof, & Deb,&2012). Problems are Artificial BeeanColony (Mahmuddin, 2008; Mustaffa, Kamaruddin, 2013), and Firefly Algorithm (Yang, 2010). formulated as either a minimum or maximum function (Rothlauf, 2011). http://jict.uum.edu.my Step1: Randomly choose K cluster centers. Step2: Assign each object to closest center using Euclidean distance. Step3: Re-calculate the centers. Step4: Repeat step1 and step2 stopping condition is reached. until Figure 1. Steps of K-means (Jain, 2010). Figure 1. Steps of K-means (Jain, 2010). With regards to the population meta-heuristic approach, Aliguliyev (2009) developed modified differential evolution (DE) algorithm for and textpopulation clusteringmetaThere are two types ofaMeta-heuristic approaches: single meta-heuristic solution that optimized various density based criterion functions: internal, external and heuristic solution & indicates Siarry, 2013). meta-heuristic solution initializes hybrid (Boussaïd, functions.Lepagnot, The result thatSingle the proposed DE algorithm speeds one up the convergence. On the other hand, the PSO, proposed by Kennedy and solution and moves away from it, such as implemented in the Simulated Annealing (Kirkpatrick, Eberhart (1995) was used for document clustering in Cui, Potok, & Palathingal basic of Search PSO comes from the flock and foraging behavior Gelatt, & (2005). Vecchi,The 1983) andidea Tabu (Glover, 1986). Population meta-heuristic solution where each solution has n dimensions search space. The birds do not have space, so is calledthe“Particles”. Each particle has a fitness function initializes search multi solutions andit chooses best solution based on evaluation of solutions at each value that can be computed using a velocity of particles flight direction and iteration, such as in Genetic algorithm & Martin, 1993), Evolutionary programming distance. The basic PSO (Beasley, clusteringBull, algorithm (Cui, Potok, & Palathingal, 2005) is illustrated in Figure 2. (Fogel, 1994), Differential Evolution (Aliguliyev, 2009) and nature-inspired algorithms (Bonabeau, each particle randomly chooses a number of cluster centersintelligence, from a is Dorigo, &Initially, Theraulaz, 1999). The nature-inspired algorithm, also known as Swarm vector of document dataset. Then, each particle performs three steps: creating related to clusters, the collective behaviorclusters of social insects or animals in solvingCreating hard problems (Rothlauf, evaluating and creating new solutions. clusters is done by assigning documents to the closest center. Evaluating clusters is done by evaluating the created clusters using fitness function (i.e., average distance 2011). Swarm intelligence, includes Particle Swarm Optimization (PSO) (Kennedy and Eberhart, 1995), Artificial Bee Colony (Mahmuddin, 2008; Mustaffa, Yusof, & Kamaruddin, 2013), and Firefly 60 Algorithm (Yang, 2010). http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 between documents and center (ADDC)) and selecting the best solution from multiple solutions. The last step is creating new solutions which is done by updating the velocity and position of particle. The previous three steps are repeated until one of the stopping conditions is reached; the maximum number of iterations or the average change in center is less than a threshold (predefined value). Step1: Each particle, randomly choose K cluster centers. For each Particle: Step2: - Step3: Assign each document to closest center. Compute the fitness value based on average distance between documents and center (ADDC). Update the velocity and position of particle. Repeat step2 until one of stop conditions is reached; the maximum number of iterations or the average change in center is less than threshold (predefined value). Figure 2. Steps in standard PSO clustering (Cui, Potok, & Palathingal, 2005). Further work in clustering also includes Rashedi, Nezamabadi-pour and Saryazdi (2009) who proposed optimization algorithm that utilizew law of gravity and law of motion known as Gravitation Search Algorithm (GSA). They considered each agent as an object and determined their performance by their masses. The agents move towards heavier masses (the mass is calculated by map of fitness). The heavy masses represent good solutions. Later (Hatamlou, Abdullah, & Nezamabadi-pour, 2012), a hybrid Gravitational Search Algorithm with K-means (GSA-KM) for numerical data clustering was presented. GSA algorithm prevents K-means from trapping into local optima whereas the K-means algorithm speeds up the convergence of GSA. On the other hand, Fuzzy C-means (FCM) algorithm (which is a type of soft clustering) based on gravity and cluster merging is presented in Zhong, Liu, & Li (2010). It tries to find initial centers of clusters and solve the outlier’s sensitivity problem. It also utilizes the law of gravity but the calculation of mass is different where the mass of object p is the number of neighbourhood objects of p. 61 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Previous clustering algorithms (Cui, Potok, & Palathingal, 2005; Jain, 2010; Zhong, Liu, & Li, 2010) are denoted as static method which initially requires a pre-defined number of clusters. Hence, such algorithms are not appropriate to cluster data collections that are not accompanied with relevant information (i.e., number of classes or clusters). To date, such issues are solved using two approaches: estimation or by using dynamic swarm based approach. The first approach employs the validity index in clustering, which can drive to select the optimal number of clusters. Initially, it starts by determining a range of clusters (minimum and maximum number of clusters). Then, it performs clustering with the various numbers of clusters and chooses the number of clusters that produces the best quality performance. In the work of Sayed, Hacid and Zighed (2009), the clustering is of hierarchical agglomerative with validity index (VI) where at each level of merging step, it calculates the index of two closest clusters before and after merging. If the VI improves after merging, then merging of the clusters is finalized. This process continues until it reaches optimal clustering solution. Similarly, Kuo and Zulvia (2013) proposed an automatic clustering method, known as Automatic Clustering using Particle Swarm Optimization (ACPSO). It is based on PSO where it identifies number of clusters along with the usage of K-means that adjusts the clustering centers. The ACPSO determines the appropriate number of clusters in the range of [2, Nmax] . The result shows that ACPSO produces better accuracy and consistency compared to Dynamic Clustering Particle Swarm Optimization and Genetic (DCPG) algorithm, Dynamic Clustering Genetic Algorithm (DCGA) and Dynamic Clustering Particle Swarm Optimization (DCPSO). In the work of Mahmuddin (2008), a modified K-means and bees’ algorithm are integrated to estimate the total number of clusters in a dataset. The aim of using bees’ algorithm is to identify as near as possible the right centroids, while K-means is utilized to identify the best cluster. From previous discussions, it is learnt that the estimation approach is suitable for a problem that requires little or no knowledge of it; however, there is difficulty to determine the range of clusters for each dataset (lower and upper bound of number of clusters). On the other hand, the dynamic swarm based approach can automatically find the appropriate number of clusters in a given data collection, without any support. Hence, it offers a more convenient cluster analysis. Dynamic swarm based approach adapts the mechanism of a specific insect or animal that is found in nature and converts it to heuristics rules. Each swarm employs it like an agent that follows the heuristic rules to carry out the sorting and grouping of objects (Tan, 2012). In literature, there are examples of such an 62 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 approach in solving clustering problems, such as Flocking based approach (Cui, Gao, & Potok, 2006; Picarougne, Azzag, Venturini, & Guinot, 2007) and Ant based clustering (Tan, Ting, & Teng, 2011a, 2011b). The Flocking based approach relates to behavior of swarm intelligence (Bonabeau et al., 1999) where a group of flocks swarm move in 2D or 3D search space following the same rules of flocks; get close to similar agents or far away from dissimilar agents (Picarougne, Azzag, Venturini & Guinot, 2007). This approach is computationally expensive as it requires multiple distance computations. On the other hand, the Ant based approach deals with behavior of ants, where each ant can perform sorting and corpse cleaning. This approach works by distributing the data object randomly in the 2D grid (search space), then determining a specific number of ants (agents) that move randomly in this grid to pick up a data item if it does not hold any object (item) and drop the object (item) if it finds similar object. This process continues until it reaches a specific number of iterations (Deneubourg et al., 1991). Tan, Ting and Teng (2011b) proposed practical General Stochastic Clustering Method (pGSCM) that is a simplification of the Ant based clustering approach. The pGSCM is used to cluster multivariate real world data. The pseudo code of pGSCM is illustrated in Figure 3. The input of pGSCM is a dataset, D, that contains n objects and the output is the number of clusters discovered by pGSCM method, without any prior knowledge. In the initialization of pGSCM, the dissimilarity threshold for n objects is estimated. Then, it creates n bins where each bin includes one object from dataset D. Through the working of pGSCM, it selects two objects randomly from a dataset; if the distance between these two objects is less than their dissimilarity threshold, then the level of support of the two objects is compared. If object i has less support than j, then the lesser one is moved to the greater one and vice versa. At the end of iterations, a number of small and large bins are created. The large bins are selected as output clusters while the small bins are reassigned to large bins (objects in small bins assigned to similar center in large bins). The selected large bins process is based on threshold of 50, n/20 (means the threshold is 5% of the size of dataset, n); this threshold is based on criterion used by Picarougne, Azzag, Venturini, and Guinot (2007). This method performs well compared to the state-of-the-art methods; however, randomly selecting two objects in each iteration may create other issues. There is a chance that in some iterations, the same objects are selected or some objects are not selected at all. Furthermore, the selection process initially requires large number of iterations to increase the probability of selecting different objects. 63 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Step1: Input: dataset D of n objects. Step2: Estimate the dissimilarity threshold for n objects. Step3: Initialize n bins by allocating each object in D to a bin. Step4: While iteration <= maxiteration - i= random select (D) - j= random select (D) - if d(i,j) < min (Ti,Tj) {Ti,Tj is dissimilarity threshold} - Store the comparison outcome in Vi and Vj. - If c(i) < c(j) move i to j. - Else move j to i. - End if - End while Step5: Return all non-empty bins as a set of final. Figure 3. Pseudo of pGSCM clustering algorithm (Tan, Teng, 2011b). Figure 3. Pseudo codecode of pGSCM clustering algorithm (Tan, Ting, andTing, Teng,&2011b). Of late, Of a newly meta-heuristic algorithm has appeared, known algorithm (FA). FA late,inspired a newly inspired meta-heuristic algorithm hasas Firefly appeared, known as Firefly Algorithm (FA). FA was developed and presented at Cambridge was developed and presented at Cambridge University by Xin-She Yang in 2008. Firefly algorithm is University by Xin-She Yang in 2008. Firefly algorithm is related to behavior of related to behavior of firefly insects that produce short and flashes rhythmic(flashing flashes (flashing light), where, firefly insects that produce short and rhythmic light), where, the rate of rhythmic flashes and amount of time brings two fireflies together. Further, the distance between two fireflies also affects the light, where the weaker whenthe thelight distance increases. Yang betweenlight two becomes fireflies also affectsand theweaker light, where becomes weaker Xin-She and weaker when the formulated this mechanism by associating the flashing light with objective distancefunction increases.f(x). Xin-She Yang of formulated this mechanism by associating flashing light with The value x is represented by the position of thethe firefly, where every position has various values of flashing light. Based on the problem objective function f(x). The value of x is represented by the position of the firefly, where every (maximization or minimization) that we want to solve, we can deal with the is light. another factor in problem FA algorithm affected the position objective has variousfunction. values ofThere flashing Based on the (maximization or by minimization) distance of two fireflies which is the attractiveness β. This factor is changed that we want solve, can dealofwith objective function. is another factor intoFA algorithm basedtoon the we distance twothe fireflies; when two There fireflies are attracted each the rate of rhythmic flashes and amount of time brings two fireflies together. Further, the distance affected by the distance of two fireflies which is the 64 attractiveness β. This factor is changed based on the distance of two fireflies; when two fireflies are attracted to each other, the highest light will attract Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 other, the highest light will attract the lower light; this process will cause the changing in the position of two fireflies and lead to change in the value of β. Theofpseudo standard Firefly Algorithm is isillustrated Figure4.4. in the value β. The code pseudoofcode of standard Firefly Algorithm illustrated in in Figure http://jict.uum.edu.my the lower light; this process will cause the changing in the position of two fireflies and lead to change Step1: - Objective function f(x), x=(x1, ..., xn)T. - Generate initial population of firefly randomly x i ( i=1, 2, .., n). - Light Intensity I at xi is determined by f(xi ). - Define light absorption coefficient γ. Step2: - Step3: - Rank the fireflies and find the current global best g* - End while - Post-process results and visualization While (t < Max Generation) For i=1 to N (N all fireflies) For j=1 to N If (Ii < Ij) { Move firefly i towards j; end if Vary attractiveness with distance r via exp[-yr] Evaluate new solutions and update light intensity End For j End For i Figure 4. Pseudo code of standard Firefly algorithm (Yang, 2010). Figure 4. Pseudo code of standard Firefly Algorithm (Yang, 2010). Firefly Algorithm (Yang, 2010) has been applied in many disciplines and Firefly algorithm 2010) has been applied segmentation in many disciplines and proven to Vojodi, be successful proven to(Yang, be successful in image (Hassanzadeh, & in and dispatch problem (Apostolopoulos Vlachos, 2011). image Moghadam, segmentation 2011) (Hassanzadeh, Vojodi, & Moghadam, 2011) &and dispatch problem Additionally, the Firefly Algorithm was utilized in numeric data clustering (Apostolopoulos & Vlachos, 2011).Senthilnath, Additionally, the Fireflyand algorithm utilized in numeric and proven successful. Omkar, Mani was (2011) used Firefly data algorithm in supervised clustering (the class for each object is defined) and clustering and proven successful. Senthilnath, Omkar, and Mani (2011) used Firefly algorithm in also in static manner (the number of clusters is defined). In the process of this algorithm, specific x inand 2Dalso search space evaluated the of supervised clusteringeach (the firefly class forateach objectlocation is defined) in static manner (the number fitness using objective function related to the sum of Euclidean distance on clustersall is defined). the process of this algorithm, each firefly at specific location xused in 2Dfor search training Indata. The result demonstrates that FA can be efficiently clustering. But Banati and Bajaj (2013) implemented the Firefly algorithm space evaluated the fitness using objective function related to the sum of Euclidean distance on all differently. They used FA as an unsupervised learning (the class for each is result undefined). However, implementation basedBut on Banati static and trainingobject data. The demonstrates that FAtheir can be efficiently used is for still clustering. manner, as shown in Senthilnath, Omkar, and Mani’s (2011) work, where the Bajaj (2013) implemented Firefly algorithm differently. They used FA as an unsupervised number of clustersthe is pre-defined. learning (the class for each object is undefined). However, their implementation is still based on static 65 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 In this paper, Firefly Algorithm is proposed to cluster documents automatically using divisive clustering approach, and the algorithm is termed Gravity Firefly Clustering (GF-CLUST). The proposed GF-CLUST integrates the work on Gravitation Firefly Algorithm (GFA) (Mohammed, Yusof, & Husni, 2014a) with the criterion of selection clusters (Picarougne, Azzag, Venturini, & Guinot, 2007). GF-CLUST operates based on random positioning of documents that employs the law of gravity to find the force between documents which is used as the objective function. METHODOLOGY The proposed GF-CLUST works in three steps: data pre-processing, development of vector space model and data clustering, as shown in Figure 5. Data Pre-processing This step is very important in any machine learning system. It can be defined as the process of converting a set of documents from unstructured into a structured form. This process involves three steps as shown in Figure 5: Data cleaning, stop word remover and word stemming. Initially, it starts by selecting texts from each document. The extracted texts are cleaned of special characters and digits. Then, the text undergoes a splitting process that divides each cleaned text into a set of words. Later, it removes words that have length less than three characters, such as in, on, at, etc., and removes stop words, such as propositions, conjunctions, etc. The last step in pre-processing is the stemming process, where all the words are retained as the root (Manning, Raghavan, & Schütze, 2008). Development of Vector Space Model This step is commonly used in information retrieval and data mining approach where it represents the utilized data in a vector space. In this work, each cleaned document is represented in 2D space (columns and rows). The column denotes terms, m, extracted from documents, while the rows refer to the documents, n. The term frequency-inverse document frequency (TF-IDF) is an efficient scheme that is used to identify significance of terms in the text, The benefit of utilizing TF-IDF is the balance between the local and global term weighting in a document (Aliguliyev, 2009; Manning, Raghavan, & Schutze, 2008). Equation 1 is used to calculate the TF-IDF. tfidft , d tft , d * log N / dftt (1) F G M1* M 2 R 2 66 (2) (1) http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Data Clustering This step includes two processes: identify centers and create clusters and selection of clusters. The identification of centers is achieved by using GFA (Mohammed, Yusof, & Husni, 2014a), where the GFA employs Newton’s law of gravity to determine the force between documents and uses it as an objective function. Newton’s law of gravity states that “Every point mass attracts every single other point mass by a force pointing along the line intersecting both points. The force is proportional to the product of the two masses and inversely proportional to the square of the distance between them” (Rashedi, Nezamabadi-pour, & Saryazdi, 2009). Equation 2 shows Newton’s law of gravity. tfidft , d tft , d * log N / dftt (1) F G M1* M 2 R 2 F ( di * dj ) Co sin eSimilarity ( di * dj ) T (di ) * T ( dj ) 2 (3) Cdist Data Pre-processing m Co sin eSimilarity (di * dj ) Data i * dj ) ( dCleaning i j 1 m Documents T ( dj ) Collection tfidfti , dj i 1 Cdist( Xi * Xj ) 2 ( Xi Xj ) 2 Stop Word Remover k j 1 K Dis ( dj , di ) (4) Construct TFIDF (5) (6) Data Clustering Produced Clusters 2 Vector Space Model Development Word Stemming Dis (ci * di ) ni i 1 ni ADDC (2) (2) m ( djn din) 2 n1 P (k * Cj ) Purity N k 1,..,c (7) Identify Centers & Create Clusters (8) Selection Clusters (9) Figure 5. The process of Gravity Firefly clustering (GF-Clust) Figure 5. The process of Gravity Firefly clustering (GF-CLUST). P(k * Cj ) Maxk k Cj (10) where F is the force between two masses, G is the gravitational constant, M1 is the first mass, M2 is R(masses, Cj) 2two *two masses. k ,gravitational the secondFmass, R isforce the distance between where is the between constant, M k , C j )G*isP(the 1 F ( k ) max ( is the first mass, M is the second mass, R is the distance between two masses. (11) C jC21,...,Ck R( k , C j ) * P ( k , C j ) Yusof & Husni, In the GFA (Mohammed, 2014a), the F is the force between two documents as shown 67 in Equation 3 while represents similarity documents TheGnumber ofthethecosine members of between class k two in cluster j calculated using R( k , C j ) (12) The number k of the first Equation 4 (Luo, Li & Chung, 2009). Theof M1the and members M2 representof the class total weight and second http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 In the GFA (Mohammed, Yusof, & Husni, 2014a), the F is the force between two documents as shown in Equation 3 while G represents the cosine similarity between two documents calculated using Equation 4 (Luo, Li, & Chung, 2009). The M1 and M2 represent the total weight of the first and second document, and is calculated using Equation 5 (Mohammed, Yusof, & Husni, tfidf d Rtf **log log N dft tfidf ttt,,,dddbased * dft 2014b). The value on distance (Cdist) (1) between the tfidfttt,,of ,dd tftfis logN N ///Cartesian dftttt (1) (1) tfidf t , d tft , d * logis N obtained / dftt positions of two documents and using Equation 6 (Yang, 2010). (1) 1* M M M *M M M222 The representation of11*documents position in GFA is illustrated in Figure 6, G FF G F G M 1 *2M 2 (2) (2) 2 where x is a Frandom value (for example, in 20 Newsgroups dataset, (2) the value 2 R G RR (2) 2 is in the range 1-300) and y is fixed at 0.5. R TT(((dddiii)))***T TT(((dddjjj))) T FF(((dddiii***dddjjj))) Co eSimilarit Co sin F Cosin eSimilarityyy(((dddiii***dddjjj))T sineSimilarit ) (di ) * T ( d222j ) (3) (3) Cdist (3) F ( di * dj ) Co sin eSimilarity ( di * dj ) Cdist 2 Cdist (3) (3) m m m Cdist m Co sin eSimilarit eSimilarityyy(((dddiii***dddjjj))) (((dddiii***dddjjj))) Co sin Co j (1di * dj ) Cosin sineSimilarit eSimilarity (di * dj ) i i ij j11 m m i j 1 mmtfidf TT(((dddjjj))) tfidfttiti,i,,dddjj j T T ( dj ) i tfidfti , dj 1 tfidf i i11 i 1 2 (4) (4)(4) (4) (4) (5) (5) (5) (5) (5) 2222 (( X X XXjjj)))222 (6) (X Xiiii i* j )) Cdist ( X * X ( X X j) i j Cdist ( X X (6) Cdist Cdist((X Xii * Xj ) (6) (6)(6) ni ( * ) Dis ci di ni Later, the GFA assigns the value force ni ciciof di ni Dis ***di Dis Dis(((ci di))) as initial light of each firefly (in this 1 represents paper, the number of fireflies number of documents). Every firefly ni k i ni kkk iii11 ni ni ADDC will compete with each other; if a firefly has a brighter light than (7) another, then ADDC ADDC ADDC KK (7)(7) j 11 K (7)β between 1 it will attract the onesjjj1with less bright light. The attraction value two fireflies changes basedmon the distance between these fireflies. m m 2 22 m Dis ( d d j,j,,d ii))) jn in))) 2 ddddinin Dis ( d d 222 Dis ( d j i (((dddjnjn (8) Dis ( dj , di ) n (8)(8) jn in ) 1 (8)the firefly The position of the less nbright firefly will change. Changes of n11 position will then lead to the change of*C the force value (objective function in k ** j) P PP((( C P kkkof ( *C Cjjj))) firefly). After a specific number Purity Purity this algorithm that represents the light each Purity (9)(9) Purity kk (9) NN 1,.., N 1 ,.., ccc (9) light and kk N of iterations, GFA identifies firefly (document) with the brightest ,.., c 11,.., denotes it asPan cluster The pseudo-code for the identification of kkinitial *C Cj ) Maxkk center. k C j PP(((( ** Max C (10) P kk* CCjjj))) kk C (10) in Max kkk in CjjFigure j cluster centers GFA isMax presented 7. (10) (10) C 22**RR((k , ,CCj ) )**PP (( , ) j C , ) CCjj ))**PP(the (kkkk,first ,CCjjj))cluster starts by max the process (( 22**RR((of identified, kkk,,creating Once a center j FFF((( max (kk))))is (11) F max ( R C P C ( , ) * ( , ) max C jC ,...,Ck ( (i.e., (11) kk similar finding the most documents as (11) in equation R(k using , Cj )cosine * P (k similarity , Cj ) 1,..., C C C (11) C Cjjj C111,..., Ckkk RR((kkk,,CCjjj))**PP((kkk,,CCjjj)) ,...,C C 4). Documents that have high similarity to the centroid is located in the The number of the members class k in value cluster(in j this paper, first cluster. approach requires amembers specificofof threshold R( k , CThis ) The The number of the members of class kkin in cluster number of the class k cluster jjj (12) j The number of the members of class in cluster RR((( ,,C C )) value k ,threshold The number ofdifferent the members of class k R ) different is used for dataset). Documents that (12) do not j (12) C j kk The number of the members of class j (12) The number of of kkk Theare number of the themembers members of class class belong to the firstThe cluster ranked (step 17 in GFA process) based on it number of the members of class k in cluster j P( k , C The number of the members of class k in cluster j j ) isThe brightness. This to find a new center and later, create a new cluster. Such a number of the members of class k in cluster j of the of class k injcluster j (13) The number ofmembers the members of class PP((( C )) The number k ,,,C P ) j (13) C j processkkis repeated untilThe all number documents aremembers grouped accordingly. (13) The of the of class j (13) Thenumber numberof of the themembers membersof of class class jjj C C c k j k j 68C HC ( j ) cc CCj log CCj k k C (14) kkC jj jj kc1 kk C HC log j j HC log HC(((jjj))) log (14) 11 (14) CCj CCj kkk (14) C C 1 jj jj http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 b)The Fireflies (documents) (b)The Fireflies (documents) position position after (a)Initial position of Fireflies (b)The Fireflies (documents) afteraft (b)The Fireflies (documents) position (b)The Fireflies (documents) position (a)Initial position ofposition Fireflies (Documents) (documents) position after (a)Initial position of Fireflies (Documents) (a)Initial position ofof Fireflies (Documents) (b)The Fireflies Fireflies position after after (a)Initial Fireflies (Documents)(b)The (a)Initial position of (Documents) 1st (documents) iteration (a)Initial position of Fireflies Fireflies (Documents) 1st iteration position after 1st iteration 1st iteration (Documents) 1st iteration 1st iteration 1st iteration (c)The Fireflies (documents) position after(d)The (d)The Fireflies (documents) position af (c)The Fireflies (documents) Fireflies (documents) (c)The Fireflies (documents) position (d)The Fireflies (documents) position (c)The Fireflies (documents) position after after((d)The d)The Fireflies (documents) position after after (c)The Fireflies (documents) position after Fireflies (documents) position after 2nd iteration (c)The Fireflies (documents) position after ( d)The Fireflies (documents) position after 5th iteration 2nd iteration 5th iteration position after 2nd iteration (c)The Fireflies (documents) position after (d)The (documents) position after 2nd iteration iteration position afterFireflies 5thiteration iteration 5th iteration 2nd 5th 2nd iteration 5th iteration 2nd iteration 5th iteration (e) The Fireflies (documents) position after (f)The Fireflies (documents) position aft (f)The Fireflies (documents) (e)Fireflies The(documents) Fireflies (documents) (e) The Fireflies position after (f)The Fireflies (documents) position after (e) Fireflies The (documents) position (f)The Fireflies (documents) position (e) The (documents) position after after(f)The 10th iteration. Fireflies (documents) position after after 20 iterations. 10th iteration. position after 20 iterations. 20 iterations. position after 10th iteration. 10th iteration. 20 iterations. 10th iteration. 20 iterations. (e) The Fireflies (documents) position after (f)The Fireflies (documents) Figure 6. The representation of documents position in GFA. position after (e) The Fireflies (documents) position after (f)The Fireflies Figure 6.10th The representation of documents position in GFA. Figure 6. The representation of documents position in GFA. (documents) position after iteration. 20 iterations. Figure 6. The representation of documents position in GFA. 10th iteration. Figure 6. The representation of documents position in GFA. 20 iterations. Figure 6. Figure The representation of documents position inposition GFA. in GFA. 6. The representation of documents 69 15 15 15 15 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Step1: Generate Initial population of firefly xi where i=1, 2,.., n, n=number of fireflies (documents). Step2: Initial Light Intensity, I=Force between two document using equation 3. Step3: Define light absorption coefficient γ, initial γ=1 Step4: Define the randomization parameter α, α=0.2 Step5: Define initial attractiveness Step6: While t < Number of iterations Step7: For i=1 to N Step8: For j=1 to N Step9: If(Force Ii< Force Ij){ Step10: Calculate distance between i,j using Equation 6. Step11: Calculate attractiveness using equation below . Step12: Move document i to j using Equation Step13: Update force between two documents (light intensity). Step14: End For j Step15: End For i Step16: Loop Step17: Rank to identify center (brightest light). Figure 7. Pseudo code of GFA. Figure 7. Pseudo code of GFA. obtaining the cluster clusters, cluster selection processThis is process conducted. Thisout by UponUpon obtaining the clusters, selection process is conducted. is carried process is carried out by choosing clusters (which include documents greater choosing include documents greater than 5% of the size of dataset n)To thatdate, exceed an thanclusters 5% of (which the size of dataset n) that exceed an identified threshold. it is set to 50, n/20 for normal distributed data, such as the 20Newsgroup (20NewsgroupsDataSet, 2006) and Reuters-21578 (Lewis, 1999) (normal distribution means every includes the same1999) number of documents). This every (20NewsgroupsDataSet, 2006) andclass Reuters-21578 (Lewis, (normal distribution means threshold is based on the criteria used by Tan et al. (2011) and the idea of class merging includes the same number of documents). This threshold is based on the criteria by Tan et clusters is adopted from Picarougne, Azzag, Venturini, andused Guinot (2007). The merging of clusters assigns smaller clusters to the bigger ones. al. (2011) and the idea of merging clusters is adopted from Picarougne, Azzag, Venturini, and Guinot On the other hand, the threshold value of 50, n/40 is used for dataset that is notThe normally such smaller as the TR11 TR12, TREC (2007). mergingdistributed, of clusters assigns clustersand to the biggerretrieved ones. On from the other hand, the collection (TREC, 1999). identified threshold. To date, it is set to 50, n/20 for normal distributed data, such as the 20Newsgroup threshold value of 50, n/40 is used for dataset that is not normally distributed, such as the TR11 and TR12, retrieved from TREC collection (TREC, 1999). RESULTS Data Sets RESULTS Data Sets Four datasets are used in evaluating the performance of GF-Clust. They were 20Newsgroup (20NewsgroupsDataSet, Four obtained datasets arefrom used indifferent evaluatingresources: the performance of GF-Clust. They were obtained from different 2006), Reuters-21578 (Lewis, 1999) and TREC collection (TREC, 1999). resources: (20NewsgroupsDataSet, 2006), Reuters-21578 (Lewis, 1999) and TREC Table20Newsgroup 1 displays description of the chosen datasets. collection (TREC, 1999). Table 1 displays description 70 of the chosen datasets. 16 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Table 1 Description of Datasets Datasets 20Newsgroups Reuters-21578 TR11 TR12 Source 20Newsgroups Reuters-21578 TREC TREC No. of Doc. 300 300 414 313 Classes Min class Max size class size 3 100 100 6 50 50 9 6 132 8 9 93 No. of Terms 2275 1212 6429 5804 The first dataset, named 20Newsgroups, contains 300 documents separated in three classes that include hardware, baseball and electronics. Each class involves 100 documents and 2,275 number of terms. The second dataset which is the Reuters-21578 contains 300 documents distributed in six classes which are the ‘earn’, ‘sugar’, ‘trade’, ‘ship’, ‘money-supply’ and ‘gold’. Each class includes 50 documents and 1,212 number of terms. The third dataset is known as TR11 and is derived from TREC collection. It includes 414 documents distributed in nine classes. The smallest class size is six while the largest class includes 132 comprises 6,429 terms. The fourth tfidfdocuments t , d tft , d *and log the N / collection dftt (1) tfidft , TR12 d tft ,is d *also log obtained N / dftt from TREC collection. dataset called It includes 313 (1) documents with eight M1* M 2 classes and 5,804 terms. The smallest class is nine F G M 1 * M 2largest contains 93 documents. documents (2) 2 F GwhileRthe (2) 2 R Evaluation Metrics T (di ) * T ( dj ) F ( di * dj ) Co sin eSimilarity ( di * dj )T (di ) * T (2dj ) (3) i * dj ) Co sin eSimilarity ( di * dj ) FIn( dorder Cdist 2 to evaluate the performance of the GF-Clust against state-of-the-art (3) Cdist methods, K-means (Jain, 2010), PSO (Cui, Potok, & Palathingal, 2005) and m pGSCM (Tan, Ting, 2011a), evaluation metrics are employed. Co sin eSimilarit y (di & * dTeng, ( di * djfour ) j) m (4) Co sin eSimilarit y (di * dthe dj ) j ) ADDC, i j (1di *Purity, These metrics include F-measure and Entropy (4) (Forsati, i j 1 Mahdavi, mShamsfard, & Meybodi, 2013; Murugesan, & Zhang, 2011). T ( dj ) m tfidfti , dj (5) j) T ( dfirst i 1tfidfti , dj The metric, ADDC, measures the compactness of the obtained (5) clusters. A i 1 smaller value of ADDC indicates a better cluster and it satisfies the optimization 2 2 ( Xi Xj ) 2 Shamsfard, & Meybodi, 2013). The ADDC can constrains Cdist( Xi *(Forsati, Xj ) 2 (Mahdavi, Xi Xj ) (6) Cdist ( Xi * as Xj )Equations (6) be defined 7 and 8. ni Dis ( ci * di ) ni Dis (ci * di ) ni k i 1 (7) ni k i 1 ADDC (7) K ADDC j 1 (7) K j 1 m (8) 2 Dis ( dj , di ) 2 m ( djn din)2 (8) Dis ( dj , di ) 2 n1( djn din ) (8) n1 P (k * Cj ) Purity P (k * Cj )71 N Purity k 1,..,c N k 1,..,c (9) (9) http://jict.uum.edu.my m Co Cosin sineSimilarit eSimilarityy((ddi i**ddj )j ) ((ddi i**ddj )j ) (4) (4) i i j j 1 1 mm Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 TT((ddj )j )tfidf tfidftit,i d, dj j (5) (5) i i 11 Where, K refers to number of clusters, ni refers to number of documents in 22 center cluster i, ci refers22to ((XX i iXX j )j )of cluster I, di refers to document in cluster i, i i**XX j )j ) Cdist ( X Cdist ( X (6) and Dis(ci, di) is Euclidian distance (Murugesan, & Zhang, 2011) (6) which is calculated using Equation 8. ni tfidft tfidf , d t ,tf , dDis *tftlog */*dft ci di dtni , d(( *N N (1) (1) Dis cilog dit)/)dftt tfidft , d tft , d * log N / dftt (1) ni k*k M i i 1 M 1as 121 * 2 ni M M Purity ADDC is Fdefined the weighted sum of all cluster purity as shown in Equation G F G M 1* M 2 ADDC (2) (2) (7) 2calculated F G 2 (7) K 9. Cluster purity based on the largest class of jis 1 Rj 1 R2 K (2) documents assigned R to a specific cluster as shown in Equation 10. The larger the value of purity, T (di )T*(T di()d*j )T ( dj ) mmsolution y ( di *y (dis ( di F * (ddj )ai *clustering the Fbetter Mahdavi, Shamsfard, & j ) sin dCo CoeSimilarit eSimilarit dj2)i2*Entropy dj ) T (di2) *(Forsati, sin T2( dj ) (3) (3) 2 Dis eSimilarit id F ((d(d *j ,jd,dj )di )i)Co ()d)i * djCdist ) ((ddjnjn& ddyinZhang, Dis 2sin in Cdist2 (8) (3) Meybodi, 2013; Murugesan, 2011). (8) nn1 Cdist 1 m m Co sin y (di *y (ddj )i *dj )( di * (ddj )i * dj ) m CoeSimilarit sin eSimilarit (( jP 1 Co sin eSimilarit y (di * dij)P *Cjd)jj)) jkk(1*d*iC i Purity Purity k 1,.., cc k 1,.., m , dj ti , dj T ( dj )T( d mtitfidf j ) tfidf i 1 T ( dj ) i 1tfidfti , dj kk ji j 1 kk kk 2 2 2 ( Xi2 Xj ) ( Xi Xj ) CdistCdist ( Xi *(X Xji)*Xj ) 2 ( Xi Xj )2 Cdist( Xi * Xj ) m (4) NN PP(( **CC))Max Max CCj j (4) (4) (9) (9) i j 1 (5) (5) (5) (10) (10) (6) (9) (10) (6) 22**RR((k ,metric P((kk, ,CCj j))the(6)accuracy of the ,CCj j))**Pmeasures On the other hand, the F-measure ni Dis ni( ci *((di )* di ) k F ( ) max Dis ci F ( ) max ( clustering as shown It can be obtained by calculating kk solution ni Dis (11) (ciin*Equation di ) , C )11. (11) CCjkCC ni ni RR( (kk , Cj j )**PP((kk, ,CCj j)) i 1,..., k,..., i C 1C 1 k j two important metrics that are mostly used in evaluation of information 1 k ni ADDC k i 1 ADDC (7) (7) K K j 1 ADDC j 1 retrieval system which are recall and precision. Recall(7)is the division of the K members j 1 The of of class kkinincluster jj Thenumber number of the the members clusterover documents from specific class of in class specific cluster the number RRnumber ((k , ,CCj )of m ) m 2 (12) 2 jDis(dDis (12) k class 2 number of class kkwhile Precision ) j j , d(id m (2number djn djninas )of ,The dThe i ) dataset of members of class inthe ( d d ) 2themembers of that in whole shown in Equation 12;, is the (8) (8) Dis ( dj , di )n1 2 n1( djn din) (8) nof 1 documents from specific class in specific cluster division of the number The the members of Thenumber numberof of members ofclass classkkinincluster clusterj j kP*( jk)* Cj ) P (the C PPover ((k , ,C )Purity Purity C ) size of that cluster as shown in Equation 13. Larger value of(13) F-measure j k j ( * ) P C j Purity k (13) The number of the members (9) j(9) membersof ofclass class j k c The 1k,.., number 1,.., c N ofNthe leads to a better clustering solution Entropy (Forsati, Mahdavi, Shamsfard, & (9) N k 1,..,c Meybodi, 2013; Murugesan, & Zhang, 2011). P(kPc*( Cjk)* k kC CMax j )C Cj CCj kj k (10) (10) CkMax j j P(ckk*k C j ) jMax k k k C j (10) HC j ( ) log HC ( j ) log (14) 2 * R2(C *kR, (C j ), *CP)(*kP, (C j ), C ) (14) k k1 1 CCj j j C k k (11) j max ( F ( kF)( ) max ( 2 * jRj ( k , C j ) * P( k , C j ) (11) F ( kkC) CC max ( R, (C j ), * ,..., CP)(* P, (C j ), C ) (11) (11) C R CCk,..., j C 1j kR ( k , C jj ) *kP ( kk , C jj ) C1,...,Ck k j k * CCj j 1 k k HC HCjThe j * number of the of class k in cluster j The number ofmembers the members of class k in cluster j (12) HH R( , C ) kR ( kj , C j ) The number of the members of class k in cluster (12) j (15) (12)(15) 1 The number of the members of class k N Rj (j , C ) 1k The number of the members of class k jN (12) The number of the members of class k The number of the of class k in cluster j The number ofmembers the members of class k in cluster j P( kP, (C j ), C ) The number of the members of class k in cluster (13) j (13) j k (13) The number of the members of class j P( k , C j ) The number of the members of class j (13) The number of the members of class j c k cC j C k C j C k j Cthe kk C jj of clusters and randomness Entropy j ) ( ( HC c k log TheHC Entropy goodness j )measures j log (14) (14) k HC ( j ) 1 kC1 j C logC j C j (14)& Zhang, 2011). (Forsati, Mahdavi, &C Meybodi, 2013; Murugesan, k 1Shamsfard, C jj j It also can measure the distribution of classes in each cluster. The clustering * C j* C k HC k j HC j high j solution reaches performance when clusters contain documents from H H k HCits j * Cj (15) (15) j 1 H N j 1 N situation, the entropy value of clustering a single class. In this solution will (15) j 1 N be zero. A smaller value of entropy demonstrates a better cluster performance. Equations 14 and 15 are used to compute the Entropy. 72 http://jict.uum.edu.my , C j) ) RR((kk, C j (12) (12) Thenumber numberofofthe themembers membersofofclass classkk The Journal of ICT,of 15,the No.members 1 (June) 2016, pp: 57k–81 The number of class in cluster j , C j) )The number of the members of class k in cluster j PP((kk, C j Thenumber numberofofthe themembers membersofofclass classj j The (13) (13) kCCj j log k log CCj j CCj j (14) HCj j**CCj j k k HC HH j j11 NN (15) C c c k k C j j HC( (j )j ) HC k k11 (14) (14) (15) (15) Experimental Result and Discussion The proposed GF-CLUST, K-means (Jain, 2010), PSO (Cui, Potok, & Palathingal, 2005) and pGSCM (Tan et al., 2011a) was executed 10 times before the average value of the evaluation metrics was obtained. All experiments were carried out in Matlab on windows 8 with a 2000 MHz processor and 4 GB memory. Table 2 tabularizes the results of Purity, F-measure, Entropy, ADDC and the number of obtained clusters. Table 2 Performance Results and Standard Deviation: GF-CLUST vs. K-means vs. PSO vs. pGSCM algorithms Validity Indices GF-Clust K-means PSO pGSCM Purity 20Newsgroups Reuters TR11 TR12 Datasets 0.5667(0.00) 0.4867(0.00) 0.4710(0.00) 0.4920(0.00) 0.3443(0.0314) 0.3240(0.0699) 0.4087(0.0811) 0.4096(0.0761) 0.4097(0.0691) 0.3497(0.0697) 0.4092(0.0606) 0.3712(0.0405) 0.3853(0.0159) 0.2357(0.0147) 0.3372(0.0184) 0.3029(0.0072) F-measure 20Newsgroups Reuters TR11 TR12 0.5218(0.00) 0.3699(0.00) 0.3213(0.00) 0.3851(0.00) 0.4974(0.0073) 0.3575(0.0664) 0.3792(0.0861) 0.3678(0.0958) 0.5085(0.0390) 0.3522(0.0654) 0.3492(0.0539) 0.3161(0.0444) 0.3571(0.0275) 0.2400(0.0108) 0.2520(0.0346) 0.2245(0.0171) Entropy 20Newsgroups Reuters TR11 TR12 1.3172(0.00) 1.6392(0.00) 2.0119(0.00) 1.9636(0.00) 1.5751(0.0272) 2.0964(0.2358) 2.2620(0.3850) 2.2768(0.2834) 1.4847(0.0935) 2.1483(0.1563) 2.2970(0.1913) 2.4571(0.1551) 1.5630(0.0132) 2.5144(0.0245) 2.5660(0.0586) 2.6647(0.0353) ADDC 20Newsgroups Reuters TR11 TR12 1.9739(0.00) 1.5987(0.00) 1.1316(0.00) 1.1246(0.00) 0.5802(0.2057) 0.6957(0.1341) 0.5600(0.2648) 0.4592(0.1001) 1.8955(0.2166) 1.3806(0.1115) 0.7283(0.1362) 0.7486(0.1794) 1.7517(0.0159) 1.4078(0.0142) 1.0410(0.0115) 1.0604(0.0065) Number of clusters 20Newsgroups Reuters TR11 TR12 5 7 8 8 3 6 9 8 3 6 9 8 6 6.3 7.2 6.3 Hint: The value highlighted in bold indicates the best value. 73 Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 As can be seen from Table 2 and Figure 8, the Purity value for GF-Clust is higher than K-means, PSO and pGSCM , inbe allseen datasets in this paper (refers Reuters, TR11 and TR12). It As can fromused Table 2 and Figure 8, to the20Newsgroups, Purity value for GF-CLUST is higher K-means, PSO and pGSCM all datasets used in this On paper is also noted that than the PSO outperforms K-means and, in pGSCM in most datasets. the other hand, (refers to 20Newsgroups, Reuters, TR11 and TR12). It is also noted that the outperforms K-means pGSCM most datasets. On the other dataset hand, where it is pGSCMPSO produces the lowest purity and in all datasetsinexcluding in 20Newsgroups pGSCM produces the lowest purity in all datasets excluding in 20Newsgroups better than K-means. on literature, a better clustering is whena it produces high purity value dataset where Based it is better than K-means. Based on literature, better clustering when itShamsfard produces&high purity2013; value (Forsati, &Mahdavi, Shamsfard, & (Forsati,isMahdavi, Meybodi, Murugesan Zhang, 2011). http://jict.uum.edu.my Meybodi, 2013; Murugesan, & Zhang, 2011). 0.6 0.5 0.4 GF-Clust 0.3 K-means PSO 0.2 pGSCM 0.1 0 20Newsgroups Reuters TR11 TR12 Figure 8. Purity Result: GF-CLUST vs. K-means vs. PSO vs. pGSCM Methods. Figure 8. Purity Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods. The comparative results of F-measure among GF-CLUST, K-means, PSO and pGSCM are tabularized in Table 2 and represented in a graph in Figure The comparative results of F-measure among GF-Clust, K-means, PSO and pGSCM are tabularized in 9. The F-measure evaluates the accuracy of the clustering. It can be noticed 9 and Table 2 that in theFigure proposed has evaluates higher F-measure Table 2from and Figure represented in a graph 9. GF-CLUST The F-measure the accuracy of the measure value in most datasets (refers to 20Newsgroups, Reuters and TR12) despite not being value in most datasets (refers to 20Newsgroups, Reuters and TR12) despite not clustering. It can be noticed Figure 9 and Table 2 that of the proposed GF-Clust has higher Fbeing supported with from anyoninformation the number clusters. supported with any information the number on of clusters. Nevertheless, forNevertheless, TR11 dataset, it can be for TR11 dataset, it can be seen that K-means outperforms GF-CLUST by seen that K-means 0.3792 outperforms GF-Clust by generating 20 0.3792 compared to 0.3213. generating compared to 0.3213. 0.6 0.5 0.4 GF-Clust 0.3 K-means PSO 0.2 pGSCM 0.1 0 20Newsgroups Reuters TR11 TR12 Figure 9. F-measure Result: GF-CLUST vs. K-means vs. PSO vs. pGSCM Methods. Figure 9. F-measure Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods. 74 Further, as can see in Table 2, the GF-Clust has the best Entropy (lowest value) compared to Kmeans, PSO and pGSCM in all datasets tested in this work (refers to 20Newsgroups, Reuters, TR11 Further, as can see in Table 2, the GF-Clust has the best Entropy (lowest value) compared to KJournal of ICT, 15, No. 1 (June) 2016, pp: 57–81 means, PSO and pGSCM in all datasets tested in this work (refers to 20Newsgroups, Reuters, TR11 http://jict.uum.edu.my Further, as can see in Table 2, the GF-CLUST has the best Entropy (lowest and TR12). The GF-Clust produces 1.3172, 1.6392, 2.0119 and 1.9636 for 20Newsgroups, Reuters, value) compared to K-means, PSO and pGSCM in all datasets tested in this to 20Newsgroups, and TR12). GF-CLUST TR11 and work TR12,(refers respectively. Further, it is Reuters, also notedTR11 that K-means is aThe better algorithm compared to produces 1.3172, 1.6392, 2.0119 and 1.9636 for 20Newsgroups, Reuters, TR11 and respectively. Further, it (refers is also noted that K-means a betteras it produces PSO and pGSCM in TR12, clustering most of the datasets to Reuters, TR11 andisTR12) algorithm compared to PSO and pGSCM in clustering most of the datasets (refers Reuters, TR11and and2.2768). TR12) as it produces less Entropy (i.e., representation 2.0964, less Entropy (i.e.,to2.0964, 2.2620 Figure 10 illustrates a graphical of the 2.2620 and 2.2768). Figure 10 illustrates a graphical representation of the Entropy for the F-Clust, and pGSCM. Entropy for theK-means, F-Clust,PSO K-means, PSO and pGSCM. 3 2.5 2 GF-Clust 1.5 K-means PSO 1 pGSCM 0.5 0 20Newsgroups Reuters TR11 TR12 Figure 10. Entropy result: GF-CLUST vs. K-means vs. PSO vs. pGSCM Methods. Figure 10. Entropy Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods. 21 in a cluster, the ADDC value for In terms of the distance between documents Euclidian similarity is displayed in Table 2 for GF-CLUST, K-means, PSO and pGSCM. This value is used to show the algorithm satisfies the optimization constraints. The utilization of ADDC is similar to the Entropy metric where a smaller value demonstrates a better algorithm (Forsati, Mahdavi, Shamsfard & Meybodi, 2013). The ADDC results indicate that the K-means is a better clustering than PSO, pGSCM and the proposed GF-CLUST in all datasets. The reason behind this performance is that K-means utilizes Euclidean distance in assigning documents to clusters. On the other hand, the GF-CLUST employs Cosine similarity with a specific threshold to identify member of a cluster. Such an approach does not take into account the physical distance that exists between documents. Figure 11 illustrates a graphical representation of ADDC value among GF-CLUST, K-means, PSO and pGSCM. Research in clustering is aimed at producing clusters that match the actual number of clusters. As can be seen in Table 2, the average number of obtained clusters by the GF-CLUST is 5, 7, 8 and 8 for the four datasets, 20Newsgroups, 75 pGSCM and the proposed GF-Clust in all datasets. The reason behind this performance is that KJournal of ICT, 15, No. 1 (June) 2016, pp: 57–81 means utilizes Euclidean distance in assigning documents to clusters. On the other hand, the GF-Clust employs Cosine similarity with specific threshold to identify member a cluster. Such an approach Reuters-21578, TR11a and TR12. These values are near to theofactual number of clusters which are 3, 6, 9 and 8. The result of the proposed GF-CLUST is does not take intothan account the physical distance that that exists between6,documents. better the dynamic method, pGSCM, generates 6.3, 7.2 andFigure 6.3 for11 illustrates a the said datasets. Figure 12 shows a graphical representation of the number of graphical representation of ADDC value among GF-Clust, PSO and pGSCM. obtained clusters with different algorithms usingK-means, different datasets. http://jict.uum.edu.my 2 1.5 GF-Clust K-means 1 PSO 0.5 0 pGSCM 20Newsgroups Reuters TR11 TR12 Figure 11. ADDC result: GF-CLUST vs. K-means vs. PSO vs. pGSCM Methods. Figure 11. ADDC Result: GF-Clust vs. K-means vs. PSO vs. pGSCM Methods. 10 Research in clustering is aimed at producing clusters that match the actual number of clusters. As can 8 GF-Clust 6 RealGF-Clust Number of clusters be seen in Table 2, the average number of obtained clusters by the is 5,(K-7, 8 and 8 for the means & PSO) 4 pGSCM four datasets, 220Newsgroups, Reuters-21578, TR11 and TR12. These values are near to the actual 0 number of clusters which are 3, 6, 9 andTR11 8. The result of the proposed GF-Clust is better than the 20Newsgroups dynamic method, pGSCM, generates 6.3, 7.2 and 6.3 for the said datasets. Figure 12. Resultsthat of the number 6, of generated clusters by GF-Clust, pGSCM, Figure 12 shows a K-means & PSO. 12. Resultsof of the of generated clusters by GF-Clust, K-means & PSO.using different graphical Figure representation thenumber number of obtained clusters with pGSCM, different algorithms datasets. CONCLUSION CONCLUSION This paper presents a new method known as Gravity Firefly clustering method This(GF-CLUST) paper presents for a new method known as Gravity Firefly clustering (GF-Clust) for an an automatic document clustering where themethod idea mimics 22 the behavior of the firefly insect in nature. The GF-CLUST has the ability automatic document clustering where the idea mimics the behavior of the firefly insect in nature. The to identify a near optimal number of clusters and this is achieved in three steps:has data development of vector modeland andthis clustering. GF-Clust thepre-processing, ability to identify a near optimal numberspace of clusters is achieved in three In the clustering step, GF-CLUST uses GFA to identify documents with steps: data pre-processing, development of vector space model and clustering. In the clustering step, 76 GF-Clust uses GFA to identify documents with high force as centers and create clusters based on http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 high force as centers and create clusters based on cosine similarity. It later chooses quality clusters and assigns small size clusters to them. Results of four benchmark datasets indicate that GF-Clust is a better clustering algorithm than the pGSCM. Furthermore, the GF-Clust performs better than K-means, PSO and pGSCM in terms of cluster quality; purity, F-measure and Entropy, hence, indicating that GF-Clust can become a competitor in the area of dynamic swarm based clustering. REFERENCES 20NewsgroupsDataSet. (2006). Retrieved from http://www.cs.cmu.edu/afs/ cs.cmu.edu/project/theo-4/text-learning/www/datasets.html. Aggarwal, C. C., & Reddy, C. K. (2014). Data clustering Algorithm and applications. CRC Press, Taylor and Francis Group. Aliguliyev, R. M. (2009). Clustering of document collection-A weighted approach. Elsevier, Expert Systems with Applications, 36(4), 7904– 7916. Retrieved from doi: 10.1016/j.eswa.2008.11.017 Apostolopoulos, T., & Vlachos, A. (2011). Application of the firefly algorithm for solving the economic emissions load dispatch problem. International Journal of Combinatorics, 201, 23. Retrieved from doi:10.1155/2011/523806 Banati, H., & Bajaj, M. (2013). Performance analysis of Firefly algorithm for data clustering. Int. J. Swarm Intelligence, 1(1), 19–35. Beasley, D., Bull, D. R., & Martin, R. R. (1993). An overview of genetic algorithms : Part 1, fundamentals. University Computing, 15(2), 58–69. Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm intelligence: From Natural to artificial systems. New York, NY: Oxford University Press, Santa Fe Institute Studies in the Sciences of Complexity. Boussaïd, I., Lepagnot, J., & Siarry, P. (2013). A survey on optimization metaheuristics. Elsevier,Information Sciences, 237, 82–117. Cui, X., Gao, J., & Potok, T. E. (2006). A flocking based algorithm for document clustering analysis. Journal of Systems Architecture, 52(8-9), 505–515. 77 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Cui, X., Potok, T. E., & Palathingal, P. (2005). Document clustering using particle swarm optimization. In Proceedings 2005 IEEE Swarm Intelligence Symposium, SIS 2005. (pp. 185–191). IEEEXplore. Retrieved from doi:10.1109/SIS.2005.1501621 Deneubourg, J. L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., & Chrétien, L. (1991). The dynamics of collective sorting: robot-like ants and ant-like robots. In Proceedings of the First International Conference on Simulation of Adaptive Behavior on from Animals to Animats (pp. 356–363). MIT Press Cambridge, MA, USA. Fogel, D. B. (1994). Asympototic convergence properties of gentic algorithms and evaluationary programming. Cybernetics and Systems, 25(3), 389–407. Forsati, R., Mahdavi, M., Shamsfard, M., & Meybodi, M. R. (2013). Efficient stochastic algorithms for document clustering. Elsevier, Information Sciences, 220, 269–291. Retrieved from doi: 10.1016/j.ins.2012.07.025 Gil-Garicia, R., & Pons-Porrata, A. (2010). Dynamic hierarchical algorithms for document clustering. Elsevier, Pattern Recognition Letters, 31(6), 469–477. Retrieved from doi: 10.1016/j.patrec.2009.11.011 Glover, F. (1986). Future paths for integer programming and links to artificial intelligence. Computers and Operations Research, 13(No.5), 533–549. Hassanzadeh, T., Vojodi, H., & Moghadam, A. M. E. (2011). An Image segmentation approach based on maximum variance intra-cluster method and Firefly algorithm. In Seventh International Conference on Natural Computation (ICNC) (Vol. 3, pp. 1817–1821). Shanghai: IEEE Explore. Retrieved from doi:10.1109/ICNC.2011.6022379 Hatamlou, A., Abdullah, S., & Nezamabadi-pour, H. (2012). A combined approach for clustering based on K-means and gravitational search algorithms. Elsevier, Swarm and Evolutionary Computation, 6, 47–52. Retrieved from doi: 10.1016/j.swevo.2012.02.003 Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Elsevier, Pattern Recognition Letters, 31(8), 651–666. Retrieved from doi: 10.1016/j.patrec.2009.09.011 78 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Kashef, R., & Kamel, M. (2010). Cooperative clustering. Elsevier, Pattern Recognition, 43(6), 2315–2329. Retrieved from doi: 10.1016/j. patcog.2009.12.018 Kashef, R., & Kamel, M. S. (2009). Enhanced bisecting k-means clustering using intermediate cooperation. Elsevier, Pattern Recognition, 42(11), 2557–2569. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks IV. Perth, WA: IEEE. Retrieved from doi:10.1109/ICNN.1995.488968 Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. New Series, 220(No. 4598), 671–680. Kuo, R. J., & Zulvia, F. E. (2013). automatic clustering using an improved particle swarm optimization. Journal of Industrial and Intelligent Information, 1(1), 46–51. Lewis, D. (1999). The reuters-21578 text categorization test collection. Retrieved from Available online at :http://kdd.ics.uci.edu/database/ reuters21578/reuters21578.html Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Elsevier, Data & Knowledge Engineering, 68(11), 1271– 1288. Retrieved from doi: 10.1016/j.datak.2009.06.007 Mahmuddin, M. (2008). Optimisation using Bees algorithm on unlabelled data problems. Manufactoring engineering centre. Cardiff: Cardif University , UK. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval (1st ed.). New York, USA: Cambridge University Press. Mohammed, A. J., Yusof, Y., & Husni, H. (2014). A Newton’s universal gravitation inspired Firefly algorithm for document clustering. Proceedings of Advanced in Computer Science and its Applications, v. 279, Lecture Notes in Electrical Engineering, Jeong, H.Y., Yen, N. Y., Park, J.J. (Jong Hyuk), Springer Berlin Heidelberg, pp. 1259-1264. 79 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Mohammed, A. J., Yusof, Y., & Husni, H. (2014). Weight-based Firefly algorithm for document clustering. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), v. 285, Lecture Notes in Electrical Engineering, Herawan, T., Deris, M. M., Abawajy, J., Springer Berlin Heidelberg, pp. 259-266. Murugesan, K., & Zhang, J. (2011). Hybrid hierarchical clustering : An expermintal analysis (p. 26). University of Kentucky. Mustaffa, Z., Yusof, Y., & Kamaruddin, S. (2013). Enhanced Abc-Lssvm for Energy fuel price prediction. Journal of Information and Communication Technology, 12, 73–101. Retrieved from http://www.jict.uum.edu.my/ index.php Picarougne, F., Azzag, H., Venturini, G., & Guinot, C. (2007). A new approach of data clustering using a flock of agents. Evolutionary Computation, 15(3), 345–367. Cambridge: MIT Press. Rashedi, E., Nezamabadi-pour, H., & Saryazdi, S. (2009). GSA: A gravitational search algorithm. Elsevier, Information Sciences, 179(13), 2232–2248. Rothlauf, F. (2011). Design of modern heuristics principles and application. Springer-Verlag Berlin Heidelberg. Retrieved from doi:10.1007/978-3450-72962-4 Sayed, A., Hacid, H., & Zighed, D. (2009). Exploring validity indices for clustering textual data. In Mining Complex Data, 165, 281–300. Senthilnath, J., Omkar, S. N., & Mani, V. (2011). Clustering using Firefly algorithm: Performance study. Elsevier, Swarm and Evolutionary Computation, 1(3), 164–171. Retrieved from doi: 10.1016/j. swevo.2011.06.003 Tan, S. C. (2012). Simplifying and improving swarm based clustering. In IEEE Congress on Evolutionary Computation (CEC) (pp. 1–8). Brisbane, QLD: IEEE. Tan, S. C., Ting, K. M., & Teng, S. W. (2011a). A general stochastic clustering method for automatic cluster discovery. Elsevier, Pattern Recognition, 44(10-11), 2786–2799. 80 http://jict.uum.edu.my Journal of ICT, 15, No. 1 (June) 2016, pp: 57–81 Tan, S. C., Ting, K. M., & Teng, S. W. (2011b). Simplifying and improving ant-based clustering. In Procedia Computer Science (pp. 46–55). Tang, R., Fong, S., Yang, X. S., & Deb, S. (2012). Integrating natureinspired optimization algorithms to K-means clustering. In Seventh International Conference on Digital Information Management (ICDIM), 2012 (pp. 116–123). Macau: IEEE. Retrieved from doi:10.1109/ ICDIM.2012.6360145 TREC. (1999). Text REtrieval Conference (TREC). Retrieved from http:// trec.nist.gov Yang, X. S. (2010). Nature-inspired metaheuristic algorithms (2nd ed). United Kingdom: Luniver press. Yujian, L., & Liye, X. (2010). Unweighted multiple group method with arithmetic mean. In IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA) (pp. 830–834). Changsha: IEEE. Retrieved from doi:10.1109/BICTA.2010.5645232 Zhong, J., Liu, L., & Li, Z. (2010). A novel clustering algorithm based on gravity and cluster merging. Advanced Data Mining and Applications, 6440, 302–309. Retrieved from doi:10.1007/978-3-642-17316-5_30 81