Deepthi_Webclustering Report
Transcription
Deepthi_Webclustering Report
Web Clustering Engines SEMINAR REPORT 2009-2011 In partial fulfillment of Requirements in Degree of Master of Technology In COMPUTER & INFORMATION SCIENCE SUBMITTED BY DEEPTHI THERESA K.K. DEPARTMENT OF COMPUTER SCIENCE COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY KOCHI – 682 022 COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY KOCHI – 682 022 DEPARTMENT OF COMPUTER SCIENCE CERTIF I CATE This is to certify that the seminar report entitled “Web Clustering Engines”” is being submitted by Deepthi Theresa K.K. in partial fulfillment of the requirements for the award of M.Tech in Computer & Information Science is a bonafide record of the seminar presented by her during the academic year 2010. Mr. G.Santhosh Kumar Lecturer Dept. of Computer Science Prof. Dr.K.Poulose Jacob Director Dept. of Computer Science ACK NO W L E D G E M E NT First of all let me thank our Director Prof: Dr. K. Paulose Jacob, Dept. of Computer Science, CUSAT who provided with the necessary facilities and advice. I am also thankful to Mr. G.Santhosh Kumar, Lecturer, Dept of Computer Science, CUSAT for his valuable suggestions and support for the completion of this seminar. With great pleasure I remember Dr. Sumam Mary Idicula, Reader, Dept. of Computer Science, CUSAT for her sincere guidance. Also I am thankful to all of my teaching and non-teaching staff in the department and my friends for extending their warm kindness and help. I would like to thank my parents without their blessings and support I would not have been able to accomplish my goal. I also extend my thanks to all my well wishers. Finally, I thank the almighty for giving the guidance and blessings. ABSTRACT Web clustering Engines are emerging trend in the field of information retrieval. They organize search results by topic, thus offering a complementary view to the flat ranked list returned by the conventional search engines. The search results returned by traditional search engines on different subtopics or meanings of a query will be mixed together in the list so that the user may have to sift through a large number of irrelevant items to locate those of interest. The Web clustering engines categorize the search results into different hierarchical groups/clusters and display those cluster labels. Hence the user can locate the desired document very fast. In this seminar we discuss different phases in the implementation of web clustering engines in detail and also incorporate some of the web clustering algorithms, their advantages and issues. We will familiarize some currently using web clustering engines. Some future research directions are also presented. Additional Key Words and Phrases: Web Clustering Engines, Information retrieval, meta search engines, search results clustering, Search results acquisition, Preprocessing, Cluster construction and labeling, Vector Space model, data centric clustering algorithms, description aware algorithms Contents 1. Introduction 1 1.1 Motivation 1 1.2 Goal of web clustering engines 2 1.3 Issues in the implementation of clusters 3 2. Architecture and techniques of web clustering engines 2.1 Architecture of web clustering engines 5 5 2.1.1 Search results acquisition 5 2.1.2 Preprocessing of search results 6 2.1.3 Cluster construction and labeling 7 2.1.3.1 Data centric clustering algorithms 8 2.1.3.2 Description aware algorithms 10 2.1.4. Visualization of clustered results 3. Efficiency and future works 15 20 3.1.Search results clustering efficiency factors 20 3.2 Improve efficiency of clustering 21 3.3 Performance evaluation 22 3.4 Research directions and future works 23 4. Conclusion 24 5. References 25 6. Appendix 26 Seminar Report 2010 1 Web Clustering Engine 1.INTRODUCTION 1.1 MOTIVATION Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query. The user starts at the top of the list and follows it down examining one result at a time, until the sought information has been found. Now a days efficient search engines are available like Google, Yahoo etc. Even though they are definitely good for navigational searching and transactional searching, they are not that much efficient in the case of queries which includes ambiguity. Ambiguous queries means they should have multiple meaning in different contexts. The search results returned by conventional search engines on different subtopics or meanings of a query will be mixed together in the list so that the user may have to sift through a large number of irrelevant items to locate those of interest. In this context clustering of search results come in to picture. Clustering is the act of grouping similar object into sets. The distance between the objects in the same cluster(inter-cluster variations) should be minimum and the distance between objects in different clusters(intra-cluster variations) should be maximum. In the web search context, organizing web pages (search results) into groups, so that different groups correspond to different user needs. In 1979 Van Rijsbergen introduced the concept Cluster Hypothesis in the field of information retrieval. It states that “Closely related documents tend to be relevant to the same requests.” Web Clustering Engines are the systems that perform clustering of web search results. This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories). Dept. Of Computer Science CUSAT Seminar Report 2010 2 Web Clustering Engine To illustrate, Figure 1 in appendix shows the clustered results returned for the query “tiger” .This result is given by one of the very popular web clustering engine called Vivisimo (as of March 5, 2010). Like many queries on the Web, “tiger” has multiple meanings like: the feline, the Mac OS X computer operating system, the golf champion and so on. These different meanings are well represented in Figure 1.By contrast, if we submit the query “tiger” to Google or Yahoo!(Figure 2), we can see that each meaning’s items are scattered in the ranked list of search results, often through a large number of result pages. The first commercial clustering engine was Northern Light, at the end of the 1990s. It was based on a predefined set of categories, to which the search results were assigned. A major breakthrough was then made by Vivısimo, whose clusters and cluster labels were dynamically generated from the search results. Some other available clustering engines are Clusty, Grokker, KartOO, Lingo3G, CREDO 1.2 GOAL OF WEB CLUSTERING ENGINES Web Clustering Engines organize search results by topic, thus offering a complementary view to the flat ranked list returned by the conventional search engines. Main advantages of the cluster hierarchy is that: It makes for shortcuts to the items that relate to the same meaning. Since Web Clustering Engines group the search results having the same meaning within same cluster it is very easy for the user to find similar documents. Hence the search time will be less. It allows better topic understanding. Since Web Clustering Engines give a high level view of the query, it is useful for informational searches in unknown or dynamic domains. Dept. Of Computer Science CUSAT Seminar Report 2010 3 Web Clustering Engine It favors systematic exploration of search results. A clustering engine summarizes the content of many search results in one single view on the first result page, the user may review hundreds of potentially relevant results without the need to download and scroll to subsequent pages. A clustering engine tries to address the limitations of current search engines by providing clustered results as an added feature to their standard user interface. 1.3 ISSUES IN THE IMPLEMENTATION OF CLUSTERS Unlike document clustering Web search results clustering included constantly changing billions of pages. The data are mainly unstructured and heterogeneous and additional information to consider (i.e. links, click-through data, etc.). This dynamic nature of the data together with the interactive use of clustered results pose new requirements and challenges to clustering technology: Short input data description. Due to computational reasons, the data available to the clustering algorithm for each search result are usually limited to a URL, an optional title, and a short excerpt of the document’s text (the snippet) Meaningful labels. Each cluster label should indicate the contents of the cluster items within that cluster. Selection of similarity measure. So many known methods are there for finding the dissimilarity/similarity between 2 items within a cluster like, euclidean distance, Manhattan distance etc. Dept. Of Computer Science CUSAT Seminar Report 2010 4 Web Clustering Engine Grouping of objects into clusters. So many approaches are available for grouping the objects like, agglomerative clustering, suffix tree clustering, kmeans clustering. Computational efficiency. Search results clustering is performed online, within an application that requires overall subsecond response times. The critical step is the acquisition of search results, whereas the efficiency of the cluster construction algorithm is less important due to the low number of input results. Overlapping clusters. Since the same result may applied to different themes we may allow overlapping clusters. Handling of overlapping clusters in a dynamic environment is a open issue. Unknown number of clusters. In search results clustering, both the number and the size of clusters cannot be predetermined because they vary with the query. Dept. Of Computer Science CUSAT Seminar Report 2010 5 Web Clustering Engine 2. ARCHITECTURE AND TECHNIQUES OF WEB CLUSTERING ENGINES 2.1 ARCHITECTURE OF WEB CLUSTERING ENGINES Practical implementations of Web search clustering engines will usually consist of four general components: search results acquisition, input preprocessing, cluster construction, and visualization of clustered results, all arranged in a processing pipeline. 2.1.1 SEARCH RESULTS ACQUISITION The task of the search results acquisition component is to provide input for the rest of the system. Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL pointing to the full text being referred to. The source of search results can be any public search engines, such as google, yahoo etc. Clustering applied to this smaller set of documents ,returned by the Dept. Of Computer Science CUSAT Seminar Report 2010 6 Web Clustering Engine conventional search engines, in response to the query. The most elegant way of fetching results from such search engines is by using application programming interfaces(APIs) these engines provide. 2.1.2 PREPROCESSING OF SEARCH RESULTS Input preprocessing is a step that is common to all search results clustering systems. Its primary aim is to convert the contents of search results (output by the acquisition component) into a sequence of features used by the actual clustering algorithm. Steps for feature extraction are, Language identification, Tokenization, Stemming, Selection of features. Clustering engines that support multilingual content must perform initial language recognition on each search result in the input. During the tokenization step, the text of each search result gets split into a sequence of basic independent units called tokens, which will usually represent single words, numbers, symbols and so on .Tokenization becomes much more complex for languages where white spaces are not present (such as Chinese) or where the text may switch direction (such as an Arabic text, within which English phrases are quoted). The aim of stemming is to remove the inflectional prefixes and suffixes of each word and thus reduce different grammatical forms of the word to a common base form called a stem. For example, the words connected, connecting and interconnection would be transformed to the word connect .Here connect is the stem. Last but not least, the preprocessing step needs to extract features for each search result present in the input. Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm. When looking at Dept. Of Computer Science CUSAT Seminar Report 2010 7 Web Clustering Engine text, the most intuitive set of features would be simply words of a given language. But this is not the only possibility. The features can vary from single words and fixed-length tuples of words (n-grams) to frequent phrases (variable-length sequences of words), and very algorithm-specific data structures, such as approximate sentences. One method for representing a text is Vector Space model(VSM). A document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn], where t0, t1, . . . tn is a global set of words (features) and wti expresses the weight (importance) of feature ti to document d. Weights in a document vector typically reflect the distribution of occurrences of features in that document. For example, a term vector for the phrase “Polly had a dog and the dog had Polly” could appear as shown below (weights are simply counts of words, articles are rarely specific to any document and normally would be omitted). 2.1.3 CLUSTER CONSTRUCTION AND LABELLING The set of search results along with their features, extracted in the preprocessing step, are given as input to the clustering algorithm, which is responsible for building the clusters and labeling them. There are a number of algorithms available for clustering. We can classify them into two different categories, Data centric and Description aware. In search results clustering users are the ultimate consumers of cluster. Hence the created clusters should be aptly labeled. The labels should be unique, unambiguous, comprehensive and sensible to the content. An inefficiently labeled cluster is useless eventhough it contains closely related, relevant documents. Dept. Of Computer Science CUSAT Seminar Report 2010 8 Web Clustering Engine 2.1.3.1 DATA CENTRIC CLUSTERING ALGORITHMS The representatives of this group consists of a conventional data clustering algorithms like Agglomerative Hierarchical Clustering (AHC), K-means etc. Scatter/Gather is a landmark example of a data-centric system, developed in 1992 at Xerox PARC, Scatter/Gather is commonly perceived as a predecessor and conceptual parent of all clustering systems that appeared later. This system uses VSM for text representation and the clustering technique used is agglomerative hierarchical clustering (AHC), with an average-link merge criterion. It has an initial clustering of a collection of documents in a set of k clusters(scattering).At Query time the user selected clusters of interest(gather) and the system re-clustered those documents. This process repeats until a small cluster with relevant documents is found. The following figure depicts the function of a Scatter/Gather system Agglomerative Hierarchical Clustering(AHC) is a typical example of Data centric clustering algorithms. It is a bottom up approach. Initially each document is in its own cluster. Build a distance matrix (dissimilarity matrix) for every pair of clusters. Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one Dept. Of Computer Science CUSAT Seminar Report 2010 9 Web Clustering Engine cluster. Continue this process until the desired no of k clusters reached. The Complexity of this algorithm is clearly O(n2) since we are using a matrix, where n is the number of clusters. Another Data centric algorithm is called as K-means clustering. K is a predefined value for number of clusters and we are always selecting an average one as the cluster centroid. Hence the name. Firstly choose the number of clusters k. Randomly generate k clusters and find cluster representative/centroid. Calculate the distance between each cluster and each document. Assign each document to the nearest cluster centroid. Recompute new cluster centroid. Repeat the steps until some convergence criterion is met. The complexity is O(knT),where k is the number of clusters, n is the number of documents and T is the number of times the algorithm should repeat for getting a stable system(without changing the membership of document). Data-centric algorithms borrow their strengths from well-known and proven techniques targeted at clustering numeric data. Eventhough it uses simple keyword based features, still it is a powerful method. But there are some difficulties in these set of algorithms. All these algorithms are not incremental in nature. ‘Incremental’ in the sense, as each document arrives from the web, we “clean” it and add it to the available model. All the above algorithms excluded the incremental property. Another difficulty raised in Data centric approaches are in the case of meaningful labels. In these algorithms cluster labels are created by selecting frequent keywords from the set of cluster documents. This keyword based representation seemed to be insufficient from the user perspective. Once a text is converted to a document vector we can hardly speak of the text’s meaning, because the vector is basically a collection of unrelated terms. Using the extracted features in a keyword based approach the content of the cluster is not that much readable. Dept. Of Computer Science CUSAT Seminar Report 2010 10 Web Clustering Engine For justifying this argument refer the figure 3 in the appendix. The query used here is Retrieve the top 250 documents that contain the word star . We ask Scatter/Gather to place the 250 documents into 5 groups. The Figure contains only the first scattered clusters. Shown here are the clusters' sizes (how many documents they contain), a list of topical terms, and a list of document titles. One can see from the topical terms of Cluster 1 that this cluster contains documents that involve stars as symbols, as in military rank and patriotic songs. Cluster 2 has 68 documents that appear mainly to be about movie and tv stars. Cluster 3 contains 97 documents that having to do with aspects of astrophysics. Cluster 4 contains 67 documents also about astronomy and astrophysics. This cluster contains many articles about people who are astronomers. Cluster 5 contains all the articles that discuss animals or plants, and that happen to contain the word star, for example, star fish. But looking in to this clusters we can hardly conclude these descriptions about the cluster contents. For getting more detailed cluster labels we can use Description aware algorithms. 2.1.3.2 DESCRIPTION AWARE ALGORITHMS Description-aware algorithms are aware of this labeling problem and try to ensure that the construction of cluster descriptions is that feasible and it yields results interpretable to a human. One way to achieve this goal is to use a monothetic clustering algorithm (i.e., one in which objects are assigned to clusters based on a single feature) and carefully select the features so that they are immediately recognizable to the user as something meaningful. If features are meaningful and precise then they can be used to describe the output clusters accurately and sufficiently. The algorithm that first implemented this idea was Suffix Tree Clustering (STC), described in a few seminal Dept. Of Computer Science CUSAT Seminar Report 2010 11 Web Clustering Engine papers by Zamir and Etzioni in 1998, 1999, and implemented in a system called Grouper. In practice, STC was as much of a break through to search results clustering. Suffix Tree Clustering(STC) uses a data structure called suffix tree. It Use phrases(ordered sequence of words) as their atomic features rather than keywords. 3 steps are there for performing suffix tree clustering. Those are, data cleaning, identifying base clusters and combining base clusters. We define a base cluster to be a set of documents that share a common phrase. A suffix tree-Definition 1 A suffix tree of a string S is a compact trie containing all suffixes of S. 2. It is a rooted tree. 3. Each internal node has at least two children 4. Each edge is labeled with a non empty substring of S. The label of a node is the concatenation of the edge labels on the path from the root to that node 5. No two edges out of the same node can have edge labels that begin with the same word For example the suffixes of a sentence “mouse ate cheese too” are: Suffix no. Suffixes 1. mouse ate cheese too 2. ate cheese too 3. cheese too 4. too Dept. Of Computer Science CUSAT Seminar Report 2010 12 Web Clustering Engine A General Suffix Tree (GST) means a suffix tree contains all the suffixes of two or more sentences. Step1-Data Cleaning In this step, the string of text representing each document is transformed using a light stemming algorithm (deleting word prefixes and suffixes and reducing plural to singular). Sentence boundaries (identified via punctuation and HTML tags) are marked and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped. Step 2-Identifying base clusters The following picture is an example for a General Suffix Tree of a set of strings1)"cat ate cheese", 2)"mouse ate cheese too" and 3)"cat ate mouse too". The nodes of the suffix tree are drawn as circles. Each suffix-node has one or more boxes attached to it designating the string(s) it originated from. The first number in each box designates the string of origin (1-3 in our example, by the order the strings appear above); the second number designates which suffix of that string labels that suffix-node. Dept. Of Computer Science CUSAT Seminar Report 2010 13 Web Clustering Engine Each node of the suffix tree represents a group of documents and a phrase that is common to all of them. The label of the node represents the common phrase; the set of documents tagging the suffix-nodes that are descendants of the node make up the document group. Therefore, each node represents a base cluster. Following Table lists the six marked nodes (a-f) from the example shown above and their corresponding base clusters: Each base cluster is assigned a score that is a function of the number of documents it contains, and the words that make up its phrase. The score s(B) of base cluster B with phrase P is given by: where |B| is the number of documents in base cluster B, and |P| is the number of words in P that have a non-zero score (i.e., the effective length of the phrase) Step 3 - Combining Base Clusters This step of the algorithm merges the base clusters, with a high overlap in their document sets. For doing this we are using a base cluster graph. The nodes in this graph are base clusters. Combine these base clusters based on some similarity measure. The following figure is a base cluster graph of the previous example. Dept. Of Computer Science CUSAT Seminar Report 2010 14 Web Clustering Engine We define a binary similarity measure. Given 2 Base clusters Bm and Bn with sizes |Bm | and | Bn | respectively.| Bm ∩ Bn | is the number. of documents common to both base clusters. We define the similarity between Bm and Bn is to be 1 iff: | Bm ∩ Bn | / | Bm |>0.5 and | Bm ∩ Bn | / | Bn |>0.5 Otherwise similarity is equal to 0. If similarity between base clusters is equal to 1 then draw an edge connecting those base clusters. A cluster is defined as being a connected component in the base cluster graph. Each cluster contains the union of the documents of all its base clusters. In the above base cluster example there is one connected component, therefore one cluster. The advantages of STC over Data centric algorithms are, The STC can be constructed in linear time. It is incremental in nature. This method focused attention on cluster label descriptiveness, so that the cluster labels will be more effective. STC support overlapping clusters. The following picture gives us an overview about the clusters created by Suffix Tree Clustering method: Dept. Of Computer Science CUSAT Seminar Report 2010 15 Web Clustering Engine The Query used here is ‘salsa’. Only the first 5 clusters are shown here. The words in bold are the shared phrases found in the clusters. Note the descriptive power of phrases such as "Puerto Rico", "Latin Music" and "York Salsa Dancers". 2.1.4. VISUALIZATION OF CLUSTERED RESULTS Now powerful visualizations are available for Web Clustering Engines. One prominent approach is based on hierarchical folders. The Web Clustering Engines like, Clusty, CREDO, Lingo3G ,etc are using hierarchical folder visualization approach. A famous Clustering Engine called Grokker uses Nesting and zooming approach. Some search engines also used Graph based interfaces. KartOO is such a system. Dept. Of Computer Science CUSAT Seminar Report 2010 16 Web Clustering Engine Some Clustering Engines and their visualizations are mentioned below: Clusty Clusty is a clustering engine developed by the company Vivisimo. Vivisimo won the “best meta-search engine award” assigned by SearchEngineWatch.com from 2001 to 2003. Vivisimo means lively, bright, or clever in Spanish. Vivisimo's founders picked the name to express their vision of optimizing and giving life to our information. Clusty is a meta search engine, meaning it combines results from a variety of different sources. It uses an algorithm to cluster content based on textual similarity. Every time of a search, Clusty pulls together the data from other engines like Ask, MSN and Wisenut. It then organizes the search results in a way that helps us navigate away from ambiguity towards specific cluster of results. Clusty uses a hierarchical folder approach. It is a very simple method and familiar to everyone. Figure1 in appendix is the screenshot (taken on March 5, 2010) of Clusty. Dept. Of Computer Science CUSAT Seminar Report 2010 17 Web Clustering Engine The hierarchical folders are limited in the left side of the screen so that the user can choose any cluster he may need within no time. CREDO CREDO ( Conceptual REorganization of DOcuments) has been developed at Fondazione Ugo Bordoni by Claudio Carpineto and Gianni Romano. CREDO groups the results of a web search (currently Yahoo APIs search results) in a lattice of conceptual clusters that highlight the contents of the retrieved documents. CREDO is based on a mathematical data representation termed a concept lattice. Compared to other systems for clustering Web results, the clusters produced by CREDO are more justifiable, are easier to navigate because they are organized in a lattice rather than a strict hierarchy, and allow discovery of causal associations between the words contained in the results. CREDO is an interesting example of a system that attempts to build the taxonomy of topics and their descriptions simultaneously. Eventhough CREDO do not follow a strict hierarchical organization can still use a tree-based visualization. Refer Figure 4(taken on March 6, 2010) in appendix for seeing the visualization of CREDO. A version of CREDO for PDAs (Credino) and for cellular phones (SmartCREDO) has been developed in collaboration with Stefano Mizzaro and Andrea Della Pietra (University of Udine). Grokker Grokker is developed by a company called Groxis. Groxis was a tech company based in San Francisco, California. The name Grokker is inspired by the 1961 Robert A. Heinlein science fiction classic Stranger in a Strange Land, in which Grok is a Martian word meaning literally ‘to drink’ and metaphorically ‘to be one with.’ To grok something is to understand something so well that it is fully absorbed into oneself. It is to look at every problem, opportunity, action, and point of view from any and all perspectives. Grokker sits on top of multiple sources. After Grokker retrieves the information, it Dept. Of Computer Science CUSAT Seminar Report 2010 18 Web Clustering Engine "federates" it, meaning it meshes it all together. Finally, it clusters the returns into categories. End users most frequently look at less than three screens from the thousands of returned search results. Using Grokker, users immediately see the cluster(s) of greatest relevance, and drill down, only within the cluster(s) that matter to them. Grokker uses Nesting and Zooming approach. The screen shot of Grokker is shown in appendix Figure 5. This Map View is a visual representation of the return of hits. When the user click on one of the circles and see the subcategories again. By clicking on Search Options the user can change the number of hits he will return. The user can also choose which sites you want to search: Yahoo, Wikipedia and/or Amazon. Simultaneous searching of different sites are also permitted. Finally, we can limit our results by using the tools on the left side of the screen. Some universities are using Grokker as their searching tool. Stanford University was one of the first customers of Grokker. The new platform provides faculty and students with a single point of access to multiple resources, including library catalogs, proprietary subscription databases, and the Web. It helps Stanford users to be more efficient in their research and navigation among the numerous available resources. The desktop version of Stanford Grokker is no longer being supported, and is not available for download. In March of 2009, Groxis ceased operations. KartOO KartOO was a meta search engine which displayed a visual interface. It operated from 2001 to early 2010. KartOO had an advanced Adobe Flash GUI, as opposed to a text-based list of results.It uses a Graph based approach. Its color scheme was to a degree reminiscent of Apple Computer's Aqua interface. Search results were presented as a "map", with blob-like masses of varying color connecting each item. The shape of the blobs clearly depends on the relevance of the keyword corresponding to that blob, according to the query. If one began their search with a general topic, KartOO sometimes helped to narrow it down. Every "blob" clicked added another word to the search query. Dept. Of Computer Science CUSAT Seminar Report 2010 19 Web Clustering Engine The map would often succeed in presenting keywords or subtopics that defined the topic one was searching on. Refer Figure 6 in appendix for seeing the visualization of KartOO. It was co-founded in France by two cousins, Laurent and Nicholas Baleydier. This project was then launched in 2001. In 2004, KartOO launched a new version called UJIKO. In January 2010 KartOO closed down, removing all content from the KartOO and UJIKO websites, but leaving a small message in French thanking its users for their support. Dept. Of Computer Science CUSAT Seminar Report 2010 20 Web Clustering Engine 3. EFFICIENCY AND FUTURE WORKS 3.1 SEARCH RESULTS CLUSTERING EFFICIENCY FACTORS The most critical tasks involve the first three components presented namely search result acquisition, preprocessing, and clustering. The visualization component is not likely to affect the overall system efficiency in a significant manner. Search Results Acquisition The number of search results required for clustering cannot be fetched in one remote request. The Yahoo! API allows up to 50 search results to be retrieved in one request, while Google SOAP API returns a mere 10 results per one remote call. The results obviously depend on network congestion , on the capability of local equipment used , and also on the specific server processing the request on the search engine side. Preprocessing The performance of tokenization is a critical concern in the case of preprocessing of search results. Tokenizers will have a different performance characteristic depending on whether they were hand-written or automatically generated. Tokenization becomes much more complex for languages where white spaces are not present (such as Chinese) or where the text may switch direction (such as an Arabic text, within which English phrases are quoted). Clustering Depending on the specific algorithm used, the clustering phase can significantly contribute to the overall processing time. Search results clustering systems must be optimized to handle smaller instances and process them as fast as possible. Dept. Of Computer Science CUSAT Seminar Report 2010 21 Web Clustering Engine 3.2 IMPROVE EFFICIENCY OF CLUSTERING There are a number of techniques that can be used to improve the computational performance of a search results clustering engine. Client side processing The majority of currently available search clustering engines are doing all processes as server-side processing. One possible problem with this approach is thatduring high query rate periods the response times can significantly increase and thus degrade the user experience. For avoiding this we can do some processes using the client side resources. In this way, scalability issues and the resulting problems could be avoided. Incremental processing One desirable feature of search results clustering would be incremental processing- as each document arrives from the web, we “clean” it and add it to the available model. Pretokenized documents The input to the Web Clustering Engine is the search results returned by the conventional search engines. This search engines already will do some preprocessing techniques to their results before they are retrieved. If the clustering engines can use these tokens for their work it will be an added advantage. Dept. Of Computer Science CUSAT Seminar Report 2010 22 Web Clustering Engine 3.3 PERFORMANCE EVALUATION Clustering engines are designed to overcome the limitations of plain search engines. So we need to evaluate whether the use of clustered results does yield a gain in retrieval performance over flat ranked lists. Some methods are explained below: First suggestive method related to the conventional notion of Recall and precision. For applying this concept the retrieved list should be in a linear list, not in a clustered form. One obvious way to perform such a clustering linearization would be to preserve the order in which clusters are presented and just expand their content, but this would amount to ignoring the role played by the user in the choice of the clusters to be expanded. One of the earliest and simplest linearization techniques is to assume that the user can choose the cluster with the highest density of relevant documents and to consider only the documents contained in it ranked in order of relevance. A more analytic approach is based on the reach time: a modelization of the time taken to locate a relevant document in the hierarchy. Another method is by analyzing the user logs. Compare the search engine logs to clustering engine logs, computing several metrics such as the number of documents followed, the time spent, and the click distance. The interpretation of user logs is, however, difficult. To date, the evaluation issue has probably not yet received sufficient attention. It remains still as an open issue. Anyway some experimental findings are suggesting that Web Clustering Engines may be more effective than plain search engines. Due to the lack of an efficient method for the performance evaluation of clustering engines they are still not seeking the attention of people. Dept. Of Computer Science CUSAT Seminar Report 2010 23 Web Clustering Engine 3.4 RESEARCH DIRECTIONS AND FUTURE WORKS The most important research issue is thus how to improve the quality and usability of output hierarchies. For improving the cluster efficiency, should extract powerful features. The developers should adopts methods for generating more expressive and effective descriptions of clusters. Finding optimal cluster representatives is another approach for increasing the efficiency of clustering phase. If we can find a better cluster representative then the iterations for stable clustering will be less, means less response time. Combination of existing clustering algorithms can also be used for getting better clusters. One advanced concept is called Personalized clustering. Since the clustering process does not depend only on the search results, but is also influenced by the user characteristics, we speak of personalization. Personalization means instead of optimizing the construction of the hierarchy structure, one can try to reorganize a given structure based on user actions. This proposed techniques exploit user feedback, to filter out parts of the hierarchy that are presumably of no interest to the user. One of the recent topics in the field of search result clustering is the on growing market of mobile search. Two mobile versions of CREDO, suitable for personal digital assistants and cellular phones, the systems, termed Credino (small CREDO,in Italian) and SmartCREDO, are exclusively based on the search results and are freely available online. The screenshots of Credino is available in the appendix Figure 7,Figure 8, Figure 9(taken in March 6, 2010) Semantic Web is a recent research topic. In Semantic Web the meaning (semantics) of information on the web is defined, making it possible for machines to process it. Google has initiated a good example of Semantic Web technology with its "rich snippets". Swoogle is a semantic web search engine. In future clustering can also be applied for Semantic web search engines also. Dept. Of Computer Science CUSAT Seminar Report 2010 24 Web Clustering Engine 4. CONCLUSION Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. Web Clustering Engines has reached a level in which research has been deployed and commercial systems are being deployed. A number of advances must be made to improve the cluster labels, coherence of cluster structure, performance evaluation studies, advanced visualization techniques. Then Web Clustering Engines entirely fulfills the promise of being the PageRank of the future. Dept. Of Computer Science CUSAT Seminar Report 2010 25 Web Clustering Engine 5. REFERENCES Journal/Paper: Claudio Carpineto,Stanisiaw Osinski,Giovanni Romano and Dawid Weiss,”A survey of Web Clustering Engines”,ACM Computing Surveys,Vol.41,No.3,Article 17,July 2009. Oren Zamir and Orem Etzioni,Web Document Clustering :A Feasibility Demonstration, In Proc. 21st annual Int. ACM SIGIR Conf. on Research and Development of Information Retrieval, pp.46-54 ,1998. Books: C.J.Van Rijsbergen , Information Retrieval, Butterworth , 1979 Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval Addison Wesley Longman Publishing Co. Inc.,1999 Websites: http://clusty.com/ March 5, 2010 http://credo.fub.it March 8, 2010 http://www2.parc.com/istl/projects/ia/sg-example1.html http://credino.dimi.uniud.it/ March 10, 2010 http://smartcredo.dimi.uniud.it March 10, 2010 Dept. Of Computer Science March 4, 2010 CUSAT Seminar Report 2010 26 Web Clustering Engine 6. APPENDIX Figure 1 Dept. Of Computer Science CUSAT Seminar Report 2010 27 Web Clustering Engine Figure 2 Dept. Of Computer Science CUSAT Seminar Report 2010 28 Web Clustering Engine Figure 3 Dept. Of Computer Science CUSAT Seminar Report 2010 29 Web Clustering Engine Figure 4 Dept. Of Computer Science CUSAT Seminar Report 2010 30 Web Clustering Engine Figure 5 Dept. Of Computer Science CUSAT Seminar Report 2010 31 Web Clustering Engine Figure 6 Dept. Of Computer Science CUSAT Seminar Report 2010 32 Web Clustering Engine Figure 7 Figure 8 Dept. Of Computer Science CUSAT Seminar Report 2010 33 Web Clustering Engine Figure 9 Dept. Of Computer Science CUSAT