Deepthi_Webclustering Report

Transcription

Web Clustering Engines
SEMINAR REPORT
2009-2011
In partial fulfillment of Requirements in
Degree of Master of Technology
In
COMPUTER & INFORMATION SCIENCE
SUBMITTED BY
DEEPTHI THERESA K.K.
DEPARTMENT OF COMPUTER SCIENCE
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
KOCHI – 682 022
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
KOCHI – 682 022
DEPARTMENT OF COMPUTER SCIENCE
CERTIF I CATE
This is to certify that the seminar report entitled “Web Clustering Engines”” is being
submitted by Deepthi Theresa K.K. in partial fulfillment of the requirements for the award of
M.Tech in Computer & Information Science is a bonafide record of the seminar presented by
her during the academic year 2010.
Mr. G.Santhosh Kumar
Lecturer
Dept. of Computer Science
Prof. Dr.K.Poulose Jacob
Director
Dept. of Computer Science
ACK NO W L E D G E M E NT
First of all let me thank our Director Prof: Dr.
K. Paulose Jacob, Dept. of
Computer Science, CUSAT who provided with the necessary facilities and advice. I am also
thankful to Mr. G.Santhosh Kumar, Lecturer, Dept of Computer Science, CUSAT for his
valuable suggestions and support for the completion of this seminar. With great pleasure I
remember Dr. Sumam Mary Idicula, Reader, Dept. of Computer Science, CUSAT for her
sincere guidance. Also I am thankful to all of my teaching and non-teaching staff in the
department and my friends for extending their warm kindness and help.
I would like to thank my parents without their blessings and support I would not have
been able to accomplish my goal. I also extend my thanks to all my well wishers. Finally, I
thank the almighty for giving the guidance and blessings.
ABSTRACT
Web clustering Engines are emerging trend in the field of information retrieval.
They organize search results by topic, thus offering a complementary view to the flat ranked
list returned by the conventional search engines. The search results returned by traditional
search engines on different subtopics or meanings of a query will be mixed together in the list
so that the user may have to sift through a large number of irrelevant items to locate those of
interest. The Web clustering engines categorize the search results into different hierarchical
groups/clusters and display those cluster labels. Hence the user can locate the desired
document very fast.
In this seminar we discuss different phases in the implementation of web clustering
engines in detail and also incorporate some of the web clustering algorithms, their advantages
and issues. We will familiarize some currently using web clustering engines. Some future
research directions are also presented.
Additional Key Words and Phrases: Web Clustering Engines, Information retrieval, meta
search engines, search results clustering, Search results acquisition, Preprocessing, Cluster
construction and labeling, Vector Space model, data centric clustering algorithms, description
aware algorithms
Contents
1. Introduction
1
1.1 Motivation
1
1.2 Goal of web clustering engines
2
1.3 Issues in the implementation of clusters
3
2. Architecture and techniques of web clustering engines
2.1 Architecture of web clustering engines
5
5
2.1.1 Search results acquisition
5
2.1.2 Preprocessing of search results
6
2.1.3 Cluster construction and labeling
7
2.1.3.1 Data centric clustering algorithms
8
2.1.3.2 Description aware algorithms
10
2.1.4. Visualization of clustered results
3. Efficiency and future works
15
20
3.1.Search results clustering efficiency factors
20
3.2 Improve efficiency of clustering
21
3.3 Performance evaluation
22
3.4 Research directions and future works
23
4. Conclusion
24
5. References
25
6. Appendix
26
Seminar Report 2010
1
Web Clustering Engine
1.INTRODUCTION
1.1 MOTIVATION
Search engines are an invaluable tool for retrieving information from the Web. In
response to a user query, they return a list of results ranked in order of relevance to the
query. The user starts at the top of the list and follows it down examining one result at a
time, until the sought information has been found.
Now a days efficient search engines are available like Google, Yahoo etc. Even
though they are definitely good for navigational searching and transactional searching,
they are not that much efficient in the case of queries which includes ambiguity.
Ambiguous queries means they should have multiple meaning in different contexts. The
search results returned by conventional search engines on different subtopics or meanings
of a query will be mixed together in the list so that the user may have to sift through a
large number of irrelevant items to locate those of interest. In this context clustering of
search results come in to picture.
Clustering is the act of grouping similar object into sets. The distance between the
objects in the same cluster(inter-cluster variations) should be minimum and the distance
between objects in different clusters(intra-cluster variations) should be maximum. In the
web search context, organizing web pages (search results) into groups, so that different
groups correspond to different user needs.
In 1979 Van Rijsbergen introduced the concept Cluster Hypothesis in the field of
information retrieval. It states that “Closely related documents tend to be relevant to the
same requests.”
Web Clustering Engines are the systems that perform clustering of web search
results. This systems group the results returned by a search engine into a hierarchy of
labeled clusters (also called categories).
Dept. Of Computer Science
CUSAT
Seminar Report 2010
2
To illustrate, Figure 1 in appendix shows the clustered results returned for the
query “tiger” .This result is given by one of the very popular web clustering engine called
Vivisimo (as of March 5, 2010). Like many queries on the Web, “tiger” has multiple
meanings like: the feline, the Mac OS X computer operating system, the golf champion
and so on. These different meanings are well represented in Figure 1.By contrast, if we
submit the query “tiger” to Google or Yahoo!(Figure 2), we can see that each meaning’s
items are scattered in the ranked list of search results, often through a large number of
result pages.
The first commercial clustering engine was Northern Light, at the end of the
1990s. It was based on a predefined set of categories, to which the search results were
assigned. A major breakthrough was then made by Vivısimo, whose clusters and cluster
labels were dynamically generated from the search results. Some other available
clustering engines are Clusty, Grokker, KartOO, Lingo3G, CREDO
1.2 GOAL OF WEB CLUSTERING ENGINES
Web Clustering Engines organize search results by topic, thus offering a
complementary view to the flat ranked list returned by the conventional search engines.
Main advantages of the cluster hierarchy is that:

It makes for shortcuts to the items that relate to the same meaning. Since Web
Clustering Engines group the search results having the same meaning within
same cluster it is very easy for the user to find similar documents. Hence the
search time will be less.

It allows better topic understanding. Since Web Clustering Engines give a
high level view of the query, it is useful for informational searches in
unknown or dynamic domains.
CUSAT
Seminar Report 2010

3
It favors systematic exploration of search results. A clustering engine
summarizes the content of many search results in one single view on the first
result page, the user may review hundreds of potentially relevant results
without the need to download and scroll to subsequent pages.
A clustering engine tries to address the limitations of current search engines by
providing clustered results as an added feature to their standard user interface.
1.3 ISSUES IN THE IMPLEMENTATION OF CLUSTERS
Unlike document clustering Web search results clustering included constantly
changing billions of pages. The data are mainly unstructured and heterogeneous and
additional information to consider (i.e. links, click-through data, etc.).
This dynamic nature of the data together with the interactive use of clustered
results pose new requirements and challenges to clustering technology:

Short input data description. Due to computational reasons, the data available
to the clustering algorithm for each search result are usually limited to a URL,
an optional title, and a short excerpt of the document’s text (the snippet)

Meaningful labels. Each cluster label should indicate the contents of the
cluster items within that cluster.

Selection of similarity measure. So many known methods are there for finding
the dissimilarity/similarity between 2 items within a cluster like, euclidean
distance, Manhattan distance etc.
CUSAT
Seminar Report 2010

4
Grouping of objects into clusters. So many approaches are available for
grouping the objects like, agglomerative clustering, suffix tree clustering, kmeans clustering.

Computational efficiency. Search results clustering is performed online, within
an application that requires overall subsecond response times. The critical step
is the acquisition of search results, whereas the efficiency of the cluster
construction algorithm is less important due to the low number of input
results.

Overlapping clusters. Since the same result may applied to different themes
we may allow overlapping clusters. Handling of overlapping clusters in a
dynamic environment is a open issue.

Unknown number of clusters. In search results clustering, both the number and
the size of clusters cannot be predetermined because they vary with the query.
CUSAT
Seminar Report 2010
5
2. ARCHITECTURE AND TECHNIQUES OF WEB
CLUSTERING ENGINES
2.1 ARCHITECTURE OF WEB CLUSTERING ENGINES
Practical implementations of Web search clustering engines will usually consist of
four general components: search results acquisition, input preprocessing, cluster
construction, and visualization of clustered results, all arranged in a processing pipeline.
2.1.1 SEARCH RESULTS ACQUISITION
The task of the search results acquisition component is to provide input for the
rest of the system. Based on the query, the acquisition component must deliver 50 to 500
results, each of which should contain a title, a contextual snippet, and the URL pointing
to the full text being referred to.
The source of search results can be any public search engines, such as google,
yahoo etc. Clustering applied to this smaller set of documents ,returned by the
CUSAT
Seminar Report 2010
6
conventional search engines, in response to the query. The most elegant way of fetching
results from such search engines is by using application programming interfaces(APIs)
these engines provide.
2.1.2 PREPROCESSING OF SEARCH RESULTS
Input preprocessing is a step that is common to all search results clustering
systems. Its primary aim is to convert the contents of search results (output by the
acquisition component) into a sequence of features used by the actual clustering
algorithm.
Steps for feature extraction are, Language identification, Tokenization, Stemming,
Selection of features.
Clustering engines that support multilingual content must perform initial
language recognition on each search result in the input.
During the tokenization step, the text of each search result gets split into a
sequence of basic independent units called tokens, which will usually represent single
words, numbers, symbols and so on .Tokenization becomes much more complex for
languages where white spaces are not present (such as Chinese) or where the text may
switch direction (such as an Arabic text, within which English phrases are quoted).
The aim of stemming is to remove the inflectional prefixes and suffixes of each
word and thus reduce different grammatical forms of the word to a common base form
called a stem. For example, the words connected, connecting and interconnection would
be transformed to the word connect .Here connect is the stem.
Last but not least, the preprocessing step needs to extract features for each search
result present in the input. Features are atomic entities by which we can describe an
object and represent its most important characteristic to an algorithm. When looking at
CUSAT
Seminar Report 2010
7
text, the most intuitive set of features would be simply words of a given language. But
this is not the only possibility. The features can vary from single words and fixed-length
tuples of words (n-grams) to frequent phrases (variable-length sequences of words), and
very algorithm-specific data structures, such as approximate sentences.
One method for representing a text is Vector Space model(VSM). A document d
is represented in the VSM as a vector [wt0 , wt1 , . . .wtn], where t0, t1, . . . tn is a global set
of words (features) and wti expresses the weight (importance) of feature ti to document d.
Weights in a document vector typically reflect the distribution of occurrences of features
in that document. For example, a term vector for the phrase “Polly had a dog and the dog
had Polly” could appear as shown below (weights are simply counts of words, articles are
rarely specific to any document and normally would be omitted).
2.1.3 CLUSTER CONSTRUCTION AND LABELLING
The set of search results along with their features, extracted in the preprocessing
step, are given as input to the clustering algorithm, which is responsible for building the
clusters and labeling them. There are a number of algorithms available for clustering. We
can classify them into two different categories, Data centric and Description aware.
In search results clustering users are the ultimate consumers of cluster. Hence the
created clusters should be aptly labeled. The labels should be unique, unambiguous,
comprehensive and sensible to the content. An inefficiently labeled cluster is useless
eventhough it contains closely related, relevant documents.
CUSAT
Seminar Report 2010
8
2.1.3.1 DATA CENTRIC CLUSTERING ALGORITHMS
The representatives of this group consists of a conventional data clustering
algorithms like Agglomerative Hierarchical Clustering (AHC), K-means etc.
Scatter/Gather is a landmark example of a data-centric system, developed in 1992 at
Xerox PARC, Scatter/Gather is commonly perceived as a predecessor and conceptual
parent of all clustering systems that appeared later. This system uses VSM for text
representation and the clustering technique used is agglomerative hierarchical clustering
(AHC), with an average-link merge criterion. It has an initial clustering of a collection of
documents in a set of k clusters(scattering).At Query time the user selected clusters of
interest(gather) and the system re-clustered those documents. This process repeats until a
small cluster with relevant documents is found. The following figure depicts the function
of a Scatter/Gather system
Agglomerative Hierarchical Clustering(AHC) is a typical example of Data centric
clustering algorithms. It is a bottom up approach. Initially each document is in its own
cluster. Build a distance matrix (dissimilarity matrix) for every pair of clusters. Merge 2
closest clusters and build the new distance matrix by replacing the merged cluster by one
CUSAT
Seminar Report 2010
9
cluster. Continue this process until the desired no of k clusters reached. The Complexity
of this algorithm is clearly O(n2) since we are using a matrix, where n is the number of
clusters.
Another Data centric algorithm is called as K-means clustering. K is a predefined
value for number of clusters and we are always selecting an average one as the cluster
centroid. Hence the name. Firstly choose the number of clusters k. Randomly generate k
clusters and find cluster representative/centroid. Calculate the distance between each
cluster and each document. Assign each document to the nearest cluster centroid. Recompute new cluster centroid. Repeat the steps until some convergence criterion is met.
The complexity is
O(knT),where k is the number of clusters, n is the number of
documents and T is the number of times the algorithm should repeat for getting a stable
system(without changing the membership of document).
Data-centric algorithms borrow their strengths from well-known and proven
techniques targeted at clustering numeric data. Eventhough it uses simple keyword based
features, still it is a powerful method.
But there are some difficulties in these set of algorithms. All these algorithms are not
incremental in nature. ‘Incremental’ in the sense, as each document arrives from the web,
we “clean” it and add it to the available model. All the above algorithms excluded the
incremental property.
Another difficulty raised in Data centric approaches are in the case of meaningful labels.
In these algorithms cluster labels are created by selecting frequent keywords from the set
of cluster documents. This keyword based representation seemed to be insufficient from
the user perspective. Once a text is converted to a document vector we can hardly speak
of the text’s meaning, because the vector is basically a collection of unrelated terms.
Using the extracted features in a keyword based approach the content of the cluster is not
that much readable.
CUSAT
Seminar Report 2010
10
For justifying this argument refer the figure 3 in the appendix.
The query used here is
Retrieve the top 250 documents that contain the word star .
We ask Scatter/Gather to place the 250 documents into 5 groups. The Figure
contains only the first scattered clusters. Shown here are the clusters' sizes (how many
documents they contain), a list of topical terms, and a list of document titles.
One can see from the topical terms of Cluster 1 that this cluster contains
documents that involve stars as symbols, as in military rank and patriotic songs. Cluster 2
has 68 documents that appear mainly to be about movie and tv stars. Cluster 3 contains
97 documents that having to do with aspects of astrophysics. Cluster 4 contains 67
documents also about astronomy and astrophysics. This cluster contains many articles
about people who are astronomers. Cluster 5 contains all the articles that discuss animals
or plants, and that happen to contain the word star, for example, star fish.
But looking in to this clusters we can hardly conclude these descriptions about the
cluster contents. For getting more detailed cluster labels we can use Description aware
algorithms.
2.1.3.2 DESCRIPTION AWARE ALGORITHMS
Description-aware algorithms are aware of this labeling problem and try to ensure
that the construction of cluster descriptions is that feasible and it yields results
interpretable to a human. One way to achieve this goal is to use a monothetic clustering
algorithm (i.e., one in which objects are assigned to clusters based on a single feature)
and carefully select the features so that they are immediately recognizable to the user as
something meaningful. If features are meaningful and precise then they can be used to
describe the output clusters accurately and sufficiently. The algorithm that first
implemented this idea was Suffix Tree Clustering (STC), described in a few seminal
CUSAT
Seminar Report 2010
11
papers by Zamir and Etzioni in 1998, 1999, and implemented in a system called Grouper.
In practice, STC was as much of a break through to search results clustering.
Suffix Tree Clustering(STC) uses a data structure called suffix tree. It Use
phrases(ordered sequence of words) as their atomic features rather than keywords. 3 steps
are there for performing suffix tree clustering. Those are, data cleaning, identifying base
clusters and combining base clusters. We define a base cluster to be a set of documents
that share a common phrase.
A suffix tree-Definition
1 A suffix tree of a string S is a compact trie containing all suffixes of S.
2. It is a rooted tree.
3. Each internal node has at least two children
4. Each edge is labeled with a non empty substring of S. The label of a node is the
concatenation of the edge labels on the path from the root to that node
5. No two edges out of the same node can have edge labels that begin with the same word
For example the suffixes of a sentence “mouse ate cheese too” are:
Suffix no.
Suffixes
1.
mouse ate cheese too
2.
ate cheese too
3.
cheese too
4.
too
CUSAT
Seminar Report 2010
12
A General Suffix Tree (GST) means a suffix tree contains all the suffixes of two
or more sentences.
Step1-Data Cleaning
In this step, the string of text representing each document is transformed using a
light stemming algorithm (deleting word prefixes and suffixes and reducing plural to
singular). Sentence boundaries (identified via punctuation and HTML tags) are marked
and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped.
Step 2-Identifying base clusters
The following picture is an example for a General Suffix Tree of a set of strings1)"cat ate cheese", 2)"mouse ate cheese too" and 3)"cat ate mouse too". The nodes of the
suffix tree are drawn as circles. Each suffix-node has one or more boxes attached to it
designating the string(s) it originated from. The first number in each box designates the
string of origin (1-3 in our example, by the order the strings appear above); the second
number designates which suffix of that string labels that suffix-node.
CUSAT
Seminar Report 2010
13
Each node of the suffix tree represents a group of documents and a phrase that is
common to all of them. The label of the node represents the common phrase; the set of
documents tagging the suffix-nodes that are descendants of the node make up the
document group. Therefore, each node represents a base cluster.
Following Table lists the six marked nodes (a-f) from the example shown above
and their corresponding base clusters:
Each base cluster is assigned a score that is a function of the number of
documents it contains, and the words that make up its phrase. The score s(B) of base
cluster B with phrase P is given by:
where |B| is the number of documents in base cluster B, and |P| is the number of words in
P that have a non-zero score (i.e., the effective length of the phrase)
Step 3 - Combining Base Clusters
This step of the algorithm merges the base clusters, with a high overlap in their
document sets. For doing this we are using a base cluster graph. The nodes in this graph
are base clusters. Combine these base clusters based on some similarity measure.
The following figure is a base cluster graph of the previous example.
CUSAT
Seminar Report 2010
14
We define a binary similarity measure. Given 2 Base clusters Bm and Bn with
sizes |Bm | and | Bn | respectively.| Bm ∩ Bn | is the number. of documents common to both
base clusters. We define the similarity between Bm and Bn is to be 1 iff:
| Bm ∩ Bn | / | Bm |>0.5 and
| Bm ∩ Bn | / | Bn |>0.5
Otherwise similarity is equal to 0.
If similarity between base clusters is equal to 1 then draw an edge connecting
those base clusters. A cluster is defined as being a connected component in the base
cluster graph. Each cluster contains the union of the documents of all its base clusters. In
the above base cluster example there is one connected component, therefore one cluster.
The advantages of STC over Data centric algorithms are, The STC can be
constructed in linear time. It is incremental in nature. This method focused attention on
cluster label descriptiveness, so that the cluster labels will be more effective. STC support
overlapping clusters.
The following picture gives us an overview about the clusters created by Suffix
Tree Clustering method:
CUSAT
Seminar Report 2010
15
The Query used here is ‘salsa’. Only the first 5 clusters are shown here. The
words in bold are the shared phrases found in the clusters. Note the descriptive power of
phrases such as "Puerto Rico", "Latin Music" and "York Salsa Dancers".
2.1.4. VISUALIZATION OF CLUSTERED RESULTS
Now powerful visualizations are available for Web Clustering Engines. One
prominent approach is based on hierarchical folders. The Web Clustering Engines like,
Clusty, CREDO, Lingo3G ,etc are using hierarchical folder visualization approach. A
famous Clustering Engine called Grokker uses Nesting and zooming approach. Some
search engines also used Graph based interfaces. KartOO is such a system.
CUSAT
Seminar Report 2010
16
Some Clustering Engines and their visualizations are mentioned below:
Clusty
Clusty is a clustering engine developed by the company Vivisimo. Vivisimo won
the “best meta-search engine award” assigned by SearchEngineWatch.com from 2001 to
2003. Vivisimo means lively, bright, or clever in Spanish. Vivisimo's founders picked the
name to express their vision of optimizing and giving life to our information. Clusty is a
meta search engine, meaning it combines results from a variety of different sources. It
uses an algorithm to cluster content based on textual similarity. Every time of a search,
Clusty pulls together the data from other engines like Ask, MSN and Wisenut. It then
organizes the search results in a way that helps us navigate away from ambiguity towards
specific cluster of results.
Clusty uses a hierarchical folder approach. It is a very simple method and familiar
to everyone. Figure1 in appendix is the screenshot (taken on March 5, 2010) of Clusty.
CUSAT
Seminar Report 2010
17
The hierarchical folders are limited in the left side of the screen so that the user can
choose any cluster he may need within no time.
CREDO
CREDO ( Conceptual REorganization of DOcuments) has been developed at
Fondazione Ugo Bordoni by Claudio Carpineto and Gianni Romano. CREDO groups the
results of a web search (currently Yahoo APIs search results) in a lattice of conceptual
clusters that highlight the contents of the retrieved documents. CREDO is based on a
mathematical data representation termed a concept lattice. Compared to other systems for
clustering Web results, the clusters produced by CREDO are more justifiable, are easier
to navigate because they are organized in a lattice rather than a strict hierarchy, and allow
discovery of causal associations between the words contained in the results. CREDO is
an interesting example of a system that attempts to build the taxonomy of topics and their
descriptions simultaneously. Eventhough CREDO do not follow a strict hierarchical
organization can still use a tree-based visualization. Refer Figure 4(taken on March 6,
2010) in appendix for seeing the visualization of CREDO.
A version of CREDO for PDAs (Credino) and for cellular phones (SmartCREDO)
has been developed in collaboration with Stefano Mizzaro and Andrea Della Pietra
(University of Udine).
Grokker
Grokker is developed by a company called Groxis. Groxis was a tech company
based in San Francisco, California. The name Grokker is inspired by the 1961 Robert A.
Heinlein science fiction classic Stranger in a Strange Land, in which Grok is a Martian
word meaning literally ‘to drink’ and metaphorically ‘to be one with.’ To grok something
is to understand something so well that it is fully absorbed into oneself. It is to look at
every problem, opportunity, action, and point of view from any and all perspectives.
Grokker sits on top of multiple sources. After Grokker retrieves the information, it
CUSAT
Seminar Report 2010
18
"federates" it, meaning it meshes it all together. Finally, it clusters the returns into
categories. End users most frequently look at less than three screens from the thousands
of returned search results. Using Grokker, users immediately see the cluster(s) of greatest
relevance, and drill down, only within the cluster(s) that matter to them.
Grokker uses Nesting and Zooming approach. The screen shot of Grokker
is shown in appendix Figure 5. This Map View is a visual representation of the return of
hits. When the user click on one of the circles and see the subcategories again. By
clicking on Search Options the user can change the number of hits he will return. The
user can also choose which sites you want to search: Yahoo, Wikipedia and/or Amazon.
Simultaneous searching of different sites are also permitted. Finally, we can limit our
results by using the tools on the left side of the screen.
Some universities are using Grokker as their searching tool. Stanford University
was one of the first customers of Grokker. The new platform provides faculty and
students with a single point of access to multiple resources, including library catalogs,
proprietary subscription databases, and the Web. It helps Stanford users to be more
efficient in their research and navigation among the numerous available resources. The
desktop version of Stanford Grokker is no longer being supported, and is not available for
download. In March of 2009, Groxis ceased operations.
KartOO
KartOO was a meta search engine which displayed a visual interface. It operated
from 2001 to early 2010. KartOO had an advanced Adobe Flash GUI, as opposed to a
text-based list of results.It uses a Graph based approach. Its color scheme was to a degree
reminiscent of Apple Computer's Aqua interface. Search results were presented as a
"map", with blob-like masses of varying color connecting each item. The shape of the
blobs clearly depends on the relevance of the keyword corresponding to that blob,
according to the query. If one began their search with a general topic, KartOO sometimes
helped to narrow it down. Every "blob" clicked added another word to the search query.
CUSAT
Seminar Report 2010
19
The map would often succeed in presenting keywords or subtopics that defined the topic
one was searching on. Refer Figure 6 in appendix for seeing the visualization of KartOO.
It was co-founded in France by two cousins, Laurent and Nicholas Baleydier. This
project was then launched in 2001. In 2004, KartOO launched a new version called
UJIKO. In January 2010 KartOO closed down, removing all content from the KartOO
and UJIKO websites, but leaving a small message in French thanking its users for their
support.
CUSAT
Seminar Report 2010
20
3. EFFICIENCY AND FUTURE WORKS
3.1 SEARCH RESULTS CLUSTERING EFFICIENCY FACTORS
The most critical tasks involve the first three components presented namely
search result acquisition, preprocessing, and clustering. The visualization component is
not likely to affect the overall system efficiency in a significant manner.
Search Results Acquisition
The number of search results required for clustering cannot be fetched in one
remote request. The Yahoo! API allows up to 50 search results to be retrieved in one
request, while Google SOAP API returns a mere 10 results per one remote call. The
results obviously depend on network congestion , on the capability of local equipment
used , and also on the specific server processing the request on the search engine side.
Preprocessing
The performance of tokenization is a critical concern in the case of
preprocessing of search results. Tokenizers will have a different performance
characteristic depending on whether they were hand-written or automatically generated.
Tokenization becomes much more complex for languages where white spaces are not
present (such as Chinese) or where the text may switch direction (such as an Arabic text,
within which English phrases are quoted).
Clustering
Depending on the specific algorithm used, the clustering phase can significantly
contribute to the overall processing time. Search results clustering systems must be
optimized to handle smaller instances and process them as fast as possible.
CUSAT
Seminar Report 2010
21
3.2 IMPROVE EFFICIENCY OF CLUSTERING
There are a number of techniques that can be used to improve the computational
performance of a search results clustering engine.
Client side processing
The majority of currently available search clustering engines are doing all
processes as server-side processing. One possible problem with this approach is
thatduring high query rate periods the response times can significantly increase and thus
degrade the user experience. For avoiding this we can do some processes using the client
side resources. In this way, scalability issues and the resulting problems could be
avoided.
Incremental processing
One desirable feature of search results clustering would be incremental
processing- as each document arrives from the web, we “clean” it and add it to the
available model.
Pretokenized documents
The input to the Web Clustering Engine is the search results returned by the
conventional search engines. This search engines already will do some preprocessing
techniques to their results before they are retrieved. If the clustering engines can use these
tokens for their work it will be an added advantage.
CUSAT
Seminar Report 2010
22
3.3 PERFORMANCE EVALUATION
Clustering engines are designed to overcome the limitations of plain search
engines. So we need to evaluate whether the use of clustered results does yield a gain in
retrieval performance over flat ranked lists. Some methods are explained below:
First suggestive method related to the conventional notion of Recall and precision.
For applying this concept the retrieved list should be in a linear list, not in a clustered
form. One obvious way to perform such a clustering linearization would be to preserve
the order in which clusters are presented and just expand their content, but this would
amount to ignoring the role played by the user in the choice of the clusters to be
expanded. One of the earliest and simplest linearization techniques is to assume that the
user can choose the cluster with the highest density of relevant documents and to consider
only the documents contained in it ranked in order of relevance.
A more analytic approach is based on the reach time: a modelization of the time
taken to locate a relevant document in the hierarchy.
Another method is by analyzing the user logs. Compare the search engine logs to
clustering engine logs, computing several metrics such as the number of documents
followed, the time spent, and the click distance. The interpretation of user logs is,
however, difficult.
To date, the evaluation issue has probably not yet received sufficient attention. It
remains still as an open issue. Anyway some experimental findings are suggesting that
Web Clustering Engines may be more effective than plain search engines. Due to the lack
of an efficient method for the performance evaluation of clustering engines they are still
not seeking the attention of people.
CUSAT
Seminar Report 2010
23
3.4 RESEARCH DIRECTIONS AND FUTURE WORKS
The most important research issue is thus how to improve the quality and
usability of output hierarchies. For improving the cluster efficiency, should extract
powerful features. The developers should adopts methods for generating more expressive
and effective descriptions of clusters.
Finding optimal cluster representatives is another approach for increasing the
efficiency of clustering phase. If we can find a better cluster representative then the
iterations for stable clustering will be less, means less response time. Combination of
existing clustering algorithms can also be used for getting better clusters.
One advanced concept is called Personalized clustering. Since the clustering
process does not depend only on the search results, but is also influenced by the user
characteristics, we speak of personalization. Personalization means instead of optimizing
the construction of the hierarchy structure, one can try to reorganize a given structure
based on user actions. This proposed techniques exploit user feedback, to filter out parts
of the hierarchy that are presumably of no interest to the user.
One of the recent topics in the field of search result clustering is the on growing
market of mobile search. Two mobile versions of CREDO, suitable for personal digital
assistants and cellular phones, the systems, termed Credino (small CREDO,in Italian) and
SmartCREDO, are exclusively based on the search results and are freely available online.
The screenshots of Credino is available in the appendix Figure 7,Figure 8, Figure 9(taken
in March 6, 2010)
Semantic Web is a recent research topic. In Semantic Web the meaning
(semantics) of information on the web is defined, making it possible for machines to
process it. Google has initiated a good example of Semantic Web technology with its
"rich snippets". Swoogle is a semantic web search engine. In future clustering can also be
applied for Semantic web search engines also.
CUSAT
Seminar Report 2010
24
4. CONCLUSION
Web clustering engines organize search results by topic, thus offering a
complementary view to the flat-ranked list returned by conventional search engines. Web
Clustering Engines has reached a level in which research has been deployed and
commercial systems are being deployed. A number of advances must be made to improve
the cluster labels, coherence of cluster structure, performance evaluation studies,
advanced visualization techniques. Then Web Clustering Engines entirely fulfills the
promise of being the PageRank of the future.
CUSAT
Seminar Report 2010
25
5. REFERENCES
Journal/Paper:
Claudio Carpineto,Stanisiaw Osinski,Giovanni Romano and Dawid Weiss,”A survey
of Web Clustering Engines”,ACM Computing Surveys,Vol.41,No.3,Article 17,July
2009.
Oren Zamir and Orem Etzioni,Web Document Clustering :A Feasibility
Demonstration, In Proc. 21st annual Int. ACM SIGIR Conf. on Research and
Development of Information Retrieval, pp.46-54 ,1998.
Books:
C.J.Van Rijsbergen , Information Retrieval, Butterworth , 1979
Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval
Addison Wesley Longman Publishing Co. Inc.,1999
Websites:
http://clusty.com/
March 5, 2010
http://credo.fub.it
March 8, 2010
http://www2.parc.com/istl/projects/ia/sg-example1.html
http://credino.dimi.uniud.it/
March 10, 2010
http://smartcredo.dimi.uniud.it
March 10, 2010
March 4, 2010
CUSAT
Seminar Report 2010
26
6. APPENDIX
Figure 1
CUSAT
Seminar Report 2010
27
Figure 2
CUSAT
Seminar Report 2010
28
Figure 3
CUSAT
Seminar Report 2010
29
Figure 4
CUSAT
Seminar Report 2010
30
Figure 5
CUSAT
Seminar Report 2010
31
Figure 6
CUSAT
Seminar Report 2010
32
Figure 7
Figure 8
CUSAT
Seminar Report 2010
33
Figure 9
CUSAT

Deepthi_Webclustering Report

Transcription

Similar documents

Fabio D`Andrea LMD – 4e étage “dans les serres” 01 44 32 22 31

Hartmann Data Driven Business models presentation

[7] Big Data: Clustering

Uses of persistence for interpreting coarse instructions

Fiche SMA SR 305 Engine.indd

Normalized Cuts Without Eigenvectors