A novel approach for comparing web sites by using MicroGenres

Transcription

Engineering Applications of Artificial Intelligence 35 (2014) 187–198
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence
journal homepage: www.elsevier.com/locate/engappai
A novel approach for comparing web sites by using MicroGenres
Milos Kudelka b, Vaclav Snasel b, Zdenek Horak b, Aboul Ella Hassanien a,
Ajith Abraham b, Juan D. Velásquez c,n
a
Cairo University, Egypt
VSB Technical University of Ostrava, Czech Republic
c
University of Chile, Chile
b
art ic l e i nf o
a b s t r a c t
Article history:
Received 21 July 2013
Received in revised form
28 May 2014
Accepted 16 June 2014
Available online 20 July 2014
In this paper, a novel approach is introduced to compare web sites by analysing their web page content.
Each web page can be expressed as a set of entities called MicroGenres, which in turn are abstractions
about design patterns and genres for representing the page content. This description is useful for web
page and web site classification and for a deeper insight into the web site's social context.
The web site comparison is useful for extracting patterns which can be used for improving Web
search engine effectiveness, the identification of best practices in web site design and of course in the
organization of web page content to personalize the web user experience on a web site.
The effectiveness of the proposed approach was tested in a real world case, with e-shop web sites
showing that a web site can be represented in a high level of abstraction by using MicroGenres, the
contents of which can then be compared and given a measure corresponding to web site similarity. This
measure is very useful for detecting web communities on the Web, i.e., a group of web sites sharing
similar contents, and the result is essential in performing a focused and effective information search as
well as minimizing web page retrieval.
& 2014 Elsevier Ltd. All rights reserved.
Keywords:
Web design pattern
Genre
Microgenre
Detection
1. Introduction
The current Web due to its dimensions (number of web pages and
web sites) is beginning to exceed the scope of human perception and
understanding. This leads to a problem with orientation within the
Web space, as evidenced by the problems associated with contemporary search engines giving the relevant answers to user queries.
Therefore new tools are still in high demand and researchers propose
new approaches supporting interaction between users and computers in the Internet environment.
Consider as an example the various Internet search engines
that add new functions related to the purpose of their pages. In
search engines such as Google, Bing or Yahoo, users can specify the
context of their queries to news, shopping, blogs, sports, etc.
However, these contexts cannot currently be combined and the
obtained results do not always meet the users' expectations.
In this approach, an attempt is made to address these issues by
not classifying the web page into a single context (genre) but
n
Corresponding author at: Department of Industrial Engineering, University of
Chile Av. República 701, P.O. Box 8370439, Santiago, Chile. Tel.: þ562 29784834.
E-mail addresses: [email protected] (M. Kudelka),
[email protected] (V. Snasel), [email protected] (Z. Horak),
[email protected] (A. Ella Hassanien), [email protected] (A. Abraham),
[email protected] (J.D. Velásquez).
http://dx.doi.org/10.1016/j.engappai.2014.06.011
0952-1976/& 2014 Elsevier Ltd. All rights reserved.
instead by understanding the page as a set of entities (MicroGenres) where every single entity has been included on the page
for a specific purpose and is used by the user for a specific reason.
Thus the web page is perceived as a place that provides interaction
between the web page owners, developers and users. This interaction is permanent and the development is just a consequence of this
interaction. At first glance it might seem that this interaction is
hidden. However, especially in commercial situations, approaches
can be found that provide evidence for this model. It is in the field of
user testing (see Barnum and Dragga, 2001), where chosen users
directly influence the look and feel of web pages.
Therefore the question of whether web site similarity can be
measured (see Fig. 1) leads to the conviction that with a reasonable amount of abstraction, techniques allowing this measurement
can be formulated. This technology may be used for Web reduction, which corresponds to Web segmentation in certain social
contexts as given by the web site authors and the web site target
audience.
This paper is organized as follows. In Section 2, the web page is
described from the viewpoint of groups of people appearing over
the course of the web page's lifetime. The notion of the MicroGenre as a building block of web pages is also defined and a new
method for MicroGenre detection is described. In Section 3 an
overview of related approaches and methods is provided, mostly
from the area of web page similarity, web genre identification and
188
M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198
Fig. 1. Page similarity can be measured. What about web site similarity?.
Fig. 2. Three different points of view on web page development.
detection, web design patterns and information extraction methods. Section 4 describes tools and techniques required for our
experiments. In Section 5 several observations and experiments
with web site description are presented to verify the correctness of
the new approach. The last section summarizes our contribution
and focuses on possible directions of further research.
2. Social context of the Web
Every single web page (or group of web pages) can be
perceived from three different points of view. Consideration of
the individual points of view was inspired by specialists on web
design (see Van Duyne et al., 2002) and on the communication of
humans with computers (see Borchers, 2000). Three groups of
people can be naturally defined (Kudelka et al., 2009) that are
somehow involved in the course of events (Fig. 2):
1. The first group is composed of those whose intention is that the
user finds what he expects on the web page (those who usually
pay for the web page and later expect some profit). The
intention which the web page is supposed to fulfil is consequently represented by this group.
2. The second group consists of developers responsible for the
creation and development of the web page. They are therefore
consequently responsible for fulfilling the goals of the two
other groups.
3. The third group is made up of users who work with the web
page, use its functions and create its content. This group
consequently represents how the web page should appear
outwardly to the user. It is important that this performance
satisfies a particular need of the user.
Remark. Consider blogs as an example. The first group is the
companies, which offer an environment and technological background for blog authors, and to some extent also define the formal
aspects of blogs. The second group is the developers who implement the task given by the previous group. A visible attribute of
this group is that they – to a certain degree – share their
techniques and policies. The third group consists of blog authors
(in the sense of content creation) and blog readers. They influence
the previous two groups retroactively (by their behaviour, requests,
and needs).
From this point of view, web pages are elements around which
the social networks are formed. Therefore it is natural to create the
analysis and descriptive model based on the social networks
around web pages. For further details and references, see Adamic
and Adar (2005) and Kumar et al. (2010) (which also consider the
aspect of network evolution).
2.1. MicroGenres
In all sectors of the arts, genres are vague categories without
fixed boundaries which are essentially formed by sets of conventions. Many works of art are cross-genre and employ and combine
these conventions. Probably the most deeply theoretically studied
genres are the literary ones. They allow experts to systematize the
world of the literature such that it can be considered as a subject
of scientific examination. The term MicroGenre can be found in
this field.
For example, in Martin (1995) the MicroGenre is seen as part of
a combined text. This term has been introduced to identify the
contribution of inserted genres to the overall organization of text.
The motivation of using the term “MicroGenre” is because it is
used as a building block of the analytic descriptive system. On the
other hand, Web design patterns are used more technically and
provide a means for good web page solutions.
From our point of view, web pages are structured similar to
literary texts, using parts which are relatively independent and
have their own purpose. However, the architecture of the web
page (individual parts) is based on Web design patterns. For these
parts the term “MicroGenre” has been chosen. Contemporary web
pages are often very complex; nevertheless they can usually be
described using several MicroGenres. This kind of description can
be more flexible than the genre description (which usually
represents the whole page). Genre and design patterns have a
social background. Within this background there are different
groups of people with the same common interests.
Definition. (Web) MicroGenre is a part of a web page,1
1. Whose purpose is general and repeats frequently
2. Which can be named intelligibly and more or less unambiguously so that the name is understandable for the web page user
(developer, designer, etc).
3. Which is detectable on a web page using an algorithm
Remark. From an external point of view, a given web site can
appear differently to different users. The reasons to visit a web site
may also vary. Consider an e-shop as an example. It is certain that
1
The MicroGenre can, but does not have to, strictly relate to the structure of a
web page in a technical sense, see below for detailed explanations.
the aim of the e-shop owner is to sell the most goods. The aims of
visitors may vary. A user can visit the e-shop to explore the kinds
and prices of goods and read the terms of sale. The main target is
the price information and maybe the price comparison. Another
user wants to directly buy the goods. In that case he will be
interested in pages with a purchasing possibility. A third user may
be interested in product parameters and the opinions of other
users. Therefore he will prefer pages containing technical features
of products, discussion, FAQ, customer reviews and ratings. Therefore the description of a web site based on the purpose of its
particular pages and the intents of its users seems to be appropriate. These purposes and intents are contained in MicroGenres.
2.2. MicroGenre description
1. Historically, the oldest resource for detecting useful objects on
web pages is the so-called page wrappers, which serve to
detect and extract data from the context of mostly product
pages (but also from many others) with known or partially
known structure. The user's needs are linked to the field of Web
content mining.
2. The second resource is the genres. The key problem here is the
need for document classification (web pages in particular)
however nowadays it seems to be complicated to classify a
189
web page within only a single genre. Automatic Web page
genre detection is again one of the Web content mining tasks.
Its application is obvious from the aforementioned search
engine functionality.
3. The final resource is the Web design patterns. Their purpose is to
describe the widely used practices for web page creation. These
descriptions should be general enough, but easily understandable
by the software architects, designers and web page developers.
All these resources are therefore the results of considering how
to describe the useful parts of a web page. These resources also
rely on user interaction and due to this interaction they change
over time. These resources can therefore be understood as a timevariant database or catalog of objects that is meaningful to detect
on web pages (see Fig. 4).
This catalog will never be too large (there are dozens of
MicroGenres at the most according to our estimation). This
introduces quite an important advantage to using MicroGenres
for web page representation and for using this representation for
classification and measuring web page similarity.
2.3. MicroGenre detection – Pattrio method
We began to detect objects on web pages in 2006 (see Kudelka
et al., 2006). We have tried to find a general principle which can be
Fig. 3. MicroGenre elements.
Fig. 4. Catalog of MicroGenres.
190
used to describe most of the features that can be (with high
relevance) used to detect MicroGenres on a web page. Generally
speaking, we use a two-step approach for automated detection.
The first step is to analyse the web page content and structure.
The goal of this analysis is to find web page elements, which are
usually contained in MicroGenres. We have incorporated approaches
Fig. 5. Gestalt principles illustrated on particular e-shop page as a part of the MicroGenre detection process.
191
Fig. 6. MicroGenre detection process.
known from the field of Web information extraction (see Chang et al.,
2006). To successfully perform this step we have to know which
words, short phrases and data types are characteristic for a particular
MicroGenre (and in what context, see Fig. 3). We obtain this
information for every single MicroGenre by analysing a large amount
of data.
The second step is to apply rules that are related to the layout of
web page elements. Here we use the so-called Gestalt principles (see
Fig. 5) that describe the general rules that are valid for visual systems
(see Tidwell, 2005). Basically, if a group of elements should be
perceived as a whole, then these elements should be close enough,
can repeat themselves or follow each other. As an example more
closely related segments can be expected in the Discussion MicroGenre
(single contributions) and more varied segments in the Technical
features MicroGenre (list of parameters – attribute names, values and
physical units). The whole process is depicted in Fig. 6 and is described
in detail (including algorithms, etc.) in our previous publications.
For each MicroGenre a configuration containing all the necessary inputs for the detection process has been prepared. This
process is heuristic in nature and answers the following question;
with what relevance does the given web page contain the
specified MicroGenre? A simplified input for the Discussion MicroGenre is shown below.
Example – Discussion (Forum)
Words: {Main: re, reply, discussion, forum, author, question, answer, thread, contribution, subject, sent;
Complementary: date, name, post, topic}
Data types: {Main: date, time, first name; Complementary:}
Technical {link, label, input, table}
elements:
Rules: {proximity: 16; closure: normal; similarity: high;
continuity: low}
The result of the detection process for each web page is
therefore a list of contained MicroGenres (accompanied by the
relevance of each MicroGenre). On the technical level web pages
can be represented as vectors and used in further applications
based on a vector-space model. Another possibility is to use binary
or fuzzy relevance evaluation, in which pages can then be
represented using an object-attribute model. In the text below
both of these approaches are used.
Remark: Based on the experiments with the Discussion and the
Purchase possibility MicroGenres, the accuracy of our Pattrio
method is about 80% (for details see Kocibova et al., 2007).
3. Related work
In this section other studies somewhat related to our approach
will be described. None of these approaches will be imitated, but
in some details our approach is similar to some of this work. This
holds true particularly for several web classification methods, web
page similarity measurements, genre detection, web design patterns and web information extraction. The main difference of our
approach is not a technical solution but what information about
the web page and web site will be offered to users.
Web classification: Web classification is understood as the task
of assigning web pages or web sites to pre-defined categories
according to their content. Both manual and automated
approaches can be distinguished. Manual evaluation relies on the
judgments of individuals about a web site. On the other hand,
there are many approaches of automated Web classification into
relevance categories that eliminate the human effort required to
maintain repositories of sources.
Bauer and Scharl (2000) suggested a valuable methodology for
automated content analysis of web sites of commercial, educational and non-profit organizations. Sun et al. (2002) studied the
effect of using context features in web classification using SVM
classifiers. They used two kinds of context features – web page
titles and hyperlinks. Shen et al. (2004) proposed classification
algorithms based on web summarization, for extracting the most
relevant features from web pages. Yu et al. (2004) presented a
framework for web page classification. The framework applied a
learning algorithm based on positive and unlabelled data.
A survey in Choi and Yao (2005) described systems that
automatically classify web pages into meaningful categories of
subject-based and genre-based classifications. Lindemann and
Littig (2006) analysed structural properties which reflect the
functionality of web sites. These structural properties consider
the size, the organization and composition of URLs, and the link
structure of web sites. The result of their approach is classified into
relevant functional classes (Academic, Blog, Community, Corporate, Information, Non-profit, Personal, and Shopping). Qi and
Davison (2009) in their survey examined the space of web
classification approaches to find new areas for research. They
192
reviewed the web-specific features and algorithms that were
found to be useful for web page classification. Rajaraman (2009)
described a general-purpose topic exploration engine using a
federated search approach. A presented taxonomy consists of
millions of topics organized hierarchically.
Web page similarity: The similarity of web pages based on their
content (there are also methods based on link analysis) can be
evaluated with different methods and the results of these methods
can be used in various ways. There are different views on how to
measure the similarity of web pages. It depends on the target of a
method that is used for web page analysis. The methods can be
roughly divided into three groups (see Kudelka et al., 2010). The
first group includes methods which are aimed at the semantic
content of web pages, and as a typical example methods from the
field of Information Retrieval (IR) can be mentioned (Velásquez
et al., 2011; Velásquez, 2012). In IR, the similarity is based on the
terms and the advantages of a vector space model that are often
used (see e.g. Kobayashi and Takeda, 2000). The second group
contains methods working with patterns, semantic blocks or the
genre of web pages (see e.g. Kudelka et al., 2008). The third is a
group including methods evaluating the visual similarity of web
pages (see e.g. Takama and Mitsuhashi, 2005). Obviously, there can
be some methods which combine approaches from different
groups.
Genre detection: In the context of documents, the genre is a
taxonomy that incorporates the style, form and content of a
document which is orthogonal to topic, with fuzzy classification
into multiple genres (see Boese, 2005). In the same paper, existing
classifications are described. As a result, 115 genres are identified,
e.g. Ads, Calendars, Classifieds, Comparisons, Contact info, Coupons, Demographics, Documentaries, E-cards, FAQs, Homepages,
Hubs, Lists, News, Newsgroups, Reports, etc. For classification,
there are many approaches and also many methods for genre
identification. Kennedy and Shepherd (2005) analysed home page
genres (personal home page, corporate home page or organization
home page). Chaker and Habib (2007) proposed a flexible
approach for web page genre categorization. Flexibility means
that the approach assigns a document to all predefined genres
with different weights. Dong et al. (2008) described a set of
experiments to examine the effect of various attributes of web
genres for the automatic identification of the genre of web pages.
Four different genres were used in the data set (FAQ, News,
E-Shopping and Personal Home Pages).
Rosso (2008) explored the use of genre as a document descriptor in order to improve the effectiveness of web searching. The
author formulated three hypotheses: (1) Users of the system must
possess sufficient knowledge of the genre. (2) Searchers must be
able to relate the genres to their information needs. (3) Genre
must be predictable by a machine-applied algorithm because it is
not typically explicitly contained in the document.
Web design patterns: Design patterns are based on the work of
Alexander et al. (1977). From the mid-sixties to mid-seventies,
Alexander et al. defined a new approach to architectural design.
The new approach centred on the concept of pattern languages is
described in a series of books (Alexander et al., 1977). Alexander's
definition of pattern is as follows: “Each pattern describes a
problem, which occurs over and over again in our environment,
and then describes the core of the solution to that problem, in
such a way that you can use this solution a million times over,
without ever doing it the same way twice.”
According to Tidwell (2005), patterns are structural and behavioural features that improve the applicability of software architecture, a user interface, a web site or something else in another
domain. They make things more usable and easier to understand.
Patterns are descriptions of best practices within a given design
domain. They capture common and widely accepted solutions, and
their validity is empirically proved. Patterns are not novel ideas,
but patterns are captured experiences with each of their implementation a little different. Good examples are also the so-called
“Web design patterns,” which are patterns for design related to the
Web (see van Welie, 2008).
Web information extraction: There is a wide area of methods,
which aim to detect objects related to MicroGenres and extract
their semantic details (e.g. Opinion extraction, Lee et al., 2008;
News extraction, Zheng et al., 2007; Web discussion extraction,
Limanto et al., 2005; Product detail extraction, Nie et al., 2007; and
Technical features extraction, Schmidt et al., 2005).
Contrary to such notions from the Semantic web as MicroData
and Schema.org,2 the MicroGenres represent a fully automatic
approach relying on existing data. There is no need to update
existing pages or manually annotate them. However this approach
is mainly focused on detecting particular MicroGenres and, unlike
the mentioned approaches, has only limited data extraction and
description capabilities.
4. Tools and techniques
The object-attribute representation of MicroGenres is well
suited to being analysed using Formal Concept Analysis (FCA).
This well-known method is capable of finding all meaningful
groups of objects and attributes that are valid in the data. Therefore a deeper view can be got into the structure of web sites, their
properties and relations between their groups. The problem is that
it cannot often be precisely stated whether a page contains a
particular MicroGenre or not. Using a multiple-degree scale the
presence of MicroGenre on the page can be estimated. This
information can then be processed using a fuzzy version of Formal
Context Analysis.
However, in an attempt to find all meaningful groups in a large
dataset (which is what a collection of web sites truly is) we will
end up with a large number of groups. There will be difficulties in
navigating the results. Marginal sets will receive the same attention as important ones. Starting with a large quantity of web sites,
the computations may be unable to be finished at all. Matrix
reduction methods have previously been used successfully in large
data environments; therefore the so-called Nonnegative Matrix
Factorization has been used to focus on the important data on the
web site – MicroGenre relationships.
First of all, some basic notions of these techniques will be
introduced. Detailed information about their usage can be found in
the Experiment section.
4.1. Formal concept analysis
Now only a brief overview of FCA in fuzzy environments is
given (approach of Belohlávek and Vychodil, 2005, further details
and comparison with other approaches can be found in the same
place). Consider the so-called complete residuated lattice
L ¼ 〈L; 3 ; 4 ; ; -; 0; 1〉, where 〈L; 3 ; 4 ; 0; 1〉 is the complete
lattice (with 0 and 1 being the smallest and biggest elements),
〈L; ; 1〉 is a commutative monoid and 〈 ; -〉 is the adjoint pair of
binary operations (truth functions of fuzzy conjunction and fuzzy
implication). The notion of being an element of a set can be
replaced by the degree to which the element is contained in the
set (A(x)). The notion of subset can be inferred in a similar way.
Now we can define the fuzzy L-context as L ¼ 〈X; Y; I〉 with X
being the set of objects, Y being the set of attributes and I being the
fuzzy relation I : X Y-L. For fuzzy set A A LX ; B A LY (A is a fuzzy
2
http://www.schema.org
set of objects, B is a fuzzy set of attributes) we define fuzzy set of
attributes A0 A LY and fuzzy set of objects B0 A LX as
A0 ðyÞ ¼ ⋀ ðAðxÞnX -Iðx; yÞÞ;
xAX
B0 ðxÞ ¼ ⋀ ðBðyÞnY -Iðx; yÞÞ:
yAY
The nX and nY are the so-called truth-stressing functions or simple
hedges which allow us to control the size of the resulting lattice.
Formal fuzzy concept is a pair 〈A; B〉, A A LX ; B A LY , such that
A0 ¼ B and B0 ¼ A. In this case, we will call A as the extent of the
concept, B as its intent. As 〈BðX nX ; Y nY ; IÞ; r 〉 we denote that the set
of all concepts, which, accompanied by the induced order is called
fuzzy concept lattice. Simply said, the fuzzy concept lattice can
provide us a structure of all meaningful groups of data present in
the analysed dataset. Using a specific structure the fuzzy case can
naturally degenerate into the binary environment. This effect is
used in our experiments several times.
4.2. Non-negative matrix factorization
Methods of matrix decomposition have succeeded in reducing
the dimensions of input data (see Snasel et al., 2008 for applications connected with Formal Concept Analysis and Berry et al.,
1995, Lee and Seung, 1999 for overview).
Matrix factorization methods decompose one – usually huge –
matrix into several smaller matrices. Such decomposition methods
are common in the field of information retrieval. Non-negative
matrix factorization differs by the use of constraints that produce
non-negative basis vectors, which create parts-based representation. Therefore a matrix containing original object-attribute data
can be decomposed, less important parts omitted and the matrix
recomposed. As a result, a much simpler lattice is obtained
maintaining the most important properties and relations.
The NMF is usually computed as an approximation of the
original matrix V by computing a (W,H) pair to minimize the
Frobenius norm of the difference V WH. Let V A Rmn be a nonnegative matrix and W A Rmk and H A Rkn for 0 o k⪡minðm; nÞ.
Then, the minimization problem can be stated as
min‖V WH‖2
with W ij 4 0 and H ij 4 0 for each i and j. The algorithm used was
the one proposed in Lee and Seung (1999).
5. Observations, experiments and results
In this paper we attempt to find a description of a web site such
as that coming from the analysis of the architecture of pages
belonging to this web site.
We use our Pattrio method for MicroGenre identification. Our
web site description indicates the intent and purpose of the web
site. From this point of view the description provides interesting
information about web sites.
In order to demonstrate the feasibility and usefulness of this
new approach, it has been decided to perform several experiments, which are described below, beginning with the background
of the experiment.
We have implemented a web application with user interface
connected to the API of different search engines (google.com, msn.
com, yahoo.com and the Czech search engine jyxo.cz). Users from
a group of students and teachers of high schools and VŠB
(Technical University of Ostrava, Czech Republic) were using this
application for more than one year to search for everyday
information. The process of searching was not influenced by us
in any way. The purpose of this part of the experiment was to view
193
the World Wide Web from the perspective of users (as the search
engines play a key role in World Wide Web navigation). In the end
a dataset with more than 115,000 web pages was obtained. After
cleaning up, 77,850 unique Czech pages from 16,644 web sites
remained. For every single web page the detection of 13 MicroGenres have been performed. The page did not have to contain any
MicroGenre, just as it may as well have theoretically contained 13
MicroGenres (Price information, Purchase possibility, Special offer,
Hire sale, Second hand, Discussion and comments, Review and
opinion, Technical features, News, Enquire, Login, Something to
read, Price per item).3 We used such a pre-processed dataset for all
the following experiments.
5.1. Web site visualization and classification based on MicroGenres
This part is focused on visualization of the web site description
based on MicroGenres. We would like to illustrate that this kind of
visualization can quickly inform users about the whole web site
and serve as an additional source of information for example
during a Web search. Using this visualization the users can also
distinguish between different types of web sites and avoid
possible user confusion when landing on the web site.
For the visualization of extracted information one common
graph drawing method has been adopted, which works as follows;
at the centre of Fig. 7 the circle representing the web site can be
seen. The size of the circle is determined by the relative size
(number of pages) in the dataset. The circles around the web site
correspond to individual MicroGenres. The size of circles is again
determined by their relative presence in the dataset.
The web site is connected to MicroGenres using a straight line.
The strength of this connection (represented by the line width)
corresponds to the detected degree of MicroGenres present on the
web site.
Figs. 7 and 8 contain the described visualization of some web
sites from the Czech Republic – typical web sites aimed at selling
products, sharing information and education.
This observation shows that MicroGenres can describe web
sites in user-understandable terms. MicroGenres may be used to
differentiate between different web sites, therefore they divide the
space of web sites and can be used in the search process without
introducing new unfamiliar concepts.
Fig. 9 depicts two web sites with similar intent, the selling of
products. The basic MicroGenres related to e-shops are shared (Price
per item, Price information and Purchase possibility). The second one
differs in allowing users to share information in addition to a
purchase possibility as well as allowing purchase on hire.
5.2. Web site clustering based on MicroGenres
In this experiment we wanted to prove that MicroGenres can be
used to divide the set of web sites and allow the structural view of a
set of web sites, with a potential of navigating within the dataset
from the global view as well as from the view of one specific topic. As
an input we have used the list of web sites created in the previous
step. Only web sites with more than 20 pages in the dataset have
been taken into consideration. Each web site was accompanied by
detected MicroGenres. This list was transformed into a binary matrix
and considered as a formal context. Using the methods of FCA we
have computed a conceptual lattice which can be seen in Fig. 10. The
resulting matrix has 516 rows (objects – web sites) and 13 columns
(attributes – MicroGenres) while the computed conceptual lattice
contains 259 concepts.
3
The names of MicroGenres emerged from the discussions between the first
three authors and students who took part in the experiments.
194
Fig. 7. Typical web site aimed at selling products.
Fig. 9. Comparison of two product-oriented web sites.
Fig. 8. Typical web site aimed at information sharing (left part) and typical
university web site (right part).
From the computed lattice we have selected a sub-lattice
containing 18 web sites dealing with cell phones. Only five
attributes have been selected and the visualization was created
in a slightly different manner (see Fig. 11 and attached legend).
Each node of the graph corresponds to one formal concept. To
increase the visualization value, the attributes (MicroGenres) are
represented by icons and the set of objects (web sites) is depicted
using small filled/empty squares in the lower part. It can be easily
seen that the whole set of web sites can be divided into two
groups, the first one containing sites where users are enabled to
buy cell phones and a second one where users are allowed to have
a discussion. Deeper insight gives more detailed information about
web site structures and relations.
The conceptual lattice forms a graph, which can be interpreted
as an expression of relation between different web sites. As a
result, the conceptual lattice also describes an automatic classification of web sites.
Complexity of the view: To address the concept lattice complexity more generally, the NMF method has been used to decompose
the original matrix into two, much smaller matrices. The detailed
look at these matrices gives us an insight into extracted partsbased representation. According to this method, the six most
195
important elements (which form the base for most of the web
sites) are the following:
6. Price information, Purchase possibility, Special offer and Price
per item (present in 77 web sites).
1. Login (present in 148 web sites).
2. Something to read (present in 448 web sites).
3. Discussion and comments and Review and opinion (present in
50 web sites).
4. Price info, Purchase possibility and Price per item (present in
164 web sites).
5. Technical features (present in 229 web sites).
A histogram of attributes can be seen in Fig. 13. A naturally
simplified lattice constructed from these reduced data can be seen
in Fig. 12.
It is apparent from the presented results that most of the
analysed web sites contain various forms of the Something to read
MicroGenre. More specific MicroGenres provide better insight into
web sites and the services they provide. It can be seen from the
Fig. 10. Lattice calculated from whole dataset.
Fig. 12. Lattice calculated from reduced data set (rank 6).
Fig. 11. Part of lattice.
196
Fig. 13. Attribute histogram.
Fig. 14. Web site map (e-shops and universities highlighted).
MicroGenre histogram and the underlying data that there are
many web sites containing Price information (234), while some of
them also provide Purchase possibility (166). Out of this group,
many web sites contain Special offer (77). Only a small part of these
web sites provide (with respect to our detection) Hire sale (12).
One of the specific attributes of the Czech Web is the frequent sale
of used products. There are specialized second-hand web sites, but
the same service is often provided by ordinary e-shops.
These MicroGenre statistics may provide useful information in
additional applications. For example when evaluating the relevancy of particular MicroGenres w.r.t. search results, we may
incorporate the ideas behind the Term Frequency–Inverse Document Frequency (TF–IDF) methodology. Therefore the role of
highly represented MicroGenres (like Something to read) will be
diminished. Clearly, this MicroGenre is more important on the
page level than at the level of the whole site.
The structure built over MicroGenres can be used in the search
process for refining the search results based on user feedback with
the potential of raising the user's satisfaction. Each time the user
specifies a MicroGenre that is (or is not) interesting, technically
that means navigation within the conceptual lattice and selecting
a subset of search results.
5.3. Web site similarity
This experiment verifies the usage of MicroGenres for calculating web site similarity and obtaining a potential map of the Web.
Using cosine similarity, defined as follows:
similarity ¼ cos ðθÞ ¼
AB
JAJ JBJ
we have computed the similarity between each pair of web sites.
The computed results have been considered as a graph, where
nodes are web sites and the edge between nodes exists if the
similarity between web sites is above some specified threshold.
Using Pajek software4 and Kamada–Kawai force-based placing
algorithm a simple web site map has been produced (see Fig. 14)
with e-shops and university sites highlighted.
One concept has been randomly selected from the whole
lattice. This gave a concept in common with the web sites
aaamobil.cz, hifishop.cz, mshop.cz and o-d.cz. These web sites
have Price information, Purchase possibility, Special offer, Hire sale,
Technical features, Login, Something to read and Price per item
attributes in common. These web sites are highlighted in the left
lower part of Fig. 14. Their cosine similarity ranges from 0.9 to
1.0 and their visualization is in Fig. 15.
As an interesting result of this experiment the overall map of
web sites was considered:
1. On the upper right part of the map can be found (as well as
many university web sites) web sites containing mostly static
information pages. Only a few general MicroGenres and hardly
4
http://pajek.imfm.si
197
Fig. 15. Selected web sites (e-shops).
any specific MicroGenres, such as Price information or Discussion and comments, have been detected on them. From the view
of MicroGenres (which in principle ignores the semantic content of the pages) these web sites are very similar. Therefore
they are placed closely on the map and so besides universities,
other educational web sites can also be found, such as The
Academy of Sciences of the Czech Republic.
2. On the opposite side of the map (bottom left corner) can be
found, apart from the already mentioned e-shop web sites,
many other e-shops with various products (cell phones, sports,
electronics, etc). They are spread widely throughout the area
because a higher number of MicroGenres have been detected
and therefore this group is more diversified from the similarity
viewpoint.
3. The web sites which are not very specific and usually provide
both static information content and purchase possibility are
placed somewhere between the mentioned parts of the map.
This means that the web sites which are similar from the user's
point of view (universities and e-shops) are similar also from the
point of view of MicroGenres. Therefore they divide the search
space according to the user's expectations and are supposed to be
useful in the search process.
6. Conclusions and future work
In this paper three kinds of social groups have been described
which take part in web page creation and usage. We distinguish
these groups using their relation with the web page.
The emergence and evolution of web pages and web sites is the
result of interaction between these groups. These groups have
agreed on several basic building blocks of web pages. In this paper
we identify some of these blocks and call them MicroGenres. The
agreement resulted in a uniform name for the MicroGenre,
predictable behaviour and identifiable appearance of the element.
Each web page and web site indirectly defines the social context
within which the tasks linked with Web development and usage are
realized. In this context, we have tried to look at web sites from the
point of view of their similarity. When browsing the Web (especially
using Internet search engines) the need is clear for automatic
classification of web pages, and with respect to this classification,
the extraction of user-understandable data as well. Our MicroGenres
could be useful in this classification, and combined with the frequency
analysis of their usage on individual web sites the similarity can be
described and measured at a high level of abstraction.
More specifically, in the first observation a way of looking at
web sites was proposed that allows comprehensive insight into
198
the web site essence. As a result it also allows us to measure the
similarity of web sites. Obviously one can imagine better graphic
design of the visualization, incorporating for example pictograms
for particular MicroGenres, etc. However the main goal was
illustrated. The first experiment presents that the MicroGenres
on web sites can also represent a structural view, which can then
be used for a quick overview or detailed navigation. In the last
experiment a similarity between web sites was able to be
evaluated which can have both local and global impacts.
In our future research we want to apply our methodology to
more complex web sites, such us the adaptive web sites, which
change their structure and content depending on web user
browsing behaviour.
References
Adamic, L., Adar, E., 2005. How to search a social network. Soc. Netw. 27 (3),
187–203.
Alexander, C., Ishikawa, S., Silverstein, M., 1977. A Pattern Language: Towns,
Buildings, Construction. Oxford University Press, USA.
Barnum, C., Dragga, S., 2001. Usability Testing and Research. Allyn & Bacon Inc.,
Needham Heights, MA, USA.
Bauer, C., Scharl, A., 2000. Quantitative evaluation of Web site content and
structure. Internet Res.: Electron. Netw. Appl. Policy 10 (1), 31–43.
Belohlávek, R., Vychodil, V., 2005. What is a fuzzy concept lattice. In: Proceedings of
CLA, vol. 5, pp. 34–45.
Berry, M.W., Dumais, S.T., Letsche, T.A., 1995. Computational methods for intelligent
information access. In: Proceedings of the IEEE/ACM SC95 Conference on
Supercomputing, 1995. IEEE, Washington, DC, USA, p. 20.
Boese, E., 2005. Stereotyping the Web: Genre Classification of Web Documents
(Ph.D. thesis). Colorado State University, USA.
Borchers, J.O., 2000. Interaction design patterns: twelve theses. In: Workshop, vol. 2,
The Hague. pp. 3–9.
Chaker, J., Habib, O., 2007. Genre categorization of web pages. In: Proceedings of the
Seventh IEEE International Conference on Data Mining Workshops. IEEE
Computer Society Washington, Washington, DC, USA, pp. 455–464.
Chang, C., Kayed, M., Girgis, M., Shaalan, K., 2006. A survey of web information
extraction systems. IEEE Trans. Knowl. Data Eng. 18 (10), 1411–1428.
Choi, B., Yao, Z., 2005. Web page classificationn. In: Foundations and Advances in
Data Mining. Springer, Berlin, Germany, pp. 221–274.
Dong, L., Watters, C., Duffy, J., Shepherd, M., 2008. An examination of genre
attributes for web page classification. In: Proceedings of the 41st Annual
Hawaii International Conference on System Sciences. IEEE Computer Society,
Washington, DC, USA, pp. 133–143.
Kennedy, A., Shepherd, M., 2005. Automatic identification of home pages on the
web. In: Proceedings of the 38th Annual Hawaii International Conference on
System Sciences, 2005, HICSS'05. IEEE, p. 1–9.
Kobayashi, M., Takeda, K., 2000. Information retrieval on the web. ACM Comput.
Surv. 32 (2), 144–173.
Kocibova, J., Klos, K., Lehecka, O., Kudelka, M., Snasel, V., 2007. Web page analysis:
experiments based on discussion and purchase web patterns. In: Web Intelligence and Intelligent Agent Technology Workshops, pp. 221–225.
Kudelka, M., Snasel, V., Horak, Z., Hassanien, A., Abraham, A., 2009. Web communities defined by web page content. Comput. Social Netw. Anal., 349–370.
Kudelka, M., Snasel, V., Klos, K., Pokorny, J., 2010. Visual similarity of web pages. In:
Advances in Intelligent Web Mastering: Proceedings of the Sixth Atlantic Web
Intelligence Conference—AWIC'2009, Prague, Czech Republic, September 2009.
Springer Verlag, Washington, DC, USA; London, UK, p. 135.
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., 2006. Semantic analysis of
Web pages using Web patterns. In: Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence. IEEE Computer Society,
Washington, DC, USA; London, UK, pp. 329–333.
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J., 2008. Web pages
reordering and clustering based on web patterns. In: SOFSEM 2008: Theory and
Practice of Computer Science, pp. 731–742.
Kumar, R., Novak, J., Tomkins, A., 2010. Structure and evolution of online social
networks. In: Link Mining: Models, Algorithms, and Applications. Springer,
Berlin, Germany, pp. 337–357.
Lee, D., Jeong, O., Lee, S., 2008. Opinion mining of customer feedback data on the
web. In: Proceedings of the Second International Conference on Ubiquitous
Information Management and Communication. ACM, New York, NY, USA,
pp. 230–235.
Lee, D., Seung, H., 1999. Learning the parts of objects by non-negative matrix
factorization. Nature 401 (6755), 788–791.
Limanto, H., Giang, N., Trung, V., Zhang, J., He, Q., Huy, N., 2005. An information
extraction engine for web discussion forums. In: Special Interest Tracks and
Posters of the 14th International Conference on World Wide Web. ACM, New
York, NY, USA, pp. 978–979.
Lindemann, C., Littig, L., 2006. Coarse-grained classification of web sites by their
structural properties. In: Proceedings of the Eighth Annual ACM International
Workshop on Web Information and Data Management. ACM, New York, NY,
USA, pp. 35–42.
Martin, J., 1995. Text and clause: fractal resonance. Text (The Hague) 15 (1), 5–42.
Nie, Z., Wen, J., Ma, W., 2007. Object-level vertical search. In: CIDR, pp. 235–246.
Qi, X., Davison, B., 2009. Web page classification: features and algorithms. ACM
Comput. Surv. 41 (2), 1–31.
Rajaraman, A., 2009. Kosmix: high-performance topic exploration using the deep
web. Proc. VLDB Endow. 2 (2), 1524–1529.
Rosso, M., 2008. User-based identification of Web genres. J. Am. Soc. Inf. Sci. 59 (7),
1053–1072.
Schmidt, S., Stoyan, H., Weichselgarten, A., 2005. Web-based extraction of technical
features of products. GI Jahrestag. 1, 246–250.
Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., Ma, W., 2004. Web-page
classification through summarization. In: Proceedings of the 27th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, pp. 242–249.
Snasel, V., Polovincak, M., Dahwa, H., Horak, Z., 2008. On concept lattices and
implication bases from reduced contexts. In: Supplementary Proceedings of the
16th International Conference on Conceptual Structures, ICCS, pp. 83–90.
Sun, A., Lim, E., Ng, W., 2002. Web classification using support vector machine. In:
Proceedings of the Fourth International Workshop on Web Information and
Data Management. ACM, New York, NY, USA, pp. 96–99.
Takama, Y., Mitsuhashi, N., 2005. Visual similarity comparison for Web page
retrieval. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference
on Web Intelligence, pp. 301–304.
Tidwell, J., 2005. Designing Interfaces: Patterns for Effective Interaction Design.
O'Reilly, CA, USA.
Van Duyne, D., Landay, J., Hong, J., 2002. The Design of Sites: Patterns, Principles,
and Processes for Crafting a Customer-centered Web Experience. AddisonWesley Professional, Boston, USA.
van Welie, M., 2008. A Pattern Library for Interaction Design. 〈www.welie.com〉
(last accessed March 2010).
Velásquez, J.D., 2012. Web site keywords: a methodology for improving gradually
the web site text content. Intell. Data Anal. 16 (2), 327–348.
Velásquez, J.D., Dujovne, L.E., L'Huillier, 2011. Extracting significant website key
objects: a semantic web mining approach. Eng. Appl. Artif. Intell. 24 (8),
1532–1541.
Yu, H., Han, J., Chang, K., 2004. PEBL: Web page classification without negative
examples. IEEE Trans. Knowl. Data Eng. 16 (1), 70–81.
Zheng, S., Song, R., Wen, J., 2007. Template-independent news extraction based on
visual consistency. In: Proceedings of the National Conference on Artificial
Intelligence, vol. 7, pp. 1507–1513.

A novel approach for comparing web sites by using MicroGenres

Transcription

Similar documents

How to Tie a Woggle

lnstallation - ConservationMart.com

Turkey - EOPOWER

Key - Scioly.org

fuzzy`s pit box lounge

RÉSoNANSS: a regional contribution to

Real Time Control of Hand Prosthesis Using Surface EMG

Kerntech Diagnose-System Loose Part Monitoring System

CoUNTerMEASURES CoUNTerMEASURES

PrimeFilm 1800 AFL Low-Cost 35mm Slide/Film Scanning Through

pdf file - School of Mathematics and Statistics

Using the Köppen classification to quantify climate variation and