A novel approach for comparing web sites by using MicroGenres
Transcription
A novel approach for comparing web sites by using MicroGenres
Engineering Applications of Artificial Intelligence 35 (2014) 187–198 Contents lists available at ScienceDirect Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai A novel approach for comparing web sites by using MicroGenres Milos Kudelka b, Vaclav Snasel b, Zdenek Horak b, Aboul Ella Hassanien a, Ajith Abraham b, Juan D. Velásquez c,n a Cairo University, Egypt VSB Technical University of Ostrava, Czech Republic c University of Chile, Chile b art ic l e i nf o a b s t r a c t Article history: Received 21 July 2013 Received in revised form 28 May 2014 Accepted 16 June 2014 Available online 20 July 2014 In this paper, a novel approach is introduced to compare web sites by analysing their web page content. Each web page can be expressed as a set of entities called MicroGenres, which in turn are abstractions about design patterns and genres for representing the page content. This description is useful for web page and web site classification and for a deeper insight into the web site's social context. The web site comparison is useful for extracting patterns which can be used for improving Web search engine effectiveness, the identification of best practices in web site design and of course in the organization of web page content to personalize the web user experience on a web site. The effectiveness of the proposed approach was tested in a real world case, with e-shop web sites showing that a web site can be represented in a high level of abstraction by using MicroGenres, the contents of which can then be compared and given a measure corresponding to web site similarity. This measure is very useful for detecting web communities on the Web, i.e., a group of web sites sharing similar contents, and the result is essential in performing a focused and effective information search as well as minimizing web page retrieval. & 2014 Elsevier Ltd. All rights reserved. Keywords: Web design pattern Genre Microgenre Detection 1. Introduction The current Web due to its dimensions (number of web pages and web sites) is beginning to exceed the scope of human perception and understanding. This leads to a problem with orientation within the Web space, as evidenced by the problems associated with contemporary search engines giving the relevant answers to user queries. Therefore new tools are still in high demand and researchers propose new approaches supporting interaction between users and computers in the Internet environment. Consider as an example the various Internet search engines that add new functions related to the purpose of their pages. In search engines such as Google, Bing or Yahoo, users can specify the context of their queries to news, shopping, blogs, sports, etc. However, these contexts cannot currently be combined and the obtained results do not always meet the users' expectations. In this approach, an attempt is made to address these issues by not classifying the web page into a single context (genre) but n Corresponding author at: Department of Industrial Engineering, University of Chile Av. República 701, P.O. Box 8370439, Santiago, Chile. Tel.: þ562 29784834. E-mail addresses: [email protected] (M. Kudelka), [email protected] (V. Snasel), [email protected] (Z. Horak), [email protected] (A. Ella Hassanien), [email protected] (A. Abraham), [email protected] (J.D. Velásquez). http://dx.doi.org/10.1016/j.engappai.2014.06.011 0952-1976/& 2014 Elsevier Ltd. All rights reserved. instead by understanding the page as a set of entities (MicroGenres) where every single entity has been included on the page for a specific purpose and is used by the user for a specific reason. Thus the web page is perceived as a place that provides interaction between the web page owners, developers and users. This interaction is permanent and the development is just a consequence of this interaction. At first glance it might seem that this interaction is hidden. However, especially in commercial situations, approaches can be found that provide evidence for this model. It is in the field of user testing (see Barnum and Dragga, 2001), where chosen users directly influence the look and feel of web pages. Therefore the question of whether web site similarity can be measured (see Fig. 1) leads to the conviction that with a reasonable amount of abstraction, techniques allowing this measurement can be formulated. This technology may be used for Web reduction, which corresponds to Web segmentation in certain social contexts as given by the web site authors and the web site target audience. This paper is organized as follows. In Section 2, the web page is described from the viewpoint of groups of people appearing over the course of the web page's lifetime. The notion of the MicroGenre as a building block of web pages is also defined and a new method for MicroGenre detection is described. In Section 3 an overview of related approaches and methods is provided, mostly from the area of web page similarity, web genre identification and 188 M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 Fig. 1. Page similarity can be measured. What about web site similarity?. Fig. 2. Three different points of view on web page development. detection, web design patterns and information extraction methods. Section 4 describes tools and techniques required for our experiments. In Section 5 several observations and experiments with web site description are presented to verify the correctness of the new approach. The last section summarizes our contribution and focuses on possible directions of further research. 2. Social context of the Web Every single web page (or group of web pages) can be perceived from three different points of view. Consideration of the individual points of view was inspired by specialists on web design (see Van Duyne et al., 2002) and on the communication of humans with computers (see Borchers, 2000). Three groups of people can be naturally defined (Kudelka et al., 2009) that are somehow involved in the course of events (Fig. 2): 1. The first group is composed of those whose intention is that the user finds what he expects on the web page (those who usually pay for the web page and later expect some profit). The intention which the web page is supposed to fulfil is consequently represented by this group. 2. The second group consists of developers responsible for the creation and development of the web page. They are therefore consequently responsible for fulfilling the goals of the two other groups. 3. The third group is made up of users who work with the web page, use its functions and create its content. This group consequently represents how the web page should appear outwardly to the user. It is important that this performance satisfies a particular need of the user. Remark. Consider blogs as an example. The first group is the companies, which offer an environment and technological background for blog authors, and to some extent also define the formal aspects of blogs. The second group is the developers who implement the task given by the previous group. A visible attribute of this group is that they – to a certain degree – share their techniques and policies. The third group consists of blog authors (in the sense of content creation) and blog readers. They influence the previous two groups retroactively (by their behaviour, requests, and needs). From this point of view, web pages are elements around which the social networks are formed. Therefore it is natural to create the analysis and descriptive model based on the social networks around web pages. For further details and references, see Adamic and Adar (2005) and Kumar et al. (2010) (which also consider the aspect of network evolution). 2.1. MicroGenres In all sectors of the arts, genres are vague categories without fixed boundaries which are essentially formed by sets of conventions. Many works of art are cross-genre and employ and combine these conventions. Probably the most deeply theoretically studied genres are the literary ones. They allow experts to systematize the world of the literature such that it can be considered as a subject of scientific examination. The term MicroGenre can be found in this field. For example, in Martin (1995) the MicroGenre is seen as part of a combined text. This term has been introduced to identify the contribution of inserted genres to the overall organization of text. The motivation of using the term “MicroGenre” is because it is used as a building block of the analytic descriptive system. On the other hand, Web design patterns are used more technically and provide a means for good web page solutions. From our point of view, web pages are structured similar to literary texts, using parts which are relatively independent and have their own purpose. However, the architecture of the web page (individual parts) is based on Web design patterns. For these parts the term “MicroGenre” has been chosen. Contemporary web pages are often very complex; nevertheless they can usually be described using several MicroGenres. This kind of description can be more flexible than the genre description (which usually represents the whole page). Genre and design patterns have a social background. Within this background there are different groups of people with the same common interests. Definition. (Web) MicroGenre is a part of a web page,1 1. Whose purpose is general and repeats frequently 2. Which can be named intelligibly and more or less unambiguously so that the name is understandable for the web page user (developer, designer, etc). 3. Which is detectable on a web page using an algorithm Remark. From an external point of view, a given web site can appear differently to different users. The reasons to visit a web site may also vary. Consider an e-shop as an example. It is certain that 1 The MicroGenre can, but does not have to, strictly relate to the structure of a web page in a technical sense, see below for detailed explanations. M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 the aim of the e-shop owner is to sell the most goods. The aims of visitors may vary. A user can visit the e-shop to explore the kinds and prices of goods and read the terms of sale. The main target is the price information and maybe the price comparison. Another user wants to directly buy the goods. In that case he will be interested in pages with a purchasing possibility. A third user may be interested in product parameters and the opinions of other users. Therefore he will prefer pages containing technical features of products, discussion, FAQ, customer reviews and ratings. Therefore the description of a web site based on the purpose of its particular pages and the intents of its users seems to be appropriate. These purposes and intents are contained in MicroGenres. 2.2. MicroGenre description 1. Historically, the oldest resource for detecting useful objects on web pages is the so-called page wrappers, which serve to detect and extract data from the context of mostly product pages (but also from many others) with known or partially known structure. The user's needs are linked to the field of Web content mining. 2. The second resource is the genres. The key problem here is the need for document classification (web pages in particular) however nowadays it seems to be complicated to classify a 189 web page within only a single genre. Automatic Web page genre detection is again one of the Web content mining tasks. Its application is obvious from the aforementioned search engine functionality. 3. The final resource is the Web design patterns. Their purpose is to describe the widely used practices for web page creation. These descriptions should be general enough, but easily understandable by the software architects, designers and web page developers. All these resources are therefore the results of considering how to describe the useful parts of a web page. These resources also rely on user interaction and due to this interaction they change over time. These resources can therefore be understood as a timevariant database or catalog of objects that is meaningful to detect on web pages (see Fig. 4). This catalog will never be too large (there are dozens of MicroGenres at the most according to our estimation). This introduces quite an important advantage to using MicroGenres for web page representation and for using this representation for classification and measuring web page similarity. 2.3. MicroGenre detection – Pattrio method We began to detect objects on web pages in 2006 (see Kudelka et al., 2006). We have tried to find a general principle which can be Fig. 3. MicroGenre elements. Fig. 4. Catalog of MicroGenres. 190 M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 used to describe most of the features that can be (with high relevance) used to detect MicroGenres on a web page. Generally speaking, we use a two-step approach for automated detection. The first step is to analyse the web page content and structure. The goal of this analysis is to find web page elements, which are usually contained in MicroGenres. We have incorporated approaches Fig. 5. Gestalt principles illustrated on particular e-shop page as a part of the MicroGenre detection process. M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 191 Fig. 6. MicroGenre detection process. known from the field of Web information extraction (see Chang et al., 2006). To successfully perform this step we have to know which words, short phrases and data types are characteristic for a particular MicroGenre (and in what context, see Fig. 3). We obtain this information for every single MicroGenre by analysing a large amount of data. The second step is to apply rules that are related to the layout of web page elements. Here we use the so-called Gestalt principles (see Fig. 5) that describe the general rules that are valid for visual systems (see Tidwell, 2005). Basically, if a group of elements should be perceived as a whole, then these elements should be close enough, can repeat themselves or follow each other. As an example more closely related segments can be expected in the Discussion MicroGenre (single contributions) and more varied segments in the Technical features MicroGenre (list of parameters – attribute names, values and physical units). The whole process is depicted in Fig. 6 and is described in detail (including algorithms, etc.) in our previous publications. For each MicroGenre a configuration containing all the necessary inputs for the detection process has been prepared. This process is heuristic in nature and answers the following question; with what relevance does the given web page contain the specified MicroGenre? A simplified input for the Discussion MicroGenre is shown below. Example – Discussion (Forum) Words: {Main: re, reply, discussion, forum, author, question, answer, thread, contribution, subject, sent; Complementary: date, name, post, topic} Data types: {Main: date, time, first name; Complementary:} Technical {link, label, input, table} elements: Rules: {proximity: 16; closure: normal; similarity: high; continuity: low} The result of the detection process for each web page is therefore a list of contained MicroGenres (accompanied by the relevance of each MicroGenre). On the technical level web pages can be represented as vectors and used in further applications based on a vector-space model. Another possibility is to use binary or fuzzy relevance evaluation, in which pages can then be represented using an object-attribute model. In the text below both of these approaches are used. Remark: Based on the experiments with the Discussion and the Purchase possibility MicroGenres, the accuracy of our Pattrio method is about 80% (for details see Kocibova et al., 2007). 3. Related work In this section other studies somewhat related to our approach will be described. None of these approaches will be imitated, but in some details our approach is similar to some of this work. This holds true particularly for several web classification methods, web page similarity measurements, genre detection, web design patterns and web information extraction. The main difference of our approach is not a technical solution but what information about the web page and web site will be offered to users. Web classification: Web classification is understood as the task of assigning web pages or web sites to pre-defined categories according to their content. Both manual and automated approaches can be distinguished. Manual evaluation relies on the judgments of individuals about a web site. On the other hand, there are many approaches of automated Web classification into relevance categories that eliminate the human effort required to maintain repositories of sources. Bauer and Scharl (2000) suggested a valuable methodology for automated content analysis of web sites of commercial, educational and non-profit organizations. Sun et al. (2002) studied the effect of using context features in web classification using SVM classifiers. They used two kinds of context features – web page titles and hyperlinks. Shen et al. (2004) proposed classification algorithms based on web summarization, for extracting the most relevant features from web pages. Yu et al. (2004) presented a framework for web page classification. The framework applied a learning algorithm based on positive and unlabelled data. A survey in Choi and Yao (2005) described systems that automatically classify web pages into meaningful categories of subject-based and genre-based classifications. Lindemann and Littig (2006) analysed structural properties which reflect the functionality of web sites. These structural properties consider the size, the organization and composition of URLs, and the link structure of web sites. The result of their approach is classified into relevant functional classes (Academic, Blog, Community, Corporate, Information, Non-profit, Personal, and Shopping). Qi and Davison (2009) in their survey examined the space of web classification approaches to find new areas for research. They 192 M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 reviewed the web-specific features and algorithms that were found to be useful for web page classification. Rajaraman (2009) described a general-purpose topic exploration engine using a federated search approach. A presented taxonomy consists of millions of topics organized hierarchically. Web page similarity: The similarity of web pages based on their content (there are also methods based on link analysis) can be evaluated with different methods and the results of these methods can be used in various ways. There are different views on how to measure the similarity of web pages. It depends on the target of a method that is used for web page analysis. The methods can be roughly divided into three groups (see Kudelka et al., 2010). The first group includes methods which are aimed at the semantic content of web pages, and as a typical example methods from the field of Information Retrieval (IR) can be mentioned (Velásquez et al., 2011; Velásquez, 2012). In IR, the similarity is based on the terms and the advantages of a vector space model that are often used (see e.g. Kobayashi and Takeda, 2000). The second group contains methods working with patterns, semantic blocks or the genre of web pages (see e.g. Kudelka et al., 2008). The third is a group including methods evaluating the visual similarity of web pages (see e.g. Takama and Mitsuhashi, 2005). Obviously, there can be some methods which combine approaches from different groups. Genre detection: In the context of documents, the genre is a taxonomy that incorporates the style, form and content of a document which is orthogonal to topic, with fuzzy classification into multiple genres (see Boese, 2005). In the same paper, existing classifications are described. As a result, 115 genres are identified, e.g. Ads, Calendars, Classifieds, Comparisons, Contact info, Coupons, Demographics, Documentaries, E-cards, FAQs, Homepages, Hubs, Lists, News, Newsgroups, Reports, etc. For classification, there are many approaches and also many methods for genre identification. Kennedy and Shepherd (2005) analysed home page genres (personal home page, corporate home page or organization home page). Chaker and Habib (2007) proposed a flexible approach for web page genre categorization. Flexibility means that the approach assigns a document to all predefined genres with different weights. Dong et al. (2008) described a set of experiments to examine the effect of various attributes of web genres for the automatic identification of the genre of web pages. Four different genres were used in the data set (FAQ, News, E-Shopping and Personal Home Pages). Rosso (2008) explored the use of genre as a document descriptor in order to improve the effectiveness of web searching. The author formulated three hypotheses: (1) Users of the system must possess sufficient knowledge of the genre. (2) Searchers must be able to relate the genres to their information needs. (3) Genre must be predictable by a machine-applied algorithm because it is not typically explicitly contained in the document. Web design patterns: Design patterns are based on the work of Alexander et al. (1977). From the mid-sixties to mid-seventies, Alexander et al. defined a new approach to architectural design. The new approach centred on the concept of pattern languages is described in a series of books (Alexander et al., 1977). Alexander's definition of pattern is as follows: “Each pattern describes a problem, which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.” According to Tidwell (2005), patterns are structural and behavioural features that improve the applicability of software architecture, a user interface, a web site or something else in another domain. They make things more usable and easier to understand. Patterns are descriptions of best practices within a given design domain. They capture common and widely accepted solutions, and their validity is empirically proved. Patterns are not novel ideas, but patterns are captured experiences with each of their implementation a little different. Good examples are also the so-called “Web design patterns,” which are patterns for design related to the Web (see van Welie, 2008). Web information extraction: There is a wide area of methods, which aim to detect objects related to MicroGenres and extract their semantic details (e.g. Opinion extraction, Lee et al., 2008; News extraction, Zheng et al., 2007; Web discussion extraction, Limanto et al., 2005; Product detail extraction, Nie et al., 2007; and Technical features extraction, Schmidt et al., 2005). Contrary to such notions from the Semantic web as MicroData and Schema.org,2 the MicroGenres represent a fully automatic approach relying on existing data. There is no need to update existing pages or manually annotate them. However this approach is mainly focused on detecting particular MicroGenres and, unlike the mentioned approaches, has only limited data extraction and description capabilities. 4. Tools and techniques The object-attribute representation of MicroGenres is well suited to being analysed using Formal Concept Analysis (FCA). This well-known method is capable of finding all meaningful groups of objects and attributes that are valid in the data. Therefore a deeper view can be got into the structure of web sites, their properties and relations between their groups. The problem is that it cannot often be precisely stated whether a page contains a particular MicroGenre or not. Using a multiple-degree scale the presence of MicroGenre on the page can be estimated. This information can then be processed using a fuzzy version of Formal Context Analysis. However, in an attempt to find all meaningful groups in a large dataset (which is what a collection of web sites truly is) we will end up with a large number of groups. There will be difficulties in navigating the results. Marginal sets will receive the same attention as important ones. Starting with a large quantity of web sites, the computations may be unable to be finished at all. Matrix reduction methods have previously been used successfully in large data environments; therefore the so-called Nonnegative Matrix Factorization has been used to focus on the important data on the web site – MicroGenre relationships. First of all, some basic notions of these techniques will be introduced. Detailed information about their usage can be found in the Experiment section. 4.1. Formal concept analysis Now only a brief overview of FCA in fuzzy environments is given (approach of Belohlávek and Vychodil, 2005, further details and comparison with other approaches can be found in the same place). Consider the so-called complete residuated lattice L ¼ 〈L; 3 ; 4 ; ; -; 0; 1〉, where 〈L; 3 ; 4 ; 0; 1〉 is the complete lattice (with 0 and 1 being the smallest and biggest elements), 〈L; ; 1〉 is a commutative monoid and 〈 ; -〉 is the adjoint pair of binary operations (truth functions of fuzzy conjunction and fuzzy implication). The notion of being an element of a set can be replaced by the degree to which the element is contained in the set (A(x)). The notion of subset can be inferred in a similar way. Now we can define the fuzzy L-context as L ¼ 〈X; Y; I〉 with X being the set of objects, Y being the set of attributes and I being the fuzzy relation I : X Y-L. For fuzzy set A A LX ; B A LY (A is a fuzzy 2 http://www.schema.org M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 set of objects, B is a fuzzy set of attributes) we define fuzzy set of attributes A0 A LY and fuzzy set of objects B0 A LX as A0 ðyÞ ¼ ⋀ ðAðxÞnX -Iðx; yÞÞ; xAX B0 ðxÞ ¼ ⋀ ðBðyÞnY -Iðx; yÞÞ: yAY The nX and nY are the so-called truth-stressing functions or simple hedges which allow us to control the size of the resulting lattice. Formal fuzzy concept is a pair 〈A; B〉, A A LX ; B A LY , such that A0 ¼ B and B0 ¼ A. In this case, we will call A as the extent of the concept, B as its intent. As 〈BðX nX ; Y nY ; IÞ; r 〉 we denote that the set of all concepts, which, accompanied by the induced order is called fuzzy concept lattice. Simply said, the fuzzy concept lattice can provide us a structure of all meaningful groups of data present in the analysed dataset. Using a specific structure the fuzzy case can naturally degenerate into the binary environment. This effect is used in our experiments several times. 4.2. Non-negative matrix factorization Methods of matrix decomposition have succeeded in reducing the dimensions of input data (see Snasel et al., 2008 for applications connected with Formal Concept Analysis and Berry et al., 1995, Lee and Seung, 1999 for overview). Matrix factorization methods decompose one – usually huge – matrix into several smaller matrices. Such decomposition methods are common in the field of information retrieval. Non-negative matrix factorization differs by the use of constraints that produce non-negative basis vectors, which create parts-based representation. Therefore a matrix containing original object-attribute data can be decomposed, less important parts omitted and the matrix recomposed. As a result, a much simpler lattice is obtained maintaining the most important properties and relations. The NMF is usually computed as an approximation of the original matrix V by computing a (W,H) pair to minimize the Frobenius norm of the difference V WH. Let V A Rmn be a nonnegative matrix and W A Rmk and H A Rkn for 0 o k⪡minðm; nÞ. Then, the minimization problem can be stated as min‖V WH‖2 with W ij 4 0 and H ij 4 0 for each i and j. The algorithm used was the one proposed in Lee and Seung (1999). 5. Observations, experiments and results In this paper we attempt to find a description of a web site such as that coming from the analysis of the architecture of pages belonging to this web site. We use our Pattrio method for MicroGenre identification. Our web site description indicates the intent and purpose of the web site. From this point of view the description provides interesting information about web sites. In order to demonstrate the feasibility and usefulness of this new approach, it has been decided to perform several experiments, which are described below, beginning with the background of the experiment. We have implemented a web application with user interface connected to the API of different search engines (google.com, msn. com, yahoo.com and the Czech search engine jyxo.cz). Users from a group of students and teachers of high schools and VŠB (Technical University of Ostrava, Czech Republic) were using this application for more than one year to search for everyday information. The process of searching was not influenced by us in any way. The purpose of this part of the experiment was to view 193 the World Wide Web from the perspective of users (as the search engines play a key role in World Wide Web navigation). In the end a dataset with more than 115,000 web pages was obtained. After cleaning up, 77,850 unique Czech pages from 16,644 web sites remained. For every single web page the detection of 13 MicroGenres have been performed. The page did not have to contain any MicroGenre, just as it may as well have theoretically contained 13 MicroGenres (Price information, Purchase possibility, Special offer, Hire sale, Second hand, Discussion and comments, Review and opinion, Technical features, News, Enquire, Login, Something to read, Price per item).3 We used such a pre-processed dataset for all the following experiments. 5.1. Web site visualization and classification based on MicroGenres This part is focused on visualization of the web site description based on MicroGenres. We would like to illustrate that this kind of visualization can quickly inform users about the whole web site and serve as an additional source of information for example during a Web search. Using this visualization the users can also distinguish between different types of web sites and avoid possible user confusion when landing on the web site. For the visualization of extracted information one common graph drawing method has been adopted, which works as follows; at the centre of Fig. 7 the circle representing the web site can be seen. The size of the circle is determined by the relative size (number of pages) in the dataset. The circles around the web site correspond to individual MicroGenres. The size of circles is again determined by their relative presence in the dataset. The web site is connected to MicroGenres using a straight line. The strength of this connection (represented by the line width) corresponds to the detected degree of MicroGenres present on the web site. Figs. 7 and 8 contain the described visualization of some web sites from the Czech Republic – typical web sites aimed at selling products, sharing information and education. This observation shows that MicroGenres can describe web sites in user-understandable terms. MicroGenres may be used to differentiate between different web sites, therefore they divide the space of web sites and can be used in the search process without introducing new unfamiliar concepts. Fig. 9 depicts two web sites with similar intent, the selling of products. The basic MicroGenres related to e-shops are shared (Price per item, Price information and Purchase possibility). The second one differs in allowing users to share information in addition to a purchase possibility as well as allowing purchase on hire. 5.2. Web site clustering based on MicroGenres In this experiment we wanted to prove that MicroGenres can be used to divide the set of web sites and allow the structural view of a set of web sites, with a potential of navigating within the dataset from the global view as well as from the view of one specific topic. As an input we have used the list of web sites created in the previous step. Only web sites with more than 20 pages in the dataset have been taken into consideration. Each web site was accompanied by detected MicroGenres. This list was transformed into a binary matrix and considered as a formal context. Using the methods of FCA we have computed a conceptual lattice which can be seen in Fig. 10. The resulting matrix has 516 rows (objects – web sites) and 13 columns (attributes – MicroGenres) while the computed conceptual lattice contains 259 concepts. 3 The names of MicroGenres emerged from the discussions between the first three authors and students who took part in the experiments. 194 M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 Fig. 7. Typical web site aimed at selling products. Fig. 9. Comparison of two product-oriented web sites. Fig. 8. Typical web site aimed at information sharing (left part) and typical university web site (right part). From the computed lattice we have selected a sub-lattice containing 18 web sites dealing with cell phones. Only five attributes have been selected and the visualization was created in a slightly different manner (see Fig. 11 and attached legend). Each node of the graph corresponds to one formal concept. To increase the visualization value, the attributes (MicroGenres) are represented by icons and the set of objects (web sites) is depicted using small filled/empty squares in the lower part. It can be easily seen that the whole set of web sites can be divided into two groups, the first one containing sites where users are enabled to buy cell phones and a second one where users are allowed to have a discussion. Deeper insight gives more detailed information about web site structures and relations. The conceptual lattice forms a graph, which can be interpreted as an expression of relation between different web sites. As a result, the conceptual lattice also describes an automatic classification of web sites. Complexity of the view: To address the concept lattice complexity more generally, the NMF method has been used to decompose the original matrix into two, much smaller matrices. The detailed look at these matrices gives us an insight into extracted partsbased representation. According to this method, the six most M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 195 important elements (which form the base for most of the web sites) are the following: 6. Price information, Purchase possibility, Special offer and Price per item (present in 77 web sites). 1. Login (present in 148 web sites). 2. Something to read (present in 448 web sites). 3. Discussion and comments and Review and opinion (present in 50 web sites). 4. Price info, Purchase possibility and Price per item (present in 164 web sites). 5. Technical features (present in 229 web sites). A histogram of attributes can be seen in Fig. 13. A naturally simplified lattice constructed from these reduced data can be seen in Fig. 12. It is apparent from the presented results that most of the analysed web sites contain various forms of the Something to read MicroGenre. More specific MicroGenres provide better insight into web sites and the services they provide. It can be seen from the Fig. 10. Lattice calculated from whole dataset. Fig. 12. Lattice calculated from reduced data set (rank 6). Fig. 11. Part of lattice. 196 M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 Fig. 13. Attribute histogram. Fig. 14. Web site map (e-shops and universities highlighted). MicroGenre histogram and the underlying data that there are many web sites containing Price information (234), while some of them also provide Purchase possibility (166). Out of this group, many web sites contain Special offer (77). Only a small part of these web sites provide (with respect to our detection) Hire sale (12). One of the specific attributes of the Czech Web is the frequent sale of used products. There are specialized second-hand web sites, but the same service is often provided by ordinary e-shops. These MicroGenre statistics may provide useful information in additional applications. For example when evaluating the relevancy of particular MicroGenres w.r.t. search results, we may incorporate the ideas behind the Term Frequency–Inverse Document Frequency (TF–IDF) methodology. Therefore the role of highly represented MicroGenres (like Something to read) will be diminished. Clearly, this MicroGenre is more important on the page level than at the level of the whole site. The structure built over MicroGenres can be used in the search process for refining the search results based on user feedback with the potential of raising the user's satisfaction. Each time the user specifies a MicroGenre that is (or is not) interesting, technically that means navigation within the conceptual lattice and selecting a subset of search results. 5.3. Web site similarity This experiment verifies the usage of MicroGenres for calculating web site similarity and obtaining a potential map of the Web. Using cosine similarity, defined as follows: similarity ¼ cos ðθÞ ¼ AB JAJ JBJ we have computed the similarity between each pair of web sites. The computed results have been considered as a graph, where nodes are web sites and the edge between nodes exists if the similarity between web sites is above some specified threshold. Using Pajek software4 and Kamada–Kawai force-based placing algorithm a simple web site map has been produced (see Fig. 14) with e-shops and university sites highlighted. One concept has been randomly selected from the whole lattice. This gave a concept in common with the web sites aaamobil.cz, hifishop.cz, mshop.cz and o-d.cz. These web sites have Price information, Purchase possibility, Special offer, Hire sale, Technical features, Login, Something to read and Price per item attributes in common. These web sites are highlighted in the left lower part of Fig. 14. Their cosine similarity ranges from 0.9 to 1.0 and their visualization is in Fig. 15. As an interesting result of this experiment the overall map of web sites was considered: 1. On the upper right part of the map can be found (as well as many university web sites) web sites containing mostly static information pages. Only a few general MicroGenres and hardly 4 http://pajek.imfm.si M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 197 Fig. 15. Selected web sites (e-shops). any specific MicroGenres, such as Price information or Discussion and comments, have been detected on them. From the view of MicroGenres (which in principle ignores the semantic content of the pages) these web sites are very similar. Therefore they are placed closely on the map and so besides universities, other educational web sites can also be found, such as The Academy of Sciences of the Czech Republic. 2. On the opposite side of the map (bottom left corner) can be found, apart from the already mentioned e-shop web sites, many other e-shops with various products (cell phones, sports, electronics, etc). They are spread widely throughout the area because a higher number of MicroGenres have been detected and therefore this group is more diversified from the similarity viewpoint. 3. The web sites which are not very specific and usually provide both static information content and purchase possibility are placed somewhere between the mentioned parts of the map. This means that the web sites which are similar from the user's point of view (universities and e-shops) are similar also from the point of view of MicroGenres. Therefore they divide the search space according to the user's expectations and are supposed to be useful in the search process. 6. Conclusions and future work In this paper three kinds of social groups have been described which take part in web page creation and usage. We distinguish these groups using their relation with the web page. The emergence and evolution of web pages and web sites is the result of interaction between these groups. These groups have agreed on several basic building blocks of web pages. In this paper we identify some of these blocks and call them MicroGenres. The agreement resulted in a uniform name for the MicroGenre, predictable behaviour and identifiable appearance of the element. Each web page and web site indirectly defines the social context within which the tasks linked with Web development and usage are realized. In this context, we have tried to look at web sites from the point of view of their similarity. When browsing the Web (especially using Internet search engines) the need is clear for automatic classification of web pages, and with respect to this classification, the extraction of user-understandable data as well. Our MicroGenres could be useful in this classification, and combined with the frequency analysis of their usage on individual web sites the similarity can be described and measured at a high level of abstraction. More specifically, in the first observation a way of looking at web sites was proposed that allows comprehensive insight into 198 M. Kudelka et al. / Engineering Applications of Artificial Intelligence 35 (2014) 187–198 the web site essence. As a result it also allows us to measure the similarity of web sites. Obviously one can imagine better graphic design of the visualization, incorporating for example pictograms for particular MicroGenres, etc. However the main goal was illustrated. The first experiment presents that the MicroGenres on web sites can also represent a structural view, which can then be used for a quick overview or detailed navigation. In the last experiment a similarity between web sites was able to be evaluated which can have both local and global impacts. In our future research we want to apply our methodology to more complex web sites, such us the adaptive web sites, which change their structure and content depending on web user browsing behaviour. References Adamic, L., Adar, E., 2005. How to search a social network. Soc. Netw. 27 (3), 187–203. Alexander, C., Ishikawa, S., Silverstein, M., 1977. A Pattern Language: Towns, Buildings, Construction. Oxford University Press, USA. Barnum, C., Dragga, S., 2001. Usability Testing and Research. Allyn & Bacon Inc., Needham Heights, MA, USA. Bauer, C., Scharl, A., 2000. Quantitative evaluation of Web site content and structure. Internet Res.: Electron. Netw. Appl. Policy 10 (1), 31–43. Belohlávek, R., Vychodil, V., 2005. What is a fuzzy concept lattice. In: Proceedings of CLA, vol. 5, pp. 34–45. Berry, M.W., Dumais, S.T., Letsche, T.A., 1995. Computational methods for intelligent information access. In: Proceedings of the IEEE/ACM SC95 Conference on Supercomputing, 1995. IEEE, Washington, DC, USA, p. 20. Boese, E., 2005. Stereotyping the Web: Genre Classification of Web Documents (Ph.D. thesis). Colorado State University, USA. Borchers, J.O., 2000. Interaction design patterns: twelve theses. In: Workshop, vol. 2, The Hague. pp. 3–9. Chaker, J., Habib, O., 2007. Genre categorization of web pages. In: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops. IEEE Computer Society Washington, Washington, DC, USA, pp. 455–464. Chang, C., Kayed, M., Girgis, M., Shaalan, K., 2006. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18 (10), 1411–1428. Choi, B., Yao, Z., 2005. Web page classificationn. In: Foundations and Advances in Data Mining. Springer, Berlin, Germany, pp. 221–274. Dong, L., Watters, C., Duffy, J., Shepherd, M., 2008. An examination of genre attributes for web page classification. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences. IEEE Computer Society, Washington, DC, USA, pp. 133–143. Kennedy, A., Shepherd, M., 2005. Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 2005, HICSS'05. IEEE, p. 1–9. Kobayashi, M., Takeda, K., 2000. Information retrieval on the web. ACM Comput. Surv. 32 (2), 144–173. Kocibova, J., Klos, K., Lehecka, O., Kudelka, M., Snasel, V., 2007. Web page analysis: experiments based on discussion and purchase web patterns. In: Web Intelligence and Intelligent Agent Technology Workshops, pp. 221–225. Kudelka, M., Snasel, V., Horak, Z., Hassanien, A., Abraham, A., 2009. Web communities defined by web page content. Comput. Social Netw. Anal., 349–370. Kudelka, M., Snasel, V., Klos, K., Pokorny, J., 2010. Visual similarity of web pages. In: Advances in Intelligent Web Mastering: Proceedings of the Sixth Atlantic Web Intelligence Conference—AWIC'2009, Prague, Czech Republic, September 2009. Springer Verlag, Washington, DC, USA; London, UK, p. 135. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., 2006. Semantic analysis of Web pages using Web patterns. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, Washington, DC, USA; London, UK, pp. 329–333. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J., 2008. Web pages reordering and clustering based on web patterns. In: SOFSEM 2008: Theory and Practice of Computer Science, pp. 731–742. Kumar, R., Novak, J., Tomkins, A., 2010. Structure and evolution of online social networks. In: Link Mining: Models, Algorithms, and Applications. Springer, Berlin, Germany, pp. 337–357. Lee, D., Jeong, O., Lee, S., 2008. Opinion mining of customer feedback data on the web. In: Proceedings of the Second International Conference on Ubiquitous Information Management and Communication. ACM, New York, NY, USA, pp. 230–235. Lee, D., Seung, H., 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), 788–791. Limanto, H., Giang, N., Trung, V., Zhang, J., He, Q., Huy, N., 2005. An information extraction engine for web discussion forums. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. ACM, New York, NY, USA, pp. 978–979. Lindemann, C., Littig, L., 2006. Coarse-grained classification of web sites by their structural properties. In: Proceedings of the Eighth Annual ACM International Workshop on Web Information and Data Management. ACM, New York, NY, USA, pp. 35–42. Martin, J., 1995. Text and clause: fractal resonance. Text (The Hague) 15 (1), 5–42. Nie, Z., Wen, J., Ma, W., 2007. Object-level vertical search. In: CIDR, pp. 235–246. Qi, X., Davison, B., 2009. Web page classification: features and algorithms. ACM Comput. Surv. 41 (2), 1–31. Rajaraman, A., 2009. Kosmix: high-performance topic exploration using the deep web. Proc. VLDB Endow. 2 (2), 1524–1529. Rosso, M., 2008. User-based identification of Web genres. J. Am. Soc. Inf. Sci. 59 (7), 1053–1072. Schmidt, S., Stoyan, H., Weichselgarten, A., 2005. Web-based extraction of technical features of products. GI Jahrestag. 1, 246–250. Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., Ma, W., 2004. Web-page classification through summarization. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, pp. 242–249. Snasel, V., Polovincak, M., Dahwa, H., Horak, Z., 2008. On concept lattices and implication bases from reduced contexts. In: Supplementary Proceedings of the 16th International Conference on Conceptual Structures, ICCS, pp. 83–90. Sun, A., Lim, E., Ng, W., 2002. Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management. ACM, New York, NY, USA, pp. 96–99. Takama, Y., Mitsuhashi, N., 2005. Visual similarity comparison for Web page retrieval. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 301–304. Tidwell, J., 2005. Designing Interfaces: Patterns for Effective Interaction Design. O'Reilly, CA, USA. Van Duyne, D., Landay, J., Hong, J., 2002. The Design of Sites: Patterns, Principles, and Processes for Crafting a Customer-centered Web Experience. AddisonWesley Professional, Boston, USA. van Welie, M., 2008. A Pattern Library for Interaction Design. 〈www.welie.com〉 (last accessed March 2010). Velásquez, J.D., 2012. Web site keywords: a methodology for improving gradually the web site text content. Intell. Data Anal. 16 (2), 327–348. Velásquez, J.D., Dujovne, L.E., L'Huillier, 2011. Extracting significant website key objects: a semantic web mining approach. Eng. Appl. Artif. Intell. 24 (8), 1532–1541. Yu, H., Han, J., Chang, K., 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16 (1), 70–81. Zheng, S., Song, R., Wen, J., 2007. Template-independent news extraction based on visual consistency. In: Proceedings of the National Conference on Artificial Intelligence, vol. 7, pp. 1507–1513.