Context-aware recommender system for large e

Transcription

Context-aware recommender system for large e
Context-aware recommender system for large e-commerce
platforms
by
Jacek Wasilewski
A thesis submitted in partial satisfaction of the
requirements for the degree of
Master of Science
in
Computer Science
in the
Institute of Computing Science
of the
Poznań University of Technology, Poznań
Thesis supervisor:
Mikołaj Morzy, PhD, DSc.
June 2013
i
Contents
1 Introduction
1
2 Related work
2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Recommender system . . . . . . . . . . . . . . . . . . . . . .
2.3 Context-aware recommender systems . . . . . . . . . . . . .
5
6
9
14
3 Dataset
3.1 Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Dataset structure . . . . . . . . . . . . . . . . . . . . . . . .
17
18
18
4 Text preprocessing
4.1 General transformations
4.2 Stop words removal . . .
4.3 Stemming . . . . . . . .
4.4 Tags removal . . . . . .
.
.
.
.
21
22
23
23
24
.
.
.
.
.
.
.
.
.
.
.
27
28
31
33
33
35
37
37
40
45
45
52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Context-aware recommender system
5.1 General idea . . . . . . . . . . . . . . . . . . . . .
5.2 Text preprocessing . . . . . . . . . . . . . . . . .
5.3 Category-based context creation . . . . . . . . . .
5.3.1 Description . . . . . . . . . . . . . . . . .
5.3.2 Experiment . . . . . . . . . . . . . . . . .
5.4 Latent Semantic Indexing-based context creation
5.4.1 Description . . . . . . . . . . . . . . . . .
5.4.2 Experiment . . . . . . . . . . . . . . . . .
5.5 Network-based context creation . . . . . . . . . .
5.5.1 Description . . . . . . . . . . . . . . . . .
5.5.2 Experiment . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusions
59
Bibliography
62
ii
CONTENTS
A Category-based context creation experiment data
A.1 Items data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Recommended items . . . . . . . . . . . . . . . . . . . . . .
65
66
66
B Latent Semantic Indexing experiment data
B.1 Items data . . . . . . . . . . . . . . . . . . .
B.2 Term-document frequency matrix . . . . . .
B.3 Singular Value Decomposition . . . . . . . .
B.4 Queries . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
71
72
73
74
76
C Network-based context creation experiment
C.1 Categories networks . . . . . . . . . . . . . .
C.2 Compressed categories networks . . . . . . .
C.3 Modules of merged network . . . . . . . . .
data
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
77
78
81
84
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
87
List of Tables
88
1
Chapter 1
Introduction
2
CHAPTER 1. INTRODUCTION
Nowadays, the number of data and the number of different items we can
buy outnumber few times, if not several times, what was available even few
years ago. It is not possible to compare every type of item, which we are
interested in purchasing, to chose the one which suits us perfectly. Also
we might note be aware of different types of one product and all additional
accessories that we might need while using the one product chosen by us.
Because of the technological progress we have now, we expect that specialized
systems analyze our needs, available products and then suggest us which one
would be the best for us and what else we might need or might be interested
in. With no doubt we can say that personalization and recommendation are
clues to everything.
There are few types of recommender systems but the majority of existing
recommender system approaches focuses on recommending the most similar
items, according to users’ interests and recent viewings, and usually does not
take into account other, sometimes important information like time, place,
mood or connections between items. For instance, if we are viewing eBay
item of a bag of vanilla flavored ground coffee, then, as a result of typical
recommender system, we might get other items of the same type of coffee.
Usually this is what we expect from recommender systems.
However, in many cases, more then results of items of the same type as
recently viewed items, we might be expecting others which have something
in common with viewed item. For example, if we are looking for a volleyball
ball we might want to see, besides other volleyball balls, also t-shirts and
shorts. What is more, while we are browsing some items of coffee machines,
most likely we also would like to buy or just see bags of coffee or thermoses.
This idea of highlighting connected items is not a new idea, it is called crossselling and is widely used even in traditional shops where we have laces places
near to shoes because of the possibility we might need additional ones.
Cross-selling is a very simple idea and is easily applicable in small ecommerce systems because all we need is to set connections between items.
When it comes to larger systems like Amazon platform, eBay platform or
Allegro platform (allegro.pl) with order of magnitude of millions accessible
3
items, this task becomes to be very difficult and doing it manually is unreasonable and almost impossible. We must mention that items which are
being sold via large e-commerce platforms usually are being sold by different
sellers who describe them in various ways, so the same item can be presented
differently. The amount of data and the variety of descriptions make this
problem complicated and it needs automatic and dynamic solutions.
In this thesis we discuss the topic of specific type of recommender systems
- context-aware recommender systems (CARS), which tries to build model
enriched with information about complementary products. We also address
the main problems which occur during processing data and then show how
those complex tasks can be divided into subtasks and for each subtask we
present approaches which can be used to perform those tasks. Because solutions must be ready to work with a huge amount of data, we take into
consideration the complexity of each task.
The rest of the thesis is organized as follows. Chapter 2 discusses the
general notions of context, recommender systems as well as how we connect
all this ideas. In Chapter 3 we present information about dataset which is
used during experiments. Chapter 4 describes methods of text and web preprocessing. After that, in Chapter 5, we present idea how contexts can
be constructed and also a few methods of building contexts - categorybased context creation, Latent Semantic Indexing-based context creation
and network-based context creation. Chapter 6 contains conclusions and
presents some opportunities for further work.
4
CHAPTER 1. INTRODUCTION
5
Chapter 2
Related work
6
CHAPTER 2. RELATED WORK
Before we describe what context-aware recommender system is exactly,
in Section 2.1 we start with discussing the general notion of context. Then,
in Section 2.2, we focus on different approaches of recommender systems.
After that, in Section 2.3, we show the ideas of context-aware recommender
systems.
2.1
Context
What is a context? Question is simple but there is not a simple answer.
According to Webster dictionary[1] we describe it as interrelated conditions
in which something exists or occurs. It is a kind of environment. This
topic has been studied across various types of science disciplines and even
there is conference called CONTEXT about interdisciplinary use of context in fields of cognitive sciences (linguistics, psychology, computer science,
neuroscience), social sciences, but also medicine, law and business. Wellknow business researcher Coimbatore Krishnarao Prahalad has suggested
that companies must deliver not only a competitive products but also provide unique, real-time customer experiences shaped by customer context and
it would be next big thing for the CRM (Customer Relationship Management) practitioners[22].
This is only a one of many definitions of what the context is. Mary Bazire
and Patrick Brézillon have analyzed over 150 of different definitions coming
mainly from the web. They have observed that it is difficult to find a relevant
definition satisfying in any discipline. Multifaceted nature of this intuitive
problem makes it difficult to define it accordingly and exactly with the given
area[17]. Since we are focused on the topic of recommender systems in area
of e-commerce applications we discuss the definition with data mining, ecommerce personalization, databases, information retrieval, context-aware
pervasive systems and marketing points of view. In this we follow [11] and
[8].
2.1. CONTEXT
7
Data mining According to [18] we define context as events that characterize the life stages of a customer and discriminate his/her preferences.
Examples of context in data mining include e.g. a purchase of a new car,
planning of a wedding, changing of a job, sets of items bought in the shop.
The knowledge of those situations helps in discovering well-fitted and accurate data mining patterns to those situations. In some cases we might
be looking for information what data are relevant in chosen problem, e.g.
what data describe the possibility of occurrence of an infection after liver
transplant, in other cases we might want to find what other insurances we
should have been offered if we recently have bought a car insurance for our
new sport car.
E-commerce personalization In field of e-commerce, within the meaning of [8], we say that in e-commerce application the intent of purchase made
by customer, interested in buying a particular item, is our context. Single
customer may have many contexts, each one for a different intent. For instance, buying new DVD player, set of horror movies on DVD for a friend
as a gift, new hairdryer and old set of cups give us four different contexts,
no matter if all of those items were bought by the same customer, with the
same application account, even the same seller. Each time the reason of
those purchases was different. In [8] they built different and separate profiles for each reason, depending on customer behavior. What is more, in this
case we can share the same context within many customers.
Context-aware pervasive systems Initially the term context-aware system was used for systems which, depending on the localization of user’s mobile phone, were providing information about every object that was near
the user’s phone. More specifically, context-aware system was classified as
a one which can adopt accordingly to this information[7]. In further works
this context-awareness have been pertaining not only to localization but also
date, season, temperature[20], user’s emotional status and any other information that might be in a relationship between user and application[6].
8
CHAPTER 2. RELATED WORK
Information retrieval Information retrieval is strongly connected with
Web searching, which is its one of the most common application. In this
case we describe context as a set of other topics related to the search query.
It has been proved that contextual data affect positively on the effects of information retrieval, but most of existing retrieval systems base their results
only on queries, ignoring context of the query. In those which use contextual approach, retrieving techniques are focused on providing a short-term
context, such as a context for currently searching query, in contrast to a
long-term context which includes user’s tastes and preferences. Although,
i.e. Google Web search is trying to adjust results of search to e.g. user’s
search history.
Marketing In the field of marketing and management researches have
shown that items, their quality and other attributes are dependent on a
context or a purpose of purchase. It is proved that in many situations customers make different buying decisions because of different strategies they
have applied while buying e.g. car and hairdryer. According to [12], consumers vary in their decision-making rules because of the usage situation, the
use of the good or service (for family, for gift, for self) and purchase situation
(catalog sale, in-store shelf selection, and sales person aided purchase).
Mentioned before Prahalad[22] describes context as the precise physical
location of a customer at any given time, the exact minute he or she needs
the service, and the 7 kind of technological mobile device over which that
experience [or service – our addition] will be received. Prahalad also focuses
on the delivering unique, real-time customer experiences, i.e. services like
VOD (Video On Demand) services where customer gets exactly what he/she
wants, exactly when he/she wants and where he/she wants it. It is the three
dimensional space described by Prahalad. Although he mentions about realtime experiences, this can be used any time and place in the future.
This section has presented that there is not just a one definition of context, but this is more complex term and in fact it depends on a situation in
which this term is used.
2.2. RECOMMENDER SYSTEM
2.2
9
Recommender system
Recommender system field derives directly from other fields of sciences
such as cognitive science, approximation theory, information retrieval, forecasting theory and also customer choice modeling and has become an individual research field in the middle of 1990’s when researches started to focus
on the rating problem. Commonly the problem of creating recommender
system is reduced to the problem of providing appropriate scoring function
which comply also new users and new unseen items[10].
Formally, let S be the set of all users and T be the set of all possible
and available items. Now, let u be a utility function that represents the
usefulness of item t to user s.
Recommender systems are usually classified into three following categories, based on how recommendations are made: content-based, collaborative filtering, hybrid approaches. Now, we present each of these approaches.
Content-based
The content-based approach derivates from information retrieval and information search researches. Because of the advancements in those fields and
importance of text-based applications, a lot of recently developed contentbased recommender systems focus on recommending items containing textual
information. Those systems are supported also by user profiles containing
data such as tastes, preferences, needs. User’s preferences can be collected
directly from user via questionaries or indirectly using machine learning techniques based on previous activity.
Formally, let Content(t) be an item profile characterized by a set of
attributes previously extracted from item description. This set of attributes
is used in finding most appropriate or similar item. Because commonly
content-based recommender systems are used with text-based items, the
content is presented usually as a set of keywords. Measuring importance of
each word ki from keywords in document dj is possible if some weighting wij
10
CHAPTER 2. RELATED WORK
is done, which can be defined in many ways.
One of the most popular method of weighting words in documents is
the method derived from information retrieval called term frequency/inverse
document frequency ( TF-IDF) which is defined as follows[10]. Let N be the
total number of all documents and word ki appears in ni of them. Also assume that fi,j is the frequency how many times word ki appears in document
dj . Then term frequency of word ki in document dj is defined as:
T Fi,j =
fi,j
maxz fz,j
where maximum frequency is computed over all words kz that occur in document dj . Usually high occurrence of one word in most of documents does not
help in distinguishing them, but in this case the inverse document frequency
(IDF) is often use with TF. The inverse document frequency for word ki is
defined as:
N
ni
Then the TF-IDF weight of word ki in document dj is defined as:
IDFi = log
wij = T Fi,j · IDFi
and the content of the document as:
Content(dj ) = (w1j , w2j , ..., wkj )
In content-based systems, the utility function u(s, t) is usually defined
as:
u(s, t) = score(UserContentPreferences(s), Content(t))
where UserContentPreferences can be different and in general is computed
based on the previous user behavior. Both of this attributes can be presented
as TF-IDF vectors, w~s and w
~ t of keyword weights. One of the most popular utility function used in comparing sets of keywords is cosine similarity
measure, defined as follows:
K
wi,s wi,t
w~s · w
~t
u(s, t) = cos(w~s , w
~t) =
= qP i=1 qP
K
K
2
2
kw~s kkw~t k
i=1 wi,s
i=1 wi,t
P
2.2. RECOMMENDER SYSTEM
11
where K is the total number of words.
We can specify three main problems connected with the content-based
approach: limited content analysis, over-specialization, new user problem.
Content-based techniques are limited by the features that are explicitly
associated with the objects recommended by those systems. Therefore, in
order to have a sufficient set of features, the content must either be in a form
that can be parsed automatically by a computer, or the features should be
assigned to items manually. Another problem with limited content analysis
is that if two different items are represented by the same set of features they
are indistinguishable.
When the system can only recommend items that scores highly against
a user’s profile, the user is limited to receive recommended items similar
to those already rated - we call it over-specialization. This problem, which
has also been studied in other domains, is often solved by introducing some
randomness to the recommendation process. Also in certain cases, items
should not be recommended if they are too similar to something that user
has already seen. The diversity of recommendations is often a desirable
feature in recommender systems.
The user has to rate a sufficient number of items before a content-based
recommender system can really understand user’s preferences and present
reliable recommendations to the user. Therefore, a new user, having very
few ratings, would not be able to get accurate recommendations.
Collaborative filtering
Different approaches exist in collaborative filtering method, where system
tries to predict relevant items based on user’s ratings and finding users with
similar taste. More formally, the utility function u(s, t) where s is an user
and t is an item, is predicted based on the utilities u(sj , t) assigned to item
t by those users sj ∈ S who has similar taste to user s.
Algorithms used in collaborative filtering can be clustered into two groups:
memory-based and model-based.
12
CHAPTER 2. RELATED WORK
Memory-based algorithms are in general heuristics which make rating
predictions based on the ratings of entire set of previously rated items. Rating for unseen item is usually calculated as an aggregation of other users’
ratings for item:
rs,t = aggrs0 ,t
where s0 is the subset of S - usually the most similar users. As an aggregation function we can use simple average, but the weighted sum or adjusted
weighted sum are used in most cases, though:
rs,t = k
X
sim(s, s0 ) × rs0 ,t
s0 ∈S
and
rs,t = r¯s + k
X
sim(s, s0 ) × (rs0 ,t − r¯s0 )
s0 ∈S
where k is a normalizing factor.
To determine similarity between different users we can use described in
Section 2.2 cosine similarity and Pearson correlation coefficient, defined as:
P
sim(x, y) = qP
t∈Txy (rx,t
t∈Txy (rx,t
− r¯x )(ry,t − r¯y )
− r¯x )2
P
t∈Txy (ry,t
− r¯y )2
The only difference between cosine similarity used in content-based approach
and collaborative filtering is that in the first one we calculate similarity
between TF-IDF vectors and in collaborative filtering we use vector of userspecified ratings.
Collaborative filtering approach traditionally is used to comparing users,
but also can be used in comparing items. In this case we use correlationbased and cosine-based techniques to compute similarity between items and
obtain ratings for them. This idea was extended to top-N recommendation
method.
In contrast to memory-based, model-based approach use collection of ratings to learn a model. Machine learning gives us a lot of methods which can
be used as a model training method, such as naive Bayesian model, neural
networks, clustering, SVM, decision trees and other. Literature also mentions
2.2. RECOMMENDER SYSTEM
13
using other approaches, from different fields such as statistical models, Gibbs
sampling, probabilistic relational model, linear regression, maximum entropy
model and more complex probabilistic models, for instance Markov decision process, probabilistic latent semantic analysis or generative semantics
of Latent Dirichlet Allocation. Main problem that occurs with model-based
approach is in many cases the complexness of methods and how they handle
with huge amounts of data.
Collaborative method also has some problems which must face - new user
and new item appearance and sparsity in dataset .
New user problem is the same as in content-based method. In order
to make accurate recommendations, the system must first learn the user’s
preferences from the ratings the user makes. There is not a simple solution
to this problem, however the impact of this is minimized by using hybrid
approach we describe further.
New items are added regularly to recommender systems. Collaborative
systems rely solely on users’ preferences to make recommendations. Therefore, until the new item is rated by a substantial number of users, the recommender system would not be able to recommend it. This problem is similar
to new user problem and also can be minimized by hybrid methods.
Sparsity of data is another problem of collaborative filtering method. In
any recommender system, the number of ratings already obtained is usually
very small comparing to the number of ratings that needs to be predicted.
The success of the collaborative recommender system depends on the availability of intercepted users’ preferences. In result for the user whose taste is
unusual in comparison to the rest of the population there might not be any
other users with similar taste so recommendation result might be poor. One
way to reduce this influence is applying demographic filtering in which we
are looking for other users in e.g. the same age, localization, social status.
[10]
Several recommender systems use hybrid approach by combining collaborative and content-based methods, which helps to avoid certain limitations
14
CHAPTER 2. RELATED WORK
of content-based and collaborative systems. We can describe four different methods of putting together collaborative and content-based approaches
which we shortly introduce. One way to build hybrid recommender systems
is to implement separate collaborative and content-based systems. Then
we can combine the outputs obtained from individual recommender systems into one final recommendation using either a linear combination of
ratings or a voting. Second method is to add collaborative characteristics
to content-based models. We can achieve that by i.e. using some dimensionality reduction technique on a group of content-based profiles. Third
method is to add content-based characteristics to collaborative models. In
this we use traditional collaborative techniques but also we maintain the
content-based profiles for each user. These content-based profiles we use to
calculate the similarity between two users. Finally, we can develop a unified
recommendation model by mixing e.g. user’s age and article subject.
Also expert knowledge can be used as an augment in hybrid recommender
systems, especially to prevent new user or item problem. The main drawback
of it is the unwritten requirement that subject domain must be well specified
or should use ontologies.
This section has presented main approaches in field of recommender system, their typical characterization and applications. Also we have introduced
mostly used formulas which in general can be useful in solving such problems.
2.3
Context-aware recommender systems
In this section we focus on the topic of context-aware recommender systems as an union of context and recommender systems which we have introduced before.
Traditionally, all types of recommender systems deal with two types of
entities, users and items, but as we have said in Section 2.1 all ours behaviors
have their backgrounds and also reasons which describe our current needs
and explain out recent decisions. In traditional two-dimensional user-item
2.3. CONTEXT-AWARE RECOMMENDER SYSTEMS
15
approach reasons, of e.g. ratings, are hidden, but we can not ignore the
existence of them. Sometimes it even might be more important why we have
made such a decision than the decision itself - for sure it gives us a lot of
information to provide more accurate recommendations.
Decent work in this field was made by authors of [11]. They have extended the user-item approach by the support of additional contextual data.
This contextual data can be introduced in three ways. Manually as a set
of questions e.g. about preferences, automatically provided by user’s device,
e.g. localization of mobile phone, but also by analyzing user’s set of actions.
Different methods of using contextual data in general can be grouped into
two categories: recommendation via context-driven querying and search, and
recommendation via contextual preference elicitation and estimation. The
context-driven querying and search approach is widely used when we are
looking for a restaurant where we could eat korean cuisine - the recommender
system finds the best matching restaurant in the neighborhood which is open.
The second method, contextual preference elicitation and estimation, tries
to learn a model based on user’s activities and interactions between different
users, but also based on the user’s feedbacks about previously recommended
items. To achieve that we can use data analysis techniques from machine
learning or data mining. This two concepts also can be combined into one
that has features of both.
Paper mentioned before introduces three concepts of applying context
into the workflow of recommender system - Figure 2.1 on page 16 presents
all them.
In a contextual pre-filtering approach, contextual information is used to
filter the data set before applying a traditional recommendation algorithm.
In a contextual post-filtering approach, recommendations are generated on
the entire data set. The result set of recommendations is adjusted using
the contextual information. Contextual modeling approaches use contextual
information directly in the recommendation function as an explicit predictor
of a rating for an item. Whereas contextual pre-filtering and post-filtering
approaches can use traditional recommendation algorithms, the contextual
is initially ignored, and the ratings are predicted using any traditional 2D recommender system on the entire data. Then, the resulting set of recommendations is adjusted (contextuallized) for each user using the contextual information.
Contextual modeling (or contextualization of recommendation function). In
this recommendation paradigm (presented in Figure 4c), contextual information
16 is used directly in the modeling techniqueCHAPTER
2. RELATED
as part of rating
estimation. WORK
U
Data
I C
(c) Contextual Modeling
(b) Contextual Post-Filtering
(a) Contextual Pre-Filtering
U
R
Data
I C
U
R
Data
I C
R
c
Contextualized Data
U I R
2D Recommender
U I
R
2D Recommender
U I
R
u
u
MD Recommender
U I C
R
u
Recommendations
i1, i2, i3,
c
c
Contextual
Recommendations
i1, i2, i3,
Contextual
Recommendations
i1, i2, i3,
Contextual
Recommendations
i1, i2, i3,
Figure
2.1: Paradigms
for incorporating
context
in recommender systems.
Fig. 4. Paradigms
for incorporating
context in recommender
systems.
[11]
In the remainder of this section we will discuss these three approaches in detail.
modeling approach uses multidimensional recommendation algorithms[13].
Examples of heuristic-based and model-based approaches have been de-
3.1 Contextual
scribed
in [11]. Pre-Filtering
this section
we4a,
have
general and
most uses
common
descripAsInshown
in Figure
the presented
contextual the
pre-filtering
approach
contextual
information
select the
most
relevant 2D systems
(User Item)
for generating recomtion
of howtocontext
and
recommender
workdata
together.
mendations. One major advantage of this approach is that it allows deployment of
anyThis
of the
numerous
traditional some
recommendation
techniquesabout
previously
proposed
chapter
has presented
general information
current
works
in the literature (Adomavicius and Tuzhilin 2005). In particular, when using this
in fields of context definition, recommender system approaches and also
approach, context c essentially serves as a query for selecting relevant ratings data.
context-awareness
of recommender
Inrecommender
section 2.1 we
havewould
seen few
An example of a contextual
data filtersystems.
for a movie
system
be:
if a person wants to see a movie on Saturday, only the Saturday rating data is used
definitions of context from different points of view, especially e-commerce
to recommend movies. Note that this example represents an exact pre-filter. In
and marketing. Next we have had brief review about traditional recommender system approaches. Last section has connected information from
two previous sections and has introduced the definition of general contextaware recommender system.
17
Chapter 3
Dataset
18
CHAPTER 3. DATASET
In this chapter we present characteristics of typical dataset which is used
in e-commerce platforms. We introduce information about source but also
present structure and data types to provide good understanding of data.
3.1
Source
While we are working on a real, industrial problem, the best way to
receive reliable results is working with real data. Luckily we have been able
to receive real dataset from running e-commerce platform which is Allegro.pl.
Allegro.pl (allegro.pl) is the biggest e-commerce platform in Poland and
Eastern Europe with several millions of users only in Poland. Every day
using this e-commerce platform about 500 thousands items are sold from
about 18 millions available items.
Dataset contains information about items and purchases from period between September 2006 and April 2007 with total of 305899 items from 17519
categories and 1327872 purchases. Initial subset of users were picked up and
for every user of this subset its items and connected data were retrieved.
It is necessary to mention that if we had current dataset and compared
that with the one we have access to, probably it would be different. The
reason of this is the change in customers and sellers behaviors. For example
in our dataset, 65% of all items have “buy now” option, where currently
(in 2013) it is about 88%. Also now most of the sellers are professional
sellers and companies that use platform as another method of distribution.
Because of that, the way how descriptions are created and their quality might
be different.
3.2
Dataset structure
Dataset structure is presented as follows on figure 3.1.
Every item in the dataset is described by few features, such as name, description, prices, starting and ending date and also information about quantity and number of sold items. Items are grouped into predefined categories,
19
3.2. DATASET STRUCTURE
Bid
* date
* quantity
* amount
Item
# * id
* name
* price
* description
o starting_price
o buy_now_price
* starting_time
* ending_time
* bid_count
* quantity
* photo_count
* quantity_sold
Feedback
#*
*
*
*
Category
# * id
* name
User
# * id
* creating_time
o activation_time
* rating
* group
o super_seller_status
id
creation_time
type
description
Figure 3.1: Entity-Relation Diagram representing the dataset.
20
CHAPTER 3. DATASET
where categories are designed as a forest - that is the reason why every item
can be connected to many categories - one leaf category and descendants of
this leaf category. Every category is described by a name.
Because of privacy reasons, users placed in the dataset had to be anonymized
in a way that prevents the possibility of recognition. Because of that dataset
does not include user personal data, but only hashed ID and information
about account creation, its rating on the platform and internal user groups.
Every user can own an item which means that is the seller of this item.
User also is able to buy an item or bid in an auction - in this case user is
called buyer and information about that is stored in the bid relation. For
every bid (transaction) users (seller and buyers) can give feedback about the
transaction.
This dataset contains different types of data: text, integer, float and date.
All price fields use float type, time fields use date, rating, quantity, amount,
count fields use integers. Name and description fields use text type which
apparently is the most difficult type of field because of the high dimensionality hidden in data. Next chapter focuses on methods of preprocessing of
text fields so they can be used in further processing.
21
Chapter 4
Text preprocessing
22
CHAPTER 4. TEXT PREPROCESSING
In this chapter we present some preprocessing tasks that usually have
to be performed in order to a further processing. In this we follow [16]. In
Section 4.1 we focus on general and simple text transformations, then, in
Section 4.2, we present stop words removal problem in general but also more
specific that occurs in case of e-commerce item descriptions. Section 4.3 is
about stemming, what it is, why it is so important and how it may improve
the quality of dataset after preprocessing. At the end, in Section 4.4, we
focus on tags removal and parsing web pages.
4.1
General transformations
The main rule, which usually is performed at the beginning, is assurance
that all letters are converted to either the upper or lower case.
First and the most common problem in cleansing text is existence of
digits. In traditional information retrieval systems numbers which occur in
words or terms are removed because typically it means that this is not a
word. An exception from this might be specific type, like date, time and
other restricted by regular expression. It is worth to mention that in specific
situations, like in search engines, words which contain digits are not removed
and they are indexed.
Different treatment of punctuation marks, hyphens and other special
characters can lead into different results, sometimes inconsistent, so there is
not a perfect approach to perform that. For example in English we have some
words that can be written in different way by different people - state-of-theart or state of the art. If we replace hyphens with space in the first one we
eliminate the problem of inconsistency. But not in every situation we can just
replace special characters with space - the action which should be executed
is not so obvious. Some words may have e.g. a hyphen as an integral part of
a word. In general we use tho types of special character removal: (1) simple
removing special character from original text without leaving anything in a
place of removed sign, e.g. state-of-the-art will produce stateoftheart, and (2)
replacing special character with a space. In some situations both forms of
4.2. STOP WORDS REMOVAL
23
the same word have to be stored and used in processing because determining
which form is correct is hard, for example, if we convert pre-processing into
pre processing and then we will be looking for word preprocessing we can
receive no results.
4.2
Stop words removal
Stop words are words which are filtered out prior to processing of natural
language. Any group of words can be used as the stop words, but usually
words which are frequently occurring and insignificant in a language and only
help construct sentences without representing any specific content. When it
comes to English language, common stop words may include: a, about, an,
are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to,
was, what, when, where, who, will, with and more.
Besides language stop words we can define other stop words depending on
our needs and its application. For example we can create a list of words that
in specific situations create noise and do not provide any useful information,
e.g. words often used while naming items which are used only to attract
attention without adding any information.
4.3
Stemming
Part of many languages is word formation and declension, where based
on one word we create another with a very similar meaning, according to
the context. For example, in English, we can create adjectives and adverbs
from nouns, nouns have plural forms, verbs have gerund forms (by adding
-ing), and verbs used in the past tense are different from the present tense.
These are treated as syntactic variations of the same word. This variations
can cause worse results in case, e.g. document searching, because relevant
document may contain a variation of a query word but not the exact word.
One of the solution to this problem is stemming.
24
CHAPTER 4. TEXT PREPROCESSING
Stemming refers to the process of reducing words to their stems. A stem
is the portion of a word that is left after removing its prefixes and suffixes. In
English, most variants of a word are generated by the introduction of suffixes
(rather than prefixes). So stemming in English usually means suffix removal,
or stripping. In other languages, like in Polish, stemming process might be
different, because language word formation rules are different. Stemming
enables different variations of the word to be considered in retrieval, which
improves the recall.
Many researchers have been focused on the advantages and disadvantages
of using stemming. For sure, stemming increases the recall and can reduces
the size of the indexing structure. Although, it can make the precision
worse because many irrelevant documents may be considered as a relevant.
However many experiments have been conducted by researchers, there is still
no simple answer to question whether one should use stemming or not - it
depends on the dataset, so every time results of the stemming should be
checked to measure its usefulness.
4.4
Tags removal
If we have to process documents that are not pure text and for example
they are web pages or XML documents then performing approaches presented before might not be enough. There are different types of preprocessing that might be useful, we focus only on the most common and suitable
to our dataset.
Tags (no matter if HTML or XML tags) removal can be dealt with similarly to punctuation or hyphen removal. There is one issue which needs
careful consideration. Tags make the structure of the document and their
removal destroy this structure and makes document inconsistent. For example if we have HTML document of a typical commercial page, information
is presented in many rectangular blocks or layers. Simply removing HTML
tags may cause problems by joining text that should not be joined. The
problem is worse with XML documents where while we are removing XML
4.4. TAGS REMOVAL
25
tags we can lose semantical information and merge two different strings into
one text.
In this chapter we have presented some basic methods of text preprocessing which are further used in phase of data preparing and normalization.
26
CHAPTER 4. TEXT PREPROCESSING
27
Chapter 5
Context-aware recommender
system
28
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
As we have mentioned before in Chapter 2, we can specify different mean-
ings of contexts. In this Chapter we describe an idea of item type context
and how this idea can be used in recommendations - Section 5.1 introduces
the concept of this method and few information about the implementation.
After that, in Section 5.2, we present how text data are processed to extract
some features. Section 5.3 shows the simple method of context creation
based only on terms frequencies and its evaluation. Next section, Section
5.4, presents Latent Semantic Indexing idea as a method to find connections inside data and based on that to build items contexts. As the last, we
propose network-based context creation in Section 5.5.
5.1
General idea
In typical applications of context-aware recommender systems usually we
define context as a set of features that describe user or item environment
such as localization, mood, preferences. According to that if we are looking for a new car then we will receive cars which are being sold in the very
near neighborhood. Assuming that we have bought this car the next thing
which is going to be recommended by a typical content-based or collaborative filtering recommender system for us would be another car which is
unreasonable. Maybe this example is very particular but perfectly describes
the disadvantage of typical recommender system. More likely, after we have
bought a car we would be looking for an additional tires, a windscreen wiper
blades, a motor oil or an audio system. So more than buying the same type
of item probably we would be looking for complementary items which create
our set of items that we might need, connected with the main item. Imagine
we have bought new coffee machine, instead of recommendations of other
coffee machine we would be more interested in equipment that works with
our new coffee machine such as filters, coffee. We also might be interested
in buying coffee grinder, new cups or spoons. All this items create the set
of connected products that all together create context of a product.
Having this examples in our minds, now we define what is the context of
29
5.1. GENERAL IDEA
items.
If we have a set of types of items, in which types are similar to other
types from the same set, then we call this set of types as a context of items.
As we can see context is not based on items itself but more on categories of
items.
The recommendation process is quite similar to typical recommender
system workflow. It is presented on the figure 5.1. As we can see everything
Viewing item A Find type of item A Find contexts for found item type Retrieve items from contexts Filter items with constraints Recommenda;on results Rank items Figure 5.1: Context-aware recommendation process.
starts from item that is currently viewed by user. For this item, instead of
looking for other similar items, we are determining the type of this item for item exists only one type, but every item type may belong to unlimited
number of contexts. For example, coffee machine may occur in context of
coffee equipment with coffee, grinders and thermos, but also as a one of
kitchen equipment with microwaves and dish washers. At this point it is
hard to determine only based on viewing item for which particular reason
30
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
this item has been viewed. After the contexts are determined, in general,
we retrieve all items from these contexts. All items from context probably
would be inappropriate so item filtering is performed and items that remain
after this process are ranked in the next step. As a comparison, in typical
content-based recommender system we would rather perform filtering and
ranking on the items from whole dataset than subset created by context.
The recommendation process is simple itself, but many difficult problems
are hidden behind. We highlight five of them and describe them shortly.
First problem and the main problem is how to create contexts and whether
it should be performed on-line or off-line. Creating a context is a type of
data-mining clustering where we are trying to find groups of similar items
taking into account some constraints. Because of size of the dataset, the
performance issues are more than important. It is impossible to perform
typical agglomerative clustering on dataset of 21 millions of items in reasonable time, on the other hand, we are looking for available items in the time of
viewing and also if new type of item occurs in the dataset we want to receive
results for it. These reasons show us that some tasks should be performed
on-line and other off-line to provide optimal recommendation results. Second problem is the method how to select most relevant items from, selected
before, contexts. Number of items still can be huge and from this subset
we would like to separate only those which e.g. are compatible to currently
viewing item. Next problem, similar to filtering problem, is ranking items.
A lot of effort has been done in this area, but still this is a complex problem
how to pick the TOP 5 of perfect items which probability of user taking an
interest in viewing them will be very high. Text processing is also one of
problems which occur during working with content type data. Results of
other steps might depend on how well text processing has been performed.
The last problem, which is an effect of methods used as solutions of every
step, is the performance. Recommendation not only must be precise but also
must be fast, so all solutions must be prepared with taking into consideration
the time of an execution.
31
5.2. TEXT PREPROCESSING
5.2
Text preprocessing
Since all items in dataset are mainly texts, unified procedure of their
preprocessing was prepared. It is quite typical natural language processing
workflow, presented on the figure 5.2 and described below. This procedure
is used with every text field that occurs in dataset model. First step of this
Original text Lowercase text Tags stripping (op;onal) Text stemming Tokenize text Text cleansing Remove pla7orm stop words Remove other stop words Remove language stop words Word length filtering Processed text Figure 5.2: Text preprocessing workflow.
procedure is lowercasing string, the reason of this step is quite obvious, if
we have two strings TExt and text when we do string comparison the result
without lowercasing is that these two words are different. Lowercasing makes
these two strings the same.
If we have to handle with texts that contain different types of tags, such
32
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
as HTML tags or XML tags then we do tags stripping which just mean that
we are removing any kind of tags from original text. It is useful if we have
to process descriptions full of HTML tags that only format our description
and place some images inside. After that we receive plain text. As it has
been mentioned in Chapter 4, sometimes tag removal merges two parts of
text that should be separated and together they do not provide any useful
information. Basically because we have two types of text fields - titles and
descriptions, we apply tag removal on descriptions and we do not do that
for titles.
Tokenizing text is splitting sentences into words. Normally it is done by
splitting when space occurs, but we also split text to words by other special
characters. It is helpful when we have titles like great ***new*** bike, after
tokenizing only by space we would receive great, ***new***, bike but after
tokenizing also by other special characters we would have words great, new,
bike and that is what we have wanted to achieve. Unfortunately here there
is a possibility of lost some information. Imagine that in out title of item we
have words like C-3PO or R2-D2 which are e.g. model numbers. If we do
tokenizing by hyphen then we are going to receive words C, 3PO, R2, D2
and we have lost information.
The most important step which has the biggest influence on the results
of preprocessing is text stemming. There is not an universal method because
it depends on the language and available libraries. In general the goal is to
do reverse e.g. word formation to the root word with keeping the meaning in
the context. Since we are working with polish dataset we use library called
Morfologik that does quite well its tasks.
When we have our titles and descriptions split into words and stemmed
sometimes we can find in our set of words, terms like R2, 3PO or 12345 which
usually can not be treated as a normal word and there are just mistakes,
number of parts, information about 24 hour shipping or service. If we are
looking for words that describe item the most, probably words with number
will not provide any useful information. We also remove at this stage verbs
from descriptions. When verbs say about the action then describe the item
5.3. CATEGORY-BASED CONTEXT CREATION
33
we find it useless, because usually it sounds like See it now! and it does not
add any information.
After we have received terms that are likely existing words it is time to
remove from this set of words, words which language role is only creating
and connecting words. Usually it is done by applying stop word list. Some
examples of English stop words have been presented in Chapter 4. In general,
stop word removal is for removing terms that for some reason we think are
useless and they only provide noise in text. In our application we have
decided also to analyze words used by sellers only to emphasize and highlight
titles. For example if we have title like Superb brand new cross bike - invoice
and words like superb, brand, invoice are used by a lot of sellers then they
just do not give any important information about item at all and only disrupt
the whole title. Also other types of stop words list can be used, more specific
to platform, maybe to avoid e.g. usernames or platform names.
At the end the length of terms is checked. It is most likely that one-letter
word is not a correct word, sometimes even two-letters words do not exist
so they can be remove from the final set of words.
It must be said that text preprocessing is a very important part of the
whole processing because it can change the quality of output and for sure
it should be performed very careful and there is neither a simple rule nor
setting that will work with any dataset.
5.3
5.3.1
Category-based context creation
Description
In typical e-commerce platforms all goods are labeled by category. In
most cases the category describes the type of the item. Categories also
usually create category tree where while going deeper the more detailed type
we get. Example of a category tree can be found on figure 5.3. Although
items in the same category create contexts, those contexts are not as we
defined previously. In our definition those sets of items we can find in the
34
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
Clothes, Shoes & Accessories Women's Clothing Music Men's Clothing CDs Trousers Records Socks Lingerie & Nightwear 7” Singles T-­‐Shirts 12” Singles Bodies Bras & Bra Sets Figure 5.3: Example of category tree structure.
same category are groups that contexts are built from. So for example if we
focus on the context of coffee, category where we can find coffee machines
is only one of many elements that create coffee context. Those elements
usually are placed in different categories, even they do not have the same
root of the tree, however they can construct the context.
Every category can be described by its items that are put inside it. Since
every item is featured by title, description and set of attributes all this can
be transformed into set of keywords that characterize one item. Having that
we can create description of the category. Because originally description
created from items would be very long and pointless, we count occurrences
of each term and create ordered list of frequencies. We can say that the most
relevant terms describes the best the category and they form new category
description. This category description that we have got can be treated as a
virtual item that represents all items of selected category and can be used
5.3. CATEGORY-BASED CONTEXT CREATION
35
as an element of context.
Having those virtual items they have to be connected to create context.
Because theirs descriptions are bags of keywords, simple cosine similarity
can be used to calculate similarity value and then it can be ranked by those
values to find the most similar virtual items which are candidates from which
context is built.
After that, when context candidates are selected, real items from corresponding categories are picked for further filtering.
5.3.2
Experiment
Accuracy of presented method has been validated by selecting different
items from different categories and then manually evaluating their correctness according to intuition and usefulness of recommended items or categories/contexts. To achieve that we have selected 7 items of different type
and from different categories and for every item we have prepared 4 recommendations. More information about these items can be found in Appendix
A.1. Because items have been recommended randomly according to calculated contexts, constraints and filters examples of them can be found in
Appendix A.2.
First item (I1) for which we have been looking for recommendations is
an item of bag of coffee. As a result we have received other item of bag of
coffee (R11), coffee grinder (R12), thermal cup (R13) and set of cups for
coffee and tea. Categories of these items are various what means that their
descriptions have been quite good and probably they have been connected
by coffee keyword. Also because of the variety of categories it all together
creates the context of items - items that have been recommended are not
the same as viewing item.
Second item (I2) is also connected with coffee and it is coffee machine.
As a results we have got 4 items about the same thing - coffee machine.
Only thing that distinguish them is category. It means that for category of
viewing item the most important word is not word connected with coffee as
36
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
it could be expected but probably other term connected with this machine.
Recommended system has acted like typical one which is serving items of
the same type. Because our assumptions have been different results are not
acceptable.
Third item (I3) is specific type of bicycle. In this case in our result set
we have cover for a bike (R31), kid bike (R32), bicycle (R33), book about
cycling (R34). Because we are not looking for items only of the same type
but also for other. Here book about cycling and bike cover can be treated as
a complementary items and bike R31 as a suggestion of other bike. It is hard
to determine if the bike for children is adequate because probably 6-7 year
old child is not looking for a bike itself, more likely a parent is doing this
and in this case it would more like looking for a gift but we have not been
looking for a bike for a kid originally. Nevertheless those recommendations
partially build a context.
Fourth item (I4) is an item of dress. In this case we have got 4 items of
other dresses but when we analyze their categories we can see that original
item is an adult dress and recommended are for children. Although we have
received recommendations of relevant items because of the keyword, they
do not create valid context and also are not useful. Probably the reason of
that is the type of items we have in dataset but it shows that even if we
have quite strong keywords in categories descriptions, connections between
categories might be constructed badly.
Volleyball ball is out fifth item (I5) that we have checked. In a result
set we can find american football ball (R51), rugby ball (R52), soccer video
game (R53) and garden ball toy (R54). Although all items are balls it is hard
to say they are from the same context. For sure item R53 is not connected to
the viewing item and it is in results because of phrases that exists in polish
language. Other items are strongly connected with ball and if we are looking
for a ball then context has been constructed quite well but we think that
more likely is that we are looking for things connected with volleyball and
context should be created around this concept.
Sixth item (I6) is Apple’s MP3 player and we have two other MP3 players
5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION37
and two chargers for this kind of device. Context is very narrow but we have
been suggested for another devices and additional equipment to currently
viewing.
Last item (I7) is Apple’s laptop and as a result we have been offered two
covers for Apple’s computers. Items are connected by the manufacturer but
are not relevant in this case.
As we can see recommending items by creating context between categories/item types may vary in results. It all depends on how well items are
described and what category description they create. Because of that, this
method is also unstable and can be easily manipulated - in this it works like
naive Bayes classifier. This approach also do not work well with categories
where keywords are very general, common and can be used in many contexts. The reason of that is using single keywords - using n-grams instead
of single words to describe categories can affect in more accurate results in
some cases but in other it can narrow the context. This cons may be pros
in some particular situations like if we have items with manufacturer name
and we are looking for compatible parts or equipment. Other advantage
of this method is its complexity and execution speed. Collecting categories
descriptions can be done off-line periodically and it is not a very complex
task. Also searching for categories to create context can be done efficiently
when we use indexers like e.g. Lucene/Solr.
5.4
Latent Semantic Indexing-based context
creation
5.4.1
Description
Typically when we are talking about calculating similarity between documents, the most common method is to use cosine similarity which is based
on term frequency. This is a good method if documents about the same
thing use the same words, however this situation does not occur too often
38
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
and many concepts or objects can be described in multiple ways (using different words) due to the context and people’s language habits. If our query
uses different words from the words used in a document, the document will
not be retrieved although it may be relevant because the document may use
some synonyms of query words. This results in low recall. For example,
“picture”, “image” and “photo” are synonyms in the context of digital cameras. If the user query only has the word “picture”, relevant documents that
contain “image” or “photo” but not “picture” will not be retrieved.
Latent Semantic Indexing (LSI), proposed in [9], tries to deal with this
problem by identifying statistical associations of terms. It is assumed that
there is some latent semantic structure in the data that is hidden by the randomness of word choice. To mine this latent structure and remove noise statistical technique, called singular value decomposition (SVD) is used. This
structure is also called the hidden concept space, which connects syntactically different but semantically similar documents.
Let D be the text collection, the number of distinctive words in D be m
and the number of documents in D be n. As an input LSI takes m × n termdocument matrix A. Documents are represented as columns of A and words
are represented by rows. Matrix is usually computed using term frequency,
but also TF-IDF can be used. Every cell of matrix A, denoted by Aij stores
the number of occurrences of word i in document j.
SVD is used in LSI to factorize matrix A into three matrices:
A = U ΣV T
where U is m × r matrix and its columns are eigenvectors associated with
the r non-zero eigenvalues of AAT . What is more, the columns of U are unit
orthogonal vectors, what means U T U = I.
V is n × r matrix and its columns are eigenvectors associated with the
r non-zero eigenvalues of AT A. The columns of V are also unit orthogonal
vectors, what means V T V = I.
Σ is r × r diagonal matrix, Σ = diag(σ1 , σ2 , ..., σr ), σi > 0. σ1 , σ2 , ..., and
σr are the non-negative square roots of the r non-zero eigenvalues of AAT .
5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION39
They are ordered decreasingly, i.e.,σ1 ≥ σ2 ≥ ... ≥ σr > 0.
One of the features of SVD is that we can delete some insignificant dimensions in the transformed space to optimally approximate matrix A. The
significance of the dimensions is indicated by the number of the singular
values in Σ. Let we use only the k largest singular values in and set the
remaining small ones to zero. The approximated matrix of A is denoted by
Ak . We also have to reduce the size of the matrices Σ, U and V by deleting
the last r − k rows and columns from Σ, the last r − k columns in U and the
last r − k columns in V . As a result we obtain
Ak = Uk Σk VkT
which means that we use the k-largest singular triplets to approximate the
original A matrix. The new space is called the k-concept space. Figure 5.4
shows the original matrices and reduced matrices schematically.
Documents"
Term"vectors"
k"
Terms"
Document"vector"
k"
Σk"
k"
Σ"
A"/"Ak"
="
Uk"
U"
VkT"
VT"
×"
×"
k"
m×n"
m×r"
r×r"
r×n"
Figure 5.4: Schematic representation of singular value decomposition of the
matrix A.
Latent Semantic Indexing does not re-construct the original A matrix
perfectly. The truncated SVD captures most of the important basic structures in the association of terms and documents and at the same time removes the variability of word usage.[16]
40
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
In general Latent Semantic Indexing is used to reveal hidden connections
between texts about the same thing when different phrases are used. This
characteristics can be used to create contexts of items. All items similar
to each other are from the same concept space what means that they are
representing the same context.
5.4.2
Experiment
To check how LSI approach works for items available in the dataset we
have selected 13 items from three categories: Bras, Socks and Bra Accessories. All this categories have common ancestor category which is Clothes.
Such example has been selected because of two commonly used in Polish
language synonym names for bra - stanik and biustonosz - which can be
used to perform simple evaluation whether LSI method is working or not.
Types of each item are presented below in table 5.1, more information about
items is presented in Appendix A.1.
No
Category
Item type
D1
Socks
Suit socks
D2
Socks
Sport socks
D3
Socks
Sport socks
D4
Bras
Bra
D5
Bras
Bra
D6
Bras
Bra
D7
Bra accessories
Bra accessories
D8
Socks
Socks
D9
Bras
Bra
D10
Bras
Bra
D11
Bras
Sport bra
D12
Bras
Sport bra
D13
Bras
Sport bra (only manufacturer name)
Table 5.1: Selected items and theirs types.
First step of the experiment is creating term-document frequency matrix
where items titles are treated as documents. Every title is tokenized, all nonwords are removed and stemming is performed. After this preprocessing of
5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION41
every title we create set of words from all titles and count their occurrences
in documents, results are put into matrix A. For presented example termdocument frequency matrix A can be find in Appendix B.2.
With such prepared matrix now we can apply SVD decomposition on
this. As a result we receive three matrices: U (size 34 × 13), Σ (size 13 × 13),
V (size 13 × 13) which are enclosed in Appendix B.3.
Next step of latent semantic indexing is choosing value of k which is
used to prune matrices. After the analysis of matrix Σ we have decided
to evaluate with k = 3 and k = 2 because the drop of value on diagonal
after k = 3 is relatively big. Although we have received better results with
k = 2 and those results are presented - we only highlight differences between
results with k = 2 and k = 3. With given value of k we create new matrices:
Uk of size 34 × k, Σk of size k × k and Vk of size k × 13.
To evaluate this method we prepared 6 queries built from available words.
This queries are itemize below and theirs queries vectors are shown in Appendix B.4:
• Q1: biustonosz,
• Q2: biustonosz sport,
• Q3: push up,
• Q4: skarpeta,
• Q5: sport,
• Q6: stanik.
Each query word vector is now transformed using matrices created in SVD
process by the equation:
vk = q T Uk Σ−1
k
where:
q T is transposed vector that represents single query,
Uk is matrix U after resizing to k-columns,
42
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
Σ−1
k is inverted matrix Σ after resizing to k × k dimensions,
vk is result vector of the query.
Now to evaluate if latent semantic indexing is working we calculate the
similarity between vector vk and every document represented by rows in Vk
matrix. This can be done by using cosine similarity:
(i)
sim(vk , Vk ) =
~(i)
v~k · Vk
, i = 1, 2, ..., d
~(i)
kv~k kkVk k
where:
d is the total number of documents,
(i)
Vk
is the i − th row of Vk matrix.
In our case, results for every query and document are presented in table 5.2. Queries are represented by columns and documents by rows.
Q1
Q2
Q3
Q4
Q5
Q6
D1
0,01
0,39
-0,13
0,99
0,72
-0,15
D2
-0,06
0,32
-0,21
0,99
0,66
-0,23
D3
0,23
0,58
0,08
0,96
0,85
0,06
D4
0,99
0,87
0,99
-0,11
0,62
0,99
D5
0,99
0,91
0,99
-0,04
0,67
0,99
D6
0,99
0,88
0,99
-0,10
0,63
0,99
D7
0,98
0,84
0,99
-0,19
0,56
0,99
D8
0,26
0,61
0,11
0,96
0,87
0,09
D9
0,98
0,83
0,99
-0,20
0,55
0,99
D10
0,99
0,86
0,99
-0,14
0,59
0,99
D11
0,96
0,98
0,92
0,22
0,85
0,91
D12
0,92
0,99
0,86
0,35
0,91
0,85
D13
0,73
0,94
0,62
0,66
0,99
0,61
Table 5.2: Similarities between documents and queries.
According to the results, Query1 which is using one name of bra also
finds useful documents that use second name of bra. Even if in document
#13 this name has not been used still the similarity is quite high. Also bra
accessories has been included into group of similar items. Similarity between
this query and socks in 2 of 4 cases are near to zero. Other cases have the
5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION43
similarity about 0.25 and probably this is because of occurrence of sport
word in titles which is used not only by bras but also by socks. Query2 has
checked this hypothesis and adding sport word to query results in increasing
a little bit similarities between query and items representing socks group.
Query3 has checked how similarities would look like if we are looking for a
push-up bra which is one of subtype of bra. In results we can see that it has
high similarity value in almost all bra items. Query4 shows the similarities
for socks items and how it differs from bra items. However document #13
has high value of similarity because it uses words that also occurs in titles of
socks items like sport and elegancki. Query5 has tested how using common
words like sport affects on similarities and in this case it does not distinguish
well items. Last query, Query6 is similar to Query1, but uses second word
for bra and results are very similar.
Using the same equations we can also measure the similarities between
queries itself and it is presented in table 5.3. This results also confirm strong
Q1
Q2
Q1
Q2
Q3
Q4
Q5
Q6
1,00
0,92
0,98
-0,01
0,69
0,98
1,00
0,85
0,36
0,91
0,84
1,00
-0,16
0,58
0,99
1,00
0,70
-0,18
1,00
0,56
Q3
Q4
Q5
Q6
1,00
Table 5.3: Similarities between queries.
connections between some types of items or theirs keywords. Query1 gets
similar results to Query2, Query3, Query6 which all are about bras. Query4
has the strongest connection with Query5, but Query5 is an example of
query about very general and widely used word sport, so its similarity is
above average with every query.
Another way to compare results of LSI is document-document similarity
calculating. Similarities between documents before using LSI are presented
in table 5.4 and after applying LSI are in table 5.5.
44
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
D1
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
1,00
0,38
0,43
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,33
1,00
0,50
0,00
0,00
0,00
0,00
0,38
0,00
0,00
0,00
0,00
0,00
1,00
0,00
0,00
0,15
0,00
0,21
0,00
0,00
0,16
0,21
0,21
1,00
0,22
0,18
0,20
0,00
0,00
0,25
0,40
0,25
0,00
1,00
0,00
0,00
0,00
0,00
0,00
0,22
0,28
0,00
1,00
0,54
0,00
0,20
0,23
0,00
0,00
0,00
1,00
0,00
0,22
0,25
0,00
0,00
0,00
1,00
0,00
0,00
0,25
0,33
0,33
1,00
0,00
0,00
0,00
0,00
1,00
0,00
0,00
0,33
1,00
0,51
0,25
1,00
0,33
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
1,00
Table 5.4: Similarities between documents before applying LSI.
D1
D2
D3
D4
D5
D6
D7
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
1,00
0,99
0,97
0,09
0,01
0,07
0,16
0,96
0,17
-0,11
0,25
0,38
0,68
1,00
0,95
0,17
0,09
0,15
0,24
0,94
0,25
-0,19
0,17
0,30
0,62
1,00
0,13
0,20
0,14
0,05
0,99
0,04
0,10
0,46
0,57
0,82
1,00
0,99
0,99
0,99
0,16
0,99
0,99
0,93
0,88
0,66
1,00
0,99
0,98
0,23
0,98
0,99
0,96
0,91
0,71
1,00
0,99
0,17
0,99
0,99
0,94
0,89
0,67
1,00
0,08
0,99
0,99
0,91
0,84
0,60
1,00
0,07
0,13
0,49
0,60
0,84
1,00
0,99
0,90
0,84
0,59
1,00
0,92
0,87
0,64
1,00
0,99
0,88
1,00
0,93
D8
D9
D10
D11
D12
D13
1,00
Table 5.5: Similarities between documents after applying LSI.
We can see how similarities have changed and connections that have
been hidden appeared, e.g. between D1 and D8 or D4 and D9. Also weak
connections have been strengthen. Some values have been bolded to indicate
specific changes in the values of similarities.
5.5. NETWORK-BASED CONTEXT CREATION
45
Latent Semantic Indexing in a method that could be used to create clusters of similar items. It connects items together by discovering and showing
latent context of items. However this method also has some disadvantages.
Main disadvantage is the complexity of SVD which is O(nm2 ) or O(n2 m).
Computing such a big matrices where dimensions would be in millions is
practically impossible in the original form. Other situation that may occur
is noise caused by number of words which connect a lot of items but term
itself is too general or does not mean anything.
5.5
5.5.1
Network-based context creation
Description
Network-based context creation approach is an approach that mixes two
approaches presented in Section 5.3 and Section 5.4 and it minimizes the
disadvantages they have. Main disadvantage of category-based context creation is that it can not find contexts more specific than a category - like
sub-context - but also more general - like context which items are distributed
into many categories. This can be achieved by using LSI but it also has some
disadvantage like complexity and we do not get typical contexts as a result.
In network-based context creation we use the property of tags cloud from
category-based context creation and how those tags are connected with each
other. Those connections create structured network that keeps hidden information about contexts. Dividing this network into subnetworks gives us
more and more concrete contexts. There are two ways of creating this network: (1) taking all items from dataset and analyzing them all together, or
(2) analyzing items per category in order to create small network for category and then merge them into one network. Selecting the first solution does
not differ from LSI-based approach, and because of that we have decided to
divide computations per category. For every context corresponding to subnetwork we look for terms that represent items from this context. Those
terms can be treated as labels of the context and items from this context.
46
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
Deeply this approach uses network connectivity, network modularity,
PageRank algorithm, Hyperlink-Induced Topic Search (HITS) algorithm and
inverted index which we shortly present before the details of approach itself.
In an undirected graph G, two vertices u and v are called connected if
G contains a path from u to v. Otherwise, they are called disconnected.
A graph is said to be connected if every pair of vertices in the graph is
connected. A connected component is a maximal connected subgraph of G.
Each vertex belongs to exactly one connected component, as does each edge
[2].
Modularity term has a lot of different meaning depending on field and
context in which it is used. We focus on meaning of modularity in term of
network and graph analysis. [4] says that modularity is the measure of structure of network or graph. It measures the possibility how graph is likely to be
divided into modules (groups, clusters). Graphs where the modularity value
is high have nodes strongly connected within the same module and nodes
weakly connected with other nodes from different modules. The modularity
is, up to a multiplicative constant, the number of edges falling within groups
minus the expected number in an equivalent network with edges placed at
random [19]. The modularity can be either positive or negative. Because
of that it is possible to search for module structure precisely by looking for
the divisions of a network that have positive values of the modularity. A
precise mathematical formulation of modularity for networks where are two
and more modules can be found in [19].
PageRank is a link analysis algorithm that assigns a numerical weighting
to each element of a connected set of documents, with the purpose of measuring its relative importance according to given set. The algorithm may
be applied not only to Web pages, which was its first application, but to
any collection of entities where connections exist. Precisely, PageRank is
a probability distribution used to represent the likelihood that we will be
interested in seeing specific documents. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers
that the distribution is evenly divided among all documents in the collection
5.5. NETWORK-BASED CONTEXT CREATION
47
at the beginning of the computational process. The PageRank computations
require several passes through collection to adjust approximate PageRank
values to reflect more closely theoretical value. A probability is expressed
as a numeric value between 0 and 1 [5]. More information about PageRank
algorithm can be found in [21].
Hyperlink-Induced Topic Search (HITS), which is also known as hubs
and authorities, is a link analysis algorithm that originally rates Web pages
networks. The idea behind Hubs and Authorities stemmed from a particular
insight into the creation of web pages when the Internet was originally forming - that is, certain web pages, known as hubs, served as large directories
that were not actually authoritative in the information that it held, but were
used as compilations of a broad catalog of information that led users directly
to other authoritative pages. In other words, a good hub represented a page
that pointed to many other pages, and a good authority represented a page
that was linked by many different hubs. The scheme therefore assigns two
scores for each page: its authority, which estimates the value of the content
of the page, and its hub value, which estimates the value of its links to other
pages [3]. More information about HITS algorithm and its values calculation
can be found in [14, 15].
The inverted index of a document collection is basically a data structure
that attaches each distinctive term with a list of all documents that contains
the term. Given a set of documents, D = {d1 , d2 , . . . , dN }, each document
has a unique identifier (ID). An inverted index consists of two parts: a
vocabulary V , containing all the distinct terms in the document set, and for
each distinct term ti an inverted list of postings. Each posting stores the ID
(denoted by idj ) of the document dj that contains term ti and other pieces
of information about term ti in document dj [16].
Context creation based on network analysis can be divided into 5 steps:
(1) analyzing category network, (2) analyzing possible connections between
categories, (3) simplification of category network, (4) item mapping and (5)
network analysis of general network. Scheme of context creation based on
network analysis in presented on figure 5.5. Now we describe more precisely
48
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
all these steps.
For every category Create network with terms as nodes Find the most relevant terms in category Find duplicated terms Simplify network Items mapping Merge networks Find contexts Items re-­‐mapping Figure 5.5: Schematic workflow of network-based context creation approach.
Analyzing all items from dataset is a very complex task which requires a
lot of resources. While all items are divided into categories also their analysis
can be divided and performed separately on each category. In this step we
want to get as a result words that describe all items from selected category
or most of them. Also we are interested in connections between those words
which correspond to their similarity. To achieve that the first thing that we
have to do is calculate Term Frequency (TF) matrix for items from current
category. Instead of processing all items it is possible to use representative
subset of items, e.g. 10000 of items. The reason of that is SVD which we
are performing in the next step while applying LSI on the items of category.
Latent Semantic Indexing is parametrized by dimension to which we want to
reduce our matrices and this k-value can be set manually or automatically
5.5. NETWORK-BASED CONTEXT CREATION
49
e.g. to retain 99% of variance of Σ matrix, that means:
Pk
Σj
≥ 0.99
j=1 Σj
j=1
Pn
. Next, we create queries like it has been done in Section 5.4.2 but for every
term that occurs in TF matrix. After that we calculate cosine similarities
between queries, as a result of that we have m × m matrix M with values
between 0 and 1. Before we use this matrix as an input matrix for graph
creation we multiply every cell of this matrix with corresponding frequency
for term from TF matrix. The reason for this is that we want to add word
popularity factor and distinguish words where the similarities might be the
same but one word should be more important than other. It might be worth
saying that we do not need to have filled all cells because similarity is symmetric and lower or upper triangular matrix is enough to create undirected
network, which we create right now from matrix M . While we have created
network corresponding to matrix M , we calculate modularity for every connected component. If the modularity value is greater or equal than some
value p, we split input graph (network) into subgraphs (subnetworks). We
repeat this recurrently until modularity value for a subnetwork is less than
p. For every subnetwork we have got, we perform PageRank algorithm to
find the most relevant nodes which correspond to the most relevant words
in a selected category. From those most relevant words in every subnetwork
we select top N which we use in the further steps.
Retrieving top N important words and discarding all not so relevant in
category have some disadvantages. One of them is possibility of information
and connection loses between words about the same thing or from the same
context which have not been so important for PageRank algorithm. Also
ideally we do not like to have not connected networks that represent different
categories, to analyze them better we need information how they are close
to each other or do they have overlapping nodes, overlapping context. If we
do not do that then as a result we can get many disjointed networks. To
prevent this and to find candidates that might connect networks together we
look for common words in the whole dataset. In other words, we are looking
50
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
for duplicates of words of all items. This can be done efficiently by using
e.g. Prefix Trees. As a result of this step we get a set of words that occur
at least a specific number of times.
In the third step we take the set of terms that have been retrieved in
steps #1 and #2, and delete from network created in step #1 all terms that
are not in this set. While removing nodes we also remove all edges that
connect those nodes. After that we have smaller network which can be used
in further processing and analysis.
While we have shrunk network to the most relevant and necessary words
all items have to be bind to terms that describe selected item. This can be
done by creating inverted indices for vocabulary used in step third and all
items from category.
Last step is the actual step where contexts are built. At the beginning
we merge all small networks corresponding to categories into one bigger network that represents contexts and connections between them of all items
from dataset. On this network we calculate modularity for every connected
component and the same like it has been done in step #1, if the modularity
value is greater or equal than some value p, we split input network into subnetworks and repeat this recurrently until modularity value for a subnetwork
is less than p. Every subnetwork we have received create context and more
general contexts contain more specific, smaller contexts. For debug purposes
PageRank algorithm can be performed on every subnetwork to check what
this context is about by looking on the most important term in the ranking.
From the process of splitting network into modules we create hierarchical
graph that represents contexts and its divisions into more specific contexts.
After that we have to re-map items bindings from modules created in step
#1 and simplified in step #3 to new modules, created after merging and
modularity analysis.
Because as a result we have received graph that shows contexts structure, recommendation process can be describe as a random walk through
the graph. Sample graph is presented on figure 5.6. Every node represents
context and subnetwork which we have received after network’s modularity
51
5.5. NETWORK-BASED CONTEXT CREATION
C1 C2 C1.2 C1.1 C2.1 C2.2 C2.3 p2 p1 C1.2.1 C1.2.2 C1.2.3 C2.1.1 C2.1.2 C2.2.1 C2.2.2 Figure 5.6: Contexts’ structure graph and possible recommendation jumps.
analysis. We distinguish two types of moves: changing context on the same
level, or changing parent context - red and blue arrows show those two possibilities. While changing contexts or sub-contexts we set the probabilities
that determine which route should be chosen - probability p1 for changing
context to other from the same level and p2 for changing parent context of
current context. Changing parent context can give more diversity in results
so usually p1 > p2 .
While visualizing network created after merging it is likely that some
modules can be connected with others by words that should be filtered by
one of the stop words list - its meaning is too general and does not introduce
any information to any context, it is meaningless. If we check their hub
value after applying HITS algorithm and dividing into modules we can see
that words with high hub value which connect different modules in many
cases should be removed. This step might be done manually periodically to
52
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
improve the correctness of context creation.
Also it is worth to mention that in this approach only recommendation
process is done on-line - all other tasks are done off-line, periodically because
items average characterization (typical, most common terms used to describe
items) in one category does not change so quickly and so often. For all
new items that occur between calculations there is a need to map them to
appropriate context - to achieve that we have to store all terms from which
context have been created and its corresponding network and calculate e.g.
cosine similarity between item and those terms to find most appropriate
context or contexts.
5.5.2
Experiment
In order to evaluate network-based context creation we have picked 12
categories presented in table 5.6. All these categories contain exactly 1872
items. All items have been processed, titles divided into terms and TF matrix
created. To show how our approach change number of terms describing
category while processing, we count them. Counts for original data are
presented in table 5.7. For every category LSI analysis has been performed
and similarity with weighting has been calculated. After that we have created
network between terms with weighted similarity as weight values on edges.
Visualization of those networks can be found in Appendix C.1. Based on
those networks we have select most important words that are used next
to describe items connected to those categories - we have picked 5 most
important words per category and total 103 terms for all categories. Also
we have determined 97 duplicate terms within selected categories.
Having the most important words for every category and list of duplicated
we have been able to shrink original networks into its compressed versions.
All those networks can be found in Appendix C.2 and counts of terms for
those can be also found in table 5.7. As we can see, the number of edges
decreases from few percents up to almost 80%. At this point we can merge all
networks presented in Appendix C.2 into one network. figure 5.7a presents
5.5. NETWORK-BASED CONTEXT CREATION
53
the results of this process.
(a) Before hubs removal.
(b) After hubs removal.
Figure 5.7: Network representing 12 categories merged together.
The network created from compressed networks, which have been representing connections between terms, has only one connected component,
although there is a probability that some of terms that connect different
modules (every module has different color of nodes) should not be existing
in this network. To check that HITS algorithm has been performed and its
results reflect on the size of nodes. After brief analysis of hub nodes that
connect different modules, we have decided to remove few more nodes in our
opinion can not distinguish any specific type of items. After this removal we
have received network on figure 5.7b.
Instead one connected component we have now two connected components. This shows us that hub analysis (which can improve stop word lists)
of nodes that connects modules has quite huge impact on the network structure. The modularity value of the top and bottom subnetworks are equal to
0.102 and 0.386, and if we set the value p = 0.1 then we have to continue
its analysis on submodules. The structures of components can be found on
figure 5.7.
For every component we have analyzed its modules and their visualization
are enclosed in Appendix C.3. Because all modules have to be analyze if
54
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
(a) Connected component #1.
(b) Connected component #2.
Figure 5.8: Connected components.
they can be divided into submodules, so in the result module #1.3 has been
divided into 3 submodules and 2 subsubmodels - figure C.9 in Appendix C.3.
For debug purposes on final submodules PageRank algorithm has been
executed to give, more or less, the topic of its context. Divisions based
on the modularity values has created context structure as follows on figure 5.9. After the analysis of terms topics we can find out approximately
what are those topics in those contexts. Our interpretation is presented in
table 5.8. We can see that topics are generally divided into two categories
because of our two connected components which reflect on context division.
The recommendation process - graph traversing will result in picking items
from connected contexts - e.g. if we are viewing item which context is #2.4
(Socks) as a result we can receive items from context #2.3 (Leg warmers,
which are also kind of socks) but also from #2.1 (Bras) which are a kind
of lingerie also and might be useful. In this section we have presented few
different ideas how to create contexts based on items keywords. We have
started with simple idea of simple analyzing items from categories and creating categories descriptions. Although this method is fast because it can
take advantages from existing technologies and methods, the results might
not be satisfactory. To improve importance and to connect stronger context
55
5.5. NETWORK-BASED CONTEXT CREATION
1 1.1 1.2 1.3 1.4 1.3.1 1.3.2 1.3.3 1.3.3.1 2 2.1 2.2 2.3 2.3 1.3.3.2 Figure 5.9: Graph of contexts structure.
we have tested the Latent Semantic Indexing approach in order to create
contexts. This approach, on other hand, finds hidden connections between
items but is very demanding in manner of resources, especially if it has to
be performed on huge datasets. As a result of those two approaches we
created network-based method which does category analysis separately and
finds hidden connections between items inside one category and then analyze
whole dataset using network and graph algorithms to find contexts in data.
56
No
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
Category name
Category path
Items type
Odzież, Obuwie, Dodatki / Odzież
Skarpetki i
damska / Bielizna / Skarpetki i
1
Socks
podkolanówki
podkolanówki
RTV i AGD / Sprzęt AGD / Kuchnia /
2
Przelewowe
Coffee machines
Ekspresy do kawy / Przelewowe
RTV i AGD / Sprzęt AGD / Kuchnia /
3
Ciśnieniowe
Coffee machines
Ekspresy do kawy / Ciśnieniowe
RTV i AGD / Sprzęt AGD / Kuchnia /
4
Młynki do kawy
Coffee grinders
Młynki do kawy
Odzież, Obuwie, Dodatki / Odzież
5
Silikonowe
damska / Bielizna / Biustonosze /
Bras
Silikonowe
Odzież, Obuwie, Dodatki / Odzież
6
Sportowe
damska / Bielizna / Biustonosze /
Bras
Sportowe
Odzież, Obuwie, Dodatki / Odzież
7
Typu push-up
damska / Bielizna / Biustonosze / Typu
Bras
push-up
Naczynia do kawy i
Antyki i Sztuka / Antyki / Ceramika /
Antique coffee & tea
herbaty
Naczynia do kawy i herbaty
cups
8
Dom i Ogród / Żywność / Kawy /
9
Mielone
Coffee
Mielone
Dom i Ogród / Żywność / Kawy /
10
Rozpuszczalne
Coffee
Rozpuszczalne
Dom i Ogród / Żywność / Kawy /
11
Ziarniste
Coffee
Ziarniste
Dom i Ogród / Żywność / Kawy / Inne
12
Inne kawy
Coffee
kawy
Table 5.6: Selected categories to evaluation.
5.5. NETWORK-BASED CONTEXT CREATION
No
1
2
3
4
5
6
7
8
9
10
11
12
Number of distinct terms
104
19
102
11
17
20
130
16
151
16
55
43
57
Number of distinct terms after compression
32
13
25
10
11
13
29
11
54
12
41
26
Table 5.7: Distinct term counts for original and compressed category network.
Context No
1
1.1
1.2
1.3
1.3.1
1.3.2
1.3.3
1.3.3.1
1.3.3.2
1.4
2
2.1
2.2
2.3
2.4
Topic interpretation
Coffee equipment
Coffee grinders
Coffee
Coffee machines
Coffee machines
Coffee machines (specific manufacturer)
Coffee & breakfast
Coffee
Breakfast accessories
Grounded coffee
Lingerie
Bras
Push-up bras
Leg warmers
Socks
Table 5.8: Topics interpretation of created contexts.
58
CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM
59
Chapter 6
Conclusions
60
CHAPTER 6. CONCLUSIONS
Topic of this thesis was to create some approaches that could be used in
context creation process and then in context-aware recommender systems.
The first problem of this was defining the meaning of context and we have
ended with meaning of context as a set of complementary things that might
be recommended to create full set of connected items. Then we have presented three methods where task of each next was to create contexts and to
eliminate problems observed in the results of previously evaluated, so new
approach has been the consequence of previous one. As a result we have
received a solution that mixes different approaches and fields of computer
science to perform its task in the best way. It analyzes groups of items categorized by people, removes unnecessary information, builds groups of items
and tries to connect them all together in order to divide them into shared between categories, contexts. In order to achieve that we use Latent Semantic
Indexing and network analysis methods such as PageRank algorithm, HITS
algorithm and modularity. We think that this method may end with decent
results of the recommendation task in real e-commerce platform.
Further work It is impossible to debate on the efficiency of any method
presented in this thesis based only on our subjective opinion. Objective
evaluation of recommender systems is hard because it depends on users experience, they current needs, age, sex, social status. Some recommended
items might be suitable for one and totally unacceptable for other. Wellknown validation methods from machine learning in many cases can not be
used because of the lack of objectivity - we can not just check the equation.
Although some methods of validation of recommender systems has been presented in the literature and e.g. to check users’ satisfactory we can do user
trails. We say that this is the first thing that should be done as a further
work.
In all methods we have been relied on the results of text processing
techniques. We have observed that this is the most important step in the
whole context creation flow and by using better stemmers or analyzing texts
more precisely. For instance, we have been dividing and then analyzing terms
61
built from single words. Instead of that we could use n-grams and then check
what is the impact of this decision on the final results. Using n-gram creates
itself a small context inside - nevertheless in network-based context creation
approach n-grams might correspond to small cliques. Checking the impact
of using n-gram might be the next thing for further verification.
Some of used methods to perform subtask, such as SVD to decompose
matrices, are quite complex and expensive when it comes to resources of
processors and memory. We were trying to minimize datasets they had
to handle with, but there are other possibilities how to adapt algorithms to
handle with larger sets of data. For instance, instead of using standard SVD,
CUR matrix approximation or Stochastic SVD might be used. However it
might have an impact on the final results so it has to be evaluated before
used in real systems.
In this thesis we have presented the simple idea of graph traversing in
order to serve recommendation results. Other, more complex probabilistic
models should be applied to provide more accurate and realistic behavior of
context switching, and this can be a topic for further researches.
62
Bibliography
[1] Webster
Dictionary
-
Context.
available
at
http://www.
merriam-webster.com/dictionary/context, last checked 13/06/22.
[2] Wikipedia: Connectivity.
available at http://en.wikipedia.org/
wiki/Connectivity_(graph_theory), last checked 13/06/11.
[3] Wikipedia: HITS algorithm. available at http://en.wikipedia.org/
wiki/HITS_algorithm, last checked 13/06/11.
[4] Wikipedia:
Modularity
(networks).
available
//en.wikipedia.org/wiki/Modularity_(networks),
at
http:
last checked
13/06/10.
[5] Wikipedia: PageRank. available at http://en.wikipedia.org/wiki/
PageRank, last checked 13/06/11.
[6] D. Salber A. Dey, G. Abowd. A Conceptual Framework and a Toolkit
for Supporting the Rapid Prototyping of Context-Aware Applications.
Human Computer Interaction, 16(2):97–166, 2001.
[7] M. Theimer B. Schilit. Disseminating Active Map Information to Mobile
Hosts. IEEE Network, 8(5):22–32, 1994.
[8] M. Gorgoglione C. Palmisano, A. Tuzhilin. Using Context to Improve
Predictive Models of Customers in Personalization Applications Models
of Customers in Personalization Applications. IEEE Transactions on
Knowledge and Data Engineering, 20(11):1535–1549, 2008.
BIBLIOGRAPHY
63
[9] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K.
Landauer, and Richard Harshman. Indexing by latent semantic analysis.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
SCIENCE, 41(6):391–407, 1990.
[10] A. Tuzhilin G. Adomavicius. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions.
IEEE Transactions on Knowledge and Data Engineering,
17(6):734–749, 2005.
[11] A. Tuzhilin G. Adomavicius. Context-Aware Recommender Systems. In
Recommender Systems Handbook. Springer, 2011.
[12] S. Moorthy G. Lilien, P. Kotler. Marketing Models. USA: Prentice Hall,
pages 22–23, 1992.
[13] X. Ochoa M. Wolpers H. Drachsler I. Bosnic E. Duval K. Verbert,
N. Manouselis. Context-Aware Recommender Systems for Learning: A
Survey and Future Challenges. IEEE Transactions on Learning Technologies, 5(4):318–335, 2012.
[14] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.
J. ACM, 46(5):604–632, September 1999.
[15] Jon M. Kleinberg. Hubs, authorities, and communities. ACM Comput.
Surv., 31(4es), December 1999.
[16] Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage
Data (Data-Centric Systems and Applications). Springer-Verlag New
York, Inc., Secaucus, NJ, USA, 2006.
[17] P. Brézillon M. Bazire. Understanding context before to use it. In 5th
International and Interdisciplinary Conference on Modeling and Using
Context, volume 3554 of Lectures Notes in Artificial Intelligence, pages
29–40. Springer-Verlag, 2005.
64
BIBLIOGRAPHY
[18] G. Linoff M. Berry. Data mining techniques: For marketing, sales, and
customer support. Wiley, 1997.
[19] M. E. J. Newman. Modularity and community structure in networks.
Proceedings of the National Academy of Sciences, 103(23):8577–8582,
June 2006.
[20] X. Chen P. Brown, J. Bovey.
Context-Aware Applications: From
the Laboratory to the Marketplace. IEEE Personal Communications,
4(5):58–64, 1997.
[21] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
The PageRank Citation Ranking: Bringing Order to the Web. Technical
Report 1999-66, Stanford InfoLab, November 1999.
[22] C.K. Prahalad. Beyond CRM: C.K. Prahalad Predicts Customer Context Is the Next Big Thing. In American Management Association MwWorld, 2004.
65
Appendix A
Category-based context
creation experiment data
66APPENDIX A. CATEGORY-BASED CONTEXT CREATION EXPERIMENT DATA
A.1
Items data
No
Item title
Category path
I1
Kawa Senseo “Jacobs Kronung”
Dom i Ogród / Żywność / Pozostałe
I2
EKSPRES CIŚNIENIOWO-PRZELEWOWY
RTV i AGD / Sprzęt AGD / Kuchnia /
Ekspresy do kawy / Ciśnieniowe
Sport i Turystyka / Rowery i akcesoria /
I3
rower GIANT BOULDER 500
Rowery / Szosowe
Odzież, Obuwie, Dodatki / Odzież
I4
damska / Odzież / Suknie i sukienki /
Czarna subtelna sukienko tunika ESSENCE, r.48/50
Sukienki
Sport i Turystyka / Sporty drużynowe /
I5
Piłka do siatkówki, piłka siatkowa (80806-7)
Siatkówka / Piłki
RTV i AGD / Sprzęt audio przenośny /
I6
IPOD APPLE 2GB
MP4 / 2GB
Komputery / Komputery - inne / Apple
I7
NOWY MacBook Pro 17” 2,33GHz/2GB RAM,F.Vat.Gw
/ Notebooki
Table A.1: Items selected to experiment.
A.2
No
Recommended items
Item title
Category Path
Kawa Segafredo Intermezzo 1kg - oferta
Dom i Ogród / Żywność / Kawy /
POLCAFFE
Ziarniste
MŁYNEK DO KAWY FIRST
RTV i AGD / Sprzęt AGD / Kuchnia /
GWARAN24M-C SERWIS, FA.VA
Młynki do kawy
KUBEK TERMICZNY 400ML 16CM
Motoryzacja / Gadżety motoryzacyjne /
KAWA w SAMOCHODZIE
Kubki
R11
R12
R13
Dom i Ogród / Wyposażenie / Kubki i
R14
ZESTAW DO KAWY I HERBATY
filiżanki / Zestawy
Table A.2: Items recommended for item I1.
67
A.2. RECOMMENDED ITEMS
No
Item title
Category path
Ekspres kawowy Ciśnieniowy przelewowy
RTV i AGD / Sprzęt AGD / Kuchnia /
clatronik
Ekspresy do kawy / Ciśnieniowe
Ekspres przelewowy PREDOM ZELMER
RTV i AGD / Sprzęt AGD / Kuchnia /
typ 215.2
Ekspresy do kawy / Przelewowe
NAJLEPSZY EKSPRES CIŚNIENIOWY
RTV i AGD / Sprzęt AGD / Kuchnia /
50% CENY, NAJTANIEJ
Ekspresy do kawy / Ciśnieniowe
R21
R22
R23
Firma i Przemysł / Gastronomia /
R24
ekspres ciśnieniowy automatyczny Saeco
Wyposaż. i akcesoria barowe / Ekspresy
ciśnieniowe
Table A.3: Items recommended for item I2.
No
Item title
Category path
POKROWIEC NA
Sport i Turystyka / Rowery i akcesoria /
R31
ROWER-MOTOR-SKUTER ITD.
Akcesoria / Torby i sakwy / Pozostałe
SUPER CENA
ROWER DLA DZIECKA 4-8 LAT
Dla Dzieci / Zabawki / Pojazdy / Na
KOŁA 14” UŻYWANY
pedały
DAMSKI ROWER SUNCITY 26” ALU
Sport i Turystyka / Rowery i akcesoria /
SUPER CENA!*****
Rowery / Trekkingowe
JAZDA ROWEREM GORSKIM LOPES
Sport i Turystyka / Rowery i akcesoria /
NOWOSC KRAKOW KSIEG
Literatura, instrukcje
R32
R33
R34
Table A.4: Items recommended for item I3.
68APPENDIX A. CATEGORY-BASED CONTEXT CREATION EXPERIMENT DATA
No
Item title
Category path
Sukienka sztruksowa + bluza 92 2-3 lata
Dla Dzieci / Odzież / Niemowlęta i małe
SUPER!!!
dzieci / Rozmiar 92 / Sukienki
SUKIENKA NEXT WIOSNA/LATO
Dla Dzieci / Odzież / Niemowlęta i małe
2007 R.86
dzieci / Rozmiar 86 / Sukienki
R41
R42
Dla Dzieci / Odzież / Niemowlęta i małe
R43
SUKIENKA DLA MAŁEJ WRÓŻKI
dzieci / Rozmiar 68 / Sukienki
Dla Dzieci / Odzież / Niemowlęta i małe
R44
SUKIENKA,TUNICZKA-92,98CM.
dzieci / Rozmiar 92 / Sukienki
Table A.5: Items recommended for item I4.
No
Item title
Category path
WILSON Piłka do futbolu
Sport i Turystyka / Sporty drużynowe /
amerykańskiego NFL
Futbol amerykański / Piłki
R51
Sport i Turystyka / Sporty drużynowe /
R52
Piłka, Okazja
Rugby / Piłki
SUPER HIT GRA SOCCER PIŁKA
Gry / Konsole i automaty / Game Boy
NOŻNA ZOBACZ SAM!!!!
Color / Gry / Sportowe
bo-bas/do skakania KONIK - PIŁKA *
Dla Dzieci / Zabawki / Ogrodowe /
SUPER JAKOŚĆ
Pozostałe
R53
R54
Table A.6: Items recommended for item I5.
No
Item title
Category path
Ładowarka sieciowa - iPod Mini Nano
RTV i AGD / Sprzęt audio przenośny /
Video 3G 4G
Pozostałe
R61
OKAZJA! M10 MP4 512MB
RTV i AGD / Sprzęt audio przenośny /
R62
AUDIO/VIDEO/FM/PENDR/DYK
MP4 / 512MB
GW.
Ładowarka sieciowa - iPod Mini Nano
RTV i AGD / Sprzęt audio przenośny /
Video 3G 4G
Pozostałe
PROMOCJA! M06 MP4 2GB
RTV i AGD / Sprzęt audio przenośny /
AUDIO/VIDEO/FM/PEN/DYK GW.
MP4 / 2GB
R63
R64
Table A.7: Items recommended for item I6.
69
A.2. RECOMMENDED ITEMS
No
Item title
Category path
NOWY Pokrowiec Tucano na iMac 20”
Komputery / Komputery - inne / Apple
F.Vat !!!
/ Komputery
NOWY Pokrowiec Tucano na iMac G5
Komputery / Komputery - inne / Apple
17” F.Vat
/ Komputery
R71
R72
Table A.8: Items recommended for item I7.
70APPENDIX A. CATEGORY-BASED CONTEXT CREATION EXPERIMENT DATA
71
Appendix B
Latent Semantic Indexing
experiment data
72APPENDIX B. LATENT SEMANTIC INDEXING EXPERIMENT DATA
B.1
Items data
No
Category
Item type
D1
Socks
Suit socks
Item title
CZARNE ELEGANCKIE SKARPETKI
HENDERSON 39 - 42
FROTOWE SKARPETKI NIKE 42-46 FROTA
D2
Socks
Sport socks
SKARPETY #
SKARPETY SKARPETKI SPORTOWE
D3
Socks
Sport socks
EXTREME E3 rozm.29-30
STANIK BIUSTONOSZ PUSH-UP 80D - BIAŁY
D4
Bras
Bra
(31)
Miss Selfridge ŚWIETNY KORONKOWY
D5
Bras
Bra
BIUSTONOSZ 75D
SILIKONOWY STANIK CharaBra -PIĘKNY
D6
Bras
Bra
BIUST ! ROZM B
WKŁADKI SILIKONOWE pod STANIK -
D7
Bra accessories
Bra accessories
Powiększ BIUST !
D8
Socks
Socks
D9
Bras
Bra
D10
Bras
Bra
D11
Bras
Sport bra
3 PARY SPORTOWYCH FROTA 44-46
UN BRA SILIKONOWY Stanik B,C,D FACECI
OSZALEJĄ
Stanik Triumph dla aktywnych, Nowy tanio r.75D
DKNY -75 C- BIAŁY SPORTOWY
TOP-BIUSTONOSZ
60-70A BIUSTONOSZ - SPORTOWY - NOWY 0d
D12
Bras
Sport bra
1zl! 224
D13
Bras
Sport bra
Sportowy,elegancki Triumph,80A
Table B.1: Items selected to experiment.
73
B.2. TERM-DOCUMENT FREQUENCY MATRIX
B.2
Term-document frequency matrix
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
czarny
1
0
0
0
0
0
0
0
0
0
0
0
0
elegancki
1
0
0
0
0
0
0
0
0
0
0
0
1
skarpeta
1
2
2
0
0
0
0
0
0
0
0
0
0
henderson
0
0
0
0
0
0
0
0
0
0
0
0
0
frota
0
2
0
0
0
0
0
1
0
0
0
0
0
nike
0
1
0
0
0
0
0
0
0
0
0
0
0
sport
0
0
1
0
0
0
0
1
0
0
1
1
1
extreme
0
0
1
0
0
0
0
0
0
0
0
0
0
rozmiar
0
0
1
0
0
1
0
0
0
0
0
0
0
stanik
0
0
0
1
0
1
1
0
0
1
0
0
0
biustonosz
0
0
0
1
1
0
0
0
0
0
1
1
0
push
0
0
0
1
0
0
0
0
0
0
0
0
0
up
0
0
0
1
0
0
0
0
0
0
0
0
0
biały
0
0
0
1
0
0
0
0
0
0
1
0
0
miss
0
0
0
0
1
0
0
0
0
0
0
0
0
selfridge
0
0
0
0
1
0
0
0
0
0
0
0
0
świetny
0
0
0
0
0
0
0
0
0
0
0
0
0
koronka
0
0
0
0
1
0
0
0
0
0
0
0
0
silikon
0
0
0
0
0
1
1
0
1
0
0
0
0
charabra
0
0
0
0
0
1
0
0
0
0
0
0
0
piękny
0
0
0
0
0
1
0
0
0
0
0
0
0
biust
0
0
0
0
0
1
1
0
0
0
0
0
0
wkładka
0
0
0
0
0
0
1
0
0
0
0
0
0
powiększyć
0
0
0
0
0
0
1
0
0
0
0
0
0
para
0
0
0
0
0
0
0
1
0
0
0
0
0
un
0
0
0
0
0
0
0
0
1
0
0
0
0
bra
0
0
0
0
0
0
0
0
1
0
0
0
0
facet
0
0
0
0
0
0
0
0
1
0
0
0
0
oszaleć
0
0
0
0
0
0
0
0
0
0
0
0
0
triumph
0
0
0
0
0
0
0
0
0
1
0
0
1
aktywny
0
0
0
0
0
0
0
0
0
1
0
0
0
nowy
0
0
0
0
0
0
0
0
0
0
0
1
0
dkny
0
0
0
0
0
0
0
0
0
0
1
0
0
top
0
0
0
0
0
0
0
0
0
0
1
0
0
Table B.2: Term-document frequency matrix of selected items.
74APPENDIX B. LATENT SEMANTIC INDEXING EXPERIMENT DATA
B.3
Singular Value Decomposition
0,07
-0,02
0,02
0,07
0,10
0,06
0,23
0,08
0,25
-0,35
0,12
0,03
-0,54
0,10
0,00
-0,05
0,27
0,17
-0,13
0,42
0,02
0,25
-0,48
0,17
-0,06
0,18
0,75
-0,20
0,16
0,00
0,12
0,31
0,12
0,16
0,11
0,11
-0,08
0,07
-0,06
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,43
-0,16
0,09
-0,44
-0,18
-0,47
-0,12
-0,19
-0,11
-0,14
0,04
-0,04
0,03
0,18
-0,08
0,07
-0,23
-0,03
-0,11
0,00
-0,02
-0,01
-0,07
-0,10
0,20
0,24
0,33
0,15
-0,42
0,50
-0,15
-0,25
-0,16
-0,14
-0,01
0,13
0,16
-0,10
0,08
0,16
-0,01
0,00
0,19
0,05
0,23
-0,06
0,06
-0,06
0,30
0,01
-0,18
0,01
0,19
0,17
0,14
0,25
0,05
0,32
-0,21
-0,05
-0,28
-0,03
0,00
-0,10
0,03
0,07
0,53
0,17
-0,17
0,35
-0,14
0,07
0,04
-0,10
0,14
0,01
0,00
-0,20
0,11
0,31
-0,52
-0,25
-0,10
0,19
0,09
-0,08
0,02
-0,01
0,08
0,29
-0,04
0,02
0,13
-0,10
-0,21
0,16
0,03
0,02
0,24
-0,08
0,01
0,30
-0,16
0,15
0,02
0,13
-0,10
-0,21
0,16
0,03
0,02
0,24
-0,08
0,01
0,30
-0,16
0,15
0,06
0,22
-0,29
-0,18
0,08
0,00
-0,12
0,35
0,05
-0,13
-0,15
-0,18
0,07
0,01
0,04
-0,10
-0,13
-0,09
0,21
0,24
-0,33
-0,03
0,01
-0,07
-0,17
0,02
0,01
0,04
-0,10
-0,13
-0,09
0,21
0,24
-0,33
-0,03
0,01
-0,07
-0,17
0,02
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,01
0,04
-0,10
-0,13
-0,09
0,21
0,24
-0,33
-0,03
0,01
-0,07
-0,17
0,02
0,05
0,39
0,31
0,06
-0,37
-0,02
0,08
0,06
0,08
-0,04
0,00
0,02
0,05
0,03
0,18
0,14
0,06
0,00
0,08
-0,15
-0,11
-0,22
-0,33
0,00
0,08
0,03
0,03
0,18
0,14
0,06
0,00
0,08
-0,15
-0,11
-0,22
-0,33
0,00
0,08
0,03
0,04
0,33
0,25
0,03
-0,02
0,00
-0,15
-0,19
0,21
-0,09
0,00
0,03
0,08
0,01
0,16
0,12
-0,03
-0,02
-0,08
0,00
-0,07
0,44
0,24
0,00
-0,06
0,05
0,01
0,16
0,12
-0,03
-0,02
-0,08
0,00
-0,07
0,44
0,24
0,00
-0,06
0,05
0,06
0,00
-0,05
0,02
-0,11
-0,25
-0,12
-0,15
-0,09
-0,01
0,24
-0,44
-0,45
0,00
0,05
0,06
0,03
-0,36
-0,02
0,23
0,25
-0,14
0,04
0,00
-0,01
-0,02
0,00
0,05
0,06
0,03
-0,36
-0,02
0,23
0,25
-0,14
0,04
0,00
-0,01
-0,02
0,00
0,05
0,06
0,03
-0,36
-0,02
0,23
0,25
-0,14
0,04
0,00
-0,01
-0,02
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,04
0,09
-0,04
0,21
0,28
-0,36
0,38
-0,09
-0,23
0,09
-0,24
0,04
0,30
0,01
0,07
0,02
0,01
0,20
-0,17
0,19
-0,02
-0,24
0,21
-0,29
0,13
-0,42
0,03
0,05
-0,13
0,07
-0,08
-0,02
-0,03
-0,10
0,00
0,12
0,30
0,64
-0,13
0,04
0,08
-0,19
0,03
-0,08
-0,03
-0,15
0,11
0,13
-0,14
-0,45
-0,02
-0,08
0,04
0,08
-0,19
0,03
-0,08
-0,03
-0,15
0,11
0,13
-0,14
-0,45
-0,02
-0,08
Table B.3: Matrix U .
75
B.3. SINGULAR VALUE DECOMPOSITION
3,71
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
3,17
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
2,89
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
2,21
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
2,01
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,98
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,83
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,80
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,56
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,43
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,34
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
1,14
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,00
0,88
Table B.4: Matrix Σ.
0,25
-0,07
0,04
0,15
0,19
0,12
0,42
0,15
0,39
-0,51
0,16
0,03
-0,48
0,68
-0,25
0,19
-0,50
-0,07
-0,22
0,00
-0,04
-0,02
-0,09
-0,14
0,23
0,21
0,59
-0,03
0,01
0,42
0,10
0,46
-0,10
0,11
-0,09
0,43
0,01
-0,21
0,01
0,08
0,42
-0,29
-0,46
0,32
0,06
0,04
0,44
-0,12
0,02
0,40
-0,18
0,13
0,04
0,14
-0,28
-0,29
-0,18
0,42
0,44
-0,60
-0,05
0,01
-0,09
-0,19
0,02
0,11
0,56
0,40
0,12
0,01
0,17
-0,27
-0,20
-0,35
-0,47
-0,01
0,09
0,02
0,05
0,49
0,34
-0,07
-0,04
-0,16
0,00
-0,13
0,68
0,35
0,00
-0,07
0,05
0,22
0,00
-0,13
0,04
-0,22
-0,49
-0,22
-0,26
-0,14
-0,02
0,33
-0,51
-0,39
0,02
0,17
0,17
0,07
-0,72
-0,05
0,42
0,44
-0,21
0,06
0,00
-0,01
-0,02
0,03
0,22
0,05
0,03
0,41
-0,33
0,35
-0,04
-0,37
0,31
-0,39
0,15
-0,37
0,16
0,27
-0,56
0,06
-0,17
-0,06
-0,27
0,19
0,20
-0,21
-0,60
-0,02
-0,07
0,13
0,16
-0,37
0,15
-0,16
-0,04
-0,06
-0,18
0,01
0,17
0,41
0,73
-0,12
0,13
0,08
-0,18
0,45
0,14
-0,37
0,35
-0,12
0,00
-0,18
0,07
-0,10
0,63
Table B.5: Matrix V .
76APPENDIX B. LATENT SEMANTIC INDEXING EXPERIMENT DATA
B.4
Queries
Q1
Q2
Q3
Q4
Q5
Q6
czarny
0
0
0
0
0
0
elegancki
0
0
0
0
0
0
skarpeta
0
0
0
1
0
0
henderson
0
0
0
0
0
0
frota
0
0
0
0
0
0
nike
0
0
0
0
0
0
sport
0
1
0
0
1
0
extreme
0
0
0
0
0
0
rozmiar
0
0
0
0
0
0
stanik
0
0
0
0
0
1
biustonosz
1
1
0
0
0
0
push
0
0
1
0
0
0
up
0
0
1
0
0
0
biały
0
0
0
0
0
0
miss
0
0
0
0
0
0
selfridge
0
0
0
0
0
0
świetny
0
0
0
0
0
0
koronka
0
0
0
0
0
0
silikon
0
0
0
0
0
0
charabra
0
0
0
0
0
0
piękny
0
0
0
0
0
0
biust
0
0
0
0
0
0
wkładka
0
0
0
0
0
0
powiększyć
0
0
0
0
0
0
para
0
0
0
0
0
0
un
0
0
0
0
0
0
bra
0
0
0
0
0
0
facet
0
0
0
0
0
0
oszaleć
0
0
0
0
0
0
triumph
0
0
0
0
0
1
aktywny
0
0
0
0
0
0
nowy
0
0
0
0
0
0
dkny
0
0
0
0
0
0
top
0
0
0
0
0
0
Table B.6: Word vectors of queries used in an evaluation.
77
Appendix C
Network-based context
creation experiment data
78APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA
C.1
Categories networks
(a) Network between terms in category #1.
(b) Network between terms in category #2.
(c) Network between terms in category #3.
(d) Network between terms in category #4.
Figure C.1: Networks between terms for categories #1 - #4.
79
C.1. CATEGORIES NETWORKS
(a) Network between terms in category #5.
(b) Network between terms in category #6.
(c) Network between terms in category #7.
(d) Network between terms in category #8.
Figure C.2: Networks between terms for categories #5 - #8.
80APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA
(a) Network between terms in category #9.
(b) Network between terms in category
#10.
(c) Network between terms in category
(d) Network between terms in category
#11.
#12.
Figure C.3: Networks between terms for categories #9 - #12.
C.2. COMPRESSED CATEGORIES NETWORKS
C.2
81
Compressed categories networks
(a) Compressed network between terms in
(b) Compressed network between terms in
category #1.
category #2.
(c) Compressed network between terms in
(d) Compressed network between terms in
category #3.
category #4.
Figure C.4: Compressed networks between terms for categories #1 - #4.
82APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA
(a) Compressed network between terms in
(b) Compressed network between terms in
category #5.
category #6.
(c) Compressed network between terms in
(d) Compressed network between terms in
category #7.
category #8.
Figure C.5: Compressed networks between terms for categories #5 - #8.
C.2. COMPRESSED CATEGORIES NETWORKS
83
(a) Compressed network between terms in
(b) Compressed network between terms in
category #9.
category #10.
(c) Compressed network between terms in
(d) Compressed network between terms in
category #11.
category #12.
Figure C.6: Compressed networks between terms for categories #9 - #12.
84APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA
C.3
Modules of merged network
(a) Submodule #1.1
(b) Submodule #1.2
(c) Submodule #1.3
(d) Submodule #1.4
Figure C.7: Submodules of component #1.
85
C.3. MODULES OF MERGED NETWORK
(a) Submodule #2.1
(b) Submodule #2.2
(c) Submodule #2.3
(d) Submodule #2.4
Figure C.8: Submodules of component #2.
86APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA
(a) Submodule #1.3.1.
(b) Submodule #1.3.2.
(c) Submodule #1.3.3 and its 2 submodules
- #1.3.3.1 and #1.3.3.2.
Figure C.9: Submodules of module #1.3.
87
List of Figures
2.1
Paradigms for incorporating context in recommender systems.
[11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1
Entity-Relation Diagram representing the dataset. . . . . . .
19
5.1
5.2
5.3
5.4
29
31
34
5.5
5.6
5.7
5.8
5.9
Context-aware recommendation process. . . . . . . . . . . .
Text preprocessing workflow. . . . . . . . . . . . . . . . . . .
Example of category tree structure. . . . . . . . . . . . . . .
Schematic representation of singular value decomposition of
the matrix A. . . . . . . . . . . . . . . . . . . . . . . . . . .
Schematic workflow of network-based context creation approach.
Contexts’ structure graph and possible recommendation jumps.
Network representing 12 categories merged together. . . . . .
Connected components. . . . . . . . . . . . . . . . . . . . .
Graph of contexts structure. . . . . . . . . . . . . . . . . . .
39
48
51
53
54
55
C.1
C.2
C.3
C.4
C.5
C.6
C.7
C.8
C.9
Networks between terms for categories #1 - #4. . . . . . . .
Networks between terms for categories #5 - #8. . . . . . . .
Networks between terms for categories #9 - #12. . . . . . .
Compressed networks between terms for categories #1 - #4.
Compressed networks between terms for categories #5 - #8.
Compressed networks between terms for categories #9 - #12.
Submodules of component #1. . . . . . . . . . . . . . . . . .
Submodules of component #2. . . . . . . . . . . . . . . . . .
Submodules of module #1.3. . . . . . . . . . . . . . . . . . .
78
79
80
81
82
83
84
85
86
88
List of Tables
5.1
5.2
5.3
5.4
5.5
5.6
5.7
40
42
43
44
44
56
5.8
Selected items and theirs types. . . . . . . . . . . . . . . . .
Similarities between documents and queries. . . . . . . . . .
Similarities between queries. . . . . . . . . . . . . . . . . . .
Similarities between documents before applying LSI. . . . . .
Similarities between documents after applying LSI. . . . . .
Selected categories to evaluation. . . . . . . . . . . . . . . .
Distinct term counts for original and compressed category network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Topics interpretation of created contexts. . . . . . . . . . . .
A.1
A.2
A.3
A.4
A.5
A.6
A.7
A.8
Items
Items
Items
Items
Items
Items
Items
Items
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
66
67
67
68
68
68
69
B.1
B.2
B.3
B.4
B.5
B.6
Items selected to experiment. . . . . . . . . . . . .
Term-document frequency matrix of selected items.
Matrix U . . . . . . . . . . . . . . . . . . . . . . . .
Matrix Σ. . . . . . . . . . . . . . . . . . . . . . . .
Matrix V . . . . . . . . . . . . . . . . . . . . . . . .
Word vectors of queries used in an evaluation. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
73
74
75
75
76
selected to experiment. .
recommended for item I1.
recommended for item I2.
recommended for item I3.
recommended for item I4.
recommended for item I5.
recommended for item I6.
recommended for item I7.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57