Get - Wiley Online Library

Transcription

Computational Intelligence, Volume 0, Number 0, 2016
SWALLOW: RESOURCE AND TAG RECOMMENDER SYSTEM BASED
ON HEAT DIFFUSION ALGORITHM IN SOCIAL
ANNOTATION SYSTEMS
VAHIDEH AMEL MAHBOOB, MEHRDAD JALALI, MAJID VAFAEI JAHAN, AND
PEGAH BAREKATI
Mashhad Branch, Islamic Azad University, Mashhad, Iran
Social annotation systems (SAS) allow users to annotate different online resources with keywords (tags).
These systems help users in finding, organizing, and retrieving online resources to significantly provide collaborative semantic data to be potentially applied by recommender systems. Previous studies on SAS had been worked
on tag recommendation. Recently, SAS-based resource recommendation has received more attention by scholars.
In the most of such systems, with respect to annotated tags, searched resources are recommended to user, and their
recent behavior and click-through is not taken into account. In the current study, to be able to design and implement
a more precise recommender system, because of previous users’ tagging data and users’ current click-through,
it was attempted to work on the both resource (such as web pages, research papers, etc.) and tag recommendation problem. Moreover, by applying heat diffusion algorithm during the recommendation process, more diverse
options would present to the user. After extracting data, such as users, tags, resources, and relations between them,
the recommender system so called “Swallow” creates a graph-based pattern from system log files. Eventually, following the active user path and observing heat conduction on the created pattern, user further goals are anticipated
and recommended to him. Test results on SAS data set demonstrate that the proposed algorithm has improved the
accuracy of former recommendation algorithms.
Received 5 August 2014; Revised 9 October 2015; Accepted 28 October 2015
Key words: social annotation systems, web recommender systems, heat diffusion, graph.
1. INTRODUCTION
One of the most successful web2 is social annotation services (SAS), such as
Delicious, CiteULike, and Flickr, which has been significantly developed recently. In SAS,
users are able to simply organize, share, and retrieve such online resources as resources in
Delicious, research papers in CiteUlike, and photos in Flicker by means of annotation technique. Through development of these systems, SASs users have generated great amount
of annotation data, which has attracted many research societies’ concern. Considering the
volume of online resources, finding resources, which each user is interested in, is highly significant. Most of the annotation services give access to the resources having equivalent tag
to keywords searched by user. It should be noted that by searching in this way, users’ problem in detecting interesting resources still remains because the returned resources count is
too much so that finding a proper reference through thousands of resources is confusing
and dawdling.
In other words, the result of such methods only retrieved those resources that are
matched with given tags and doesn’t consider the semantically related resources. For
instance, in web pages, when a user searches for “clothing,” it may be impossible to retrieve
web pages labeled with “garment.” Further, searching appropriate queries and modifying
them is difficult, tedious, and not applicable. Therefore, a recommender module is required
that recommends to users the most favorite resources among thousands or even millions
Address correspondence to Mehrdad Jalali, Mashhad Branch, Islamic Azad University, Mashhad, Islamic Republic of
Iran; e-mail: [email protected]
© 2016 Wiley Periodicals, Inc.
COMPUTATIONAL INTELLIGENCE
FIGURE 1. Two saved bookmarks in Delicious by a user.
FIGURE 2. Tag recommender (user would like to annotate a tag on a resource, so list of tags recommended
to him).
of resources. Since the first emergence of collaborative filtering (CF) systems, the recommender system was considerably taken into account by industry and academia (Resnick
et al. 1994; Maes and Shardanand 1995). Older recommender systems focused on user’s
explicit ranking (e.g., movie ranking), while SAS data have distinctive properties. Figure 1
shows two bookmarks saved by a user in Delicious. In each bookmark, the web page title
and its allocated tags are shown on the left-up side and the right-down side, respectively.
The main differences between tagging data and ranked data are as follows:
(1) Unlike ranked data, tagging data doesn’t contain users’ explicit priority information on
resources.
(2) Tagging data are composed of three parameters (user, tag, and resource), and ranked
data are only composed of two parameters (user and resource).
These differences create some opportunities in recommender problem on the tagging data.
There are two kinds of recommender in SAS: the first one is tag recommender, when
a user tends to annotate a resource, as shown in Figure 2. In this system, the user has one
resource and would allocate a proper tag to it; here, the recommender system suggests the
user those semantically related tags to the resource. The second one (according to searched
tags) is to suggest bookmarked resources, which still have not been visited by the user,
as illustrated in Figure 3. In this figure, the user searches a tag by which the equivalent
resources of that tag are recommended to the user as well. In previous systems, the active
user’s progress was not regarded, and they were only evaluated and analyzed by considering
the accuracy and diversity of recommended resources to the user.
Applying heat diffusion algorithm, the current research is focused on resource and tag
recommendation system based on tagging data and users’ current activity and user’s clickthrough. Further, comparing previous methods, in this recommender system, a different
method was utilized in a way that users’ current activity has been scrutinized in SAS (unlike
previous systems that after searching a tag started to recommend resources). It means, after
following user’s click-through, user’s interests are anticipated and then proper resources and
tags based on the user’s favor are recommended. The learning graph used in this method
synchronizes two similarity criteria, causing more precision of semantic relations of graph
SOCIAL ANNOTATION SYSTEMS, WEB RECOMMENDER SYSTEMS
FIGURE 3. Resource recommender (user search a tag, then many resources related to researched keywords
recommended to him).
vertices (resources and tags). In fact, both relations between available resources and relations between resources and tags are considered in this graph. Another distinctive notion in
this study is to apply heat diffusion algorithm in recommending resources to increase the
accuracy and diversity of recommended resources. This system is able to recommend both
resources and tags; the proposed algorithm is not only designed for using in SAS, but also
it would be applicable in common recommender systems as well. Consequently, this algorithm also could be used in query recommender systems, photo recommender systems, etc.
The proposed approach is executed on Delicious and Movielens “SAS” data sets. Results are
evaluated by comparing recommended resources and tags by the system, by user’s real visiting from resources, and the proposed system’s success is measured with evaluation criteria.
Data set is divided in two parts, that is, train and test; and recommended resources generated
based on train data set are controlled on test data. Results reveal that applying the proposed
system has remarkably augmented the recommendations precision than former approaches.
As mentioned before, we have used heat diffusion natural algorithm; therefore, we were
interested to name our system as a natural phenomenon. It was a very desirable similarity
for this study: sagacity in selecting and following a path and object (to convey a user to the
nearest resources to the object in annotation systems) and the bird Swallow immigration
approach (to find final destination) using heat factor. For this reason, to animate this study
and to introduce the proposed system, the word “Swallow” was used, and its symbol was
utilized in the corresponding graphs.
The current article is organized as follows: in Section 2, definitions and literature review
is presented. The proposed system (Swallow), discussions on designing a “Swallow” and
Swallow’s design subjects, are introduced in Section 3. Next, the evaluation and implementation results of the proposed system are presented in Section 4. Section 5 discusses
conclusion and future improvements.
2. DEFINITIONS AND LITERATURE REVIEW
2.1. Definitions
2.1.1. Tagging Data Structure. As depicted in Figure 4, there are three connected
parameters in a tagging data, which are users, tags, and documents. The tagging data could
be seen as a triple set (Heymann et al. 2008; Guan et al. 2009; Markines et al. 2009). Each
triple (u, r, and t) states that a user u attributes tag t to resource r. To find weight relations
between resources and tags, it can add each of these three options, user, tag, and resource.
Thereafter,
(1) If the total number of users who have allocated a tag to a document, the number of times
is obtained, which a tag has been assigned to a document.
(2) And if total number of “documents” have been assigned, a special tag is considered, the
number of times which users have used a specified tag is obtained.
FIGURE 4. Presentation of upper level of tagging data structure (Guan et al. 2010).
In the proposed method (Swallow), similar to two earlier, mentioned points were used
to create a bipartite graph (weighted) between the source and the tag.
2.1.2. Diffusion on the Graph. In this section, based on heat diffusion, a new diffusion
graph is introduced. This model can be performed on directed and undirected graphs. First,
it is discussed how to deduct parameters because of the graph structure. Then, an example
is presented.
2.1.2.1. Heat Diffusion. Heat diffusion is a physical phenomenon. Generally, heat
moves from a position of higher temperature to a lower temperature position. Recently,
heat diffusion-based methods have been successfully applied in such vast domains such as
dimension classification and dimension reduction (Lafferty and Kondor 2002; Niyogi and
Belkin 2003; Lebanon and Lafferty 2005). Heat approximated the heat kernel in a closed
form for a multinomial family, which had more improvement than Gaussian method or linear kernel (Lebanon and Lafferty 2005). Kondor proposed a separate heat diffusion kernel
for classification and showed that simple kernel diffusion on a hypercube had an acceptable
efficiency on this kind of data (Lafferty and Kondor 2002).
On the other hand, Belkin et al. used heat kernel to weight a neighborhood graph and
applied it in a dimension reduction algorithm (Niyogi and Belkin 2003). Yang et al. (2007),
by using heat diffusion, suggested a ranking algorithm called Rank Diffusion. The simulation states that this method is highly effective in recognizing spams. In this article, we use
heat diffusion to find the similarity between user’s click-through and other previous annotated resources. In nature and physics, the heat diffusion is performed on the manifolds, but
we propagate it on the graph.
2.1.2.2. Heat Diffusion on Undirected Graph. Consider the undirected graph
G
® D .V; E/ where V D ¹v1 ; v2 ; : : : ; vn º ¯is a set of graph vertexes and E D
vi ; vj j t here is an edge bet wee n vi t o vj is a set of edges. We suppose edge
vi ; vj as a pipe connecting vi and vj vertexes. The value fi .t/ shows the amounts of heat
at vertex vi at time t beginning from an initial distribution of heat given by fi .0/ at time 0.
f .t / represents a vector that its components are fi .t /. The mathematic model of this algorithm is as the utilized approaches by Ma et al. (2012): suppose at the time t , the vertex
I receives the amounts M.ijt; t / heat from its neighbor vertex j during a time period t .
The amounts of exchanged heat should be related to t and also related to the difference
ˇ
ˇ
of ˇfi .t / fj .t /ˇ. Moreover, the heat transfers through a pipe connecting these vertices. So
we can say
M .i; j; t; t / D ˛.fi .t / fj .t //t
(1)
˛ is the thermal conductivity of the heat diffusion coefficient. As a result, the different
amount of heat on the vertex i at the times t and t C t is equal to some of the heat that this
vertex receives from its neighbor vertexes. Mathematically speaking, we have the following:
f .t C t / f .t /
D˛
t
X
fj fi .t /
(2)
j W.vj ;vi /2E
E is the graph edges set.
To find out a close solution, we rewrite the aforementioned equation in a matrix form.
f .t C t / f .t /
D ˛ .H D/ f .t /
t
where
²
Hij D
(3)
vi ; vj 2 Eor vj ; vi 2 E
ot herwise
1
0
and
²
Dij D
d .vi /
0
i Dj
ot herwise
(4)
(5)
In which d.vi / is the degree of the vertex vi . From the definition, it is clear that the D is a
Diagonal matrix. We normalize all the elements of the matrices H and D by the degrees of
all vertices to have a more general representation. So the H and D matrices got improved
as follows:
² 1
vi ; vj 2 E
d
.v
/
i
Hij D
(6)
0
otherwise
and
²
Dij D
1
0
i Dj
otherwise
(7)
Now, while the limit .t / tends to zero, the equation is also as the following:
d
f .t / D ˛ .H D/ f .t /
dt
(8)
And we got to this answer when solving this equation as derived is equal to the same
function, so we have the following:
f .1/ D e ˛.H D/ f .0/
(9)
In which d.v/ is the degree of vertex v. Using the following expansion, we can calculate the
amounts of e ˛.H D/ and could be extended as
e ˛.H D/ D I C ˛ .H D/ C
˛2
˛3
˛4
.H D/2 C
.H D/3 C
.H D/4 C : : : (10)
2Š
3Š
4Š
FIGURE 5. Two simple heat diffusion examples on an undirected graph (Ma et al. 2012).
The e ˛.H D/ is called the diffusion kernel in the sense that the heat diffusion process continues infinitely many times from the initial heat diffusion. This problem is very important
to assign vertexes in a graph. Ma et al. (2012) have discussed about heat diffusion on an
undirected graph and its related algorithm time complexity.
An example of heat diffusion is obtained later. To interpret Equation (8) and the heatdiffusion process more intuitively, we construct a small undirected graph with only five
nodes as showed in Figure 5(a). Initially, at time zero, suppose node 1 is given three units of
heat, and node 2 is given two units of heat, then the vector f .0/ equals Œ3; 2; 0; 0; 0T . The
entries in matrix HD are follows:
3
2
1 1 1 1 1
6 1=4 1 0 0 0 7
7
6
6 1=4 0 1 0 0 7
4 1=4 0 0 1 0 5
1=4 0 0 0 1
Without loss of generality, we set the thermal conductivity ˛ D 1 and vary time t from 0
to 1 with a step of 0.05. The curve for the amount of heat at each node with time is shown
in Figure 5(b). It can be seen that, as time passes, the heat sources node 1 and node 2 will
diffuse their heat to nodes 3, 4, and 5.
The heat of nodes 3, 4, and 5 will increase, respectively, and the trends of their heat
curves are the same because in this graph, these three nodes are symmetric.
Another example is shown in Figure 5(c). Initially, at time zero, suppose node 1 is given
four units of heat, then the vector f .0/ equals Œ4; 0; 0; 0T . The related heat curve is shown
in Figure 5(d). We can see that the node 2, the nearest node to the heat source, gains more
heat than other nodes. This also indicates that if a node has more paths connected to the heat
source, it will potentially obtain more heat. This is a perfect property for recommending
relevant nodes on a graph.
2.2. Literature Review
Because expanding and developing the news elements in scientific and research contexts
undoubtedly depends on completely knowing their background, innovating and developing
Swallow hybrid system owes to the ideas and studies on three research domains in the past
10 years:
(1) Recommender by using social tagging data
(2) Graph-based recommendation
(3) Heat diffusion–based recommendation
A glimpse at these three may lead to understand the Swallow system.
2.2.1. A Recommender by Using Tagging Data. Recently, research interests are on
resource recommendation in SASs. T.bogers and van den Bosch worked on applying CiteULike data for recommending scientific resources (such as papers). They studied three
different CF algorithms and found out that user-based collaborative filtering algorithm
performs much well (CiteULike 2004). Moreover, there is a huge amount of studies for
enhancing older methods by means of tagging data (Flickr 2005; Resnick et al. 1994; Maes
and Shardanand 1995; Guan et al. 2010; Gemmell et al. 2012). De Gemmis et al. (2008)
utilized tagging data to enhance the content for user-based recommender systems to use
tags as the old content of documents (Flickr 2005). Tso-Sutter et al. (2008) decreased triple
options (user, item, and tag) of tagging data to two options, so that recalled tags is considered either as users or items and after that item-based CF algorithms or user-based CF
algorithms are consecutively applied. They combined the calculated numbers by CF algorithm through linear adjoining to generate the final recommender (Heymann et al. 2008).
Nakamoto et al. (2008) modified the user-based CF algorithm to unify tagging data in each
step (Markines et al. 2009). Iofciu and Diederich (2006) created TF-IDF tag properties
and used this property vector to measure user–user similarities in a user-based algorithm
(Sigurbjörnsson and van Zwol 2008). Wetzker modified probabilistic latent semantic analysis to model collaborative occurrence relations between users–resources and resources–
users (Wetzker et al. 2009). He assumed that these two types of collaborative occurrence
relations share the same set of latent subjects and because of the probability of those
resources, those suggested resources to the user are given to the user. Five aforementioned
studies heuristically found tagging data in the older methods. They do not explore structural information in tagging data. Recent studies used more structured tagging data in these
studies; Shepitsen proposed a private recommender algorithm for SASs, which was with
respect to hierarchically tag clustering (Shepitsen et al. 2008). This algorithm required that
the users enter queries tags to obtain the recommender. It can point to linear hybrid method
as one of the most innovative methods for suggesting a source in SAS. In this method, various dual combinations of (tag–resource–user) were used and an optimal linear hybrid of
them was reached by Hill Climbing approach (the method which was compared with the
proposed method of this study) (Gemmell et al. 2012). On the other hand, first, by deducting the priorities of the user tag, produced the item recommender, then computed the item
priorities (based on the tag priority). This method needed annotation information (e.g., the
information by clicking and ranking information) in addition to different information and
is not limited to entering query. In this study, it is tried to use structured combination of
(tag–resource) (resource–resource) and user’s click-through instead of user’s query. It may
be hard to obtain the information by clicking and ranking information that doesn’t usually
exist in SAS data.
One of the other problems related to SAS is the tag recommender. The tag recommender
is different from the resources recommender such that a tag recommender suggests corresponding tags for a resource (likely with respect to the user tag, such as customization);
whereas, a resource recommender proposes the resources to the user (Lafferty and Kondor
2002; Niyogi and Belkin 2003; Lebanon and Lafferty 2005; Iofciu and Diederich 2006;
Yang et al. 2007; De Gemmis et al. 2008; Sigurbjörnsson and van Zwol 2008; van den Bosch
and Bogers 2008; Guan et al. 2009). Guan proposed a ranking-based graph for customized
tag recommender (Guan et al. 2009). The mentioned algorithm took the history of the user
tags and objective document to compute the customization as the query, ranking the tags
with respect to them, and then the tags with high rank were recommended. But this algorithm doesn’t explicitly include the users while they are vital for resources recommender.
Thereafter, it cannot be directly applied for the resources recommender.
2.2.2. Graph-Based Recommender. These studies are related to semi-space learningbased graph. Common algorithms locally consist of Laplacian maps and linear embedded
codes (Niyogi and Belkin 2003; Guan et al. 2009). The basis of these algorithms is so that a
dependent graph is made as an approximation to multiply the vital data to learn presentation
with low dimension for the data by means of sustaining the dependent graph structure.
With this trained space, it is possible to approximate the resemblance between two arbitrary
samples. The former semi-space learning-based graph algorithms consider just one type of
data parameter, while the proposed algorithm deals with two kinds of related data. Recently,
the manifold alignment subject has become an interesting topic in research studies (Lafferty
and Kondor 2002). The problem is to “align” two data manifolds corresponding to two types
of data objects into a common space by pairwise correspondence examples between the two
types of objects. In another study, Guan discussed the resources recommendation by using
a graph learning algorithm. He made two weighted bipartite graphs, user-tag and resourcetag, then, a non-weighted graph, user-resource and finally, a content-based graph for the
documents. Consequently, a semantic space appears for user-tag-document, which keeps
the best connection structure of these graphs and shows the closest documents to the user,
which has not tagging them yet (Guan et al. 2010). In this article, a way similar to Guan
et al. (2010) used for construction of graph for achieving better accuracy, a new section of
resource-resource similarity graph added to the resource-tag graph.
2.2.3. Recommender by Heat-Diffusion Algorithm. Recently, applying heat-diffusion
algorithm has been taken into account by researchers in web recommender systems. Shang
et al. (2010) used the diffusion to find the similarity between the users and concluded that
this resemblance is much more effective than others (like, cosine similarity). Heat-diffusion
algorithm has been used for suggesting a friend in social networks; Zhang et al. (2010), first,
created a three-part graph (user-item-tag) by means of web log file in SAS. Then, by selecting the present user’s query as the heat resource heat-diffusion algorithm was performed on
the graph, finally he proposed items of higher heat (Zhang et al. 2007; Shang et al. 2010;
Guan et al. 2012; Lüa et al. 2012). Aarthi et al. providing a common framework on mining
web graphs for recommendations using heat-diffusion method first proposed a recommendation algorithm; the algorithm aggregates items from similar customers eliminates items
the user has already rated, and recommends the remaining items to the user (Aarthi and
Sampath 2013). In this article, the user’s query is replaced by the user’s click-through.
Also, some resources used heat diffusion for query recommendation (Ma et al. 2012) in
this model; first of all, a bipartite graph was made, which represented the relation between
the query and the link. Second, by entering the searched query by the user, a subgraph was
exploited from the main graph, and heat-diffusion algorithm (although all nodes are one
and the searched node is zero) was executed on the graph. Finally, high-heat nodes were
proposed to the user. The applied model in this study has been applied in Swallow system.
3. PROPOSED SYSTEM (SWALLOW)
In this section, the new system and applied modules are thoroughly defined and their
implementation in real world is discussed. Additionally, to do required experiments for
assessing the system result, required parameters are introduced.
3.1. Introduction of “Swallow”
Swallow conveys the resource and tag recommendation problem in SASs. This system
contains two separate parts, online and offline. In the first phase, that is, offline section,
because of learning framework, a hybrid graph is created. This graph is a combination of the
weighted links between the resources and tags. In Swallow, the link graph of the resources
and tags is implicitly exploited, and it is combined with the representative graph of the
similarity resources created of the users’ applicable data.
In online phase, by using the pattern of the obtained graph (of last step) and by following the current user’s click-through and performing heat diffusion on the graph, the user
bookmarked resources and tags are forecasted and are proposed to him. In the following, the
method details and the proposed algorithm are described (the proposed system framework
is shown in Figure 6.
3.1.1. Introduction: Preprocessing. In this section, the Swallow’s immigration start
point is explained. This step is only the data preparation for entering the system.
FIGURE 6. The proposed system (Swallow) structure.
Swallow was performed on the user’s tagging data, which had been taken from Delicious
and Movielens SAS (from October 20 to December 15, 2008) (Gemmell et al. 2012). Every
record of registered data (user’s annotation) consists of following items:
(1) Data registering date
(2) User identification (ID(u))
(3) The resource, which was labeled by the user (URL) in Delicious, and Movie, which was
labeled by the user that we named it resource afterward
(4) Tags
Available information in the data set could be represented by a quadruple (D, U, R, T), where
R is the available URL in table and T stands for tag. In this research, to make the graph of
resources and tags relations, the pair (R, T) was used, and the user’s related information and
the registration time of data were ignored.
At the beginning, 300 users, who have the highest number of annotations among all
users, were selected. Then, those resources, which have been used less than five times,
were omitted.
3.1.1.1. Data Cleaning.
are deleted from data set:
In this part, after doing following four steps, unrelated options
(1) Deleting the records without tags.
(2) Deleting the records with non-English tags.
(3) Deleting the records whose tags are in the blacklist (a list of excessive words such as
and, or, the, etc.).
(4) Performing p-core algorithm: to use the considered data set, p-core parameter is
performed according to the way that Gemmell et al. (2012) used.
P-core parameter guarantees that each user, page, and tag exist in at least annotation of
training database. Applying this parameter results in omission of some data of database;
to clarify the activity of this parameter, its behavior can be expressed as the following two
sets intersection:
(1) Users who have labeled the pages more than P times.
(2) Users who have labeled the pages more than P times and those of this set who have
labeled more than P pages.
This intersection causes that the users who have labeled one page more than P times
are not placed in final list. In addition, it causes that the pages labeled more than P times
and appeared only once in the database and this number is common for more than one user
would not be appeared.
3.1.2. The First Step: Making Resources Similarity Graph. In this section, by using
the previous users’ data registered in log file, the resources similarity graph is made: after
preprocessing step, the similarity matrix is produced. The similarity matrix shows the correlation among the resources. At first, there are k resources. The similarity matrix is M*K,
where M and K are the number of the users and the resources, respectively. Each matrix’s
column is assumed as column vector, like vi
D .R1i : : : Rmi /; which states how the
source i is used by each user.
The resemblance between both resources may be calculated by set analogy and
Euclidean distance. Total analogy can also be achieved by a proper weighting to these two
formulas. These concepts are given as follows:
s
2
PkD1 vki vkj
m
ED vi ; vj D 1 m
Set analogy:
SetSim vi ; vj
Euclidean distance:
r
ED vi ; vj D
Normalization:
ˇ
ˇ
ˇ Vi \ Vj ˇ
ˇ
Dˇ
ˇ Vi [ Vj ˇ
Xm
kD1
s
ND vi ; vj D 1 Total analogy:
Pm
(12)
2
vki vkj
kD1
(11)
2
Vki Vkj
m
Sij D WSS :SetSim vi ; vj C WND ND vi ; vj
(13)
(14)
(15)
and
WSS C WND D 1
which is the weighted summation of two upper formulas and declares how the source i is
used by every user. In this matrix, if Rhi is 1, the source i has been used by the user h and
vice versa. Final matrix S is a K*K similarity matrix, which shows the similarity between
each two resources. The components of the similarity matrixSij represents the degree of
similarity between sources i and j. The matrix S is represented as follows:
R1
R2
S D ::
:
Rk
R1 R2 : : : Rk
S11 S12 S1k
6 S21 S22 S2k
6
:: :: : : ::
4
: :
::
Sk1 Sk2 Skk
2
3
7
7
5
(16)
It is attempted that other similarity criteria, such as cosine similarity and Pearson correlation coefficient (whose definitions are given later) to be used instead of Euclidean distance.
The comparison of the achieved results of algorithms is presented in the evaluation section.
Cosine similarity:
Pm
vki vkj
CS vi ; vj D qP kD1 qP
(17)
m
m
2
2
v
v
kD1 kj
i D1 kj
Pearson correlation coefficient similarity:
P P
P
Vj
Vj i
n
Vi Vj P C C.vi vj / D rh
i h P P i
P 2 P
2
n
V i . V i /2 n
Vj2 Vj
(18)
3.1.3. The Second Step: Making the Bipartite Graph “Resource-tag.”. To create a
resource-tag bipartite graph, an undirected bipartite graph Brt D .Vrt [ Ert / is considered
such that
® Vrt Dˇ .R [ T / R D .r1 r2 : : : rn /; T D .t1 t2¯: : : tn /, and define the set of edges
Ert D .ri ; ti / ˇif t here is an edge bet wee n ri t otj . The edge (ri ; ti ) exists if the user
ui has labeled the resource ri as ti . For instance, in Figure 7(a), the value on each edge
explains that how many times a tag is attributed to resource R. The graph, which is extracted
by user’s clicks, cannot be simply used in the heat-diffusion algorithm because this graph
is undirected and doesn’t precisely declare the relations between resources and tags. Hence,
it is modified to the one in Figure 7(b). In the second graph, each undirected edge in the
first one has been changed into two directed edges. The weight of each edge of resource-tag
directed graph was normalized with respect to the number of times, which that resource has
been labeled. Also, each edge weight of (tag-resource) graph was normalized because of the
number of times that this tag has been used.
The information of this graph is located in an M*K matrix, where M is the number of
resources, and K is the number of tags. The final matrix RT showing the relations between
the resource and the tag is as follows:
R1
R2
RT D ::
:
RM
T1
T2
TR D ::
:
Tk
T1 T2 : : :
rt 11 rt 12
:::
6 rt 21 rt 22
6
:: :: : :
6
:
: :
6
4 rt
k1 rt k2
2
R1
tr11
6 tr 21
6
::
6
:
6
4 tr
2
k1
R2 : : :
tr 12
tr 22
:: : :
:
:
tr k2
Tk
3
rt 1k
rt 2k 7
7
::
7
:
7
rt 5
(19)
mk
RM 3
tr 1k
tr 2k 7
7
::
7
:
7
tr km 5
(20)
FIGURE 7. Making the graph in the second step; a) resource-tag graph; b) resource-tag modified bipartite
graph).
3.1.4. The Third Step: Combining Two Graphs. In this step, two declared graphs
(resource-resource and resource-tag) are combined with each other and are set in one graph
as shown in Figure 8. So, after performing heat-diffusion algorithm for every considered
resource, it is accounted for the heat resource. It is possible to predict and determine
both similar resources and similar tags. It results in using two similarity criteria in the
recommendation problem, which increases the proposed system accuracy.
The final obtained matrix is a combination of three matrices, S, RT, and TR.
2
S11
6 S21
6
6
6
6
6 Sk1
Final matrix D 6
6 tr11
6
6 t r21
6
6
4
t rk1
::
:
::
:
S12
S22
::
:
Sk2
t r12
t r22
::
:
t rk2
::
S1k
S2k
::
:
:
::
:
rt11
rt21
::
:
Skk
t r1k
t r2k
rtk1
0
0
::
:
t rkm
::
:
0
3
rt12
rt1k
rt22
rt2k 7
7
:: : :
::
7
7
:
:
:
7
rtk2
rtmk 7
7
7
0
0
7
7
0
0
7
7
::
: : ::
5
: :
:
0
0
3.1.5. The Fourth Step: Determining the Initial Heat Resources. Here, according to
the user current process and click trough, observed and labeled resources are taken as the
initial heat resource, and their heat is considered 1, whereas other nodes are zero. To clarify,
if the user annotates the resources r1, r2, r5 with t3, t7, t2, respectively, then the initial
temperature of the mentioned nodes gets 1 in the created combined graph. There is a vector
f .0/ whose number of elements is equal to resources plus tags whose all values are zero
and the mentioned nodes are 1.
3.1.6. The Fifth Step: Determining a Proper Subgraph. To reduce the duration of
performing heat-diffusion algorithm, it was attempted to decrease the final graph size. Moreover, an algorithm was implanted, which only extracted the initial heat resources and related
nodes to them from the main graph, and other nodes were omitted. The algorithm is shown
in Table 1.
FIGURE 8. Combining resource-tag graph and resource-similarity graph.
TABLE 1. Making a subgraph for heat-diffusion algorithm’s entry.
Algorithm 1. Making subgraph G (G, R,T)
1.
Inputs: G = the graph G = (V, E) where is the vertices ®composed¯ of resources and
tag. R D ¹r1 ; : : : rk º Where is visited resources, T D t1 ; : : : ; tj where is visited
tags are given
2.
For each resources and tags in R and T, a subgraph is constructed by using depthfirst search in G. The search stops when the number of nodes is larger than a
predefined number.
3.
4.
All the above subgraphs are combined together so all visited vertexes are placed in
b
V, and selected edges in b
E
K D .b
Vb
E/ is an appropriated subgraph.
Graph G
TABLE 2. Swallow recommendation algorithm.
Algorithm 2. Making the recommendation set (G, R,T)
1.
2.
a hybrid graph G D .V C [V E C [E / where the vertices are composed of
two sets V C D ¹r 1 ; : : : ; r n º resources, and V D ¹t 1 : : : t n º tags and the edges
includes two direction weighted edges showing the E relations of resource-tag
and E C which contains the edges showing resources relations.
®
¯
By taking a set from the entry including R D ¹r 1 ; : : : r k º and T D t 1 ; : : : ; t j
that are visited resources and annotated tags, a subgraph in G is made by means
of algorithm 1.
3.
Let ˛ D 1 and the initial
value of R and T be 1, i.e.,
temperature
f 0 .r 1 ; : : : :r k / D 1 and f 0 t 1 ; : : : :; t j D 1
4.
Heat-diffusion algorithm begins by taking f .1/ D e ˛ .H-D/ f .0/
5.
The final f .1/ suggests to user 10 highest temperatures related to the sources
and 10 highest ones related to the tags.
3.1.7. The Sixth Step: Resources Recommendation. In the final step, heat-diffusion
algorithm was performed on a final subgraph and after some duration, the nodes with the
highest temperature were suggested to the users as the favorite’s resources and tags. This was
performed with respect to Equations (9) and (10) such that the matrix f .1/ was computed.
In matrix f .1/, 10 tags and 10 resources of the largest value which had not been observed
yet were proposed to him. The algorithm is shown in Table 2.
Swallow suggests both favor resources and tags, which might be proper with the user’s
click-through.
4. ESTIMATION AND EVALUATION
4.1. Evaluation Criteria
In this study, common method such as the linear hybrid method was utilized to evaluate
Swallow. In most of the methods, the recommendations efficacy is assessed with respect to
precision and coverage.
Precision is criterion to assess the correctness of the recommended pages and coverage
is as a criterion to evaluate this system’s capability of including those items, which the user
tends to visit.
Precision is defined as the ratio of the number of common options of system’s recommended data, “A,” and of user’s observed data, “B,” to the number of the system’s
recommended pages, “A.”
jA \ Bj
Precision D
(21)
jAj
Coverage is defined as the ratio of the common options of the system’s recommended data,
“A” and of user’s observed data, “B,” to the number of user’s observed pages, “B.”
Coverage D
jA \ Bj
jBj
(22)
The calculated precision and coverage of test data set interactions were averaged, and system
efficiency was assessed. Data sets, after preparation, were divided into two test and train
sets such that 70% of users to be in train set and 30% of them to be in test set. Then,
the initial 50% of the users of test set were used for giving the recommendation; and the
available sources of the remainder 50% were compared with the recommended sources, and
the precision and coverage were computed.
4.2. The Effect of Resources Similarity Criterion on Swallow
To calculate the degree of resources similarity, three similarity criteria that were
mentioned in Section 3.1.3 were used: weighted combination of Euclidean distance, set
similarity, cosine similarity, and Pearson similarity. Comparison results of performing each
algorithm are given in Figure 9. As it is seen, using Pearson correlation coefficient has
shown better results than the other criteria. Thereafter, Pearson correlation having the highest precision was applied for final computation of web pages similarity. As Figure 9 in most
cases, cosine similarity and Pearson similarity resulted in nearly the same recommendation precision. But when the number of recommended pages increased, Pearson correlation
coefficient had higher precision.
The results show that precision of resources similarity degree directly affects the
Swallow precision. For this reason, for researchers, it seems reasonable to use this system.
4.3. Resources Recommendation Evaluation
4.3.1. Comparing the Proposed Method with the Linear Hybrid Methods (Resource
Recommendation). In this section, Swallow’s obtained results of the resource recommendation were compared with those of the linear combination method (Gemmell et al. 2012).
They discussed about resource recommendation in SAS. First, a three-dimension matrix
(user, resource, and tag) was produced, and to score the resources, their optimal linear hybrid
was calculated by multiple two-dimension images (of the three-dimension matrix). In this
FIGURE 9. The effect of resources similarity on Swallow.
FIGURE 10. Comparing Swallow precision and linear hybrid precision in resource recommendation with
respect to the number of recommended pages (RP) in Delicious and Movielens data set.
part, to specify the effect of the number of recommended pages on assessment criteria,
per different values of the number of recommended pages, precision and coverage criteria
were computed. The number of recommended pages is altered from 1 to 10 and for each,
precision of the system was calculated. Considering Figure 10, in RP = 1, Swallow has the
maximum capability, and according to the total output, and in all cases, Swallow is more
valid for recommendations than the linear hybrid system. As stated earlier, for two data
sets, the proposed systems were evaluated. As can be seen in both data sets (Movielens and
Delicious), the accuracy of Swallow is more than linear hybrid method.
FIGURE 11. Comparing Swallow coverage and the linear hybrid coverage in resource recommendation
with respect to the number of recommended pages (RP) in Movielens and Delicious.
0.8
Precision (%)
0.7
0.6
0.5
0.4
Swallow
0.3
Linear Hybrid
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
Recommended Page Count (RP)
FIGURE 12. Comparing precision of Swallow and linear hybrid systems in tag recommendation considering the number of recommended pages (RP).
By observing Swallow coverage and the linear hybrid system coverage as depicted in
Figure 11, one can well understand that Swallow functionality capability in coverage is more
than its neighbor’s graph.
Coverage (%)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Swallow
Linear Hybrid
1
2
3
4
5
6
7
8
9 10
Recommended Pages Count (RP)
FIGURE 13. Comparing coverage of Swallow and linear hybrid systems in tag recommendation considering
the number of recommended pages (RP).
4.4. Evaluating Tag Recommendation
4.4.1. Comparing Swallow System with Linear Hybrid Method (Tag Recommendation).
As mentioned earlier, while a user wants to annotate a tag on the resources in the SAS, Swallow can also predict tag. Here, the proposed system was compared with the linear hybrid
method for tag recommendation by Gemmell et al. (2011). Then tag recommendation, precision, and recall were again calculated. It was seen that it is more valid, and its coverage
and precision is higher (as shown in Figures 12 and 13).
5. CONCLUSION AND FUTURE DEVELOPMENTS
5.1. Conclusion
In this article, to propose resources and tags, a new recommendation system, based
on heat-diffusion algorithm, has been developed. The proposed system, for using a natural
phenomenon (heat-diffusion algorithm) and its instinct similarity to the router birds, was
named Swallow. Observing the online user’s click-through and referring previous users tagging data, Swallow helps SAS to identify the properties and interests of current users and
shows them to the users. In this study, first, using the information of log files, the similarity
of resources in a graph was obtained. Then, this graph was enhanced by using the available
relations between resources and tags annotated in the website by the users. The obtained
pattern is the Swallow’s emigration map. Swallow checks online user’s behavior and after
that, it heats checked points on its pattern and while following heat-diffusion process in the
graph, reaches to users’ aimed resources. From that point, according to users’ priority, they
will be presented. Analyzing and assessing Swallow based on precision and coverage criteria and comparing it with linear hybrid method reveals the increment of mentioned criteria
in this system.
To develop Swallow and to get close to a thorough useful and modern system probable
perspectives are presented:
(1) Regarding assessment results in Figure 9, one of the criteria, related to swallow, is used
for computing resources similarity. So, it is suggested, to increase similarity precision,
context-based methods to be used.
(2) As adding resources similarity graph in bipartite resource-tag graph helps to improve
the precision of the proposed system, utilizing other relations, such as tags similarity in
this system, can be helpful.
(3) In these systems, because speed factor is noticeable and important, one of the proposed
items (to improve) is to apply clustering approaches to set resources and tags in the
clusters. Instead of heating a resource or tag in a hybrid graph, it could heat the most
similar cluster to the considered options. It not only improves its speed, but also it solves
the cold start. That is, if a selected resource and tag, as initial heat resources, not to be
available, to heat the available resources in its similar cluster will be possible.
REFERENCES
AARTHI, S., and S. SAMPATH. 2013. A heat diffusion method for mining web graphs for recommendations using
recommendation algorithm. International Journal of Engineering Research and Technology (IJERT), 2(8):
961–966.
CITEULIKE. 2004. Available at: http://www.citeulike.org.
DE GEMMIS, M., P. LOPS, G. SEMERARO, and P. BASILE. 2008. Integrating tags in a semantic content-based
recommender. In Proceedings of the 2008 ACM Conference on Recommender Systems. ACM: New York,
pp. 163–170.
DELICIOUS. 2003. Available at: http://delicious.com.
FLICKR. 2005. Available at: http://www.flickr.com.
GEMMELL, J., T. SCHIMOLER, B. MOBASHER, and R. BURKE. 2011. Tag-based resource recommendation in
social annotation applications. In Proceedings of the 19th International Conference on User Modeling,
Adaption, and Personalization. Springer-Verlag: Berlin, Heidelberg, pp. 111–122.
GEMMELL, J., T. SCHIMOLER, B. MOBASHER, and B. BURKE. 2012. Resource recommendation in social
annotationsystems: a linear-weighted hybrid approach. Journal of Computer and System Sciences, 78(4):
1160–1174.
GUAN, Z., J. BU, Q. MEI, C. CHEN, and C. WANG. 2009. Personalized tag recommendation using graph-based
ranking on multi-type interrelated objects. ACM: New York, pp. 540–547.
GUAN, Z., C. WANG, J. BU, C. CHEN, K. YANG, D. CAI, and X. HE. 2010. Document recommendation in social
tagging services. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC,
pp. 391–400.
GUAN, Z., G. MIAO, R. MCLOUGHLIN, X. YAN, and D. CAI. 2012. Co-occurrence-based diffusion for expert
search on the web. IEEE Transactions on Knowledge and Data Engineering, 25(99): 1001–1014.
HAM, J., D. D. LEE, and L. SAUL. 2005. Semisupervised alignmentof manifolds. In Proceedings of the Annual
Conference on Uncertainty in Artificial Intelligence, Edinburgh, UK, pp. 120–127.
HEYMANN, P., D. RAMAGE, and M. GARCIA-MOLINA. 2008. Social tag prediction. In Proceedings of the 31st
annual international ACM SIGIR conference. ACM: New York, pp. 531–538.
IOFCIU, T., and J. DIEDERICH. 2006. Finding communities of practice from user profiles based on folksonomies.
In Proceedings of the 1st International Workshop on Building Technology Enhanced Learning Solutions
for Communities of Practice, EC-TEL, Crete, Greece, pp. 308–410.
KARYPIS, G. 2001. Evaluation of item-based top-n recommendation algorithms. In Proceedings of the Tenth
International Conference on Information and Knowledge Management. ACM: New York, pp. 247–254.
LAFFERTY, J. D., and R. I. KONDOR. 2002. Diffusion kernels on graphs and other discrete input spaces. In ICML
’02 Proceedings of the Nineteenth International Conference on Machine Learning. Morgan Kaufmann: San
Francisco, pp. 315–322.
LEBANON, G., and J. D. LAFFERTY. 2005. Diffusion kernels on statisticalmanifolds. Journal of Machine
Learning Research, 6: 129–163.
LI, X., S. LIN, S. YAN, and D. XU. 2008. Discriminant locally linear embedding with high-order tensor data.
IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38: 342–352.
LÜA, L., M. MEDO, C. H. YEUNG, Y.-C. ZHANG, Z.-K. ZHANG A, and T. ZHOUA. 2012. Recommender systems.
Physics Reports, 519: 1–49.
MA, H., I. KING, and M. R. LYU. 2012. Mining web graphs for recommendations. IEEE Transactions on
Knowledge and Data Engineering, 24: 1051–1064.
MAES, P., and U. SHARDANAND. 1995. Social information filtering: algorithms for automating “word of
mouth.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press/
Addison-Wesley: New York, pp. 210–217.
MARKINES, B., C. CATTUTO, F. MENCZER, D. BENZ, A. HOTHO, and G. STUMME. 2009. Evaluating similarity
measures for emergent-semantics of social tagging. In Proceedings of the 18th International Conference on
World Wide Web, Madrid, pp. 641–650.
MIN, W., K. LU, and X. HE. 2004. Locality pursuit embedding. Pattern Recognition, 34: 781–788.
NAKAMOTO, R. Y., S. NAKAJIMA, J. MIYAZAKI, S. UEMURA, H. KATO, and Y. INAGAKI. 2008. Reasonable
tag-based collaborative filtering for social tagging systems. In Proceeding of the 2nd ACM Workshop on
Information Credibility on the Web. ACM: New York, pp. 11–18.
NIYOGI, P., and M. BELKIN. 2003. Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 15: 1373–1396.
RESNICK, P., N. IACOVOU, M. SUCHAK, P. BERGSTROM, and J. RIEDL. 1994. Grouplens: an open architecture
for collaborative filtering of netnews. In Proceedings of ACM 1994 Conference on Computer Supported
Cooperative Work. ACM: New York, pp. 175–186.
SAUL, L., and S. ROWEIS. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:
2323–2326.
SHANG, M. S., Z.-K. ZHANG, T. ZHOU, and Y.-C. ZHANG. 2010. Collaborative filtering with diffusion-based
similarity on tripartite graphs. Physica A, 389: 1259–1264.
SHEPITSEN, A., J. GEMMELL, B. MOBASHER, and R. BURKE. 2008. Personalized recommendation in
social tagging systems using hierarchical clustering. In Proceedings of the 2008 ACM Conference on
Recommender Systems. ACM: New York, pp. 259–266.
SIGURBJÖRNSSON, B., and R. VAN ZWOL. 2008. Flickr tag recommendation based on collective knowledge. In
Proceedings of the 17th International Conference on World Wide Web. ACM: New York, pp. 327–336.
TSO-SUTTER, K., L. B. MARINHO, and L. SCHMIDT-THIEME. 2008. Tag-aware recommender systems by fusion
of collaborative filtering algorithms. In Proceedings of the 2008 ACM Symposium on Applied Computing.
ACM: New York, pp. 1995–1999.
BOSCH, A., and T. BOGERS. 2008. Recommending scientific articles using CiteULike. In Proceedings
of the 2008 ACM Conference on Recommender Systems. ACM: New York, pp. 287–290.
VAN DEN
WETZKER, R., W. UMBRATH, and A. SAID. 2009. A hybrid approach to item recommendation in folksonomies.
In Proceedings of the WSDM’09 Workshop on Exploiting Semantic Annotations in Information Retrieval.
ACM: New York, pp. 25–29.
YANG, H., I. KING, and M. R. LYU. 2007. DiffusionRank: a possible penicillin for web spamming. In SIGIR
’07 Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval. ACM: New York, pp. 431–438.
ZHANG, Y.-C., M BLATTNER, and Y.-K. YU. 2007. Heat Conduction Process on Community Networks as a
Recommendation Model. Physical Review Letters, 99(14): 1–5.
ZHANG, Z.-K., T. ZHOU, and Y.-C. ZHANG. 2010. Personalized recommendation via integrated diffusion on
user–item–tag tripartite graphs. Physica A: Statistical Mechanics and Its Applications, 389: 179–186.

Get - Wiley Online Library

Transcription

Similar documents

Toll Tag Request

ROGZ RogLite - New Colors! Pet ID Tag Bag

PETtrac Stainless/Brass Tag Order Form

Web 2.0

Project A Registration for Parents 1. Begin at the WCATY homepage

Swallow Rocks to Big Mountain

Song PredicZon System

Here - Ultrazone

Interactive Quiz - DavidSchubert.net

Sophisticated Coverpage Horst Lichter, Software