Personalized Resource Categorisation in Folksonomies

Transcription

Personalized Resource Categorisation in Folksonomies
Personalized Resource Categorisation in Folksonomies
Muzaffer Ege Alper
Şule Gündüz Öğüdücü
Faculty of Computer and Informatics
Istanbul Technical University
Maslak, Istanbuk, Turkey
Faculty of Computer and Informatics
Istanbul Technical University
Maslak, Istanbuk, Turkey
[email protected]
[email protected]
ABSTRACT
Folksonomies constitute an important type of Web 2.0 services, where users collectively annotate (or “tag”) resources
to create custom categories. Semantic relation of these categories hint at the possibility of another categorization at
a higher level. Discovering these more general categories,
called “topics”, is an important task. One problem is to
discover these semantically coherent topics and the accompanying small sets of tags that cover these topics in order
to facilitate more detailed item search. Another important
problem is to find words/phrases that describe these topics, i.e. labels or “meta-tag”s. These labeled topics can
immensely increase the item search efficiency of users in a
folksonomy service. However, this possibility has not been
sufficiently exploited to date. In this paper, a probabilistic model is used to identify topics in a folksonomy, which
are then associated with relevant, descriptive meta-tags. In
addition, a small set of diverse and relevant tags are found
which cover the semantics of the topic well. The resulting
topics form a personalized categorization of folksonomy data
due to the personalized nature of the model employed. The
results show that the proposed method is successful at discovering important topics and the corresponding identifying
meta-tags.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Text Mining
General Terms
Algorithms
Keywords
Statistical Topic Models, Topic Model Labeling, Meta-Tag
Generation, Personalized Information Retrieval
1.
INTRODUCTION
(c) 2012 Association for Computing Machinery. ACM acknowledges that
this contribution was authored or co-authored by an employee, contractor
or affiliate of the national government of Turkey. As such, the government
of Turkey retains a nonexclusive, royalty-free right to publish or reproduce
this article, or to allow others to do so, for Government purposes only.
MDS August 12-16 2012, Beijing, China
Copyright 2012 ACM 978-1-4503-1546-3/12/08 . . . $15.00.
One of the core challenges of Data Mining is to turn vast
amount of data into manageable information. Web 2.0 services immensely increased the acquisition of data, increasing
demand on more advanced methods to mine information
from them. Folksonomies constitute one type of such services where users collectively annotate (or “tag”) resources to
create custom categories. These tags, however, are semantically related. These relations can be exploited to form higher
level categorizations, also called “topics”. One approach to
form topics would be to employ dictionaries to find clusters
of semantically related tags. However, another approach,
called topic modeling, has recently gained popularity, where
the semantic relation among the words are assumed to show
themselves in co-occurances of the words in documents. Latent Dirichlet Allocation is a particularly successful method
of this kind [2]. The semantic relations (hypernym, synonymy, and others) among tags can then be discovered and
exploited further using the resulting tag-topic distribution
as in [13].
Another important goal is to find identifying words/phrases
for the discovered topics, that is the topic labeling problem [8, 10]. It has gained more attention in recent years
due to its broad application areas in text mining and information retrieval such as multi-document summarization [4]
and opinion mining [9]. The corresponding labels are useful as a summary of each topic and also useful to make the
topic model more transparent to the user [8]. The corresponding labels are useful as a summary of each topic and
also useful to make the topic model more transparent to
the user [8]. Most of these models extract these labels from
text collections according to the contents of these collections without considering the needs of users. However, a
label for a topic may be very difficult for a user to understand whereas it could be very meaningful for another user.
Thus, in folksonomies it is highly desirable to automatically
generate meaningful personalized labels for a topic.
In the context of folksonomies, labels also enable categorization of resources, so that resources tagged “computer”
and “informatics” may fall in the same category. So, in a
sense, tags are tools for organizing resources and meta-tags
are generalizations of them. This higher level categorization can immensely increase the item search efficiency in a
folksonomy service. Since the computed categorizations are
considered as an aid to the users to locate their resources
of interest faster, construction of personalized categories is
also crucial. For this reason, different usage patterns of users
should also be taken into account in the categorization process. However, to the best of our knowledge, this possibility
has not been sufficiently exploited to date.
A more serious problem with topic labeling is that the
methods found in the literature are not applicable to the
problem of meta-tag generation in folksonomies. As noted in
other works, multiple words/phrases are necessary for identifying names, since single words are usually too broad in
meaning or because a single word is usually found inside
a longer phrase [10]. However, their solution is also not
suitable because they assume that there is an underlying
document set behind the bag-of-words dataset employed in
the topic modeling process. Contents of these documents
are used to find identifying phrases for topics. However,
the tagging data in folksonomies naturally come in a bagof-words format and this bag-of-words representation is not
derived directly from the contents of documents/resources.
Rather it reflects the interest of the user in a resource as
well as the personal understanding of the resource. Besides,
many folksonomy services have resources which are not text
based (video,picture,animation,. . . ). Even for bookmarking
services such as “Del.icio.us”1 whose resources are websites,
collection of content is problematic due to Flash-based websites, video hosting sites with few textual content (or noisy
content), etc. For these reasons, it is desirable to construct
meta-tags using only the bag-of-words data obtained from
the user annotations of the resources.
In this paper, a probabilistic model, called Latent Interest
Model (LIM) [1], is used to identify topics in a folksonomy.
This model is similar to the method proposed in [15]; but
also has important differences as mentioned in Section 2.3.
The success of LIM as a personalized probabilistic model
of folksonomy was previously shown in the context of item
recommendation [1], where the items are tags or resources.
Here, it is employed to discover personalized categorization
(taxonomy) of resources in folksonomies.
The contribution of this paper is three-folds: first a compact and more informative representation of topics is proposed (using the criteria of minimum redundancy and maximum relevance [12]) instead of the usual top-K most probable word per topic listing. Secondly, a method to assign
clear, identifying words/phrases to topics, without assuming an underlying textual dataset is proposed. Finally these
methods, together with the application of LIM, are employed to discover personalized resource categorizations in
folksonomies. The resulting categorisation has two levels,
the level of topics (meta-tags) or the lower level of tags in a
given topic (see the second stage in Figure 2). The quality of
the generated meta-tags and the resource categorization are
also demonstrated with a user study. Note that the proposed
methods can also be employed in any other application of
topic models.
The paper is organized as follows: in Section 2 formal
definitions of a folksonomy and the topic labeling problem
are given together with a brief discussion on the particular
personalized topic model employed. Section 3 explains the
proposed solution to the labeling problem in detail. Section 4, demonstrates the efficiency of the proposed method
both qualitatively and quantitatively. Section 5 reviews the
related work and finally, Section 6 concludes the paper.
2.
BACKGROUND
This section provides the necessary background on the
1
http://delicious.com/
Table 1: A sample topic from LIM
Tag (w)
Probability (p(w|z))
writing
0.248
reference
0.066
grammar
0.055
english
0.054
language
0.043
quotes
0.037
literature
0.026
copywriting
0.021
resources
0.019
tools
0.019
problem of meta-tag generation in folksonomies. First the
notion of a folksonomy is discussed and the problem of discovering meta-tags in a folksonomy is presented. Finally, the
particular model employed in this work, LIM, a personalized
probabilistic model for folksonomy data previously proposed
in [1], is explained.
2.1
Folksonomies
A folksonomy can be formally defined as a quadruplet
of sets, (T, U, R, P ) where T is the set of all possible tag
words, U is the set of all users, R is the set of all resources
and P ⊆ T × U × R, is the set of annotations. Elements of
P are called “tagging triplets” or just “taggings” for short.
We say a user u bookmarked a resource r if there exists a
triplet (t, u, r) ∈ P for some t ∈ T . Note that there are also
other rare cases where a user bookmarks a resource without
tagging it. In this case, we assume that a tagging exists but
the tag is unknown.
2.2
Topic Labeling
Topic models are probabilistic models of textual data (in
bag-of-words format), which are stated in terms of conditional distributions p(w|z) of words (w) given topics (z) and
distributions of topics themselves as p(z). It is generally assumed that a successful topic model gives most of the probability weight to semantically related words. Recently, there
is an attempt to use probabilistic models in topic extraction. Common methods are probabilistic Latent Semantic
Analysis (pLSI) [5] and LDA [2]. The intuition behind these
models is that the observed texts are derived from a generative model. In such a model, unseen latent variables (the
topics) are represented as random variables over the set of
words. The words and the documents in the data set are
generated from these latent variables.
Table 1 shows a sample topic with probability values computed using LIM, which is also a probabilistic topic model.
Note that only the first ten words are listed. This topic is obviously related to English grammar with related references,
resources and tools.
The problem of topic labeling, or, in the particular case of
folksonomies, meta-tag generation, is to find words/phrases
that is representative of the topic, i.e. summarize the information in the conditional word probability distribution
p(w|z).
The solution to this problem is composed of two steps [8,
10]. The first step is to select or generate a list of candidate labels. Consequently, the best candidate label is
selected using a score function. The proposed method
considers only the words in the topic model, as a result the
scoring function has a simple form, which is the product
of the conditional probabilities of a given label (a word/tag
pair). The contribution of this paper is mainly in finding
non-redundant and descriptive candidate labels. As an example, the method proposed in this paper assigns “writing/grammar” to the topic shown in Table 1.
2.3
Introduction to LIM
Much of the previous work on probabilistic modeling of
folksonomies assumed a global meaning for tags [7]. LIM,
on the contrary, is a personalized model that assumes meaning (or “interest context”) of a tagging event is determined
by the user, tag and resource collectively. This is achieved
by incorporating users and resources into the model as random variables. Figure 1 shows the graphical model of this
generative process.
z
tz
Ntz + ν
Nz + T ν
uz
Nuz + γ
Nz + U γ
rz
Nrz + β
Nz + Rβ
Φ̂t =
ν
u
θ
Φ̂u =
φt
t
r
Φ̂r =
φr
β
z
θ̂ =
2.4
Figure 1: Graphical Model of LIM
Notice that the latent variables z have a broader meaning
in our model than simply being a topic of words as in LDA
[2]. It is better interpreted as an “interest context”, since the
tagging triplets it generates can be regarded as statements
of what interest a given resource is to a user, declared by
the user with the associated tag. Such a reading of a folksonomy assumes that, each tag a user attaches to a resource
is a declaration of why the user is interested in the resource
and/or how the user is intending to use it.
The (hyper)parameters in the proposed model are: α, β, γ, ν.
These parameters are respectively the asymmetrical Dirichlet priors for distributions over “interests contexts” (θ) and
the symmetrical Dirichlet priors for distributions of resources,
users and tags (Φu , Φr and Φt ) [14]. To estimate these distributions, we employ Collapsed Gibbs Sampling [3].
We sample from the posterior p(z|u, r, t, D), where the
variables Φu , Φr , Φt and θ have been integrated out, using
aik biuk citk dirk
i i
i
i
k0 ak0 buk0 ctk0 drk0
p(z i = k|z−i , u, t, r, α, β, γ, ν, D) = P
(1)
aik
=
Nk−i
−i
Nrk
+β
−i
Nk + Rβ
In the expressions above, the index i in z i denotes the
ith triplet in the corpus, while z i is the associated “interest” variable and z−i are the interest variables correspond−i
ing to the other triplets. Nurtk
is used as the count of a
particular triplet associated with a particular “interest” in
the corpus, other than the ith triplet. In this form it can be
either 0 or 1, corresponding to whether the triplet (t, u, r) is
not the ith triplet and the sampled interest context corresponding to this triplet is kth interest context. The dropped
indices express
a summation over that index. For example,
P
−i
−i
Nuk
=
r∈R,t∈T Nurtk . Notice the effect of the Dirichlet
prior parameters as pseudocounts, which is the consequence
of the conjugacy of the Dirichlet and multinomial distributions.
The (Bayes) estimates of the four essential conditional
z
probabilities (Φ̂u , Φ̂r , Φ̂t , θ̂ ) in the model are:
φu
γ
α
dirk =
+ αk
biuk =
−i
Nuk
+γ
−i
Nk + U γ
citk =
−i
Ntk
+ν
−i
Nk + T ν
Nz + αz
P
N + k αk
Resource Categorization using LIM
Resource categorization using LIM can be achieved in two
manners. First, each “interest context” provides a ranked list
of resources using p(r|z) and listing top-K resources. Since
the interest contexts in a folksonomy reflect different interests of different users, only some of these topics are listed
to a particular user, determined by applying a threshold δ
to p(z|u) ∝ p(u|z)p(z). The parameters K, δ determine the
amount of resources which are categorized for user u. In the
Results section, the effect of the varying K on categorization
performance is explored.
The second kind of resource categorization is via p(r|t, u).
Notice that, in this form, this listing does not necessarily
correspond to a particular topic. However, in the next section, a method to compute a set of tags that are maximally
related to the topic and have as small redundancy as possible is introduced. As a result, the tags in this set and the
related resources can be loosely seen as sub-categories of the
corresponding topics. Finally, once the topic is itself labeled
with a descriptive phrase, the task of computing a personalized hierarchical resource categorization is complete. The
whole process can be visualized as in Figure 2.
3.
META-TAG GENERATION
Intuitively, a meta-tag should be representative for its
topic. This is achieved when the meta-tag is both specific to
the topic and general enough to cover the semantic variety in
the topic [8]. In this section, the proposed method to automatically determine meta-tags for topics in a folksonomy is
presented. Unlike [10], the proposed model does not assume
an underlying text dataset, which enables it to be applicable
Figure 2: The overall process of personalized hierarchical resource categorization
Table 2: A topic from LIM with redundancies
Tag
Probability
webdesign
0.134
design
0.132
blog
0.117
tutorial
0.094
inspiration
0.079
resources
0.056
web
0.05
css
0.044
photoshop
0.028
web2.0
0.016
Table 3: A topic from LIM
tation
Tag
webdesign
jquery
blog
news
magazine
bookmarks
icons
php
javascript
css3
textures
with compact represenScore
0.183
0.161
0.158
0.147
0.132
0.115
0.111
0.099
0.096
0.096
0.094
to folksonomy data. In this section, first a method to find
a compact representation for topics is discussed. Then, this
new representation is employed to find descriptive labels for
the topics.
able rather than the random variable itself, so there is no
expectation. One would like to define reduncancy similarly;
3.1
Red∗ (wi , wj ) = log p(wt = wi |wt+1 = wj ) − log p(wt = wi )
Compact Topic Representation
The common method to present a topic is to list the top-K
most probable words given the topic. This list, however, can
contain redundancies. Consider the topic in Table 2. This
topic is obviously related to websites with blogs, references
and resources for webdesign. It can be directly observed that
several tags repeat these themes, like “webdesign”, “design”,
“web” or “resources”, “web2.0”. In this section, a method
to produce a more compact and informative representation
is presented. The results are shown in Table 3. With a
comparable number of words, tags shown in the table give
us a much more diverse and detailed picture of the concepts
significantly related to the topic.
In order to eliminate redundancies, concepts of relevance
and redundacy are employed. The proposed method is similar to the mRMR feature selection method in spirit [12].
However, the definitions of relevance and redundancy are
slightly different in this work. Relevance of a word (tag) w
to a topic z is defined as;
Rel(w, z) = log p(z|w) − log p(z) = log p(w|z) − log p(w)
In other words, relevance is the difference of self information
of topic and topic given the tag/word. Notice the relevance
is defined as a function of a realization of a random vari-
where, wt is the random variable corresponding to the word
sampled at time t, while wi and wj are particular realizations of this random variable (we drop the superscript t for
realizations since the set of possible words which we sample
is the same for all time steps). However, the model assumptions of LIM states that the tags/words are distributed i.i.d.
That is, one sample at step t does not yield additional information for sample at step t + 1. To circumvent this issue,
another measure for similarity and consequently redundancy
is used;
Red(wi , wj ) = log p(wt = wi |τ (wi ) = τ (wj ))−log p(wt = wi )
p(wt = wi |τ (wi ) = τ (wj )) =
X
p(z|wt+1 = wj )p(wt = wi |z)
z
where, τ (.) indicates the corresponding topic of the given
word, so that the above equation can be read like this: redundancy is the information supplied by knowing the word
wt and wt+1 have the same “interest contexts” with respect
to the event of observing wi at a given time. Observe that if
the words wi and wj have similar probabilities over several
different topics, the − log p(wt = wi |τ (wi ) = τ (wj )) will be
much smaller than − log p(wt = wi ) indicating a big redundancy.
The algorithm then proceeds in an iterative fashion where
at each iteration the tag (which is not already in the list)
with the highest score value is added.
X
Score(w, z, S) = p(w|z) ∗ (Rel(w, z) −
Red(w, wj ))
the compact list is discussed in this section. Consequently,
the final meta-tag is selected from these augmented tags using a simple criterion.
The accompanying/augmenting tags are determined using
redundancy to the respective tag, among the top-L most
probable tags as in section 3.1.
Aug(w, z) = arg max Red(wj , w)
wj ∈S
St+1 = St ∪ {arg max Score(w, z, St )}
w6∈St
In the above equation, St is the set of tags at the t’th step.
Note that the step t denotes the iteration of the algorithm,
whereas “time” t in the above paragraph refers to the sampling time of a word in the corpus according to the probabilistic model. S0 is defined as the empty set. Note that the
candidate words/tags are picked from the top-L most prob, where
able words given the topic. This L is taken to be W
T
W is the number of words and T is the number of topics.
This constraint improves the computational efficiency of the
method while leaving the results the same, since most of the
probability mass is allocated to this set by the topic model
LIM.
Finally, a threshold value has to be determined so that
tags with low scores, which are irrelevant to the topic or
are redundant, are not included in the candidate label set.
After trying different alternatives using our dataset, the
“best” threshold is empirically determined to be one eighth
the value of the highest score for a given topic. Note that
the “compact” representation usually results in a list with
a much smaller number of identifying words in it than, say,
ten. However, this is not always the case, as it can be seen
from Table 3. In this case, the words which are eliminated
due to redundancy are replaced with other words that are
non-redundant and more informative about the topic. It is,
of course, possible to think of other viable options for determining cut-off points in the compact list. One option would
be to consider resource coverage as defined in Section 4.3,
possibly setting a cut-off point satisfying a given coverage
threshold in probability (see Table 9 and the related discussion). However, for the purposes of meta-tag generation, we
noticed that the simple approach used in this work suffices.
3.2
Labeling Topics
The compact representation of topics presented in the previous section is more efficient than the top-10 or top-5 most
probable words listing in the literature; but it is still too
long to use as labels in a taxonomy of resources. Therefore,
in this section, a method to derive labels using this compact
lists is introduced.
As discussed previously, the goal is to find a relevant
phrase that also strikes a balance between generality and
specificity. The generality criterion is satisfied by the top
scoring members of the compact list (which may or may not
be the most probable words given the topic). However, it is
observed that sometimes these tags are too general to fully
specify the topic, so accompanying words which increase the
specificity are necessary. For example, “science” is a very
broad meta-tag, however “science/astronomy” is much more
specific and helpful to the users. Additionally, some tags
are naturally used in pairs with other tags, such as “social”
and “web” and despite one being redundant to the other,
combined they are more informative than alone. For these
reasons, a method to find accompanying tags to the tags in
wj ∈topL
The combinations of tags formed in this fashion are candidates for topic labels. In this work, we simply choose the
most probable combination ( p(wi , wj |z) = p(wi |z)p(wj |z) )
among these candidates as the meta-tag for the corresponding topic ( z ).
4.
RESULTS
In this section, the results of the proposed meta-tag generation method is shown using data collected from a popular folksonomy website Del.icio.us. The dataset consists of
34,665 users, 6,429 resources, 9,641 tags, 5,546,813 taggings
and 1,358,522 bookmarks. This set is produced by collecting
data from Del.icio.us and performing standard cleaning operations such as stemming of tags using standard methods
such as Pling, removing non-english websites and removing
stop-words from tags. In addition to these, the resources
which were not bookmarked by at least 100 users and tags
which were not used in at least 100 bookmarks were removed
(the resulting dataset is also called a 100-core), similar to
other studies on folksonomies such as [6].
The proposed model contains one asymmetric Dirichlet
prior parameter vector α and three other parameters, β, γ, ν,
for the symmetrical priors. Additionally, the number of “interest contexts” is also a parameter for the model. Traditionally, the last parameter (number of topics) is manually
tuned either using perplexity in unseen data or using another
performance criterion regarding the specific application. In
this work, this value is found using the precision/recall curve
on a validation dataset. The other parameters however, can
also be computed by maximum (marginal) likelihood estimation, which is the prevalent method in the literature. We
have observed that validation based parameter estimation
performed better in our setting [1]. The estimated parameters used in this paper are β = 0.1, γ = 1, ν = 0.01 and
using 200 topics.
The results will be reported in three parts, first the solution to the classical topic labeling problem is reported by
showing ten randomly sampled topics together with the related tag probabilities. The goal here is to show that the
meta-tags’ can sucessfully represent the underlying tag distribution. The second part shows the efficiency of resource
categorization using the proposed method. Again, twenty
randomly sampled topics are used to test the categorization
performance. The resulting precision/recall values of this
set of experiments are shown, together with two samples
of “best” and “worst” categories, followed by a discussion
of these results. Finally, the efficiency of the compact representation is discussed by showing the improved resource
coverage results.
4.1
Topic Labeling Results
The results are reported using 10 randomly selected topics, where for each the topic; the classical top-10 most probable words list, the compact list and the meta-tags are shown
Table 4: 10 random topics from LIM
Topic 0
tools
internet
dns
network
web
test
networking
ip
speed
speedtest
Topic 5
dictionary
language
reference
english
tools
translation
word
thesaurus
education
online
Topic 1
javascript
tools
programming
development
webdev
code
web
js
testing
ajax
Topic 6
photo
photography
images
flickr
tools
web2.0
pictures
search
sharing
gallery
Topic 2
torrent
download
bittorrent
search
p2p
music
movies
software
video
rapidshare
Topic 7
programming
processing
software
art
visualization
design
opensource
graphics
interactive
code
Topic 3
wiki
reference
wikipedia
encyclopedia
web2.0
collaboration
wikis
tools
research
information
Topic 8
generator
tools
webdesign
design
web
favicon
css
online
text
html
Topic 4
tools
pdf
converter
online
conversion
free
convert
software
file
video
Topic 8
typography
fonts
webdesign
css
design
tools
web
@font-face
type
resources
Table 5: compact representations of 10 random topics from LIM
Topic 0
internet
tools
resources
reference
flash
download
search
technology
visualization
web2.0
Topic 5
dictionary
web
web2.0
social
Topic 1
javascript
tools
online
collaboration
free
web2.0
software
Topic 2
torrent
blog
Topic 3
wiki
Topic 6
photo
browser
searchengine
Topic 7
programming
processing
inspiration
images
art
blog
media
graphic
research
linux
Topic 8
generator
Topic 4
pdf
tools
html
music
graphics
photo
resources
images
technology
howto
Topic 8
typography
tools
online
in Table 4, Table 5 and Table 6 respectively. The results in
Table 6 shows the descriptive value of the computed metatags. It also hints at effect of these labels or meta-tags in
improving the resource search efficiency of users in a folksonomy.
In order to better appreciate the compact topic representation, it is important to have a deeper understanding on the
kind of “redundancy” that this representation eliminates. To
this end, consider a simple scenerio where a user selects a
topic and the relevant resources (ordered w.r.t. p(r|z)) are
listed to him alongside the descriptive tags in the compact
representation. This general list is useful, however a user
might also want to narrow her search by selecting the tags
listed, in which case the resources are listed w.r.t. p(r|t, u).
Now if one were to use the top-K most probable tags as the
representation of the topics, many of the tags would lead to
a similar listing of resources, thus being redundant. This
issue is discussed further in Section 4.3.
4.2
Resource Categorization Results
Previous discussion demonstrates the effectiveness of the
proposed method in finding a label that covers and summa-
Table 7: Average relevance of resources to the suggested meta-tags at different coverage levels
Average Precision
76.6%
84%
Coverage
62%
31 %
rizes the underlying tag distribution. However, the major
claim of this paper, that these meta-tags are also representative for the related resources and that these resources form
a useful categorization of the resources in a folksonomy. Success at this task is evaluated by using the judgements of five
subjects (4 of whom are Master students and 1 is a PhD student), who scored each resource among the most probable 20
resources for 20 randomly selected topics (out of 200) according to whether they are related to the associated meta-tag.
These binary scores, indicating relevance or irrelevance, are
used to compute precision/recall values. Following this, the
two best and worst topics are also shown with the meta-tags
and the corresponding resources. A discussion of these results provides significant insights on the different properties
of the proposed method.
Table 7, shows the average precision, that is the ratio of
average scores to the number of given resources at different
coverage (or recall) values. Two different results for two
coverage values are reported to indicate the performance
difference of the top 10 and top 20, with coverage 31% and
62% respectively, most probable resources. The coverage of
these results are computed by taking the ratio of the number
of included websites to the number of all websites. Note
that this value is calculated by assuming that the precision
in the sampled topics is an indicator for the precision of
the rest. Furthermore, we assumed that there are few cases
of overlapping resources among the most 20 most probable
resources of topics. Thus, this value is just a rough estimate,
which is helpful to get a sense of the generality of our results.
The two best and worst scored topics are shown in Table 8.
Notice that in this table, only the root URLs of the websites
are shown, due to space restrictions. Thus some websites
appear to have been listed twice when, in fact, they refer to
different pages of the websites.
Several important conclusions can be drawn from these results. First of all the first word in the meta-tag tends to have
a broader meaning, while the second term usually makes the
meta-tag more specific. The two “best” topics clearly show
the usefulness of the augmenting tag in forming a descriptive and specific meta-tag. However, in the worst topics,
it is seen that this specificity can also hurt. For example,
the “programming/python” topic includes URLs of websites
that are broadly related to programming but unrelated to
python. In quantitative terms, 20% of the 20 websites listed
under this topic would be scored a 1 instead of a 0 if the
augmenting tag “python” was not shown. However, we believe, the qualitative results show the overall usefulness of
the augmenting tags. Indeed, when asked, the test subjects
indicated that in 75% of the topics (15/20), the augmenting
tags served to produce a more specific label without disturbing the relevance to the resources. Another interesting
observation is that, sometimes, the users in our dataset and
the test subjects disagree on relevance of resources to some
Topic 0
internet/dns
Topic 5
dictionary/language
Table
Topic 1
javascript/js
Topic 6
photo/flickr
6: Meta-Tags for the 10
Topic 2
torrent/download
Topic 7
processing/programming
Table 9: Coverage of resources using most probable
tags (M 1 ) and the compact tag set (M 2 )
Measure
Cardinality
Probability
M1
47 ± 1.74
4.33 ± 0.19
M2
50 ± 1.5
4.59 ± 0.18
p-value
0.001
0.008
of the concepts. For example, “projecteuler.net” is judged
as irrelevant by all test subjects due to its apparent irrelevance to “python”; but it is tagged with “python” by 5 users
in our dataset! This, we believe, is again due to the personalized usage of websites. For many, the aforementioned
website is irrelevant to python; but for some it is a resource
for python excercises. This result is another indication of
the importance of personalized meta-tag generation. The
non-personalized nature of our experiments is a factor that
lowers the scores. In practice, this topic would only be shown
to users with similar “tastes” and this resource would have
a higher probability of being useful/meaningful.
4.3
Efficiency of the Compact Representation
The compact representation discussed in Section 3.1 is expected to result in a set of tags that covers the semantics of
the particular interest context well. This can be shown by
using a sample scenerio. Assume that a user selects a related
meta-tag and browses this topic by selecting a tag from the
associated set of tags. Obviously, the user will only be patient enough to try a few tags and browse a limited number
of resources within the set of resources associated with the
tag (i.e. the set of resources listed according to p(r|t)). Then
the goal is to select this set of tags such that the resulting
set of resources are as diverse and relevant as possible. In
other words, we must avoid repeating resources. To show
the advantage of our method in such a scenerio, we exercise the following procedure. For each user in the dataset,
we find the most relevant topic. Then, in this topic, we
select a set of tags, either using the most probable tags or
the proposed compact tag set and find the associated set
of resources. Finally, the sets are measured using either set
cardinality or the total probability mass of the set. These
two values are shown in Table 9, where we select three tags
and show 20 resources to the user. The p-values from the
t-test is also shown in the last column. The results show
a statistically significant improvement in categorisation efficiency using the compact tag set.
5.
RELATED WORK
The problem of meta-tag generation to enhance resource
categorization is, to the best of our knowledge, novel. The
most similar works in the literature consider the problem of
finding representative words [8] or phrases [10] for multinomial word distributions. Lau et.al. [8] considers training a
topics
Topic 3
wiki/wikipedia
Topic 8
generator/tools
Topic 4
pdf/converter
Topic 8
typography/fonts
Support Vector Regression method to determine the order
of words for topics. Their features include Wordnet based
word relations, coverage of words, in terms of mean conditional probability computed using Wikipedia word frequencies, and Pantel [11] distributional similarity score. The data
set is prepared by manually labeling selected topics. They
observed that many users disagreed in selecting the most
representative words in some of the topics and consequently
the methods performance in such topics is low. The supervised nature of this method prohibits its use it in a web based
folksonomy service, since either online learning or frequent
re-training of the topic models would require perpetual collection of manual tags, which is unrealistic. In addition to
this, using a single word to describe topics is generally insufficient, as observed in [10]. Mei et. al. [10], proposed using
phrases to label topics. The phrases were obtained from the
n-grams of the actual documents. These n-grams are then
scored with respect to relevance and coverage. The paper defines coverage using the concept of redundancy, however they
choose to minimize maximum redundancy between words instead of sum of the redundancies as in this paper. Another
criterion that the authors employed is the discriminating capability of tags, which is the relevance of a tag to the desired
topic minus the sum of relevance to other topics. This criterion is meaningful given their goal. However, in the setting
of this paper, taking discrimination into account is not desired, since it is expected, due to the personalized nature of
LIM, for some topics to have similar tag distributions but
different resource distributions. For example, LIM might
provide two different topics on “movies” with similar tag
distributions but leading to different movie resources. Finally, they assume that the underlying documents are easily
available but in our case the underlying resources consists of
many non-textual websites and acquiring the text data and
constructing all possible n-grams would be inefficient even
for the text based websites. The proposed method employs
only the available bag-of-words data in labeling topics.
6.
CONCLUSIONS
With the rapid growth social tagging systems, so called
folksonomies, it becomes a critical issue to design and organize the vast amounts of on-line resources on these systems
according to their topic. Another important characteristic
of folksonomy systems is that users can choose any keyword
as a tag and can put one or more tags to a resource resulting in a wide variety of tags that can be redundant and
ambiguous. Although statistical topic modeling has been
well studied, there is no existing methods for automatically
generating personalized topic labels in folksonomy systems.
In this paper, a novel method for personalized resource
categorization and labeling in folksonomies is introduced.
The method is capable of creating a personalized taxonomy of resources with meaningful labels, called meta-tags,
so that users can easily locate resources of interest. The
resulting meta-tags offer a balance between generality and
Table 8: Two best and worst scored topics
Topics
Meta-Tags
Resources
Best 1
photo/flickr
www.flickr.com
www.tineye.com
www.compfight.com
compfight.com
labs.ideeinc.com
bighugelabs.com/flickr
tineye.com
photobucket.com
taggalaxy.de
www.airtightinteractive.com
hugin.sourceforge.net
wylio.com
www.cooliris.com
www.flickriver.com
min.us
www.dropmocks.com
www.smugmug.com
labs.systemone.at
labs.ideeinc.com
photosynth.net
Best 2
food/recipes
www.tastespotting.com
www.epicurious.com
www.supercook.com
smittenkitchen.com
www.seriouseats.com
www.cookingforengineers.com
allrecipes.com
www.stilltasty.com
www.101cookbooks.com
www.chow.com
www.foodnetwork.com
www.opensourcefood.com
www.cookingbynumbers.com
www.jamieoliver.com
www.thekitchn.com
thisiswhyyourefat.com
foodgawker.com
www.101cookbooks.com
www.recipezaar.com
mingmakescupcakes.yolasite.com
specificity. The qualitative results from randomly drawn
topics and quantitative results from human judgements indicate the usefulness of the proposed method.
7.
ACKNOWLEDGMENTS
[9]
This project was partially funded by TUBITAK under
project ID 110E027.
8.
REFERENCES
[1] M. E. Alper and S. Gunduz-Oguducu. Personalized
recommendation in folksonomies using a joint
probabilistic model of users, resources and tags. In
submitted to a Conference 2012, 2012, 2012.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
dirichlet allocation. The Journal of Machine Learning
Research, 3:993–1022, 2003.
[3] T. L. Griffiths. Finding scientific topics. Proceedings of
the National Academy of Sciences,
101(suppl 1):5228–5235, Jan. 2004.
[4] A. Haghighi and L. Vanderwende. Exploring content
models for multi-document summarization. In
Proceedings of Human Language Technologies,
NAACL ’09, pages 362–370, Stroudsburg, PA, USA,
2009. Association for Computational Linguistics.
[5] T. Hofmann. Probabilistic latent semantic analysis. In
Proc. of Uncertainty in Artificial Intelligence, UAI’99,
pages 289—296, 1999.
[6] R. Jäschke, L. Marinho, A. Hotho, S.-T. Lars, and
S. Gerd. Tag recommendations in folksonomies.
PKDD 2007, pages 506–514, Berlin, Heidelberg, 2007.
Springer-Verlag.
[7] R. Krestel, P. Fankhauser, and W. Nejdl. Latent
dirichlet allocation for tag recommendation. In
Proceedings of the third ACM conference on
Recommender systems, RecSys ’09, pages 61–68, New
York, NY, USA, 2009. ACM.
[8] J. H. Lau, D. Newman, S. Karimi, and T. Baldwin.
Best topic word selection for topic labelling. In
[10]
[11]
[12]
[13]
[14]
[15]
Worst 1
programming/python
projecteuler.net
gettingreal.37signals.com
stackoverflow.com
code.google.com/edu
mitpress.mit.edu/sicp
samizdat.mines.edu/howto
diveintohtml5.org
www.e-booksdirectory.com
diveintopython.org
www.freetechbooks.com
mitpress.mit.edu/sicp
www.python.org
www.indiangeek.net
jqfundamentals.com
www.gigamonkeys.com/book
detexify.kirelabs.org
blog.objectmentor.com
eloquentjavascript.net
developer.mozilla.org
learnyouahaskell.com
Worst 2
health/fitness
www.nutritiondata.com
www.fitbit.com
hundredpushups.com
www.coolrunning.com
www.gmap-pedometer.com
preyproject.com
hundredpushups.com
www.webmd.com
www.mapmyrun.com
www.informationisbeautiful.net
www.fitday.com
www.thedailyplate.com
www.mayoclinic.com
www.patientslikeme.com
adeona.cs.washington.edu
www.lumosity.com
www.successwithtracywhite.com
www.walkscore.com
www.bikely.com
www.boston.com
Proceedings of the 23rd International Conference on
Computational Linguistics: Posters, COLING ’10,
pages 605–613, Stroudsburg, PA, USA, 2010.
Association for Computational Linguistics.
Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai.
Topic sentiment mixture: modeling facets and
opinions in weblogs. In Proceedings of the 16th
international conference on World Wide Web, WWW
’07, pages 171–180, New York, NY, USA, 2007. ACM.
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of
multinomial topic models. In Proceedings of the 13th
ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’07, pages 490–499,
New York, NY, USA, 2007. ACM.
P. Pantel and D. Lin. Discovering word senses from
text. In Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery and
data mining, KDD ’02, pages 613–619, New York, NY,
USA, 2002. ACM.
H. Peng, F. Long, and C. Ding. Feature selection
based on mutual information criteria of
max-dependency, max-relevance, and min-redundancy.
Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 27(8):1226 –1238, aug. 2005.
J. Tang, H.-f. Leung, Q. Luo, D. Chen, and J. Gong.
Towards ontology learning from folksonomies. In
Proceedings of the 21st international jont conference
on Artifical intelligence, IJCAI’09, pages 2089–2094,
San Francisco, CA, USA, 2009. Morgan Kaufmann
Publishers Inc.
H. M. Wallach, D. Mimno, and A. McCallum.
Rethinking LDA: Why Priors Matter. In Proceedings
of NIPS, 2009.
X. Wu, L. Zhang, and Y. Yu. Exploring social
annotations for the semantic web. In Proceedings of
the 15th international conference on World Wide Web,
WWW ’06, pages 417–426, New York, NY, USA,
2006. ACM.