URL
Transcription
URL
Proceedings of the 3rd ACM RecSys’10 Workshop on Recommender Systems and the Social Web EDITORS Jill Freyne, Sarabjot Singh Anand, Ido Guy, Andreas Hotho October 23rd, 2011 Chicago, Illinois, USA Organizing COMMITTEE Jill Freyne, CSIRO, TasICT Centre, Australia Sarabjot Singh Anand, University of Warwick, UK Ido Guy, IBM Research, Haifa, Israel Andreas Hotho, University of Würzburg , Germany PROGRAM COMMITTEE Shlomo Berkovsky, CSIRO, Australia Peter Brusilovsky, University of Pittsburgh, USA Robin Burke, De Paul University, USA Xiongcai Cai, University of New South Wales, Australia Elizabeth Daly, IBM Research, Cambridge, MA USA Jon Dron, Athabasca University, Canada Casey Dugan, IBM Research, USA Rosta Farzan, Carnegie Mellon University, USA Werner Geyer, IBM Research, USA Max Harper, University of Minnesota, USA C. Lee Giles,The Pennsylvania State University, USA Kristina Lerman, University of Southern California, USA Luiz Pizzato, The University of Sydney, Australia Lars Schmidt-Thieme, University of Hildesheim, Germany Shilad Sen, Macalester College, St. Paul, USA Aaditeshwar Seth, IIT Delhi, India Barry Smyth, University College Dublin, Ireland Gerd Stumme, University of Kassel, Germany Juergen Vogel, SAP Research FORWARD The exponential growth of the social web poses challenges and presents new opportunities for recommender system research. The social web has turned information consumers into active contributors creating massive amounts of information. Finding relevant and interesting content at the right time and in the right context is challenging for existing recommender approaches. At the same time, social systems by their definition encourage interaction between users and both online content and other users, thus generating new sources of knowledge for recommender systems. Web 2.0 users explicitly provide personal information and implicitly express preferences through their interactions with others and the system (e.g. commenting, friending, rating, etc.). These various new sources of knowledge can be leveraged to improve recommendation techniques and develop new strategies which focus on social recommendation. The Social Web provides huge opportunities for recommender technology and in turn recommender technologies can play a part in fuelling the success of the Social Web phenomenon. The goal of this workshop was to bring together researcher and practitioners to explore, discuss, and understand challenges and new opportunities for recommender systems and the Social Web. We encouraged contributions in the following areas: • • • • • • • • • • • Case studies and novel fielded social recommender applications Economy of community-based systems: Using recommenders to encourage users to contribute and sustain participation. Social network and folksonomy development: Recommending friends, tags, bookmarks, blogs, music, communities etc. Recommender systems mash-ups, Web 2.0 user interfaces, rich media recommender systems Collaborative knowledge authoring, collective intelligence Recommender applications involving users or groups directly in the recommendation process Exploiting folksonomies, social network information, interaction, user context and communities or groups for recommendations Trust and reputation aware social recommendations Semantic Web recommender systems, use of ontologies or microformats Empirical evaluation of social recommender techniques, success and failure measures Social recommender systems in the enterprise The workshop consisted both of technical sessions, in which selected participants presented their results or ongoing research, as well as informal breakout sessions on more focused topics. Papers discussing various aspects of recommender system in the Social Web were submitted and 10 papers were selected for presentation and discussion in the workshop through a formal reviewing process. The Workshop Chairs October 2011 Third Workshop on Recommender Systems and the Social Web 23rd October, 2011 Workshop Programme 8:30 - 9:00 - Opening & Introductions 9:00 - 10:15 Paper Session I - Social Search and Discovery Kevin Mcnally, Michael O'Mahony and Barry Smyth. Evaluating User Reputation in Collaborative Web Search (15+5) Owen Phelan, Kevin Mccarthy and Barry Smyth. Yokie - A Curated, Real-time Search & Discovery System using Twitter (15+5) Tamara Heck, Isabella Peters and Wolfgang G. Stock. Testing Collaborative Filtering against Co-Citation Analysis and Bibliographic Coupling for Academic Author Recommendation (15+5) Discussion (15) 10:15-10:45 - Coffee Break 10:45 - 12:30 - Paper Session II - Groups Communities, and Networks Lara Quijano-Sanchez, Juan Recio-Garcia and Belen Diaz-Agudo. Group recommendation methods for social network environments (15+5) Yu Chen and Pearl Pu. Do You Feel How We Feel? An Affective Interface in Social Group Recommender Systems (10+5) Amit Sharma, Meethu Malu and Dan Cosley. PopCore: A system for Network-Centric Recommendations (10+5) Shaghayegh Sahebi and William Cohen. Community-Based Recommendations: a Solution to the Cold Start Problem (10+5) Maria Terzi, Maria-Angela Ferrario and Jon Whittle. Free Text In User Reviews: Their Role In Recommender (10+5) Discussion (20) 12:30 – 14:00 Lunch 14:00-15:00 - Keynote Presentation (Werner Geyer, title TBD) 15:05-15:45 - Paper Session III - User Generated Content Sandra Garcia Esparza, Michael O'Mahony and Barry Smyth. A Multi-Criteria Evaluation of a User Generated Content Based Recommender System (15+5) Jonathan Gemmell, Tom Schimoler, Bamshad Mobasher and Robin Burke. Personalized Recommendation by Example (15+5) Discussion (15) 15:45-16:15 - Coffee Break 16:15 - 18:00 - Breakout Sessions in Groups 18:00 - Closing Contents Evaluating User Reputation in Collaborative Web Search Kevin Mcnally, Michael O'Mahony and Barry Smyth. 1 Yokie - A Curated, Real-time Search & Discovery System using Twitter Owen Phelan, Kevin Mccarthy and Barry Smyth 9 Collaborative Filtering against Co-Citation Analysis and Bibliographic Coupling for Academic Author Recommendation Tamara Heck, Isabella Peters and Wolfgang G. Stock. Testing Group recommendation methods for social network environments Lara Quijano-Sanchez, Juan Recio-Garcia and Belen Diaz-Agudo. Do You Feel How We Feel? An Affective Interface in Social Group Recommender Systems Yu Chen and Pearl Pu. 16 24 32 PopCore: A system for Network-Centric Recommendations Amit Sharma, Meethu Malu and Dan Cosley 36 Community-Based Recommendations: a Solution to the Cold Start Problem Shaghayegh Sahebi and William Cohen 40 Free Text In User Reviews: Their Role In Recommender Maria Terzi, Maria-Angela Ferrario and Jon Whittle. 45 A Multi-Criteria Evaluation of a User Generated Content Based Recommender System Sandra Garcia Esparza, Michael O'Mahony and Barry Smyth. 49 Personalized Recommendation by Example Jonathan Gemmell, Tom Schimoler, Bamshad Mobasher and Robin Burke 57 1 Evaluating User Reputation in Collaborative Web Search Kevin McNally, Michael P. O’Mahony, Barry Smyth CLARITY Centre for Sensor Web Technologies School Of Computer Science & Informatics University College Dublin {firstname.lastname}@ucd.ie ABSTRACT Often today’s recommender systems look to past user activity in order to influence future recommendations. In the case of social web search, employing collaborative recommendation techniques allows for personalization of search results. If recommendations arise from past user activity, the expertise of those users driving the recommendation process can play an important role when it comes to ensuring recommendation quality. Hence the reputation of users is important in collaborative and social search tasks, in addition to result relevance as traditionally considered in web search. In this paper we explore this concept of reputation; specifically, investigating how reputation can enhance the recommendation engine at the core of the HeyStaks social search utility. We evaluate a number of different reputation models in the context of the HeyStaks system, and demonstrate how incorporating reputation into the recommendation process can enhance the relevance of results recommended by HeyStaks. 1. INTRODUCTION The early years of web search (1995-1998) were characterised by innovation as researchers came to discover some of the shortcomings of traditional term-based information retrieval techniques in the face of large-scale, heterogeneous web content, and in the face of queries from users who were far from search experts. While traditional term-based matching techniques played an important role in result selection, they were not sufficiently robust when it came to delivering a reliable and relevant ranking of search results. The significant breakthrough that led to modern web search engines came about through the work of Brin and Page [1], and Kleinberg [6], highlighting the importance of link connectivity when it came to understanding the importance of web pages. In the end, ranking metrics based on this type of connectivity data came to provide a key signal for all of today’s mainstream search engines. By and large the world of web search has remained relatively stable over the past decade or more. Mainstream Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. search engines have innovated around the edges of search but their core approaches have remained intact. However there are signs that this is now changing and it is an interesting time in the world of mainstream web search, especially as all of the mainstream players look to the world of social networks to provide new types of search content and, importantly in this paper, new sources of ranking signals. There is now considerable interest in the concept of social search, based on the idea that information in our social graphs can be used to improve mainstream search. For example, the HeyStaks system [19] has been developed to add a layer of social search onto mainstream search engines, using recommendation techniques to automatically suggest results to users based on pages that members of their social graphs have found to be interesting for similar queries in the past. HeyStaks adds collaboration to conventional web search and allows us to benefit from the past search histories of people we trust and on topics that matter to us. In this paper we examine the role of reputation in HeyStaks’ recommendation engine. Previously we have described how to estimate the reputation of a searcher by analysing how frequently their past search efforts have translated into useful recommendations for other users [9, 11]. We have also examined user behaviour in HeyStaks, and highlighted the potential for reputation to unearth users who have gained the most benefit from the system and whose activity benefits others [10]. For example, if my previous searches (and the pages that I find) lead to result recommendations to others that are regularly acted on (selected, tagged, shared etc.), then my reputation should increase, whereas if my past search efforts rarely translate into useful recommendations then my reputation should decline. In this paper we expand on previous work by considering a number of user reputation models, showing how these models can be used to estimate result reputation, and comparing the ability of these models to influence recommendation quality based on recent live-user data. 2. RELATED WORK Recently there has been considerable interest in reputation systems to provide mechanisms to evaluate user reputation and inter-user trust across a growing number of social web and e-commerce applications. For example, the reputation system used by eBay has been examined by Jøsang et al. [5] and Resnick et al. [16]. Briefly, eBay elicits feedback from buyers and sellers regarding their interactions with each other, and that information is aggregated in order to calculate user reputation scores. The aim is to reward good 2 behaviour on the site and to improve robustness by leveraging reputation to predict whether a vendor will honour future transactions. Resnick found that using information received directly from users to calculate reputation is not without its problems [16]. Feedback is generally reciprocal; users almost always give positive feedback if they themselves had received positive feedback from the person they performed a transaction with. Jøsang confirms this, stating this may lead to malicious use of the system and as such needs manual curation. The work of O’Donovan and Smyth [14] addresses reputation in recommender systems. In this case, a standard collaborative filtering algorithm is modified to add a user-user trust score to complement the normal profile or item-based similarity score, so that recommendation partners are chosen from those users that are not only similar to the target user, but who have also had a positive recommendation history with that user. It is posited that reputation can be estimated by measuring the accuracy of a profile at making predictions over time. Using this metric average prediction error is improved by 22%. Other recent research has examined reputation systems employed in social networking platforms. Lazzari performed a case study of the professional social networking site Naymz [8]. He warns that calculating reputation on a global level allows users who have interacted with only a small number of others to accrue a high degree of reputation, making the system vulnerable to malicious use. Similar to Jøsang in [5], Lazzari suggests that vulnerability lies in the site itself, allowing malicious users to game the reputation system for their own ends. However, applying reputation globally affords malicious users influence over the entire system, which adds to its vulnerability. The previous section outlines our intention to present different reputation models to be applied to HeyStaks users. These models are in part derived from constructing a graph based on collaborations that occur in the HeyStaks community. Perhaps two of the most well-known link analysis algorithms that are applied to online social network graphs are PageRank and HITS. PageRank is the well known algorithm employed by the Google search engine to rank web search results [1]. The key intuition behind PageRank is that pages on the web can be modeled as vertices in a directed graph, where the edge set is determined by the hyperlinks between pages. PageRank leverages this link structure to produce an estimate of a relative importance of web pages, with inlinks from pages seen as a form of recommendation from page authors. Important pages are considered to be those with relatively large number of inlinks. Moreover, pages that are linked to by many other important pages receive higher ranks themselves. PageRank is a recursive algorithm, where the ranks of pages are a function of the ranks of those pages that link to them. The HITS algorithm [6] was also developed to rank web search results and, like PageRank, makes use of the link structure of the web to perform ranking. In particular, HITS computes two distinct scores for each page: an authority score and a hub score. The former provides an estimate of the value of a page’s content while the latter measures the value of its links to other pages. Pages receive higher authority scores if they are linked to by pages with high hub scores, and receive higher hub scores if they link to many pages with high authority scores. HITS is an iterative algorithm where authority and hub scores are computed recursively. A lot of work has been done in the area of link analysis in the social web space in the recent past, often by employing the techniques introduced by Page and Kleinberg. For example the well-known algorithm FolkRank [4], an adaptation of PageRank, looks to exploit users’ disposition for adding metadata to online content in order to construct a graph based on social tagging information. Work by Schifanella et al. [18] expands on the idea behind FolkRank, and claims that examination of folksonomy data can help in predicting links between people in the social network graphs of Flickr and Last.fm. In this paper we consider reputation models in the context of the HeyStaks social search service which seek to capture the quality of search knowledge that is contributed by users. Further, we present a framework in which user reputation is employed to influence the recommendations that are made by HeyStaks. Using data from a live-user trial, we show how this approach leads to significant improvements in the ranking of recommendations from a quality perspective. This differs from our approach in that we wish to leverage the HeyStaks social graph to determine who provides the best quality content as determined by their community. 3. THE HEYSTAKS RECOMMENDATION ENGINE In this section we review the HeyStaks recommendation engine to provide sufficient context for this work. Further details can be found in [19] (which focuses on the relevance model) and in [11] (which focuses on the reputation model). 3.0.1 Profiling Stak Pages Each stak in HeyStaks captures the search activities of its stak members. The basic unit of stak information is a result (URL) and each stak (S) is associated with a set of results, S = {r1 , ..., rk }. Each result is also anonymously associated with a number of implicit and explicit interest indicators, based on the type of actions (for example, selecting, voting, tagging and sharing) that users can perform on these pages. These actions can be associated with a degree of confidence that the user finds the page to be relevant. Each result page riS from stak S, is associated with relevance indicators: the number of times a result has been selected (Sl), the query terms (q1 , ..., qn ) that led to its selection, the terms contained in the snippet of the selected result (s1 , ..., sk ), the number of times a result has been tagged (T g), the terms used to tag it (t1 , ..., tm ), the votes it has received (v + , v − ), and the number of people it has been shared with (Sh) as per Equation 1. riS = {q1 ...qn , s1 ...sk , t1 ...tm , v + , v − , Sl, T g, Sh} . (1) Importantly, this means each result page is associated with a set of term data (query and/or tag terms) and a set of usage data (the selection, tag, share, and voting count). The term data provides the basis for retrieving and ranking recommendation candidates. The usage data provides an additional source of evidence that can be used to filter results and to generate a final set of recommendations. 3.0.2 Recommending Search Results At search time, the searcher’s query q and current stak S are used to generate a list of recommendations. Here we 3 score(r, q) = w × rep(r, t) + (1 − w) × rel(q, r) . (2) The relevance of a result r with respect to a query qt is computed using TF-IDF [17], which gives high weights to terms that are popular for a result r but rare across other stak results, thereby serving to prioritise results that match distinguishing index terms, as per Equation 3. X rel(q, r) = tf (tr) × idf (t)2 . (3) τ q The reputation of a result r at time t (rep(r, t)) is an orthogonal measure of recommendation quality. The intuition is that we should prefer results that originate from more reputable stak members. We explore user reputation and how it can be computed in the next section. 4. REPUTATION MODELS FOR SOCIAL SEARCH For HeyStaks, searchers themselves play a crucial role in determining what gets recommended and to whom, and so the quality of these searchers can be an important factor to consider during recommendation. Recommendation candidates originating from the activities of very experienced users, for example, might be considered ahead of candidates that come from the activity of less experienced users. This is particularly important given the potential for malicious users to disrupt stak quality by introducing dubious results to a stak. For example, as it stands it is feasible for a malicious user to flood a stak with results in the hope that at least some will be recommended to other users at search time. This type of gaming has the potential to significantly degrade recommendation quality; see also recent related research on malicious users and robustness by the recommender systems community [3, 7, 13, 15]. For this reason we propose to complement the relevance of a page, during recommendation, with an orthogonal measure of reputation to reflect the predicted quality of the users who are responsible for this recommendation. In fact we propose a variety of reputation models and in Section 5 we evaluate their effectiveness in practice. 4.1 Search, Collaboration, and Reputation The long-term value of HeyStaks as a social search service depends critically on the ability of users to benefit from its quality search knowledge and if, for example, all of the best search experiences are tied up in private staks and never shared, then this long-term value will be greatly diminished. !"#$%# !"#$%# !"# &'# &# !"#$%# $# &# !"#" !"#$%# &# !%# #" !"#$%# &"# &" !"$%&&"'()*%'+ ,"-"$*%'+ (a) %&%&%& !'# %&%&%& discuss recommendation generation from the current stak S only, although recommendations may also come from other staks that the user has joined or created. There are two key steps when it comes to generating recommendations. First, a set of recommendation candidates are retrieved from S !"#" based on the overlap between the query terms and the terms #" used to index each recommendation (query, snippet, and $" !" !"$%&&"'()*%'+ tag terms). These recommendations are then filtered and ,"-"$*%'+ ranked. Results that do not exceed certain activity thresholds are eliminated; such as, for example, results with only a single selection or results with more negative votes than positive votes (see [19]). Remaining recommendation candidates are then ranked according to a weighted score of its relevance and reputation (Equation 2), where w is used to adjust the relative influence of relevance and reputation. %" )# )# (# !"#$%# )# &%# (b) Figure 1: Collaboration and reputation: (a) the consumer c selects result r, which has been recommended based on the producer p’s previous activity, so that c confers some unit of reputation (rep) on p. (b) The consumer c selects a result r that has been produced by several producers, p1 , ..., pk ; reputation is shared amongst these producers with each user receiving an equal share of rep/k units of reputation. Thus, our model of reputation must recognise the quality of shared search knowledge. There is a way to capture this notion of shared search by quality in a manner that serves to incentivise users to behave in just the right way to grow long-term value for all. The key idea is that the quality of shared search knowledge can be estimated by looking at the search collaborations that naturally occur within HeyStaks. If HeyStaks recommends a result to a searcher, and the searcher chooses to act on this result (i.e. select, tag, vote or share), then we can view this as a single instance of search collaboration. The current searcher who chooses to act on the recommendation is known as the consumer and, in the simplest case, the original searcher, whose earlier action on this result caused it to be added to the search stak, and ultimately recommended, is known as the producer. In other words, the producer created search knowledge that was deemed to be relevant enough to be recommended and useful enough for the consumer to act upon it. The basic idea behind our reputation models is that this act of implicit collaboration between producer and consumer confers some unit of reputation on the producer (Figure 1(a)). And the reputation models that we will present in what follows differ in the way that they distribute and aggregate reputation among these collaborations. 4.2 Graph-Based Reputation Models We can treat the collaborations that occur among HeyStaks users as a type of graph. Each node represents a unique user and the edges represent collaborations between pairs of users. These edges are directed to reflect the producer/consumer relationships and reputation flows along these edges, and is aggregated at the nodes. As such, the extent to which users collaborate (i.e., the number of times each user is a producer in a collaboration event) is used to weight the nodes in the collaboration graph. We now present a series of graph-based reputation model alternatives. 4.2.1 Reputation as a Weighted Count of Collaboration Events Our first and simplest reputation model calculates the reputation of a producer as a weighted sum of the collaboration events in which they have participated. The simplest case is captured by Figure 1(a) where a single producer participates in a collaboration event with a given consumer and benefits 4 from a single unit of reputation as a result. More generally however, at the time when the consumer acts (selects, tags, votes etc.) on the promoted result, there may have been a number of past producers who each contributed part of the search knowledge that caused this result to be promoted. A specific producer may have been the first to select the result in a given stak, but subsequent users may have selected it for different queries, or they may have voted on it or tagged it or shared it with others independently of its other producers. Alternatively, a collaboration event can have a knock-on effect, where the original producer–consumer relationship is broadened as more people act on the same recommendation over time. The original consumer becomes a second producer as a new user acts on the same recommendation, and so on. Thus we need to be able to share reputation across these different producers; see Figure 1(b). More formally, let us consider the selection of a result r by a user c, the consumer, at time t. The producers responsible for the recommendation of this result are given by producers(r, t) as per Equation 4 such that each pi denotes a specific user ui in a specific stak Sj . producers(r, t) = {p1 , ..., pk } . (4) Then, for each producer of r, pi , we update its reputation as in Equation 5. In this way reputation is shared equally among its k contributing producers. rep(pi , t) = rep(pi , t − 1) + 1/k . (5) As it stands this reputation model is susceptible to gaming in the following manner. To increase their reputation, malicious users could attempt to flood a stak with pages in the hope that at least some are recommended and subsequently acted on by other users. If this happens, then these malicious producers will benefit from increased reputation, and further pages from these users may continue to be recommended. The problem is that the current reputation model distributes reputation equally among all producers. To address this we can adjust our reputation model by changing the way in which reputation is distributed. The basic idea is that a producer should receive more reputation if many of their past contributions have been consumed by other users but the should receive less reputation if most of their contributions have not been consumed. More formally, for a producer pi , let nt (pi , t − 1) be the total number of distinct results that this user has added to the stak in question prior to time t; remember that pi refers to a user ui and a specific stak Sj . Further, let nr (pi , t − 1) be the number of these results that have been subsequently recommended and consumed by other users. We define the consumption ratio according to Equation 6; κ is an initialization constant that is set to 0.01 in our experiments. Accordingly, if a producer has a high consumption ratio it means that many of their contributions have been consumed by other users, suggesting that the producer has added useful content to the stak. In contrast, if a user has a low consumption ratio then it means that few of their contributions have proven to be useful to other users. consumption ratio(pi , t) = κ + nr (pi , t − 1) . nt (pi , t − 1) (6) Thus, given the selection of a result r by a consumer c at time t: if p1 , ..., pk are the contributing producers, then we can use their consumption ratios as the basis for sharing reputation according to Equation 7. consumption ratio(pi , t) . ∀p{p1 ,...,pk } consumption ratio(p, t) (7) In this way, users who have a history of contributing many irrelevant results to a stak (that is, users with low consumption ratios) will receive a small proportion of the reputation share compared to users who have a history of contributing many useful results. rep(pi , t) = rep(pi , t−1)+ P 4.2.2 Reputation as PageRank The PageRank algorithm can be readily applied to compute the reputation of HeyStaks users, which take the place of web pages in the graph. When a collaboration event occurs, directed links are inserted from the consumer (i.e. the user who selects or votes etc. on the recommended page) to each of the producers (i.e. the set of users whose previous activity on the page caused it to be recommended by HeyStaks). Once all the collaboration events up to some point in time, t, have been captured on the graph, the PageRank algorithm is then executed and the reputation (PageRank) of each user pi at time t is computed as: X P R(pj ) 1−d +d , (8) P R(pi ) = N |L(pj |) pj ∈M (pi ) where d s a damping factor, N is the number of users, M (pi ) is the set of inlinks (from consumers) to (producer) pi and L(pj ) is the the set of outlinks from pj (i.e. the other users from whom pj has consumed results). In this paper, we use the JUNG (Java Universal Network/Graph) Framework (http://jung.sourceforge.net/) implementation of PageRank. 4.2.3 Reputation as HITS As with PageRank, we use the collaboration graph and the HITS algorithm to estimate user reputation. In this regard, it seems appropriate to consider producers as authorities and consumers as hubs. However, as we will discuss in Section 5, hub scores are useful when it comes to identifying a particular class of users which act both as useful consumers and producers of high quality search knowledge. Thus we model user reputation using both authority and hub scores, which we compute using the JUNG implementation of the HITS algorithm. Briefly, the algorithm operates as follows. After initialisation, repeated iterations are used to update the authority (auth(pi )) and hub scores (hub(pi )) for each user pi . At each iteration, authority and hub scores are given by: X auth(pi ) = hub(pj ) (9) pj ∈M (pi ) hub(pi ) = X auth(pj ) (10) pj ∈L(pi ) where as before M (pi ) is the set of inlinks (from consumers) to (producer) pi and L(pi ) is the set of outlinks from pi (i.e. the other users from whom pj has consumed results). 4.3 Reputation and Result Recommendation In the previous sections we have described reputation models for users. Individual stak members accumulate reputation when results that they have added to the stak are recommended and acted on by other users. We have described 5 how reputation is distributed between multiple producers during these collaboration events. In this section we describe how this reputation information can be used to produce better recommendations at search time. The recommendation engine described in Section 3 operates at the level of an individual result page and scores each recommendation candidate based on how relevant it is to the target query. If we are to allow reputation to influence recommendation ranking, as well as relevance, then we need to transform our user-based reputation measure into a result-based reputation measure. How then can we compute the reputation of a result that have been recommended by a set of producers? Before the reputation of a page is calculated, the reputation score of each producer is normalized according to the maximum user reputation score existing in the stak at the time that the recommendation is made. But how can we calculate the reputation of a page based on that of its producers? One option is to simply add the reputation scores of the producers. However, this favours results that have been produced by lots of producers, even if the reputation of these producers is low. Another option is to compute the average of the reputation scores of the producers, but this tends to depress the reputation of results that have been produced by many low-reputation users even if some users have very high reputation scores. In our work we have found a third option to work best. The reputation of a result page r (at time t) is simply the maximum reputation of its associated producers; see Equation 11. Thus, as long as at least some of the producers are considered reputable then this result will receive a high reputation score, even if many of the producers have low reputation scores. These less reputable users might be novices and so their low reputations are not so much of a concern in the face of highly reputable producers. ` ´ rep(pi , t) . (11) rep(r, t) = max ∀pi {p1 ,...,pk } Now we have two ways to evaluate the appropriateness of a page for recommendation — the relevance of the page as per Equation 3 and its reputation as per Equation 11 — and we can combine these two scores using a simple weighted sum according to Equation 2 to calculate the rank score of a result page r and its producers p1 , ..., pk at time t, with respect to query qT . 5. EVALUATION In previous work [19] we have demonstrated how the standard relevance-based recommendations generated by HeyStaks can be more relevant than the top ranking results of Google. In this work we wish to compare HeyStaks’ relevance-based recommendation technique to an extended version of the system that also includes reputation. In more recent prior work, our initial proof-of-concept reputation model has been outlined and motivated, and a preliminary evaluation of reputation scores assigned to early adopters of the HeyStaks system was carried out [11]. We have also showed that user reputation scores can be used to positively influence HeyStaks recommendations [12], however this work focused on only one model. The purpose of this paper has been to build on previous work by proposing a number of alternatives to estimating the reputation of users (producers) who are helping other users (consumers) to search within the HeyStaks social search ser- Question 1. Who was the last Briton to win the men’s singles at Wimbledon? 2. Which Old Testament book is about the sufferings of one man? 3. Which reporter fronted the film footage that sparked off Band Aid? 4. Which space probes failed to find life on Mars? Table 1: A sample of the user-trial questions. vice. The aim is to explore known link-analysis techniques to find a mechanism that best captures HeyStaks users’ reputation in terms of the quality of content they provide their community. We measure each model’s effectiveness by allowing the scores to influence recommendations made by HeyStaks: The hypothesis is that by allowing reputation, as well as relevance, to influence the ranking of result recommendation, we can improve the overall quality of search results. In this section we evaluate these reputation models using data generated during a recent closed, live-user trial of HeyStaks, designed to evaluate the utility of HeyStaks’ brand of collaborative search in fact-finding information discovery tasks. 5.1 Dataset and Methodology Our live-user trial involved 64 first-year undergraduate university students with varying degrees of search expertise. Users were asked to participate in a general knowledge quiz, during a supervised laboratory session, answering as many questions as they could from a set of 20 questions in the space of 1 hour. Each student received the same set of questions which were randomly presented to avoid any ordering bias. The questions were selected for their obscurity and difficulty; see Table 1 for a sample of these questions. Each user was allocated a desktop computer with the Firefox web browser and HeyStaks’ toolbar pre-installed; they were permitted to use Google, enhanced by HeyStaks functionality, as an aid in the quiz. The 64 students were randomly divided into search groups. Each group was associated with a newly created search stak, which would act as a repository for the groups’ search knowledge. We created 6 solitary staks, each containing just a single user, and 4 shared staks containing 5, 9, 19, and 25 users. The solitary staks served as a benchmark to evaluate the search effectiveness of individual users on a non-collaborative search setting, whereas the different sizes of shared staks provided an opportunity to examine the effectiveness of collaborative search across a range of different group sizes. All activity on both Google search results and HeyStaks recommendations was logged, as well as all queries submitted during the experiment. During the 60 minute trial, 3,124 queries and 1,998 result activities (selections, tagging, voting, popouts) were logged, and 724 unique results were selected. During the course of the trial, result selections — the typical form of search activity — dominated over HeyStaks-specific activities such as tagging and voting. On average, across all staks, result selections accounted for just over 81% of all activities, with tagging accounting for just under 12% and voting for 6%. In recent work we described the performance results of this trial showing how larger groups tended to benefit from the increased collaboration effects of HeyStaks [9]. Members of shared staks answered significantly more questions correctly, and with fewer queries, than the members of solitary staks who did not benefit from collaboration. In this paper we are interested in exploring reputation. No reputation model was used during the live-user trial and so recommen- 6 dations were ranked based on relevance only. However the data produced makes it possible for us to replay the user trial so that we can construct our reputation models and use them to re-rank HeyStaks recommendations. We can retrospectively test the quality of re-ranked results versus the original ranking against a ground-truth relevance; since as part of the post-trial analysis, each selected result was manually classified as relevant (the result contained the answer to a question), partially relevant (the result referred to an answer, but not explicity), or not-relevant (the result did not contain any reference to an answer) by experts. 5.2 User Reputation We now examine the type of user reputation values that are generated from the trial data. In Figure 2, box-plots are shown for the median reputation scores across the 4 shared staks and for each reputation model. Here we see that for the WeightedSum model there is a clear difference in the median reputation score for members of the 5 person stak when compared to members of the larger staks. This is not evident in results for the PageRank model, which shows very similar reputation scores, regardless of stak size. For the Hubs and Authority models we see very exagerated median reputation scores for the largest 25-person stak, whereas the median reputation scores for members of the smaller staks are orders of magnitude less. Next we consider, for members of each stak, how the reputation scores produced by the four reputation models compare. The pairwise rank correlations between user reputation scores given by each reputation model are shown in Table 2. With the exception of the 5 person stak (likely due to the relatively small number of users in this particular stak), correlations are seen to be high between the WeightedSum, PageRank and Authority models. For example, pairwise correlations between these models in the range 0.90-0.94 are observed for the 25 person stak. In contrast, the correlations between the Hubs model and the other models are much lower; and indeed, are negative for the smaller 5 and 9 person staks. It is difficult to draw precise conclusions about the Hubs correlations for each of the staks concerned (given the constrained nature of the user-trial and the different numbers of users in each stak), but since the HITS Hubs metric is designed to identify pages that contain useful links towards authoritative pages in the web search domain (analogous to good consumers rather than producers in our context), such low correlations are to be expected with the other models which more directly focus on producer activity. Further, a desirable property of a reputation model is that it should capture consumption diversity, meaning that in order for producers to gain high reputation, many consumers should benefit from the content that producers contribute to staks. Table 3 shows the Pearson correlation between the number of distinct consumers per producer (per stak) and producer reputation according to each of the four reputation models tested. Across all staks, Authority displays the highest correlations (between 0.98 and 1), indicating that this model is particularly effective in capturing consumption diversity. This is to be expected, given that user Authority scores are directly influenced by the number of consumers interacting with them. In contrast and given the nature of the Hubs model, it unsurprisingly fails to capture consumption diversity. For the larger staks, we can see good correlations are achieved for the WeightedSum and PageRank models (a) PageRank Hubs Authority WeightedSum 0.90 -0.60 0.30 PageRank -0.70 0.50 Hubs (b) PageRank Hubs Authority WeightedSum 0.88 -0.67 0.72 PageRank -0.68 0.70 Hubs (c) PageRank Hubs Authority WeightedSum 0.84 0.31 0.83 PageRank 0.31 0.91 Hubs (d) PageRank Hubs Authority WeightedSum 0.94 0.35 0.90 PageRank 0.30 0.92 Hubs -0.90 -0.98 0.37 0.18 Table 2: Pairwise rank correlations between user reputation scores given by each reputation model for (a) 5 person, (b) 9 person, (c) 19 person and (d) 25 person staks. also, but less so for the smaller staks. In future work, we plan on refining our WeightedSum model in order to better reflect consumption diversity for such small-sized staks. Figure 2 shows that there are significant differences in user reputation scores produced by the four different models. But how best to interpret these differences? In this work, we consider that the true test of these reputation models is the extent to which they improve in the quality of results recommended by HeyStaks. We have described how HeyStaks combines term-based relevance and user reputation to generate its recommendation rankings (see Equation 2); in the following section we regenerate each of the recommendation lists produced during the trial using our reputation models and compare the performance of each. 5.3 From Reputation to Quality Since we have ground-truth relevance information for all of the recommendations (relative to the quiz questions), we can then determine the quality of the resulting recommendations. Specifically, we focus on the top recommended result and note whether it is relevant (that is, contains the answer to the question) or not relevant (does not contain the answer to the question). For each reputation model we compute an overall relevance rate, as the ratio of the percentage of recommendation sessions where the top result was deemed to be relevant, to the percentage of those where the top result was not-relevant. Moreover, we can compare this to the relevance rate of the recommendations made by the standard HeyStaks ranking (i.e. when w = 0 in Equation 2) in the WeightedSum PageRank Hubs Authority 5 0.41 0.75 -0.86 1.00 Stak 9 0.52 0.58 -0.63 1.00 Size 19 0.78 0.85 0.43 0.98 25 0.85 0.92 0.26 0.99 Table 3: Correlations between the number of distinct consumers per producer per stak and producer reputation. 7 0.05 40 −1 −1 10 10 0.04 30 0.02 10 5 9 19 0 25 −3 10 5 9 Stak Size 19 25 10 5 Stak Size (a) WeightedSum −3 10 −4 −4 10 0.01 0 −2 10 Rep Score 20 Rep Score Rep Score Rep Score −2 10 0.03 9 19 25 5 9 (b) PageRank (c) Hubs 19 25 Stak Size Stak Size (d) Authority Figure 2: Reputation scores per user, per stak for the four reputation models. trial to compute an overall relevance benefit; such that a relevance benefit of 40%, for a given reputation model, means that this model generated 40% more relevant recommendations than the standard HeyStaks ranking scheme. Figure 3 presents a graph of relevance benefit versus the weighting (w) used in Equation 2 to adjust the influence of term-based relevance versus user reputation during recommendation. The results for all four reputation models indicate a significant benefit in recommendation quality when compared to the standard HeyStaks recommendations. As we increase the influence of reputation over relevance during recommendation (by increasing w) we see a consistent increase in the relevance benefit, up to values of w in the range 0.5-0.7. For example, we can see that for w = 0.5, the reputation models are driving a relative improvement in recommendation relevance of about 30-40% compared to default HeyStaks’ relevance-only based recommendations. Overall the Hubs model performs best. It consistently outperforms the other models across all values of w and achieves a maximum relevance benefit of about 45% at w = 0.7. Looking at mean relevance benefit across reputation models, Hubs is clearly the best performer. For example, Hubs achieves a mean relevance benefit of 31%, while the other models achieve similar mean relevance benefits of between 21-25%. In a sense, this finding is counter-intuitive and highlights an interesting property of the HITS algorithm in this context. One might expect, for example, that the Authority model would outperform Hubs, given that Authority scores capture the extent to which users are good producers of quality search knowledge (i.e. users whose recommendations are frequently selected by other users), while Hubs captures the extent to which users are good consumers (i.e. users who select, tag, vote etc. HeyStaks recommendations deriving from the activity of good producers). However, given the man- !"#"$%&'"()"&"*+(,-.( '!" &!" %!" $!" #!" -./012.3456" 780.98:;" <5=>" ?521@A/2B" !" !" !(#" !($" !(%" !(&" !('" !()" !(*" !(+" !(," /"012+( Figure 3: Relevance benefit vs. reputation model. #" ner in which the collaboration graph is constructed (Section 4.2), once a user has consumed a recommended result, then that user is also considered to be a producer of the result in question if it is recommended by HeyStaks and selected by other users at future points in time. Thus, good consumers — who select recommended results from many good producers (i.e. producers with high Authority scores) — serve a “filter” for a broad base of quality search knowledge, and hence re-ranking default HeyStaks recommendations using reputation scores from the Hubs model leads to the better recommendation performance observed in Figure 3. 5.4 Limitations In this evaluation we have compared a number of reputation models based on live-user search data. One limitation of this approach is that although the evaluation uses liveuser search data, the final recommendations are not themselves evaluated using live-users. Instead we replay users’ searches to generate reputation-enhanced recommendations. The reason for this is the difficulty in securing sufficiently many live-users for a trial of this nature, which combines a number of reputation models and therefore a number of experimental conditions. That being said, our evaluation methodology is sound since we evaluate the final recommendations with respect to their ground-truth relevance. We have an objective measure of page relevance based on the Q&A nature of the trial and we use this to evaluate the genuine relevance of the final recommendations. The fact that our reputation models deliver relevance benefits above and beyond the standard HeyStaks recommendation algorithm is a clear indication that reputation provides a valuable ranking signal. Of course this evaluation cannot tell whether users will actually select these reputation ranked recommendations, although there is no reason to suppose that they would treat these recommendation differently from the default HeyStaks recommendations, which they are inclined to select. We view this as a matter for future work. Another point worth noting is that the live-user trial is limited to a specific type of search task, in this case a Q&A search task. Although such a task is informational in nature (according to stipulations set out by Broder [2]) it would be unsafe to draw general conclusions in relation to other more open-ended search tasks. However, this type of focused search task is not uncommon among web searchers and as such we feel it represents an important and suitable usecase that is worthy of evaluation. Moreover, previous work [19] has looked at the role of HeyStaks in more open-ended search tasks to note related benefits to end-users from its default relevance-based recommendations. As part of our 8 future work we are currently in the process of deploying and evaluating our reputation model across similar generalpurpose search tasks. [8] 6. CONCLUSIONS In this paper we have described a number of different user reputation models designed to mediate result recommendation in collaborative search systems. We have described the results of a comparative evaluation in the context of real-user data which highlights the ability of these models to improve overall recommendation quality, when combined with conventional recommendation ranking metrics. Moreover, we have found that one model, based on the well-known HITS Hubs metric seems to perform especially well, delivering relative improvements of up to 45%. We believe that this work lays the ground-work for future research in this area which will focus on scaling-up the role of reputation in HeyStaks and refining the combination of relevance and reputation during recommendation. Our reputation model is utility-based [11], based on an analysis of the usefulness of producer recommendations during collaboration events. Currently, in HeyStaks the identity of users (producers and consumers) is not revealed and so users do not see where their recommendations come from. In the future it may be appropriate to relax this anonymity condition in certain circumstances (under user control). By doing so it will then be possible to individual users to better understand the source of their recommendations and the reputation of their collaborating users. As such this model can ultimately lead to the formation of trust-based relationships via search collaboration. 7. ACKNOWLEDGMENTS This work is supported by Science Foundation Ireland under grant 07/CE/I1147. 8. REFERENCES [1] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In WWW ’98: Proceedings of the 7th international conference on World Wide Web, pages 107–117, Brisbane, Australia, 1998. ACM. [2] A. Broder. A taxonomy of web search. SIGIR Forum, 36:3–10, September 2002. [3] K. Bryan, M. O’Mahony, and P. Cunningham. Unsupervised Retrieval of Attack Profiles in Collaborative Recommender Systems. In RecSys ’08: Proceedings of the 2008 ACM conference on Recommender systems, pages 155–162, New York, NY, USA, 2008. ACM. [4] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. FolkRank: A Ranking Algorithm for Folksonomies. In FGIR 2006, pages 2–5. CiteSeer, 2006. [5] A. Jøsang, R. Ismail, and C. Boyd. A Survey of Trust and Reputation Systems for Online Service Provision. Decis. Support Syst., 43(2):618–644, 2007. [6] J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46(5):604–632, 1999. [7] S. K. Lam and J. Riedl. Shilling recommender systems for fun and profit. In Proceedings of the 13th [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] International World Wide Web Conference (WWW ’04), pages 393–402, New York, NY, USA, May 17–20 2004. ACM. M. Lazzari. An Experiment on the Weakness of Reputation Algorithms Used in Professional Social Networks: The Case of Naymz. In IADIS International Conference e-Society, pages 519–522, 2010. K. McNally, M. P. O’Mahony, and B. Smyth. Social and Collaborative Web Search: An Evaluation Study. In Proceedings of the 16th international conference on Intelligent User Interfaces (IUI ’11), pages 387–390, Palo Alto, California USA, 2011. ACM. K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and P. Briggs. Collaboration and Reputation in Social Web Search. In 2nd Workshop on Recommender Systems and the Social Web, in association with The 4th ACM Conference on Recommender Systems (RecSys 2010), 2010. K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and P. Briggs. Towards a Reputation-based Model of Social Web Search. In Proceedings of the 14th International Conference on Intelligent User Interfaces (IUI 2010), pages 179–188, Hong Kong, China, 2010. K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and P. Briggs. A Case-study of Collaboration and Reputation in Social Web Search. ACM TIST: Transactions on Intelligent Systems Technology (In Press), 2011. B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Toward trustworthy recommender systems: An analysis of attack models and algorithm robustness. ACM Transactions on Internet Technology (TOIT), 7(4):1–40, 2007. J. O’Donovan and B. Smyth. Trust in Recommender Systems. In IUI ’05: Proceedings of the 10th international conference on Intelligent user interfaces, pages 167–174, 2005. M. P. O’Mahony, N. J. Hurley, and G. C. M. Silvestre. Promoting recommendations: An attack on collaborative filtering. In Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA 2002), pages 494–503, Aix-en-Provence, France, 2002. Springer. P. Resnick and R. Zeckhauser. Trust Among Strangers in Internet Transactions: Empirical Analysis of eBay’s Reputation System. Advances in Applied Microeconomics, 11:127–157, 2002. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. R. Schifanella, A. Barrat, C. Cattuto, B. Markines, and F. Menczer. Folks in folksonomies: social link prediction from shared metadata. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM ’10), pages 271–280, New York, New York, USA, 2010. ACM. B. Smyth, P. Briggs, M. Coyle, and M. P. O’Mahony. Google Shared. A Case Study in Social Search. In User Modeling, Adaptation and Personalization. Springer-Verlag, June 2009. 9 Yokie - A Curated, Real-time Search & Discovery System using Twitter Owen Phelan, Kevin McCarthy and Barry Smyth CLARITY Centre for Sensor Web Technologies School Of Computer Science & Informatics University College Dublin Email: [email protected] ABSTRACT Social networks and the Real-time Web (RTW) have joined Search and Discovery as central pillars of online human activities. These are staple venues of interaction, with vast social graphs facilitating messaging and sharing of information. Twitter1 , for example, boasts 200 million users posting over 150 million messages every day. Such volumes of content being disseminated make for a tempting source of relevant content on the web. In this paper, we presentYokie, a novel search and discovery system that sources its index from the shared URL’s of a curated selection of Twitter users. The added benefit of this method is that tweets containing these URL’s contain extra contextual information, such as terms describing the URL, publishing time, down to the Tweet metadata which can include location and user data. Also, since we are exploiting a social graph structure of content sharing, it is possible to explore novel reputation ranking of content. The mixture of contextual data, with the fundamental harnessing of sharing activities amongst a curated set of users combine to produce a novel system that, with an initial online user trial, has shown promising results. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous General Terms Algorithms, Experimentation, Theory Keywords Search, Discovery, Information Retrieval, Relevance, Reputation, Twitter 1 Twitter - http://www.twitter.com Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. Tweet count Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 54221 1411784 6924205 7453870 60042573 Tweet count (with URL) 11964 331445 1539323 1647295 13113525 Average: Std. Dev. % 22.065251 23.47703 22.231043 22.09986 21.840378 22.468298 0.67627121 Figure 1: Analysis of 5 public Twitter datasets of varying sizes consisting of public tweets, with percentage of Tweets containing URL’s. Datasets gathered at various points between 2009 and 2011. Sets 1,2 and 3 were focussed scrapes, specific to a set of hashtags. Sets 4 and 5 were general public scrapes of the Twitter firehose. 1. INTRODUCTION Google2 , Bing3 and Yahoo!4 are household tools for finding relevant items on the web, of varying quality and relevance to the users search query or task. These systems rely on the use of automatic software “crawlers” that build queryable indexes by navigating the web of documents. These crawlers index documents based on their content, find edges between each document (hyperlinks), and perform a set of weighting and relevance calculations to decide on hubs and authorities of the web, while improving index quality[3]. More recently, search systems have started to introduce context into their ranking and retrieval strategies, such as location and time of document publication. These are mostly content-based (related to documents actual content), as it is difficult for a web crawler to determine the precise contextual features of a web document. Social networks are an abundant resource of social activity and discussion. In the case of Twitter, we estimate an average rate of 22% of Twitter tweets contain a hyperlink to a document (analysis shown in Figure 1). 2 Google - http://www.google.com Microsoft Bing - http://www.bing.com 4 Yahoo! - http://www.yahoo.com 3 10 This rate has held steady despite the three-fold increase of Twitter’s tweet-per-day rate in the past year, and an increase of 10 fold between 2009 and 2010. These URL’s can be news items, photos, Geo-located “check-in’s”, videos, as well as vanilla URL’s to websites[13]. The interesting dynamic here is these networks allow users to repost, or retweet other people’s items, which allow for these links to propagate throughout the graphs of users on the service. In this paper, we present Yokie, a novel search and discovery system with several main attributes relating to content sources, querying, ranking and retrieval; location user @phelo 3: The querying component of the system allows users to add extra contextual filters in addition to query strings, these are in the form of a temporal window (between two dates). It extracts a range of contextual features from shared content, and as such the systems querying UI can be adapted to exploit these also. 4: The added contextual data from these messages can give the user interesting and more relevant ways of ranking content over traditional approaches, as well as interesting item discovery Monday opportunities. 1 August 11 We use the contextual data for presentation and added ranking functionality. Emphasis is also placed on contextual filtering, such as temporal windows in queries, rather than just a single attribute keyword search. In the following sections, we will describe the system in greater detail, along with the details of a live user evaluation of the prototype, and a discussion of the results and future directions. 2. BACKGROUND Social network activity dominates traffic and per-user expended time on the web [13]. Real-time web products and services provide access to new types of information and the real-time nature of these data streams provide as many opportunities as they do challenges. In addition, companies like Twitter have adopted a very open approach to making their data available via APIs5 . It is no surprise then that the recent literature includes analyses of Twitter’s real-time data and graph, largely with a view to developing an understanding of why and how people are using services like Twitter; see for example [5, 7, 8, 11, 13, 15]. For instance, the work of Kwak et al. [13] describes a very comprehensive analysis of Twitter users and Twitter usage, covering almost 42 million users, nearly 1.5 billion social connections, and over 100 million tweets.The authors examined reciprocity and homophily among Twitter users, they have compared a number of different ways to evaluate user influence, as well as investigating how information diffuses through the Twitter ecosystem as a result of social relationships and retweeting behaviour. 5 Twitter Developer API - http://developer.twitter.com time 16:23 GMT obama in japan on #g20 #ecotalks 1: The system uses posted and shared content that contain hyperlinks as the basis of an index of webpages, the main content of which is based on user-generated text included with each hyperlink. 2: It operates using a Curated list of sources; in this case, these sources are Twitter users who post Tweets. These curated lists of users are called Search Parties. Dublin, Ireland http://bit.ly/82os5zx TERMS #HASHTAGS obama, japan #g20, #ecotalks URL: http://bit.ly/82os5zx Figure 2: Example of a Twitter tweet containing a hyperlink and contextual information Some of our own previous work has explored using Twitter as a news discovery and recommendation service, with item discovery appearing to be a prominently useful feature [15]. Krishnamurthy et al. identify classes of Twitter users based on behaviours and geographical dispersion [12]. They highlight the process of producing and consuming content based on retweet actions, where users source and disseminate information through the network. Other work by Chen et al. and Bernstein et al. have looked at using the content of messages to recommend content topically [1, 4]. Curation and content-editorial are age-old practices in publishing activities. News organizations operate editorial teams to filter output for relevant, interesting, topical and aesthetic content for their audiences. In terms of the domain of recommender systems, it can be considered an interesting avenue of exploration, such as to benchmark against an automatic or intelligent methods of item recommendation. Related to the idea of Curation are the notions of the Trust, Provenance and Reputation of those who are providing input into the system. Reputation scoring is an active field in Recommender Systems [17] and Social Search Systems [2]. in particular, focus is placed on finding reputable sources of information to extract and present content from. As an example, the TrustRank technique proposed by Gyongyi et al. computes a reputation score of elements in a web-graph with the purpose of detecting spam[6]. Whereas, several explorations such as those by McNally et al. have explored the notion of computing reputable users in a social search context[14]. Yokie’s inherent novelty starts with its broad range of related research fields that can be explored. The systems main features and technologies are described in the next section. 11 3. YOKIE Twitter is a expansive natural resource of user-generated content, that while each item may only seem to comprise of only 140 characters, also contains a rich quantity of metadata and contextual information thats published in a timely manner. Rather than an automatic crawler locating documents on the web, Yokie follows a curated set of users on Twitter, and is capable of receiving a stream of documents contributed by users in a timely fashion. These documents contain hyperlinks, descriptive text, and other metadata such as time-of-publishing and numbers of people in a potential audience. In the example of Figure 2, the user @phelo has posted a URL with a set of text (Obama in Japan on #G20 #ecotalks) at a given time. The system extracts the URL, resolves it (expanded to e.g. www.cnn.com/obama.html) and stores it. Instead of using the content of that URL as a basis of the search index, it uses the set of surrounding text. The index also takes into account the time data of when the tweet was published. The content of that index item that relates to that URL also contains related content from tweets posted by a curated set of users that containing that same URL. These contextual pieces of metadata are all stored so the system can perform content ranking and re-ranking. To give a typical use case, a user may query the term “obama” in a traditional search system, and return relevant content based on some ranking strategy. In Yokie, the UI allows the user to directly query a search term, along with a defined temporal window; so the query will look like “obama” between “6 hours ago” and “now”. Yokie has the ability to parse natural language date-time strings, which we feel allows for an easier definition of the search task rather than just cumbersome date-picking UI’s. The results-list can then be re-ranked using a number of features. In the following sections, we will delve into greater detail of how the system accomplishes its data gathering, storage, retrieval, presentation and user interactions. We will also discuss several ranking strategies that enable the user to explore results based on the contextual data from both their originating and related tweets. 3.1 3.1.1 @--@--@--- • Scraping Tweets (defined by Search Party) • Filtering Content • Parsing Content • Resolving URLs • Finding Mentions of URLs • Analyzing Metadata of Tweets Indexer timestamp, url, #hashtags, (Update text if not new) Main Content Index •Parse Query term •Parse Query Date window •Query Main Content Index •Retrieve Results, gather metadata Re-ranks results based on metadata (age, reputation, mentions, relevance, etc) for each (tweeter, mentions, date, urlTitle, urlDescription, etc.) •Calculate Reputation score Re-Ranking System Querying System Monday 1 August 11 (B) Debtcrisis crisis talks talks Debt solvedby by obama obama solved http://bit.ly/z9ka http://bit.ly/z9ka by Curated User y by Curated User y Great! #obama Great! #obama http://bit.ly/z9ka Obama in Crisis Obama Crisis Debt ceilingintalks http://bit.ly/z9ka Debt ceiling talks http://bit.ly/z9ka by Curated User x by Curated User x http://bit.ly/z9ka {debt, crisis, talks, great, {debt,solved, crisis,obama talks, obama, #obama, great, solved, obama ceiling, talks, crisis} Text by Curated User z by Curated User z obama, #obama, ceiling, talks, crisis} Solr Index document for http://bit.ly/z9ka Solr Index document for http://bit.ly/z9ka Database Metadata for http://bit.ly/z9ka Public Retweets or Mentions of http://bit.ly/z9ka Public Retweets or Mentions of http://bit.ly/z9ka Debt crisis talks solved by obama http://bit.ly/z9ka 6 It is intended that when the system has a number of these parties, they are indexed separately. Query: “iPad” from “1 day ago” to “now” URL Metadata Database The Search Party A key component of the system is a curated list of content sources, what we’ve termed a Search Party 6 . An example could be a user curating a list of Twitter users who they believe to be related to, or indeed talk about a given domain. This allows users to curate dedicated search engines for personal and community use based around a domain-specific topic. In the current prototype we have curated a seed-list of 140 Twitter users who mostly discuss Technology, and have been listed in Twitter’s list feature under a Technology category. Data Gathering •Index tweet content, Architecture The architecture for the system, as presented in Figure 4A, highlights how data is gathered, stored and queried, and how results are presented to the user. Here, we will discuss each main component and how each operates during query, retrieval, results presentation and interaction-times. (A) Search Party Debt crisis talks solved by obama http://bit.ly/z9ka Database Metadata {Time, location, for http://bit.ly/z9ka original user, URL Title, etc...} {Time, location, original user, URL Title, etc...} Debt crisis talks solved by obama http://bit.ly/z9ka Debt crisis talks solved by obama http://bit.ly/z9ka Debt crisis talks solved by obama http://bit.ly/z9ka Debt crisis talks solved by obama http://bit.ly/z9ka Debt crisis talks solved by obama http://bit.ly/z9ka Monday 1 August 11 Figure 4: Yokie System - (A) Full System Architecture. (B) Shows the process of describing and indexing a document. 12 Figure 3: Yokie in a browser. As shown above, the system takes a traditional approach to layout, it includes functions for viewing extra metadata related to the item (pane to the right of the “DesignTaxi” item.) 3.1.2 Content Gathering The Data-gathering agent uses Twitter’s API to scrape a domain of Tweets, or a subset of the total stream. The system can be adapted to listen to the public stream, or sources can be curated based on user lists, keywords, geographical metadata or algorithmic analysis of relevant, interesting or important content. These messages are then stored and indexed using the service described in the following subsection. This component also carries out real-time language classification and finds related messages that contain the same URL so the system can calculate item popularity. 3.1.3 Storage & Indexing Once content is gathered, it is pushed to the Storage and Indexing subsystem. This is responsible for extracting metadata regarding the tweets, for instance timestamp data, hashtags (#obama, etc.), user profile information, location, etc, as well as the message content itself. The process by which Yokie deals with indexing and storing metadata is described in Figure 4B. The main content, as well as the urlID of the URL mentioned in the message, as well as the timestamp is pushed to an indexer for storage and querying. In our current implementation we use Apache Solr7 for this. We also store the remaining extracted metadata in the form of a Database. The current implementation of the system uses the MongoDB8 NoSQL system for this databasing functionality. 7 8 Apache Solr - http://lucene.apache.org/solr/ MongoDB - http://mongodb.org These content indexes and databases give us a quick, programmable way of querying the content, while providing an equally handy way of gathering associated metadata for presenting the results of a contextual query, rerank based on metadata and presenting further metadata to the user. 3.1.4 User Interface The system is essentially made up of a query interface, currently comprising of a query string field, and two temporal fields, from and to. The system takes in a query string with an associated time window, which can be either a natural language query (eg. “1 day ago”, “now”, “last week”, etc) or a fixed date (“12 December 2010”). The UI also allows users to drill-down on results to explore related content such as the original tweet that the URL was shared with, the time and day it was shared, and the related Tweet mentions (if any). A re-ranking menu is also presented, that allows users to re-rank the results (this will be discussed further below). Such interface tweaks have been successfully applied to systems in the past and have shown to provide a value add for users and motivate participation[16] (in this case, for the curated list of Twitter users). 13 3.1.5 Querying The querying subsystem is the largest component of the system. It parses user queries, based on the triple {QueryString, Tmax , Tmin }. The natural-language date strings it receives (eg. “1 week ago” to “1 hour ago”) is parsed into a computer-readable format (eg 12 June 2011 12:31:41 translates to the UNIX timestamp of 1307881901). Users can specify specific dates, as well as special keywords such as “yesterday” (12am the day before), and “now”. The triple is pushed to the Querying subsystem, and a set of database ID’s of URL’s are returned (urlID’s). The querying system takes these resulting urlID’s and finds complete database objects for each URL. These objects contain pertinent metadata for the URL, its title, expanded hyperlink, description, as well as the surrounding Tweet content related to the initial tweet that mentioned it. At querying time, the system uses the expanded metadata from the database to re-rank the vector of URL’s based on the users’ specified ranking strategy; these are explained in the following subsection. 3.2 Relevance Traditional IR systems use traditional Term Frequency Inverse Document Frequency scoring (TFxIDF) [18]; we label this as “Relevance” in Yokie, and represents this evaluations’ benchmark. The indexing component of the architecture natively ranks items based on relevance. The rest of these items are ranked algorithmically post retrieval-time using the ranking strategies described in the following subsections. 3.2.2 Item Age Since the content that is indexed is timely, and the querying system has temporal windowing, it is prudent to consider the ability to rank items based on their age. We allow the user to rank the list based on newer and older items. This is particularly useful in the context of the temporal window, as users may query between a certain date or time and “now”, then rank by newer first. This will give them a near-realtime updating of content related to the query. 3.2.3 Item Popularity When an item is indexed by the data-gathering agent, a separate thread begins that searches Twitter for mentions of the same URL. As such, we consider Popularity a useful metric. This is the total number of unique mentions of a given URL inside the query time-window. These related tweets are sourced from the public feed, as well as amongst the users of the curated Search Party. 3.2.4 3.2.5 Reputation As described in Section 2, reputation is an interesting subfield appearing in recommender systems and search contexts. In the current iteration of Yokie, we use reputation scoring of Twitter users to rank items. As such, we place items from more reputable users higher in a descending list. In this iteration of the system, we use a shallow summation of the total potential audience of the URL based on the sum of follower counts of each person in the curated domain list. Our motivation for doing so related to the notion that follower relationships in Twitter’s directed graph structure of social network topography may reflect in a form of promotion or voting in favor of a person to follow. In future iterations of the system we hope to explore a range of more comprehensive reputation scoring based on graph analyses and topic detection. Ranking Strategies For our initial prototype of the system, we implemented several ranking strategies that are used to order the results lists presented to users. The main aim in exposing several strategies over the main relevance strategy was to expose the potential benefits of the added contextual data extracted from the stream of tweets. 3.2.1 in the set. For example, a given URL U has a longevity score of l which is based on the difference between the Unix timestamp of the latest mention Tmax and the first mention Tmin . Item Longevity Longevity describes the total length of time an item appears in the domain (the amount of time between the first mention/activity and last mention/activity of the item). This score applies for items that have more than one occurrence In the following section we will discuss a live user evaluation which, among other things, encouraged user re-ranking of their results lists using the properties discussed above. 4. ONLINE USER EVALUATION In order to capture some preliminary usage statistics, we developed the prototype (as seen in Figure 3) for a live user evaluation. Our aim here was to examine patterns related to how users interacted with the querying interface, and subsequent results lists on a session-level basis. We launched the prototype system, which was online for one week. In this setup, we curated a list of 140 users on Twitter who we believed contributed mostly technology content. Yokie’s data gathering agent gathered past and current content from each account, and in all during the evaluation it captured 75,021 unique URL’s, the oldest URL was from May 2007 to the present hour of the evaluation. This varied depending on the number of statuses and frequency for each account in the curated list. The broad makeup of the evaluation participants were technology and Computer Science research students and colleagues within our group. The definition of the technologyoriented search party was initially explained. We captured a range of participant activities in the system. For each user, we counted search sessions which contained one or more queries, as well as any click-throughs from the results list and highlighting actions in the metadata window. Initially, we had also defaulted the temporal window to be between “one month ago” and “now” (the time the query was performed). Data from each result interaction was captured, which included list positions and score. The results of these interactions are presented in the following section. 5. PRELIMINARY RESULTS In this section, we will focus on preliminary results from the evaluation that represent usage trends and patterns, and highlight the novel nature of the system and its interface. In total, the system ran for seven days, with 35 unique participants who performed 223 total queries. 14 Sessions In a broad sense, each user partook in search sessions that contained queries and other interactions. A session is defined as the total set of interactions (Queries, click-throughs, hovers, reformulations, etc.) between the initial query and the final instance of user interaction. The evaluation gathered 69 search sessions in total across all 35 participants. On average, each user performed 3 queries per session. Here, we will discuss query activity and will focus particularly on the temporal windowing aspect of the query. Number of Re-ranks by Strategy 40 30 #reranks 5.1 20 10 0 5.1.1 Queries As mentioned, there were in total 223 queries performed by all users across the 7 days. When we delve deeper into the makeup of the queries themselves, we see that of those 223 queries, there were 116 unique query strings, showing high overlap and duplication across the set. In some of these cases, that overlap was due to query reformulation, which will be discussed in a section below. The average query length amongst this set of unique queries was 1.54 terms, and on a character basis it was an average of 9 characters. 5.1.2 Temporal windowing As mentioned in the previous sections, Yokie’s querying interface has an emphasis on the searcher providing a temporal window for each query. Inside the total 223 queries, the average time-window was 55 days in size, however in 198 of the queries had the temporal window starting from “now” and extending backwards, which means people were interested in current updating content relating to a query. Some more interesting results relating to the windowing are described in Section 5.3. 5.2 Result-lists & User Interactions The three main user interactions in the system in terms of results lists were item “hovering”, where users were encouraged to explore extra metadata regarding the item, as well as clicking on the items themselves. Here we will describe these interactions and some details regarding the results lists makeup. 5.2.1 Results-lists For each of the 223 queries, there were results lists that ranged between 3 and 4000 returned items, with an average size of 294 items. In each case, the user was only presented with the top 50 based on the ranking strategy. For each interaction of items on the results list, data such as the position of the item in the list. 5.2.2 Clicks & Hovers Once results lists were presented to the user post querying, the user has either the option to “peek” at extra metadata relating to the URL, which is shown in the screenshot in Figure 3, or click on the item in a traditional fashion to visit the page. One interesting metric was that while there was a reasonable number of 80 click-throughs, there was tremendous interest in the metadata pane, which garnered 267 activations. Also interesting was that 90% of the click-throughs that the system captured occurred on items that had their metadata exposed. If we consider click-throughs to be the ultimate success metric of a query, then this shows that users are highly interested in exploring more information relating to their results. Relevance Newer First Older First Mentions Reputation Figure 5: Breakdown of frequency of re-ranking per strategy. 5.2.3 Re-ranking of results The ranking strategies outlined in Section 3 were implemented into the system, namely Relevance, Newest first, Oldest first, Popularity, Reputation and Longevity; and each of these were explained to the users at the beginning of the live evaluation. As shown in Figure 5 users preferred reranking using each of the strategies rather than the benchmark relevance metric. Reputation was the most popular reranking strategy employed, but was only slightly ahead of popularity. Users also seemed to like to rank with newer items at the top of the results-lists. This, perhaps, shows the utility of a real-time search system, which can potentially provide upto-the minute results based solely on their freshness. 5.3 Query Reformulation One interesting result that emerged is the user activity of query reformulation. Related search analysis work by [9] and [10] have discussed the user practice of reformulating queries in search sessions. In all, there were 24 sessions that contained a reformulation of the query, but in each of these reformulations the user did not change the query term in any way. The reformulation exclusively took the form of a modification of the time window. In 8 of the cases, the users refreshed the results, but this is because the Tmin value (or date to) was set at “now”, which meant that content was potentially changing at a real-time rate. Unfortunately the results-count did not change during their reformulation. In 5 cases, the reformulation was based on a narrowing of the time window. For example, a user queried the term “iPad” from “3 weeks ago” to “3 days ago” reformulated to be “1 week ago” to “1 day ago”. The remaining 11 cases of reformulation involved a widening of the time window, where users were querying over a broader period of time. These typically gained significantly larger results lists. These time-based query reformulations may have been because we had provided an explicit interface to modify the time window. Another interesting interaction, or lack thereof, was in relation to the re-ranking. Within the queries that users reformulated, they did not once re-rank first. 6. CONCLUSIONS In this paper, we have presented a novel search and discovery platform that harnesses users’ urge to share content on the real-time web as a basis for finding and indexing relevant content. We also explore other emerging themes in informa- 15 tion discovery, such as curation as a means of selecting and editing the underlying structure of a system. As mentioned in the previous section, an initial user evaluation has shown the system to provide an engaging search interface that allows, among other things, easy query reformulation. Presently we are formulating changes to the system based on the outcomes and observations during the evaluation described in this paper. We hope to explore further the potential of the contextual features that are extracted from Twitter, and other social networks. Analysis of social graphs are commonly exploited in recommendations, and the unique architecture we have employed in Yokie will allow us to explore the use of these graphs further, especially in the context of user and item reputation scoring. The curation system merits further exploration, particularly as a basis of evaluation compared with standard automatic neighborhood formation. One considerable experiment would involve the role and usefulness of curation in such a system, as compared with automated systems for content indexing. Techniques surrounding tag recommendation and content and query expansion would be starting avenues, as would topic detection using algorithms such as Latent Dirchlet Allocation to group items and relate them to a query based on topical similarity. Naturally, all of these proposed techniques would culminate in a larger-scale user trial involving many more participants, with a more focused agenda explore each of these. Yokie is novel as its positioned on a union of many different fields of research, including IR, Recommender Systems, Social Search systems, social networks, to name but a few. As such, it has wide potential for users and research goals. 7. ACKNOWLEDGEMENTS With sincere thanks to our evaluation participants. This work is generously supported by Science Foundation Ireland under Grant No. 07/CE/11147 CLARITY CSET. 8. REFERENCES [1] M S Bernstein, Bongwon Suh, Lichan Hong, Jilin Chen, Sanjay Kairam, and E H Chi. Eddi : Interactive topic-based browsing of social status streams. Fortune, pages 303–312, 2010. [2] Oisı́n Boydell and Barry Smyth. Capturing community search expertise for personalized web search using snippet-indexes. In Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06, pages 277–286, New York, NY, USA, 2006. ACM. [3] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30:107–117, April 1998. [4] Jilin Chen, Rowan Nairn, Les Nelson, Michael Bernstein, and Ed Chi. Short and tweet: experiments on recommending content from information streams. In Proceedings of the 28th international conference on Human factors in computing systems, CHI ’10, pages 1185–1194, New York, NY, USA, 2010. ACM. [5] Sandra Garcia Esparza, Michael P. O’Mahony, and Barry Smyth. On the real-time web as a source of recommendation knowledge. In RecSys 2010, Barcelona, Spain, September 26-30 2010. ACM. [6] Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. Combating web spam with trustrank. In VLDB ’04: Proceedings of the Thirtieth international conference on Very large data bases, pages 576–587. VLDB Endowment, 2004. [7] John Hannon, Mike Bennett, and Barry Smyth. Recommending twitter users to follow using content and collaborative filtering approaches. In RecSys 2010: Proceedings of the The 4th ACM Conference on Recommender Systems, Barcelona, Spain, September 26-30 2010. ACM. [8] Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. Social networks that matter: Twitter under the microscope. SSRN eLibrary, 2008. [9] Bernard J. Jansen, Danielle L. Booth, and Amanda Spink. Patterns of query reformulation during web searching. Journal of the American Society for Information Science and Technology, 60(7):1358–1371, July 2009. [10] Bernard J. Jansen, Amanda Spink, and Vinish Kathuria. How to define searching sessions on web search engines. In Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis, WebKDD’06, pages 92–109, Berlin, Heidelberg, 2007. Springer-Verlag. [11] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter: understanding microblogging usage and communities. In Procedings of the Joint 9th WEBKDD and 1st SNA-KDD Workshop, pages 56–65, 2007. [12] Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on Online social networks, pages 19–24, NY, USA, 2008. ACM. [13] Haewoom Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In WWW ’10, pages 591–600, 2010. [14] Kevin McNally, Michael P. O’Mahony, Barry Smyth, Maurice Coyle, and Peter Briggs. Towards a reputation-based model of social web search. In Proceedings of the 15th international conference on Intelligent user interfaces, IUI ’10, pages 179–188, New York, NY, USA, 2010. ACM. [15] Owen Phelan, Kevin McCarthy, Mike Bennett, and Barry Smyth. Terms of a feather: content-based news recommendation and discovery using twitter. In Proceedings of the 33rd European conference on Advances in information retrieval, ECIR’11, pages 448–459, Berlin, Heidelberg, 2011. Springer-Verlag. [16] Al M. Rashid, Kimberly Ling, Regina D. Tassone, Paul Resnick, Robert Kraut, and John Riedl. Motivating participation by displaying the value of contribution. In Proceedings of the SIGCHI conference on Human Factors in computing systems, CHI ’06, pages 955–958, New York, NY, USA, 2006. ACM. [17] Paul Resnick, Ko Kuwabara, Richard Zeckhauser, and Eric Friedman. Reputation systems. Commun. ACM, 43:45–48, December 2000. [18] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34:1–47, March 2002. 16 Testing Collaborative Filtering against Co-Citation Analysis and Bibliographic Coupling for Academic Author Recommendation Isabella Peters Wolfgang G. Stock Heinrich-Heine-University Dept. of Information Science D-40225 Düsseldorf, Germany Heinrich-Heine-University Dept. of Information Science D-40225 Düsseldorf, Germany Heinrich-Heine-University Dept. of Information Science D-40225 Düsseldorf, Germany [email protected] [email protected] [email protected] Tamara Heck ABSTRACT Recommendation systems have become an important tool to overcome information overload and help people to make the right choice of needed items, which can be e.g. documents, products, tags or even other people. Last attribute has aroused our interest: Scientists are in need of different collaboration partners, i.e. experts for a special topic similar to their research field, to work with. Co-citation and bibliographic coupling have become standard measurements in scientometrics for detecting author similarity, but it can be laborious to elevate these data accurately. As collaborative filtering (CF) has proved to show acceptable results in recommender systems, we investigate in the comparison of scientometric analysis methods and CF methods. We use data from the social bookmarking service CiteULike as well as from the multi-discipline information services Web of Science and Scopus to recommend authors as potential collaborators for a target scientist. The paper aims to answer how a relevant author cluster for a target scientist can be proposed with CF and how the results differ in comparison with co-citation and bibliographic coupling. In this paper we will show first result, complemented by an explicit user evaluation with the help of the target authors. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Scientific databases. H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Information filtering. H.3.5 [Information Storage and Retrieval]: Online Information Services – Web-based services. General Terms Measurement, Experimentation, Human Factors, Management. Keywords Collaborative Filtering, Recommendation, Evaluation, Social Bookmarking, Personalization, Similarity Measurement, Bibliographic Coupling, Author Co-Citation, Social Tagging. 1. INTRODUCTION An important task for knowledge management in academic settings and in knowledge-intensive companies is to find the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00. “right” people who can work together to solve successfully a scientific or technological problem. This can either be a partner having the same skills and providing similar know-how, or someone with complementary skills to form a collaborative team. In both cases the research interests must be similar. Amongst others this interest can be figured out with a person’s scientific publications. Exemplarily, we will list some situations in which expert recommendations are very useful: • compilation of a (formal) working group in a large university department or company, • compilation of researchers for preparing a project proposal for a research grant (inside and outside the department and company), • forming a Community of Practice (CoP), independent from the affiliation with the institutions following only shared interests, • accosting colleagues in preparation of a congress, a panel or a workshop, • asking colleagues for contributions to a handbook or a specialized journal issue, • finding appropriate co-authors. It is very important for cooperation in science and technology that the reputation of the experts is proved [15]. A recommendation service must not suggest just anybody who is possibly relevant, but has to check up on the expert’s reputation. The reputation of a person in science and technology grows with her or his amount of publications in peer-reviewed journals and with the citations of those publications [14]. So we are going to use academic information services, which stores publication and citation data, as basis for our author recommendation. Multi-discipline information services which allow publication and citation counts are Web of Science (WoS) and Scopus [34, 40, 41]. Additionally our experimental expert recommendation applies also data from CiteULike, which is a social bookmarking service for academic literature [18, 22]. So we can not only consider the authors’ perspectives (by tracking their publications, references and citations via WoS and Scopus), but also the perspectives of the readers (by tracking their bookmarks and tags via CiteULike) to recommend relevant partners. Our research questions are: 1) Can we propose a relevant author cluster for a target scientist with CF applying CiteULike data? 2) Are these results different to the results based on co-citation and bibliographic coupling? Recommender systems (RS) nowadays use different methods and algorithms to recommend items, for e.g. products, movies, music, articles, to a Web user. The aim is personalized recommendation [5], i.e. to get a list of items, which are unknown to the target user and which he might be interested in. One problem is to find the best resources for user a and to rank them according to their 17 relevance [16]. Two approaches are normally distinguished (among other distinctive recommender methods and hybridizations): The content-based approach, which tries to identify similarities between items based on their content and positively rated by user a, and the collaborative filtering approach (CF), which not only considers the ratings of user a, but also the ratings of other users [a.o. 16, 20, 25, 35, 37, 42, 48]. One advantage of CF compared to the content-based method is that recommendations rely on the evaluation of other users and not only on the item’s content, which can be inappropriate for quality indication. RS work with user ratings assigned to the items, also called useritem response [16]: They can be scalar (e.g. 1-5 stars), binary (like/dislike) or unary, i.e. a user doesn’t rate an item, but his purchase or access of the item is assumed as a positive response. The latter user-item response can also be used for recommendations in social tagging systems (STS) as e.g. social bookmarking systems like BibSonomy, CiteULike and Connotea [38]. STS have a folksonomy structure with user-resource-tag relations, which is the basis for CF. In STS not only recommendations of items are possible, but also recommendations of tags and users, which is the basis for our academic author recommendations. We apply approaches of CF to recommend potential collaboration partners to academic researchers. Hereby we ask if CF in a STS recommends different results than the established scientometric measurements, author co-citation and bibliographic coupling of authors. In general these measurements are not explicitly used for recommendation, but rather for author and scientific network analysis [54]. 2. RELATED WORK RS can be constructed in many different ways, e.g. choosing the appropriate algorithm especially for personal recommendation [54], defining user interactions and user models [44], facing criteria like RS accuracy, efficiency and stability [16] and focusing on optimal RS learning models [47]. With the appearance of bookmarking and collaboration services in the Web, several algorithms and hybridizations have been developed [27]. They may differ in combination of the considered relations between users, items and tags and the used weights. Similarity fusion [59] for example combines user- and item-based filtering (subcategories of CF) and additionally adds ratings of similar items by similar users. Cacheda et al. give an overview of different algorithms and compare the performances of the methods, also proposing a new algorithm, which takes account of the users’ positive or negative ratings of the items [11]. Bogers and van den Bosch compare three different collaborative filtering algorithms, two item-based and one user-based. The latter one outperformed the others throughout a time of 37 months [8]. But the most evident problem seems to be the cold-start, i.e. new items cannot be recommended at the beginning [2]. Said et al. are also concerned with the cold-start problem and the performance of different algorithms within a time span: Thereby adding tag similarity measures can improve the quality of item recommendation because tags offer more detailed information about items [50]. Hotho et al. propose the FolkRank [27], a graph based approach similar to the idea of the PageRank, which can be applied in a system with a folksonomy structure like a bookmarking service. Hereby users, tags and resources are the nodes in the graph and the relations between them become the weighted edges, taken into account weight-spreading like the PageRank does. In the current approach similarity based on users and tags within CiteULike is measured separately. Using the relations between them, like it is done in the FolkRank method, may lead to better recommendations. However this method may not be applied to bibliographic coupling and author co-citation [see paragraph 3] without modifications. Several papers investigate in expert recommendation mainly for business institutions [45, 46, 62]. Petry et al. developed the expert recommendation system ICARE, which should recommend experts in an organization. Therefore the focus doesn’t lie on an author’s publications and citations, but for example on his organizational level, his availability and his reputation [45]. Reichling and Wulf investigated in a recommender system for a European industrial association supporting their knowledge management, foregone a field study and interviews with the employees. Experts were defined according to their collection of written documents, which were automatically analyzed. Additionally a post-integrated user profile with information about their background and job is used [46]. Using user profiles in bookmarking services could also be helpful to provide further information about a user’s interests and prove user recommendation, which could be an investigating new research approach. However this approach has serious problems with privacy and data security on the Web. Apart from people recommendation for commercial companies [a.o. 12, 51] other approaches concentrate on Web 2.0 user and academics. Au Yeung et al., using the non-academic bookmarking system Del.icio.us, define an expert user as someone who has high-quality documents in his bookmark collection (many others users with high expertise have them in their collection) and who tends to identify useful documents before other users do it (according to the timestamp of a bookmark) [3]. In comparison their SPEAR algorithm is better for finding such experts than the HITS algorithm, which is used for link structure analysis. Compared to the current approach the “high-quality documents” in this experiment are the publications of our target author, i.e. a user who has bookmarked one of these publications is important for our user-based recommendation (see paragraph 3). A weighed approach like Yeung et al. did it when they weighted a user’s bookmarks according to their quality could also be interesting to test. Blazek focuses on expert recommendation sets of articles for a “Domain Novice Researcher”, i.e. for example new academics, who enter a new domain using a collection of academic documents [7]. A main aspect hereby is again the cold start problem: Citation analysis can hardly be applied for novice researchers, as long as there are no or only few references and citations. Therefore in the current approach only target authors where chosen, who have at least published five articles in the last five years. Blazek understands his expert recommendation mainly as a recommendation of relevant documents. Heck and Peters propose to use social bookmarking systems for scientific literature such as BibSonomy, CiteULike and Connotea to recommend researchers, who are unknown to the target researcher, but share the same interests and are therefore potential cooperation partners to build CoP [24]. Users are recommended when they have either common bookmarks or common tags, a method founded on the idea of CF. A condition is that the researcher, who should get relevant expert recommendations, must be active in the social bookmarking system and put his relevant literature to his internet library. In this project, beneath the additional comparison of CF against co-citation and bibliographic coupling, we avoid the problem of the “active researcher”, i.e. we have a look at the users in CiteULike, who have bookmarked our target researcher’s publications. Therefore the recommendation doesn’t depend on the target scientist himself, which would be based on his bookmarks and assigned tags, but on the bookmarking users and their collaborative filtering. The approach of Cabanac is similar to [24], but he concentrates only on user similarity networks and 18 relevant articles, not on the recommendation of unknown researchers [10]. He uses the concepts of Ben Jabeur et al. to build a social network for recommending relevant literature [4]. The following entities can be used: Co-authorship, Authorship, Citation, Reference, Bookmarking, Tagging, Annotation and Friendship. Additionally Cabanac adds social clues like connectivity of researchers and meeting opportunities on scientific conferences. According to him these social clues lead to a better performance of the recommendation system. Both approaches [4, 10] aim to build a social network to show the researcher connectivity to each other. In this project co-authorship for example is not important, as we try to recommend unknown researchers or academics our target author has not in his mind. Zanardi and Capra, proposing a “Social Ranking”, calculate similarity between users based on same tags and tag-pairs based on same bookmarks they both describe [63]. The tag similarity is compared with a user’s query tag; both user and tag similarity are then combined. The results show that user similarity improves accuracy whereas tag similarity improves coverage. Another important aspect with RS is their evaluation. RS should not only prove accuracy and efficiency, but also usefulness for the users [26]. The users’ need must be detected to make the best recommendation. Beneath RS evaluation based on models [29], some papers investigate in user evaluation [39]. McNee et al. show recommender pitfalls to assure users acceptance and growing usage of recommenders as knowledge management tools. This is also one of our main aspects in this paper because we want to recommend potential collaboration partners to our target scientists. They have to prove the recommended people as useful for their scientific work. 3. MODELING RECOMMENDATION 3.1 Similarity Algorithm The most common similarity measures in Information Science are Cosine, Dice and Jaccard-Sneath [1, 2, 31, 33, 49, 57]. The last two are similar and come to similar results [17]. Additionally Hamers et al. proved that the similarity measures with the cosine coefficient are twice the number than the Jaccard coefficient showed referring to citation measurements [21]. According to van Eck and Waltman the most popular similarity measures are the association strength, the cosine, the inclusion index and the Jaccard index [58]. In our comparative experiment we make use of the cosine. Our own experiences [23] and results from the literature [49] show that cosine works well. But in later project steps we want to extent the similarity measures to Dice and Jaccard-Sneath as well. 3.2 Collaborative Filtering Using Bookmarks and Tags in CiteULike Social bookmarking systems like BibSonomy, Connotea and CiteULike have become very popular [36]: Unlike bookmarking systems like Del.icio.us they focus on academic literature management. Basis for social recommendation are their folksonomies. A folksonomy [38, 43] is defined as a tuple F: = (U, T, R, Y), where U, T and R are finite sets with the elements usernames, tags and resources and Y is a ternary relation between them: Y ⊆U x T x R with the elements called tag actions or assignments. The tripartite structure allows matching users, resources or tags which are similar to each other. CF uses data of the users in a system to measure similarity [16]. To get a 2dimensional matrix for applying traditional CF, which is not possible in the ternary relation Y, one could split F in three 2dimensional subsets: The docsonomy DF:= (T, R, Z) where Z ⊆ T x R, the personomy PUT:= (U, T, X) where X ⊆U x T, and the user- resource relation, which we call in our case personal bookmark list (PBL): PBLUR:= (U, R, W) where W ⊆ U x R. In our experimental comparison we want to cluster scientific authors who have similar research interests. Scientometric analyses are co-citation and bibliographic coupling, which we compare with data from CiteULike using CF. Therefore we are not interested in the CiteULike users themselves, but in their tags and bookmarks they connect with our target author, i.e. the bookmarked papers which our target author published. We set Ra for all bookmarked articles which our target author a published and Ta for all tags which are assigned to those articles. To set our database for author similarity measure we have two possible methods: 1. 2. We search for all users u 𝜖 U who have at least one article of the target author a in their bookmark list: PBLURa:= (U, Ra, W) where W ⊆ U x Ra . We search for all documents, to which users assigned the same tags like to the target author’s a articles: DFa:= (Ta, R, Z) where Z ⊆ Ta x R. The disadvantage in the first method, in our case, is the small number of users. It can be difficult to rely only on these users for identifying similarity [30]. Therefore we use the second method: Resources (here: scientific papers) can be supposed similar, if the same tags have been assigned to them. Our assumption is that also the authors of these documents are similar because users describe their papers with the same keywords. Tags show topical relations, and authors with thematically relations concerning their research field are potential collaboration partners. Additionally the more common tags two documents have, the more similar they are. In some cases very general tags like “nanotube” and “spectroscopy” were assigned to our target authors’ articles. So we decided to set a minimum of unique tags a document must have in common with a target author’s document: DFa:= (Ta, R, Z) where Z ⊆ Ta x R e. {r ∈ Ta x R with |Ta| ≥ 2} (1) On this database we measure author similarity in two different ways: (A) Based on common tags t assigned to the authors’ documents by users; (B) Based on common users u. We use the cosine coefficient as explained above: 𝑎)𝑠𝑖𝑚(𝑎, 𝑏): = 𝑇𝑎 ∩ 𝑇𝑏 �𝑇𝑎 ∗ 𝑇𝑏 𝑏)𝑠𝑖𝑚(𝑎, 𝑏): = 𝑈𝑎 ∩ 𝑈𝑏 �𝑈𝑎 ∗ 𝑈𝑏 (2) Consider that the latter method leads to different results than applying the proposed first method for database modeling. If we would apply the first method, we would find all users who have at least one document of target author a in their bookmark list. With the second method, we get all users, who have at least one document in their bookmark list, which is similar to any of target author’s a articles, i.e. users who bookmarked a document of a may be left out. As we want to apply one unique dataset for author similarity measure, we do not merge both methods, but measure tag-based and user-based similarity in the dataset described above. Nevertheless where no tags were available, we chose the first method (see paragraph 5). 3.3 Author Co-Citation and Bibliographic Coupling of Authors There are four relations between two authors concerning their publications, references and citations: co-authorship, direct citation, bibliographic coupling of authors and author co-citation. The first two relationships are not appropriate for our problem, for here it is sure that one authors knows the other: of course, one knows his co-authors, and we can assume, that an author knows 19 who she has cited. Our goal is to recommend unknown scientists. Bibliographic coupling (BC) [28] and co-citations (CC) [55] are undirected weighted linkages of two papers, calculated through the fraction of shared references (BC) or co-citations (CC). We aggregate the data from the document level to the author level. Bibliographic coupling of authors means that two authors a and b are linked if they cite the same authors in their references. We mine data about bibliographic coupling of authors by using WoS, for this information service allows searches for “related records”, where the relation is calculated by the number of references a certain document has in common with the source article [13, 56]. Our assumption is: Two authors who have two documents with a high number of same references are more similar than two authors who have a high number of same references in many documents, i.e. the number of same references per document is important. Consider authors a, b and c: if 𝑠𝑖𝑚(𝑎, 𝑏) > 𝑠𝑖𝑚(𝑎, 𝑐) 𝑅𝑒𝑓𝑎 ∩ 𝑅𝑒𝑓𝑏 . 𝐷𝑎 ∪ 𝐷𝑏 > 𝑅𝑒𝑓𝑎 ∩ 𝑅𝑒𝑓𝑐 . 𝐷𝑎 ∪ 𝐷𝑐 (3) (4) where Ref is the set of references of an author and D the set of documents of an author {d ∈ D x Refa}. For example: author a has 6 references in common with author b and c. These 6 common references are found in two unique documents of author a, respectively of author b, but in 6 unique documents of author c, i.e.: 6 6 > (5) 2+6 2+2 Therefore it can be said that authors a and b are similar if there documents have similar reference lists. Our assumption leads to the following dataset model for BC, where we take all authors of related documents with at least n common references with any of the target author’s publications, where n may vary in different cases: BC:= (Refd(a), D, S) where S ⊆ Refd(a x D and {d ∈ D | Refd(a)| ≥ n, n 𝜖 ℕ} (6) where Refd(a) is the number of references in one document d of target author a. Unique authors of the dataset are accomplished; the list of the generated authors of the related documents is cut at m 𝜖 ℕ unique authors (m > 30) because their publications and references for BC have to be analyzed manually in WoS. For these related authors we measure similarity with the cosine (Eq. 2a), where T is substituted with H and Ha is the number of unique references of target author a and Hb the number of references of author b. Author Co-Citation (ACC) [32, 52, 53, 60, 61] means that two authors a and b are linked if they are cited in the same documents. ACC is then measured with cosine (Eq. 2a), where T is substituted with J and Ja is the number of unique citing articles which cite target author a and Jb is the number of unique citing articles which cite author b. To mine the author-co-citation data it is not possible to work with WoS, for in the references section of a bibliographic entry there is only the first author of the cited documents and not, what is needed, a declaration of all authors [65]. Therefore we are going to mine those data from Scopus, for here we can find more than one author of the cited literature. We perform an inclusive all-author co-citation, i.e. two authors are considered co-cited when a paper they co-authored is cited [64]. The dataset is based on the documents which cite at least one of the target author’s articles in Scopus: ACC:= (D, Ca, Q) where Q ⊆ D x Ca with |Q| > 0 (7) where Ca is the set of cited articles of target author a. The list of potential similar authors is cut at m 𝜖 ℕ unique authors (m > 30) because their publications for ACC have to be analyzed manually in Scopus. With regards to the results of the research literature, both methods, BC and ACC in combination, perform best to represent research activities [6, 9, 19]. Applying the proposed four mined datasets and similarity approaches we can assemble four different sets of potential similar authors, which we call clusters. One cluster is based on BC in WoS, one cluster is based on ACC in Scopus, one cluster is based on common users in CiteULike and one cluster is based on common tags in CiteULike. We can now analyze the authors who are most similar to our target author according to the cosine coefficient and evaluate the results. Additionally based on the mined datasets we can also measure similarity between all authors of a cluster. These results are shown in visualizations, which we call graphs. Therefore for each cluster a visualized graph exists which will also be evaluated. 4. DATASET LIMITATIONS While filtering the information in the three information services different problems arise, which we would point out briefly, because the recommendation results highly depend on the source dataset. In Scopus we detected differences in the metadata: An identical article may appear in different ways, i.e. for example title and authors may be complete in one reference list of an article, but incomplete in a reference list of another article. In our case, several co-authors in the dataset are missed and could not be considered for co-citation. The completeness of co-authorship highly varies: In a random sample, where the co-citation dataset is adjusted with data of the Scopus website, five of 14 authors have a complete coverage, three of them have coverage between 70 and 90 %, five between 55 and 70 % and one author only has coverage of about 33 %. In the information services there is the problem of homonymy concerning author names. Additionally in CiteULike users also misspell author names, which were rechecked for our dataset. The id-number for an author in Scopus is practical for identification, but it may also fail when two or more authors with the same name are allocated to the same research field and change their working place several times. In WoS we don’t have an author-id and it is more difficult to distinct a single person. Therefore we check the filtered author’s document list and if necessary correct it based on the articles’ subject area. 5. EXPERIMENTAL RESULTS We cooperate with physicists of the Forschungszentrum Jülich and worked with 6 researchers so far. For any of the 6 target academic authors (35-50 years old) we build individual clusters with authors who are supposed to be similar to them. We limit source for the dataset modeling to the authors’ publications between 2006-2011 to make recommendations based on the actual research interest of the physicist. To summarize, any scientist got the following four clusters: 1. Based on author co-citation (COCI) in Scopus, 2. Based on bibliographic coupling (BICO) in WoS, 3. Based on common users in CiteULike (CULU) and 4. Based on common tags in CiteULike (CULT). Based on the cosine similarity we are also able to show graphs of all four clusters using the cosine coefficient for similarity measure between all authors (e.g. Fig. 1 and Fig. 2). We applied the software Gephi 1 for the cluster visualization. The nodes (=authornames) are sized according to their connections, the edges are seized according to the cosine weight. Consider that the CiteULike graphs are much bigger because all related authors are taken into account. To get a 1 http://gephi.org/ 20 Figure 1. Extract of a CULT graph, circle = target author, cosine interval 0.99-0.49. Figure 2. BICO graph, circle = target author, cosine threshold 0.2. clear graph arrangement for a better evaluation, we set thresholds based on the cosine coefficient when needed. Additionally we left out author-pairs with a similarity of 1 if they had only one user or tag (in the CiteULike dataset) in common because this would distort the results. We will shortly summarize the interesting answers of part one: As confirmed in our earlier studies [24] most of the physicists work in research teams, i.e. they collaborate in small groups (in general not more than 5 people). The choice of people for possible collaboration highly depends on their research interest: There must be a high thematic overlap. On the other hand, if the overlap is too high, it could be disadvantageous. Some authors, who claimed a similar author in a cluster important, stated that they wouldn’t cooperate with him because he exactly does the same research, i.e. he is important for their own work, but rather a competitor of them. Additionally another statement against collaboration was less thematically overlap. Successful collaborations with international institutes are aspired. In general our interviewees meet new colleagues at conferences and scientific workshops. While modeling the datasets we found out that one of the six authors didn’t have any users, who bookmarked any of his articles in CiteULike. Some articles were found, but they were adjusted to the system by the CiteULike operators themselves, so the CiteULike clusters couldn’t be modeled for this scientist. One researcher’s articles were bookmarked, but not tagged. In this case, we used method 1 in 3.2 to model the dataset. In all four clusters we ranked the similar authors with the cosine. In general it can be seen that the cosine coefficient for BC is very low according to the one for ACC and similarity measures in CiteULike. This is because some authors have a lot of references, which minimize similarity. Additionally similarity is comparatively very high for measurements in CiteULike because the number of users and assigned tags related to the target authors’ publications was relatively low. 6. EVALUATION To prove our experimental results we let our 6 target physicists evaluate the clusters as well as the graphs. The evaluation is divided in three parts. Part one is arranged in a semi-structured interview with questions about the scientist’s research behavior and the purchase of relevant literature as well as his working behavior, i.e. is he organized in teams and with whom does he cooperate? These questions should show a picture of the scientists work and help to estimate the following evaluation results. In the second part the target author has to rank the proposed similar authors according to their relevance. Therefore the ten top authors of all four measurements are listed in alphabetical order (coauthors eliminated). The interviewee should tell if he knew the proposed authors, how important these authors are for his research (rating from not important (1) to very important (10)), with whom he would cooperate and which important authors he misses. In part three our author has to evaluate the cluster graphs (rating from 1 to 10) according to the distribution of the authors and the generated groups. Here the questions are: 1. Due to your individual valuation does the distribution of the authors reflect reality respective to the research community and the collaborations of them? 2. Are there any other important authors you didn’t remember before? 3. Would this graph, i.e. the recommendation of similar authors, help you e.g. to organize a workshop or find collaboration partners? 100% 80% 60% 40% 20% 0% COCI BICO CULU CULT Figure 3. Coverage of important authors in the recommendation of the Top 20 authors. Part two of the evaluation is concerned with the similar author ranking. We analyze all authors an interviewee claimed important with at least a rating of 5 and all important authors, which the researcher additionally added and which were not on the Top 10 list of any cluster. In general our target authors have up to 30 people they claim most important for their recent scientific work. Figure 3 shows the coverage of these important authors for the first 20 ranks based on the cosine (consider author 6 didn’t have any publication bookmarked in CiteULike). For example concerning target author 1: 30 % of the 20 most similar authors of the co-citation cluster (COCI) are claimed important. In the bibliographic coupling cluster (BICO) it is 20 %, in the CiteULike cluster based on users (CULU) 30 % and in the CiteULike cluster based on tags 25 %. Compared to the other target authors there are great differences. The BICO and COCI cluster can be said to provide the best results except by author 1 and 5. Concerning the 21 CiteULike clusters they are slightly worse, but not in all cases: By author 1 the CULU provide the same coverage than COCI, both CULU and CULT are better than BICO. By author 5 (no CULT because no tags were assigned) CULU has full coverage, which means that all the 20 top authors ranked by the cosine are claimed important by the target author. Beneath the coverage results shown in figure 3 it is interesting to look at the important authors who were only found on the CiteULike clusters: E.g. 6 of the 29 important authors of target author 1 are only in the CiteULike cluster, the same applies to 5 of 19 important authors of target author 2. The great differences may also depend on the interviewees’ recent research activities: Some of the physicists said that they slightly changed their research interest. Hitherto similar authors who were important in the past aren’t important nowadays. One problem with our applied similarity measure may be that it is based on past data, i.e. publications of the last five years. The authors the interviewees marked as important, are important for recent research. If we would have considered all authors who are or have been important, the results for the clusters would have been better. In the third part of the evaluation the interviewee had to evaluate the graphs. The average cluster relevance (based on six target users) are 5.08 for COCI, 8.7 for BICO, 2.13 for CULU and 5.25 for CULT. Consider only four authors had publications and tags in CUL to be analyzed. For author 5, for whom we applied method 1 (see 3.2) in case of missing tags, no CiteULike graph could be modeled because only one user bookmarked his articles and we measured author similarity only on the numbers of authors this user had in his literature list. Two authors claimed BICO and CULT to be very relevant and proposed to combine these two to get all important authors and relevant research communities. In BICO and COCI some interviewees missed important authors. Two of the interviewees stated that the authors in BICO and COCI are too obvious to be similar and were interested in bigger graphs with new potential interesting colleagues. A combined cluster could help them to find researcher groups, partners for cooperation and it would be supportive to intensify relationships among colleagues. Looking at the graphs almost all target authors recollected important colleagues, who didn’t come to their mind first, which they found very helpful. They stated that bigger graphs like CULT show more unknown and possible relevant people. However to give a clear statement about the similar researchers who were unknown by the target user, the interviewee would have had to look at these researchers’ publications. Assumptions can be made that if an unknown person is clearly connected to a known relevant researcher group, this person would do similar relevant work. As the interviewees stated that the distribution of the researchers is shown correctly, it is likely, but not explicitly proved, that the unknown scientist are also allocated correctly within the graph. An important factor for all interviewees is a clear cluster arrangement. A problem which may concern CUL clusters is the sparse dataset, i.e. if only few tags were assigned to one author’s publications or only one user bookmarked them, the cluster cannot show high distinguishable communities. That was the case with author 2 and 5. Author 2 gave worse ratings to the CUL graphs because they didn’t show clear distributions and author groups. Further categorizations of authors, e.g. via tags or author keywords, might help to classify scientists’ work. 7. DISCUSSION In our project we analyzed academic author recommendation based on different author relations in three information services. We combined two classical approaches (co-citation and bibliographic coupling) with collaborative filtering methods. First results and the evaluation show that the combination of different methods leads to the best results. Similarity based on users and assigned tags of an online bookmarking system may complement co-citation and bibliographic coupling. By some target authors more important similar authors were found in CiteULike than in Scopus or WoS. The interviewees approved this assumption with the graph relevance ranking. They and other researchers in former studies confirm that there is a need for author recommendation: Many physicists don’t work by oneself, but in project teams. The cooperation with colleagues of the same research field is essential. A recommender system could support them. Our paper shows a new approach to recommend relevant collaboration colleagues for scientific authors. The challenge will be to combine the different similarity approaches. One method is the simple summation of the cosine values. The cumulated cosine values provide better ranking results for some relevant researchers, but they are not satisfactory. Further investigations will be made in a weighted algorithm which considers the results of all four cluster. The relations between user- and tag-based similarity in a bookmarking system should also be considered and tested, e.g. with a graph based approach like FolkRank [27] or expertise analysis (SPEAR) [3]. Besides this the paper did not study important aspects of a running recommender system like accuracy and efficiency. Research has to be done on these fields. An issue which may as well be addressed is social network analysis and graph constructions. 8. ACKNOWLEDGMENTS Tamara Heck is financed by a grant of the Strategische Forschungsfonds of the Heinrich-Heine-University Düsseldorf. We would like to thank Oliver Hanraths for data mining, Stefanie Haustein for valuable discussions and the physicists of Forschungszentrum Jülich for participating in the evaluation. 9. REFERENCES [1] Ahlgren, P., Jarneving, B. and Rousseau, R. 2003. Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550-560. [2] Ahn, H. J. 2008. A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem. Information Sciences, 178, 37-51. [3] Au Yeung, C. M., Noll, M., Gibbins, N., Meinel, C. and Shadbolt, N. 2009. On measuring expertise in collaborative tagging systems. Web Science Conference: Society On-Line, 18th-20th March 2009, Athens, Greece. [4] Ben Jabeur, L., Tamine, L. and Boughanem, M. 2010. A social model for literature access: towards a weighted social network of authors. Proceedings of RIAO '10 International Conference on Adaptivity, Personalization and Fusion of Heterogeneous Information. Paris, France, 32-39. [5] Berkovsky, S., Kuflik, T. and Ricci F. 2007. Mediation of user models for enhanced personalization in recommender systems. User Model User-Adap Inter, 18, 245-286. [6] Bichteler, J. and Eaton, E. A. 1980. The combined use of bibliographic coupling and cocitation for document-retrieval. Journal of the American Society for Information Science, 31(4), 278-282. [7] Blazek, R. 2007. Author-Statement Citation Analysis Applied as a Recommender System to Support Non-DomainExpert Academic Research. Doctoral Dissertation. Fort Lauderdale, FL: Nova Southeastern University. 22 [8] Bogers, T. and van den Bosch, A. 2008. Recommending scientific articles using CiteULike. Proceedings of the 2008 ACM Conference on Recommender Systems. New York, NY, 287-290. [9] Boyack, K. W. and Klavans, R. 2010. Co-citation analysis, bibliographic coupling, and direct citation. Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389-2404. bookmarking systems. In Proceedings of I-Know 2010,10th International Conference on Knowledge Management and Knowledge Technologies, 458-464. [25] Herlocker, J. L., Konstan, J. A., Borchers, A. and Riedl, J. 1999. An algorithmic framework for performing collaborative filtering. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, 230-237. [10] Cabanac, G. 2010. Accuracy of inter-researcher similarity measures based on topical and social clues. Scientometrics, 87(3), 597-620. [26] Herlocker, J. L., Konstan, J. A., Terveen L. G. and Riedl, J. T. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 5-53. [11] Cacheda, F., Carneiro, V., Fernández, D. and Formoso, V. 2011. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Transactions on the Web, 5(1), article 2. [27] Hotho, A., Jäschke, R., Schmitz, C. and Stumme, G. 2006. Information retrieval in folksonomies: Search and ranking (pp. 411-426). In Sure, Y., Domingue, J. (Eds.), The Semantic Web: Research and Applications, Lecture Notes in Computer Science 4011, Springer, Heidelberg. [12] Cai, X., Bain, M., Krzywicki, A., Wobcke, W., Kim, Y. S., Compton, P. and Mahidadia, A. 2011. Collaborative filtering for people to people recommendation in social networks. Lecture Notes in Computer Science, 6464, 476-485. [28] Kessler, M. M. 1963. Bibliographic coupling between scientific papers. American Documentation, 14, 10-25. [13] Cawkell, T. 2000. Methods of information retrieval using Web of Science. Pulmonary hypertension as a subject example. Journal of Information Science, 26(1), 66-70. [14] Cronin, B. 1984. The Citation Process. The Role and Significance of Citations in Scientific Communication. London, UK: Taylor Graham. [15] Cruz, C. C. P., Motta, C. L. R., Santoro, F. M. and Elia, M. 2009. Applying reputation mechanisms in communities of practice. A case study. Journal of Universal Computer Science, 15(9), 1886-1906. [16] Desrosiers, C. and Karypis, G. 2011. A comprehensive survey of neighborhood-based recommendation methods (pp. 197-144). In Ricci, F., Rokach, L., Shapira, B. and Kantor, P.B (Eds.), Recommender Systems Handbook. Springer, NY. [17] Egghe, L. 2010. Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology, 61(10), 2151-2160. [18] Emamy, K. and Cameron, R. 2007. CiteULike. A researcher’s bookmarking service. Ariadne, 51. [19] Gmur, M. 2003. Co-citation analysis and the search for invisible colleges. A methodological evaluation. Scientometrics, 57(1), 27-57. [20] Goldberg, D., Nichols, D., Oki, B. M. and Terry, D. 1992. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61-70. [21] Hamers, L., Hemeryck, Y., Herweyers, G. and Janssen, M. 1989. Similarity measures in scientometric research: The Jaccard Index versus Salton’s cosine formula. Information Processing & Management, 25(3), 315-318. [22] Haustein, S. and Siebenlist, T. 2011. Applying social bookmarking data to evaluate journal usage. Journal of Informetrics, 5, 446-457. [23] Heck, T. (2011). A comparison of different user-similarity measures as basis for research and scientific cooperation. Information Science and Social Media International Conference August 24-26, Åbo/Turku, Finland. [24] Heck, T. and Peters, I. 2010. Expert recommender systems: Establishing Communities of Practice based on social [29] Krohn-Grimberghe, A., Nanopoulos, A. and SchmidtThieme, L. 2010. A novel multidimensional framework for evaluating recommender systems. In Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI). New York, NY, ACM. [30] Lee, D. H. and Brusilovky, P. 2010a. Social networks and interest similarity. The case of CiteULike. In Proceedings of the 21st ACM Conference on Hypertext & Hypermedia, Toronto, Canada, 151-155. [31] Lee, D. H. and Brusilovky, P. 2010b. Using self-defined group activities for improving recommendations in collaborative tagging systems. In Proceedings of the Fourth ACM Conference on Recommender Systems. NY, 221-224. [32] Leydesdorff, L. 2005. Similarity measures, author cocitation analysis, and information theory. Journal of the American Society for Information Science and Technology, 56(7), 769772. [33] Leydesdorff, L. 2008. On the normalization and visualization of author co-citation data. Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77-85. [34] Li, J., Burnham, J. F., Lemley, T. and Britton, R. M. 2010. Citation analysis. Comparison of Web of Science, Scopus, SciFinder, and Google Scholar. Journal of Electronic Resources in Medical Libraries, 7(3), 196-217. [35] Liang, H., Xu, Y., Li, Y. and Nayak, R. 2008. Collaborative filtering recommender systems using tag information. ACM International Conference on Web Intelligence and Intelligent Agent Technology. New York, NY, 59-62. [36] Linde, F. and Stock, W.G. 2011. Information Markets. Berlin, Germany, New York, NY: De Gruyter Saur. [37] Luo, H., Niu, C., Shen, R. and Ullrich, C. 2008. A collaborative filtering framework based on both local user similarity and global user similarity. Machine Learning, 72(3), 231-245. [38] Marinho, L. B., Nanopoulos, A., Schmidt-Thieme, L., Jäschke, R., Hotho, A., Stumme, G. and Symeonidis, P. 2011. Social tagging recommenders systems (pp. 615-644). In Ricci, F., Rokach, L., Shapira, B. and Kantor, P.B (Eds.), Recommender Systems Handbook. Springer, NY. 23 [39] McNee, S. M., Kapoor, N. and Konstan, J.A. 2006. Don’t look stupid. Avoiding pitfalls when recommending research papers. In Proc. of the 20th anniversary Conference on Computer Supported Cooperative Work. New York, NY, ACM, 171-180. [52] Schneider, J.W. and Borlund, P. 2007a. Matrix Comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology, 58(11), 1586-1595. [40] Meho, L. I. and Rogers, Y. 2008. Citation counting, citation ranking, and h-index of human-computer interaction researchers. A comparison of Scopus and Web of Science. Journal of the American Society for Information Science and Technology, 59(11), 1711-1726. [53] Schneider, J. W. and Borlund, P. 2007b. Matrix Comparison, Part 2: Measuring the resemblance between proximity measures or ordination results by use of the Mantel and Procrustes statistics. Journal of the American Society for Information Science and Technology, 58(11), 1596-1609. [41] Meho, L. I. and Sugimoto, C. R. 2009. Assessing the scholarly impact of information studies. A tale of two citation databases – Scopus and Web of Science. Journal of the American Society for Information Science and Technology, 60(12), 2499-2508. [54] Shepitsen, A., Gemmell, J., Mobasher, B. and Burke, R. 2008. Personalized recommendation in social tagging systems using hierarchical clustering. In Proc. of the 2008 ACM Conference on Recommender Systems. NY, 259-266. [42] Parra, D. and Brusilovsky, P. 2009. Collaborative filtering for social tagging systems. An Experiment with CiteULike. In Proc. of the Third ACM Conference on Recommender Systems. New York, NY, ACM, 237-240. [43] Peters, I. 2009. Folksonomies. Indexing and Retrieval in Web 2.0. Berlin, Germany: De Gruyter Saur. [44] Ramezani, M., Bergman, L., Thompson, R., Burke, R. and Mobasher, B. 2008. Selecting and applying recommendation technology. In Proc. of International Workshop on Recommendation and Collaboration, in Conjunction with 2008 International ACM Conference on Intelligent User Interfaces. Canaria, Canary Islands, Spain. [45] Petry, H., Tedesco, P., Vieira, V. and Salgado, A. C. 2008. ICARE. A context-sensitive expert recommendation system. In The 18th European Conference on Artificial Intelligence. Workshop on Recommender Systems. Patras, Greece, 53-58. [46] Reichling, T. and Wulf, V. 2009. Expert recommender systems in practice. Evaluating semi-automatic profile generation. CHI ’09. In Proc. of the 27th International Conference on Human Factors in Computing Systems. New York, NY, 59-68. [47] Rendle, S., Marinho, L. B., Nanopoulos, A. and SchmidtThieme, L. 2009. Learning optimal ranking with tensor factorization for tag recommendation. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, 727-736. [55] Small, H. 1973. Cocitation in scientific literature. New measure of relationship between 2 documents. Journal of the American Society for Information Science, 24(4), 265-269. [56] Stock, W. G. 1999. Web of Science. Ein Netz wissenschaftlicher Informationen – gesponnen aus Fußnoten [Web of Science. A web of scientific information – cocooned from footnotes]. Password, no. 7+8, 21-25. [57] Van Eck, N. J. and Waltman, L. 2008. Appropriate similarity measures for author co-citation analysis. Journal of the American Society for Information Science and Technology, 59(10), 1653-1661. [58] Van Eck, N. J. and Waltman, L. 2009. How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651. [59] Wang, J., de Vries, A. P. and Reinders, M. J. T. 2006. Unifying userbased and itembased collaborative filtering approaches by similarity fusion. In Proc. of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. NY, 501-508. [60] White, H. D., & Griffith (1981). Author cocitation. A literature measure of intellectual structure. Journal of the American Society for Information Science, 32(3), 163-171. [61] White, H. D. and Griffith, B. 1982. Authors as markers of intellectual space. Co-citation in studies of science, technology and society. Journal of Documentation, 38(4), 255-272. [48] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P. and Riedl, J. 1994. Grouplens: An open architecture for collaborative filtering of netnews. In Proc. of CSCW’94, ACM Conference on Computer Supported Cooperative Work. New York, NY, ACM, 175-186. [62] Yukawa, T., Kasahara, K., Kita, T. & Kato, T. (2001). An expert recommendation system using concept-based relevance discernment. In Proceedings of ICTAI ’01, 13th IEEE International Conference on Tools with Artificial Intelligence (pp. 257-264. Dallas, Texas, 257-264. [49] Rorvig, M. 1999. Images of similarity: A visual exploration of optimal similarity metrics and scaling properties of TREC topic-document sets. Journal of the American Society for Information Science and Technology, 50(8), 639-651. [63] Zanardi, V. and Capra, L. 2008. Social ranking: Uncovering relevant content using tag-based recommender systems. Proceedings of the 2008 ACM Conference on Recommender Systems. New York, NY, 51-58. [50] Said, A., Wetzker, R., Umbrath, W. and Hennig, L. 2009. A hybrid PLSA approach for warmer cold start in folksonomy recommendation. In Proc. of the International Conference on Recommender Systems. New York, NY, 87-90. [64] Zhao, D. and Strotmann, A. 2007. All-author vs. first author co-citation analysis of the Information Science field using Scopus. Proceedings of the 70th Annual Meeting of the American Society for Information Science and Technology, 44(1). Milwaukee, Wisconsin, 1-12. [51] Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. 2000. Analysis of recommendation algorithms for ecommerce. In Proceedings of the 2nd ACM 2nd ACM Conference on Electronic Commerce. New York, NY, 158-167. [65] Zhao, D. and Strotmann, A. 2011. Counting first, last, or all authors in citation analysis. Collaborative stem cell research field. Journal of the American Society for Information Science and Technology, 62(4), 654-67. 24 Group recommendation methods for social network environments Universidad Complutense de Madrid, Spain Lara Quijano-Sanchez [email protected] Juan A. Recio-Garcia [email protected] ABSTRACT Social networks present an opportunity for enhancing the design of social recommender systems, particularly group recommenders. Concepts like personality, tie strength or emotional contagion are social factors that increase the accuracy of the system’s prediction when including them in a group recommendation model. This paper analyses the inclusion of social factors in current preference aggregation strategies by exploiting the knowledge available in social network environments. Proposed techniques are evaluated in a real application for Facebook that recommends movies for groups of users. General Terms Algorithms, Human Factors, Performance Keywords Recommender Systems, Social Networks, Personality, Trust 1. INTRODUCTION Recommender systems are born from the necessity of having some kind of guidance when searching through complex product spaces. More precisely, group recommenders are built to help groups of people decide a common activity or item. Nowadays, social networks are commonly used to organize events and activities for groups of users. Therefore, they are an ideal environment for exploiting recommendation techniques. Moreover, we can use the various relationships captured in these communities (trust, confidence, tie strength, etc.) in new ways, by incorporating better indicators of relationship information. There is a proliferation of recommender systems that cope with the challenge of addressing recommendations for groups of users in different domains like MusicFX [12], FlyTrap [2] or LET’S BROWSE [9] among many. What all these recommenders have in common is that the group recommendations take the personal preferences obtained from their Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. Belen Diaz-Agudo [email protected] users into account but they consider each user equal to the others. The recommendation is not influenced by their personality or the way each one behaves in a group when joining a decision-making process. In our approach we propose to study how people interact depending on their personality or their closeness in order to improve group recommendations. Our previous work [16, 15, 17] studies how to measure the personality and trust between. It also proposes several recommendation methods that incorporate both factors. Group recommendation approaches are typically based on generating an aggregated preference using the user’s individual preferences. As stated in [6] the main approaches to generate the preference aggregation are (a) merging the recommendations made for individuals, (b) aggregation of ratings for individuals and (c) constructing a group preference model. Masthoff [10] presents a compilation of the most important preference aggregation techniques. These basic approaches merge the ratings predicted individually for each item to calculate a global prediction for the group. The selection of a proper aggregation strategy is a key element in the success of recommendations. The main contribution of this paper is a comparative analysis of these strategies applied to our social-enhanced recommendation methods. This study will help us choose the best aggregation strategy for our group recommendation approach. The second goal of this paper is to illustrate the potential of the proposed social recommendation techniques, to be able to evaluate them in a real environment and to make them reachable to our users. To do so, we have developed HappyMovie, a movie recommendation application for Facebook. Our application applies the methods presented in this paper to aid groups of users deciding what is the best movie to watch together. Our system takes advantage of the popularization among users of organizing events through social networks. It is becoming very usual that someone proposes an activity and invites several friends to the event using Facebook or any other online community. HappyMovie goes one step beyond and guides groups of friends to decide an activity to perform together (in our case, selecting a proper movie for the group). Summing up in this paper we present a group recommender application embedded in a social network that allows us to study and improve (as we will later conclude) the performance of different aggregation techniques when using them in our recommendation method based on personality and trust factors. Section 2 introduces our Facebook application HappyMovie. Our social-enhanced group recommendation method is ex- 25 Figure 1: HappyMovie Main Page Figure 2: Personality test in HappyMovie plained in Section 3. We present the case study in the movie recommendation domain and the experimental evaluation of the presented methods in Section 4. Finally Section 5 concludes the paper. 2. FACEBOOK’S GROUP RECOMMENDATION APPLICATION: HAPPYMOVIE Happy Movie is a Facebook application where we provide a movie recommendation for a group of people planning to go to the cinema together. The application works as Facebook’s event application where the event is going to the cinema. The application recommends to the attending users a movie from the current movie listing of a selected place. In HappyMovie users can create new events, invite their Facebook friends to any of their events and erase themselves from them. When a user starts the application, as shown in Figure 1, she can1 : 1. Answer to a personality test: In previous works we have studied the different behaviours that people have in conflict situations according to their personality. In [16, 15] we presented a group recommendation method that distinguishes different types of individuals regarding their personality. There are different approaches that can be used in order to obtain the different roles that people play when interacting in a decision making 1 We must point that it is necessary to perform activities 1 and 2 before being able to perform any of the others. process, for example the TKI test [19]. We have used a movie metaphor that consists on displaying two movie characters with opposite personalities for five possible personality aspects. We have proven in [17] that with this test we are able to obtain in a less tedious way the personality measures that the TKI test obtains with equitable results in the recommendations. In our test, one character represents the essential characteristics of the personality feature, while the other one represents all the opposite ones. What the user has to do is to choose with whom of each pair of characters she feels more identified by simple moving an arrow that indicates de grade of conformity as shown in Figure 2. Note that additional information about the characters and the category they are representing is provided to our users. 2. Perform a movies preference test: This test is shown in Figure 3. This information about the user is required to predict the ratings for the movies to be recommended. Our group recommendation strategies combine individual recommendations to find a movie suitable for all the users in the group. This individual recommender estimates the ratings that a user would assign to each product in the catalogue. It is built using the jCOLIBRI framework [3] and follows a content based approach [14]. To do so, it assigns an estimated rating to the products to be recommended based on an average of the ratings given by the user in the preference test to the most similar products. In our case, it returns an average of the three most similar rated items. 3. Create a new event: Figure 4 shows how this option is presented. To create an event, users must establish the place, the deadline for users to join the event and the date when the event will take place. Invited users receive a notification of the event and are able to accept or reject it. Once the event has been created any attending user can see the date and place of the event and a proposal of three movies, that are the best ones that our group recommender has found for the current group of attending users. 4. Access to events the user is already attending: When a user enters an event, the application calculates the trust that the user has with all the other users that have joined the event up to now. Current research has pointed out that people tend to rely more on recommendations from people they trust (friends) than on recommendations based on anonymous ratings [18]. This social element is even more important when we are performing a group recommendation where users have to decide an item for the whole group. This kind of recommendations usually follow an argumentation process, where each user defends her preferences and rebuts other’s opinions. Here, when users must change their mind to reach a common decision, the trust between users is the major factor. Note that trust is also related to tie strength and previous works have reported that both are conceptually different but there is a correlation between them [8]. The calculation of the trust is the one that has the most benefits of embedding the application in a social network. With a 26 Figure 3: Preferences test in HappyMovie standalone application, the task of obtaining the data required to compute the trust between users is very tedious. Now, we are able to calculate the trust between users extracting the specific information from each of their own profiles in the social network. Users in Facebook, can post on their profiles a huge amount of personal information that can be analysed to compute the trust with other users: distance in the social network, number of shared comments, likes and interests, personal information, pictures, games, duration of the friendship, etc [5, 4]. We analyse 10 different trust factors comparing the information stored in their Facebook profiles. Next, these factors are combined using a weighted average. A detailed explanation of the trust factors obtained from Facebook and the combination process is provided in [16]. When the event is created it looks up for the current movie listing from the selected city and provides a list of 3 movies, that represent the best 3 movies that the recommender has found in the movie listing for the users that have joined the event up to now, this is shown in Figure 5. This list is automatically updated every time a user joins the event or retires from it. This process keeps going on until the last possible day to join or retire from the event, the deadline date. From that date until the date when the event will take place users can vote the 3 final proposed movies. When the final date arrives the votes are analyzed and the most voted movie is the one presented. 3. GROUP RECOMMENDATION METHOD Our group recommendation method is based on the typical preference aggregation approaches plus personality and social factors. The novelty presented in this paper is the evaluation of the different aggregation approaches that exist when using them with our group recommendation method. These approaches [11, 13] aggregate the users individual predicted ratings pred(u, i) to obtain an estimation for the group {gpred(G, i)|u ∈ G}. Then the item with the highest group predicted scoring is proposed. Figure 4: How to create an activity in HappyMovie G gpred(G, i) = pred(u, i) (1) ∀u∈G Here G is a group of users, which user u belongs to. This function provides an aggregated value that predicts the group preference for a given item i. By using this estimation, our group recommender proposes the set of k items with the highest group predicted scoring. This is what we will later refer as a “Standard recommender”. In our proposal, we modify the individual ratings with the personality and trust factors. This way, we modify the impact of the individual preferences as shown in Equation 2. gpred(G, i) = G pred0 (u, i) ∀u,v∈G pred0 (u, i) = G f ( pred(u, i) , pu , tu,v ) (2) ∀v∈G where gpred(G, i) is the group rating prediction for a given item i, pred(u, i) is the original individual prediction for user u and item i, pu is the personality value for user u and tu,v is the trust value between users u and v. There are several ways to modify the predicted rating for a user according to the personality and trust factors. The one that has proven to be the most efficient in our previous experiments performed in [16, 15], is the Delegation-based method. The idea behind this method is that users create their opinions based on the opinions of their friends. The delegation-based method tries to simulate the following behaviour: when we are deciding which item to choose within a group of users we ask the people who we trust. Moreover, we also take into account their personality to give a certain importance to their opinions (for example, because we know that a selfish person may get angry if we do not choose her preferred item). The estimation of the delegation-based rating (dbr(u, i)) given an user u and an item i is computed in this way: 27 The function that represents this strategy is: gpred(G, i) = 1 X pred0 (u, i) |G| u∈G (4) Where pred0 (u, i) is the predicted rating for each user u, and every item i. gpred0 (G, i) is the final rating of item i for the group. • Borda Count: (Borda 1971). The Borda count is a single-winner election method in which users rank candidates in order of preference. The Borda count determines the winner of an election by giving each candidate a certain number of points corresponding to the position in which she is ranked by each voter. Once all votes have been counted the candidate with the most points is the winner. Because it sometimes elects broadly acceptable candidates, rather than those preferred by the majority, the Borda count is often described as a consensus-based electoral system, rather than a majority-based one. Finally, to obtain the group preference ordering, the points awarded for the individuals are added up. Figure 5: How events look like in HappyMovie gpred(G, i) = X bs(u, i) u∈G bs(u, i) 0 pred (u, i) dbr(u, i) = = pos( pred0 (u, i) , OL ) OL = {pred0 (u, i1 ), . . . , pred0 (u, in )} 1 | X P v∈G tu,v | v∈G∧v6=u where pred(u, ip ) ≤ pred0 (u, ip+1 ) (5) tu,v ·( pred(v, i) + pv ) (3) In this formula, we take into account the recommendation pred(v, i) of every friend v for item i. This rating is increased or decreased depending on her personality (pv ), and finally it is weighted according to the level of trust (tu,v ). Note that this formula is not normalized by the group size and uses the accumulated personality. Therefore, this formula could return a value out of the ratings range. As we are only interested in giving a final ordered list of the users preferences in the products of a given catalogue, it is not necessary to normalize the results given by our formula. Next, we will explain the aggregation functions that can be applied to combine the individual estimations. 3.1 = Where bs(u,i) is the Borda score assigned to each item rated by user u. It is obtained as the position of the estimated rating for item i in the ordered list OL of the ratings predicted for all the items. A problem arises when an individual has multiple alternatives with the same rating. We have decided to distribute the points. • Copeland Rule:(Copeland, 1951) ranks the alternatives according to the difference between the number of alternatives they beat and the number of alternatives they loose against. It is a good procedure to overcome problems resulting from voting cycles [7]. X gpred(G, i) = cs(i) i∈C Aggregation Functions A wide set of aggregation functions has been devised for combining individual preferences [10], being the average and least misery the most commonly used. In our previous research we only evaluated the performance of our approach using social factors with the average satisfaction function. As we have said before choosing the aggregation function that performs best is a key element to provide good recommendations. Here we explain the ones that we have studied for our group recommendation method. • Average Satisfaction: Refers to the common arithmetic mean, which is a method to derive the central tendency of a sample space. It computes the average of the predicted ratings of each member of the group. cs(u, i) = 1 −1 0 wins(u, i) losses(u, i) = = |pred0 (u, i) > pred(u, j)|, ∀i 6= j |pred0 (u, i) < pred(u, j)|, ∀i 6= j if wins(u, i) > losses(u, i) if wins(u, i) < losses(u, i) a.o.c. (6) • Approval Voting: is a single-winner voting system used for elections. Each voter may vote for (approve of) as many candidates as they wish. The winner is the candidate receiving the most votes. Users could vote for all alternatives with a rating higher than a certain threshold δ, as this means voting for all alternatives they like at least a little bit. 28 gpred(G, i) = X recommender”). During the experiment we have compared the results obtained with these two recommenders and the different aggregation functions, for the real and synthetic data. Next we describe the details of the experiment. as(u, i) u∈G as(u, i) = if pred0 (u, i) ≥ δ a.o.c. 1 0 (7) • Least Misery: This strategy follows the idea that, even if the average satisfaction is high, a solution that leaves one or more members very dissatisfied is likely to be considered undesirable. This strategy considers that a group is as happy as its least happy member. The final list of ratings is the minimum of each of the the individual ratings. A disadvantage can be that if the majority really likes one item, but one person does not, then it will never be chosen. gpred(G, i) = min pred0 (u, i) u∈G (8) • Most Pleasure Strategy: It is the opposite of the previous strategy, it chooses the highest rating for each item to form the final list of predicted ratings. gpred(G, i) = max pred0 (u, i) u∈G (9) • Average Without Misery: assigns the average of the weights in the individual ratings. The difference here is that those preferences which have a weight under a certain threshold will not be considered. gpred(G, i) = 1 |G| X pred0 (u, i) (10) u∈G|pred0 (u,i)>δ In next section we study how the different aggregation functions influence in the results of the group recommendation and we also to prove the validity of our method. 4. EXPERIMENTAL EVALUATION We have run an experiment in the movie recommendation domain. In it we have been able to benefit from the advantages of having the group recommendation application, HappyMovie, embedded in a social network. We have tested our theories with groups of real users and also with synthetically generated users. The reason of using synthetically generated users, besides of the real users, is that we wanted to have control of the data distribution, which does not happen when using real data. This synthetic data set let us explore every group composition and personality distribution within the group. It also lets us reproduce the behaviour of large groups that are very difficult to organize in experiments with real users. With both the synthetic and real data we are able to explore the realistic and the pure combinational distributions. After obtaining the necessary data to perform our case study we have implemented two versions of our system: a standard recommender that only aggregates preferences (the baseline explained in Equation 1, we will refer to it from now on as “Base recommender”) and another one that implements our method: the DelegationBased method (we will refer to it as “Delegation Method • Experimental set-up of the real data: As we want to evaluate the real performance of our group recommendation method, we test our Facebook application HappyMovie. From it we obtain the real users data used in this first experiment. To do so we create different events in the social network as explained in 2. In these events we ask volunteers to complete some tests that let us obtain two of the factors used by our system: personality and personal preferences. The demographic data about our participants (mean age, gender, etc) is quite varied because they have been selected among our friends, family and students. We finally have 58 users. The tests are the ones presented in Section 2 (Figures 2 and 3) that let us obtain the personality and individual preferences of each user. Note that trust is obtained by analysing Facebook profiles. In order to make a good recommendation it is necessary to have accurate information about users individual preferences. To be able to have this we have asked our users to rate at least 20 movies in the personal preferences test (we studied that 20 was the minimum rates needed in order to have a representative profile of the users preferences). On average, the users have rated 30 movies from the 50 Movies to rate list. Now we have all the information required to build the individual profile of our users which is necessary for our recommendation method. This profile is based, as we have explained before, on three different aspects: personality, individual preferences and trust with other users. We need a way to measure the accuracy of the group recommendation. To be able to compare our results with what will happen in real life, we brought our users together in person and ask them to mix differently several times and simulate that they are going to the cinema together, forming different groups that would actually come out in reality. We provide them 15 movies that represent the current movie listing and we ask them to choose which 3 movies they actually would watch together. We have chosen to ask for just 3 movies, because real users are only interested on a few movies they really want to watch. And knowing that a movie listing is normally formed by no more than 15 movies, we have considered that 3 would be the maximum of movies that a user will be really interested in watching at a time. Later, users created events in HappyMovie and joined them with the same configuration as they did in reality, we manage to gather 15 groups (which means 15 events in the application) of 9 (4 groups), 5 (6 groups) or 3 (5 groups) members. The three movies that each group chooses are stored as the real group favorites set –rgf –. Our goal is to evaluate the accuracy of our recommender by comparing the set of the 3 final proposed movies in each event as shown in Figure 5 –the pgf set– with the real preferences rgf. The more that the predicted movie set resembles the real one the better results is our application providing. The evaluation metrics applied to compare both sets are explained in Section 4.1. 29 • Experimental set-up of the Synthetic Data: We have performed a second experiment, where we have simulated the behaviour of 100 people. In it we assign to our synthetic users a random personality value. We basically define five different types of personality according to the range provided by the TKI [19] and our movie metaphor personality tests: very selfish, selfish, tolerant, cooperative and very cooperative. For example if we consider a very selfish person her personality value must be contained in a range of [0.8,1.0]. When we analyzed the range of the personality values of the real users, there were some of these ranges that were unexplored because with smaller samples (58) and a not fixed group composition not all the possible situations appear. As we wanted to study all the possible behaviors we decided to use the synthetic data. We must note that the validity of experimenting with these synthetically generated users has been already proven in our previous studies [16]. To study the effects of the different types of personalities we generate 20 users for each type of personality. We group users in sets of 3, 5, 10, 15, 20 and 40 people. For each group size we select the components of the group so that the personality distributions contain all the possible combinations: groups of very selfish, selfish, tolerant, cooperative, very cooperative, very selfish & very cooperative, very selfish & tolerant, ... and so on until we reach 13 possible combinations and groups for each size. In the end we have 76 groups (13 different distributions for each size, except for the 40 people group where we only had 11 combinations due to the resemblance of personalities in such big groups). The next step needed in our experiment is to obtain the real group favorites (rgf ) and be able to measure the recommendation accuracy. We have to simulate the individual preferences test, so that we know which movies would each of our users have chosen individually (from the same movie listing of a cinema, that we proposed to our real users). Afterwards we also have to determine which of that movies the group as a whole would have finally decided to watch. To do so, we use the description of the movies to predict which movies each user likes. We have given to each of our synthetically generated users a random profile. These profiles are constructed according to typical preferences in movies of real life people according to their age, sex and preferences and studding the Movielens data set [1]. For example, the ratings that a user with a childish profile would give are very high ratings to animation, children or musical movies and very low ratings to drama, horror, documental, etc. From these profiles we are able to predict the individual likes of our users. We have selected the same list of 50 heterogeneous movies from the preferences test that HappyMovie offers (see Figure 3) and we rated them for each user according to their profile. Afterwards, with a content-based recommender we rate and organize the listing of the cinema in order of preference by comparing the items with the ones in the simulated preferences. We chose the top 3 and marked them as the real individual favorites –rif –. Secondly we need to obtain the decision of the group. Now that we know which movies would the individual users argue for, we reproduce a real life situation were everyone discusses their preferences, taking into account the personalities and the friendship between them and then we finally obtain the real favorite movies for the group rgf. We use this information to evaluate the accuracy of our recommender by comparing how many of the first 3 recommended movies in the predicted group favorites –pgf – belong to the rgf set of that group. 4.1 Evaluation metrics As we need to have an evaluation function to measure the accuracy of our method, we have studied several aspects before deciding which matters we should take into account in the evaluation process of our experiment. We need to compare the results of our recommender system to the real preferences of the users (that is, what would happen in a real life situation), also we must note: 1) The number of estimated movies that we were going to take into account: Long lists of ordered items are of no use in this case scenario. Real users are only interested on a few movies they really want to watch. This fact discards several evaluation metrics that compare the ordering of the items in the real list of favourite movies and the estimated one (MAE, nDCGs, etc). 2) The number of relevant and retrieved items in our system is fixed : Therefore, we cannot use general measures like recall or precision. However, there are some metrics used in the Information Extraction field [20] that limit the retrieved set. This is the case of the precision@n measure that computes the precision after n items have been retrieved. We have decided to use the precision@3 to evaluate how many of the movies in pgf are in the rgf set (note that |rgf | = 3). This kind of evaluation can be seen from a different point of view: we are usually interested on having at least one of the movies from pgf in the rgf set. This measure is called success@n and returns 1 if there is at least one hit in the first n positions. Therefore, we could use success@3 to evaluate our system computing the rate of recommendations where we have at least one-hit in the real group favorites list. For example, a 90% of accuracy using success@3 represents that the recommender suggests at least one correct movie for the 90% of the evaluated groups. In fact, success@3 is equivalent to having precision@3 > 1/3, when considering retrievals individually before computing the average. We can also define a 2success@3 metric (equivalent to precision@3 > 2/3) that represents how many times the estimated favorites list pgf contains at least two movies from rgf. Obviously, it is much more difficult to achieve high results using this second measure. 4.2 Results This section describes the results obtained in the two experiments: using real vs. synthetic data. As we have explained before, we have built two recommenders, the standard recommender (we refer to it as “Base recommender”) and the one using the Delegation-based method (we refer to it as “Delegation Method recommender”), both explained in Section 3. In them we have tested the 7 different types of aggregation functions, explained in Section 3.1. Figure 6 shows the results for the real and the synthetic data using the recommender that implements Delegationbased method for all the different merging functions. We can see that the best two merging functions are the average 30 Figure 6: Results for the real data and the synthetic data using our method for all the different merging functions. satisfaction function for the real data and the least misery for the synthetic data. We have also analysed the improvement of our method in comparison with the base recommender for all the different aggregation functions. This improvement is shown in Figures 7 and 8. From them we conclude that our method improves the base recommender for the real data in a 10% for the success@3 measure and in a 12% for the 2success@3 measure. For the synthetic data the recommender that implements our method improves the results in a 16% for the success@3 measure and in a 7% for the 2success@3 measure. With these results we are able to conclude that by using our method we do improve significantly the base recommender. In Figures 9 and 10 we have studied the results of the recommender that implements our method with the real and synthetic data when varying the group size. We have performed the analysis for all the different aggregation functions. We can see that while some aggregation functions like average satisfaction give better results for small groups (we consider as small groups size 10 or less), others like least misery or most pleasure work the other way round and give better results for big groups. In this particular case we can clearly see the necessity of using our synthetically generated data, because with the real data we only had 3 different groups sizes and results were not significant. This is why on average, as we can see in Figure 6, the best function for the real data is average satisfaction (as we remember group size in real data is 9, 5 & 3, all of them considered small groups). And on the other hand, on average, the best function for the synthetic data is least misery, which makes sense as the results in Figure 10 show that this function works better for big groups and on average the synthetic data groups are big (with groups of 15, 20 & 40). Graphics 9 and 10 reflect a similar behaviour as the group size grows. This difference in the results of the different aggregation functions when varying the group size, opens the possibility of improving our method with an adaptive group recommender, where the recommendation algorithm adapts itself to the personality distribution in the group, its size and other characteristics. Figures 9 and 10 also show that for the groups that are equal in size and therefore comparable (3,5&9), results between synthetic and real data for all the different merging functions differ on average by no more than 0.11. So we prove that our synthetic generated data is valid and provides sustainable results. 5. CONCLUSIONS This paper introduces HappyMovie, a movie recommender Figure 7: Comparison of the results for the real data with and without our method for all the different merging functions. Figure 8: Comparison of the results for the synthetic data with and without our method for all the different merging functions. system for groups that models the social side of users to provide better recommendations. It is integrated in Facebook to infer the social behaviours within the group. It is clear that groups have an influence on individuals when reaching a common decision. This is commonly referred as emotional contagion. This contagion is usually proportional to the trust between individuals as closer friends have a higher influence. Therefore our system analyses users interaction (interchanged messages, shared photos, friends in common, . . . ) to calculate this social factor. However, the influence of the group depends also on the degree of conformity of the individual. The degree of conformity is counteracted by the individual’s behaviour when facing a conflict situation. Here, the personality influences the acceptance of others’ proposals. Our model also includes the personality by asking the users to answer a short personality test. Both variables, personality and trust, are used to estimate the preference of each user for a given item. To do so, we modify the ratings estimated by a standard content-based recommender in our delegation-based method. This method models the effect of the emotional contagion and obtains an estimation that is based on the estimations for other users with a close relationship. This closeness is inferred according to the trust between users. Finally, the personality variable represents the degree of conformity with the items preferred by these closer individuals. Individual predictions must be combined to obtain a global recommendation for the group. Several aggregation strategies have been proposed to obtain this final value: average satisfaction, least misery, approval voting, etc. We have evaluated these strategies applied to our social-enhanced recommendation method with real and synthetically generated 31 [6] [7] Figure 9: Comparison of the results of all the different merging functions when varying the group size for the real data. [8] [9] [10] [11] Figure 10: Comparison of the results of all the different merging functions when varying the group size for the synthetic data. users. Results show that average satisfaction (computing the average estimated rating) and the least misery are the best options and that our method improves the accuracy of standard approaches by 12%. 6. [13] ACKNOWLEDGMENTS Supported by Spanish Ministry of Science & Education (TIN2009-13692-C03-03) and Madrid Education Council and UCM (Group 910494). We also thank the friends and students that have participated in the experiment. 7. [12] [14] [15] REFERENCES [1] J. Bobadilla, F. Serradilla, and A. Hernando. Collaborative filtering adapted to recommender systems of e-learning. Knowl.-Based Syst., 22(4):261–265, 2009. [2] A. Crossen, J. Budzik, and K. J. Hammond. Flytrap: intelligent group music recommendation. In IUI ’02: Proceedings of the 7th international conference on Intelligent user interfaces, pages 184–185. ACM, 2002. [3] B. Dı́az-Agudo, P. A. González-Calero, J. A. Recio-Garcı́a, and A. A. Sánchez-Ruiz-Granados. Building cbr systems with jcolibri. Sci. Comput. Program., 69(1-3):68–75, 2007. [4] E. Gilbert and K. Karahalios. Predicting tie strength with social media. In CHI ’09, pages 211–220. ACM, 2009. [5] J. Golbeck. Combining provenance with trust in social networks for semantic web content filtering. In L. Moreau and I. T. Foster, editors, Provenance and Annotation of Data, International Provenance and [16] [17] [18] [19] [20] Annotation Workshop, IPAW 2006, Chicago, IL, USA, May 3-5, 2006, Revised Selected Papers, volume 4145 of Lecture Notes in Computer Science, pages 101–108. Springer, 2006. A. Jameson and B. Smyth. Recommendation to groups. In P. Brusilovsky, A. Kobsa, and W. Nejdl, editors, The Adaptive Web, Methods and Strategies of Web Personalization, volume 4321 of Lecture Notes in Computer Science, pages 596–627. Springer, 2007. C. Klamler. The copeland rule and condorcet’s principle. Economic Theory, 25(3):745–749, 04 2005. D. Z. Levin, R. Cross, and L. C. Abrams. The strength of weak ties you can trust: the mediating role of trust in effective knowledge transfer. Management Science, 50:1477–1490, 2004. H. Lieberman, N. W. V. Dyke, and A. S. Vivacqua. Let’s browse: A collaborative web browsing agent. In IUI, pages 65–68, 1999. J. Masthoff. Group modeling: Selecting a sequence of television items to suit a group of viewers. User Modeling and User-Adapted Interaction, pages 37–85, 2004. J. Masthoff and A. Gatt. In pursuit of satisfaction and the prevention of embarrassment: affective state in group recommender systems. User Modeling and User-Adapted Interaction, 16(3-4):281–319, 2006. J. F. McCarthy and T. D. Anagnost. MusicFX: An arbiter of group preferences for computer aupported collaborative workouts. In CSCW ’98: Proceedings of the 1998 ACM conference on Computer supported cooperative work, pages 363–372. ACM, 1998. M. O’Connor, D. Cosley, J. A. Konstan, and J. Riedl. Polylens: a recommender system for groups of users. In ECSCW’01: Proceedings of the seventh conference on European Conference on Computer Supported Cooperative Work, pages 199–218, Norwell, MA, USA, 2001. Kluwer Academic Publishers. M. J. Pazzani and D. Billsus. Content-based recommendation systems. In The Adaptive Web, pages 325–341, 2007. L. Quijano-Sánchez, J. Recio-Garcı́a, B. Dı́az-Agudo, and G. Jiménez-Dı́az. Social factors in group recommender systems. In ACM-TIST, TIST-2011-01-0013. in press, 2011. L. Quijano-Sánchez, J. A. Recio-Garcı́a, and B. Dı́az-Agudo. Personality and social trust in group recommendations. In ICTAI’10., pages 121–126. IEEE Computing Society, 2010. L. Quijano-Sánchez, J. A. Recio-Garcı́a, and B. Dı́az-Agudo. Happymovie: A facebook application for recommending movies to groups. In ICTAI’11. to be published, 2011. R. R. Sinha and K. Swearingen. Comparing recommendations made by online systems and friends. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001. K. Thomas and R. Kilmann. Thomas-Kilmann Conflict Mode Instrument. Tuxedo, N.Y., 1974. S. Tomlinson. Comparing the robustness of expansion techniques and retrieval measures. In CLEF, pages 129–136, 2006. 32 Do You Feel How I Feel? An Affective Interface in Social Group Recommender Systems Yu Chen Pearl Pu Human Computer Interaction Group Swiss Federal Institute of Technology CH-1015, Lausanne, Switzerland Human Computer Interaction Group Swiss Federal Institute of Technology CH-1015, Lausanne, Switzerland [email protected] [email protected] ABSTRACT Group and social recommender systems aim to recommend items of interest to a group or a community of people. The user issues in such systems cannot be addressed by examining the satisfaction of their members as individuals. Rather, group satisfaction should be studied as a result of the interaction and interface methods that support group awareness and interaction. In this paper, we introduced Affective Color Tagging Interface (ACTI) that supports emotional tagging and feedback within a social group in pursuit of an affective recommender system. We further apply ACTI to GroupFun, a music social group recommender system. We then report results of a field study and particularly how social relationship within a group influences users’ acceptance and attitudes for ACTI. preferences and enhance recommendation accuracy. Musicovery [3] has developed an interactive interface for users to select music category based on their mood. However, in group and social recommender systems, the affective states of users are more than the sum of its members. Users’ emotion could not only be influenced by recommended items, e.g., music, but also that of others. Group formation and characteristics have been investigated as a premise to study social group recommender systems [4]. In this paper, we described the results of an empirical study of an affective interface for providing feedback in a social group recommender system. Our goal is to set a basic understanding of this area with a particular focus on the following two questions. 1) How would users like affective user interface in social group environment? 2) How does social relationship influence user behavior and attitude? Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation]: User Interfaces –Graphical user interfaces (GUI), User-centered design. H.5.3 [Information Interfaces and Presentation]: Group and Organization Interfaces - Organizational design, Web-based interaction General Terms Design, Human Factors Keywords Group and Social Recommender Systems, Interface Design, Interaction Design, Affective Interface, Emotional Contagion 1. INTRODUCTION With the proliferation of social networks, social groups have extended the meaning from families, friends and colleagues to people who share same interests or experiences in online communities, or people who are socially connected. The field trail version of Google+ allows users to define social “circles” of connections, share “sparks” they find interesting and join unplanned “hangouts”; interest groups on Last.fm are formed by members who support the same singers or music bands; members of LinkedIn groups are usually people who share similar academic, industrial or technical background. Meanwhile, we have learned from theories in psychology field that multimedia files such as music, video and pictures evoke emotion. Purely numeric ratings are not sufficient for users to accurately provide feedback. Attempts have been made to recommend music by other methods. Hu and Pu [2] have proved that personality quizzes reveal more hidden aspects of user To answer this question, we implemented GroupFun, a music recommender system that suggests music playlists to groups of people. We conduct a field study, which allows users to indicate their preferences by rating music; we then invite them to evaluate the experimented Affective Color Tagging Interface (ACTI). This is followed by interview questions investigating their attitudes towards ACTI. The next section discusses existing work and how they relate with our work. This is followed by descriptions on the functionalities and interface of GroupFun and design issues of ACTI in Section 3. Section 4 describes the hypotheses and procedure of a pilot study. After reporting the study results in Section 5, this paper concludes with limitations and future work in Section 6. 2. RELATED WORK 2.1 Emotion in Recommender Systems The main goal of studying recommender systems is to improve user satisfaction. However, satisfaction is a highly subjective metric. Masthoff and Gatt [5] have considered satisfaction as an affective state or mood based on the following aspects in socioand psycho- theories: 1) mood impacts judgement; 2) retrospective feelings can differ from feelings experienced; 3) expectation can influence emotion and 4) emotions wear off over time. However, they did not propose any feasible methods to apply the above psychological theories. Musicovery is a typical example of websites recommending music by user selected mood. Musicovery classifies mood by two 33 dimensions: dark-positive and energetic-calm. It uses highly interactive interface for users to experience different emotion categories and their corresponding music. However, such recommender does not apply in social group environment, as individual emotion diversifies from each other. 2.2 Emotional Contagion Masthoff and Gatt [5] also proved that in group recommender systems, members’ emotion can be influenced by each other, and this phenomenon is called emotional contagion. Hancock et al (2008) [6] have investigated emotion contagion and proved that emotions can be sensed in text-based computer mediated communications. More significantly, they have proved the emotional contagion occurred between partners. Sy, T., S. Côté, et al. (2005) [7] carried out a large-scale user study involving 189 users forming 56 groups. They proved that the leaders transfer their moods to group members and that leaders’ moods impact the effort and the coordination of groups. However, to the best of our knowledge, implementation of features related with emotional contagion in group recommender systems is lacking. 2.3 Group Relationships music. For instance, peaceful music is usually selected for chatting while energetic music is a top candidate for parties. We adopted Geneva Emotional Music Scale (GEMS) for user emotion evaluation [8]. GEMS is the first instrument that has been designed to evaluate music-evoked emotions. We adopt the short version GEMS-9, consisting of 9 classifications of emotions, including wonder, transcendence, power, tenderness, nostalgia, peacefulness, joyful, sadness, tension. Each class provides a scale from 1 to 5 indicating the intensity of the emotion. However, asking users to evaluate the evoked emotion of music is not our research focus. Rather, we want to provide a ludic affective feedback interface for a group of users to participate. However, a survey-style questionnaire easily distracts users from the system itself. Inspired by Geneva Emotional Wheel (GEW) [9], we visualize the evaluation scale to a color wheel, as is shown in Figure 2. This wheel contains 9 dimensions designed by GEMS, and each dimension contains 5 degrees of intensity, visualized by sizes of circles, distances from the center and saturation of colors in order to enhance visualization. The smaller size the circle has, the less intense the emotion is. Users could tag their evoked emotions in any of the 9 dimensions according to the intensity. Emotional contagion depends on the relationship of group members. Masthoff defined and distinguish different types of relationships. In a communal sharing relationship (e.g., friends), group members are more likely to care about each other. Furthermore, their emotions are more likely to influence each other. For example, if your friend feels happy, then you are likely to feel happy and if your friend feels sad, you are likely to feel sad. In an equality matching relationship, e.g., strangers, members in such groups are less likely to be influenced by others. Such differences caused by relationships have not been proved by experiment. In our work, we compare how the two groups differ in their attitude towards affective interface in group environment. 3. Prototype System We have developed a music group recommender system named GroupFun (http://apps.facebook.com/groupfun/), which is a Facebook application that allows groups of users to share music for events, such as a graduation parties. The functions of GroupFun mainly include: 1) group management, 2) music playlist recommendation. Users are able to create a group, invite their friends to the group, and join other groups. Each user rates uploaded songs1. The ratings follow a 5-scale style, as is shown in Figure 1. Figure 1. Screenshot of original rating interface We further designed Affective Color Tagging Interface (ACTI) as an additional widget that allows users to tag their emotions evoked by the music they have listened to. Different from Musicovery, which recommends music based on user mood, ACTI uses emotions as an explanation, feedback and re-communication channel. In group environment, each user usually actively persuades other users to take his/her own preferences. In previous user studies (reported in another paper), we observed that users would not use texts for explanations due to incurrence of much user effort. Users also commented that their preferences on music differ with contexts, which usually correspond to emotions in 1 The algorithms for recommending playlist and songs are introduced in another paper. Figure 2. GEMW distribution in GroupFun 34 We further include ACTI widget in GroupFun rating interface. Users could give emotional feedback to songs by clicking the emotional button at the left side of song ratings. The emotional rating interface will pop out with the collective emotions in a group, which visualize the overall emotional element of the music (Figure 3). This serves as an alternative way for explanation and persuasion. recommendation list simultaneously. This scenario provides two types of roles, one driver and three passengers. The driver could be the user who creates the group. Since the focus group is small, it is not realistic to design a large-group scenario such as a party. In this experiment, GroupFun provides 15 songs, which is suitable for the experiment in the sense that 15 songs are approximately suitable for a 45-minute ride. 4.4 Experiment Design and Procedure We carried out the user study on two groups in two days. We first invited each member of the group to meet in our lab with their favorite music. We then debriefed each group on the procedure of the study and the usage of GroupFun. We then assigned roles for each group. Users started following the scenario after exploring original GroupFun for around five minutes. In each group, the driver first created a group and sent invitation to his/her friends (other participants of the group), and uploaded his/her favorite music. Invited users accepted the invitation, joined the group and contributed songs respectively. Meanwhile, they could listen to group music and rate songs. Then they continued with a survey questionnaire evaluating GroupFun. After that, we invited them to explore the experimented GroupFun interface with ACTI, followed by an interview. The user study process in both groups has been recorded for further analysis. 5. INTERVIEW RESULTS Figure 3. Screenshot of rating interface with ACTI 4. PILOT STUDY 4.1 Hypotheses We postulated interface and interaction design in social group recommender systems depends on relationship among group members. In other words, interface and interaction design is different in groups constitute of friends and those constitutes of strangers. We also hypothesize that groups whose members are in close relationship like ACTI interface and groups whose members are not close with each other would less likely to use ACTI. 4.2 Participants In total, two groups (4 participants each) joined the user study. Two PhD students from a course voluntarily participated in the user study. Each of them invited another three people to form a group. In order to investigate the effect of relationship on group behavior, we asked the first student to invite people who are not familiar with each other, while we asked the second student to invite people who are familiar with each other. 4.3 Scenarios and Roles In order to organize the experiment, we design the car driving scenario. A group of 4 people are travelling together from Lausanne to Geneva by car. It is around 45 minutes ride. Based on different group relationships, we design the first scenario as strangers sharing a car-ride and the second scenario as friends travelling together. Among each group, one of the participants is a driver and the other three participants are passengers. They are using GroupFun to select a playlist for their trip. This scenario is a typical group activity where all members consume a In order to verify the differences in relationship among group members, each participant indicated their closeness with each other (except themselves) ranging from 1 (don’t know him/her) to 5 (know him/her very well). For example, if Participant A is in very close relationship with Participant B, he is expected to rate 5 for Participant A. The total score in Group 1 is 33 while that in Group 2 is 44. This result verifies the group differences as we expected. It is easy to discover from the video recording that both groups enjoy listening to the group music very much. For example, some of them even sang with the music while listening. More interestingly, we found out members in Group 2 even discussed with each other about a song that they particularly enjoyed. During the interview phase, we first asked them to call back the experiment scenario and whether music influences their mood. All of them agreed on the evoked emotion by music. Then each participant compared original GroupFun interface with the experiment interface with ACTI. In Group 1, only one member supported ACTI, while others thought it is not necessary and too complicated to include additional features. It cost them more effort. By contrast, members in Group 2 were highly positive about ACTI. As one user said, “It is interesting to listen to music while evaluating their mood. It is even more interesting to compare their results with group emotion.” Followed by this question, we asked them whether their emotions are influenced by group emotion. As we hypothesized, members in Group 1 do not see any influence, but they still enjoyed using GroupFun because of the recommended group music. By contrast, 3 out of 4 participants in Group 2 said they would like to view group emotion when tagging their own mood evoked by music. The last question is to ask participants their suggestions on interaction and interface design of GroupFun. We collected some valuable suggestions from participants. One participant in Group 35 1 regarded simplicity as an important factor. He would prefer interesting functions, but the interface should not distract them from the main function of GroupFun, and that explains why he did not like ACTI. On the other hand, one member from Group 2 said some interesting group activity would help them to listen to suggested songs more carefully and therefore provide more accurate and responsible ratings. Another participant in Group 2 also mentioned that they would like to see more enjoyable interfaces in GroupFun as an entertaining application. 6. CONCLUSIONS We designed an affective color tagging interface (ACTI) and applied it to GroupFun, a social group music recommender system. We further invited two different types of groups for our user study. While members of one group do not know each other well, the other group consists of friends in good relationship. Users in both groups prefer simple but interesting interface. Even though both groups of users enjoyed listening to group songs and indicated their positive attitudes towards GroupFun, the user study did show obvious differences in their group behavior. Members of the group with close relationship were active in discussion with each other and they like interfaces that support group emotion, and consider it interesting and entertaining. Meanwhile, groups whose members do not know each other well consider affective interface complicated and not useful. This further proves our hypothesis that interface design in social group recommender systems should consider group formation and relationship. However, this work is still at the preliminary stage, and has some limitations. First, as a pilot study, we only invited two groups of users to survey their needs. In order to further establish design guidelines, we need more groups and more types of groups and conduct larger scale user studies. Furthermore, browser-based affective interface is limited. Rather, user emotion could be captured automatically in an ambient environment. Our future work also include ambient affective interface in social group recommender systems. 7. ACKNOWLEDGEMENT We thank all participants for their interest in our project, their valuable time and suggestions. 8. REFERENCES [1] McCarthy J. and Anagnost T. 1998. MusicFX: an arbiter of group preferences for computer supported collaborative workouts. In Proceedings of the 1998 ACM conference on Computer supported cooperative work, p.363-372. [2] Musicovery. http://musicovery.com/ [3] Rong Hu and Pearl Pu. A Study on User Perception of Personality-Based Recommender Systems. In: P. De Bra, A. Kobsa, and D. Chin (Eds.): UMAP 2010, LNCS 6075, pp. 291–302, 2010. [4] Cosley, D., A. Konstan, J. and Riedl, J. 2001. PolyLens: a recommender system for groups of users, Proceedings of the seventh conference on European Conference on Computer Supported Cooperative Work, p.199-218, September 16-20, 2001, Bonn, Germany. [5] Masthoff, J. and A. Gatt (2006). "In pursuit of satisfaction and the prevention of embarrassment: affective state in group recommender systems." User Modeling and User-Adapted Interaction 16(3): 281-319. [6] Hancock, J. T., K. Gee, et al. (2008). I'm sad you're sad: emotional contagion in CMC. Proceedings of the 2008 ACM conference on Computer supported cooperative work. San Diego, CA, USA, ACM: 295-298. [7] Sy, T., S. Côté, et al. (2005). "The contagious leader: Impact of the leader's mood on the mood of group members, group affective tone, and group processes." Journal of applied psychology. 90(2): 295-305 [8] Zentner, M., Grandjean, D., & Scherer, K. R. (2008). Emotions evoked by the sound of music: Characterization, classification, and measurement. Emotion, 8, 494-521. [9] Banziger, T., Tran, V. and Scherer K R., (2005), "The Emotion Wheel: A tool for the verbal report of emotional reactions", Proceedings of the General Meeting of the International Society for Research on Emotions (ISRE), 2005, July 11-15, Bari, Italy. 36 PopCore: A system for Network-Centric Recommendations Amit Sharma Meethu Malu Dan Cosley Dept. of Computer Science Cornell University Ithaca, NY 14853 Information Science Cornell University Ithaca, NY 14850 Information Science Cornell University Ithaca, NY 14850 [email protected] [email protected] ABSTRACT In this paper we explore the idea of network-centric recommendations. In contrast to individually-oriented recommendations enabled by social network data, a network-centric approach to recommendations introduces new goals such as effective information exchange, enabling shared experiences, and supporting user-initiated suggestions in addition to conventional goals like recommendation accuracy. We are building a Facebook application, PopCore, to study how to support these goals in a real network, using recommendations in the entertainment domain. We describe the design and implementation of the system and initial experiments. We end with a discussion on a set of possible research questions and short-term goals for the system. Keywords recommender systems,social recommendation,network-centric 1. INTRODUCTION Users are increasingly disclosing information about themselves and their relationships on social websites such as Facebook, Twitter, and Google+. These data provide signals that have been used to augment traditional collaborative filtering techniques by making network-aware recommendations [8, 9]. Such recommenders use social data to support prediction, provide social context for the recommendations, and help alleviate the cold-start problem typically found in recommender systems. Much of their power comes from social forces, such as homophily, trust and influence, and thus these recommenders do not just provide better recommendations, they can also support the study of these forces. For example, in [4], the authors divide a user’s social contacts into familiarity and similarity networks (proxies for trust and homophily, respectively), and study their relative impact on the quality of recommendation. But we can take this a step farther. Just as a user’s network can influence the recommendations he/she receives, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Workshop on Recommender Systems and the Social Web ’11 Chicago, IL Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. [email protected] the recommendations, in turn, also can influence the network and alter the underlying social processes. For instance, new recommendations can alter the diversity of the set of items within a network while group recommendations can strengthen social ties. Thinking of recommendations as being embedded in a network, rather than informed by it, provides a new context for analyzing and designing recommender systems—and an important one, given people’s increasing interaction and consumption in online social networks. In this paper, we lay out our approach to exploring this network-centric approach to recommendation system design. We start by discussing new concerns such systems foreground, focusing on design goals that come from thinking about the social aspects of recommendations that are embedded in a network, compared to more individually-focused systems. Second, we introduce PopCore, the network-centric recommender system we are building in Facebook to support these goals. We have already deployed an initial proof-of-concept version to conduct initial experiments around network-aware algorithms [13]; here, we discuss how we are evolving the system to support the social design goals. We close by laying out issues that doing network-centric recommendations raise, most notably around the tension between social sharing, privacy, and identity management, and outlining the initial questions we hope to address as we design and build both the system and the community. 2. NETWORK-CENTRIC DESIGN GOALS Thinking about recommendations as embedded in a social network raises a number of questions, ranging from using network data to improve individual recommendations to using people’s behavior to study large-scale patterns of diffusion and other social science forces at work. Given our goal to design a useful network-centric recommender system, here we focus on design goals that capture social elements that are more salient than they would be in a typical e-commerce recommender application. Directed Recommendations. An integral part of social experience is sharing information with others person-toperson. Such user-generated directed suggestions have been studied for link-sharing [2] and are ripe for study in other domains, integration with automated applications, and application to a social network context. Allowing directed suggestions might encourage people to be more active participants in the system and allow them ways to express their identity. These suggestions may also be more accurate than collaborative filtering for certain tasks [6], and in aggregate 37 Figure 1: A mockup of the PopCore interface. By default, recommendations are shown from all three domains, Movies, Books and TV. The controls at the top help a user decide the composition of the list of items, while the lower section provides contextual visualization for the recommendations. support data mining and automated recommendation from this user-generated ‘buzz’ [10]. Shared Experiences and Conversation. For many items, especially in the entertainment domain, enjoyment depends not just on personal preferences but also on social experiences such as enjoying the content with other people [3]. Given an item such as a movie, is it possible to predict the people who may join you for it? This is slightly different from group recommendations, which are typically aimed at a predefined group of people [1], and we expect that leveraging network information will make them more effective than earlier approaches that combined individual lists of recommendations [11]. Conversation is another social experience, and since people who disagree about movies have livelier conversation [7], algorithms might focus on recommending items that evoke strong reactions, or even “antirecommendations”, along with the traditional goals of accurate recommendation. Systems aimed at individuals are unlikely to want to recommend hated items, but people often like to talk about them, and this propagation of negative opinion may also help others avoid bad experiences. Network Awareness. Negative information is a specific kind of awareness, and people have a broad interest in awareness of what is happening in their social network [5]. From the point of view of information, taste, and fashion, it’s useful to know who the opinion leaders are, who are active and effective recommenders, what items are becoming hot or not, and who is knowledgeable about a given topic [12]. Thus, supporting social interaction not just between individuals but at the network level is likely to be valuable in a network-centric recommender system. 3. POPCORE: THE PLATFORM We now discuss how we are starting to realize these goals in PopCore, a Facebook application we are developing for providing and studying network-centric recommendations. We chose Facebook because it provides us both network and preference data (though Likes), and also supports a diverse set of domains for items. PopCore works by fetching a user and her friends’ profile data on Facebook (subject to the user’s permission) and providing recommendations based on those signals. Currently, we restrict PopCore to the entertainment domain, including movies, books, TV shows and music. These categories have a fair amount of activity and broad popular appeal. 3.1 System Description/Design We decided on a simple three-part interface, as shown in Fig. 1. The center section contains the main content to be shown (a list of items), while the top and bottom sections show content-filtering controls and contextual visualizations respectively. Each of the interface components, from the logo on down, is designed to support both the goals outlined above and the collection of interesting data to study. PopMix. The top section is the control panel PopMix, which allows a user full control of the type of items shown in the content section. The controls are designed to be intuitive, inspired by a common interface metaphor, a music equalizer. Just like the music mixer allows a user to set sound output according to his tastes, the PopCore interface gives the user control over the domain, genre, popularity and other parameters he may choose. In order to account for temporal preferences, we also include a special recency knob 38 that allows users to select the proportion of recent versus older items shown. In addition, users may also view items expected to be available soon. For such current and ‘future’ items, users may notify and invite their friends (chosen manually or from a system-recommended list).This supports the goal of shared consumption. Eventually, we plan to implement filters that allow people to control social network parameters as well. For instance, a user may also choose the relative importance/proportion of network signals for recommendation, such as link-distance of people from the user, interaction strength, age, location of people. People may also select a subset of people manually, or a named group of people, in which case recommendation morphs more into a stream of items from those sources. Stackpiles. The middle section shows a number of views relevant to user tasks such as getting automated recommendations, directed suggestions, and remembering suggestions to follow up on. The top right corner of this section contains the tab-buttons to switch views as shown in Fig. 1. The primary view presents automatically generated suggestions filtered on the user’s PopMix settings. The recommendation algorithm is a ranking algorithm that ranks a user’s friends based on their relevance to the user on a list of parameters, such as interaction strength and number of commonly Liked items. The most popular items according to this weighted user popularity are then chosen. In a given view, a user is shown a list of items arranged as cards in distributed stackpiles. Items are grouped into stackpiles based on their similarity, using k-means clustering over their attributes. The number of piles and their distribution is generated dynamically. On flipping an item card, a user gets options to Like, Dislike, or rate the movie on a scale from 0.5-5. Giving a number of ways to interact with the movie supports rich data collection, and in the case of Dislike, the idea of antirecommendations. Users may also directly suggest an item to one or more friends using the PopCore button. These suggestions are sent to the target users as a Wall Post or private message, based on the user’s preferences. PopCore members can also see these suggestions as a view in the content window by clicking the “N” button at its upper right. People can type in any friend; PopCore also suggests people that may be a good fit for enjoying the item with. The other main view is a user’s personal library, which contains a user’s ‘For later’ list and the list of items for which the user has provided strong feedback. The ’For later’ list may be thought of as a non-linear queue (and can be accessed through the “Q” button). The list benefits from the same stackpiling metaphor, thus allowing a user more visual and organized view of his/her library. Items are stackpiled based their similarity and recency in the list by default, however, users have full control to customize the groups. Visualizations. The bottom section contains visualizations of network activity around items that support the network awareness goals described earlier. The default view is a word cloud showing items weighted by the number of user’s friends who have Liked those items. Other visualizations include showing the friends who have contributed the most to the content shown to a user (either through directed recommendations, or algorithmically) along with the items that have been recommended (Fig. 2), or a timeline showing the entry and growth of recent items in a user’s network. The goal of these visualizations is to help the user navigate Figure 2: A visualization showing aggregated behavior among the user’s friends weighted by the amount of recommendations they make, with a detailed view of each item that has been recommended in the user’s network. the multi-part social activity information in a clear, intuitive fashion. 4. ISSUES AND RESEARCH QUESTIONS We conclude by discussing the major issues we expect around deploying a real network-centric recommender. 4.1 Trading off social and private elements. A primary issue is that having access to more information enhances the social discovery and consumption experience, but there is a direct trade-off with privacy. For example, the visualization component is designed to show individual activity about either items or people, and aggregate information about the other. “Activity” in this case might represent making or receiving directed suggestions, rating items, getting recommendations, adding items to one’s queue, and so on. Consider showing items as the detail, people as the aggregate, and queuing as the activity. Users might want to know which items their friends are intending to consume, but it may often be the case that an individual using the system will queue a sequence of movies. Her picture will grow as the stream of movies changes. This will immediately convey her queuing behavior to others, and hence her privacy has been compromised. Identity management also comes into play. Having Likes and Dislikes visible to all friends makes it easier for a user’s friends to follow his/her interests, but does it then affect the Liking behavior based on concerns about privacy and identity management? Similarly, a queue is a definite indication of interest and making it accessible to others will directly benefit shared experiences and co-operation, but it is unclear whether users would want to have a public queue. For now, we have decided to have everything except Likes and Dislikes private, but give the user an option to selectively enable items for sharing whenever an action is taken, with the hope of balancing identity, privacy, and discovery without imposing too much work. 39 4.2 Long-term goals and short-term questions. The other major issue we see is that building out a networkcentric recommender while building up its userbase promises to consume a fair amount of time. Thus, our short-term goal is to answer questions that need no or limited social interaction while the system and userbase develops. Tradeoffs in doing network-centric recommendations. A network-centric approach affords fast algorithms, real-time capabilities, and modest user requirements compared to conventional collaborative filtering’s use of large datasets, but it places a lot of emphasis on a person’s immediate social network. This reduces the pool of available items, and may also lead to a possible loss of diversity among the items recommended. We plan to pit our algorithm against state-of-the-art collaborative filters and compare the performance of both in terms of the activity generated around recommendations and users’ satisfaction with the automated recommendations they receive from each. Eventually we hope to develop recommendation strategies that use recommendations computed both on the full dataset and in a network-centric way in the user’s local network. Interpreting actions and developing metrics. PopCore provides a wide variety of actions that users can take with an item, including putting it in their queue, publicly Liking it or Disliking it, or suggesting it to friends. All these actions may convey signals that can be used to both improve the quality of recommendations and also evaluate them, although we need to learn to interpret them. What’s the difference between a “Like” (which is public) and a 5-star rating (which is probably not)? Sharing an item provides an indication of “interestingness”, but unlike ratings does not provide a definite scale of enjoyment, and in fact people may share disliked items. Exploring cross-domain recommendations. The network-centric approach relies heavily on people and their connections, and less on the items. This suggests that we may be able to cross-recommend items based on a user’s network information and his/her preferences in a related domain, a task for which collaborative filters have not been so successful. Designing algorithms for cross-domain recommendation within a network is an interesting question in itself. Social explanations. Right now PopCore uses data harvested from Wikipedia to present additional information about items to help people make decisions. However, that data does not explain why the recommendation was made, which is a commonly wanted feature in real world recommendation systems [14]. Using network information to help justify automated recommendations may be a powerful feature, given the way people rely on this information to make decisions already. Once we have built the userbase we will be in a better position to ask questions about the explicit social elements we are designing for. Comparing directed to automatic recommendations, studying the value of awareness of network activity around items, exploring how recommendations and consumption propagate in the networks, and developing effective metrics for measuring social outcomes are all questions that we hope to address in the long term, and that we think are key for recommender systems as they move into social networks. Acknowledgement. We would like to acknowledge support from NSF grant IIS 0910664. 5. REFERENCES [1] S. Amer-Yahia, S. B. Roy, A. Chawlat, G. Das, and C. Yu. Group recommendation: Semantics and efficiency. Proc. VLDB Endow., 2:754–765, August 2009. [2] M. S. Bernstein, A. Marcus, D. R. Karger, and R. C. Miller. Enhancing directed content sharing on the web. In Proc. CHI, pages 971–980, 2010. [3] P. Brandtzag, A. Folstad, and J. Heim. Enjoyment: Lessons from karasek. In M. Blythe, K. Overbeeke, A. Monk, and P. Wright, editors, Funology, volume 3 of Human-Computer Interaction Series, pages 55–65. Springer Netherlands, 2005. [4] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, and S. Ofek-Koifman. Personalized recommendation of social software items based on social relations. In Proc. RecSys, pages 53–60, 2009. [5] A. N. Joinson. Looking at, looking up or keeping up with people?: Motives and use of Facebook. In Proc. SIGCHI, CHI ’08, pages 1027–1036, New York, NY, USA, 2008. ACM. [6] V. Krishnan, P. Narayanashetty, M. Nathan, R. Davies, and J. Konstan. Who predicts better? Results from an online study comparing humans and an online recommender system. In Proc. RecSys, pages 211–218, Lausanne, Switzerland, 10/23/2008 2008. ACM. [7] P. J. Ludford, D. Cosley, D. Frankowski, and L. Terveen. Think different: Increasing online community participation using uniqueness and group dissimilarity. In Proc. SIGCHI, CHI ’04, pages 631–638, New York, NY, USA, 2004. ACM. [8] H. Ma, I. King, and M. R. Lyu. Learning to recommend with social trust ensemble. In Proc. SIGIR, pages 203–210, 2009. [9] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regularization. In Proc. WSDM, pages 287–296, 2011. [10] H. Nguyen, N. Parikh, and N. Sundaresan. A software system for buzz-based recommendations. In Proc. SIGKDD, KDD ’08, pages 1093–1096, New York, NY, USA, 2008. ACM. [11] M. O’Connor, D. Cosley, J. A. Konstan, and J. Riedl. PolyLens: A recommender system for groups of users. In Proc. ECSCW, pages 199–218, Norwell, MA, USA, 2001. [12] N. S. Shami, Y. C. Yuan, D. Cosley, L. Xia, and G. Gay. That’s what friends are for: Facilitating ’who knows what’ across group boundaries. In Proc. GROUP, GROUP ’07, pages 379–382, New York, NY, USA, 2007. ACM. [13] A. Sharma and D. Cosley. Network-centric recommendation: Personalization with and in social networks. In IEEE International Conference on Social Computing, Boston, MA, USA, 2011. [14] K. Swearingen and R. Sinha. Beyond Algorithms: An HCI Perspective on Recommender Systems. In ACM SIGIR Workshop on Recommender Systems, New Orleans, USA, 2001. 40 Community-Based Recommendations: a Solution to the Cold Start Problem Shaghayegh Sahebi William W. Cohen Intelligent Systems Program University of Pittsburgh Machine Learning Department Carnegie Mellon University [email protected] [email protected] ABSTRACT The “Cold-Start” problem is a well-known issue in recommendation systems: there is relatively little information about each user, which results in an inability to draw inferences to recommend items to users. In this paper, we try to give a solution to this problem based on homophily in social networks: we can use social networks’ information in order to fill the gap existing in cold-start problem and find similarities between users. In this study, we use communities, extracted from different dimensions of social networks, to capture the similarities of these different dimensions and accordingly, help recommendation systems to work based on the found latent similarities. By different dimensions, we mean friendship network, item similarity network, commenting network and etc. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—information filtering Keywords Recommendation, Cold-Start, Community Detection, Social Media 1. be available. In this paper, we suggest user connections and ratings in social networks as a replacement. By the advance of openID protocol and the emerge of new social networks, user activities, connections and ratings in various networks are now more accessible. Social networks offer connection of different dimensions: people may be friends with each other, they might have similar interests, and may rate content similarly. These different dimensions can be used to detect communities among people. Using community detection techniques, collective behavior of users is predictable. For example, in [5], a comparison has been made between familiarity network based and similarity network based recommendations. In [4], a typical traditional collaborative filtering (CF) approach is compared to a social recommender/social filtering approach. These studies do not utilize latent community detection techniques to address the cold start problem. This study aims to use different dimensions of social networks to extract latent communities and use these communities to provide a solution to the cold start problem. In this paper, we first give a brief introduction to community detection methods. Then, we describe the Principal Modularity Maximization method [8] in section 2.1. After that, we propose our approaches to utilize the community detection algorithm in section 2.2., describe the used dataset in section 3, and discuss the experiments in section 4. INTRODUCTION Recommendation systems have been developed as one of the possible solutions to the information overload problem. The cold start problem [7] is a typical problem in recommendation systems. In recent years, some studies tried to address this problem. For example in [6] and [7], hybrid recommendation approaches, that combine content and usage data, are proposed and in [1], a new similarity measure considering impact, popularity, and proximity is introduced as a solution to this problem. Most of these approaches consider content information or demographic data, and not the connection information, for performing the recommendations; However, in some cases these information might not Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WOODSTOCK ’97 El Paso, Texas USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. 2. COMMUNITY DETECTION With the growth of social network web sites, the number of subjects within these networks has been growing rapidly. Community detection in social media analysis [3] helps to understand more of users’ collective behavior. The community detection techniques aim to find subgroups among subjects such that the amount of interaction within group is more than the interaction outside it. Multiple statistical and graph-based methods have been used recently for the community detection purposes. Bayesian generative models [2], graph clustering approaches, hierarchical clustering, and modularity-based methods [3] are a few examples. While the existing social networks consist of multiple types of subjects and interactions among them, most of these techniques focus only on one dimension of these interactions. Consider the example of blog networks in which people can connect to each other, comment on each other’s posts, post link to other posts in their blog post, or blog about similar subjects. By considering only one of these dimensions, e.g. connections network, we loose important information about other dimensions in the network and the resulting communi- 41 ties will just represent a part of existing ones. In this paper we use modularity-based community detection method for multi-dimensional networks presented by Tang et. al [8] as brielfy described in the following subsection. 2.1 Principal Modularity Maximization Modularity-based methods consider the strength of a community partition for real-world networks by taking into account the degree distribution of nodes. Modularity measure is defined based on how far the within-group interaction of found communities deviates from a uniform random graph with the same degree distribution. The modularity measure is defined as follows: Q= 1 T r S T BS 2m (1) ddT (2) 2m where S is a matrix indicating community membership (Sij = 1 if node i belongs to community j and 0 otherwise) and B is the modularity matrix defined in equation 2. In equation 2, which measures the deviation of network interactions from a random graph, A represents the sparse interaction matrix between actors of the network, d shows the degree of each node, and m is the total number of existing edges. The goal in modularity-based methods is to maximize Q, the strength of the community partition. By relaxing matrix S as a matrix with continuous elements, the optimal S can be computed as the top k eigenvectors of the modularity matrix B [8]. As said before, communities can consist of multiple dimensions like friendship dimension, co-rating dimension, commenting dimension and etc. Principal Modularity Maximization[8], is a modularity based method to find hidden communities in multi-dimensional networks. The idea is to integrate the network information of multiple dimensions in order to discover cross-dimension group structures. The method is a two-phase strategy to identify the hidden structures shared across dimensions. In the first phase, the structural features from each dimension of the network is extracted via modularity analysis (structural feature extraction), and then the features are integrated to find out a community structure among nodes (cross-dimension integration). The assumption behind this cross-dimensional integration is that the structure of all of the dimensions in the network should be similar to each other. In the first step, structural features are defined as the network-extracted dimensions that are indicative of community structure. They can be computed by a low-dimensional embedding using the top eigenvectors of the modularity matrix. Minimizing the difference among features of various dimensions in crossdimension integration is equivalent to performing Principal Component Analysis (PCA) on them. This results in a community membership matrix S which is continuous. This matrix shows how much each node belongs to each community. To group all the nodes in a discrete community membership based on these features, a simple clustering algorithm such as K-means is used on S. As a result of this clustering, each node will belong to just one community. B =A− 2.2 Cold Start Problem and Community Detection in Recommendation Systems The “cold start” problem [7] happens in recommendation systems due to the lack of information, on users or items. Usage-based recommendation systems work based on the similarity of taste of user to other users and content based recommendations take into account the similarity of items user has been consumed to other existing items. When a user is a newcomer in a system, or he/she has not yet rated enough number of items. So, there is not enough evidence for the recommendation system to build the user profile based on his/her taste and the user profile will not be comparable to other users or items. As a result, the recommendation system cannot recommend any items to such a user. Regarding the cold start problem for items, when an item is new in the usage based recommendation systems, no users have rated that item. So, it does not exist in any user profile. Since in collaborative filtering the items consumed in similar user profiles are recommended to the user, this new item cannot be considered for recommendation to anyone. In this paper, we concentrate on cold start problem for new users. we propose that if a user is new in one system, but has a history in another system, we can use his/her external profile to recommend relevant items, in the new system, to this user. As an example, consider a new user in youtube, of whom we are aware of his/her profile in Facebook. A comprehensive profile of the user can be produced by the movies he/she posted, liked or commented on in Facebook and this profile can be used to recommend relevant movies in youtube to the same user. In this example, the type of recommended items are the same: movies. Another hypothesis, is that users’ interest in specific items, might reveal his/her interest in other items. This is the same hypothesis that exists in multidimensional network community detection: we expect multiple dimensions of a network to have a similar structure. As an example, if a user is new to the books section of a system, but has a profile in the movies section, we can consider similar users to him/her, in terms of movie ratings, to have a similar taste on books with him/her; or if two users are friends, we expect them to have more similar behavior in the system. We utilize user profiles in other dimensions to predict their interests in another dimension can be used as a solution to the cold start problem. Community detection can provide us with a group of users similar to the target user considering multiple dimensions. We can use this information in multiple ways as suggested in the following. In traditional collaborative filtering, the predicted rating of active user a on each item j is calculated as a weighted sum of similar users’ rankings on the same item: Equation 3. Where n is the number of similar users we would like to take into account, α is a normalizer, vi,j is the vote of user i on item j, v̄i is the average rating of user i and w(a, i) is the weight of this n similar users. pa,j = v̄a + α n X w(a, i)(vi,j − v̄i ) (3) i=1 The value of w(a, i) can be calculated in many ways. Common methods are Cosine similarity, Euclidean similarity, or Pearson Correlation on user profiles. we proposed and tried multiple approaches in community based collaborative filtering to predict user ratings. Once we have found latent communities in the data, we need to use this information to help with the recommendation of content to users. Our 42 assumption is that users within the same latent community are a better representative of user interests in comparison with all users. We propose approaches that consist of combinations of the following: 1. Using a community based similarity measure to calculate w(a, i): This is specifically useful in PMM community detection algorithm. Here, a matrix S, an N × K matrix, which is an indicator of multi dimensional community membership is produced. It shows how much each user belongs to each community. we define the community-based similarity measure among users of the system as an N × N matrix W in equation 4 and use it as a weight function in equation 3. Here, N is the total number of users and each element of the matrix shows the similarity between two of users based on the communities they belong to. W = SS T (4) Figure 1: Log-log plot of number of book ratings per user 2. Using co-community users (users within active user’s community) instead of k-nearest neighbors: we define the predicted rating as in equation 5 in which community(a) indicates the community assigned to the active user by the community detection algorithm. Based on that, only users within a user community are considered in the CF algorithm. pa,j = v̄a + α X w(a, i)(vi,j − v̄i ) (5) i∈community(a) In addition to using the proposed methods to address the cold-start problem, we believe that, the second case is useful where there are a large number of users and as a result, the traditional collaborative filtering approach takes a lot of space and time to converge. Instead, we can detect the community user belongs to, and use that community members to find relevant items to users. 3. DATASET The dataset used in this study is based on an online Russian social network called imhonet 1 . This web site contains many aspects of a social network, including friendships, comments and ratings on items. We use a dataset that includes the connections between users of this web site and the ratings they had on books and movies. The friendship network contains approximately 240,000 connections among around 65,000 users. The average number of friends each user has is about 3.5. Additionally, there are about 16 million rating instances of the movie ratings on about 50,000 movies in the dataset and more than 11.5 million user ratings on about 195,000 available books in the dataset,. Figure 1 shows the log-log scale of the number of book ratings per user and Figure 2 shows the number of ratings for each book. As can be seen, the number of users per book follows the power law distribution. But for the number of book ratings per user, it doesn’t show a power law distribution. It looks like a combination of two power law distributions. That is because imhonet asked its users to rate at least 20 books for building more complete user profiles. 1 www.imhonet.ru Figure 2: Log-log plot of number of ratings per book If we look at movie rating distribution (which we omitted due to the space restrictions), we can see the same behavior: based on imhonet’s request, many users rated around 20 movies. Friendship connections between users follow a power law distribution too. To reduce the volume of the data, we used the ratings of users who had at least one connection in the dataset. The resulting dataset contains about 9 million movie ratings of 48,000 users on 50,000 movies and 1.2 million book ratings of 13,000 users on 140,000 books. For the experiments, we picked 10,000 random users among these users. 4. EXPERIMENTS We separated 10% of users as test users and the reminder as train users. To simulate the cold start problem, we removed all the book ratings of test users from the dataset and tried to predict these book ratings for them. We performed 10-fold cross-validation on this data. To apply PMM to the problem at hand, we need to define the various network dimensions. The first is obvious: we can simply use the friendship network itself. Then, we need a method to construct a similarity graph of users using their book and 43 movie ratings. To do so, we define an edge weight s(ri , rj ) between each two users as follows: Let ri be the rating vector of user i, let σx be the standard deviation of the non-zero elements of a vector x, and let covar(x, y) be covariance of points where both x and y are non-zero. Then, the similarity function is s(ri , rj ) = covar(ri , rj ) σri σrj (6) provided that ri and rj overlap at at least 3 positions and 0 otherwise. A similarity score of 0 indicates that no edges should be added. This function is a modified version of Pearson’s Correlation Coefficient that takes into account the standard deviation of a user’s ratings instead of just the standard deviation of the overlap with another user. As such it is no longer constrained to the interval [−1, 1] and does not have a direct interpretation, but it better represents the similarity between users. We can then use this function to create graphs from the book and movie ratings. Once we had different dimensions of the network, we can run PMM on the friendship, books, and movies graphs to obtain the latent communities. We set the number of communities and the number of neighbors in collaborative filtering approach to 30 in this experiment. Graphical results of performing PMM are shown in Figures 3 and 4 which are created by Gephi software2 . Figure 4: Pie chart of number of users in each community. Each color represents a community. berships as a similarity measure (CF with Community Simil), 3. As in case 2, we perform traditional collaborative filtering within the community (CF within Community), 4. We perform collaborative filtering using the community based similarity measure within the community (combination of cases 1 and 2) (CF with Community Simil within Community), The performance of these different combination are reported in Figure 5 in terms of nDCG at top k recommendations for k changing from one to ten. Notice that collaborative filtering within members of a community works slightly better than other methods. Also, performing CF within a community, either with community-based similarity measure or the Pearson correlation, works better than performing CF with a constant number of neighbors for kNN. On the other hand, we can see that using community-based similarity reduces nDCG for both within community and global Cf methods. While this means that using only community members in CF helps in recommending more interesting items to users, it also means that Pearson correlation, works better as a similarity measure for CF in comparison with communitybased similarity measure. Generally, the nDCG results we obtain for the cold start problem is reasonable since the problem is simulated in a way that having no information about other dimensions, recommending items to users would be impossible. Figure 3: Communities detected by PMM shown in a graph sketched by Gephi software. Each community forms a square. Nodes are imhonet users and links are their friendship connections. We considered different combinations of the approaches proposed in Section 2.2 as follows: 1. We consider a vector space model for book and movie ratings and build user profiles by concatenating these two vectors in a combined space; then, performing traditional collaborative filtering using Pearson Correlation on the concatenated vector (CF), 2. As described in case 1, we perform collaborative filtering for all users considering their community mem2 www.gephi.org 5. CONCLUSIONS AND FUTURE WORKS We showed that performing collaborative filtering within community members is more effective than running collaborative filtering on all users. Also, we showed that using other dimensions of user interests or user connections, helps in having a reasonable nDCG in cold-start problem. Based on our experiments, the number of members in each community follows a power law. As a result, it is interesting to see the performance of proposed community-based recommendation methods on different size communities and see if these methods help in small-size, mid-size or big communities. Another interesting study is to consider the effect of number of neighbors in simple collaborative filtering approach on the results. In other words, it is interesting to see considering which number of neighbors is better in collaborative filtering and if this number is related to the average detected community size. Another future works would be to 44 [8] L. Tang, X. Wang, and H. Liu. Uncovering groups via heterogeneous interaction analysis. In ICDM, 2009. Figure 5: nDCG at top k recommendations consider Bayesian generative models of community detection and study how grouping connections of a user and assigning them to each of the dimensions of the network would help in the recommendations’ quality. 6. ACKNOWLEDGMENTS We would like to thank the administration of imhonet who kindly provided anonymized data for our study. Also, we would like to thank Dr. Peter Brusilovsky and Daniel Mills for their help during this study. This research is partially supported by the National Science Foundation under Grants No. 1059577 and 1138094. 7. REFERENCES [1] H. J. Ahn. A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem. Information Sciences, 178(1):37 – 51, 2008. [2] C. Delong and K. Erickson. Social topic models for community extraction categories and subject descriptors. October, 2008. [3] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 – 174, 2010. [4] G. Groh and C. Ehmig. Recommendations in taste related domains: collaborative filtering vs. social filtering. In proc. of the 2007 international ACM conf. on Supporting group work, GROUP ’07, pages 127–136, New York, NY, USA, 2007. ACM. [5] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, and S. Ofek-Koifman. Personalized recommendation of social software items based on social relations. In proc. of the third ACM conf. on Recommender systems, RecSys ’09, pages 53–60, New York, NY, USA, 2009. ACM. [6] S.-T. Park and W. Chu. Pairwise preference regression for cold-start recommendation. RecSys ’09, pages 21–28, New York, NY, USA, 2009. ACM. [7] A. I. Schein, A. Popescul, L. H., R. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In In proc. of ACM SIGIR conf. on Research and Development in Information Retrieval, pages 253–260. ACM Press, 2002. 45 Free Text In User Reviews: Their Role In Recommender Systems Maria Terzi Maria-Angela Ferrario Jon Whittle School of Computing & Communications, InfoLab21, Lancaster University LA1 4WA Lancaster UK School of Computing & Communications, InfoLab21, Lancaster University, LA1 4WA Lancaster UK School of Computing & Communications, InfoLab21, Lancaster University, LA1 4WA Lancaster UK [email protected] [email protected] [email protected] ABSTRACT As short free text user-generated reviews become ubiquitous on the social web, opportunities emerge for new approaches to recommender systems that can harness users‟ reviews in open text form. In this paper we present a first experiment towards the development of a hybrid recommender system which calculates users‟ similarity based on the content of users‟ reviews. We apply this approach to the movie domain and evaluate the performance of LSA, a state-of-the-art similarity measure, at estimating users‟ reviews similarity. Our initial investigation indicates that users‟ similarity is not well reflected in traditional score-based recommender systems which solely rely on users‟ ratings. We argue that short free text reviews can be used as a complementary and effective information source. However, we also find that LSA underperforms when measuring the similarity of short, informal user-generated reviews. For this we argue that further research is needed to develop similarity measures better suited to noisy short text. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information Search and Retrieval – information filtering General Terms Algorithms, Human factors, Experimentation Keywords Recommender systems, social web, similarity measures, user reviews. 1. INTRODUCTION Recommender systems provide personalized suggestions to users about products or services they might be interested in. The two main approaches for developing recommender systems, namely „content based‟ and „collaborative filtering ‟ are principally based on user ratings. Collaborative filtering approaches compare the ratings of two users to recommend new items to each. They work Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00. by pairing users that have rated the same items similarly, and recommending items to each user based on the rating history of the other paired user. Content-based filtering approaches (e.g., Amazon1) analyse item features/descriptions to identify items that are likely of interest to the user. This is achieved by building user profiles based on the features of items a user has previously rated, then measuring the similarity of a profile with the extracted features of other items. Potentially interesting items are then recommended to the user based on the result of this similarity. The main problem making recommendations based on items‟ ratings is that “we do not understand how users compose their judgment of the varying attributes of an item into a single rating, or how users‟ usage of rating schemes will vary between each other” [1]. Indeed, people may rate items similarly, but their ratings may be based on different combinations of item features. Thus, a recommender system based blindly on ratings (without any indication of the underlying reasons as to why a user gave a particular item a particular rating) may not result in accurate suggestions. For example, a collaborative filtering system recommending movies will match two users that rated a number of movies with the same (high) score. However, one of the users may rate a movie highly because his favorite actor plays the lead role, while the other may do so because he likes the special effects. Thus, any recommendations made to these two users based on their respective ratings of these films could be inappropriate. Current implementations of the content-based approach utilize multi-rating systems [2, 3], whereby a user rates each aspect of the item. The additional structure in the user rating enables a more effective correlation of the user rating to product features, thereby allowing a more accurate delivery of recommended content. These systems, however, depend critically on the willingness of the user to explicitly rate various features of an item. Surprisingly despite the popularity of publicly viewable reviewing in the social web, limited work has been undertaken to utilize it in improving recommender systems as recent research has been mainly focused on product features extraction. Recent work in the field aims to extract item features and sentiment from user reviews to enhance item recommendations. Jakob et al [4] use opinion mining and sentiment analysis techniques to infer the ratings of item aspects from user reviews and improve the prediction accuracy of star ratings based recommender systems. Long et al [5] propose a method to 1 http://www.amazon.com/ 46 estimate features ratings by using users‟ textual reviews. Particularly, they focus on „specialized reviews‟, that is reviews that extensively discuss one feature, and suggest that recommender systems can use specialized reviews to make recommendations based on their main feature. In this paper we undertake a feasibility study for a potential recommender system which calculates the users‟ similarity based on their reviews. The core part of such a system will utilize a similarity measure to identify similar reviews, that is, reviews that refer to the same features or use the same adjectives to describe an item. The proposed system will use free text users‟ reviews to a) match users that provided similar reviews for items and b) match users with items based on their reviews. Our approach differs from existing work [4, 5] focusing on features extraction. The proposed system does not attempt to extract predictive features of items from long-text reviews to make recommendations but it aims to pair users with other users and items based on the similarity of short text reviews. We argue that such system may overcome the limitations of current rating-based approaches as it is further described in Section 2. To investigate the feasibility of such a system, we present the results of an experiment in which a corpus of short free text movie reviews are analyzed to determine whether the content of the reviews matches the rating given by the user. Furthermore, we investigate the accuracy of the current rating-based collaborative filtering approaches by measuring the similarly of reviews that have the same movie rating. In addition, we evaluate the effectiveness of LSA (Latent Semantic Analysis), a state-of-the-art similarity measure, for judging the similarity of user reviews. This provides an indication of the feasibility of implementing the proposed system using similarity of user reviews. The remainder of the article is structured as follows. Section 2 briefly describes the functionality of the proposed system. Section 3 provides a review of the state-of-art similarity measures for short text, including LSA. Section 4 describes a pilot study undertaken to test the feasibility of the approach and the performance of LSA as a short-review similarity measure technique. In section 5, we provide a summary and outlook for further investigations. 2. TOWARDS A FREE TEXT RECOMMENDER SYSTEM The recommender system we propose builds on the assumption that a free text review contains the reasons why a rating was given. The system is a hybrid approach consisting of collaborative filtering coupled with a content based analysis of the user reviews. The collaborative filtering aspect of the system will measure the similarity between two users, by comparing the similarity of their reviews (instead of ratings) for the items that both users have reviewed, whilst recommending new items of potential interest to the target user. The advantage of this system is that it will overcome a common limitation of rating-based collaborative filtering approaches: the system will be able to match users based on the similarity of their reviews, even if they rated an item differently. To highlight this point, in Figure 1 we present three reviews from Figure 1. User reviews from RottenTomatoes. the RottenTomatoes2 movie review site. All three reviews are about the movie Pirates of the Caribbean: On Stranger Tides, the . fourth in the series. Two reviews have the same rating: a traditional rating-based collaborative filtering approach would judge them as similar. Evidently these two reviews are different one reviewer referred to the main actor, while the other referred to the storyline. Additionally, each user has differing opinions about this movie over the previous in the series. These two users rated the movie with the same score, yet the reasons why they assigned such a score differ. In such circumstances, to consider these two reviews as equivalent would be naïve. Finally, the third reviewer in Figure 1 positively comments about the storyline, a review which is highly similar to the second, despite the difference in the final score. Based on these reviews, the system we propose would class the two reviewers as similar a sharp contrast to how their similarity would have been evaluated from a rating-based approach. The proposed system will also adopt a content based approach. It will build items profiles based on reviews shared about them, and users profiles based on reviews that they share. The system will then compare the similarity between a target user profile and item profiles to identify items of potential interest for each user. For example, if a user often comments about the special effects, his profile would be identified as similar to movies that users have reviewed as having good special effects. The advantage of this approach over current recommender systems is that it will enable recommendations based not only on pre-defined movie features, but also on potentially interesting features of that movie. Since commenting and reviewing is already one of the main user activities in the social web, and no additional interaction from users is required, the proposed system can potentially be applied to any pre- popular review website or social networking site. For the development of this system we require a measure that will be able to identify similar reviews. We propose similarity measures of user reviews as a different approach for measuring similarity and making recommendations, rather than extracting features and sentiment from reviews. For that reason, we present an overview of current similarity measures for short text to select and evaluate the best, in terms of performance, on user reviews. 3. SIMILARITY MEASURES The majority of user reviews on the social web can be described as noisy short text: short free text often containing abbreviations, has been undertaken for evaluating similarity measures on short user-generated text. Beaza-Yates et al [6] propose a method for recommending search engine queries based on similar query logs. 2 http://www.rottentomatoes.com/ 47 Table 1. Feature similarity scores and explanations feature similarity not similar somehow similar score 1 2 explanation the reviews don't contain reference to the same feature(s) the reviews contain a reference to some same features similar 3 the reviews contain reference to exactly the same feature(s) To identify similar queries they applied cosine similarity on a variation of the term frequency- inverse document frequency (tfidf) weighted Vector Space Model (VSM) that uses URL click popularity, in place of the idf. Their results indicate that the particular similarity measure cannot be used as a criterion for making recommendations on its own. However, more sophisticated similarity measures techniques exist and could potentially provide better results. VSM is one of the first approaches for computing semantic similarity. However, it depends on exact matches between words present in queries and documents, and is therefore subject to problems such as polysemy and synonymy. Latent semantic models can moderate these issues using co-occurrences of words over the entire collection. The simplest and best known latent model is Latent Semantic Analysis (LSA) [7], where the term vectors are mapped into a lower-dimensional space based on Singular Value decomposition. LSA was compared with word and n-gram vectors by Lee et al [8] to evaluate their performance on short text comments. Results indicated that LSA performs better than the word and n-gram vectors. Li et al [9] propose a hybrid measure called STASIS that uses information from WordNet3; a large lexical database of nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms to compute similarity between short texts. The measure STS, proposed by Islam et al [10] calculates similarity between short texts using string similarity along with corpus-based word similarity. Tsatsaronis et al [11], proposed Omiotis, a measure of semantic relatedness between texts, is based on a word to word measure that uses semantic links between words from WordNet. LSA, STATIS, STS, and Omiotis have been compared by Tsatsaronis et al [11], to evaluate their performance on measuring the similarity of sentences from a dictionary. Their results indicated that Omiotis had the highest Spearman‟s correlation with human judgments (r=0.8959) with second best being LSA (r=0.8614). . 4.1 Data Our data sample consists of 80 short review pairs (with less than 200 characters). Each pair contains two reviews associated with the same score. Four score rating categories (0-15, 20-30, 35-45, 50-60) were used to have a broad representation of reviews with different scores. Our sample was extracted using the following steps: first, we randomly selected 100 reviews from each of the top 20 movies as ranked by the RottenTomatoes box office; second, we selected the 600 reviews that contained reference to movie features; third, we randomly selected two reviews associated with the same score from each of the four categories. The third step was repeated for each of the top 20 movies. 4.2 Procedure In the first phase of the experiment, the 2000 reviews were manually classified into two groups: those that reference movie features and those that do not. This classification was made in order to simplify the similarity judgment procedure. A clear definition of similarity enabled participants to systematically judge the similarity of pairs of reviews. In the second phase, the main experiment was carried out by three participants independently. Participants were instructed to rate the similarity of 80 pairs using a three point scale: 1) not similar, 2) somehow similar, and 3) similar. Similarity was defined as the presence of the same movie features in a pair of reviews. Explicit directions for measuring the similarity based on features were given, and are provided in Table 1. In the third phase, we measured the correlation of human similarity scores with scores produced by LSA. In the results we report LSA values obtained from the LSA website4. 4.3 Results and Discussion 4. EXPERIMENT During the first phase, we identified that user movie reviews contain content that can be used for making recommendations. 600 of the 2000 reviews collected contained references to movie features, while 1300 reviews contained general adjectives about the movie, and only 100 did not contain any useful content or were in a different language. This suggests that there are mainly two types of movie reviews: “feature-based”: those referring to specific features and “discussion-based”: those describing the users‟ general opinion of the movie. This result is promising since it suggests that movie reviews contain content which can be used for making accurate recommendations. A three-phase experiment was carried out to evaluate the feasibility of the proposed system. We used movie reviews as a platform, specifically collecting data from RottenTomatoes, a movie review website that allows users to express their opinions about movies with a scalar rating and a text review. In the first phase, we investigate if users‟ reviews contain useful content; that is content that represents the underlying reasons on why a rating was composed. In the second phase, we investigate if reviews with references such as actors and plot. In phase 3, we evaluate the performance of LSA in measuring the similarity of user reviews. For the second phase an inter-rater reliability analysis using Kappa statistic was performed to determine consistency among the raters. The inter-rater reliability for the raters was found to be Kappa=0.725 (p<0.001), 95% CI (0.631, 0.804), which indicates a substantial agreement. As presented in Figure 2, 42 of the movie review pairs with the same rating (52.5%) were judged as “not similar” (mode 1), 22 review pairs (27.5%) were judged as “somehow similar” (mode 2), while 16 pairs (20%) were judged as “similar” (mode 3). Thus, the majority of the comments were 3 4 Based on the reported results, the availability of measures, and the indicated performance of LSA on different short text datasets, we decided to carry out an experiment to evaluate its effectiveness in measuring the similarity of user reviews comparing with human judgments. The experimental design and results follow. http://wordnetweb.princeton.edu/perl/webwn http://lsa.colorado.edu 48 „features-based‟ reviews. Further experiments will be conducted to determine which reviews are more useful for recommendations and how to automatically extract them. Results also indicated that LSA performs weakly in measuring the similarity of short user-generated reviews. Further investigations into the reasons of this weak performance need to be undertaken, along with the evaluation of other approaches. Figure 2. Number of review pairs per similarity category not similar, referring to completely different movie features despite identical ratings. Based. on the assumption that users‟ reviews represent the reasons behind a rating, these results suggest that people compose their ratings based on different aspects of the film, thus ratings alone may be an insufficient source of knowledge for making accurate recommendations. In conclusion, additional work must be carried out to evaluate the feasibility of the proposed system and establish the effectiveness of the system over conventional rating-based recommender systems. Our preliminary results, however, do lend weight to the idea that item reviews, one of the most prominent features of social web, offer a natural source of rating information. While still a work in progress, our preliminary results constitute a useful addition to the current line of research in this field, while complementing research in the field of similarity measures. 6. REFERENCES Furthermore, a Spearman's rank correlation coefficient analysis was carried out to examine if there is a relation between the mode of human judgments and the LSA similarity measure for evaluating similarity. The results revealed a significant and positive relationship (r=0.406, N=80, p<0.001), although the correlation was weak in strength. [1] Lathia, N., Hailes, S., and Capra, L. 2008. The effect of correlation coefficients on communities of recommenders. In Proceedings of SAC. 2008, 2000-2005. According to related work [11], LSA has a high correlation (r~0.8) with human judgments when measuring similarity, which was the principle motivation for choosing this measure. The weak correlation found in this experiment may be due to the nature of the dataset. Free text user reviews are noisy, and often include spelling mistakes, movies specific terms and abbreviations, which LSA cannot recognize. In light of this, the performance of LSA could be enhanced using a larger dictionary containing terms frequently used in movie reviewing such as movie features. [3] Wang, Y., Stash, N., Aroyo, L., Hollink, L., and Schreiber, G. 2009. Semantic relations for content-based recommendations. Proceedings of K-CAP. 2009, 209-210. The initial results of this experiment suggest that user reviews are a promising source of knowledge for recommender systems. Moreover, the majority of pairs of randomly selected reviews with the same rating score (52.5%) are not similar. Even if users rate a movie with a same score, this does not necessarily mean that each rating was based on similar reasons. Thus, current recommender systems using only ratings may lack accuracy. In addition, this experiment shows that while LSA performed well in judging similarity, its performance for short user reviews is not sufficient. 5. CONCLUSION AND OUTLOOK In this article we presented an initial investigation towards the development of a recommender system that makes recommendations based on user generated free text reviews. The results of the small experiment shows that reviews can be used as an alternative way for building users and items profiles for recommender systems, since reviews typically represent the opinion of a user about an item or some of its features. Moreover the results show that in current rating-based recommender systems, users‟ similarity is not well reflected. For any two user ratings of a particular item there is a high possibility that each respective rating was based on different features of that item. However, replications of this experiment using a larger dataset and more participants are needed to validate these results. Additionally, a limitation of this experiment was the focus only on [2] Lakiotaki, K., Matsatsinis, N.F., and Tsoukiàs, A. 2011. Multicriteria User Modeling in Recommender Systems. In Proceedings of IEEE Intelligent Systems. 2011, 64-76. [4] Jakob, N., Weber, S. H, Muller, M.-C, and Gurevych, I. 2009 Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In Proceedings of the 1st international CIKM workshop on topic-sentiment analysis for mass opinion. [5] Long, C., Zhang, J., Huang, M., Zhu, X., Li, M., Ma, B. 2009. Specialized review selection for feature rating estimation. Proceedings of the IEEE/WIC/ACM International Joint Conference on Web. WI-IAT‟ 09 [6] Baeza-Yates, R., Hurtado. C., and Mendoza, M. 2004. Query Recommendation Using Query Logs in Search Engines, Current Trends in Database Technology –EDBT 2004 [7] Landauer, T.K,. and Dumais, S.T. 1997. A solution toplato‟s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. [8] Lee , M.D., Pincombe, B., and Welsh, M. 2005. An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1254–1259 [9] Li, Y., McLean, D., Bandar, Z., O'Shea, J., and Crockett, K.A. 2006 Sentence Similarity Based on Semantic Nets and Corpus Statistics. In Proceedings of IEEE Transactions on Knowledge and Data Engineering. [10] Islam, A., and Inkpen, D. 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data [11] Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. 2010. Text Relatedness Based on a Word Thesaurus. In roceedings of J. Artif. Intell. Res. (JAIR). 2010, 1-39 49 A Multi-Criteria Evaluation of a User-Generated Content Based Recommender System Sandra Garcia Esparza, Michael P. O’Mahony, Barry Smyth CLARITY: Centre for Sensor Web Technologies School of Computer Science and Informatics University College Dublin, Ireland {sandra.garcia-esparza, michael.omahony, barry.smyth}@ucd.ie ABSTRACT 1. The Social Web provides new and exciting sources of information that may be used by recommender systems as a complementary source of recommendation knowledge. For example, User-Generated Content, such as reviews, tags, comments, tweets etc. can provide a useful source of item information and user preference data, if a clear signal can be extracted from the inevitable noise that exists within these sources. In previous work we explored this idea, mining term-based recommendation knowledge from user reviews, to develop a recommender that compares favourably to conventional collaborative-filtering style techniques across a range of product types. However, this previous work focused solely on recommendation accuracy and it is now well accepted in the literature that accuracy alone tells just part of the recommendation story. For example, for many, the promise of recommender systems lies in their ability to surprise with novel recommendations for less popular items that users might otherwise miss. This makes for a riskier recommendation prospect, of course, but it could greatly enhance the practical value of recommender systems to end-users. In this paper we analyse our User-Generated Content (UGC) approach to recommendation using metrics such as novelty, diversity, and coverage and demonstrate superior performance, when compared to conventional user-based and itembased collaborative filtering techniques, while highlighting a number of interesting performance trade-offs. Recommender systems allow users to discover information, products and services by predicting their needs based on the past behaviour of like-minded users. Typically these systems can be classified in three main categories: collaborative filtering (CF), content-based (CB) and hybrid approaches. In CF approaches [4, 7], users are recommended items that users with similar interests have liked in the past, where users interests in items are represented by ratings. In contrast, in CB approaches [14], users are recommended items that are similar to those items that the user liked in the past, where item descriptions (e.g. movies can be described using metadata such as actors, genres etc.) are used to measure the similarity between items. Finally, researchers have looked at the potential of combining CF and CB approaches as the basis for hybrid recommendation strategies [5]. However, one of the problems with these systems is that they need sufficient amounts of data in order to provide useful recommendations, and sometimes neither ratings nor metadata are available in such quantities. For this reason researchers have started looking into additional sources of recommendation data. In the last few years, the Social Web has experienced significant growth, with the emergence of new services, such as Twitter, Flixster and Foursquare, whose users collectively generate very large volumes of content in the form of micro-blogs, reviews, ratings and check-ins. These rich sources of information, namely User-Generated Content (UGC), which sometimes relate to products and services (such as movie reviews or restaurant check-ins), are becoming increasingly plentiful and researchers have already started to utilise this content for the purposes of recommendation. Here we focus on UGC in the form of product reviews (see, for example, Figure 1). We believe this type of information offers important advantages in comparison to other sources of recommendation knowledge such as ratings and metadata. For instance, when reviewing products, users often discuss particular aspects that they like about them (e.g. “Tom Hanks and Tim Allen at their best”), as well as commenting on general interests (e.g. “I love animation”), which is not always reflected in other sources of recommendation knowledge. In this sense, in a collaborative filtering approach, two users that have rated movies similarly are treated as people with similar interests. However, it often happens that users may like the same movies for different reasons. For example, one user may have rated a movie with a high score because they loved the special effects while the other one rated the same movie highly because they loved the plot and the ac- Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous General Terms Algorithms, Experimentation Keywords Recommender Systems, User-Generated Content, Performance Metrics Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. INTRODUCTION 50 tors’ performances. In a similar way, a content-based approach usually relies on product descriptions to draw item similarities, but it does not consider how the user feels towards each of these descriptions. A solution to this problem is to consider multi-criteria ratings, where different aspects of a product (or service) are rated separately [2]. For example, when rating restaurants in TripAdvisor, users can rate along the dimensions of service, food, value and atmosphere. One of the disadvantages of this approach is that these multi-criteria ratings tend to be predefined and thus can restrict users from commenting on other aspects of the product. Social tagging provides a solution to this by allowing users to associate tags to content. In [18] these tags, which reflect users’ preferences, are introduced into recommender algorithms called tagommenders, and results show that these algorithms performed better than state of the art algorithms. [8] also comments on the benefits of using tag-based user profiles and they investigate different techniques to build condensed and optimised profiles. However, not many systems provide this social tagging functionality; instead users’ ratings and reviews are more common and abundant and for this reason we believe it is important to consider them for recommendation purposes. The most common way to evaluate recommenders is to measure how accurate they are at predicting items that users like. However, one of the problems with evaluating the accuracy of top-N recommendation lists is that current metrics (such as precision and recall) reward algorithms that are able to accurately predict test set items (typically constructed by selecting a random subset of users’ known ratings), while failing to consider items that are not present in test sets, but which users may in fact like. Hence current recommendation accuracy evaluation techniques are limited, which may result in promising new algorithms being labelled as poor performers. Further, it has been shown that accuracy on its own is insufficient to to fully measure the utility of recommendations [19]. Other performance criteria, such as novelty and diversity of recommended lists are now acknowledged as also being important when it comes to user satisfaction. In addition, the ability to recommend as many products as possible, known as coverage, is also a desirable system property. While these performance criteria have been considered in the past [20, 24, 25], however they have been less explored in the context of UGC-based recommenders. In past work we have implemented a recommendation approach where UGC in the form of micro-reviews was used as the source of recommendation knowledge [9]. An evaluation performed on 4 different domains showed that UGC, while inherently noisy, provided a useful recommendation signal and outperformed a variation of a collaborative-filtering based approach. Here, we expand on this work by considering additional performance criteria, such as novelty, diversity and coverage, and we compare the performance of our approach with traditional user-based and item-based collaborative filtering approaches [4, 7]. 2. RELATED WORK UGC has been leveraged by recommender systems for different purposes such as enriching user profiles or extracting ratings from reviews by using sentiment analysis techniques. In addition, in the last few years it has been shown that accuracy on its own is insufficient to fully measure the utility Figure 1: A Flixster review of the movie ‘Toy Story’. of recommendations and new metrics have been proposed. Here, we provide an overview of some of the work that has been carried out in these areas. 2.1 Review-based Recommendations Recent work has focused on leveraging UGC in the form of reviews for recommendations. For example, a methodology to build a recommender system which leverages usergenerated content is described in [23]. Although an evaluation is not performed, they propose a hybrid of a collaborative filtering and a content-based approach to recommend hotels and attractions, where the collaborative filtering component utilises the review text to compute user similarities in place of traditional preference-based similarity computations. Moreover, they also comment on the advantages of using user-generated content for recommender systems; such as, for example, providing a better rationale for recommended products and increasing user trust in the system. An early attempt to build a recommender system based on user-generated review data is described in [1]. Here, an ontology is used to extract concepts from camera reviews and recommendations are provided based on users’ requests about a product; for example, “I would like to know if Sony361 is a good camera, specifically its interface and battery consumption”. In this case, the features interface and battery are identified, and for each of them a score is computed according to the opinions (i.e. polarities) of other users and presented to the user. Similar ideas are described in [3], which look at using usergenerated movie reviews from IMDb in combination with movie meta-data (e.g. keywords, genres, plot outlines and synopses) as input for a movie recommender system. Their results show that user reviews provide the best source of information for movie recommendations, followed by movie genre data. In addition, in [12, 15], the number of ratings in a collaborative filtering system is increased by inferring new ratings from user reviews using sentiment analysis techniques. While [15] generate ratings for Flixster reviews by extracting the overall sentiment expressed in the review, [12] extract features and their associated opinions from IMDb reviews and a rating is created by averaging the opinion polarities (i.e. positive or negative) across the various features. Both approaches achieve better performance when using the ratings inferred from reviews when compared to using ratings predicted by traditional collaborative filtering approaches. 2.2 Beyond Accuracy Metrics Typically, the performance evaluation of recommendation algorithms is done in terms of accuracy, which measures how well a recommender system can make recommendations. However, it has been shown that accuracy on its 51 own is insufficient to fully measure the utility of recommendations and over the last few years new metrics have been proposed [13,19]. Here we are interested in coverage, novelty and diversity. Coverage is a measure of the domain of items in the system over which the recommender system can form predictions or make recommendations [10]. Sometimes algorithms can provide highly accurate recommendations but only for a small portion of the item space. Systems with poor coverage may be capable of just recommending well-known or popular products, while less mainstream products (belonging to the long tail) are rarely if ever recommended, resulting in a poor experience for recommendation consumers and providers alike. For this reason it is useful to examine accuracy and coverage in combination; a “good” recommender will achieve high performance along both dimensions. Novelty measures how new or different recommendations are from a user’s perspective. For instance, a movie recommender that keeps suggesting movies that the user is already aware of is unlikely to be very useful to the user, although the recommendations may well have high accuracy. Diversity refers to how dissimilar the products in recommendation lists are. For example, a music recommendation list where most albums are from the same band will have a low diversity, and such a situation is not desirable. Indeed, in [20], it is argued that diversity can sometimes be as important as similarity. Current research is focused on improving the novelty and diversity of the recommendations without sacrificing accuracy. For instance, in [24] the authors suggest an approach to recommend novel items by partitioning the user profile into clusters of similar items. Further, in [25], the authors introduce diversity in their recommendation lists and results show that although the new lists show a decrease in accuracy, users are more satisfied with the diversified lists. UGC in the form of tags has also been used to introduce diversity in recommendation lists and results showed that their method was also able to improve the accuracy [22]. The goal of this paper is to explore the advantages of using UGC in the form of reviews in a recommender system. Similar work has been discussed above; however, these approaches have been evaluated in terms of accuracy and other metrics, which have proven to be equally important, have not been taken into account. In this paper we provide an evaluation of a UGC-based recommendation approach which considers metrics that go beyond accuracy; in particular we are interested in the properties of novelty, diversity and coverage. 3. METHODOLOGY The goal of this paper is to evaluate our UGC-based recommender by providing a multi-criteria evaluation. The recommender is similar to that described in our previous work [9], where UGC in the form of product micro-reviews are used as the source of recommendation knowledge. In particular, we propose an index-based approach where users and items are represented by the terms used in their associated reviews. This section describes two variants of our index-based approach and two variants of collaborative recommenders, which are used as benchmark techniques. 3.1 !"#$%&'() @;#A'() !.4'>00B) !"#$%&'() /0*"0(!"#23*0( query /0*"(%1( %40(')-5%#4) 6.1%)/#--'() 0(46'%&(.)4".%%#4) .78'1&5('%) 9%#40"0:#4.")&;#%&) <511=).7.-) %.17"'()>."'?) G) *'+"#$) results ,-./01)232%) !"#$%&'(()*&#++*,$*"( 45&1('#('6*(7%'%"*( 86*(9*$$:,;(<:,;*"( %40(')0(46'%&(.) C#46.'")DE)!0$) 40-'7=)4".%%#4) .78'1&5('%) <511=)?) H) 86*(=5&6:,:0'( <511=).7.-) *) 9%=460"0:#4.") %.17"'()40-'7=) E)E)E)))) &6(#""'()F05(1'=) >.((=-0(') >."')-#17);'#(7) -5%#4) :007)F0>) )%0517&(.4B)?) 9'(<0(-.14')?) Figure 2: An index-based approach for recommendation. of an index-based approach where products and users are represented by the terms in their associated reviews. In this initial work we only consider terms from positive user reviews (i.e. reviews which are associated with a rating of greater than 3 on a 5 point scale). The reason for this is that in such reviews users tend to talk about the aspects they like in products. This poses some obvious limitations discussed in section 5 which will be addressed in future work. Our recommendation approach involves the creation of a product index, where each product Pi can be viewed as a document made up of the set of terms, t1 , . . . , tn , used in its associated user reviews, r1 , . . . , rk , as per Equation 1. Pi = {r1 , . . . , rk } = {t1 , . . . , tn } . (1) Using techniques from the information retrieval community, we can apply weights to terms associated with a particular product according to how representative they are with respect of that product. In this work we have used the well known TFIDF approach [17] to term weighting (Equation 2). Briefly, the weight of a term tj in a product Pi , with respect to some collection of products P, is proportional to the frequency of occurrence of tj in Pi (denoted by ntj ,Pi ), but inversely proportional to the frequency of occurrence of tj in P overall, thus giving preference to terms that help to discriminate Pi from the other products in the collection. We use Lucene1 to provide this indexing, term-weighting and retrieval functionality. TFIDF(Pi , tj , P) = P Index-based Approaches The approach to recommend products to users consists !"#$%&'(-,$*.( 1 ntj ,Pi |P| ×log |{Pk ∈ P : tj ∈ Pk }| tk ∈Pi ntk ,Pi (2) http://lucene.apache.org/ 52 Similarly, we can create the profile of a user by using the terms in their associated (positive) reviews. To provide recommendations for this user we can use their profile as a query into the product index, and return the top-N list of products that best match the query as recommendations. In addition to the vanilla index-based approach (IB) outlined above, we have also consider a variation where only nouns and adjectives from reviews2 (IB+) are used to form the product index and user queries. We also considered extracting nouns only from reviews, but better results were obtained when adjectives were included. Further, for both index-based approaches we applied stemming, stop-word removal and removed words that appeared in more than 60% of user profiles and products to exclude common domainspecific words (such as “movie”, “plot” and “story”). This index-based approach is illustrated in Figure 2. One of the advantages of this approach is that user profiles can be independent from the product index (i.e. users may not have reviewed products from the product index), allowing us to use a product index from one particular source (e.g. a movie index created from Flixster reviews) with user profiles from another source (e.g. user interests extracted by analysing that user’s Twitter messages). This independence allows for cross-domain possibilities which in turn can be used to mitigate the cold start problem. 3.2 User-based CF (UBCF) [4]. In order to provide a topN list of recommended items for a target user, the k most similar users (neighbours) to that user are selected using cosine similarity. Then, the union of the products rated by each of the neighbours (less those already present in the target user’s profile) is returned, ordered by descending frequency of occurrence. The top-N products are returned as recommendations. Item-based CF (IBCF) [7]. In this case, recommended products are generated by first taking the union of the k most similar products (using cosine similarity) to each of the products in the target user’s profile, again ordered by descending frequency of occurrence. After removing the products already present in the target user’s profile, the top-N products are selected as recommendations. 4.1 EVALUATION Metrics We use four different metrics in order to evaluate our index-based recommendation approach. Accuracy measures the extent to which the system can predict the items that the users like. We measure this in terms of the F1 metric (Equation 3), which is the harmonic mean of precision and recall [21]. 2 PN i=1 (1 − popularity(i)) . (4) Novelty = N We define the diversity of a top-N list of recommended products as the average of their pairwise dissimilarities [20]. Let i and j denote two products and let dissimilarity(i, j) = 1 − similarity(i, j) and, assuming a symmetric similarity measure, diversity is given by: Collaborative Filtering Approaches We study two variations of collaborative filtering (CF): user-based and item-based techniques. For these techniques, entries in the user-product ratings matrix consist of 1 (if the user likes a product, i.e. has assigned a rating of ≥ 4) or the special symbol ⊥ (meaning that the user has not reviewed the product or has assigned a rating of ≤ 3). 4. 2 × Precision × Recall . (3) Precision + Recall Novelty measures how new or different recommended products are to a user. Typically, the most popular items in a system are the ones that users will be the most familiar with. Likewise, less mainstream items are more likely to be unknown by users. In related work novelty is often based on item popularity [6, 25]. Here we follow a similar approach and compute the novelty of a product as one minus its popularity; where the popularity (popularity(i)) of a product, i, is given by the number of reviews submitted for the product divided by the maximum number of reviews submitted over all products. Hence, the novelty of a top-N list of recommended products is computed as the average of each product’s novelty as per Equation 4. F1 = Nouns and adjectives are extracted from review text using the Stanford Parser (http://www-nlp.stanford.edu/ software/lex-parser.shtml). Diversity = 2 PN−1 PN i=1 j=i+1 1 − similarity(i, j) N × (N − 1) . (5) The similarity, similarity(i, j), between two products is computed using cosine similarity on the corresponding columns of the ratings matrix; for normalisation purposes, this is divided by the maximum similarity obtained between two products. Similar trends were obtained by computed similarity over the documents in the product index. Finally, the ability to make recommendations for as many products as possible is also a desirable system property. This is reflected by the coverage metric, which for a given user is defined as the percentage of the unrated set of products for which the recommender is capable of making recommendations. Then the overall coverage provided by the system is given by the mean coverage over all users. 4.2 Dataset and Methodology In this paper we consider Flixster3 as our source of data. Flixster is an online movie website where users can rate movies and also write reviews about them. We selected reviews authored in the English language only and performed some standard preprocessing on the reviews; such as removing stop-words, special symbols, digits and multiple character repetitions (e.g. we reduce cooool to cool ). Further, we selected users and movies with at least 10 associated positive reviews. This produced a total of 43179 reviews (and ratings) by 2157 users on 763 movies. The average number of reviews per user is 20 and per movie is 57. To evaluate each algorithm, first we randomly split each users’ reviews and ratings into training (60%) and test (40%) sets. Second, we create the product index or ratings matrix using the training data. Then, for each user we produce a top-N list of recommendations using the approaches described in Section 3, and compute accuracy using the test 3 http://www.flixster.com 53 a) UBCF IBCF IB IB+ b) Novelty (%) F1 Metric 0.15 0.1 0.05 0 UBCF IB IB+ 80 60 40 20 0 10 20 30 40 0 10 Result List Size c) UBCF IBCF IB IB+ 95 90 85 80 75 70 0 10 20 30 20 30 40 Result List Size d) Coverage (%) Diversity (%) IBCF 40 Result List Size 100 80 60 40 20 0 UBCF IBCF IB IB+ Figure 3: (a) Accuracy, (b) novelty, (c) diversity and (d) coverage for the index-based and collaborative filtering approaches. set. We also compute diversity, novelty and coverage as described in Section 4, and compute averages across all users. We repeated this procedure five times and again averaged the metrics. We note that when generating user recommendations using the index-based approaches, we first remove the reviews in each user profile from the product index. For the CF approaches, we performed evaluations using different neighbourhood sizes and found that the best accuracy for UBCF and IBCF was achieved for k = 200 and k = 100, respectively. These are the values used when comparing the CF algorithms against the index-based approaches. 4.3 Results The results shown in Figure 3 indicate that the collaborative filtering approaches outperformed both index-based approaches in terms of accuracy, with user-based CF performing best overall. The index-based approach using nouns and adjectives (IB+) actually performed slightly worse than the standard bag-of-words approach (IB), which indicates a loss of information using only the terms selected. Although the accuracy results may seem discouraging at first, our index-based approaches outperformed both CF approaches in terms of diversity, novelty and coverage. The worst performing approach in terms of coverage was IBCF (only 63.23%), while both IB and IB+ achieved in excess of 90% coverage. In terms of novelty, IB+ was the best approach, with 63% novelty for top-10 recommended lists, followed by IB and IBCF, with UBCF providing the poorest novelty (approximately 34%). In terms of diversity, the index-based approaches performed significantly better, with IB+ providing 87% diversity for top-10 recommended lists, compared to 77% for the best CF approach (IBCF). It is interesting to note that in the above results, the neighbourhood size (k) for both CF approaches was tuned based on delivering the best accuracy performance. An interesting question that arises is how well the CF approaches would perform if neighbourhood sizes were tuned according to other performance criteria (e.g. novelty or diversity) — would the recommendation accuracy provided by CF still outperform the index-based approaches? To answer this question, we repeated the above analysis using different neighbourhood sizes for CF. In particular we computed accuracy versus novelty and diversity for the two CF approaches and compared them with the IB+ approach. Results are presented in Figure 4 and show that by reducing the neighbourhood size, CF can achieve better diversity and novelty than when using a bigger neighbourhood size. However, this comes at the cost of reduced accuracy performance. In fact, for neighbourhood sizes of k = 10, the novelty and diversity performance of the CF approaches is closest to that achieved by the IB+ approach but at the cost of poorer (in the case of UBCF) or almost equivalent (in the case of IBCF) accuracy compared to IB+. For comparison purposes, we also show novelty and diversity versus coverage in Figure 5. It can be seen that, for the CF approaches, a higher coverage is achieved when using larger neighbourhood sizes, although neither CF approach can beat the coverage achieved by the IB+ approach. For the CF approaches, the results in Figure 5 are also interesting in that they show a clear tradeoff exists between optimising for coverage performance on the one hand (larger neighbourhood sizes), or optimising for novelty and diversity performance on the other hand (smaller neighbourhood sizes). 54 0.7 b) UBCF-‐10 IB+ Novelty@10 0.6 UBCF-‐25 0.5 UBCF-‐100 UBCF-‐50 0.4 UBCF-‐200 0.3 0.05 0.06 0.07 0.08 0.09 Novelty@10 a) 0.1 0.7 IB+ 0.6 IBCF-‐10 0.5 Diversity@10 0.9 0.3 0.06 0.07 IB+ UBCF-‐25 UBCF-‐100 0.8 UBCF-‐50 UBCF-‐200 0.7 0.05 0.06 0.07 0.08 0.08 0.09 0.1 F1@10 d) UBCF-‐10 IBCF-‐50 0.4 0.05 0.09 0.1 F1@10 Diversity@10 1 IBCF-‐100 IBCF-‐200 F1@10 c) IBCF-‐25 1 IB+ 0.9 IBCF-‐25 IBCF-‐10 0.8 IBCF-‐200 IBCF-‐50 IBCF-‐100 0.7 0.05 0.06 0.07 0.08 0.09 0.1 F1@10 Figure 4: Novelty vs. Accuracy for (a) UBCF and (b) IBCF and Diversity vs. Accuracy for (c) UBCF and (d) IBCF. Further, examining the results in Figures 4 and 5 together, although similar accuracy to IB+ was achieved by the CF approaches when using a small neighbourhood (k = 10), we can also see that the coverage achieved at k = 10 is only 9% and 11% for UBCF and IBCF, respectively. This scenario is obviously never desirable since 90% of the system’s items cannot be recommended, showing that if we want to maintain coverage while having high levels of accuracy, novelty and diversity, then the UGC-based approaches are preferable to both CF approaches in this evaluation setting. Hence we can conclude that the index-based approaches compare quite favourably to the collaborative filtering techniques, when the range of performance metrics evaluated in this work are taken into consideration. This is a noteworthy result, and underlines the potential of the UGC-based recommenders as described in this paper. 5. DISCUSSION AND FUTURE WORK While past work has focused on improving recommendation accuracy, recent work has shown that other metrics need to be explored in order to improve the user experience. In fact, algorithms which recommend novel items are likely to be more useful to end-users than those recommending more popular items. Such algorithms may, for example, introduce users to an entirely new space of items, regardless of the rating that they might actually give to said novel items. Further, a live evaluation performed by [25] showed that users preferred more diverse lists instead of more accurate (and less diverse) lists. In this paper we consider a multi-criteria performance evaluation and, although trade-offs exist for all evaluated approaches, we believe the findings indicate that the UGCbased approach offers the best trade-off between all met- rics and algorithms considered. For example, in order to achieve similar levels of novelty and diversity using the CF approaches as achieved by the UGC-based approach, a significant loss in coverage must apply. We believe a reason for the higher novelty and diversity performance achieved by the UGC-based approach is that profiles created using this technique often reflect particular aspects and topics that users are interested in, allowing for more diverse (and often novel) recommendation lists compared to using ratings alone. For example, even if a user rated only science fiction movies, there will be specific aspects that differentiate this user from another who is also a fan of this genre (e.g. one may prefer aliens and strange creatures while the other might prefer the more romantic elements in the storyline). There is an interesting range of future work to be carried in order to improve our current approach and to explore other benefits of UGC content: • Enhancing the Real-Time Web In this paper we used UGC in the form of long-form movie reviews. However we are also interested in UGC in the form of Real-Time Web data (e.g. Twitter messages) which captures users’ preferences in real-time. For instance, people often post messages about the movies they liked, their favourite football team or their dream vacation experience. This data facilitates the building of rich user profiles which in turn allows recommenders to better address users’ needs. In fact, in past work [9], we demonstrated that micro-blogging messages can provide a useful recommendation signal despite their short-form and inconsistent use of language. Further, as discussed above, in our index-based approach user profiles can be independent from the product index, 0.7 0.6 b) IB+ UBCF-‐10 UBCF-‐25 UBCF-‐50 0.5 UBCF-‐100 0.4 UBCF-‐200 0.3 0.7 IBCF-‐10 IBCF-‐25 0.6 Novelty@10 a) Novelty@10 55 IBCF-‐50 0.5 0.4 0.2 0.4 0.6 0.8 1 0 0.2 Coverage d) UBCF-‐10 UBCF-‐25 UBCF-‐50 UBCF-‐100 UBCF-‐200 IB+ Diversity@10 Diversity@10 0.8 0.7 0 0.2 0.4 0.6 0.4 0.6 0.8 1 Coverage 1 0.9 IB+ IBCF-‐200 0.3 0 c) IBCF-‐100 0.8 1 0.9 IBCF-‐25 0.8 IBCF-‐50 IBCF-‐100 IBCF-‐200 0.7 0 1 IB+ IBCF-‐10 0.2 0.4 0.6 0.8 1 Coverage Coverage Figure 5: Novelty vs. Coverage for (a) UBCF and (b) IBCF and Diversity vs. Coverage for (c) UBCF and (d) IBCF. allowing us to use a product index from one particular source (e.g. Flixster) with user profiles from another source (e.g. Twitter). This allows for cross-domain possibilities which will be explored in future work. itations of our approach is that it is based on positive reviews. In future we will also consider negative reviews which may be useful to avoid recommending certain products to users. Another limitation is that positive reviews may have negative aspects (e.g. ‘I don’t generally like romantic comedies but I loved this one’) or negative reviews may have positive aspects (e.g. ‘Susan Sarandon was the only good thing in this movie’). To address this problem, we will extend our approach by using feature extraction techniques together with sentiment analysis [11, 16] in order to create richer user profiles and product indexes. Choosing the optimum terms to represent users and items is also a problem to be solved in order to reduce the sparsity of the term-based profiles. Evaluating the effect of these improvements on various performance metrics will also be carried out in future work. • The cold-start problem. In future work we will also explore how UGC can help in solving (or at least in mitigating) the well-known cold-start problem, which is related to the new user and new item problems. One of the advantages of using UGC as recommendation knowledge is that it facilitates a cross-domain approach for users who have not reviewed any products from a particular product index. If data relating to such users can be sourced from other domains (e.g. Twitter or Facebook feeds), then they can still benefit from recommendations. Further, a system which does not have reviews for particular products could provide recommendations for these products by building an index based on reviews from another system. • Integrating UGC in traditional recommenders. Collaborative filtering algorithms have proven to be effective when there is a sufficient amount of ratings available, but their performance decreases when the number of ratings is limited. Our work shows preliminary evidence that UGC-based approaches have the potential to complement recommendation knowledge in the form of ratings and to improve the response of recommender systems to data sparsity. In future work we will study the performance of a hybrid recommender that benefits from the strengths of multiple data sources. • Improved index-based approach. One of the lim- 6. CONCLUSIONS In this paper we have considered an alternative source of recommendation knowledge based on user-generated product reviews. Our findings indicate that recommenders utilising this source of knowledge can deliver comparable recommendation performance — across a range of criteria — compared to traditional CF-based techniques. In future work we would like to study other properties of UGC for recommendation, such as the ability to address the well known cold-start problem. Further, despite the simplicity of the current approach our results are promising and further improvements can be incorporated in order to increase recommendation accuracy, while trying to maintain high levels of coverage, novelty and diversity. 56 7. ACKNOWLEDGMENTS Based on work supported by Science Foundation Ireland, Grant No. 07/CE/I1147. 8. [14] REFERENCES [1] S. Aciar, D. Zhang, S. Simoff, and J. Debenham. Recommender system based on consumer product reviews. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI-IATW ’06), pages 719–723, Washington, DC, USA, 2006. IEEE Computer Society. [2] G. Adomavicius and Y. Kwon. New recommendation techniques for multicriteria rating systems. IEEE Intelligent Systems, 22(3):48–55, 2007. [3] S. Ahn and C.-K. Shi. Exploring movie recommendation system using cultural metadata. Transactions on Edutainment II, pages 119–134, 2009. [4] J. S. Breese, D. Heckerman, and C. M. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In G. F. Cooper and S. Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI ’98), pages 43–52. Morgan Kaufmann, 1998. [5] R. Burke. Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, 2002. [6] O. Celma and P. Herrera. A new approach to evaluating novel recommendations. In Proceedings of the 2008 ACM Conference on Recommender systems (RecSys ’08), pages 179–186, New York, NY, USA, 2008. ACM. [7] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems, 22(1):143–177, 2004. [8] C. S. Firan, W. Nejdl, and R. Paiu. The benefit of using tag-based profiles. In Proceedings of the 2007 Latin American Web Conference, pages 32–41, Washington, DC, USA, 2007. IEEE Computer Society. [9] S. Garcia Esparza, M. P. O’Mahony, and B. Smyth. Effective product recommendation using the real-time web. In Proceedings of the 30th International Conference on Artificial Intelligence (SGAI ’10), 2010. [10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1):5–53, January 2004. [11] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD ’04), pages 168–177, New York, NY, USA, 2004. ACM. [12] N. Jakob, S. H. Weber, M. C. Müller, and I. Gurevych. Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In Proceeding of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion (TSA ’09), pages 57–64, New York, NY, USA, 2009. ACM. [13] S. M. McNee, J. Riedl, and J. A. Konstan. Making recommendations better: an analytic model for human-recommender interaction. In CHI ’06 extended abstracts on Human factors in computing systems [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] (CHI EA ’06), pages 1103–1108, New York, NY, USA, 2006. ACM. M. J. Pazzani and D. Billsus. Content-based recommendation systems. In P. Brusilovsky, A. Kobsa, and W. Nejdl, editors, The adaptive web: Methods and strategies of Web personalization, pages 325–341. Springer-Verlag, Berlin, Heidelberg, 2007. D. Poirier, I. Tellier, F. Françoise, and S. Julien. Toward text-based recommendations. In Proceedings of the 9th International Conference on Adaptivity, Personalization and Fusion of Heterogeneous Information (RIAO ’10), Paris, France, 2010. A.-M. Popescu and O. Etzioni. Extracting product features and opinions from reviews. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05), pages 339–346, Morristown, NJ, USA, 2005. Association for Computational Linguistics. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986. S. Sen, J. Vig, and J. Riedl. Tagommenders: connecting users to items through tags. In Proceedings of the 18th International Conference on World wide web (WWW ’09), pages 671–680, New York, NY, USA, 2009. ACM. G. Shani and A. Gunawardana. Evaluating recommendation systems. Recommender Systems Handbook, pages 257–298, 2009. B. Smyth and P. McClave. Similarity vs. diversity. In Proceedings of the 4th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development (ICCBR ’01), pages 347–361, London, UK, 2001. Springer-Verlag. C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979. C. Wartena and M. Wibbels. Improving tag-based recommendation by topic diversification. In Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR’11), pages 43–54, Berlin, Heidelberg, 2011. Springer-Verlag. R. T. A. Wietsma and F. Ricci. Product reviews in mobile decision aid systems. In Pervasive Mobile Interaction Devices (PERMID 2005), pages 15–18, Munich, Germany, 2005. M. Zhang and N. Hurley. Statistical modeling of diversity in top-n recommender systems. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT ’09), volume 01, pages 490–497, Washington, DC, USA, 2009. IEEE Computer Society. C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In Proceedings of the 14th international conference on World Wide Web (WWW ’05), pages 22–32, New York, NY, USA, 2005. ACM. 57 Personalized Recommendation by Example in Social Annotation Systems Jonathan Gemmell, Thomas Schimoler, Bamshad Mobasher, Robin Burke Center for Web Intelligence School of Computing, DePaul University Chicago, Illinois, USA jgemmell,tschimoler,mobasher,[email protected] ABSTRACT Resource recommendation by example allows users to explore new resources similar to an example or query resource. While common in many contemporary Internet applications this function is not commonly personalized. This work looks at a particular type of Internet application: social annotations systems which enable users to annotate resources with tags. In these systems recommendation by example natural occurs as users often navigate through the resource space by clicking on the resources. We propose cascading hybrids to combine personalized and non-personalized approaches. Our extensive evaluation on three real world datasets reveals that personalization is indeed beneficial, that cascading hybrids can effectively integrate personalized and non-personalized recommenders, and the characteristics of the underlying data influence the effectiveness of the hybrids. 1. INTRODUCTION The discovery of new and interesting resources remains a central role of the World Wide Web. As the Web has evolved to encompass new forms of social interaction, so too has new forms of resource interaction been developed. In the so called Social Web users rate, annotate, share, upload, promote and blog about online resources. This abundance of data offers new ways to model resources as well as users, and injects new life into old resource discovery tasks. In this work we focus on a particular type of resource discovery, recommendation by example, in which a user asks for resources similar to an example. Users in the movie domain may ask for more movies like “The Godfather.” The recommendation engine is then responsible for generating a list of similar movies such as the “The Deer Hunter.” The user could then select this new film producing a new recommendation set. In this manner the user can discover new movies as he navigates through the resource space. This type of functionality is common in today’s Internet. When Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RSWeb’11, October 23, 2011, Chicago, Illinois. Copyright 2011 ACM 978-1-60558-093-7/08/10 ...$5.00. a user views an item in Amazon1 , Netflix2 or LastFM3 the system often presents the user with related items. However, these functions are often not personalized. While the example (or query) resource is clearly a necessary input to a recommendation by example algorithm, we assert that the user profile is equally important. Two users may both select “The Godfather.” However, one might be more interested in crime drama, while the other is particular interested in Marlon Brando. Recommendation by example algorithms must take into account the user’s preferences in order to maximize the benefit of the system. In this paper we focus on a particular type of system, the social annotation system, in which users interact with the application by annotating resources with tags. Users are drawn to these applications due to their low entry barrier and freedom from preconceived hierarchies; users can annotate a resource with any tag they wish. The result is rich landscape of users, resources and tags. For our experimental study we make the assumption that if a user annotates two resources with identical tags then the two resources are — in the eyes of that user — similar. This assumption permits our evaluation of personalized recommendation by example algorithms. We evaluate several component recommenders some of which ignore the user profile, some of which ignore the example profile. We then construct cascading hybrid recommenders to exploit the benefits of both. A recommendation list of size k is produced through one algorithm. That list is then reordered by a second algorithm and n elements are presented to the user. Our extensive evaluation conducted on three real world datasets found that personalization improves the effectiveness of recommendation by example algorithms. Cascading hybrid recommenders effectively incorporate algorithms that focus on the user profile with those that rely on the example resource. Finally, differences in the underlying characteristics of the social annotation systems require different combinations of component recommenders to maximize the effectiveness of the cascading hybrid. In the next Section we explore related work. In Section 3, we formalize the notion of personalized resource recommendation by example, define six algorithms based on cosine similarity and collaborative filtering and present our cascading hybrid models. Our experimental results and evaluation follow in Section 4. We conclude the paper with a general discussion of our results. 2. RELATED WORK The notion of recommendation by example is a core compo- 1 www.amazon.com www.netflix.com 3 www.lastfm.com 2 58 Figure 1: An example of resource recommendation by example, similar artists to Radiohead are recommended in LastFM. If the user was to select Arcade Fire from the list a new list of similar artists would be presented. In this manner the user can explore the resource space by moving from resource to resource. nent of information retrieval systems particularly in the domain of e-commerce recommenders [27, 32]. Early approaches include association rule mining [1] and content-based classification [17]. Content-based filtering has been combined with collaborative filtering in several ways [2, 3, 22] in order to improve prediction effectiveness for personalized retrieval. More generally, hybrid recommender systems [4] have been shown to be an effective method of drawing out the best performance among several independent component algorithms. This work draws from these prior efforts in applying a hybrid recommender to the domain of social annotation systems and specifically accommodating a recommendation by example query. There has been considerable work on the general recommendation problem in social annotation systems. Generalizable latentvariable retrieval model for annotation systems [31] can be used to determine resource relevance for queries of several forms. Tagging data was combined with classic collaborative filtering in order to further filter a user’s domain of interest [28]. More recently, several techniques [12, 13, 18] have built upon and refined this earlier work. None of these approaches, however, deal with the possibility of resources themselves as queries. Some work has focused on resource-to-resource comparison in social annotation, although little in the way of direct recommendation. Some have considered the problem of measuring the similarity of resources (as well as tags) in a social annotation system by various means of aggregation [19]. An author-topic latent variable model has been used in order to determine web resources with identical functionality [23]. They do not, however, specifically seek to recommend resources to a particular user, but rather simply enable resource discovery utilizing the annotation data. Our own previous work regarding annotation systems has fo- cused on the use of tag clusters for personalized recommendation [11, 30] and linear weighted hybrid recommenders for both tag [6] and resource [7, 8, 10] recommendation. Here we extend our examination to cascading hybrids comprised of simple components for the specific problem of recommendation by example. 3. RECOMMENDATION BY EXAMPLE Recommenders are a critical component of today’s Web applications, reducing the burden of information overload and offering the user a personalized view of the information space. One such example is recommendation by example in which users select an example resource and view additional resources which are similar. Figure 1 illustrates a common paradigm. A user is viewing information about the band Radiohead. Near the bottom several other bands are displayed under the heading “similar artists.” A user might click on the link to Arcade Fire and discover yet more bands. In this way the user can explore the resource space jumping from artist to artist. Similarity in such a system can be measured based on content features or co-occurrences in user transactions. In social annotation systems, like that shown in Figure 1, similarity is often measured based by the correlation of tags assigned to the resources. Many Web sites offer the functionality of resource recommendation by example. However, it is not often personalized. Personalization focuses the results based on the interests of the user. For example, the system may know that the user is fond of rich vocals or heavy bass. By incorporating these preferences into the recommendation, the system can better serve the user. In this work we seek to explore the impact of personalization on recommendation by example recommenders. We focus our experimentation on social annotation systems. To that end we first present 59 the data model for these systems. We then discuss recommendation by example for social annotation systems in a general sense before presenting algorithms based on cosine similarity and collaborative filtering. We present the framework for cascading hybrids which integrate both approaches. Finally, we compare the cascading hybrid to other integrative techniques. 3.1 Data Model The foundation of a social annotation system is the annotation: the record of a user labeling a resource with one or more tags. A collection of annotations results in a complex network of interrelated users, resources and tags [20]. A social annotation system can be described as a four-tuple: D = ⟨U, R, T, A⟩, where, U is a set of users; R is a set of resources; T is a set of tags; and A is a set of annotations. Each annotation a is a tuple ⟨u, r, Tur ⟩, where u is a user, r is a resource, and Tur is the set of tags that u has applied to r. It is sometimes useful to view a social annotation system as a three-dimensional matrix, URT, in which an entry URT(u,r,t) is 1 if u has tagged r with t. Many of our component recommenders rely on two-dimensional projections of the three dimensional annotation data [21]. These projections sacrifice some of its informational content but reduce the dimensionality of the data, making it easier for algorithms to leverage. We define the relation between resources and tags as RT (r, t), the number of users that have applied t to r, a notion strongly resembling the “bag-of-words” vector space model [25]. Similarly, we can produce the projection U T . A user is modeled as a vector over the set of tags, where each weight, U T (u, t), measures how often a user applied a particular tag across all resources. In all, there are six possible two-dimensional projections: U R, U T , RU , RT , T U , T R (or three if you wish consider the transpose). In the case of U R, we do not weigh resources by the number of tags a user applies, as this is not representative of the user interest. Rather we define U R to be binary, indicating whether or not the user has annotated the resource. In our previous work we explored the notion of information channels [10]. An information channel describes the relationship between the underlying dimensions in a tagging system: users, resources and tags. A strong information channel between two dimensions means that information in the first dimension will be useful in building a predictor for the second dimension. While our previous effort describe various ways for calculating the strength of an information channel, in this work we simply rely on the notion to evaluate the experimental results. 3.2 Recommendation by Example in Social Annotation Systems Personalized resource recommendation engines reduce the effort of navigating large information spaces by giving each user a tailored view of the resource space. To achieve this end the algorithms presented in this section accept a user u, an example rq and a candidate resource r. With these considerations in mind, we can view any recommendation by example algorithm as a function: ϕ : U × R × R → R, (1) Given a particular instance, ϕ(u, rq , r), the function assigns a real-valued score to the candidate resource r given the user profile u and the example (or query) resource rq . A system computing such a function can iterate over all potential recommendations and recommend those with the highest scores. This general framework, of course, requires a means to calculate the relevance. The follow- ing sections provide several techniques. 3.3 Cosine Models Cosine similarity is commonly used in information retrieval to measure the agreement between two vectors. By modeling resources as a vector, the cosine similarity between the two can be computed. Our first recommender, CSrt , models resources as a vector of tags taken from RT and computes the cosine similarity between the example resource and the potential recommendation: ∑ ϕ(u, rq , r) = √∑ RT (rq , t) × RT (r, t) √∑ 2 RT (rq , t)2 × t∈T RT (r, t) t∈T t∈T (2) Resources can also be modeled as a vector of users taken from RU . We call this approach CSru . Neither of these approaches is personalized; they completely ignore the input user u. Yet, they are the type of recommender one might expect in a recommendation by example scenario making it a useful baseline. 3.4 Collaborative Filtering In order to take advantage of the user profile we rely on collaborative filtering algorithms. More specifically we employ two userbased algorithms [16, 29] and two item-based [5, 26] algorithms. The user-based approaches, KN Nur and KN Nut , model users as either resources or tags gathered from U R or U T . To make recommendations, we filter the potential neighbors to only those who have used the example resource rq . We perform cosine similarity to find the k nearest neighbors to u and then use these neighbors to recommend resources using a weighted sum based on user-user similarity: ∑ ϕ(u, rq , r) = σ(u, v)θ(v, r) (3) v∈N rq where N rq is the neighborhood of users that have annotated rq , σ(u, v) is the cosine similarity between the users u and v, and θ(v, r) is 1 if v has annotated r and 0 otherwise. Filtering users by the query resource focuses the algorithm on the user’s query but still leaves a great deal of room for resources dissimilar to the example to make its way into the recommendation set. These two approaches however are strongly personalized. The item-based algorithms rely on the similarity between resources rather than between users. When modeling resources as users we call the algorithm KN Nru . When modeling them as tags we call it KN Nrt . Again a weighted-sum is used as is common in collaborative filtering: ϕ(u, rq , r) = ∑ σ(r, s)θ(u, s) (4) s∈Nr where Nr is the k resources nearest to r drawn from the user profile and σ(r, s) is the similarity between s and r. This procedure ignores the query resource entirely, instead focusing on the similarity of the potential recommendations to those the user has already annotated. 3.5 Cascading Hybrids Hybrid recommenders are powerful tools used to combine the results of multiple components into a single framework [4]. In this work we focus on cascading recommenders which reorders the output of one recommender by the results of the second. The variable k is used to determine how many of the resources are taken from the first recommender and passed to the second. If k is set very small 60 then the second is severely limited in what it can recommend. On the other hand if k is set very large then the first algorithm has very little influence on the final result since all the resources might be passed onto the second recommender. Given the six recommenders described above, several cascading hybrids are possible, thirty in all. We limit the scope of this paper to an examination of hybrids that start with a component based on cosine similarity and reorder the results based on collaborative filtering. In our preliminary experiments these combinations offer the best results. Ideally, the first recommender will focus the results on the example resource. The value of k will be tuned such that the most relevant resources are passed to the second recommender. The second recommender will then reorder the resources based on the user’s profile thereby personalizing the final recommendation list. More formally we describe the recommender as: ϕ(u, rq , r) = χ1 (k, r)ϕ2 (u, rq , r) (5) where ϕ2 (u, rq , r) is the score taken from the second recommender and χ1 (k, r) is 1 if r is ranked among the top k resources by the first recommender. The cascading recommender has many advantageous. First, is its efficiency; the cosine similarities can be computed offline and the collaborative filtering approaches — which are more computationally intense — would only need to rerank a small subset of the resources. Second, the hybrid can leverage multiple information channels of the data. For example, the resource-tag channel exploited by CSrt and the user-resource channel exploited by KN Nur can be combined in a single cascading recommender. Third, by leveraging multiple information channels the hybrid can produce results superior to what either can produce alone. 3.6 Other Integrative Models Hybrid recommenders are not the only means to integrate multiple information channels of social annotation data. Graph based approaches such as Adapted PageRank [15] and tensor factorization algorithms such as Pairwise Interaction Tensor Factorization [24] (P IT F ) have meet with great success, particularly in tag recommendation. Adapted PageRank models the data as a graph composed of users, resources and tags connected through the annotations. A preference vector is used to model the input. The PageRank values are then calculated and used to make recommendations. However the computational requirements of Adapted Pagerank make it ill-suited for large scale deployment; the Pagerank vector must be calculated for each recommendation. P IT F on the other hand offers a far better running time. It achieves excellent results for tag recommendation, but it is not clear how to adapt the algorithm for the recommendation by example scenario. In the context of tag recommendation P IT F prioritizes tags from both a user and resource model in order to make recommendations thereby reusing tags. In resource recommendation however the algorithm cannot promote resources from the user profile as these are already known to the user. This requirement conflicts with the assumptions of the prioritization model; all possible candidate recommendations are in effect treated as negative examples. Second, many proposed tensor factorization methods, P IT F included, require an element from two of the data spaces in order to produce elements from the third. For example a user and resource can be used to produce tags. In recommendation by example the input is a user and a resource while the expected output also comes from the resource space. Finally, in our investigation into tag-based resource recommendation [9], we found that hybrid recommenders Bibsonomy MovieLens LastFM KNNur 20 20 50 KNNut 20 30 50 KNNru 10 2 2 KNNrt 2 20 2 Table 1: Values for k in the collaborative filtering recommenders. Bibsonomy MovieLens LastFM KNNur 20 50 50 CSrt KNNut KNNru 20 30 20 75 30 150 KNNrt 20 50 20 Table 2: Values for k in the cascading recommenders beginning with CSrt. often outperform P IT F . While these previous efforts are not evaluated in this paper due to either their scalability or applicability to the recommendation by example paradigm, they do demonstrate the benefit of an integrative model leveraging multiple channels of the data and provide inspiration for the cascading hybrids. 4. EXPERIMENTAL EVALUATION Here we describe the methods used to gather and pre-process our three real-world datasets. We describe how test cases where generated and the cross fold validation used in our study. The evaluation metrics are presented and the results for each dataset is given separately before we draw some final conclusions. 4.1 Datasets Our experiments into recommendation by example were completed on three real-world social annotation systems. As part of our data processing we generated p-cores [15] for each dataset. Users, resources and tags were removed in order to produce a residual dataset that guarantees each user, resource and tag occur in at least p annotations. 5-cores or 20-cores were generated depending on the size of the dataset thus allowing five-fold cross validation. Bibsonomy users annotate both URL bookmarks and journal articles. The dataset was gathered on 1 January 2009 and is made available online [14]. A 5-core was taken producing 13,909 annotations with 357 users, 1,738 resources and 1,573 tags. MovieLens is administered by the GroupLens research lab at the University of Minnesota. They provide a dataset containing users, rating of movies, and tags. A 5-core generated 35,366 annotations with 819 users, 2,445 resources and 2,309 tags. LastFM users share their musical tastes online. Users have the option to tag songs, artists or albums. The experiments here are limited to album data though experiments with artist and song data show similar trends. A p-core of 20 was drawn from the data and contains 2,368 users, 2,350 resources, 1,141 tags and 172,177 annotations. 4.2 Methodology Recommendation by example plays a critical role in modern Internet applications. However, it is difficult to directly capture the user’s perception of how two resources relate to one another. Moreover, there does not exist to the best of our knowledge datasets is which a user has explicitly stated he believes two items are similar. 61 Bibsonomy MovieLens LastFM KNNur 150 75 75 CSru KNNut KNNru 150 150 20 20 75 75 KNNrt 150 75 30 Table 3: Values for k in the cascading recommenders beginning with CSru. For these reasons we limit the scope of our experiments to social annotation systems. In these systems a user applies tags to resources, in effect describing it in a way that is important to the user. The basic assumption of this work is that if a user annotates two resources with the same tags then the two resources are similar from that user’s perspective. Since a user may annotate resources with any number of tags we segment the results into cases in which one, two, three, four or five tags are in agreement. This segmentations allows us to analyze the results when there is very high probability that two resources are similar (when a user applies several similar tags to both resources) or when the probability is lower (when only a single common tag is applied to both resources). We partition each dataset into five folds. The profile of each user is randomly but equally assigned to each partition. In order to tune the variables we use four partitions as training data and the fifth as testing data. The variables were tuned for the collaborative filtering algorithms as well as the cascading hybrid. Then the testing partition was discarded and we continued with four-fold cross validation to produce our results. To evaluate the algorithms, we iterated over all annotations in the testing data. Each annotation contains a user, a resource and a set of tags applied by the user to the resource. We compare these tags to the tags in the user’s annotations from the training data. If there is a match we generate a test case consisting of the user, the resource from the training data as the example resources and the resource from the holdout data as the target resource. The example resource and target resource may have one tag in common or several. We evaluate these cases separately looking at as many as five common tags. Since each test case has only one target resource, we judge the effectiveness of the recommenders with the hit ratio, the percentage of times the target resource is found in the recommendation set. We measure hit ratio for the top 10 recommended resources. The results are averaged for each user, averaged over all users, and finally averaged over all four folds. 4.3 Experimental Results Table 1 reports the k values used in the collaborative filtering algorithms. Experiments were conducted using values of 2, 5, 10, 20, 30 and 50. Tables 2 and 3 present the values used in the cascading recommenders. The values 20, 30, 50, 75, 100 and 150 were tested. The experimental results are given in Figure 2. The first subfigure (Bibsonomy - CSrt) shows the results for CSrt and the four hybrids that incorporate CSrt with the collaborative filtering approaches. For each case the results are displayed when one, two, three, four or five tags are common among the example resource and the target resource. In general when a single tag is common between two annotations it is difficult to know the users intent and the hit ratio is low. However, with more tags in agreement it is safer to assume that the user views the two resources in similar terms and the hit ratio increases to almost 30%. When comparing algorithms, we find little difference in these two cases. For readability, the re- maining subfigures only report the case when five tags are shared between the example and target resource. The y-axis displays the hit ratio for a recommendation set of 10 resources. In general we find that personalization is important for resource recommendation by example algorithms. It is particularly important for domains in which personal taste strongly influences the user’s consumption of resources. MovieLens and LastFM, which connect users with movies and music, both receive a benefit from our personalized cascading hybrids. In contrast, Bibsonomy, which allows its users to annotate journal articles, receives very little benefit. These users often annotate articles for professional reasons. Since their individual preferences plays a small role in their annotations, the impact of personalizing the recommendation by example algorithms is diminished. A second finding is that the cascading hybrid recommenders effectively merge the benefits of its component parts. If one component exploits the information channel between users and resources while a second leverages the channel between resources and tags, then the cascading hybrid is benefited by both. Thirdly, we see that the underlying characteristics of the annotation systems vary. Some have strong user-resource channels. Others have better developed user-tag channels. The cascading recommenders expose these differences and offer insights into how user behavior might differ from system to system. In the remainder of this section we evaluate each dataset in greater detail. 4.3.1 Bibsonomy The top left graph of Figure 2 shows that CSrt alone achieves a hit ratio of approximately 4% when the results are limited to cases where a single tag is annotated to both the query resource and the recommended resource by the user. When two tags are in common the hit ratio rises to approximately 8%. When five tags are in common it jumps to 30%. We assume that with five tags in common between the query resource and recommended resource the likelihood that the user views the two resources in a similar way is greater than when only one tag is in common. The increase in hit ratio appears to bear this out. Furthermore, we notice that except in certain rare circumstances the relative performance between algorithms is the same regardless of how many shared tags are considered. For simplicity, we restrict our discussion to cases where five tags overlap. In Bibsonomy CSrt clearly outperforms CSru. It achieves a hit ratio of nearly 30% while the other does little better than 10%. In this application it appears that resources are better modeled by tags than by users, at least for the task of recommending resources by example. The personalization afforded by the cascading recommenders appears to offer little benefit. In the best case, CSrt/KN N ur, the improvement is only a faction of a percent. Moreover in the remaining hybrids, the results is actually worse than CSrt alone. When using CSru as the initial recommender, the hybrid composed of CSru and KN N rt produces an improvement, but its overall performance is still less than CSrt alone. These results suggest that the resource-tag information channel is particularly informative in Bibsonomy. Examination into how this system is used seems to offer some explanation why this might be the case. Bibsonomy allows users to annotate journal articles and Web pages. Many users employ the system to organize papers relevant to their research. To that end they select tags useful for retrieving their resources at a latter date and they often use tags drawn from their area of expertise. The results is a highly informative tag space that CSrt is able to exploit. Since CSrt draws from the resource-tag space, it is not surpris- 62 ŝďƐŽŶŽŵLJͲ ^ƌƚ ŝďƐŽŶŽŵLJͲ ^ƌƵ ϯϬй ϯϬй ϮϬй ϮϬй ϭϬй ϭϬй Ϭй Ϭй ^ƌƚ ^ƌƚͬ<EEƵƌ ^ƌƚͬ<EEƵƚ ^ƌƚͬ<EEƌƵ ^ƌƚͬ<EEƌƚ ^ƌƵ DŽǀŝĞ>ĞŶƐͲ ^ƌƚ ^ƌƵͬ<EEƵƌ ^ƌƵͬ<EEƵƚ ^ƌƵͬ<EEƌƵ ^ƌƵͬ<EEƌƚ DŽǀŝĞ>ĞŶƐͲ ^ƌƵ ϱϬй ϱϬй ϰϬй ϰϬй ϯϬй ϯϬй ϮϬй ϮϬй ϭϬй ϭϬй Ϭй Ϭй ^ƌƚ ^ƌƚͬ<EEƵƌ ^ƌƚͬ<EEƵƚ ^ƌƚͬ<EEƌƵ ^ƌƚͬ<EEƌƚ ^ƌƵ >ĂƐƚ&DͲ ^ƌƚ ^ƌƵͬ<EEƵƌ ^ƌƵͬ<EEƵƚ ^ƌƵͬ<EEƌƵ ^ƌƵͬ<EEƌƚ ^ƌƵͬ<EEƌƵ ^ƌƵͬ<EEƌƚ >ĂƐƚ&DͲ ^ƌƵ ϯϬй ϯϬй ϮϬй ϮϬй ϭϬй ϭϬй Ϭй Ϭй ^ƌƚ ^ƌƚͬ<EEƵƌ ^ƌƚͬ<EEƵƚ ^ƌƚͬ<EEƌƵ ^ƌƚͬ<EEƌƚ ^ƌƵ ^ƌƵͬ<EEƵƌ ^ƌƵͬ<EEƵƚ Figure 2: The hit ratio for a recommendation set of size 10 for the baseline recommenders (CSrt and CSru) and the corresponding cascading recommenders leveraging the collaborative filtering algorithms(KN Nur , KN Nut , KN Nru and KN Nrt ). ing that KN N rt is not able to improve the results. Yet, the other approaches appear not to add information to the recommender either. A look at Table 2 shows that 20 resources were taken from CSrt for the other algorithms to reorder. This value was tuned to produce the best performance and is the lowest among all the cascading hybrids. It may be that the reliance of the collaborative algorithms on the user profiles rather than the example resource profiles is particularly detrimental in a system where the resources are organized not by personal taste but by the content of the resource. 4.3.2 MovieLens In MovieLens we see dramatically different results. First, CSru still does not perform as well as CSrt, but it is far more competitive. This suggests that in this domain several information channels of the data might be exploited to benefit the recommendation engine. Second, the cascading hybrids improve upon the performance of CSrt by as much as 19%. Alone CSrt results in a hit ratio of 29%, whereas coupled with KN N ur it results in 48%. The other cascading hybrids also improve the results, but to a lesser 63 degree. These results imply that combining component recommenders which rely on different channels of the data can increase accuracy. We also see that CSru is improved by combining it with the collaborative recommenders. Best results are achieved by combining it with KN N rt producing a hit ratio of 38%. In this system users tag movies. Often they use genres or an actor’s name to annotate a film. In this sense they are similar to users that annotate journal articles in Bibsonomy. In both cases, tags are often drawn from a common domain specific vocabulary. Yet, in MovieLens we observe that the user’s preferences play a more important role. By means of an example, two users may both like science fiction and annotate their finds as such. If one is drawn toward British television (Dr. Who) and the other prefers modern blockbusters (The Transformers) then even though they both use the tag scifi, it means something different to each of them. A recommender like CSrt could whittle the possible films down to a general area, but a collaborative algorithms such as KN N ur can narrow the field down even further by focusing on the user’s preferences. It is also informative to note that whereas KN N ur is the better secondary model for CSrt, KN N rt is the better secondary model for CSru. This may be because those secondary algorithms compliment the primary recommenders. For example CSrt focused on the resource-tag channel. KN N rt does not offer substantially new information since it also focuses on the resource-tag channel. Instead better results are achieved by combining CSrt with KN N ur, an algorithm that leverages the user-resource channel. 4.3.3 LastFM LastFM provides another example of how personalization can improve upon standard cosine models. In this case, however, the best cascading hybrid is built from CSru, which models the resource as a vector of users, and KN N ru, a collaborative algorithm that also models resources over the user space. The result is more modest improvement in performance from 25% to 29%. We observe that KN N ru also makes the best cascading recommender of those combined with CSrt. This strength of this component may be explained by an evaluation of the system itself. LastFM users annotate music — tracks, artists and albums. Like MovieLens they often use tags common to the domain such and genre or a singer’s name. However, they also use tags like sawlive or albumiown. The tag space is used not only to describe the resource, but the users relationship to it, making the tag space far noisier. Furthermore, LastFM users more commonly focus on a particular genre of music, whereas MovieLens users are likely to watch a broader swath of genres, from horror to adventure to action. This interaction makes the user space in LastFm slightly cleaner than in MovieLens. These two differences suggest the user space would be a better model for resources than the tag space. In fact that is what we observe: CSru is the best initial component and KN N ru is the best secondary component. 5. CONCLUSION In this paper we have investigated the use of cascading hybrids for the personalization of recommendation by example algorithms in social annotation systems. This form of interaction offers great utility to users as they navigate large information space, yet it is rarely personalized. We contend that personalization is an important ingredient toward satisfying the user’s needs. Our experimental analysis on three real world datasets reveal that in some cases personalization offers little benefit. However, in other contexts, par- ticularly where resource consumption is driven by personal preference, the benefits the can be quite large. Our proposed cascading hybrids leverage multiple information channels of the data producing superior results yet it offers advantages beyond accuracy. Since much of the work can be completed offline and only a small subset of resources needs to be evaluated for reranking the algorithm is very efficient permitting fast online personalized recommendations. Cascading hybrids built from different combinations of component recommenders performed differently across our three datasets. These results suggest that users interact with the social annotation systems in varying ways, producing datasets with different underlying characteristics. The cascading hybrids expose these characteristics and allow investigations into why they might occur. 6. ACKNOWLEDGMENTS This work was supported in part by a grant from the Department of Education, Graduate Assistance in the Area of National Need, P200A070536. 7. REFERENCES [1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining Association Rules between Sets of Items in Large Databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 1993. [2] M. Balabanović and Y. Shoham. Fab: content-based, collaborative recommendation. Commun. ACM, 40(3):66–72, Mar. 1997. [3] C. Basu, H. Hirsh, and W. W. Cohen. Recommendation as Classification: Using Social and Content-Based Information in Recommendation. In AAAI/IAAI, pages 714–720, 1998. [4] R. Burke. Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, 2002. [5] M. Deshpande and G. Karypis. Item-Based Top-N Recommendation Algorithms. ACM Transactions on Information Systems, 22(1):143–177, 2004. [6] J. Gemmell, M. Ramezani, T. Schimoler, L. Christiansen, and B. Mobasher. A fast effective multi-channeled tag recommender. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases Discovery Challenge, Bled, Slovenia, 2009. [7] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke. Hybrid tag recommendation for social annotation systems. In 19th ACM International Conference on Information and Knowledge Management, Toronto, Canada, 2010. [8] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke. Resource Recommendation in Collaborative Tagging Applications. In E-Commerce and Web Technologies, Bilbao, Spain, 2010. [9] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke. Tag-based resource recommendation in social annotation applications. In User Modeling, Adaptation and Personalization, Girona, Spain, 2011. [10] J. Gemmell, T. Schimoler, M. Ramezani, L. Christiansen, and B. Mobasher. Resource Recommendation for Social Tagging: A Multi-Channel Hybrid Approach. In Recommender Systems & the Social Web, Barcelona, Spain, 2010. 64 [11] J. Gemmell, A. Shepitsen, B. Mobasher, and R. Burke. Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. In 10th International Conference on Data Warehousing and Knowledge Discovery, Turin, Italy, 2008. [12] Z. Guan, C. Wang, J. Bu, C. Chen, K. Yang, D. Cai, and X. He. Document recommendation in social tagging services. In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 391–400, New York, NY, USA, 2010. ACM. [13] I. Guy, N. Zwerdling, I. Ronen, D. Carmel, and E. Uziel. Social media recommendation based on people and tags. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10, pages 194–201, New York, NY, USA, 2010. ACM. [14] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. BibSonomy: A social bookmark and publication sharing system. In Proceedings of the Conceptual Structures Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures, Aalborg , Denmark, 2006. [15] R. Jaschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag Recommendations in Folksonomies. Lecture Notes In Computer Science, 4702:506–513, 2007. [16] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl. GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, 40(3):87, 1997. [17] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’96, pages 298–306, New York, NY, USA, 1996. ACM. [18] H. Liang, Y. Xu, Y. Li, R. Nayak, and X. Tao. Connecting users and items with weighted tags for personalized item recommendations. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10, pages 51–60, New York, NY, USA, 2010. ACM. [19] B. Markines, C. Cattuto, F. Menczer, D. Benz, A. Hotho, and S. Gerd. Evaluating similarity measures for emergent semantics of social tagging. In Proceedings of the 18th international conference on World wide web, WWW ’09, pages 641–650, New York, NY, USA, 2009. ACM. [20] A. Mathes. Folksonomies-Cooperative Classification and Communication Through Shared Metadata. Computer Mediated Communication, (Doctoral Seminar), Graduate School of Library and Information Science, University of Illinois Urbana-Champaign, December, 2004. [21] P. Mika. Ontologies are us: A unified model of social networks and semantics. Web Semantics: Science, Services and Agents on the World Wide Web, 5(1):5–15, 2007. [22] D. M. Pennock, S. Lawrence, R. Popescul, and L. H. Ungar. Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pages 437–444, 2001. [23] A. Plangprasopchok and K. Lerman. Exploiting Social Annotation for Automatic Resource Discovery. In Proceedings of AAAI workshop on Information Integration, Apr. 2007. [24] S. Rendle and L. Schmidt-Thieme. Pairwise Interaction Tensor Factorization for Personalized Tag Recommendation. [25] [26] [27] [28] [29] [30] [31] [32] In Proceedings of the third ACM international conference on Web search and data mining, New York, New York, 2010. G. Salton, A. Wong, and C. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613–620, 1975. B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-Based Collaborative Filtering Recommendation Algorithms. In 10th International Conference on World Wide Web, Hong Kong, China, 2001. J. B. Schafer, J. A. Konstan, and J. Riedl. E-Commerce Recommendation Applications. Data mining and knowledge discovery, 5(1):115–153, 2001. S. Sen, J. Vig, and J. Riedl. Tagommenders: connecting users to items through tags. In WWW ’09: Proceedings of the 18th international conference on World wide web, pages 671–680, New York, NY, USA, 2009. ACM. U. Shardanand and P. Maes. Social Information Filtering: Algorithms for Automating ŞWord of MouthŤ. In SIGCHI Conference on Human Factors in Computing Systems, Denver, Colorado, 1995. A. Shepitsen, J. Gemmell, B. Mobasher, and R. Burke. Personalized Recommendation in Social Tagging Systems using Hierarchical Clustering. In ACM Conference on Recommender Systems. Lausanne, Switzerland, 2008. X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for the semantic web. In Proceedings of the 15th international conference on World Wide Web, Edinburgh , Scotland, 2006. Y. Yang and C. G. Chute. An example-based mapping method for text categorization and retrieval. ACM Trans. Inf. Syst., 12:252–277, July 1994.