CSE 6240: Web Search and Text Mining
Transcription
CSE 6240: Web Search and Text Mining
CSE 6240: Web Search and Text Mining Memory-Based Collaborative Filtering Yi Zhen College of Computing Georgia Institute of Technology Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 1 / 24 1 Recommendation Methods 2 Memory-based Collaborative Filtering 3 Experiments 4 Summary Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 2 / 24 Recommendation Methods 1 Recommendation Methods 2 Memory-based Collaborative Filtering 3 Experiments 4 Summary Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 3 / 24 Recommendation Methods Content-based Recommender Systems (Lops et al. 2011) Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 4 / 24 Recommendation Methods Content-based vs. Collaborative Recommendation Content-based recommendation: explicit profiling of users and items — user independence: profile each user independently — data collection: domain knowledge, time-consuming — profile flexibility: domain-specific Collaborative recommendation: rely on past user behaviors (rating, purchasing) — avoid extensive data collection — little domain knowledge: domain-independent — discover implicit patterns: impossible to profile explicitly Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 5 / 24 Memory-based Collaborative Filtering 1 Recommendation Methods 2 Memory-based Collaborative Filtering 3 Experiments 4 Summary Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 6 / 24 Memory-based Collaborative Filtering Collaborative Filtering (CF): Problem Formulation Users: u, v ∈ U; Items: i, j ∈ I Ratings: rui : degree of preference of user u for item i – rui > ruj ⇒ user u prefers item i to j Problem Given observed ratings, predict those missing ratings Incomplete rating matrix Casablanc God Father David 5 4 John 3 2 Jenny 5 2 Yi Zhen (Georgia Tech) Harry Potter 2 ? 5 CSE 6240: Web Search and Text Mining Lion King ? 5 ? 7 / 24 Memory-based Collaborative Filtering Collaborative Filtering: Methods Memory-based CF User centric: for a given user with past rating history, how to recommend other items to her? Item centric: for a given item rated by some users before, to which other users should we recommend it? The duality between users and items Model-based CF Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 8 / 24 Memory-based Collaborative Filtering In the User Centric World Problem. For a given user with past purchasing and/or rating history, how to recommend new items to her? User-based CF Find other similar users for the given user Recommend items those similar users liked Item-based CF Find other similar items for items rated by the given user Recommend those items with high ratings Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 9 / 24 Memory-based Collaborative Filtering User-based CF (Breese et al, 1998) Notations: an active user a, item i, and the rating rai For any user u, let Iu = {j | ruj 6=?} Mean user rating: ¯ru = 1 X ruj |Iu | j∈I u Prediction for rai ˆrai = ¯ra + κ X sim(a, u)(rui − ¯ru ) u where u is over the set of neighbors, κ normalization factor Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 10 / 24 Memory-based Collaborative Filtering Limitations The set of neighbors is fixed and independent of the item to be predicted The best k neighbors may not even have an opinion about the particular item Solution: Dynamically select k best neighbors who have rated the item: N(a, i) — those tend to rate similarly to u: neighbors — and also actually rated i ˆrai = ¯ra + κ X sim(a, u)(rui − ¯ru ) u∈N(a,i) Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 11 / 24 Memory-based Collaborative Filtering Similarity between Users ( K-nearest neighbor: sim(a, u) = if u ∈ Neighborhood(a) otherwise 1 0 Pearson correlation coefficient: P sim(a, u) = qP i (rai − ¯ra )(rui − ¯ru ) ra )2 i (rai − ¯ qP i (rui − ¯ru )2 where the summation is over i ∈ Ia ∩ Iu ≡ Iau P Cosine distance: sim(a, u) = qP k∈Ia Yi Zhen (Georgia Tech) r r i ai ui 2 rak qP r2 k∈Iu uk CSE 6240: Web Search and Text Mining 12 / 24 Memory-based Collaborative Filtering User Similarity Extensions Inverse user frequency: down-weight items that appear in many Iu — analogous to inverse document frequency in IR — many variations on this: log(M/Mi ), Mi # of Iu that item i appeared Case amplification: making sim(a, u) more extreme Support/confidence of user similarities Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 13 / 24 Memory-based Collaborative Filtering Issues for User-based Methods Scaling issues: complexity O(M 2 N) M: # of users, and N: # of items in practice more like O(M 2 ) due to small number of items liked by each user Some remedies: sampling users clustering users offline computation of user similarity: inappropriate when frequent changes of user activities other fast similarity computation methods: hashing... Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 14 / 24 Memory-based Collaborative Filtering Item-based CF (Badrul, 2001) Problem. For a given user with past rating history, how to recommend other items to her? For an item, compute correlation with others items For the given user, aggregate her previous ratings of the items highly correlated to current item Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 15 / 24 Memory-based Collaborative Filtering Item-based CF (2) Offline computation of item similarity: complexity O(MN 2 ). Online look-up of similar items does not depend on M or N — but rather how many the user purchased/rated in the past Item similarity is more stable than user similarity, hence, it works for user with limited data, even just one item purchase/rating Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 16 / 24 Experiments 1 Recommendation Methods 2 Memory-based Collaborative Filtering 3 Experiments 4 Summary Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 17 / 24 Experiments Evaluation Metrics Given a testing set of ratings T , compare true ratings rui and estimated ˆrui Root mean squared error (RMSE): 1/2 1 X (rui − ˆrui )2 |T | r ∈T ui Mean absolute error (MAE): 1 X |rui − ˆrui | |T | r ∈T ui Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 18 / 24 Experiments Benchmark: MovieLens Data A public data set from movielens.org MovieLens 100K — users with 20+ ratings — used 100,000 ratings with a 943 × 1682 user-item matrix MovieLens 1M — 1 M ratings, 4K movies, and 6K users. — About 4% of the ratings are observed. MovieLens 10M — 10M ratings, 100K tags, 10K movies, and 72K users Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 19 / 24 Experiments Item-based CF: Compare Item Similarities MAE Relative performance of different similarity measures 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 0.7 0.68 0.66 Adjusted cosine Yi Zhen (Georgia Tech) Pure cosine CSE 6240: Web Search and Text Mining Correlation 20 / 24 Item-based CF: Item Neighborhood Size Sensitivity of the Neighborhood Size x MAE 0.751 0.746 0.741 200 175 150 125 90 100 80 70 60 50 40 30 0.9 20 0.736 0.8 10 0.7 Experiments No. of Neighbors itm-itm eg Yi Zhen (Georgia Tech) itm-reg CSE 6240: Web Search and Text Mining 21 / 24 Experiments Item-based vs. User-based Item-item vs. User-user at Selected Neighborhood Sizes (at x=0.8) Item-ite Densi 0.755 0.84 0.75 0.82 MAE MAE 0.745 0.74 0.735 0.8 0.78 0.76 0.73 0.74 0.725 0.72 10 20 60 90 125 200 0.2 No. of neighbors user-user item-item-regression Yi Zhen (Georgia Tech) item-item nonpers CSE 6240: Web Search and Text Mining user-user item-item-r 22 / 24 Summary 1 Recommendation Methods 2 Memory-based Collaborative Filtering 3 Experiments 4 Summary Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 23 / 24 Summary Memory-based Collaborative Filtering Simple and reasonable assumption Easy to implement Large storage and computation cost Hard to deal with code-start users or items Yi Zhen (Georgia Tech) CSE 6240: Web Search and Text Mining 24 / 24