CSE 6240: Web Search and Text Mining

Transcription

CSE 6240: Web Search and Text Mining
CSE 6240: Web Search and Text Mining
Memory-Based Collaborative Filtering
Yi Zhen
College of Computing
Georgia Institute of Technology
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
1 / 24
1
Recommendation Methods
2
Memory-based Collaborative Filtering
3
Experiments
4
Summary
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
2 / 24
Recommendation Methods
1
Recommendation Methods
2
Memory-based Collaborative Filtering
3
Experiments
4
Summary
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
3 / 24
Recommendation Methods
Content-based Recommender Systems (Lops et al. 2011)
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
4 / 24
Recommendation Methods
Content-based vs. Collaborative Recommendation
Content-based recommendation: explicit profiling of users and items
— user independence: profile each user independently
— data collection: domain knowledge, time-consuming
— profile flexibility: domain-specific
Collaborative recommendation: rely on past user behaviors (rating,
purchasing)
— avoid extensive data collection
— little domain knowledge: domain-independent
— discover implicit patterns: impossible to profile explicitly
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
5 / 24
Memory-based Collaborative Filtering
1
Recommendation Methods
2
Memory-based Collaborative Filtering
3
Experiments
4
Summary
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
6 / 24
Memory-based Collaborative Filtering
Collaborative Filtering (CF): Problem Formulation
Users: u, v ∈ U; Items: i, j ∈ I
Ratings: rui : degree of preference of user u for item i
– rui > ruj ⇒ user u prefers item i to j
Problem Given observed ratings, predict those missing ratings
Incomplete rating matrix
Casablanc God Father
David
5
4
John
3
2
Jenny
5
2
Yi Zhen (Georgia Tech)
Harry Potter
2
?
5
CSE 6240: Web Search and Text Mining
Lion King
?
5
?
7 / 24
Memory-based Collaborative Filtering
Collaborative Filtering: Methods
Memory-based CF
User centric: for a given user with past rating history, how to
recommend other items to her?
Item centric: for a given item rated by some users before, to which
other users should we recommend it?
The duality between users and items
Model-based CF
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
8 / 24
Memory-based Collaborative Filtering
In the User Centric World
Problem. For a given user with past purchasing and/or rating history,
how to recommend new items to her?
User-based CF
Find other similar users for the given user
Recommend items those similar users liked
Item-based CF
Find other similar items for items rated by the given user
Recommend those items with high ratings
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
9 / 24
Memory-based Collaborative Filtering
User-based CF (Breese et al, 1998)
Notations: an active user a, item i, and the rating rai
For any user u, let Iu = {j | ruj 6=?}
Mean user rating:
¯ru =
1 X
ruj
|Iu | j∈I
u
Prediction for rai
ˆrai = ¯ra + κ
X
sim(a, u)(rui − ¯ru )
u
where u is over the set of neighbors, κ normalization factor
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
10 / 24
Memory-based Collaborative Filtering
Limitations
The set of neighbors is fixed and independent of the item to be
predicted
The best k neighbors may not even have an opinion about the
particular item
Solution:
Dynamically select k best neighbors who have rated the item: N(a, i)
— those tend to rate similarly to u: neighbors
— and also actually rated i
ˆrai = ¯ra + κ
X
sim(a, u)(rui − ¯ru )
u∈N(a,i)
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
11 / 24
Memory-based Collaborative Filtering
Similarity between Users
(
K-nearest neighbor: sim(a, u) =
if u ∈ Neighborhood(a)
otherwise
1
0
Pearson correlation coefficient:
P
sim(a, u) = qP
i (rai
− ¯ra )(rui − ¯ru )
ra )2
i (rai − ¯
qP
i (rui
− ¯ru )2
where the summation is over i ∈ Ia ∩ Iu ≡ Iau
P
Cosine distance: sim(a, u) = qP
k∈Ia
Yi Zhen (Georgia Tech)
r r
i ai ui
2
rak
qP
r2
k∈Iu uk
CSE 6240: Web Search and Text Mining
12 / 24
Memory-based Collaborative Filtering
User Similarity Extensions
Inverse user frequency: down-weight items that appear in many Iu
— analogous to inverse document frequency in IR
— many variations on this: log(M/Mi ), Mi # of Iu that item i
appeared
Case amplification: making sim(a, u) more extreme
Support/confidence of user similarities
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
13 / 24
Memory-based Collaborative Filtering
Issues for User-based Methods
Scaling issues: complexity O(M 2 N)
M: # of users, and N: # of items
in practice more like O(M 2 ) due to small number of items liked by
each user
Some remedies:
sampling users
clustering users
offline computation of user similarity: inappropriate when frequent
changes of user activities
other fast similarity computation methods: hashing...
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
14 / 24
Memory-based Collaborative Filtering
Item-based CF (Badrul, 2001)
Problem. For a given user with past rating history, how to recommend
other items to her?
For an item, compute correlation with others items
For the given user, aggregate her previous ratings of the items highly
correlated to current item
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
15 / 24
Memory-based Collaborative Filtering
Item-based CF (2)
Offline computation of item similarity: complexity O(MN 2 ).
Online look-up of similar items does not depend on M or N
— but rather how many the user purchased/rated in the past
Item similarity is more stable than user similarity, hence, it works for
user with limited data, even just one item purchase/rating
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
16 / 24
Experiments
1
Recommendation Methods
2
Memory-based Collaborative Filtering
3
Experiments
4
Summary
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
17 / 24
Experiments
Evaluation Metrics
Given a testing set of ratings T , compare true ratings rui and
estimated ˆrui
Root mean squared error (RMSE):
1/2

1 X

(rui − ˆrui )2 
|T | r ∈T
ui
Mean absolute error (MAE):
1 X
|rui − ˆrui |
|T | r ∈T
ui
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
18 / 24
Experiments
Benchmark: MovieLens Data
A public data set from movielens.org
MovieLens 100K
— users with 20+ ratings
— used 100,000 ratings with a 943 × 1682 user-item matrix
MovieLens 1M
— 1 M ratings, 4K movies, and 6K users.
— About 4% of the ratings are observed.
MovieLens 10M
— 10M ratings, 100K tags, 10K movies, and 72K users
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
19 / 24
Experiments
Item-based CF: Compare Item Similarities
MAE
Relative performance of different similarity
measures
0.86
0.84
0.82
0.8
0.78
0.76
0.74
0.72
0.7
0.68
0.66
Adjusted cosine
Yi Zhen (Georgia Tech)
Pure cosine
CSE 6240: Web Search and Text Mining
Correlation
20 / 24
Item-based CF: Item Neighborhood Size
Sensitivity of the Neighborhood Size
x
MAE
0.751
0.746
0.741
200
175
150
125
90
100
80
70
60
50
40
30
0.9
20
0.736
0.8
10
0.7
Experiments
No. of Neighbors
itm-itm
eg
Yi Zhen (Georgia Tech)
itm-reg
CSE 6240: Web Search and Text Mining
21 / 24
Experiments
Item-based vs. User-based
Item-item vs. User-user at Selected
Neighborhood Sizes (at x=0.8)
Item-ite
Densi
0.755
0.84
0.75
0.82
MAE
MAE
0.745
0.74
0.735
0.8
0.78
0.76
0.73
0.74
0.725
0.72
10
20
60
90
125
200
0.2
No. of neighbors
user-user
item-item-regression
Yi Zhen (Georgia Tech)
item-item
nonpers
CSE 6240: Web Search and Text Mining
user-user
item-item-r
22 / 24
Summary
1
Recommendation Methods
2
Memory-based Collaborative Filtering
3
Experiments
4
Summary
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
23 / 24
Summary
Memory-based Collaborative Filtering
Simple and reasonable assumption
Easy to implement
Large storage and computation cost
Hard to deal with code-start users or items
Yi Zhen (Georgia Tech)
CSE 6240: Web Search and Text Mining
24 / 24