Large Scale Collaborative Filtering Algorithms
Chih-Chao Ma
Department of Computer Science National Taiwan University
Outline
Introduction
Singular Value Decomposition Post-Processing
Experiments Conclusions
Introduction
Outline
Introduction
Singular Value Decomposition Post-Processing
Experiments Conclusions
Introduction
Introduction
Recommendation systems give people advices Web shopping
News (Yahoo) and videos (Youtube) Collaborative Filtering
Make predictions by taste information e.g. the ratings of products given by users
Introduction
Problem Definition
Preferences for m objects from n users V is the sparse matrix of the scores Some users do not score some objects V ∈ Rn×m with an indicator I ∈ {0, 1}n×m Predict the unknown scores in the matrix
Represented by another sparse matrix A ∈ Rn×m
Introduction
Evaluation Criteria
Performance are measured by RMSE (Root Mean Square Error)
P ∈ Rn×m is the prediction matrix with indicator J
RMSE(P, A) =
sPn i =1
Pm
j =1Jij(Aij − Pij)2 Pn
i =1
Pm j =1Jij
Give advices to users equally
Regardless of the numbers of scores
Uniform distribution on users for test data
Introduction
The Netflix Prize
A contest held by Netflix
17, 770 movies and 480, 189 users
100, 480, 507 training scores (values given) 2, 817, 131 test scores (values unknown) Select the 9 most recent scores for each user Divided into probe, quiz, and test sets
Probe set is used for offline validation Quiz and test sets are the test data
Introduction
Related Work
KNN (K -Nearest Neighbors)
Define the similarity between users or objects Predict unknown score by other “similar” ones SVD (Singular Value Decomposition)
Find the features of users and objects Predict the scores by a predefined function Regression
Singular Value Decomposition
Outline
Introduction
Singular Value Decomposition Post-Processing
Experiments Conclusions
Singular Value Decomposition
Formulation
User and object features
V ∈ Rn×m is the training matrix
find feature matrices U ∈ Rf ×n and M ∈ Rf ×m f is the dimension of SVD
Predict scores by a function p(Ui, Mj) Objective function
E = 1 2
n
X
i =1 m
X
j =1
Iij(Vij − p(Ui, Mj))2
+ ku 2
n
X
i =1
kUik2 + km 2
m
X
j =1
kMjk2
Singular Value Decomposition
Prediction Function
Dot product : p(Ui, Mj) = UiTMj
Matrix factorization problem UTM ≈ V The scores often have a range [a, b].
p(Ui, Mj) = a + UiTMj Linear model
−∂E
∂Ui =
m
X
j =1
Iij((Vij − p(Ui, Mj))Mj) − kuUi
− ∂E
∂Mj =
n
X
i =1
Iij((Vij − p(Ui, Mj))Ui) − kmMj
Singular Value Decomposition
Learning Types
Batch learning optimizes all variables at a time.
Look through all training scores Incomplete incremental learning
Consider one user (or one object) at a time Complete incremental learning
Consider one score at a time Score Vij → feature vectors Ui, Mj
Eij = 1
2(Vij − p(Ui, Mj))2 + ku
2 kUik2 + km
2 kMjk2
Singular Value Decomposition
Variants
Add per-user biases α and per-object biases β p(Ui, Mj, αi, βj) = a + UiTMj + αi + βj
Updated by also gradient descent Add Constraints on feature vectors
Constrainted SVD [Salakhutdinov and Mnih, 2007]
Import a constraint matrix W ∈ Rf ×m Ui = Yi +
Pm
k=1IikWk
Pm k=1Iik
Inappropriate for complete incremental learning
Singular Value Decomposition
Compound SVD
Combine both biases and constraints
Optimize three matrices Y , M, W and biases α, β Update Y , M, α, β by complete incremental learning Update W by incomplete incremental learning
−∇Yi = (Vij − p(Ui, Mj, αi, βj))Mj − kyYi
−∇Wk = Iik m
X
j =1
Iij
(Vij − p(Ui, Mj, αi, βj))Mj Pm
k=1Iik
− IikkwWk
Post-Processing
Outline
Introduction
Singular Value Decomposition Post-Processing
Experiments Conclusions
Post-Processing
Post-Processing
An (SVD) algorithm predicts scores with errors Estimate the test errors by training errors
R ∈ Rn×m is the residual matrix of training data Rij = Vij − p(Ui, Mj, αi, βj) if Vij exists Update the biases by residuals
R¯j is the average residual of object j βj ← βj − ¯Rj
Rij ← Rij − ¯Rj if Iij = 1
Post-Processing
Regression
Build a model for each user Ridge regression
Features X ∈ Rt×f, target values Y ∈ Rt×1 Involve a kernel function K (xi, xj)
ˆ
y = K (x, X )(K (X , X ) + λIt)−1Y K (xi, xj) =
( xixTj p
if xixTj ≥ 0 0 if xixTj < 0 p = 5 − 20 works well in experiments
Only trust a neighbor with high similarity
Post-Processing
Acceleration
The computation of (K (X , X ) + λIt)−1 is expensive Kernel function and inversion on a t × t matrix Use a threshold on t
Acceleration under the polynomial kernel Replace K (X , X ) with It
(K (X , X ) + λIt)−1 = ( 1 1 + λ)It Still use the kernel function in prediction
Post-Processing
Weighted Average
The simplified algorithm is like a weighted sum ˆ
y = K (x, X )( 1
1 + λ)ItY
=
t
X
a=1
K (x, xa) 1 + λ ya Weight average is more reliable Modify the form again
ˆ y =
Pt
a=1K (x, xa)ya Pt
a=1K (x, xa) + λ
Experiments
Outline
Introduction
Singular Value Decomposition Post-Processing
Experiments Conclusions
Experiments
Data Sets
Movielens
6, 040 users and 3, 706 movies
1, 000, 209 scores (density = 4.61%) Select 3 scores of each user as test data Netflix
480, 189 users and 17, 770 movies 100, 480, 507 scores (density = 1.18%)
Use probe set (1, 408, 395 scores) as test data
Experiments
Algorithms in Experiments
Algorithms used for comparison
AVGB: a simple baseline predictor Pij = µj + bi SVDNR: SVD without regularization terms SVD: SVD with complete incremental learning SVDUSER: SVD with incomplete incremental learning in the order of users
CSVD: The compound SVD algorithm
Dataset AVGB SVDNR SVD CSVD
Movielens 0.9313 0.8796 0.8713 0.8706 Netflix 0.9879 0.9280 0.9229 0.9178
Experiments
Performance versus Time
Experiments
Post-Processing Algorithms
Start from the best algorithm CSVD
CSVD: Compound SVD without post-processing SHIFT: Update biases by training residuals KNN: K -nearest neighbor on residuals KRR: Kernel ridge regression on residuals WAVG: Weighted average on residuals
Dataset CSVD SHIFT KNN KRR WAVG
Movielens 0.8706 0.8677 0.8644 0.8635 0.8621 Netflix 0.9178 0.9175 0.9139 0.9101 0.9097
Experiments
Competition for Netflix Prize
Use compound SVD with dimension f = 256 Use the probe set for validation
Apply the weighted average algorithm
RMSE = 0.8868, 35th place when submitted 6.79% better than baseline RMSE 0.9514 (10% for the Grand Prize)
Ordinary SVD gives results in 0.91 − 0.93
Conclusions
Outline
Introduction
Singular Value Decomposition Post-Processing
Experiments Conclusions
Conclusions
Conclusions
Complete incremental learning for SVD Update after looking a single score
Compound SVD outperforms the original SVD Combination of biases and constraints
Post-processing algorithms on the residuals KNN, Regression, Weighted average