Large Scale Collaborative Filtering Algorithms

(1)

Large Scale Collaborative Filtering Algorithms

Chih-Chao Ma

Department of Computer Science National Taiwan University

(2)

Outline

Introduction

Singular Value Decomposition Post-Processing

Experiments Conclusions

(3)

Introduction

Outline

Introduction

(4)

Introduction

Recommendation systems give people advices Web shopping

News (Yahoo) and videos (Youtube) Collaborative Filtering

Make predictions by taste information e.g. the ratings of products given by users

(5)

Introduction

Problem Definition

Preferences for m objects from n users V is the sparse matrix of the scores Some users do not score some objects V ∈ R^n×m with an indicator I ∈ {0, 1}^n×m Predict the unknown scores in the matrix

Represented by another sparse matrix A ∈ R^n×m

(6)

Introduction

Evaluation Criteria

Performance are measured by RMSE (Root Mean Square Error)

P ∈ R^n×m is the prediction matrix with indicator J

RMSE(P, A) =

sPn i =1

Pm

j =1J_ij(A_ij − P_ij)² Pn

i =1

Pm j =1Jij

Give advices to users equally

Regardless of the numbers of scores

Uniform distribution on users for test data

(7)

Introduction

The Netflix Prize

A contest held by Netflix

17, 770 movies and 480, 189 users

100, 480, 507 training scores (values given) 2, 817, 131 test scores (values unknown) Select the 9 most recent scores for each user Divided into probe, quiz, and test sets

Probe set is used for offline validation Quiz and test sets are the test data

(8)

Introduction

Related Work

KNN (K -Nearest Neighbors)

Define the similarity between users or objects Predict unknown score by other “similar” ones SVD (Singular Value Decomposition)

Find the features of users and objects Predict the scores by a predefined function Regression

(9)

Singular Value Decomposition

Outline

Introduction

(10)

Formulation

User and object features

V ∈ R^n×m is the training matrix

find feature matrices U ∈ R^{f ×n} and M ∈ R^{f ×m} f is the dimension of SVD

Predict scores by a function p(U_i, M_j) Objective function

E = 1 2

n

X

i =1 m

X

j =1

Iij(Vij − p(U_i, Mj))²

+ k_u 2

n

X

i =1

kU_ik² + k_m 2

m

X

j =1

kM_jk²

(11)

Prediction Function

Dot product : p(U_i, M_j) = U_i^TM_j

Matrix factorization problem U^TM ≈ V The scores often have a range [a, b].

p(U_i, M_j) = a + U_i^TM_j Linear model

−∂E

∂U_i =

m

X

j =1

I_ij((V_ij − p(U_i, M_j))M_j) − k_uU_i

− ∂E

∂M_j =

n

X

i =1

Iij((Vij − p(U_i, Mj))Ui) − kmMj

(12)

Learning Types

Batch learning optimizes all variables at a time.

Look through all training scores Incomplete incremental learning

Consider one user (or one object) at a time Complete incremental learning

Consider one score at a time Score V_ij → feature vectors U_i, M_j

E_ij = 1

2(V_ij − p(U_i, M_j))² + k_u

2 kU_ik² + k_m

2 kM_jk²

(13)

Variants

Add per-user biases α and per-object biases β p(U_i, M_j, α_i, β_j) = a + U_i^TM_j + α_i + β_j

Updated by also gradient descent Add Constraints on feature vectors

Constrainted SVD [Salakhutdinov and Mnih, 2007]

Import a constraint matrix W ∈ R^{f ×m} U_i = Y_i +

Pm

k=1IikWk

Pm k=1I_ik

Inappropriate for complete incremental learning

(14)

Compound SVD

Combine both biases and constraints

Optimize three matrices Y , M, W and biases α, β Update Y , M, α, β by complete incremental learning Update W by incomplete incremental learning

−∇_Y_i = (V_ij − p(U_i, M_j, α_i, β_j))M_j − k_yY_i

−∇_W_k = Iik m

X

j =1

Iij

(V_ij − p(U_i, M_j, α_i, β_j))M_j Pm

k=1I_ik

− I_ikk_wW_k

(15)

Post-Processing

Outline

Introduction

(16)

Post-Processing

An (SVD) algorithm predicts scores with errors Estimate the test errors by training errors

R ∈ R^n×m is the residual matrix of training data Rij = Vij − p(U_i, Mj, αi, βj) if Vij exists Update the biases by residuals

R¯_j is the average residual of object j β_j ← β_j − ¯R_j

R_ij ← R_ij − ¯R_j if I_ij = 1

(17)

Post-Processing

Regression

Build a model for each user Ridge regression

Features X ∈ R^t×f, target values Y ∈ R^t×1 Involve a kernel function K (x_i, x_j)

ˆ

y = K (x, X )(K (X , X ) + λI_t)⁻¹Y K (x_i, x_j) =

( x_ix^T_j p

if x_ix^T_j ≥ 0 0 if x_ix^T_j < 0 p = 5 − 20 works well in experiments

Only trust a neighbor with high similarity

(18)

Post-Processing

Acceleration

The computation of (K (X , X ) + λI_t)⁻¹ is expensive Kernel function and inversion on a t × t matrix Use a threshold on t

Acceleration under the polynomial kernel Replace K (X , X ) with I_t

(K (X , X ) + λI_t)⁻¹ = ( 1 1 + λ)I_t Still use the kernel function in prediction

(19)

Post-Processing

Weighted Average

The simplified algorithm is like a weighted sum ˆ

y = K (x, X )( 1

1 + λ)I_tY

=

t

X

a=1

K (x, x_a) 1 + λ y_a Weight average is more reliable Modify the form again

ˆ y =

Pt

a=1K (x, x_a)y_a Pt

a=1K (x, x_a) + λ

(20)

Experiments

Outline

Introduction

(21)

Experiments

Data Sets

Movielens

6, 040 users and 3, 706 movies

1, 000, 209 scores (density = 4.61%) Select 3 scores of each user as test data Netflix

480, 189 users and 17, 770 movies 100, 480, 507 scores (density = 1.18%)

Use probe set (1, 408, 395 scores) as test data

(22)

Experiments

Algorithms in Experiments

Algorithms used for comparison

AVGB: a simple baseline predictor P_ij = µ_j + b_i SVDNR: SVD without regularization terms SVD: SVD with complete incremental learning SVDUSER: SVD with incomplete incremental learning in the order of users

CSVD: The compound SVD algorithm

Dataset AVGB SVDNR SVD CSVD

Movielens 0.9313 0.8796 0.8713 0.8706 Netflix 0.9879 0.9280 0.9229 0.9178

(23)

Experiments

Performance versus Time

(24)

Experiments

Post-Processing Algorithms

Start from the best algorithm CSVD

CSVD: Compound SVD without post-processing SHIFT: Update biases by training residuals KNN: K -nearest neighbor on residuals KRR: Kernel ridge regression on residuals WAVG: Weighted average on residuals

Dataset CSVD SHIFT KNN KRR WAVG

Movielens 0.8706 0.8677 0.8644 0.8635 0.8621 Netflix 0.9178 0.9175 0.9139 0.9101 0.9097

(25)

Experiments

Competition for Netflix Prize

Use compound SVD with dimension f = 256 Use the probe set for validation

Apply the weighted average algorithm

RMSE = 0.8868, 35th place when submitted 6.79% better than baseline RMSE 0.9514 (10% for the Grand Prize)

Ordinary SVD gives results in 0.91 − 0.93

(26)

Conclusions

Outline

Introduction

(27)

Conclusions

Complete incremental learning for SVD Update after looking a single score

Compound SVD outperforms the original SVD Combination of biases and constraints

Post-processing algorithms on the residuals KNN, Regression, Weighted average