Matrix Factorization and Factorization Machines for Recommender Systems

(1)

Machines for Recommender Systems

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at 4th Workshop on Large-Scale Recommender Systems, ACM RecSys, 2016

(2)

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(3)

Outline

(4)

Matrix Factorization

Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)

A group of users give ratings to some items User Item Rating

1 5 100

1 13 30

. . . .

u v r

. . . .

The information can be represented by a rating matrix R

(5)

Matrix Factorization (Cont’d)

R

m × n

m : u : 2 1

1 2 .. v .. .. .. n

r_u,v

?2,2

m, n: numbers of users and items u, v : index for u_th user and v_th item

r_u,v: u_th user gives a rating r_u,v to v_th item

(6)

Matrix Factorization (Cont’d)

R

m × n

≈ ×

P^T

m × k

Q

k × n

m : u : 2 1

1 2 .. v .. .. .. n

r_uv

?_2,2

p^T₁ p^T₂ : p^T_u

: p^T_m

q₁ q₂ .. qv .. .. .. qn

k: number of latent dimensions r_u,v = p^T_uq_v

?_2,2 = p^T₂q₂

(7)

Matrix Factorization (Cont’d)

A non-convex optimization problem:

minP,Q

X

(u,v )∈R

(r_u,v − p^T_uq_v)² + λ_Pkp_uk²_F + λ_Qkq_vk²_F

λP and λQ are regularization parameters

Many optimization methods have been successfully applied

Overall MF is a mature technique

(8)

Outline

(9)

MF versus Classification/Regression

MF solves

min

P,Q

X

(u,v )∈R

r_u,v − p^T_uq_v²

Note that I omit the regularization term Ratings are the only given information

This doesn’t sound like a classification or regression problem

But indeed we can make some interesting connections

(10)

Handling User/Item Features

What if instead of user/item IDs we are given user and item features?

Assume user u and item v have feature vectors f_u ∈ R^U and g_v ∈ R^V,

where

U ≡ number of user features V ≡ number of item features How to use these features to build a model?

(11)

Handling User/Item Features (Cont’d)

We can consider a regression problem where data instances are

value features

... ...

r_uv f_u^T g_v^T

... ...

and solve

minw

X

u,v ∈R

r_u,v − w^T f_u gv

2

(12)

Feature Combinations

However, this does not take the interaction between users and items into account

Following the concept of degree-2 polynomial mappings in SVM, we can generate new features

(f_u)_t(g_v)_s, t = 1, . . . , U, s = 1, . . . V and solve

min

wt,s,∀t,s

X

u,v ∈R

(r_u,v −

U

X

t=1 V

X

s=1

w_t,s(f_u)_t(g_v)_s)²

(13)

Feature Combinations (Cont’d)

This is equivalent to min

W

X

u,v ∈R

(r_u,v − f_u^TW g_v)², where

W ∈ R^U×V is a matrix

If we have vec(W ) by concatenating W ’s columns, another form is

minW

X

u,v ∈R





ru,v − vec(W )^T





 ...

(f_u)_t(g_v)_s ...













2

,

(14)

Feature Combinations (Cont’d)

However, this setting fails for extremely sparse features

Consider the most extreme situation. Assume we have

user ID and item ID as features

Then

U = m, J = n, fi = [0, . . . , 0

| {z }

i −1

, 1, 0, . . . , 0]^T

(15)

Feature Combinations (Cont’d)

The optimal solution is

W_u,v =

(r_u,v, if u, v ∈ R 0, if u, v /∈ R We can never predict

r_u,v, u, v /∈ R

(16)

Factorization Machines

The reason why we cannot predict unseen data is because in the optimization problem

# variables = mn # instances = |R|

Overfitting occurs Remedy: we can let

W ≈ P^TQ,

where P and Q are low-rank matrices. This becomes matrix factorization

(17)

Factorization Machines (Cont’d)

This can be generalized to sparse user and item features

min

P,Q

X

(u,v )∈R

(r_u,v − f_u^TP^TQg_v)² That is, we think

Pf_u and Qg_v

are latent representations of user u and item v , respectively

We can also consider the interaction between elements in f_u (or elements in g_v)

(18)

Factorization Machines (Cont’d)

The new formulation is minP,Q

X

(u,v )∈R

ru,v −f_u^T g^T_v P^T Q^T

P Q f_u g_v

2

This becomes factorization machines (Rendle, 2010)

(19)

Factorization Machines (Cont’d)

Similar ideas have been used in other places such as Stern et al. (2009)

We see that such ideas can be used for not only recommender systems.

They may be useful for any classification problems with very sparse features

(20)

FM for Classification

In a classification setting assume a data instance is x ∈ Rⁿ

Linear model:

w^Tx Degree-2 polynomial mapping:

x^TW x

(21)

FM for Classification (Cont’d)

FM:

x^TP^TPx or alternatively

X

i ,j

x_ip^T_i p_jx_j,

where

p_i, p_j ∈ R^k

That is, in FM each feature is associated with a latent factor

(22)

Outline

(23)

Field-aware Factorization Machines

We have seen that FM seems to be useful to handle highly sparse features such as user IDs

What if we have more than two ID fields?

For example, in CTR (click-through rate) prediction for computational advertising, we may have

clicked features

... ...

Yes user ID, Ad ID, site ID

... ...

(24)

Field-aware Factorization Machines (Cont’d)

FM can be generalized to handle different interactions between fields

Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...

We call this approach FFM (field-aware factorization machines)

An early study on three fields is Rendle and Schmidt-Thieme (2010)

(25)

FFM for CTR Prediction

It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012

In 2014 my students used FFM to win two Kaggle CTR competitions

After we used FFM to win the first competition, in the second competition all top teams use FFM Note that for CTR prediction, logistic rather than squared loss is used

(26)

Practical Use of FFM

Recently we conducted a detailed study on FFM (Juan et al., 2016)

Here I briefly discuss some results there

(27)

Numerical Features

For categorical features like IDs, we have ID: field ID index: feature Each field has many 0/1 features

But how about numerical features?

Two possibilities

Dummy fields: The field has only one real-valued feature

Discretization: transform a numerical feature to a categorical one and then many binary features

(28)

Normalization

After obtaining the feature vector, empirically we find that instance-wise normalization is useful Faster convergence and better test accuracy

(29)

Impact of Parameters

We have the following parameters k: number of latent factors λ: regularization parameter

parameters of the optimization methods (e.g., learning rate of stochastic gradient)

Their sensitivity to the performance varies

(30)

Example: Regularization Parameter λ

Epochs

20 40 60 80 100 120 140

Logloss

0.44 0.45 0.46 0.47 0.48 0.49 0.5

λ = 1e − 6 λ = 1e − 5 λ = 1e − 4 λ = 1e − 3

Too large λ: model not good

Too small λ: better model but easily overfitting Similar situations occur for SG learning rates Early stopping by a validation procedure is needed

(31)

Experiments: Two CTR Sets

method test logloss rank Linear 0.46224 91

Poly2 0.44956 14

FM 0.44922 14

FM 0.44847 11

FFM 0.44603 3

Linear 0.38833 64

Poly2 0.38347 10

FM 0.38407 11

FM 0.38531 15

FFM 0.38223 6

For same method (e.g., FM), we try different parameters

(32)

Experiments: Two CTR Sets (Cont’d)

For these two sets, FFM is the best

For winning competitions, some additional tricks are used

(33)

Experiments: Other Sets

• Can FFM work well for other sets? Can we identify when it’s useful

• We try the following data

Data Set # instances # features # fields

KDD2010-bridge 20,012,499 651,166 9

KDD2012 20,950,284 19,147,857 11

phishing 11,055 100 30

adult 48,842 308 14

cod-rna (dummy fields) 331,152 8 8

cod-rna (discretization) 331,152 2,296 8

ijcnn (dummy fields) 141,691 22 22

ijcnn (discretization) 141,691 69,867 22

(34)

Experiments: Other Sets (Cont’d)

Data Set LM Poly2 FM FFM

KDD2010-bridge 0.30910 0.27448 0.28437 0.26899 KDD2012 0.49375 0.49640 0.49292 0.48700 phishing 0.11493 0.09659 0.09461 0.09374 adult 0.30897 0.30757 0.30959 0.30760 cod-rna (dummy fields) 0.13829 0.12874 0.12580 0.12914 cod-rna (discretization) 0.16455 0.17576 0.16570 0.14993 ijcnn (dummy fields) 0.20627 0.09209 0.07696 0.07668 ijcnn (discretization) 0.21686 0.22546 0.22259 0.18635 Best results are underlined

(35)

Experiments: Other Sets (Cont’d)

For data with categorical data, FFM works well For some data (e.g., adult), feature interactions are not useful

It’s not easy for FFM to handle numerical features

(36)

Outline

(37)

Solving the Optimization Problem

MF, FM, and FFM all involve optimization problems Optimization techniques for them are related but different due to different problem structures

With time constraint I will only briefly discuss some optimization techniques for matrix factorization

(38)

Matrix Factorization

Recall we have a non-convex optimization problem:

minP,Q

X

(u,v )∈R

(r_u,v − p^T_uq_v)² + λ_Pkp_uk²_F + λ_Qkq_vk²_F

Existing optimization techniques include ALS: Alternating Least Squares (ALS) CD : Coordinate Descent

SG : Stochastic Gradient

(39)

Complexity in Training MF

To update P, Q once

ALS: O(|R|k² + (m + n)k³) CD: O(|R|k)

To go through |R| elements once SG: O(|R|k)

I don’t discuss details, but this indicates that CD and SG are generally more efficient

(40)

Stochastic Gradient for Matrix Factorization

SG update rule:

pu ← p_u + γ (eu,vqv − λ_Ppu) , q_v ← q_v + γ (e_u,vp_u − λ_Qq_v) where

e_u,v ≡ r_u,v − p^T_uq_v Two issues:

SG is sensitive to learning rate SG is inherently sequential

(41)

SG’s Learning Rate

We can apply advanced settings such as ADAGRAD (Duchi et al., 2011)

Each element of latent vectors p_u, q_v has its own learning rate

Maintaining so many learning rates can be quite expensive

How about a modification to let the whole p_u (or the whole q_v) associates with a rate? (Chin et al., 2015b)

This is an example that we take MF’s property into account

(42)

SG for Parallel MF

After r_3,3 is selected, ratings in gray blocks cannot be updated

r_3,1 r_3,2 r_3,3 r_3,4 r_3,5 r_3,6

r_6,6

1 2 3 4 5 6

But r_6,6 can be used

r_3,1 = p₃^Tq₁ r_3,2 = p₃^Tq₂ ..

r3,6 = p3Tq6

——————

r_3,3 = p₃^Tq₃ r_6,6 = p₆^Tq₆

(43)

SG for Parallel MF (Cont’d)

We can split the matrix to blocks and update those which don’t share p or q

1 2 3 4 5 6

This concept is simple, but there are many issues to have a right implementation under the given architecture

(44)

SG for Parallel MF (Cont’d)

Past developments of SG for parallel MF include Gemulla et al. (2011); Chin et al. (2015a); Yun et al. (2014)

However, the idea of block splits applies to MF only We haven’t seen an easy way to extend it to FM or FFM

This is another example where we take problem structure into account

(45)

Outline

(46)

Discussion and Conclusions

In this talk we briefly discuss three models for recommender systems

MF, FM, and FFM

They are related, but are useful in different situations

Different algorithms may be needed due to different properties of the optimization problems

(47)

Acknowledgments

Past and current students who have contributed to this work:

Wei-Sheng Chin Yu-Chin Juan Meng-Yuan Yang Bo-Wen Yuan Yong Zhuang

We thank the support from Ministry of Science and Technology in Taiwan