Matrix Factorization and Factorization Machines for Recommender Systems

(1)

Machines for Recommender Systems

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Facebook, November 13, 2015

(2)

Matrix factorization (MF) and its extensions are now widely used in recommender systems

In this talk I will briefly discuss three research works related to this topic

(3)

1 Parallel matrix factorization in shared-memory systems

2 Optimization algorithms for one-class matrix factorization

3 From matrix factorization to factorization machines and more

4 Conclusions

(4)

Outline

4 Conclusions

(5)

Matrix Factorization

Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)

But training is slow.

We developed a parallel MF package LIBMF for shared-memory systems

http://www.csie.ntu.edu.tw/~cjlin/libmf Best paper award at ACM RecSys 2013

(6)

Matrix Factorization (Cont’d)

A group of users give ratings to some items User Item Rating

1 5 100

1 10 80

1 13 30

. . . .

u v r

. . . .

The information can be represented by a rating matrix R

(7)

Matrix Factorization (Cont’d)

R

m × n

m : u : 2 1

1 2 .. v .. n

r_u,v

?2,2

m, n : numbers of users and items u, v : index for u_th user and v_th item

r_u,v : u_th user gives a rating r_u,v to v_th item

(8)

Matrix Factorization (Cont’d)

R

m × n

≈ ×

P^T

m × k

Q

k × n

m : u : 2 1

1 2 .. v .. .. .. n

r_uv

?_2,2

p^T₁ p^T₂ : p^T_u

: p^T_m

q₁ q₂ .. qv .. .. .. qn

k : number of latent dimensions r_u,v = p^T_uq_v

?_2,2 = p^T₂q₂

(9)

Matrix Factorization (Cont’d)

A non-convex optimization problem:

minP,Q

X

(u,v )∈R

(r_u,v − p^T_uq_v)² + λ_Pkp_uk²_F + λ_Qkq_vk²_F

λP and λQ are regularization parameters SG (Stochastic Gradient) is now a popular optimization method for MF

It loops over ratings in the training set.

(10)

Matrix Factorization (Cont’d)

SG update rule:

p_u ← p_u + γ (e_u,vq_v − λ_Pp_u) , qv ← q_v + γ (eu,vpu − λ_Qqv) where

e_u,v ≡ r_u,v − p^T_uq_v SG is inherently sequential

(11)

SG for Parallel MF

After r_3,3 is selected, ratings in gray blocks cannot be updated

r_3,1 r_3,2 r_3,3 r_3,4 r_3,5 r_3,6

r_6,6

1 2 3 4 5 6

But r_6,6 can be used

r_3,1 = p₃^Tq₁ r_3,2 = p₃^Tq₂ ..

r3,6 = p3Tq6

——————

r_3,3 = p₃^Tq₃ r_6,6 = p₆^Tq₆

(12)

SG for Parallel MF (Cont’d)

We can split the matrix to blocks and update those which don’t share p or q

1 2 3 4 5 6

This concept is simple, but there are many issues to have a right implementation under the given architecture

(13)

Our Approach in the Package LIBMF

Parallelization (Zhuang et al., 2013; Chin et al., 2015a)

Effective block splitting to avoid synchronization time

Partial random method for the order of SG updates

Adaptive learning rate for SG updates (Chin et al., 2015b)

Details omitted due to time constraint

(14)

Block Splitting and Synchronization

A naive way for T nodes is to split the matrix to T × T blocks

This is used in DSGD (Gemulla et al., 2011) for distributed systems, where communication cost is the main concern

In distributed systems, it is difficult to move data or model

(15)

Block Splitting and Synchronization (Cont’d)

• For shared memory systems, synchronization becomes a concern

1 2 3

1

2

3

• Block 1: 20s

• Block 2: 10s

• Block 3: 20s

We have 3 threads hi

Thread 0→10 10→20

1 Busy Busy

2 Busy Idle

3 Busy Busy

ok 10s wasted!!

(16)

Lock-Free Scheduling

We split the matrix to enough blocks. For example, with two threads, we split the matrix to 4 × 4 blocks

0 0 0 0

0 is the updated counter recording the number of updated times for each block

(17)

Lock-Free Scheduling (Cont’d)

Firstly, T1 selects a block randomly For T2, it selects a block neither green nor gray

T₁0 0

0

(18)

Lock-Free Scheduling (Cont’d)

For T2, it selects a block neither green nor gray randomly For T₂, it selects a block neither green nor gray

T₁

T₂ 0

0

(19)

Lock-Free Scheduling (Cont’d)

After T1 finishes, the counter for the corresponding block is added by one

T₂ 1

0

(20)

Lock-Free Scheduling (Cont’d)

T1 can select available blocks to update Rule: select one that is least updated

T₂ 1

0

(21)

Lock-Free Scheduling (Cont’d)

SG: applying Lock-Free Scheduling SG**: applying DSGD-like Scheduling

0 2 4 6 8 10

0.84 0.86 0.88 0.9

RMSE

Time(s)

SG**

SG

0 100 200 300 400 500 600

22 22.5 23 23.5 24

Time(s)

RMSE

SG**

SG

MovieLens 10M Yahoo!Music

MovieLens 10M: 18.71s → 9.72s (RMSE: 0.835) Yahoo!Music: 728.23s → 462.55s (RMSE: 21.985)

(22)

Memory Discontinuity

Discontinuous memory access can dramatically increase the training time. For SG, two possible update orders are Update order Advantages Disadvantages

Random Faster and stable Memory discontinuity Sequential Memory continuity Not stable

Random Sequential

R R

Our lock-free scheduling gives randomness, but the resulting code may not be cache friendly

(23)

Partial Random Method

Our solution is that for each block, access both ˆR and ˆP continuously

R : (one block)ˆ

= ×

Pˆ^T

Qˆ

1 2

3 4

5 6

Partial: sequential in each block

Random: random when selecting block

(24)

Partial Random Method (Cont’d)

0 20 40 60 80 100

0.8 0.9 1 1.1 1.2 1.3

Time(s)

RMSE

Random Partial Random

0 500 1000 1500 2000 2500 3000 20

25 30 35 40 45

Time(s)

RMSE

Random Partial Random

MovieLens 10M Yahoo!Music

The performance of Partial Random Method is better than that of Random Method

(25)

Experiments

State-of-the-art methods compared

LIBPMF: a parallel coordinate descent method (Yu et al., 2012)

NOMAD: an asynchronous SG method (Yun et al., 2014)

LIBMF: earlier version of LIBMF (Zhuang et al., 2013; Chin et al., 2015a)

LIBMF++: with adaptive learning rates for SG (Chin et al., 2015c)

Details of data sets are omitted; the largest has 1.7B ratings.

(26)

Results: k = 100

0 10 20 30 40 50

0.92 0.94 0.96 0.98 1

Time (sec.)

RMSE

NOMAD LIBPMF LIBMF LIBMF++

0 50 100 150 200

22 23 24 25

Time (sec.)

RMSE

Netflix Yahoo!Music

0 50 100 150

23.5 24 24.5 25 25.5 26

Time (sec.)

RMSE

0 500 1000 1500

0.5 0.52 0.54 0.56 0.58 0.6

Time (sec.)

RMSE

LIBPMF LIBMF LIBMF++

Webscope-R1 Hugewiki

(27)

Outline

4 Conclusions

(28)

One-class Matrix Factorization

• Some applications have only two possible ratings positive (1, watched) and

negative (0, not-watched)

• One-class observation (i.e., implicit feedback)

⇒ only part of positive actions are recorded User Item Watched∈{0,1}

61 7 1

61 23 1

1647 128 1

• Past works include Pan et al. (2008); Hu et al. (2008);

Pan and Scholz (2009); Li et al. (2010); Paquet and Koenigstein (2013)

(29)

Selection of Negative Samples

• One popular solution: treat some missing entries as negative

Why? Most unknown entries are negative.

⇒ a user cannot watch all the movies min

P,Q

X

(u,v )∈Ω⁺

C_uv(1 − p^T_uq_v)² + X

(u,v )∈Ω⁻

C_uv(0 − p^T_uq_v)² + λ(kPk²_F + kQk²_F)

• Ω⁺: observed positive entries

• Ω⁻: negative entries sampled from missing entries

• C_uv: weights

(30)

Two Ways to Select Negative Entries

We may use a subset or include all missing entries

• Subsampled :

|Ω⁻| = O(|Ω⁺|) mn Full :

Ω⁻ = {(u, v ) | (u, v ) /∈ Ω⁺}

• Subsampled is just an approximation for Full

• Full: no need to worry about the selection

• Full: O(mn) elements lead to a hard optimization problem

Subsampled: existing MF techniques can be directly applied

(31)

Full for One-class Matrix Factorization

• Include all missing entries in Ω⁻

Ω = Ω⁺ ∪ Ω⁻ = {1, . . . , m} × {1, . . . , n}

• Weighted matrix factorization:

X

(i ,j )∈Ω⁺

C_uv(A_uv−p^T_uq_v)²+X

(i ,j )∈Ω⁻

C_uv(0−p^T_uq_v)²+λ(kPk²_F+kQk²_F)

• For most MF algorithms, the complexity is

proportional to O(|Ω|). Now |Ω| = mn can be huge

• Therefore, even if Full gives better performance, it’s not useful without efficient training techniques

(32)

New Optimization Techniques

Under certain conditions on C_uv, we reduce the mn term to Ω⁺ in the optimization algorithms:

• ALS: Alternating Least Squares (ALS)

The has been done by Pan and Scholz (2009)

• CD : Coordinate Descent

• SG : Stochastic Gradient

ALS CD SG

rating-MF O(|Ω|k²+(m+n)k³) O(|Ω|k) O(|Ω|k) Subsampled O(|Ω⁺|k²+(m+n)k³) O(|Ω⁺|k) O(|Ω⁺|k) Full-direct O(mnk²+(m+n)k³) O(mnk) O(mnk) Full-new O(|Ω⁺|k²+(m+n)k³) O(|Ω⁺|k +(m+n)k²) ??

(33)

New Optimization Techniques (Cont’d)

Weights Cuv should satisfy certain conditions. Like Pan and Scholz (2009), we assume

C_uv = p_uq_v, ∀u, v /∈ R.

This is often satisfied in practice. For example, we may have

C_uv ∝ |Ω⁺_u|, ∀u, when v is fixed where

|Ω⁺_u| = # of user u’s observed entries

(34)

New Optimization Techniques (Cont’d)

X

(u,v )∈Ω⁺

(· · · ) + X

(u,v )∈Ω⁻

(· · · ), u = 1, . . . , m can be written as

X

(u,v )∈Ω⁺(· · · + something from the 2nd summation)+

(a term involving u)Xn

v =1(a term involving v ), ∀u The derivation is a bit complicated. Also some implementation issues must be carefully addressed.

Reducing the complexity of SG remains a challenging issue.

(35)

Comparison: Subsampled and Full

ml10m nDCG

nHLU MAP AUC

@1 @10

Subsampled 9.33 12.10 13.31 10.00 0.97293 Full 25.64 23.81 24.94 17.70 0.97372

netflix nDCG

nHLU MAP AUC

@1 @10

Subsampled 10.62 11.27 12.03 8.91 0.97224 Full 27.04 22.62 22.72 14.36 0.96879 For nDCG, nHLU and MAP, Full is much better than Subsampled. AUC isn’t a good criterion as we now care more about top recommendations

(36)

Summary for One-class MF

With the developed optimization techniques, the Full approach of treating all missing entries as negative becomes practical

This work was done while I visited Microsoft

(37)

Outline

4 Conclusions

(38)

MF versus Classification/Regression

MF solves

minP,Q

X

(u,v )∈R

ru,v − p^T_uqv

2

Note that I omit the regularization term Ratings are the only given information

This doesn’t sound like a classification or regression problem

In the last part of this talk we will make a connection and introduce FM (Factorization Machines)

(39)

Handling User/Item Features

What if instead of user/item IDs we are given user and item features?

Assume user u and item v have feature vectors f_u ∈ R^U and g_v ∈ R^V,

where

U ≡ number of user features V ≡ number of item features How to use these features to build a model?

(40)

Handling User/Item Features (Cont’d)

We can consider a regression problem where data instances are

value features

... ...

r_uv f_u^T g_v^T

... ...

and solve

minw

X

u,v ∈R

R_u,v − w^T f_u gv

2

(41)

Feature Combinations

However, this does not take the interaction between users and items into account

Following the concept of degree-2 polynomial mappings in SVM, we can generate new features

(f_u)_t(g_v)_s, t = 1, . . . , U, s = 1, . . . V and solve

min

wt,s,∀t,s

X

u,v ∈R

(r_u,v −

U

X

t=1 V

X

s=1

w_t,s(f_u)_t(g_v)_s)²

(42)

Feature Combinations (Cont’d)

This is equivalent to min

W

X

u,v ∈R

(r_u,v − f_u^TW g_v)², where

W ∈ R^U×V is a matrix

If we have vec(W ) by concatenating W ’s columns, another form is

minW

X

u,v ∈R





ru,v − vec(W )^T





 ...

(f_u)_t(g_v)_s ...













2

,

(43)

Feature Combinations (Cont’d)

However, this setting fails for extremely sparse features

Consider the most extreme situation. Assume we have

user ID and item ID as features

Then

U = m, J = n, fi = [0, . . . , 0

| {z }

i −1

, 1, 0, . . . , 0]^T

(44)

Feature Combinations (Cont’d)

The optimal solution is

W_u,v =

(r_u,v, if u, v ∈ R 0, if u, v /∈ R We can never predict

r_u,v, u, v /∈ R

(45)

Factorization Machines

The reason why we cannot predict unseen data is because in the optimization problem

# variables = mn # instances = |R|

Overfitting occurs Remedy: we can let

W ≈ P^TQ,

where P and Q are low-rank matrices. This becomes matrix factorization

(46)

Factorization Machines (Cont’d)

This can be generalized to sparse user and item features

min

u,v ∈R(R_u,v − f_u^TP^TQg_v)² That is, we think

Pfu and Qgv

are latent representations of user u and item v , respectively

This becomes factorization machines (Rendle, 2010)

(47)

Factorization Machines (Cont’d)

Similar ideas have been used in other places such as Stern et al. (2009)

We see that such ideas can be used for not only recommender systems.

They may be useful for any classification problems with very sparse features

(48)

Field-aware Factorization Machines

We have seen that FM is useful to handle highly sparse features such as user IDs

What if we have more than two ID fields?

For example, in CTR prediction for computational advertising, we may have

value features

... ...

CTR user ID, Ad ID, site ID

... ...

(49)

Field-aware Factorization Machines (Cont’d)

FM can be generalized to handle different interactions between fields

Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...

This becomes FFM: field-aware factorization machines (Rendle and Schmidt-Thieme, 2010)

(50)

FFM for CTR Prediction

It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012

Recently my students used FFM to win two Kaggle competitions

After we used FFM to win the first competition, in the second competition all top teams use FFM Note that for CTR prediction, logistic rather than squared loss is used

(51)

Discussion

How to decide which field interactions to use?

If features are not extremely sparse, can the result still be better than degree-2 polynomial mappings?

Note that we lose the convexity here

We have a software LIBFFM for public use

http://www.csie.ntu.edu.tw/~cjlin/libffm

(52)

Outline

4 Conclusions

(53)

Discussion and Conclusions

From my limited experience on recommender

systems, I feel that the practical use is very problem dependent

For example, sometimes many features are available, but sometimes you only have ratings

Developing general algorithms becomes difficult. An algorithm may be useful only for certain scenarios

(54)

Discussion and Conclusions (Cont’d)

This situation is different from data classification, where the process is more standardized

I am still learning different aspects of recommender systems. Your comments/suggestions are very welcome

(55)

Acknowledgments

Collaborators for works mentioned in this talk:

National Taiwan University Wei-Sheng Chin

Yu-Chin Juan Bo-Wen Yuan Yong Zhuang UT Austin

Hsiang-Fu Yu Microsoft

Misha Bilenko