### Matrix Factorization and Factorization Machines for Recommender Systems

### Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at SDM workshop on Machine Learning Methods on

### Outline

1 Matrix factorization

2 Factorization machines

3 Conclusions

In this talk I will briefly discuss two related topics Fast matrix factorization (MF) in shared-memory systems

Factorization machines (FM) for recommender systems and classification/regression

Note that MF is a special case of FM

Matrix factorization

### Outline

1 Matrix factorization

Introduction and issues for parallelization Our approach in the package LIBMF

2 Factorization machines

3 Conclusions

Matrix factorization Introduction and issues for parallelization

### Outline

1 Matrix factorization

Introduction and issues for parallelization Our approach in the package LIBMF

2 Factorization machines

3 Conclusions

Matrix factorization Introduction and issues for parallelization

### Matrix Factorization

Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)

But training is slow.

We developed a parallel MF package LIBMF for shared-memory systems

http://www.csie.ntu.edu.tw/~cjlin/libmf Best paper award at ACM RecSys 2013

Matrix factorization Introduction and issues for parallelization

### Matrix Factorization (Cont’d)

For recommender systems: a group of users give ratings to some items

User Item Rating

1 5 100

1 10 80

1 13 30

. . . .

u v r

. . . .

The information can be represented by a rating matrix R

Matrix factorization Introduction and issues for parallelization

### Matrix Factorization (Cont’d)

R

m × n

m : u : 2 1

1 2 .. v .. n

r_{u,v}

?2,2

m, n : numbers of users and items
u, v : index for u_{th} user and v_{th} item

r_{u,v} : u_{th} user gives a rating r_{u,v} to v_{th} item

Matrix factorization Introduction and issues for parallelization

### Matrix Factorization (Cont’d)

q_{2}

R

m × n

≈ ×

P^{T}

m × k

Q

k × n

m : u : 2 1

1 2 .. v .. n

r_{u,v}

?2,2

p^{T}_{1}
p^{T}_{2}
:
p^{T}_{u}

:
p^{T}_{m}

q_{1} q_{2} .. qv .. qn

k : number of latent dimensions
r_{u,v} = p^{T}_{u}q_{v}

?_{2,2} = p^{T}q_{2}

Matrix factorization Introduction and issues for parallelization

### Matrix Factorization (Cont’d)

A non-convex optimization problem:

minP,Q

X

(u,v )∈R

(r_{u,v} − p^{T}_{u}q_{v})^{2} + λ_{P}kp_{u}k^{2}_{F} + λ_{Q}kq_{v}k^{2}_{F}

λP and λQ are regularization parameters SG (Stochastic Gradient) is now a popular optimization method for MF

It loops over ratings in the training set.

Matrix factorization Introduction and issues for parallelization

### Matrix Factorization (Cont’d)

SG update rule:

p_{u} ← p_{u} + γ (e_{u,v}q_{v} − λ_{P}p_{u}) ,
qv ← q_{v} + γ (eu,vpu − λ_{Q}qv)
where

e_{u,v} ≡ r_{u,v} − p^{T}_{u}q_{v}
SG is inherently sequential

Matrix factorization Introduction and issues for parallelization

### SG for Parallel MF

After r_{3,3} is selected, ratings in gray blocks cannot be
updated

r_{3,1} r_{3,2} r_{3,3} r_{3,4} r_{3,5} r_{3,6}

r_{6,6}

1 2 3 4 5 6

1 2 3 4 5 6

But r_{6,6} can be used

r_{3,1} = p_{3}^{T}q_{1}
r_{3,2} = p_{3}^{T}q_{2}
..

r3,6 = p3Tq6

——————

r_{3,3} = p_{3}^{T}q_{3}
r_{6,6} = p_{6}^{T}q_{6}

Matrix factorization Introduction and issues for parallelization

### SG for Parallel MF (Cont’d)

We can split the matrix to blocks.

Then use threads to update the blocks where ratings in different blocks don’t share p or q

1 2 3 4 5 6

1 2 3 4 5 6

Matrix factorization Introduction and issues for parallelization

### SG for Parallel MF (Cont’d)

This concept of splitting data to independent blocks seems to work

However, there are many issues to have a right implementation under the given architecture

Matrix factorization Our approach in the package LIBMF

### Outline

1 Matrix factorization

Introduction and issues for parallelization Our approach in the package LIBMF

2 Factorization machines

3 Conclusions

Matrix factorization Our approach in the package LIBMF

### Our approach in the package LIBMF

Parallelization (Zhuang et al., 2013; Chin et al., 2015a)

Effective block splitting to avoid synchronization time

Partial random method for the order of SG updates

Adaptive learning rate for SG updates (Chin et al., 2015b)

Details omitted due to time constraint

Matrix factorization Our approach in the package LIBMF

### Block Splitting and Synchronization

A naive way for T nodes is to split the matrix to T × T blocks

This is used in DSGD (Gemulla et al., 2011) for distributed systems. The setting is reasonable because communication cost is the main concern In distributed systems, it is difficult to move data or model

Matrix factorization Our approach in the package LIBMF

### Block Splitting and Synchronization (Cont’d)

• However, for shared memory systems, synchronization is a concern

1 2 3

1

2

3

• Block 1: 20s

• Block 2: 10s

• Block 3: 20s

We have 3 threads hi

Thread 0→10 10→20

1 Busy Busy

2 Busy Idle

3 Busy Busy

ok 10s wasted!!

Matrix factorization Our approach in the package LIBMF

### Lock-Free Scheduling

We split the matrix to enough blocks. For example, with two threads, we split the matrix to 4 × 4 blocks

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 is the updated counter recording the number of updated times for each block

Matrix factorization Our approach in the package LIBMF

### Lock-Free Scheduling (Cont’d)

Firstly, T1 selects a block randomly For T2, it selects a block neither green nor gray

T_{1}0
0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Matrix factorization Our approach in the package LIBMF

### Lock-Free Scheduling (Cont’d)

For T2, it selects a block neither green nor gray randomly
For T_{2}, it selects a block neither green nor gray

T_{1}

T_{2}
0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Matrix factorization Our approach in the package LIBMF

### Lock-Free Scheduling (Cont’d)

After T1 finishes, the counter for the corresponding block is added by one

T_{2}
1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Matrix factorization Our approach in the package LIBMF

### Lock-Free Scheduling (Cont’d)

T1 can select available blocks to update Rule: select one that is least updated

T_{2}
1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Matrix factorization Our approach in the package LIBMF

### Lock-Free Scheduling (Cont’d)

SG: applying Lock-Free Scheduling SG**: applying DSGD-like Scheduling

0 2 4 6 8 10

0.84 0.86 0.88 0.9

RMSE

Time(s)

SG**

SG

0 100 200 300 400 500 600

22 22.5 23 23.5 24

Time(s)

RMSE

SG**

SG

MovieLens 10M Yahoo!Music

MovieLens 10M: 18.71s → 9.72s (RMSE: 0.835) Yahoo!Music: 728.23s → 462.55s (RMSE: 21.985)

Matrix factorization Our approach in the package LIBMF

### Memory Discontinuity

Discontinuous memory access can dramatically increase the training time. For SG, two possible update orders are Update order Advantages Disadvantages

Random Faster and stable Memory discontinuity Sequential Memory continuity Not stable

Random Sequential

R R

Our lock-free scheduling gives randomness, but the resulting code may not be cache friendly

Matrix factorization Our approach in the package LIBMF

### Partial Random Method

Our solution is that for each block, access both ˆR and ˆP continuously

R : (one block)ˆ

= ×

Pˆ^{T}

Qˆ

1 2

3 4

5 6

Partial: sequential in each block

Random: random when selecting block

Matrix factorization Our approach in the package LIBMF

### Partial Random Method (Cont’d)

0 20 40 60 80 100

0.8 0.9 1 1.1 1.2 1.3

Time(s)

RMSE

Random Partial Random

0 500 1000 1500 2000 2500 3000 20

25 30 35 40 45

Time(s)

RMSE

Random Partial Random

MovieLens 10M Yahoo!Music

The performance of Partial Random Method is better than that of Random Method

Matrix factorization Our approach in the package LIBMF

### Experiments

State-of-the-art methods compared

LIBPMF: a parallel coordinate descent method (Yu et al., 2012)

NOMAD: an asynchronous SG method (Yun et al., 2014)

LIBMF: earlier version of LIBMF (Zhuang et al., 2013; Chin et al., 2015a)

LIBMF++: with adaptive learning rates for SG (Chin et al., 2015c)

Matrix factorization Our approach in the package LIBMF

### Experiments (Cont’d)

Data Set m n #ratings

Netflix 2,649,429 17,770 99,072,112 Yahoo!Music 1,000,990 624,961 252,800,275 Webscope-R1 1,948,883 1,101,750 104,215,016 Hugewiki 39,706 25,000,000 1,703,429,136

• Due to machine capacity, Hugewiki here is about half of the original

• k = 100

Matrix factorization Our approach in the package LIBMF

### Experiments (Cont’d)

0 10 20 30 40 50

0.92 0.94 0.96 0.98 1

Time (sec.)

RMSE

NOMAD LIBPMF LIBMF LIBMF++

0 50 100 150 200

22 23 24 25

Time (sec.)

RMSE

NOMAD LIBPMF LIBMF LIBMF++

Netflix Yahoo!Music

0 50 100 150

23.5 24 24.5 25 25.5 26

Time (sec.)

RMSE

NOMAD LIBPMF LIBMF LIBMF++

0 500 1000 1500

0.5 0.52 0.54 0.56 0.58 0.6

Time (sec.)

RMSE

CCD++

FPSG FPSG++

Webscope-R1 Hugewiki

Matrix factorization Our approach in the package LIBMF

### Non-negative Matrix Factorization (NMF)

Our method has been extended to solve NMF min

P,Q

X

(u,v )∈R

(r_{u,v} − p^{T}_{u}q_{v})^{2} + λ_{P}kp_{u}k^{2}_{F} + λ_{Q} kq_{v}k^{2}_{F}
subject to P_{i ,u} ≥ 0, Q_{i ,v} ≥ 0, ∀i , u, v

Factorization machines

### Outline

1 Matrix factorization

2 Factorization machines

3 Conclusions

Factorization machines

### MF and Classification/Regression

MF solves

minP,Q

X

(u,v )∈R

ru,v − p^{T}_{u}qv

2

Note that I omit the regularization term Ratings are the only given information

This doesn’t sound like a classification or regression problem

In the second part of this talk we will make a connection and introduce FM (Factorization Machines)

Factorization machines

### Handling User/Item Features

What if instead of user/item IDs we are given user and item features?

Assume user u and item v have feature vectors
f_{u} and g_{v}

How to use these features to build a model?

Factorization machines

### Handling User/Item Features (Cont’d)

We can consider a regression problem where data instances are

value features

... ...

r_{uv} f_{u}^{T} g_{v}^{T}

... ...

and solve

minw

X

u,v ∈R

R_{u,v} − w^{T} f_{u}
gv

2

Factorization machines

### Feature Combinations

However, this does not take the interaction between users and items into account

Note that we are approximating the rating ru,v of user u and item v

Let

U ≡ number of user features V ≡ number of item features Then

fu ∈ R^{U}, u = 1, . . . , m,
g_{v} ∈ R^{V}, v = 1, . . . , n

Factorization machines

### Feature Combinations (Cont’d)

Following the concept of degree-2 polynomial mappings in SVM, we can generate new features

(f_{u})_{t}(g_{v})_{s}, t = 1, . . . , U, s = 1, . . . V
and solve

min

wt,s,∀t,s

X

u,v ∈R

(r_{u,v} −

U

X

t^{0}=1
V

X

s^{0}=1

w_{t}^{0}_{,s}^{0}(f_{u})_{t}(g_{v})_{s})^{2}

Factorization machines

### Feature Combinations (Cont’d)

This is equivalent to min

W

X

u,v ∈R

(r_{u,v} − f_{u}^{T}W g_{v})^{2},
where

W ∈ R^{U×V} is a matrix

If we have vec(W ) by concatenating W ’s columns, another form is

minW

X

u,v ∈R

ru,v − vec(W )^{T}

...

(f_{u})_{t}(g_{v})_{s}
...

2

,

Factorization machines

### Feature Combinations (Cont’d)

However, this setting fails for extremely sparse features

Consider the most extreme situation. Assume we have

user ID and item ID as features

Then

U = m, J = n, fi = [0, . . . , 0

| {z }

i −1

, 1, 0, . . . , 0]^{T}

Factorization machines

### Feature Combinations (Cont’d)

The optimal solution is

W_{u,v} =

(r_{u,v}, if u, v ∈ R
0, if u, v /∈ R
We can never predict

r_{u,v}, u, v /∈ R

Factorization machines

### Factorization Machines

The reason why we cannot predict unseen data is because in the optimization problem

# variables = mn # instances = |R|

Overfitting occurs Remedy: we can let

W ≈ P^{T}Q,

where P and Q are low-rank matrices. This becomes matrix factorization

Factorization machines

### Factorization Machines (Cont’d)

This can be generalized to sparse user and item features

min

u,v ∈R(R_{u,v} − f_{u}^{T}P^{T}Qg_{v})^{2}
That is, we think

Pfu and Qgv

are latent representations of user u and item v , respectively

This becomes factorization machines (Rendle, 2010)

Factorization machines

### Factorization Machines (Cont’d)

Similar ideas have been used in other places such as Stern, Herbrich, and Graepel (2009)

In summary, we connect MF and

classification/regression by the following settings We need combination of different feature types (e.g., user, item, etc)

However, overfitting occurs if features are very sparse

We use product of low-rank matrices to avoid overfitting

Factorization machines

### Factorization Machines (Cont’d)

We see that such ideas can be used for not only recommender systems.

They may be useful for any classification problems with very sparse features

Factorization machines

### Field-aware Factorization Machines

We have seen that FM is useful to handle highly sparse features such as user IDs

What if we have more than two ID fields?

For example, in CTR prediction for computational advertising, we may have

value features

... ...

CTR user ID, Ad ID, site ID

... ...

Factorization machines

### Field-aware Factorization Machines (Cont’d)

FM can be generalized to handle different interactions between fields

Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...

This becomes FFM: field-aware factorization machines (Rendle and Schmidt-Thieme, 2010)

Factorization machines

### FFM for CTR Prediction

It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012

Recently my students used FFM to win two Kaggle competitions

After we used FFM to win the first, in the second competition all top teams use FFM

Note that for CTR prediction, logistic rather than squared loss is used

Factorization machines

### Discussion

How to decide which field interactions to use?

If features are not extremely sparse, can the result still be better than degree-2 polynomial mappings?

Note that we lose the convexity here

We have a software LIBFFM for public use

http://www.csie.ntu.edu.tw/~cjlin/libffm

Factorization machines

### Experiments

We see that

W ⇒ P^{T}Q
reduces the number of variables
What if we map

...

(f_{u})_{t}(g_{v})_{s}
...

⇒ a shorter vector to reduce the number of features/variables

Factorization machines

### Experiments (Cont’d)

However, we may have something like

(r_{1,2}− W_{1,2})^{2} ⇒ (r_{1,2}− ¯w_{1})^{2} (1)
(r_{1,4}− W_{1,4})^{2} ⇒ (r_{1,4}− ¯w_{2})^{2}

(r_{2,1}− W_{2,1})^{2} ⇒ (r_{2,1}− ¯w_{3})^{2}

(r_{2,3}− W_{2,3})^{2} ⇒ (r_{2,3}− ¯w_{1})^{2} (2)
Clearly, there is no reason why (1) and (2) should
share the same variable ¯w_{1}

In contrast, in MF, we connect r_{1,2} and r_{1,3} through
p_{1}

Factorization machines

### Experiments (Cont’d)

A simple comparison on MovieLens

# training: 9,301,274, # test: 698,780, # users:

71,567, # items: 65,133

Results of MF: RMSE = 0.836 Results of Poly-2 + Hashing:

RMSE = 1.14568 (10^{6} bins), 3.62299 (10^{8} bins),
3.76699 (all pairs)

We can clearly see that MF is much better

Conclusions

### Outline

1 Matrix factorization

2 Factorization machines

3 Conclusions

Conclusions

### Conclusions

In this talk we have talked about MF and FFM MF is a mature technique, so we investigate its fast training

FFM is relatively new. We introduce its basic concepts and practical use

Conclusions

### Acknowledgments

The following students have contributed to works mentioned in this talk

Wei-Sheng Chin Yu-Chin Juan Bo-Wen Yuan Yong Zhuang