Matrix Factorization and Factorization Machines for Recommender Systems
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at SDM workshop on Machine Learning Methods on
Outline
1 Matrix factorization
2 Factorization machines
3 Conclusions
In this talk I will briefly discuss two related topics Fast matrix factorization (MF) in shared-memory systems
Factorization machines (FM) for recommender systems and classification/regression
Note that MF is a special case of FM
Matrix factorization
Outline
1 Matrix factorization
Introduction and issues for parallelization Our approach in the package LIBMF
2 Factorization machines
3 Conclusions
Matrix factorization Introduction and issues for parallelization
Outline
1 Matrix factorization
Introduction and issues for parallelization Our approach in the package LIBMF
2 Factorization machines
3 Conclusions
Matrix factorization Introduction and issues for parallelization
Matrix Factorization
Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)
But training is slow.
We developed a parallel MF package LIBMF for shared-memory systems
http://www.csie.ntu.edu.tw/~cjlin/libmf Best paper award at ACM RecSys 2013
Matrix factorization Introduction and issues for parallelization
Matrix Factorization (Cont’d)
For recommender systems: a group of users give ratings to some items
User Item Rating
1 5 100
1 10 80
1 13 30
. . . .
u v r
. . . .
The information can be represented by a rating matrix R
Matrix factorization Introduction and issues for parallelization
Matrix Factorization (Cont’d)
R
m × n
m : u : 2 1
1 2 .. v .. n
ru,v
?2,2
m, n : numbers of users and items u, v : index for uth user and vth item
ru,v : uth user gives a rating ru,v to vth item
Matrix factorization Introduction and issues for parallelization
Matrix Factorization (Cont’d)
q2
R
m × n
≈ ×
PT
m × k
Q
k × n
m : u : 2 1
1 2 .. v .. n
ru,v
?2,2
pT1 pT2 : pTu
: pTm
q1 q2 .. qv .. qn
k : number of latent dimensions ru,v = pTuqv
?2,2 = pTq2
Matrix factorization Introduction and issues for parallelization
Matrix Factorization (Cont’d)
A non-convex optimization problem:
minP,Q
X
(u,v )∈R
(ru,v − pTuqv)2 + λPkpuk2F + λQkqvk2F
λP and λQ are regularization parameters SG (Stochastic Gradient) is now a popular optimization method for MF
It loops over ratings in the training set.
Matrix factorization Introduction and issues for parallelization
Matrix Factorization (Cont’d)
SG update rule:
pu ← pu + γ (eu,vqv − λPpu) , qv ← qv + γ (eu,vpu − λQqv) where
eu,v ≡ ru,v − pTuqv SG is inherently sequential
Matrix factorization Introduction and issues for parallelization
SG for Parallel MF
After r3,3 is selected, ratings in gray blocks cannot be updated
r3,1 r3,2 r3,3 r3,4 r3,5 r3,6
r6,6
1 2 3 4 5 6
1 2 3 4 5 6
But r6,6 can be used
r3,1 = p3Tq1 r3,2 = p3Tq2 ..
r3,6 = p3Tq6
——————
r3,3 = p3Tq3 r6,6 = p6Tq6
Matrix factorization Introduction and issues for parallelization
SG for Parallel MF (Cont’d)
We can split the matrix to blocks.
Then use threads to update the blocks where ratings in different blocks don’t share p or q
1 2 3 4 5 6
1 2 3 4 5 6
Matrix factorization Introduction and issues for parallelization
SG for Parallel MF (Cont’d)
This concept of splitting data to independent blocks seems to work
However, there are many issues to have a right implementation under the given architecture
Matrix factorization Our approach in the package LIBMF
Outline
1 Matrix factorization
Introduction and issues for parallelization Our approach in the package LIBMF
2 Factorization machines
3 Conclusions
Matrix factorization Our approach in the package LIBMF
Our approach in the package LIBMF
Parallelization (Zhuang et al., 2013; Chin et al., 2015a)
Effective block splitting to avoid synchronization time
Partial random method for the order of SG updates
Adaptive learning rate for SG updates (Chin et al., 2015b)
Details omitted due to time constraint
Matrix factorization Our approach in the package LIBMF
Block Splitting and Synchronization
A naive way for T nodes is to split the matrix to T × T blocks
This is used in DSGD (Gemulla et al., 2011) for distributed systems. The setting is reasonable because communication cost is the main concern In distributed systems, it is difficult to move data or model
Matrix factorization Our approach in the package LIBMF
Block Splitting and Synchronization (Cont’d)
• However, for shared memory systems, synchronization is a concern
1 2 3
1
2
3
• Block 1: 20s
• Block 2: 10s
• Block 3: 20s
We have 3 threads hi
Thread 0→10 10→20
1 Busy Busy
2 Busy Idle
3 Busy Busy
ok 10s wasted!!
Matrix factorization Our approach in the package LIBMF
Lock-Free Scheduling
We split the matrix to enough blocks. For example, with two threads, we split the matrix to 4 × 4 blocks
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 is the updated counter recording the number of updated times for each block
Matrix factorization Our approach in the package LIBMF
Lock-Free Scheduling (Cont’d)
Firstly, T1 selects a block randomly For T2, it selects a block neither green nor gray
T10 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matrix factorization Our approach in the package LIBMF
Lock-Free Scheduling (Cont’d)
For T2, it selects a block neither green nor gray randomly For T2, it selects a block neither green nor gray
T1
T2 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matrix factorization Our approach in the package LIBMF
Lock-Free Scheduling (Cont’d)
After T1 finishes, the counter for the corresponding block is added by one
T2 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matrix factorization Our approach in the package LIBMF
Lock-Free Scheduling (Cont’d)
T1 can select available blocks to update Rule: select one that is least updated
T2 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matrix factorization Our approach in the package LIBMF
Lock-Free Scheduling (Cont’d)
SG: applying Lock-Free Scheduling SG**: applying DSGD-like Scheduling
0 2 4 6 8 10
0.84 0.86 0.88 0.9
RMSE
Time(s)
SG**
SG
0 100 200 300 400 500 600
22 22.5 23 23.5 24
Time(s)
RMSE
SG**
SG
MovieLens 10M Yahoo!Music
MovieLens 10M: 18.71s → 9.72s (RMSE: 0.835) Yahoo!Music: 728.23s → 462.55s (RMSE: 21.985)
Matrix factorization Our approach in the package LIBMF
Memory Discontinuity
Discontinuous memory access can dramatically increase the training time. For SG, two possible update orders are Update order Advantages Disadvantages
Random Faster and stable Memory discontinuity Sequential Memory continuity Not stable
Random Sequential
R R
Our lock-free scheduling gives randomness, but the resulting code may not be cache friendly
Matrix factorization Our approach in the package LIBMF
Partial Random Method
Our solution is that for each block, access both ˆR and ˆP continuously
R : (one block)ˆ
= ×
PˆT
Qˆ
1 2
3 4
5 6
Partial: sequential in each block
Random: random when selecting block
Matrix factorization Our approach in the package LIBMF
Partial Random Method (Cont’d)
0 20 40 60 80 100
0.8 0.9 1 1.1 1.2 1.3
Time(s)
RMSE
Random Partial Random
0 500 1000 1500 2000 2500 3000 20
25 30 35 40 45
Time(s)
RMSE
Random Partial Random
MovieLens 10M Yahoo!Music
The performance of Partial Random Method is better than that of Random Method
Matrix factorization Our approach in the package LIBMF
Experiments
State-of-the-art methods compared
LIBPMF: a parallel coordinate descent method (Yu et al., 2012)
NOMAD: an asynchronous SG method (Yun et al., 2014)
LIBMF: earlier version of LIBMF (Zhuang et al., 2013; Chin et al., 2015a)
LIBMF++: with adaptive learning rates for SG (Chin et al., 2015c)
Matrix factorization Our approach in the package LIBMF
Experiments (Cont’d)
Data Set m n #ratings
Netflix 2,649,429 17,770 99,072,112 Yahoo!Music 1,000,990 624,961 252,800,275 Webscope-R1 1,948,883 1,101,750 104,215,016 Hugewiki 39,706 25,000,000 1,703,429,136
• Due to machine capacity, Hugewiki here is about half of the original
• k = 100
Matrix factorization Our approach in the package LIBMF
Experiments (Cont’d)
0 10 20 30 40 50
0.92 0.94 0.96 0.98 1
Time (sec.)
RMSE
NOMAD LIBPMF LIBMF LIBMF++
0 50 100 150 200
22 23 24 25
Time (sec.)
RMSE
NOMAD LIBPMF LIBMF LIBMF++
Netflix Yahoo!Music
0 50 100 150
23.5 24 24.5 25 25.5 26
Time (sec.)
RMSE
NOMAD LIBPMF LIBMF LIBMF++
0 500 1000 1500
0.5 0.52 0.54 0.56 0.58 0.6
Time (sec.)
RMSE
CCD++
FPSG FPSG++
Webscope-R1 Hugewiki
Matrix factorization Our approach in the package LIBMF
Non-negative Matrix Factorization (NMF)
Our method has been extended to solve NMF min
P,Q
X
(u,v )∈R
(ru,v − pTuqv)2 + λPkpuk2F + λQ kqvk2F subject to Pi ,u ≥ 0, Qi ,v ≥ 0, ∀i , u, v
Factorization machines
Outline
1 Matrix factorization
2 Factorization machines
3 Conclusions
Factorization machines
MF and Classification/Regression
MF solves
minP,Q
X
(u,v )∈R
ru,v − pTuqv
2
Note that I omit the regularization term Ratings are the only given information
This doesn’t sound like a classification or regression problem
In the second part of this talk we will make a connection and introduce FM (Factorization Machines)
Factorization machines
Handling User/Item Features
What if instead of user/item IDs we are given user and item features?
Assume user u and item v have feature vectors fu and gv
How to use these features to build a model?
Factorization machines
Handling User/Item Features (Cont’d)
We can consider a regression problem where data instances are
value features
... ...
ruv fuT gvT
... ...
and solve
minw
X
u,v ∈R
Ru,v − wT fu gv
2
Factorization machines
Feature Combinations
However, this does not take the interaction between users and items into account
Note that we are approximating the rating ru,v of user u and item v
Let
U ≡ number of user features V ≡ number of item features Then
fu ∈ RU, u = 1, . . . , m, gv ∈ RV, v = 1, . . . , n
Factorization machines
Feature Combinations (Cont’d)
Following the concept of degree-2 polynomial mappings in SVM, we can generate new features
(fu)t(gv)s, t = 1, . . . , U, s = 1, . . . V and solve
min
wt,s,∀t,s
X
u,v ∈R
(ru,v −
U
X
t0=1 V
X
s0=1
wt0,s0(fu)t(gv)s)2
Factorization machines
Feature Combinations (Cont’d)
This is equivalent to min
W
X
u,v ∈R
(ru,v − fuTW gv)2, where
W ∈ RU×V is a matrix
If we have vec(W ) by concatenating W ’s columns, another form is
minW
X
u,v ∈R
ru,v − vec(W )T
...
(fu)t(gv)s ...
2
,
Factorization machines
Feature Combinations (Cont’d)
However, this setting fails for extremely sparse features
Consider the most extreme situation. Assume we have
user ID and item ID as features
Then
U = m, J = n, fi = [0, . . . , 0
| {z }
i −1
, 1, 0, . . . , 0]T
Factorization machines
Feature Combinations (Cont’d)
The optimal solution is
Wu,v =
(ru,v, if u, v ∈ R 0, if u, v /∈ R We can never predict
ru,v, u, v /∈ R
Factorization machines
Factorization Machines
The reason why we cannot predict unseen data is because in the optimization problem
# variables = mn # instances = |R|
Overfitting occurs Remedy: we can let
W ≈ PTQ,
where P and Q are low-rank matrices. This becomes matrix factorization
Factorization machines
Factorization Machines (Cont’d)
This can be generalized to sparse user and item features
min
u,v ∈R(Ru,v − fuTPTQgv)2 That is, we think
Pfu and Qgv
are latent representations of user u and item v , respectively
This becomes factorization machines (Rendle, 2010)
Factorization machines
Factorization Machines (Cont’d)
Similar ideas have been used in other places such as Stern, Herbrich, and Graepel (2009)
In summary, we connect MF and
classification/regression by the following settings We need combination of different feature types (e.g., user, item, etc)
However, overfitting occurs if features are very sparse
We use product of low-rank matrices to avoid overfitting
Factorization machines
Factorization Machines (Cont’d)
We see that such ideas can be used for not only recommender systems.
They may be useful for any classification problems with very sparse features
Factorization machines
Field-aware Factorization Machines
We have seen that FM is useful to handle highly sparse features such as user IDs
What if we have more than two ID fields?
For example, in CTR prediction for computational advertising, we may have
value features
... ...
CTR user ID, Ad ID, site ID
... ...
Factorization machines
Field-aware Factorization Machines (Cont’d)
FM can be generalized to handle different interactions between fields
Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...
This becomes FFM: field-aware factorization machines (Rendle and Schmidt-Thieme, 2010)
Factorization machines
FFM for CTR Prediction
It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012
Recently my students used FFM to win two Kaggle competitions
After we used FFM to win the first, in the second competition all top teams use FFM
Note that for CTR prediction, logistic rather than squared loss is used
Factorization machines
Discussion
How to decide which field interactions to use?
If features are not extremely sparse, can the result still be better than degree-2 polynomial mappings?
Note that we lose the convexity here
We have a software LIBFFM for public use
http://www.csie.ntu.edu.tw/~cjlin/libffm
Factorization machines
Experiments
We see that
W ⇒ PTQ reduces the number of variables What if we map
...
(fu)t(gv)s ...
⇒ a shorter vector to reduce the number of features/variables
Factorization machines
Experiments (Cont’d)
However, we may have something like
(r1,2− W1,2)2 ⇒ (r1,2− ¯w1)2 (1) (r1,4− W1,4)2 ⇒ (r1,4− ¯w2)2
(r2,1− W2,1)2 ⇒ (r2,1− ¯w3)2
(r2,3− W2,3)2 ⇒ (r2,3− ¯w1)2 (2) Clearly, there is no reason why (1) and (2) should share the same variable ¯w1
In contrast, in MF, we connect r1,2 and r1,3 through p1
Factorization machines
Experiments (Cont’d)
A simple comparison on MovieLens
# training: 9,301,274, # test: 698,780, # users:
71,567, # items: 65,133
Results of MF: RMSE = 0.836 Results of Poly-2 + Hashing:
RMSE = 1.14568 (106 bins), 3.62299 (108 bins), 3.76699 (all pairs)
We can clearly see that MF is much better
Conclusions
Outline
1 Matrix factorization
2 Factorization machines
3 Conclusions
Conclusions
Conclusions
In this talk we have talked about MF and FFM MF is a mature technique, so we investigate its fast training
FFM is relatively new. We introduce its basic concepts and practical use
Conclusions
Acknowledgments
The following students have contributed to works mentioned in this talk
Wei-Sheng Chin Yu-Chin Juan Bo-Wen Yuan Yong Zhuang