Machines for Recommender Systems
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at Facebook, November 13, 2015
Matrix factorization (MF) and its extensions are now widely used in recommender systems
In this talk I will briefly discuss three research works related to this topic
1 Parallel matrix factorization in shared-memory systems
2 Optimization algorithms for one-class matrix factorization
3 From matrix factorization to factorization machines and more
4 Conclusions
Outline
1 Parallel matrix factorization in shared-memory systems
2 Optimization algorithms for one-class matrix factorization
3 From matrix factorization to factorization machines and more
4 Conclusions
Matrix Factorization
Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)
But training is slow.
We developed a parallel MF package LIBMF for shared-memory systems
http://www.csie.ntu.edu.tw/~cjlin/libmf Best paper award at ACM RecSys 2013
Matrix Factorization (Cont’d)
A group of users give ratings to some items User Item Rating
1 5 100
1 10 80
1 13 30
. . . .
u v r
. . . .
The information can be represented by a rating matrix R
Matrix Factorization (Cont’d)
R
m × n
m : u : 2 1
1 2 .. v .. n
ru,v
?2,2
m, n : numbers of users and items u, v : index for uth user and vth item
ru,v : uth user gives a rating ru,v to vth item
Matrix Factorization (Cont’d)
R
m × n
≈ ×
PT
m × k
Q
k × n
m : u : 2 1
1 2 .. v .. .. .. n
ruv
?2,2
pT1 pT2 : pTu
: pTm
q1 q2 .. qv .. .. .. qn
k : number of latent dimensions ru,v = pTuqv
?2,2 = pT2q2
Matrix Factorization (Cont’d)
A non-convex optimization problem:
minP,Q
X
(u,v )∈R
(ru,v − pTuqv)2 + λPkpuk2F + λQkqvk2F
λP and λQ are regularization parameters SG (Stochastic Gradient) is now a popular optimization method for MF
It loops over ratings in the training set.
Matrix Factorization (Cont’d)
SG update rule:
pu ← pu + γ (eu,vqv − λPpu) , qv ← qv + γ (eu,vpu − λQqv) where
eu,v ≡ ru,v − pTuqv SG is inherently sequential
SG for Parallel MF
After r3,3 is selected, ratings in gray blocks cannot be updated
r3,1 r3,2 r3,3 r3,4 r3,5 r3,6
r6,6
1 2 3 4 5 6
1 2 3 4 5 6
But r6,6 can be used
r3,1 = p3Tq1 r3,2 = p3Tq2 ..
r3,6 = p3Tq6
——————
r3,3 = p3Tq3 r6,6 = p6Tq6
SG for Parallel MF (Cont’d)
We can split the matrix to blocks and update those which don’t share p or q
1 2 3 4 5 6
1 2 3 4 5 6
This concept is simple, but there are many issues to have a right implementation under the given architecture
Our Approach in the Package LIBMF
Parallelization (Zhuang et al., 2013; Chin et al., 2015a)
Effective block splitting to avoid synchronization time
Partial random method for the order of SG updates
Adaptive learning rate for SG updates (Chin et al., 2015b)
Details omitted due to time constraint
Block Splitting and Synchronization
A naive way for T nodes is to split the matrix to T × T blocks
This is used in DSGD (Gemulla et al., 2011) for distributed systems, where communication cost is the main concern
In distributed systems, it is difficult to move data or model
Block Splitting and Synchronization (Cont’d)
• For shared memory systems, synchronization becomes a concern
1 2 3
1
2
3
• Block 1: 20s
• Block 2: 10s
• Block 3: 20s
We have 3 threads hi
Thread 0→10 10→20
1 Busy Busy
2 Busy Idle
3 Busy Busy
ok 10s wasted!!
Lock-Free Scheduling
We split the matrix to enough blocks. For example, with two threads, we split the matrix to 4 × 4 blocks
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 is the updated counter recording the number of updated times for each block
Lock-Free Scheduling (Cont’d)
Firstly, T1 selects a block randomly For T2, it selects a block neither green nor gray
T10 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Lock-Free Scheduling (Cont’d)
For T2, it selects a block neither green nor gray randomly For T2, it selects a block neither green nor gray
T1
T2 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Lock-Free Scheduling (Cont’d)
After T1 finishes, the counter for the corresponding block is added by one
T2 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Lock-Free Scheduling (Cont’d)
T1 can select available blocks to update Rule: select one that is least updated
T2 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Lock-Free Scheduling (Cont’d)
SG: applying Lock-Free Scheduling SG**: applying DSGD-like Scheduling
0 2 4 6 8 10
0.84 0.86 0.88 0.9
RMSE
Time(s)
SG**
SG
0 100 200 300 400 500 600
22 22.5 23 23.5 24
Time(s)
RMSE
SG**
SG
MovieLens 10M Yahoo!Music
MovieLens 10M: 18.71s → 9.72s (RMSE: 0.835) Yahoo!Music: 728.23s → 462.55s (RMSE: 21.985)
Memory Discontinuity
Discontinuous memory access can dramatically increase the training time. For SG, two possible update orders are Update order Advantages Disadvantages
Random Faster and stable Memory discontinuity Sequential Memory continuity Not stable
Random Sequential
R R
Our lock-free scheduling gives randomness, but the resulting code may not be cache friendly
Partial Random Method
Our solution is that for each block, access both ˆR and ˆP continuously
R : (one block)ˆ
= ×
PˆT
Qˆ
1 2
3 4
5 6
Partial: sequential in each block
Random: random when selecting block
Partial Random Method (Cont’d)
0 20 40 60 80 100
0.8 0.9 1 1.1 1.2 1.3
Time(s)
RMSE
Random Partial Random
0 500 1000 1500 2000 2500 3000 20
25 30 35 40 45
Time(s)
RMSE
Random Partial Random
MovieLens 10M Yahoo!Music
The performance of Partial Random Method is better than that of Random Method
Experiments
State-of-the-art methods compared
LIBPMF: a parallel coordinate descent method (Yu et al., 2012)
NOMAD: an asynchronous SG method (Yun et al., 2014)
LIBMF: earlier version of LIBMF (Zhuang et al., 2013; Chin et al., 2015a)
LIBMF++: with adaptive learning rates for SG (Chin et al., 2015c)
Details of data sets are omitted; the largest has 1.7B ratings.
Results: k = 100
0 10 20 30 40 50
0.92 0.94 0.96 0.98 1
Time (sec.)
RMSE
NOMAD LIBPMF LIBMF LIBMF++
0 50 100 150 200
22 23 24 25
Time (sec.)
RMSE
NOMAD LIBPMF LIBMF LIBMF++
Netflix Yahoo!Music
0 50 100 150
23.5 24 24.5 25 25.5 26
Time (sec.)
RMSE
NOMAD LIBPMF LIBMF LIBMF++
0 500 1000 1500
0.5 0.52 0.54 0.56 0.58 0.6
Time (sec.)
RMSE
LIBPMF LIBMF LIBMF++
Webscope-R1 Hugewiki
Outline
1 Parallel matrix factorization in shared-memory systems
2 Optimization algorithms for one-class matrix factorization
3 From matrix factorization to factorization machines and more
4 Conclusions
One-class Matrix Factorization
• Some applications have only two possible ratings positive (1, watched) and
negative (0, not-watched)
• One-class observation (i.e., implicit feedback)
⇒ only part of positive actions are recorded User Item Watched∈{0,1}
61 7 1
61 23 1
1647 128 1
• Past works include Pan et al. (2008); Hu et al. (2008);
Pan and Scholz (2009); Li et al. (2010); Paquet and Koenigstein (2013)
Selection of Negative Samples
• One popular solution: treat some missing entries as negative
Why? Most unknown entries are negative.
⇒ a user cannot watch all the movies min
P,Q
X
(u,v )∈Ω+
Cuv(1 − pTuqv)2 + X
(u,v )∈Ω−
Cuv(0 − pTuqv)2 + λ(kPk2F + kQk2F)
• Ω+: observed positive entries
• Ω−: negative entries sampled from missing entries
• Cuv: weights
Two Ways to Select Negative Entries
We may use a subset or include all missing entries
• Subsampled :
|Ω−| = O(|Ω+|) mn Full :
Ω− = {(u, v ) | (u, v ) /∈ Ω+}
• Subsampled is just an approximation for Full
• Full: no need to worry about the selection
• Full: O(mn) elements lead to a hard optimization problem
Subsampled: existing MF techniques can be directly applied
Full for One-class Matrix Factorization
• Include all missing entries in Ω−
Ω = Ω+ ∪ Ω− = {1, . . . , m} × {1, . . . , n}
• Weighted matrix factorization:
X
(i ,j )∈Ω+
Cuv(Auv−pTuqv)2+X
(i ,j )∈Ω−
Cuv(0−pTuqv)2+λ(kPk2F+kQk2F)
• For most MF algorithms, the complexity is
proportional to O(|Ω|). Now |Ω| = mn can be huge
• Therefore, even if Full gives better performance, it’s not useful without efficient training techniques
New Optimization Techniques
Under certain conditions on Cuv, we reduce the mn term to Ω+ in the optimization algorithms:
• ALS: Alternating Least Squares (ALS)
The has been done by Pan and Scholz (2009)
• CD : Coordinate Descent
• SG : Stochastic Gradient
ALS CD SG
rating-MF O(|Ω|k2+(m+n)k3) O(|Ω|k) O(|Ω|k) Subsampled O(|Ω+|k2+(m+n)k3) O(|Ω+|k) O(|Ω+|k) Full-direct O(mnk2+(m+n)k3) O(mnk) O(mnk) Full-new O(|Ω+|k2+(m+n)k3) O(|Ω+|k +(m+n)k2) ??
New Optimization Techniques (Cont’d)
Weights Cuv should satisfy certain conditions. Like Pan and Scholz (2009), we assume
Cuv = puqv, ∀u, v /∈ R.
This is often satisfied in practice. For example, we may have
Cuv ∝ |Ω+u|, ∀u, when v is fixed where
|Ω+u| = # of user u’s observed entries
New Optimization Techniques (Cont’d)
X
(u,v )∈Ω+
(· · · ) + X
(u,v )∈Ω−
(· · · ), u = 1, . . . , m can be written as
X
(u,v )∈Ω+(· · · + something from the 2nd summation)+
(a term involving u)Xn
v =1(a term involving v ), ∀u The derivation is a bit complicated. Also some implementation issues must be carefully addressed.
Reducing the complexity of SG remains a challenging issue.
Comparison: Subsampled and Full
ml10m nDCG
nHLU MAP AUC
@1 @10
Subsampled 9.33 12.10 13.31 10.00 0.97293 Full 25.64 23.81 24.94 17.70 0.97372
netflix nDCG
nHLU MAP AUC
@1 @10
Subsampled 10.62 11.27 12.03 8.91 0.97224 Full 27.04 22.62 22.72 14.36 0.96879 For nDCG, nHLU and MAP, Full is much better than Subsampled. AUC isn’t a good criterion as we now care more about top recommendations
Summary for One-class MF
With the developed optimization techniques, the Full approach of treating all missing entries as negative becomes practical
This work was done while I visited Microsoft
Outline
1 Parallel matrix factorization in shared-memory systems
2 Optimization algorithms for one-class matrix factorization
3 From matrix factorization to factorization machines and more
4 Conclusions
MF versus Classification/Regression
MF solves
minP,Q
X
(u,v )∈R
ru,v − pTuqv
2
Note that I omit the regularization term Ratings are the only given information
This doesn’t sound like a classification or regression problem
In the last part of this talk we will make a connection and introduce FM (Factorization Machines)
Handling User/Item Features
What if instead of user/item IDs we are given user and item features?
Assume user u and item v have feature vectors fu ∈ RU and gv ∈ RV,
where
U ≡ number of user features V ≡ number of item features How to use these features to build a model?
Handling User/Item Features (Cont’d)
We can consider a regression problem where data instances are
value features
... ...
ruv fuT gvT
... ...
and solve
minw
X
u,v ∈R
Ru,v − wT fu gv
2
Feature Combinations
However, this does not take the interaction between users and items into account
Following the concept of degree-2 polynomial mappings in SVM, we can generate new features
(fu)t(gv)s, t = 1, . . . , U, s = 1, . . . V and solve
min
wt,s,∀t,s
X
u,v ∈R
(ru,v −
U
X
t=1 V
X
s=1
wt,s(fu)t(gv)s)2
Feature Combinations (Cont’d)
This is equivalent to min
W
X
u,v ∈R
(ru,v − fuTW gv)2, where
W ∈ RU×V is a matrix
If we have vec(W ) by concatenating W ’s columns, another form is
minW
X
u,v ∈R
ru,v − vec(W )T
...
(fu)t(gv)s ...
2
,
Feature Combinations (Cont’d)
However, this setting fails for extremely sparse features
Consider the most extreme situation. Assume we have
user ID and item ID as features
Then
U = m, J = n, fi = [0, . . . , 0
| {z }
i −1
, 1, 0, . . . , 0]T
Feature Combinations (Cont’d)
The optimal solution is
Wu,v =
(ru,v, if u, v ∈ R 0, if u, v /∈ R We can never predict
ru,v, u, v /∈ R
Factorization Machines
The reason why we cannot predict unseen data is because in the optimization problem
# variables = mn # instances = |R|
Overfitting occurs Remedy: we can let
W ≈ PTQ,
where P and Q are low-rank matrices. This becomes matrix factorization
Factorization Machines (Cont’d)
This can be generalized to sparse user and item features
min
u,v ∈R(Ru,v − fuTPTQgv)2 That is, we think
Pfu and Qgv
are latent representations of user u and item v , respectively
This becomes factorization machines (Rendle, 2010)
Factorization Machines (Cont’d)
Similar ideas have been used in other places such as Stern et al. (2009)
We see that such ideas can be used for not only recommender systems.
They may be useful for any classification problems with very sparse features
Field-aware Factorization Machines
We have seen that FM is useful to handle highly sparse features such as user IDs
What if we have more than two ID fields?
For example, in CTR prediction for computational advertising, we may have
value features
... ...
CTR user ID, Ad ID, site ID
... ...
Field-aware Factorization Machines (Cont’d)
FM can be generalized to handle different interactions between fields
Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...
This becomes FFM: field-aware factorization machines (Rendle and Schmidt-Thieme, 2010)
FFM for CTR Prediction
It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012
Recently my students used FFM to win two Kaggle competitions
After we used FFM to win the first competition, in the second competition all top teams use FFM Note that for CTR prediction, logistic rather than squared loss is used
Discussion
How to decide which field interactions to use?
If features are not extremely sparse, can the result still be better than degree-2 polynomial mappings?
Note that we lose the convexity here
We have a software LIBFFM for public use
http://www.csie.ntu.edu.tw/~cjlin/libffm
Outline
1 Parallel matrix factorization in shared-memory systems
2 Optimization algorithms for one-class matrix factorization
3 From matrix factorization to factorization machines and more
4 Conclusions
Discussion and Conclusions
From my limited experience on recommender
systems, I feel that the practical use is very problem dependent
For example, sometimes many features are available, but sometimes you only have ratings
Developing general algorithms becomes difficult. An algorithm may be useful only for certain scenarios
Discussion and Conclusions (Cont’d)
This situation is different from data classification, where the process is more standardized
I am still learning different aspects of recommender systems. Your comments/suggestions are very welcome
Acknowledgments
Collaborators for works mentioned in this talk:
National Taiwan University Wei-Sheng Chin
Yu-Chin Juan Bo-Wen Yuan Yong Zhuang UT Austin
Hsiang-Fu Yu Microsoft
Misha Bilenko