Machines for Recommender Systems
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at 4th Workshop on Large-Scale Recommender Systems, ACM RecSys, 2016
1 Matrix factorization
2 Factorization machines
3 Field-aware factorization machines
4 Optimization methods for large-scale training
5 Discussion and conclusions
Outline
1 Matrix factorization
2 Factorization machines
3 Field-aware factorization machines
4 Optimization methods for large-scale training
5 Discussion and conclusions
Matrix Factorization
Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)
A group of users give ratings to some items User Item Rating
1 5 100
1 13 30
. . . .
u v r
. . . .
The information can be represented by a rating matrix R
Matrix Factorization (Cont’d)
R
m × n
m : u : 2 1
1 2 .. v .. .. .. n
ru,v
?2,2
m, n: numbers of users and items u, v : index for uth user and vth item
ru,v: uth user gives a rating ru,v to vth item
Matrix Factorization (Cont’d)
R
m × n
≈ ×
PT
m × k
Q
k × n
m : u : 2 1
1 2 .. v .. .. .. n
ruv
?2,2
pT1 pT2 : pTu
: pTm
q1 q2 .. qv .. .. .. qn
k: number of latent dimensions ru,v = pTuqv
?2,2 = pT2q2
Matrix Factorization (Cont’d)
A non-convex optimization problem:
minP,Q
X
(u,v )∈R
(ru,v − pTuqv)2 + λPkpuk2F + λQkqvk2F
λP and λQ are regularization parameters
Many optimization methods have been successfully applied
Overall MF is a mature technique
Outline
1 Matrix factorization
2 Factorization machines
3 Field-aware factorization machines
4 Optimization methods for large-scale training
5 Discussion and conclusions
MF versus Classification/Regression
MF solves
min
P,Q
X
(u,v )∈R
ru,v − pTuqv2
Note that I omit the regularization term Ratings are the only given information
This doesn’t sound like a classification or regression problem
But indeed we can make some interesting connections
Handling User/Item Features
What if instead of user/item IDs we are given user and item features?
Assume user u and item v have feature vectors fu ∈ RU and gv ∈ RV,
where
U ≡ number of user features V ≡ number of item features How to use these features to build a model?
Handling User/Item Features (Cont’d)
We can consider a regression problem where data instances are
value features
... ...
ruv fuT gvT
... ...
and solve
minw
X
u,v ∈R
ru,v − wT fu gv
2
Feature Combinations
However, this does not take the interaction between users and items into account
Following the concept of degree-2 polynomial mappings in SVM, we can generate new features
(fu)t(gv)s, t = 1, . . . , U, s = 1, . . . V and solve
min
wt,s,∀t,s
X
u,v ∈R
(ru,v −
U
X
t=1 V
X
s=1
wt,s(fu)t(gv)s)2
Feature Combinations (Cont’d)
This is equivalent to min
W
X
u,v ∈R
(ru,v − fuTW gv)2, where
W ∈ RU×V is a matrix
If we have vec(W ) by concatenating W ’s columns, another form is
minW
X
u,v ∈R
ru,v − vec(W )T
...
(fu)t(gv)s ...
2
,
Feature Combinations (Cont’d)
However, this setting fails for extremely sparse features
Consider the most extreme situation. Assume we have
user ID and item ID as features
Then
U = m, J = n, fi = [0, . . . , 0
| {z }
i −1
, 1, 0, . . . , 0]T
Feature Combinations (Cont’d)
The optimal solution is
Wu,v =
(ru,v, if u, v ∈ R 0, if u, v /∈ R We can never predict
ru,v, u, v /∈ R
Factorization Machines
The reason why we cannot predict unseen data is because in the optimization problem
# variables = mn # instances = |R|
Overfitting occurs Remedy: we can let
W ≈ PTQ,
where P and Q are low-rank matrices. This becomes matrix factorization
Factorization Machines (Cont’d)
This can be generalized to sparse user and item features
min
P,Q
X
(u,v )∈R
(ru,v − fuTPTQgv)2 That is, we think
Pfu and Qgv
are latent representations of user u and item v , respectively
We can also consider the interaction between elements in fu (or elements in gv)
Factorization Machines (Cont’d)
The new formulation is minP,Q
X
(u,v )∈R
ru,v −fuT gTv PT QT
P Q fu gv
2
This becomes factorization machines (Rendle, 2010)
Factorization Machines (Cont’d)
Similar ideas have been used in other places such as Stern et al. (2009)
We see that such ideas can be used for not only recommender systems.
They may be useful for any classification problems with very sparse features
FM for Classification
In a classification setting assume a data instance is x ∈ Rn
Linear model:
wTx Degree-2 polynomial mapping:
xTW x
FM for Classification (Cont’d)
FM:
xTPTPx or alternatively
X
i ,j
xipTi pjxj,
where
pi, pj ∈ Rk
That is, in FM each feature is associated with a latent factor
Outline
1 Matrix factorization
2 Factorization machines
3 Field-aware factorization machines
4 Optimization methods for large-scale training
5 Discussion and conclusions
Field-aware Factorization Machines
We have seen that FM seems to be useful to handle highly sparse features such as user IDs
What if we have more than two ID fields?
For example, in CTR (click-through rate) prediction for computational advertising, we may have
clicked features
... ...
Yes user ID, Ad ID, site ID
... ...
Field-aware Factorization Machines (Cont’d)
FM can be generalized to handle different interactions between fields
Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...
We call this approach FFM (field-aware factorization machines)
An early study on three fields is Rendle and Schmidt-Thieme (2010)
FFM for CTR Prediction
It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012
In 2014 my students used FFM to win two Kaggle CTR competitions
After we used FFM to win the first competition, in the second competition all top teams use FFM Note that for CTR prediction, logistic rather than squared loss is used
Practical Use of FFM
Recently we conducted a detailed study on FFM (Juan et al., 2016)
Here I briefly discuss some results there
Numerical Features
For categorical features like IDs, we have ID: field ID index: feature Each field has many 0/1 features
But how about numerical features?
Two possibilities
Dummy fields: The field has only one real-valued feature
Discretization: transform a numerical feature to a categorical one and then many binary features
Normalization
After obtaining the feature vector, empirically we find that instance-wise normalization is useful Faster convergence and better test accuracy
Impact of Parameters
We have the following parameters k: number of latent factors λ: regularization parameter
parameters of the optimization methods (e.g., learning rate of stochastic gradient)
Their sensitivity to the performance varies
Example: Regularization Parameter λ
Epochs
20 40 60 80 100 120 140
Logloss
0.44 0.45 0.46 0.47 0.48 0.49 0.5
λ = 1e − 6 λ = 1e − 5 λ = 1e − 4 λ = 1e − 3
Too large λ: model not good
Too small λ: better model but easily overfitting Similar situations occur for SG learning rates Early stopping by a validation procedure is needed
Experiments: Two CTR Sets
method test logloss rank Linear 0.46224 91
Poly2 0.44956 14
FM 0.44922 14
FM 0.44847 11
FFM 0.44603 3
Linear 0.38833 64
Poly2 0.38347 10
FM 0.38407 11
FM 0.38531 15
FFM 0.38223 6
For same method (e.g., FM), we try different parameters
Experiments: Two CTR Sets (Cont’d)
For these two sets, FFM is the best
For winning competitions, some additional tricks are used
Experiments: Other Sets
• Can FFM work well for other sets? Can we identify when it’s useful
• We try the following data
Data Set # instances # features # fields
KDD2010-bridge 20,012,499 651,166 9
KDD2012 20,950,284 19,147,857 11
phishing 11,055 100 30
adult 48,842 308 14
cod-rna (dummy fields) 331,152 8 8
cod-rna (discretization) 331,152 2,296 8
ijcnn (dummy fields) 141,691 22 22
ijcnn (discretization) 141,691 69,867 22
Experiments: Other Sets (Cont’d)
Data Set LM Poly2 FM FFM
KDD2010-bridge 0.30910 0.27448 0.28437 0.26899 KDD2012 0.49375 0.49640 0.49292 0.48700 phishing 0.11493 0.09659 0.09461 0.09374 adult 0.30897 0.30757 0.30959 0.30760 cod-rna (dummy fields) 0.13829 0.12874 0.12580 0.12914 cod-rna (discretization) 0.16455 0.17576 0.16570 0.14993 ijcnn (dummy fields) 0.20627 0.09209 0.07696 0.07668 ijcnn (discretization) 0.21686 0.22546 0.22259 0.18635 Best results are underlined
Experiments: Other Sets (Cont’d)
For data with categorical data, FFM works well For some data (e.g., adult), feature interactions are not useful
It’s not easy for FFM to handle numerical features
Outline
1 Matrix factorization
2 Factorization machines
3 Field-aware factorization machines
4 Optimization methods for large-scale training
5 Discussion and conclusions
Solving the Optimization Problem
MF, FM, and FFM all involve optimization problems Optimization techniques for them are related but different due to different problem structures
With time constraint I will only briefly discuss some optimization techniques for matrix factorization
Matrix Factorization
Recall we have a non-convex optimization problem:
minP,Q
X
(u,v )∈R
(ru,v − pTuqv)2 + λPkpuk2F + λQkqvk2F
Existing optimization techniques include ALS: Alternating Least Squares (ALS) CD : Coordinate Descent
SG : Stochastic Gradient
Complexity in Training MF
To update P, Q once
ALS: O(|R|k2 + (m + n)k3) CD: O(|R|k)
To go through |R| elements once SG: O(|R|k)
I don’t discuss details, but this indicates that CD and SG are generally more efficient
Stochastic Gradient for Matrix Factorization
SG update rule:
pu ← pu + γ (eu,vqv − λPpu) , qv ← qv + γ (eu,vpu − λQqv) where
eu,v ≡ ru,v − pTuqv Two issues:
SG is sensitive to learning rate SG is inherently sequential
SG’s Learning Rate
We can apply advanced settings such as ADAGRAD (Duchi et al., 2011)
Each element of latent vectors pu, qv has its own learning rate
Maintaining so many learning rates can be quite expensive
How about a modification to let the whole pu (or the whole qv) associates with a rate? (Chin et al., 2015b)
This is an example that we take MF’s property into account
SG for Parallel MF
After r3,3 is selected, ratings in gray blocks cannot be updated
r3,1 r3,2 r3,3 r3,4 r3,5 r3,6
r6,6
1 2 3 4 5 6
1 2 3 4 5 6
But r6,6 can be used
r3,1 = p3Tq1 r3,2 = p3Tq2 ..
r3,6 = p3Tq6
——————
r3,3 = p3Tq3 r6,6 = p6Tq6
SG for Parallel MF (Cont’d)
We can split the matrix to blocks and update those which don’t share p or q
1 2 3 4 5 6
1 2 3 4 5 6
This concept is simple, but there are many issues to have a right implementation under the given architecture
SG for Parallel MF (Cont’d)
Past developments of SG for parallel MF include Gemulla et al. (2011); Chin et al. (2015a); Yun et al. (2014)
However, the idea of block splits applies to MF only We haven’t seen an easy way to extend it to FM or FFM
This is another example where we take problem structure into account
Outline
1 Matrix factorization
2 Factorization machines
3 Field-aware factorization machines
4 Optimization methods for large-scale training
5 Discussion and conclusions
Discussion and Conclusions
In this talk we briefly discuss three models for recommender systems
MF, FM, and FFM
They are related, but are useful in different situations
Different algorithms may be needed due to different properties of the optimization problems
Acknowledgments
Past and current students who have contributed to this work:
Wei-Sheng Chin Yu-Chin Juan Meng-Yuan Yang Bo-Wen Yuan Yong Zhuang
We thank the support from Ministry of Science and Technology in Taiwan