• 沒有找到結果。

Matrix Factorization and Factorization Machines for Recommender Systems

N/A
N/A
Protected

Academic year: 2022

Share "Matrix Factorization and Factorization Machines for Recommender Systems"

Copied!
47
0
0

加載中.... (立即查看全文)

全文

(1)

Machines for Recommender Systems

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at 4th Workshop on Large-Scale Recommender Systems, ACM RecSys, 2016

(2)

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(3)

Outline

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(4)

Matrix Factorization

Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011)

A group of users give ratings to some items User Item Rating

1 5 100

1 13 30

. . . .

u v r

. . . .

The information can be represented by a rating matrix R

(5)

Matrix Factorization (Cont’d)

R

m × n

m : u : 2 1

1 2 .. v .. .. .. n

ru,v

?2,2

m, n: numbers of users and items u, v : index for uth user and vth item

ru,v: uth user gives a rating ru,v to vth item

(6)

Matrix Factorization (Cont’d)

R

m × n

≈ ×

PT

m × k

Q

k × n

m : u : 2 1

1 2 .. v .. .. .. n

ruv

?2,2

pT1 pT2 : pTu

: pTm

q1 q2 .. qv .. .. .. qn

k: number of latent dimensions ru,v = pTuqv

?2,2 = pT2q2

(7)

Matrix Factorization (Cont’d)

A non-convex optimization problem:

minP,Q

X

(u,v )∈R



(ru,v − pTuqv)2 + λPkpuk2F + λQkqvk2F

λP and λQ are regularization parameters

Many optimization methods have been successfully applied

Overall MF is a mature technique

(8)

Outline

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(9)

MF versus Classification/Regression

MF solves

min

P,Q

X

(u,v )∈R

ru,v − pTuqv2

Note that I omit the regularization term Ratings are the only given information

This doesn’t sound like a classification or regression problem

But indeed we can make some interesting connections

(10)

Handling User/Item Features

What if instead of user/item IDs we are given user and item features?

Assume user u and item v have feature vectors fu ∈ RU and gv ∈ RV,

where

U ≡ number of user features V ≡ number of item features How to use these features to build a model?

(11)

Handling User/Item Features (Cont’d)

We can consider a regression problem where data instances are

value features

... ...

ruv fuT gvT

... ...

and solve

minw

X

u,v ∈R



ru,v − wT  fu gv

2

(12)

Feature Combinations

However, this does not take the interaction between users and items into account

Following the concept of degree-2 polynomial mappings in SVM, we can generate new features

(fu)t(gv)s, t = 1, . . . , U, s = 1, . . . V and solve

min

wt,s,∀t,s

X

u,v ∈R

(ru,v

U

X

t=1 V

X

s=1

wt,s(fu)t(gv)s)2

(13)

Feature Combinations (Cont’d)

This is equivalent to min

W

X

u,v ∈R

(ru,v − fuTW gv)2, where

W ∈ RU×V is a matrix

If we have vec(W ) by concatenating W ’s columns, another form is

minW

X

u,v ∈R

ru,v − vec(W )T

 ...

(fu)t(gv)s ...

2

,

(14)

Feature Combinations (Cont’d)

However, this setting fails for extremely sparse features

Consider the most extreme situation. Assume we have

user ID and item ID as features

Then

U = m, J = n, fi = [0, . . . , 0

| {z }

i −1

, 1, 0, . . . , 0]T

(15)

Feature Combinations (Cont’d)

The optimal solution is

Wu,v =

(ru,v, if u, v ∈ R 0, if u, v /∈ R We can never predict

ru,v, u, v /∈ R

(16)

Factorization Machines

The reason why we cannot predict unseen data is because in the optimization problem

# variables = mn  # instances = |R|

Overfitting occurs Remedy: we can let

W ≈ PTQ,

where P and Q are low-rank matrices. This becomes matrix factorization

(17)

Factorization Machines (Cont’d)

This can be generalized to sparse user and item features

min

P,Q

X

(u,v )∈R

(ru,v − fuTPTQgv)2 That is, we think

Pfu and Qgv

are latent representations of user u and item v , respectively

We can also consider the interaction between elements in fu (or elements in gv)

(18)

Factorization Machines (Cont’d)

The new formulation is minP,Q

X

(u,v )∈R



ru,v −fuT gTv  PT QT



P Q fu gv

2

This becomes factorization machines (Rendle, 2010)

(19)

Factorization Machines (Cont’d)

Similar ideas have been used in other places such as Stern et al. (2009)

We see that such ideas can be used for not only recommender systems.

They may be useful for any classification problems with very sparse features

(20)

FM for Classification

In a classification setting assume a data instance is x ∈ Rn

Linear model:

wTx Degree-2 polynomial mapping:

xTW x

(21)

FM for Classification (Cont’d)

FM:

xTPTPx or alternatively

X

i ,j

xipTi pjxj,

where

pi, pj ∈ Rk

That is, in FM each feature is associated with a latent factor

(22)

Outline

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(23)

Field-aware Factorization Machines

We have seen that FM seems to be useful to handle highly sparse features such as user IDs

What if we have more than two ID fields?

For example, in CTR (click-through rate) prediction for computational advertising, we may have

clicked features

... ...

Yes user ID, Ad ID, site ID

... ...

(24)

Field-aware Factorization Machines (Cont’d)

FM can be generalized to handle different interactions between fields

Two latent matrices for user ID and Ad ID Two latent matrices for user ID and site ID ...

We call this approach FFM (field-aware factorization machines)

An early study on three fields is Rendle and Schmidt-Thieme (2010)

(25)

FFM for CTR Prediction

It’s used by Jahrer et al. (2012) to win the 2nd prize of KDD Cup 2012

In 2014 my students used FFM to win two Kaggle CTR competitions

After we used FFM to win the first competition, in the second competition all top teams use FFM Note that for CTR prediction, logistic rather than squared loss is used

(26)

Practical Use of FFM

Recently we conducted a detailed study on FFM (Juan et al., 2016)

Here I briefly discuss some results there

(27)

Numerical Features

For categorical features like IDs, we have ID: field ID index: feature Each field has many 0/1 features

But how about numerical features?

Two possibilities

Dummy fields: The field has only one real-valued feature

Discretization: transform a numerical feature to a categorical one and then many binary features

(28)

Normalization

After obtaining the feature vector, empirically we find that instance-wise normalization is useful Faster convergence and better test accuracy

(29)

Impact of Parameters

We have the following parameters k: number of latent factors λ: regularization parameter

parameters of the optimization methods (e.g., learning rate of stochastic gradient)

Their sensitivity to the performance varies

(30)

Example: Regularization Parameter λ

Epochs

20 40 60 80 100 120 140

Logloss

0.44 0.45 0.46 0.47 0.48 0.49 0.5

λ = 1e − 6 λ = 1e − 5 λ = 1e − 4 λ = 1e − 3

Too large λ: model not good

Too small λ: better model but easily overfitting Similar situations occur for SG learning rates Early stopping by a validation procedure is needed

(31)

Experiments: Two CTR Sets

method test logloss rank Linear 0.46224 91

Poly2 0.44956 14

FM 0.44922 14

FM 0.44847 11

FFM 0.44603 3

Linear 0.38833 64

Poly2 0.38347 10

FM 0.38407 11

FM 0.38531 15

FFM 0.38223 6

For same method (e.g., FM), we try different parameters

(32)

Experiments: Two CTR Sets (Cont’d)

For these two sets, FFM is the best

For winning competitions, some additional tricks are used

(33)

Experiments: Other Sets

• Can FFM work well for other sets? Can we identify when it’s useful

• We try the following data

Data Set # instances # features # fields

KDD2010-bridge 20,012,499 651,166 9

KDD2012 20,950,284 19,147,857 11

phishing 11,055 100 30

adult 48,842 308 14

cod-rna (dummy fields) 331,152 8 8

cod-rna (discretization) 331,152 2,296 8

ijcnn (dummy fields) 141,691 22 22

ijcnn (discretization) 141,691 69,867 22

(34)

Experiments: Other Sets (Cont’d)

Data Set LM Poly2 FM FFM

KDD2010-bridge 0.30910 0.27448 0.28437 0.26899 KDD2012 0.49375 0.49640 0.49292 0.48700 phishing 0.11493 0.09659 0.09461 0.09374 adult 0.30897 0.30757 0.30959 0.30760 cod-rna (dummy fields) 0.13829 0.12874 0.12580 0.12914 cod-rna (discretization) 0.16455 0.17576 0.16570 0.14993 ijcnn (dummy fields) 0.20627 0.09209 0.07696 0.07668 ijcnn (discretization) 0.21686 0.22546 0.22259 0.18635 Best results are underlined

(35)

Experiments: Other Sets (Cont’d)

For data with categorical data, FFM works well For some data (e.g., adult), feature interactions are not useful

It’s not easy for FFM to handle numerical features

(36)

Outline

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(37)

Solving the Optimization Problem

MF, FM, and FFM all involve optimization problems Optimization techniques for them are related but different due to different problem structures

With time constraint I will only briefly discuss some optimization techniques for matrix factorization

(38)

Matrix Factorization

Recall we have a non-convex optimization problem:

minP,Q

X

(u,v )∈R



(ru,v − pTuqv)2 + λPkpuk2F + λQkqvk2F

Existing optimization techniques include ALS: Alternating Least Squares (ALS) CD : Coordinate Descent

SG : Stochastic Gradient

(39)

Complexity in Training MF

To update P, Q once

ALS: O(|R|k2 + (m + n)k3) CD: O(|R|k)

To go through |R| elements once SG: O(|R|k)

I don’t discuss details, but this indicates that CD and SG are generally more efficient

(40)

Stochastic Gradient for Matrix Factorization

SG update rule:

pu ← pu + γ (eu,vqv − λPpu) , qv ← qv + γ (eu,vpu − λQqv) where

eu,v ≡ ru,v − pTuqv Two issues:

SG is sensitive to learning rate SG is inherently sequential

(41)

SG’s Learning Rate

We can apply advanced settings such as ADAGRAD (Duchi et al., 2011)

Each element of latent vectors pu, qv has its own learning rate

Maintaining so many learning rates can be quite expensive

How about a modification to let the whole pu (or the whole qv) associates with a rate? (Chin et al., 2015b)

This is an example that we take MF’s property into account

(42)

SG for Parallel MF

After r3,3 is selected, ratings in gray blocks cannot be updated

r3,1 r3,2 r3,3 r3,4 r3,5 r3,6

r6,6

1 2 3 4 5 6

1 2 3 4 5 6

But r6,6 can be used

r3,1 = p3Tq1 r3,2 = p3Tq2 ..

r3,6 = p3Tq6

——————

r3,3 = p3Tq3 r6,6 = p6Tq6

(43)

SG for Parallel MF (Cont’d)

We can split the matrix to blocks and update those which don’t share p or q

1 2 3 4 5 6

1 2 3 4 5 6

This concept is simple, but there are many issues to have a right implementation under the given architecture

(44)

SG for Parallel MF (Cont’d)

Past developments of SG for parallel MF include Gemulla et al. (2011); Chin et al. (2015a); Yun et al. (2014)

However, the idea of block splits applies to MF only We haven’t seen an easy way to extend it to FM or FFM

This is another example where we take problem structure into account

(45)

Outline

1 Matrix factorization

2 Factorization machines

3 Field-aware factorization machines

4 Optimization methods for large-scale training

5 Discussion and conclusions

(46)

Discussion and Conclusions

In this talk we briefly discuss three models for recommender systems

MF, FM, and FFM

They are related, but are useful in different situations

Different algorithms may be needed due to different properties of the optimization problems

(47)

Acknowledgments

Past and current students who have contributed to this work:

Wei-Sheng Chin Yu-Chin Juan Meng-Yuan Yang Bo-Wen Yuan Yong Zhuang

We thank the support from Ministry of Science and Technology in Taiwan

參考文獻

相關文件

Matrix factorization (MF) and its extensions are now widely used in recommender systems.. In this talk I will briefly discuss three research works related to

– For each image, use RANSAC to select inlier features from 6 images with most feature matches. •

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-field operator was developed

An electronic textbook is a comprehensive and self-contained curriculum package with digital print-on demand contents and electronic features (e-features include multimedia

Fully quantum many-body systems Quantum Field Theory Interactions are controllable Non-perturbative regime..

For circular cone, a special non-symmetric cone, and circular cone optimization, like when dealing with SOCP and SOCCP, the following studies are cru- cial: (i) spectral

Now we assume that the partial pivotings in Gaussian Elimination are already ar- ranged such that pivot element a (k) kk has the maximal absolute value... The growth factor measures

Map Reading & Map Interpretation Skills (e.g. read maps of different scales, interpret aerial photos & satellite images, measure distance & areas on maps)?. IT