Solutions and Experiences from KDD Cup 2011 Track 1: A Linear Ensemble of Individual and Blended Models for Music Rating Prediction

(1)

Solutions and Experiences from KDD Cup 2011

Track 1: A Linear Ensemble of Individual and Blended Models for Music Rating Prediction

Po-Lung Chen, Chen-Tse Tsai, Yao-Nan Chen, Ku-Chun Chou, Chun-Liang Li, Cheng-Hao Tsai, Kuan-Wei Wu, Yu-Cheng Chou, Chung-Yi Li, Wei-Shih Lin, Shu-Hao Yu, Rong-Bing Chiu, Chieh-Yen Lin,

Chien-Chih Wang, Po-Wei Wang, Wei-Lun Su, Chen-Hung Wu, Tsung-Ting Kuo, Todd G. McKenzie, Ya-Hsuan Chang, Chun-Sung Ferng,

Chia-Mau Ni, Hsuan-Tien Lin, Chih-Jen Lin and Shou-De Lin

National Taiwan University

(2)

What is KDD Cup?

Background

an annual competition on KDD (knowledge discovery and data mining)

organized by ACM SIGKDD, starting from 1997, nowthe most prestigious data mining competition

usually lasts 3-4 months

participants include famous research labs (IBM, AT&T) and top universities (Stanford, Berkeley)

Aim

bridge the gap between theory andpractice define thestate-of-the-art

(3)

KDD Cup 2011

Music Recommendation Systems host: Yahoo!

11 years of Yahoo! music data 2 tracks of competition

official dates: March 15 to June 30 1878 teams submitted to track 1;

1854 teams submitted to track 2

(4)

NTU Team for KDD Cup 2011

3 faculties:

Profs. Chih-Jen Lin, Hsuan-Tien Lin and Shou-De Lin 1 course (similar to what was done in 2010):

Data Mining and Machine Learning: Theory and Practice 3 TAs and 19 students:

most wereinexperienced in music recommendation in the beginning

official classes: April to June;

actual classes: December to June

our motto: study state-of-the-art approaches and thencreatively improve them

(5)

The Track 1 Problem (1/2)

Given Data

263M examples(user u; item i; rating rui; date tui; time ui) user item rating date time

1 21 10 102 23:52

1 213 90 1032 21:01

4 45 95 768 09:15

u, i: abstract IDs

r_ui: integer between 0 and 100,mostly multiples of 10 Additional Information: Item Hierarchy

track (46:85%) album (19:01%) artist (28:84%) genre (5:30%)

(6)

The Track 1 Problem (2/2)

Data Partitioned by Organizers training: 253M; validation: 4M;

test (w/o rating): 6M

per user,training < validation < test in time

20 examples total

4 examples in validation; 6 in test fixed random half of test: leaderboard;

another half: award decision

Goal

predictions^r_ui rui on the test set, measured by RMSE =q

average(^rui r_ui)² note: one submission allowedevery eight hours

(7)

Three Properties of Track 1 Data

R=

track1 track2 album3 author4 genreI

user₁ 100 80 70 ?

user₂ 0 ? 80

userU ? 20 0

similar to Netflix data, but with the following differences...

scale: larger data

training: study mature models that arecomputationally feasible taxonomy: relation graph of tracks, albums, authors and genres

include as featuresfor combining models nonlinearly time: detailed; training earlier than validation earlier than test

include as featuresfor combining models nonlinearly;

respect time-closenessduring training

(8)

Framework of Our Solution

single models—computationally feasible models that arediverse:

individual models: matrix factorization (& pPCA), pLSA residual models: R-Boltz. machine, k -NN

derivative model: regression with statistical & model-based features validation-set blending:

combine models nonlinearly whilerespecting time-closeness test-set blending:

combine models linearly whilefitting the leaderboard feedback post processing:

polish predictions usingfindings during data analysis

(9)

RMSE Performance at Each Stage of Framework

single models:22:7915

individual models: best RMSE 22:9022 (MF) residual models: best RMSE 22:7915 (k-NN + MF)

derivative model: best RMSE 24:1251 (but helps in later stages) validation-set blending:21:3598 [improvement 1:4317]

test-set blending: (estimated)21:0253 [improvement 0:3345]

post processing: 21:0147 [improvement 0:0106]

both blending stages: key to the system

(10)

Single Model: Matrix Factorization (1/2)

Basic Idea

R ^R = P^TQ on the known examples

P^T: U (number of user) by F (number of factor) user-factor matrix Q: F (number of factor) by I (number of item) item-factor matrix one of themost commonly-used single models

Training

learnP and Q from data min

P;Q

X

(u;i)2data

(^rui r_ui)²

| {z }

Eui()

s.t.^rui = p^Tuq_i

large-scale optimization tool: stochastic gradient descent (SGD)

1 randomly pick one example(u; i)

2 P P rEui(P) (similar for Q)

(11)

Single Model: Matrix Factorization (2/2)

Matrix Factorization Variants

Pmin;Q; regularization+ X

(u;i)2data

(^rui r_ui)²

s.t. ^r_ui = p^Tuq_i + r + ^useru + ^itemi + i

+ (tui t_i^begin) extended terms (overall bias, user bias, item bias, time factor, etc.): enhance the power of model

regularization: control the complexity of model

parameter selection: triedAutomatic Parameter Tuning tool included many variants in the final solution for diversity

(12)

Selected Ideas that Worked (1/5):

Time Emphasis in Stochastic Gradient Descent

Background

SGD for minimizing sum of per-example E_ui(P):

randomly pick one example(u; i) P P rEui(P)

Idea

last M steps of SGD: effectively considering only the last M examples picked—final P as if biased towards those need: P respects time-closeness to the test examples heuristic:deterministically pick the “newer” examples as last

consistent 0:05 RMSE improvement for MF

(13)

Single Model:

Probabilistic Principle Component Analysis

Basic Idea

P(rju; i) = N (p^Tur_i+ ru; ²)

ru: user rating average

can be viewed as probabilistic MF

prediction^rui: expected rating over P(rju; i)

Training

Expectation Maximization (EM) over maximum likelihood formulation

very similar to MF in the final solution

(14)

Single Model:

Probabilistic Latent Semantic Analysis

Basic Idea

P(rju; i) =Xk

=1P(rji; z = )P(z = ju):

z: the hidden variable representing user type can be viewed asanother way of factorization prediction^rui: expected rating over P(rju; i) Training

basic: EM over maximum likelihood formulation

improvement: tempered EM—EM + annealing (0.468 RMSE improvement)

not strong individually, but quite different from MF solutions

(15)

Residual Model: Restricted Boltzmann Machine

Basic Idea

discrete hidden factors

" " #

per-user incomplete discrete ratings predicted continuous ratings

a recursive NNet

popularly used in Netflix competition

Training

find the fixed point of the NNet weights bycontrastive divergence

not better than MF individually,

but can be used to process residuals (see below)

(16)

Selected Ideas that Worked (2/5):

Gaussian RBM as Residual Model

Background

RBM: a recursive NNet; can be used as an individual model by discrete hidden factors

" " #

per-user incomplete discrete ratings predicted continuous ratings as individual: RMSE 24:7433, worse than MF (22:9974)

Idea

MF (a first-order model) efficiently gets better performance, but can RBM digestsomething different?

need: RBM that learns from theresiduals of MFr_ui ^r_ui^MF (continuous values)

(17)

Selected Ideas that Worked (2/5):

Gaussian RBM as Residual Model

Background

" " #

per-user incomplete discrete ratings predicted continuous ratings

Idea

need: RBM that learns from theresiduals of MF choice: Gaussian RBM (gRBM)

" " #

per-user incompletecontinuous residuals predicted continuousresiduals MF+gRBM: 22:8008;

better than individual MF (22:9974) or RBM (24:7433)

(18)

Residual Model: k -Nearest Neighbor

Basic Idea

^rui = P

j2Gk(u;i)w_ij ruj

P

j2Gk(u;i)w_ij G_k(u; i): item-neighbors of item i (for user u) w_ij: correlation between items i and j

Training

efficiently compute suitable neighbor and correlation functions like RBM, not better than MF individually,

but can be used to process residuals

(19)

Derivative Model: Regression

Basic Idea

^rui = g(xui)

x_ui 2 R^d: some features related to user u and item i

statistical: number of ratings from u, number of genre of item i, etc.

model: p_uin MF, wijin k -NN, etc.

g: a regressor fromR^d toR linear regression

NNet

gradient boosting regression tree

can be flexibly used to include “side information” like hierarchy and time

(20)

Glance of Single Model RMSE

model # used best average worst contribution

MF 81 22.90 23.92 26.94 0.3645

pPCA 2 24.46 24.61 24.75 0.0014

pLSA 7 24.83 25.53 26.09 0.0042

R-Boltz. machine 8 22.80 24.75 26.08 0.0314

k -NN 18 22.79 25.06 42.94 0.0298

regression 10 24.13 28.01 35.14 0.0261

contribution (before val.-set blending):

estimated RMSE diff. via leave-the-model-out in test-set blending MF: most important (absorbing pPCA)

residual models: both quite important

derivative model: individually weak but adds diversity val.-set blending:

95 models, best 21.36, average 23.53, worst 31.70

(21)

Selected Ideas that Worked (3/5) in Val.-Set Blending:

Multi-Feature and Multi-Stage Binned Lin. Reg.

Background

Binned Linear Regression: a conditional aggregation model different model strength on different “types” of examples different blending weights for different types (bins)to utilize strength

bins # rating 1 1< # rating 2 others

weight of MF-1 0.4 0.7 1.0

weight of RBM-1 0.5 0.1 0.0

weight of RBM-2 0.1 0.2 0.0

a simplified regression tree with one level (on one feature)

(22)

Selected Ideas that Worked (3/5) in Val.-Set Blending:

Multi-Feature and Multi-Stage Binned Lin. Reg.

Background

Binned Linear Regression

—different blending weights for different (types) bins of examples

Idea: multi-feature BLR

rationale: “type” more sophisticated than 1-feature bin a specialmulti-level decision tree

prevent overfitting bylimiting height and bin size heuristic algorithm instead of traditional decision tree:

due tosimplicity by extending from one-feature BLR multi-feature 1-feature 4-feature 6-feature

RMSE 22.0829 21.8605 21.8128

(23)

Selected Ideas that Worked (3/5) in Val.-Set Blending:

Multi-Feature and Multi-Stage Binned Lin. Reg.

Background

Binned Linear Regression

—different blending weights for different (types) bins of examples Idea: multi-stage BLR

rationale: morediverse but good modelsbefore test-set blending

bins 1 2 3

weight of MF-1 ... ... ...

weight of RBM-1 ... ... ...

weight of RBM-2 ... ... ...

weight of BLR-1 ... ... ...

weight of BLR-2 ... ... ...

multi-stage 1-stage 2-stage 3-stage RMSE 21.7140 21.4591 21.4287

(24)

Selected Ideas that Worked (4/5) in Test-Set Blending:

Offline Test Performance Predictor

Background

given: columnszm= test-set prediction of model m test-set linear regression:

w(z1; z2; ; zM; ) = (Z^TZ+ I) ¹Z^Tr

true ratingsr unknown butz^Tr can be estimatedby 2z^Tr = z^Tz+ r^Tr (z r)^T(z r)

 z^Tz+ N RMSE(0)² N RMSE(z)²

common technique for RMSEever since Netflix competition

(25)

Selected Ideas that Worked (4/5) in Test-Set Blending:

Offline Test Performance Predictor

Background

2z^Tr = z^Tz+r^Tr (z r)^T(z r)

 z^Tz+N RMSE(0)² N RMSE(z)²

Idea

want: decide whichz_m’s and to use

restriction: one submission every eight hours

solution: estimate RMSE ofwwithout submitting more thanzm

N RMSE(w)²= (Zw r)^T(Zw r) = w^TZ^TZw 2w^TZ^Tr+r^Tr compute the contribution of models;

choose 221 from 300 models & decide = 10 ⁶offline

(26)

Selected Ideas that Worked (5/5) in Post-Processing:

Clipping for Old Four-Star Days

Background

some very different rating systems observed during data analysis:

four-star rating?f0; 30; 50; 70; 90g five-star rating?f0; 20; 40; 60; 80; 100g 100-point scale

suspectchanges in the user interface of Yahoo! Music

Idea

existing: in five-star or 100-point scale, clip prediction to[0; 100]

new: for four-star,clip prediction to[0; 90]

what dates? [3365; 5982] (7 years) or[4281; 6170](5 years)

0:02 RMSE improvement on most models

(27)

Revisit: RMSE Performance at Each Stage of Framework

single models:22:7915

individual models: best RMSE 22:9022 (MF) residual models: best RMSE 22:7915 (k-NN + MF)

derivative model: best RMSE 24:1251 (but helps in later stages) validation-set blending:21:3598 [improvement 1:4317]

test-set blending: (estimated)21:0253 [improvement 0:3345]

post processing: 21:0147 [improvement 0:0106]

both blending stages: key to the system

(28)

Selected Ideas that Did Not Work:

Deal with Zero-Variance Users

Background

zero-variance users (7% of all users)

—if a user gives 60, 60, 60, in all training ratings, how’d she rate the next item?

Occam’s razor prediction: 60

—only true for 80% of users, 20% changed their mind!

Idea

conditionally (the 80%) post-process the predictions difficult to distinguish and thus failed

(29)

Track 1 Mini-Summary

individual models

single:MF (& pPCA), pLSA residual: RBM, k -NN derivative: regression

—concept ofdiversityimportant blending

validation: deeply and non-linear to improve model power test: broadly and linear to use leaderboard feedback properly (with good estimation)

Next: Track 2 by Prof. Shou-De Lin