• 沒有找到結果。

Solutions and Experiences from KDD Cup 2011 Track 1: A Linear Ensemble of Individual and Blended Models for Music Rating Prediction

N/A
N/A
Protected

Academic year: 2022

Share "Solutions and Experiences from KDD Cup 2011 Track 1: A Linear Ensemble of Individual and Blended Models for Music Rating Prediction"

Copied!
29
0
0

加載中.... (立即查看全文)

全文

(1)

Solutions and Experiences from KDD Cup 2011

Track 1: A Linear Ensemble of Individual and Blended Models for Music Rating Prediction

Po-Lung Chen, Chen-Tse Tsai, Yao-Nan Chen, Ku-Chun Chou, Chun-Liang Li, Cheng-Hao Tsai, Kuan-Wei Wu, Yu-Cheng Chou, Chung-Yi Li, Wei-Shih Lin, Shu-Hao Yu, Rong-Bing Chiu, Chieh-Yen Lin,

Chien-Chih Wang, Po-Wei Wang, Wei-Lun Su, Chen-Hung Wu, Tsung-Ting Kuo, Todd G. McKenzie, Ya-Hsuan Chang, Chun-Sung Ferng,

Chia-Mau Ni, Hsuan-Tien Lin, Chih-Jen Lin and Shou-De Lin

National Taiwan University

(2)

What is KDD Cup?

Background

an annual competition on KDD (knowledge discovery and data mining)

organized by ACM SIGKDD, starting from 1997, nowthe most prestigious data mining competition

usually lasts 3-4 months

participants include famous research labs (IBM, AT&T) and top universities (Stanford, Berkeley)

Aim

bridge the gap between theory andpractice define thestate-of-the-art

(3)

KDD Cup 2011

Music Recommendation Systems host: Yahoo!

11 years of Yahoo! music data 2 tracks of competition

official dates: March 15 to June 30 1878 teams submitted to track 1;

1854 teams submitted to track 2

(4)

NTU Team for KDD Cup 2011

3 faculties:

Profs. Chih-Jen Lin, Hsuan-Tien Lin and Shou-De Lin 1 course (similar to what was done in 2010):

Data Mining and Machine Learning: Theory and Practice 3 TAs and 19 students:

most wereinexperienced in music recommendation in the beginning

official classes: April to June;

actual classes: December to June

our motto: study state-of-the-art approaches and thencreatively improve them

(5)

The Track 1 Problem (1/2)

Given Data

263M examples(user u; item i; rating rui; date tui; time ui) user item rating date time

1 21 10 102 23:52

1 213 90 1032 21:01

4 45 95 768 09:15

   u, i: abstract IDs

rui: integer between 0 and 100,mostly multiples of 10 Additional Information: Item Hierarchy

track (46:85%) album (19:01%) artist (28:84%) genre (5:30%)

(6)

The Track 1 Problem (2/2)

Data Partitioned by Organizers training: 253M; validation: 4M;

test (w/o rating): 6M

per user,training < validation < test in time

 20 examples total

4 examples in validation; 6 in test fixed random half of test: leaderboard;

another half: award decision

Goal

predictions^rui  rui on the test set, measured by RMSE =q

average(^rui rui)2 note: one submission allowedevery eight hours

(7)

Three Properties of Track 1 Data

R=

track1 track2 album3 author4    genreI

user1 100 80 70 ?   

user2 0 ? 80   

                    

userU ? 20    0

similar to Netflix data, but with the following differences...

scale: larger data

training: study mature models that arecomputationally feasible taxonomy: relation graph of tracks, albums, authors and genres

include as featuresfor combining models nonlinearly time: detailed; training earlier than validation earlier than test

include as featuresfor combining models nonlinearly;

respect time-closenessduring training

(8)

Framework of Our Solution

single models—computationally feasible models that arediverse:

individual models: matrix factorization (& pPCA), pLSA residual models: R-Boltz. machine, k -NN

derivative model: regression with statistical & model-based features validation-set blending:

combine models nonlinearly whilerespecting time-closeness test-set blending:

combine models linearly whilefitting the leaderboard feedback post processing:

polish predictions usingfindings during data analysis

(9)

RMSE Performance at Each Stage of Framework

single models:22:7915

individual models: best RMSE 22:9022 (MF) residual models: best RMSE 22:7915 (k-NN + MF)

derivative model: best RMSE 24:1251 (but helps in later stages) validation-set blending:21:3598 [improvement 1:4317]

test-set blending: (estimated)21:0253 [improvement 0:3345]

post processing: 21:0147 [improvement 0:0106]

both blending stages: key to the system

(10)

Single Model: Matrix Factorization (1/2)

Basic Idea

R ^R = PTQ on the known examples

PT: U (number of user) by F (number of factor) user-factor matrix Q: F (number of factor) by I (number of item) item-factor matrix one of themost commonly-used single models

Training

learnP and Q from data min

P;Q

X

(u;i)2data

(^rui rui)2

| {z }

Eui()

s.t.^rui = pTuqi

large-scale optimization tool: stochastic gradient descent (SGD)

1 randomly pick one example(u; i)

2 P P   rEui(P) (similar for Q)

(11)

Single Model: Matrix Factorization (2/2)

Matrix Factorization Variants

Pmin;Q; regularization+ X

(u;i)2data

(^rui rui)2

s.t. ^rui = pTuqi + r + useru + itemi + i 

 + (tui tibegin) extended terms (overall bias, user bias, item bias, time factor, etc.): enhance the power of model

regularization: control the complexity of model

parameter selection: triedAutomatic Parameter Tuning tool included many variants in the final solution for diversity

(12)

Selected Ideas that Worked (1/5):

Time Emphasis in Stochastic Gradient Descent

Background

SGD for minimizing sum of per-example Eui(P):

randomly pick one example(u; i) P P   rEui(P)

Idea

last M steps of SGD: effectively considering only the last M examples picked—final P as if biased towards those need: P respects time-closeness to the test examples heuristic:deterministically pick the “newer” examples as last

consistent 0:05 RMSE improvement for MF

(13)

Single Model:

Probabilistic Principle Component Analysis

Basic Idea

P(rju; i) = N (pTuri+ ru; 2)

ru: user rating average

can be viewed as probabilistic MF

prediction^rui: expected rating over P(rju; i)

Training

Expectation Maximization (EM) over maximum likelihood formulation

very similar to MF in the final solution

(14)

Single Model:

Probabilistic Latent Semantic Analysis

Basic Idea

P(rju; i) =Xk

=1P(rji; z = )P(z = ju):

z: the hidden variable representing user type can be viewed asanother way of factorization prediction^rui: expected rating over P(rju; i) Training

basic: EM over maximum likelihood formulation

improvement: tempered EM—EM + annealing (0.468 RMSE improvement)

not strong individually, but quite different from MF solutions

(15)

Residual Model: Restricted Boltzmann Machine

Basic Idea

discrete hidden factors

" " #

per-user incomplete discrete ratings predicted continuous ratings

a recursive NNet

popularly used in Netflix competition

Training

find the fixed point of the NNet weights bycontrastive divergence

not better than MF individually,

but can be used to process residuals (see below)

(16)

Selected Ideas that Worked (2/5):

Gaussian RBM as Residual Model

Background

RBM: a recursive NNet; can be used as an individual model by discrete hidden factors

" " #

per-user incomplete discrete ratings predicted continuous ratings as individual: RMSE 24:7433, worse than MF (22:9974)

Idea

MF (a first-order model) efficiently gets better performance, but can RBM digestsomething different?

need: RBM that learns from theresiduals of MFrui ^ruiMF (continuous values)

(17)

Selected Ideas that Worked (2/5):

Gaussian RBM as Residual Model

Background

discrete hidden factors

" " #

per-user incomplete discrete ratings predicted continuous ratings

Idea

need: RBM that learns from theresiduals of MF choice: Gaussian RBM (gRBM)

discrete hidden factors

" " #

per-user incompletecontinuous residuals predicted continuousresiduals MF+gRBM: 22:8008;

better than individual MF (22:9974) or RBM (24:7433)

(18)

Residual Model: k -Nearest Neighbor

Basic Idea

^rui = P

j2Gk(u;i)wij  ruj

P

j2Gk(u;i)wij Gk(u; i): item-neighbors of item i (for user u) wij: correlation between items i and j

Training

efficiently compute suitable neighbor and correlation functions like RBM, not better than MF individually,

but can be used to process residuals

(19)

Derivative Model: Regression

Basic Idea

^rui = g(xui)

xui 2 Rd: some features related to user u and item i

statistical: number of ratings from u, number of genre of item i, etc.

model: puin MF, wijin k -NN, etc.

g: a regressor fromRd toR linear regression

NNet

gradient boosting regression tree

can be flexibly used to include “side information” like hierarchy and time

(20)

Glance of Single Model RMSE

model # used best average worst contribution

MF 81 22.90 23.92 26.94 0.3645

pPCA 2 24.46 24.61 24.75 0.0014

pLSA 7 24.83 25.53 26.09 0.0042

R-Boltz. machine 8 22.80 24.75 26.08 0.0314

k -NN 18 22.79 25.06 42.94 0.0298

regression 10 24.13 28.01 35.14 0.0261

contribution (before val.-set blending):

estimated RMSE diff. via leave-the-model-out in test-set blending MF: most important (absorbing pPCA)

residual models: both quite important

derivative model: individually weak but adds diversity val.-set blending:

95 models, best 21.36, average 23.53, worst 31.70

(21)

Selected Ideas that Worked (3/5) in Val.-Set Blending:

Multi-Feature and Multi-Stage Binned Lin. Reg.

Background

Binned Linear Regression: a conditional aggregation model different model strength on different “types” of examples different blending weights for different types (bins)to utilize strength

bins # rating 1 1< # rating  2 others

weight of MF-1 0.4 0.7 1.0

weight of RBM-1 0.5 0.1 0.0

weight of RBM-2 0.1 0.2 0.0

a simplified regression tree with one level (on one feature)

(22)

Selected Ideas that Worked (3/5) in Val.-Set Blending:

Multi-Feature and Multi-Stage Binned Lin. Reg.

Background

Binned Linear Regression

—different blending weights for different (types) bins of examples

Idea: multi-feature BLR

rationale: “type” more sophisticated than 1-feature bin a specialmulti-level decision tree

prevent overfitting bylimiting height and bin size heuristic algorithm instead of traditional decision tree:

due tosimplicity by extending from one-feature BLR multi-feature 1-feature 4-feature 6-feature

RMSE 22.0829 21.8605 21.8128

(23)

Selected Ideas that Worked (3/5) in Val.-Set Blending:

Multi-Feature and Multi-Stage Binned Lin. Reg.

Background

Binned Linear Regression

—different blending weights for different (types) bins of examples Idea: multi-stage BLR

rationale: morediverse but good modelsbefore test-set blending

bins 1 2 3

weight of MF-1 ... ... ...

weight of RBM-1 ... ... ...

weight of RBM-2 ... ... ...

weight of BLR-1 ... ... ...

weight of BLR-2 ... ... ...

multi-stage 1-stage 2-stage 3-stage RMSE 21.7140 21.4591 21.4287

(24)

Selected Ideas that Worked (4/5) in Test-Set Blending:

Offline Test Performance Predictor

Background

given: columnszm= test-set prediction of model m test-set linear regression:

w(z1; z2;    ; zM; ) = (ZTZ+ I) 1ZTr

true ratingsr unknown butzTr can be estimatedby 2zTr = zTz+ rTr (z r)T(z r)

 zTz+ N  RMSE(0)2 N RMSE(z)2

common technique for RMSEever since Netflix competition

(25)

Selected Ideas that Worked (4/5) in Test-Set Blending:

Offline Test Performance Predictor

Background

2zTr = zTz+rTr (z r)T(z r)

 zTz+N RMSE(0)2 N RMSE(z)2

Idea

want: decide whichzm’s and to use

restriction: one submission every eight hours

solution: estimate RMSE ofwwithout submitting more thanzm

N RMSE(w)2= (Zw r)T(Zw r) = wTZTZw 2wTZTr+rTr compute the contribution of models;

choose 221 from 300 models & decide  = 10 6offline

(26)

Selected Ideas that Worked (5/5) in Post-Processing:

Clipping for Old Four-Star Days

Background

some very different rating systems observed during data analysis:

four-star rating?f0; 30; 50; 70; 90g five-star rating?f0; 20; 40; 60; 80; 100g 100-point scale

suspectchanges in the user interface of Yahoo! Music

Idea

existing: in five-star or 100-point scale, clip prediction to[0; 100]

new: for four-star,clip prediction to[0; 90]

what dates? [3365; 5982] (7 years) or[4281; 6170](5 years)

 0:02 RMSE improvement on most models

(27)

Revisit: RMSE Performance at Each Stage of Framework

single models:22:7915

individual models: best RMSE 22:9022 (MF) residual models: best RMSE 22:7915 (k-NN + MF)

derivative model: best RMSE 24:1251 (but helps in later stages) validation-set blending:21:3598 [improvement 1:4317]

test-set blending: (estimated)21:0253 [improvement 0:3345]

post processing: 21:0147 [improvement 0:0106]

both blending stages: key to the system

(28)

Selected Ideas that Did Not Work:

Deal with Zero-Variance Users

Background

zero-variance users (7% of all users)

—if a user gives 60, 60, 60,   in all training ratings, how’d she rate the next item?

Occam’s razor prediction: 60

—only true for 80% of users, 20% changed their mind!

Idea

conditionally (the 80%) post-process the predictions difficult to distinguish and thus failed

(29)

Track 1 Mini-Summary

individual models

single:MF (& pPCA), pLSA residual: RBM, k -NN derivative: regression

—concept ofdiversityimportant blending

validation: deeply and non-linear to improve model power test: broadly and linear to use leaderboard feedback properly (with good estimation)

Next: Track 2 by Prof. Shou-De Lin

參考文獻

相關文件

Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results.. Discussion

In Pre-Qin and Han era, the theories of heaven mind and Tao mind had a different level from human mind: an individual can conduct the human mind by means of self-cultivation and

Total spending and per-capita spending of visitors for the fourth quarter of 2011 were extrapolated from 39,900 effective questionnaires collected; besides, data for the fourth

Total spending and per-capita spending of visitors for the third quarter of 2011 were extrapolated from 47,300 effective questionnaires collected; besides, data for the third

How would this task help students see how to adjust their learning practices in order to improve?..

3.16 Career-oriented studies provide courses alongside other school subjects and learning experiences in the senior secondary curriculum. They have been included in the

On top of the overall students’ attainment rates of a school in Chinese Language, English Language and Mathematics (starting from 2014, individual primary schools are no

•  Automatically generate predicates and solutions from user troubleshooting traces. •