Machine Learning Techniques ( 機器學習技巧)
Lecture 14: Miscellaneous Models
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23
Miscellaneous Models
Agenda
Lecture 14: Miscellaneous Models Matrix Factorization
Gradient Boosted Decision Tree Naive Bayes
Bayesian Learning
Miscellaneous Models Matrix Factorization
Recommender System Revisited
data ML skill
• data: how ‘many users’ have rated ‘some movies’
• skill: predict how a user would rate an unrated movie
A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
dataDj
for j-th movie:{(xn
= (i), yn
=rij
)}N n=1 j
—abstract features
x n
= (i)how to
learn our preferences
from allDj
?Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23
Miscellaneous Models Matrix Factorization
Linear Model for Recommender System
consider
one linear model
for eachDj
={(xn
= (i), yn
=rij
)}N n=1 j
, with a sharedtransform Φ:
y ≈
h j
(x) =w T j Φ(x) for j-th movie
• Φ(i): named v i
, to belearned
from data, like NNet/RBF Net•
then,r ij
=yn
≈w T j v i
•
overall Ein
with squared error:E
in
({w j
}, {v i
}) = Pj
Nj
Ein (j)
(wj
,{v i
}) Pj
Nj
= 1N X
known (i,j)
(r
ij
−w T j v i
)2
how to minimize?
SGD
by sampling known (i,j)
Miscellaneous Models Matrix Factorization
Matrix Factorization
r ij
≈w T j v i
=v T i w j
R movie 1 movie 2 · · · movie J
user 1 100 ? · · · −
user 2 − 70 · · · −
· · · · · · · · · · · · · · ·
user I ? − · · · 0
≈
V v T 1 v T 2
· · · v T I
W w 1 w 2 · · · w J
Match movie and viewer factors
predicted rating
com edy con ten t acti on
con ten t blo ckb uste r?
Tom Cru ise in it?
like s T om Cr uis e?
pre fer s b loc kb ust ers ? like s a cti on?
like s c om edy ?
movie viewer
add contributions from each factor
Matrix Factorization Model
•
learning:→
known rating
→ learned
factors w j
andv i
→ unknown rating prediction similar modeling can be used for
abstract features
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23
Miscellaneous Models Matrix Factorization
Fun Time
Miscellaneous Models Gradient Boosted Decision Tree
Coordinate Descent for Linear Blending
Consider a linear blending problem: forG = {g
`
},min
β
1 N
N
X
n=1
exp
−yn
L
X
`=1
β ` g `
(xn
)!
•
why exponential errorexp(
−yG(x)):
a
convex upper bound
on err0/1
aserr c
•
how to minimize?—GD, SGD,. . . if few
{g ` }
•
what if lots of orinfinitely many g ` ?
—pick
one good g i
, and update itsβ i
onlycoordinate descent: in each iteration
•
pick a goodcoordinate i
(the best one for the next step)•
minimize by settingβ i new ← β old i + ∆
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23
Miscellaneous Models Gradient Boosted Decision Tree
Coordinate Descent View of AdaBoost
Consider a linear blending problem: forG = {g
`
},min
β
1 N
N
X
n=1
exp
−yn L
X
`=1
β ` g `
(xn
)!
coordinate descent: in each iteration
•
pick a goodcoordinate i
(the best one for the next step)•
minimize by settingβ i new ← β old i + ∆
AdaBoost: in each iteration•
pick agood hypothesis g t
•
setα new t ← 0 + 1 2 ln 1− t
t
after some derivations (ML2012Fall HW7.5):
AdaBoost =
coordinate descent
+exponential
errorMiscellaneous Models Gradient Boosted Decision Tree
Gradient Boosted Decision Tree
Consider another linear blending problem:
min
β
1 N
N
X
n=1
y
n
−L
X
`=1
β ` g `
(xn
)!
2
• best coordinate
at t-th iteration (under assumptions):min
g `
1 N
N
X
n=1
(y
n − G t−1 (x n )
−g `
(xn
))2
—best
hypothesis
on{(xn
,residual n
)}• best β ` new
: one-dimensionallinear regression
gradient boosted decision tree
(GBDT):above + find
best g ` by decision tree
(a ‘regression’ extension of AdaBoost)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23
Miscellaneous Models Gradient Boosted Decision Tree
Fun Time
Miscellaneous Models Naive Bayes
Naive Bayes Model
want: getting
P(y |x)
(e.g. logistic regression) for classification•
Bayes rule:P(y |x)
∝P(x |y) P(y )
•
estimatingP(y ): frequency of y n
=y inD (easy!)•
joint distributionP(x |y)
: easier ifP(x |y)
=P(x 1 |y)P(x 2 |y) · · · p(x d |y)
—conditional independence
•
marginal distributionP(x i |y)
:piece-wise discrete, Gaussian, etc.
Naive Bayes model:
h(x) =
P(x 1 |y)P(x 2 |y) · · · p(x d |y) P(y )
with your choice of distribution familiesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23
Miscellaneous Models Naive Bayes
More about Naive Bayes
find g(x) =
P(x 1 |y) · · · p(x d |y) P(y )
by ‘good estimate’ of all RHS termsfor binary classification
sign
P(x 1 | + 1)P(x 2 | + 1) · · · p(x d | + 1) P(+1) P(x 1 | − 1)P(x 2 | − 1) · · · p(x d | − 1) P( −1)
− 1
= sign
P(+1) P( −1)
d
Y
i=1
P(x i | + 1) P(x i | − 1)
− 1!
= sign
log
P(+1) P( −1)
| {z }
w 0
+
d
X
i=1
log
P(x i | + 1) P(x i | − 1)
| {z }
φ i (x)
=sign
w 0
+d
X
i=1
1·
φ i
(x)!
—also naive linear model with ‘heuristic/learned’
transform
andbias
a simple (heuristic) model, usuallysuper fast
Miscellaneous Models Naive Bayes
IDCM 2006 Top 10 Data Mining Algorithms
1
C4.5: decision tree2
K -means: clustering, taught with RBF Network3
SVM: large-margin/kernel4
Apriori: for frequent itemset mining5
EM: the ‘gradient descent’in Bayesian learning
6
PageRank: forlink-analysis, similar to matrix factorization
7
AdaBoost: aggregation8
k -NN: taught very shortly within RBF Network9
Naive Bayes: linear model with heuristic transform10
CART: decision tree personal view of four missing ML competitors:LinReg, LogReg, Random Forest, NNet
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23
Miscellaneous Models Naive Bayes
Fun Time
Miscellaneous Models Bayesian Learning
Disclaimer
Part of the following lecture borrows
Prof. Yaser S. Abu-Mostafa’s slides with permission.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23
Miscellaneous Models Bayesian Learning
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f) P (h = f)
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f) P (h = f)
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f) P (h = f)
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f) P (h = f)
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f) P (h = f) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Theprior
P (h = f | D)
requiresanadditionalprobabilitydistribution:P (h = f | D) = P ( D | h = f ) P (h = f )
P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )
isthepriorP (h = f | D)
istheposteriorGiventheprior,wehavethefulldistribution
LearningFromData-Le ture18 7/23
Miscellaneous Models Bayesian Learning
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f)
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f )
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f )
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f )
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f)P (D | h = f)
LearningFromData-Le ture18 8/23
Exampleofaprior
Consideraper eptron:
h
isdeterminedbyw = w 0 , w 1 , · · · , w d
Apossibleprioron
w
:Ea hw i
isindependent,uniformover[ −1, 1]
Thisdeterminesthepriorover
h
-P (h = f )
Given
D
,we an omputeP ( D | h = f )
Puttingthemtogether,weget
P (h = f | D)
∝ P (h = f )P (D | h = f )
LearningFromData-Le ture18 8/23
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23
Miscellaneous Models Bayesian Learning
A prior isanassumption
Eventhemostneutralprior:
Hi Hi
Thetrueequivalentis:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
Hi Hi
Thetrueequivalentis:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
x is unknown
1
−1 Hi
Hi
Thetrueequivalentis:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
x is unknown
1
−1
x is random
?
Hi Hi
Thetrueequivalentis:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
x is unknown
1
−1 x
P(x)
x is random
?
Hi Hi
−1 1
Thetrueequivalentis:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
x is unknown
1
−1 x
P(x)
x is random
Hi Hi
−1 1
Thetrueequivalentis:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
x is unknown
1
−1 x
P(x)
x is random
Hi Hi
−1 1
Thetrueequivalentwouldbe:
Hi Hi
LearningFromData-Le ture18 9/23
A prior isanassumption
Eventhemostneutralprior:
x is unknown
1
−1 x
P(x)
x is random
Hi Hi
−1 1
Thetrueequivalentwouldbe:
x is unknown
1
−1 x
x is random
Hi Hi
−1 a 1
δ (x ) −a
LearningFromData-Le ture18 9/23
Miscellaneous Models Bayesian Learning
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
If weknewtheprior
. . .
we ould omputeP (h = f | D)
foreveryh ∈ H
= ⇒
we anndthemostprobableh
giventhedatawe anderive
E (h(x))
foreveryx
we anderivetheerrorbarforevery
x
we anderiveeverythinginaprin ipledway
LearningFromData-Le ture18 10/23
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23
Miscellaneous Models Bayesian Learning
One Instance of Using Posterior
•
logistic regression: know how to calculatelikelihood P( D|w = w f )
• define Gaussian prior P(w = w f ) = N (0, σ 2 I)
• posterior
=Gaussian
*(logistic likelihood)
•
maximizeposterior
= maximize [log Gaussian+
log logistic likelihood]
=
regularized logistic regression regularized logistic regression
= min
augmented error
(with iid assumption +
effective d VC heuristic
+surrogate error err) c
= max
prior
*likelihood
(with iid assumption +
prior/likelihood
assumptions)Miscellaneous Models Bayesian Learning
When isBayesian learningjustied?
1. Thepriorisvalid
trumpsallothermethods
2. Thepriorisirrelevant
justa omputational atalyst
LearningFromData-Le ture18 11/23
When isBayesian learningjustied?
1. Thepriorisvalid
trumpsallothermethods
2. Thepriorisirrelevant
justa omputational atalyst
LearningFromData-Le ture18 11/23
When isBayesian learningjustied?
1. Thepriorisvalid
trumpsallothermethods
2. Thepriorisirrelevant
justa omputational atalyst
LearningFromData-Le ture18 11/23
When isBayesian learningjustied?
1. Thepriorisvalid
trumpsallothermethods
2. Thepriorisirrelevant
justa omputational atalyst
LearningFromData-Le ture18 11/23
When isBayesian learningjustied?
1. Thepriorisvalid
trumpsallothermethods
2. Thepriorisirrelevant
justa omputational atalyst
LearningFromData-Le ture18 11/23
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23
Miscellaneous Models Bayesian Learning
My Biased View
in reality: