• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技巧)

Lecture 14: Miscellaneous Models

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Miscellaneous Models

Agenda

Lecture 14: Miscellaneous Models Matrix Factorization

Gradient Boosted Decision Tree Naive Bayes

Bayesian Learning

(3)

Miscellaneous Models Matrix Factorization

Recommender System Revisited

data ML skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

dataD

j

for j-th movie:{(x

n

= (i), y

n

=r

ij

)}

N n=1 j

—abstract features

x n

= (i)

how to

learn our preferences

from allD

j

?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

(4)

Miscellaneous Models Matrix Factorization

Linear Model for Recommender System

consider

one linear model

for eachD

j

={(x

n

= (i), y

n

=r

ij

)}

N n=1 j

, with a shared

transform Φ:

y ≈

h j

(x) =

w T j Φ(x) for j-th movie

• Φ(i): named v i

, to be

learned

from data, like NNet/RBF Net

then,

r ij

=y

n

w T j v i

overall E

in

with squared error:

E

in

({

w j

}, {

v i

}) = P

j

N

j

E

in (j)

(w

j

,{

v i

}) P

j

N

j

= 1

N X

known (i,j)

(r

ij

w T j v i

)

2

how to minimize?

SGD

by sampling known (i,

j)

(5)

Miscellaneous Models Matrix Factorization

Matrix Factorization

r ij

w T j v i

=

v T i w j

R movie 1 movie 2 · · · movie J

user 1 100 ? · · · −

user 2 − 70 · · · −

· · · · · · · · · · · · · · ·

user I ? − · · · 0

V v T 1 v T 2

· · · v T I

W w 1 w 2 · · · w J

Match movie and viewer factors

predicted rating

com edy con ten t acti on

con ten t blo ckb uste r?

Tom Cru ise in it?

like s T om Cr uis e?

pre fer s b loc kb ust ers ? like s a cti on?

like s c om edy ?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors w j

and

v i

→ unknown rating prediction similar modeling can be used for

abstract features

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

(6)

Miscellaneous Models Matrix Factorization

Fun Time

(7)

Miscellaneous Models Gradient Boosted Decision Tree

Coordinate Descent for Linear Blending

Consider a linear blending problem: forG = {g

`

},

min

β

1 N

N

X

n=1

exp

−y

n

L

X

`=1

β ` g `

(x

n

)

!

why exponential error

exp(

−y

G(x)):

a

convex upper bound

on err

0/1

as

err c

how to minimize?

—GD, SGD,. . . if few

{g ` }

what if lots of or

infinitely many g ` ?

—pick

one good g i

, and update its

β i

only

coordinate descent: in each iteration

pick a good

coordinate i

(the best one for the next step)

minimize by setting

β i new ← β old i + ∆

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

(8)

Miscellaneous Models Gradient Boosted Decision Tree

Coordinate Descent View of AdaBoost

Consider a linear blending problem: forG = {g

`

},

min

β

1 N

N

X

n=1

exp

−y

n L

X

`=1

β ` g `

(x

n

)

!

coordinate descent: in each iteration

pick a good

coordinate i

(the best one for the next step)

minimize by setting

β i new ← β old i + ∆

AdaBoost: in each iteration

pick a

good hypothesis g t

set

α new t ← 0 + 1 2 ln 1−  t

t

after some derivations (ML2012Fall HW7.5):

AdaBoost =

coordinate descent

+

exponential

error

(9)

Miscellaneous Models Gradient Boosted Decision Tree

Gradient Boosted Decision Tree

Consider another linear blending problem:

min

β

1 N

N

X

n=1

y

n

L

X

`=1

β ` g `

(x

n

)

!

2

• best coordinate

at t-th iteration (under assumptions):

min

g `

1 N

N

X

n=1

(y

n − G t−1 (x n )

g `

(x

n

))

2

—best

hypothesis

on{(x

n

,

residual n

)}

• best β ` new

: one-dimensional

linear regression

gradient boosted decision tree

(GBDT):

above + find

best g ` by decision tree

(a ‘regression’ extension of AdaBoost)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23

(10)

Miscellaneous Models Gradient Boosted Decision Tree

Fun Time

(11)

Miscellaneous Models Naive Bayes

Naive Bayes Model

want: getting

P(y |x)

(e.g. logistic regression) for classification

Bayes rule:

P(y |x)

P(x |y) P(y )

estimating

P(y ): frequency of y n

=y inD (easy!)

joint distribution

P(x |y)

: easier if

P(x |y)

=

P(x 1 |y)P(x 2 |y) · · · p(x d |y)

—conditional independence

marginal distribution

P(x i |y)

:

piece-wise discrete, Gaussian, etc.

Naive Bayes model:

h(x) =

P(x 1 |y)P(x 2 |y) · · · p(x d |y) P(y )

with your choice of distribution families

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23

(12)

Miscellaneous Models Naive Bayes

More about Naive Bayes

find g(x) =

P(x 1 |y) · · · p(x d |y) P(y )

by ‘good estimate’ of all RHS terms

for binary classification

sign



P(x 1 | + 1)P(x 2 | + 1) · · · p(x d | + 1) P(+1) P(x 1 | − 1)P(x 2 | − 1) · · · p(x d | − 1) P( −1)

− 1



= sign

P(+1) P( −1)

d

Y

i=1

P(x i | + 1) P(x i | − 1)

− 1

!

= sign

log

P(+1) P( −1)

| {z }

w 0

+

d

X

i=1

log

P(x i | + 1) P(x i | − 1)

| {z }

φ i (x)

=sign

w 0

+

d

X

i=1

φ i

(x)

!

—also naive linear model with ‘heuristic/learned’

transform

and

bias

a simple (heuristic) model, usually

super fast

(13)

Miscellaneous Models Naive Bayes

IDCM 2006 Top 10 Data Mining Algorithms

1

C4.5: decision tree

2

K -means: clustering, taught with RBF Network

3

SVM: large-margin/kernel

4

Apriori: for frequent itemset mining

5

EM: the ‘gradient descent’

in Bayesian learning

6

PageRank: for

link-analysis, similar to matrix factorization

7

AdaBoost: aggregation

8

k -NN: taught very shortly within RBF Network

9

Naive Bayes: linear model with heuristic transform

10

CART: decision tree personal view of four missing ML competitors:

LinReg, LogReg, Random Forest, NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

(14)

Miscellaneous Models Naive Bayes

Fun Time

(15)

Miscellaneous Models Bayesian Learning

Disclaimer

Part of the following lecture borrows

Prof. Yaser S. Abu-Mostafa’s slides with permission.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

(16)

Miscellaneous Models Bayesian Learning

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f) P (h = f)

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f) P (h = f)

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f) P (h = f)

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f) P (h = f)

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f) P (h = f) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

P (h = f | D)

requiresanadditionalprobabilitydistribution:

P (h = f | D) = P ( D | h = f ) P (h = f )

P ( D) ∝ P (D | h = f ) P (h = f ) P (h = f )

istheprior

P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

(17)

Miscellaneous Models Bayesian Learning

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f)

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f )

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f )

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f )

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

h

isdeterminedby

w = w 0 , w 1 , · · · , w d

Apossibleprioron

w

:Ea h

w i

isindependent,uniformover

[ −1, 1]

Thisdeterminesthepriorover

h

-

P (h = f )

Given

D

,we an ompute

P ( D | h = f )

Puttingthemtogether,weget

P (h = f | D)

∝ P (h = f )P (D | h = f )

LearningFromData-Le ture18 8/23

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

(18)

Miscellaneous Models Bayesian Learning

A prior isanassumption

Eventhemostneutralprior:

Hi Hi

Thetrueequivalentis:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

Hi Hi

Thetrueequivalentis:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

x is unknown

1

−1 Hi

Hi

Thetrueequivalentis:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

x is unknown

1

−1

x is random

?

Hi Hi

Thetrueequivalentis:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

x is unknown

1

−1 x

P(x)

x is random

?

Hi Hi

−1 1

Thetrueequivalentis:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

x is unknown

1

−1 x

P(x)

x is random

Hi Hi

−1 1

Thetrueequivalentis:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

x is unknown

1

−1 x

P(x)

x is random

Hi Hi

−1 1

Thetrueequivalentwouldbe:

Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

x is unknown

1

−1 x

P(x)

x is random

Hi Hi

−1 1

Thetrueequivalentwouldbe:

x is unknown

1

−1 x

x is random

Hi Hi

−1 a 1

δ (x ) −a

LearningFromData-Le ture18 9/23

(19)

Miscellaneous Models Bayesian Learning

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

. . .

we ould ompute

P (h = f | D)

forevery

h ∈ H

= ⇒

we anndthemostprobable

h

giventhedata

we anderive

E (h(x))

forevery

x

we anderivetheerrorbarforevery

x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23

(20)

Miscellaneous Models Bayesian Learning

One Instance of Using Posterior

logistic regression: know how to calculate

likelihood P( D|w = w f )

define Gaussian prior P(w = w f ) = N (0, σ 2 I)

• posterior

=

Gaussian

*

(logistic likelihood)

maximize

posterior

= maximize [log Gaussian+

log logistic likelihood]

=

regularized logistic regression regularized logistic regression

= min

augmented error

(with iid assumption +

effective d VC heuristic

+

surrogate error err) c

= max

prior

*

likelihood

(with iid assumption +

prior/likelihood

assumptions)

(21)

Miscellaneous Models Bayesian Learning

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23

(22)

Miscellaneous Models Bayesian Learning

My Biased View

in reality:

prior/likelihood mostly

invalid

(Gaussian, conditional independence, etc.), shooting for computational ease

prior/likelihood

irrelevant? I don’t know

(23)

Miscellaneous Models Bayesian Learning

Fun Time

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(24)

Miscellaneous Models Bayesian Learning

Summary

Lecture 14: Miscellaneous Models Matrix Factorization

Gradient Boosted Decision Tree Naive Bayes

Bayesian Learning

參考文獻

相關文件

2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging.. Motivation of Aggregation

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics