• 沒有找到結果。

# Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
24
0
0

(1)

## Machine Learning Techniques ( 機器學習技巧)

### Lecture 14: Miscellaneous Models

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

(2)

(3)

## Recommender System Revisited

### •

competition held by Netflix in 2006

### j

for j-th movie:{(x

= (i), y

=r

)}

### N n=1 j

—abstract features

= (i)

how to

from allD

?

(4)

## Linear Model for Recommender System

consider

for eachD

={(x

= (i), y

=r

)}

, with a shared

y ≈

(x) =

, to be

### learned

from data, like NNet/RBF Net

then,

=y

overall E

### in

with squared error:

E

({

}, {

}) = P

N

E

(w

,{

}) P

N

= 1

N X

(r

)

how to minimize?

### SGD

by sampling known (i,

(5)

## Matrix Factorization

=

learning:

→ learned

and

### vi

→ unknown rating prediction similar modeling can be used for

(6)

(7)

## Coordinate Descent for Linear Blending

Consider a linear blending problem: forG = {g

},

min

1 N

X

−y

(x

)

!

### •

why exponential error

−y

a

on err

as

### •

how to minimize?

—GD, SGD,. . . if few

### •

what if lots of or

—pick

, and update its

only

pick a good

### coordinate i

(the best one for the next step)

### •

minimize by setting

(8)

## Coordinate Descent View of AdaBoost

Consider a linear blending problem: forG = {g

},

min

1 N

X

−y

(x

)

!

pick a good

### coordinate i

(the best one for the next step)

### •

minimize by setting

pick a

set

### t

after some derivations (ML2012Fall HW7.5):

+

error

(9)

### Miscellaneous Models Gradient Boosted Decision Tree

Consider another linear blending problem:

min

1 N

X

y

(x

)

!

### • best coordinate

at t-th iteration (under assumptions):

min

1 N

X

(y

(x

))

—best

on{(x

,

)}

### • best β `new

: one-dimensional

(GBDT):

above + find

(10)

(11)

## Naive Bayes Model

want: getting

### P(y |x)

(e.g. logistic regression) for classification

Bayes rule:

estimating

=y inD (easy!)

### •

joint distribution

: easier if

=

### P(x 1 |y)P(x 2 |y) · · · p(x d |y)

—conditional independence

### •

marginal distribution

:

### piece-wise discrete, Gaussian, etc.

Naive Bayes model:

h(x) =

### P(x 1 |y)P(x 2 |y) · · · p(x d |y) P(y )

with your choice of distribution families

(12)

find g(x) =

### P(x 1 |y) · · · p(x d |y) P(y )

by ‘good estimate’ of all RHS terms

sign



− 1



= sign

Y

− 1

!

= sign

log

| {z }

+

X

log

| {z }

=sign

+

X

### φ i

(x)

!

—also naive linear model with ‘heuristic/learned’

and

### bias

a simple (heuristic) model, usually

(13)

## IDCM 2006 Top 10 Data Mining Algorithms

### 1

C4.5: decision tree

### 2

K -means: clustering, taught with RBF Network

### 3

SVM: large-margin/kernel

### 4

Apriori: for frequent itemset mining

### 5

in Bayesian learning

PageRank: for

### 8

k -NN: taught very shortly within RBF Network

### 9

Naive Bayes: linear model with heuristic transform

### 10

CART: decision tree personal view of four missing ML competitors:

LinReg, LogReg, Random Forest, NNet

(14)

(15)

## Disclaimer

(16)

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

Theprior

istheprior

### P (h = f | D)

istheposterior

Giventheprior,wehavethefulldistribution

LearningFromData-Le ture18 7/23

(17)

### Miscellaneous Models Bayesian Learning

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f)

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f )

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f )

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f )

Puttingthemtogether,weget

### ∝ P (h = f)P (D | h = f)

LearningFromData-Le ture18 8/23

Exampleofaprior

Consideraper eptron:

isdeterminedby

Apossibleprioron

:Ea h

### w i

isindependent,uniformover

### [ −1, 1]

Thisdeterminesthepriorover

-

Given

,we an ompute

### P ( D | h = f )

Puttingthemtogether,weget

### ∝ P (h = f )P (D | h = f )

LearningFromData-Le ture18 8/23

(18)

### Miscellaneous Models Bayesian Learning

A prior isanassumption

Eventhemostneutralprior:

### Hi Hi

Thetrueequivalentis:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### Hi Hi

Thetrueequivalentis:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### Hi

Thetrueequivalentis:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### Hi Hi

Thetrueequivalentis:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### −1 1

Thetrueequivalentis:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### −1 1

Thetrueequivalentis:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### −1 1

Thetrueequivalentwouldbe:

### Hi Hi

LearningFromData-Le ture18 9/23

A prior isanassumption

Eventhemostneutralprior:

### −1 1

Thetrueequivalentwouldbe:

### δ (x )−a

LearningFromData-Le ture18 9/23

(19)

### Miscellaneous Models Bayesian Learning

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

If weknewtheprior

we ould ompute

forevery

### = ⇒

we anndthemostprobable

giventhedata

we anderive

forevery

### x

we anderivetheerrorbarforevery

### x

we anderiveeverythinginaprin ipledway

LearningFromData-Le ture18 10/23

(20)

## One Instance of Using Posterior

### •

logistic regression: know how to calculate

=

*

maximize

### posterior

= maximize [log Gaussian+

=

= min

### augmented error

(with iid assumption +

+

= max

*

### likelihood

(with iid assumption +

assumptions)

(21)

### Miscellaneous Models Bayesian Learning

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

When isBayesian learningjustied?

1. Thepriorisvalid

trumpsallothermethods

2. Thepriorisirrelevant

justa omputational atalyst

LearningFromData-Le ture18 11/23

(22)

## My Biased View

in reality:

### •

prior/likelihood mostly

### invalid

(Gaussian, conditional independence, etc.), shooting for computational ease

prior/likelihood

(23)

(24)

## Summary

### Bayesian Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/27.. The Learning Problem What is Machine Learning. The Machine

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28.. Linear Support Vector Machine Course Introduction.

Feature Exploitation Techniques Error Optimization Techniques Overfitting Elimination Techniques Machine Learning in Practice... Finale Feature

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming

decision tree: a traditional learning model that realizes conditional aggregation.. Decision Tree Decision Tree Hypothesis.. Disclaimers about

decision tree: a traditional learning model that realizes conditional aggregation.. Disclaimers about Decision

• validation set blending: a special any blending model E test (squared): 519.45 =⇒ 456.24. —helped secure the lead in last

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25.. Gradient Boosted Decision Tree Summary of Aggregation Models. Map of

2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging.. Motivation of Aggregation

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22.. Matrix Factorization Summary of Extraction Models.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25.. Noise and Error Noise and Probabilistic Target.

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine.. Reasons behind

Lecture 4: Soft-Margin SVM Soft-Margin SVM: Primal Soft-Margin SVM: Dual Soft-Margin SVM: Solution Soft-Margin SVM: Selection.. Soft-Margin SVM Soft-Margin

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26... The

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF