• 沒有找到結果。

# Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
33
0
0

(1)

## Machine Learning Techniques ( 機器學習技法)

### Lecture 11: Gradient Boosted Decision Tree

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

(2)

### 2

Combining Predictive Features: Aggregation Models

of

trees with

and

(3)

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T

request size-N

data

by

with D

obtain tree

### g t

by

Randomized-DTree(

) return

=Uniform({g

### t

})

function

For t = 1, 2, . . . , T

reweight data by

obtain tree

by DTree(D,

)

calculate ‘vote’

of

return

=

({(g

,

)})

need:

DTree(D,

)

(4)

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

(h) =

P

· err(y

,h(x

### n

)) if using existing algorithm as

### black box

(no modifications),

to get E

### inu

approximately optimized...

### weights u

expressed by

bootstrap-sampled

—request size-N

data

by

with D

expressed by

proportional to

—request size-N

data

by

on D

+ DTree(

### D ˜ t

) without modifying DTree

(5)

## Weak Decision Tree Algorithm

=ln 

=ln q

t

t with

=⇒

if

tree trained on

=⇒

(g

) =0 if

different

=⇒

(g

) =0

=⇒

=0

=⇒

=∞ (autocracy!!)

need:

tree trained on

to be

∝ u

∝ u

+

DTree(

### D) ˜

(6)

what if DTree with

### height ≤ 1

(extremely pruned)?

learn

argmin

X

|D

with h| ·

with h)

—if

=

=

(7)

## Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

such that g

### t

achieves zero error on the sampled data set ˜D

### t

. Which of the following is possible?

α

<0

α

=0

α

>0

all of the above

While g

### t

achieves zero error on ˜D

, g

### t

may not achieve zero weighted error on (D,

)and hence 

### t

can be anything, even ≥

. Then, α

### t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

(8)

## Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

such that g

### t

achieves zero error on the sampled data set ˜D

### t

. Which of the following is possible?

α

<0

α

=0

α

>0

all of the above

While g

### t

achieves zero error on ˜D

, g

### t

may not achieve zero weighted error on (D,

)and hence 

### t

can be anything, even ≥

. Then, α

can be ≤ 0.

(9)

= (

·

/

=

·

n

t

n

=

·



=

·

Y

·

!

### •

recall: G(x) = sign

!

:

of {g

} on x

∝ exp −y

(

)

(10)

## Voting Score and Margin

linear blending =

+

+

G(x

) =sign

z }| {

X

i

i

n

### )

and hard-margin SVM

=

n

T

n

,

y

### n

(voting score)= signed & unnormalized

⇐=

want y

(voting score)

want

exp(−y

(voting score))

want

(11)

### n=1 u n(t)

and thus somewhat

= 1 N

X

exp −y

!

linear score

=

err

(s,

) =J

≤ 0K

) =exp(−y

### s):

upper bound of err

—called

by

of err

(12)

### n=1 u n(t)

and thus somewhat

= 1 N

X

exp −y

!

linear score

=

err

(s,

) =J

≤ 0K

) =exp(−y

### s):

upper bound of err

—called

by

of err

(13)

### n=1 u n(t)

and thus somewhat

= 1 N

X

exp −y

!

linear score

=

err

(s,

) =J

≤ 0K

) =exp(−y

### s):

upper bound of err

—called

by

of err

(14)

### n=1 u n(t)

and thus somewhat

= 1 N

X

exp −y

!

linear score

=

err

(s,

) =J

≤ 0K

) =exp(−y

### s):

upper bound of err

—called

by

of err

### 0/1

(15)

recall: gradient descent (remember? :-)), at iteration t

min E

(w

+

E

(w

)

| {z }

+

|{z}

∇E

(w

)

| {z }

### known

at iteration t, to find

, solve min

X

exp −y

+

!!

=

X

exp (−y

X

(1 − y

X

y

good

P

(−y

(16)

## Learning Hypothesis as Optimization

finding good

### h

(function direction) ⇔ minimizeP

(−y

### n h(x n ))

for binary classification, where y

and

both ∈ {−1, +1}:

X

(−y

=

X

 − 1 if y

=

+1 if y

6=

= −

X

+

X

 0 if y

=

2 if y

6=

= −

X

+2E

(t)(h) · N

—who

E

(t)(h)?

for

(17)

## Deciding Blending Weight as Optimization

### g t

by approximately min

P

exp (−y

after finding

P

exp (−y

optimal

somewhat

### ‘greedily faster’

than fixed (small)

—called

### steepest

descent for optimization

### •

two cases inside summation:

n

t

n

n(t)

n

t

n

n(t)

·

(1 −

) exp (−η) +

exp (+η)

by solving

=lnq

t

t =

,

descent with

(18)

## Fun Time

·

(1 −

) exp (−η) +

###  t

exp (+η) , which of the following is

### ∂ bE∂η

ADA that can be used for solving the optimal

?

·

+ (1 −

) exp (−η) +

exp (+η)

·

+ (1 −

) exp (−η) −

exp (+η)

·

− (1 −

) exp (−η) +

exp (+η)

·

− (1 −

) exp (−η) −

###  t

exp (+η)

Differentiate exp(−η)and exp(+η)with respect to

### η

and you should easily get the result.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/25

(19)

## Fun Time

·

(1 −

) exp (−η) +

###  t

exp (+η) , which of the following is

### ∂ bE∂η

ADA that can be used for solving the optimal

?

·

+ (1 −

) exp (−η) +

exp (+η)

·

+ (1 −

) exp (−η) −

exp (+η)

·

− (1 −

) exp (−η) +

exp (+η)

·

− (1 −

) exp (−η) −

###  t

exp (+η)

Differentiate exp(−η)and exp(+η)with respect to

### η

and you should easily get the result.

(20)

## Gradient Boosting for Arbitrary Error Function

min

min

X

exp −y

+

!!

with

min

min

X

err

+

y

!

with

(usually

hypothesis)

### err

for regression/soft classification/etc.

(21)

min

min

X

err

| {z }

n

+ηh(x

y

### n



with err(s, y ) = (s − y )

min

. . .

≈ min

X

err (s

,y

)

| {z }

+

X

∂err(s, y

)

∂s

n

= min

+

X

· 2(s

− y

)

naïve solution

=

(s

− y

### n

) if no constraint on

(22)

## Learning Hypothesis as Optimization

h

N

n=1

n

n

n

of

### h

does not matter: because

### η

will be optimized next

### • penalize large magnitude

to avoid naïve solution

h

N

n=1

n

n

n

n

2

N

n=1

n

n

n

2

solution of

squared-error

on {(x

,y

| {z }

)}

find

by

with

(23)

## Deciding Blending Weight as Optimization

after finding

min

X

err



| {z }

n

+ηg

y

### n



with err(s, y ) = (s − y )

min

1 N

X

(s

+

− y

)

= 1 N

X

((y

### ηg t (x n )) 2

—one-variable linear regressionon {(g

by

(24)

## Putting Everything Together

### Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0

for t = 1, 2, . . . , T

obtain

by

,

)})where

### A

is a (squared-error) regression algorithm

—how about sampled and pruned C&RT?

compute

### α t

=OneVarLinearRegression({(g

)})

update

+

return G(x) =P

(x)

### GBDT: ‘regression sibling’ of AdaBoost-DTree

—popular in practice

(25)

## Fun Time

Which of the following is the optimal η for

min

1 N

X

((y

(P

g

(x

)(y

)) · (P

g

(x

))

(P

g

(x

)(y

)) / (P

g

(x

))

(P

g

(x

)(y

)) + (P

g

(x

))

(P

g

(x

)(y

)) − (P

g

(x

### n

))

Derived within Lecture 9 of ML Foundations,

### remember? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/25

(26)

## Fun Time

Which of the following is the optimal η for

min

1 N

X

((y

(P

g

(x

)(y

)) · (P

g

(x

))

(P

g

(x

)(y

)) / (P

g

(x

))

(P

g

(x

)(y

)) + (P

g

(x

))

(P

g

(x

)(y

)) − (P

g

(x

### n

))

Derived within Lecture 9 of ML Foundations,

### remember? :-)

(27)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Map of Blending Models

blending: aggregate

### uniform

simple

voting/averaging of

linear model on

inputs

### conditional

nonlinear model on

inputs

### uniform for ‘stability’;

non-uniform/conditional

### carefully

for

‘complexity’

(28)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Map of Aggregation-Learning Models

learning: aggregate

diverse

by

by

diverse

by

by

diverse

by

by

diverse

by

by

### boosting-like algorithms

most popular

(29)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Map of Aggregation of Aggregation Models

### Random Forest

randomized bagging + ‘strong’ DTree

+ ‘weak’ DTree

### all three

frequently used in practice

(30)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Specialty of Aggregation Models

G(x) ‘strong’

aggregation

=⇒

G(x) ‘moderate’

aggregation

=⇒

### regularization

proper aggregation (a.k.a. ‘ensemble’)

=⇒

### better performance

(31)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Fun Time

Which of the following aggregation model learns diverse

by

and calculates

by

Random Forest

Decision Tree

### 4

Linear Blending

Congratulations on being an

### expert

in aggregation models!

### :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/25

(32)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Fun Time

Which of the following aggregation model learns diverse

by

and calculates

by

Random Forest

Decision Tree

### 4

Linear Blending

Congratulations on being an

### expert

in aggregation models!

### :-)

(33)

Gradient Boosted Decision Tree Summary of Aggregation Models

## Summary

### 2

Combining Predictive Features: Aggregation Models

### • next: extract features other than hypotheses

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep