• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 11: Gradient Boosted Decision Tree

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Gradient Boosted Decision Tree

Roadmap

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 10: Random Forest

bagging

of

randomized C&RT

trees with

automatic validation

and

feature selection Lecture 11: Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting

Summary of Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

(4)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

)) if using existing algorithm as

black box

(no modifications),

to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

) without modifying DTree

(5)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(6)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(7)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

t

such that g

t

achieves zero error on the sampled data set ˜D

t

. Which of the following is possible?

1

α

t

<0

2

α

t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

t

achieves zero error on ˜D

t

, g

t

may not achieve zero weighted error on (D,

u (t)

)and hence 

t

can be anything, even ≥

1 2

. Then, α

t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

(8)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

t

such that g

t

achieves zero error on the sampled data set ˜D

t

. Which of the following is possible?

1

α

t

<0

2

α

t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

t

achieves zero error on ˜D

t

, g

t

may not achieve zero weighted error on (D,

u (t)

)and hence 

t

can be anything, even ≥

1 2

. Then, α

t

can be ≤ 0.

(9)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp −y n α t g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

(10)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N

n=1 u n (t)

(11)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

(12)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

(13)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

(14)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

(15)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(1 − y

n ηh(x n )) =

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

(16)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Learning Hypothesis as Optimization

finding good

h

(function direction) ⇔ minimizeP

N

n=1 u n (t)

(−y

n h(x n ))

for binary classification, where y

n

and

h(x n )

both ∈ {−1, +1}:

N

X

n=1

u n (t)

(−y

n h(x n ))

=

N

X

n=1

u n (t)

 − 1 if y

n

=

h(x n )

+1 if y

n

6=

h(x n )

= −

N

X

n=1

u n (t)

+

N

X

n=1

u (t) n

 0 if y

n

=

h(x n )

2 if y

n

6=

h(x n )

= −

N

X

n=1

u n (t)

+2E

in u

(t)(h) · N

—who

minimizes

E

in u

(t)(h)?

A in AdaBoost! :-)

A: good g t = h

for

‘gradient descent’

(17)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Deciding Blending Weight as Optimization

AdaBoost finds

g t

by approximately min

h

EbADA=

N

P

n=1

u n (t)

exp (−y

n ηh(x n ))

after finding

g t

, how about min

η

EbADA=

N

P

n=1

u n (t)

exp (−y

n ηg t (x n ))

optimal

η t

somewhat

‘greedily faster’

than fixed (small)

η

—called

steepest

descent for optimization

two cases inside summation:

• y

n

= g

t

(x

n

) : u

n(t)

exp (−η) (correct)

• y

n

6= g

t

(x

n

) : u

n(t)

exp (+η) (incorrect)

EbADA =

 P N

n=1 u (t) n 

·

(1 −

 t

) exp (−η) +

 t

exp (+η)

by solving

∂ b E ∂η

ADA =0,

steepest η t

=lnq

1−

t



t =

α t

,

remember? :-)

—AdaBoost:

steepest

descent with

approximate functional gradient

(18)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Fun Time

With bEADA =

 P N

n=1 u n (t) 

·

(1 −

 t

) exp (−η) +

 t

exp (+η) , which of the following is

∂ b E ∂η

ADA that can be used for solving the optimal

η t

?

1

 P N

n=1 u n (t) 

·

+ (1 −

 t

) exp (−η) +

 t

exp (+η)

2

 P N

n=1 u n (t) 

·

+ (1 −

 t

) exp (−η) −

 t

exp (+η)

3

 P N

n=1 u n (t) 

·

− (1 −

 t

) exp (−η) +

 t

exp (+η)

4

 P N

n=1 u n (t) 

·

− (1 −

 t

) exp (−η) −

 t

exp (+η)

Reference Answer: 3

Differentiate exp(−η)and exp(+η)with respect to

η

and you should easily get the result.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/25

(19)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Fun Time

With bEADA =

 P N

n=1 u n (t) 

·

(1 −

 t

) exp (−η) +

 t

exp (+η) , which of the following is

∂ b E ∂η

ADA that can be used for solving the optimal

η t

?

1

 P N

n=1 u n (t) 

·

+ (1 −

 t

) exp (−η) +

 t

exp (+η)

2

 P N

n=1 u n (t) 

·

+ (1 −

 t

) exp (−η) −

 t

exp (+η)

3

 P N

n=1 u n (t) 

·

− (1 −

 t

) exp (−η) +

 t

exp (+η)

4

 P N

n=1 u n (t) 

·

− (1 −

 t

) exp (−η) −

 t

exp (+η)

Reference Answer: 3

Differentiate exp(−η)and exp(+η)with respect to

η

and you should easily get the result.

(20)

Gradient Boosted Decision Tree Gradient Boosting

Gradient Boosting for Arbitrary Error Function

AdaBoost

min

η

min

h

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

with

binary-output hypothesis h GradientBoost

min

η

min

h

1 N

N

X

n=1

err

t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n ),

y

n

!

with

any hypothesis h

(usually

real-output

hypothesis)

GradientBoost: allows

extension to different

err

for regression/soft classification/etc.

(21)

Gradient Boosted Decision Tree Gradient Boosting

GradientBoost for Regression

min

η

min

h

1 N

N

X

n=1

err

X t−1

τ =1

α τ g τ (x n )

| {z }

s

n

+ηh(x

n ),

y

n



with err(s, y ) = (s − y )

2

min

h

. . .

taylor

≈ min

h

1 N

N

X

n=1

err (s

n

,y

n

)

| {z }

constant

+

1 N

N

X

n=1

ηh(x n )

∂err(s, y

n

)

∂s

s=s

n

= min

h constants

+

η N

N

X

n=1

h(x n )

· 2(s

n

− y

n

)

naïve solution

h(x n )

=

−∞ ·

(s

n

− y

n

) if no constraint on

h

(22)

Gradient Boosted Decision Tree Gradient Boosting

Learning Hypothesis as Optimization

min

h

constants + η N

X

N

n=1

2h(x

n

)(s

n

− y

n

)

• magnitude

of

h

does not matter: because

η

will be optimized next

• penalize large magnitude

to avoid naïve solution

min

h

constants + η

N

N

X

n=1

2h(x

n

)(s

n

− y

n

) + (h(x

n

))

2



= constants + η N

N

X

n=1



constant + h(x

n

) − (y

n

− s

n

) 

2



solution of

penalized approximate functional gradient:

squared-error

regression

on {(x

n

,y

n

s n

| {z }

residual

)}

GradientBoost for regression:

find

g t = h

by

regression

with

residuals

(23)

Gradient Boosted Decision Tree Gradient Boosting

Deciding Blending Weight as Optimization

after finding

g t = h,

min

η @

@ min

h

1 N

N

X

n=1

err



X t−1

τ =1

α τ g τ (x n )

| {z }

s

n

+ηg

t (x n ),

y

n



with err(s, y ) = (s − y )

2

min

η

1 N

N

X

n=1

(s

n

+

ηg t (x n )

− y

n

)

2

= 1 N

N

X

n=1

((y

n − s n )

ηg t (x n )) 2

—one-variable linear regressionon {(g

t -transformed input, residual)}

GradientBoost for regression:

α t = optimal η

by

g t -transformed linear regression

(24)

Gradient Boosted Decision Tree Gradient Boosting

Putting Everything Together

Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0

for t = 1, 2, . . . , T

1

obtain

g t

by

A({(x n

,

y n − s n

)})where

A

is a (squared-error) regression algorithm

—how about sampled and pruned C&RT?

2

compute

α t

=OneVarLinearRegression({(g

t (x n ), y n − s n

)})

3

update

s n

s n

+

α t g t (x n )

return G(x) =P

T

t=1 α t g t

(x)

GBDT: ‘regression sibling’ of AdaBoost-DTree

—popular in practice

(25)

Gradient Boosted Decision Tree Gradient Boosting

Fun Time

Which of the following is the optimal η for

min

η

1 N

N

X

n=1

((y

n − s n )

ηg t (x n )) 2

1

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) · (P

N

n=1

g

t 2

(x

n

))

2

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) / (P

N

n=1

g

t 2

(x

n

))

3

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) + (P

N

n=1

g

t 2

(x

n

))

4

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) − (P

N

n=1

g

t 2

(x

n

))

Reference Answer: 2

Derived within Lecture 9 of ML Foundations,

remember? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/25

(26)

Gradient Boosted Decision Tree Gradient Boosting

Fun Time

Which of the following is the optimal η for

min

η

1 N

N

X

n=1

((y

n − s n )

ηg t (x n )) 2

1

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) · (P

N

n=1

g

t 2

(x

n

))

2

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) / (P

N

n=1

g

t 2

(x

n

))

3

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) + (P

N

n=1

g

t 2

(x

n

))

4

(P

N

n=1

g

t

(x

n

)(y

n

s n

)) − (P

N

n=1

g

t 2

(x

n

))

Reference Answer: 2

Derived within Lecture 9 of ML Foundations,

remember? :-)

(27)

Gradient Boosted Decision Tree Summary of Aggregation Models

Map of Blending Models

blending: aggregate

after getting diverse g t

uniform

simple

voting/averaging of

g t

non-uniform

linear model on

g t -transformed

inputs

conditional

nonlinear model on

g t -transformed

inputs

uniform for ‘stability’;

non-uniform/conditional

carefully

for

‘complexity’

(28)

Gradient Boosted Decision Tree Summary of Aggregation Models

Map of Aggregation-Learning Models

learning: aggregate

as well as getting diverse g t Bagging

diverse

g t

by

bootstrapping;

uniform vote

by

nothing :-)

AdaBoost

diverse

g t

by

reweighting;

linear vote

by

steepest search

Decision Tree

diverse

g t

by

data splitting;

conditional vote

by

branching GradientBoost

diverse

g t

by

residual fitting;

linear vote

by

steepest search

boosting-like algorithms

most popular

(29)

Gradient Boosted Decision Tree Summary of Aggregation Models

Map of Aggregation of Aggregation Models

Bagging AdaBoost Decision Tree

Random Forest

randomized bagging + ‘strong’ DTree

AdaBoost-DTree

AdaBoost

+ ‘weak’ DTree

GradientBoost

GBDT

GradientBoost + ‘weak’ DTree

all three

frequently used in practice

(30)

Gradient Boosted Decision Tree Summary of Aggregation Models

Specialty of Aggregation Models

cure underfitting

G(x) ‘strong’

aggregation

=⇒

feature transform

cure overfitting

G(x) ‘moderate’

aggregation

=⇒

regularization

proper aggregation (a.k.a. ‘ensemble’)

=⇒

better performance

(31)

Gradient Boosted Decision Tree Summary of Aggregation Models

Fun Time

Which of the following aggregation model learns diverse

g t

by

reweighting

and calculates

linear vote

by

steepest search?

1

AdaBoost

2

Random Forest

3

Decision Tree

4

Linear Blending

Reference Answer: 1

Congratulations on being an

expert

in aggregation models!

:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/25

(32)

Gradient Boosted Decision Tree Summary of Aggregation Models

Fun Time

Which of the following aggregation model learns diverse

g t

by

reweighting

and calculates

linear vote

by

steepest search?

1

AdaBoost

2

Random Forest

3

Decision Tree

4

Linear Blending

Reference Answer: 1

Congratulations on being an

expert

in aggregation models!

:-)

(33)

Gradient Boosted Decision Tree Summary of Aggregation Models

Summary

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 11: Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree

sampling and pruning for ‘weak’ trees Optimization View of AdaBoost

functional gradient descent on exponential error Gradient Boosting

iterative steepest residual fitting Summary of Aggregation Models

some cure underfitting; some cure overfitting

3 Distilling Implicit Features: Extraction Models

next: extract features other than hypotheses

參考文獻

相關文件

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep