Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 11: Gradient Boosted Decision Tree

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Gradient Boosted Decision Tree

Roadmap

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 10: Random Forest

bagging

of

randomized C&RT

trees with

automatic validation

and

feature selection Lecture 11: Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting

Summary of Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3

calculate ‘vote’

α _t

of

g _t

return

G

=

LinearHypo

({(g

_t

,

α _t

)})

need:

weighted

DTree(D,

u ^(t)

)

(4)

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

)) if using existing algorithm as

black box

(no modifications),

to get E

_in ^u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

) without modifying DTree

(5)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(6)

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(7)

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

_t

such that g

_t

achieves zero error on the sampled data set ˜D

_t

. Which of the following is possible?

1

α

t

<0

2

α

_t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

_t

achieves zero error on ˜D

_t

, g

_t

may not achieve zero weighted error on (D,

u ^(t)

)and hence

t

can be anything, even ≥

¹ ₂

. Then, α

t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

(8)

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

_t

such that g

_t

achieves zero error on the sampled data set ˜D

_t

. Which of the following is possible?

1

α

t

<0

2

α

_t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

_t

achieves zero error on ˜D

_t

, g

_t

may not achieve zero weighted error on (D,

u ^(t)

)and hence

t

can be anything, even ≥

¹ ₂

. Then, α

t

can be ≤ 0.

(9)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp −y n α t g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(10)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

(11)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(12)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(13)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(14)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(15)

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(1 − y

_n ηh(x _n )) =

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(16)

Learning Hypothesis as Optimization

finding good

h

(function direction) ⇔ minimizeP

N

n=1 u _n ^(t)

(−y

n h(x n ))

for binary classification, where y

n

and

h(x n )

both ∈ {−1, +1}:

N

X

n=1

u _n ^(t)

(−y

_n h(x _n ))

=

N

X

n=1

u _n ^(t)

− 1 if y

_n

=

h(x _n )

+1 if y

n

6=

h(x _n )

= −

N

X

n=1

u _n ^(t)

+

N

X

n=1

u ^(t) _n

0 if y

n

=

h(x n )

2 if y

_n

6=

h(x _n )

= −

N

X

n=1

u _n ^(t)

+2E

_in ^u

^(t)(h) · N

—who

minimizes

E

_in ^u

^(t)(h)?

A in AdaBoost! :-)

A: good g t = h

for

‘gradient descent’

(17)

Deciding Blending Weight as Optimization

AdaBoost finds

g _t

by approximately min

h

EbADA=

N

P

n=1

u _n ^(t)

exp (−y

_n ηh(x _n ))

after finding

g _t

, how about min

η

Eb_ADA=

N

P

n=1

u _n ^(t)

exp (−y

_n ηg _t (x _n ))

•

optimal

η _t

somewhat

‘greedily faster’

than fixed (small)

η

—called

steepest

descent for optimization

•

two cases inside summation:

• y

n

= g

t

(x

n

) : u

_n^(t)

exp (−η) (correct)

• y

_n

6= g

_t

(x

_n

) : u

_n^(t)

exp (+η) (incorrect)

•

EbADA =

P N

n=1 u ^(t) _n

·

(1 −

_t

) exp (−η) +

_t

exp (+η)

by solving

^{∂ b} ^E _∂η

^ADA =0,

steepest η _t

=lnq

1−

t

t =

α t

,

remember? :-)

—AdaBoost:

steepest

descent with

approximate functional gradient

(18)

Fun Time

With bEADA =

P N

n=1 u _n ^(t)

·

(1 −

_t

) exp (−η) +

_t

exp (+η) , which of the following is

^{∂ b} ^E _∂η

^ADA that can be used for solving the optimal

η t

?

1 P N

n=1 u _n ^(t)

·

+ (1 −

t

) exp (−η) +

t

exp (+η)

2 P N

n=1 u _n ^(t)

·

+ (1 −

t

) exp (−η) −

t

exp (+η)

3 P N

n=1 u _n ^(t)

·

− (1 −

t

) exp (−η) +

t

exp (+η)

4 P N

n=1 u _n ^(t)

·

− (1 −

t

) exp (−η) −

t

exp (+η)

Reference Answer: 3

Differentiate exp(−η)and exp(+η)with respect to

η

and you should easily get the result.

(19)

Fun Time

With bEADA =

P N

n=1 u _n ^(t)

·

(1 −

_t

) exp (−η) +

_t

exp (+η) , which of the following is

^{∂ b} ^E _∂η

^ADA that can be used for solving the optimal

η t

?

1 P N

n=1 u _n ^(t)

·

+ (1 −

t

) exp (−η) +

t

exp (+η)

2 P N

n=1 u _n ^(t)

·

+ (1 −

t

) exp (−η) −

t

exp (+η)

3 P N

n=1 u _n ^(t)

·

− (1 −

t

) exp (−η) +

t

exp (+η)

4 P N

n=1 u _n ^(t)

·

− (1 −

t

) exp (−η) −

t

exp (+η)

Reference Answer: 3

Differentiate exp(−η)and exp(+η)with respect to

η

and you should easily get the result.

(20)

Gradient Boosted Decision Tree Gradient Boosting

Gradient Boosting for Arbitrary Error Function

AdaBoost

min

η

min

h

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x _n )

+

ηh(x _n )

!!

with

binary-output hypothesis h GradientBoost

min

η

min

h

1 N

N

X

n=1

err

t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n ),

y

_n

!

with

any hypothesis h

(usually

real-output

hypothesis)

GradientBoost: allows

extension to different

err

for regression/soft classification/etc.

(21)

GradientBoost for Regression

min

η

min

h

1 N

N

X

n=1

err

X ^t−1

τ =1

α _τ g _τ (x n )

| {z }

s

n

+ηh(x

n ),

y

n

with err(s, y ) = (s − y )

²

min

h

. . .

^taylor

≈ min

h

1 N

N

X

n=1

err (s

n

,y

n

)

| {z }

constant

+

1 N

N

X

n=1

ηh(x _n )

∂err(s, y

n

)

∂s

s=s

n

= min

h constants

+

η N

N

X

n=1

h(x n )

· 2(s

n

− y

n

)

naïve solution

h(x n )

=

−∞ ·

(s

n

− y

n

) if no constraint on

h

(22)

Learning Hypothesis as Optimization

min

h

constants + η N

X

N

n=1

2h(x

_n

)(s

_n

− y

_n

)

• magnitude

of

h

does not matter: because

η

will be optimized next

• penalize large magnitude

to avoid naïve solution

min

h

constants + η

N

X

n=1

2h(x

_n

)(s

_n

− y

_n

) + (h(x

_n

))

²

= constants + η N

N

X

n=1

constant + h(x

_n

) − (y

_n

− s

n

)

2

•

solution of

penalized approximate functional gradient:

squared-error

regression

on {(x

n

,y

n

−

s n

| {z }

residual

)}

GradientBoost for regression:

find

g _t = h

by

regression

with

residuals

(23)

Deciding Blending Weight as Optimization

after finding

g _t = h,

min

η @

@ min

h

1 N

N

X

n=1

err

X ^t−1

τ =1

α τ g _τ (x _n )

| {z }

s

n

+ηg

_t (x _n ),

y

_n

with err(s, y ) = (s − y )

²

min

η

1 N

N

X

n=1

(s

n

+

ηg t (x _n )

− y

n

)

²

= 1 N

N

X

n=1

((y

n − s n )

−

ηg t (x _n )) ²

—one-variable linear regressionon {(g

t -transformed input, residual)}

GradientBoost for regression:

α t = optimal η

by

g _t -transformed linear regression

(24)

Putting Everything Together

Gradient Boosted Decision Tree (GBDT) s ₁ = s ₂ = . . . = s _N = 0

for t = 1, 2, . . . , T

1

obtain

g t

by

A({(x _n

,

y n − s n

)})where

A

is a (squared-error) regression algorithm

—how about sampled and pruned C&RT?

2

compute

α t

=OneVarLinearRegression({(g

_t (x _n ), y _n − s n

)})

3

update

s _n

←

s _n

+

α _t g _t (x _n )

return G(x) =P

T

t=1 α _t g _t

(x)

GBDT: ‘regression sibling’ of AdaBoost-DTree

—popular in practice

(25)

Fun Time

Which of the following is the optimal η for

min

η

1 N

N

X

n=1

((y

_n − s _n )

−

ηg _t (x _n )) ²

1

(P

N

n=1

g

_t

(x

_n

)(y

_n

−

s _n

)) · (P

N

n=1

g

_t ²

(x

_n

))

2

(P

N

n=1

g

_t

(x

_n

)(y

_n

−

s _n

)) / (P

N

n=1

g

_t ²

(x

_n

))

3

(P

N

n=1

g

t

(x

n

)(y

n

−

s n

)) + (P

N

n=1

g

_t ²

(x

n

))

4

(P

N

n=1

g

t

(x

n

)(y

n

−

s n

)) − (P

N

n=1

g

_t ²

(x

n

))

Reference Answer: 2

Derived within Lecture 9 of ML Foundations,

remember? :-)

(26)

Fun Time

Which of the following is the optimal η for

min

η

1 N

N

X

n=1

((y

_n − s _n )

−

ηg _t (x _n )) ²

1

(P

N

n=1

g

_t

(x

_n

)(y

_n

−

s _n

)) · (P

N

n=1

g

_t ²

(x

_n

))

2

(P

N

n=1

g

_t

(x

_n

)(y

_n

−

s _n

)) / (P

N

n=1

g

_t ²

(x

_n

))

3

(P

N

n=1

g

t

(x

n

)(y

n

−

s n

)) + (P

N

n=1

g

_t ²

(x

n

))

4

(P

N

n=1

g

t

(x

n

)(y

n

−

s n

)) − (P

N

n=1

g

_t ²

(x

n

))

Reference Answer: 2

Derived within Lecture 9 of ML Foundations,

remember? :-)

(27)

Gradient Boosted Decision Tree Summary of Aggregation Models

Map of Blending Models

blending: aggregate

after getting diverse g _t

uniform

simple

voting/averaging of

g _t

non-uniform

linear model on

g _t -transformed

inputs

conditional

nonlinear model on

g _t -transformed

inputs

uniform for ‘stability’;

non-uniform/conditional

carefully

for

‘complexity’

(28)

Map of Aggregation-Learning Models

learning: aggregate

as well as getting diverse g _t Bagging

diverse

g _t

by

bootstrapping;

uniform vote

by

nothing :-)

AdaBoost

diverse

g _t

by

reweighting;

linear vote

by

steepest search

Decision Tree

diverse

g _t

by

data splitting;

conditional vote

by

branching GradientBoost

diverse

g _t

by

residual fitting;

linear vote

by

steepest search

boosting-like algorithms

Map of Aggregation of Aggregation Models

Bagging AdaBoost Decision Tree

Random Forest

randomized bagging + ‘strong’ DTree

AdaBoost-DTree

AdaBoost

+ ‘weak’ DTree

GradientBoost

GBDT

GradientBoost + ‘weak’ DTree

all three

frequently used in practice

(30)

Specialty of Aggregation Models

cure underfitting

•

G(x) ‘strong’

•

aggregation

=⇒

feature transform

cure overfitting

•

G(x) ‘moderate’

•

aggregation

=⇒

regularization

proper aggregation (a.k.a. ‘ensemble’)

=⇒

better performance

(31)

Fun Time

Which of the following aggregation model learns diverse

g _t

by

reweighting

and calculates

linear vote

by

steepest search?

1

AdaBoost

2

Random Forest

3

Decision Tree

4

Linear Blending

Reference Answer: 1

Congratulations on being an

expert

in aggregation models!

:-)

(32)

Fun Time

Which of the following aggregation model learns diverse

g _t

by

reweighting

and calculates

linear vote

by

steepest search?

1

AdaBoost

2

Random Forest

3

Decision Tree

4

Linear Blending

Reference Answer: 1

Congratulations on being an

expert

in aggregation models!

:-)

(33)

Summary

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 11: Gradient Boosted Decision Tree

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 Embedding Numerous Features: Kernel Models

2

Lecture 10: Random Forest

bagging

randomized C&RT

automatic validation

feature selection Lecture 11: Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting

Summary of Aggregation Models

3 Distilling Implicit Features: Extraction Models

From Random Forest to AdaBoost-DTree

RandomForest(D)

1

0

D ˜ t

bootstrapping

2

g t

D ˜ t

G

t

AdaBoost-DTree(D)

1

u (t)

2

g t

u (t)

3

α t

g t

G

LinearHypo

t

α t

weighted

u (t)

Weighted Decision Tree Algorithm

Weighted Algorithm

in u

N 1

N

n=1 u n

n

n

black box

in u

‘Weighted’ Algorithm in Bagging

weights u

copies

0

D ˜ t

bootstrapping

A General Randomized Base Algorithm

weights u

sampling

u n

0

D ˜ t

sampling

u

sampling

u (t)

D ˜ t

Weak Decision Tree Algorithm

votes α t

t

1−



weighted error rate  t

fully grown

all x n

E in

Machine Learning Techniques (ᘤᢈ)

⁰

D ˜ _t

g _t

D ˜ _t

u ^(t)

g _t

u ^(t)

α _t

g _t

_t

α _t

u ^(t)

_in ^u

_N ¹

n=1 u _n

_n

_n

_in ^u

⁰

D ˜ _t

⁰

D ˜ _t

u ^(t)

D ˜ _t

votes α _t

1−

weighted error rate t

all x _n

E _in

_t

all x _n

E _in ^u

_t

some x _n

^(t)

^(t)

_c

impurity(D _c

_t

_t

_t

_t

_t

_t

_t

u ^(t)

¹ ₂

_t

_t

_t

_t

_t

_t

_t

u ^(t)

¹ ₂

u ^(t+1) _n

u _n ^(t)