• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
153
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 11: Gradient Boosted Decision Tree

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25

(2)

Gradient Boosted Decision Tree

Roadmap

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 10: Random Forest

bagging

of

randomized C&RT

trees with

automatic validation

and

feature selection Lecture 11: Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting

Summary of Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

(4)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

(5)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

(6)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

(7)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

(8)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

with D

2

obtain tree

g t

by

Randomized-DTree(

D ˜ t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u (t)

2

obtain tree

g t

by DTree(D,

u (t)

)

3

calculate ‘vote’

α t

of

g t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u (t)

)

(9)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

))

if using existing algorithm as

black box

(no modifications), to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

) without modifying DTree

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

(10)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

)) if using existing algorithm as

black box

(no modifications),

to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

) without modifying DTree

(11)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

)) if using existing algorithm as

black box

(no modifications),

to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

) without modifying DTree

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

(12)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

)) if using existing algorithm as

black box

(no modifications),

to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

) without modifying DTree

(13)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

)) if using existing algorithm as

black box

(no modifications),

to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

) without modifying DTree

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

(14)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

in u

(h) =

N 1

P

N

n=1 u n

· err(y

n

,h(x

n

)) if using existing algorithm as

black box

(no modifications),

to get E

in u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

0

data

D ˜ t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

0

data

D ˜ t

by

sampling

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

u (t)

+ DTree(

D ˜ t

)

(15)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =

0

if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(16)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =

0

if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(17)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(18)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =

0

=⇒

 t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(19)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(20)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=

0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(21)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(22)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=

∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(23)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(24)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(25)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(26)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

(27)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Weak Decision Tree Algorithm

AdaBoost:

votes α t

=ln 

t

=ln q

1−

t



t with

weighted error rate  t

=⇒

if

fully grown

tree trained on

all x n

=⇒

E in

(g

t

) =0 if

all x n

different

=⇒

E in u

(g

t

) =0

=⇒

 t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x n

to be

weak

pruned: usual pruning, or just limiting tree height

some: sampling

∝ u

(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

(t)

+

pruned

DTree(

D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

(28)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(29)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25

(30)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(31)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25

(32)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

t

such that g

t

achieves zero error on the sampled data set ˜D

t

. Which of the following is possible?

1

α

t

<0

2

α

t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

t

achieves zero error on ˜D

t

, g

t

may not achieve zero weighted error on (D,

u (t)

)and hence 

t

can be anything, even ≥

1 2

. Then, α

t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

(33)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

t

such that g

t

achieves zero error on the sampled data set ˜D

t

. Which of the following is possible?

1

α

t

<0

2

α

t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

t

achieves zero error on ˜D

t

, g

t

may not achieve zero weighted error on (D,

u (t)

)and hence 

t

can be anything, even ≥

1 2

. Then, α

t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

(34)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t

−y

n

g

t

(x

n

)

=

u n (t)

·

exp

−y n

α t

g t (x n )



u (T +1) n

=

u n (1)

·

T

Y

t=1

exp

−y n α t g t (x n )

 =

1

N

·

exp −y n

T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

(35)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp

−y n

α t

g t (x n )



u (T +1) n

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n

T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

(36)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp

−y n

α t

g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n

T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

(37)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp −y n α t g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

(38)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp −y n α t g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

(39)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp −y n α t g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

(40)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp −y n α t g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

(41)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u (t+1) n

= (

u n (t)

·

 t if incorrect u n (t)

/

 t if correct

=

u (t) n

·

 t −y

n

g

t

(x

n

)

=

u n (t)

·

exp −y n α t g t (x n )



u n (T +1)

=

u n (1)

·

T

Y

t=1

exp −y n α t g t (x n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x n )

!

recall: G(x) = sign

T

P

t=1

α t g t (x)

!

T

P

t=1

α t g t (x)

:

voting score

of {g

t

} on x

AdaBoost: u

n (T +1)

∝ exp −y

n

(

voting score on x n

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

(42)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N

n=1 u n (t)

(43)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N n=1 u n (t)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

(44)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N

n=1 u n (t)

(45)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N n=1 u n (t)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

(46)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N

n=1 u n (t)

(47)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

claim: AdaBoost

decreases P N n=1 u n (t)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

(48)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

(( hhh constraints (( hh ( ( h

G(x

n

) =sign

voting score

z }| {

T

X

t=1

α t

|{z} w

i

g t (x n )

| {z }

φ

i

(x

n

)

and hard-margin SVM

margin

=

y

n

·(w

T

kwk φ(x

n

)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

n

(voting score)

positive & large

want

exp(−y

n

(voting score))

small

want

u n (T +1) small

(49)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

(50)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

(51)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

(52)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

(53)

Gradient Boosted Decision Tree Optimization View of AdaBoost

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u n (t)

and thus somewhat

minimizes

N

X

n=1

u n (T +1)

= 1 N

N

X

n=1

exp −y

n

T

X

t=1

α t g t (x n )

!

linear score

s

=

P T

t=1 α t g t (x n )

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

0/1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

(54)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(

1 − y

n ηh(x n )

) =

N

X

n=1

u n (t)

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

(55)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(

1 − y

n ηh(x n )

) =

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

(56)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(

1 − y

n ηh(x n )

) =

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

(57)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(

1 − y

n ηh(x n )

) =

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

(58)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(

1 − y

n ηh(x n )

)

=

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

(59)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(1 − y

n ηh(x n ))

=

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

(60)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

in

(w

t

+

ηv) ≈

E

in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v T

∇E

in

(w

t

)

| {z }

known

at iteration t, to find

g t

, solve min

h

EbADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g τ (x n )

+

ηh(x n )

!!

=

N

X

n=1

u n (t)

exp (−y

n ηh(x n ))

taylor

N

X

n=1

u n (t)

(1 − y

n ηh(x n )) =

N

X

n=1

u (t) n

η

N

X

n=1

u n (t)

y

n h(x n )

good

h: minimize

P

N

n=1 u n (t)

(−y

n h(x n ))

參考文獻

Outline

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep