Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 11: Gradient Boosted Decision Tree

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25

(2)

Gradient Boosted Decision Tree

Roadmap

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 10: Random Forest

bagging

of

randomized C&RT

trees with

automatic validation

and

feature selection Lecture 11: Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting

Summary of Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3

calculate ‘vote’

α _t

of

g _t

return

G

=

LinearHypo

({(g

t

,

α t

)})

need:

weighted

DTree(D,

u ^(t)

)

(4)

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3 α _t

of

g _t

return

G

=

LinearHypo

({(g

_t

,

α _t

)})

need:

weighted

DTree(D,

u ^(t)

)

(5)

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3 α _t

of

g _t

return

G

=

LinearHypo

({(g

_t

,

α _t

)})

need:

weighted

DTree(D,

u ^(t)

)

(6)

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3 α _t

of

g _t

return

G

=

LinearHypo

({(g

_t

,

α _t

)})

need:

weighted

DTree(D,

u ^(t)

)

(7)

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3 α _t

of

g _t

return

G

=

LinearHypo

({(g

_t

,

α _t

)})

need:

weighted

DTree(D,

u ^(t)

)

(8)

From Random Forest to AdaBoost-DTree

function

RandomForest(D)

For t = 1, 2, . . . , T

1

request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

2

obtain tree

g _t

by

Randomized-DTree(

D ˜ _t

) return

G

=Uniform({g

t

})

function

AdaBoost-DTree(D)

For t = 1, 2, . . . , T

1

reweight data by

u ^(t)

2

obtain tree

g _t

by DTree(D,

u ^(t)

)

3 α _t

of

g _t

return

G

=

LinearHypo

({(g

_t

,

α _t

)})

need:

weighted

DTree(D,

u ^(t)

)

(9)

Weighted Decision Tree Algorithm

Weighted Algorithm

minimize (regularized) E

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

))

if using existing algorithm as

black box

(no modifications), to get E

_in ^u

approximately optimized...

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost-DTree: often via

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

) without modifying DTree

(10)

Weighted Decision Tree Algorithm

Weighted Algorithm

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

)) if using existing algorithm as

black box

(no modifications),

to get E

_in ^u

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

(11)

Weighted Decision Tree Algorithm

Weighted Algorithm

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

black box

(no modifications),

to get E

_in ^u

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

(12)

Weighted Decision Tree Algorithm

Weighted Algorithm

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

black box

(no modifications),

to get E

_in ^u

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

(13)

Weighted Decision Tree Algorithm

Weighted Algorithm

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

black box

(no modifications),

to get E

_in ^u

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

(14)

Weighted Decision Tree Algorithm

Weighted Algorithm

_in ^u

(h) =

_N ¹

P

N

n=1 u _n

· err(y

_n

,h(x

_n

black box

(no modifications),

to get E

_in ^u

‘Weighted’ Algorithm in Bagging

weights u

expressed by

bootstrap-sampled

copies

—request size-N

⁰

data

D ˜ _t

by

bootstrapping

with D

A General Randomized Base Algorithm

weights u

expressed by

sampling

proportional to

u n

—request size-N

⁰

data

D ˜ _t

by

sampling

∝

u

on D

AdaBoost +

sampling

∝

u ^(t)

+ DTree(

D ˜ _t

)

(15)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =

0

if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

AdaBoost-DTree: often via AdaBoost +

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(16)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =

0

if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(17)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(18)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =

0

=⇒

_t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(19)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(20)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=

0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(21)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

= ∞(autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(22)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=

∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(23)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(24)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(25)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(26)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(27)

Weak Decision Tree Algorithm

AdaBoost:

votes α _t

=ln

t

=ln q

1−

t

t with

weighted error rate t

=⇒

if

fully grown

tree trained on

all x _n

=⇒

E _in

(g

_t

) =0 if

all x _n

different

=⇒

E _in ^u

(g

t

) =0

=⇒

_t

=0

=⇒

α t

=∞ (autocracy!!)

need:

pruned

tree trained on

some x _n

to be

weak

• pruned: usual pruning, or just limiting tree height

• some: sampling

∝ u

^(t)

sampling

∝ u

^(t)

+

pruned

DTree(

D) ˜

(28)

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(29)

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(30)

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(31)

AdaBoost with Extremely-Pruned Tree

what if DTree with

height ≤ 1

(extremely pruned)?

DTree (C&RT) with height ≤ 1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

—if

impurity

=

binary classification error,

just a decision stump, remember? :-)

AdaBoost-Stump

=

special case

of AdaBoost-DTree

(32)

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

_t

such that g

_t

achieves zero error on the sampled data set ˜D

_t

. Which of the following is possible?

1

α

t

<0

2

α

_t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

_t

achieves zero error on ˜D

_t

, g

_t

may not achieve zero weighted error on (D,

u ^(t)

)and hence

t

can be anything, even ≥

¹ ₂

. Then, α

t

can be ≤ 0.

(33)

Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

_t

such that g

_t

achieves zero error on the sampled data set ˜D

_t

. Which of the following is possible?

1

α

t

<0

2

α

_t

=0

3

α

t

>0

4

all of the above

Reference Answer: 4

While g

_t

achieves zero error on ˜D

_t

, g

_t

may not achieve zero weighted error on (D,

u ^(t)

)and hence

t

can be anything, even ≥

¹ ₂

. Then, α

t

can be ≤ 0.

(34)

Gradient Boosted Decision Tree Optimization View of AdaBoost

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t

−y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp

−y n

α t

g _t (x _n )

u ^{(T +1)} _n

=

u _n ⁽¹⁾

·

T

Y

t=1

exp

−y n α t g t (x _n )

=

1 N

·

exp −y n

T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(35)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp

−y n

α t

g _t (x _n )

u ^{(T +1)} _n

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n

T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(36)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp

−y n

α t

g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n

T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(37)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp −y n α t g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(38)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp −y n α t g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(39)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp −y n α t g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(40)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp −y n α t g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(41)

Example Weights of AdaBoost

u ^(t+1) _n

= (

u _n ^(t)

·

t if incorrect u _n ^(t)

/

t if correct

=

u ^(t) _n

·

t −y

n

g

t

(x

n

)

=

u _n ^(t)

·

exp −y n α t g _t (x _n )

u _n ^{(T +1)}

=

u _n ⁽¹⁾

·

T

Y

t=1

exp −y n α t g t (x _n ) = 1

N

·

exp −y n T

P

t=1

α t g t (x _n )

!

•

recall: G(x) = sign

T

P

t=1

α t g _t (x)

!

• T

P

t=1

α _t g _t (x)

:

voting score

of {g

_t

} on x

AdaBoost: u

_n ^{(T +1)}

∝ exp −y

n

(

voting score on x _n

)

(42)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

(voting score)= signed & unnormalized

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

(43)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N n=1 u _n ^(t)

(44)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

(45)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N n=1 u _n ^(t)

(46)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

(47)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

claim: AdaBoost

decreases P N n=1 u _n ^(t)

(48)

Voting Score and Margin

linear blending =

LinModel

+

hypotheses as transform

+

₍₍ ^hhh constraints ⁽⁽ hh ⁽ ⁽ h

G(x

_n

) =sign







voting score

z }| {

T

X

t=1

α t

|{z} w

_i

g t (x _n )

| {z }

φ

i

(x

n

)







and hard-margin SVM

margin

=

^y

ⁿ

^·(w

^T

_kwk ^φ(x

ⁿ

^)+b)

,

remember? :-)

y

n

margin

⇐=

want y

_n

(voting score)

positive & large

⇔

want

exp(−y

_n

(voting score))

small

⇔

want

u _n ^{(T +1)} small

(49)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(50)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(51)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(52)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(53)

AdaBoost Error Function

claim: AdaBoost

decreases P N

n=1 u _n ^(t)

and thus somewhat

minimizes

N

X

n=1

u _n ^{(T +1)}

= 1 N

N

X

n=1

exp −y

_n

T

X

t=1

α t g _t (x _n )

!

linear score

s

=

P T

t=1 α t g _t (x _n )

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

ADA(s,

y

) =exp(−y

s):

upper bound of err

_0/1

—called

exponential error measure

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1

ys

-3 -2 -1 0 1 2 3

0 1 2 4 6

err

0/1 ada

err c

ADA:

algorithmic error measure

by

convex upper bound

of err

_0/1

(54)

Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

at iteration t, to find

g _t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(

1 − y

_n ηh(x _n )

) =

N

X

n=1

u _n ^(t)

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(55)

Gradient Descent on AdaBoost Error Function

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(

1 − y

_n ηh(x _n )

) =

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(56)

Gradient Descent on AdaBoost Error Function

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(

1 − y

_n ηh(x _n )

) =

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(57)

Gradient Descent on AdaBoost Error Function

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(

1 − y

_n ηh(x _n )

) =

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(58)

Gradient Descent on AdaBoost Error Function

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(

1 − y

_n ηh(x _n )

)

=

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(59)

Gradient Descent on AdaBoost Error Function

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(1 − y

_n ηh(x _n ))

=

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

n h(x n ))

(60)

Gradient Descent on AdaBoost Error Function

kvk=1

min E

_in

(w

t

+

ηv) ≈

E

_in

(w

t

)

| {z }

known

+

η

|{z}

given positive

v ^T

∇E

_in

(w

t

)

| {z }

known

g t

, solve min

h

Eb_ADA =

1 N

N

X

n=1

exp −y

n t−1

X

τ =1

α τ g _τ (x _n )

+

ηh(x _n )

!!

=

N

X

n=1

u _n ^(t)

exp (−y

n ηh(x n ))

taylor

≈

N

X

n=1

u _n ^(t)

(1 − y

_n ηh(x _n )) =

N

X

n=1

u ^(t) _n

−

η

N

X

n=1

u _n ^(t)

y

_n h(x _n )

good

h: minimize

P

N

n=1 u _n ^(t)

(−y

Machine Learning Techniques (ᘤᢈ)