## Machine Learning Techniques

## ( 機器學習技法)

### Lecture 11: Gradient Boosted Decision Tree

Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw

### Department of Computer Science

### & Information Engineering

### National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25

Gradient Boosted Decision Tree

## Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2

Combining Predictive Features: Aggregation Models### Lecture 10: Random Forest

**bagging**

of**randomized C&RT**

trees with
**automatic validation**

and**feature selection** Lecture 11: Gradient Boosted Decision Tree

### Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting

### Summary of Aggregation Models

### 3 Distilling Implicit Features: Extraction Models

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T### 1

request size-N^{0}

data### D ˜ _{t}

by
### bootstrapping

with D### 2

obtain tree### g _{t}

by
Randomized-DTree(

### D ˜ _{t}

)
return### G

=Uniform({g### t

})function

### AdaBoost-DTree(D)

For t = 1, 2, . . . , T### 1

reweight data by**u** ^{(t)}

### 2

obtain tree### g _{t}

by
DTree(D,**u** ^{(t)}

)
### 3

calculate ‘vote’### α _{t}

of### g _{t}

return### G

=### LinearHypo

({(g### t

,### α t

)})need:

### weighted

DTree(D,**u** ^{(t)}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T### 1

request size-N^{0}

data### D ˜ _{t}

by
### bootstrapping

with D### 2

obtain tree### g _{t}

by
Randomized-DTree(

### D ˜ _{t}

)
return### G

=Uniform({g### t

})function

### AdaBoost-DTree(D)

For t = 1, 2, . . . , T### 1

reweight data by**u** ^{(t)}

### 2

obtain tree### g _{t}

by
DTree(D,**u** ^{(t)}

)
### 3

calculate ‘vote’### α _{t}

of### g _{t}

return### G

=### LinearHypo

({(g_{t}

,### α _{t}

)})
need:

### weighted

DTree(D,**u** ^{(t)}

)
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T### 1

request size-N^{0}

data### D ˜ _{t}

by
### bootstrapping

with D### 2

obtain tree### g _{t}

by
Randomized-DTree(

### D ˜ _{t}

)
return### G

=Uniform({g### t

})function

### AdaBoost-DTree(D)

For t = 1, 2, . . . , T### 1

reweight data by**u** ^{(t)}

### 2

obtain tree### g _{t}

by
DTree(D,**u** ^{(t)}

)
### 3

calculate ‘vote’### α _{t}

of### g _{t}

return### G

=### LinearHypo

({(g_{t}

,### α _{t}

)})
need:

### weighted

DTree(D,**u** ^{(t)}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T### 1

request size-N^{0}

data### D ˜ _{t}

by
### bootstrapping

with D### 2

obtain tree### g _{t}

by
Randomized-DTree(

### D ˜ _{t}

)
return### G

=Uniform({g### t

})function

### AdaBoost-DTree(D)

For t = 1, 2, . . . , T### 1

reweight data by**u** ^{(t)}

### 2

obtain tree### g _{t}

by
DTree(D,**u** ^{(t)}

)
### 3

calculate ‘vote’### α _{t}

of### g _{t}

return

### G

=### LinearHypo

({(g_{t}

,### α _{t}

)})
need:

### weighted

DTree(D,**u** ^{(t)}

)
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T### 1

request size-N^{0}

data### D ˜ _{t}

by
### bootstrapping

with D### 2

obtain tree### g _{t}

by
Randomized-DTree(

### D ˜ _{t}

)
return### G

=Uniform({g### t

})function

### AdaBoost-DTree(D)

For t = 1, 2, . . . , T### 1

reweight data by**u** ^{(t)}

### 2

obtain tree### g _{t}

by
DTree(D,**u** ^{(t)}

)
### 3

calculate ‘vote’### α _{t}

of### g _{t}

return### G

=### LinearHypo

({(g_{t}

,### α _{t}

)})
need:

### weighted

DTree(D,**u** ^{(t)}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## From Random Forest to AdaBoost-DTree

function

### RandomForest(D)

For t = 1, 2, . . . , T### 1

request size-N^{0}

data### D ˜ _{t}

by
### bootstrapping

with D### 2

obtain tree### g _{t}

by
Randomized-DTree(

### D ˜ _{t}

)
return### G

=Uniform({g### t

})function

### AdaBoost-DTree(D)

For t = 1, 2, . . . , T### 1

reweight data by**u** ^{(t)}

### 2

obtain tree### g _{t}

by
DTree(D,**u** ^{(t)}

)
### 3

calculate ‘vote’### α _{t}

of### g _{t}

return### G

=### LinearHypo

({(g_{t}

,### α _{t}

)})
need:

### weighted

DTree(D,**u** ^{(t)}

)
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

_{in} ^{u}

(h) = ^{u}

_{N} ^{1}

P### N

### n=1 u _{n}

· err(y_{n}

,h(x_{n}

))
if using existing algorithm as

**black box**

(no modifications),
to get E_{in} ^{u}

approximately optimized...
^{u}

### ‘Weighted’ Algorithm in Bagging

### weights **u**

expressed by
bootstrap-sampled

### copies

—request size-N

^{0}

data### D ˜ _{t}

by### bootstrapping

with D### A General Randomized Base Algorithm

### weights **u**

expressed by
### sampling

proportional to### u n

—request size-N

^{0}

data### D ˜ _{t}

by### sampling

∝**u**

on D
AdaBoost-DTree: often via

AdaBoost +

**sampling**

∝**u** ^{(t)}

+ DTree(### D ˜ _{t}

)
without modifying DTree
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

_{in} ^{u}

(h) = ^{u}

_{N} ^{1}

P### N

### n=1 u _{n}

· err(y_{n}

,h(x_{n}

))
if using existing algorithm as**black box**

(no modifications),
to get E

_{in} ^{u}

approximately optimized...
^{u}

### ‘Weighted’ Algorithm in Bagging

### weights **u**

expressed by
bootstrap-sampled

### copies

—request size-N

^{0}

data### D ˜ _{t}

by### bootstrapping

with D### A General Randomized Base Algorithm

### weights **u**

expressed by
### sampling

proportional to### u n

—request size-N

^{0}

data### D ˜ _{t}

by### sampling

∝**u**

on D
AdaBoost-DTree: often via

AdaBoost +

**sampling**

∝**u** ^{(t)}

+ DTree(### D ˜ _{t}

)
without modifying DTree
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

_{in} ^{u}

(h) = ^{u}

_{N} ^{1}

P### N

### n=1 u _{n}

· err(y_{n}

,h(x_{n}

))
if using existing algorithm as**black box**

(no modifications),
to get E

_{in} ^{u}

approximately optimized...
^{u}

### ‘Weighted’ Algorithm in Bagging

### weights **u**

expressed by
bootstrap-sampled

### copies

—request size-N

^{0}

data### D ˜ _{t}

by### bootstrapping

with D### A General Randomized Base Algorithm

### weights **u**

expressed by
### sampling

proportional to### u n

—request size-N

^{0}

data### D ˜ _{t}

by### sampling

∝**u**

on D
AdaBoost-DTree: often via

AdaBoost +

**sampling**

∝**u** ^{(t)}

+ DTree(### D ˜ _{t}

)
without modifying DTree
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

_{in} ^{u}

(h) = ^{u}

_{N} ^{1}

P### N

### n=1 u _{n}

· err(y_{n}

,h(x_{n}

))
if using existing algorithm as**black box**

(no modifications),
to get E

_{in} ^{u}

approximately optimized...
^{u}

### ‘Weighted’ Algorithm in Bagging

### weights **u**

expressed by
bootstrap-sampled

### copies

—request size-N

^{0}

data### D ˜ _{t}

by### bootstrapping

with D### A General Randomized Base Algorithm

### weights **u**

expressed by
### sampling

proportional to### u n

—request size-N

^{0}

data### D ˜ _{t}

by### sampling

∝**u**

on D
AdaBoost-DTree: often via

AdaBoost +

**sampling**

∝**u** ^{(t)}

+ DTree(### D ˜ _{t}

)
without modifying DTree
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

_{in} ^{u}

(h) = ^{u}

_{N} ^{1}

P### N

### n=1 u _{n}

· err(y_{n}

,h(x_{n}

))
if using existing algorithm as**black box**

(no modifications),
to get E

_{in} ^{u}

approximately optimized...
^{u}

### ‘Weighted’ Algorithm in Bagging

### weights **u**

expressed by
bootstrap-sampled

### copies

—request size-N

^{0}

data### D ˜ _{t}

by### bootstrapping

with D### A General Randomized Base Algorithm

### weights **u**

expressed by
### sampling

proportional to### u n

—request size-N

^{0}

data### D ˜ _{t}

by### sampling

∝**u**

on D
AdaBoost-DTree: often via

AdaBoost +

**sampling**

∝**u** ^{(t)}

+ DTree(### D ˜ _{t}

)
without modifying DTree
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weighted Decision Tree Algorithm

### Weighted Algorithm

minimize (regularized) E

_{in} ^{u}

(h) = ^{u}

_{N} ^{1}

P### N

### n=1 u _{n}

· err(y_{n}

,h(x_{n}

))
if using existing algorithm as**black box**

(no modifications),
to get E

_{in} ^{u}

approximately optimized...
^{u}

### ‘Weighted’ Algorithm in Bagging

### weights **u**

expressed by
bootstrap-sampled

### copies

—request size-N

^{0}

data### D ˜ _{t}

by### bootstrapping

with D### A General Randomized Base Algorithm

### weights **u**

expressed by
### sampling

proportional to### u n

—request size-N

^{0}

data### D ˜ _{t}

by### sampling

∝**u**

on D
AdaBoost-DTree: often via

AdaBoost +

**sampling**

∝**u** ^{(t)}

+ DTree(### D ˜ _{t}

)
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =
0

if

### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =
0

if

### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0

=⇒

### _{t}

=0
=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=
0

=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

= ∞(autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

=∞ (autocracy!!)

need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

=∞ (autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

=∞ (autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

=∞ (autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

=∞ (autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Weak Decision Tree Algorithm

AdaBoost:

**votes α** _{t}

=ln ### t

=ln q### 1−

t t with### weighted error rate t

=⇒

if

### fully grown

tree trained on### all **x** _{n}

=⇒

### E _{in}

(g_{t}

) =0 if### all **x** _{n}

different
=⇒

### E _{in} ^{u}

(g^{u}

### t

) =0=⇒

### _{t}

=0
=⇒

### α t

=∞ (autocracy!!)need:

**pruned**

tree trained on**some** **x** _{n}

to be**weak**

### • **pruned: usual pruning, or just** **limiting tree height**

### • **some:** **sampling**

**∝ u**

^{(t)}

AdaBoost-DTree: often via AdaBoost +

**sampling**

**∝ u**

^{(t)}

+**pruned**

DTree(### D) ˜

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## AdaBoost with Extremely-Pruned Tree

what if DTree with

**height ≤ 1**

(extremely pruned)?
### DTree (C&RT) with **height ≤ 1**

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
—if

### impurity

=### binary classification error,

**just a decision stump, remember? :-)**

AdaBoost-Stump

=

**special case**

of AdaBoost-DTree
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## AdaBoost with Extremely-Pruned Tree

what if DTree with

**height ≤ 1**

(extremely pruned)?
### DTree (C&RT) with **height ≤ 1**

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
—if

### impurity

=### binary classification error,

**just a decision stump, remember? :-)**

AdaBoost-Stump

=

**special case**

of AdaBoost-DTree
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## AdaBoost with Extremely-Pruned Tree

what if DTree with

**height ≤ 1**

(extremely pruned)?
### DTree (C&RT) with **height ≤ 1**

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
—if

### impurity

=### binary classification error,

**just a decision stump, remember? :-)**

AdaBoost-Stump

=

**special case**

of AdaBoost-DTree
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## AdaBoost with Extremely-Pruned Tree

what if DTree with

**height ≤ 1**

(extremely pruned)?
### DTree (C&RT) with **height ≤ 1**

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
—if

### impurity

=### binary classification error,

**just a decision stump, remember? :-)**

AdaBoost-Stump

=

**special case**

of AdaBoost-DTree
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

_{t}

such that g_{t}

achieves zero error on the sampled data set ˜D_{t}

.
Which of the following is possible?
### 1

α### t

<0### 2

α_{t}

=0
### 3

α### t

>0### 4

all of the above### Reference Answer: 4

While g

_{t}

achieves zero error on ˜D_{t}

, g_{t}

may not
achieve zero weighted error on (D,**u** ^{(t)}

)and
hence ### t

can be anything, even ≥^{1} _{2}

. Then, α### t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

Gradient Boosted Decision Tree Adaptive Boosted Decision Tree

## Fun Time

When running AdaBoost-DTree with sampling and getting a decision tree g

_{t}

such that g_{t}

achieves zero error on the sampled data set ˜D_{t}

.
Which of the following is possible?
### 1

α### t

<0### 2

α_{t}

=0
### 3

α### t

>0### 4

all of the above### Reference Answer: 4

While g

_{t}

achieves zero error on ˜D_{t}

, g_{t}

may not
achieve zero weighted error on (D,**u** ^{(t)}

)and
hence ### t

can be anything, even ≥^{1} _{2}

. Then, α### t

can be ≤ 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t

### −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

·
### exp

### −y n

### α t

### g _{t} (x _{n} )

### u ^{(T +1)} _{n}

=
### u _{n} ^{(1)}

·

### T

Y

### t=1

### exp

### −y n α t g t (x _{n} )

=

### 1

### N

·### exp −y n

### T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

:**voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=

### u _{n} ^{(t)}

·
### exp

### −y n

### α t

### g _{t} (x _{n} )

### u ^{(T +1)} _{n}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n

### T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

:**voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

·
### exp

### −y n

### α t

### g _{t} (x _{n} )

### u _{n} ^{(T +1)}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n

### T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

: **voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

· ### exp −y n α t g _{t} (x _{n} )

### u _{n} ^{(T +1)}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

: **voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

· ### exp −y n α t g _{t} (x _{n} )

### u _{n} ^{(T +1)}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

: **voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

· ### exp −y n α t g _{t} (x _{n} )

### u _{n} ^{(T +1)}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

: **voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

· ### exp −y n α t g _{t} (x _{n} )

### u _{n} ^{(T +1)}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

: **voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Gradient Boosted Decision Tree Optimization View of AdaBoost

## Example Weights of AdaBoost

### u ^{(t+1)} _{n}

=
(
### u _{n} ^{(t)}

·### t **if incorrect** u _{n} ^{(t)}

/### t **if correct**

=

### u ^{(t)} _{n}

·### t −y

n### g

t### (x

n### )

=### u _{n} ^{(t)}

· ### exp −y n α t g _{t} (x _{n} )

### u _{n} ^{(T +1)}

=### u _{n} ^{(1)}

·
### T

Y

### t=1

### exp −y n α t g t (x _{n} ) = 1

### N

·### exp −y n T

### P

### t=1

### α t g t (x _{n} )

!

### •

recall: G(x) = sign### T

### P

### t=1

### α t g _{t} (x)

!

### •

### T

### P

### t=1

### α _{t} g _{t} (x)

: **voting** **score**

of {g_{t}

**} on x**

AdaBoost: u

_{n} ^{(T +1)}

∝ exp −y### n

(### voting score on **x** _{n}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

claim: AdaBoost

**decreases** P N

### n=1 u _{n} ^{(t)}

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

claim: AdaBoost

**decreases** P N n=1 u _{n} ^{(t)}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

claim: AdaBoost

**decreases** P N

### n=1 u _{n} ^{(t)}

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

claim: AdaBoost

**decreases** P N n=1 u _{n} ^{(t)}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

claim: AdaBoost

**decreases** P N

### n=1 u _{n} ^{(t)}

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

claim: AdaBoost

**decreases** P N n=1 u _{n} ^{(t)}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Voting Score and Margin

linear blending =

### LinModel

+### hypotheses as transform

+_{((} ^{hhh} constraints ^{((} hh ^{(} ^{(} h

G(x

_{n}

) =sign

### voting score

z }| {

### T

X

### t=1

### α t

### |{z} w

_{i}

### g t (x _{n} )

### | {z }

### φ

i### (x

n### )

and hard-margin SVM

**margin**

= ^{y}

^{n}

^{·(w}

^{T}

_{kwk} ^{φ(x}

_{kwk}

^{n}

^{)+b)}

,**remember? :-)**

y

### n

(voting score)= signed & unnormalized**margin**

⇐=

want y

_{n}

(voting score)**positive & large**

⇔

want

exp(−y

_{n}

(voting score))**small**

⇔

want

### u _{n} ^{(T +1)} **small**

Gradient Boosted Decision Tree Optimization View of AdaBoost

## AdaBoost Error Function

claim: AdaBoost

### decreases P N

### n=1 u _{n} ^{(t)}

and thus somewhat**minimizes**

### N

### X

### n=1

### u _{n} ^{(T +1)}

= 1
N
### N

X

### n=1

exp −y

_{n}

### T

### X

### t=1

### α t g _{t} (x _{n} )

!

linear score

### s

=### P T

### t=1 α t g _{t} (x _{n} )

### •

err_{0/1}

(s,### y

) =J### ys

≤ 0K### • err c

ADA(s,### y

) =exp(−y### s):

upper bound of err_{0/1}

—called

**exponential error** **measure**

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1 ada

### err c

ADA:**algorithmic error measure**

by**convex upper bound**

of err_{0/1}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## AdaBoost Error Function

claim: AdaBoost

### decreases P N

### n=1 u _{n} ^{(t)}

and thus somewhat**minimizes**

### N

### X

### n=1

### u _{n} ^{(T +1)}

= 1
N
### N

X

### n=1

exp −y

_{n}

### T

### X

### t=1

### α t g _{t} (x _{n} )

!

linear score

### s

=### P T

### t=1 α t g _{t} (x _{n} )

### •

err_{0/1}

(s,### y

) =J### ys

≤ 0K### • err c

ADA(s,### y

) =exp(−y### s):

upper bound of err_{0/1}

—called

**exponential error** **measure**

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1 ada

### err c

ADA:**algorithmic error measure**

by**convex upper bound**

of err_{0/1}

Gradient Boosted Decision Tree Optimization View of AdaBoost

## AdaBoost Error Function

claim: AdaBoost

### decreases P N

### n=1 u _{n} ^{(t)}

and thus somewhat**minimizes**

### N

### X

### n=1

### u _{n} ^{(T +1)}

= 1
N
### N

X

### n=1

exp −y

_{n}

### T

### X

### t=1

### α t g _{t} (x _{n} )

!

linear score

### s

=### P T

### t=1 α t g _{t} (x _{n} )

### •

err_{0/1}

(s,### y

) =J### ys

≤ 0K### • err c

ADA(s,### y

) =exp(−y### s):

upper bound of err_{0/1}

—called

**exponential error** **measure**

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1 ada

### err c

ADA:**algorithmic error measure**

by**convex upper bound**

of err_{0/1}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## AdaBoost Error Function

claim: AdaBoost

### decreases P N

### n=1 u _{n} ^{(t)}

and thus somewhat**minimizes**

### N

### X

### n=1

### u _{n} ^{(T +1)}

= 1
N
### N

X

### n=1

exp −y

_{n}

### T

### X

### t=1

### α t g _{t} (x _{n} )

!

linear score

### s

=### P T

### t=1 α t g _{t} (x _{n} )

### •

err_{0/1}

(s,### y

) =J### ys

≤ 0K### • err c

ADA(s,### y

) =exp(−y### s):

upper bound of err

_{0/1}

—called

**exponential error** **measure**

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1 ada

### err c

ADA:**algorithmic error measure**

by**convex upper bound**

of err_{0/1}

Gradient Boosted Decision Tree Optimization View of AdaBoost

## AdaBoost Error Function

claim: AdaBoost

### decreases P N

### n=1 u _{n} ^{(t)}

and thus somewhat**minimizes**

### N

### X

### n=1

### u _{n} ^{(T +1)}

= 1
N
### N

X

### n=1

exp −y

_{n}

### T

### X

### t=1

### α t g _{t} (x _{n} )

!

linear score

### s

=### P T

### t=1 α t g _{t} (x _{n} )

### •

err_{0/1}

(s,### y

) =J### ys

≤ 0K### • err c

ADA(s,### y

) =exp(−y### s):

upper bound of err

_{0/1}

—called

**exponential error** **measure**

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1

### ys

### -3 -2 -1 0 1 2 3

### 0 1 2 4 6

### err

### 0/1 ada

### err c

ADA:**algorithmic error measure**

by**convex upper bound**

of err_{0/1}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g _{t}

, solve
min
### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y

### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(
1 − y

_{n} ηh(x _{n} )

) =

### N

### X

### n=1

### u _{n} ^{(t)}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good

### h: minimize

P### N

### n=1 u _{n} ^{(t)}

(−y### n h(x n ))

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g t

, solve min### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y

### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(
1 − y

_{n} ηh(x _{n} )

) =

### N

### X

### n=1

### u ^{(t)} _{n}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good

### h: minimize

P### N

### n=1 u _{n} ^{(t)}

(−y### n h(x n ))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g t

, solve min### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y

### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(
1 − y

_{n} ηh(x _{n} )

) =

### N

### X

### n=1

### u ^{(t)} _{n}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good

### h: minimize

P### N

### n=1 u _{n} ^{(t)}

(−y### n h(x n ))

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g t

, solve min### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(
1 − y

_{n} ηh(x _{n} )

) =

### N

### X

### n=1

### u ^{(t)} _{n}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good

### h: minimize

P### N

### n=1 u _{n} ^{(t)}

(−y### n h(x n ))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g t

, solve min### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(
1 − y

_{n} ηh(x _{n} )

)

=

### N

### X

### n=1

### u ^{(t)} _{n}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good

### h: minimize

P### N

### n=1 u _{n} ^{(t)}

(−y### n h(x n ))

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g t

, solve min### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(1 − y_{n} ηh(x _{n} ))

=

### N

### X

### n=1

### u ^{(t)} _{n}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good

### h: minimize

P### N

### n=1 u _{n} ^{(t)}

(−y### n h(x n ))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

Gradient Boosted Decision Tree Optimization View of AdaBoost

## Gradient Descent on AdaBoost Error Function

recall: gradient descent (remember? :-)), at iteration t

### kvk=1

min E_{in}

(w### t

+### ηv) ≈

E_{in}

(w### t

)| {z }

### known

+

### η

|{z}

### given positive

**v** ^{T}

∇E_{in}

(w### t

)| {z }

### known

at iteration t, to find

### g t

, solve min### h

Eb_{ADA} =

### 1 N

### N

X

### n=1

exp −y

### n t−1

### X

### τ =1

### α τ g _{τ} (x _{n} )

+### ηh(x _{n} )

!!

=

### N

X

### n=1

### u _{n} ^{(t)}

exp (−y### n ηh(x n ))

### taylor

≈

### N

X

### n=1

### u _{n} ^{(t)}

(1 − y_{n} ηh(x _{n} )) =

### N

### X

### n=1

### u ^{(t)} _{n}

−### η

### N

X

### n=1

### u _{n} ^{(t)}

y_{n} h(x _{n} )

good