Machine Learning Techniques
( 機器學習技法)
Lecture 11: Gradient Boosted Decision Tree
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University ( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25
Gradient Boosted Decision Tree
Roadmap
1 Embedding Numerous Features: Kernel Models
2
Combining Predictive Features: Aggregation ModelsLecture 10: Random Forest
bagging
ofrandomized C&RT
trees withautomatic validation
andfeature selection Lecture 11: Gradient Boosted Decision Tree
Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting
Summary of Aggregation Models
3 Distilling Implicit Features: Extraction Models
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
returnG
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
returnG
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
returnG
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
return
G
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
returnG
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
returnG
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
))if using existing algorithm as
black box
(no modifications), to get Ein u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
) without modifying DTreeHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
)) if using existing algorithm asblack box
(no modifications),to get E
in u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
) without modifying DTreeGradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
)) if using existing algorithm asblack box
(no modifications),to get E
in u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
) without modifying DTreeHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
)) if using existing algorithm asblack box
(no modifications),to get E
in u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
) without modifying DTreeGradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
)) if using existing algorithm asblack box
(no modifications),to get E
in u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
) without modifying DTreeHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
)) if using existing algorithm asblack box
(no modifications),to get E
in u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
)Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0
if
all x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0
if
all x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0
=⇒
t
=0=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0
=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
= ∞(autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)
need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
AdaBoost with Extremely-Pruned Tree
what if DTree with
height ≤ 1
(extremely pruned)?DTree (C&RT) with height ≤ 1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)—if
impurity
=binary classification error,
just a decision stump, remember? :-)
AdaBoost-Stump
=
special case
of AdaBoost-DTreeGradient Boosted Decision Tree Adaptive Boosted Decision Tree
AdaBoost with Extremely-Pruned Tree
what if DTree with
height ≤ 1
(extremely pruned)?DTree (C&RT) with height ≤ 1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)—if
impurity
=binary classification error,
just a decision stump, remember? :-)
AdaBoost-Stump
=
special case
of AdaBoost-DTreeHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
AdaBoost with Extremely-Pruned Tree
what if DTree with
height ≤ 1
(extremely pruned)?DTree (C&RT) with height ≤ 1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)—if
impurity
=binary classification error,
just a decision stump, remember? :-)
AdaBoost-Stump
=
special case
of AdaBoost-DTreeGradient Boosted Decision Tree Adaptive Boosted Decision Tree
AdaBoost with Extremely-Pruned Tree
what if DTree with
height ≤ 1
(extremely pruned)?DTree (C&RT) with height ≤ 1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)—if
impurity
=binary classification error,
just a decision stump, remember? :-)
AdaBoost-Stump
=
special case
of AdaBoost-DTreeHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Fun Time
When running AdaBoost-DTree with sampling and getting a decision tree g
t
such that gt
achieves zero error on the sampled data set ˜Dt
. Which of the following is possible?1
αt
<02
αt
=03
αt
>04
all of the aboveReference Answer: 4
While g
t
achieves zero error on ˜Dt
, gt
may not achieve zero weighted error on (D,u (t)
)and hencet
can be anything, even ≥1 2
. Then, αt
can be ≤ 0.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Fun Time
When running AdaBoost-DTree with sampling and getting a decision tree g
t
such that gt
achieves zero error on the sampled data set ˜Dt
. Which of the following is possible?1
αt
<02
αt
=03
αt
>04
all of the aboveReference Answer: 4
While g
t
achieves zero error on ˜Dt
, gt
may not achieve zero weighted error on (D,u (t)
)and hencet
can be anything, even ≥1 2
. Then, αt
can be ≤ 0.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t
−y
ng
t(x
n)
=u n (t)
·exp
−y n
α t
g t (x n )
u (T +1) n
=u n (1)
·
T
Y
t=1
exp
−y n α t g t (x n )
=
1
N
·exp −y n
T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=
u n (t)
·exp
−y n
α t
g t (x n )
u (T +1) n
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n
T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp
−y n
α t
g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n
T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp −y n α t g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp −y n α t g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp −y n α t g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp −y n α t g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp −y n α t g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N
n=1 u n (t)
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N n=1 u n (t)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N
n=1 u n (t)
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N n=1 u n (t)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N
n=1 u n (t)
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N n=1 u n (t)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err
0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err
0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−y
n ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − y
n ηh(x n )
) =
N
X
n=1
u n (t)
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−y
n ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − y
n ηh(x n )
) =
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−y
n ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − y
n ηh(x n )
) =
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−yn ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − y
n ηh(x n )
) =
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−yn ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − y
n ηh(x n )
)
=
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−yn ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − yn ηh(x n ))
=
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−yn ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − yn ηh(x n )) =
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good