Machine Learning Techniques ( 機器學習技法)
Lecture 11: Gradient Boosted Decision Tree
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Gradient Boosted Decision Tree
Roadmap
1 Embedding Numerous Features: Kernel Models
2
Combining Predictive Features: Aggregation ModelsLecture 10: Random Forest
bagging
ofrandomized C&RT
trees withautomatic validation
andfeature selection Lecture 11: Gradient Boosted Decision Tree
Adaptive Boosted Decision Tree Optimization View of AdaBoost Gradient Boosting
Summary of Aggregation Models
3 Distilling Implicit Features: Extraction Models
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
From Random Forest to AdaBoost-DTree
function
RandomForest(D)
For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
with D2
obtain treeg t
byRandomized-DTree(
D ˜ t
) returnG
=Uniform({gt
})function
AdaBoost-DTree(D)
For t = 1, 2, . . . , T1
reweight data byu (t)
2
obtain treeg t
by DTree(D,u (t)
)3
calculate ‘vote’α t
ofg t
returnG
=LinearHypo
({(gt
,α t
)})need:
weighted
DTree(D,u (t)
)Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weighted Decision Tree Algorithm
Weighted Algorithm
minimize (regularized) E
in u
(h) =N 1
PN
n=1 u n
· err(yn
,h(xn
)) if using existing algorithm asblack box
(no modifications),to get E
in u
approximately optimized...‘Weighted’ Algorithm in Bagging
weights u
expressed bybootstrap-sampled
copies
—request size-N
0
dataD ˜ t
bybootstrapping
with DA General Randomized Base Algorithm
weights u
expressed bysampling
proportional tou n
—request size-N
0
dataD ˜ t
bysampling
∝u
on DAdaBoost-DTree: often via
AdaBoost +
sampling
∝u (t)
+ DTree(D ˜ t
) without modifying DTreeGradient Boosted Decision Tree Adaptive Boosted Decision Tree
Weak Decision Tree Algorithm
AdaBoost:
votes α t
=lnt
=ln q1−
t t withweighted error rate t
=⇒
if
fully grown
tree trained onall x n
=⇒
E in
(gt
) =0 ifall x n
different=⇒
E in u
(gt
) =0=⇒
t
=0=⇒
α t
=∞ (autocracy!!)need:
pruned
tree trained onsome x n
to beweak
• pruned: usual pruning, or just limiting tree height
• some: sampling
∝ u(t)
AdaBoost-DTree: often via AdaBoost +
sampling
∝ u(t)
+pruned
DTree(D) ˜
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
AdaBoost with Extremely-Pruned Tree
what if DTree with
height ≤ 1
(extremely pruned)?DTree (C&RT) with height ≤ 1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)—if
impurity
=binary classification error,
just a decision stump, remember? :-)
AdaBoost-Stump
=
special case
of AdaBoost-DTreeGradient Boosted Decision Tree Adaptive Boosted Decision Tree
Fun Time
When running AdaBoost-DTree with sampling and getting a decision tree g
t
such that gt
achieves zero error on the sampled data set ˜Dt
. Which of the following is possible?1
αt
<02
αt
=03
αt
>04
all of the aboveReference Answer: 4
While g
t
achieves zero error on ˜Dt
, gt
may not achieve zero weighted error on (D,u (t)
)and hencet
can be anything, even ≥1 2
. Then, αt
can be ≤ 0.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25
Gradient Boosted Decision Tree Adaptive Boosted Decision Tree
Fun Time
When running AdaBoost-DTree with sampling and getting a decision tree g
t
such that gt
achieves zero error on the sampled data set ˜Dt
. Which of the following is possible?1
αt
<02
αt
=03
αt
>04
all of the aboveReference Answer: 4
While g
t
achieves zero error on ˜Dt
, gt
may not achieve zero weighted error on (D,u (t)
)and hencet
can be anything, even ≥1 2
. Then, αt
can be ≤ 0.
Gradient Boosted Decision Tree Optimization View of AdaBoost
Example Weights of AdaBoost
u (t+1) n
= (u n (t)
·t if incorrect u n (t)
/t if correct
=
u (t) n
·t −y
ng
t(x
n)
=u n (t)
·exp −y n α t g t (x n )
u n (T +1)
=u n (1)
·T
Y
t=1
exp −y n α t g t (x n ) = 1
N
·exp −y n T
P
t=1
α t g t (x n )
!
•
recall: G(x) = signT
P
t=1
α t g t (x)
!
•
T
P
t=1
α t g t (x)
:voting score
of {gt
} on xAdaBoost: u
n (T +1)
∝ exp −yn
(voting score on x n
)Gradient Boosted Decision Tree Optimization View of AdaBoost
Voting Score and Margin
linear blending =
LinModel
+hypotheses as transform
+(( hhh constraints (( hh ( ( h
G(x
n
) =sign
voting score
z }| {
T
X
t=1
α t
|{z} w
ig t (x n )
| {z }
φ
i(x
n)
and hard-margin SVM
margin
=y
n·(w
Tkwk φ(x
n)+b)
,remember? :-)
y
n
(voting score)= signed & unnormalizedmargin
⇐=
want y
n
(voting score)positive & large
⇔
want
exp(−y
n
(voting score))small
⇔
want
u n (T +1) small
claim: AdaBoost
decreases P N
n=1 u n (t)
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err
0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err
0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err
0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Gradient Boosted Decision Tree Optimization View of AdaBoost
AdaBoost Error Function
claim: AdaBoost
decreases P N
n=1 u n (t)
and thus somewhatminimizes
N
X
n=1
u n (T +1)
= 1 NN
X
n=1
exp −y
n
T
X
t=1
α t g t (x n )
!
linear score
s
=P T
t=1 α t g t (x n )
•
err0/1
(s,y
) =Jys
≤ 0K• err c
ADA(s,y
) =exp(−ys):
upper bound of err
0/1
—called
exponential error measure
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1
ys
-3 -2 -1 0 1 2 3
0 1 2 4 6
err
0/1 ada
err c
ADA:algorithmic error measure
byconvex upper bound
of err0/1
Gradient Boosted Decision Tree Optimization View of AdaBoost
Gradient Descent on AdaBoost Error Function
recall: gradient descent (remember? :-)), at iteration t
kvk=1
min Ein
(wt
+ηv) ≈
Ein
(wt
)| {z }
known
+
η
|{z}
given positive
v T
∇Ein
(wt
)| {z }
known
at iteration t, to find
g t
, solve minh
EbADA =
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
=
N
X
n=1
u n (t)
exp (−yn ηh(x n ))
taylor
≈
N
X
n=1
u n (t)
(1 − yn ηh(x n )) =
N
X
n=1
u (t) n
−η
N
X
n=1
u n (t)
yn h(x n )
good
h: minimize
PN
n=1 u n (t)
(−yn h(x n ))
Gradient Boosted Decision Tree Optimization View of AdaBoost
Learning Hypothesis as Optimization
finding good
h
(function direction) ⇔ minimizePN
n=1 u n (t)
(−yn h(x n ))
for binary classification, where yn
andh(x n )
both ∈ {−1, +1}:N
X
n=1
u n (t)
(−yn h(x n ))
=N
X
n=1
u n (t)
− 1 if y
n
=h(x n )
+1 if yn
6=h(x n )
= −
N
X
n=1
u n (t)
+N
X
n=1
u (t) n
0 if y
n
=h(x n )
2 if yn
6=h(x n )
= −
N
X
n=1
u n (t)
+2Ein u
(t)(h) · N—who
minimizes
Ein u
(t)(h)?A in AdaBoost! :-)
A: good g t = h
for‘gradient descent’
Gradient Boosted Decision Tree Optimization View of AdaBoost
Deciding Blending Weight as Optimization
AdaBoost finds
g t
by approximately minh
EbADA=
N
P
n=1
u n (t)
exp (−yn ηh(x n ))
after findingg t
, how about minη
EbADA=N
P
n=1
u n (t)
exp (−yn ηg t (x n ))
•
optimalη t
somewhat‘greedily faster’
than fixed (small)η
—called
steepest
descent for optimization•
two cases inside summation:• y
n= g
t(x
n) : u
n(t)exp (−η) (correct)
• y
n6= g
t(x
n) : u
n(t)exp (+η) (incorrect)
•
EbADA =P N
n=1 u (t) n
·
(1 −
t
) exp (−η) +t
exp (+η)by solving
∂ b E ∂η
ADA =0,steepest η t
=lnq1−
t t =α t
,remember? :-)
—AdaBoost:
steepest
descent withapproximate functional gradient
Gradient Boosted Decision Tree Optimization View of AdaBoost
Fun Time
With bEADA =
P N
n=1 u n (t)
·
(1 −
t
) exp (−η) +t
exp (+η) , which of the following is∂ b E ∂η
ADA that can be used for solving the optimalη t
?1
P N
n=1 u n (t)
·
+ (1 −
t
) exp (−η) +t
exp (+η)2
P N
n=1 u n (t)
·
+ (1 −
t
) exp (−η) −t
exp (+η)3
P N
n=1 u n (t)
·
− (1 −
t
) exp (−η) +t
exp (+η)4
P N
n=1 u n (t)
·
− (1 −
t
) exp (−η) −t
exp (+η)Reference Answer: 3
Differentiate exp(−η)and exp(+η)with respect to
η
and you should easily get the result.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/25
Gradient Boosted Decision Tree Optimization View of AdaBoost
Fun Time
With bEADA =
P N
n=1 u n (t)
·
(1 −
t
) exp (−η) +t
exp (+η) , which of the following is∂ b E ∂η
ADA that can be used for solving the optimalη t
?1
P N
n=1 u n (t)
·
+ (1 −
t
) exp (−η) +t
exp (+η)2
P N
n=1 u n (t)
·
+ (1 −
t
) exp (−η) −t
exp (+η)3
P N
n=1 u n (t)
·
− (1 −
t
) exp (−η) +t
exp (+η)4
P N
n=1 u n (t)
·
− (1 −
t
) exp (−η) −t
exp (+η)Reference Answer: 3
Differentiate exp(−η)and exp(+η)with respect to
η
and you should easily get the result.Gradient Boosted Decision Tree Gradient Boosting
Gradient Boosting for Arbitrary Error Function
AdaBoost
minη
minh
1 N
N
X
n=1
exp −y
n t−1
X
τ =1
α τ g τ (x n )
+ηh(x n )
!!
with
binary-output hypothesis h GradientBoost
min
η
minh
1 N
N
X
n=1
err
t−1
X
τ =1
α τ g τ (x n )
+ηh(x n ),
yn
!
with
any hypothesis h
(usuallyreal-output
hypothesis)GradientBoost: allows
extension to different
err
for regression/soft classification/etc.Gradient Boosted Decision Tree Gradient Boosting
GradientBoost for Regression
min
η
minh
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηh(x
n ),
yn
with err(s, y ) = (s − y )
2
min
h
. . .taylor
≈ minh
1 N
N
X
n=1
err (s
n
,yn
)| {z }
constant
+
1 N
N
X
n=1
ηh(x n )
∂err(s, yn
)∂s
s=s
n= min
h constants
+η N
N
X
n=1
h(x n )
· 2(sn
− yn
)naïve solution
h(x n )
=−∞ ·
(sn
− yn
) if no constraint onh
Gradient Boosted Decision Tree Gradient Boosting
Learning Hypothesis as Optimization
min
h
constants + η N
X
Nn=1
2h(x
n)(s
n− y
n)
• magnitude
ofh
does not matter: becauseη
will be optimized next• penalize large magnitude
to avoid naïve solutionmin
hconstants + η
N
N
X
n=1
2h(x
n)(s
n− y
n) + (h(x
n))
2= constants + η N
N
X
n=1
constant + h(x
n) − (y
n− s
n)
2•
solution ofpenalized approximate functional gradient:
squared-error
regression
on {(xn
,yn
−s n
| {z }
residual
)}
GradientBoost for regression:
find
g t = h
byregression
withresiduals
Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
min
η @
@ min
h
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηg
t (x n ),
yn
with err(s, y ) = (s − y )
2
min
η
1 N
N
X
n=1
(s
n
+ηg t (x n )
− yn
)2
= 1 NN
X
n=1
((y
n − s n )
−ηg t (x n )) 2
—one-variable linear regressionon {(g
t -transformed input, residual)}
GradientBoost for regression:
α t = optimal η
byg t -transformed linear regression
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =PT
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Gradient Boosted Decision Tree Gradient Boosting
Fun Time
Which of the following is the optimal η for
min
η
1 N
N
X
n=1
((y
n − s n )
−ηg t (x n )) 2
1
(PN
n=1
gt
(xn
)(yn
−s n
)) · (PN
n=1
gt 2
(xn
))2
(PN
n=1
gt
(xn
)(yn
−s n
)) / (PN
n=1
gt 2
(xn
))3
(PN
n=1
gt
(xn
)(yn
−s n
)) + (PN
n=1
gt 2
(xn
))4
(PN
n=1
gt
(xn
)(yn
−s n
)) − (PN
n=1
gt 2
(xn
))Reference Answer: 2
Derived within Lecture 9 of ML Foundations,
remember? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/25
Gradient Boosted Decision Tree Gradient Boosting
Fun Time
Which of the following is the optimal η for
min
η
1 N
N
X
n=1
((y
n − s n )
−ηg t (x n )) 2
1
(PN
n=1
gt
(xn
)(yn
−s n
)) · (PN
n=1
gt 2
(xn
))2
(PN
n=1
gt
(xn
)(yn
−s n
)) / (PN
n=1
gt 2
(xn
))3
(PN
n=1
gt
(xn
)(yn
−s n
)) + (PN
n=1
gt 2
(xn
))4
(PN
n=1
gt
(xn
)(yn
−s n
)) − (PN
n=1
gt 2
(xn
))Reference Answer: 2
Derived within Lecture 9 of ML Foundations,
remember? :-)
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverseg t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting;
linear vote
by
steepest search
boosting-like algorithms
most popularGradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation of Aggregation Models
Bagging AdaBoost Decision Tree
Random Forest
randomized bagging + ‘strong’ DTreeAdaBoost-DTree
AdaBoost+ ‘weak’ DTree
GradientBoost
GBDT
GradientBoost + ‘weak’ DTree
all three
frequently used in practiceGradient Boosted Decision Tree Summary of Aggregation Models
Specialty of Aggregation Models
cure underfitting
•
G(x) ‘strong’•
aggregation=⇒
feature transform
cure overfitting
•
G(x) ‘moderate’•
aggregation=⇒
regularization
proper aggregation (a.k.a. ‘ensemble’)
=⇒
better performance
Gradient Boosted Decision Tree Summary of Aggregation Models
Fun Time
Which of the following aggregation model learns diverse
g t
byreweighting
and calculateslinear vote
bysteepest search?
1
AdaBoost2
Random Forest3
Decision Tree4
Linear BlendingReference Answer: 1
Congratulations on being an
expert
in aggregation models!:-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Fun Time
Which of the following aggregation model learns diverse
g t
byreweighting
and calculateslinear vote
bysteepest search?
1
AdaBoost2
Random Forest3
Decision Tree4
Linear BlendingReference Answer: 1
Congratulations on being an
expert
in aggregation models!:-)
Gradient Boosted Decision Tree Summary of Aggregation Models