Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
min
η @
@ min
h
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηg
t (x n ),
yn
with err(s, y ) = (s − y )
2
min
η
1 N
N
X
n=1
(s
n
+ηg t (x n )
− yn
)2
= 1 N
N
X
n=1
(
(y n − s n )
−
ηg t (x n )) 2
—one-variable linear regressionon {(g
t -transformed input, residual)}
GradientBoost for regression:
α t = optimal η
byg t -transformed linear regression
Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
min
η @
@ min
h
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηg
t (x n ),
yn
with err(s, y ) = (s − y )
2
min
η
1 N
N
X
n=1
(s
n
+ηg t (x n )
− yn
)2
= 1 NN
X
n=1
(
(y n − s n )
−
ηg t (x n )) 2
—one-variable linear regressionon {(g
t -transformed input, residual)}
GradientBoost for regression:
α t = optimal η
byg t -transformed linear regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/25
Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
min
η @
@ min
h
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηg
t (x n ),
yn
with err(s, y ) = (s − y )
2
min
η
1 N
N
X
n=1
(s
n
+ηg t (x n )
− yn
)2
= 1 NN
X
n=1
((y
n − s n )
−ηg t (x n )) 2
—one-variable linear regressionon {(g
t -transformed input, residual)}
GradientBoost for regression:
α t = optimal η
byg t -transformed linear regression
Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
min
η @
@ min
h
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηg
t (x n ),
yn
with err(s, y ) = (s − y )
2
min
η
1 N
N
X
n=1
(s
n
+ηg t (x n )
− yn
)2
= 1 NN
X
n=1
((y
n − s n )
−ηg t (x n )) 2
—one-variable linear regressionon {(g
t -transformed input, residual)}
GradientBoost for regression:
α t = optimal η
byg t -transformed linear regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/25
Gradient Boosted Decision Tree Gradient Boosting
Deciding Blending Weight as Optimization
after finding
g t = h,
min
η @
@ min
h
1 N
N
X
n=1
err
X t−1
τ =1
α τ g τ (x n )
| {z }
s
n+ηg
t (x n ),
yn
with err(s, y ) = (s − y )
2
min
η
1 N
N
X
n=1
(s
n
+ηg t (x n )
− yn
)2
= 1 NN
X
n=1
((y
n − s n )
−ηg t (x n )) 2
—one-variable linear regressionon {(g
t -transformed input, residual)}
GradientBoost for regression:
α t = optimal η
byg -transformed linear regression
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT)
s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =P
T
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =P
T
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =P
T
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =P
T
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =PT
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25
Gradient Boosted Decision Tree Gradient Boosting
Putting Everything Together
Gradient Boosted Decision Tree (GBDT) s 1 = s 2 = . . . = s N = 0
for t = 1, 2, . . . , T
1
obtaing t
byA({(x n
,y n − s n
)})whereA
is a (squared-error) regression algorithm—how about sampled and pruned C&RT?
2
computeα t
=OneVarLinearRegression({(gt (x n ), y n − s n
)})3
updates n
←s n
+α t g t (x n )
return G(x) =PT
t=1 α t g t
(x)GBDT: ‘regression sibling’ of AdaBoost-DTree
—popular in practice
Gradient Boosted Decision Tree Gradient Boosting
Fun Time
Which of the following is the optimal η for
min
η
1 N
N
X
n=1
((y
n − s n )
−ηg t (x n )) 2
1
(PN
n=1
gt
(xn
)(yn
−s n
)) · (PN
n=1
gt 2
(xn
))2
(PN
n=1
gt
(xn
)(yn
−s n
)) / (PN
n=1
gt 2
(xn
))3
(PN
n=1
gt
(xn
)(yn
−s n
)) + (PN
n=1
gt 2
(xn
))4
(PN
n=1
gt
(xn
)(yn
−s n
)) − (PN
n=1
gt 2
(xn
))Reference Answer: 2
Derived within Lecture 9 of ML Foundations,
remember? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/25
Gradient Boosted Decision Tree Gradient Boosting
Fun Time
Which of the following is the optimal η for
min
η
1 N
N
X
n=1
((y
n − s n )
−ηg t (x n )) 2
1
(PN
n=1
gt
(xn
)(yn
−s n
)) · (PN
n=1
gt 2
(xn
))2
(PN
n=1
gt
(xn
)(yn
−s n
)) / (PN
n=1
gt 2
(xn
))3
(PN
n=1
gt
(xn
)(yn
−s n
)) + (PN
n=1
gt 2
(xn
))4
(PN
n=1
gt
(xn
)(yn
−s n
)) − (PN
n=1
gt 2
(xn
))Reference Answer: 2
Derived within Lecture 9 of ML Foundations,
remember? :-)
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Blending Models
blending: aggregate
after getting diverse g t
uniform
simplevoting/averaging of
g t
non-uniform
linear model ong t -transformed
inputsconditional
nonlinear model on
g t -transformed
inputsuniform for ‘stability’;
non-uniform/conditional
carefully
for‘complexity’
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t
Bagging
diverseg t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting; linear vote
by
steepest search
Decision Tree
diverse
g t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting; linear vote
by
steepest search
boosting-like algorithms
most popularHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting; linear vote
by
steepest search
Decision Tree
diverse
g t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting; linear vote
by
steepest search
boosting-like algorithms
most popularGradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverse
g t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting; linear vote
by
steepest search
boosting-like algorithms
most popularHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverse
g t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting; linear vote
by
steepest search
boosting-like algorithms
most popularGradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverse
g t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting;
linear vote
by
steepest search
boosting-like algorithms
most popularHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverse
g t
bydata splitting;
conditional vote
bybranching
GradientBoost
diverseg t
by
residual fitting;
linear vote
by
steepest search
boosting-like algorithms
most popularGradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverseg t
bydata splitting;
conditional vote
bybranching
GradientBoost
diverseg t
by
residual fitting;
linear vote
by
steepest search
boosting-like algorithms
most popularHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverseg t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting;
linear vote
by
steepest search
boosting-like algorithms
most popularGradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverseg t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting;
linear vote
by
steepest search
boosting-like algorithms
most popularHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation-Learning Models
learning: aggregate
as well as getting diverse g t Bagging
diverse
g t
bybootstrapping;
uniform vote
bynothing :-)
AdaBoost
diverseg t
byreweighting;
linear vote
by
steepest search
Decision Tree
diverseg t
bydata splitting;
conditional vote
bybranching GradientBoost
diverse
g t
by
residual fitting;
linear vote
by
steepest search
Gradient Boosted Decision Tree Summary of Aggregation Models
Map of Aggregation of Aggregation Models
Bagging AdaBoost Decision Tree
Random Forest
randomized bagging + ‘strong’ DTreeAdaBoost-DTree
AdaBoost+ ‘weak’ DTree