Machine Learning Techniques ( 機器學習技法)
Lecture 7: Blending and Bagging
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23
Blending and Bagging
Roadmap
1 Embedding Numerous Features: Kernel Models
Lecture 6: Support Vector Regression kernel ridge regression
(dense) via ridge regression +representer theorem;
support vector regression
(sparse) via regularizedtube
error +Lagrange dual
2
Combining Predictive Features: Aggregation ModelsLecture 7: Blending and Bagging
Motivation of Aggregation Uniform Blending
Linear and Any Blending
Bagging (Bootstrap Aggregation)
3 Distilling Implicit Features: Extraction Models
Blending and Bagging Motivation of Aggregation
An Aggregation Story
Your T friends g
1
, · · · ,gT
predicts whether stock will go up as gt
(x).You can . . .
• select
the most trust-worthy friend from theirusual performance
—validation!
• mix
the predictions from all your friendsuniformly
—let them
vote!
• mix
the predictions from all your friendsnon-uniformly
—let them vote, but
give some more ballots
• combine
the predictionsconditionally
—if
[t satisfies some condition]
give some ballots to friend t•
...aggregation
models:mix
orcombine
hypotheses (for better performance)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23
Blending and Bagging Motivation of Aggregation
Aggregation with Math Notations
Your T friends g
1
, · · · ,gT
predicts whether stock will go up as gt
(x).• select
the most trust-worthy friend from theirusual performance
G(x) = gt
∗(x) with t∗
=argmint∈{1,2,··· ,T } E val (g t − )
• mix
the predictions from all your friendsuniformly
G(x) = signP
T
t=1 1
· gt
(x)
• mix
the predictions from all your friendsnon-uniformly
G(x) = signP
T
t=1 α t
· gt
(x)with
α t ≥ 0
• include select: α
t= q
E
val(g
t−) smallest y
• include uniformly: α
t= 1
• combine
the predictionsconditionally
G(x) = signP
T
t=1 q t (x)
· gt
(x)with
q t (x) ≥ 0
• include non-uniformly: q
t(x) = α
taggregation models: a
rich family
Blending and Bagging Motivation of Aggregation
Recall: Selection by Validation
G(x) = g
t
∗(x) with t∗
= argmint∈{1,2,··· ,T }
E val (g t − )
• simple
and popular•
what if use Ein
(gt
)instead ofE val (g t − )?
complexity price on d
VC, remember? :-)
•
needone strong
gt −
to guarantee smallE val
(and small Eout
)selection:
rely on one strong hypothesis
aggregation:
can we do better with many (possibly weaker) hypotheses?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23
Blending and Bagging Motivation of Aggregation
Why Might Aggregation Work?
•
mixdifferent weak hypotheses
uniformly—G(x) ‘strong’
•
aggregation=⇒
feature transform (?)
•
mixdifferent random-PLA hypotheses
uniformly—G(x) ‘moderate’
•
aggregation=⇒
regularization (?)
proper aggregation =⇒
better performance
Blending and Bagging Motivation of Aggregation
Fun Time
Consider three decision stump hypotheses from R to {−1, +1}:
g
1
(x ) = sign(1 − x ), g2
(x ) = sign(1 + x ), g3
(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?1
2J|x | ≤ 1K − 12
2J|x | ≥ 1K − 13
2Jx ≤ −1K − 14
2Jx ≥ +1K − 1Reference Answer: 1
The ‘region’ that gets two positive votes from g
1
and g2
is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps gt
can be aggregated to form a more sophisticated hypothesis G.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23
Blending and Bagging Motivation of Aggregation
Fun Time
Consider three decision stump hypotheses from R to {−1, +1}:
g
1
(x ) = sign(1 − x ), g2
(x ) = sign(1 + x ), g3
(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?1
2J|x | ≤ 1K − 12
2J|x | ≥ 1K − 13
2Jx ≤ −1K − 14
2Jx ≥ +1K − 1Reference Answer: 1
The ‘region’ that gets two positive votes from g
1
and g2
is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps gt
can be aggregated to form a more sophisticated hypothesis G.Blending and Bagging Uniform Blending
Uniform Blending (Voting) for Classification
uniform blending: known g t
, each with1
ballotG(x) = sign
T
X
t=1
1 · g
t(x)
!
•
sameg t
(autocracy):as good as one single
g t
•
very differentg t
(diversity+democracy):
majority can
correct
minority•
similar results with uniform voting for multiclassG(x) = argmax
1≤k ≤K T
X
t=1
Jg
t
(x) = kKhow about
regression?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23
Blending and Bagging Uniform Blending
Uniform Blending for Regression
G(x) = 1 T
T
X
t=1
g t
(x)•
sameg t
(autocracy):as good as one single
g t
•
very differentg t
(diversity+democracy):
=⇒
some
g t
(x) > f (x), someg t
(x) < f (x)=⇒average
could be
more accurate than individualdiverse hypotheses:
even simple
uniform blending
can be better than anysingle hypothesis
Blending and Bagging Uniform Blending
Theoretical Analysis of Uniform Blending
G(x)
=1 T
T
X
t=1
g t (x)
avg (g
t(x) − f (x))
2= avg g
2t− 2g
tf + f
2= avg g
2t− 2Gf + f
2= avg g
2t− G
2+ (G − f )
2= avg g
2t− 2G
2+ G
2+ (G − f )
2= avg g
2t− 2g
tG + G
2+ (G − f )
2= avg (g
t− G)
2+ (G − f )
2avg
(Eout
(gt
)) =avg
E(g
t
−G) 2
+E
out
(G)≥
avg
E(g
t
−G) 2
+E
out
(G)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23
Blending and Bagging Uniform Blending
Some Special g t
consider a
virtual
iterative process that for t = 1, 2, . . . , T1
request size-N data Dt
from PN
(i.i.d.)2
obtaing t
by A(Dt
)g ¯
= limT →∞ G
= limT →∞
1 T
T
X
t=1
g t
=E
D A(D)
avg
(Eout
(gt
)) =avg
E(g
t
−¯ g) 2
+E
out
(g) ¯ expected
performance of A =expected deviation
toconsensus
+performance of
consensus
•
performance ofconsensus: called bias
• expected deviation
toconsensus: called variance
uniform blending:reduces
variance
for more stable performanceBlending and Bagging Uniform Blending
Consider applying uniform blending G(x) =
T 1
PT
t=1
gt
(x) on linear regression hypotheses gt
(x) = innerprod(wt
,x). Which of the following
property best describes the resulting G(x)?1
a constant function ofx
2
a linear function ofx
3
a quadratic function ofx
4
none of the other choicesReference Answer: 2
G(x) = innerprod 1T
T
X
t=1
w t
,x
!
which is clearly a linear function of
x. Note that
we write ‘innerprod’ instead of the usual‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23
Blending and Bagging Uniform Blending
Consider applying uniform blending G(x) =
T 1
PT
t=1
gt
(x) on linear regression hypotheses gt
(x) = innerprod(wt
,x). Which of the following
property best describes the resulting G(x)?1
a constant function ofx
2
a linear function ofx
3
a quadratic function ofx
4
none of the other choicesReference Answer: 2
G(x) = innerprod 1 T
T
X
t=1
w t
,x
!
which is clearly a linear function of
x. Note that
we write ‘innerprod’ instead of the usual‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).
Blending and Bagging Linear and Any Blending
Linear Blending
linear blending: known g t
, each to be givenα t
ballotG(x) = sign
T
X
t=1
α t
· gt
(x)!
with
α t ≥ 0
computing ‘good’ α
t
: minα
t≥0
Ein
(α)linear blending for regression
α
mint≥0
1 N
N
X
n=1
y
n
−T
X
t=1
α t g t
(xn
)
2
LinReg + transformation
minw
i1 N
N
X
n=1
y
n
−d ˜
X
i=1
w i φ i
(xn
)
2
like two-level learning, remember? :-)
linear blending = LinModel +
hypotheses as transform
+constraints
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23
Blending and Bagging Linear and Any Blending
Constraint on α t
linear blending = LinModel +
hypotheses as transform
+constraints:
min
α
t≥0
1 N
N
X
n=1
err y
n
,T
X
t=1
α t g t
(xn
)!
linear blending for binary classification
if
α t
<0 =⇒α t g t
(x) =|α t |
(−gt
(x))•
negativeα t
forg t
≡ positive|α t |
for−g t
• if you have a stock up/down classifier with 99% error, tell me!
:-)
in practice, often
linear blending = LinModel +
hypotheses as transform
(( (( (( (
hhh
+constraints hhh h
Blending and Bagging Linear and Any Blending
Linear Blending versus Selection
in practice, often
g 1
∈ H1
,g 2
∈ H2
, . . . ,g T
∈ HT
byminimum E in
•
recall:selection by minimum E in
—bestof
best, paying d
VC
T S
t=1
H t
•
recall: linear blending includesselection
as special case—by setting
α t
=qE val (g t − )
smallesty•
complexity price of linear blendingwith E in
(aggregationofbest):
≥d
VC
T S
t=1
H t
like
selection, blending practically done with
(Eval
instead ofE in
) + (gt −
from minimumE train
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23
Blending and Bagging Linear and Any Blending
Any Blending
Given
g 1 −
,g 2 −
, . . .,g T −
fromD train
, transform (xn
,yn
)inD val
to (zn
=Φ −
(xn
),yn
), whereΦ −
(x) = (g− 1
(x), . . . ,g T −
(x))Linear Blending
1
computeα
= LinearModel
{(z
n
,yn
)}
2
return GLINB(x) =LinearHypothesis α
(Φ(x)),Any Blending (Stacking)
1
computeg ˜
=
AnyModel
{(z
n
,yn
)}
2
return GANYB(x) =g(Φ(x)), ˜
where
Φ(x) = (g 1
(x), . . . ,g T
(x))any
blending:• powerful, achieves conditional blending
•
butdanger of overfitting, as always :-(
Blending and Bagging Linear and Any Blending
Blending in Practice
(Chen et al., A linear ensemble of individual and blended models for music rating prediction, 2012)
KDDCup 2011 Track 1: World Champion Solution by NTU
• validation set blending: a special any blending
model Etest
(squared):519.45
=⇒456.24
—helped
secure the lead
inlast two weeks
• test set blending: linear blending
using ˜Etest
E
test
(squared):456.24
=⇒442.06
—helped
turn the tables
inlast hour
blending ‘useful’ in practice,
despite the computational burden
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23
Blending and Bagging Linear and Any Blending
Fun Time
Consider three decision stump hypotheses from R to {−1, +1}:
g 1
(x ) = sign(1 − x ),g 2
(x ) = sign(1 + x ),g 3
(x ) = −1. When x = 0, what is the resultingΦ(x ) = (g 1
(x ),g 2
(x ),g 3
(x )) used in the returned hypothesis of linear/any blending?1
(+1, +1, +1)2
(+1, +1, −1)3
(+1, −1, −1)4
(−1, −1, −1)Reference Answer: 2 Too easy? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23
Blending and Bagging Linear and Any Blending
Fun Time
Consider three decision stump hypotheses from R to {−1, +1}:
g 1
(x ) = sign(1 − x ),g 2
(x ) = sign(1 + x ),g 3
(x ) = −1. When x = 0, what is the resultingΦ(x ) = (g 1
(x ),g 2
(x ),g 3
(x )) used in the returned hypothesis of linear/any blending?1
(+1, +1, +1)2
(+1, +1, −1)3
(+1, −1, −1)4
(−1, −1, −1)Reference Answer: 2 Too easy? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23
Blending and Bagging Bagging (Bootstrap Aggregation)
What We Have Done
blending: aggregate
after getting g t
; learning: aggregateas well as getting g t aggregation type blending learning
uniform voting/averaging ?
non-uniform linear ?
conditional stacking ?
learning
gt
for uniform aggregation:diversity
important• diversity
by different models:g 1
∈ H1
,g 2
∈ H2
, . . . ,g T
∈ HT
• diversity
by different parameters: GD with η = 0.001, 0.01, . . ., 10• diversity
by algorithmic randomness:random PLA with different random seeds
• diversity
by data randomness:within-cross-validation hypotheses g
v −
next:diversity
by data randomnesswithout
g−
Blending and Bagging Bagging (Bootstrap Aggregation)
Revisit of Bias-Variance
expected
performance of A =expected deviation
toconsensus
+performance ofconsensus consensus ¯ g
=expected g t
fromD t
∼ PN
• consensus
more stable than direct A(D),but comes from many more
D t
than the D on hand•
want: approximateg ¯
by• finite (large) T
• approximate g
t= A(D
t) from D
t∼ P
Nusing only D
bootstrapping: a statistical tool that re-samples
from D to ‘simulate’D t
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23
Blending and Bagging Bagging (Bootstrap Aggregation)
Bootstrap Aggregation
bootstrapping
bootstrap sample
D ˜ t
: re-sample N examples from Duniformly with replacement—can also use arbitrary N 0
instead of original Nvirtual aggregation
consider avirtual
iterative process that for t = 1, 2, . . . , T1
request size-N data Dt
from P
N
(i.i.d.)2
obtaing t
by A(Dt
)G
=Uniform({gt
})bootstrap aggregation
consider aphysical
iterative process that for t = 1, 2, . . . , T1
request size-N’dataD ˜ t
frombootstrapping
2
obtaing t
by A(D ˜ t
)G
=Uniform({gt
})bootstrap aggregation (BAGging):
a simple
meta algorithm
on top ofbase algorithm
ABlending and Bagging Bagging (Bootstrap Aggregation)
Bagging Pocket in Action
TPOCKET =1000; TBAG=25
•
verydiverse
gt
from bagging•
propernon-linear
boundary after aggregating binary classifiersbagging works reasonably well
if base algorithm sensitive to data randomness
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23
Blending and Bagging Bagging (Bootstrap Aggregation)
Fun Time
When using bootstrapping to re-sample N examples ˜D
t
from a data set D with N examples, what is the probability of getting ˜Dt
exactly the same as D?1
0 /NN
=02
1 /NN
3
N! /NN
4
NN
/NN
=1Reference Answer: 3
Consider re-sampling in an ordered manner for N steps. Then there are (N
N
)possibleoutcomes ˜D
t
, each with equal probability. Most importantly, (N!) of the outcomes arepermutations of the original D, and thus the answer.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Blending and Bagging Bagging (Bootstrap Aggregation)
Fun Time
When using bootstrapping to re-sample N examples ˜D
t
from a data set D with N examples, what is the probability of getting ˜Dt
exactly the same as D?1
0 /NN
=02
1 /NN
3
N! /NN
4
NN
/NN
=1Reference Answer: 3
Consider re-sampling in an ordered manner for N steps. Then there are (N
N
)possibleoutcomes ˜D
t
, each with equal probability. Most importantly, (N!) of the outcomes arepermutations of the original D, and thus the answer.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Blending and Bagging Bagging (Bootstrap Aggregation)