Machine Learning Techniques
( 機器學習技巧)
Lecture 7: Blending and Bagging
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Blending and Bagging
Agenda
Lecture 7: Blending and Bagging Motivation of Aggregation Uniform Blending
Linear and Any Blending
Bagging
Blending and Bagging Motivation of Aggregation
An Aggregation Story
Your T friends g
1
, · · · ,gT
predicts whether stock will go up as gt
(x).You can . . .
• select
the most trust-worthy friend from theirusual performance
—validation!
• mix
the predictions from all your friendsuniformly
—let them
vote!
• mix
the predictions from all your friendsnon-uniformly
—let them vote, but
give some more ballots
• combine
the predictionsconditionally
—if
[condition t true]
give some ballots to friend t•
...aggregation
models:mix
orcombine
hypotheses (for better performance)Blending and Bagging Motivation of Aggregation
Aggregation with Math Notations
Your T friends g
1
, · · · ,gT
predicts whether stock will go up as gt
(x).• select
the most trust-worthy friend from theirusual performance
G(x) = gt
∗(x) with t∗
=argmint∈{1,2,··· ,T } E val (g t )
• mix
the predictions from all your friendsuniformly
G(x) = signP
T
t=1 1
· gt
(x)• mix
the predictions from all your friendsnon-uniformly
G(x) = signP
T
t=1 α t
· gt
(x)with
α t ≥ 0
• include select: α
t= J E
val(g
t) smallest K
• include uniformly: α
t= 1
• combine
the predictionsconditionally
G(x) = signP
T
t=1 q t (x)
· gt
(x)with
q t (x) ≥ 0
• include non-uniformly: q
t(x) = α
taggregation models: a
rich family
Blending and Bagging Motivation of Aggregation
Recall: Selection by Validation
G(x) = g
t
∗(x) with t∗
= argmint∈{1,2,··· ,T }
E val (g t )
• simple
and popular•
can also use Ein
instead ofE val
(withcomplexity price on d
VC)•
needone strong
gt
to guarantee smallE val
(and small Eout
)selection:
rely on one strong hypothesis
aggregation:
can we do better with many (possibly weaker) hypotheses?
Blending and Bagging Motivation of Aggregation
Why Might Aggregation Work?
•
mixdifferent weak hypotheses
uniformly—G(x) ‘strong’
•
aggregation=⇒
feature transform (?)
•
mixdifferent random-PLA hypotheses
uniformly—G(x) ‘moderate’
•
aggregation=⇒
regularization (?)
proper aggregation =⇒
better performance
Blending and Bagging Motivation of Aggregation
Fun Time
Blending and Bagging Uniform Blending
Uniform Blending (Voting) for Classification
uniform blending: known g t
, each with1
ballotG(x) = sign
T
X
t=1
1
·g t
(x)!
•
sameg t
(autocracy):as good as one single
g t
•
very differentg t
(diversity+democracy):
majority can
correct
minority•
similar results with uniform voting for multiclassG(x) = argmax
1≤k ≤K T
X
t=1
Jg
t
(x) = kKhow about
regression?
Blending and Bagging Uniform Blending
Uniform Blending for Regression
G(x) = 1 T
T
X
t=1
g t
(x)•
sameg t
(autocracy):as good as one single
g t
•
very differentg t
(diversity+democracy):
=⇒
some
g t
(x) > f (x), someg t
(x) < f (x)=⇒average
could be
more accurate than individualdiverse hypotheses: even simple uniform
blending
can be better than oneBlending and Bagging Uniform Blending
Theoretical Analysis of Uniform Blending
G(x)
=1 T
T
X
t=1
g t (x)
avg (g
t(x) − f (x))
2= avg g
2t− 2g
tf + f
2= avg g
2t− 2Gf + f
2= avg g
2t− G
2+ (G − f )
2= avg g
2t− 2G
2+ G
2+ (G − f )
2= avg g
2t− 2g
tG + G
2+ (G − f )
2= avg (g
t− G)
2+ (G − f )
2avg
(Eout
(gt
)) =avg
E(gt
−G) 2
+ Eout
(G)Blending and Bagging Uniform Blending
Some Special g t
consider a
virtual
iterative process that for t = 1, 2, . . . , T1
request size-N data Dt
from PN
(i.i.d.)2
obtaing t
by A(Dt
)g ¯
= limT →∞ G
= limT →∞
1 T
T
X
t=1
g t
=E
D A(D)
avg
(Eout
(gt
)) =avg
E(g
t
−¯ g) 2
+E
out
(g) ¯ expected
performance of A =expected deviation
toconsensus
+performance of
consensus
•
performance ofconsensus: called bias
• expected deviation
toconsensus: called variance
uniform blending:reduces
variance
for stabler performanceBlending and Bagging Uniform Blending
Fun Time
Blending and Bagging Linear and Any Blending
Linear Blending
linear blending: known g t
, each to be givenα t
ballotG(x) = sign
T
X
t=1
α t
· gt
(x)!
with
α t ≥ 0
computing ‘good’ α
t
: minα
t≥0
Ein
(α)linear blending for regression
α
mint≥0
1 N
N
X
n=1
y
n
−T
X
t =1
α t g t
(xn
)
2
LinReg + transformation
min
w
i1 N
N
X
n=1
y
n
−˜ d
X
i=1
w i φ i
(xn
)
2
linear blending = LinModel +
hypotheses as transform
+constraints
Blending and Bagging Linear and Any Blending
Constraint on α t
linear blending = LinModel +
hypotheses as transform
+constraints:
min
α
t≥0
1 N
N
X
n=1
err y
n
,T
X
t=1
α t g t
(xn
)!
linear blending for binary classification
if
α t
<0 =⇒α t g t
(x) =|α t |
(−gt
(x))•
negativeα t
forg t
≡ positive|α t |
for−g t
• if you have a stock up/down classifier with 99% error, tell me!
:-)
in practice, often
linear blending = LinModel +
hypotheses as transform
(( (( (( (
hhh
+constraints hhh h
Blending and Bagging Linear and Any Blending
Linear Blending versus Selection
in practice, often
g 1
∈ H1
,g 2
∈ H2
, . . . ,g T
∈ HT
byminimum E in
•
recall:selection by minimum E in
—bestof
best, paying d
VC
T S
t=1
H t
•
recall: linear blending includesselection
as special case—by setting
α t
=JE val (g t )
smallestK•
complexity price of linear blendingwith E in
(aggregationofbest):
d
VC
T S
t=1
H t
like
selection, blending practically done with
(Eval
instead ofE in
) + (gt
fromE train
)Blending and Bagging Linear and Any Blending
Any Blending
Linear Blending
Giveng 1
,g 2
, . . .,g T
1
transform (xn
,yn
)inD val
to (zn
=Φ(x n
),yn
), whereΦ(x) = (g 1
(x), . . . ,g T
(x))2
computeα
= Lin{(z
n
,yn
)}return GLINB(x) = LinH(α
T Φ(x))
Any Blending (Stacking)
Giveng 1
,g 2
, . . .,g T
1
transform (xn
,yn
)inD val
to (zn
=Φ(x n
),yn
), whereΦ(x) = (g 1
(x), . . . ,g T
(x))2
computeg ˜
=Any
{(z
n
,yn
)} return GANYB(x) =g(Φ(x)) ˜
if
AnyModel = quadratic polynomial:
G
ANYB(x) =
T
X
t=1
α
t+
T
X
τ =1
α
τ,tg
τ(x)
!
| {z }
q(x)
·g
t(x) —conditional aggregation
danger:
overfitting withany
blending!Blending and Bagging Linear and Any Blending
Blending in Practice
KDDCup 2012 Track 1: World Champion Solution by NTU
• validation set blending: a special any blending
model Etest
(squared):519.45
=⇒456.24
—helped
secure the lead
inlast two weeks
• test set blending: linear blending
using ˜Etest
E
test
(squared):456.24
=⇒442.06
—helped
turn the tables
inlast hour
blending ‘useful’ in practice,
despite the
computational burden
Blending and Bagging Linear and Any Blending
Fun Time
Blending and Bagging Bagging
What We Have Done
blending: aggregate
after getting g t
; learning: aggregateas well as getting g t aggregation type blending learning
uniform voting/averaging ?
non-uniform linear ?
conditional stacking ?
learning
gt
for uniform aggregation:diversity
important• diversity
by different models:g 1
∈ H1
,g 2
∈ H2
, . . . ,g T
∈ HT
• diversity
by different parameters: GD with η = 0.001, 0.01, . . ., 10• diversity
by algorithmic randomness:random PLA with different random seeds
• diversity
by data randomness:within-cross-validation hypotheses g
v −
next:diversity
by data randomnesswithout
g−
Blending and Bagging Bagging
Revisit of Bias-Variance
expected
performance of A =expected deviation
toconsensus
+performance ofconsensus consensus ¯ g
=expected g t
fromD t
∼ PN
• consensus
more stable than direct A(D),but comes from many more
D t
than the D on hand•
want: approximateg ¯
by• finite (large) T
• approximate g
t= A(D
t) from D
t∼ P
Nusing only D
bootstrapping: a statistical tool that
re-samples
from D to ‘simulate’D t
Blending and Bagging Bagging
Bootstrap Aggregation
bootstrapping
bootstrap sample
D ˜ t
: re-sample N examples from Dwith replacement
virtual aggregation
consider avirtual
iterative process that for t = 1, 2, . . . , T1
request size-N data Dt
from P
N
(i.i.d.)2
obtaing t
by A(Dt
)G
=Uniform(gt
)bootstrap aggregation
consider aphysical
iterative process that for t = 1, 2, . . . , T1
request size-N dataD ˜ t
frombootstrapping
2
obtaing t
by A(D ˜ t
)G
=Uniform(gt
)bootstrap aggregation (BAGging):
a simple
meta algorithm
on top ofbase algorithm
ABlending and Bagging Bagging
Bagging Pocket in Action
TPOCKET =1000; TBAG=25
•
verydiverse
gt
from bagging•
propernon-linear
boundary after aggregating binary classifiersbagging works reasonably well
if base
algorithm sensitive to data randomness
Blending and Bagging Bagging
Fun Time
Blending and Bagging Bagging