## Machine Learning Techniques ( 機器學習技法)

### Lecture 7: Blending and Bagging

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

Blending and Bagging

## Roadmap

### 1 Embedding Numerous Features: Kernel Models

### Lecture 6: Support Vector Regression **kernel ridge regression**

(dense) via
ridge regression +**representer theorem;**

**support vector regression**

(sparse) via
regularized**tube**

error +**Lagrange dual**

### 2

Combining Predictive Features: Aggregation Models### Lecture 7: Blending and Bagging

### Motivation of Aggregation Uniform Blending

### Linear and Any Blending

### Bagging (Bootstrap Aggregation)

### 3 Distilling Implicit Features: Extraction Models

Blending and Bagging Motivation of Aggregation

## An Aggregation Story

Your T friends g

_{1}

, · · · ,g_{T}

predicts whether stock will go up as g### t

(x).### You can . . .

### • **select**

the most trust-worthy friend from their**usual performance**

—validation!

### • **mix**

the predictions from all your friends**uniformly**

—let them

**vote!**

### • **mix**

the predictions from all your friends**non-uniformly**

—let them vote, but

**give some more ballots**

### • **combine**

the predictions**conditionally**

—if

**[t satisfies some condition]**

give some ballots to friend t
### •

...**aggregation**

models: **mix**

or**combine**

hypotheses (for better performance)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

Blending and Bagging Motivation of Aggregation

## Aggregation with Math Notations

Your T friends g

_{1}

, · · · ,g_{T}

predicts whether stock will go up as g_{t}

(x).
### • **select**

the most trust-worthy friend from their**usual performance**

G(x) = g_{t}

_{∗}(x) with t

### ∗

=argmin### t∈{1,2,··· ,T } E _{val} (g _{t} ^{−} )

### • **mix**

the predictions from all your friends**uniformly**

G(x) = sign
P

### T

### t=1 1

· g### t

(x)

### • **mix**

the predictions from all your friends**non-uniformly**

G(x) = sign
P

### T

### t=1 α _{t}

· g_{t}

(x)
with

### α _{t} ≥ 0

### • include **select: α**

_{t}

### = q

### E

val### (g

_{t}

^{−}

### ) smallest y

### • include **uniformly: α**

_{t}

### = 1

### • **combine**

the predictions**conditionally**

G(x) = sign
P

### T

### t=1 q _{t} (x)

· g### t

(x)with

### q _{t} (x) ≥ 0

### • include **non-uniformly: q**

t### (x) = α

taggregation models: a

**rich family**

Blending and Bagging Motivation of Aggregation

## Recall: Selection by Validation

G(x) = g

_{t}

_{∗}(x) with t

_{∗}

= argmin
### t∈{1,2,··· ,T }

### E _{val} (g _{t} ^{−} )

### • **simple**

and popular
### •

what if use E_{in}

(g_{t}

)instead of### E _{val} (g _{t} ^{−} )?

**complexity price on d**

_{VC}

**, remember? :-)**

### •

need**one strong**

g_{t} ^{−}

to guarantee small### E _{val}

(and small E_{out}

)
**selection:**

rely on one strong hypothesis

**aggregation:**

can we do better with many (possibly weaker) hypotheses?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

Blending and Bagging Motivation of Aggregation

## Why Might Aggregation Work?

### •

mix**different weak** **hypotheses**

uniformly
—G(x) ‘strong’

### •

aggregation=⇒

**feature transform (?)**

### •

mix**different random-PLA** **hypotheses**

uniformly
—G(x) ‘moderate’

### •

aggregation=⇒

**regularization (?)**

proper aggregation =⇒

**better performance**

Blending and Bagging Motivation of Aggregation

## Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

_{1}

(x ) = sign(1 − x ), g_{2}

(x ) = sign(1 + x ), g_{3}

(x ) = −1. When mixing
the three hypotheses uniformly, what is the resulting G(x )?
### 1

2J|x | ≤ 1K − 1### 2

2J|x | ≥ 1K − 1### 3

2Jx ≤ −1K − 1### 4

2Jx ≥ +1K − 1### Reference Answer: 1

The ‘region’ that gets two positive votes from g

_{1}

and g_{2}

is |x | ≤ 1, and thus G(x ) is
positive within the region only. We see that the
three decision stumps g_{t}

can be aggregated to
form a more sophisticated hypothesis G.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

Blending and Bagging Motivation of Aggregation

## Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

_{1}

(x ) = sign(1 − x ), g_{2}

(x ) = sign(1 + x ), g_{3}

(x ) = −1. When mixing
the three hypotheses uniformly, what is the resulting G(x )?
### 1

2J|x | ≤ 1K − 1### 2

2J|x | ≥ 1K − 1### 3

2Jx ≤ −1K − 1### 4

2Jx ≥ +1K − 1### Reference Answer: 1

The ‘region’ that gets two positive votes from g

_{1}

and g_{2}

is |x | ≤ 1, and thus G(x ) is
positive within the region only. We see that the
three decision stumps g_{t}

can be aggregated to
form a more sophisticated hypothesis G.
Blending and Bagging Uniform Blending

## Uniform Blending (Voting) for Classification

### uniform blending: known g _{t}

, each with### 1

ballot### G(x) = sign

T

### X

t=1

### 1 · g

t### (x)

### !

### •

same### g t

(autocracy):as good as one single

### g _{t}

### •

very different### g t

(diversity+**democracy):**

majority can

**correct**

minority
### •

similar results with uniform voting for multiclassG(x) = argmax

### 1≤k ≤K T

X

### t=1

Jg

### t

(x) = kKhow about

**regression?**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

Blending and Bagging Uniform Blending

## Uniform Blending for Regression

### G(x) = 1 T

### T

### X

### t=1

### g _{t}

(x)
### •

same### g _{t}

(autocracy):
as good as one single

### g t

### •

very different### g _{t}

(diversity+**democracy):**

=⇒

some

### g t

(x) > f (x), some### g t

(x) < f (x)=⇒average

**could be**

more accurate than individual
**diverse hypotheses:**

even simple

### uniform blending

can be better than any### single hypothesis

Blending and Bagging Uniform Blending

## Theoretical Analysis of Uniform Blending

### G(x)

=### 1 T

### T

### X

### t=1

### g t (x)

### avg (g

t### (x) − f (x))

^{2}

### = avg g

^{2}

_{t}

### − 2g

_{t}

### f + f

^{2}

### = avg g

^{2}

_{t}

### − 2Gf + f

^{2}

### = avg g

^{2}

_{t}

### − G

^{2}

### + (G − f )

^{2}

### = avg g

^{2}

_{t}

### − 2G

^{2}

### + G

^{2}

### + (G − f )

^{2}

### = avg g

^{2}

_{t}

### − 2g

_{t}

### G + G

^{2}

### + (G − f )

^{2}

### = avg (g

t### − G)

^{2}

### + (G − f )

^{2}

### avg

(E### out

(g### t

)) =### avg

E(g

_{t}

−### G) ^{2}

+E

### out

(G)≥

avg

E(g

_{t}

−### G) ^{2}

+E

_{out}

(G)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

Blending and Bagging Uniform Blending

## Some Special g _{t}

consider a

**virtual**

iterative process that for t = 1, 2, . . . , T
### 1

request size-N data D_{t}

from P^{N}

(i.i.d.)
### 2

obtain### g t

by A(D### t

)### g ¯

= lim### T →∞ G

= lim### T →∞

### 1 T

### T

### X

### t=1

### g t

=### E

### D A(D)

### avg

(E_{out}

(g_{t}

)) = ### avg

E(g

_{t}

−### ¯ g) ^{2}

+E

_{out}

(### g) ¯ expected

performance of A =### expected deviation

to### consensus

+performance of

### consensus

### •

performance of### consensus: called **bias**

### • expected deviation

to### consensus: called **variance**

uniform blending:
reduces

**variance**

for more stable performance
Blending and Bagging Uniform Blending

Consider applying uniform blending G(x) =

_{T} ^{1}

P### T

### t=1

g### t

(x) on linear regression hypotheses g### t

(x) = innerprod(w### t

,**x). Which of the following**

property best describes the resulting G(x)?
### 1

a constant function of**x**

### 2

a linear function of**x**

### 3

a quadratic function of**x**

### 4

none of the other choices### Reference Answer: 2

G(x) = innerprod 1T

### T

X

### t=1

**w** _{t}

,**x**

!

which is clearly a linear function of

**x. Note that**

we write ‘innerprod’ instead of the usual
‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

Blending and Bagging Uniform Blending

Consider applying uniform blending G(x) =

_{T} ^{1}

P### T

### t=1

g### t

(x) on linear regression hypotheses g### t

(x) = innerprod(w### t

,**x). Which of the following**

property best describes the resulting G(x)?
### 1

a constant function of**x**

### 2

a linear function of**x**

### 3

a quadratic function of**x**

### 4

none of the other choices### Reference Answer: 2

G(x) = innerprod 1 T

### T

X

### t=1

**w** _{t}

,**x**

!

which is clearly a linear function of

**x. Note that**

we write ‘innerprod’ instead of the usual
‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).

Blending and Bagging Linear and Any Blending

## Linear Blending

### linear blending: known g t

, each to be given### α t

ballotG(x) = sign

### T

X

### t=1

### α _{t}

· g_{t}

(x)
!

with

### α _{t} ≥ 0

computing ‘good’ α

_{t}

: min
### α

t### ≥0

E_{in}

(α)
### linear blending for regression

### α

mint### ≥0

1 N

### N

X

### n=1

y

### n

−### T

X

### t=1

### α t g t

(x### n

)

### 2

### LinReg + transformation

min### w

i1 N

### N

X

### n=1

y

### n

−### d ˜

X

### i=1

### w _{i} φ _{i}

(x### n

)

### 2

**like two-level learning, remember? :-)**

linear blending = LinModel +

### hypotheses as transform

+### constraints

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

Blending and Bagging Linear and Any Blending

## Constraint on α _{t}

linear blending = LinModel +

### hypotheses as transform

+### constraints:

min

### α

t### ≥0

1 N

### N

X

### n=1

err y

_{n}

,
### T

X

### t=1

### α t g _{t}

(x_{n}

)
!

### linear blending for binary classification

if

### α t

<0 =⇒### α t g _{t}

(x) =### |α _{t} |

(−g### t

(x))### •

negative### α t

for### g t

≡ positive### |α _{t} |

for### −g t

### • **if you have a stock up/down classifier with 99% error, tell me!**

**:-)**

in practice, often

linear blending = LinModel +

### hypotheses as transform

### (( (( (( (

### hhh

+### constraints hhh h

Blending and Bagging Linear and Any Blending

## Linear Blending versus Selection

in practice, often

### g _{1}

∈ H_{1}

,### g _{2}

∈ H_{2}

, . . . ,### g _{T}

∈ H_{T}

by### minimum E _{in}

### •

recall:**selection by minimum E** _{in}

—bestof

### best, paying d

VC

_{T} S

### t=1

### H _{t}

### •

recall: linear blending includes### selection

as special case—by setting

### α _{t}

=q
### E _{val} (g _{t} ^{−} )

smallesty
### •

complexity price of linear blending### with E _{in}

(aggregationof### best):

### ≥d

_{VC}

_{T} S

### t=1

### H _{t}

like

### selection, blending practically done with

(E_{val}

instead of### E _{in}

) + (g_{t} ^{−}

from minimum### E _{train}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

Blending and Bagging Linear and Any Blending

## Any Blending

Given

### g _{1} ^{−}

,### g _{2} ^{−}

, . . .,### g _{T} ^{−}

from### D _{train}

, transform (x_{n}

,y_{n}

)in### D _{val}

to
(z_{n}

=### Φ ^{−}

(x_{n}

),y### n

), where### Φ ^{−}

(x) = (g^{−} _{1}

(x), . . . ,### g _{T} ^{−}

(x))
### Linear Blending

### 1

compute### α

= LinearModel

**{(z**

### n

,y### n

)}

### 2

return GLINB(x) =### LinearHypothesis _{α}

(Φ(x)),
### Any Blending (Stacking)

### 1

compute### g ˜

=

**AnyModel**

**{(z**

### n

,y_{n}

)}

### 2

return G_{ANYB}(x) =

### g(Φ(x)), ˜

where

### Φ(x) = (g _{1}

(x), . . . ,### g _{T}

(x))
**any**

blending:
### • **powerful, achieves conditional blending**

### •

but**danger of overfitting, as always** **:-(**

Blending and Bagging Linear and Any Blending

## Blending in Practice

### (Chen et al., A linear ensemble of individual and blended models for music rating prediction, 2012)

### KDDCup 2011 Track 1: World Champion Solution by NTU

### • validation set blending: a special any blending

model E### test

(squared):### 519.45

=⇒### 456.24

—helped

**secure the lead**

in### last two weeks

### • test set blending: linear blending

using ˜E_{test}

E

_{test}

(squared): ### 456.24

=⇒### 442.06

—helped

**turn the tables**

in### last hour

blending ‘useful’ in practice,

**despite the computational burden**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

Blending and Bagging Linear and Any Blending

## Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

### g _{1}

(x ) = sign(1 − x ),### g _{2}

(x ) = sign(1 + x ),### g _{3}

(x ) = −1. When x = 0,
what is the resulting### Φ(x ) = (g _{1}

(x ),### g _{2}

(x ),### g _{3}

(x )) used in the returned
hypothesis of linear/any blending?
### 1

(+1, +1, +1)### 2

(+1, +1, −1)### 3

(+1, −1, −1)### 4

(−1, −1, −1)### Reference Answer: 2 **Too easy? :-)**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

Blending and Bagging Linear and Any Blending

## Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

### g _{1}

(x ) = sign(1 − x ),### g _{2}

(x ) = sign(1 + x ),### g _{3}

(x ) = −1. When x = 0,
what is the resulting### Φ(x ) = (g _{1}

(x ),### g _{2}

(x ),### g _{3}

(x )) used in the returned
hypothesis of linear/any blending?
### 1

(+1, +1, +1)### 2

(+1, +1, −1)### 3

(+1, −1, −1)### 4

(−1, −1, −1)### Reference Answer: 2 **Too easy? :-)**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

Blending and Bagging Bagging (Bootstrap Aggregation)

## What We Have Done

blending: aggregate

### after getting g _{t}

;
learning: aggregate### as well as getting g _{t} aggregation type blending learning

### uniform voting/averaging **?**

### non-uniform linear **?**

### conditional stacking **?**

### learning

g_{t}

for uniform aggregation: ### diversity

important### • diversity

by different models:### g _{1}

∈ H_{1}

,### g _{2}

∈ H_{2}

, . . . ,### g _{T}

∈ H_{T}

### • diversity

by different parameters: GD with η = 0.001, 0.01, . . ., 10### • diversity

by algorithmic randomness:random PLA with different random seeds

### • diversity

by data randomness:within-cross-validation hypotheses g

_{v} ^{−}

next: ### diversity

by data randomness### without

g^{−}

Blending and Bagging Bagging (Bootstrap Aggregation)

## Revisit of Bias-Variance

### expected

performance of A =### expected deviation

to### consensus

+performance of### consensus consensus ¯ g

=### expected g _{t}

from### D _{t}

∼ P^{N}

### • consensus

more stable than direct A(D),but comes from many more

### D _{t}

than the D on hand
### •

want: approximate### g ¯

by### • finite (large) T

### • approximate g

t### = A(D

_{t}

### ) from D

_{t}

### ∼ P

^{N}

### using only D

### bootstrapping: a statistical tool that re-samples

from D to ‘simulate’### D _{t}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

Blending and Bagging Bagging (Bootstrap Aggregation)

## Bootstrap Aggregation

### bootstrapping

bootstrap sample

### D ˜ _{t}

: re-sample N examples from D**uniformly with** **replacement—can also use arbitrary** N ^{0}

instead of original N
### virtual aggregation

consider a**virtual**

iterative
process that for t = 1, 2, . . . , T
### 1

request size-N data D### t

from P

^{N}

(i.i.d.)
### 2

obtain### g t

by A(D### t

)### G

=Uniform({g### t

})### bootstrap aggregation

consider a**physical**

iterative
process that for t = 1, 2, . . . , T
### 1

request size-N’data### D ˜ _{t}

from### bootstrapping

### 2

obtain### g t

by A(### D ˜ _{t}

)
### G

=Uniform({g### t

})bootstrap aggregation (BAGging):

a simple

**meta algorithm**

on top of**base algorithm**

A
Blending and Bagging Bagging (Bootstrap Aggregation)

## Bagging Pocket in Action

TPOCKET =1000; TBAG=25

### •

very### diverse

g_{t}

from bagging
### •

proper**non-linear**

boundary after aggregating binary classifiers
bagging works reasonably well

**if base** **algorithm sensitive to data randomness**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

Blending and Bagging Bagging (Bootstrap Aggregation)

## Fun Time

When using bootstrapping to re-sample N examples ˜D

_{t}

from a data set
D with N examples, what is the probability of getting ˜D_{t}

exactly the
same as D?
### 1

0 /N^{N}

=0
### 2

1 /N^{N}

### 3

N! /N^{N}

### 4

N^{N}

/N^{N}

=1
### Reference Answer: 3

Consider re-sampling in an ordered manner for N steps. Then there are (N

^{N}

)possible
outcomes ˜D

_{t}

, each with equal probability. Most
importantly, (N!) of the outcomes are
permutations of the original D, and thus the answer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

Blending and Bagging Bagging (Bootstrap Aggregation)

## Fun Time

When using bootstrapping to re-sample N examples ˜D

_{t}

from a data set
D with N examples, what is the probability of getting ˜D_{t}

exactly the
same as D?
### 1

0 /N^{N}

=0
### 2

1 /N^{N}

### 3

N! /N^{N}

### 4

N^{N}

/N^{N}

=1
### Reference Answer: 3

Consider re-sampling in an ordered manner for N steps. Then there are (N

^{N}

)possible
outcomes ˜D

_{t}

, each with equal probability. Most
importantly, (N!) of the outcomes are
permutations of the original D, and thus the answer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

Blending and Bagging Bagging (Bootstrap Aggregation)