Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 7: Blending and Bagging

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Blending and Bagging

Roadmap

1 Embedding Numerous Features: Kernel Models

Lecture 6: Support Vector Regression kernel ridge regression

(dense) via ridge regression +

representer theorem;

support vector regression

(sparse) via regularized

tube

error +

Lagrange dual

2

Combining Predictive Features: Aggregation Models

Lecture 7: Blending and Bagging

Motivation of Aggregation Uniform Blending

Linear and Any Blending

Bagging (Bootstrap Aggregation)

3 Distilling Implicit Features: Extraction Models

(3)

Blending and Bagging Motivation of Aggregation

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

predicts whether stock will go up as g

t

(x).

You can . . .

• select

the most trust-worthy friend from their

usual performance

—validation!

• mix

the predictions from all your friends

uniformly

—let them

vote!

• mix

non-uniformly

—let them vote, but

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

give some ballots to friend t

•

...

aggregation

models:

mix

or

combine

hypotheses (for better performance)

(4)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

predicts whether stock will go up as g

_t

(x).

• select

the most trust-worthy friend from their

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

= 1

• combine

the predictions

conditionally

G(x) = sign

P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) = α

t

aggregation models: a

rich family

(5)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

rely on one strong hypothesis

aggregation:

can we do better with many (possibly weaker) hypotheses?

(6)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

—G(x) ‘strong’

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

—G(x) ‘moderate’

•

aggregation

=⇒

regularization (?)

proper aggregation =⇒

better performance

(7)

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

₁

(x ) = sign(1 − x ), g

₂

(x ) = sign(1 + x ), g

₃

(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?

1

2J|x | ≤ 1K − 1

2

2J|x | ≥ 1K − 1

3

2Jx ≤ −1K − 1

4

2Jx ≥ +1K − 1

Reference Answer: 1

The ‘region’ that gets two positive votes from g

₁

and g

₂

is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps g

_t

can be aggregated to form a more sophisticated hypothesis G.

(8)

Fun Time

g

₁

(x ) = sign(1 − x ), g

₂

(x ) = sign(1 + x ), g

₃

(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?

1

2J|x | ≤ 1K − 1

2

2J|x | ≥ 1K − 1

3

2Jx ≤ −1K − 1

4

2Jx ≥ +1K − 1

Reference Answer: 1

The ‘region’ that gets two positive votes from g

₁

and g

₂

is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps g

_t

can be aggregated to form a more sophisticated hypothesis G.

(9)

Blending and Bagging Uniform Blending

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

t

(x)

!

•

same

g t

(autocracy):

as good as one single

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

similar results with uniform voting for multiclass

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(10)

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g _t

(x)

•

same

g _t

(autocracy):

as good as one single

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

more accurate than individual

diverse hypotheses:

even simple

uniform blending

can be better than any

single hypothesis

(11)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G + G

²

+ (G − f )

²

= avg (g

t

− G)

²

+ (G − f )

²

avg

(E

out

(g

t

)) =

avg

E(g

_t

−

G) ²

+E

out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(12)

Some Special g _t

consider a

virtual

iterative process that for t = 1, 2, . . . , T

1

request size-N data D

_t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞ G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯ expected

performance of A =

expected deviation

to

consensus

+performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

for more stable performance

(13)

Consider applying uniform blending G(x) =

_T ¹

P

T

t=1

g

t

(x) on linear regression hypotheses g

t

(x) = innerprod(w

t

,

x). Which of the following

property best describes the resulting G(x)?

1

a constant function of

x

2

a linear function of

x

3

a quadratic function of

x

4

none of the other choices

Reference Answer: 2

G(x) = innerprod 1

T

X

t=1

w _t

,

x

!

which is clearly a linear function of

x. Note that

we write ‘innerprod’ instead of the usual

‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).

(14)

Consider applying uniform blending G(x) =

_T ¹

P

T

t=1

g

t

(x) on linear regression hypotheses g

t

(x) = innerprod(w

t

,

x). Which of the following

property best describes the resulting G(x)?

1

a constant function of

x

2

a linear function of

x

3

a quadratic function of

x

4

none of the other choices

Reference Answer: 2

G(x) = innerprod 1 T

T

X

t=1

w _t

,

x

!

which is clearly a linear function of

x. Note that

we write ‘innerprod’ instead of the usual

‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).

(15)

Blending and Bagging Linear and Any Blending

Linear Blending

linear blending: known g t

, each to be given

α t

ballot

G(x) = sign

T

X

t=1

α _t

· g

_t

(x)

!

with

α _t ≥ 0

computing ‘good’ α

_t

: min

α

t

≥0

E

_in

(α)

linear blending for regression

α

mint

≥0

1 N

N

X

n=1



y

n

−

T

X

t=1

α t g t

(x

n

)





2 LinReg + transformation

min

w

i

1 N

N

X

n=1



y

n

−

d ˜

X

i=1

w _i φ _i

(x

n

)





2 like two-level learning, remember? :-)

linear blending = LinModel +

hypotheses as transform

+

constraints

(16)

Constraint on α _t

hypotheses as transform

+

constraints:

min

α

t

≥0

1 N

N

X

n=1

err y

_n

,

T

X

t=1

α t g _t

(x

_n

)

!

linear blending for binary classification

if

α t

<0 =⇒

α t g _t

(x) =

|α _t |

(−g

t

(x))

•

negative

α t

for

g t

≡ positive

|α _t |

for

−g t

• if you have a stock up/down classifier with 99% error, tell me!

:-)

in practice, often

hypotheses as transform

(( (( (( (

hhh

+

constraints hhh h

(17)

Linear Blending versus Selection

in practice, often

g ₁

∈ H

₁

,

g ₂

∈ H

₂

, . . . ,

g _T

∈ H

_T

by

minimum E _in

•

recall:

selection by minimum E _in

—bestof

best, paying d

VC

_T S

t=1

H _t

•

recall: linear blending includes

selection

as special case

—by setting

α _t

=q

E _val (g _t ⁻ )

smallesty

•

complexity price of linear blending

with E _in

(aggregationof

best):

≥d

_VC

_T S

t=1

H _t

like

selection, blending practically done with

(E

_val

instead of

E _in

) + (g

_t ⁻

from minimum

E _train

)

(18)

Any Blending

Given

g ₁ ⁻

,

g ₂ ⁻

, . . .,

g _T ⁻

from

D _train

, transform (x

_n

,y

_n

)in

D _val

to (z

_n

=

Φ ⁻

(x

_n

),y

n

), where

Φ ⁻

(x) = (g

⁻ ₁

(x), . . . ,

g _T ⁻

(x))

Linear Blending

1

compute

α

= LinearModel

{(z

n

,y

n

)}

2

return GLINB(x) =

LinearHypothesis _α

(Φ(x)),

Any Blending (Stacking)

1

compute

g ˜

=

AnyModel

{(z

n

,y

_n

)}

2

return G_ANYB(x) =

g(Φ(x)), ˜

where

Φ(x) = (g ₁

(x), . . . ,

g _T

(x))

any

blending:

• powerful, achieves conditional blending

•

but

danger of overfitting, as always :-(

(19)

Blending in Practice

(Chen et al., A linear ensemble of individual and blended models for music rating prediction, 2012)

KDDCup 2011 Track 1: World Champion Solution by NTU

• validation set blending: a special any blending

model E

test

(squared):

519.45

=⇒

456.24

—helped

secure the lead

in

last two weeks

• test set blending: linear blending

using ˜E

_test

E

_test

(squared):

456.24

=⇒

442.06

—helped

turn the tables

in

last hour

blending ‘useful’ in practice,

despite the computational burden

(20)

Fun Time

g ₁

(x ) = sign(1 − x ),

g ₂

(x ) = sign(1 + x ),

g ₃

(x ) = −1. When x = 0, what is the resulting

Φ(x ) = (g ₁

(x ),

g ₂

(x ),

g ₃

(x )) used in the returned hypothesis of linear/any blending?

1

(+1, +1, +1)

2

(+1, +1, −1)

3

(+1, −1, −1)

4

(−1, −1, −1)

Reference Answer: 2 Too easy? :-)

(21)

Fun Time

g ₁

(x ) = sign(1 − x ),

g ₂

(x ) = sign(1 + x ),

g ₃

(x ) = −1. When x = 0, what is the resulting

Φ(x ) = (g ₁

(x ),

g ₂

(x ),

g ₃

(x )) used in the returned hypothesis of linear/any blending?

1

(+1, +1, +1)

2

(+1, +1, −1)

3

(+1, −1, −1)

4

(−1, −1, −1)

Reference Answer: 2 Too easy? :-)

(22)

Blending and Bagging Bagging (Bootstrap Aggregation)

What We Have Done

blending: aggregate

after getting g _t

; learning: aggregate

as well as getting g _t aggregation type blending learning

uniform voting/averaging ?

non-uniform linear ?

conditional stacking ?

learning

g

_t

for uniform aggregation:

diversity

important

• diversity

by different models:

g ₁

∈ H

₁

,

g ₂

∈ H

₂

, . . . ,

g _T

∈ H

_T

• diversity

by different parameters: GD with η = 0.001, 0.01, . . ., 10

• diversity

by algorithmic randomness:

random PLA with different random seeds

• diversity

by data randomness:

within-cross-validation hypotheses g

_v ⁻

by data randomness

without

g

⁻

(23)

Revisit of Bias-Variance

expected

performance of A =

expected deviation

to

consensus

+performance of

consensus consensus ¯ g

=

expected g _t

from

D _t

∼ P

^N

• consensus

more stable than direct A(D),

but comes from many more

D _t

than the D on hand

•

want: approximate

g ¯

by

• finite (large) T

• approximate g

t

= A(D

_t

) from D

_t

∼ P

^N

using only D

bootstrapping: a statistical tool that re-samples

from D to ‘simulate’

D _t

(24)

Bootstrap Aggregation

bootstrapping

bootstrap sample

D ˜ _t

: re-sample N examples from D

uniformly with replacement—can also use arbitrary N ⁰

instead of original N

virtual aggregation

consider a

virtual

1

request size-N data D

t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

G

=Uniform({g

t

})

bootstrap aggregation

consider a

physical

1

request size-N’data

D ˜ _t

from

bootstrapping

2

obtain

g t

by A(

D ˜ _t

)

G

=Uniform({g

t

})

bootstrap aggregation (BAGging):

a simple

meta algorithm

on top of

base algorithm

A

(25)

Bagging Pocket in Action

TPOCKET =1000; TBAG=25

•

very

diverse

g

_t

from bagging

•

proper

non-linear

boundary after aggregating binary classifiers

bagging works reasonably well

if base algorithm sensitive to data randomness

(26)

Fun Time

When using bootstrapping to re-sample N examples ˜D

_t

from a data set D with N examples, what is the probability of getting ˜D

_t

exactly the same as D?

1

0 /N

^N

=0

2

1 /N

^N

3

N! /N

^N

4

N

^N

/N

^N

=1

Reference Answer: 3

Consider re-sampling in an ordered manner for N steps. Then there are (N

^N

)possible

outcomes ˜D

_t

, each with equal probability. Most importantly, (N!) of the outcomes are

permutations of the original D, and thus the answer.

(27)

Fun Time

When using bootstrapping to re-sample N examples ˜D

_t

from a data set D with N examples, what is the probability of getting ˜D

_t

exactly the same as D?

1

0 /N

^N

=0

2

1 /N

^N

3

N! /N

^N

4

N

^N

/N

^N

=1

Reference Answer: 3

Consider re-sampling in an ordered manner for N steps. Then there are (N

^N

)possible

outcomes ˜D

_t

, each with equal probability. Most importantly, (N!) of the outcomes are

permutations of the original D, and thus the answer.

(28)

Summary

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 7: Blending and Bagging

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 Embedding Numerous Features: Kernel Models

Lecture 6: Support Vector Regression kernel ridge regression

representer theorem;

support vector regression

tube

Lagrange dual

2

Lecture 7: Blending and Bagging

Motivation of Aggregation Uniform Blending

Linear and Any Blending

Bagging (Bootstrap Aggregation)

3 Distilling Implicit Features: Extraction Models

An Aggregation Story

1

T

t

You can . . .

• select

usual performance

• mix

uniformly

vote!

• mix

non-uniformly

give some more ballots

• combine

conditionally

[t satisfies some condition]

•

aggregation

mix

combine

Aggregation with Math Notations

1

T

t

• select

usual performance

t

∗

t∈{1,2,··· ,T } E val (g t − )

• mix

uniformly

T

t=1 1

t

• mix

non-uniformly

T

t=1 α t

t

α t ≥ 0

• include select: α

= q

E

(g

) smallest y

• include uniformly: α

= 1

• combine

conditionally

T

t=1 q t (x)

t

q t (x) ≥ 0

• include non-uniformly: q

(x) = α

rich family

Recall: Selection by Validation

t

∗

t∈{1,2,··· ,T }

Machine Learning Techniques (ᘤᢈ)

₁

_T

₁

_T

_t

_t

t∈{1,2,··· ,T } E _val (g _t ⁻ )

t=1 α _t

_t

α _t ≥ 0

t=1 q _t (x)

q _t (x) ≥ 0

_t

_∗

E _val (g _t ⁻ )

_in

_t

E _val (g _t ⁻ )?

_t ⁻

E _val

_out

₁

₂

₃

₁

₂

_t

₁

₂

₃

₁

₂

_t

uniform blending: known g _t

g _t