• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技巧)

Lecture 7: Blending and Bagging

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Blending and Bagging

Agenda

Lecture 7: Blending and Bagging Motivation of Aggregation Uniform Blending

Linear and Any Blending

Bagging

(3)

Blending and Bagging Motivation of Aggregation

An Aggregation Story

Your T friends g

1

, · · · ,g

T

predicts whether stock will go up as g

t

(x).

You can . . .

select

the most trust-worthy friend from their

usual performance

—validation!

mix

the predictions from all your friends

uniformly

—let them

vote!

mix

the predictions from all your friends

non-uniformly

—let them vote, but

give some more ballots

combine

the predictions

conditionally

—if

[condition t true]

give some ballots to friend t

...

aggregation

models:

mix

or

combine

hypotheses (for better performance)

(4)

Blending and Bagging Motivation of Aggregation

Aggregation with Math Notations

Your T friends g

1

, · · · ,g

T

predicts whether stock will go up as g

t

(x).

select

the most trust-worthy friend from their

usual performance

G(x) = g

t

∗(x) with t

=argmin

t∈{1,2,··· ,T } E val (g t )

mix

the predictions from all your friends

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

mix

the predictions from all your friends

non-uniformly

G(x) = sign

P

T

t=1 α t

· g

t

(x)

with

α t ≥ 0

• include select: α

t

= J E

val

(g

t

) smallest K

• include uniformly: α

t

= 1

combine

the predictions

conditionally

G(x) = sign

P

T

t=1 q t (x)

· g

t

(x)

with

q t (x) ≥ 0

• include non-uniformly: q

t

(x) = α

t

aggregation models: a

rich family

(5)

Blending and Bagging Motivation of Aggregation

Recall: Selection by Validation

G(x) = g

t

(x) with t

= argmin

t∈{1,2,··· ,T }

E val (g t )

simple

and popular

can also use E

in

instead of

E val

(with

complexity price on d

VC)

need

one strong

g

t

to guarantee small

E val

(and small E

out

)

selection:

rely on one strong hypothesis

aggregation:

can we do better with many (possibly weaker) hypotheses?

(6)

Blending and Bagging Motivation of Aggregation

Why Might Aggregation Work?

mix

different weak hypotheses

uniformly

—G(x) ‘strong’

aggregation

=⇒

feature transform (?)

mix

different random-PLA hypotheses

uniformly

—G(x) ‘moderate’

aggregation

=⇒

regularization (?)

proper aggregation =⇒

better performance

(7)

Blending and Bagging Motivation of Aggregation

Fun Time

(8)

Blending and Bagging Uniform Blending

Uniform Blending (Voting) for Classification

uniform blending: known g t

, each with

1

ballot

G(x) = sign

T

X

t=1

1

·

g t

(x)

!

same

g t

(autocracy):

as good as one single

g t

very different

g t

(diversity+

democracy):

majority can

correct

minority

similar results with uniform voting for multiclass

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(9)

Blending and Bagging Uniform Blending

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g t

(x)

same

g t

(autocracy):

as good as one single

g t

very different

g t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

more accurate than individual

diverse hypotheses: even simple uniform

blending

can be better than one

(10)

Blending and Bagging Uniform Blending

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

2



= avg g

2t

− 2g

t

f + f

2



= avg g

2t

 − 2Gf + f

2

= avg g

2t

 − G

2

+ (G − f )

2

= avg g

2t

 − 2G

2

+ G

2

+ (G − f )

2

= avg g

2t

− 2g

t

G + G

2

 + (G − f )

2

= avg (g

t

− G)

2

 + (G − f )

2

avg

(E

out

(g

t

)) =

avg

E(g

t

G) 2

 + E

out

(G)

(11)

Blending and Bagging Uniform Blending

Some Special g t

consider a

virtual

iterative process that for t = 1, 2, . . . , T

1

request size-N data D

t

from P

N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞ G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

out

(g

t

)) =

avg



E(g

t

¯ g) 2



+E

out

(

g) ¯ expected

performance of A =

expected deviation

to

consensus

+performance of

consensus

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

for stabler performance

(12)

Blending and Bagging Uniform Blending

Fun Time

(13)

Blending and Bagging Linear and Any Blending

Linear Blending

linear blending: known g t

, each to be given

α t

ballot

G(x) = sign

T

X

t=1

α t

· g

t

(x)

!

with

α t ≥ 0

computing ‘good’ α

t

: min

α

t

≥0

E

in

(α)

linear blending for regression

α

mint

≥0

1 N

N

X

n=1

y

n

T

X

t =1

α t g t

(x

n

)

2

LinReg + transformation

min

w

i

1 N

N

X

n=1

y

n

˜ d

X

i=1

w i φ i

(x

n

)

2

linear blending = LinModel +

hypotheses as transform

+

constraints

(14)

Blending and Bagging Linear and Any Blending

Constraint on α t

linear blending = LinModel +

hypotheses as transform

+

constraints:

min

α

t

≥0

1 N

N

X

n=1

err y

n

,

T

X

t=1

α t g t

(x

n

)

!

linear blending for binary classification

if

α t

<0 =⇒

α t g t

(x) =

t |

(−g

t

(x))

negative

α t

for

g t

≡ positive

t |

for

−g t

if you have a stock up/down classifier with 99% error, tell me!

:-)

in practice, often

linear blending = LinModel +

hypotheses as transform

(( (( (( (

hhh

+

constraints hhh h

(15)

Blending and Bagging Linear and Any Blending

Linear Blending versus Selection

in practice, often

g 1

∈ H

1

,

g 2

∈ H

2

, . . . ,

g T

∈ H

T

by

minimum E in

recall:

selection by minimum E in

—bestof

best, paying d

VC



T S

t=1

H t



recall: linear blending includes

selection

as special case

—by setting

α t

=J

E val (g t )

smallestK

complexity price of linear blending

with E in

(aggregationof

best):

d

VC



T S

t=1

H t



like

selection, blending practically done with

(E

val

instead of

E in

) + (g

t

from

E train

)

(16)

Blending and Bagging Linear and Any Blending

Any Blending

Linear Blending

Given

g 1

,

g 2

, . . .,

g T

1

transform (x

n

,y

n

)in

D val

to (z

n

=

Φ(x n

),y

n

), where

Φ(x) = (g 1

(x), . . . ,

g T

(x))

2

compute

α

= Lin

{(z

n

,y

n

)}

 return GLINB(x) = LinH(α

T Φ(x))

Any Blending (Stacking)

Given

g 1

,

g 2

, . . .,

g T

1

transform (x

n

,y

n

)in

D val

to (z

n

=

Φ(x n

),y

n

), where

Φ(x) = (g 1

(x), . . . ,

g T

(x))

2

compute

g ˜

=

Any



{(z

n

,y

n

)} return GANYB(x) =

g(Φ(x)) ˜

if

AnyModel = quadratic polynomial:

G

ANYB

(x) =

T

X

t=1

α

t

+

T

X

τ =1

α

τ,t

g

τ

(x)

!

| {z }

q(x)

·g

t

(x) —conditional aggregation

danger:

overfitting with

any

blending!

(17)

Blending and Bagging Linear and Any Blending

Blending in Practice

KDDCup 2012 Track 1: World Champion Solution by NTU

• validation set blending: a special any blending

model E

test

(squared):

519.45

=⇒

456.24

—helped

secure the lead

in

last two weeks

• test set blending: linear blending

using ˜E

test

E

test

(squared):

456.24

=⇒

442.06

—helped

turn the tables

in

last hour

blending ‘useful’ in practice,

despite the

computational burden

(18)

Blending and Bagging Linear and Any Blending

Fun Time

(19)

Blending and Bagging Bagging

What We Have Done

blending: aggregate

after getting g t

; learning: aggregate

as well as getting g t aggregation type blending learning

uniform voting/averaging ?

non-uniform linear ?

conditional stacking ?

learning

g

t

for uniform aggregation:

diversity

important

• diversity

by different models:

g 1

∈ H

1

,

g 2

∈ H

2

, . . . ,

g T

∈ H

T

• diversity

by different parameters: GD with η = 0.001, 0.01, . . ., 10

• diversity

by algorithmic randomness:

random PLA with different random seeds

• diversity

by data randomness:

within-cross-validation hypotheses g

v

next:

diversity

by data randomness

without

g

(20)

Blending and Bagging Bagging

Revisit of Bias-Variance

expected

performance of A =

expected deviation

to

consensus

+performance of

consensus consensus ¯ g

=

expected g t

from

D t

∼ P

N

• consensus

more stable than direct A(D),

but comes from many more

D t

than the D on hand

want: approximate

g ¯

by

• finite (large) T

• approximate g

t

= A(D

t

) from D

t

∼ P

N

using only D

bootstrapping: a statistical tool that

re-samples

from D to ‘simulate’

D t

(21)

Blending and Bagging Bagging

Bootstrap Aggregation

bootstrapping

bootstrap sample

D ˜ t

: re-sample N examples from D

with replacement

virtual aggregation

consider a

virtual

iterative process that for t = 1, 2, . . . , T

1

request size-N data D

t

from P

N

(i.i.d.)

2

obtain

g t

by A(D

t

)

G

=Uniform(g

t

)

bootstrap aggregation

consider a

physical

iterative process that for t = 1, 2, . . . , T

1

request size-N data

D ˜ t

from

bootstrapping

2

obtain

g t

by A(

D ˜ t

)

G

=Uniform(g

t

)

bootstrap aggregation (BAGging):

a simple

meta algorithm

on top of

base algorithm

A

(22)

Blending and Bagging Bagging

Bagging Pocket in Action

TPOCKET =1000; TBAG=25

very

diverse

g

t

from bagging

proper

non-linear

boundary after aggregating binary classifiers

bagging works reasonably well

if base

algorithm sensitive to data randomness

(23)

Blending and Bagging Bagging

Fun Time

(24)

Blending and Bagging Bagging

Summary

Lecture 7: Blending and Bagging Motivation of Aggregation

strong and/or moderate Uniform Blending

one hypothesis, one vote, one value Linear and Any Blending

learning with hypotheses as transform Bagging

bootstrapping for diverse g t

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.