• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
28
0
0

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 7: Blending and Bagging

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Blending and Bagging

Lecture 6: Support Vector Regression kernel ridge regression

(dense) via ridge regression +

support vector regression

(sparse) via regularized

error +

2

Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Blending and Bagging Motivation of Aggregation

An Aggregation Story

, · · · ,g

T

predicts whether stock will go up as g

(x).

• select

the most trust-worthy friend from their

—validation!

• mix

the predictions from all your friends

—let them

• mix

the predictions from all your friends

non-uniformly

—let them vote, but

the predictions

—if

[t satisfies some condition]

give some ballots to friend t

...

models:

or

combine

hypotheses (for better performance)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

(4)

Blending and Bagging Motivation of Aggregation

Aggregation with Math Notations

, · · · ,g

T

predicts whether stock will go up as g

(x).

• select

the most trust-worthy friend from their

G(x) = g

(x) with t

=argmin

• mix

the predictions from all your friends

G(x) = sign

P

· g

(x)



• mix

the predictions from all your friends

G(x) = sign

P

· g

(x)

with

t

val

t

t

the predictions

G(x) = sign

P

· g

(x)

with

t

(x) = α

t

aggregation models: a

rich family

(5)

Blending and Bagging Motivation of Aggregation

Recall: Selection by Validation

G(x) = g

(x) with t

= argmin

and popular

what if use E

(g

VC

need

g

t−

to guarantee small

(and small E

)

selection:

rely on one strong hypothesis

aggregation:

can we do better with many (possibly weaker) hypotheses?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

(6)

Blending and Bagging Motivation of Aggregation

Why Might Aggregation Work?

mix

uniformly

—G(x) ‘strong’

aggregation

=⇒

mix

uniformly

—G(x) ‘moderate’

aggregation

=⇒

regularization (?)

proper aggregation =⇒

better performance

(7)

Blending and Bagging Motivation of Aggregation

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

1

(x ) = sign(1 − x ), g

2

(x ) = sign(1 + x ), g

3

(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?

2J|x | ≤ 1K − 1

2J|x | ≥ 1K − 1

2Jx ≤ −1K − 1

4

2Jx ≥ +1K − 1

The ‘region’ that gets two positive votes from g

and g

2

is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps g

t

can be aggregated to form a more sophisticated hypothesis G.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

(8)

Blending and Bagging Motivation of Aggregation

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

1

(x ) = sign(1 − x ), g

2

(x ) = sign(1 + x ), g

3

(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?

2J|x | ≤ 1K − 1

2J|x | ≥ 1K − 1

2Jx ≤ −1K − 1

4

2Jx ≥ +1K − 1

The ‘region’ that gets two positive votes from g

and g

2

is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps g

t

can be aggregated to form a more sophisticated hypothesis G.

(9)

Blending and Bagging Uniform Blending

Uniform Blending (Voting) for Classification

, each with

ballot

T

t=1

t

same

g t

(autocracy):

as good as one single

very different

(diversity+

majority can

minority

•

similar results with uniform voting for multiclass

G(x) = argmax

X

Jg

(x) = kK

regression?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

(10)

Blending and Bagging Uniform Blending

Uniform Blending for Regression

(x)

same

g t

(autocracy):

as good as one single

very different

(diversity+

=⇒

some

g t

(x) > f (x), some

(x) < f (x)

=⇒average

could be

more accurate than individual

even simple

uniform blending

can be better than any

single hypothesis

(11)

Blending and Bagging Uniform Blending

Theoretical Analysis of Uniform Blending

=

t

2

2t

t

2

2t

2

2t

2

2

2t

2

2

2

2t

t

2

2

t

2

2

(E

(g

)) =



E(g



+E

(G)

avg

E(g



+E

out

(G)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

(12)

Blending and Bagging Uniform Blending

Some Special g t

consider a

virtual

iterative process that for t = 1, 2, . . . , T

1

request size-N data D

from P

(i.i.d.)

obtain

by A(D

)

= lim

= lim

=

(E

(g

)) =



E(g



+E

(

g) ¯ expected

performance of A =

to

+performance of

performance of

to

consensus: called variance

uniform blending:

reduces

variance

for more stable performance

(13)

Blending and Bagging Uniform Blending

Consider applying uniform blending G(x) =

P

g

t

(x) on linear regression hypotheses g

t

(x) = innerprod(w

,

x). Which of the following

property best describes the resulting G(x)?

1

a constant function of

2

a linear function of

4

none of the other choices

G(x) = innerprod 1

T

X

,

x

!

which is clearly a linear function of

x. Note that

we write ‘innerprod’ instead of the usual

‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

(14)

Blending and Bagging Uniform Blending

Consider applying uniform blending G(x) =

P

g

t

(x) on linear regression hypotheses g

t

(x) = innerprod(w

,

x). Which of the following

property best describes the resulting G(x)?

1

a constant function of

2

a linear function of

4

none of the other choices

G(x) = innerprod 1 T

X

,

x

!

which is clearly a linear function of

x. Note that

we write ‘innerprod’ instead of the usual

‘transpose’ notation to avoid symbol conflict with T (number of hypotheses).

(15)

Blending and Bagging Linear and Any Blending

Linear Blending

linear blending: known g t

, each to be given

ballot

G(x) = sign

X

· g

(x)

!

with

α t ≥ 0

computing ‘good’ α

: min

t

E

(α)

mint

1 N

X

y

X

(x

)

min

i

1 N

X

y

X

(x

)

like two-level learning, remember? :-)

linear blending = LinModel +

+

constraints

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

(16)

Blending and Bagging Linear and Any Blending

Constraint on α t

linear blending = LinModel +

+

min

t

1 N

X

err y

,

X

(x

)

!

if

<0 =⇒

(x) =

(−g

(x))

negative

for

≡ positive

for

:-)

in practice, often

linear blending = LinModel +

+

constraints hhh h

(17)

Blending and Bagging Linear and Any Blending

Linear Blending versus Selection

in practice, often

∈ H

,

∈ H

, . . . ,

∈ H

by

recall:

—bestof

VC





•

recall: linear blending includes

as special case

—by setting

=q

smallesty

•

complexity price of linear blending

(aggregationof

VC





like

(E

) + (g

from minimum

E train

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

(18)

Blending and Bagging Linear and Any Blending

Any Blending

Given

,

, . . .,

from

, transform (x

,y

)in

to (z

=

(x

),y

), where

(x) = (g

(x), . . . ,

(x))

compute

= LinearModel



{(z

,y

)}



2

return GLINB(x) =

(Φ(x)),

compute

=



{(z

,y

)}



2

return GANYB(x) =

where

(x), . . . ,

(x))

blending:

but

danger of overfitting, as always:-(

(19)

Blending and Bagging Linear and Any Blending

Blending in Practice

model E

(squared):

=⇒

—helped

in

using ˜E

E

(squared):

=⇒

—helped

in

last hour

blending ‘useful’ in practice,

despite the computational burden

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

(20)

Blending and Bagging Linear and Any Blending

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g 1

(x ) = sign(1 − x ),

g 2

(x ) = sign(1 + x ),

g 3

(x ) = −1. When x = 0, what is the resulting

(x ),

(x ),

g 3

(x )) used in the returned hypothesis of linear/any blending?

(+1, +1, +1)

(+1, +1, −1)

(+1, −1, −1)

(−1, −1, −1)

Reference Answer: 2 Too easy? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

(21)

Blending and Bagging Linear and Any Blending

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g 1

(x ) = sign(1 − x ),

g 2

(x ) = sign(1 + x ),

g 3

(x ) = −1. When x = 0, what is the resulting

(x ),

(x ),

g 3

(x )) used in the returned hypothesis of linear/any blending?

(+1, +1, +1)

(+1, +1, −1)

(+1, −1, −1)

(−1, −1, −1)

Reference Answer: 2 Too easy? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

(22)

Blending and Bagging Bagging (Bootstrap Aggregation)

What We Have Done

blending: aggregate

after getting g t

; learning: aggregate

g

t

for uniform aggregation:

important

• diversity

by different models:

∈ H

,

∈ H

, . . . ,

∈ H

• diversity

by different parameters: GD with η = 0.001, 0.01, . . ., 10

• diversity

by algorithmic randomness:

random PLA with different random seeds

• diversity

by data randomness:

within-cross-validation hypotheses g

next:

diversity

by data randomness

g

−

(23)

Blending and Bagging Bagging (Bootstrap Aggregation)

Revisit of Bias-Variance

expected

performance of A =

to

+performance of

=

from

∼ P

• consensus

more stable than direct A(D),

but comes from many more

D t

than the D on hand

•

want: approximate

by

t

t

t

N

bootstrapping: a statistical tool that re-samples

from D to ‘simulate’

D t

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

(24)

Blending and Bagging Bagging (Bootstrap Aggregation)

Bootstrap Aggregation

bootstrap sample

D ˜ t

: re-sample N examples from D

consider a

virtual

iterative process that for t = 1, 2, . . . , T

1

request size-N data D

from P

(i.i.d.)

obtain

by A(D

)

=Uniform({g

})

consider a

physical

iterative process that for t = 1, 2, . . . , T

1

request size-N’data

from

obtain

by A(

)

=Uniform({g

t

})

bootstrap aggregation (BAGging):

a simple

on top of

base algorithm

A

(25)

Blending and Bagging Bagging (Bootstrap Aggregation)

Bagging Pocket in Action

TPOCKET =1000; TBAG=25

very

g

from bagging

proper

non-linear

boundary after aggregating binary classifiers

bagging works reasonably well

if basealgorithm sensitive to data randomness

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

(26)

Blending and Bagging Bagging (Bootstrap Aggregation)

Fun Time

When using bootstrapping to re-sample N examples ˜D

t

from a data set D with N examples, what is the probability of getting ˜D

t

exactly the same as D?

0 /N

=0

1 /N

N! /N

N

/N

N

=1

Consider re-sampling in an ordered manner for N steps. Then there are (N

)possible

outcomes ˜D

t

, each with equal probability. Most importantly, (N!) of the outcomes are

permutations of the original D, and thus the answer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(27)

Blending and Bagging Bagging (Bootstrap Aggregation)

Fun Time

When using bootstrapping to re-sample N examples ˜D

t

from a data set D with N examples, what is the probability of getting ˜D

t

exactly the same as D?

0 /N

=0

1 /N

N! /N

N

/N

N

=1

Consider re-sampling in an ordered manner for N steps. Then there are (N

)possible

outcomes ˜D

t

, each with equal probability. Most importantly, (N!) of the outcomes are

permutations of the original D, and thus the answer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(28)

Blending and Bagging Bagging (Bootstrap Aggregation)

Summary

2

Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep