• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 15: Validation

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

(2)

Validation

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 14: Regularization

minimizes

augmented error, where the added regularizer

effectively

limits model complexity

Lecture 15: Validation Model Selection Problem Validation

Leave-One-Out Cross Validation

V -Fold Cross Validation

(3)

Validation Model Selection Problem

So Many Models Learned

Even Just for Binary Classification . . .

A ∈ { PLA, pocket, linear regression, logistic regression}

T ∈ { 100, 1000, 10000}

×

η∈ { 1, 0.01, 0.0001}

×

Φ∈ { linear, quadratic, poly-10, Legendre-poly-10}

×

Ω(w)∈ { L2 regularizer, L1 regularizer, symmetry regularizer}

×

λ∈ { 0, 0.01, 1 }

×

in addition to your

favorite

combination, may need to try other combinations to get a good g

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22

(4)

Validation Model Selection Problem

Model Selection Problem

H

1

which one do you prefer? :-)

H

2

given: M modelsH

1

,H

2

, . . . ,H

M

, each with corresponding algorithmA

1

,A

2

, . . . ,A

M

goal: select

H m

such that g

m

=A

m

(D) is of low

E out (g m

)

• unknown E out

due to unknown P(x) & P(y|x), as always

:-)

arguably the

most important

practical problem of ML

how to select?

visually?

—no, remember Lecture 12? :-)

(5)

Validation Model Selection Problem

Model Selection by Best E in

H

1

select by best

E in

? m

=argmin

1≤m≤M

(E

m

=

E in

(A

m

(D)))

H

2

Φ

1126

always more preferred over Φ

1

;

λ = 0 always more preferred over λ = 0.1—overfitting?

ifA

1

minimizes E

in

overH

1

andA

2

minimizes E

in

overH

2

,

=⇒ g

m

achieves minimal E

in

overH

1

∪ H

2

=⇒ ‘

model selection

+ learning’ pays dVC(H

1

∪ H

2

)

—bad generalization?

selecting by

E in

is

dangerous

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22

(6)

Validation Model Selection Problem

Model Selection by Best E test

H

1

select by best

E test

, which is evaluated on a fresh

D test

? m

=argmin

1≤m≤M

(E

m

=

E test

(A

m

(D)))

H

2

generalization guarantee (finite-bin Hoeffding):

E

out

(g

m

)≤

E test

(g

m

) +Oq

log M N

test



—yes! strong guarantee :-)

but where is

D test

?—your boss’s safe, maybe? :-(

selecting by

E test

is

infeasible

and

cheating

(7)

Validation Model Selection Problem

Comparison between E in and E test

in-sample error E in

calculated from

D

• feasible

on hand

‘contaminated’ as

D

also used byA

m

to ‘select’ g

m

test error E test

calculated from

D test

• infeasible

in boss’s safe

‘clean’ as

D test

never used for selection before

something in between: E val

calculated fromD

val

⊂ D

feasible

on hand

‘clean’

if

D

val

never used byA

m

before

selecting by E

val

:

legal cheating :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22

(8)

Validation Model Selection Problem

Fun Time

ForX = R

d

, consider two hypothesis sets,H

+

andH

. The first hypothesis set contains all perceptrons with w

1

≥ 0, and the second hypothesis set contains all perceptrons with w

1

≤ 0. Denote g

+

and g

as the minimum-E

in

hypothesis in each hypothesis set, respectively.

Which statement below is true?

1

If E

in

(g

+

)< E

in

(g

), then g

+

is the minimum-E

in

hypothesis of all perceptrons in R

d

.

2

If E

test

(g

+

)< E

test

(g

), then g

+

is the minimum-E

test

hypothesis of all perceptrons in R

d

.

3

The two hypothesis sets are disjoint.

4

None of the above

Reference Answer: 1

Note that the two hypothesis sets are not disjoint (sharing ‘w

1

=0’ perceptrons) but their union is all perceptrons.

(9)

Validation Model Selection Problem

Fun Time

ForX = R

d

, consider two hypothesis sets,H

+

andH

. The first hypothesis set contains all perceptrons with w

1

≥ 0, and the second hypothesis set contains all perceptrons with w

1

≤ 0. Denote g

+

and g

as the minimum-E

in

hypothesis in each hypothesis set, respectively.

Which statement below is true?

1

If E

in

(g

+

)< E

in

(g

), then g

+

is the minimum-E

in

hypothesis of all perceptrons in R

d

.

2

If E

test

(g

+

)< E

test

(g

), then g

+

is the minimum-E

test

hypothesis of all perceptrons in R

d

.

3

The two hypothesis sets are disjoint.

4

None of the above

Reference Answer: 1

Note that the two hypothesis sets are not disjoint (sharing ‘w

1

=0’ perceptrons) but their union is all perceptrons.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22

(10)

Validation Validation

Validation Set D val

E in

(h)

E val

(h)

↑ ↑

|{z} D

size N

D | {z } train size N−K

|{z} D val size K

↓ ↓

g m

=A

m

(

D

)

g m

=A

m

(

D train

)

• D val

D

: called

validation set—‘on-hand’ simulation of test set

to connect

E val

with E

out

:

D val

iid

∼ P(x, y) ⇐= select K examples from

D

at random

to make sure

D val

‘clean’:

feed only

D train

toA

m

for model selection

E

out

(g

m

)≤

E val

(g

m

) +O

q

log M K



(11)

Validation Validation

Model Selection by Best E val

m

=argmin

1≤m≤M

(E

m

=

E val

(A

m

(

D train

)))

generalization guarantee for all m:

E

out

(g

m

)≤

E val

(g

m

) +O

q

log M K



heuristic gain from N− K to N:

E out

  g m

|{z}

A

m∗

(D)

  ≤ E out

  g m

|{z}

A

m∗

(D

train

)

 

—learning curve, remember? :-)

H 1 H 2 H M g 1 g 2 · · · g M

· · ·

E 1 · · · E M

D val D train

g m

E 2 (H m

, E m

)

| {z }

pick the best

D

E out (g m

) ≤ E out (g m

)

E val

(g

m

∗) +O

q

log M K



Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22

(12)

Validation Validation

Validation in Practice

use validation to select betweenH

Φ

5 andH

Φ

10

Validation Set Size, K E xp ec te d E

out

optimal validation: g

m

in-sample: g

mb

validation: g

m

5 15 25

0.48 0.52

0.56

in-sample: selection

with E

in

• optimal: cheating-selection with E test

• sub-g: selection with E val and report g m

• full-g: selection with E val and report g m

—E

out

(g

m

)≤ E

out

(g

m

∗) indeed

why is

sub-g

worse than in-sample some time?

(13)

Validation Validation

The Dilemma about K

reasoning of validation:

E

out

(g) ≈ E

out

(g

) ≈

E val

(g

)

(small K ) (large K )

large K :

every E val ≈ E out

, but all

g m

much worse than

g m

small K : every

g m

g m

, but

E val

far from E

out

Validation Set Size, K ExpectedEout

optimal validation: gm

in-sample: gmb

validation: gm

5 15 25

0.48 0.52 0.56

practical rule of thumb:

K = N 5

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22

(14)

Validation Validation

Fun Time

For a learning model that takes N

2

seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =

N 5

on 25 such models with different parameters to get the final g

m

?

1

6N

2

2

17N

2

3

25N

2

4

26N

2

Reference Answer: 2

To get all the g

m

, we need

16 25

N

2

· 25 seconds. Then to get g

m

, we need another N

2

seconds. So in total we need 17N

2

seconds.

(15)

Validation Validation

Fun Time

For a learning model that takes N

2

seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =

N 5

on 25 such models with different parameters to get the final g

m

?

1

6N

2

2

17N

2

3

25N

2

4

26N

2

Reference Answer: 2

To get all the g

m

, we need

16 25

N

2

· 25 seconds.

Then to get g

m

, we need another N

2

seconds.

So in total we need 17N

2

seconds.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22

(16)

Validation Leave-One-Out Cross Validation

Extreme Case: K = 1

reasoning of validation:

E

out

(g) ≈ E

out

(g

) ≈

E val

(g

)

(small K ) (large K )

take K = 1?

D (n) val = {(x n , y n ) }

and

E val (n)

(g

n

) = err(g

n

(x

n

),

y n

) =

e n

make

e n

closer to E

out

(g)?—average overpossible

E val (n)

leave-one-out

cross validation

estimate:

E loocv ( H, A)

=

1 N

X N n=1

e n

=

1 N

X N n=1

err(g

n

(x

n

),

y n

)

hope:

E loocv ( H, A)

≈ E

out

(g)

(17)

Validation Leave-One-Out Cross Validation

Illustration of Leave-One-Out

e1

x

y

e2

x

y

e3

x

y

E loocv (linear)

=

1 3

(e

1

+

e 2

+

e 3

)

e1

x

y

e2

x

y

e3

x

y

E loocv (constant)

=

1 3

(e

1

+

e 2

+

e 3

) which one would you choose?

m

=argmin

1≤m≤M

(E

m

=

E loocv

(H

m

,A

m

))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22

(18)

Validation Leave-One-Out Cross Validation

Theoretical Guarantee of Leave-One-Out Estimate

does

E loocv ( H, A)

say something about E

out

(g)?

yes, for average E out on size-(N − 1) data

E D E loocv ( H, A)

=

E D

N1

X

N n=1

e n

= 1 N

X

N n=1

E D e n

= 1

N X

N n=1

D E

n

(x E

n

,y

n

)

err(g

n

(x

n

),

y n

)

= 1

N X

N n=1

D E

n

E out

(g

n

)

= 1

N X

N n=1

E out (N − 1)

=

E out (N − 1)

expected E loocv ( H, A)

says something about

expected E out (g )

—often called ‘almost unbiased estimate of E

out

(g)’

(19)

Validation Leave-One-Out Cross Validation

Leave-One-Out in Practice

Average Intensity

Symmetry

1 Not 1

Average Intensity

Symmetry

replacements

Average Intensity

Symmetry

select by E

in

select by E

loocv

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

loocv

much better than

E in

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22

(20)

Validation Leave-One-Out Cross Validation

Fun Time

Consider three examples (x

1

, y

1

), (x

2

, y

2

), (x

3

, y

3

)with y

1

=1, y

2

=5, y

3

=7. If we use E

loocv

to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is E

loocv

(squared error) of the algorithm?

1

0

2 56 9 3 60

9 4

14

Reference Answer: 4

This is based on a simple calculation of e

1

= (1− 6)

2

, e

2

= (5− 4)

2

, e

3

= (7− 3)

2

.

(21)

Validation Leave-One-Out Cross Validation

Fun Time

Consider three examples (x

1

, y

1

), (x

2

, y

2

), (x

3

, y

3

)with y

1

=1, y

2

=5, y

3

=7. If we use E

loocv

to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is E

loocv

(squared error) of the algorithm?

1

0

2 56 9 3 60

9 4

14

Reference Answer: 4

This is based on a simple calculation of e

1

= (1− 6)

2

, e

2

= (5− 4)

2

, e

3

= (7− 3)

2

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22

(22)

Validation V -Fold Cross Validation

Disadvantages of Leave-One-Out Estimate

Computation

E loocv ( H, A)

=

1 N

X N n=1

e n

=

1 N

X N n=1

err(g

n

(x

n

),

y n

)

N ‘additional’ training per model, not always feasible in practice

except ‘special case’ like analytic solution for linear regression

Stability—due to variance of single-point estimates

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

loocv

: not often used practically

(23)

Validation V -Fold Cross Validation

V -fold Cross Validation

how to

decrease computation need

for cross validation?

essence of leave-one-out cross validation: partitionD to N parts, taking N− 1 for training and 1 for validation orderly

V -fold cross-validation: random-partition ofD

to V equal parts, D

1

D

2

D

3

D

4

D

5

D

6

D

7

D

8

D

9

D

10

train validate train

z D }| {

take V − 1 for training and 1 for validation orderly

E cv ( H, A)

=

1 V

X V v =1

E

val (v )

(g

v

)

selection by E

cv

: m

=argmin

1≤m≤M

(E

m

=

E cv

(H

m

,A

m

))

practical rule of thumb:

V = 10

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22

(24)

Validation V -Fold Cross Validation

Final Words on Validation

‘Selecting’ Validation Tool

• V -Fold

generally preferred over single validation if computation allows

• 5-Fold or 10-Fold

generally works well:

not necessary to trade V -Fold with Leave-One-Out

Nature of Validation

all training models: select among hypotheses

all validation schemes:

select among finalists

all testing methods: just

evaluate

validation still

more optimistic than testing

do not fool yourself and others

:-),

report test result, not best validation result

(25)

Validation V -Fold Cross Validation

Fun Time

For a learning model that takes N

2

seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final g

m

?

1 47 2

N

2

2

47N

2

3 407 2

N

2

4

407N

2

Reference Answer: 3

To get all the E

cv

, we need

100 81

N

2

· 10 · 25 seconds. Then to get g

m

, we need another N

2

seconds. So in total we need

407 2

N

2

seconds.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22

(26)

Validation V -Fold Cross Validation

Fun Time

For a learning model that takes N

2

seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final g

m

?

1 47 2

N

2

2

47N

2

3 407 2

N

2

4

407N

2

Reference Answer: 3

To get all the E

cv

, we need

100 81

N

2

· 10 · 25 seconds. Then to get g

m

, we need another N

2

seconds. So in total we need

407 2

N

2

seconds.

(27)

Validation V -Fold Cross Validation

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 14: Regularization Lecture 15: Validation

Model Selection Problem

dangerous by E in and dishonest by E test

Validation

select with E val ( D train ) while returning A m

( D) Leave-One-Out Cross Validation

huge computation for almost unbiased estimate V -Fold Cross Validation

reasonable computation and performance

next: something ‘up my sleeve’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22

參考文獻

相關文件

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics