• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
27
0
0

(1)

### Lecture 15: Validation

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

(2)

Validation

### 4

How Can Machines Learn

minimizes

effectively

### V -Fold Cross Validation

(3)

Validation Model Selection Problem

### Even Just for Binary Classification . . .

A ∈ { PLA, pocket, linear regression, logistic regression}

T ∈ { 100, 1000, 10000}

## ×

η∈ { 1, 0.01, 0.0001}

## ×

Φ∈ { linear, quadratic, poly-10, Legendre-poly-10}

## ×

Ω(w)∈ { L2 regularizer, L1 regularizer, symmetry regularizer}

## ×

λ∈ { 0, 0.01, 1 }

## ×

### favorite

combination, may need to try other combinations to get a good g

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22

(4)

Validation Model Selection Problem

H

H

given: M modelsH

,H

, . . . ,H

### M

, each with corresponding algorithmA

,A

, . . . ,A

goal: select

such that g

=A

(D) is of low

### • unknown E out

due to unknown P(x) & P(y|x), as always

arguably the

### most important

practical problem of ML

how to select?

### —no, remember Lecture 12? :-)

(5)

Validation Model Selection Problem

H

select by best

? m

=argmin

(E

=

(A

(D)))

H

Φ

### 1126

always more preferred over Φ

### 1

;

λ = 0 always more preferred over λ = 0.1—overfitting?

ifA

minimizes E

overH

andA

minimizes E

overH

,

=⇒ g

### m

achieves minimal E

overH

∪ H

=⇒ ‘

### model selection

+ learning’ pays dVC(H

∪ H

)

selecting by

is

### dangerous

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22

(6)

Validation Model Selection Problem

H

select by best

### E test

, which is evaluated on a fresh

? m

=argmin

(E

=

(A

(D)))

H

### •

generalization guarantee (finite-bin Hoeffding):

E

(g

)≤

(g

) +Oq

### log M N

test



—yes! strong guarantee :-)

but where is

selecting by

is

and

### cheating

(7)

Validation Model Selection Problem

calculated from

on hand

### •

‘contaminated’ as

also used byA

to ‘select’ g

calculated from

in boss’s safe

‘clean’ as

### D test

never used for selection before

calculated fromD

⊂ D

on hand

‘clean’

D

never used byA

before

selecting by E

:

### legal cheating :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22

(8)

Validation Model Selection Problem

ForX = R

### d

, consider two hypothesis sets,H

andH

### −

. The first hypothesis set contains all perceptrons with w

### 1

≥ 0, and the second hypothesis set contains all perceptrons with w

≤ 0. Denote g

and g

as the minimum-E

### in

hypothesis in each hypothesis set, respectively.

Which statement below is true?

If E

(g

)< E

(g

), then g

is the minimum-E

### in

hypothesis of all perceptrons in R

.

If E

(g

)< E

(g

), then g

is the minimum-E

### test

hypothesis of all perceptrons in R

.

### 3

The two hypothesis sets are disjoint.

### 4

None of the above

Note that the two hypothesis sets are not disjoint (sharing ‘w

### 1

=0’ perceptrons) but their union is all perceptrons.

(9)

Validation Model Selection Problem

ForX = R

### d

, consider two hypothesis sets,H

andH

### −

. The first hypothesis set contains all perceptrons with w

### 1

≥ 0, and the second hypothesis set contains all perceptrons with w

≤ 0. Denote g

and g

as the minimum-E

### in

hypothesis in each hypothesis set, respectively.

Which statement below is true?

If E

(g

)< E

(g

), then g

is the minimum-E

### in

hypothesis of all perceptrons in R

.

If E

(g

)< E

(g

), then g

is the minimum-E

### test

hypothesis of all perceptrons in R

.

### 3

The two hypothesis sets are disjoint.

### 4

None of the above

Note that the two hypothesis sets are not disjoint (sharing ‘w

### 1

=0’ perceptrons) but their union is all perceptrons.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22

(10)

Validation Validation

(h)

(h)

↑ ↑

↓ ↓

=A

(

)

=A

(

)

: called

to connect

with E

:

### iid

∼ P(x, y) ⇐= select K examples from

at random

to make sure

‘clean’:

feed only

toA

### m

for model selection

E

(g

)≤

(g

) +O

q

### log M K



(11)

Validation Validation

m

=argmin

(E

=

(A

(

)))

### •

generalization guarantee for all m:

E

(g

)≤

(g

) +O

q



### •

heuristic gain from N− K to N:

m∗

m∗

train

###  

—learning curve, remember? :-)

(g

∗) +O

q

### log M K



Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22

(12)

Validation Validation

### Validation in Practice

use validation to select betweenH

5 andH

10

out

m

mb

m

5 15 25

0.48 0.52

0.56

### •

in-sample: selection

with E

—E

(g

)≤ E

(g

∗) indeed

why is

### sub-g

worse than in-sample some time?

(13)

Validation Validation

reasoning of validation:

E

(g) ≈ E

(g

) ≈

(g

)

large K :

, but all

much worse than

small K : every

, but

far from E

### out

Validation Set Size, K ExpectedEout

optimal validation: gm

in-sample: gmb

validation: gm

5 15 25

0.48 0.52 0.56

practical rule of thumb:

### K = N5

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22

(14)

Validation Validation

### Fun Time

For a learning model that takes N

### 2

seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =

### N5

on 25 such models with different parameters to get the final g

?

6N

17N

25N

26N

To get all the g

, we need

N

### 2

· 25 seconds. Then to get g

### m

, we need another N

### 2

seconds. So in total we need 17N

### 2

seconds.

(15)

Validation Validation

### Fun Time

For a learning model that takes N

### 2

seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =

### N5

on 25 such models with different parameters to get the final g

?

6N

17N

25N

26N

To get all the g

, we need

N

· 25 seconds.

Then to get g

### m

, we need another N

### 2

seconds.

So in total we need 17N

### 2

seconds.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22

(16)

Validation Leave-One-Out Cross Validation

### Extreme Case: K = 1

reasoning of validation:

E

(g) ≈ E

(g

) ≈

(g

)

take K = 1?

and

(g

) = err(g

(x

),

) =

make

closer to E

### out

(g)?—average overpossible

leave-one-out

estimate:

=

=

err(g

(x

),

)

hope:

≈ E

### out

(g)

(17)

Validation Leave-One-Out Cross Validation

e1

x

y

e2

x

y

e3

x

y

=

(e

+

+

)

e1

x

y

e2

x

y

e3

x

y

=

(e

+

+

### e 3

) which one would you choose?

m

=argmin

(E

=

(H

,A

### m

))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22

(18)

Validation Leave-One-Out Cross Validation

does

(g)?

=

N1

X

= 1 N

X

= 1

N X

n

n

n

err(g

(x

),

)

= 1

N X

n

(g

)

= 1

N X

=

### expected E out (g − )

—often called ‘almost unbiased estimate of E

### out

(g)’

(19)

Validation Leave-One-Out Cross Validation

### Leave-One-Out in Practice

Average Intensity

Symmetry

1 Not 1

Average Intensity

Symmetry

replacements

Average Intensity

Symmetry

select by E

select by E

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

much better than

### E in

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22

(20)

Validation Leave-One-Out Cross Validation

### Fun Time

Consider three examples (x

, y

), (x

, y

), (x

, y

)with y

=1, y

=5, y

=7. If we use E

### loocv

to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is E

### loocv

(squared error) of the algorithm?

0

### 9 4

14

This is based on a simple calculation of e

= (1− 6)

, e

= (5− 4)

, e

= (7− 3)

### 2

.

(21)

Validation Leave-One-Out Cross Validation

### Fun Time

Consider three examples (x

, y

), (x

, y

), (x

, y

)with y

=1, y

=5, y

=7. If we use E

### loocv

to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is E

### loocv

(squared error) of the algorithm?

0

### 9 4

14

This is based on a simple calculation of e

= (1− 6)

, e

= (5− 4)

, e

= (7− 3)

### 2

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22

(22)

Validation V -Fold Cross Validation

=

=

err(g

(x

),

)

### •

N ‘additional’ training per model, not always feasible in practice

### •

except ‘special case’ like analytic solution for linear regression

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

### loocv

: not often used practically

(23)

Validation V -Fold Cross Validation

how to

### decrease computation need

for cross validation?

### •

essence of leave-one-out cross validation: partitionD to N parts, taking N− 1 for training and 1 for validation orderly

### •

V -fold cross-validation: random-partition ofD

1

2

3

4

5

6

7

8

9

10

=

E

(g

)

selection by E

: m

=argmin

(E

=

(H

,A

### m

))

practical rule of thumb:

### V = 10

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22

(24)

Validation V -Fold Cross Validation

### • V -Fold

generally preferred over single validation if computation allows

### • 5-Fold or 10-Fold

generally works well:

not necessary to trade V -Fold with Leave-One-Out

### •

all training models: select among hypotheses

### •

all validation schemes:

### •

all testing methods: just

validation still

### more optimistic than testing

do not fool yourself and others

### report test result, not best validation result

(25)

Validation V -Fold Cross Validation

### Fun Time

For a learning model that takes N

### 2

seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final g

?

N

47N

N

407N

To get all the E

, we need

N

### 2

· 10 · 25 seconds. Then to get g

### m

, we need another N

### 2

seconds. So in total we need

N

### 2

seconds.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22

(26)

Validation V -Fold Cross Validation

### Fun Time

For a learning model that takes N

### 2

seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final g

?

N

47N

N

407N

To get all the E

, we need

N

### 2

· 10 · 25 seconds. Then to get g

### m

, we need another N

### 2

seconds. So in total we need

N

### 2

seconds.

(27)

Validation V -Fold Cross Validation

### 4

How Can Machines Learn

### • next: something ‘up my sleeve’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics