### Machine Learning Foundations ( 機器學習基石)

### Lecture 15: Validation

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

Validation

### Roadmap

### 1 When Can Machines Learn?

### 2 Why Can Machines Learn?

### 3 How Can Machines Learn?

### 4

How Can Machines Learn**Better?**

### Lecture 14: Regularization

minimizes

**augmented error, where the added** **regularizer**

effectively**limits model complexity**

### Lecture 15: Validation Model Selection Problem Validation

### Leave-One-Out Cross Validation

### V -Fold Cross Validation

Validation Model Selection Problem

### So Many Models Learned

### Even Just for Binary Classification . . .

A ∈ { PLA, pocket, linear regression, logistic regression}

T ∈ { 100, 1000, 10000}

## ×

η∈ { 1, 0.01, 0.0001}

## ×

Φ∈ { linear, quadratic, poly-10, Legendre-poly-10}

## ×

Ω(w)∈ { L2 regularizer, L1 regularizer, symmetry regularizer}

## ×

λ∈ { 0, 0.01, 1 }

## ×

in addition to your

**favorite**

combination, may
need to try other combinations to get a good g
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22

Validation Model Selection Problem

### Model Selection Problem

H

### 1

**which one do you prefer? :-)**

H

### 2

### •

given: M modelsH### 1

,H### 2

, . . . ,H### M

, each with corresponding algorithmA### 1

,A### 2

, . . . ,A### M

### •

goal: select### H m

^{∗}such that g

_{m}

^{∗}=A

### m

^{∗}(D) is of low

### E _{out} (g _{m}

^{∗}

### )

### • unknown E _{out}

due to unknown P(x) & P(y**|x), as always**

**:-)**

### •

arguably the**most important**

practical problem of ML
how to select?

**visually?**

**—no, remember Lecture 12? :-)**

Validation Model Selection Problem

### Model Selection by Best E _{in}

H

### 1

select by best

### E _{in}

?
m^{∗}

=argmin
### 1≤m≤M

(E

### m

=### E _{in}

(A^{m}

(D)))
H

### 2

### •

Φ### 1126

always more preferred over Φ_{1}

;
λ = 0 always more preferred over λ = 0.1—overfitting?

### •

ifA### 1

minimizes E_{in}

overH### 1

andA### 2

minimizes E_{in}

overH### 2

,=⇒ g

^{m}

^{∗}achieves minimal E

_{in}

overH### 1

∪ H### 2

=⇒ ‘

### model selection

+ learning’ pays d_{VC}(H

### 1

∪ H### 2

)—bad generalization?

selecting by

### E _{in}

is**dangerous**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22

Validation Model Selection Problem

### Model Selection by Best E _{test}

H

### 1

select by best

### E test

, which is evaluated on a fresh### D ^{test}

?
m^{∗}

=argmin
### 1≤m≤M

(E

### m

=### E test

(A^{m}

(D)))
H

### 2

### •

generalization guarantee (finite-bin Hoeffding):E

_{out}

(g_{m}

^{∗})≤

### E _{test}

(g_{m}

^{∗}) +Oq

### log M N

test

—yes! strong guarantee :-)

### •

but where is### D ^{test}

?—your boss’s safe, maybe? :-(
selecting by

### E test

is**infeasible**

and**cheating**

Validation Model Selection Problem

### Comparison between E _{in} and E _{test}

### in-sample error E in

### •

calculated from### D

### • feasible

on hand### •

‘contaminated’ as### D

also used byA### m

to ‘select’ g_{m}

### test error E test

### •

calculated from### D test

### • infeasible

in boss’s safe### •

‘clean’ as### D ^{test}

never used
for selection before
### something in between: E val

### •

calculated fromD### val

⊂ D### • **feasible**

on hand
### •

‘clean’**if**

D### val

never used byA^{m}

before
selecting by E

_{val}

:**legal cheating :-)**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22

Validation Model Selection Problem

### Fun Time

ForX = R

^{d}

, consider two hypothesis sets,H^{+}

andH^{−}

. The first
hypothesis set contains all perceptrons with w_{1}

≥ 0, and the second
hypothesis set contains all perceptrons with w_{1}

≤ 0. Denote g### +

and g### −

as the minimum-E

_{in}

hypothesis in each hypothesis set, respectively.
Which statement below is true?

### 1

If E_{in}

(g_{+}

)< E_{in}

(g### −

), then g_{+}

is the minimum-E_{in}

hypothesis of all
perceptrons in R^{d}

.
### 2

If E### test

(g_{+}

)< E### test

(g### −

), then g_{+}

is the minimum-E### test

hypothesis of all perceptrons in R^{d}

.
### 3

The two hypothesis sets are disjoint.### 4

None of the above### Reference Answer: 1

Note that the two hypothesis sets are not disjoint (sharing ‘w

_{1}

=0’ perceptrons) but their
union is all perceptrons.
Validation Model Selection Problem

### Fun Time

ForX = R

^{d}

, consider two hypothesis sets,H^{+}

andH^{−}

. The first
hypothesis set contains all perceptrons with w_{1}

≥ 0, and the second
hypothesis set contains all perceptrons with w_{1}

≤ 0. Denote g### +

and g### −

as the minimum-E

_{in}

hypothesis in each hypothesis set, respectively.
Which statement below is true?

### 1

If E_{in}

(g_{+}

)< E_{in}

(g### −

), then g_{+}

is the minimum-E_{in}

hypothesis of all
perceptrons in R^{d}

.
### 2

If E### test

(g_{+}

)< E### test

(g### −

), then g_{+}

is the minimum-E### test

hypothesis of all perceptrons in R^{d}

.
### 3

The two hypothesis sets are disjoint.### 4

None of the above### Reference Answer: 1

Note that the two hypothesis sets are not disjoint (sharing ‘w

_{1}

=0’ perceptrons) but their
union is all perceptrons.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22

Validation Validation

### Validation Set D ^{val}

### E _{in}

(h) ### E _{val}

(h)
↑ ↑

### |{z} D

### size N

→

### D | {z } train size N−K

∪

### |{z} D val size K

↓ ↓

### g _{m}

=A### m

(### D

)### g _{m} ^{−}

=A### m

(### D train

)### • D val

⊂### D

: called**validation set—‘on-hand’ simulation of test set**

### •

to connect### E _{val}

with E_{out}

:
### D val

### iid

**∼ P(x, y) ⇐= select K examples from**

### D

at random### •

to make sure### D val

‘clean’:feed only

### D train

toA### m

for model selectionE

### out

(g^{−} _{m}

)≤### E _{val}

(g_{m} ^{−}

) +O
q

### log M K

Validation Validation

### Model Selection by Best E _{val}

m

^{∗}

=argmin
### 1≤m≤M

(E

_{m}

=### E _{val}

(A### m

(### D train

)))### •

generalization guarantee for all m:E

_{out}

(g_{m} ^{−}

)≤### E _{val}

(g_{m} ^{−}

) +O
q

### log M K

### •

heuristic gain from N− K to N:### E _{out}

###

### g _{m}

^{∗}

### |{z}

### A

_{m∗}

### (D)

###

### ≤ E ^{out}

###

### g _{m} ^{−}

∗
### |{z}

### A

_{m∗}

### (D

_{train}

### )

###

###

—learning curve, remember? :-)

### H ^{1} H ^{2} H ^{M} g _{1} g _{2} · · · g _{M}

### · · ·

### E _{1} · · · E M

### D ^{val} D train

### g _{m}

^{∗}

### E _{2} (H m

^{∗}

### , E _{m}

∗### )

### | {z }

### pick the best

### D

### E _{out} (g _{m}

^{∗}

### ) ≤ E out (g _{m} ^{−}

∗### )

≤### E _{val}

(g_{m} ^{−}

∗) +O
q

### log M K

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22

Validation Validation

### Validation in Practice

use validation to select betweenH

### Φ

5 andH### Φ

10### Validation Set Size, K E xp ec te d E

out### optimal validation: g

m^{∗}

### in-sample: g

mb### validation: g

_{m}∗

5 15 25

0.48 0.52

0.56

### •

in-sample: selectionwith E

_{in}

### • optimal: cheating-selection with E _{test}

### • sub-g: selection with E _{val} and report g ^{−} _{m}

∗
### • full-g: selection with E _{val} and report g m

^{∗}

—E

_{out}

(g_{m}

^{∗})≤ E

### out

(g_{m} ^{−}

∗)
indeed
why is

### sub-g

worse than in-sample some time?Validation Validation

### The Dilemma about K

reasoning of validation:

E

_{out}

(g) ≈ E_{out}

(g^{−}

) ≈ ### E _{val}

(g^{−}

)
**(small K )** **(large K )**

### •

large K :**every** E _{val} ≈ E ^{out}

,
but all### g _{m} ^{−}

much worse than### g _{m}

### •

small K : every### g ^{−} _{m}

≈### g m

, but### E _{val}

far from E_{out}

Validation Set Size, K ExpectedEout

optimal
validation: gm^{∗}

in-sample: gm_{b}

validation: gm^{∗}

5 15 25

0.48 0.52 0.56

practical rule of thumb:

### K = ^{N} _{5}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22

Validation Validation

### Fun Time

For a learning model that takes N

^{2}

seconds of training when using N
examples, what is the total amount of seconds needed when running
the whole validation procedure with K = ^{N} _{5}

on 25 such models with
different parameters to get the final g### m

^{∗}?

### 1

6N^{2}

### 2

17N^{2}

### 3

25N^{2}

### 4

26N^{2}

### Reference Answer: 2

To get all the g

_{m} ^{−}

, we need ^{16} _{25}

N^{2}

· 25 seconds.
Then to get g### m

^{∗}, we need another N

^{2}

seconds.
So in total we need 17N^{2}

seconds.
Validation Validation

### Fun Time

For a learning model that takes N

^{2}

seconds of training when using N
examples, what is the total amount of seconds needed when running
the whole validation procedure with K = ^{N} _{5}

on 25 such models with
different parameters to get the final g### m

^{∗}?

### 1

6N^{2}

### 2

17N^{2}

### 3

25N^{2}

### 4

26N^{2}

### Reference Answer: 2

To get all the g

_{m} ^{−}

, we need ^{16} _{25}

N^{2}

· 25 seconds.
Then to get g

### m

^{∗}, we need another N

^{2}

seconds.
So in total we need 17N

^{2}

seconds.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22

Validation Leave-One-Out Cross Validation

### Extreme Case: K = 1

reasoning of validation:

E

_{out}

(g) ≈ E_{out}

(g^{−}

) ≈ ### E _{val}

(g^{−}

)
**(small K )** **(large K )**

### •

take K = 1?### D ^{(n)} _{val} = **{(x** n , y _{n} ) }

and### E _{val} ^{(n)}

(g_{n} ^{−}

) = err(g_{n} ^{−}

(x_{n}

),### y _{n}

) =### e _{n}

### •

make### e _{n}

closer to E_{out}

(g)?—average overpossible### E _{val} ^{(n)}

### •

leave-one-out### cross validation

estimate:### E _{loocv} ( H, A)

= ### 1 N

### X N n=1

### e n

=### 1 N

### X N n=1

err(g

_{n} ^{−}

(x### n

),### y n

)hope:

### E _{loocv} ( H, A)

≈ E^{out}

(g)
Validation Leave-One-Out Cross Validation

### Illustration of Leave-One-Out

e1

x

y

e2

x

y

e3

x

y

### E _{loocv} (linear)

= ^{1} _{3}

(e_{1}

+### e _{2}

+### e _{3}

)
e1

x

y

e2

x

y

e3

x

y

### E _{loocv} (constant)

= ^{1} _{3}

(e_{1}

+### e _{2}

+### e _{3}

)
which one would you choose?
m

^{∗}

=argmin
### 1≤m≤M

(E

_{m}

=### E _{loocv}

(H### m

,A### m

))Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22

Validation Leave-One-Out Cross Validation

### Theoretical Guarantee of Leave-One-Out Estimate

does

### E _{loocv} ( H, A)

say something about E_{out}

(g)?
**yes, for average E** _{out} **on size-(N** **− 1) data**

### E _{D} ^{E} ^{loocv} ^{(} ^{H, A)}

^{=}

### E _{D}

_{N}

^{1}

X

### N n=1

### e n

= 1 NX

### N n=1

### E _{D} ^{e} ^{n}

= 1

N X

### N n=1

### D E

n_{(x} E

n

### ,y

n### )

err(g_{n} ^{−}

(x_{n}

),### y _{n}

)
= 1

N X

### N n=1

### D E

n### E out

(g_{n} ^{−}

)
= 1

N X

### N n=1

### E _{out} (N − 1)

=### E _{out} (N − 1)

### expected E _{loocv} ( H, A)

says something about### expected E _{out} (g ^{−} )

—often called ‘almost unbiased estimate of E

_{out}

(g)’
Validation Leave-One-Out Cross Validation

### Leave-One-Out in Practice

Average Intensity

Symmetry

1 Not 1

Average Intensity

Symmetry

replacements

Average Intensity

Symmetry

select by E

_{in}

select by E_{loocv}

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

_{loocv}

much better than### E _{in}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22

Validation Leave-One-Out Cross Validation

### Fun Time

Consider three examples (x

_{1}

, y### 1

), (x### 2

, y### 2

), (x### 3

, y### 3

)with y_{1}

=1, y_{2}

=5,
y_{3}

=7. If we use E_{loocv}

to estimate the performance of a learning
algorithm that predicts with the average y value of the data set—the
optimal constant prediction with respect to the squared error. What is
E_{loocv}

(squared error) of the algorithm?
### 1

0### 2 56 9 3 60

### 9 4

14### Reference Answer: 4

This is based on a simple calculation of e

_{1}

= (1− 6)^{2}

, e_{2}

= (5− 4)^{2}

, e_{3}

= (7− 3)^{2}

.
Validation Leave-One-Out Cross Validation

### Fun Time

Consider three examples (x

_{1}

, y### 1

), (x### 2

, y### 2

), (x### 3

, y### 3

)with y_{1}

=1, y_{2}

=5,
y_{3}

=7. If we use E_{loocv}

to estimate the performance of a learning
algorithm that predicts with the average y value of the data set—the
optimal constant prediction with respect to the squared error. What is
E_{loocv}

(squared error) of the algorithm?
### 1

0### 2 56 9 3 60

### 9 4

14### Reference Answer: 4

This is based on a simple calculation of e

_{1}

= (1− 6)^{2}

, e_{2}

= (5− 4)^{2}

, e_{3}

= (7− 3)^{2}

.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22

Validation V -Fold Cross Validation

### Disadvantages of Leave-One-Out Estimate

### Computation

### E _{loocv} ( H, A)

= ### 1 N

### X N n=1

### e _{n}

= ### 1 N

### X N n=1

err(g

_{n} ^{−}

(x_{n}

),### y _{n}

)
### •

N ‘additional’ training per model, not always feasible in practice### •

except ‘special case’ like analytic solution for linear regression### Stability—due to variance of single-point estimates

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

_{loocv}

: not often used practically
Validation V -Fold Cross Validation

### V -fold Cross Validation

how to

**decrease computation need**

for cross validation?
### •

essence of leave-one-out cross validation: partitionD to N parts, taking N− 1 for training and 1 for validation orderly### •

V -fold cross-validation: random-partition ofD**to V equal parts,** D

1 ### D

2### D

3### D

4### D

5### D

6### D

7### D

8### D

9### D

10### train validate train

### z D }| {

**take V** **− 1 for training and 1 for validation orderly**

### E cv ( H, A)

=### 1 V

### X V v =1

E

_{val} ^{(v )}

(g_{v} ^{−}

)
### •

selection by E_{cv}

: m^{∗}

=argmin
### 1≤m≤M

(E

_{m}

=### E _{cv}

(H### m

,A### m

))practical rule of thumb:

### V = 10

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22

Validation V -Fold Cross Validation

### Final Words on Validation

### ‘Selecting’ Validation Tool

### • V **-Fold**

generally preferred over single validation if computation
allows
### • 5-Fold or 10-Fold

generally works well:not necessary to trade V -Fold with Leave-One-Out

### Nature of Validation

### •

all training models: select among hypotheses### •

all validation schemes:### select among finalists

### •

all testing methods: just### evaluate

validation still### more optimistic than testing

do not fool yourself and others

**:-),**

### report test result, not best validation result

Validation V -Fold Cross Validation

### Fun Time

For a learning model that takes N

^{2}

seconds of training when using N
examples, what is the total amount of seconds needed when running
10-fold cross validation on 25 such models with different parameters to
get the final g### m

^{∗}?

### 1 47 2

N^{2}

### 2

47N^{2}

### 3 407 2

N^{2}

### 4

407N^{2}

### Reference Answer: 3

To get all the E

_{cv}

, we need _{100} ^{81}

N^{2}

· 10 · 25
seconds. Then to get g### m

^{∗}, we need another N

^{2}

seconds. So in total we need ^{407} _{2}

N^{2}

seconds.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22

Validation V -Fold Cross Validation

### Fun Time

For a learning model that takes N

^{2}

seconds of training when using N
examples, what is the total amount of seconds needed when running
10-fold cross validation on 25 such models with different parameters to
get the final g### m

^{∗}?

### 1 47 2

N^{2}

### 2

47N^{2}

### 3 407 2

N^{2}

### 4

407N^{2}

### Reference Answer: 3

To get all the E

_{cv}

, we need _{100} ^{81}

N^{2}

· 10 · 25
seconds. Then to get g### m

^{∗}, we need another N

^{2}

seconds. So in total we need ^{407} _{2}

N^{2}

seconds.
Validation V -Fold Cross Validation

### Summary

### 1 When Can Machines Learn?

### 2 Why Can Machines Learn?

### 3 How Can Machines Learn?

### 4

How Can Machines Learn**Better?**

### Lecture 14: Regularization Lecture 15: Validation

### Model Selection Problem

**dangerous by E** _{in} **and dishonest by E** test

### Validation

**select with E** _{val} ( D train ) **while returning** A m

^{∗}

### ( D) Leave-One-Out Cross Validation

**huge computation for almost unbiased estimate** V -Fold Cross Validation

**reasonable computation and performance**

### • **next: something ‘up my sleeve’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22