Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 15: Validation

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

(2)

Validation

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 14: Regularization

minimizes

augmented error, where the added regularizer

effectively

limits model complexity

Lecture 15: Validation Model Selection Problem Validation

Leave-One-Out Cross Validation

V -Fold Cross Validation

(3)

Validation Model Selection Problem

So Many Models Learned

Even Just for Binary Classification . . .

A ∈ { PLA, pocket, linear regression, logistic regression}

T ∈ { 100, 1000, 10000}

×

η∈ { 1, 0.01, 0.0001}

×

Φ∈ { linear, quadratic, poly-10, Legendre-poly-10}

×

Ω(w)∈ { L2 regularizer, L1 regularizer, symmetry regularizer}

×

λ∈ { 0, 0.01, 1 }

×

in addition to your

favorite

combination, may need to try other combinations to get a good g

(4)

Model Selection Problem

H

1 which one do you prefer? :-)

H

2

•

given: M modelsH

1

,H

2

, . . . ,H

M

, each with corresponding algorithmA

1

,A

2

, . . . ,A

M

•

goal: select

H m

^∗ such that g

_m

^∗=A

m

^∗(D) is of low

E _out (g _m

^∗

)

• unknown E _out

due to unknown P(x) & P(y|x), as always

:-)

•

arguably the

most important

practical problem of ML

how to select?

visually?

—no, remember Lecture 12? :-)

(5)

Model Selection by Best E _in

H

1

select by best

E _in

? m

^∗

=argmin

1≤m≤M

(E

m

=

E _in

(A

^m

(D)))

H

2

•

Φ

1126

always more preferred over Φ

₁

;

λ = 0 always more preferred over λ = 0.1—overfitting?

•

ifA

1

minimizes E

_in

overH

1

andA

2

minimizes E

_in

overH

2

,

=⇒ g

^m

^∗ achieves minimal E

_in

overH

1

∪ H

2

=⇒ ‘

model selection

+ learning’ pays d_VC(H

1

∪ H

2

)

—bad generalization?

selecting by

E _in

is

dangerous

(6)

Model Selection by Best E _test

H

1

select by best

E test

, which is evaluated on a fresh

D ^test

? m

^∗

=argmin

1≤m≤M

(E

m

=

E test

(A

^m

(D)))

H

2

•

generalization guarantee (finite-bin Hoeffding):

E

_out

(g

_m

^∗)≤

E _test

(g

_m

^∗) +Oq

log M N

test

—yes! strong guarantee :-)

•

but where is

D ^test

?—your boss’s safe, maybe? :-(

selecting by

E test

is

infeasible

and

cheating

(7)

Comparison between E _in and E _test

in-sample error E in

•

calculated from

D

• feasible

on hand

•

‘contaminated’ as

D

also used byA

m

to ‘select’ g

_m

test error E test

•

calculated from

D test

• infeasible

in boss’s safe

•

‘clean’ as

D ^test

never used for selection before

something in between: E val

•

calculated fromD

val

⊂ D

• feasible

on hand

•

‘clean’

if

D

val

never used byA

^m

before

selecting by E

_val

:

legal cheating :-)

(8)

Fun Time

ForX = R

^d

, consider two hypothesis sets,H

⁺

andH

⁻

. The first hypothesis set contains all perceptrons with w

₁

≥ 0, and the second hypothesis set contains all perceptrons with w

₁

≤ 0. Denote g

+

and g

−

as the minimum-E

_in

hypothesis in each hypothesis set, respectively.

Which statement below is true?

1

If E

_in

(g

₊

)< E

_in

(g

−

), then g

₊

is the minimum-E

_in

hypothesis of all perceptrons in R

^d

.

2

If E

test

(g

₊

)< E

test

(g

−

), then g

₊

is the minimum-E

test

^d

.

3

The two hypothesis sets are disjoint.

4

None of the above

Reference Answer: 1

Note that the two hypothesis sets are not disjoint (sharing ‘w

₁

=0’ perceptrons) but their union is all perceptrons.

(9)

Fun Time

ForX = R

^d

, consider two hypothesis sets,H

⁺

andH

⁻

. The first hypothesis set contains all perceptrons with w

₁

≥ 0, and the second hypothesis set contains all perceptrons with w

₁

≤ 0. Denote g

+

and g

−

as the minimum-E

_in

hypothesis in each hypothesis set, respectively.

Which statement below is true?

1

If E

_in

(g

₊

)< E

_in

(g

−

), then g

₊

is the minimum-E

_in

^d

.

2

If E

test

(g

₊

)< E

test

(g

−

), then g

₊

is the minimum-E

test

^d

.

3

The two hypothesis sets are disjoint.

4

None of the above

Reference Answer: 1

Note that the two hypothesis sets are not disjoint (sharing ‘w

₁

=0’ perceptrons) but their union is all perceptrons.

(10)

Validation Validation

Validation Set D ^val

E _in

(h)

E _val

(h)

↑ ↑

|{z} D

size N

→

D | {z } train size N−K

∪

|{z} D val size K

↓ ↓

g _m

=A

m

(

D

)

g _m ⁻

=A

m

(

D train

)

• D val

⊂

D

: called

validation set—‘on-hand’ simulation of test set

•

to connect

E _val

with E

_out

:

D val

iid

∼ P(x, y) ⇐= select K examples from

D

at random

•

to make sure

D val

‘clean’:

feed only

D train

toA

m

for model selection

E

out

(g

⁻ _m

)≤

E _val

(g

_m ⁻

) +O

q

log M K

(11)

Model Selection by Best E _val

m

^∗

=argmin

1≤m≤M

(E

_m

=

E _val

(A

m

(

D train

)))

•

generalization guarantee for all m:

E

_out

(g

_m ⁻

)≤

E _val

(g

_m ⁻

) +O

q

log M K

•

heuristic gain from N− K to N:

E _out



  g _m

^∗

|{z}

A

_m∗

(D)



  ≤ E ^out



  g _m ⁻

∗

|{z}

A

_m∗

(D

_train

)



 

—learning curve, remember? :-)

H ¹ H ² H ^M g ₁ g ₂ · · · g _M

· · ·

E ₁ · · · E M

D ^val D train

g _m

^∗

E ₂ (H m

^∗

, E _m

∗

)

| {z }

pick the best

D

E _out (g _m

^∗

) ≤ E out (g _m ⁻

∗

)

≤

E _val

(g

_m ⁻

∗) +O

q

log M K

(12)

Validation in Practice

use validation to select betweenH

Φ

5 andH

Φ

10

Validation Set Size, K E xp ec te d E

out

optimal validation: g

m^∗

in-sample: g

mb

validation: g

_m∗

5 15 25

0.48 0.52

0.56

•

in-sample: selection

with E

_in

• optimal: cheating-selection with E _test

• sub-g: selection with E _val and report g ⁻ _m

∗

• full-g: selection with E _val and report g m

^∗

—E

_out

(g

_m

^∗)≤ E

out

(g

_m ⁻

∗) indeed

why is

sub-g

worse than in-sample some time?

(13)

The Dilemma about K

reasoning of validation:

E

_out

(g) ≈ E

_out

(g

⁻

) ≈

E _val

(g

⁻

)

(small K ) (large K )

•

large K :

every E _val ≈ E ^out

, but all

g _m ⁻

much worse than

g _m

•

small K : every

g ⁻ _m

≈

g m

, but

E _val

far from E

_out

Validation Set Size, K ExpectedEout

optimal validation: gm^∗

in-sample: gm_b

validation: gm^∗

5 15 25

0.48 0.52 0.56

practical rule of thumb:

K = ^N ₅

(14)

Fun Time

For a learning model that takes N

²

seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =

^N ₅

on 25 such models with different parameters to get the final g

m

^∗?

1

6N

²

2

17N

²

3

25N

²

4

26N

²

Reference Answer: 2

To get all the g

_m ⁻

, we need

¹⁶ ₂₅

N

²

· 25 seconds. Then to get g

m

^∗, we need another N

²

seconds. So in total we need 17N

²

seconds.

(15)

Fun Time

²

seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =

^N ₅

on 25 such models with different parameters to get the final g

m

^∗?

1

6N

²

2

17N

²

3

25N

²

4

26N

²

Reference Answer: 2

To get all the g

_m ⁻

, we need

¹⁶ ₂₅

N

²

· 25 seconds.

Then to get g

m

²

seconds.

So in total we need 17N

²

seconds.

(16)

Validation Leave-One-Out Cross Validation

Extreme Case: K = 1

reasoning of validation:

E

_out

(g) ≈ E

_out

(g

⁻

) ≈

E _val

(g

⁻

)

(small K ) (large K )

•

take K = 1?

D ⁽ⁿ⁾ _val = {(x n , y _n ) }

and

E _val ⁽ⁿ⁾

(g

_n ⁻

) = err(g

_n ⁻

(x

_n

),

y _n

) =

e _n

•

make

e _n

closer to E

_out

(g)?—average overpossible

E _val ⁽ⁿ⁾

•

leave-one-out

cross validation

estimate:

E _loocv ( H, A)

=

1 N

X N n=1

e n

=

1 N

X N n=1

err(g

_n ⁻

(x

n

),

y n

)

hope:

E _loocv ( H, A)

≈ E

^out

(g)

(17)

Illustration of Leave-One-Out

e1

x

y

e2

x

y

e3

x

y

E _loocv (linear)

=

¹ ₃

(e

₁

+

e ₂

+

e ₃

)

e1

x

y

e2

x

y

e3

x

y

E _loocv (constant)

=

¹ ₃

(e

₁

+

e ₂

+

e ₃

) which one would you choose?

m

^∗

=argmin

1≤m≤M

(E

_m

=

E _loocv

(H

m

,A

m

))

(18)

Theoretical Guarantee of Leave-One-Out Estimate

does

E _loocv ( H, A)

say something about E

_out

(g)?

yes, for average E _out on size-(N − 1) data

E _D ^E ^loocv ⁽ ^{H, A)}

⁼

E _D

_N¹

X

N n=1

e n

= 1 N

X

N n=1

E _D ^e ⁿ

= 1

N X

N n=1

D E

n

_(x E

n

,y

n

)

err(g

_n ⁻

(x

_n

),

y _n

)

= 1

N X

N n=1

D E

n

E out

(g

_n ⁻

)

= 1

N X

N n=1

E _out (N − 1)

=

E _out (N − 1)

expected E _loocv ( H, A)

says something about

expected E _out (g ⁻ )

—often called ‘almost unbiased estimate of E

_out

(g)’

(19)

Leave-One-Out in Practice

Average Intensity

Symmetry

1 Not 1

Average Intensity

Symmetry

replacements

Average Intensity

Symmetry

select by E

_in

select by E

_loocv

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

_loocv

much better than

E _in

(20)

Fun Time

Consider three examples (x

₁

, y

1

), (x

2

, y

2

), (x

3

, y

3

)with y

₁

=1, y

₂

=5, y

₃

=7. If we use E

_loocv

to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is E

_loocv

(squared error) of the algorithm?

1

0

2 56 9 3 60

9 4

14

Reference Answer: 4

This is based on a simple calculation of e

₁

= (1− 6)

²

, e

₂

= (5− 4)

²

, e

₃

= (7− 3)

²

.

(21)

Fun Time

Consider three examples (x

₁

, y

1

), (x

2

, y

2

), (x

3

, y

3

)with y

₁

=1, y

₂

=5, y

₃

=7. If we use E

_loocv

to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is E

_loocv

(squared error) of the algorithm?

1

0

2 56 9 3 60

9 4

14

Reference Answer: 4

This is based on a simple calculation of e

₁

= (1− 6)

²

, e

₂

= (5− 4)

²

, e

₃

= (7− 3)

²

.

(22)

Validation V -Fold Cross Validation

Disadvantages of Leave-One-Out Estimate

Computation

E _loocv ( H, A)

=

1 N

X N n=1

e _n

=

1 N

X N n=1

err(g

_n ⁻

(x

_n

),

y _n

)

•

N ‘additional’ training per model, not always feasible in practice

•

except ‘special case’ like analytic solution for linear regression

Stability—due to variance of single-point estimates

# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01 0.02 0.03

E

_loocv

: not often used practically

(23)

V -fold Cross Validation

how to

decrease computation need

for cross validation?

•

essence of leave-one-out cross validation: partitionD to N parts, taking N− 1 for training and 1 for validation orderly

•

V -fold cross-validation: random-partition ofD

to V equal parts, D

1

D

2

D

3

D

4

D

5

D

6

D

7

D

8

D

9

D

10

train validate train

z D }| {

take V − 1 for training and 1 for validation orderly

E cv ( H, A)

=

1 V

X V v =1

E

_val ^{(v )}

(g

_v ⁻

)

•

selection by E

_cv

: m

^∗

=argmin

1≤m≤M

(E

_m

=

E _cv

(H

m

,A

m

))

practical rule of thumb:

V = 10

(24)

Final Words on Validation

‘Selecting’ Validation Tool

• V -Fold

generally preferred over single validation if computation allows

• 5-Fold or 10-Fold

generally works well:

not necessary to trade V -Fold with Leave-One-Out

Nature of Validation

•

all training models: select among hypotheses

•

all validation schemes:

select among finalists

•

all testing methods: just

evaluate

validation still

more optimistic than testing

do not fool yourself and others

:-),

report test result, not best validation result

(25)

Fun Time

²

seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final g

m

^∗?

1 47 2

N

²

2

47N

²

3 407 2

N

²

4

407N

²

Reference Answer: 3

To get all the E

_cv

, we need

₁₀₀ ⁸¹

N

²

· 10 · 25 seconds. Then to get g

m

²

seconds. So in total we need

⁴⁰⁷ ₂

N

²

seconds.

(26)

Fun Time

²

seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final g

Machine Learning Foundations (ᘤ9M)

Machine Learning Foundations ( 機器學習基石)

Lecture 15: Validation

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

Better?

Lecture 14: Regularization

augmented error, where the added regularizer

limits model complexity

Lecture 15: Validation Model Selection Problem Validation

Leave-One-Out Cross Validation

V -Fold Cross Validation

So Many Models Learned

Even Just for Binary Classification . . .

×

×

×

×

×

favorite

Model Selection Problem

1

which one do you prefer? :-)

2

•

1

2

M

1

2

M

•

H m

m

m

E out (g m

)

• unknown E out

:-)

•

most important

visually?

—no, remember Lecture 12? :-)

Model Selection by Best E in

1

E in

∗

1≤m≤M

m

E in

m

2

•

1126

1

•

1

in

1

2

in

2

m

in

1

2

model selection

1

2

E in

dangerous

Model Selection by Best E test

1

E test

Machine Learning Foundations (ᘤ9M)

_m

E _out (g _m

• unknown E _out

Model Selection by Best E _in

E _in

^∗

E _in

^m

₁

_in

_in

^m

_in

E _in

Model Selection by Best E _test

D ^test

^∗

^m

_out

_m

E _test

_m

D ^test

Comparison between E _in and E _test

_m

D ^test

^m

_val

^d

⁺

⁻

₁

₁

_in

_in

₊

_in

₊

_in

^d

₊

₊

^d