Machine Learning Foundations ( 機器學習基石)
Lecture 15: Validation
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University ( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22
Validation
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
4
How Can Machines LearnBetter?
Lecture 14: Regularization
minimizes
augmented error, where the added regularizer
effectivelylimits model complexity
Lecture 15: Validation Model Selection Problem Validation
Leave-One-Out Cross Validation
V -Fold Cross Validation
Validation Model Selection Problem
So Many Models Learned
Even Just for Binary Classification . . .
A ∈ { PLA, pocket, linear regression, logistic regression}
T ∈ { 100, 1000, 10000}
×
η∈ { 1, 0.01, 0.0001}
×
Φ∈ { linear, quadratic, poly-10, Legendre-poly-10}
×
Ω(w)∈ { L2 regularizer, L1 regularizer, symmetry regularizer}
×
λ∈ { 0, 0.01, 1 }
×
in addition to your
favorite
combination, may need to try other combinations to get a good gHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22
Validation Model Selection Problem
Model Selection Problem
H
1
which one do you prefer? :-)
H
2
•
given: M modelsH1
,H2
, . . . ,HM
, each with corresponding algorithmA1
,A2
, . . . ,AM
•
goal: selectH m
∗ such that gm
∗=Am
∗(D) is of lowE out (g m
∗)
• unknown E out
due to unknown P(x) & P(y|x), as always:-)
•
arguably themost important
practical problem of MLhow to select?
visually?
—no, remember Lecture 12? :-)
Validation Model Selection Problem
Model Selection by Best E in
H
1
select by best
E in
? m∗
=argmin1≤m≤M
(E
m
=E in
(Am
(D)))H
2
•
Φ1126
always more preferred over Φ1
;λ = 0 always more preferred over λ = 0.1—overfitting?
•
ifA1
minimizes Ein
overH1
andA2
minimizes Ein
overH2
,=⇒ g
m
∗ achieves minimal Ein
overH1
∪ H2
=⇒ ‘
model selection
+ learning’ pays dVC(H1
∪ H2
)—bad generalization?
selecting by
E in
isdangerous
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22
Validation Model Selection Problem
Model Selection by Best E test
H
1
select by best
E test
, which is evaluated on a freshD test
? m∗
=argmin1≤m≤M
(E
m
=E test
(Am
(D)))H
2
•
generalization guarantee (finite-bin Hoeffding):E
out
(gm
∗)≤E test
(gm
∗) +Oqlog M N
test
—yes! strong guarantee :-)
•
but where isD test
?—your boss’s safe, maybe? :-(selecting by
E test
isinfeasible
andcheating
Validation Model Selection Problem
Comparison between E in and E test
in-sample error E in
•
calculated fromD
• feasible
on hand•
‘contaminated’ asD
also used byAm
to ‘select’ gm
test error E test
•
calculated fromD test
• infeasible
in boss’s safe•
‘clean’ asD test
never used for selection beforesomething in between: E val
•
calculated fromDval
⊂ D• feasible
on hand•
‘clean’if
Dval
never used byAm
beforeselecting by E
val
:legal cheating :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22
Validation Model Selection Problem
Fun Time
ForX = R
d
, consider two hypothesis sets,H+
andH−
. The first hypothesis set contains all perceptrons with w1
≥ 0, and the second hypothesis set contains all perceptrons with w1
≤ 0. Denote g+
and g−
as the minimum-E
in
hypothesis in each hypothesis set, respectively.Which statement below is true?
1
If Ein
(g+
)< Ein
(g−
), then g+
is the minimum-Ein
hypothesis of all perceptrons in Rd
.2
If Etest
(g+
)< Etest
(g−
), then g+
is the minimum-Etest
hypothesis of all perceptrons in Rd
.3
The two hypothesis sets are disjoint.4
None of the aboveReference Answer: 1
Note that the two hypothesis sets are not disjoint (sharing ‘w
1
=0’ perceptrons) but their union is all perceptrons.Validation Model Selection Problem
Fun Time
ForX = R
d
, consider two hypothesis sets,H+
andH−
. The first hypothesis set contains all perceptrons with w1
≥ 0, and the second hypothesis set contains all perceptrons with w1
≤ 0. Denote g+
and g−
as the minimum-E
in
hypothesis in each hypothesis set, respectively.Which statement below is true?
1
If Ein
(g+
)< Ein
(g−
), then g+
is the minimum-Ein
hypothesis of all perceptrons in Rd
.2
If Etest
(g+
)< Etest
(g−
), then g+
is the minimum-Etest
hypothesis of all perceptrons in Rd
.3
The two hypothesis sets are disjoint.4
None of the aboveReference Answer: 1
Note that the two hypothesis sets are not disjoint (sharing ‘w
1
=0’ perceptrons) but their union is all perceptrons.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22
Validation Validation
Validation Set D val
E in
(h)E val
(h)↑ ↑
|{z} D
size N
→
D | {z } train size N−K
∪
|{z} D val size K
↓ ↓
g m
=Am
(D
)g m −
=Am
(D train
)• D val
⊂D
: calledvalidation set—‘on-hand’ simulation of test set
•
to connectE val
with Eout
:D val
iid
∼ P(x, y) ⇐= select K examples fromD
at random•
to make sureD val
‘clean’:feed only
D train
toAm
for model selectionE
out
(g− m
)≤E val
(gm −
) +Oq
log M K
Validation Validation
Model Selection by Best E val
m
∗
=argmin1≤m≤M
(E
m
=E val
(Am
(D train
)))•
generalization guarantee for all m:E
out
(gm −
)≤E val
(gm −
) +Oq
log M K
•
heuristic gain from N− K to N:E out
g m
∗|{z}
A
m∗(D)
≤ E out
g m −
∗|{z}
A
m∗(D
train)
—learning curve, remember? :-)
H 1 H 2 H M g 1 g 2 · · · g M
· · ·
E 1 · · · E M
D val D train
g m
∗E 2 (H m
∗, E m
∗)
| {z }
pick the best
D
E out (g m
∗) ≤ E out (g m −
∗)
≤E val
(gm −
∗) +Oq
log M K
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22
Validation Validation
Validation in Practice
use validation to select betweenH
Φ
5 andHΦ
10Validation Set Size, K E xp ec te d E
outoptimal validation: g
m∗in-sample: g
mbvalidation: g
m∗5 15 25
0.48 0.52
0.56
•
in-sample: selectionwith E
in
• optimal: cheating-selection with E test
• sub-g: selection with E val and report g − m
∗• full-g: selection with E val and report g m
∗—E
out
(gm
∗)≤ Eout
(gm −
∗) indeedwhy is
sub-g
worse than in-sample some time?Validation Validation
The Dilemma about K
reasoning of validation:
E
out
(g) ≈ Eout
(g−
) ≈E val
(g−
)(small K ) (large K )
•
large K :every E val ≈ E out
, but allg m −
much worse thang m
•
small K : everyg − m
≈g m
, butE val
far from Eout
Validation Set Size, K ExpectedEout
optimal validation: gm∗
in-sample: gmb
validation: gm∗
5 15 25
0.48 0.52 0.56
practical rule of thumb:
K = N 5
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22
Validation Validation
Fun Time
For a learning model that takes N
2
seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =N 5
on 25 such models with different parameters to get the final gm
∗?1
6N2
2
17N2
3
25N2
4
26N2
Reference Answer: 2
To get all the g
m −
, we need16 25
N2
· 25 seconds. Then to get gm
∗, we need another N2
seconds. So in total we need 17N2
seconds.Validation Validation
Fun Time
For a learning model that takes N
2
seconds of training when using N examples, what is the total amount of seconds needed when running the whole validation procedure with K =N 5
on 25 such models with different parameters to get the final gm
∗?1
6N2
2
17N2
3
25N2
4
26N2
Reference Answer: 2
To get all the g
m −
, we need16 25
N2
· 25 seconds.Then to get g
m
∗, we need another N2
seconds.So in total we need 17N
2
seconds.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22
Validation Leave-One-Out Cross Validation
Extreme Case: K = 1
reasoning of validation:
E
out
(g) ≈ Eout
(g−
) ≈E val
(g−
)(small K ) (large K )
•
take K = 1?D (n) val = {(x n , y n ) }
andE val (n)
(gn −
) = err(gn −
(xn
),y n
) =e n
•
makee n
closer to Eout
(g)?—average overpossibleE val (n)
•
leave-one-outcross validation
estimate:E loocv ( H, A)
=1 N
X N n=1
e n
=1 N
X N n=1
err(g
n −
(xn
),y n
)hope:
E loocv ( H, A)
≈ Eout
(g)Validation Leave-One-Out Cross Validation
Illustration of Leave-One-Out
e1
x
y
e2
x
y
e3
x
y
E loocv (linear)
=1 3
(e1
+e 2
+e 3
)e1
x
y
e2
x
y
e3
x
y
E loocv (constant)
=1 3
(e1
+e 2
+e 3
) which one would you choose?m
∗
=argmin1≤m≤M
(E
m
=E loocv
(Hm
,Am
))Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22
Validation Leave-One-Out Cross Validation
Theoretical Guarantee of Leave-One-Out Estimate
does
E loocv ( H, A)
say something about Eout
(g)?yes, for average E out on size-(N − 1) data
E D E loocv ( H, A)
=E D
N1X
N n=1
e n
= 1 NX
N n=1
E D e n
= 1
N X
N n=1
D E
n(x E
n
,y
n)
err(gn −
(xn
),y n
)= 1
N X
N n=1
D E
nE out
(gn −
)= 1
N X
N n=1
E out (N − 1)
=E out (N − 1)
expected E loocv ( H, A)
says something aboutexpected E out (g − )
—often called ‘almost unbiased estimate of E
out
(g)’Validation Leave-One-Out Cross Validation
Leave-One-Out in Practice
Average Intensity
Symmetry
1 Not 1
Average Intensity
Symmetry
replacements
Average Intensity
Symmetry
select by E
in
select by Eloocv
# Features Used
Error
Eout
Ecv
Ein
5 10 15 20
0.01 0.02 0.03
E
loocv
much better thanE in
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22
Validation Leave-One-Out Cross Validation
Fun Time
Consider three examples (x
1
, y1
), (x2
, y2
), (x3
, y3
)with y1
=1, y2
=5, y3
=7. If we use Eloocv
to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is Eloocv
(squared error) of the algorithm?1
02 56 9 3 60
9 4
14Reference Answer: 4
This is based on a simple calculation of e
1
= (1− 6)2
, e2
= (5− 4)2
, e3
= (7− 3)2
.Validation Leave-One-Out Cross Validation
Fun Time
Consider three examples (x
1
, y1
), (x2
, y2
), (x3
, y3
)with y1
=1, y2
=5, y3
=7. If we use Eloocv
to estimate the performance of a learning algorithm that predicts with the average y value of the data set—the optimal constant prediction with respect to the squared error. What is Eloocv
(squared error) of the algorithm?1
02 56 9 3 60
9 4
14Reference Answer: 4
This is based on a simple calculation of e
1
= (1− 6)2
, e2
= (5− 4)2
, e3
= (7− 3)2
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22
Validation V -Fold Cross Validation
Disadvantages of Leave-One-Out Estimate
Computation
E loocv ( H, A)
=1 N
X N n=1
e n
=1 N
X N n=1
err(g
n −
(xn
),y n
)•
N ‘additional’ training per model, not always feasible in practice•
except ‘special case’ like analytic solution for linear regressionStability—due to variance of single-point estimates
# Features Used
Error
Eout
Ecv
Ein
5 10 15 20
0.01 0.02 0.03
E
loocv
: not often used practicallyValidation V -Fold Cross Validation
V -fold Cross Validation
how to
decrease computation need
for cross validation?•
essence of leave-one-out cross validation: partitionD to N parts, taking N− 1 for training and 1 for validation orderly•
V -fold cross-validation: random-partition ofDto V equal parts, D
1D
2D
3D
4D
5D
6D
7D
8D
9D
10train validate train
z D }| {
take V − 1 for training and 1 for validation orderly
E cv ( H, A)
=1 V
X V v =1
E
val (v )
(gv −
)•
selection by Ecv
: m∗
=argmin1≤m≤M
(E
m
=E cv
(Hm
,Am
))practical rule of thumb:
V = 10
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22
Validation V -Fold Cross Validation
Final Words on Validation
‘Selecting’ Validation Tool
• V -Fold
generally preferred over single validation if computation allows• 5-Fold or 10-Fold
generally works well:not necessary to trade V -Fold with Leave-One-Out
Nature of Validation
•
all training models: select among hypotheses•
all validation schemes:select among finalists
•
all testing methods: justevaluate
validation stillmore optimistic than testing
do not fool yourself and others
:-),
report test result, not best validation result
Validation V -Fold Cross Validation
Fun Time
For a learning model that takes N
2
seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final gm
∗?1 47 2
N2
2
47N2
3 407 2
N2
4
407N2
Reference Answer: 3
To get all the E
cv
, we need100 81
N2
· 10 · 25 seconds. Then to get gm
∗, we need another N2
seconds. So in total we need407 2
N2
seconds.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22
Validation V -Fold Cross Validation
Fun Time
For a learning model that takes N
2
seconds of training when using N examples, what is the total amount of seconds needed when running 10-fold cross validation on 25 such models with different parameters to get the final gm
∗?1 47 2
N2
2
47N2
3 407 2
N2
4
407N2
Reference Answer: 3
To get all the E
cv
, we need100 81
N2
· 10 · 25 seconds. Then to get gm
∗, we need another N2
seconds. So in total we need407 2
N2
seconds.Validation V -Fold Cross Validation
Summary
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
4
How Can Machines LearnBetter?
Lecture 14: Regularization Lecture 15: Validation
Model Selection Problem
dangerous by E in and dishonest by E test
Validation
select with E val ( D train ) while returning A m
∗( D) Leave-One-Out Cross Validation
huge computation for almost unbiased estimate V -Fold Cross Validation
reasonable computation and performance
• next: something ‘up my sleeve’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22