Machine Learning Foundations ( 機器學習基石)
Lecture 7: The VC Dimension
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University ( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26
The VC Dimension
Roadmap
1 When Can Machines Learn?
2 Why
Can Machines Learn?Lecture 6: Theory of Generalization E out ≈ E in
possibleif
m H (N) breaks somewhere
andN large enough Lecture 7: The VC Dimension
Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension
3 How Can Machines Learn?
4 How Can Machines Learn Better?
The VC Dimension Definition of VC Dimension
Recap: More on Growth Function
m
H
(N) of break point k ≤B(N, k ) =
Xk −1 i=0
N i
| {z }
highest term N
k −1k
B(N , k ) 1 2 3 4 5
1 1 2 2 2 2
2 1 3 4 4 4
3 1 4 7 8 8
N 4 1 5 11 15 16
5 1 6 16 26 31
6 1 7 22 42 57
k
N
k −11 2 3 4 5
1 1 1 1 1 1
2 1 2 4 8 16
3 1 3 9 27 81
4 1 4 16 64 256
5 1 5 25 125 625 6 1 6 36 216 1296
provably
& loosely, for N ≥ 2, k ≥ 3,m
H
(N)≤ B(N, k ) =
k −1
X
i=0
N i
≤ N k −1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘Eout
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned! (:-) good luck)The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC =‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
k
> dVC =⇒k
is a break point forHif N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
The Four VC Dimensions
•
positive rays: mH
(N) = N + 1d
VC= 1
••
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1d
VC= 2
• ••
convex sets: mH
(N) = 2N
d
VC= ∞
up
bottom
•
2D perceptrons: mH
(N)≤ N3
for N ≥ 2d
VC= 3
•• •
good:
finite d
VCThe VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
Fun Time
If there is a set of N inputs that cannot be shattered by H. Based only on this information, what can we conclude about d
VC( H)?
1
dVC(H) > N2
dVC(H) = N3
dVC(H) < N4
no conclusion can be madeReference Answer: 4
It is possible that there is another set of N inputs that can be shattered, which means dVC≥ N. It is also possible that no set of N input can be shattered, which means dVC< N.
Neither cases can be ruled out by one non-shattering set.
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0 E out (g) ≈ E in (g)
PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3 linearly separable D with x n ∼ P and y n = f (x n )
T large N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
VC Dimension of Perceptrons
•
1D perceptron (pos/neg rays): dVC=2•
2D perceptrons: dVC=3• d
VC≥ 3: •
• •
• d
VC≤ 3: × ◦
◦ ×
•
d -D perceptrons: dVC=
?
d + 1two steps:
•
dVC ≥ d + 1•
dVC ≤ d + 1The VC Dimension VC Dimension of Perceptrons
Extra Fun Time
What statement below shows that d
VC≥ d + 1?
1
There are some d + 1 inputs we can shatter.2
We can shatter any set of d + 1 inputs.3
There are some d + 2 inputs we cannot shatter.4
We cannot shatter any set of d + 2 inputs.Reference Answer: 1
dVCis the maximum that m
H
(N) = 2N
, and mH
(N) is the most number of dichotomies of N inputs. So if we can find 2d +1
dichotomies on some d + 1 inputs, mH
(d + 1) = 2d +1
and hence dVC≥ d + 1.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/26
The VC Dimension VC Dimension of Perceptrons
d VC ≥ d + 1
There are
some d + 1 inputs
we can shatter.•
some ‘trivial’ inputs:X =
—
x T 1
——
x T 2
——
x T 3
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0
1 0 1 0
.. . .. . . .. 0 1 0 . . . 0 1
•
visually in 2D:•
• •
note:
X invertible!
The VC Dimension VC Dimension of Perceptrons
Can We Shatter X?
X =
—
x T 1
——
x T 2
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1
invertible
to shatter . . .
for anyy =
y
1
... y
d +1
, find
w such that
sign (Xw) = y ⇐=
(Xw) = y X invertible!
⇐⇒
w = X −1 y
‘special’ X can be shattered =⇒ dVC≥ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26
The VC Dimension VC Dimension of Perceptrons
Extra Fun Time
What statement below shows that d
VC≤ d + 1?
1
There are some d + 1 inputs we can shatter.2
We can shatter any set of d + 1 inputs.3
There are some d + 2 inputs we cannot shatter.4
We cannot shatter any set of d + 2 inputs.Reference Answer: 4
dVCis the maximum that m
H
(N) = 2N
, and mH
(N) is the most number of dichotomies of N inputs. So if we cannot find 2d +2
dichotomies on any d + 2 inputs (i.e. break point),m
H
(d + 2)< 2d +2
and hence dVC< d + 2.That is, dVC≤ d + 1.
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be × w T x 4
=w T x 2
| {z }
◦
+
w T x 3
| {z }
◦
−
w T x 1
| {z }
×
> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (2/2)
d -D General Case
X =
—
x T 1
——
x T 2
— ...—
x T d +1
——
x T d +2
—
more rows than columns:
linear dependence (some a
i
non-zero)x d +2
=a 1 x 1
+a 2 x 2
+. . . +a d +1 x d +1
•
can you generate (sign(a1 ), sign(a 2 ), . . . , sign(a d +1 ), ×
)? if so, whatw?
w T x d +2
=a 1 w T x 1
| {z }
◦
+a
2 w T x 2
| {z }
×
+. . . +
a d +1 w T x d +1
| {z }
×
>
0(contradition!)
‘general’ X no-shatter =⇒ dVC ≤ d + 1
The VC Dimension VC Dimension of Perceptrons
Fun Time
Based on the proof above, what is d
VCof 1126-D perceptrons?
1
10242
11263
11274
6211Reference Answer: 3
Well,
too much fun for this section! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/26
The VC Dimension Physical Intuition of VC Dimension
Degrees of Freedom
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3 4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
01 2
3 4 5 6 87 10 9 11 12 13 14 15 16
1718 01
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
(modified from the work of Hugues Vermeiren on http://www.texample.net)
•
hypothesisparameters w = (w 0 , w 1 , · · · , w d ):
creates degrees of freedom
•
hypothesis quantity M =|H|:‘analog’ degrees of freedom
•
hypothesis ‘power’ dVC=d + 1:effective ‘binary’ degrees of freedom
d
VC(H
):powerfulness
ofH
The VC Dimension Physical Intuition of VC Dimension
Two Old Friends
Positive Rays (d
VC= 1)
x
1x
2x
3. . . x
Nh(x) = −1 h(x) = +1 a
0.8
free parameters: a Positive Intervals (d
VC= 2)
x
1x
2x
3. . . x
Nh(x) = −1 h(x) = +1 h(x) = −1
0.8
free parameters: `, r
practical rule of thumb:
d
VC ≈#free parameters
(but not always)Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/26
The VC Dimension Physical Intuition of VC Dimension
M and d VC
copied from Lecture 5 :-)
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices small d
VC1 Yes!
, P[BAD]≤ 4·(2N) d
VC · exp(. . .)2 No!, too limited power
large d
VC1 No!
, P[BAD]≤ 4·(2N) d
VC · exp(. . .)2 Yes!, lots of power
using the right dVC(orH) is importantThe VC Dimension Physical Intuition of VC Dimension
Fun Time
Origin-crossing Hyperplanes are essentially perceptrons with w 0
fixed at 0. Make a guess about the d
VCof origin-crossing hyperplanes in R d .
1
12
d3
d + 14
∞Reference Answer: 2
The proof is almost the same as proving the dVCfor usual perceptrons, but it is the
intuition
(dVC ≈ #free parameters) that you shall use to answer this quiz.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/26
The VC Dimension Interpreting VC Dimension
VC Bound Rephrase: Penalty for Model Complexity
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, d
VC≥ 2
P
D
h
E
in
(g)− Eout
(g) >| {z }
BAD
i
if k exists
≤ 4(2N)
d
VCexp−
1 8
2 N
| {z }
δ
Rephrase
. . .,
with probability ≥ 1 − δ
,GOOD:
E
in
(g)− Eout
(g) ≤ setδ
= 4(2N)d
VCexp−
1 8
2 N
δ
4(2N)
dVC = exp−
1 8
2 N
ln4(2N)
dVCδ
=
1 8
2 N
r8
N
ln4(2N)
dVCδ
=
√. . .
| {z } Ω(N,
H
,δ)
: penalty for
model complexity
The VC Dimension Interpreting VC Dimension
VC Bound Rephrase: Penalty for Model Complexity
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, d
VC≥ 2
P
D
h
E
in
(g)− Eout
(g) >| {z }
BAD
i
if k exists
≤ 4(2N)
d
VCexp−
1 8
2 N
| {z }
δ
Rephrase
. . .,
with probability ≥ 1 − δ
,GOOD!
gen. error
E
in
(g)− Eout
(g)≤
r
8 N
ln4(2N)
dVCδ
E in (g) − r
8 N ln
4(2N)
dVCδ
≤
Eout
(g) ≤ Ein
(g) + r8 N
ln4(2N)
dVCδ
√. . .
| {z } Ω(N,
H
,δ)
: penalty for
model complexity
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/26
The VC Dimension Interpreting VC Dimension
THE VC Message
with
a high probability,
E out (g)
≤E in (g)
+r
8 N ln
4(2N)
dVCδ
| {z }
Ω(N,H,δ)
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
•
dVC↑:E in ↓
butΩ ↑
•
dVC↓:Ω ↓
butE in ↑
•
best dVC∗ in the middle
powerful H
not always good!The VC Dimension Interpreting VC Dimension
VC Bound Rephrase: Sample Complexity
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, d
VC≥ 2
P
D
h
E
in
(g)− Eout
(g) >| {z }
BAD
i
if k exists
≤ 4(2N)
d
VCexp−
1 8
2 N
| {z }
δ
given
specs
= 0.1, δ = 0.1, dVC=3, want4(2N) d
VCexp − 1 8 2 N ≤ δ N bound
100 2.82 × 10
71,000 9.17 × 10
910,000 1.19 × 10
8100,000 1.65 × 10
−3829,300 9.99 × 10
−2sample complexity:
need N ≈ 10, 000d
VCin theory
practical rule of thumb:
N ≈ 10d
VCoften enough!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/26
The VC Dimension Interpreting VC Dimension
Looseness of VC Bound
P
D
h
E
in
(g)− Eout
(g) > iif k exists
≤ 4(2N)
d
VCexp−
1 8
2 N
theory: N ≈ 10, 000dVC;practice: N ≈ 10d
VCWhy?
•
Hoeffding for unknown Eout any distribution, any target
•
mH
(N) instead of|H(x1
, . . . , xN
)|‘any’ data
•
Nd
VC instead of mH
(N)‘any’ H of same d
VC•
union bound on worst casesany choice made by A
—but hardly better, and ‘similarly loose for all models’
philosophical message
of VC bound important for improving MLThe VC Dimension Interpreting VC Dimension
Fun Time
Consider the VC Bound below. How can we decrease the probability of getting BAD data?
P
D
h
E
in
(g)− Eout
(g) > iif k exists
≤ 4(2N)
d
VCexp−
1 8
2
N1
decrease model complexity dVC2
increase data size N a lot3
increase generalization error tolerance4
all of the aboveReference Answer: 4
Congratulations on being Master of VC bound! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 25/26
The VC Dimension Interpreting VC Dimension