Machine Learning Foundations ( 機器學習基石)
Lecture 7: The VC Dimension
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26
The VC Dimension
Roadmap
1 When Can Machines Learn?
2 Why
Can Machines Learn?Lecture 6: Theory of Generalization E out ≈ E in
possibleif
m H (N) breaks somewhere
andN large enough Lecture 7: The VC Dimension
Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/26
The VC Dimension Definition of VC Dimension
Recap: More on Growth Function
m
H
(N) of break point k ≤B(N, k ) =
Xk −1 i=0
N i
| {z }
highest term N
k −1k
B(N , k ) 1 2 3 4 5
1 1 2 2 2 2
2 1 3 4 4 4
3 1 4 7 8 8
N 4 1 5 11 15 16
5 1 6 16 26 31
6 1 7 22 42 57
k
N
k −11 2 3 4 5
1 1 1 1 1 1
2 1 2 4 8 16
3 1 3 9 27 81
4 1 4 16 64 256
5 1 5 25 125 625 6 1 6 36 216 1296
provably
& loosely, for N ≥ 2, k ≥ 3,m
H
(N)≤ B(N, k ) =
k −1
X
i=0
N i
≤ N k −1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26
The VC Dimension Definition of VC Dimension
Recap: More on Growth Function
m
H
(N) of break point k ≤B(N, k ) =
Xk −1 i=0
N i
| {z }
highest term N
k −1k
B(N , k ) 1 2 3 4 5
1 1 2 2 2 2
2 1 3 4 4 4
3 1 4 7 8 8
N 4 1 5 11 15 16
5 1 6 16 26 31
6 1 7 22 42 57
k
N
k −11 2 3 4 5
1 1 1 1 1 1
2 1 2 4 8 16
3 1 3 9 27 81
4 1 4 16 64 256
5 1 5 25 125 625 6 1 6 36 216 1296
provably
& loosely, for N ≥ 2, k ≥ 3,m
H
(N)≤ B(N, k ) =
k −1
X
i=0
N i
≤ N k −1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26
The VC Dimension Definition of VC Dimension
Recap: More on Growth Function
m
H
(N) of break point k ≤B(N, k ) =
Xk −1 i=0
N i
| {z }
highest term N
k −1k
B(N , k ) 1 2 3 4 5
1 1 2 2 2 2
2 1 3 4 4 4
3 1 4 7 8 8
N 4 1 5 11 15 16
5 1 6 16 26 31
6 1 7 22 42 57
k
N
k −11 2 3 4 5
1 1 1 1 1 1
2 1 2 4 8 16
3 1 3 9 27 81
4 1 4 16 64 256
5 1 5 25 125 625 6 1 6 36 216 1296
provably
& loosely, for N ≥ 2, k ≥ 3,m
H
(N)≤ B(N, k ) =
k −1
X
i=0
N i
≤ N k −1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26
The VC Dimension Definition of VC Dimension
Recap: More on Growth Function
m
H
(N) of break point k ≤B(N, k ) =
Xk −1 i=0
N i
| {z }
highest term N
k −1k
B(N , k ) 1 2 3 4 5
1 1 2 2 2 2
2 1 3 4 4 4
3 1 4 7 8 8
N 4 1 5 11 15 16
5 1 6 16 26 31
6 1 7 22 42 57
k
N
k −11 2 3 4 5
1 1 1 1 1 1
2 1 2 4 8 16
3 1 3 9 27 81
4 1 4 16 64 256
5 1 5 25 125 625 6 1 6 36 216 1296
provably
& loosely, for N ≥ 2, k ≥ 3,m
H
(N)≤ B(N, k ) =
k −1
X
i=0
N i
≤ N k −1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26
The VC Dimension Definition of VC Dimension
Recap: More on Growth Function
m
H
(N) of break point k ≤B(N, k ) =
Xk −1 i=0
N i
| {z }
highest term N
k −1k
B(N , k ) 1 2 3 4 5
1 1 2 2 2 2
2 1 3 4 4 4
3 1 4 7 8 8
N 4 1 5 11 15 16
5 1 6 16 26 31
6 1 7 22 42 57
k
N
k −11 2 3 4 5
1 1 1 1 1 1
2 1 2 4 8 16
3 1 3 9 27 81
4 1 4 16 64 256
5 1 5 25 125 625 6 1 6 36 216 1296
provably
& loosely, for N ≥ 2, k ≥ 3,m
H
(N)≤ B(N, k ) =
k −1
X
i=0
N i
≤ N k −1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For
any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤
P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘E
out
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned!
(:-) good luck)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
P
D
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘E
out
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned!
(:-) good luck)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘E
out
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned!
(:-) good luck)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘E
out
≈ Ein
’, and if 3 A picks a g with small E
in
(goodA
)=⇒
probably
learned!
(:-) good luck)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘E
out
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned!
(:-) good luck)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘E
out
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned!
(:-) good luck)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
Recap: More on Vapnik-Chervonenkis (VC) Bound
For any
g
=A
(D
)∈H
and ‘statistical’large D
,for N ≥ 2, k ≥ 3
PD
h
E
in
(g)− Eout
(g) > i≤ P
D
h∃h ∈
H
s.t.E
in
(h)− Eout
(h) > i≤ 4m
H
(2N)exp−
1 8
2 N
if k exists
≤ 4(2N)
k −1
exp−
1 8
2 N
if 1 m
H
(N) breaks at k (goodH
)if
2 N large enough (good
D
)=⇒
probably
generalized ‘Eout
≈ Ein
’, and if 3 A picks a g with small Ein
(goodA
)=⇒
probably
learned! (:-) good luck)Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-
break point
Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC=‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
> dVC =⇒ is a break point forH
if N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point
Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC=‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
> dVC =⇒ is a break point forH
if N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC=‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
> dVC =⇒ is a break point forH
if N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC =‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
> dVC =⇒ is a break point forH
if N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC =‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
> dVC =⇒ is a break point forH
if N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC =‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs N > dVC =⇒ N is a break point forH
if N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC =‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
k
> dVC =⇒k
is a break point forHif N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
VC Dimension
the formal name of
maximum non-break point Definition
VC dimension ofH, denoted dVC(H) is
largest
N for which mH
(N) = 2N
•
themost
inputsH that can shatter•
dVC =‘minimum k’ - 1N ≤ dVC =⇒ H can shatter some N inputs
k
> dVC =⇒k
is a break point forHif N≥ 2, dVC ≥ 2,
m H (N) ≤ N d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26
The VC Dimension Definition of VC Dimension
The Four VC Dimensions
•
positive rays: mH
(N) = N + 1d
VC= 1
••
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1d
VC= 2
• ••
convex sets: mH
(N) = 2N
d
VC= ∞
up
bottom
•
2D perceptrons: mH
(N)≤ N3
for N ≥ 2d
VC= 3
•• •
good:
finite d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/26
The VC Dimension Definition of VC Dimension
The Four VC Dimensions
•
positive rays: mH
(N) = N + 1d
VC= 1
••
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1d
VC= 2
• ••
convex sets: mH
(N) = 2N
d
VC= ∞
up
bottom
•
2D perceptrons: mH
(N)≤ N3
for N ≥ 2d
VC= 3
•• •
good:
finite d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/26
The VC Dimension Definition of VC Dimension
The Four VC Dimensions
•
positive rays: mH
(N) = N + 1d
VC= 1
••
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1d
VC= 2
• ••
convex sets: mH
(N) = 2N
d
VC= ∞
up
bottom
•
2D perceptrons: mH
(N)≤ N3
for N ≥ 2d
VC= 3
•• •
good:
finite d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/26
The VC Dimension Definition of VC Dimension
The Four VC Dimensions
•
positive rays: mH
(N) = N + 1d
VC= 1
••
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1d
VC= 2
• ••
convex sets: mH
(N) = 2N
d
VC= ∞
up
bottom
•
2D perceptrons: mH
(N)≤ N3
for N ≥ 2d
VC= 3
•• •
good:
finite d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/26
The VC Dimension Definition of VC Dimension
The Four VC Dimensions
•
positive rays: mH
(N) = N + 1d
VC= 1
••
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1d
VC= 2
• ••
convex sets: mH
(N) = 2N
d
VC= ∞
up
bottom
•
2D perceptrons: mH
(N)≤ N3
for N ≥ 2d
VC= 3
•• •
good:
finite d
VCHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/26
The VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
VC Dimension and Learning
finite d
VC= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))
•
regardless of learning algorithmA•
regardless of input distribution P•
regardless of target function funknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
‘worst case’
guarantee on generalizationHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26
The VC Dimension Definition of VC Dimension
Fun Time
If there is a set of N inputs that cannot be shattered by H. Based only on this information, what can we conclude about d
VC( H)?
1
dVC(H) > N2
dVC(H) = N3
dVC(H) < N4
no conclusion can be madeReference Answer: 4
It is possible that there is another set of N inputs that can be shattered, which means dVC≥ N. It is also possible that no set of N input can be shattered, which means dVC< N. Neither cases can be ruled out by one
non-shattering set.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/26
The VC Dimension Definition of VC Dimension
Fun Time
If there is a set of N inputs that cannot be shattered by H. Based only on this information, what can we conclude about d
VC( H)?
1
dVC(H) > N2
dVC(H) = N3
dVC(H) < N4
no conclusion can be madeReference Answer: 4
It is possible that there is another set of N inputs that can be shattered, which means dVC≥ N. It is also possible that no set of N input can be shattered, which means dVC< N.
Neither cases can be ruled out by one non-shattering set.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0 E out (g) ≈ E in (g)
PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3
linearly separable D
with x n ∼ P and y n = f (x n )
T large N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0 E out (g) ≈ E in (g)
PLA can converge
P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3
linearly separable D
with x n ∼ P and y n = f (x n )
T large N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0
E out (g) ≈ E in (g)
PLA can converge
P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3
linearly separable D
with x n ∼ P and y n = f (x n )
T large
N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0
E out (g) ≈ E in (g)
PLA can converge
P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3
linearly separable D with x n ∼ P and y n = f (x n )
T large
N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0
E out (g) ≈ E in (g)
PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3 linearly separable D with x n ∼ P and y n = f (x n )
T large
N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0 E out (g) ≈ E in (g)
PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3 linearly separable D with x n ∼ P and y n = f (x n )
T large N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0 E out (g) ≈ E in (g)
PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3 linearly separable D with x n ∼ P and y n = f (x n )
T large N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
2D PLA Revisited
E out (g) ≈ 0 :-)
E in (g) = 0 E out (g) ≈ E in (g)
PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d
VC= 3 linearly separable D with x n ∼ P and y n = f (x n )
T large N large
general PLA for
x with more than 2 features?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26
The VC Dimension VC Dimension of Perceptrons
VC Dimension of Perceptrons
•
1D perceptron (pos/neg rays): dVC=2•
2D perceptrons: dVC=3• d
VC≥ 3: •
• •
• d
VC≤ 3: × ◦
◦ ×
•
d -D perceptrons: dVC=
?
d + 1two steps:
•
dVC≥ d + 1•
dVC≤ d + 1Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/26
The VC Dimension VC Dimension of Perceptrons
VC Dimension of Perceptrons
•
1D perceptron (pos/neg rays): dVC=2•
2D perceptrons: dVC=3• d
VC≥ 3: •
• •
• d
VC≤ 3: × ◦
◦ ×
•
d -D perceptrons: dVC=
?
d + 1two steps:
•
dVC≥ d + 1•
dVC≤ d + 1Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/26
The VC Dimension VC Dimension of Perceptrons
VC Dimension of Perceptrons
•
1D perceptron (pos/neg rays): dVC=2•
2D perceptrons: dVC=3• d
VC≥ 3: •
• •
• d
VC≤ 3: × ◦
◦ ×
•
d -D perceptrons: dVC=
?
d + 1two steps:
•
dVC≥ d + 1•
dVC≤ d + 1Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/26
The VC Dimension VC Dimension of Perceptrons
VC Dimension of Perceptrons
•
1D perceptron (pos/neg rays): dVC=2•
2D perceptrons: dVC=3• d
VC≥ 3: •
• •
• d
VC≤ 3: × ◦
◦ ×
•
d -D perceptrons: dVC=
?
d + 1two steps:
•
dVC ≥ d + 1•
dVC ≤ d + 1Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/26
The VC Dimension VC Dimension of Perceptrons
Extra Fun Time
What statement below shows that d
VC≥ d + 1?
1
There are some d + 1 inputs we can shatter.2
We can shatter any set of d + 1 inputs.3
There are some d + 2 inputs we cannot shatter.4
We cannot shatter any set of d + 2 inputs.Reference Answer: 1
dVCis the maximum that m
H
(N) = 2N
, and mH
(N) is the most number of dichotomies of N inputs. So if we can find 2d +1
dichotomies on some d + 1 inputs, mH
(d + 1) = 2d +1
and hence dVC≥ d + 1.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/26
The VC Dimension VC Dimension of Perceptrons
Extra Fun Time
What statement below shows that d
VC≥ d + 1?
1
There are some d + 1 inputs we can shatter.2
We can shatter any set of d + 1 inputs.3
There are some d + 2 inputs we cannot shatter.4
We cannot shatter any set of d + 2 inputs.Reference Answer: 1
dVCis the maximum that m
H
(N) = 2N
, and mH
(N) is the most number of dichotomies of N inputs. So if we can find 2d +1
dichotomies on some d + 1 inputs, mH
(d + 1) = 2d +1
and hence dVC≥ d + 1.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/26
The VC Dimension VC Dimension of Perceptrons
d VC ≥ d + 1
There are
some d + 1 inputs
we can shatter.•
some ‘trivial’ inputs:X =
—
x T 1
——
x T 2
——
x T 3
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0
1 0 1 0
.. . .. . . .. 0 1 0 . . . 0 1
•
visually in 2D:•
• •
note:
X invertible!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/26
The VC Dimension VC Dimension of Perceptrons
Can We Shatter X?
X =
—
x T 1
——
x T 2
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1
invertible
to shatter . . .
for anyy =
y
1
... y
d +1
, find
w such that
sign (Xw) = y
⇐=
(Xw) = y X invertible!
⇐⇒
w = X −1 y
‘special’ X can be shattered =⇒ dVC≥ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26
The VC Dimension VC Dimension of Perceptrons
Can We Shatter X?
X =
—
x T 1
——
x T 2
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1
invertible
to shatter . . .
for anyy =
y
1
... y
d +1
, find
w such that
sign (Xw) = y ⇐=
(Xw) = y
X invertible!
⇐⇒
w = X −1 y
‘special’ X can be shattered =⇒ dVC≥ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26
The VC Dimension VC Dimension of Perceptrons
Can We Shatter X?
X =
—
x T 1
——
x T 2
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1
invertible
to shatter . . .
for anyy =
y
1
... y
d +1
, find
w such that
sign (Xw) = y ⇐=
(Xw) = y X invertible!
⇐⇒
w = X −1 y
‘special’ X can be shattered =⇒ dVC≥ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26
The VC Dimension VC Dimension of Perceptrons
Can We Shatter X?
X =
—
x T 1
——
x T 2
— ...—x
T d +1
—
=
1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1
invertible
to shatter . . .
for anyy =
y
1
... y
d +1
, find
w such that
sign (Xw) = y ⇐=
(Xw) = y X invertible!
⇐⇒
w = X −1 y
‘special’ X can be shattered =⇒ dVC≥ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26
The VC Dimension VC Dimension of Perceptrons
Extra Fun Time
What statement below shows that d
VC≤ d + 1?
1
There are some d + 1 inputs we can shatter.2
We can shatter any set of d + 1 inputs.3
There are some d + 2 inputs we cannot shatter.4
We cannot shatter any set of d + 2 inputs.Reference Answer: 4
dVCis the maximum that m
H
(N) = 2N
, and mH
(N) is the most number of dichotomies of N inputs. So if we cannot find 2d +2
dichotomies on any d + 2 inputs (i.e. break point),m
H
(d + 2)< 2d +2
and hence dVC< d + 2. That is, dVC≤ d + 1.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26
The VC Dimension VC Dimension of Perceptrons
Extra Fun Time
What statement below shows that d
VC≤ d + 1?
1
There are some d + 1 inputs we can shatter.2
We can shatter any set of d + 1 inputs.3
There are some d + 2 inputs we cannot shatter.4
We cannot shatter any set of d + 2 inputs.Reference Answer: 4
dVCis the maximum that m
H
(N) = 2N
, and mH
(N) is the most number of dichotomies of N inputs. So if we cannot find 2d +2
dichotomies on any d + 2 inputs (i.e. break point),m
H
(d + 2)< 2d +2
and hence dVC< d + 2.That is, dVC≤ d + 1.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be ×
w T x 4
= + −> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be ×
w T x 4
= + −> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be ×
w T
x 4
=w T
x 2
+w T
x 3
−w T
x 1
> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be × w T x 4
=w T x 2
+w T x 3
− wT x 1
> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be × w T x 4
=w T x 2
| {z }
◦
+
w T x 3
| {z }
◦
−
w T x 1
| {z }
×
> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be × w T x 4
=w T x 2
| {z }
◦
+
w T x 3
| {z }
◦
−
w T x 1
| {z }
×
> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (1/2)
A 2D Special Case
• •
• • X =
—
x T 1
——
x T 2
——
x T 3
——x
T 4
—
=
1 0 0 1 1 0 1 0 1 1 1 1
◦ ?
× ◦
? cannot be × w T x 4
=w T x 2
| {z }
◦
+
w T x 3
| {z }
◦
−
w T x 1
| {z }
×
> 0
linear dependence
restricts dichotomy
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (2/2)
d -D General Case
X =
—
x T 1
——
x T 2
— ...—
x T d +1
——
x T d +2
—
more rows than columns:
linear dependence (some a
i
non-zero)x d +2
=a 1 x 1
+a 2 x 2
+. . . +a d +1 x d +1
•
can you generate (sign(a1 ), sign(a 2 ), . . . , sign(a d +1 ), ×
)? if so, whatw?
w T x d +2
=a 1 w T x 1
| {z }
◦
+a
2 w T x 2
| {z }
×
+. . . +
a d +1 w T x d +1
| {z }
×
>
0(contradition!)
‘general’ X no-shatter =⇒ dVC ≤ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (2/2)
d -D General Case
X =
—
x T 1
——
x T 2
— ...—
x T d +1
——
x T d +2
—
more rows than columns:
linear dependence (some a
i
non-zero)x d +2
=a 1 x 1
+a 2 x 2
+. . . +a d +1 x d +1
•
can you generate (sign(a1 ), sign(a 2 ), . . . , sign(a d +1 ), ×
)? if so, whatw?
w T x d +2
=a 1 w T x 1
| {z }
◦
+a
2 w T x 2
| {z }
×
+. . . +
a d +1 w T x d +1
| {z }
×
>
0(contradition!)
‘general’ X no-shatter =⇒ dVC ≤ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/26
The VC Dimension VC Dimension of Perceptrons
d VC ≤ d + 1 (2/2)
d -D General Case
X =
—
x T 1
——
x T 2
— ...—
x T d +1
——
x T d +2
—
more rows than columns:
linear dependence (some a
i
non-zero)x d +2
=a 1 x 1
+a 2 x 2
+. . . +a d +1 x d +1
•
can you generate (sign(a1 ), sign(a 2 ), . . . , sign(a d +1 ), ×
)? if so, whatw?
w T x d +2
=a 1 w T x 1
| {z }
◦
+a
2 w T x 2
| {z }
×
+. . . +
a d +1 w T x d +1
| {z }
×
>
0(contradition!)
‘general’ X no-shatter =⇒ dVC ≤ d + 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/26
The VC Dimension VC Dimension of Perceptrons
Fun Time
Based on the proof above, what is d
VCof 1126-D perceptrons?
1
10242
11263
11274
6211Reference Answer: 3
Well,
too much fun for this section! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/26
The VC Dimension VC Dimension of Perceptrons
Fun Time
Based on the proof above, what is d
VCof 1126-D perceptrons?
1
10242
11263
11274
6211Reference Answer: 3
Well,
too much fun for this section! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/26
The VC Dimension Physical Intuition of VC Dimension
Degrees of Freedom
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3 4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
01 2
3 4 5 6 87 10 9 11 12 13 14 15 16
1718 01
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
(modified from the work of Hugues Vermeiren on http://www.texample.net)
•
hypothesisparameters w = (w 0 , w 1 , · · · , w d ):
creates degrees of freedom
•
hypothesis quantity M =|H|:‘analog’ degrees of freedom
•
hypothesis ‘power’ dVC=d + 1:effective ‘binary’ degrees of freedom d
VC(H
):powerfulness
ofH
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/26
The VC Dimension Physical Intuition of VC Dimension
Degrees of Freedom
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3 4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
01 2
3 4 5 6 87 10 9 11 12 13 14 15 16
1718 01
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
0 1 2
3 4 5 6 8 7 10 9 11 12 13 14 15 16
1718 0 1
2 3
4 5 6 7 9 8 10 11 12 13 14 15 16
17 18
(modified from the work of Hugues Vermeiren on http://www.texample.net)
•
hypothesisparameters w = (w 0 , w 1 , · · · , w d ):
creates degrees of freedom
•
hypothesis quantity M =|H|:‘analog’ degrees of freedom
•
hypothesis ‘power’ dVC=d + 1:effective ‘binary’ degrees of freedom d
VC(H
):powerfulness
ofH
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/26