Machine Learning Foundations
( 機器學習基石)
Lecture 6: Theory of Generalization
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Roadmap
1 When Can Machines Learn?
2 Why
Can Machines Learn?Lecture 5: Training versus Testing
effective
price of choice in training:(wishfully) growth function m H (N)
witha break point Lecture 6: Theory of Generalization
Restriction of Break Point
Bounding Function: Basic Cases Bounding Function: Inductive Cases A Pictorial Proof
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Theory of Generalization Restriction of Break Point
The Four Break Points
growth function m
H
(N): max number of dichotomies•
positive rays: mH
(N) = N + 1◦×
mH
(2) = 3< 22
:break point at 2
•
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1◦×◦
mH
(3) = 7< 23
:break point at 3
•
convex sets: mH
(N) = 2N
◦ ◦
×
×
◦
mH
(N) = 2N
always:no break point
•
2D perceptrons:m H (N) < 2 N in some cases
× ◦
◦ ×
mH
(4) = 14< 24
:break point at 4
break point k =⇒ break point k + 1, . . .what else?
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
1 dichotomy , shatter any two points?
no x 1 x 2 x 3
◦ ◦ ◦
Theory of Generalization Restriction of Break Point
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
2 dichotomies , shatter any two points?
no x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
3 dichotomies , shatter any two points?
no x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
Theory of Generalization Restriction of Break Point
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
4 dichotomies , shatter any two points?
yes x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
◦ × ×
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
4 dichotomies , shatter any two points?
no x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
× ◦ ◦
Theory of Generalization Restriction of Break Point
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
5 dichotomies , shatter any two points?
yes x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
× ◦ ◦
× ◦ ×
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
5 dichotomies , shatter any two points?
yes x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
× ◦ ◦
× × ◦
Theory of Generalization Restriction of Break Point
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
5 dichotomies , shatter any two points?
yes x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
× ◦ ◦
× × ×
Restriction of Break Point (1/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
maximum possible m H (N) when N = 3 and k = 2?
maximum possible so far:
4 dichotomies x 1 x 2 x 3
◦ ◦ ◦
◦ ◦ ×
◦ × ◦
× ◦ ◦
:-( :-( :-(
Theory of Generalization Restriction of Break Point
Restriction of Break Point (2/2)
what ‘must be true’ when
minimum break point k = 2
•
N = 1: every mH
(N) = 2 by definition•
N = 2: every mH
(N)< 4 by definition (somaximum possible = 3)
•
N = 3:maximum possible = 4 2 3
—break point k
restricts maximum possible m H (N) a lot
for N > kidea: m
H
(N)≤
maximum possible m H (N) given k
≤ poly(N)
Fun Time
When minimum break point k = 1, what is the maximum possible m H (N) when N = 3?
1
12
23
44
8Reference Answer: 1
Because k = 1, the hypothesis set cannot even shatter one point. Thus, every ‘column’ of the table cannot contain both
◦
and×
. Then, after including the first dichotomy, it is not possible to include any other different dichotomy. Thus, the maximum possible mH
(N) is 1.x 1 x 2 x 3
◦ × ◦
◦ × ×
Theory of Generalization Bounding Function: Basic Cases
Bounding Function
bounding function B(N , k ):
maximum possible m
H
(N) when break point = k•
combinatorial quantity:maximum number of length-N vectors with (
◦
,×
) while‘no shatter’ any length-k
subvectors•
irrelevant of the details ofH e.g. B(N, 3) bounds both• positive intervals (k = 3)
• 1D perceptrons (k = 3)
new goal:
B(N, k ) ≤ poly(N)?
Table of Bounding Function (1/4)
k
B(N, k ) 1 2 3 4 5 6 . . .
1
2
3
3
4
N 4
5 6 ...
Known
•
B(2, 2) = 3 (maximum < 4)•
B(3, 2) = 4 (‘pictorial’ proof previously)Theory of Generalization Bounding Function: Basic Cases
Table of Bounding Function (2/4)
k
B(N, k ) 1 2 3 4 5 6 . . .
1
1
2
1
33
1
4N 4
1
5
1
6
1
...
.. .
Known
•
B(N, 1) = 1 (see previous quiz)Table of Bounding Function (3/4)
k
B(N, k ) 1 2 3 4 5 6 . . .
1 1
2 2 2 2 2 . . .
2 1 3
4 4 4 4 . . .
3 1 4
8 8 8 . . .
N 4 1
16 16 . . .
5 1
32 . . .
6 1
. . .
... ...
Known
•
B(N, k ) = 2N
for N < k—including all dichotomies not violating ‘breaking condition’
Theory of Generalization Bounding Function: Basic Cases
Table of Bounding Function (4/4)
k
B(N, k ) 1 2 3 4 5 6 . . .
1
1
2 2 2 2 2 . . .2 1
3
4 4 4 4 . . .3 1 4
7
8 8 8 . . .N 4 1
15
16 16 . . .5 1
31
32 . . .6 1
63
. . .... ...
. ..
Known
•
B(N, k ) = 2N
− 1 for N = k—removing a single dichotomysatisfies ‘breaking condition’
more than halfway done! :-)
Fun Time
For the 2D perceptrons, which of the following claim is true?
1
minimum break point k = 22
mH
(4) = 153
mH
(N)< B(N, k ) when N = k = minimum break point4
mH
(N)> B(N, k ) when N = k = minimum break pointReference Answer: 3
As discussed previously, minimum break point for 2D perceptrons is 4, with m
H
(4) = 14. Also, note that B(4, 4) = 15. So bounding function B(N, k ) can be ‘loose’ in bounding mH
(N).Theory of Generalization Bounding Function: Inductive Cases
Estimating B(4, 3)
k
B(N, k ) 1 2 3 4 5 6 . . .
1 1 2 2 2 2 2 . . .
2 1 3 4 4 4 4 . . .
3 1 4 7 8 8 8 . . .
N 4 1
?
15 16 16 . . .5 1 31 32 . . .
6 1 63 . . .
... ... . ..
Motivation
•
B(4, 3) shall berelated to B(3, ?)
—‘adding’ one point from B(3, ?)
next: reduce B(4, 3) to B(3, ?)
‘Achieving’ Dichotomies of B(4, 3)
after checking all 2
2
4 sets of dichotomies,the winner is . . . x 1 x 2 x 3 x 4
01
◦ ◦ ◦ ◦
02
× ◦ ◦ ◦
03
◦ × ◦ ◦
04
◦ ◦ × ◦
05
◦ ◦ ◦ ×
06
× × ◦ ×
07
× ◦ × ◦
08
× ◦ ◦ ×
09
◦ × × ◦
10
◦ × ◦ ×
11
◦ ◦ × ×
k
B(N, k ) 1 2 3 4 5 6
1 1 2 2 2 2 2
2 1 3 4 4 4 4
3 1 4 7 8 8 8
N 4 1 11 15 16 16
5 1 31 32
6 1 63
how to reduce B(4, 3) to B(3, ?) cases?
Theory of Generalization Bounding Function: Inductive Cases
Reorganized Dichotomies of B(4, 3)
after checking all 2
2
4 sets of dichotomies,the winner is . . . x 1 x 2 x 3 x 4
01
◦ ◦ ◦ ◦
02
× ◦ ◦ ◦
03
◦ × ◦ ◦
04
◦ ◦ × ◦
05
◦ ◦ ◦ ×
06
× × ◦ ×
07
× ◦ × ◦
08
× ◦ ◦ ×
09
◦ × × ◦
10
◦ × ◦ ×
11
◦ ◦ × ×
⇒
x 1 x 2 x 3 x 4
01
◦ ◦ ◦ ◦
05
◦ ◦ ◦ ×
02
× ◦ ◦ ◦
08
× ◦ ◦ ×
03
◦ × ◦ ◦
10
◦ × ◦ ×
04
◦ ◦ × ◦
11
◦ ◦ × ×
06
× × ◦ ×
07
× ◦ × ◦
09
◦ × × ◦
orange: pair; purple: single
Estimating Part of B(4, 3) (1/2)
B(4, 3) = 11 =
2α
+β x 1 x 2 x 3
◦ ◦ ◦
α × ◦ ◦
◦ × ◦
◦ ◦ ×
× × ◦
β × ◦ ×
◦ × ×
• α
+β: dichotomies on (x 1
, x2
, x3
)•
B(4, 3) ‘no shatter’ any 3 inputs=⇒
α
+β
‘no shatter’ any 3x 1 x 2 x 3 x 4
◦ ◦ ◦ ◦
◦ ◦ ◦ ×
× ◦ ◦ ◦
2α × ◦ ◦ ×
◦ × ◦ ◦
◦ × ◦ ×
◦ ◦ × ◦
◦ ◦ × ×
× × ◦ ×
β × ◦ × ◦
◦ × × ◦
α
+β
≤ B(3, 3)Theory of Generalization Bounding Function: Inductive Cases
Estimating Part of B(4, 3) (2/2)
B(4, 3) = 11 =
2α
+β x 1 x 2 x 3
◦ ◦ ◦
α × ◦ ◦
◦ × ◦
◦ ◦ ×
• α: dichotomies on (x 1
, x2
, x3
) withx 4 paired
•
B(4, 3) ‘no shatter’ any 3 inputs=⇒
α
‘no shatter’ any 2x 1 x 2 x 3 x 4
◦ ◦ ◦ ◦
◦ ◦ ◦ ×
× ◦ ◦ ◦
2α × ◦ ◦ ×
◦ × ◦ ◦
◦ × ◦ ×
◦ ◦ × ◦
◦ ◦ × ×
× × ◦ ×
β × ◦ × ◦
◦ × × ◦
α
≤ B(3, 2)Putting It All Together
B(4, 3) =
2α
+β α
+β
≤ B(3, 3)α
≤ B(3, 2)⇒ B(4, 3) ≤ B(3, 3) + B(3, 2) k
B(N, k ) 1 2 3 4 5 6
1 1 2 2 2 2 2
2 1 3 4 4 4 4
3 1 4 7 8 8 8
N 4 1
≤ 5
11 15 16 165 1
≤ 6 ≤ 16 ≤ 26
31 326 1
≤ 7 ≤ 22 ≤ 42 ≤ 57
63now have
upper bound
of bounding functionTheory of Generalization Bounding Function: Inductive Cases
Putting It All Together
B(N, k ) =
2α
+β α
+β
≤ B(N − 1, k)α
≤ B(N − 1, k − 1)⇒ B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1) k
B(N, k ) 1 2 3 4 5 6
1 1 2 2 2 2 2
2 1 3 4 4 4 4
3 1 4 7 8 8 8
N 4 1
≤ 5
11 15 16 165 1
≤ 6 ≤ 16 ≤ 26
31 326 1
≤ 7 ≤ 22 ≤ 42 ≤ 57
63now have
upper bound
of bounding functionBounding Function: The Theorem
B(N, k )≤
k −1
X
i=0
N i
| {z }
highest term N
k −1•
simple induction usingboundary and inductive formula
•
for fixed k , B(N, k ) upper bounded by poly (N)=⇒
m H (N) is poly (N) if break point exists
‘≤’ can be ‘=’ actually,
go play and prove it if math lover! :-)
Theory of Generalization Bounding Function: Inductive Cases
The Three Break Points
B(N, k )≤
k −1
X
i=0
N i
| {z }
highest term N
k −1•
positive rays: mH
(N) = N + 1≤ N + 1
◦×
mH
(2) = 3< 22
:break point at 2
•
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1≤ 1 2 N 2 + 1 2 N + 1
◦×◦
mH
(3) = 7< 23
:break point at 3
•
2D perceptrons:m H (N)=? ≤ 1 6 N 3 + 5 6 N + 1
× ◦
◦ ×
mH
(4) = 14< 24
:break point at 4
can bound m
H
(N) by onlyone break point
Fun Time
For 1D perceptrons (positive and negative rays), we know that m H (N) = 2N. Let k be the minimum break point. Which of the following is not true?
1
k = 32
for some integers N > 0, mH
(N) =Pk −1 i=0
N i
3
for all integers N > 0, mH
(N) =Pk −1 i=0
N i
4
for all integers N > 2, mH
(N)<Pk −1 i=0
N i
Reference Answer: 3
The proof is generally trivial by listing the definitions. For 2 , N = 1 or 2 gives the equality. One thing to notice is 4 : the upper bound can be ‘loose’.
Theory of Generalization A Pictorial Proof
BAD Bound for General H
want:
P
h∃h ∈ H s.t.
E
in
(h)−Eout
(h) > i≤ 2
·2
m H (
2
N)
·exp
−2
· 1 16
2
N
actually,
when N large enough,
P
h∃h ∈ H s.t.
E
in
(h)− Eout
(h) > i≤ 2·2
m H (2N)
· exp
−2·
1 16
2
N
next:
sketch
of proofStep 1: Replace E out by E in 0
1 2
Ph∃h ∈ H s.t.
E
in
(h)− Eout
(h) > i≤ P
h∃h ∈ H s.t.
E
in
(h)−E in 0 (h)
>2
i•
Ein
(h) finitely many, Eout
(h) infinitely many—replace the evil E
out first
•
how? sampleverification set D 0
of size N to calculateE in 0
•
BAD h of Ein
− Eout probably
= ⇒ BAD h of E in − E in 0
EoutProbabilitydistribution ofEin,E′ in
Ein
Ein′
0.1
evil E
out
removed by verification with ‘ghost data’Theory of Generalization A Pictorial Proof
Step 2: Decompose H by Kind
BAD ≤
2
Ph
∃h ∈ H s.t.
E
in
(h)−E in 0 (h)
>2
i≤
2m H
(2N)P hfixed
h s.t.E
in
(h)−E in 0 (h)
>2
i•
Ein
withD, Ein 0
withD0
—now m
H comes to play
•
how? infiniteH becomes|H(x 1 , . . . , x N , x 0 1 , . . . , x 0 N ) | kinds
• union bound on m H (2N) kinds
Dspace of data sets
(a) Hoeffding Inequality (b) Union Bound (c) Now
use m
H
(2N) tocalculate BAD-overlap properly
Step 3: Use Hoeffding without Replacement
BAD ≤
2m H
(2N)Phfixed
h s.t.E
in
(h)−E in 0 (h)
>2
i≤
2m H
(2N)· 2 exp
−2
4
2
N
•
consider bin of 2N examples,choose N for E
in
, leave others for Ein 0
|E in − E in 0 | > 2 ⇔
E in − E
in+E 2
in0>
4
•
so? just ‘smaller bin’, ‘smaller’, andHoeffding without replacement
top
bottom top
sample for E in
small bin
useHoeffding
after zooming tofixed h
Theory of Generalization A Pictorial Proof
That’s All!
Vapnik-Chervonenkis (VC) bound:
P
h∃h ∈ H s.t.
E
in
(h)− Eout
(h) > i≤
4m H
(2N) exp
−
1 8
2
N
• replace E out by E in 0
• decompose H by kind
• use Hoeffding without replacement
2D perceptrons:
•
break point? 4•
mH
(N)? O(N3
)learning with 2D perceptrons feasible! :-)
Fun Time
For positive rays, m H (N) = N + 1. Plug it into the VC bound for
= 0.1 and N = 10000. What is VC bound of BAD events?
P
h∃h ∈ H s.t.
E
in
(h)− Eout
(h) > i≤
4m H
(2N) exp
−
1 8
2
N
1
2.77× 10−87
2
5.54× 10−83
3
2.98× 10−1
4
2.29× 102
Reference Answer: 3
Simple calculation. Note that the BAD probability bound is not very small even with 10000 examples.
Theory of Generalization A Pictorial Proof