Machine Learning Foundations
( 機器學習基石)
Lecture 5: Training versus Testing
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
Training versus Testing
Roadmap
1 When Can Machines Learn?
Lecture 4: Feasibility of Learning
learning isPAC-possible
if enough
statistical data
andfinite |H|
2 Why
Can Machines Learn?Lecture 5: Training versus Testing Recap and Preview
Effective Number of Lines Effective Number of Hypotheses Break Point
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Training versus Testing Recap and Preview
Recap: the ‘Statistical’ Learning Flow
if|H| = M finite, N large enough,
for whatever g picked byA, E
out
(g)≈ Ein
(g)ifA finds one g with E
in
(g)≈ 0, PAC guarantee for Eout
(g)≈ 0=⇒
learning possible :-)
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used) unknown
P on X
x
1, x
2, · · · , x
Nx
E
out
(g) ≈|{z}
test
E
in
(g) ≈|{z}
train
0
Training versus Testing Recap and Preview
Recap: the ‘Statistical’ Learning Flow
if|H| = M finite, N large enough,
for whatever g picked byA, E
out
(g)≈ Ein
(g) ifA finds one g with Ein
(g)≈ 0,PAC guarantee for E
out
(g)≈ 0 =⇒learning possible :-)
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
E
out
(g) ≈|{z}
test
E
in
(g) ≈|{z}
train
0
Training versus Testing Recap and Preview
Recap: the ‘Statistical’ Learning Flow
if|H| = M finite, N large enough,
for whatever g picked byA, E
out
(g)≈ Ein
(g) ifA finds one g with Ein
(g)≈ 0,PAC guarantee for E
out
(g)≈ 0 =⇒learning possible :-)
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used) unknown
P on X
x
1, x
2, · · · , x
Nx
E
out
(g) ≈|{z}
test
Training versus Testing Recap and Preview
Recap: the ‘Statistical’ Learning Flow
if|H| = M finite, N large enough,
for whatever g picked byA, E
out
(g)≈ Ein
(g) ifA finds one g with Ein
(g)≈ 0,PAC guarantee for E
out
(g)≈ 0 =⇒learning possible :-)
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
E
out
(g) ≈|{z}
test
E
in
(g) ≈|{z}
train
0Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
|{z}
|H|
play for the two questions?
Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
|{z}
|H|
play for the two questions?
Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
|{z}
|H|
play for the two questions?
Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
|{z}
|H|
play for the two questions?
Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
|{z}
|H|
play for the two questions?
Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
|{z}
|H|
play for the two questions?
Training versus Testing Recap and Preview
Two Central Questions
for batch & supervised binary classification
| {z }
lecture 3
,
g ≈ f
| {z }
lecture 1
⇐⇒ E out (g) ≈ 0
achieved through
E out (g) ≈ E in (g)
| {z }
lecture 4
and
E in (g) ≈ 0
| {z }
lecture 2
learning split to two central questions:
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
what role does
M
play for the two questions?Training versus Testing Recap and Preview
Trade-off on M
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices
using the right M (orH) is important
M = ∞ doomed?
Training versus Testing Recap and Preview
Trade-off on M
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices
using the right M (orH) is important
M = ∞ doomed?
Training versus Testing Recap and Preview
Trade-off on M
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices
using the right M (orH) is important
M = ∞ doomed?
Training versus Testing Recap and Preview
Trade-off on M
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices
using the right M (orH) is important
M = ∞ doomed?
Training versus Testing Recap and Preview
Trade-off on M
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices
using the right M (orH) is important
M = ∞ doomed?
Training versus Testing Recap and Preview
Trade-off on M
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
small M
1 Yes!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 No!, too few choices
large M
1 No!,
P[BAD]≤ 2 ·
M
· exp(. . .)2 Yes!, many choices
using the right M (orH) is important
M = ∞ doomed?
Training versus Testing Recap and Preview
Preview
Known
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
NTodo
•
establisha finite quantity
that replacesM
P
E
in
(g)− Eout
(g) >?
≤ 2 ·
m H
· exp−2
2
N•
justify the feasibility of learning for infinite M•
studym H
to understand its trade-off for ‘right’H, just like Mmysterious PLA to be fully resolved
after 3 more lectures :-)
Training versus Testing Recap and Preview
Preview
Known
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
NTodo
•
establisha finite quantity
that replacesM
P
E
in
(g)− Eout
(g) >?
≤ 2 ·
m H
· exp−2
2
N•
justify the feasibility of learning for infinite M•
studym H
to understand its trade-off for ‘right’H, just like Mmysterious PLA to be fully resolved
after 3 more lectures :-)
Training versus Testing Recap and Preview
Preview
Known
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
NTodo
•
establisha finite quantity
that replacesM
P
E
in
(g)− Eout
(g) >?
≤ 2 ·
m H
· exp−2
2
N•
justify the feasibility of learning for infinite M•
studym H
to understand its trade-off for ‘right’H, just like Mmysterious PLA to be fully resolved
after 3 more lectures :-)
Training versus Testing Recap and Preview
Preview
Known
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
NTodo
•
establisha finite quantity
that replacesM
P
E
in
(g)− Eout
(g) >?
≤ 2 ·
m H
· exp−2
2
N•
justify the feasibility of learning for infinite M•
studym H
to understand its trade-off for ‘right’H, just like Mmysterious PLA to be fully resolved
after 3 more lectures :-)
Training versus Testing Recap and Preview
Preview
Known
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
NTodo
•
establisha finite quantity
that replacesM
P
E
in
(g)− Eout
(g) >?
≤ 2 ·
m H
· exp−2
2
N•
justify the feasibility of learning for infinite M•
studym H
to understand its trade-off for ‘right’H, just like Mmysterious PLA to be fully resolved
after 3 more lectures :-)
Training versus Testing Recap and Preview
Fun Time
Data size: how large do we need?
One way to use the inequality P
E
in
(g)− Eout
(g)> ≤ 2 · M · exp
−2
2
N| {z }
δ
is to pick a tolerable difference as well as a tolerable BAD probabilityδ, and then gather data with size (N) large enough to achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.
What is the data size needed?
1
2152
4153
6154
815Reference Answer: 2
We can simply express N as a function of those ‘known’ variables. Then, the needed N =
2 1
2 ln2M δ
.Training versus Testing Recap and Preview
Fun Time
Data size: how large do we need?
One way to use the inequality P
E
in
(g)− Eout
(g)> ≤ 2 · M · exp
−2
2
N| {z }
δ
is to pick a tolerable difference as well as a tolerable BAD probabilityδ, and then gather data with size (N) large enough to achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.
What is the data size needed?
1
2152
4153
6154
815Reference Answer: 2
We can simply express N as a function of those ‘known’ variables.
Then, the needed N =
2 1
2 ln2M δ
.Training versus Testing Effective Number of Lines
Where Did M Come From?
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
N
• BAD events B m
: |Ein
(hm
)− Eout
(hm
)| >•
to giveA freedom of choice: bound P[B1
orB2
or . . .BM
]•
worst case: allBm
non-overlapping P[B1
orB2
or . . .BM
]≤
|{z} union bound
P[B
1
] + P[B2
] +. . . + P[BM
]where did
union bound fail
to consider for M =∞?Training versus Testing Effective Number of Lines
Where Did M Come From?
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
N
• BAD events B m
: |Ein
(hm
)− Eout
(hm
)| >•
to giveA freedom of choice: bound P[B1
orB2
or . . .BM
]•
worst case: allBm
non-overlapping P[B1
orB2
or . . .BM
]≤
|{z} union bound
P[B
1
] + P[B2
] +. . . + P[BM
]where did
union bound fail
to consider for M =∞?Training versus Testing Effective Number of Lines
Where Did M Come From?
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
N
• BAD events B m
: |Ein
(hm
)− Eout
(hm
)| >•
to giveA freedom of choice: bound P[B1
orB2
or . . .BM
]•
worst case: allBm
non-overlapping P[B1
orB2
or . . .BM
]≤
|{z}
union bound
P[B
1
] + P[B2
] +. . . + P[BM
]where did
union bound fail
to consider for M =∞?Training versus Testing Effective Number of Lines
Where Did M Come From?
P
E
in
(g)− Eout
(g)> ≤ 2 ·
M
· exp−2
2
N
• BAD events B m
: |Ein
(hm
)− Eout
(hm
)| >•
to giveA freedom of choice: bound P[B1
orB2
or . . .BM
]•
worst case: allBm
non-overlapping P[B1
orB2
or . . .BM
]≤
|{z}
union bound
P[B
1
] + P[B2
] +. . . + P[BM
]where did
union bound fail
to consider for M =∞?Training versus Testing Effective Number of Lines
Where Did Union Bound Fail?
union bound P[B
1
] + P[B2
] +. . . + P[BM
]• BAD events B m
: |Ein
(hm
)− Eout
(hm
)| > overlapping for similar hypotheses h1
≈ h2
•
why? 1 Eout
(h1
)≈ Eout
(h2
)why?
2 for mostD, E
in
(h1
) =Ein
(h2
)•
union boundover-estimating
to account for overlap,
can we group similar hypotheses by
kind?
Training versus Testing Effective Number of Lines
Where Did Union Bound Fail?
union bound P[B
1
] + P[B2
] +. . . + P[BM
]• BAD events B m
:|Ein
(hm
)− Eout
(hm
)| > overlapping for similar hypotheses h1
≈ h2
•
why? 1 Eout
(h1
)≈ Eout
(h2
)why?
2 for mostD, E
in
(h1
) =Ein
(h2
)•
union boundover-estimating
to account for overlap,
can we group similar hypotheses by
kind?
Training versus Testing Effective Number of Lines
Where Did Union Bound Fail?
union bound P[B
1
] + P[B2
] +. . . + P[BM
]• BAD events B m
:|Ein
(hm
)− Eout
(hm
)| > overlapping for similar hypotheses h1
≈ h2
•
why? 1 Eout
(h1
)≈ Eout
(h2
)why?
2 for mostD, E
in
(h1
) =Ein
(h2
)•
union boundover-estimating
to account for overlap,
can we group similar hypotheses by
kind?
Training versus Testing Effective Number of Lines
Where Did Union Bound Fail?
union bound P[B
1
] + P[B2
] +. . . + P[BM
]• BAD events B m
:|Ein
(hm
)− Eout
(hm
)| > overlapping for similar hypotheses h1
≈ h2
•
why? 1 Eout
(h1
)≈ Eout
(h2
)why?
2 for mostD, E
in
(h1
) =Ein
(h2
)•
union boundover-estimating B
3B
1B
2to account for overlap,
can we group similar hypotheses by
kind?
Training versus Testing Effective Number of Lines
Where Did Union Bound Fail?
union bound P[B
1
] + P[B2
] +. . . + P[BM
]• BAD events B m
:|Ein
(hm
)− Eout
(hm
)| > overlapping for similar hypotheses h1
≈ h2
•
why? 1 Eout
(h1
)≈ Eout
(h2
)why?
2 for mostD, E
in
(h1
) =Ein
(h2
)•
union boundover-estimating B
3B
1B
2to account for overlap,
can we group similar hypotheses by
kind?
Training versus Testing Effective Number of Lines
How Many Lines Are There? (1/2)
H =n
all lines in R
2
o•
how many lines? ∞•
how manykinds of
lines if viewed from one input vectorx 1
?•x
1
2 kinds: h 1
-like(x1
) =◦
or h2
-like(x1
) =×
Training versus Testing Effective Number of Lines
How Many Lines Are There? (1/2)
H =n
all lines in R
2
o•
how many lines? ∞•
how manykinds of
lines if viewed from one input vectorx 1
?•x
1
2 kinds: h 1
-like(x1
) =◦
or h2
-like(x1
) =×
Training versus Testing Effective Number of Lines
How Many Lines Are There? (1/2)
H =n
all lines in R
2
o•
how many lines? ∞•
how manykinds of
lines if viewed from one input vectorx 1
?•x
1
h
1
h2
2 kinds: h 1
-like(x1
) =◦
or h2
-like(x1
) =×
Training versus Testing Effective Number of Lines
How Many Lines Are There? (2/2)
H =n
all lines in R
2
o•
how manykinds of
lines if viewed from two inputsx 1
, x2
?•x
1
•x
2
4: ◦
◦ ×
× ◦
× ×
◦
one input: 2; two inputs: 4;
three inputs?
Training versus Testing Effective Number of Lines
How Many Lines Are There? (2/2)
H =n
all lines in R
2
o•
how manykinds of
lines if viewed from two inputsx 1
, x2
?•x
1
•x
2
4: ◦
◦ ×
× ◦
× ×
◦
one input: 2; two inputs: 4;
three inputs?
Training versus Testing Effective Number of Lines
How Many Lines Are There? (2/2)
H =n
all lines in R
2
o•
how manykinds of
lines if viewed from two inputsx 1
, x2
?•x
1
•x
2
4: ◦
◦ ×
× ◦
× ×
◦
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Three Inputs? (1/2)
H =n
all lines in R
2
ofor three inputs x 1 , x 2 , x 3
•x
1
•x
2
•x
3
always
8 for three inputs?
8:
◦ ◦
◦
× ×
×
◦ ◦
×
× ×
◦
× ◦
◦
◦ ×
×
× ◦
×
◦ ×
◦
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Three Inputs? (1/2)
H =n
all lines in R
2
ofor three inputs x 1 , x 2 , x 3
•x
1
•x
2
•x
3
always
8 for three inputs?
8:
◦ ◦
◦
× ×
×
◦ ◦
×
× ×
◦
× ◦
◦
◦ ×
×
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Three Inputs? (1/2)
H =n
all lines in R
2
ofor three inputs x 1 , x 2 , x 3
•x
1
•x
2
•x
3
always
8 for three inputs?
8:
◦ ◦
◦
× ×
×
◦ ◦
×
× ×
◦
× ◦
◦
◦ ×
×
× ◦
×
◦ ×
◦
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Three Inputs? (2/2)
H =n
all lines in R
2
ofor another three inputs x 1 , x 2 , x 3
•x
1
•x
2
•x
3
‘fewer than 8’
when degenerate (e.g. collinear or same inputs)6:
◦ ◦
◦
× ×
×
◦ ◦
×
× ×
◦
× ◦
◦
◦ ×
×
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Three Inputs? (2/2)
H =n
all lines in R
2
ofor another three inputs x 1 , x 2 , x 3
•x
1
•x
2
•x
3
‘fewer than 8’
when degenerate (e.g. collinear or same inputs)6:
◦ ◦
◦
× ×
×
◦ ◦
×
× ×
◦
× ◦
◦
◦ ×
×
× ◦
×
◦ ×
◦
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Three Inputs? (2/2)
H =n
all lines in R
2
ofor another three inputs x 1 , x 2 , x 3
•x
1
•x
2
•x
3
6:
◦ ◦
◦
× ×
×
◦ ◦
×
× ×
◦
× ◦
◦
◦ ×
×
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Four Inputs?
H =n
all lines in R
2
ofor four inputs x 1 , x 2 , x 3 , x 4
•x
1
•x
2
•x
3
•x
4
for any four inputs
at most 14
14:
2×◦ ◦
◦ ◦ ◦
× × ×
◦ ◦
◦ × ◦
× × ◦
◦ ◦
× ◦ ◦
× ◦ ×
◦ ◦
× × ◦
× ◦ ◦
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Four Inputs?
H =n
all lines in R
2
ofor four inputs x 1 , x 2 , x 3 , x 4
•x
1
•x
2
•x
3
•x
4
for any four inputs
at most 14
14:
2×◦ ◦
◦ ◦ ◦
× × ×
◦ ◦
◦ × ◦
× × ◦
◦ ◦
× ◦ ◦
× ◦ ×
Training versus Testing Effective Number of Lines
How Many Kinds of Lines for Four Inputs?
H =n
all lines in R
2
ofor four inputs x 1 , x 2 , x 3 , x 4
•x
1
•x
2
•x
3
•x
4
for any four inputs
at most 14
14:
2×◦ ◦
◦ ◦ ◦
× × ×
◦ ◦
◦ × ◦
× × ◦
◦ ◦
× ◦ ◦
× ◦ ×
◦ ◦
× × ◦
× ◦ ◦
Training versus Testing Effective Number of Lines
Effective Number of Lines
maximum kinds of lines with respect to N inputs
x 1
, x2
,· · · , xN
⇐⇒
effective number of lines
•
must be≤ 2N
(why?)•
finite ‘grouping’ of infinitely-many lines∈ H•
wish:P
E
in
(g)− Eout
(g) >≤ 2 ·
effective(N)
· exp−2
2
Nlines in 2D
N effective(N)1 2
2 4
3 8
4 14
< 2 N
if 1 effective(N) can replace M and
if
2 effective(N) 2
N
learning possible with infinite lines :-)
Training versus Testing Effective Number of Lines
Effective Number of Lines
maximum kinds of lines with respect to N inputs
x 1
, x2
,· · · , xN
⇐⇒
effective number of lines
•
must be≤ 2N
(why?)•
finite ‘grouping’ of infinitely-many lines∈ H•
wish:P
E
in
(g)− Eout
(g) >≤ 2 ·
effective(N)
· exp−2
2
Nlines in 2D
N effective(N)1 2
2 4
3 8
4 14
< 2 N
if 1 effective(N) can replace M and
if
2 effective(N) 2
N
learning possible with infinite lines :-)
Training versus Testing Effective Number of Lines
Effective Number of Lines
maximum kinds of lines with respect to N inputs
x 1
, x2
,· · · , xN
⇐⇒
effective number of lines
•
must be≤ 2N
(why?)•
finite ‘grouping’ of infinitely-many lines∈ H•
wish:P
E
in
(g)− Eout
(g) >≤ 2 ·
effective(N)
· exp−2
2
Nlines in 2D
N effective(N)1 2
2 4
3 8
4 14
< 2 N
if 1 effective(N) can replace M and
if
2 effective(N) 2
N
learning possible with infinite lines :-)
Training versus Testing Effective Number of Lines
Effective Number of Lines
maximum kinds of lines with respect to N inputs
x 1
, x2
,· · · , xN
⇐⇒
effective number of lines
•
must be≤ 2N
(why?)•
finite ‘grouping’ of infinitely-many lines∈ H•
wish:P
E
in
(g)− Eout
(g) >≤ 2 ·
effective(N)
· exp−2
2
Nlines in 2D
N effective(N)1 2
2 4
3 8
4 14
< 2 N
if 1 effective(N) can replace M and
if
2 effective(N) 2
N
learning possible with infinite lines :-)
Training versus Testing Effective Number of Lines
Effective Number of Lines
maximum kinds of lines with respect to N inputs
x 1
, x2
,· · · , xN
⇐⇒
effective number of lines
•
must be≤ 2N
(why?)•
finite ‘grouping’ of infinitely-many lines∈ H•
wish:P
E
in
(g)− Eout
(g) >≤ 2 ·
effective(N)
· exp−2
2
Nlines in 2D
N effective(N)1 2
2 4
3 8
4 14
< 2 N
if 1 effective(N) can replace M and
if
2 effective(N) 2
N
learning possible with infinite lines :-)
Training versus Testing Effective Number of Lines
Effective Number of Lines
maximum kinds of lines with respect to N inputs
x 1
, x2
,· · · , xN
⇐⇒
effective number of lines
•
must be≤ 2N
(why?)•
finite ‘grouping’ of infinitely-many lines∈ H•
wish:P
E
in
(g)− Eout
(g) >≤ 2 ·
effective(N)
· exp−2
2
Nlines in 2D
N effective(N)1 2
2 4
3 8
4 14
< 2 N
if 1 effective(N) can replace M and
if
2 effective(N) 2
N
learning possible with infinite lines :-)
Training versus Testing Effective Number of Lines
Fun Time
What is the effective number of lines for five inputs ∈ R 2 ?
1
142
163
224
32Reference Answer: 3
If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is
much smaller than 2 5 = 32. You shall find it difficult
to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.•x
1
•x
2
•x
3
•x
4
•x
5
Training versus Testing Effective Number of Lines
Fun Time
What is the effective number of lines for five inputs ∈ R 2 ?
1
142
163
224
32Reference Answer: 3
If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is
much smaller than 2 5 = 32. You shall find it difficult
to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.•x
1
•x
2
•x
3
•x
4
•x
5
Training versus Testing Effective Number of Hypotheses
Dichotomies: Mini-hypotheses
H = {hypothesis h : X → {×,
◦}}
•
callh(x
1
, x2
, . . . , xN
) = (h(x1
), h(x2
), . . . , h(xN
))∈ {×,◦} N
adichotomy: hypothesis ‘limited’ to the eyes of x 1
, x2
, . . . , xN
•
H(x1
, x2
, . . . , xN
):all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N
hypothesesH dichotomiesH(x1
, x2
, . . . , xN
) e.g. all lines in R2
{◦◦◦◦,◦◦◦×
,◦◦××
, . . .} size possibly infinite upper bounded by 2N
|H(x
1
, x2
, . . . , xN
)|: candidate forreplacing M
Training versus Testing Effective Number of Hypotheses
Dichotomies: Mini-hypotheses
H = {hypothesis h : X → {×,
◦}}
•
callh(x
1
, x2
, . . . , xN
) = (h(x1
), h(x2
), . . . , h(xN
))∈ {×,◦} N
adichotomy: hypothesis ‘limited’ to the eyes of x 1
, x2
, . . . , xN
•
H(x1
, x2
, . . . , xN
):all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N
hypothesesH dichotomiesH(x1
, x2
, . . . , xN
) e.g. all lines in R2
{◦◦◦◦,◦◦◦×
,◦◦××
, . . .} size possibly infinite upper bounded by 2N
|H(x
1
, x2
, . . . , xN
)|: candidate forreplacing M
Training versus Testing Effective Number of Hypotheses
Dichotomies: Mini-hypotheses
H = {hypothesis h : X → {×,
◦}}
•
callh(x
1
, x2
, . . . , xN
) = (h(x1
), h(x2
), . . . , h(xN
))∈ {×,◦} N
adichotomy: hypothesis ‘limited’ to the eyes of x 1
, x2
, . . . , xN
•
H(x1
, x2
, . . . , xN
):all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N
hypothesesH dichotomiesH(x
1
, x2
, . . . , xN
) e.g. all lines in R2
{◦◦◦◦,◦◦◦×
,◦◦××
, . . .} size possibly infinite upper bounded by 2N
|H(x
1
, x2
, . . . , xN
)|: candidate forreplacing M
Training versus Testing Effective Number of Hypotheses
Dichotomies: Mini-hypotheses
H = {hypothesis h : X → {×,
◦}}
•
callh(x
1
, x2
, . . . , xN
) = (h(x1
), h(x2
), . . . , h(xN
))∈ {×,◦} N
adichotomy: hypothesis ‘limited’ to the eyes of x 1
, x2
, . . . , xN
•
H(x1
, x2
, . . . , xN
):all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N
hypothesesH dichotomiesH(x1
, x2
, . . . , xN
) e.g. all lines in R2
{◦◦◦◦,◦◦◦×
,◦◦××
, . . .} size possibly infinite upper bounded by 2N
|H(x
1
, x2
, . . . , xN
)|: candidate forreplacing M
Training versus Testing Effective Number of Hypotheses
Dichotomies: Mini-hypotheses
H = {hypothesis h : X → {×,
◦}}
•
callh(x
1
, x2
, . . . , xN
) = (h(x1
), h(x2
), . . . , h(xN
))∈ {×,◦} N
adichotomy: hypothesis ‘limited’ to the eyes of x 1
, x2
, . . . , xN
•
H(x1
, x2
, . . . , xN
):all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N
hypothesesH dichotomiesH(x1
, x2
, . . . , xN
) e.g. all lines in R2
{◦◦◦◦,◦◦◦×
,◦◦××
, . . .} size possibly infinite upper bounded by 2N
Training versus Testing Effective Number of Hypotheses
Growth Function
•
|H(x1
, x2
, . . . , xN
)|: depend on inputs (x1
, x2
, . . . , xN
)•
growth function:remove dependence by
taking max of all possible (x 1 , x 2 , . . . , x N )
m
H
(N) = maxx
1,x
2,...,x
N∈X
|H(x1
, x2
, . . . , xN
)|•
finite, upper-bounded by 2N
lines in 2D
N mH
(N)1 2
2 4
3 max(. . . , 6, 8)
=8 4 14
< 2 N
how to ‘calculate’ the growth function?
Training versus Testing Effective Number of Hypotheses
Growth Function
•
|H(x1
, x2
, . . . , xN
)|: depend on inputs (x1
, x2
, . . . , xN
)•
growth function:remove dependence by
taking max of all possible (x 1 , x 2 , . . . , x N )
m
H
(N) = maxx
1,x
2,...,x
N∈X
|H(x1
, x2
, . . . , xN
)|•
finite, upper-bounded by 2N
lines in 2D
N mH
(N)1 2
2 4
3 max(. . . , 6, 8)
=8 4 14
< 2 N
how to ‘calculate’ the growth function?
Training versus Testing Effective Number of Hypotheses
Growth Function
•
|H(x1
, x2
, . . . , xN
)|: depend on inputs (x1
, x2
, . . . , xN
)•
growth function:remove dependence by
taking max of all possible (x 1 , x 2 , . . . , x N )
m
H
(N) = maxx
1,x
2,...,x
N∈X
|H(x1
, x2
, . . . , xN
)|•
finite, upper-bounded by 2N
lines in 2D
N mH
(N)1 2
2 4
3 max(. . . , 6, 8)
=8 4 14
< 2 N
how to ‘calculate’ the growth function?
Training versus Testing Effective Number of Hypotheses
Growth Function
•
|H(x1
, x2
, . . . , xN
)|: depend on inputs (x1
, x2
, . . . , xN
)•
growth function:remove dependence by
taking max of all possible (x 1 , x 2 , . . . , x N )
m
H
(N) = maxx
1,x
2,...,x
N∈X
|H(x1
, x2
, . . . , xN
)|•
finite, upper-bounded by 2N
lines in 2D
N mH
(N)1 2
2 4
3 max(. . . , 6, 8)
=8 4 14
< 2 N
how to ‘calculate’ the growth function?
Training versus Testing Effective Number of Hypotheses
Growth Function
•
|H(x1
, x2
, . . . , xN
)|: depend on inputs (x1
, x2
, . . . , xN
)•
growth function:remove dependence by
taking max of all possible (x 1 , x 2 , . . . , x N )
m
H
(N) = maxx
1,x
2,...,x
N∈X
|H(x1
, x2
, . . . , xN
)|•
finite, upper-bounded by 2N
lines in 2D
N mH
(N)1 2
2 4
3 max(. . . , 6, 8)
=8 4 14
< 2 N
how to ‘calculate’ the growth function?
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Rays
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1
a
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = sign(x − a) for threshold a
•
‘positive half’ of 1D perceptronsone dichotomy for a∈ each spot (x
n
, xn+1
):m H (N) = N + 1
(N + 1)
2 N
when N large!x
1
x2
x3
x4
◦ ◦ ◦ ◦
× ◦ ◦ ◦
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Rays
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1
a
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = sign(x − a) for threshold a
•
‘positive half’ of 1D perceptronsone dichotomy for a∈ each spot (x
n
, xn+1
):m H (N) = N + 1
(N + 1)
2 N
when N large!x
1
x2
x3
x4
◦ ◦ ◦ ◦
× ◦ ◦ ◦
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Rays
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1
a
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = sign(x − a) for threshold a
•
‘positive half’ of 1D perceptronsone dichotomy for a∈ each spot (x
n
, xn+1
):m H (N) = N + 1
(N + 1)
2 N
when N large!x
1
x2
x3
x4
◦ ◦ ◦ ◦
× ◦ ◦ ◦
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Rays
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1
a
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = sign(x − a) for threshold a
•
‘positive half’ of 1D perceptronsone dichotomy for a∈ each spot (x
n
, xn+1
):m H (N) = N + 1
(N + 1)
2 N
when N large!x
1
x2
x3
x4
◦ ◦ ◦ ◦
× ◦ ◦ ◦
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Rays
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1
a
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = sign(x − a) for threshold a
•
‘positive half’ of 1D perceptronsone dichotomy for a∈ each spot (x
n
, xn+1
):m H (N) = N + 1
x
1
x2
x3
x4
◦ ◦ ◦ ◦
× ◦ ◦ ◦
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Intervals
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1 h(x) = −1
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = +1 iff x ∈ [`, r), −1 otherwise
one dichotomy for each ‘interval kind’
m H (N)
=N + 1
2
| {z }
interval ends in N + 1 spots
+
1
|{z}
all ×
= 1
2 N 2 + 1 2 N + 1
1
2
N2
+1 2
N + 12 N
when N large!x
1x
2x
3x
4◦ × × ×
◦ ◦ × ×
◦ ◦ ◦ ×
◦ ◦ ◦ ◦
× ◦ × ×
× ◦ ◦ ×
× ◦ ◦ ◦
× × ◦ ×
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Intervals
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1 h(x) = −1
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = +1 iff x ∈ [`, r), −1 otherwise
one dichotomy for each ‘interval kind’
m H (N)
=N + 1 2
| {z }
interval ends in N + 1 spots
+
1
|{z}
all ×
1
2
N2
+1 2
N + 12 N
when N large!x
1x
2x
3x
4◦ × × ×
◦ ◦ × ×
◦ ◦ ◦ ×
◦ ◦ ◦ ◦
× ◦ × ×
× ◦ ◦ ×
× ◦ ◦ ◦
× × ◦ ×
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Intervals
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1 h(x) = −1
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = +1 iff x ∈ [`, r), −1 otherwise
one dichotomy for each ‘interval kind’
m H (N)
=N + 1 2
| {z }
interval ends in N + 1 spots
+
1
|{z}
all ×
= 1
2 N 2 + 1 2 N + 1
1
2
N2
+1 2
N + 12 N
when N large!x
1x
2x
3x
4◦ × × ×
◦ ◦ × ×
◦ ◦ ◦ ×
◦ ◦ ◦ ◦
× ◦ × ×
× ◦ ◦ ×
× ◦ ◦ ◦
× × ◦ ×
× × ◦ ◦
× × × ◦
× × × ×
Training versus Testing Effective Number of Hypotheses
Growth Function for Positive Intervals
x
1x
2x
3. . . xN
h(x) = −1 h(x) = +1 h(x) = −1
•
X = R (one dimensional)•
H contains h, whereeach h(x ) = +1 iff x ∈ [`, r), −1 otherwise
one dichotomy for each ‘interval kind’
m H (N)
=N + 1 2
| {z }
interval ends in N + 1 spots
+
1
|{z}
all ×
x
1x
2x
3x
4◦ × × ×
◦ ◦ × ×
◦ ◦ ◦ ×
◦ ◦ ◦ ◦
× ◦ × ×
× ◦ ◦ ×
× ◦ ◦ ◦
Training versus Testing Effective Number of Hypotheses
Growth Function for Convex Sets (1/2)
up
bottom
convex region in blue
up
bottom
non-convex region
•
X = R2
(two dimensional)•
H contains h, whereh(x) = +1 iff x in a convex region, −1 otherwise
what is m
H
(N)?Training versus Testing Effective Number of Hypotheses
Growth Function for Convex Sets (1/2)
up
bottom
convex region in blue
up
bottom
non-convex region
•
X = R2
(two dimensional)•
H contains h, whereh(x) = +1 iff x in a
convex region, −1 otherwise
Training versus Testing Effective Number of Hypotheses
Growth Function for Convex Sets (2/2)
•
one possible set of N inputs:x 1
, x2
, . . . , xN
on a big circle• every dichotomy can be implemented
byH using a convex regionslightly extended from contour of positive inputs
m H (N) = 2 N
•
call those N inputs‘shattered’ by H
+
+ + +
+
−
−
−
−
−
up
bottom
m
H
(N) = 2N
⇐⇒exists
N inputs that can be shatteredTraining versus Testing Effective Number of Hypotheses
Growth Function for Convex Sets (2/2)
•
one possible set of N inputs:x 1
, x2
, . . . , xN
on a big circle• every dichotomy can be implemented
byH using a convex regionslightly extended from contour of positive inputs
m H (N) = 2 N
•
call those N inputs‘shattered’ by H
+
+ + +
+
−
−
−
−
−
up
bottom
m
H
(N) = 2N
⇐⇒exists
N inputs that can be shatteredTraining versus Testing Effective Number of Hypotheses
Growth Function for Convex Sets (2/2)
•
one possible set of N inputs:x 1
, x2
, . . . , xN
on a big circle• every dichotomy can be implemented
byH using a convex regionslightly extended from contour of positive inputs
m H (N) = 2 N
•
call those N inputs‘shattered’ by H
+
+ + +
+
−
−
−
−
−
up
bottom
m
H
(N) = 2N
⇐⇒exists
N inputs that can be shatteredTraining versus Testing Effective Number of Hypotheses
Growth Function for Convex Sets (2/2)
•
one possible set of N inputs:x 1
, x2
, . . . , xN
on a big circle• every dichotomy can be implemented
byH using a convex regionslightly extended from contour of positive inputs
m H (N) = 2 N
•
call those N inputs‘shattered’ by H
+
+ + +
+
−
−
−
−
−
up
bottom
m
H
(N) = 2N
⇐⇒Training versus Testing Effective Number of Hypotheses
Fun Time
Consider positive and negative rays as H, which is equivalent to the perceptron hypothesis set in 1D. The hypothesis set is often called ‘decision stump’ to describe the shape of its hypotheses. What is the growth function m H (N)?
1
N2
N + 13
2N4
2N
Reference Answer: 3
Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-
◦
and all-×
cases.Training versus Testing Effective Number of Hypotheses
Fun Time
Consider positive and negative rays as H, which is equivalent to the perceptron hypothesis set in 1D. The hypothesis set is often called ‘decision stump’ to describe the shape of its hypotheses. What is the growth function m H (N)?
1
N2
N + 13
2N4
2N
Reference Answer: 3
Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-
◦
and all-×
cases.Training versus Testing Break Point
The Four Growth Functions
•
positive rays: mH
(N) = N + 1•
positive intervals: mH
(N) =1 2
N2
+1 2
N + 1•
convex sets: mH
(N) = 2N
•
2D perceptrons:m H (N) < 2 N in some cases
what if m H (N) replaces M?
PE
in
(g)− Eout
(g) >?
≤ 2 ·
m H (N)
· exp−2
2
Npolynomial: good; exponential: bad
for 2D or general perceptrons,