## Machine Learning Foundations ( 機器學習基石)

### Lecture 5: Training versus Testing

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

## Roadmap

### 1 When Can Machines Learn?

### Lecture 4: Feasibility of Learning

learning is**PAC-possible**

if enough

**statistical data**

and**finite** |H|

### 2 **Why**

Can Machines Learn?
### Lecture 5: Training versus Testing Recap and Preview

### Effective Number of Lines Effective Number of Hypotheses Break Point

### 3 How Can Machines Learn?

### 4 How Can Machines Learn Better?

Training versus Testing Recap and Preview

## Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

^{out}

(g)≈ E### in

(g) ifA finds one g with E### in

(g)≈ 0,PAC guarantee for E

_{out}

(g)≈ 0 =⇒**learning possible :-)**

### unknown target function f : X → Y

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

_{N}

### ,y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X

**x**

1### , **x**

2### , · · · , **x**

N **x**

E

_{out}

(g) ≈
|{z}

### test

E

_{in}

(g) ≈
|{z}

### train

0## Two Central Questions

for batch & supervised binary classification

| {z }

### lecture 3

,

### g ≈ f

### | {z }

### lecture 1

### ⇐⇒ E out (g) ≈ 0

achieved through

### E _{out} (g) ≈ E in (g)

### | {z }

### lecture 4

and

### E _{in} (g) ≈ 0

### | {z }

### lecture 2

learning split to two central questions:

### 1 can we make sure that E out (g) is close enough to E _{in} (g)?

### 2 can we make E _{in} (g) small enough?

what role does

### M

### |{z}

### |H|

play for the two questions?

Training versus Testing Recap and Preview

## Trade-off on M

### 1 can we make sure that E out (g) is close enough to E _{in} (g)?

### 2 can we make E _{in} (g) small enough?

**small M**

### 1 Yes!,

**P[BAD]**≤ 2 ·

### M

· exp(. . .)### 2 No!, too few choices

**large M**

### 1 No!,

**P[BAD]**≤ 2 ·

### M

· exp(. . .)### 2 Yes!, many choices

using the right M (orH) is important

### M = **∞ doomed?**

## Preview

### Known

P

E

_{in}

(g)− E### out

(g)> ≤ 2 ·

### M

· exp−2

^{2}

N
### Todo

### •

establish**a finite quantity**

that replaces### M

P

E

_{in}

(g)− E### out

(g) >^{?}

^{?}

≤ 2 ·

### m _{H}

· exp
−2

^{2}

N
### •

justify the feasibility of learning for infinite M### •

study### m H

to understand its trade-off for ‘right’H, just like Mmysterious PLA to be fully resolved

**after 3 more lectures :-)**

Training versus Testing Recap and Preview

## Fun Time

### Data size: how large do we need?

One way to use the inequality P

E

_{in}

(g)− E### out

(g)> ≤ 2 · M · exp

−2

^{2}

N
| {z }

### δ

is to pick a tolerable difference** as well as a tolerable BAD**
probabilityδ, and then gather data with size (N) large enough to
achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.

What is the data size needed?

### 1

215### 2

415### 3

615### 4

815### Reference Answer: 2

We can simply express N as a function of those ‘known’ variables.

Then, the needed N =

_{2} ^{1}

2 ln^{2M} _{δ}

.
## Where Did M Come From?

P

E

_{in}

(g)− E### out

(g)> ≤ 2 ·

### M

· exp−2

^{2}

N

### • **BAD events B** m

: |E### in

(h_{m}

)− E### out

(h_{m}

)| >
### •

to giveA freedom of choice: bound P[B### 1

orB### 2

or . . .B### M

]### •

worst case: allB### m

non-overlapping P[B### 1

orB### 2

or . . .B### M

]### ≤

### |{z}

**union bound**

P[B

### 1

] + P[B### 2

] +. . . + P[B### M

]where did

**union bound fail**

to consider for M =∞?
Training versus Testing Effective Number of Lines

## Where Did Union Bound Fail?

union bound P[B

### 1

] + P[B### 2

] +. . . + P[B### M

]### • **BAD events B** m

:|E### in

(h_{m}

)− E### out

(h_{m}

)| >
overlapping for similar hypotheses h_{1}

≈ h### 2

### •

why? 1 E_{out}

(h_{1}

)≈ E### out

(h_{2}

)
why?

2 for mostD, E

### in

(h_{1}

) =E_{in}

(h_{2}

)
### •

union bound**over-estimating** ^{B}

^{3}

### B

1### B

2to account for overlap,

can we group similar hypotheses by

**kind?**

## How Many Lines Are There? (1/2)

H =n

all lines in R

^{2}

o
### •

how many lines? ∞### •

how many**kinds of**

lines if viewed from one input vector**x** _{1}

?
**•x**

### 1

**2 kinds: h** _{1}

-like(x_{1}

) =### ◦

or h_{2}

-like(x_{1}

) =### ×

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (1/2)

H =n

all lines in R

^{2}

o
### •

how many lines? ∞### •

how many**kinds of**

lines if viewed from one input vector**x** _{1}

?
**•x**

### 1

h

_{1}

h_{2}

**2 kinds: h** _{1}

-like(x_{1}

) =### ◦

or h_{2}

-like(x_{1}

) =### ×

## How Many Lines Are There? (2/2)

H =n

all lines in R

^{2}

o
### •

how many**kinds of**

lines if viewed from two inputs**x** _{1}

**, x**

### 2

?**•x**

### 1

**•x**

### 2

**4:** ◦

## ◦ ×

## × ◦

## × ×

## ◦

one input: 2; two inputs: 4;

**three inputs?**

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (2/2)

H =n

all lines in R

^{2}

o
### •

how many**kinds of**

lines if viewed from two inputs**x** _{1}

**, x**

### 2

?**•x**

### 1

**•x**

### 2

h

_{1}

h

_{2}

h_{3}

h

_{4}

**4:** ◦

## ◦ ×

## × ◦

## × ×

## ◦

one input: 2; two inputs: 4;

**three inputs?**

## How Many Kinds of Lines for Three Inputs? (1/2)

H =n

all lines in R

^{2}

o
### for three inputs **x** 1 **, x** 2 **, x** 3

**•x**

### 1

**•x**

### 2

**•x**

### 3

always

### 8 **for three inputs?**

**8:**

## ◦ ◦

## ◦

## × ×

## ×

## ◦ ◦

## ×

## × ×

## ◦

## × ◦

## ◦

## ◦ ×

## ×

## × ◦

## ×

## ◦ ×

## ◦

Training versus Testing Effective Number of Lines

## How Many Kinds of Lines for Three Inputs? (2/2)

H =n

all lines in R

^{2}

o
### for **another** three inputs **x** _{1} **, x** 2 **, x** 3

**•x**

### 1

**•x**

### 2

**•x**

### 3

**‘fewer than 8’**

when degenerate
(e.g. collinear or same inputs)
**6:**

## ◦ ◦

## ◦

## × ×

## ×

## ◦ ◦

## ×

## × ×

## ◦

## × ◦

## ◦

## ◦ ×

## ×

## × ◦

## ×

## ◦ ×

## ◦

## How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

^{2}

o
### for four inputs **x** 1 **, x** 2 **, x** 3 **, x** 4

**•x**

### 1

**•x**

### 2

**•x**

### 3

**•x**

### 4

for any four inputs

**at most 14**

**14:**

2×
## ◦ ◦

## ◦ ◦ ◦

## × × ×

## ◦ ◦

## ◦ × ◦

## × × ◦

## ◦ ◦

## × ◦ ◦

## × ◦ ×

## ◦ ◦

## × × ◦

## × ◦ ◦

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

**x** _{1}

**, x**

### 2

,**· · · , x**

### N

⇐⇒

**effective number of lines**

### •

must be≤ 2^{N}

(why?)
### •

finite ‘grouping’ of infinitely-many lines∈ H### •

wish:P

E

_{in}

(g)− E### out

(g) >≤ 2 ·

### effective(N)

· exp−2

^{2}

N
### lines in 2D

N effective(N)1 2

2 4

3 8

4 14

### < 2 ^{N}

if 1 effective(N) can replace M and

if

2 effective(N) 2

^{N}

**learning possible with infinite lines :-)**

## Fun Time

### What is the effective number of lines for five inputs ∈ R ^{2} ?

### 1

14### 2

16### 3

22### 4

32### Reference Answer: 3

If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is

**much** **smaller than 2** ^{5} = 32. You shall find it difficult

to generate more kinds by varying the inputs,
and we will give a formal proof in future
lectures.
**•x**

### 1

**•x**

### 2

**•x**

### 3

**•x**

### 4

**•x**

### 5

Training versus Testing Effective Number of Hypotheses

## Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

### ◦}}

### •

callh(x

_{1}

**, x**

_{2}

**, . . . , x**

_{N}

) = (h(x_{1}

), h(x_{2}

), . . . , h(x_{N}

))∈ {×,### ◦} ^{N}

a**dichotomy: hypothesis ‘limited’ to the eyes of** **x** _{1}

**, x**

_{2}

**, . . . , x**

_{N}

### •

**H(x**

### 1

**, x**

_{2}

**, . . . , x**

_{N}

):
**all dichotomies ‘implemented’ by** **H on x** 1 **, x** _{2} **, . . . , x** _{N}

hypothesesH dichotomies**H(x**

### 1

**, x**

_{2}

**, . . . , x**

_{N}

)
e.g. all lines in R^{2}

{◦◦◦◦,### ◦◦◦×

,### ◦◦××

, . . .} size possibly infinite upper bounded by 2^{N}

**|H(x**

### 1

**, x**

_{2}

**, . . . , x**

_{N}

)|: candidate for**replacing M**

## Growth Function

### •

**|H(x**

### 1

**, x**

_{2}

**, . . . , x**

_{N}

)|: depend on inputs
(x_{1}

**, x**

_{2}

**, . . . , x**

_{N}

)
### •

growth function:remove dependence by

**taking max of all** **possible (x** _{1} **, x** _{2} **, . . . , x** _{N} )

m

### H

(N) = max**x**

1### ,x

2### ,...,x

N### ∈X

**|H(x**

### 1

**, x**

_{2}

**, . . . , x**

_{N}

)|
### •

finite, upper-bounded by 2^{N}

### lines in 2D

N m_{H}

(N)
1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

### < 2 ^{N}

how to ‘calculate’ the growth function?

Training versus Testing Effective Number of Hypotheses

## Growth Function for Positive Rays

### x

1### x

2### x

3## . . . x

N### h(x) = −1 h(x) = +1

### a

### •

X = R (one dimensional)### •

H contains h, where**each h(x ) = sign(x** **− a) for threshold a**

### •

‘positive half’ of 1D perceptronsone dichotomy for a∈ each spot (x

^{n}

, x_{n+1}

):
### m H (N) = N + 1

(N + 1)

### 2 ^{N}

when N large!
x

_{1}

x_{2}

x_{3}

x_{4}

### ◦ ◦ ◦ ◦

### × ◦ ◦ ◦

### × × ◦ ◦

### × × × ◦

### × × × ×

## Growth Function for Positive Intervals

### x

1### x

2### x

3## . . . x

N### h(x) = −1 h(x) = +1 h(x) = −1

### •

X = R (one dimensional)### •

H contains h, where**each h(x ) = +1 iff x** **∈ [`, r), −1 otherwise**

one dichotomy for each ‘interval kind’

### m H (N)

=### N + 1 2

### | {z }

### interval ends in N + 1 spots

+

### 1

### |{z}

### all ×

### = 1

### 2 N ^{2} + 1 2 N + 1

### 1

### 2

N^{2}

+^{1} _{2}

N + 1
### 2 ^{N}

when N large!
### x

1### x

2### x

3### x

4### ◦ × × ×

### ◦ ◦ × ×

### ◦ ◦ ◦ ×

### ◦ ◦ ◦ ◦

### × ◦ × ×

### × ◦ ◦ ×

### × ◦ ◦ ◦

### × × ◦ ×

### × × ◦ ◦

### × × × ◦

### × × × ×

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (1/2)

up

bottom

**convex region in** **blue**

up

bottom

non-convex region

### •

X = R^{2}

(two dimensional)
### •

H contains h, where### h(x) = +1 iff x in a **convex region,** **−1 otherwise**

what is m

### H

(N)?## Growth Function for Convex Sets (2/2)

### •

one possible set of N inputs:**x** _{1}

**, x**

_{2}

**, . . . , x**

_{N}

on a big circle
### • **every dichotomy can be implemented**

byH using a convex region### slightly extended from contour of positive inputs

### m H (N) = 2 ^{N}

### •

call those N inputs**‘shattered’ by** H

+

+ + +

+

−

−

−

−

−

up

bottom

m

### H

(N) = 2^{N}

⇐⇒
**exists**

N inputs that can be shattered
Training versus Testing Effective Number of Hypotheses

## Fun Time

### Consider positive **and negative** rays as H, which is equivalent to the perceptron hypothesis set in 1D. The hypothesis set is often called **‘decision stump’** to describe the shape of its hypotheses. What is the growth function m H (N)?

### 1

N### 2

N + 1### 3

2N### 4

2^{N}

### Reference Answer: 3

Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-

### ◦

and all-### ×

cases.## The Four Growth Functions

### •

positive rays: m_{H}

(N) = N + 1
### •

positive intervals: m### H

(N) =^{1} _{2}

N^{2}

+^{1} _{2}

N + 1
### •

convex sets: m### H

(N) = 2^{N}

### •

2D perceptrons:### m H (N) < 2 ^{N} **in some cases** what if m H (N) replaces M?

P

E

_{in}

(g)− E^{out}

(g)
> ^{?}

^{?}

≤ 2 ·

### m H (N)

· exp−2

^{2}

N
**polynomial: good;** **exponential: bad**

for 2D or general perceptrons,

### m H (N) **polynomial?**

Training versus Testing Break Point

## Break Point of H

### what do we know about 2D perceptrons now?

**three inputs: ‘exists’ shatter;**

**four inputs, ‘for all’ no shatter**

if no k inputs can be shattered byH,
call k a

**break point**

forH
### •

m_{H}

(k )< 2^{k}

### •

k + 1, k + 2, k + 3,. . . also break points!### •

will study### minimum break point k

## ◦

## × ◦

## ×

2D perceptrons:

**break point at 4**

## The Four Break Points

### •

positive rays: m_{H}

(N) = N + 1### = O(N)

### break point at 2

### •

positive intervals: m### H

(N) =^{1} _{2}

N^{2}

+^{1} _{2}

N + 1### = O(N ^{2} ) break point at 3

### •

convex sets: m### H

(N) = 2^{N}

### no break point

### •

2D perceptrons:### m H (N) < 2 ^{N} **in some cases** break point at 4

conjecture:

### •

no break point: m### H

(N) = 2^{N}

(sure!)
### •

break point k :### m H (N) = O(N ^{k −1} )

**excited? wait for next lecture :-)**

Training versus Testing Break Point

## Fun Time

### Consider positive **and negative** rays as H, which is equivalent to the perceptron hypothesis set in 1D. As discussed in an earlier quiz question, the growth function m H (N) = 2N. What is the minimum break point for H?

### 1

1### 2

2### 3

3### 4

4### Reference Answer: 3

At k = 3, m