Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 6: Theory of Generalization

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Roadmap

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 5: Training versus Testing

effective

price of choice in training:

(wishfully) growth function m _H (N)

with

a break point Lecture 6: Theory of Generalization

Restriction of Break Point

Bounding Function: Basic Cases Bounding Function: Inductive Cases A Pictorial Proof

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Theory of Generalization Restriction of Break Point

The Four Break Points

growth function m

H

(N): max number of dichotomies

•

positive rays: m

H

(N) = N + 1

◦×

m

H

(2) = 3< 2

²

:

break point at 2

•

positive intervals: m

H

(N) =

¹ ₂

N

²

+

¹ ₂

N + 1

◦×◦

m

H

(3) = 7< 2

³

:

break point at 3

•

convex sets: m

H

(N) = 2

^N

◦ ◦

×

◦

m

H

(N) = 2

^N

always:

no break point

•

2D perceptrons:

m H (N) < 2 ^N in some cases

× ◦

◦ ×

m

H

(4) = 14< 2

⁴

:

break point at 4

break point k =⇒ break point k + 1, . . .

what else?

(4)

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

•

N = 1: every m

H

(N) = 2 by definition

•

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

1 dichotomy , shatter any two points?

no x ₁ x ₂ x ₃

◦ ◦ ◦

(5)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

2 dichotomies , shatter any two points?

no x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

(6)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

no x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

(7)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

yes x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

◦ × ×

(8)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

no x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

(9)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

yes x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

× ◦ ×

(10)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

yes x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

× × ◦

(11)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

yes x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

× × ×

(12)

Restriction of Break Point (1/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

maximum possible so far:

4 dichotomies x ₁ x ₂ x ₃

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

:-( :-( :-(

(13)

Restriction of Break Point (2/2)

minimum break point k = 2

•

N = 1: every m

H

•

N = 2: every m

H

maximum possible = 3)

•

N = 3:

maximum possible = 4 2 ³

—break point k

restricts maximum possible m H (N) a lot

for N > k

idea: m

_H

(N)

≤

maximum possible m H (N) given k

≤ poly(N)

(14)

Fun Time

When minimum break point k = 1, what is the maximum possible m H (N) when N = 3?

1

2

3

4

8

Reference Answer: 1

Because k = 1, the hypothesis set cannot even shatter one point. Thus, every ‘column’ of the table cannot contain both

◦

and

×

. Then, after including the first dichotomy, it is not possible to include any other different dichotomy. Thus, the maximum possible m

H

(N) is 1.

x ₁ x ₂ x ₃

◦ × ◦

◦ × ×

(15)

Theory of Generalization Bounding Function: Basic Cases

Bounding Function

bounding function B(N , k ):

maximum possible m

H

(N) when break point = k

•

combinatorial quantity:

maximum number of length-N vectors with (

◦

,

×

) while

‘no shatter’ any length-k

subvectors

•

irrelevant of the details ofH e.g. B(N, 3) bounds both

• positive intervals (k = 3)

• 1D perceptrons (k = 3)

new goal:

B(N, k ) ≤ poly(N)?

(16)

Table of Bounding Function (1/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1

2

3

4

N 4

5 6 ...

Known

•

B(2, 2) = 3 (maximum < 4)

•

B(3, 2) = 4 (‘pictorial’ proof previously)

(17)

Table of Bounding Function (2/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1

2

1

3

1

4

N 4

1

5

1

6

1

...

.. .

Known

•

B(N, 1) = 1 (see previous quiz)

(18)

Table of Bounding Function (3/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1 1

2 2 2 2 2 . . .

2 1 3

4 4 4 4 . . .

3 1 4

8 8 8 . . .

N 4 1

16 16 . . .

5 1

32 . . .

6 1

. . .

... ...

Known

•

B(N, k ) = 2

^N

for N < k

—including all dichotomies not violating ‘breaking condition’

(19)

Table of Bounding Function (4/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1

2 2 2 2 2 . . .

2 1

3

4 4 4 4 . . .

3 1 4

7

8 8 8 . . .

N 4 1

15

16 16 . . .

5 1

31

32 . . .

6 1

63

. . .

... ...

. ..

Known

•

B(N, k ) = 2

^N

− 1 for N = k

—removing a single dichotomysatisfies ‘breaking condition’

more than halfway done! :-)

(20)

Fun Time

For the 2D perceptrons, which of the following claim is true?

1

minimum break point k = 2

2

m

H

(4) = 15

3

m

_H

(N)< B(N, k ) when N = k = minimum break point

4

m

H

(N)> B(N, k ) when N = k = minimum break point

Reference Answer: 3

As discussed previously, minimum break point for 2D perceptrons is 4, with m

H

(4) = 14. Also, note that B(4, 4) = 15. So bounding function B(N, k ) can be ‘loose’ in bounding m

H

(N).

(21)

Theory of Generalization Bounding Function: Inductive Cases

Estimating B(4, 3)

k

B(N, k ) 1 2 3 4 5 6 . . .

1 1 2 2 2 2 2 . . .

2 1 3 4 4 4 4 . . .

3 1 4 7 8 8 8 . . .

N 4 1

?

15 16 16 . . .

5 1 31 32 . . .

6 1 63 . . .

... ... . ..

Motivation

•

B(4, 3) shall be

related to B(3, ?)

—‘adding’ one point from B(3, ?)

next: reduce B(4, 3) to B(3, ?)

(22)

‘Achieving’ Dichotomies of B(4, 3)

after checking all 2

²

⁴ sets of dichotomies,

the winner is . . . x ₁ x ₂ x ₃ x ₄

01

◦ ◦ ◦ ◦

02

× ◦ ◦ ◦

03

◦ × ◦ ◦

04

◦ ◦ × ◦

05

◦ ◦ ◦ ×

06

× × ◦ ×

07

× ◦ × ◦

08

× ◦ ◦ ×

09

◦ × × ◦

10

◦ × ◦ ×

11

◦ ◦ × ×

k

B(N, k ) 1 2 3 4 5 6

1 1 2 2 2 2 2

2 1 3 4 4 4 4

3 1 4 7 8 8 8

N 4 1 11 15 16 16

5 1 31 32

6 1 63

how to reduce B(4, 3) to B(3, ?) cases?

(23)

Reorganized Dichotomies of B(4, 3)

after checking all 2

²

⁴ sets of dichotomies,

the winner is . . . x ₁ x ₂ x ₃ x ₄

01

◦ ◦ ◦ ◦

02

× ◦ ◦ ◦

03

◦ × ◦ ◦

04

◦ ◦ × ◦

05

◦ ◦ ◦ ×

06

× × ◦ ×

07

× ◦ × ◦

08

× ◦ ◦ ×

09

◦ × × ◦

10

◦ × ◦ ×

11

◦ ◦ × ×

⇒

x ₁ x ₂ x ₃ x ₄

01

◦ ◦ ◦ ◦

05

◦ ◦ ◦ ×

02

× ◦ ◦ ◦

08

× ◦ ◦ ×

03

◦ × ◦ ◦

10

◦ × ◦ ×

04

◦ ◦ × ◦

11

◦ ◦ × ×

06

× × ◦ ×

07

× ◦ × ◦

09

◦ × × ◦

orange: pair; purple: single

(24)

Estimating Part of B(4, 3) (1/2)

B(4, 3) = 11 =

2α

+

β x ₁ x ₂ x ₃

◦ ◦ ◦

α × ◦ ◦

◦ × ◦

◦ ◦ ×

× × ◦

β × ◦ ×

◦ × ×

• α

+

β: dichotomies on (x ₁

, x

₂

, x

₃

)

•

B(4, 3) ‘no shatter’ any 3 inputs

=⇒

α

+

β

‘no shatter’ any 3

x ₁ x ₂ x ₃ x ₄

◦ ◦ ◦ ◦

◦ ◦ ◦ ×

× ◦ ◦ ◦

2α × ◦ ◦ ×

◦ × ◦ ◦

◦ × ◦ ×

◦ ◦ × ◦

◦ ◦ × ×

× × ◦ ×

β × ◦ × ◦

◦ × × ◦

α

+

β

≤ B(3, 3)

(25)

Estimating Part of B(4, 3) (2/2)

B(4, 3) = 11 =

2α

+

β x ₁ x ₂ x ₃

◦ ◦ ◦

α × ◦ ◦

◦ × ◦

◦ ◦ ×

• α: dichotomies on (x ₁

, x

₂

, x

₃

) with

x ₄ paired

•

B(4, 3) ‘no shatter’ any 3 inputs

=⇒

α

‘no shatter’ any 2

x ₁ x ₂ x ₃ x ₄

◦ ◦ ◦ ◦

◦ ◦ ◦ ×

× ◦ ◦ ◦

2α × ◦ ◦ ×

◦ × ◦ ◦

◦ × ◦ ×

◦ ◦ × ◦

◦ ◦ × ×

× × ◦ ×

β × ◦ × ◦

◦ × × ◦

α

≤ B(3, 2)

(26)

Putting It All Together

B(4, 3) =

2α

+

β α

+

β

≤ B(3, 3)

α

≤ B(3, 2)

⇒ B(4, 3) ≤ B(3, 3) + B(3, 2) k

B(N, k ) 1 2 3 4 5 6

1 1 2 2 2 2 2

2 1 3 4 4 4 4

3 1 4 7 8 8 8

N 4 1

≤ 5

11 15 16 16

5 1

≤ 6 ≤ 16 ≤ 26

31 32

6 1

≤ 7 ≤ 22 ≤ 42 ≤ 57

63

now have

upper bound

of bounding function

(27)

Putting It All Together

B(N, k ) =

2α

+

β α

+

β

≤ B(N − 1, k)

α

≤ B(N − 1, k − 1)

⇒ B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1) k

B(N, k ) 1 2 3 4 5 6

1 1 2 2 2 2 2

2 1 3 4 4 4 4

3 1 4 7 8 8 8

N 4 1

≤ 5

11 15 16 16

5 1

≤ 6 ≤ 16 ≤ 26

31 32

6 1

≤ 7 ≤ 22 ≤ 42 ≤ 57

63

now have

upper bound

of bounding function

(28)

Bounding Function: The Theorem

B(N, k )≤

k −1

X

i=0

N i

| {z }

highest term N

^{k −1}

•

simple induction using

boundary and inductive formula

•

for fixed k , B(N, k ) upper bounded by poly (N)

=⇒

m H (N) is poly (N) if break point exists

‘≤’ can be ‘=’ actually,

go play and prove it if math lover! :-)

(29)

The Three Break Points

B(N, k )≤

k −1

X

i=0

N i

| {z }

highest term N

^{k −1}

•

positive rays: m

_H

(N) = N + 1

≤ N + 1

◦×

m

H

(2) = 3< 2

²

:

break point at 2

•

positive intervals: m

H

(N) =

¹ ₂

N

²

+

¹ ₂

N + 1

≤ ¹ ₂ N ² + ¹ ₂ N + 1

◦×◦

m

H

(3) = 7< 2

³

:

break point at 3

•

2D perceptrons:

m H (N)=? ≤ ¹ 6 N ³ + ⁵ ₆ N + 1

× ◦

◦ ×

m

_H

(4) = 14< 2

⁴

:

break point at 4

can bound m

H

(N) by only

one break point

(30)

Fun Time

For 1D perceptrons (positive and negative rays), we know that m H (N) = 2N. Let k be the minimum break point. Which of the following is not true?

1

k = 3

2

for some integers N > 0, m

H

(N) =P

k −1 i=0

N i

3

for all integers N > 0, m

H

(N) =P

k −1 i=0

N i

4

for all integers N > 2, m

H

(N)<P

k −1 i=0

N i

Reference Answer: 3

The proof is generally trivial by listing the definitions. For 2 , N = 1 or 2 gives the equality. One thing to notice is 4 : the upper bound can be ‘loose’.

(31)

Theory of Generalization A Pictorial Proof

BAD Bound for General H

want:

P

h∃h ∈ H s.t.

E

_in

(h)−E

out

(h) > i

≤ 2

·2

m H (

2 N)

·exp

−2

· 1 16

²

N

actually,

when N large enough,

P

h∃h ∈ H s.t.

E

_in

(h)− E

^out

(h) > i

≤ 2·2

m H (2N)

· exp

−2·

1 16

²

N

of proof

(32)

Step 1: Replace E _out by E _in ⁰

1 2

P

h∃h ∈ H s.t.

E

_in

(h)− E

out

(h) > i

≤ P

h∃h ∈ H s.t.

E

_in

(h)−

E _in ⁰ (h)

>

2

i

•

E

_in

(h) finitely many, E

_out

(h) infinitely many

—replace the evil E

out first

•

how? sample

verification set D ⁰

of size N to calculate

E _in ⁰

•

BAD h of E

_in

− E

out probably

= ⇒ BAD h of E in − E _in ⁰

^E^out

Probabilitydistribution ofEin,E′ in

Ein

E_in^′

0.1

evil E

out

removed by verification with ‘ghost data’

(33)

Step 2: Decompose H by Kind

BAD ≤

2

P

h

∃h ∈ H s.t.

E

_in

(h)−

E _in ⁰ (h)

>

2

i

≤

2m H

(2N)P h

fixed

h s.t.

E

_in

(h)−

E _in ⁰ (h)

>

2

i

•

E

_in

withD, E

in ⁰

withD

⁰

—now m

H comes to play

•

how? infiniteH becomes

|H(x 1 , . . . , x _N , x ⁰ ₁ , . . . , x ⁰ _N ) | kinds

• union bound on m H (2N) kinds

^D

space of data sets

(a) Hoeffding Inequality (b) Union Bound (c) Now

use m

H

(2N) to

calculate BAD-overlap properly

(34)

Step 3: Use Hoeffding without Replacement

BAD ≤

2m H

(2N)Ph

fixed

h s.t.

E

_in

(h)−

E _in ⁰ (h)

>

2

i

≤

2m H

(2N)· 2 exp

−2

4

2

N

•

consider bin of 2N examples,

choose N for E

_in

, leave others for E

_in ⁰

|E in − E in ⁰ | > ₂ ⇔

E _in − ^E

ⁱⁿ

^+E ₂

ⁱⁿ⁰

>

4

•

so? just ‘smaller bin’, ‘smaller’, and

Hoeffding without replacement

top

bottom top

sample for E _in

small bin

use

Hoeffding

after zooming to

fixed h

(35)

That’s All!

Vapnik-Chervonenkis (VC) bound:

P

h∃h ∈ H s.t.

E

_in

(h)− E

^out

(h) > i

≤

4m H

(2N) exp

−

1 8

²

N

• replace E out by E _in ⁰

• decompose H by kind

• use Hoeffding without replacement

2D perceptrons:

•

break point? 4

•

m

H

(N)? O(N

³

)

learning with 2D perceptrons feasible! :-)

(36)

Fun Time

For positive rays, m H (N) = N + 1. Plug it into the VC bound for

= 0.1 and N = 10000. What is VC bound of BAD events?

P

h∃h ∈ H s.t.

E

_in

(h)− E

^out

(h) > i

≤

4m H

(2N) exp

−

1 8

²

N

1

2.77× 10

⁻⁸⁷

2

5.54× 10

⁻⁸³

3

2.98× 10

⁻¹

4

2.29× 10

²

Reference Answer: 3

Simple calculation. Note that the BAD probability bound is not very small even with 10000 examples.

Machine Learning Foundations (ᘤ9M)