• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
37
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 6: Theory of Generalization

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Roadmap

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 5: Training versus Testing

effective

price of choice in training:

(wishfully) growth function m H (N)

with

a break point Lecture 6: Theory of Generalization

Restriction of Break Point

Bounding Function: Basic Cases Bounding Function: Inductive Cases A Pictorial Proof

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Theory of Generalization Restriction of Break Point

The Four Break Points

growth function m

H

(N): max number of dichotomies

positive rays: m

H

(N) = N + 1

◦×

m

H

(2) = 3< 2

2

:

break point at 2

positive intervals: m

H

(N) =

1 2

N

2

+

1 2

N + 1

◦×◦

m

H

(3) = 7< 2

3

:

break point at 3

convex sets: m

H

(N) = 2

N

◦ ◦

×

×

m

H

(N) = 2

N

always:

no break point

2D perceptrons:

m H (N) < 2 N in some cases

× ◦

◦ ×

m

H

(4) = 14< 2

4

:

break point at 4

break point k =⇒ break point k + 1, . . .

what else?

(4)

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

1 dichotomy , shatter any two points?

no x 1 x 2 x 3

◦ ◦ ◦

(5)

Theory of Generalization Restriction of Break Point

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

2 dichotomies , shatter any two points?

no x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

(6)

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

3 dichotomies , shatter any two points?

no x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

(7)

Theory of Generalization Restriction of Break Point

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

4 dichotomies , shatter any two points?

yes x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

◦ × ×

(8)

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

4 dichotomies , shatter any two points?

no x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

(9)

Theory of Generalization Restriction of Break Point

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

5 dichotomies , shatter any two points?

yes x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

× ◦ ×

(10)

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

5 dichotomies , shatter any two points?

yes x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

× × ◦

(11)

Theory of Generalization Restriction of Break Point

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

5 dichotomies , shatter any two points?

yes x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

× × ×

(12)

Restriction of Break Point (1/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

maximum possible m H (N) when N = 3 and k = 2?

maximum possible so far:

4 dichotomies x 1 x 2 x 3

◦ ◦ ◦

◦ ◦ ×

◦ × ◦

× ◦ ◦

:-( :-( :-(

(13)

Theory of Generalization Restriction of Break Point

Restriction of Break Point (2/2)

what ‘must be true’ when

minimum break point k = 2

N = 1: every m

H

(N) = 2 by definition

N = 2: every m

H

(N)< 4 by definition (so

maximum possible = 3)

N = 3:

maximum possible = 4  2 3

—break point k

restricts maximum possible m H (N) a lot

for N > k

idea: m

H

(N)

maximum possible m H (N) given k

≤ poly(N)

(14)

Fun Time

When minimum break point k = 1, what is the maximum possible m H (N) when N = 3?

1

1

2

2

3

4

4

8

Reference Answer: 1

Because k = 1, the hypothesis set cannot even shatter one point. Thus, every ‘column’ of the table cannot contain both

and

×

. Then, after including the first dichotomy, it is not possible to include any other different dichotomy. Thus, the maximum possible m

H

(N) is 1.

x 1 x 2 x 3

◦ × ◦

◦ × ×

(15)

Theory of Generalization Bounding Function: Basic Cases

Bounding Function

bounding function B(N , k ):

maximum possible m

H

(N) when break point = k

combinatorial quantity:

maximum number of length-N vectors with (

,

×

) while

‘no shatter’ any length-k

subvectors

irrelevant of the details ofH e.g. B(N, 3) bounds both

• positive intervals (k = 3)

• 1D perceptrons (k = 3)

new goal:

B(N, k ) ≤ poly(N)?

(16)

Table of Bounding Function (1/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1

2

3

3

4

N 4

5 6 ...

Known

B(2, 2) = 3 (maximum < 4)

B(3, 2) = 4 (‘pictorial’ proof previously)

(17)

Theory of Generalization Bounding Function: Basic Cases

Table of Bounding Function (2/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1

1

2

1

3

3

1

4

N 4

1

5

1

6

1

...

.. .

Known

B(N, 1) = 1 (see previous quiz)

(18)

Table of Bounding Function (3/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1 1

2 2 2 2 2 . . .

2 1 3

4 4 4 4 . . .

3 1 4

8 8 8 . . .

N 4 1

16 16 . . .

5 1

32 . . .

6 1

. . .

... ...

Known

B(N, k ) = 2

N

for N < k

—including all dichotomies not violating ‘breaking condition’

(19)

Theory of Generalization Bounding Function: Basic Cases

Table of Bounding Function (4/4)

k

B(N, k ) 1 2 3 4 5 6 . . .

1

1

2 2 2 2 2 . . .

2 1

3

4 4 4 4 . . .

3 1 4

7

8 8 8 . . .

N 4 1

15

16 16 . . .

5 1

31

32 . . .

6 1

63

. . .

... ...

. ..

Known

B(N, k ) = 2

N

− 1 for N = k

—removing a single dichotomysatisfies ‘breaking condition’

more than halfway done! :-)

(20)

Fun Time

For the 2D perceptrons, which of the following claim is true?

1

minimum break point k = 2

2

m

H

(4) = 15

3

m

H

(N)< B(N, k ) when N = k = minimum break point

4

m

H

(N)> B(N, k ) when N = k = minimum break point

Reference Answer: 3

As discussed previously, minimum break point for 2D perceptrons is 4, with m

H

(4) = 14. Also, note that B(4, 4) = 15. So bounding function B(N, k ) can be ‘loose’ in bounding m

H

(N).

(21)

Theory of Generalization Bounding Function: Inductive Cases

Estimating B(4, 3)

k

B(N, k ) 1 2 3 4 5 6 . . .

1 1 2 2 2 2 2 . . .

2 1 3 4 4 4 4 . . .

3 1 4 7 8 8 8 . . .

N 4 1

?

15 16 16 . . .

5 1 31 32 . . .

6 1 63 . . .

... ... . ..

Motivation

B(4, 3) shall be

related to B(3, ?)

—‘adding’ one point from B(3, ?)

next: reduce B(4, 3) to B(3, ?)

(22)

‘Achieving’ Dichotomies of B(4, 3)

after checking all 2

2

4 sets of dichotomies,

the winner is . . . x 1 x 2 x 3 x 4

01

◦ ◦ ◦ ◦

02

× ◦ ◦ ◦

03

◦ × ◦ ◦

04

◦ ◦ × ◦

05

◦ ◦ ◦ ×

06

× × ◦ ×

07

× ◦ × ◦

08

× ◦ ◦ ×

09

◦ × × ◦

10

◦ × ◦ ×

11

◦ ◦ × ×

k

B(N, k ) 1 2 3 4 5 6

1 1 2 2 2 2 2

2 1 3 4 4 4 4

3 1 4 7 8 8 8

N 4 1 11 15 16 16

5 1 31 32

6 1 63

how to reduce B(4, 3) to B(3, ?) cases?

(23)

Theory of Generalization Bounding Function: Inductive Cases

Reorganized Dichotomies of B(4, 3)

after checking all 2

2

4 sets of dichotomies,

the winner is . . . x 1 x 2 x 3 x 4

01

◦ ◦ ◦ ◦

02

× ◦ ◦ ◦

03

◦ × ◦ ◦

04

◦ ◦ × ◦

05

◦ ◦ ◦ ×

06

× × ◦ ×

07

× ◦ × ◦

08

× ◦ ◦ ×

09

◦ × × ◦

10

◦ × ◦ ×

11

◦ ◦ × ×

x 1 x 2 x 3 x 4

01

◦ ◦ ◦ ◦

05

◦ ◦ ◦ ×

02

× ◦ ◦ ◦

08

× ◦ ◦ ×

03

◦ × ◦ ◦

10

◦ × ◦ ×

04

◦ ◦ × ◦

11

◦ ◦ × ×

06

× × ◦ ×

07

× ◦ × ◦

09

◦ × × ◦

orange: pair; purple: single

(24)

Estimating Part of B(4, 3) (1/2)

B(4, 3) = 11 =

+

β x 1 x 2 x 3

◦ ◦ ◦

α × ◦ ◦

◦ × ◦

◦ ◦ ×

× × ◦

β × ◦ ×

◦ × ×

• α

+

β: dichotomies on (x 1

, x

2

, x

3

)

B(4, 3) ‘no shatter’ any 3 inputs

=⇒

α

+

β

‘no shatter’ any 3

x 1 x 2 x 3 x 4

◦ ◦ ◦ ◦

◦ ◦ ◦ ×

× ◦ ◦ ◦

2α × ◦ ◦ ×

◦ × ◦ ◦

◦ × ◦ ×

◦ ◦ × ◦

◦ ◦ × ×

× × ◦ ×

β × ◦ × ◦

◦ × × ◦

α

+

β

≤ B(3, 3)

(25)

Theory of Generalization Bounding Function: Inductive Cases

Estimating Part of B(4, 3) (2/2)

B(4, 3) = 11 =

+

β x 1 x 2 x 3

◦ ◦ ◦

α × ◦ ◦

◦ × ◦

◦ ◦ ×

• α: dichotomies on (x 1

, x

2

, x

3

) with

x 4 paired

B(4, 3) ‘no shatter’ any 3 inputs

=⇒

α

‘no shatter’ any 2

x 1 x 2 x 3 x 4

◦ ◦ ◦ ◦

◦ ◦ ◦ ×

× ◦ ◦ ◦

2α × ◦ ◦ ×

◦ × ◦ ◦

◦ × ◦ ×

◦ ◦ × ◦

◦ ◦ × ×

× × ◦ ×

β × ◦ × ◦

◦ × × ◦

α

≤ B(3, 2)

(26)

Putting It All Together

B(4, 3) =

+

β α

+

β

≤ B(3, 3)

α

≤ B(3, 2)

⇒ B(4, 3) ≤ B(3, 3) + B(3, 2) k

B(N, k ) 1 2 3 4 5 6

1 1 2 2 2 2 2

2 1 3 4 4 4 4

3 1 4 7 8 8 8

N 4 1

≤ 5

11 15 16 16

5 1

≤ 6 ≤ 16 ≤ 26

31 32

6 1

≤ 7 ≤ 22 ≤ 42 ≤ 57

63

now have

upper bound

of bounding function

(27)

Theory of Generalization Bounding Function: Inductive Cases

Putting It All Together

B(N, k ) =

+

β α

+

β

≤ B(N − 1, k)

α

≤ B(N − 1, k − 1)

⇒ B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1) k

B(N, k ) 1 2 3 4 5 6

1 1 2 2 2 2 2

2 1 3 4 4 4 4

3 1 4 7 8 8 8

N 4 1

≤ 5

11 15 16 16

5 1

≤ 6 ≤ 16 ≤ 26

31 32

6 1

≤ 7 ≤ 22 ≤ 42 ≤ 57

63

now have

upper bound

of bounding function

(28)

Bounding Function: The Theorem

B(N, k )≤

k −1

X

i=0

N i



| {z }

highest term N

k −1

simple induction using

boundary and inductive formula

for fixed k , B(N, k ) upper bounded by poly (N)

=⇒

m H (N) is poly (N) if break point exists

‘≤’ can be ‘=’ actually,

go play and prove it if math lover! :-)

(29)

Theory of Generalization Bounding Function: Inductive Cases

The Three Break Points

B(N, k )≤

k −1

X

i=0

N i



| {z }

highest term N

k −1

positive rays: m

H

(N) = N + 1

≤ N + 1

◦×

m

H

(2) = 3< 2

2

:

break point at 2

positive intervals: m

H

(N) =

1 2

N

2

+

1 2

N + 1

1 2 N 2 + 1 2 N + 1

◦×◦

m

H

(3) = 7< 2

3

:

break point at 3

2D perceptrons:

m H (N)=? ≤ 1 6 N 3 + 5 6 N + 1

× ◦

◦ ×

m

H

(4) = 14< 2

4

:

break point at 4

can bound m

H

(N) by only

one break point

(30)

Fun Time

For 1D perceptrons (positive and negative rays), we know that m H (N) = 2N. Let k be the minimum break point. Which of the following is not true?

1

k = 3

2

for some integers N > 0, m

H

(N) =P

k −1 i=0

N i



3

for all integers N > 0, m

H

(N) =P

k −1 i=0

N i



4

for all integers N > 2, m

H

(N)<P

k −1 i=0

N i



Reference Answer: 3

The proof is generally trivial by listing the definitions. For 2 , N = 1 or 2 gives the equality. One thing to notice is 4 : the upper bound can be ‘loose’.

(31)

Theory of Generalization A Pictorial Proof

BAD Bound for General H

want:

P

h∃h ∈ H s.t.

E

in

(h)−E

out

(h) > i

≤ 2

·2

m H (

2

N)

·exp



−2

· 1 16



2

N



actually,

when N large enough,

P

h∃h ∈ H s.t.

E

in

(h)− E

out

(h) > i

≤ 2·2

m H (2N)

· exp



−2·

1 16



2

N



next:

sketch

of proof

(32)

Step 1: Replace E out by E in 0

1 2

P

h∃h ∈ H s.t.

E

in

(h)− E

out

(h) > i

≤ P

h∃h ∈ H s.t.

E

in

(h)−

E in 0 (h)

>



2

i

E

in

(h) finitely many, E

out

(h) infinitely many

—replace the evil E

out first

how? sample

verification set D 0

of size N to calculate

E in 0

BAD h of E

in

− E

out probably

= ⇒ BAD h of E in − E in 0

Eout

Probabilitydistribution ofEin,E′ in

Ein

Ein

0.1

evil E

out

removed by verification with ‘ghost data’

(33)

Theory of Generalization A Pictorial Proof

Step 2: Decompose H by Kind

BAD ≤

2

P

h

∃h ∈ H s.t.

E

in

(h)−

E in 0 (h)

>



2

i

2m H

(2N)P h

fixed

h s.t.

E

in

(h)−

E in 0 (h)

>



2

i

E

in

withD, E

in 0

withD

0

—now m

H comes to play

how? infiniteH becomes

|H(x 1 , . . . , x N , x 0 1 , . . . , x 0 N ) | kinds

• union bound on m H (2N) kinds

D

space of data sets

(a) Hoeffding Inequality (b) Union Bound (c) Now

use m

H

(2N) to

calculate BAD-overlap properly

(34)

Step 3: Use Hoeffding without Replacement

BAD ≤

2m H

(2N)Ph

fixed

h s.t.

E

in

(h)−

E in 0 (h)

>



2

i

2m H

(2N)· 2 exp



−2

 4



2

N



consider bin of 2N examples,

choose N for E

in

, leave others for E

in 0

|E in − E in 0 | >  2

E inE

in

+E 2

in0

>

 4

so? just ‘smaller bin’, ‘smaller’, and

Hoeffding without replacement

top

bottom top

sample for E in

small bin

use

Hoeffding

after zooming to

fixed h

(35)

Theory of Generalization A Pictorial Proof

That’s All!

Vapnik-Chervonenkis (VC) bound:

P

h∃h ∈ H s.t.

E

in

(h)− E

out

(h) > i

4m H

(2N) exp



1 8



2

N



• replace E out by E in 0

• decompose H by kind

• use Hoeffding without replacement

2D perceptrons:

break point? 4

m

H

(N)? O(N

3

)

learning with 2D perceptrons feasible! :-)

(36)

Fun Time

For positive rays, m H (N) = N + 1. Plug it into the VC bound for

 = 0.1 and N = 10000. What is VC bound of BAD events?

P

h∃h ∈ H s.t.

E

in

(h)− E

out

(h) > i

4m H

(2N) exp



1 8



2

N



1

2.77× 10

−87

2

5.54× 10

−83

3

2.98× 10

−1

4

2.29× 10

2

Reference Answer: 3

Simple calculation. Note that the BAD probability bound is not very small even with 10000 examples.

(37)

Theory of Generalization A Pictorial Proof

Summary

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 5: Training versus Testing Lecture 6: Theory of Generalization

Restriction of Break Point

break point ‘breaks’ consequent points Bounding Function: Basic Cases

B(N, k ) bounds m H (N) with break point k Bounding Function: Inductive Cases

B(N, k ) is poly (N) A Pictorial Proof

m H (N) can replace M with a few changes

next: how to ‘use’ the break point?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

參考文獻

相關文件

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

At least one can show that such operators  has real eigenvalues for W 0 .   Æ OK. we  did it... For the Virasoro

Achievement growth in children with learning difficulties in mathematics: Findings of a two-year longitudinal study... Designing vocabulary instructio n

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

This part shows how selling price and variable cost changes affect the unit contribution, break-even point, profit, rate of return and margin of safety?. This is the

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of