• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 7: The VC Dimension

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26

(2)

The VC Dimension

Roadmap

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 6: Theory of Generalization E out ≈ E in

possible

if

m H (N) breaks somewhere

and

N large enough Lecture 7: The VC Dimension

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

The VC Dimension Definition of VC Dimension

Recap: More on Growth Function

m

H

(N) of break point k ≤

B(N, k ) =

X

k −1 i=0

N i



| {z }

highest term N

k −1

k

B(N , k ) 1 2 3 4 5

1 1 2 2 2 2

2 1 3 4 4 4

3 1 4 7 8 8

N 4 1 5 11 15 16

5 1 6 16 26 31

6 1 7 22 42 57

k

N

k −1

1 2 3 4 5

1 1 1 1 1 1

2 1 2 4 8 16

3 1 3 9 27 81

4 1 4 16 64 256

5 1 5 25 125 625 6 1 6 36 216 1296

provably

& loosely, for N ≥ 2, k ≥ 3,

m

H

(N)

≤ B(N, k ) =

k −1

X

i=0

N i



≤ N k −1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26

(4)

The VC Dimension Definition of VC Dimension

Recap: More on Vapnik-Chervonenkis (VC) Bound

For any

g

=

A

(

D

)∈

H

and ‘statistical’

large D

,

for N ≥ 2, k ≥ 3

P

D

h

E

in

(g)− E

out

(g) > i

≤ P

D

h∃h ∈

H

s.t.

E

in

(h)− E

out

(h) > i

≤ 4m

H

(2N)exp

1 8



2 N



if k exists

≤ 4(2N)

k −1

exp

1 8



2 N



if 1 m

H

(N) breaks at k (good

H

)

if

2 N large enough (good

D

)

=⇒

probably

generalized ‘E

out

≈ E

in

’, and if 3 A picks a g with small E

in

(good

A

)

=⇒

probably

learned! (:-) good luck)

(5)

The VC Dimension Definition of VC Dimension

VC Dimension

the formal name of

maximum non-break point Definition

VC dimension ofH, denoted dVC(H) is

largest

N for which m

H

(N) = 2

N

the

most

inputsH that can shatter

dVC =‘minimum k’ - 1

N ≤ dVC =⇒ H can shatter some N inputs

k

> dVC =⇒

k

is a break point forH

if N≥ 2, dVC ≥ 2,

m H (N) ≤ N d

VC

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26

(6)

The VC Dimension Definition of VC Dimension

The Four VC Dimensions

positive rays: m

H

(N) = N + 1

d

VC

= 1

positive intervals: m

H

(N) =

1 2

N

2

+

1 2

N + 1

d

VC

= 2

• •

convex sets: m

H

(N) = 2

N

d

VC

= ∞

up

bottom

2D perceptrons: m

H

(N)≤ N

3

for N ≥ 2

d

VC

= 3

• •

good:

finite d

VC

(7)

The VC Dimension Definition of VC Dimension

VC Dimension and Learning

finite d

VC

= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))

regardless of learning algorithmA

regardless of input distribution P

regardless of target function f

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

‘worst case’

guarantee on generalization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26

(8)

The VC Dimension Definition of VC Dimension

Fun Time

If there is a set of N inputs that cannot be shattered by H. Based only on this information, what can we conclude about d

VC

( H)?

1

dVC(H) > N

2

dVC(H) = N

3

dVC(H) < N

4

no conclusion can be made

Reference Answer: 4

It is possible that there is another set of N inputs that can be shattered, which means dVC≥ N. It is also possible that no set of N input can be shattered, which means dVC< N.

Neither cases can be ruled out by one non-shattering set.

(9)

The VC Dimension VC Dimension of Perceptrons

2D PLA Revisited

E out (g) ≈ 0 :-)

E in (g) = 0 E out (g) ≈ E in (g)

PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d

VC

= 3 linearly separable D with x n ∼ P and y n = f (x n )

T large N large

general PLA for

x with more than 2 features?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26

(10)

The VC Dimension VC Dimension of Perceptrons

VC Dimension of Perceptrons

1D perceptron (pos/neg rays): dVC=2

2D perceptrons: dVC=3

• d

VC

≥ 3: •

• •

• d

VC

≤ 3: × ◦

◦ ×

d -D perceptrons: dVC

=

?

d + 1

two steps:

dVC ≥ d + 1

dVC ≤ d + 1

(11)

The VC Dimension VC Dimension of Perceptrons

Extra Fun Time

What statement below shows that d

VC

≥ d + 1?

1

There are some d + 1 inputs we can shatter.

2

We can shatter any set of d + 1 inputs.

3

There are some d + 2 inputs we cannot shatter.

4

We cannot shatter any set of d + 2 inputs.

Reference Answer: 1

dVCis the maximum that m

H

(N) = 2

N

, and m

H

(N) is the most number of dichotomies of N inputs. So if we can find 2

d +1

dichotomies on some d + 1 inputs, m

H

(d + 1) = 2

d +1

and hence dVC≥ d + 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/26

(12)

The VC Dimension VC Dimension of Perceptrons

d VC ≥ d + 1

There are

some d + 1 inputs

we can shatter.

some ‘trivial’ inputs:

X =

x T 1

x T 2

x T 3

— ...

—x

T d +1

=

1 0 0 . . . 0 1 1 0 . . . 0

1 0 1 0

.. . .. . . .. 0 1 0 . . . 0 1

visually in 2D:

• •

note:

X invertible!

(13)

The VC Dimension VC Dimension of Perceptrons

Can We Shatter X?

X =

x T 1

x T 2

— ...

—x

T d +1

=

1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1

invertible

to shatter . . .

for any

y =

 y

1

... y

d +1

, find

w such that

sign (Xw) = y ⇐=

(Xw) = y X invertible!

⇐⇒

w = X −1 y

‘special’ X can be shattered =⇒ dVC≥ d + 1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26

(14)

The VC Dimension VC Dimension of Perceptrons

Extra Fun Time

What statement below shows that d

VC

≤ d + 1?

1

There are some d + 1 inputs we can shatter.

2

We can shatter any set of d + 1 inputs.

3

There are some d + 2 inputs we cannot shatter.

4

We cannot shatter any set of d + 2 inputs.

Reference Answer: 4

dVCis the maximum that m

H

(N) = 2

N

, and m

H

(N) is the most number of dichotomies of N inputs. So if we cannot find 2

d +2

dichotomies on any d + 2 inputs (i.e. break point),

m

H

(d + 2)< 2

d +2

and hence dVC< d + 2.

That is, dVC≤ d + 1.

(15)

The VC Dimension VC Dimension of Perceptrons

d VC ≤ d + 1 (1/2)

A 2D Special Case

• •

• • X =

x T 1

x T 2

x T 3

—x

T 4

=

1 0 0 1 1 0 1 0 1 1 1 1

?

× ◦

? cannot be × w T x 4

=

w T x 2

| {z }

+

w T x 3

| {z }

w T x 1

| {z }

×

> 0

linear dependence

restricts dichotomy

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26

(16)

The VC Dimension VC Dimension of Perceptrons

d VC ≤ d + 1 (2/2)

d -D General Case

X =

x T 1

x T 2

— ...

x T d +1

x T d +2

more rows than columns:

linear dependence (some a

i

non-zero)

x d +2

=

a 1 x 1

+

a 2 x 2

+. . . +

a d +1 x d +1

can you generate (sign(a

1 ), sign(a 2 ), . . . , sign(a d +1 ), ×

)? if so, what

w?

w T x d +2

=

a 1 w T x 1

| {z }

+a

2 w T x 2

| {z }

×

+. . . +

a d +1 w T x d +1

| {z }

×

>

0(contradition!)

‘general’ X no-shatter =⇒ dVC ≤ d + 1

(17)

The VC Dimension VC Dimension of Perceptrons

Fun Time

Based on the proof above, what is d

VC

of 1126-D perceptrons?

1

1024

2

1126

3

1127

4

6211

Reference Answer: 3

Well,

too much fun for this section! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/26

(18)

The VC Dimension Physical Intuition of VC Dimension

Degrees of Freedom

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3 4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

01 2

3 4 5 6 87 10 9 11 12 13 14 15 16

1718 01

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

(modified from the work of Hugues Vermeiren on http://www.texample.net)

hypothesis

parameters w = (w 0 , w 1 , · · · , w d ):

creates degrees of freedom

hypothesis quantity M =|H|:

‘analog’ degrees of freedom

hypothesis ‘power’ dVC=d + 1:

effective ‘binary’ degrees of freedom

d

VC(

H

):

powerfulness

of

H

(19)

The VC Dimension Physical Intuition of VC Dimension

Two Old Friends

Positive Rays (d

VC

= 1)

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 a

0.8

free parameters: a Positive Intervals (d

VC

= 2)

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 h(x) = −1

0.8

free parameters: `, r

practical rule of thumb:

d

VC ≈

#free parameters

(but not always)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/26

(20)

The VC Dimension Physical Intuition of VC Dimension

M and d VC

copied from Lecture 5 :-)

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices small d

VC

1 Yes!

, P[BAD]≤ 4·

(2N) d

VC · exp(. . .)

2 No!, too limited power

large d

VC

1 No!

, P[BAD]≤ 4·

(2N) d

VC · exp(. . .)

2 Yes!, lots of power

using the right dVC(orH) is important

(21)

The VC Dimension Physical Intuition of VC Dimension

Fun Time

Origin-crossing Hyperplanes are essentially perceptrons with w 0

fixed at 0. Make a guess about the d

VC

of origin-crossing hyperplanes in R d .

1

1

2

d

3

d + 1

4

Reference Answer: 2

The proof is almost the same as proving the dVCfor usual perceptrons, but it is the

intuition

(dVC ≈ #free parameters) that you shall use to answer this quiz.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/26

(22)

The VC Dimension Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity

For any

g

=

A

(

D

)∈

H

and ‘statistical’

large D

,

for N ≥ 2, d

VC

≥ 2

P

D

h

E

in

(g)− E

out

(g) > 

| {z }

BAD

i

if k exists

≤ 4(2N)

d

VCexp

1 8



2 N



| {z }

δ

Rephrase

. . .,

with probability ≥ 1 − δ

,

GOOD:

E

in

(g)− E

out

(g) ≤  set

δ

= 4(2N)

d

VCexp

1 8



2 N



δ

4(2N)

dVC = exp

1 8



2 N

 ln

4(2N)

dVC

δ



=

1 8



2 N

r

8

N

ln

4(2N)

dVC

δ



= 

√. . .

| {z } Ω(N,

H

,

δ)

: penalty for

model complexity

(23)

The VC Dimension Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity

For any

g

=

A

(

D

)∈

H

and ‘statistical’

large D

,

for N ≥ 2, d

VC

≥ 2

P

D

h

E

in

(g)− E

out

(g) > 

| {z }

BAD

i

if k exists

≤ 4(2N)

d

VCexp

1 8



2 N



| {z }

δ

Rephrase

. . .,

with probability ≥ 1 − δ

,

GOOD!

gen. error

E

in

(g)− E

out

(g)

r

8 N

ln



4(2N)

dVC

δ



E in (g) − r

8 N ln

 4(2N)

dVC

δ



E

out

(g) ≤ E

in

(g) + r

8 N

ln



4(2N)

dVC

δ



√. . .

| {z } Ω(N,

H

,

δ)

: penalty for

model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/26

(24)

The VC Dimension Interpreting VC Dimension

THE VC Message

with

a high probability,

E out (g)

E in (g)

+

r

8 N ln 

4(2N)

dVC

δ



| {z }

Ω(N,H,δ)

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

dVC↑:

E in

but

Ω ↑

dVC↓:

Ω ↓

but

E in

best dVC

in the middle

powerful H

not always good!

(25)

The VC Dimension Interpreting VC Dimension

VC Bound Rephrase: Sample Complexity

For any

g

=

A

(

D

)∈

H

and ‘statistical’

large D

,

for N ≥ 2, d

VC

≥ 2

P

D

h

E

in

(g)− E

out

(g) > 

| {z }

BAD

i

if k exists

≤ 4(2N)

d

VCexp

1 8



2 N



| {z }

δ

given

specs

 = 0.1, δ = 0.1, dVC=3, want

4(2N) d

VC

exp − 1 8  2 N  ≤ δ N bound

100 2.82 × 10

7

1,000 9.17 × 10

9

10,000 1.19 × 10

8

100,000 1.65 × 10

−38

29,300 9.99 × 10

−2

sample complexity:

need N ≈ 10, 000d

VC

in theory

practical rule of thumb:

N ≈ 10d

VC

often enough!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/26

(26)

The VC Dimension Interpreting VC Dimension

Looseness of VC Bound

P

D

h

E

in

(g)− E

out

(g) > i

if k exists

≤ 4(2N)

d

VCexp

1 8



2 N

 theory: N ≈ 10, 000dVC;

practice: N ≈ 10d

VC

Why?

Hoeffding for unknown E

out any distribution, any target

m

H

(N) instead of|H(x

1

, . . . , x

N

)|

‘any’ data

N

d

VC instead of m

H

(N)

‘any’ H of same d

VC

union bound on worst cases

any choice made by A

—but hardly better, and ‘similarly loose for all models’

philosophical message

of VC bound important for improving ML

(27)

The VC Dimension Interpreting VC Dimension

Fun Time

Consider the VC Bound below. How can we decrease the probability of getting BAD data?

P

D

h

E

in

(g)− E

out

(g) > i

if k exists

≤ 4(2N)

d

VCexp

1 8



2

N

1

decrease model complexity dVC

2

increase data size N a lot

3

increase generalization error tolerance

4

all of the above

Reference Answer: 4

Congratulations on being Master of VC bound! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 25/26

(28)

The VC Dimension Interpreting VC Dimension

Summary

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 6: Theory of Generalization Lecture 7: The VC Dimension

Definition of VC Dimension

maximum non-break point VC Dimension of Perceptrons

d

VC

( H) = d + 1 Physical Intuition of VC Dimension

d

VC

≈ #free parameters Interpreting VC Dimension

loosely: model complexity & sample complexity

next: more than noiseless binary classification?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

參考文獻

相關文件

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

- [ Configuration Properties &gt; Microsoft Macro Assembler &gt; General &gt; Include Paths ]. • Enter the paths to your

Tseung Kwan O Government Secondary School, which is situated in Po Lam in Tseung Kwan O, organised a trip to England for a group of students during two successive summers.. Trip

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.