Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 7: The VC Dimension

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26

(2)

The VC Dimension

Roadmap

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 6: Theory of Generalization E out ≈ E in

possible

if

m H (N) breaks somewhere

and

N large enough Lecture 7: The VC Dimension

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

The VC Dimension Definition of VC Dimension

Recap: More on Growth Function

m

H

(N) of break point k ≤

B(N, k ) =

X

k −1 i=0

N i

| {z }

highest term N

^{k −1}

k

B(N , k ) 1 2 3 4 5

1 1 2 2 2 2

2 1 3 4 4 4

3 1 4 7 8 8

N 4 1 5 11 15 16

5 1 6 16 26 31

6 1 7 22 42 57

k

N

^{k −1}

1 2 3 4 5

1 1 1 1 1 1

2 1 2 4 8 16

3 1 3 9 27 81

4 1 4 16 64 256

5 1 5 25 125 625 6 1 6 36 216 1296

provably

& loosely, for N ≥ 2, k ≥ 3,

m

_H

(N)

≤ B(N, k ) =

k −1

X

i=0

N i

≤ N ^{k −1}

(4)

Recap: More on Vapnik-Chervonenkis (VC) Bound

For any

g

=

A

(

D

)∈

H

and ‘statistical’

large D

,

for N ≥ 2, k ≥ 3

P

D

h

E

_in

(g)− E

out

(g) > i

≤ P

D

h∃h ∈

H

s.t.

E

_in

(h)− E

^out

(h) > i

≤ 4m

H

(2N)exp

−

¹ ₈

² N

if k exists

≤ 4(2N)

^{k −1}

exp

−

¹ ₈

² N

if 1 m

H

(N) breaks at k (good

H

)

if

2 N large enough (good

D

)

=⇒

probably

generalized ‘E

_out

≈ E

in

’, and if 3 A picks a g with small E

in

(good

A

)

=⇒

probably

learned! (:-) good luck)

(5)

VC Dimension

the formal name of

maximum non-break point Definition

VC dimension ofH, denoted d^VC(H) is

largest

N for which m

H

(N) = 2

^N

•

the

most

inputsH that can shatter

•

dVC =‘minimum k’ - 1

N ≤ dVC =⇒ H can shatter some N inputs

k

> dVC =⇒

k

is a break point forH

if N≥ 2, dVC ≥ 2,

m _H (N) ≤ N ^d

^VC

(6)

The Four VC Dimensions

•

positive rays: m

_H

(N) = N + 1

d

VC

= 1

•

positive intervals: m

H

(N) =

¹ ₂

N

²

+

¹ ₂

N + 1

d

VC

= 2

• •

•

convex sets: m

H

(N) = 2

^N

d

VC

= ∞

up

bottom

•

2D perceptrons: m

H

(N)≤ N

³

for N ≥ 2

d

_VC

= 3

•

• •

good:

finite d

_VC

(7)

VC Dimension and Learning

finite d

VC

= ⇒ g ‘will’ generalize (E out (g) ≈ E in (g))

•

regardless of learning algorithmA

•

regardless of input distribution P

•

regardless of target function f

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

₁

, x

₂

, · · · , x

_N

x

‘worst case’

guarantee on generalization

(8)

Fun Time

If there is a set of N inputs that cannot be shattered by H. Based only on this information, what can we conclude about d

VC

( H)?

1

dVC(H) > N

2

dVC(H) = N

3

d_VC(H) < N

4

no conclusion can be made

Reference Answer: 4

It is possible that there is another set of N inputs that can be shattered, which means dVC≥ N. It is also possible that no set of N input can be shattered, which means d_VC< N.

Neither cases can be ruled out by one non-shattering set.

(9)

The VC Dimension VC Dimension of Perceptrons

2D PLA Revisited

E _out (g) ≈ 0 :-)

E _in (g) = 0 E _out (g) ≈ E in (g)

PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d

VC

= 3 linearly separable D with x n ∼ P and y ⁿ = f (x n )

T large N large

general PLA for

x with more than 2 features?

(10)

VC Dimension of Perceptrons

•

1D perceptron (pos/neg rays): dVC=2

•

2D perceptrons: dVC=3

• d

VC

≥ 3: •

• •

• d

VC

≤ 3: × ◦

◦ ×

•

d -D perceptrons: dVC

=

?

d + 1

two steps:

•

dVC ≥ d + 1

•

dVC ≤ d + 1

(11)

Extra Fun Time

What statement below shows that d

VC

≥ d + 1?

1

There are some d + 1 inputs we can shatter.

2

We can shatter any set of d + 1 inputs.

3

There are some d + 2 inputs we cannot shatter.

4

We cannot shatter any set of d + 2 inputs.

Reference Answer: 1

dVCis the maximum that m

H

(N) = 2

^N

, and m

H

(N) is the most number of dichotomies of N inputs. So if we can find 2

^{d +1}

dichotomies on some d + 1 inputs, m

H

(d + 1) = 2

^{d +1}

and hence d_VC≥ d + 1.

(12)

d VC ≥ d + 1

There are

some d + 1 inputs

we can shatter.

•

some ‘trivial’ inputs:

X =







—

x ^T ₁

—

x ^T ₂

—

x ^T ₃

— ...

—x

^T _{d +1}

—







=







1 0 0 . . . 0 1 1 0 . . . 0

1 0 1 0

.. . .. . . .. 0 1 0 . . . 0 1







•

visually in 2D:

• • •

note:

X invertible!

(13)

Can We Shatter X?

X =







—

x ^T ₁

—

x ^T ₂

— ...

—x

^T _{d +1}

—







=







1 0 0 . . . 0 1 1 0 . . . 0 .. . .. . . .. 0 1 0 . . . 0 1







invertible

to shatter . . .

for any

y =





 y

₁

... y

_{d +1}





, find

w such that

sign (Xw) = y ⇐=

(Xw) = y X invertible!

⇐⇒

w = X ⁻¹ y

‘special’ X can be shattered =⇒ dVC≥ d + 1

(14)

Extra Fun Time

What statement below shows that d

VC

≤ d + 1?

1

There are some d + 1 inputs we can shatter.

2

We can shatter any set of d + 1 inputs.

3

There are some d + 2 inputs we cannot shatter.

4

We cannot shatter any set of d + 2 inputs.

Reference Answer: 4

dVCis the maximum that m

H

(N) = 2

^N

, and m

H

(N) is the most number of dichotomies of N inputs. So if we cannot find 2

^{d +2}

dichotomies on any d + 2 inputs (i.e. break point),

m

H

(d + 2)< 2

^{d +2}

and hence d_VC< d + 2.

That is, dVC≤ d + 1.

(15)

d VC ≤ d + 1 (1/2)

A 2D Special Case

• •

• • X =







—

x ^T ₁

—

x ^T ₂

—

x ^T ₃

—

—x

^T ₄

—







=







1 0 0 1 1 0 1 0 1 1 1 1







◦ ?

× ◦

? cannot be × w ^T x ₄

=

w ^T x ₂

| {z }

◦

+

w ^T x ₃

| {z }

◦

−

w ^T x ₁

| {z }

×

> 0

linear dependence

restricts dichotomy

(16)

d VC ≤ d + 1 (2/2)

d -D General Case

X =







—

x ^T ₁

—

x ^T ₂

— ...

—

x ^T _{d +1}

—

x ^T _{d +2}

—







more rows than columns:

linear dependence (some a

_i

non-zero)

x _{d +2}

=

a ₁ x ₁

+

a ₂ x ₂

+. . . +

a _{d +1} x _{d +1}

•

can you generate (sign(a

₁ ), sign(a ₂ ), . . . , sign(a _{d +1} ), ×

)? if so, what

w?

w ^T x _{d +2}

=

a ₁ w ^T x ₁

| {z }

◦

+a

₂ w ^T x ₂

| {z }

×

+. . . +

a _{d +1} w ^T x _{d +1}

| {z }

×

>

0(contradition!)

‘general’ X no-shatter =⇒ dVC ≤ d + 1

(17)

Fun Time

Based on the proof above, what is d

VC

of 1126-D perceptrons?

1

1024

2

1126

3

1127

4

6211

Reference Answer: 3

Well,

too much fun for this section! :-)

(18)

The VC Dimension Physical Intuition of VC Dimension

Degrees of Freedom

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3 4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

01 2

3 4 5 6 87 10 9 11 12 13 14 15 16

1718 01

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

(modified from the work of Hugues Vermeiren on http://www.texample.net)

•

hypothesis

parameters w = (w ₀ , w ₁ , · · · , w d ):

creates degrees of freedom

•

hypothesis quantity M =|H|:

‘analog’ degrees of freedom

•

hypothesis ‘power’ d_VC=d + 1:

effective ‘binary’ degrees of freedom

d

VC(

H

):

powerfulness

of

H

(19)

Two Old Friends

Positive Rays (d

VC

= 1)

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 a

0.8

free parameters: a Positive Intervals (d

VC

= 2)

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 h(x) = −1

0.8

free parameters: `, r

practical rule of thumb:

d

VC ≈

#free parameters

(but not always)

(20)

M and d VC

copied from Lecture 5 :-)

1 can we make sure that E _out (g) is close enough to E _in (g)?

2 can we make E _in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices small d

VC

1 Yes!

, P[BAD]≤ 4·

(2N) ^d

^VC · exp(. . .)

2 No!, too limited power

large d

VC

1 No!

, P[BAD]≤ 4·

(2N) ^d

^VC · exp(. . .)

2 Yes!, lots of power

using the right dVC(orH) is important

(21)

Fun Time

Origin-crossing Hyperplanes are essentially perceptrons with w 0

fixed at 0. Make a guess about the d

VC

of origin-crossing hyperplanes in R ^d .

1

2

d

3

d + 1

4

∞

Reference Answer: 2

The proof is almost the same as proving the dVCfor usual perceptrons, but it is the

intuition

(d_VC ≈ #free parameters) that you shall use to answer this quiz.

(22)

The VC Dimension Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity

For any

g

=

A

(

D

)∈

H

large D

,

for N ≥ 2, d

^VC

≥ 2

P

D

h

E

_in

(g)− E

^out

(g) >

| {z }

BAD

i

if k exists

≤ 4(2N)

^d

^VCexp

−

¹ 8

² N

| {z }

δ

Rephrase

. . .,

with probability ≥ 1 − δ

,

GOOD:

E

_in

(g)− E

out

(g) ≤ set

δ

= 4(2N)

^d

^VCexp

−

¹ ₈

² N

δ

4(2N)

^d^VC = exp

−

¹ ₈

² N

ln

_4(2N)

d_VC

δ

=

¹ ₈

² N

r

8 N

ln

_4(2N)

d_VC

δ

=

√. . .

| {z } Ω(N,

H

,

δ)

: penalty for

model complexity

(23)

VC Bound Rephrase: Penalty for Model Complexity

For any

g

=

A

(

D

)∈

H

large D

,

for N ≥ 2, d

^VC

≥ 2

P

D

h

E

_in

(g)− E

^out

(g) >

| {z }

BAD

i

if k exists

≤ 4(2N)

^d

^VCexp

−

¹ 8

² N

| {z }

δ

Rephrase

. . .,

with probability ≥ 1 − δ

,

GOOD!

gen. error

E

_in

(g)− E

^out

(g)

≤

r

8 N

ln

4(2N)

^d^VC

δ

E _in (g) − r

8 N ln

4(2N)

^d^VC

δ

≤

E

out

(g) ≤ E

_in

(g) + r

8 N

ln

4(2N)

^d^VC

δ

√. . .

| {z } Ω(N,

H

,

δ)

: penalty for

model complexity

(24)

THE VC Message

with

a high probability,

E _out (g)

≤

E _in (g)

+

r

8 N ln

4(2N)

^d^VC

δ

| {z }

Ω(N,H,δ)

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

d^∗vc

•

d_VC↑:

E _in ↓

but

Ω ↑

•

dVC↓:

Ω ↓

but

E _in ↑

•

best d_VC

^∗ in the middle

powerful H

not always good!

(25)

VC Bound Rephrase: Sample Complexity

For any

g

=

A

(

D

)∈

H

large D

,

for N ≥ 2, d

^VC

≥ 2

P

D

h

E

_in

(g)− E

^out

(g) >

| {z }

BAD

i

if k exists

≤ 4(2N)

^d

^VCexp

−

¹ ₈

² N

| {z }

δ

given

specs

= 0.1, δ = 0.1, dVC=3, want

4(2N) ^d

^VC

exp − ¹ ₈ ² N ≤ δ N bound

100 2.82 × 10

⁷

1,000 9.17 × 10

⁹

10,000 1.19 × 10

⁸

100,000 1.65 × 10

⁻³⁸

29,300 9.99 × 10

⁻²

sample complexity:

need N ≈ 10, 000d

VC

in theory

practical rule of thumb:

N ≈ 10d

^VC

often enough!

(26)

Looseness of VC Bound

P

D

h

E

_in

(g)− E

^out

(g) > i

if k exists

≤ 4(2N)

^d

^VCexp

−

¹ ₈

² N

theory: N ≈ 10, 000d^VC;

practice: N ≈ 10d

^VC

Why?

•

Hoeffding for unknown E

_out any distribution, any target

•

m

H

(N) instead of|H(x

1

, . . . , x

_N

)|

‘any’ data

•

N

^d

^VC instead of m

H

(N)

‘any’ H of same d

^VC

•

union bound on worst cases

any choice made by A

—but hardly better, and ‘similarly loose for all models’

philosophical message

of VC bound important for improving ML

(27)

Machine Learning Foundations (ᘤ9M)

Machine Learning Foundations ( 機器學習基石)

Lecture 7: The VC Dimension

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Roadmap

1 When Can Machines Learn?

2 Why

Lecture 6: Theory of Generalization E out ≈ E in

m H (N) breaks somewhere

N large enough Lecture 7: The VC Dimension

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension

3 How Can Machines Learn?

4 How Can Machines Learn Better?

Recap: More on Growth Function

H

B(N, k ) =

k −1 i=0

highest term N

k

B(N , k ) 1 2 3 4 5

1 1 2 2 2 2

2 1 3 4 4 4

3 1 4 7 8 8

N 4 1 5 11 15 16

5 1 6 16 26 31

6 1 7 22 42 57

k

N

1 2 3 4 5

1 1 1 1 1 1

2 1 2 4 8 16

3 1 3 9 27 81

4 1 4 16 64 256

5 1 5 25 125 625 6 1 6 36 216 1296

provably

H

≤ B(N, k ) =

k −1

X

i=0

N i



≤ N k −1

Recap: More on Vapnik-Chervonenkis (VC) Bound

g

A

D

H

large D

for N ≥ 2, k ≥ 3

D

in

out

D

H

in

out

H

1 8

2 N

if k exists

k −1

1 8

2 N

H

H

D

probably

out

in

in

A

probably

VC Dimension

maximum non-break point Definition

largest

H

N

Machine Learning Foundations (ᘤ9M)

_H

N i

≤ N ^{k −1}

_in

_in

^out

¹ ₈

² N

^{k −1}

¹ ₈

² N

_out

^N

m _H (N) ≤ N ^d

_H

¹ ₂

²

¹ ₂

^N

³

E _out (g) ≈ 0 :-)

E _in (g) = 0 E _out (g) ≈ E in (g)

PLA can converge P[|E in (g) − E out (g) | > ] ≤ ... by d

= 3 linearly separable D with x n ∼ P and y ⁿ = f (x n )