• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
28
0
0

(1)

## Machine Learning Foundations ( 機器學習基石)

### Lecture 7: The VC Dimension

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26

(2)

The VC Dimension

### 2 Why

Can Machines Learn?

possible

if

and

### 4 How Can Machines Learn Better?

(3)

The VC Dimension Definition of VC Dimension

## Recap: More on Growth Function

m

### H

(N) of break point k ≤

X

N i



| {z }

k −1

k −1

### provably

& loosely, for N ≥ 2, k ≥ 3,

m

(N)

### ≤ N k −1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/26

(4)

The VC Dimension Definition of VC Dimension

## Recap: More on Vapnik-Chervonenkis (VC) Bound

For any

=

(

)∈

### H

and ‘statistical’

,

P

h

E

(g)− E

(g) > i

≤ P

h∃h ∈

s.t.

E

(h)− E

(h) > i

≤ 4m

(2N)exp





≤ 4(2N)

exp





if 1 m

### H

(N) breaks at k (good

### H

)

if

2 N large enough (good

)

=⇒

generalized ‘E

≈ E

### in

’, and if 3 A picks a g with small E

(good

)

=⇒

### probably

learned! (:-) good luck)

(5)

The VC Dimension Definition of VC Dimension

## VC Dimension

the formal name of

### maximum non-break point Definition

VC dimension ofH, denoted dVC(H) is

N for which m

(N) = 2

the

### most

inputsH that can shatter

### •

dVC =‘minimum k’ - 1

N ≤ dVC =⇒ H can shatter some N inputs

> dVC =⇒

### k

is a break point forH

if N≥ 2, dVC ≥ 2,

### m H (N) ≤ N d

VC

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/26

(6)

The VC Dimension Definition of VC Dimension

## The Four VC Dimensions

positive rays: m

(N) = N + 1

VC

### •

positive intervals: m

(N) =

N

+

N + 1

VC

• •

convex sets: m

(N) = 2

VC

up

bottom

### •

2D perceptrons: m

(N)≤ N

for N ≥ 2

VC

• •

good:

### finite d

VC

(7)

The VC Dimension Definition of VC Dimension

## VC Dimension and Learning

VC

### •

regardless of learning algorithmA

### •

regardless of input distribution P

### •

regardless of target function f

1

1

N

N

1

2

N

### ‘worst case’

guarantee on generalization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/26

(8)

The VC Dimension Definition of VC Dimension

## Fun Time

VC

dVC(H) > N

dVC(H) = N

dVC(H) < N

### 4

It is possible that there is another set of N inputs that can be shattered, which means dVC≥ N. It is also possible that no set of N input can be shattered, which means dVC< N.

Neither cases can be ruled out by one non-shattering set.

(9)

The VC Dimension VC Dimension of Perceptrons

## 2D PLA Revisited

VC

general PLA for

### x with more than 2 features?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/26

(10)

The VC Dimension VC Dimension of Perceptrons

## VC Dimension of Perceptrons

### •

1D perceptron (pos/neg rays): dVC=2

### •

2D perceptrons: dVC=3

VC

VC

### •

d -D perceptrons: dVC

=

d + 1

two steps:

dVC ≥ d + 1

### •

dVC ≤ d + 1

(11)

The VC Dimension VC Dimension of Perceptrons

## Extra Fun Time

VC

### 1

There are some d + 1 inputs we can shatter.

### 2

We can shatter any set of d + 1 inputs.

### 3

There are some d + 2 inputs we cannot shatter.

### 4

We cannot shatter any set of d + 2 inputs.

dVCis the maximum that m

(N) = 2

, and m

### H

(N) is the most number of dichotomies of N inputs. So if we can find 2

### d +1

dichotomies on some d + 1 inputs, m

(d + 1) = 2

### d +1

and hence dVC≥ d + 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/26

(12)

The VC Dimension VC Dimension of Perceptrons

## d VC ≥ d + 1

There are

we can shatter.

### •

some ‘trivial’ inputs:

X =

— ...

—x

=

visually in 2D:

note:

### X invertible!

(13)

The VC Dimension VC Dimension of Perceptrons

## Can We Shatter X?

X =

— ...

—x

=

for any

 y

... y

, find

sign (Xw) = y ⇐=

⇐⇒

### w = X−1y

‘special’ X can be shattered =⇒ dVC≥ d + 1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/26

(14)

The VC Dimension VC Dimension of Perceptrons

## Extra Fun Time

VC

### 1

There are some d + 1 inputs we can shatter.

### 2

We can shatter any set of d + 1 inputs.

### 3

There are some d + 2 inputs we cannot shatter.

### 4

We cannot shatter any set of d + 2 inputs.

dVCis the maximum that m

(N) = 2

, and m

### H

(N) is the most number of dichotomies of N inputs. So if we cannot find 2

### d +2

dichotomies on any d + 2 inputs (i.e. break point),

m

(d + 2)< 2

### d +2

and hence dVC< d + 2.

That is, dVC≤ d + 1.

(15)

The VC Dimension VC Dimension of Perceptrons

## d VC ≤ d + 1 (1/2)

• •

• • X =

—x

=

=

+

### > 0

linear dependence

### restricts dichotomy

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/26

(16)

The VC Dimension VC Dimension of Perceptrons

## d VC ≤ d + 1 (2/2)

X =

— ...

### xTd +2

more rows than columns:

linear dependence (some a

non-zero)

=

+

+. . . +

### •

can you generate (sign(a

)? if so, what

=

| {z }

+a

| {z }

+. . . +

| {z }

### ×

>

‘general’ X no-shatter =⇒ dVC ≤ d + 1

(17)

The VC Dimension VC Dimension of Perceptrons

## Fun Time

VC

1024

1126

1127

6211

Well,

### too much fun for this section! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/26

(18)

The VC Dimension Physical Intuition of VC Dimension

## Degrees of Freedom

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3 4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

01 2

3 4 5 6 87 10 9 11 12 13 14 15 16

1718 01

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

0 1 2

3 4 5 6 8 7 10 9 11 12 13 14 15 16

1718 0 1

2 3

4 5 6 7 9 8 10 11 12 13 14 15 16

17 18

hypothesis

### •

hypothesis quantity M =|H|:

‘analog’ degrees of freedom

### •

hypothesis ‘power’ dVC=d + 1:

VC(

):

of

### H

(19)

The VC Dimension Physical Intuition of VC Dimension

## Two Old Friends

VC

1

2

3

N

0.8

VC

1

2

3

N

0.8

### free parameters: `, r

practical rule of thumb:

VC ≈

### #free parameters

(but not always)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/26

(20)

The VC Dimension Physical Intuition of VC Dimension

## M and d VC

copied from Lecture 5 :-)

· exp(. . .)

· exp(. . .)

VC

VC · exp(. . .)

VC

VC · exp(. . .)

### 2 Yes!, lots of power

using the right dVC(orH) is important

(21)

The VC Dimension Physical Intuition of VC Dimension

## Fun Time

VC

1

d

d + 1

### 4

The proof is almost the same as proving the dVCfor usual perceptrons, but it is the

### intuition

(dVC ≈ #free parameters) that you shall use to answer this quiz.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/26

(22)

The VC Dimension Interpreting VC Dimension

## VC Bound Rephrase: Penalty for Model Complexity

For any

=

(

)∈

### H

and ‘statistical’

,

VC

P

h

E

(g)− E

(g) > 

| {z }

i

≤ 4(2N)

VCexp





| {z }

. . .,

,

E

(g)− E

(g) ≤  set

= 4(2N)

VCexp





dVC = exp



 ln

dVC



=



r

ln

dVC



= 

√. . .

| {z } Ω(N,

,

: penalty for

### model complexity

(23)

The VC Dimension Interpreting VC Dimension

## VC Bound Rephrase: Penalty for Model Complexity

For any

=

(

)∈

### H

and ‘statistical’

,

VC

P

h

E

(g)− E

(g) > 

| {z }

i

≤ 4(2N)

VCexp





| {z }

. . .,

,

gen. error

E

(g)− E

(g)

r

ln



dVC



dVC

E

(g) ≤ E

(g) + r

ln



dVC



√. . .

| {z } Ω(N,

,

: penalty for

### model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/26

(24)

The VC Dimension Interpreting VC Dimension

## THE VC Message

with

+

dVC

### Ω(N,H,δ)

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

dVC↑:

but

dVC↓:

but

best dVC

### powerful H

not always good!

(25)

The VC Dimension Interpreting VC Dimension

## VC Bound Rephrase: Sample Complexity

For any

=

(

)∈

### H

and ‘statistical’

,

VC

P

h

E

(g)− E

(g) > 

| {z }

i

≤ 4(2N)

VCexp





| {z }

given

### specs

 = 0.1, δ = 0.1, dVC=3, want

VC

7

9

8

−38

−2

VC

### in theory

practical rule of thumb:

VC

### often enough!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/26

(26)

The VC Dimension Interpreting VC Dimension

## Looseness of VC Bound

P

h

E

(g)− E

(g) > i

≤ 4(2N)

VCexp



### 2 N

 theory: N ≈ 10, 000dVC;

VC

### •

Hoeffding for unknown E

m

, . . . , x

)|

N

(N)

VC

### •

union bound on worst cases

### any choice made by A

—but hardly better, and ‘similarly loose for all models’

### philosophical message

of VC bound important for improving ML

(27)

The VC Dimension Interpreting VC Dimension

## Fun Time

P

h

E

(g)− E

(g) > i

≤ 4(2N)

VCexp



N

### 1

decrease model complexity dVC

### 2

increase data size N a lot

### 3

increase generalization error tolerance

all of the above

### Congratulations on beingMaster of VC bound! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 25/26

(28)

The VC Dimension Interpreting VC Dimension

## Summary

### 2 Why

Can Machines Learn?

VC

VC

### 4 How Can Machines Learn Better?

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

- [ Configuration Properties &gt; Microsoft Macro Assembler &gt; General &gt; Include Paths ]. • Enter the paths to your

Tseung Kwan O Government Secondary School, which is situated in Po Lam in Tseung Kwan O, organised a trip to England for a group of students during two successive summers.. Trip

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.