• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
100
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 5: Training versus Testing

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

(2)

Training versus Testing

Roadmap

1 When Can Machines Learn?

Lecture 4: Feasibility of Learning

learning is

PAC-possible

if enough

statistical data

and

finite |H|

2 Why

Can Machines Learn?

Lecture 5: Training versus Testing Recap and Preview

Effective Number of Lines Effective Number of Hypotheses Break Point

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Training versus Testing Recap and Preview

Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

out

(g)≈ E

in

(g)

ifA finds one g with E

in

(g)≈ 0, PAC guarantee for E

out

(g)≈ 0

=⇒

learning possible :-)

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used) unknown

P on X

x

1

, x

2

, · · · , x

N

x

E

out

(g) ≈

|{z}

test

E

in

(g) ≈

|{z}

train

0

(4)

Training versus Testing Recap and Preview

Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

out

(g)≈ E

in

(g) ifA finds one g with E

in

(g)≈ 0,

PAC guarantee for E

out

(g)≈ 0 =⇒

learning possible :-)

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

E

out

(g) ≈

|{z}

test

E

in

(g) ≈

|{z}

train

0

(5)

Training versus Testing Recap and Preview

Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

out

(g)≈ E

in

(g) ifA finds one g with E

in

(g)≈ 0,

PAC guarantee for E

out

(g)≈ 0 =⇒

learning possible :-)

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used) unknown

P on X

x

1

, x

2

, · · · , x

N

x

E

out

(g) ≈

|{z}

test

(6)

Training versus Testing Recap and Preview

Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

out

(g)≈ E

in

(g) ifA finds one g with E

in

(g)≈ 0,

PAC guarantee for E

out

(g)≈ 0 =⇒

learning possible :-)

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

E

out

(g) ≈

|{z}

test

E

in

(g) ≈

|{z}

train

0

(7)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

|{z}

|H|

play for the two questions?

(8)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

|{z}

|H|

play for the two questions?

(9)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

|{z}

|H|

play for the two questions?

(10)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

|{z}

|H|

play for the two questions?

(11)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

|{z}

|H|

play for the two questions?

(12)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

|{z}

|H|

play for the two questions?

(13)

Training versus Testing Recap and Preview

Two Central Questions

for batch & supervised binary classification

| {z }

lecture 3

,

g ≈ f

| {z }

lecture 1

⇐⇒ E out (g) ≈ 0

achieved through

E out (g) ≈ E in (g)

| {z }

lecture 4

and

E in (g) ≈ 0

| {z }

lecture 2

learning split to two central questions:

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

what role does

M

play for the two questions?

(14)

Training versus Testing Recap and Preview

Trade-off on M

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices

using the right M (orH) is important

M = ∞ doomed?

(15)

Training versus Testing Recap and Preview

Trade-off on M

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices

using the right M (orH) is important

M = ∞ doomed?

(16)

Training versus Testing Recap and Preview

Trade-off on M

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices

using the right M (orH) is important

M = ∞ doomed?

(17)

Training versus Testing Recap and Preview

Trade-off on M

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices

using the right M (orH) is important

M = ∞ doomed?

(18)

Training versus Testing Recap and Preview

Trade-off on M

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices

using the right M (orH) is important

M = ∞ doomed?

(19)

Training versus Testing Recap and Preview

Trade-off on M

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

small M

1 Yes!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 No!, too few choices

large M

1 No!,

P[BAD]≤ 2 ·

M

· exp(. . .)

2 Yes!, many choices

using the right M (orH) is important

M = ∞ doomed?

(20)

Training versus Testing Recap and Preview

Preview

Known

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N

Todo

establish

a finite quantity

that replaces

M

P

E

in

(g)− E

out

(g) > 

?

≤ 2 ·

m H

· exp

−2

2

N

justify the feasibility of learning for infinite M

study

m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

after 3 more lectures :-)

(21)

Training versus Testing Recap and Preview

Preview

Known

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N

Todo

establish

a finite quantity

that replaces

M

P

E

in

(g)− E

out

(g) > 

?

≤ 2 ·

m H

· exp

−2

2

N

justify the feasibility of learning for infinite M

study

m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

after 3 more lectures :-)

(22)

Training versus Testing Recap and Preview

Preview

Known

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N

Todo

establish

a finite quantity

that replaces

M

P

E

in

(g)− E

out

(g) > 

?

≤ 2 ·

m H

· exp

−2

2

N

justify the feasibility of learning for infinite M

study

m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

after 3 more lectures :-)

(23)

Training versus Testing Recap and Preview

Preview

Known

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N

Todo

establish

a finite quantity

that replaces

M

P

E

in

(g)− E

out

(g) > 

?

≤ 2 ·

m H

· exp

−2

2

N

justify the feasibility of learning for infinite M

study

m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

after 3 more lectures :-)

(24)

Training versus Testing Recap and Preview

Preview

Known

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N

Todo

establish

a finite quantity

that replaces

M

P

E

in

(g)− E

out

(g) > 

?

≤ 2 ·

m H

· exp

−2

2

N

justify the feasibility of learning for infinite M

study

m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

after 3 more lectures :-)

(25)

Training versus Testing Recap and Preview

Fun Time

Data size: how large do we need?

One way to use the inequality P

E

in

(g)− E

out

(g)

>  ≤ 2 · M · exp

−2

2

N

| {z }

δ

is to pick a tolerable difference as well as a tolerable BAD probabilityδ, and then gather data with size (N) large enough to achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.

What is the data size needed?

1

215

2

415

3

615

4

815

Reference Answer: 2

We can simply express N as a function of those ‘known’ variables. Then, the needed N =

2 1

2 ln

2M δ

.

(26)

Training versus Testing Recap and Preview

Fun Time

Data size: how large do we need?

One way to use the inequality P

E

in

(g)− E

out

(g)

>  ≤ 2 · M · exp

−2

2

N

| {z }

δ

is to pick a tolerable difference as well as a tolerable BAD probabilityδ, and then gather data with size (N) large enough to achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.

What is the data size needed?

1

215

2

415

3

615

4

815

Reference Answer: 2

We can simply express N as a function of those ‘known’ variables.

Then, the needed N =

2 1

2 ln

2M δ

.

(27)

Training versus Testing Effective Number of Lines

Where Did M Come From?

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N



BAD events B m

: |E

in

(h

m

)− E

out

(h

m

)| > 

to giveA freedom of choice: bound P[B

1

orB

2

or . . .B

M

]

worst case: allB

m

non-overlapping P[B

1

orB

2

or . . .B

M

]

|{z} union bound

P[B

1

] + P[B

2

] +. . . + P[B

M

]

where did

union bound fail

to consider for M =∞?

(28)

Training versus Testing Effective Number of Lines

Where Did M Come From?

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N



BAD events B m

: |E

in

(h

m

)− E

out

(h

m

)| > 

to giveA freedom of choice: bound P[B

1

orB

2

or . . .B

M

]

worst case: allB

m

non-overlapping P[B

1

orB

2

or . . .B

M

]

|{z} union bound

P[B

1

] + P[B

2

] +. . . + P[B

M

]

where did

union bound fail

to consider for M =∞?

(29)

Training versus Testing Effective Number of Lines

Where Did M Come From?

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N



BAD events B m

: |E

in

(h

m

)− E

out

(h

m

)| > 

to giveA freedom of choice: bound P[B

1

orB

2

or . . .B

M

]

worst case: allB

m

non-overlapping P[B

1

orB

2

or . . .B

M

]

|{z}

union bound

P[B

1

] + P[B

2

] +. . . + P[B

M

]

where did

union bound fail

to consider for M =∞?

(30)

Training versus Testing Effective Number of Lines

Where Did M Come From?

P

E

in

(g)− E

out

(g)

>  ≤ 2 ·

M

· exp

−2

2

N



BAD events B m

: |E

in

(h

m

)− E

out

(h

m

)| > 

to giveA freedom of choice: bound P[B

1

orB

2

or . . .B

M

]

worst case: allB

m

non-overlapping P[B

1

orB

2

or . . .B

M

]

|{z}

union bound

P[B

1

] + P[B

2

] +. . . + P[B

M

]

where did

union bound fail

to consider for M =∞?

(31)

Training versus Testing Effective Number of Lines

Where Did Union Bound Fail?

union bound P[B

1

] + P[B

2

] +. . . + P[B

M

]

BAD events B m

: |E

in

(h

m

)− E

out

(h

m

)| >  overlapping for similar hypotheses h

1

≈ h

2

why? 1 E

out

(h

1

)≈ E

out

(h

2

)

why?

2 for mostD, E

in

(h

1

) =E

in

(h

2

)

union bound

over-estimating

to account for overlap,

can we group similar hypotheses by

kind?

(32)

Training versus Testing Effective Number of Lines

Where Did Union Bound Fail?

union bound P[B

1

] + P[B

2

] +. . . + P[B

M

]

BAD events B m

:|E

in

(h

m

)− E

out

(h

m

)| >  overlapping for similar hypotheses h

1

≈ h

2

why? 1 E

out

(h

1

)≈ E

out

(h

2

)

why?

2 for mostD, E

in

(h

1

) =E

in

(h

2

)

union bound

over-estimating

to account for overlap,

can we group similar hypotheses by

kind?

(33)

Training versus Testing Effective Number of Lines

Where Did Union Bound Fail?

union bound P[B

1

] + P[B

2

] +. . . + P[B

M

]

BAD events B m

:|E

in

(h

m

)− E

out

(h

m

)| >  overlapping for similar hypotheses h

1

≈ h

2

why? 1 E

out

(h

1

)≈ E

out

(h

2

)

why?

2 for mostD, E

in

(h

1

) =E

in

(h

2

)

union bound

over-estimating

to account for overlap,

can we group similar hypotheses by

kind?

(34)

Training versus Testing Effective Number of Lines

Where Did Union Bound Fail?

union bound P[B

1

] + P[B

2

] +. . . + P[B

M

]

BAD events B m

:|E

in

(h

m

)− E

out

(h

m

)| >  overlapping for similar hypotheses h

1

≈ h

2

why? 1 E

out

(h

1

)≈ E

out

(h

2

)

why?

2 for mostD, E

in

(h

1

) =E

in

(h

2

)

union bound

over-estimating B

3

B

1

B

2

to account for overlap,

can we group similar hypotheses by

kind?

(35)

Training versus Testing Effective Number of Lines

Where Did Union Bound Fail?

union bound P[B

1

] + P[B

2

] +. . . + P[B

M

]

BAD events B m

:|E

in

(h

m

)− E

out

(h

m

)| >  overlapping for similar hypotheses h

1

≈ h

2

why? 1 E

out

(h

1

)≈ E

out

(h

2

)

why?

2 for mostD, E

in

(h

1

) =E

in

(h

2

)

union bound

over-estimating B

3

B

1

B

2

to account for overlap,

can we group similar hypotheses by

kind?

(36)

Training versus Testing Effective Number of Lines

How Many Lines Are There? (1/2)

H =n

all lines in R

2

o

how many lines? ∞

how many

kinds of

lines if viewed from one input vector

x 1

?

•x

1

2 kinds: h 1

-like(x

1

) =

or h

2

-like(x

1

) =

×

(37)

Training versus Testing Effective Number of Lines

How Many Lines Are There? (1/2)

H =n

all lines in R

2

o

how many lines? ∞

how many

kinds of

lines if viewed from one input vector

x 1

?

•x

1

2 kinds: h 1

-like(x

1

) =

or h

2

-like(x

1

) =

×

(38)

Training versus Testing Effective Number of Lines

How Many Lines Are There? (1/2)

H =n

all lines in R

2

o

how many lines? ∞

how many

kinds of

lines if viewed from one input vector

x 1

?

•x

1

h

1

h

2

2 kinds: h 1

-like(x

1

) =

or h

2

-like(x

1

) =

×

(39)

Training versus Testing Effective Number of Lines

How Many Lines Are There? (2/2)

H =n

all lines in R

2

o

how many

kinds of

lines if viewed from two inputs

x 1

, x

2

?

•x

1

•x

2

4:

◦ ×

× ◦

× ×

one input: 2; two inputs: 4;

three inputs?

(40)

Training versus Testing Effective Number of Lines

How Many Lines Are There? (2/2)

H =n

all lines in R

2

o

how many

kinds of

lines if viewed from two inputs

x 1

, x

2

?

•x

1

•x

2

4:

◦ ×

× ◦

× ×

one input: 2; two inputs: 4;

three inputs?

(41)

Training versus Testing Effective Number of Lines

How Many Lines Are There? (2/2)

H =n

all lines in R

2

o

how many

kinds of

lines if viewed from two inputs

x 1

, x

2

?

•x

1

•x

2

4:

◦ ×

× ◦

× ×

(42)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (1/2)

H =n

all lines in R

2

o

for three inputs x 1 , x 2 , x 3

•x

1

•x

2

•x

3

always

8 for three inputs?

8:

◦ ◦

× ×

×

◦ ◦

×

× ×

× ◦

◦ ×

×

× ◦

×

◦ ×

(43)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (1/2)

H =n

all lines in R

2

o

for three inputs x 1 , x 2 , x 3

•x

1

•x

2

•x

3

always

8 for three inputs?

8:

◦ ◦

× ×

×

◦ ◦

×

× ×

× ◦

◦ ×

×

(44)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (1/2)

H =n

all lines in R

2

o

for three inputs x 1 , x 2 , x 3

•x

1

•x

2

•x

3

always

8 for three inputs?

8:

◦ ◦

× ×

×

◦ ◦

×

× ×

× ◦

◦ ×

×

× ◦

×

◦ ×

(45)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (2/2)

H =n

all lines in R

2

o

for another three inputs x 1 , x 2 , x 3

•x

1

•x

2

•x

3

‘fewer than 8’

when degenerate (e.g. collinear or same inputs)

6:

◦ ◦

× ×

×

◦ ◦

×

× ×

× ◦

◦ ×

×

(46)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (2/2)

H =n

all lines in R

2

o

for another three inputs x 1 , x 2 , x 3

•x

1

•x

2

•x

3

‘fewer than 8’

when degenerate (e.g. collinear or same inputs)

6:

◦ ◦

× ×

×

◦ ◦

×

× ×

× ◦

◦ ×

×

× ◦

×

◦ ×

(47)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (2/2)

H =n

all lines in R

2

o

for another three inputs x 1 , x 2 , x 3

•x

1

•x

2

•x

3

6:

◦ ◦

× ×

×

◦ ◦

×

× ×

× ◦

◦ ×

×

(48)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

2

o

for four inputs x 1 , x 2 , x 3 , x 4

•x

1

•x

2

•x

3

•x

4

for any four inputs

at most 14

14:

◦ ◦

◦ ◦ ◦

× × ×

◦ ◦

◦ × ◦

× × ◦

◦ ◦

× ◦ ◦

× ◦ ×

◦ ◦

× × ◦

× ◦ ◦

(49)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

2

o

for four inputs x 1 , x 2 , x 3 , x 4

•x

1

•x

2

•x

3

•x

4

for any four inputs

at most 14

14:

◦ ◦

◦ ◦ ◦

× × ×

◦ ◦

◦ × ◦

× × ◦

◦ ◦

× ◦ ◦

× ◦ ×

(50)

Training versus Testing Effective Number of Lines

How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

2

o

for four inputs x 1 , x 2 , x 3 , x 4

•x

1

•x

2

•x

3

•x

4

for any four inputs

at most 14

14:

◦ ◦

◦ ◦ ◦

× × ×

◦ ◦

◦ × ◦

× × ◦

◦ ◦

× ◦ ◦

× ◦ ×

◦ ◦

× × ◦

× ◦ ◦

(51)

Training versus Testing Effective Number of Lines

Effective Number of Lines

maximum kinds of lines with respect to N inputs

x 1

, x

2

,· · · , x

N

⇐⇒

effective number of lines

must be≤ 2

N

(why?)

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

in

(g)− E

out

(g) > 

≤ 2 ·

effective(N)

· exp

−2

2

N

lines in 2D

N effective(N)

1 2

2 4

3 8

4 14

< 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

N

learning possible with infinite lines :-)

(52)

Training versus Testing Effective Number of Lines

Effective Number of Lines

maximum kinds of lines with respect to N inputs

x 1

, x

2

,· · · , x

N

⇐⇒

effective number of lines

must be≤ 2

N

(why?)

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

in

(g)− E

out

(g) > 

≤ 2 ·

effective(N)

· exp

−2

2

N

lines in 2D

N effective(N)

1 2

2 4

3 8

4 14

< 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

N

learning possible with infinite lines :-)

(53)

Training versus Testing Effective Number of Lines

Effective Number of Lines

maximum kinds of lines with respect to N inputs

x 1

, x

2

,· · · , x

N

⇐⇒

effective number of lines

must be≤ 2

N

(why?)

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

in

(g)− E

out

(g) > 

≤ 2 ·

effective(N)

· exp

−2

2

N

lines in 2D

N effective(N)

1 2

2 4

3 8

4 14

< 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

N

learning possible with infinite lines :-)

(54)

Training versus Testing Effective Number of Lines

Effective Number of Lines

maximum kinds of lines with respect to N inputs

x 1

, x

2

,· · · , x

N

⇐⇒

effective number of lines

must be≤ 2

N

(why?)

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

in

(g)− E

out

(g) > 

≤ 2 ·

effective(N)

· exp

−2

2

N

lines in 2D

N effective(N)

1 2

2 4

3 8

4 14

< 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

N

learning possible with infinite lines :-)

(55)

Training versus Testing Effective Number of Lines

Effective Number of Lines

maximum kinds of lines with respect to N inputs

x 1

, x

2

,· · · , x

N

⇐⇒

effective number of lines

must be≤ 2

N

(why?)

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

in

(g)− E

out

(g) > 

≤ 2 ·

effective(N)

· exp

−2

2

N

lines in 2D

N effective(N)

1 2

2 4

3 8

4 14

< 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

N

learning possible with infinite lines :-)

(56)

Training versus Testing Effective Number of Lines

Effective Number of Lines

maximum kinds of lines with respect to N inputs

x 1

, x

2

,· · · , x

N

⇐⇒

effective number of lines

must be≤ 2

N

(why?)

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

in

(g)− E

out

(g) > 

≤ 2 ·

effective(N)

· exp

−2

2

N

lines in 2D

N effective(N)

1 2

2 4

3 8

4 14

< 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

N

learning possible with infinite lines :-)

(57)

Training versus Testing Effective Number of Lines

Fun Time

What is the effective number of lines for five inputs ∈ R 2 ?

1

14

2

16

3

22

4

32

Reference Answer: 3

If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is

much smaller than 2 5 = 32. You shall find it difficult

to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

•x

1

•x

2

•x

3

•x

4

•x

5

(58)

Training versus Testing Effective Number of Lines

Fun Time

What is the effective number of lines for five inputs ∈ R 2 ?

1

14

2

16

3

22

4

32

Reference Answer: 3

If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is

much smaller than 2 5 = 32. You shall find it difficult

to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

•x

1

•x

2

•x

3

•x

4

•x

5

(59)

Training versus Testing Effective Number of Hypotheses

Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

◦}}

call

h(x

1

, x

2

, . . . , x

N

) = (h(x

1

), h(x

2

), . . . , h(x

N

))∈ {×,

◦} N

a

dichotomy: hypothesis ‘limited’ to the eyes of x 1

, x

2

, . . . , x

N

H(x

1

, x

2

, . . . , x

N

):

all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N

hypothesesH dichotomiesH(x

1

, x

2

, . . . , x

N

) e.g. all lines in R

2

{◦◦◦◦,

◦◦◦×

,

◦◦××

, . . .} size possibly infinite upper bounded by 2

N

|H(x

1

, x

2

, . . . , x

N

)|: candidate for

replacing M

(60)

Training versus Testing Effective Number of Hypotheses

Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

◦}}

call

h(x

1

, x

2

, . . . , x

N

) = (h(x

1

), h(x

2

), . . . , h(x

N

))∈ {×,

◦} N

a

dichotomy: hypothesis ‘limited’ to the eyes of x 1

, x

2

, . . . , x

N

H(x

1

, x

2

, . . . , x

N

):

all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N

hypothesesH dichotomiesH(x

1

, x

2

, . . . , x

N

) e.g. all lines in R

2

{◦◦◦◦,

◦◦◦×

,

◦◦××

, . . .} size possibly infinite upper bounded by 2

N

|H(x

1

, x

2

, . . . , x

N

)|: candidate for

replacing M

(61)

Training versus Testing Effective Number of Hypotheses

Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

◦}}

call

h(x

1

, x

2

, . . . , x

N

) = (h(x

1

), h(x

2

), . . . , h(x

N

))∈ {×,

◦} N

a

dichotomy: hypothesis ‘limited’ to the eyes of x 1

, x

2

, . . . , x

N

H(x

1

, x

2

, . . . , x

N

):

all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N

hypothesesH dichotomiesH(x

1

, x

2

, . . . , x

N

) e.g. all lines in R

2

{◦◦◦◦,

◦◦◦×

,

◦◦××

, . . .} size possibly infinite upper bounded by 2

N

|H(x

1

, x

2

, . . . , x

N

)|: candidate for

replacing M

(62)

Training versus Testing Effective Number of Hypotheses

Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

◦}}

call

h(x

1

, x

2

, . . . , x

N

) = (h(x

1

), h(x

2

), . . . , h(x

N

))∈ {×,

◦} N

a

dichotomy: hypothesis ‘limited’ to the eyes of x 1

, x

2

, . . . , x

N

H(x

1

, x

2

, . . . , x

N

):

all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N

hypothesesH dichotomiesH(x

1

, x

2

, . . . , x

N

) e.g. all lines in R

2

{◦◦◦◦,

◦◦◦×

,

◦◦××

, . . .} size possibly infinite upper bounded by 2

N

|H(x

1

, x

2

, . . . , x

N

)|: candidate for

replacing M

(63)

Training versus Testing Effective Number of Hypotheses

Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

◦}}

call

h(x

1

, x

2

, . . . , x

N

) = (h(x

1

), h(x

2

), . . . , h(x

N

))∈ {×,

◦} N

a

dichotomy: hypothesis ‘limited’ to the eyes of x 1

, x

2

, . . . , x

N

H(x

1

, x

2

, . . . , x

N

):

all dichotomies ‘implemented’ by H on x 1 , x 2 , . . . , x N

hypothesesH dichotomiesH(x

1

, x

2

, . . . , x

N

) e.g. all lines in R

2

{◦◦◦◦,

◦◦◦×

,

◦◦××

, . . .} size possibly infinite upper bounded by 2

N

(64)

Training versus Testing Effective Number of Hypotheses

Growth Function

|H(x

1

, x

2

, . . . , x

N

)|: depend on inputs (x

1

, x

2

, . . . , x

N

)

growth function:

remove dependence by

taking max of all possible (x 1 , x 2 , . . . , x N )

m

H

(N) = max

x

1

,x

2

,...,x

N

∈X

|H(x

1

, x

2

, . . . , x

N

)|

finite, upper-bounded by 2

N

lines in 2D

N m

H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

< 2 N

how to ‘calculate’ the growth function?

(65)

Training versus Testing Effective Number of Hypotheses

Growth Function

|H(x

1

, x

2

, . . . , x

N

)|: depend on inputs (x

1

, x

2

, . . . , x

N

)

growth function:

remove dependence by

taking max of all possible (x 1 , x 2 , . . . , x N )

m

H

(N) = max

x

1

,x

2

,...,x

N

∈X

|H(x

1

, x

2

, . . . , x

N

)|

finite, upper-bounded by 2

N

lines in 2D

N m

H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

< 2 N

how to ‘calculate’ the growth function?

(66)

Training versus Testing Effective Number of Hypotheses

Growth Function

|H(x

1

, x

2

, . . . , x

N

)|: depend on inputs (x

1

, x

2

, . . . , x

N

)

growth function:

remove dependence by

taking max of all possible (x 1 , x 2 , . . . , x N )

m

H

(N) = max

x

1

,x

2

,...,x

N

∈X

|H(x

1

, x

2

, . . . , x

N

)|

finite, upper-bounded by 2

N

lines in 2D

N m

H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

< 2 N

how to ‘calculate’ the growth function?

(67)

Training versus Testing Effective Number of Hypotheses

Growth Function

|H(x

1

, x

2

, . . . , x

N

)|: depend on inputs (x

1

, x

2

, . . . , x

N

)

growth function:

remove dependence by

taking max of all possible (x 1 , x 2 , . . . , x N )

m

H

(N) = max

x

1

,x

2

,...,x

N

∈X

|H(x

1

, x

2

, . . . , x

N

)|

finite, upper-bounded by 2

N

lines in 2D

N m

H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

< 2 N

how to ‘calculate’ the growth function?

(68)

Training versus Testing Effective Number of Hypotheses

Growth Function

|H(x

1

, x

2

, . . . , x

N

)|: depend on inputs (x

1

, x

2

, . . . , x

N

)

growth function:

remove dependence by

taking max of all possible (x 1 , x 2 , . . . , x N )

m

H

(N) = max

x

1

,x

2

,...,x

N

∈X

|H(x

1

, x

2

, . . . , x

N

)|

finite, upper-bounded by 2

N

lines in 2D

N m

H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

< 2 N

how to ‘calculate’ the growth function?

(69)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Rays

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1

a

X = R (one dimensional)

H contains h, where

each h(x ) = sign(x − a) for threshold a

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

n

, x

n+1

):

m H (N) = N + 1

(N + 1)

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ ◦ ◦ ◦

× ◦ ◦ ◦

× × ◦ ◦

× × × ◦

× × × ×

(70)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Rays

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1

a

X = R (one dimensional)

H contains h, where

each h(x ) = sign(x − a) for threshold a

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

n

, x

n+1

):

m H (N) = N + 1

(N + 1)

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ ◦ ◦ ◦

× ◦ ◦ ◦

× × ◦ ◦

× × × ◦

× × × ×

(71)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Rays

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1

a

X = R (one dimensional)

H contains h, where

each h(x ) = sign(x − a) for threshold a

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

n

, x

n+1

):

m H (N) = N + 1

(N + 1)

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ ◦ ◦ ◦

× ◦ ◦ ◦

× × ◦ ◦

× × × ◦

× × × ×

(72)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Rays

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1

a

X = R (one dimensional)

H contains h, where

each h(x ) = sign(x − a) for threshold a

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

n

, x

n+1

):

m H (N) = N + 1

(N + 1)

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ ◦ ◦ ◦

× ◦ ◦ ◦

× × ◦ ◦

× × × ◦

× × × ×

(73)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Rays

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1

a

X = R (one dimensional)

H contains h, where

each h(x ) = sign(x − a) for threshold a

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

n

, x

n+1

):

m H (N) = N + 1

x

1

x

2

x

3

x

4

◦ ◦ ◦ ◦

× ◦ ◦ ◦

(74)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Intervals

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 h(x) = −1

X = R (one dimensional)

H contains h, where

each h(x ) = +1 iff x ∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

m H (N)

=

N + 1

2



| {z }

interval ends in N + 1 spots

+

1

|{z}

all ×

= 1

2 N 2 + 1 2 N + 1

1

2

N

2

+

1 2

N + 1

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ × × ×

◦ ◦ × ×

◦ ◦ ◦ ×

◦ ◦ ◦ ◦

× ◦ × ×

× ◦ ◦ ×

× ◦ ◦ ◦

× × ◦ ×

× × ◦ ◦

× × × ◦

× × × ×

(75)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Intervals

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 h(x) = −1

X = R (one dimensional)

H contains h, where

each h(x ) = +1 iff x ∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

m H (N)

=

N + 1 2



| {z }

interval ends in N + 1 spots

+

1

|{z}

all ×

1

2

N

2

+

1 2

N + 1

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ × × ×

◦ ◦ × ×

◦ ◦ ◦ ×

◦ ◦ ◦ ◦

× ◦ × ×

× ◦ ◦ ×

× ◦ ◦ ◦

× × ◦ ×

× × ◦ ◦

× × × ◦

× × × ×

(76)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Intervals

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 h(x) = −1

X = R (one dimensional)

H contains h, where

each h(x ) = +1 iff x ∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

m H (N)

=

N + 1 2



| {z }

interval ends in N + 1 spots

+

1

|{z}

all ×

= 1

2 N 2 + 1 2 N + 1

1

2

N

2

+

1 2

N + 1

 2 N

when N large!

x

1

x

2

x

3

x

4

◦ × × ×

◦ ◦ × ×

◦ ◦ ◦ ×

◦ ◦ ◦ ◦

× ◦ × ×

× ◦ ◦ ×

× ◦ ◦ ◦

× × ◦ ×

× × ◦ ◦

× × × ◦

× × × ×

(77)

Training versus Testing Effective Number of Hypotheses

Growth Function for Positive Intervals

x

1

x

2

x

3

. . . x

N

h(x) = −1 h(x) = +1 h(x) = −1

X = R (one dimensional)

H contains h, where

each h(x ) = +1 iff x ∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

m H (N)

=

N + 1 2



| {z }

interval ends in N + 1 spots

+

1

|{z}

all ×

x

1

x

2

x

3

x

4

◦ × × ×

◦ ◦ × ×

◦ ◦ ◦ ×

◦ ◦ ◦ ◦

× ◦ × ×

× ◦ ◦ ×

× ◦ ◦ ◦

(78)

Training versus Testing Effective Number of Hypotheses

Growth Function for Convex Sets (1/2)

up

bottom

convex region in blue

up

bottom

non-convex region

X = R

2

(two dimensional)

H contains h, where

h(x) = +1 iff x in a convex region, −1 otherwise

what is m

H

(N)?

(79)

Training versus Testing Effective Number of Hypotheses

Growth Function for Convex Sets (1/2)

up

bottom

convex region in blue

up

bottom

non-convex region

X = R

2

(two dimensional)

H contains h, where

h(x) = +1 iff x in a

convex region, −1 otherwise

(80)

Training versus Testing Effective Number of Hypotheses

Growth Function for Convex Sets (2/2)

one possible set of N inputs:

x 1

, x

2

, . . . , x

N

on a big circle

every dichotomy can be implemented

byH using a convex region

slightly extended from contour of positive inputs

m H (N) = 2 N

call those N inputs

‘shattered’ by H

+

+ + +

+

up

bottom

m

H

(N) = 2

N

⇐⇒

exists

N inputs that can be shattered

(81)

Training versus Testing Effective Number of Hypotheses

Growth Function for Convex Sets (2/2)

one possible set of N inputs:

x 1

, x

2

, . . . , x

N

on a big circle

every dichotomy can be implemented

byH using a convex region

slightly extended from contour of positive inputs

m H (N) = 2 N

call those N inputs

‘shattered’ by H

+

+ + +

+

up

bottom

m

H

(N) = 2

N

⇐⇒

exists

N inputs that can be shattered

(82)

Training versus Testing Effective Number of Hypotheses

Growth Function for Convex Sets (2/2)

one possible set of N inputs:

x 1

, x

2

, . . . , x

N

on a big circle

every dichotomy can be implemented

byH using a convex region

slightly extended from contour of positive inputs

m H (N) = 2 N

call those N inputs

‘shattered’ by H

+

+ + +

+

up

bottom

m

H

(N) = 2

N

⇐⇒

exists

N inputs that can be shattered

(83)

Training versus Testing Effective Number of Hypotheses

Growth Function for Convex Sets (2/2)

one possible set of N inputs:

x 1

, x

2

, . . . , x

N

on a big circle

every dichotomy can be implemented

byH using a convex region

slightly extended from contour of positive inputs

m H (N) = 2 N

call those N inputs

‘shattered’ by H

+

+ + +

+

up

bottom

m

H

(N) = 2

N

⇐⇒

(84)

Training versus Testing Effective Number of Hypotheses

Fun Time

Consider positive and negative rays as H, which is equivalent to the perceptron hypothesis set in 1D. The hypothesis set is often called ‘decision stump’ to describe the shape of its hypotheses. What is the growth function m H (N)?

1

N

2

N + 1

3

2N

4

2

N

Reference Answer: 3

Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-

and all-

×

cases.

(85)

Training versus Testing Effective Number of Hypotheses

Fun Time

Consider positive and negative rays as H, which is equivalent to the perceptron hypothesis set in 1D. The hypothesis set is often called ‘decision stump’ to describe the shape of its hypotheses. What is the growth function m H (N)?

1

N

2

N + 1

3

2N

4

2

N

Reference Answer: 3

Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-

and all-

×

cases.

(86)

Training versus Testing Break Point

The Four Growth Functions

positive rays: m

H

(N) = N + 1

positive intervals: m

H

(N) =

1 2

N

2

+

1 2

N + 1

convex sets: m

H

(N) = 2

N

2D perceptrons:

m H (N) < 2 N in some cases

what if m H (N) replaces M?

P

E

in

(g)− E

out

(g) > 

?

≤ 2 ·

m H (N)

· exp

−2

2

N

polynomial: good; exponential: bad

for 2D or general perceptrons,

m H (N) polynomial?

參考文獻

相關文件

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

• logistic regression often preferred over pocket.. Linear Models for Classification Stochastic Gradient Descent. Two Iterative

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

Machine Learning for Modern Artificial Intelligence.. Hsuan-Tien