• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
100
0
0

(1)

## ( 機器學習基石)

### Lecture 5: Training versus Testing

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### & Information Engineering

(2)

Training versus Testing

learning is

if enough

and

### 2 Why

Can Machines Learn?

### 4 How Can Machines Learn Better?

(3)

Training versus Testing Recap and Preview

## Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

(g)≈ E

### in

(g)

ifA finds one g with E

### in

(g)≈ 0, PAC guarantee for E

(g)≈ 0

=⇒

1

1

N

N

1

2

N

E

(g) ≈

|{z}

E

(g) ≈

|{z}

### train

0

(4)

Training versus Testing Recap and Preview

## Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

(g)≈ E

### in

(g) ifA finds one g with E

### in

(g)≈ 0,

PAC guarantee for E

(g)≈ 0 =⇒

1

1

N

N

1

2

N

E

(g) ≈

|{z}

E

(g) ≈

|{z}

### train

0

(5)

Training versus Testing Recap and Preview

## Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

(g)≈ E

### in

(g) ifA finds one g with E

### in

(g)≈ 0,

PAC guarantee for E

(g)≈ 0 =⇒

1

1

N

N

1

2

N

E

(g) ≈

|{z}

### test

(6)

Training versus Testing Recap and Preview

## Recap: the ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

(g)≈ E

### in

(g) ifA finds one g with E

### in

(g)≈ 0,

PAC guarantee for E

(g)≈ 0 =⇒

1

1

N

N

1

2

N

E

(g) ≈

|{z}

E

(g) ≈

|{z}

### train

0

(7)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### |H|

play for the two questions?

(8)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### |H|

play for the two questions?

(9)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### |H|

play for the two questions?

(10)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### |H|

play for the two questions?

(11)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### |H|

play for the two questions?

(12)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### |H|

play for the two questions?

(13)

Training versus Testing Recap and Preview

## Two Central Questions

for batch & supervised binary classification

| {z }

,

achieved through

and

### lecture 2

learning split to two central questions:

what role does

### M

play for the two questions?

(14)

Training versus Testing Recap and Preview

· exp(. . .)

· exp(. . .)

### 2 Yes!, many choices

using the right M (orH) is important

### M = ∞ doomed?

(15)

Training versus Testing Recap and Preview

· exp(. . .)

· exp(. . .)

### 2 Yes!, many choices

using the right M (orH) is important

### M = ∞ doomed?

(16)

Training versus Testing Recap and Preview

· exp(. . .)

· exp(. . .)

### 2 Yes!, many choices

using the right M (orH) is important

### M = ∞ doomed?

(17)

Training versus Testing Recap and Preview

· exp(. . .)

· exp(. . .)

### 2 Yes!, many choices

using the right M (orH) is important

### M = ∞ doomed?

(18)

Training versus Testing Recap and Preview

· exp(. . .)

· exp(. . .)

### 2 Yes!, many choices

using the right M (orH) is important

### M = ∞ doomed?

(19)

Training versus Testing Recap and Preview

· exp(. . .)

· exp(. . .)

### 2 Yes!, many choices

using the right M (orH) is important

### M = ∞ doomed?

(20)

Training versus Testing Recap and Preview

## Preview

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N

establish

that replaces

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

### •

justify the feasibility of learning for infinite M

study

### m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

### after 3 more lectures :-)

(21)

Training versus Testing Recap and Preview

## Preview

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N

establish

that replaces

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

### •

justify the feasibility of learning for infinite M

study

### m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

### after 3 more lectures :-)

(22)

Training versus Testing Recap and Preview

## Preview

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N

establish

that replaces

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

### •

justify the feasibility of learning for infinite M

study

### m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

### after 3 more lectures :-)

(23)

Training versus Testing Recap and Preview

## Preview

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N

establish

that replaces

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

### •

justify the feasibility of learning for infinite M

study

### m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

### after 3 more lectures :-)

(24)

Training versus Testing Recap and Preview

## Preview

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N

establish

that replaces

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

### •

justify the feasibility of learning for infinite M

study

### m H

to understand its trade-off for ‘right’H, just like M

mysterious PLA to be fully resolved

### after 3 more lectures :-)

(25)

Training versus Testing Recap and Preview

## Fun Time

### Data size: how large do we need?

One way to use the inequality P

E

(g)− E

### out

(g)

>  ≤ 2 · M · exp

−2

N

| {z }

### δ

is to pick a tolerable difference as well as a tolerable BAD probabilityδ, and then gather data with size (N) large enough to achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.

What is the data size needed?

215

415

615

### 4

815

We can simply express N as a function of those ‘known’ variables. Then, the needed N =

2 ln

### 2Mδ

.

(26)

Training versus Testing Recap and Preview

## Fun Time

### Data size: how large do we need?

One way to use the inequality P

E

(g)− E

### out

(g)

>  ≤ 2 · M · exp

−2

N

| {z }

### δ

is to pick a tolerable difference as well as a tolerable BAD probabilityδ, and then gather data with size (N) large enough to achieve those tolerance criteria. Let = 0.1, δ = 0.05, and M = 100.

What is the data size needed?

215

415

615

### 4

815

We can simply express N as a function of those ‘known’ variables.

Then, the needed N =

2 ln

### 2Mδ

.

(27)

Training versus Testing Effective Number of Lines

## Where Did M Come From?

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N



: |E

(h

)− E

(h

)| > 

### •

to giveA freedom of choice: bound P[B

orB

or . . .B

]

worst case: allB

### m

non-overlapping P[B

orB

or . . .B

]

P[B

] + P[B

] +. . . + P[B

]

where did

### union bound fail

to consider for M =∞?

(28)

Training versus Testing Effective Number of Lines

## Where Did M Come From?

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N



: |E

(h

)− E

(h

)| > 

### •

to giveA freedom of choice: bound P[B

orB

or . . .B

]

worst case: allB

### m

non-overlapping P[B

orB

or . . .B

]

P[B

] + P[B

] +. . . + P[B

]

where did

### union bound fail

to consider for M =∞?

(29)

Training versus Testing Effective Number of Lines

## Where Did M Come From?

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N



: |E

(h

)− E

(h

)| > 

### •

to giveA freedom of choice: bound P[B

orB

or . . .B

]

worst case: allB

### m

non-overlapping P[B

orB

or . . .B

]

P[B

] + P[B

] +. . . + P[B

]

where did

### union bound fail

to consider for M =∞?

(30)

Training versus Testing Effective Number of Lines

## Where Did M Come From?

P

E

(g)− E

(g)

>  ≤ 2 ·

· exp

−2

N



: |E

(h

)− E

(h

)| > 

### •

to giveA freedom of choice: bound P[B

orB

or . . .B

]

worst case: allB

### m

non-overlapping P[B

orB

or . . .B

]

P[B

] + P[B

] +. . . + P[B

]

where did

### union bound fail

to consider for M =∞?

(31)

Training versus Testing Effective Number of Lines

## Where Did Union Bound Fail?

union bound P[B

] + P[B

] +. . . + P[B

]

: |E

(h

)− E

(h

### m

)| >  overlapping for similar hypotheses h

≈ h

why? 1 E

(h

)≈ E

(h

)

why?

2 for mostD, E

(h

) =E

(h

)

union bound

### over-estimating

to account for overlap,

can we group similar hypotheses by

### kind?

(32)

Training versus Testing Effective Number of Lines

## Where Did Union Bound Fail?

union bound P[B

] + P[B

] +. . . + P[B

]

:|E

(h

)− E

(h

### m

)| >  overlapping for similar hypotheses h

≈ h

why? 1 E

(h

)≈ E

(h

)

why?

2 for mostD, E

(h

) =E

(h

)

union bound

### over-estimating

to account for overlap,

can we group similar hypotheses by

### kind?

(33)

Training versus Testing Effective Number of Lines

## Where Did Union Bound Fail?

union bound P[B

] + P[B

] +. . . + P[B

]

:|E

(h

)− E

(h

### m

)| >  overlapping for similar hypotheses h

≈ h

why? 1 E

(h

)≈ E

(h

)

why?

2 for mostD, E

(h

) =E

(h

)

union bound

### over-estimating

to account for overlap,

can we group similar hypotheses by

### kind?

(34)

Training versus Testing Effective Number of Lines

## Where Did Union Bound Fail?

union bound P[B

] + P[B

] +. . . + P[B

]

:|E

(h

)− E

(h

### m

)| >  overlapping for similar hypotheses h

≈ h

why? 1 E

(h

)≈ E

(h

)

why?

2 for mostD, E

(h

) =E

(h

)

union bound

3

1

### B

2

to account for overlap,

can we group similar hypotheses by

### kind?

(35)

Training versus Testing Effective Number of Lines

## Where Did Union Bound Fail?

union bound P[B

] + P[B

] +. . . + P[B

]

:|E

(h

)− E

(h

### m

)| >  overlapping for similar hypotheses h

≈ h

why? 1 E

(h

)≈ E

(h

)

why?

2 for mostD, E

(h

) =E

(h

)

union bound

3

1

### B

2

to account for overlap,

can we group similar hypotheses by

### kind?

(36)

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (1/2)

H =n

all lines in R

o

### •

how many lines? ∞

how many

### kinds of

lines if viewed from one input vector

?

•x

-like(x

) =

or h

-like(x

) =

### ×

(37)

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (1/2)

H =n

all lines in R

o

### •

how many lines? ∞

how many

### kinds of

lines if viewed from one input vector

?

•x

-like(x

) =

or h

-like(x

) =

### ×

(38)

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (1/2)

H =n

all lines in R

o

### •

how many lines? ∞

how many

### kinds of

lines if viewed from one input vector

?

•x

h

h

-like(x

) =

or h

-like(x

) =

### ×

(39)

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (2/2)

H =n

all lines in R

o

how many

### kinds of

lines if viewed from two inputs

, x

?

•x

•x

## ◦

one input: 2; two inputs: 4;

### three inputs?

(40)

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (2/2)

H =n

all lines in R

o

how many

### kinds of

lines if viewed from two inputs

, x

?

•x

•x

## ◦

one input: 2; two inputs: 4;

### three inputs?

(41)

Training versus Testing Effective Number of Lines

## How Many Lines Are There? (2/2)

H =n

all lines in R

o

how many

### kinds of

lines if viewed from two inputs

, x

?

•x

•x

## ◦

(42)

Training versus Testing Effective Number of Lines

H =n

all lines in R

o

•x

•x

•x

always

## ◦

(43)

Training versus Testing Effective Number of Lines

H =n

all lines in R

o

•x

•x

•x

always

## ×

(44)

Training versus Testing Effective Number of Lines

H =n

all lines in R

o

•x

•x

•x

always

## ◦

(45)

Training versus Testing Effective Number of Lines

## How Many Kinds of Lines for Three Inputs? (2/2)

H =n

all lines in R

o

•x

•x

•x

### ‘fewer than 8’

when degenerate (e.g. collinear or same inputs)

## ×

(46)

Training versus Testing Effective Number of Lines

## How Many Kinds of Lines for Three Inputs? (2/2)

H =n

all lines in R

o

•x

•x

•x

### ‘fewer than 8’

when degenerate (e.g. collinear or same inputs)

## ◦

(47)

Training versus Testing Effective Number of Lines

H =n

all lines in R

o

•x

•x

•x

## ×

(48)

Training versus Testing Effective Number of Lines

## How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

o

•x

•x

•x

•x

### 4

for any four inputs

## × ◦ ◦

(49)

Training versus Testing Effective Number of Lines

## How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

o

•x

•x

•x

•x

### 4

for any four inputs

## × ◦ ×

(50)

Training versus Testing Effective Number of Lines

## How Many Kinds of Lines for Four Inputs?

H =n

all lines in R

o

•x

•x

•x

•x

### 4

for any four inputs

## × ◦ ◦

(51)

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

, x

,· · · , x

⇐⇒

must be≤ 2

(why?)

### •

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

N effective(N)

1 2

2 4

3 8

4 14

### < 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

### learning possible with infinite lines :-)

(52)

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

, x

,· · · , x

⇐⇒

must be≤ 2

(why?)

### •

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

N effective(N)

1 2

2 4

3 8

4 14

### < 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

### learning possible with infinite lines :-)

(53)

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

, x

,· · · , x

⇐⇒

must be≤ 2

(why?)

### •

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

N effective(N)

1 2

2 4

3 8

4 14

### < 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

### learning possible with infinite lines :-)

(54)

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

, x

,· · · , x

⇐⇒

must be≤ 2

(why?)

### •

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

N effective(N)

1 2

2 4

3 8

4 14

### < 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

### learning possible with infinite lines :-)

(55)

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

, x

,· · · , x

⇐⇒

must be≤ 2

(why?)

### •

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

N effective(N)

1 2

2 4

3 8

4 14

### < 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

### learning possible with infinite lines :-)

(56)

Training versus Testing Effective Number of Lines

## Effective Number of Lines

maximum kinds of lines with respect to N inputs

, x

,· · · , x

⇐⇒

must be≤ 2

(why?)

### •

finite ‘grouping’ of infinitely-many lines∈ H

wish:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

N

N effective(N)

1 2

2 4

3 8

4 14

### < 2 N

if 1 effective(N) can replace M and

if

2 effective(N) 2

### learning possible with infinite lines :-)

(57)

Training versus Testing Effective Number of Lines

## Fun Time

14

16

22

### 4

32

If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is

### muchsmaller than 25 = 32. You shall find it difficult

to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

•x

•x

•x

•x

•x

### 5

(58)

Training versus Testing Effective Number of Lines

## Fun Time

14

16

22

### 4

32

If you put the inputs roughly around a circle, you can then pick any consecutive inputs to be on one side of the line, and the other inputs to be on the other side. The procedure leads to effectively 22 kinds of lines, which is

### muchsmaller than 25 = 32. You shall find it difficult

to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

•x

•x

•x

•x

•x

### 5

(59)

Training versus Testing Effective Number of Hypotheses

## Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

call

h(x

, x

, . . . , x

) = (h(x

), h(x

), . . . , h(x

))∈ {×,

a

, x

, . . . , x

H(x

, x

, . . . , x

):

### all dichotomies ‘implemented’ byH on x 1 , x2, . . . , xN

hypothesesH dichotomiesH(x

, x

, . . . , x

### N

) e.g. all lines in R

{◦◦◦◦,

,

### ◦◦××

, . . .} size possibly infinite upper bounded by 2

|H(x

, x

, . . . , x

### N

)|: candidate for

### replacing M

(60)

Training versus Testing Effective Number of Hypotheses

## Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

call

h(x

, x

, . . . , x

) = (h(x

), h(x

), . . . , h(x

))∈ {×,

a

, x

, . . . , x

H(x

, x

, . . . , x

):

### all dichotomies ‘implemented’ byH on x 1 , x2, . . . , xN

hypothesesH dichotomiesH(x

, x

, . . . , x

### N

) e.g. all lines in R

{◦◦◦◦,

,

### ◦◦××

, . . .} size possibly infinite upper bounded by 2

|H(x

, x

, . . . , x

### N

)|: candidate for

### replacing M

(61)

Training versus Testing Effective Number of Hypotheses

## Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

call

h(x

, x

, . . . , x

) = (h(x

), h(x

), . . . , h(x

))∈ {×,

a

, x

, . . . , x

H(x

, x

, . . . , x

):

### all dichotomies ‘implemented’ byH on x 1 , x2, . . . , xN

hypothesesH dichotomiesH(x

, x

, . . . , x

### N

) e.g. all lines in R

{◦◦◦◦,

,

### ◦◦××

, . . .} size possibly infinite upper bounded by 2

|H(x

, x

, . . . , x

### N

)|: candidate for

### replacing M

(62)

Training versus Testing Effective Number of Hypotheses

## Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

call

h(x

, x

, . . . , x

) = (h(x

), h(x

), . . . , h(x

))∈ {×,

a

, x

, . . . , x

H(x

, x

, . . . , x

):

### all dichotomies ‘implemented’ byH on x 1 , x2, . . . , xN

hypothesesH dichotomiesH(x

, x

, . . . , x

### N

) e.g. all lines in R

{◦◦◦◦,

,

### ◦◦××

, . . .} size possibly infinite upper bounded by 2

|H(x

, x

, . . . , x

### N

)|: candidate for

### replacing M

(63)

Training versus Testing Effective Number of Hypotheses

## Dichotomies: Mini-hypotheses

H = {hypothesis h : X → {×,

call

h(x

, x

, . . . , x

) = (h(x

), h(x

), . . . , h(x

))∈ {×,

a

, x

, . . . , x

H(x

, x

, . . . , x

):

### all dichotomies ‘implemented’ byH on x 1 , x2, . . . , xN

hypothesesH dichotomiesH(x

, x

, . . . , x

### N

) e.g. all lines in R

{◦◦◦◦,

,

### ◦◦××

, . . .} size possibly infinite upper bounded by 2

### N

(64)

Training versus Testing Effective Number of Hypotheses

## Growth Function

|H(x

, x

, . . . , x

### N

)|: depend on inputs (x

, x

, . . . , x

)

### •

growth function:

remove dependence by

m

(N) = max

1

2

N

|H(x

, x

, . . . , x

)|

### •

finite, upper-bounded by 2

N m

### H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

### < 2 N

how to ‘calculate’ the growth function?

(65)

Training versus Testing Effective Number of Hypotheses

## Growth Function

|H(x

, x

, . . . , x

### N

)|: depend on inputs (x

, x

, . . . , x

)

### •

growth function:

remove dependence by

m

(N) = max

1

2

N

|H(x

, x

, . . . , x

)|

### •

finite, upper-bounded by 2

N m

### H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

### < 2 N

how to ‘calculate’ the growth function?

(66)

Training versus Testing Effective Number of Hypotheses

## Growth Function

|H(x

, x

, . . . , x

### N

)|: depend on inputs (x

, x

, . . . , x

)

### •

growth function:

remove dependence by

m

(N) = max

1

2

N

|H(x

, x

, . . . , x

)|

### •

finite, upper-bounded by 2

N m

### H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

### < 2 N

how to ‘calculate’ the growth function?

(67)

Training versus Testing Effective Number of Hypotheses

## Growth Function

|H(x

, x

, . . . , x

### N

)|: depend on inputs (x

, x

, . . . , x

)

### •

growth function:

remove dependence by

m

(N) = max

1

2

N

|H(x

, x

, . . . , x

)|

### •

finite, upper-bounded by 2

N m

### H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

### < 2 N

how to ‘calculate’ the growth function?

(68)

Training versus Testing Effective Number of Hypotheses

## Growth Function

|H(x

, x

, . . . , x

### N

)|: depend on inputs (x

, x

, . . . , x

)

### •

growth function:

remove dependence by

m

(N) = max

1

2

N

|H(x

, x

, . . . , x

)|

### •

finite, upper-bounded by 2

N m

### H

(N)

1 2

2 4

3 max(. . . , 6, 8)

=8 4 14

### < 2 N

how to ‘calculate’ the growth function?

(69)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### •

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

, x

):

(N + 1)

when N large!

x

x

x

x

### × × × ×

(70)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### •

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

, x

):

(N + 1)

when N large!

x

x

x

x

### × × × ×

(71)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### •

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

, x

):

(N + 1)

when N large!

x

x

x

x

### × × × ×

(72)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### •

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

, x

):

(N + 1)

when N large!

x

x

x

x

### × × × ×

(73)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### •

‘positive half’ of 1D perceptrons

one dichotomy for a∈ each spot (x

, x

):

x

x

x

x

### × ◦ ◦ ◦

(74)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### each h(x ) = +1 iff x∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

=

+

N

+

N + 1

when N large!

1

2

3

4

### × × × ×

(75)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### each h(x ) = +1 iff x∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

=

+

N

+

N + 1

when N large!

1

2

3

4

### × × × ×

(76)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### each h(x ) = +1 iff x∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

=

+

N

+

N + 1

when N large!

1

2

3

4

### × × × ×

(77)

Training versus Testing Effective Number of Hypotheses

1

2

3

## . . . x

N

### •

X = R (one dimensional)

### •

H contains h, where

### each h(x ) = +1 iff x∈ [`, r), −1 otherwise

one dichotomy for each ‘interval kind’

=

+

1

2

3

4

### × ◦ ◦ ◦

(78)

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (1/2)

up

bottom

### convex region inblue

up

bottom

non-convex region

X = R

### 2

(two dimensional)

### •

H contains h, where

what is m

### H

(N)?

(79)

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (1/2)

up

bottom

### convex region inblue

up

bottom

non-convex region

X = R

### 2

(two dimensional)

### •

H contains h, where

### convex region,−1 otherwise

(80)

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (2/2)

### •

one possible set of N inputs:

, x

, . . . , x

on a big circle

### • every dichotomy can be implemented

byH using a convex region

### •

call those N inputs

+

+ + +

+

up

bottom

m

(N) = 2

⇐⇒

### exists

N inputs that can be shattered

(81)

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (2/2)

### •

one possible set of N inputs:

, x

, . . . , x

on a big circle

### • every dichotomy can be implemented

byH using a convex region

### •

call those N inputs

+

+ + +

+

up

bottom

m

(N) = 2

⇐⇒

### exists

N inputs that can be shattered

(82)

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (2/2)

### •

one possible set of N inputs:

, x

, . . . , x

on a big circle

### • every dichotomy can be implemented

byH using a convex region

### •

call those N inputs

+

+ + +

+

up

bottom

m

(N) = 2

⇐⇒

### exists

N inputs that can be shattered

(83)

Training versus Testing Effective Number of Hypotheses

## Growth Function for Convex Sets (2/2)

### •

one possible set of N inputs:

, x

, . . . , x

on a big circle

### • every dichotomy can be implemented

byH using a convex region

### •

call those N inputs

+

+ + +

+

up

bottom

m

(N) = 2

### N

⇐⇒

(84)

Training versus Testing Effective Number of Hypotheses

## Fun Time

N

N + 1

2N

2

### N

Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-

and all-

### ×

cases.

(85)

Training versus Testing Effective Number of Hypotheses

## Fun Time

N

N + 1

2N

2

### N

Two dichotomies when threshold in each of the N− 1 ‘internal’ spots; two dichotomies for the all-

and all-

### ×

cases.

(86)

Training versus Testing Break Point

## The Four Growth Functions

positive rays: m

(N) = N + 1

### •

positive intervals: m

(N) =

N

+

N + 1

convex sets: m

(N) = 2

2D perceptrons:

P

E

(g)− E

(g) > 

≤ 2 ·

· exp

−2

### 2

N

for 2D or general perceptrons,

### m H (N) polynomial?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/27.. The Learning Problem What is Machine Learning. The Machine

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28.. Linear Support Vector Machine Course Introduction.

In this final project, you are going to be part of an exciting machine learning competition. Consider a startup company that features a coming product on the

Two causes of overfitting are noise and excessive d VC. So if both are relatively ‘under control’, the risk of overfitting is smaller... Hazard of Overfitting The Role of Noise and

The entrance system of the school gym, which does automatic face recognition based on machine learning, is built to charge four different groups of users differently: Staff,

[classification], [regression], structured Learning with Different Data Label y n. [supervised], un/semi-supervised, reinforcement Learning with Different Protocol f ⇒ (x n , y

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25.. Gradient Boosted Decision Tree Summary of Aggregation Models. Map of

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

• logistic regression often preferred over pocket.. Linear Models for Classification Stochastic Gradient Descent. Two Iterative

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

Machine Learning for Modern Artificial Intelligence.. Hsuan-Tien

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22.. Matrix Factorization Summary of Extraction Models.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23...

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25.. Noise and Error Noise and Probabilistic Target.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26... The