Lecture 16: Three Learning Principles

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 16: Three Learning Principles

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/25

(2)

Three Learning Principles

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

(crossly) reserve

validation data

to simulate testing procedure for

model selection

Lecture 16: Three Learning Principles Occam’s Razor

Sampling Bias

Data Snooping

Power of Three

(3)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

(4)

Occam’s Razor

beyond necessity)

(5)

Occam’s Razor

beyond necessity)

(6)

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

(7)

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

2

(8)

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

2

(9)

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

2

(10)

Simple Model

simple hypothesis h

•

small Ω(h) = ‘looks’ simple

•

specified by

few parameters

simple model H

•

small Ω(H) = not many

•

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

^`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(11)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(12)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(13)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(14)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(15)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(16)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(17)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(18)

Simple Model

simple hypothesis h

•

specified by

few parameters

simple model H

•

contains

small number of hypotheses

connection

^`

simple:

small hypothesis/model complexity

(19)

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

^H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

^N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(20)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(21)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(22)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(23)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(24)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(25)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(26)

Simple is Better

in addition to

math proof

=⇒

simpleH

=⇒ smaller m

^H

(N)

H

(N) 2

^N

direct action:

linear first;

always ask whether

data over-modeled

(27)

Fun Time

Consider the decision stumps in R

¹

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x ₁

,

x ₂

, . . . ,

x ₁₀

coupled with labels y

_n

generated iid from a fair coin. What is the probability that the dataD = {(x

ⁿ

,y

n

)}

¹⁰ n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(28)

Fun Time

Consider the decision stumps in R

¹

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x ₁

,

x ₂

, . . . ,

x ₁₀

coupled with labels y

_n

generated iid from a fair coin. What is the probability that the dataD = {(x

ⁿ

,y

n

)}

¹⁰ n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(29)

Three Learning Principles Sampling Bias

Presidential Story

•

1948 US President election: Truman versus Dewey

•

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

(30)

Presidential Story

•

• voted,

who is this? :-)

(31)

Presidential Story

•

• voted,

who is this? :-)

(32)

Presidential Story

•

• voted,

who is this? :-)

(33)

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

•

editorial bug?—no

•

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(34)

The Big Smile Came from . . .

Truman, and yes he won

•

editorial bug?—no

•

hint: phones were

expensive :-)

(35)

The Big Smile Came from . . .

Truman, and yes he won

•

editorial bug?—no

•

hint: phones were

expensive :-)

(36)

The Big Smile Came from . . .

Truman, and yes he won

•

editorial bug?—no

•

hint: phones were

expensive :-)

(37)

The Big Smile Came from . . .

Truman, and yes he won

•

editorial bug?—no

•

hint: phones were

expensive :-)

(38)

The Big Smile Came from . . .

Truman, and yes he won

•

editorial bug?—no

•

hint: phones were

expensive :-)

(39)

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

technical explanation:

data from

P ₁

(x, y ) but test under

P ₂

6=

P ₁

:

VC fails

•

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption: data and testing

both iid from P

(40)

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

data from

P ₁

P ₂

6=

P ₁

:

VC fails

•

study

Math

hard but test

English: no strong test guarantee

both iid from P

(41)

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

data from

P ₁

P ₂

6=

P ₁

:

VC fails

•

study

Math

hard but test

English: no strong test guarantee

both iid from P

(42)

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

data from

P ₁

P ₂

6=

P ₁

:

VC fails

•

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption:

data and testing

both iid from P

(43)

Sampling Bias in Learning

A True Personal Story

•

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

•

formedD

val

,

in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(44)

Sampling Bias in Learning

A True Personal Story

• 10% improvement = 1M US dollars

•

formedD

val

,

in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

predicted rating

likes comedy?

movie viewer

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(45)

Sampling Bias in Learning

A True Personal Story

• 10% improvement = 1M US dollars

•

formedD

val

, in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

predicted rating

likes comedy?

movie viewer

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(46)

Sampling Bias in Learning

A True Personal Story

• 10% improvement = 1M US dollars

•

formedD

val

, in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

predicted rating

likes comedy?

movie viewer

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(47)

Sampling Bias in Learning

A True Personal Story

• 10% improvement = 1M US dollars

•

formedD

val

, in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

predicted rating

likes comedy?

movie viewer

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(48)

Sampling Bias in Learning

A True Personal Story

• 10% improvement = 1M US dollars

•

formedD

val

, in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

predicted rating

likes comedy?

movie viewer

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

(49)

Sampling Bias in Learning

A True Personal Story

• 10% improvement = 1M US dollars

•

formedD

val

, in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

predicted rating

likes comedy?

movie viewer

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

(50)

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

practical rule of thumb:

match test scenario as much as possible

•

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with

existing bank records?

(51)

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

• match test scenario as much as possible

•

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

existing bank records?

(52)

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

• match test scenario as much as possible

•

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

existing bank records?

(53)

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

• match test scenario as much as possible

•

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

existing bank records?

(54)

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

• match test scenario as much as possible

•

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’

with

existing bank records?

(55)

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

ⁿ

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set, remember? :-)

(56)

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

ⁿ

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set,

remember? :-)

(57)

Three Learning Principles Data Snooping

Visual Data Snooping

Visualize X = R ²

•

full Φ

₂

:

z = (1, x ₁

,x

₂

,x

₁ ²

,x

₁

x

₂

,x

₂ ²

), dVC =6

•

or

z = (1, x ₁ ²

,x

₂ ²

), dVC =3,

after visualizing?

•

or better

z = (1, x ₁ ²

+x

₂ ²

), dVC=2?

•

or even better

z = sign(0.6 − x 1 ² − x 2 ² )?

—careful about

your brain’s ‘model complexity’

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

(58)

Visual Data Snooping

Visualize X = R ²

•

full Φ

₂

:

z = (1, x ₁

,x

₂

,x

₁ ²

,x

₁

x

₂

,x

₂ ²

), dVC =6

•

or

z = (1, x ₁ ²

,x

₂ ²

), dVC =3,

after visualizing?

•

or better

z = (1, x ₁ ²

+x

₂ ²

), dVC=2?

•

or even better

z = sign(0.6 − x 1 ² − x 2 ² )?

—careful about

your brain’s ‘model complexity’

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

(59)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

8 years of currency trading data

•

first 6 years for

training,

last two 2 years for

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(60)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

first 6 years for

training,

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(61)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

first 6 years for

training,

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(62)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

first 6 years for

training,

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(63)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

first 6 years for

training,

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(64)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

first 6 years for

training,

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(65)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

first 6 years for

training,

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(66)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(67)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(68)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(69)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(70)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(71)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(72)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(73)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

data by reading earlier papers,

bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(74)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

data by reading earlier papers,

bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(75)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(76)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

2

—and

publish only if better

thanH

1

onD

•

3

—and

publish only if better

thanH

2

onD

•

. . .

• one big paper:

m

H

m

)

• snooped

publish only if better

if you torture the data long enough, it will confess :-)

(77)

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(78)

Dealing with Data Snooping

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

• feeling of contamination

data-driven modeling (snooping)

and

validation (no-snooping)

(79)

Dealing with Data Snooping

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

• feeling of contamination

data-driven modeling (snooping)

and

validation (no-snooping)

(80)

Dealing with Data Snooping

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

• feeling of contamination

data-driven modeling (snooping)

and

validation (no-snooping)

(81)

Dealing with Data Snooping

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

• feeling of contamination

data-driven modeling (snooping)

and

validation (no-snooping)

(82)

Dealing with Data Snooping

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

• feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(83)

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of those! :-)

(84)

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

those! :-)

(85)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

•

use

(huge)

data to

find property

that is interesting

•

difficult to distinguish ML and DM in reality

Artificial Intelligence

•

compute something that shows

intelligent behavior

•

ML is one possible route to realize AI

Statistics

•

use data to

make inference

about an unknown process

•

statistics contains many useful tools for ML

(86)

Three Related Fields

Power of Three

Data Mining

•

use

(huge)

data to

find property

that is interesting

• Artificial Intelligence

• intelligent behavior

• Statistics

•

use data to

make inference

•

(87)

Three Related Fields

Power of Three

Data Mining

•

use

(huge)

data to

find property

that is interesting

• Artificial Intelligence

• intelligent behavior

• Statistics

•

use data to

make inference

•

(88)

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

²

N)

• one

hypothesis

•

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

²

N)

• M

hypotheses

•

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

•

all

H

•

useful for

training

(89)

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

²

N)

• one

hypothesis

•

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

²

N)

• M

hypotheses

•

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

•

all

H

•

useful for

training

(90)

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

²

N)

• one

hypothesis

•

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

²

N)

• M

hypotheses

•

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

•

all

H

•

useful for

training

(91)

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x₀

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

(92)

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x₀

1 2

d

h x( )

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

(93)

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x₀

1 2

d

h x( )

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize