Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 8: Noise and Error

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Noise and Error

Roadmap

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 7: The VC Dimension

learning happens

if

finite d

_VC,

large N

, and

low E _in

Lecture 8: Noise and Error Noise and Probabilistic Target Error Measure

Algorithmic Error Measure Weighted Classification

3 How Can Machines Learn?

4 How Can Machines Learn Better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/25

(3)

Noise and Error Noise and Probabilistic Target

Recap: The Learning Flow

unknown target function f : X → Y

+ noise

(ideal credit approval formula)

training examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

what if there is

noise?

(4)

Recap: The Learning Flow

unknown target function f : X → Y

+ noise

(ideal credit approval formula)

training examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

what if there is

noise?

(5)

Noise

briefly introduced

noise

before

pocket

algorithm

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

but more!

• noise in y

: good customer,

‘mislabeled’ as bad?

• noise in y

: same customers, different labels?

• noise in x: inaccurate

customer information?

does VC bound work under

noise?

(6)

Noise

briefly introduced

noise

before

pocket

algorithm

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

but more!

• noise in y

: good customer,

• noise in y

• noise in x: inaccurate

noise?

(7)

Noise

briefly introduced

noise

before

pocket

algorithm

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

but more!

• noise in y

: good customer,

• noise in y

• noise in x: inaccurate

noise?

(8)

Noise

briefly introduced

noise

before

pocket

algorithm

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

but more!

• noise in y

: good customer,

• noise in y

• noise in x: inaccurate

noise?

(9)

Noise

briefly introduced

noise

before

pocket

algorithm

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

but more!

• noise in y

: good customer,

• noise in y

• noise in x: inaccurate

noise?

(10)

Probabilistic Marbles

one key of VC bound:

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

•

marble

x

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

‘probabilistic’ (noisy) marbles

•

marble

x

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

^{i.i.d .}

∼

VC holds for

x ^{i.i.d .} ∼ P(x), y ^{i.i.d .} ∼ P(y|x)

| {z }

(x,y )

^{i.i.d .}

∼ P(x,y )

(11)

Probabilistic Marbles

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

•

marble

x

∼ P(x)

• ‘probabilistic’ (noisy) marbles

•

marble

x

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

^{i.i.d .}

∼

VC holds for

x ^{i.i.d .} ∼ P(x), y ^{i.i.d .} ∼ P(y|x)

| {z }

(x,y )

^{i.i.d .}

∼ P(x,y )

(12)

Probabilistic Marbles

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

•

marble

x

∼ P(x)

• ‘probabilistic’ (noisy) marbles

•

marble

x

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

^{i.i.d .}

∼

VC holds for

x ^{i.i.d .} ∼ P(x), y ^{i.i.d .} ∼ P(y|x)

| {z }

(x,y )

^{i.i.d .}

∼ P(x,y )

(13)

Probabilistic Marbles

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

•

marble

x

∼ P(x)

• ‘probabilistic’ (noisy) marbles

•

marble

x

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

^{i.i.d .}

∼

VC holds for

x ^{i.i.d .} ∼ P(x), y ^{i.i.d .} ∼ P(y|x)

| {z }

(x,y )

^{i.i.d .}

∼ P(x,y )

(14)

Probabilistic Marbles

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

•

marble

x

∼ P(x)

• ‘probabilistic’ (noisy) marbles

•

marble

x

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

^{i.i.d .}

∼

VC holds for

x ^{i.i.d .} ∼ P(x), y ^{i.i.d .} ∼ P(y|x)

| {z }

(x,y )

^{i.i.d .}

∼ P(x,y )

(15)

Probabilistic Marbles

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

•

marble

x

∼ P(x)

• ‘probabilistic’ (noisy) marbles

•

marble

x

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

^{i.i.d .}

∼

VC holds for

x ^{i.i.d .} ∼ P(x), y ^{i.i.d .} ∼ P(y|x)

| {z }

(x,y )

^{i.i.d .}

∼ P(x,y )

(16)

Target Distribution P(y |x)

characterizes behavior of

‘mini-target’

on one

x

•

can be viewed as ‘ideal mini-target’ + noise

, e.g.

• P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

•

deterministic target f :

special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(17)

Target Distribution P(y |x)

‘mini-target’

on one

x

•

can be viewed as ‘ideal mini-target’ + noise

, e.g.

• P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(18)

Target Distribution P(y |x)

‘mini-target’

on one

x

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

• P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(19)

Target Distribution P(y |x)

‘mini-target’

on one

x

• • P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(20)

Target Distribution P(y |x)

‘mini-target’

on one

x

• • P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(21)

Target Distribution P(y |x)

‘mini-target’

on one

x

• • P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(22)

Target Distribution P(y |x)

‘mini-target’

on one

x

• • P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(23)

Target Distribution P(y |x)

‘mini-target’

on one

x

• • P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

• special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(24)

The New Learning Flow

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

₁

, x

₂

, · · · , x

_N

x y

1

,y

2

, · · · , y

N

y

VC still works,

pocket algorithm explained :-)

(25)

The New Learning Flow

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

₁

, x

₂

, · · · , x

_N

x y

1

,y

2

, · · · , y

N

y

VC still works,

pocket algorithm explained :-)

(26)

Fun Time

Let’s revisit PLA/pocket. Which of the following claim is true?

1

In practice, we should try to compute ifD is linear separable before deciding to use PLA.

2

If we know thatD is not linear separable, then the target function f must not be a linear function.

3

If we know thatD is linear separable, then the target function f must be a linear function.

4

None of the above

Reference Answer: 4

1 After computing ifD is linear separable, we shall know

w ^∗

and then there is no need to use PLA. 2 What about noise? 3 What about

‘sampling luck’? :-)

(27)

Fun Time

Let’s revisit PLA/pocket. Which of the following claim is true?

1

In practice, we should try to compute ifD is linear separable before deciding to use PLA.

2

If we know thatD is not linear separable, then the target function f must not be a linear function.

3

If we know thatD is linear separable, then the target function f must be a linear function.

4

None of the above

Reference Answer: 4

1 After computing ifD is linear separable, we shall know

w ^∗

and then there is no need to use PLA. 2 What about noise? 3 What about

‘sampling luck’? :-)

(28)

Noise and Error Error Measure

Error Measure

final hypothesis g ≈ f

•

how well? previously, considered out-of-sample measure

E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

•

naturally considered

• out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

(29)

Error Measure

final hypothesis g ≈ f

• E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

• • out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

‘0/1 error’

(30)

Error Measure

final hypothesis g ≈ f

• E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

• • out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

‘0/1 error’

(31)

Error Measure

final hypothesis g ≈ f

• E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

• • out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

‘0/1 error’

(32)

Error Measure

final hypothesis g ≈ f

• E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

• • out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

‘0/1 error’

(33)

Error Measure

final hypothesis g ≈ f

• E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

• • out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

‘0/1 error’

(34)

Error Measure

final hypothesis g ≈ f

• E _out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

•

more generally,

error measure E (g, f )

• • out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

classification errorJ. . .K:

often also called

‘0/1 error’

(35)

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

_out

(g) = E

x∼P Jg (x) 6= f (x)K

| {z }

err(g(x),f (x))

—err: called

pointwise error measure

in-sample

E

_in

(g) = 1 N

X

N

n=1

err(g(x

_n

),f (x

_n

))

out-of-sample

E

_out

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

(36)

Pointwise Error Measure

_out

(g) = E

x∼P Jg (x) 6= f (x)K

| {z }

err(g(x),f (x))

—err: called

pointwise error measure

in-sample

E

_in

(g) = 1 N

X

N

n=1

err(g(x

_n

),f (x

_n

))

out-of-sample

E

_out

(g) = E

x∼P

err(g(x),f (x))

err

for simplicity

(37)

Pointwise Error Measure

_out

(g) = E

x∼P Jg (x) 6= f (x)K

| {z }

err(g(x),f (x))

—err: called

pointwise error measure

in-sample

E

_in

(g) = 1 N

X

N

n=1

err(g(x

_n

),f (x

_n

))

out-of-sample

E

_out

(g) = E

x∼P

err(g(x),f (x))

err

for simplicity

(38)

Pointwise Error Measure

_out

(g) = E

x∼P Jg (x) 6= f (x)K

| {z }

err(g(x),f (x))

—err: called

pointwise error measure

in-sample

E

_in

(g) = 1 N

X

N

n=1

err(g(x

_n

),f (x

_n

))

out-of-sample

E

_out

(g) = E

x∼P

err(g(x),f (x))

err

for simplicity

(39)

Pointwise Error Measure

_out

(g) = E

x∼P Jg (x) 6= f (x)K

| {z }

err(g(x),f (x))

—err: called

pointwise error measure

in-sample

E

_in

(g) = 1 N

X

N

n=1

err(g(x

_n

),f (x

_n

))

out-of-sample

E

_out

(g) = E

x∼P

err(g(x),f (x))

err

for simplicity

(40)

Two Important Pointwise Error Measures

err



 g(x) |{z}

˜ y

, f (x)

|{z} y



 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

•

correct or incorrect?

•

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) ²

•

how far is ˜y from y ?

•

often for

regression

how does err

‘guide’ learning?

(41)

Two Important Pointwise Error Measures

err



 g(x) |{z}

˜ y

, f (x)

|{z} y



 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

•

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) ²

•

often for

regression

how does err

‘guide’ learning?

(42)

Two Important Pointwise Error Measures

err



 g(x) |{z}

˜ y

, f (x)

|{z} y



 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

•

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) ²

•

often for

regression

how does err

‘guide’ learning?

(43)

Two Important Pointwise Error Measures

err



 g(x) |{z}

˜ y

, f (x)

|{z} y



 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

•

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) ²

•

often for

regression

how does err

‘guide’ learning?

(44)

Two Important Pointwise Error Measures

err



 g(x) |{z}

˜ y

, f (x)

|{z} y



 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

•

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) ²

•

often for

regression

how does err

‘guide’ learning?

(45)

Two Important Pointwise Error Measures

err



 g(x) |{z}

˜ y

, f (x)

|{z} y



 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

•

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) ²

•

often for

regression

how does err

‘guide’ learning?

(46)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜









1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(47)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









(∗)

3 avg. err 0.9 1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









(∗)

f (x) =X

y ∈Y

y· P(y|x)

(48)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









(∗)

3 avg. err 0.9 1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









(∗)

f (x) =X

y ∈Y

y· P(y|x)

(49)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









1 avg. err 0.8

2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









(∗)

f (x) =X

y ∈Y

y· P(y|x)

(50)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









(∗) 3 avg. err 0.9 1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









(∗)

f (x) =X

y ∈Y

y· P(y|x)

(51)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









(∗)

3 avg. err 0.9

1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









(∗)

f (x) =X

y ∈Y

y· P(y|x)

(52)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









(∗)

3 avg. err 0.9 1.9 avg. err 1.0

(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1

2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(53)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









(∗)

3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(54)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(55)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(56)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(57)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5

1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(58)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









(∗)

f (x) =X

y ∈Y

y· P(y|x)

(59)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)

f (x) =X

y ∈Y

y· P(y|x)

(60)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









f (x) =X

y ∈Y

y· P(y|x)

(61)

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

y =˜









f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

²









f (x) =X

y ∈Y

y· P(y|x)

(62)

Learning Flow with Error Measure

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

₁

, x

₂

, · · · , x

_N

x y

1

,y

2

, · · · , y

N

y

error measure err

extended VC theory/‘philosophy’

works for most H and err

(63)

Learning Flow with Error Measure

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

₁

, x

₂

, · · · , x

_N

x y

1

,y

2

, · · · , y

N

y

error measure err

extended VC theory/‘philosophy’

works for most H and err

(64)

Fun Time

Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?

P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.

1

2.5 = average withinY = {1, 2, 3, 4}

2

2.85 = weighted mean from P(y|x)

3

3 = weighted median from P(y|x)

4

4 = argmax P(y|x)

Reference Answer: 3

For the ‘absolute error’, the weighted median provably results in the minimum average err.

(65)

Fun Time

Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?

P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.

1

2.5 = average withinY = {1, 2, 3, 4}

2

2.85 = weighted mean from P(y|x)

3

3 = weighted median from P(y|x)

4

4 = argmax P(y|x)

Reference Answer: 3

For the ‘absolute error’, the weighted median provably results in the minimum average err.

(66)

Noise and Error Algorithmic Error Measure

Choice of Error Measure

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

0/1 error penalizes both types

equally

(67)

Choice of Error Measure

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

equally

(68)

Choice of Error Measure

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

equally

(69)

Choice of Error Measure

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

equally

(70)

Fingerprint Verification for Supermarket

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g +1 -1

f +1

0 10

-1

1 0

•

supermarket: fingerprint for discount

• false reject: very unhappy customer, lose future business

• false accept: give away a minor discount, intruder left fingerprint :-)

(71)

Fingerprint Verification for Supermarket

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g +1 -1

f +1

0 10

-1

1 0

• • false reject: very unhappy customer, lose future business

• false accept: give away a minor discount, intruder left fingerprint :-)

(72)

Fingerprint Verification for Supermarket

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g +1 -1

f +1

0 10

-1

1 0

• • false reject: very unhappy customer, lose future business

• false accept: give away a minor discount, intruder left fingerprint :-)

(73)

Fingerprint Verification for Supermarket

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g +1 -1

f +1

0 10

-1

1 0

• • false reject: very unhappy customer, lose future business

• false accept: give away a minor discount, intruder left fingerprint :-)

(74)

Fingerprint Verification for CIA

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g

+1 -1

f +1

0 1

-1

1000 0

•

CIA: fingerprint for entrance

• false accept: very serious consequences!

• false reject: unhappy employee, but so what? :-)

(75)

Fingerprint Verification for CIA

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g

+1 -1

f +1

0 1

-1

1000 0

• • false accept: very serious consequences!

• false reject: unhappy employee, but so what? :-)

(76)

Fingerprint Verification for CIA

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g

+1 -1

f +1

0 1

-1

1000 0

• • false accept: very serious consequences!

• false reject: unhappy employee, but so what? :-)

(77)

Fingerprint Verification for CIA

Fingerprint Verification

f

 



  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g

+1 -1

f +1

0 1

-1

1000 0

• • false accept: very serious consequences!

• false reject: unhappy employee, but so what? :-)

(78)

Take-home Message for Now

err

is

application/user-dependent

Algorithmic Error Measures err c

•

true: just

err

•

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

•

friendly: easy to optimize forA

• closed-form solution

• convex objective function

c

err: more in next lectures

(79)

Take-home Message for Now

err

is

application/user-dependent Algorithmic Error Measures err c

•

true: just

err

•

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

• • closed-form solution

• convex objective function

c

(80)

Take-home Message for Now

err

is

application/user-dependent Algorithmic Error Measures err c

•

true: just

err

•

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

• • closed-form solution

• convex objective function

c

(81)

Take-home Message for Now

err

is

application/user-dependent Algorithmic Error Measures err c

•

true: just

err

•

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

• • closed-form solution

• convex objective function

c

(82)

Take-home Message for Now

err

is

application/user-dependent Algorithmic Error Measures err c

•

true: just

err

•

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

• • closed-form solution

• convex objective function

c

(83)

Take-home Message for Now

err

is

application/user-dependent Algorithmic Error Measures err c

•

true: just

err

•

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

• • closed-form solution

• convex objective function

c

(84)

Learning Flow with Algorithmic Error Measure

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

₁

, x

₂

, · · · , x

_N

x y

1

,y

2

, · · · , y

N

y

error measure err c err

err: application goal;

c

err: a key part of many

A

(85)

Fun Time

Consider err below for CIA. What is E in (g) when using this err?

g +1 -1

f +1 0 1

-1 1000 0

1 1

N

P

N n=1

Jy

n

6= g(x

n

)K

2 1

N

P

y

n

=+1

Jy

ⁿ

6= g(x

ⁿ

)K + 1000 P

y

n

=−1

Jy

ⁿ

6= g(x

ⁿ

)K

!

3 1

N

P

y

n

=+1

Jy

n

6= g(x

n

)K − 1000 P

y

n

=−1

Jy

n

6= g(x

n

)K

!

4 1

N

1000 P

y

n

=+1

Jy

n

6= g(x

n

)K + P

y

n

=−1

Jy

n

6= g(x

n

)K

!

Reference Answer: 2

When y

_n

=−1, the

false positive

made on such (x

_n

,y

_n

)is penalized

1000

times more!

(86)

Fun Time

Consider err below for CIA. What is E in (g) when using this err?

g +1 -1

f +1 0 1

-1 1000 0

1 1

N

P

N n=1

Jy

n

6= g(x

n

)K

2 1

N

P

y

n

=+1

Jy

ⁿ

6= g(x

ⁿ

)K + 1000 P

y

n

=−1

Jy

ⁿ

6= g(x

ⁿ

)K

!

3 1

N

P

y

n

=+1

Jy

n

6= g(x

n

)K − 1000 P

y

n

=−1

Jy

n

6= g(x

n

)K

!

4 1

N

1000 P

y

n

=+1

Jy

n

6= g(x

n

)K + P

y

n

=−1

Jy

n

6= g(x

n

)K

!

Reference Answer: 2

When y

_n

=−1, the

false positive

made on such (x

_n

,y

_n

)is penalized

1000

times more!

(87)

Noise and Error Weighted Classification

Weighted Classification

CIA Cost (Error, Loss, . . .) Matrix

h(x) +1 -1

y +1 0 1

-1 1000 0

out-of-sample

E

_out

(h) = E

(x,y )∼P

1

if y = +1

1000

if y =−1

·

Jy 6= h(x)K

in-sample

E

_in

(h) = 1 N

X

N

n=1

1

if y

_n

= +1

1000

if y

n

=−1

·

Jy n 6= h(x n ) K

weighted classification:

different ‘weight’ for different (x, y )

(88)

Weighted Classification

CIA Cost (Error, Loss, . . .) Matrix

h(x) +1 -1

y +1 0 1

-1 1000 0

out-of-sample

E

_out

(h) = E

(x,y )∼P

1

if y = +1

1000

if y =−1

·

Jy 6= h(x)K

in-sample

E

_in

(h) = 1 N

X

N

n=1

1

if y

_n

= +1

1000

if y

n

=−1

·

Jy n 6= h(x n ) K

different ‘weight’ for different (x, y )

(89)

Weighted Classification

CIA Cost (Error, Loss, . . .) Matrix

h(x) +1 -1

y +1 0 1

-1 1000 0

out-of-sample

E

_out

(h) = E

(x,y )∼P

1

if y = +1

1000

if y =−1

·

Jy 6= h(x)K

in-sample

E

_in

(h) = 1 N

X

N

n=1

1

if y

_n

= +1

1000

if y

n

=−1

·

Machine Learning Foundations (ᘤ9M)