• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
105
0
0

(1)

( 機器學習基石)

Lecture 8: Noise and Error

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

(2)

Noise and Error

2 Why

Can Machines Learn?

learning happens

if

VC,

, and

4 How Can Machines Learn Better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/25

(3)

Noise and Error Noise and Probabilistic Target

Recap: The Learning Flow

1

1

N

N

1

2

N

what if there is

noise?

(4)

Noise and Error Noise and Probabilistic Target

Recap: The Learning Flow

1

1

N

N

1

2

N

what if there is

noise?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25

(5)

Noise and Error Noise and Probabilistic Target

Noise

briefly introduced

before

algorithm

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

: good customer,

• noise in y

: same customers, different labels?

• noise in x: inaccurate

customer information?

does VC bound work under

noise?

(6)

Noise and Error Noise and Probabilistic Target

Noise

briefly introduced

before

algorithm

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

: good customer,

• noise in y

: same customers, different labels?

• noise in x: inaccurate

customer information?

does VC bound work under

noise?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25

(7)

Noise and Error Noise and Probabilistic Target

Noise

briefly introduced

before

algorithm

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

: good customer,

• noise in y

: same customers, different labels?

• noise in x: inaccurate

customer information?

does VC bound work under

noise?

(8)

Noise and Error Noise and Probabilistic Target

Noise

briefly introduced

before

algorithm

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

: good customer,

• noise in y

: same customers, different labels?

• noise in x: inaccurate

customer information?

does VC bound work under

noise?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25

(9)

Noise and Error Noise and Probabilistic Target

Noise

briefly introduced

before

algorithm

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

: good customer,

• noise in y

: same customers, different labels?

• noise in x: inaccurate

customer information?

does VC bound work under

noise?

(10)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

top

bottom top

bottom

sample

bin

marble

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

marble

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

if

VC holds for

i.i.d .

∼ P(x,y )

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(11)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

top

bottom top

bottom

sample

bin

marble

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

marble

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

if

VC holds for

i.i.d .

∼ P(x,y )

(12)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

top

bottom top

bottom

sample

bin

marble

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

marble

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

if

VC holds for

i.i.d .

∼ P(x,y )

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(13)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

top

bottom top

bottom

sample

bin

marble

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

marble

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

if

VC holds for

i.i.d .

∼ P(x,y )

(14)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

top

bottom top

bottom

sample

bin

marble

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

marble

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

if

VC holds for

i.i.d .

∼ P(x,y )

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(15)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

top

bottom top

bottom

sample

bin

marble

∼ P(x)

•

deterministic color Jf (x) 6= h(x)K

marble

∼ P(x)

•

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

if

VC holds for

i.i.d .

∼ P(x,y )

(16)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise

, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(17)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise

, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

(18)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(19)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

(20)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(21)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

(22)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(23)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

on one

•

can be viewed as ‘ideal mini-target’ + noise, e.g.

•

deterministic target f :

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

on

often-seen inputs (w.r.t. P(x))

(24)

Noise and Error Noise and Probabilistic Target

The New Learning Flow

1

1

N

N

1

2

N

1

2

N

VC still works,

pocket algorithm explained :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25

(25)

Noise and Error Noise and Probabilistic Target

The New Learning Flow

1

1

N

N

1

2

N

1

2

N

VC still works,

pocket algorithm explained :-)

(26)

Noise and Error Noise and Probabilistic Target

Fun Time

1

In practice, we should try to compute ifD is linear separable before deciding to use PLA.

2

If we know thatD is not linear separable, then the target function f must not be a linear function.

3

If we know thatD is linear separable, then the target function f must be a linear function.

4

None of the above

1 After computing ifD is linear separable, we shall know

w∗

and then there is no need to use PLA. 2 What about noise? 3 What about

‘sampling luck’? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25

(27)

Noise and Error Noise and Probabilistic Target

Fun Time

1

In practice, we should try to compute ifD is linear separable before deciding to use PLA.

2

If we know thatD is not linear separable, then the target function f must not be a linear function.

3

If we know thatD is linear separable, then the target function f must be a linear function.

4

None of the above

1 After computing ifD is linear separable, we shall know

w∗

and then there is no need to use PLA. 2 What about noise? 3 What about

‘sampling luck’? :-)

(28)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(29)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

(30)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(31)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

(32)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(33)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K: often also called

‘0/1 error’

(34)

Noise and Error Error Measure

Error Measure

•

how well? previously, considered out-of-sample measure

=

more generally,

•

naturally considered

• classification: Jprediction 6= targetK

classification errorJ. . .K:

often also called

‘0/1 error’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(35)

Noise and Error Error Measure

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

(g) = E

—err: called

E

(g) = 1 N

X

err(g(x

),f (x

))

E

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

(36)

Noise and Error Error Measure

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

(g) = E

—err: called

E

(g) = 1 N

X

err(g(x

),f (x

))

E

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25

(37)

Noise and Error Error Measure

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

(g) = E

—err: called

E

(g) = 1 N

X

err(g(x

),f (x

))

E

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

(38)

Noise and Error Error Measure

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

(g) = E

—err: called

E

(g) = 1 N

X

err(g(x

),f (x

))

E

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25

(39)

Noise and Error Error Measure

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

(g) = E

—err: called

E

(g) = 1 N

X

err(g(x

),f (x

))

E

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

(40)

Noise and Error Error Measure

Two Important Pointwise Error Measures

•

correct or incorrect?

often for

•

how far is ˜y from y ?

often for

how does err

‘guide’ learning?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(41)

Noise and Error Error Measure

Two Important Pointwise Error Measures

•

correct or incorrect?

often for

•

how far is ˜y from y ?

often for

how does err

‘guide’ learning?

(42)

Noise and Error Error Measure

Two Important Pointwise Error Measures

•

correct or incorrect?

often for

•

how far is ˜y from y ?

often for

how does err

‘guide’ learning?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(43)

Noise and Error Error Measure

Two Important Pointwise Error Measures

•

correct or incorrect?

often for

•

how far is ˜y from y ?

often for

how does err

‘guide’ learning?

(44)

Noise and Error Error Measure

Two Important Pointwise Error Measures

•

correct or incorrect?

often for

•

how far is ˜y from y ?

often for

how does err

‘guide’ learning?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(45)

Noise and Error Error Measure

Two Important Pointwise Error Measures

•

correct or incorrect?

often for

•

how far is ˜y from y ?

often for

how does err

‘guide’ learning?

(46)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(47)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(48)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(49)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8

2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(50)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗) 3 avg. err 0.9 1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(51)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9

1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(52)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9 1.9 avg. err 1.0

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1

2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(53)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3

(∗)

3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1

2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(54)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1

2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(55)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1

2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(56)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3

3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(57)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5

1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

(58)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29

(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(59)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)

f (x) =X

y ∈Y

y· P(y|x)

(60)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)

f (x) =X

y ∈Y

y· P(y|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(61)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

and

and

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)

f (x) =X

y ∈Y

y· P(y|x)

(62)

Noise and Error Error Measure

Learning Flow with Error Measure

1

1

N

N

1

2

N

1

2

N

error measure err

extended VC theory/‘philosophy’

works for most H and err

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25

(63)

Noise and Error Error Measure

Learning Flow with Error Measure

1

1

N

N

1

2

N

1

2

N

error measure err

extended VC theory/‘philosophy’

works for most H and err

(64)

Noise and Error Error Measure

Fun Time

Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?

P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.

1

2.5 = average withinY = {1, 2, 3, 4}

2

2.85 = weighted mean from P(y|x)

3

3 = weighted median from P(y|x)

4

4 = argmax P(y|x)

For the ‘absolute error’, the weighted median provably results in the minimum average err.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25

(65)

Noise and Error Error Measure

Fun Time

Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?

P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.

1

2.5 = average withinY = {1, 2, 3, 4}

2

2.85 = weighted mean from P(y|x)

3

3 = weighted median from P(y|x)

4

4 = argmax P(y|x)

For the ‘absolute error’, the weighted median provably results in the minimum average err.

(66)

Noise and Error Algorithmic Error Measure

Choice of Error Measure

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

false accept no error

0/1 error penalizes both types

equally

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

(67)

Noise and Error Algorithmic Error Measure

Choice of Error Measure

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

false accept no error

0/1 error penalizes both types

equally

(68)

Noise and Error Algorithmic Error Measure

Choice of Error Measure

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

false accept no error

0/1 error penalizes both types

equally

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

(69)

Noise and Error Algorithmic Error Measure

Choice of Error Measure

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

false accept no error

0/1 error penalizes both types

equally

(70)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for Supermarket

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g +1 -1

f +1

-1

•

supermarket: fingerprint for discount

• false accept: give away a minor discount, intruder left fingerprint :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(71)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for Supermarket

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g +1 -1

f +1

-1

•

supermarket: fingerprint for discount

• false accept: give away a minor discount, intruder left fingerprint :-)

(72)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for Supermarket

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g +1 -1

f +1

-1

•

supermarket: fingerprint for discount

• false accept: give away a minor discount, intruder left fingerprint :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(73)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for Supermarket

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g +1 -1

f +1

-1

•

supermarket: fingerprint for discount

• false accept: give away a minor discount, intruder left fingerprint :-)

(74)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for CIA

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g

+1 -1

f +1

-1

•

CIA: fingerprint for entrance

• false reject: unhappy employee, but so what? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

(75)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for CIA

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g

+1 -1

f +1

-1

•

CIA: fingerprint for entrance

• false reject: unhappy employee, but so what? :-)

(76)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for CIA

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g

+1 -1

f +1

-1

•

CIA: fingerprint for entrance

• false reject: unhappy employee, but so what? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

(77)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for CIA

−1 intruder

two types of error:

and

g

+1 -1

f +1

-1

g

+1 -1

f +1

-1

•

CIA: fingerprint for entrance

• false reject: unhappy employee, but so what? :-)

(78)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

is

true: just

plausible:

•

friendly: easy to optimize forA

• convex objective function

c

err: more in next lectures

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25

(79)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

is

true: just

plausible:

•

friendly: easy to optimize forA

• convex objective function

c

err: more in next lectures

(80)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

is

true: just

plausible:

•

friendly: easy to optimize forA

• convex objective function

c

err: more in next lectures

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25

(81)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

is

true: just

plausible:

•

friendly: easy to optimize forA

• convex objective function

c

err: more in next lectures

(82)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

is

true: just

plausible:

•

friendly: easy to optimize forA

• convex objective function

c

err: more in next lectures

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25

(83)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

is

true: just

plausible:

•

friendly: easy to optimize forA

• convex objective function

c

err: more in next lectures

(84)

Noise and Error Algorithmic Error Measure

Learning Flow with Algorithmic Error Measure

1

1

N

N

1

2

N

1

2

N

error measure err c err

err: application goal;

err: a key part of many

A

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25

(85)

Noise and Error Algorithmic Error Measure

Fun Time

P

Jy

6= g(x

)K

P

n

Jy

6= g(x

)K + 1000 P

n

Jy

6= g(x

)K

!

P

n

Jy

6= g(x

)K − 1000 P

n

Jy

6= g(x

)K

!

1000 P

n

Jy

6= g(x

)K + P

n

Jy

6= g(x

)K

!

When y

=−1, the

,y

)is penalized

1000

times more!

(86)

Noise and Error Algorithmic Error Measure

Fun Time

P

Jy

6= g(x

)K

P

n

Jy

6= g(x

)K + 1000 P

n

Jy

6= g(x

)K

!

P

n

Jy

6= g(x

)K − 1000 P

n

Jy

6= g(x

)K

!

1000 P

n

Jy

6= g(x

)K + P

n

Jy

6= g(x

)K

!

When y

=−1, the

,y

)is penalized

1000

times more!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25

(87)

Noise and Error Weighted Classification

Weighted Classification

E

(h) = E



if y = +1

if y =−1



·

E

(h) = 1 N

X



if y

= +1

if y

=−1



·

Jy n 6= h(x n ) K

weighted classification:

different ‘weight’ for different (x, y )

(88)

Noise and Error Weighted Classification

Weighted Classification

E

(h) = E



if y = +1

if y =−1



·

E

(h) = 1 N

X



if y

= +1

if y

=−1



·

Jy n 6= h(x n ) K

weighted classification:

different ‘weight’ for different (x, y )

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25

(89)

Noise and Error Weighted Classification

Weighted Classification

E

(h) = E



if y = +1

if y =−1



·

E

(h) = 1 N

X



if y

= +1

if y

=−1



·

Jy n 6= h(x n ) K

weighted classification:

different ‘weight’ for different (x, y )

Outline

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/27.. The Learning Problem What is Machine Learning. The Machine

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28.. Linear Support Vector Machine Course Introduction.

Two causes of overfitting are noise and excessive d VC. So if both are relatively ‘under control’, the risk of overfitting is smaller... Hazard of Overfitting The Role of Noise and

The entrance system of the school gym, which does automatic face recognition based on machine learning, is built to charge four different groups of users differently: Staff,

[classification], [regression], structured Learning with Different Data Label y n. [supervised], un/semi-supervised, reinforcement Learning with Different Protocol f ⇒ (x n , y

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25.. Gradient Boosted Decision Tree Summary of Aggregation Models. Map of

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

• logistic regression often preferred over pocket.. Linear Models for Classification Stochastic Gradient Descent. Two Iterative

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension?. 3 How Can

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22.. Matrix Factorization Summary of Extraction Models.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23...

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26... The

Two causes of overfitting are noise and excessive d VC. So if both are relatively ‘under control’, the risk of overfitting is smaller... Hazard of Overfitting The Role of Noise and