Machine Learning Foundations
( 機器學習基石)
Lecture 8: Noise and Error
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Noise and Error
Roadmap
1 When Can Machines Learn?
2 Why
Can Machines Learn?Lecture 7: The VC Dimension
learning happensif
finite d
VC,large N
, andlow E in
Lecture 8: Noise and Error Noise and Probabilistic Target Error Measure
Algorithmic Error Measure Weighted Classification
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/25
Noise and Error Noise and Probabilistic Target
Recap: The Learning Flow
unknown target function f : X → Y
+ noise
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
what if there is
noise?
Noise and Error Noise and Probabilistic Target
Recap: The Learning Flow
unknown target function f : X → Y
+ noise
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
what if there is
noise?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25
Noise and Error Noise and Probabilistic Target
Noise
briefly introduced
noise
beforeage 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit? {no(−1), yes(+1)}
but more!
• noise in y
: good customer,‘mislabeled’ as bad?
• noise in y
: same customers, different labels?• noise in x: inaccurate
customer information?does VC bound work under
noise?
Noise and Error Noise and Probabilistic Target
Noise
briefly introduced
noise
beforeage 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit? {no(−1), yes(+1)}
but more!
• noise in y
: good customer,‘mislabeled’ as bad?
• noise in y
: same customers, different labels?• noise in x: inaccurate
customer information?does VC bound work under
noise?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25
Noise and Error Noise and Probabilistic Target
Noise
briefly introduced
noise
beforeage 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit? {no(−1), yes(+1)}
but more!
• noise in y
: good customer,‘mislabeled’ as bad?
• noise in y
: same customers, different labels?• noise in x: inaccurate
customer information?does VC bound work under
noise?
Noise and Error Noise and Probabilistic Target
Noise
briefly introduced
noise
beforeage 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit? {no(−1), yes(+1)}
but more!
• noise in y
: good customer,‘mislabeled’ as bad?
• noise in y
: same customers, different labels?• noise in x: inaccurate
customer information?does VC bound work under
noise?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25
Noise and Error Noise and Probabilistic Target
Noise
briefly introduced
noise
beforeage 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit? {no(−1), yes(+1)}
but more!
• noise in y
: good customer,‘mislabeled’ as bad?
• noise in y
: same customers, different labels?• noise in x: inaccurate
customer information?does VC bound work under
noise?
Noise and Error Noise and Probabilistic Target
Probabilistic Marbles
one key of VC bound:
marbles!
top
bottom top
bottom
sample
bin
‘deterministic’ marbles
•
marblex
∼ P(x)•
deterministic color Jf (x) 6= h(x)K‘probabilistic’ (noisy) marbles
•
marblex
∼ P(x)•
probabilistic colorJy 6= h(x)K with y ∼ P (y |x)
same nature
: can estimate P[orange]
ifi.i.d .
∼VC holds for
x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)
| {z }
(x,y )
i.i.d .∼ P(x,y )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25
Noise and Error Noise and Probabilistic Target
Probabilistic Marbles
one key of VC bound:
marbles!
top
bottom top
bottom
sample
bin
‘deterministic’ marbles
•
marblex
∼ P(x)•
deterministic color Jf (x) 6= h(x)K‘probabilistic’ (noisy) marbles
•
marblex
∼ P(x)•
probabilistic colorJy 6= h(x)K with y ∼ P (y |x)
same nature
: can estimate P[orange]
ifi.i.d .
∼VC holds for
x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)
| {z }
(x,y )
i.i.d .∼ P(x,y )
Noise and Error Noise and Probabilistic Target
Probabilistic Marbles
one key of VC bound:
marbles!
top
bottom top
bottom
sample
bin
‘deterministic’ marbles
•
marblex
∼ P(x)•
deterministic color Jf (x) 6= h(x)K‘probabilistic’ (noisy) marbles
•
marblex
∼ P(x)•
probabilistic colorJy 6= h(x)K with y ∼ P (y |x)
same nature
: can estimate P[orange]
ifi.i.d .
∼VC holds for
x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)
| {z }
(x,y )
i.i.d .∼ P(x,y )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25
Noise and Error Noise and Probabilistic Target
Probabilistic Marbles
one key of VC bound:
marbles!
top
bottom top
bottom
sample
bin
‘deterministic’ marbles
•
marblex
∼ P(x)•
deterministic color Jf (x) 6= h(x)K‘probabilistic’ (noisy) marbles
•
marblex
∼ P(x)•
probabilistic colorJy 6= h(x)K with y ∼ P (y |x)
same nature
: can estimate P[orange]
ifi.i.d .
∼VC holds for
x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)
| {z }
(x,y )
i.i.d .∼ P(x,y )
Noise and Error Noise and Probabilistic Target
Probabilistic Marbles
one key of VC bound:
marbles!
top
bottom top
bottom
sample
bin
‘deterministic’ marbles
•
marblex
∼ P(x)•
deterministic color Jf (x) 6= h(x)K‘probabilistic’ (noisy) marbles
•
marblex
∼ P(x)•
probabilistic colorJy 6= h(x)K with y ∼ P (y |x)
same nature
: can estimate P[orange]
ifi.i.d .
∼VC holds for
x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)
| {z }
(x,y )
i.i.d .∼ P(x,y )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25
Noise and Error Noise and Probabilistic Target
Probabilistic Marbles
one key of VC bound:
marbles!
top
bottom top
bottom
sample
bin
‘deterministic’ marbles
•
marblex
∼ P(x)•
deterministic color Jf (x) 6= h(x)K‘probabilistic’ (noisy) marbles
•
marblex
∼ P(x)•
probabilistic colorJy 6= h(x)K with y ∼ P (y |x)
same nature
: can estimate P[orange]
ifi.i.d .
∼VC holds for
x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)
| {z }
(x,y )
i.i.d .∼ P(x,y )
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.
• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.
• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25
Noise and Error Noise and Probabilistic Target
Target Distribution P(y |x)
characterizes behavior of
‘mini-target’
on onex
•
can be viewed as ‘ideal mini-target’ + noise, e.g.• P( ◦|x) = 0.7, P( ×|x) = 0.3
• ideal mini-target f (x) = ◦
• ‘flipping’ noise level = 0.3
•
deterministic target f :special case of target distribution
• P(y |x) = 1 for y = f (x)
• P(y |x) = 0 for y 6= f (x)
goal of learning:
predict
ideal mini-target (w.r.t. P(y |x))
onoften-seen inputs (w.r.t. P(x))
Noise and Error Noise and Probabilistic Target
The New Learning Flow
unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx y
1,y
2, · · · , y
Ny
VC still works,
pocket algorithm explained :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25
Noise and Error Noise and Probabilistic Target
The New Learning Flow
unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx y
1,y
2, · · · , y
Ny
VC still works,
pocket algorithm explained :-)
Noise and Error Noise and Probabilistic Target
Fun Time
Let’s revisit PLA/pocket. Which of the following claim is true?
1
In practice, we should try to compute ifD is linear separable before deciding to use PLA.2
If we know thatD is not linear separable, then the target function f must not be a linear function.3
If we know thatD is linear separable, then the target function f must be a linear function.4
None of the aboveReference Answer: 4
1 After computing ifD is linear separable, we shall know
w ∗
and then there is no need to use PLA. 2 What about noise? 3 What about‘sampling luck’? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25
Noise and Error Noise and Probabilistic Target
Fun Time
Let’s revisit PLA/pocket. Which of the following claim is true?
1
In practice, we should try to compute ifD is linear separable before deciding to use PLA.2
If we know thatD is not linear separable, then the target function f must not be a linear function.3
If we know thatD is linear separable, then the target function f must be a linear function.4
None of the aboveReference Answer: 4
1 After computing ifD is linear separable, we shall know
w ∗
and then there is no need to use PLA. 2 What about noise? 3 What about‘sampling luck’? :-)
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K: often also called
‘0/1 error’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K: often also called
‘0/1 error’
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K: often also called
‘0/1 error’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K: often also called
‘0/1 error’
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K: often also called
‘0/1 error’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K: often also called
‘0/1 error’
Noise and Error Error Measure
Error Measure
final hypothesis g ≈ f
•
how well? previously, considered out-of-sample measureE out (g)
=E
x∼P Jg ( x) 6= f ( x) K
•
more generally,error measure E (g, f )
•
naturally considered• out-of-sample: averaged over unknown x
• pointwise: evaluated on one x
• classification: Jprediction 6= targetK
classification errorJ. . .K:
often also called
‘0/1 error’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25
Noise and Error Error Measure
Pointwise Error Measure
can often express E (g, f ) = averaged err(g(x), f (x)), like E
out
(g) = Ex∼P Jg (x) 6= f (x)K
| {z }
err(g(x),f (x))
—err: called
pointwise error measure
in-sample
E
in
(g) = 1 NX
N
n=1
err(g(x
n
),f (xn
))out-of-sample
E
out
(g) = Ex∼P
err(g(x),f (x))will mainly consider pointwise
err
for simplicityNoise and Error Error Measure
Pointwise Error Measure
can often express E (g, f ) = averaged err(g(x), f (x)), like E
out
(g) = Ex∼P Jg (x) 6= f (x)K
| {z }
err(g(x),f (x))
—err: called
pointwise error measure
in-sample
E
in
(g) = 1 NX
N
n=1
err(g(x
n
),f (xn
))out-of-sample
E
out
(g) = Ex∼P
err(g(x),f (x))will mainly consider pointwise
err
for simplicityHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25
Noise and Error Error Measure
Pointwise Error Measure
can often express E (g, f ) = averaged err(g(x), f (x)), like E
out
(g) = Ex∼P Jg (x) 6= f (x)K
| {z }
err(g(x),f (x))
—err: called
pointwise error measure
in-sample
E
in
(g) = 1 NX
N
n=1
err(g(x
n
),f (xn
))out-of-sample
E
out
(g) = Ex∼P
err(g(x),f (x))will mainly consider pointwise
err
for simplicityNoise and Error Error Measure
Pointwise Error Measure
can often express E (g, f ) = averaged err(g(x), f (x)), like E
out
(g) = Ex∼P Jg (x) 6= f (x)K
| {z }
err(g(x),f (x))
—err: called
pointwise error measure
in-sample
E
in
(g) = 1 NX
N
n=1
err(g(x
n
),f (xn
))out-of-sample
E
out
(g) = Ex∼P
err(g(x),f (x))will mainly consider pointwise
err
for simplicityHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25
Noise and Error Error Measure
Pointwise Error Measure
can often express E (g, f ) = averaged err(g(x), f (x)), like E
out
(g) = Ex∼P Jg (x) 6= f (x)K
| {z }
err(g(x),f (x))
—err: called
pointwise error measure
in-sample
E
in
(g) = 1 NX
N
n=1
err(g(x
n
),f (xn
))out-of-sample
E
out
(g) = Ex∼P
err(g(x),f (x))will mainly consider pointwise
err
for simplicityNoise and Error Error Measure
Two Important Pointwise Error Measures
err
g(x) |{z}
˜ y
, f (x)
|{z} y
0/1 error
err(˜ y , y ) = J ˜ y 6= y K
•
correct or incorrect?•
often forclassification
squared error
err(˜ y , y ) = (˜ y − y) 2
•
how far is ˜y from y ?•
often forregression
how does err
‘guide’ learning?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25
Noise and Error Error Measure
Two Important Pointwise Error Measures
err
g(x) |{z}
˜ y
, f (x)
|{z} y
0/1 error
err(˜ y , y ) = J ˜ y 6= y K
•
correct or incorrect?•
often forclassification
squared error
err(˜ y , y ) = (˜ y − y) 2
•
how far is ˜y from y ?•
often forregression
how does err
‘guide’ learning?
Noise and Error Error Measure
Two Important Pointwise Error Measures
err
g(x) |{z}
˜ y
, f (x)
|{z} y
0/1 error
err(˜ y , y ) = J ˜ y 6= y K
•
correct or incorrect?•
often forclassification
squared error
err(˜ y , y ) = (˜ y − y) 2
•
how far is ˜y from y ?•
often forregression
how does err
‘guide’ learning?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25
Noise and Error Error Measure
Two Important Pointwise Error Measures
err
g(x) |{z}
˜ y
, f (x)
|{z} y
0/1 error
err(˜ y , y ) = J ˜ y 6= y K
•
correct or incorrect?•
often forclassification
squared error
err(˜ y , y ) = (˜ y − y) 2
•
how far is ˜y from y ?•
often forregression
how does err
‘guide’ learning?
Noise and Error Error Measure
Two Important Pointwise Error Measures
err
g(x) |{z}
˜ y
, f (x)
|{z} y
0/1 error
err(˜ y , y ) = J ˜ y 6= y K
•
correct or incorrect?•
often forclassification
squared error
err(˜ y , y ) = (˜ y − y) 2
•
how far is ˜y from y ?•
often forregression
how does err
‘guide’ learning?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25
Noise and Error Error Measure
Two Important Pointwise Error Measures
err
g(x) |{z}
˜ y
, f (x)
|{z} y
0/1 error
err(˜ y , y ) = J ˜ y 6= y K
•
correct or incorrect?•
often forclassification
squared error
err(˜ y , y ) = (˜ y − y) 2
•
how far is ˜y from y ?•
often forregression
how does err
‘guide’ learning?
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗)
3 avg. err 0.9 1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗)
3 avg. err 0.9 1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗)
3 avg. err 0.9 1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8
2 avg. err 0.3
(∗)
3 avg. err 0.9 1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗) 3 avg. err 0.9 1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗)
3 avg. err 0.9
1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗)
3 avg. err 0.9 1.9 avg. err 1.0
(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1
2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3
(∗)
3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1
2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1
2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1
2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3
3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5
1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29
(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)
f (x) =X
y ∈Y
y· P(y|x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Noise and Error Error Measure
Ideal Mini-Target
interplay between
noise
anderror:
P(y |x)
anderr
defineideal mini-target f (x)
P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1
err(˜y , y ) =Jy˜6= yK
y =˜
1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9
1.9 avg. err 1.0(really? :-))
f (x) = argmax
y ∈Y
P(y|x)
err(˜y , y ) = (˜y− y)
2
1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)
f (x) =X
y ∈Y
y· P(y|x)
Noise and Error Error Measure
Learning Flow with Error Measure
unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx y
1,y
2, · · · , y
Ny
error measure err
extended VC theory/‘philosophy’
works for most H and err
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25
Noise and Error Error Measure
Learning Flow with Error Measure
unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx y
1,y
2, · · · , y
Ny
error measure err
extended VC theory/‘philosophy’
works for most H and err
Noise and Error Error Measure
Fun Time
Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?
P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.
1
2.5 = average withinY = {1, 2, 3, 4}2
2.85 = weighted mean from P(y|x)3
3 = weighted median from P(y|x)4
4 = argmax P(y|x)Reference Answer: 3
For the ‘absolute error’, the weighted median provably results in the minimum average err.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25
Noise and Error Error Measure
Fun Time
Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?
P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.
1
2.5 = average withinY = {1, 2, 3, 4}2
2.85 = weighted mean from P(y|x)3
3 = weighted median from P(y|x)4
4 = argmax P(y|x)Reference Answer: 3
For the ‘absolute error’, the weighted median provably results in the minimum average err.
Noise and Error Algorithmic Error Measure
Choice of Error Measure
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g
+1 -1
f +1
no error false reject
-1false accept no error
0/1 error penalizes both types
equally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25
Noise and Error Algorithmic Error Measure
Choice of Error Measure
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g
+1 -1
f +1
no error false reject
-1false accept no error
0/1 error penalizes both types
equally
Noise and Error Algorithmic Error Measure
Choice of Error Measure
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
0/1 error penalizes both types
equally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25
Noise and Error Algorithmic Error Measure
Choice of Error Measure
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
0/1 error penalizes both types
equally
Noise and Error Algorithmic Error Measure
Fingerprint Verification for Supermarket
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g +1 -1
f +1
0 10
-1
1 0
•
supermarket: fingerprint for discount• false reject: very unhappy customer, lose future business
• false accept: give away a minor discount, intruder left fingerprint :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Noise and Error Algorithmic Error Measure
Fingerprint Verification for Supermarket
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g +1 -1
f +1
0 10
-1
1 0
•
supermarket: fingerprint for discount• false reject: very unhappy customer, lose future business
• false accept: give away a minor discount, intruder left fingerprint :-)
Noise and Error Algorithmic Error Measure
Fingerprint Verification for Supermarket
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g +1 -1
f +1
0 10
-1
1 0
•
supermarket: fingerprint for discount• false reject: very unhappy customer, lose future business
• false accept: give away a minor discount, intruder left fingerprint :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Noise and Error Algorithmic Error Measure
Fingerprint Verification for Supermarket
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g +1 -1
f +1
0 10
-1
1 0
•
supermarket: fingerprint for discount• false reject: very unhappy customer, lose future business
• false accept: give away a minor discount, intruder left fingerprint :-)
Noise and Error Algorithmic Error Measure
Fingerprint Verification for CIA
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g
+1 -1
f +1
0 1
-1
1000 0
•
CIA: fingerprint for entrance• false accept: very serious consequences!
• false reject: unhappy employee, but so what? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25
Noise and Error Algorithmic Error Measure
Fingerprint Verification for CIA
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g
+1 -1
f +1
0 1
-1
1000 0
•
CIA: fingerprint for entrance• false accept: very serious consequences!
• false reject: unhappy employee, but so what? :-)
Noise and Error Algorithmic Error Measure
Fingerprint Verification for CIA
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g
+1 -1
f +1
0 1
-1
1000 0
•
CIA: fingerprint for entrance• false accept: very serious consequences!
• false reject: unhappy employee, but so what? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25
Noise and Error Algorithmic Error Measure
Fingerprint Verification for CIA
Fingerprint Verification
f
+1 you
−1 intruder
two types of error:
false accept
andfalse reject
g+1 -1
f +1
no error false reject
-1false accept no error
g
+1 -1
f +1
0 1
-1
1000 0
•
CIA: fingerprint for entrance• false accept: very serious consequences!
• false reject: unhappy employee, but so what? :-)
Noise and Error Algorithmic Error Measure
Take-home Message for Now
err
isapplication/user-dependent
Algorithmic Error Measures err c
•
true: justerr
•
plausible:• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)
• squared: minimum Gaussian noise
•
friendly: easy to optimize forA• closed-form solution
• convex objective function
c
err: more in next lectures
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25
Noise and Error Algorithmic Error Measure
Take-home Message for Now
err
isapplication/user-dependent Algorithmic Error Measures err c
•
true: justerr
•
plausible:• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)
• squared: minimum Gaussian noise
•
friendly: easy to optimize forA• closed-form solution
• convex objective function
c
err: more in next lectures
Noise and Error Algorithmic Error Measure
Take-home Message for Now
err
isapplication/user-dependent Algorithmic Error Measures err c
•
true: justerr
•
plausible:• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)
• squared: minimum Gaussian noise
•
friendly: easy to optimize forA• closed-form solution
• convex objective function
c
err: more in next lectures
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25
Noise and Error Algorithmic Error Measure
Take-home Message for Now
err
isapplication/user-dependent Algorithmic Error Measures err c
•
true: justerr
•
plausible:• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)
• squared: minimum Gaussian noise
•
friendly: easy to optimize forA• closed-form solution
• convex objective function
c
err: more in next lectures
Noise and Error Algorithmic Error Measure
Take-home Message for Now
err
isapplication/user-dependent Algorithmic Error Measures err c
•
true: justerr
•
plausible:• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)
• squared: minimum Gaussian noise
•
friendly: easy to optimize forA• closed-form solution
• convex objective function
c
err: more in next lectures
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25
Noise and Error Algorithmic Error Measure
Take-home Message for Now
err
isapplication/user-dependent Algorithmic Error Measures err c
•
true: justerr
•
plausible:• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)
• squared: minimum Gaussian noise
•
friendly: easy to optimize forA• closed-form solution
• convex objective function
c
err: more in next lectures
Noise and Error Algorithmic Error Measure
Learning Flow with Algorithmic Error Measure
unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx y
1,y
2, · · · , y
Ny
error measure err c err
err: application goal;
c
err: a key part of many
AHsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25
Noise and Error Algorithmic Error Measure
Fun Time
Consider err below for CIA. What is E in (g) when using this err?
g +1 -1
f +1 0 1
-1 1000 0
1 1
N
P
N n=1
Jy
n
6= g(xn
)K2 1
N
P
y
n=+1
Jyn
6= g(xn
)K + 1000 Py
n=−1
Jyn
6= g(xn
)K!
3 1
N
P
y
n=+1
Jyn
6= g(xn
)K − 1000 Py
n=−1
Jyn
6= g(xn
)K!
4 1
N
1000 Py
n=+1
Jyn
6= g(xn
)K + Py
n=−1
Jyn
6= g(xn
)K!
Reference Answer: 2
When y
n
=−1, thefalse positive
made on such (xn
,yn
)is penalized1000
times more!Noise and Error Algorithmic Error Measure
Fun Time
Consider err below for CIA. What is E in (g) when using this err?
g +1 -1
f +1 0 1
-1 1000 0
1 1
N
P
N n=1
Jy
n
6= g(xn
)K2 1
N
P
y
n=+1
Jyn
6= g(xn
)K + 1000 Py
n=−1
Jyn
6= g(xn
)K!
3 1
N
P
y
n=+1
Jyn
6= g(xn
)K − 1000 Py
n=−1
Jyn
6= g(xn
)K!
4 1
N
1000 Py
n=+1
Jyn
6= g(xn
)K + Py
n=−1
Jyn
6= g(xn
)K!
Reference Answer: 2
When y
n
=−1, thefalse positive
made on such (xn
,yn
)is penalized1000
times more!Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25
Noise and Error Weighted Classification
Weighted Classification
CIA Cost (Error, Loss, . . .) Matrix
h(x) +1 -1
y +1 0 1
-1 1000 0
out-of-sample
E
out
(h) = E(x,y )∼P
1
if y = +11000
if y =−1
·
Jy 6= h(x)K
in-sample
E
in
(h) = 1 NX
N
n=1
1
if yn
= +11000
if yn
=−1
·
Jy n 6= h(x n ) K
weighted classification:
different ‘weight’ for different (x, y )
Noise and Error Weighted Classification
Weighted Classification
CIA Cost (Error, Loss, . . .) Matrix
h(x) +1 -1
y +1 0 1
-1 1000 0
out-of-sample
E
out
(h) = E(x,y )∼P
1
if y = +11000
if y =−1
·
Jy 6= h(x)K
in-sample
E
in
(h) = 1 NX
N
n=1
1
if yn
= +11000
if yn
=−1
·
Jy n 6= h(x n ) K
weighted classification:
different ‘weight’ for different (x, y )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25
Noise and Error Weighted Classification
Weighted Classification
CIA Cost (Error, Loss, . . .) Matrix
h(x) +1 -1
y +1 0 1
-1 1000 0
out-of-sample
E
out
(h) = E(x,y )∼P
1
if y = +11000
if y =−1
·
Jy 6= h(x)K
in-sample
E
in
(h) = 1 NX
N
n=1
1
if yn
= +11000
if yn
=−1
·
Jy n 6= h(x n ) K
weighted classification: