Cost-Sensitive Classiﬁcation: Algorithms and Advances

(1)

Cost-Sensitive Classification:

Algorithms and Advances

Hsuan-Tien Lin htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

Tutorial for ACML @ Canberra, Australia November 13, 2013

Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithms and Advances 0/99

(2)

More about Me

Associate Professor Dept. CSIE National Taiwan

University

• co-leader of KDDCup world champion teams at NTU: 2010–2013

• research on multi-label classification, ranking, active learning, etc.

• research on cost-sensitive classification:

2007–Present

• Secretary General, Taiwanese Association for Artificial Intelligence

• instructor of Mandarin-teaching MOOC of Machine Learning on NTU-Coursera:

2013.11–

https://www.coursera.org/course/ntumlone

(3)

Cost-Sensitive Binary Classification

Outline

Bayesian Perspective of Cost-Sensitive Binary Classification Non-Bayesian Perspective of Cost-Sensitive Binary Classification Cost-Sensitive Multiclass Classification

Bayesian Perspective of Cost-Sensitive Multiclass Classification Cost-Sensitive Classification by Reweighting and Relabeling Cost-Sensitive Classification by Binary Classification

Cost-Sensitive Classification by Regression

Cost-and-Error-Sensitive Classification with Bioinformatics Application Cost-Sensitive Ordinal Ranking with Information Retrieval Application Summary

(4)

Is This Your Fingerprint?

f





 +1 you

−1 intruder

?

you intruder

• abinary classification problem

—grouping “fingerprint pictures” intotwo different “categories”

C’mon, we know about binary classification all too well!:-)

(5)

Supervised Machine Learning

parent

?

(picture, category) pairs

?

kid’s good

decision function brain

'

&

$

% -

6 possibilities

truth f (x) + noise e(x)

?

examples (picturexn, category yn)

?

learning good

decision function g(x)≈ f (x) algorithm

'

&

$

% -

6

learning model{h(x)}

how toevaluatewhetherg(x)≈ f (x)?

(6)

Performance Evaluation

Fingerprint Verification

f





 +1 you

−1 intruder

example/figure borrowed from Amazon ML best-seller textbook

“Learning from Data” (Abu-Mostafa, Magdon-Ismail, , 2013)

two types of error: false acceptandfalse reject g

+1 -1

f +1 no error false reject -1 false accept no error

g +1 -1

f +1 0 1

-1 1 0

simplest choice:

penalizes both typesequallyand calculateaveragepenalties

(7)

Fingerprint Verification for Supermarket

f





 +1 you

−1 intruder

+1 -1

g +1 -1

f +1 0 10

-1 1 0

• supermarket: fingerprint for discount

• false reject:very unhappy customer, lose future business

• false accept:give a minor discount, intruder left fingerprint:-)

(8)

Fingerprint Verification for CIA

f





 +1 you

−1 intruder

+1 -1

g

+1 -1

f +1 0 1

-1 1000 0

• CIA: fingerprint for entrance

• false accept:very serious consequences!

• false reject:unhappy employee, but so what? :-)

(9)

Regular Binary Classification

penalizes both typesequally

h(x) +1 -1

y +1 0 1

-1 1 0

in-sample error for any hypothesis h E_in(h) = 1

N u w v y_n

|{z}

f (xn)+noise

6= h(xn) }

~

out-of-sample error for any hypothesis h E_out(h) = E

(x,y )

u w

v y

|{z}

f (x)+noise

6= h(x) }

~

regular binary classification:

well-studied in machine learning

—ya, we know! :-)

(10)

Class-Weighted Cost-Sensitive Binary Classification

Supermarket Cost (Error, Loss, . . .) Matrix

h(x) +1 -1

y +1 0 10

-1 1 0

in-sample

E_in(h) = 1 N

XN n=1

10 if yn = +1 1 if y_n =−1

·Jyn6= h(xn)K out-of-sample

E_out(h) = E

(x,y )

10 if y = +1 1 if y =−1

·Jy 6= h(x)K class-weighted cost-sensitive binary classification:different

‘weight’ for different y

(11)

Setup: Class-Weighted Cost-Sensitive Binary Classification Given

N examples, each

(inputx_n, label yn)∈ X × {−1, +1}

and weightsw₊,w₋

representing the two entries of the cost matrix

h(x) +1 -1 y +1 0 w+

-1 w− 0

Goal

a classifier g(x) that

pays a small cost wyJy 6= g (x)K

on futureunseen example (x, y ), i.e., achieves low E_out(g)

regular classification: w₊ =w−(= 1)

(12)

Supermarket Revisited

f





 +1 you

−1 intruder

+1 -1

g +1 -1 f big customer 0 100

usual customer 0 10

intruder 1 0

• supermarket: fingerprint for discount

• big customer: really don’t want to lose her/his business

• usual customer: don’t want to lose business, but not so serious

(13)

Example-Weighted Cost-Sensitive Binary Classification

Supermarket Cost Vectors(Rows)

h(x) +1 -1

y

big 0 100

usual 0 10

intruder 1 0

in-sample

E_in(h) = 1 N

XN n=1

wn

|{z}

importance

·Jyⁿ6= h(xn)K

out-of-sample E_out(h) = E

(x,y ,w)w·Jy 6= h(x)K example-weighted cost-sensitive binary classification:

different w for different (x, y )

—seen this in AdaBoost? :-)

(14)

Setup: Example-Weighted Cost-Sensitive Binary Classification

Given

N examples, each (inputxn, label yn)∈ X × {−1, +1}

and weight wn∈ R⁺

Goal

pays a small cost wJy 6= g (x)K

on futureunseen example (x, y , w), i.e., achieves low Eout(g)

regular⊂ class-weighted ⊂ example-weighted

(15)

Bayesian Perspective of Cost-Sensitive Binary Classification

Outline

Non-Bayesian Perspective of Cost-Sensitive Binary Classification Cost-Sensitive Multiclass Classification

(16)

Key Idea: Conditional Probability Estimator

Goal (Class-Weighted Setup)

a classifier g(x) that pays a small cost wyJy 6= g (x)K on future unseen example (x, y )

• expected error for predicting +1 onx: w−P(−1|x)

• expected error for predicting -1 onx: w₊P(+1|x)

if P(y|x) known

Bayes optimal g^∗(x) = sign

w₊P(+1|x) − w⁻P(−1|x)

if p(x)≈ P(+1|x) well

approximately good g_p(x) = sign

w₊p(x)− w⁻(1− p(x)) how to get conditional probability estimator p?

logistic regression, Naïve Bayes,. . .

(17)

Approximate Bayes-Optimal Decision

if p(x)≈ P(+1|x) well

approximately good g_p(x) = sign

w₊p(x)− w⁻(1− p(x)) that is(Elkan, 2001),

g_p(x) = +1 iff w₊p(x)− w−(1− p(x)) > 0 iff p(x)> w−

w₊+w₋ : ₁₁¹ for supermarket; ¹⁰⁰₁₀₁ for CIA

Approximate Bayes-Optimal Decision (ABOD) Approach

1 use your favorite algorithm on{(xn, yn)} to getp(x)≈ P(+1|x)

2 for each new inputx, predict its class using gp(x) = sign(p(x)−w+^w+w⁻−)

‘simplest’ approach:

probability estimate+threshold changing

(18)

ABOD on Artificial Data

1 use your favorite algorithm on{(xⁿ, yn)} to getp(x)≈ P(+1|x)

2 for each new inputx, predict its class using gp(x) = sign(p(x)−_w₊^w_+w⁻₋)

LogReg

−→

g +1 -1

y +1 0 10

-1 1 0

regular supermarket

(19)

Pros and Cons of ABOD

Pros

• optimal: if goodprobability estimate: p(x) really close to P(+1|x)

• simple: training (probability estimate)unchanged, andprediction (threshold)changed only a little Cons

• ‘difficult’: goodprobability estimateoften more difficult than goodbinary classification

• ‘restricted’: only applicable toclass-weighted setup

—need ‘full picture’ of cost matrix

approach for theexample-weighted setup?

(20)

Non-Bayesian Perspective of Cost-Sensitive Binary Classification

Outline

Bayesian Perspective of Cost-Sensitive Binary Classification Non-Bayesian Perspective of Cost-Sensitive Binary Classification

Cost-Sensitive Multiclass Classification

(21)

Key Idea: Example Weight = Copying

Goal

pays a small cost wJy 6= g (x)K on futureunseen example (x, y , w)

on one (x, y )

wrong prediction charged by w

—regular classification

onw copiesof (x, y )

wrong prediction charged by 1

—regular classification

how to copy?over-sampling

(22)

Example-Weighted Classification by Over-Sampling

copy each (x_n, yn)for w_ntimes original problem

evaluate with

h(x) +1 -1

y

big 0 100

usual 0 10

intruder 1 0

(x₁,-1,1) (x₂,+1,10) (x₃,+1,100)

(x₄,+1,10) (x₅,-1,1)

equivalent problem evaluate with

h(x) +1 -1

y

big 0 1

usual 0 1

intruder 1 0

(x₁,-1) (x₂,+1),. . ., (x2,+1) (x₃,+1),. . ., (x₃,+1),. . ., (x₃,+1)

(x₄,+1),. . ., (x₄,+1) (x₅,-1)

how to learn a good g for RHS?

SVM, NNet,. . .

(23)

Cost-Proportionate Example Weighting

Cost-Proportionate Example Weighting (CPEW) Approach

1 effectively transform{(xⁿ, yn, wn)} to{(x^m, ym)}such that the

‘copies’ of (xn, yn)in{(x^m, ym)} is proportional to wⁿ

• over/under-sampling with normalized wn(Elkan, 2001)

• under-sampling by rejection(Zadrozny, 2003)

• modify existing algorithms equivalently(Zadrozny, 2003) 2 use your favorite algorithm on{(xm, ym)}to get binary

classifierg(x)

3 for each new inputx, predict its class usingg(x)

simple and general:

very popular for cost-sensitive binary classification

(24)

CPEW by Modification

‘copies’ of (x_n, yn)in{(x^m, ym)} is proportional to wⁿ

• modify existing algorithms equivalently(Zadrozny, 2003) 2 use your favorite algorithm on{(xm, ym)} to get binary

classifierg(x)

Regular Linear SVM

minw,b

1 2hw, wi +

N

X

n=1

Cξn

ξn=max (1 − yn(hw, xni + b), 0)

Modified Linear SVM

minw,b

1 2hw, wi +

N

X

n=1

C · wn· ξn

ξn=max (1 − yn(hw, xni + b), 0)

(25)

CPEW by Modification on Artificial Data

1 effectively transform{(xⁿ, yn, wn)} to{(x^m, ym)}by modifying existing algorithms equivalently(Zadrozny, 2003)

2 use your favorite algorithm on{(x^m, ym)} to getg(x)

modify

−→

g +1 -1

y +1 0 10

-1 1 0

regular supermarket

(26)

CPEW by Rejection Sampling

COSTING Algorithm(Zadrozny, 2003)

‘copies’ of (xn, yn)in{(x^m, ym)} is proportional to wⁿ

• under-sampling by rejection(Zadrozny, 2003)

2 use your favorite algorithm on{(xm, ym)} to get binary classifierg(x)

3 repeat 1 and 2 to get multiplegandaggregate them

4 for each new inputx, predict its class usingaggregatedg(x)

commonly used when your favorite algorithm is ablack box rather than awhite box

(27)

Biased Personal Favorites

• CPEW by Modification if possible

• COSTING: fast training and stable performance

• ABOD if in the mood for Bayesian :-)

(28)

Outline

(29)

Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

• amulticlass classification problem

—grouping “pictures” into different “categories”

C’mon, we know about

multiclass classification all too well!:-)

(30)

Performance Evaluation

^(g(x)≈ f (x)?)

?

• ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong

• check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake

• evaluation by formation similarity:

1:not very similar; 2: very similar;

3:somewhat similar; 4: asilly prediction different applications:

evaluate mis-predictions differently

(31)

ZIP Code Recognition

?

1:wrong; 2:right; 3: wrong; 4: wrong

• regular multiclass classification: only right or wrong

• wrong cost: 1;right cost: 0

• prediction error of h on some (x, y ):

classification cost =Jy 6= h(x)K

—as discussed in regular binary classification

regular multiclass classification:

well-studied, many good algorithms

(32)

Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake

• cost-sensitive multiclass classification:

different costs for different mis-predictions

• e.g. prediction error of h on some (x, y ):

absolute cost =|y − h(x)|

next: cost-sensitivemulticlass classification

(33)

What is the Status of the Patient?

?

H1N1-infected cold-infected healthy

• anotherclassification problem

—grouping “patients” into different “status”

are all mis-prediction costs equal?

(34)

Patient Status Prediction

error measure = society cost XXXXactual XXXXXX

predicted

H7N9 cold healthy

H7N9 0 1000 100000

cold 100 0 3000

healthy 100 30 0

• H7N9 mis-predicted as healthy:very high cost

• cold mis-predicted as healthy: high cost

• cold correctly predicted as cold: no cost

human doctors consider costs of decision;

can computer-aided diagnosis do the same?

(35)

What is the Type of the Movie?

? romance fiction terror

customer 1 who hates romance but likes terror error measure = non-satisfaction

XXXXactual XXXXXX predicted

romance fiction terror

romance 0 5 100

customer 2 who likes terror and romance

romance 0 5 3

different customers:

evaluate mis-predictions differently

(36)

Cost-Sensitive Multiclass Classification Tasks

movie classification with non-satisfaction

customer 1, romance 0 5 100

customer 2, romance 0 5 3

patient diagnosis with society cost

H7N9 cold healthy

H7N9 0 1000 100000

cold 100 0 3000

healthy 100 30 0

check digit recognition with absolute cost C(y, h(x)) = |y − h(x)|

(37)

Cost Vector

cost vectorc: a row of cost components

• customer 1 on a romance movie: c = (0,5,100)

• an H7N9 patient: c = (0,1000,100000)

• absolute cost for digit 2: c = (1,0,1,2)

• “regular” classification cost for label 2: c⁽²⁾_c = (1,0,1,1)

regular classification:

special case of cost-sensitive classification

(38)

Setup: Matrix-Based Cost-Sensitive Binary Classification Given

N examples, each (inputxn, label yn)∈ X × {1, 2, . . . , K } and cost matrixC ∈ R^{K ×K}

—will assumeC(y, y) =0=min_{1≤k ≤K}C(y, k) Goal

pays a small costC(y, g(x)) on futureunseen example (x, y )

extension of ‘class-weighted’ cost-sensitive binary classification

(39)

Setup: Vector-Based Cost-Sensitive Binary Classification Given

N examples, each (inputxn, label yn)∈ X × {1, 2, . . . , K } and cost vectorc_n∈ R^K

—will assumec_n[y_n] =0=min_{1≤k ≤K}c_n[k ] Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

• will assumec[y ] =0=c_min =min_{1≤k ≤K}c[k ]

• note: y not really needed in evaluation

extension of ‘example-weighted’

cost-sensitive binary classification

(40)

Which Age-Group?

?

infant (1) child (2) teen (3) adult (4)

• small mistake—classify a child as a teen;

big mistake—classify an infant as an adult

• cost matrixC(y, g(x)) for embedding ‘order’: C =







0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0







cost-sensitive classification can help solve many other problems, such asordinal ranking

(41)

Bayesian Perspective of Cost-Sensitive Multiclass Classification

Outline

Cost-Sensitive Classification by Reweighting and Relabeling Cost-Sensitive Classification by Binary Classification

(42)

Key Idea: Conditional Probability Estimator

Goal (Matrix Setup)

a classifier g(x) that pays a small costC(y, g(x)) on future unseen example (x, y )

if P(y|x) known

Bayes optimal g^∗(x) =

argmin

1≤k ≤K

XK y =1

P(y|x)C(y, k)

if p(y, x)≈ P(y|x) well

approximately good g_p(x) =

argmin

1≤k ≤K

XK y =1

p(y, x)C(y, k)

how to get conditional probability estimator p?

logistic regression, Naïve Bayes,. . .

(43)

Approximate Bayes-Optimal Decision

if p(y, x)≈ P(+1|x) well

(Domingos, 1999)

approximately good gp(x) = argmink ∈{1,2,...,K }

PK

y =1p(y, x)C(y, k) Approximate Bayes-Optimal Decision (ABOD) Approach

1 use your favorite algorithm on{(xⁿ, yn)} to getp(y, x)≈ P(y|x)

2 for each new inputx, predict its class using g_p(x) above a simple extension from binary classification:

probability estimate+Bayes-optimal decision

(44)

ABOD on Artificial Data

1 use your favorite algorithm on{(xn, yn)} to getp(y, x)≈ P(y|x)

2 for each new inputx, predict its class using g_p(x)

LogReg

−→

g

1 2 3 4

y

1 0 1 2 4

2 4 0 1 2

3 2 4 0 1

4 1 2 4 0

regular rotate

(45)

Pros and Cons of ABOD

Pros

• optimal: if goodprobability estimate: p(y, x) really close to P(y|x)

• simple: withtraining (probability estimate)unchanged, andprediction (threshold)changed only a little

Cons

• ‘difficult’: goodprobability estimateoften more difficult than goodmulticlass classification

• ‘restricted’: only applicable toclass-weighted setup

—need ‘full picture’ of cost matrix

• ‘slow prediction’: need sophisticated calculation at prediction stage

can we use anymulticlass classification algorithmfor ABOD?

(46)

MetaCost Approach

Approximate Bayes-Optimal Decision (ABOD) Approach

1 use your favorite algorithm on{(xn, y_n)} to getp(y, x)≈ P(y|x)

2 for each new inputx, predict its class using gp(x) MetaCost Approach (Domingos, 1999)

1 use your favoritemulticlass classificationalgorithm on bootstrapped{(xn, yn)} and aggregate the classifiers to get p(y, x)≈ P(y|x)

2 for each given inputx_n,relabel it to y_n⁰ using gp(x)

3 run your favoritemulticlass classificationalgorithm on relabeled {(xⁿ, y_n⁰)} to get final classifier g

4 for each new inputx, predict its class using g(x)

pros: anymulticlass classificationalgorithm can be used

(47)

MetaCost on Semi-Real Data

0 500 1000 1500 2000 2500 3000 3500

0 300 600 900 1200 1500 1800

C4.5R

MetaCost Multiclass Two-class y=x

0 500 1000 1500 2000 2500

0 300 600 900 1200 1500 1800

Undersampling

0 500 1000 1500 2000 2500

0 300 600 900 1200 1500 1800

Oversampling

/(ÔçWñ8àÜ¸ê'LÐ=ØWá·ßÒWàÔªÓØgÙØMÝ*Ü"×@Ò{Ð=ØWÓ@×CDÓ(ÖzØWÓ@×@Ó#Ò´ýmÔÓ"-ôÔ×|Þ

×@ÞmØgÓÜØÝµÐ B8íB@$ó»ñÙ8æmÜ"à|Ó@ÒWá·ßmÑÔªÙmçåÒWÙæ Ø´÷WÜ"à|Ó@ÒWáßÑÔªÙmç

Ò´ýmÜ"Ó"UíX =ÒWÖ0Þ^ß8ØWÔÙg×¸Ö&ØWà|à@Ü"Ó|ß8ØgÙæÓ6×@ØËÒËæÒ×|ÒMïÒWÓÜÒWÙæ*×OâmßQÜ

ØÝÖ&ØWÓ@×á·Ò×|àÔýuí ØMÔªÙW×|Ó'ÒMïQØ´÷MÜ×@ÞmÜ N üÑÔÙmÜÒWàÜ·×|ÞmØWÓ@Ü

ô6ÞmÜ"à@Ü!]Üz×@Ò{Ð=ØWÓ@×-Øgñm×@ßQÜ"àÝºØWà|áÜ"æ×|ÞmÜÒÑ×Ü"à|ÙÒM×Ô÷MÜ=ÖzÑÒWÓ@Ó@ÔÕ8Ü"à"í

à@Ü"æñÖz×ÔØWÙµáÜz×@ÞØ{æµØMÝÖ0ÞmØWÔÖzÜ»ÝºØWàáñÑ×@ÔÖzÑÒWÓ@Óßà@ØWïÑÜGáÓ"í

EF Y LHRS#SUXW$YTZDLH[]\S

îÙÛ×Oô=ØMúZÖzÑÒWÓ@Óßà@ØWïmÑÜ"á·Óµô6ÞmÜGàÜE14êA7zêCN 14Jþ 7|þA N ÿó

Ó@×@à|Ò×ÔÕQÖ"Ò×ÔØWÙ'Ö"ÒMÙïQÜÒMßßmÑÔÜ"æ'ôÔ×@ÞmØgñm×rÒWÙgâ¸ÒWßßà@ØcýmÔá·Ò×@ÔØgÙ

ïgâ á·ÒMägÔÙç 14ê N 14Zþ:7zêUó 14ZþA N 14ê7UþAåÒMÙæ

ßà@ØmÖ&ÜzÜGæmÔªÙmç]ÒMÓï8ÜzÝºØWà@ÜMí:rÜz×@×ÔªÙmçðêËïQÜ/×|ÞmÜËáÔªÙmØWà@Ô×OâÖzÑÒWÓ@Ó

ÒMÙ8æ þ×@ÞmÜùá·Ò´òØWà@Ô×Oâ:Ö&ÑªÒMÓ|Ó"óËÜzýmßQÜ"à@ÔªáÜGÙg×@ÓðØgÙ ×Oô=ØMúZÖzÑÒWÓ@Ó

æÒM×@ÒMï8ÒMÓ@Ü"ÓÛô=ÜGàÜ Ö&ØWÙ8æñÖ&×@Ü"æñÓÔªÙmçã×@ÞmÜ:ÝºØMÑÑØ´ôÔÙmçãÖ&ØgÓ×

áØmæmÜzÑ9'014êA7zê*N 14Zþ:7|þ*Nöÿ ?14ê7UþA*Nûê"ÿWÿMÿ ?14Jþ7zê*N

ê"ÿWÿMÿ gó»ô6ÞmÜGàÜðôSÒWÓËÓ@Üz×µÒÑ×ÜGà@ÙÒM×Ü"Ñâ×Øöþmó @móÂÒWÙæûêGÿí

=ØW×Ü ×@Þ8Ò×µ×@ÞmÜüÒMï8ÓØWÑñm×@Ü¡÷´ÒMÑñmÜGÓµØÝ414Zþ:7zêËÒMÙæ<14ê7UþA

ÒMà@ÜÔªà@à@ÜzÑÜz÷´ÒWÙW×ÝºØWàÂÒÑçMØWà@Ô×@ÞáÖ&Øgá·ßÒMà@ÔÓ@ØWÙßñ8à@ßQØWÓ@Ü"Ó?pØWÙmÑâ

×@ÞÜzÔªàÂà|Ò×@ÔØÔÓ»ÓÔçWÙÔ}ÕQÖ"ÒMÙg×zí;)ÞmÜàÜGÓ@ñmÑ×@ÓÂØWï×@ÒÔªÙmÜGæuóuñÓÔªÙmç

×@ÞÜ]Ó@ÒWáÜ*ÓÜ"××ÔªÙmçgÓÝºØWà *Üz×|ÒgÐ=ØWÓ@×ÒMÓ/ï8ÜzÝºØWà@ÜMóÒMà@Ü^Ó@ÞmØ´ô6Ù

ÔªÙ )(ÒMïmÑÜ8ó[ÒWÙæ^çWà|ÒMß8ÞmÔÖ"ÒÑÑâµÔªÙ3/Ôçgñà@ÜêMí @Â÷MÜGà@Ó|ÒMá·ßmÑÔÙç

ÔªÓÙmØM×÷WÜ"à@âüÜKpÜ"Ö&×@Ô÷WÜÔÙà@Ü"æñÖzÔÙç ÖzØWÓ@×·ôÔ×|ÞÒWÙWâ°ØÝÂ×|ÞmÜ

Ö&ØgÓ×à@ÒM×ÔØWÓ"í »ÙæmÜGà@Ó|ÒMá·ßmÑÔÙç]ÔªÓÜK[ÜGÖ&×Ô÷MÜËÝºØWàCN @]ÒMÙæ

N ê"ÿóïñm×ËÙmØM×ÝºØgà N þmí ]Üz×@Ò{Ð=ØWÓ@×àÜGæñÖzÜ"ÓÖzØWÓ@×@Ó

Ö&Øgá·ßÒMà@Ü"æü×@Ø Ð BQí@óLñÙæmÜGà@Ó|ÒMá·ßmÑÔÙç¡ÒMÙ8æüØ´÷MÜGà@Ó|ÒMá·ßmÑÔÙç

ØWÙËÒÑªáØWÓ@×ÒMÑÑuæÒM×@ÒWïÒMÓ@Ü"Ó"óMÝºØgàÒÑÑuÖ&ØWÓ@×à@ÒM×ÔØWÓ"írîÙµÒÑÑÖzÒWÓÜGÓzó

×@ÞÜÖzØWÓ@×@ÓØWïm×|ÒÔªÙmÜ"æïgâ=*Üz×|ÒgÐ=ØWÓ@×·ÒMà@Ü/ÑØ´ô=ÜGà×|ÞÒMÙ°×@ÞmØgÓÜ

ØÝSÜ"ÒWÖ0Þ]ØMÝ=×@ÞÜØM×@ÞÜ"à»×@Þà@ÜzÜÒMÑçWØWà@Ô×|Þá·ÓôÔ×@Þ ÖzØWÙ{ÕQæÜ"ÙÖzÜ"Ó

Ü&ýÖzÜzÜ"æÔÙmç HHñÓÔªÙmç ÓÔçWÙÒMÙæ ;öÔÑªÖ&ØcýmØgÙü×ÜGÓ×|Ó XÜ&ýÖ&ÜGßm×

ÝºØWà×|ÞmÜûÓ@ÔçgÙ ×@Ü"Ó@×ÝºØWàöñÙæÜ"à|Ó@ÒWáßÑÔªÙmç ôÔ×|Þ N êGÿó

ô6ÞmÜGàÜ^×|ÞmÜ]Ö&ØgÙ{ÕQæmÜGÙÖ&ÜÔªÓ HWë Uí )ÞÜ"Ó@Ü*àÜGÓ@ñmÑ×@Ó/Ó@ñß8ß8Øgà×

×@ÞÜûÖ&ØWÙ8Ö&ÑªñÓÔØWÙ ×@ÞÒM× *Üz×|ÒgÐ=ØgÓ×éÔÓ×|ÞmÜûÖ&ØgÓ×úZà@Ü"æ8ñÖ&×@ÔØgÙ

áÜz×|ÞmØmæØÝÖ0ÞmØMÔªÖ&ÜüÜ"÷MÜGÙåÝºØgà*×OôSØúOÖ&ÑªÒMÓ|ÓßàØgïmÑÜ"á·Ó^ô6ÞmÜGàÜ

Ó@×@à|Ò×ÔÕQÖ"Ò×ÔØWÙµÖzÒWÙËïQÜÒMß8ßmÑÔÜ"æËôÔ×@ÞmØgñm×ÒMß8ßàØcýmÔªá·Ò×ÔØWÙí

EFE 5[]S NPYM J%NP[]S

>gÜ"÷MÜ"à|ÒÑ!(gñÜ"Ó@×ÔØWÙÓÒWàÔªÓ@ÜÔÙüÖzØWÙÙmÜGÖ&×@ÔØgÙ ôÔ×|Þ=*Üz×|ÒgÐ=ØgÓ×EDÓ

à@Ü"Ó|ñmÑ×|ÓzíPIÂØ´ô Ó@Ü"ÙÓ@Ô×@Ô÷WÜÛÒWàÜÛ×@ÞmÜ"â%×Ø ×@ÞmÜ Ù{ñáïQÜ"àéØMÝ

à@Ü"Ó|ÒMá·ßmÑÜ"ÓñÓ@Ü"æ ;ØWñmÑªæùÔ×*ïQÜüÜ"ÙØWñmçgÞ;×ØöÓ@Ôá·ßmÑâÛñ8ÓÜ

×@ÞÜ·ÖzÑÒWÓ@Ó¸ßà@ØWï8ÒMïmÔÑÔ×ÔÜ"Ó¸ßà@ØmæñÖzÜ"æïWâ¡ÒËÓ@ÔÙçMÑÜ·à|ñÙ]ØÝ=×|ÞmÜ

Ü"à|à@ØWàúZïÒWÓÜGæðÖ&ÑªÒMÓ|Ó@Ô}Õ8ÜGà·ØWÙé×@ÞÜÝ¹ñmÑÑÂ×@à|ÒÔªÙmÔªÙmçüÓÜ"× ;ØWñÑæ

*Ü"×@ÒgÐ=ØgÓ×*ßQÜ"àÝºØWà|á ïQÜz×@×ÜGà]ÔÝÒÑÑáØ{æÜzÑªÓ*ôSÜ"à@ÜðñÓÜGæùÔÙ

à@ÜzÑªÒMïQÜzÑÔÙmçãÒMÙ Ü&ýÒMá·ßmÑÜMóðÔªà|àÜGÓ@ßQÜ"Öz×Ô÷MÜkØMÝ°ô6ÞmÜ"×@ÞmÜGàÛ×|ÞmÜ

Ü&ýÒWáßÑÜ'ôÒMÓñÓ@Ü"æµ×ØÑÜ"ÒWà@ÙË×@ÞÜ"á ØgàÙmØM×^øÂÙæ^ÞmØ´ôåôSÜzÑÑ

ôSØWñmÑªæ ]Üz×@Ò{Ð=ØWÓ@×»æØËÔÝS×|ÞmÜÖ&ÑªÒMÓ|Ó¸ßàØgïÒMïÔÑÔ×@ÔÜGÓ¸ßàØmæñ8Ö&Ü"æ

ïgâ Ð BQí@ ôSÜ"à@Ü·ÔçgÙmØWà@Ü"æórÒMÙ8æ¡×|ÞmÜßà@ØWïÒWïmÔÑÔ×Oâ]ØMÝÒ^ÖzÑÒWÓ@Ó

ôÒMÓËÜ"Ó@×ÔªáÒM×ÜGæÛÓ@Ôá·ßmÑâ;ÒMÓµ×|ÞmÜÝ¹à@ÒWÖ&×ÔØWÙåØÝáØmæmÜ"ÑÓµ×@Þ8Ò×

ßà@Ü"æmÔªÖ&×@Ü"æ Ô× )ÞÔÓÓÜGÖ&×ÔØWÙüÒWÙÓ@ô=ÜGà@Ó¸×|ÞmÜ"Ó@Ü ({ñmÜ"Ó@×ÔØWÙ8Óïgâ

ÖzÒWà@à@â{ÔÙçðØWñm×µ×@ÞÜ à@ÜzÑÜz÷´ÒWÙW×ËÜ&ýßQÜ"à@ÔáÜ"Ùg×|Ózí /QØWàË×@ÞmÜüÓ|ÒMäÜ

ØÝÓ@ßÒWÖ&ÜWó6ØWÙmÑâöàÜGÓ@ñÑ×|Ó/ØgÙö×@ÞmÜ×OôSØúOÖ&ÑªÒMÓ|Ó/æÒM×@ÒMï8ÒMÓ@Ü"Ó/ÒWàÜ

ßà@Ü"Ó@Ü"Ùg×ÜGæ ? ×@ÞmÜùàÜGÓ@ñÑ×|Ó°ØgÙ áñmÑ×ÔªÖ&ÑªÒMÓ|Ó°æ8Ò×@ÒWïÒMÓ@Ü"Ó°ô=ÜGàÜ

ïà@ØWÒWæmÑâðÓ@ÔªáÔÑªÒMà"í )rÒWïmÑÜ*B à@Ü"ßQØWà@×@Ó×@ÞmÜ^à@Ü"Ó|ñmÑ×@ÓØWïm×|ÒÔªÙmÜ"æ

ÝºØWà N þmó @åÒWÙæ%êGÿ;ïgâù×|ÞmÜÝºØMÑÑØ´ôÔªÙmçå÷´ÒWàÔªÒ×@ÔØgÙÓ*ØMÝ

*Ü"×@ÒgÐ=ØgÓ×';ñÓ@ÔÙmç;þMÿéÒWÙæ:ê"ÿàÜGÓ@ÒWá·ßmÑÜGÓµÔÙÓ@×ÜGÒMæùØÝ @ÿ

¹ÑªÒMïQÜzÑÜ"æ T

N¸þÿAV'ÒMÙæ=T

NêGÿVG?Wà@ÜzÑªÒMïQÜzÑÔªÙmç×|ÞmÜ×|à@ÒMÔÙmÔªÙmç

Ü&ýÒWáßÑÜGÓ;ñÓ@ÔÙmç%×|ÞmÜûÖ&ÑªÒMÓ|Ó;ßà@ØWï8ÒMïmÔÑÔ×ÔÜ"Óöß8àØmæñÖzÜ"æïgâ

Ò;Ó@ÔÙçMÑÜ°à@ñ8ÙØÝµÐ B8íB@$ ØgÙ ÒÑÑ×|ÞmÜæÒM×@Ò XÑÒWï8Ü"ÑÜGæ TUÐ B

à@ØWï8ÓVG?6ÔçWÙmØgàÔªÙmç ×@ÞÜ]Ö&ÑªÒMÓ|Óßà@ØWïÒWïmÔÑÔ×ÔÜ"Óß8àØmæñÖzÜ"æïgâ

Ð B8íB@$ ¹ÑªÒMïQÜzÑÜ"æOT@ÿú|êLØM×ÜGÓVG?ÒMÙæéñÓ@ÔÙçÒMÑÑáØmæmÜzÑªÓÔÙ

(Domingos, 1999)

• some “random” cost with UCI data

• MetaCost+C4.5:

cost-sensitive

• C4.5: regular

not surprisingly,

considering the cost properly does help

(48)

Cost-Sensitive Classification by Reweighting and Relabeling

Outline

Bayesian Perspective of Cost-Sensitive Multiclass Classification Cost-Sensitive Classification by Reweighting and Relabeling

Cost-Sensitive Classification by Binary Classification Cost-Sensitive Classification by Regression

(49)

Recall: Example-Weighting Useful for Binary

can example weighting be used for multiclass?

Yes! an elegant solution if using costmatrixwith special properties

(Zhou, 2010)

C(i, j) C(j, i) = w_i

w_j

what if using costvectorswithout special properties?

(50)

Key Idea: Cost Transformation

0 1000

| {z }

c

= 1000 0

| {z }

#of copies

·

0 1 1 0

| {z } classification costs

3 2 3 4

| {z }

costc

= 1 2 1 0

| {z }

mixture weightsq`

·







0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0







| {z }

classification costs

• split the cost-sensitive example:

(x, 2) with c = (3, 2, 3, 4) equivalent to

a weighted mixture{(x, 1, 1), (x, 2, 2), (x, 3, 1)}

cost equivalence: c[h(x)] = PK

`=1

q_`J` 6= h(x)K for any h

(51)

Meaning of Cost Equivalence

c[h(x)] = XK

`=1

q_`J` 6= h(x)K

on one (x, y , c)

wrong prediction charged by c[h(x)]

—weighted classification

onall(x, `, q`)

wrong prediction charged by total weighted classification error

—weighted classification weighted classification =⇒ regular classification?

same as binary(with CPEW) when q_` ≥ 0

min_g expected LHS (original cost-sensitive problem)

= min_g expected RHS (a regular problem when q_`≥ 0)

(52)

Cost Transformation Methodology: Preliminary

1 split each training example (x_n, y_n, c_n)to a weighted mixture {(xn, `, qn,`)}^K_`=1

2 apply regular/weighted classification algorithm on the weighted mixtures

SN

n=1{(xn, `, q_n,`)}^K_`=1

• byc[g(x)] =PK

`=1q_`J` 6= g (x)K (cost equivalence), good g for new regular classification problem

=

good g for original cost-sensitive classification problem

• regular classification: needs q_n,`≥ 0

but what if q_n,`negative?

(53)

Similar Cost Vectors

1 0 1 2 3 2 3 4

| {z }

costs

=

1/3 4/3 1/3 −2/3

1 2 1 0

| {z }

mixture weights q`

·







0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0







| {z }

• negative q_`: cannot split

• but ˆc = (1, 0, 1, 2) is similar to c = (3, 2, 3, 4):

for any classifier g,

c[g(x)] + constant = c[g(x)] =ˆ XK

`=1q_`J` 6= g (x)K

• constant can be dropped during minimization

ming expected ˆc[g(x)] (original cost-sensitive problem)

= min_g expected LHS (a regular problem when q_`≥ 0)

(54)

Cost Transformation Methodology: Revised

1 shift each training cost ˆc_n to a similar and “splittable”c_n

2 split (x_n, yn, cn)to a weighted mixture{(xn, `, qn,`)}^K_`=1

3 apply regular classification algorithm on the weighted mixtures SN

n=1{(xⁿ, `, q_n,`)}^K_`=1

• splittable: q_n,`≥ 0

• by cost equivalence after shifting:

good g for new regular classification problem

=

but infinitely many similar and splittable c_n!

(55)

Uncertainty in Mixture

• a single example{(x, 2)}

—certain that the desired label is 2

• a mixture{(x, 1, 1), (x, 2, 2), (x, 3, 1)} sharing the same x

—uncertainty in the desired label (25% : 1, 50% : 2, 25% : 3)

• over-shifting adds unnecessary mixture uncertainty:

3 2 3 4

33 32 33 34

| {z }

costs

=

1 2 1 0

11 12 11 10

| {z }

mixture weights

·







0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0







| {z }

should choose a similar and splittablec withminimum mixture uncertainty

(56)

Cost Transformation Methodology: Final

Cost Transformation Methodology(Lin, 2010)

1 shift each training cost ˆc_n to a similar and splittablec_nwith minimum “mixture uncertainty”

2 split (xn, yn, cn)to a weighted mixture{(xⁿ, `, qn,`)}^K_`=1

3 apply regular classification algorithm on the weighted mixtures SN

n=1

{(xn, `, qn,`)}^K_`=1

• mixture uncertainty: entropy of normalized (q₁, q₂,· · · , qK)

• a simple and unique optimal shifting exists for every ˆc

good g for new regular classification problem

=

(57)

Data Space Expansion Approach

Data Space Expansion (DSE) Approach(Abe, 2004) 1 for each (x_n, yn, cn)and`, letqn,`= max

1≤k ≤Kc_n[k ]− cⁿ[`]

2 apply your favoritemulticlass classification algorithmon the weighted mixtures

SN

n=1{(xn, `, q_n,`)}^K_`=1to get g(x)

3 for each new inputx, predict its class using g(x)

• detailed explanation provided by the cost transformation methodology discussed above(Lin, 2010)

• extension of Cost-Proportionate Example Weighting, but now with relabeling!

pros: anymulticlass classificationalgorithm can be used

(58)

DSE versus MetaCost on Semi-Real Data

(Abe, 2004)

MetaCost DSE annealing 206.8 127.1

solar 5317 110.9

kdd99 49.39 46.68

letter 129.6 114.0 splice 49.95 135.5 satellite 104.4 116.8

• C4.5 with COSTING for weighted

classification

DSE comparable to MetaCost

(59)

Cons of DSE: Unavoidable (Minimum) Uncertainty

Original Cost-Sensitive Classification Problem

1 2 3 4

individual examples with certainty

+absolute cost =

New Regular Classification Problem

mixtures with unavoidable uncertainty

• cost embedded as weight + label

• new problem usuallyharder than original one

needrobustmulticlass classification algorithm to deal with uncertainty

(60)

Cost-Sensitive Classification by Binary Classification

Outline

(61)

Key Idea: Design Robust Multiclass Algorithm

One-Versus-One: A Popular Classification Meta-Method

1 for a pair (i, j), take all examples (xn, yn)that yn=i or j (original one-versus-one)

2 for a pair (i, j), from each weighted mixture{(xn, `, q_n,`)} with q_n,i > qn,j, take (x_n, i) with weight qn,i− qn,j ; vice versa(robust one-versus-one)

3 train a binary classifier ˆg^(i,j)using those examples

4 repeat the previous two steps for all different (i, j)

5 predict using the votes from ˆg^(i,j)

• un-shifting inside the meta-method to remove uncertainty

• robust step makes it suitable for cost transformation methodology

cost-sensitive one-versus-one:

cost transformation + robust one-versus-one

(62)

Cost-Sensitive One-Versus-One (CSOVO)

Cost-Sensitive One-Versus-One (Lin, 2010)

1 for a pair (i, j), transform all examples (xn, yn)to xn, argmin

k ∈{i,j}

c_n[k ]

!

with weight|cⁿ[i]− cⁿ[j]|

2 train a binary classifier ˆg^(i,j)using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from ˆg^(i,j)

• comes withgood theoretical guarantee:

test cost of final classifier≤ 2X

i<jtest cost of ˆg^(i,j)

• simple, efficient, and takes original OVO as special case physical meaning:

each ˆg^(i,j)answers yes/no question“prefer i or j?”

(63)

CSOVO on Semi-Real Data

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

avg. test random cost

CSOVO

OVO (Lin, 2010)

• CSOVO-SVM:

cost-sensitive

• OVO-SVM: regular

not surprisinglyagain,

considering the cost properly does help