Disclaimer about Cost-sensitive Classification

(1)

Cost-sensitive Classification: Techniques and Stories

Hsuan-Tien Lin [email protected]

Professor

Dept. of CSIE, National Taiwan University

Machine Learning Summer School @ Taipei, Taiwan August 2, 2021

(2)

About Me

• co-author of textbook ‘Learning from Data: A Short Course’

• instructor of two Coursera

Mandarin-teaching ML MOOCs on Coursera

goal: make ML more realistic

• weakly supervised learning: in ICML ’20, ICLR ’21,. . .

• online/active learning: in ICML ’12, ICML ’14, AAAI ’15, EMNLP ’20,. . .

• cost-sensitive classification: in ICML ’10, KDD ’12, IJCAI ’16,. . .

• multi-label classification: in

NeurIPS ’12, ICML ’14, AAAI ’18,. . .

• large-scale data mining: e.g. co-led KDDCup world-championNTU teams 2010–2013

(3)

More About Me

attendant:MLSS Taipei 2006

student workshop talk: Large-Margin Thresholded Ensembles for Ordinal Regression

(4)

Disclaimer about Cost-sensitive Classification

materials mostly from “old” tutorials

• Advances in Cost-sensitive Multiclass and Multilabel Classification. KDD 2019 Tutorial, Anchorage, Alaska, USA, August 2019.

• Cost-sensitive Classification: Algorithms and Advances. ACML 2013 Tutorial, Canberra, Australia, November 2013.

• core techniquessomewhat mature, compared to 10 years ago

• new researchstill being inspired, e.g.

• Classification with Rejection Based on Cost-sensitive Classification, Charoenphakdee et al., ICML ’21

• Cost-Sensitive Robustness against Adversarial Examples, Zhang and Evans, ICLR ’19

• will show one applicationstoryin the end

(5)

Cost-Sensitive Multiclass Classification CSMC Motivation and Setup

Outline

1 Cost-Sensitive Multiclass Classification CSMC Motivation and Setup

CSMC by Bayesian Perspective

CSMC by (Weighted) Binary Classification CSMC by Regression

2 Cost-Sensitive Multilabel Classification CSML Motivation and Setup

CSML by Bayesian Perspective

CSML by (Weighted) Binary Classification CSML by Regression

3 A Story of Bacteria Classification with Doctor-Annotated Costs

4 Summary

(6)

Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

• amulticlass classification problem: grouping ‘pictures’ into different ‘categories’

C’mon, we know about

multiclass classification all too well!:-)

(7)

Performance Evaluation

?

• ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong

• check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake

different applications: evaluate mis-predictions differently

(8)

ZIP Code Recognition

?

1:wrong; 2:right; 3: wrong; 4: wrong

• regular multiclass classification: only right or wrong

• wrong cost: 1;right cost: 0

• prediction error of h on some (x, y ):

classification cost =Jy 6= h(x)K regular multiclass classification:

well-studied, many good algorithms

(9)

Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake

• cost-sensitive multiclass classification: different costs for different mis-predictions

• e.g. prediction error of h on some (x, y ):

absolute cost =|y − h(x)|

next: more aboutcost-sensitive multiclass classification (CSMC)

(10)

What is the Status of the Patient?

(image by mcmurryjulie from Pixabay)?

bird flu cold healthy

(images by Clker-Free-Vector-Images from Pixabay)

• anotherclassification problem: grouping ‘patients’ into different ‘status’

are all mis-prediction costs equal?

(11)

Patient Status Prediction

error measure = society cost actual

predicted

bird flu cold healthy

bird flu 0 1000 100000

cold 100 0 3000

healthy 100 30 0

• bird flu mis-predicted as healthy: very high cost

• cold mis-predicted as healthy: high cost

• cold correctly predicted as cold: no cost

human doctors consider costs of decision;

can computer-aided diagnosis do the same?

(12)

Setup: Class-Dependent Cost-Sensitive Classification

Given

N examples, each (inputx_n, label yn)∈ X × {1, 2, . . . , K }

and cost matrixC ∈ R^{K ×K} withC(y, y) =0= min_{1≤k ≤K}C(y, k) patient diagnosis

with society cost C =





0 1000 100000

100 0 3000

100 30 0





check digit recognition

with absolute cost (cost function)

C(y, k) = |y − k|

Goal

a classifier g(x) that pays a small costC(y, g(x)) on future unseen example (x, y) includesregular classificationCc like₀ ₁

1 0

asspecial case

(13)

Which Age-Group?

? infant (1) child (2) teen (3) adult (4)

(images by Tawny van Breda, Pro File, Mieczysław Samol, lisa runnels, vontoba from Pixabay)

• small mistake—classify child as teen;big mistake—classify infant as adult

• cost matrixC(y, g(x)) for embedding ‘order’: C =







0 1 4 5

1 0 1 3

3 1 0 2

5 4 1 0







CSMC can help solve many other problems likeordinal ranking

(14)

Cost Vector

cost vectorc: a row of cost components

• society cost for a bird flu patient: c = (0,1000,100000)

• absolute cost for digit 2: c = (1,0,1,2)

• age-ranking cost for a teenager:c = (3,1,0,2)

• ‘regular’ classification cost for label 2:c⁽²⁾_c = (1,0,1,1)

• movie recommendation

• someone who loves romance movie buthates terror:

c = (romance =0, fiction =5,terror=100)

• someone who loves romance movie butfine with terror:

c = (romance =0, fiction =5,terror=3) cost vector:

representation ofpersonal preference in many applications

(15)

Setup: Example-Dependent Cost-Sensitive Classification

Given

N examples, each (inputxn, label yn)∈ X × {1, 2, . . . , K } and cost vectorc_n∈ R^K

—will assumec_n[y_n] =0= min_{1≤k ≤K}c_n[k ] Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

• will assumec[y ] =0=c_min= min_{1≤k ≤K}c[k ]

• note: y not really needed in evaluation

example-dependent⊃ class-dependent ⊃ regular

(16)

Cost-Sensitive Multiclass Classification CSMC by Bayesian Perspective

Outline

4 Summary

(17)

Key Idea: Conditional Probability Estimator

Goal (Class-Dependent Setup)

a classifier g(x) that pays a small costC(y, g(x)) on future unseen example (x, y)

if P(y|x) known

Bayes optimal g^∗(x) =

argmin

1≤k ≤K K

X

y =1

P(y|x)C(y, k)

if q(x, y )≈ P(y|x) well

approximately good gq(x) =

argmin

1≤k ≤K K

X

y =1

q(x, y )C(y, k)

how to getconditional probability estimator q?

logistic regression, Naïve Bayes,. . .

(18)

Approximate Bayes-Optimal Decision

if q(x, y )≈ P(y|x) well

(Domingos, 1999)

approximately good g_q(x) = argmin

1≤k ≤K K

P

y =1

q(x, y )C(y, k)

Approximate Bayes-Optimal Decision (ABOD) Approach

1 use your favorite algorithm on{(xn, yn)} to getq(x, y )≈ P(y|x)

2 for each new inputx,predict its class using g_q(x)above

ABOD:probability estimate+Bayes-optimal decision

(19)

ABOD on Artificial Data

1 use your favorite algorithm on{(xn, yn)} to getq(x, y )≈ P(y|x)

2 for each new inputx,predict its class using g_q(x)above

LogReg

−→

g(x)

1 2 3 4

y

1 0 1 2 4

2 4 0 1 2

3 2 4 0 1

4 1 2 4 0

regular rotate

(20)

ABOD for Binary Classification

Given N examples, each (inputx_n, label yn)∈ X × {−1, +1}

and weightsw₊,w− representingtwo entries of cost matrix

g(x) +1 -1

y +1 0 w+

-1 w− 0

if q(x)≈ P(+1|x) well

approximately good gq(x) = sign

w+q(x)−w−(1−q(x))

, i.e. (Elkan, 2001), g_q(x) = +1 ⇐⇒w₊q(x)−w−(1−q(x))> 0 ⇐⇒q(x)> w−

w₊+w₋

ABOD for binary classification:

probability estimate+threshold changing

(21)

ABOD for Binary Classification on Artificial Data

1 use your favorite algorithm on{(xn, yn)} to getq(x)≈ P(+1|x)

2 for each new inputx, predict its class using g_q(x) = sign(q(x)−_w₊^w_+w⁻₋)

LogReg

−→

g(x) +1 -1

y +1 0 10

-1 1 0

regular positive emphasis

(22)

Pros and Cons of ABOD

Pros

• optimal if goodprobability estimateq

• predictioneasily adapts to differentC without modifyingtraining (probability estimate) Cons

• ‘difficult’: goodprobability estimateoften more difficult

than goodmulticlass classification

• ‘restricted’: only applicable toclass-dependent setup

—need ‘full picture’ of cost matrix

• ‘slower prediction’ (for multiclass): more calculationat prediction stage

can we use anymulticlass classification algorithmfor ABOD?

(23)

MetaCost Approach

Approximate Bayes-Optimal Decision (ABOD) Approach

1 use your favorite algorithm on{(xn, yn)} to getq(y, x)≈ P(y|x)

2 for each new inputx, predict its class using g_p(x) MetaCost Approach (Domingos, 1999)

1 use your favoritemulticlass classificationalgorithm onbootstrapped{(xⁿ, yn)}and aggregatethe classifiers to getq(y, x)≈ P(y|x)

2 for each given inputxn,relabel it to y_n⁰ using gq(x)

3 run your favoritemulticlass classificationalgorithm

onrelabeled{(xn, y_n⁰)}to get final classifier g

4 for each new inputx, predict its class using g(x)

pros: anymulticlass classificationalgorithm can be used

(24)

MetaCost on Semi-Real Data

0 500 1000 1500 2000 2500 3000 3500

0 300 600 900 1200 1500 1800

C4.5R

MetaCost Multiclass Two-class y=x

0 500 1000 1500 2000 2500

0 300 600 900 1200 1500 1800

Undersampling

0 500 1000 1500 2000 2500

0 300 600 900 1200 1500 1800

Oversampling

/(ÔçWñ8àÜ¸ê 'LÐ=ØWá·ßÒWàÔªÓØgÙØMÝ*Ü"×@Ò{Ð=ØWÓ@×CDÓ(ÖzØWÓ@×@Ó#Ò´ýmÔÓ"-ôÔ×|Þ

×@ÞmØgÓÜØÝµÐ B8íB@$ó»ñÙ8æmÜ"à|Ó@ÒWá·ßmÑÔªÙmçåÒWÙæ Ø´÷WÜ"à|Ó@ÒWáßÑÔªÙmç

Ò´ýmÜ"Ó"UíX =ÒWÖ0Þ^ß8ØWÔÙg×¸Ö&ØWà|à@Ü"Ó|ß8ØgÙæÓ6×@ØËÒËæÒ×|ÒMïÒWÓÜÒWÙæ*×OâmßQÜ

ØÝÖ&ØWÓ@×á·Ò×|àÔýuí ØMÔªÙW×|Ó'ÒMïQØ´÷MÜ×@ÞmÜ N üÑÔÙmÜÒWàÜ·×|ÞmØWÓ@Ü

à@Ü"æñÖz×ÔØWÙµáÜz×@ÞØ{æµØMÝÖ0ÞmØWÔÖzÜ»ÝºØWàáñÑ×@ÔÖzÑÒWÓ@Óßà@ØWïÑÜGáÓ"í

EF Y LHRS#SUXW$YTZDLH[]\S

îÙÛ×Oô=ØMúZÖzÑÒWÓ@Óßà@ØWïmÑÜ"á·Óµô6ÞmÜGàÜE14êA7zêCN 14Jþ7|þA N ÿó

Ó@×@à|Ò×ÔÕQÖ"Ò×ÔØWÙ'Ö"ÒMÙïQÜÒMßßmÑÔÜ"æ'ôÔ×@ÞmØgñm×rÒWÙgâ¸ÒWßßà@ØcýmÔá·Ò×@ÔØgÙ

ïgâ á·ÒMägÔÙç 14ê N 14Zþ:7zêUó 14ZþA N 14ê7UþAåÒMÙæ

ßà@ØmÖ&ÜzÜGæmÔªÙmç]ÒMÓï8ÜzÝºØWà@ÜMí:rÜz×@×ÔªÙmçðêËïQÜ/×|ÞmÜËáÔªÙmØWà@Ô×OâÖzÑÒWÓ@Ó

ÒMÙ8æ þ×@ÞmÜùá·Ò´òØWà@Ô×Oâ:Ö&ÑªÒMÓ|Ó"óËÜzýmßQÜ"à@ÔªáÜGÙg×@ÓðØgÙ ×Oô=ØMúZÖzÑÒWÓ@Ó

æÒM×@ÒMï8ÒMÓ@Ü"ÓÛô=ÜGàÜ Ö&ØWÙ8æñÖ&×@Ü"æñÓÔªÙmçã×@ÞmÜ:ÝºØMÑÑØ´ôÔÙmçãÖ&ØgÓ×

áØmæmÜzÑ9'014êA7zê*N 14Zþ:7|þ *Nöÿ?14ê7UþA*Nûê"ÿWÿMÿ?14Jþ7zê*N

ê"ÿWÿMÿ gó»ô6ÞmÜGàÜðôSÒWÓËÓ@Üz×µÒÑ×ÜGà@ÙÒM×Ü"Ñâ×Øöþmó @móÂÒWÙæûêGÿí

=ØW×Ü ×@Þ8Ò×µ×@ÞmÜüÒMï8ÓØWÑñm×@Ü¡÷´ÒMÑñmÜGÓµØÝ414Zþ:7zêËÒMÙæ<14ê7UþA

ÒMà@ÜÔªà@à@ÜzÑÜz÷´ÒWÙW×ÝºØWàÂÒÑçMØWà@Ô×@ÞáÖ&Øgá·ßÒMà@ÔÓ@ØWÙßñ8à@ßQØWÓ@Ü"Ó?pØWÙmÑâ

×@ÞÜzÔªàÂà|Ò×@ÔØÔÓ»ÓÔçWÙÔ}ÕQÖ"ÒMÙg×zí;)ÞmÜàÜGÓ@ñmÑ×@ÓÂØWï×@ÒÔªÙmÜGæuóuñÓÔªÙmç

×@ÞÜ]Ó@ÒWáÜ*ÓÜ"××ÔªÙmçgÓÝºØWà *Üz×|ÒgÐ=ØWÓ@×ÒMÓ/ï8ÜzÝºØWà@ÜMóÒMà@Ü^Ó@ÞmØ´ô6Ù

ÔªÙ )(ÒMïmÑÜ8ó[ÒWÙæ^çWà|ÒMß8ÞmÔÖ"ÒÑÑâµÔªÙ3/Ôçgñà@ÜêMí @Â÷MÜGà@Ó|ÒMá·ßmÑÔÙç

ÔªÓÙmØM×÷WÜ"à@âüÜKpÜ"Ö&×@Ô÷WÜÔÙà@Ü"æñÖzÔÙç ÖzØWÓ@×·ôÔ×|ÞÒWÙWâ°ØÝÂ×|ÞmÜ

Ö&ØgÓ×à@ÒM×ÔØWÓ"í »ÙæmÜGà@Ó|ÒMá·ßmÑÔÙç]ÔªÓÜK[ÜGÖ&×Ô÷MÜËÝºØWàCN @]ÒMÙæ

N ê"ÿóïñm×ËÙmØM×ÝºØgà N þmí ]Üz×@Ò{Ð=ØWÓ@×àÜGæñÖzÜ"ÓÖzØWÓ@×@Ó

Ö&Øgá·ßÒMà@Ü"æü×@Ø Ð BQí@óLñÙæmÜGà@Ó|ÒMá·ßmÑÔÙç¡ÒMÙ8æüØ´÷MÜGà@Ó|ÒMá·ßmÑÔÙç

ØWÙËÒÑªáØWÓ@×ÒMÑÑuæÒM×@ÒWïÒMÓ@Ü"Ó"óMÝºØgàÒÑÑuÖ&ØWÓ@×à@ÒM×ÔØWÓ"írîÙµÒÑÑÖzÒWÓÜGÓzó

×@ÞÜÖzØWÓ@×@ÓØWïm×|ÒÔªÙmÜ"æïgâ=*Üz×|ÒgÐ=ØWÓ@×·ÒMà@Ü/ÑØ´ô=ÜGà×|ÞÒMÙ°×@ÞmØgÓÜ

ØÝSÜ"ÒWÖ0Þ]ØMÝ=×@ÞÜØM×@ÞÜ"à»×@Þà@ÜzÜÒMÑçWØWà@Ô×|Þá·ÓôÔ×@Þ ÖzØWÙ{ÕQæÜ"ÙÖzÜ"Ó

Ü&ýÖzÜzÜ"æÔÙmç HHñÓÔªÙmç ÓÔçWÙÒMÙæ ;öÔÑªÖ&ØcýmØgÙü×ÜGÓ×|Ó XÜ&ýÖ&ÜGßm×

ÝºØWà×|ÞmÜûÓ@ÔçgÙ ×@Ü"Ó@×ÝºØWàöñÙæÜ"à|Ó@ÒWáßÑÔªÙmç ôÔ×|Þ N êGÿó

ô6ÞmÜGàÜ^×|ÞmÜ]Ö&ØgÙ{ÕQæmÜGÙÖ&ÜÔªÓ HWë Uí )ÞÜ"Ó@Ü*àÜGÓ@ñmÑ×@Ó/Ó@ñß8ß8Øgà×

×@ÞÜûÖ&ØWÙ8Ö&ÑªñÓÔØWÙ ×@ÞÒM× *Üz×|ÒgÐ=ØgÓ×éÔÓ×|ÞmÜûÖ&ØgÓ×úZà@Ü"æ8ñÖ&×@ÔØgÙ

áÜz×|ÞmØmæØÝÖ0ÞmØMÔªÖ&ÜüÜ"÷MÜGÙåÝºØgà*×OôSØúOÖ&ÑªÒMÓ|ÓßàØgïmÑÜ"á·Ó^ô6ÞmÜGàÜ

Ó@×@à|Ò×ÔÕQÖ"Ò×ÔØWÙµÖzÒWÙËïQÜÒMß8ßmÑÔÜ"æËôÔ×@ÞmØgñm×ÒMß8ßàØcýmÔªá·Ò×ÔØWÙí

EFE 5[]S NPY M J%NP[]S

>gÜ"÷MÜ"à|ÒÑ!(gñÜ"Ó@×ÔØWÙÓÒWàÔªÓ@ÜÔÙüÖzØWÙÙmÜGÖ&×@ÔØgÙ ôÔ×|Þ=*Üz×|ÒgÐ=ØgÓ×EDÓ

à@Ü"Ó|ñmÑ×|ÓzíPIÂØ´ô Ó@Ü"ÙÓ@Ô×@Ô÷WÜÛÒWàÜÛ×@ÞmÜ"â%×Ø ×@ÞmÜ Ù{ñáïQÜ"àéØMÝ

à@Ü"Ó|ÒMá·ßmÑÜ"ÓñÓ@Ü"æ ;ØWñmÑªæùÔ×*ïQÜüÜ"ÙØWñmçgÞ;×ØöÓ@Ôá·ßmÑâÛñ8ÓÜ

×@ÞÜ·ÖzÑÒWÓ@Ó¸ßà@ØWï8ÒMïmÔÑÔ×ÔÜ"Ó¸ßà@ØmæñÖzÜ"æïWâ¡ÒËÓ@ÔÙçMÑÜ·à|ñÙ]ØÝ=×|ÞmÜ

Ü"à|à@ØWàúZïÒWÓÜGæðÖ&ÑªÒMÓ|Ó@Ô}Õ8ÜGà·ØWÙé×@ÞÜÝ¹ñmÑÑÂ×@à|ÒÔªÙmÔªÙmçüÓÜ"× ;ØWñÑæ

*Ü"×@ÒgÐ=ØgÓ×*ßQÜ"àÝºØWà|á ïQÜz×@×ÜGà]ÔÝÒÑÑáØ{æÜzÑªÓ*ôSÜ"à@ÜðñÓÜGæùÔÙ

à@ÜzÑªÒMïQÜzÑÔÙmçãÒMÙ Ü&ýÒMá·ßmÑÜMóðÔªà|àÜGÓ@ßQÜ"Öz×Ô÷MÜkØMÝ°ô6ÞmÜ"×@ÞmÜGàÛ×|ÞmÜ

Ü&ýÒWáßÑÜ'ôÒMÓñÓ@Ü"æµ×ØÑÜ"ÒWà@ÙË×@ÞÜ"á ØgàÙmØM×^øÂÙæ^ÞmØ´ôåôSÜzÑÑ

ôSØWñmÑªæ ]Üz×@Ò{Ð=ØWÓ@×»æØËÔÝS×|ÞmÜÖ&ÑªÒMÓ|Ó¸ßàØgïÒMïÔÑÔ×@ÔÜGÓ¸ßàØmæñ8Ö&Ü"æ

ïgâ Ð BQí@ ôSÜ"à@Ü·ÔçgÙmØWà@Ü"æórÒMÙ8æ¡×|ÞmÜßà@ØWïÒWïmÔÑÔ×Oâ]ØMÝÒ^ÖzÑÒWÓ@Ó

ôÒMÓËÜ"Ó@×ÔªáÒM×ÜGæÛÓ@Ôá·ßmÑâ;ÒMÓµ×|ÞmÜÝ¹à@ÒWÖ&×ÔØWÙåØÝáØmæmÜ"ÑÓµ×@Þ8Ò×

ßà@Ü"æmÔªÖ&×@Ü"æ Ô× )ÞÔÓÓÜGÖ&×ÔØWÙüÒWÙÓ@ô=ÜGà@Ó¸×|ÞmÜ"Ó@Ü ({ñmÜ"Ó@×ÔØWÙ8Óïgâ

ÖzÒWà@à@â{ÔÙçðØWñm×µ×@ÞÜ à@ÜzÑÜz÷´ÒWÙW×ËÜ&ýßQÜ"à@ÔáÜ"Ùg×|Ózí /QØWàË×@ÞmÜüÓ|ÒMäÜ

ØÝÓ@ßÒWÖ&ÜWó6ØWÙmÑâöàÜGÓ@ñÑ×|Ó/ØgÙö×@ÞmÜ×OôSØúOÖ&ÑªÒMÓ|Ó/æÒM×@ÒMï8ÒMÓ@Ü"Ó/ÒWàÜ

ßà@Ü"Ó@Ü"Ùg×ÜGæ? ×@ÞmÜùàÜGÓ@ñÑ×|Ó°ØgÙ áñmÑ×ÔªÖ&ÑªÒMÓ|Ó°æ8Ò×@ÒWïÒMÓ@Ü"Ó°ô=ÜGàÜ

ïà@ØWÒWæmÑâðÓ@ÔªáÔÑªÒMà"í )rÒWïmÑÜ*B à@Ü"ßQØWà@×@Ó×@ÞmÜ^à@Ü"Ó|ñmÑ×@ÓØWïm×|ÒÔªÙmÜ"æ

ÝºØWà N þmó @åÒWÙæ%êGÿ;ïgâù×|ÞmÜÝºØMÑÑØ´ôÔªÙmçå÷´ÒWàÔªÒ×@ÔØgÙÓ*ØMÝ

*Ü"×@ÒgÐ=ØgÓ×';ñÓ@ÔÙmç;þMÿéÒWÙæ:ê"ÿàÜGÓ@ÒWá·ßmÑÜGÓµÔÙÓ@×ÜGÒMæùØÝ @ÿ

¹ÑªÒMïQÜzÑÜ"æ T

N¸þÿAV'ÒMÙæ=T

NêGÿVG?Wà@ÜzÑªÒMïQÜzÑÔªÙmç×|ÞmÜ×|à@ÒMÔÙmÔªÙmç

Ü&ýÒWáßÑÜGÓ;ñÓ@ÔÙmç%×|ÞmÜûÖ&ÑªÒMÓ|Ó;ßà@ØWï8ÒMïmÔÑÔ×ÔÜ"Óöß8àØmæñÖzÜ"æïgâ

Ò;Ó@ÔÙçMÑÜ°à@ñ8ÙØÝµÐ B8íB@$ ØgÙ ÒÑÑ×|ÞmÜæÒM×@Ò XÑÒWï8Ü"ÑÜGæ TUÐ B

à@ØWï8ÓVG?6ÔçWÙmØgàÔªÙmç ×@ÞÜ]Ö&ÑªÒMÓ|Óßà@ØWïÒWïmÔÑÔ×ÔÜ"Óß8àØmæñÖzÜ"æïgâ

(Domingos, 1999)

• some ‘artificial’ cost with UCI data

• MetaCost+C4.5:

cost-sensitive

• C4.5: regular

not surprisingly,

considering the cost properly does help

Hsuan-Tien Lin (NTU) Cost-sensitive Classification: Techniques and Stories 23/104

(25)

Cost-Sensitive Multiclass Classification CSMC by (Weighted) Binary Classification

Outline

4 Summary

(26)

Key Idea: Cost Transformation

(heuristic)relabelinguseful in MetaCost: a moreprincipledway?

Yes, by Connecting Cost Vector to Regular Costs!

1 0 1 2

| {z }

cof interest

shift equivalence

−−−−−−−−−−−→ 3 2 3 4

| {z }

shifted cost

= 1 2 1 0

| {z }

mixture weightsu`

·







0 1 1 1

1 0 1 1

1 1 0 1

1 1 1 0







| {z }

regular costs

i.e.x with c = (1,0,1,2)equivalent to

a weighted mixture{(x, y, u)} = {(x, 1, 1), (x, 2, 2), (x, 3, 1)}

cost equivalence(Lin, 2014): for any classifier h, c[h(x)]+constant=

K

P

`=1

u_`J` 6= h(x)K

(27)

Meaning of Cost Equivalence

c[h(x)]+constant=

K

X

`=1

u_`J` 6= h(x)K

on one (x, y , c):

wrong prediction charged by c[h(x)]

relabeled data

onall{(x, `, u`)}:

wrong prediction charged by total weighted classification error

ofrelabeled data minhexpectedLHS (original CSMC problem)

= min_hexpectedRHS (weighted classification when u_` ≥ 0)

(28)

Calculation of u

_`

Smallest Non-Negative u`’s(Lin, 2014)

whenconstant = (K − 1) max

1≤k ≤Kc[k ]− P^K

k =1

c[k ], u_` = max

1≤k ≤Kc[k ]− c[`]

e.g. 1 0 1 2

| {z }

cof interest

→ 1 2 1 0

| {z }

mixture weightsu`

• largest c[`]: u`=0 (least preferred relabel)

• smallestc[`]: u_` =largest (original label &most preferred relabel)

`’s and u`’sembed the cost

(29)

Data Space Expansion Approach

Data Space Expansion (DSE) Approach(Abe, 2004) 1 for each (x_n, y_n, c_n)and`, letu_n,`= max

1≤k ≤Kc_n[k ]− cn[`]

2 apply your favoritemulticlass classification algorithmon the weighted mixtures

N

S

n=1

{(xn, `, un,`)}^K_`=1to get g(x)

3 for each new inputx, predict its class using g(x)

• bycost equivalence,

good g fornew (weighted) regular classification problem

=

good g fororiginal cost-sensitive classification problem

• weightedregular classification: special case of CSMC

but more easily solvable by, e.g.,sampling+ regular classification(Zadrozny, 2003)

pros: anymulticlass classificationalgorithm can be used

(30)

DSE versus MetaCost on Semi-Real Data

ann kdd let spl sat

0 50 100 150 200 250

avg. test cost

DSE

MetaCost (Abe, 2004)some ‘artificial’

cost with UCI data

• usesampling+ C4.5 for weighted regular classification

DSEcompetitive to MetaCost

(31)

Cons of DSE: Unavoidable Noise

Original Cost-Sensitive Classification Problem

1 2 3 4

individual examples without noise

+absolute cost =

New Regular Classification Problem

mixtures with relabeling noise

• cost embedded as weight +noisy labels

• new problem usuallyharder than original one

needrobustmulticlass classification algorithm to deal withnoise

(32)

Key Idea: Design Robust Multiclass Algorithm

One-Versus-One: A Popular Classification Meta-Method

• for all different class pairs (i, j),

1 take all examples (x_n, yn)

• that yn=i or j(original one-versus-one)

• that un,i6= un,jwith the larger-u label and weight |un,i− un,j| (robust one-versus-one) 2 train a binary classifier ˆg⁽ⁱ^,j)using those examples

• return g(x) that predicts using the votes from ˆg^(i,j)

• un-shiftinginside the meta-method toremove noise

• robust stepmakes it suitable for DSE

cost-sensitive one-versus-one: DSE +robust one-versus-one

(33)

Cost-Sensitive One-Versus-One (CSOVO)

Cost-Sensitive One-Versus-One (Lin, 2014)

• for all different class pairs (i, j),

1 robust one-versus-one+ calculate fromc_n: take all examples (x_n, yn)

thatc_n[i]6= cn[j] withsmaller-c labelandweight u_n^(i,j)=|cn[i]− cn[j]|

2 train a binary classifier ˆg⁽ⁱ^,j)using those examples

• return g(x) that predicts using the votes from ˆg^(i,j)

• comes withgood theoretical guarantee:

test cost of g≤ 2X

i<jtest cost of ˆg^(i,j)

• sibling to Weighted All-Pairs (WAP) approach: even tighter guarantee(Beygelzimer, 2005)

withmore sophisticated construction of u^(i,j)_n physical meaning: each ˆg^(i,j)answers yes/no question‘prefer i or j?’

(34)

CSOVO on Semi-Real Data

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

avg. test random cost

CSOVO

OVO (Lin, 2014)some ‘artificial’

cost with UCI data

• CSOVO-SVM:

cost-sensitive

• OVO-SVM: regular

not surprisinglyagain,

considering the cost properly does help

(35)

CSOVO for Ordinal Ranking

pyr mac bos aba ban com cal cen

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

avg. test absolute cost

CSOVO

OVO (Lin, 2014)absolute cost with benchmark ordinal

ranking data

• CSOVO-SVM:

cost-sensitive

• OVO-SVM: regular

CSOVOsignificantly better for ordinal ranking

(36)

Cons of CSOVO: Many Binary Classifiers

K classes−−−−→^CSOVO ^{K (K −1)}₂ binary classifiers

time-consumingin both

• training, especially withmany different c_n[i]and c_n[j]

• prediction

—parallization helps a bit, butgenerally not feasible for large K

CSOVO: a simple meta-methodfor median K only

(37)

Key Idea: OVO ≡ Round-Robin Tournament

Round-Robin Tournament 1

2

3 4

Single-Elimination Tournament 3

2

1 2

3

3 4

• prediction≡ decidingtournament winnerfor eachx

• (CS)OVO: ^{K (K −1)}₂ games for prediction (and hence training)

• single-elimination tournament(for K = 2^`):

• K− 1 games for prediction via bottom-up: real-world

• log₂K gamesfor prediction via top-down:computer-world :-)

next:single-elimination tournamentfor CSMC

(38)

Filter Tree (FT) Approach

Filter Tree(Beygelzimer, 2009)Training: from bottom to top

ˆg^(...) (R, 4) gˆ^(1,2) (L, 9)

1 c[1] = 0

2 c[2] = 9

gˆ^(3,4) (L, 3)

3 c[3] = 5

4 c[4] = 8

• gˆ^(1,2)andgˆ^(3,4)trained like CSOVO: smaller-c labelandweight u^(i,j)_n =|cn[i]− cn[j]|

• gˆ^(...) trainedwith (k_L, k_R)filtered by sub-trees

—smaller-csub-tree directionandweight u_n^(...) =|cn[k_L]− cn[k_R]| FT: top classifiersaware of bottom-classifier mistakes

(39)

Pros and Cons of FT

Pros

• efficient:

O(K ) training, O(log K ) prediction

• strong theoretical guarantee:

small-regret binary classifiers

=⇒ small-regret CSMC classifier

12vs34

1vs2

1 2

3vs4

3 4

Cons

• ‘asymmetric’ to labels: non-trivial structural decision

• ‘hard’sub-tree dependent top-classification tasks

next: other reductions to (weighted) binary classification

(40)

Other Approaches via Weighted Binary Classification

FT: with regret bound(Beygelzimer, 2009)

the lowestachievablecost within{1, 2} or {3, 4}?

Divide&Conquer Tree (TREE):

withoutregret bound (Beygelzimer, 2009)

the lowestidealcost within{1, 2} or {3, 4}?

12vs34

1vs2

1 2

3vs4

3 4

Sensitive Err. Correcting Output Code (SECOC): with regret bound(Langford, 2005)

c[1] + c[3] + c[4] greater than someθ?

training time:

SECOC (O(T · K )) > FT (O(K )) ≈ TREE (O(K ))

(41)

Comparison of Reductions to Weighted Binary Classification

zo. gl. ve. vo. ye. se. dn. pa. sa. us.

avg. test cost

0 50 100 150 200 250 300 350

CSOVO FT TREE

SECOC (Lin, 2014)couple all

meta-methods with SVM

• round-robin

tournament (CSOVO)

• single-elimination tournament (FT, TREE)

• error-correcting-code (SECOC)

CSOVO often among the best;

FT somewhat competitive

(42)

Cost-Sensitive Multiclass Classification CSMC by Regression

Outline

4 Summary

(43)

Key Idea: Cost Estimator

Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

if everyc[k ] known optimal

g^∗(x) = argmin_{1≤k ≤K}c[k ]

if rk(x)≈ c[k] well

approximately good gr(x) = argmin_{1≤k ≤K}r_k(x)

how to get cost estimator r_k?regression

(44)

Cost Estimator by Per-class Regression

Given

N examples, each (inputx_n, label y_n, cost c_n)∈ X × {1, 2, . . . , K } × R^K

input cn[1] input cn[2] . . . input cn[K ]

x₁ 0, x₁ 2, x₁ 1

x₂ 1, x₂ 3, x₂ 5

· · ·

x_N 6, x_N 1, x_N 0

| {z }

r₁ | {z }

r₂ | {z }

r_K

want: r_k(x)≈ c[k] for all future (x, y, c) and k

(45)

The Reduction-to-Regression Framework

cost- sensitive example (xn, yn, cn)

⇒_@_A

A

%

$ '

&

regression examples (Xn,k, Yn,k)

k = 1,· · · , K ⇒⇒⇒ _regression

algorithm

⇒⇒

⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

1 encode: transform cost-sensitive examples (x_n, yn, cn)

to regression examples x_n,k, Y_n,k = x_n, cn[k ]

2 learn: use your favorite algorithm on regression examples to get estimators r_k(x)

3 decode: for each new inputx, predict its class using g_r(x) = argmin_{1≤k ≤K}r_k(x)

the reduction-to-regression framework:

systematic & easy to implement

(46)

Theoretical Guarantees (1/2)

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem (Absolute Loss Bound)

For any set of cost estimators{rk}^Kk =1and for any example (x, y , c) with c[y ] =0,

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

.

low-cost classifier⇐= accurate estimators

(47)

Theoretical Guarantees (2/2)

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem (Squared Loss Bound)

For any set of cost estimators{rk}^Kk =1and for any example (x, y , c) with c[y ] =0,

c[gr(x)]≤ v u u t2

K

X

k =1

(r_k(x)− c[k])².

applies to commonleast-square regression

(48)

A Pictorial Proof

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

• assumec ordered and not degenerate: y = 1;0=c[1]<c[2]≤ · · · ≤c[K ]

• assume mis-prediction g_r(x) = 2: r₂(x) = min_{1≤k ≤K}r_k(x)≤ r1(x)

c[1]

∆₁ r₂(x) r₁(x)

∆₂

c[2] c[3]

r₃(x)

c[K ] r_K(x)

c[2]− c[1]

|{z}

0

≤ ∆1

+

∆2

≤

K

X

k =1

r_k(x)− c[k]

(49)

An Even Closer Look

let∆1≡ r1(x)− c[1] and∆2≡ c[2] − r2(x)

1 ∆₁≥ 0 and∆₂≥ 0: c[2] ≤∆₁+∆₂

2 ∆1≤ 0 and∆2≥ 0: c[2] ≤∆2 3 ∆₁≥ 0 and∆₂≤ 0: c[2] ≤∆₁

c[1]

∆1

r₂(x) r₁(x)

∆2

c[2] c[3]

r₃(x)

c[K ] r_K(x)

∆₁ r₂(x)r₁(x)

∆₂

c[2]≤ max(∆₁, 0) + max(∆₂, 0)≤ ∆₁

+ ∆₂

(50)

Tighter Bound with One-sided Loss

Defineone-sided lossξ_k ≡ max(∆k, 0) with ∆_k ≡

r_k(x)− c[k]

ifc[k ] = c_min=0

∆_k ≡

c[k ]− rk(x)

ifc[k ]6= cmin

Intuition

• c[k ] = c_min:wish to have r_k(x)≤ c[k]

• c[k ]6= cmin:wish to have r_k(x)≥ c[k]

—both wishes same as ∆_k ≤ 0 ⇐⇒ ξk =0

One-sided Loss Bound:

c[g_r(x)]≤

K

X

k =1

ξk ≤

K

X

k =1

∆_k

(51)

The Improved Reduction Framework

cost- sensitive example (xn, yn, cn)

⇒_@_A

A

%

$ '

&

regression examples (Xn,k, Yn,k, Zn,k)

k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

(Tu, 2010)

1 encode: transform cost-sensitive examples (x_n, yn, cn)to

one-sided regression examples x^{(k )}_n , Y_n^{(k )},Z_n^{(k )} = (x_n, cn[k ], 2Jcn[k ] =0K − 1)

2 learn: use aone-sided regression algorithmto get estimators r_k(x)

3 decode: for each new inputx, predict its class using g_r(x) = argmin_{1≤k ≤K}r_k(x) the reduction-to-OSR framework:

need a good OSR algorithm

(52)

Regularized One-Sided Hyper-Linear Regression

Given

x_n,k, Y_n,k, Z_n,k =

x_n, cn[k ], 2r

c_n[k ] =0 z

− 1 Training Goal

all trainingξn,k = max





Z_n,k r_k(x_n,k)− Yn,k

| {z }

∆n,k

, 0





small

—will drop k

minw,b

λ

2hw, wi +

N

X

n=1

ξn

to get r_k(x) =hw, φ(x)i + b

(53)

One-Sided Support Vector Regression

Regularized One-Sided Hyper-Linear Regression minw,b

λ

2hw, wi +

N

X

n=1

ξn

ξn = max (Zn· (rk(xn)− Yⁿ), 0) Standard Support Vector Regression

minw,b

1

2Chw, wi +

N

X

n=1

(ξn+ξ_n^∗)

ξn= max (+1· (rk(x_n)− Yn− ), 0) ξ^∗_n= max (−1· (rk(x_n)− Yn+), 0)

OSR-SVM = SVR +(← 0)+(keepξ_norξ_n^∗by Z_n)

(54)

OSR-SVM on Semi-Real Data

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300 350

avg. test cost

OSR

OVA (Tu, 2010)some ‘artificial’

cost with UCI data

• OSR: cost-sensitive SVM

• OVA: regular one-versus-all SVM

OSR often significantly betterthan OVA

(55)

OSR versus WAP on Semi-Real Data

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300

avg. test cost

OSR

WAP (Tu, 2010)some ‘artificial’

cost with UCI data

• OSR(per-class):

O(K ) training, O(K ) prediction

• WAP≈ CSOVO (pairwise): O(K²) training, O(K²) prediction

OSR faster and competitive performance

(56)

From OSR-SVM to AOSR-DNN

OSR-SVM min

w,b

λ

2hw, wi +XN n=1ξn

with r_k(x) =hw, φ(x)i + b

ξn= max (Z_n· (rk(x_n)− Yn), 0)

Appro. OSR-DNN min

NNet regularizer +XN n=1δn

with r_k(x) =NNet(x)

δn=ln (1 + exp (Zn· (rk(x_n)− Yⁿ))) AOSR-DNN(Chung, 2016a)= Deep Learning + OSR +

smoother upper boundδn≥ ξⁿbecause ln(1 + exp(•)) ≥ max(•, 0)

(57)

From AOSR-DNN to CSDNN

Cons of AOSR-DNN

c affects both classification and feature-extraction in DNN but hard to do effectivecost-sensitive feature extraction

idea 1: pre-training with c

• layer-wise pre-training with cost-sensitive autoencoders

loss = reconstruction+ AOSR

• CSDNN(Chung, 2016a)

= AOSR-DNN + cost-sens.

pre-training

idea 2: auxiliary cost-sensitive nodes

• auxiliary nodesto

predict costs per layer loss = AOSR for classification

+ AOSR for auxiliary

• applies to any deep learning model

(Chung, 2020)

CSDNN: world’sfirst successful CSMC deep learning model

(58)

AOSR-DNN versus CSDNN

m b s c mi bi si ci

0 1 2 3 4 5 6 7 8 9

avg. test cost

AOSR-DNN

CSDNN (Chung, 2016a)

• AOSR-DNN:

cost-sensitive training

• CSDNN:

AOSR-DNN+

cost-sensitive feature extraction

CSDNN wins, justifyingcost-sensitive feature extraction

(59)

ABOD-DNN versus CSDNN

m b s c mi bi si ci

0 1 2 3 4 5 6 7 8 9

avg. test cost

ABOD-DNN

CSDNN (Chung, 2016a)

• ABOD-DNN:

probability estimate+ cost-sensitive

prediction

• CSDNN:

cost-sensitive training +cost-sensitive feature extraction

CSDNN still wins, hintingdifficulty of probability estimate withoutcost-sensitive feature extraction

(60)

Cost-Sensitive Multilabel Classification CSML Motivation and Setup

Outline

4 Summary

(61)

Which Fruit?

?

(image by Robert-Owen-Wahl from Pixabay)

apple orange strawberry kiwi

(images by Pexels, PublicDomainPictures, 192635, Rob van der Meijden from Pixabay)

multiclass classification:

classify input (picture) toone category (label),remember? :-)

(62)

Which Fruits?

?:{apple, orange, kiwi}

(image by Michal Jarmoluk from Pixabay)

apple orange strawberry kiwi

(images by Pexels, PublicDomainPictures, 192635, Rob van der Meijden from Pixabay)

multilabelclassification:

classify input tomultiple (or no)categories

(63)

Label Powerset: Multilabel Classification via Multiclass

(Tsoumakas, 2007)

multiclass w/ L = 4 classes

4 possible outcomes {a, o, s, k}

multilabel w/ L = 4 classes 2⁴=16possible outcomes

2^{^{a, o, s, k}^} m

{ φ, a, o, s, k, ao, as, ak, os, ok, sk, aos, aok, ask, osk, aosk}

• Label Powerset (LP): reduction to multiclass classification

• difficulties for large L:

• computation: 2^Lextended classes

• sparsity: no or few example for some combinations LP: feasible only forsmall L

(64)

What Tags?

?:{machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . . etc.}

anothermultilabel classification problem:

tagginginput to multiple categories

(65)

Binary Relevance: Multilabel Classification via Yes/No

binary

classification {yes,no}

multilabel w/ L classes: L yes/no questions

machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N), etc.

• Binary Relevance (BR): reduction to multiple isolated binary classification

• disadvantages:

• isolation—hidden relations not exploited

(e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)

• unbalanced—fewyes, manyno

BR: simple (& strong) benchmark with known disadvantages

(66)

Multilabel Classification Setup

Given

N examples (inputx_n, labelsetYn)∈ X × 2^{{1,2,···L}}

• fruits:X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

• tags:X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multilabel classifier g(x) thatclosely predictsthe labelsetY associated with some unseen inputs x (byexploiting hidden relations/combinations between labels)

multilabel classification:

hot and importantwith many real-world applications

(67)

From Labelset to Coding View

labelset apple orange strawberry binary code Y1={o} 0 (N) 1 (Y) 0 (N) y₁= [0, 1, 0]

Y2={a, o} 1 (Y) 1 (Y) 0 (N) y₂= [1, 1, 0]

Y3={o, s} 0 (N) 1 (Y) 1 (Y) y₃= [0, 1, 1]

Y4={} 0 (N) 0 (N) 0 (N) y₄= [0, 0, 0]

(images by PublicDomainPictures, Narin Seandag, GiltonF, nihatyetkin from Pixabay)

subsetY of 2{1,2,··· ,L}⇐⇒ length-L binary code y