Cost-Sensitive Classiﬁcation: Algorithm and Application

(1)

Cost-Sensitive Classification:

Algorithm and Application

Hsuan-Tien Lin htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

AI in Data Science Forum December 20, 2017

(2)

About Me

• Chief Data Scientist, Appier

• Professor, Dept. of CSIE, NTU

• Co-author of textbook “Learning from Data: A Short Course”

• Instructor of two NTU-Coursera Mandarin-teaching ML Massive Open Online Courses

goal: make machine learningmore realistic

• multi-class cost-sensitive classification : in ICML ’10, BIBM ’11, KDD ’12, ACML ’14, IJCAI ’16, etc.

• multi-label classification: in ACML ’11, NIPS ’12, ICML ’14, AAAI ’18, etc.

• online/active learning: in ICML ’12, ACML ’12, ICML ’14, AAAI ’15, etc.

• large-scale data mining (w/ Profs. S.-D. Lin & C.-J. Lin & students):

KDDCup world champions of ’10, ’11 (×2), ’12, ’13 (×2)

Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 1/33

(3)

Cost-Sensitive Multiclass Classification

Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

• amulticlass classification problem

—grouping “pictures” into different “categories”

C’mon, we know about

multiclass classification all too well!:-)

(4)

Performance Evaluation

(g(x)≈ f (x)?)

?

• ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong

• check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake

different applications:

evaluate mis-predictions differently

(5)

ZIP Code Recognition

?

1:wrong; 2:right; 3: wrong; 4: wrong

• regular multiclass classification: only right or wrong

• wrong cost: 1;right cost: 0

• prediction error of h on some (x, y ):

classification cost =Jy 6= h(x)K

regular multiclass classification:

well-studied, many good algorithms

(6)

Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake

• cost-sensitive multiclass classification:

different costs for different mis-predictions

• e.g. prediction error of h on some (x, y ):

absolute cost =|y − h(x)|

cost-sensitive multiclass classification:

relatively newer, need more research

(7)

What is the Status of the Patient?

?

H7N9-infected cold-infected healthy

• anotherclassification problem

—grouping “patients” into different “status”

are all mis-prediction costs equal?

(8)

Patient Status Prediction

error measure = society cost XXXXactual XXXXXX

predicted

H7N9 cold healthy

H7N9 0 1000 100000

cold 100 0 3000

healthy 100 30 0

• H7N9 mis-predicted as healthy:very high cost

• cold mis-predicted as healthy: high cost

• cold correctly predicted as cold: no cost

human doctors consider costs of decision;

can computer-aided diagnosis do the same?

(9)

What is the Type of the Movie?

? romance fiction terror

customer 1 who hates romance but likes terror error measure = non-satisfaction

XXXXactual XXXXXX predicted

romance fiction terror

romance 0 5 100

customer 2 who likes terror and romance

romance fiction terror

romance 0 5 3

different customers:

evaluate mis-predictions differently

(10)

Cost-Sensitive Multiclass Classification Tasks

movie classification with non-satisfaction

romance fiction terror customer 1, romance 0 5 100 customer 2, romance 0 5 3

patient diagnosis with society cost

H7N9 cold healthy

H7N9 0 1000 100000

cold 100 0 3000

healthy 100 30 0

check digit recognition with absolute cost C(y, h(x)) = |y − h(x)|

(11)

Cost Vector

cost vectorc: a row of cost components

• customer 1 on a romance movie: c = (0,5,100)

• an H7N9 patient: c = (0,1000,100000)

• absolute cost for digit 2: c = (1,0,1,2)

• “regular” classification cost for label 2: c⁽²⁾_c = (1,0,1,1)

regular classification:

special case of cost-sensitive classification

(12)

Setup: Vector-Based Cost-Sensitive Binary Classification Given

N examples, each (inputxn, label yn)∈ X × {1, 2, . . . , K } and cost vectorsc_n, each∈ R^K

—will assumec_n[y_n] =0=min_{1≤k ≤K}c_n[k ] Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

• will assumec[y ] =0=c_min =min_{1≤k ≤K}c[k ]

• note: y not really needed in evaluation cost-sensitive classification:

can express any finite-loss supervised learning tasks

(13)

Our Contribution

(Tu and Lin, ICML 2010)

binary multiclass

regular well-studied well-studied

cost-sensitive known(Zadrozny et al., 2003) ongoing(our works, among others)

a theoretic and algorithmic study of cost-sensitive classification, which ...

• introduces a methodology to reduce cost-sensitive classification toregression

• providesstrong theoretical support for the methodology

• leads to a promising algorithm withsuperior experimental results

will describe the methodology and an algorithm

(14)

Cost-Sensitive Classification by Regression

Key Idea: Cost Estimator

Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

if every c[k ] known optimal

g^∗(x) = argmin_{1≤k ≤K}c[k ]

if rk(x)≈ c[k] well

approximately good gr(x) = argmin_{1≤k ≤K}r_k(x)

how to get cost estimator r_k?regression

(15)

Cost Estimator by Per-class Regression

Given

N examples, each (inputx_n, label y_n, cost c_n)∈ X × {1, 2, . . . , K } × R^K

input cn[1] input cn[2] . . . input cn[K ]

x₁ 0, x₁ 2, x₁ 1

x₂ 1, x₂ 3, x₂ 5

· · ·

x_N 6, x_N 1, x_N 0

| {z }

r₁ | {z }

r₂ | {z }

r_K

want: r_k(x)≈ c[k] for all future (x, y, c) and k

(16)

The Reduction Framework

cost- sensitive example (xn, yn, cn)

⇒

@ AA

%

$ '

&

regression examples (Xn,k, Yn,k)

k = 1,· · · , K ⇒⇒⇒ _regression

algorithm

⇒⇒

⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

1 transform cost-sensitive examples (x_n, yn, cn)to regression examples x_n,k, Y_n,k = xn, cn[k ]

2 use your favorite algorithm on the regression examples and get estimators r_k(x)

3 for each new inputx, predict its class using g_r(x) = argmin_{1≤k ≤K}r_k(x)

the reduction-to-regression framework:

systematic & easy to implement

(17)

Theoretical Guarantees (1/2)

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem (Absolute Loss Bound)

For any set of estimators (cost estimators){rk}^Kk =1and for any example (x, y , c) with c[y ] =0,

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

.

low-cost classifier⇐= accurate estimator

(18)

Theoretical Guarantees (2/2)

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem (Squared Loss Bound)

For any set of estimators (cost estimators){rk}^Kk =1and for any example (x, y , c) with c[y ] =0,

c[gr(x)]≤ v u u t2

K

X

k =1

(r_k(x)− c[k])².

applies to commonleast-square regression

(19)

A Pictorial Proof

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

• assumec ordered and not degenerate:

y = 1;0=c[1]<c[2]≤ · · · ≤c[K ]

• assume mis-prediction g_r(x) = 2:

r₂(x) = min_{1≤k ≤K}r_k(x)≤ r1(x)

c[1] -

∆₁ r₂(x)6 r₁(x)⁶

∆2

c[2] c[3]

r₃(x)6

c[K ] r_K(x)6

c[2]− c[1]

|{z}

0

≤ ∆1

+

∆2

≤

K

X

k =1

r_k(x)− c[k]

(20)

An Even Closer Look

let∆1≡ r1(x)− c[1] and∆2≡ c[2] − r2(x)

1 ∆1≥ 0 and∆2≥ 0: c[2] ≤∆1+∆2 2 ∆₁≤ 0 and∆₂≥ 0: c[2] ≤∆₂

3 ∆₁≥ 0 and∆₂≤ 0: c[2] ≤∆₁

c[2]≤ max(∆1, 0) + max(∆2, 0)≤ ∆1

+

∆2

c[1] -

∆₁

r₂(x)6 r₁(x)⁶

∆₂

c[2] c[3]

r₃(x)6

c[K ] r_K(x)6

6 - r₂(x) ⁶

∆₁

r₁(x)

∆2

6 -

r₂(x) r₁(x)⁶

∆₁

(21)

Tighter Bound with One-sided Loss

Defineone-sided loss ξk ≡ max(∆^k,0) with ∆_k ≡

r_k(x)− c[k]

ifc[k ] = c_min

∆k ≡

c[k ]− rk(x)

ifc[k ]6= cmin

Intuition

• c[k ] = c_min:wish to have r_k(x)≤ c[k]

• c[k ]6= cmin:wish to have r_k(x)≥ c[k]

—both wishes same as ∆_k ≤ 0 and hence ξk =0

One-sided Loss Bound:

c[gr(x)]≤

K

X

k =1

ξ_k ≤

K

X

k =1

∆_k

(22)

The Improved Reduction Framework

cost- sensitive example (xn, yn, cn)

⇒

@ AA

%

$ '

&

regression examples (Xn,k, Yn,k, Zn,k)

k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

1 transform cost-sensitive examples (x_n, yn, cn)to regression examples

2 use aone-sided regression algorithmto get estimators r_k(x)

3 for each new inputx, predict its class using gr(x) = argmin_{1≤k ≤K}r_k(x)

the reduction-to-OSR framework:

need a good OSR algorithm

(23)

Regularized One-sided Hyper-linear Regression

Given

x_n,k, Yn,k, Zn,k =

x_n, cn[k ], 2r

c_n[k ] =c_n[y_n] z− 1

Training Goal

all trainingξn,k =max





Z_n,k r_k(x_n,k)− Yn,k

| {z }

∆_n,k

, 0





small

—will drop k

min

w,b

λ

2hw, wi +

N

X

n=1

ξn

to get r_k(x) =hw, φ(x)i + b

(24)

One-sided Support Vector Regression

Regularized One-sided Hyper-linear Regression minw,b

λ

2hw, wi +

N

X

n=1

ξn

ξn=max (Zn· (rk(xn)− Yⁿ), 0) Standard Support Vector Regression

minw,b

1

2Chw, wi +

N

X

n=1

(ξn+ξ^∗_n)

ξn=max (+1· (rk(xn)− Yⁿ− ), 0) ξ_n^∗=max (−1· (rk(x_n)− Yn+), 0)

OSR-SVM = SVR +(0→ )+(keepξn orξ_n^∗by Z_n)

(25)

OSR versus Other Reductions

OSR: K regressors

How unlikely (costly) does the example belong to class k ?

Filter Tree (FT): K − 1 binary classifiers

Is the lowest cost within labels{1, 4} or {2, 3}?

Is the lowest cost within label{1} or {4}?

Is the lowest cost within label{2} or {3}?

Weighted All Pairs (WAP): ^{K (K −1)}₂ binary classifiers isc[1] or c[4] lower?

(26)

OSR-SVM on Semi-Real Data

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300 350

avg. test cost

OSR

OVA (Tu and Lin, ICML 2010)

• OSR: a cost-sensitive extension of OVA

• OVA: regular SVM

OSR often significantly better than OVA

(27)

OSR versus FT on Semi-Real Data

0 50 100 150 200 250 300

avg. test cost

OSR

FT (Tu and Lin, ICML 2010)

• OSR (per-class):

O(K ) training, O(K ) prediction

• FT (tournament):

O(K ) training, O(log₂K ) prediction

FT faster, but OSR better performing

(28)

OSR versus WAP on Semi-Real Data

0 50 100 150 200 250 300

avg. test cost

OSR

WAP (Tu and Lin, ICML 2010)

• OSR (per-class):

O(K ) training, O(K ) prediction

• WAP (pairwise):

O(K²)training, O(K²)prediction

OSR faster and comparable performance

(29)

Six Years after OSR-SVM

(Chung, Lin and Yang, IJCAI 2016)

OSR-SVM min

w,b

λ

2hw, wi +XN n=1ξn

with r_k(x) =hw, φ(x)i + b

ξn =max (Zn· (rk(x_n)− Yⁿ), 0)

CS Deep NNet (CSDNN) min

NNet regularizer +XN n=1δn

with r_k(x) =NNet(x)

δn=ln (1 + exp (Z_n· (rk(x_n)− Yn)))

• CSDNN: world’s first cost-sensitive deep model via a smoother upper bound—δn≥ ξⁿbecause ln(1 + exp(•)) ≥ max(•, 0)

• δn used in bothpretraining& training for better NNet feature extraction

concept of reduction-to-OSR still useful after 6 years

(30)

Cost-and-Error-Sensitive Classification with Bioinformatics Application

A Real Medical Application: Classifying Bacteria

The Problem

• by human doctors: different treatments⇐⇒ serious costs

• cost matrix averaged from two doctors:

Ab Ecoli HI KP LM Nm Psa Spn Sa GBS

Ab 0 1 10 7 9 9 5 8 9 1

Ecoli 3 0 10 8 10 10 5 10 10 2

HI 10 10 0 3 2 2 10 1 2 10

KP 7 7 3 0 4 4 6 3 3 8

LM 8 8 2 4 0 5 8 2 1 8

Nm 3 10 9 8 6 0 8 3 6 7

Psa 7 8 10 9 9 7 0 8 9 5

Spn 6 10 7 7 4 4 9 0 4 7

Sa 7 10 6 5 1 3 9 2 0 7

GBS 2 5 10 9 8 6 5 6 8 0

is cost-sensitive classificationrealistic?

(31)

OSR versus OVO/CSOVO(WAP)/FT on Bacteria Data

(Jan et al., BIBM 2011)

. . . . . .

Are cost-sensitive algorithms great?

RBF kernel

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

OVOSVM

csOSRSVM csOVOSVM csFTSVM

algorithms

cost

.

...Cost-sensitive algorithms perform better than regular algorithm

Jan et al. (Academic Sinica) Cost-Sensitive Classiﬁcation on SERS October 31, 2011 15 / 19

OSR best: cost-sensitive classification is helpful

(32)

Soft Cost-sensitive Classification

The Problem

0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2

Error (%)

Cost

• cost-sensitive classifier: low cost but high error

• traditional classifier: low error but high cost

• how can we get theblueclassifiers?: low error and low cost

cost-and-error-sensitive: more suitable for medical needs

(33)

Improved OSR for Cost and Error on Semi-Real Data

key idea(Jan et al., KDD 2012): consider a‘modified’ costthat mixesoriginal costand‘regular cost’

Cost

iris ≈

wine ≈

glass ≈ vehicle ≈

vowel

segment

dna

satimage ≈

usps

zoo

splice ≈ ecoli ≈ soybean ≈

Error

iris

wine

glass

vehicle

vowel

segment

dna

satimage

usps

zoo

splice

ecoli

soybean

improves other cost-sensitive classification algorithms, too

(34)

Conclusion

• reduction from cost-sensitive classification to regression:

viacost estimation

• one-sided regressionwithsolid theoretical guarantee

• superior experimental resultswith OSR-SVM

• OSRfor deep learning: OSR-SVM→ CSDNN

• OSR for medical application: towardscost-and-error-sensitive

Thank you. Questions?