Cost-Sensitive Classification:
Algorithm and Application
Hsuan-Tien Lin htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
AI in Data Science Forum December 20, 2017
About Me
• Chief Data Scientist, Appier
• Professor, Dept. of CSIE, NTU
• Co-author of textbook “Learning from Data: A Short Course”
• Instructor of two NTU-Coursera Mandarin-teaching ML Massive Open Online Courses
goal: make machine learningmore realistic
• multi-class cost-sensitive classification : in ICML ’10, BIBM ’11, KDD ’12, ACML ’14, IJCAI ’16, etc.
• multi-label classification: in ACML ’11, NIPS ’12, ICML ’14, AAAI ’18, etc.
• online/active learning: in ICML ’12, ACML ’12, ICML ’14, AAAI ’15, etc.
• large-scale data mining (w/ Profs. S.-D. Lin & C.-J. Lin & students):
KDDCup world champions of ’10, ’11 (×2), ’12, ’13 (×2)
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 1/33
Cost-Sensitive Multiclass Classification
Which Digit Did You Write?
?
one (1) two (2) three (3) four (4)
• amulticlass classification problem
—grouping “pictures” into different “categories”
C’mon, we know about
multiclass classification all too well!:-)
Cost-Sensitive Multiclass Classification
Performance Evaluation
(g(x)≈ f (x)?)?
• ZIP code recognition:
1:wrong; 2:right; 3: wrong; 4: wrong
• check value recognition:
1:one-dollar mistake; 2:no mistake;
3:one-dollar mistake; 4:two-dollar mistake
different applications:
evaluate mis-predictions differently
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 3/33
Cost-Sensitive Multiclass Classification
ZIP Code Recognition
?
1:wrong; 2:right; 3: wrong; 4: wrong
• regular multiclass classification: only right or wrong
• wrong cost: 1;right cost: 0
• prediction error of h on some (x, y ):
classification cost =Jy 6= h(x)K
regular multiclass classification:
well-studied, many good algorithms
Cost-Sensitive Multiclass Classification
Check Value Recognition
?
1:one-dollar mistake; 2:no mistake;
3:one-dollar mistake; 4: two-dollar mistake
• cost-sensitive multiclass classification:
different costs for different mis-predictions
• e.g. prediction error of h on some (x, y ):
absolute cost =|y − h(x)|
cost-sensitive multiclass classification:
relatively newer, need more research
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 5/33
Cost-Sensitive Multiclass Classification
What is the Status of the Patient?
?
H7N9-infected cold-infected healthy
• anotherclassification problem
—grouping “patients” into different “status”
are all mis-prediction costs equal?
Cost-Sensitive Multiclass Classification
Patient Status Prediction
error measure = society cost XXXXactual XXXXXX
predicted
H7N9 cold healthy
H7N9 0 1000 100000
cold 100 0 3000
healthy 100 30 0
• H7N9 mis-predicted as healthy:very high cost
• cold mis-predicted as healthy: high cost
• cold correctly predicted as cold: no cost
human doctors consider costs of decision;
can computer-aided diagnosis do the same?
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 7/33
Cost-Sensitive Multiclass Classification
What is the Type of the Movie?
? romance fiction terror
customer 1 who hates romance but likes terror error measure = non-satisfaction
XXXXactual XXXXXX predicted
romance fiction terror
romance 0 5 100
customer 2 who likes terror and romance
XXXXactual XXXXXX predicted
romance fiction terror
romance 0 5 3
different customers:
evaluate mis-predictions differently
Cost-Sensitive Multiclass Classification
Cost-Sensitive Multiclass Classification Tasks
movie classification with non-satisfaction
XXXXactual XXXXXX predicted
romance fiction terror customer 1, romance 0 5 100 customer 2, romance 0 5 3
patient diagnosis with society cost
XXXXactual XXXXXX predicted
H7N9 cold healthy
H7N9 0 1000 100000
cold 100 0 3000
healthy 100 30 0
check digit recognition with absolute cost C(y, h(x)) = |y − h(x)|
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 9/33
Cost-Sensitive Multiclass Classification
Cost Vector
cost vectorc: a row of cost components
• customer 1 on a romance movie: c = (0,5,100)
• an H7N9 patient: c = (0,1000,100000)
• absolute cost for digit 2: c = (1,0,1,2)
• “regular” classification cost for label 2: c(2)c = (1,0,1,1)
regular classification:
special case of cost-sensitive classification
Cost-Sensitive Multiclass Classification
Setup: Vector-Based Cost-Sensitive Binary Classification Given
N examples, each (inputxn, label yn)∈ X × {1, 2, . . . , K } and cost vectorscn, each∈ RK
—will assumecn[yn] =0=min1≤k ≤Kcn[k ] Goal
a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)
• will assumec[y ] =0=cmin =min1≤k ≤Kc[k ]
• note: y not really needed in evaluation cost-sensitive classification:
can express any finite-loss supervised learning tasks
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 11/33
Cost-Sensitive Multiclass Classification
Our Contribution
(Tu and Lin, ICML 2010)binary multiclass
regular well-studied well-studied
cost-sensitive known(Zadrozny et al., 2003) ongoing(our works, among others)
a theoretic and algorithmic study of cost-sensitive classification, which ...
• introduces a methodology to reduce cost-sensitive classification toregression
• providesstrong theoretical support for the methodology
• leads to a promising algorithm withsuperior experimental results
will describe the methodology and an algorithm
Cost-Sensitive Classification by Regression
Key Idea: Cost Estimator
Goal
a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)
if every c[k ] known optimal
g∗(x) = argmin1≤k ≤Kc[k ]
if rk(x)≈ c[k] well
approximately good gr(x) = argmin1≤k ≤Krk(x)
how to get cost estimator rk?regression
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 13/33
Cost-Sensitive Classification by Regression
Cost Estimator by Per-class Regression
Given
N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK
input cn[1] input cn[2] . . . input cn[K ]
x1 0, x1 2, x1 1
x2 1, x2 3, x2 5
· · ·
xN 6, xN 1, xN 0
| {z }
r1 | {z }
r2 | {z }
rK
want: rk(x)≈ c[k] for all future (x, y, c) and k
Cost-Sensitive Classification by Regression
The Reduction Framework
cost- sensitive example (xn, yn, cn)
⇒
@ AA
%
$ '
&
regression examples (Xn,k, Yn,k)
k = 1,· · · , K ⇒⇒⇒ regression
algorithm
⇒⇒
⇒
%
$ '
&
regressors rk(x) k∈ 1, · · · , K
AA
@
⇒
cost- sensitive classifier gr(x)
1 transform cost-sensitive examples (xn, yn, cn)to regression examples xn,k, Yn,k = xn, cn[k ]
2 use your favorite algorithm on the regression examples and get estimators rk(x)
3 for each new inputx, predict its class using gr(x) = argmin1≤k ≤Krk(x)
the reduction-to-regression framework:
systematic & easy to implement
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 15/33
Cost-Sensitive Classification by Regression
Theoretical Guarantees (1/2)
gr(x) = argmin
1≤k ≤K
rk(x)
Theorem (Absolute Loss Bound)
For any set of estimators (cost estimators){rk}Kk =1and for any example (x, y , c) with c[y ] =0,
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
.
low-cost classifier⇐= accurate estimator
Cost-Sensitive Classification by Regression
Theoretical Guarantees (2/2)
gr(x) = argmin
1≤k ≤K
rk(x)
Theorem (Squared Loss Bound)
For any set of estimators (cost estimators){rk}Kk =1and for any example (x, y , c) with c[y ] =0,
c[gr(x)]≤ v u u t2
K
X
k =1
(rk(x)− c[k])2.
applies to commonleast-square regression
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 17/33
Cost-Sensitive Classification by Regression
A Pictorial Proof
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
• assumec ordered and not degenerate:
y = 1;0=c[1]<c[2]≤ · · · ≤c[K ]
• assume mis-prediction gr(x) = 2:
r2(x) = min1≤k ≤Krk(x)≤ r1(x)
c[1] -
∆1 r2(x)6 r1(x)6
∆2
c[2] c[3]
r3(x)6
c[K ] rK(x)6
c[2]− c[1]
|{z}
0
≤ ∆1
+
∆2
≤
K
X
k =1
rk(x)− c[k]
Cost-Sensitive Classification by Regression
An Even Closer Look
let∆1≡ r1(x)− c[1] and∆2≡ c[2] − r2(x)
1 ∆1≥ 0 and∆2≥ 0: c[2] ≤∆1+∆2 2 ∆1≤ 0 and∆2≥ 0: c[2] ≤∆2
3 ∆1≥ 0 and∆2≤ 0: c[2] ≤∆1
c[2]≤ max(∆1, 0) + max(∆2, 0)≤ ∆1
+
∆2
c[1] -
∆1
r2(x)6 r1(x)6
∆2
c[2] c[3]
r3(x)6
c[K ] rK(x)6
6 - r2(x) 6
∆1
r1(x)
∆2
6 -
r2(x) r1(x)6
∆1
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 19/33
Cost-Sensitive Classification by Regression
Tighter Bound with One-sided Loss
Defineone-sided loss ξk ≡ max(∆k,0) with ∆k ≡
rk(x)− c[k]
ifc[k ] = cmin
∆k ≡
c[k ]− rk(x)
ifc[k ]6= cmin
Intuition
• c[k ] = cmin:wish to have rk(x)≤ c[k]
• c[k ]6= cmin:wish to have rk(x)≥ c[k]
—both wishes same as ∆k ≤ 0 and hence ξk =0
One-sided Loss Bound:
c[gr(x)]≤
K
X
k =1
ξk ≤
K
X
k =1
∆k
Cost-Sensitive Classification by Regression
The Improved Reduction Framework
cost- sensitive example (xn, yn, cn)
⇒
@ AA
%
$ '
&
regression examples (Xn,k, Yn,k, Zn,k)
k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒
%
$ '
&
regressors rk(x) k∈ 1, · · · , K
AA
@
⇒
cost- sensitive classifier gr(x)
1 transform cost-sensitive examples (xn, yn, cn)to regression examples
2 use aone-sided regression algorithmto get estimators rk(x)
3 for each new inputx, predict its class using gr(x) = argmin1≤k ≤Krk(x)
the reduction-to-OSR framework:
need a good OSR algorithm
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 21/33
Cost-Sensitive Classification by Regression
Regularized One-sided Hyper-linear Regression
Given
xn,k, Yn,k, Zn,k =
xn, cn[k ], 2r
cn[k ] =cn[yn] z− 1
Training Goal
all trainingξn,k =max
Zn,k rk(xn,k)− Yn,k
| {z }
∆n,k
, 0
small
—will drop k
min
w,b
λ
2hw, wi +
N
X
n=1
ξn
to get rk(x) =hw, φ(x)i + b
Cost-Sensitive Classification by Regression
One-sided Support Vector Regression
Regularized One-sided Hyper-linear Regression minw,b
λ
2hw, wi +
N
X
n=1
ξn
ξn=max (Zn· (rk(xn)− Yn), 0) Standard Support Vector Regression
minw,b
1
2Chw, wi +
N
X
n=1
(ξn+ξ∗n)
ξn=max (+1· (rk(xn)− Yn− ), 0) ξn∗=max (−1· (rk(xn)− Yn+), 0)
OSR-SVM = SVR +(0→ )+(keepξn orξn∗by Zn)
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 23/33
Cost-Sensitive Classification by Regression
OSR versus Other Reductions
OSR: K regressors
How unlikely (costly) does the example belong to class k ?
Filter Tree (FT): K − 1 binary classifiers
Is the lowest cost within labels{1, 4} or {2, 3}?
Is the lowest cost within label{1} or {4}?
Is the lowest cost within label{2} or {3}?
Weighted All Pairs (WAP): K (K −1)2 binary classifiers isc[1] or c[4] lower?
Cost-Sensitive Classification by Regression
OSR-SVM on Semi-Real Data
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300 350
avg. test cost
OSR
OVA (Tu and Lin, ICML 2010)
• OSR: a cost-sensitive extension of OVA
• OVA: regular SVM
OSR often significantly better than OVA
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 25/33
Cost-Sensitive Classification by Regression
OSR versus FT on Semi-Real Data
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300
avg. test cost
OSR
FT (Tu and Lin, ICML 2010)
• OSR (per-class):
O(K ) training, O(K ) prediction
• FT (tournament):
O(K ) training, O(log2K ) prediction
FT faster, but OSR better performing
Cost-Sensitive Classification by Regression
OSR versus WAP on Semi-Real Data
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300
avg. test cost
OSR
WAP (Tu and Lin, ICML 2010)
• OSR (per-class):
O(K ) training, O(K ) prediction
• WAP (pairwise):
O(K2)training, O(K2)prediction
OSR faster and comparable performance
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 27/33
Cost-Sensitive Classification by Regression
Six Years after OSR-SVM
(Chung, Lin and Yang, IJCAI 2016)OSR-SVM min
w,b
λ
2hw, wi +XN n=1ξn
with rk(x) =hw, φ(x)i + b
ξn =max (Zn· (rk(xn)− Yn), 0)
CS Deep NNet (CSDNN) min
NNet regularizer +XN n=1δn
with rk(x) =NNet(x)
δn=ln (1 + exp (Zn· (rk(xn)− Yn)))
• CSDNN: world’s first cost-sensitive deep model via a smoother upper bound—δn≥ ξnbecause ln(1 + exp(•)) ≥ max(•, 0)
• δn used in bothpretraining& training for better NNet feature extraction
concept of reduction-to-OSR still useful after 6 years
Cost-and-Error-Sensitive Classification with Bioinformatics Application
A Real Medical Application: Classifying Bacteria
The Problem
• by human doctors: different treatments⇐⇒ serious costs
• cost matrix averaged from two doctors:
Ab Ecoli HI KP LM Nm Psa Spn Sa GBS
Ab 0 1 10 7 9 9 5 8 9 1
Ecoli 3 0 10 8 10 10 5 10 10 2
HI 10 10 0 3 2 2 10 1 2 10
KP 7 7 3 0 4 4 6 3 3 8
LM 8 8 2 4 0 5 8 2 1 8
Nm 3 10 9 8 6 0 8 3 6 7
Psa 7 8 10 9 9 7 0 8 9 5
Spn 6 10 7 7 4 4 9 0 4 7
Sa 7 10 6 5 1 3 9 2 0 7
GBS 2 5 10 9 8 6 5 6 8 0
is cost-sensitive classificationrealistic?
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 29/33
Cost-and-Error-Sensitive Classification with Bioinformatics Application
OSR versus OVO/CSOVO(WAP)/FT on Bacteria Data
(Jan et al., BIBM 2011)
. . . . . .
Are cost-sensitive algorithms great?
RBF kernel
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
OVOSVM
csOSRSVM csOVOSVM csFTSVM
algorithms
cost
.
...Cost-sensitive algorithms perform better than regular algorithm
Jan et al. (Academic Sinica) Cost-Sensitive Classification on SERS October 31, 2011 15 / 19
OSR best: cost-sensitive classification is helpful
Cost-and-Error-Sensitive Classification with Bioinformatics Application
Soft Cost-sensitive Classification
The Problem
0.1 0.15 0.2 0.25 0.3
0 0.05 0.1 0.15 0.2
Error (%)
Cost
• cost-sensitive classifier: low cost but high error
• traditional classifier: low error but high cost
• how can we get theblueclassifiers?: low error and low cost
cost-and-error-sensitive: more suitable for medical needs
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 31/33
Cost-and-Error-Sensitive Classification with Bioinformatics Application
Improved OSR for Cost and Error on Semi-Real Data
key idea(Jan et al., KDD 2012): consider a‘modified’ costthat mixesoriginal costand‘regular cost’
Cost
iris ≈
wine ≈
glass ≈ vehicle ≈
vowel
segment
dna
satimage ≈
usps
zoo
splice ≈ ecoli ≈ soybean ≈
Error
iris
wine
glass
vehicle
vowel
segment
dna
satimage
usps
zoo
splice
ecoli
soybean
improves other cost-sensitive classification algorithms, too
Cost-and-Error-Sensitive Classification with Bioinformatics Application
Conclusion
• reduction from cost-sensitive classification to regression:
viacost estimation
• one-sided regressionwithsolid theoretical guarantee
• superior experimental resultswith OSR-SVM
• OSRfor deep learning: OSR-SVM→ CSDNN
• OSR for medical application: towardscost-and-error-sensitive
Thank you. Questions?
Hsuan-Tien Lin (NTU CSIE) Cost-Sensitive Classification: Algorithm and Application 33/33