Cost-sensitive Multiclass Classification via Regression
Hsuan-Tien Lin
Dept. of CSIE, NTU
Talk at NTU CE, 10/15/2013
Joint work with Han-Hsing Tu; parts appeared in ICML ’10
Which Digit Did You Write?
?
one (1) two (2) three (3) four (4)
• aclassification problem
—grouping “pictures” into different “categories”
How can machines learn to classify?
Supervised Machine Learning
Parent
?
(picture, category) pairs
?
Kid’s good
decision function brain
'
&
$
% -
6 possibilities
Truth f (x) + noise e(x)
?
examples (picturexn, category yn)
?
learning good decision
function g(x)≈ f (x) algorithm
'
&
$
% -
6
learning model{gα(x)} challenge:
see only{(xn, yn)} without knowing f (x) or e(x)
=?⇒ generalize to unseen (x, y) w.r.t. f (x)
Mis-prediction Costs
(g(x)≈ f (x)?)?
• ZIP code recognition:
1:wrong; 2:right; 3: wrong; 4: wrong
• check value recognition:
1:one-dollar mistake; 2:no mistake;
3:one-dollar mistake; 4:two-dollar mistake different applications:
evaluate mis-predictions differently
ZIP Code Recognition
?
1:wrong; 2:right; 3: wrong; 4: wrong
• regular classification problem: only right or wrong
• wrong cost: 1;right cost: 0
• prediction error of g on some (x, y ):
classification cost =Jy 6= g (x)K regular classification:
well-studied, many good algorithms
Check Value Recognition
?
1:one-dollar mistake; 2:no mistake;
3:one-dollar mistake; 4: two-dollar mistake
• cost-sensitive classification problem:
different costs for different mis-predictions
• e.g. prediction error of g on some (x, y ):
absolute cost =|y − g(x)|
cost-sensitive classification:
new, need more research
What is the Status of the Patient?
?
H1N1-infected cold-infected healthy
• anotherclassification problem
—grouping “patients” into different “status”
Are all mis-prediction costs equal?
Patient Status Prediction
error measure = society cost XXXXactual XXXXXX
predicted
H1N1 cold healthy
H1N1 0 1000 100000
cold 100 0 3000
healthy 100 30 0
• H1N1 mis-predicted as healthy:very high cost
• cold mis-predicted as healthy: high cost
• cold correctly predicted as cold: no cost
human doctors consider costs of decision;
can computer-aided diagnosis do the same?
What is the Type of the Movie?
? romance fiction terror
customer 1 who hates terror but likes romance error measure = non-satisfaction
XXXXactual XXXXXX predicted
romance fiction terror
romance 0 5 100
customer 2 who likes terror and romance
XXXXactual XXXXXX predicted
romance fiction terror
romance 0 5 3
different customers:
evaluate mis-predictions differently
Cost-sensitive Classification Tasks
movie classification with non-satisfaction
XXXXactual XXXXXX predicted
romance fiction terror customer 1, romance 0 5 100 customer 2, romance 0 5 3
patient diagnosis with society cost
XXXXactual XXXXXX predicted
H1N1 cold healthy
H1N1 0 1000 100000
cold 100 0 3000
healthy 100 30 0
check digit recognition with absolute cost C(y, g(x)) = |g(x) − y|
Cost Vector
cost vectorc: a row of cost components
• customer 1 on a romance movie: c = (0,5,100)
• an H1N1 patient: c = (0,1000,100000)
• absolute cost for digit 2: c = (1,0,1,2)
• “regular” classification cost for label 2: c(2)c = (1,0,1,1) regular classification:
special case of cost-sensitive classification
Cost-sensitive Classification Setup
Given
N examples, each
(inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK
• K = 2: binary; K > 2: multiclass
• will assumecn[yn] =0=min1≤k ≤Kcn[k ] Goal
a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)
• will assumec[y ] =0=cmin =min1≤k ≤Kc[k ]
• note: y not really needed in evaluation cost-sensitive classification:
can express any finite-loss supervised learning tasks
Our Contribution
binary multiclass
regular well-studied well-studied
cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)
a theoretic and algorithmic study of cost-sensitive classification, which ...
• introduces a methodology to reduce cost-sensitive classification toregression
• providesstrong theoretical support for the methodology
• leads to a promising algorithm withsuperior experimental results
will describe the methodology and an algorithm
Key Idea: Cost Estimator
Goal
a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)
if every c[k ] known optimal g∗(x) = argmin
1≤k ≤K
c[k ]
if rk(x)≈ c[k] well
approximately good gr(x) = argmin
1≤k ≤K
rk(x)
how to get cost estimator rk? regression
Cost Estimator by Per-class Regression
Given
N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK
input cn[1] input cn[2] . . . input cn[K ]
x1 0, x1 2, x1 1
x2 1, x2 3, x2 5
· · ·
xN 6, xN 1, xN 0
| {z }
r1 | {z }
r2 | {z }
rK
want: rk(x)≈ c[k] for all future (x, y, c) and k
The Reduction Framework
1 transform cost-sensitive examples (xn, yn, cn)to regression examples Xn,k, Yn,k = xn, cn[k ]
2 use your favorite algorithm on the regression examples and get regressors rk(x)
3 for each new inputx, predict its class using gr(x) = argmin
1≤k ≤K
rk(x)
the reduction-to-regression framework:
systematic & easy to implement
cost- sensitive example (xn, yn, cn)
⇒
@ AA
%
$ '
&
regression examples (Xn,k, Yn,k) k = 1,· · · , K ⇒⇒⇒
regression algorithm
⇒⇒
⇒
%
$ '
&
regressors rk(x) k∈ 1, · · · , K
AA
@
⇒
cost- sensitive classifier gr(x)
Theoretical Guarantees (1/2)
gr(x) = argmin
1≤k ≤K
rk(x)
Theorem (Absolute Loss Bound)
For any set of regressors (cost estimators){rk}Kk =1and for any example (x, y , c) with c[y ] =0,
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
.
low-cost classifier⇐= accurate regressor
Theoretical Guarantees (2/2)
gr(x) = argmin
1≤k ≤K
rk(x)
Theorem (Squared Loss Bound)
For any set of regressors (cost estimators){rk}Kk =1and for any example (x, y , c) with c[y ] =0,
c[gr(x)]≤ v u u t2
K
X
k =1
(rk(x)− c[k])2.
applies to commonleast-square regression
A Pictorial Proof
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
• assumec ordered and not degenerate:
y = 1;0=c[1]<c[2]≤ · · · ≤c[K ]
• assume mis-prediction gr(x) = 2:
r2(x) = min1≤k ≤Krk(x)≤ r1(x)
c[1] -
∆1
r2(x)6 r1(x)6
∆2
c[2] c[3]
r3(x)6
c[K ] rK(x)6
c[2]− c[1]
|{z}
0
≤ ∆1
+
∆2
≤
K
X
k =1
rk(x)− c[k]
An Even Closer Look
let∆1≡ r1(x)− c[1] and∆2≡ c[2] − r2(x)
1 ∆1≥ 0 and∆2≥ 0: c[2] ≤∆1+∆2
2 ∆1≤ 0 and∆2≥ 0: c[2] ≤∆2
3 ∆1≥ 0 and∆2≤ 0: c[2] ≤∆1
c[2]≤ max(∆1, 0) + max(∆2, 0)≤ ∆1
+ ∆2
c[1] -
∆1
r2(x)6 r1(x)6
∆2
c[2] c[3]
r3(x)6
c[K ] rK(x)6
6 - r2(x) 6
∆1
r1(x)
∆2
6 -
r2(x) r1(x)6
∆1
Tighter Bound with One-sided Loss
Defineone-sided loss ξk ≡ max(∆k,0) with ∆k ≡
rk(x)− c[k]
ifc[k ] = cmin
∆k ≡
c[k ]− rk(x)
ifc[k ]6= cmin
Intuition
• c[k ] = cmin:wish to have rk(x)≤ c[k]
• c[k ]6= cmin:wish to have rk(x)≥ c[k]
—both wishes same as ∆k ≤ 0 and hence ξk =0 One-sided Loss Bound:
c[gr(x)]≤
K
X
k =1
ξk ≤
K
X
k =1
∆k
The Improved Reduction Framework
1 transform cost-sensitive examples (xn, yn, cn)to regression examples X(k )n , Yn(k ),Zn(k ) =
xn, cn[k ],2r
cn[k ] =cn[yn]z
− 1
2 use aone-sided regression algorithmto get regressors rk(x)
3 for each new inputx, predict its class using gr(x) = argmin
1≤k ≤K
rk(x)
the reduction-to-OSR framework:
need a good OSR algorithm
cost- sensitive example (xn, yn, cn)
⇒
@ AA
%
$ '
&
regression examples (Xn,k, Yn,k, Zn,k)
k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒
%
$ '
&
regressors rk(x) k∈ 1, · · · , K
AA
@
⇒
cost- sensitive classifier gr(x)
Regularized One-sided Hyper-linear Regression
Given
Xn,k, Yn,k, Zn,k =
xn, cn[k ], 2r
cn[k ] =cn[yn]z
− 1
Training Goal
all trainingξn,k =max
Zn,k rk(Xn,k)− Yn,k
| {z }
∆n,k
, 0
small
—will drop k
minw,b
λ
2hw, wi +
N
X
n=1
ξn to get rk(X) =hw, φ(X)i + b
One-sided Support Vector Regression
Regularized One-sided Hyper-linear Regression minw,b
λ
2hw, wi +
N
X
n=1
ξn
ξn=max (Zn· (rk(Xn)− Yn), 0) Standard Support Vector Regression
minw,b
1
2Chw, wi +
N
X
n=1
(ξn+ξ∗n)
ξn=max (+1· (rk(Xn)− Yn− ), 0) ξn∗=max (−1· (rk(Xn)− Yn+), 0) OSR-SVM = SVR +(0→ )+(keepξnorξn∗ by Zn)
OSR versus Other Reductions
OSR: K regressors
How unlikely (costly) does the example belong to class k ? Filter Tree (FT): K − 1 binary classifiers
Is the lowest cost within labels{1, 4} or {2, 3}?
Is the lowest cost within label{1} or {4}?
Is the lowest cost within label{2} or {3}?
Weighted All Pairs (WAP): K (K −1)2 binary classifiers isc[1] or c[4] lower?
Sensitive Error Correcting Output Code (SECOC): (T · K ) bin.
cla.
isc[1] + c[3] + c[4] greater than someθ?
Experiment: OSR-SVM versus OVA-SVM
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300 350
avg. test cost
OSR
OVA • OSR: a cost-sensitive extension of OVA
• OVA: regular SVM
OSR often significantly better than OVA
Experiment: OSR versus FT
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300
avg. test cost
OSR
FT • OSR (per-class):
O(K ) training, O(K ) prediction
• FT (tournament):
O(K ) training, O(log2K ) prediction
FT faster, but OSR better performed
Experiment: OSR versus WAP
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300
avg. test cost
OSR
WAP • OSR (per-class):
O(K ) training, O(K ) prediction
• WAP (pairwise):
O(K2)training, O(K2)prediction
OSR faster and comparable performance
Experiment: OSR versus SECOC
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300
avg. test cost
OSR SECOC
• OSR (per-class):
O(K ) training, O(K ) prediction
• SECOC
(error-correcting): big O(K ) training, big O(K ) prediction
OSR faster and much better performance
Conclusion
• reduction to regression:
a simple way of designing cost-sensitive classification algorithms
• theoretical guarantee:
absolute, squared andone-sided bounds
• algorithmic use:
anovel and simple algorithm OSR-SVM
• experimental performance of OSR-SVM:
leading in SVM family
more algorithm andapplicationopportunities
Acknowledgments
• Profs. Chih-Jen Lin, Yuh-Jyh Lee, Shou-de Lin for suggestions
• Prof. Chuin-Shan Chen for talk invitation
• Computational Learning Lab @ NTU for discussions