Cost-sensitive Multiclass Classification Using One-versus-one Comparisons
Hsuan-Tien Lin
Assistant Professor
Dept. of Computer Science and Information Engineering National Taiwan University
Talk at Institute of Statistics, National Tsing Hua University, 12/12/2014
Based on the paper “Reduction from cost-sensitive multiclass classification to one-versus-one binary classification”, ACML 2014
Which Digit Did You Write?
?
one (1) two (2) three (3) four (4)
aclassification problem
—grouping “pictures” into different “categories”
How can machines learn to classify?
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 1 / 27
Learning from Data
(Abu-Mostafa, Magdon-Ismail and Lin, 2012) Truth f (x ) + noise e(x )?
examples (picture xn, category yn)
?
learning good
decision function g(x ) ≈ f (x ) algorithm
'
&
$
% -
6
learning model {gα(x )}
challenge:
see only {(xn,yn)}without knowing f (x ) or e(x )
=⇒? generalize to unseen (x , y ) w.r.t. f (x )
Mis-prediction Costs
(g(x ) ≈ f (x )?)? ZIP code recognition:
1:wrong; 2:right; 3: wrong; 4: wrong check value recognition:
1:one-dollar mistake; 2:no mistake;
3:one-dollar mistake; 4:two-dollar mistake evaluation by formation similarity:
1:not very similar; 2: very similar;
3:somewhat similar; 4: asilly prediction
different applications evaluate mis-predictions differently
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 3 / 27
ZIP Code Recognition
?
1: wrong; 2: right; 3:wrong; 4:right regular classification problem: only right or wrong wrong cost: 1; right cost: 0
prediction error of g on some (x , y ):
classification cost =Jy 6= g (x )K
regular classification:well-studied, many good algo- rithms
Check Value Recognition
?
1:one-dollar mistake; 2:no mistake;
3:one-dollar mistake; 4: two-dollar mistake cost-sensitive classification problem:
different costs for different mis-predictions e.g. prediction error of g on some (x , y ):
absolute cost = |y − g(x )|
cost-sensitive classification: new, need more re- search
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 5 / 27
What is the Status of the Patient?
?
H1N1-infected cold-infected healthy
anotherclassification problem
—grouping “patients” into different “status”
Are all mis-prediction costs equal?
Patient Status Prediction
error measure = society cost
C =
XXXX
XXXX XX actual
predicted
H1N1 cold healthy
H1N1 0 1000 100000
cold 100 0 3000
healthy 100 30 0
H1N1 mis-predicted as healthy:very high cost cold mis-predicted as healthy: high cost
cold correctly predicted as cold: no cost
human doctors consider costs of decision;
can computer-aided diagnosis do the same?
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 7 / 27
Which Age-Group?
?
infant (1) child (2) teen (3) adult (4)
small mistake—classify a child as a teen;
big mistake—classify an infant as an adult prediction error of g on some (x , y ):
C(y , g(x)), where C =
0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0
C: cost matrix
Cost Matrix C
regular classification C = classification cost Cc:
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
cost-sensitive classification C = anything other than Cc:
0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0
regular classification:
special case of cost-sensitive classification
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 9 / 27
Cost-sensitive Classification Setup
Given
N examples, each (input xn,label yn) ∈ X × {1, 2, . . . , K } × RK; cost matrix C
K = 2: binary; K > 2: multiclass will assume C(y , y ) = min1≤k ≤KC(y , k )
Goal
a classifier g(x ) that pays a small cost C(y , g(x )) on futureunseen example (x , y )
cost-sensitive classification:
more realistic than regular one
Our Contribution
binary multiclass
regular well-studied well-studied
cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)
a theoretical and algorithmic study of cost-sensitive classi- fication, which ...
introduces a methodology for extending regular classification algorithms to cost-sensitive ones with any cost
providesstrong theoretical support for the methodology
leads to some promising algorithms withsuperior experimental results
will describe the methodology and a concrete algorithm
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 11 / 27
Central Idea: Reduction
(iPod)
complex cost-sensitive problems
(adapter) (reduction)
(cassette player)
simpler regular classification problems with well-known results on models, algorithms, and theories
If I have seen further it is by standing on the shoulders of Giants—I. Newton
Cost-Sensitive Binary Classification (1/2)
medical profile x
? medical profile x1
H1N1(1)
medical profile x2 NOH1N1(2) predictingH1N1asNOH1N1:
serious consequences to public health predictingNOH1N1asH1N1:
not good, but less serious cost-sensitive C: 0 1000
1 0
regular Cc:0 1 1 0
how to change the entry from 1 to 1000?
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 13 / 27
Cost-Sensitive Binary Classification (2/2)
copy each case labeledH1N11000times
original problem
evaluate w/0 1000
1 0
(x1,H1N1) (x2,NOH1N1) (x3,NOH1N1) (x4,NOH1N1)
(x5,H1N1)
equivalent problem evaluate w/0 1
1 0
(x1,H1N1), · · · , (x1,H1N1) (x2,NOH1N1) (x3,NOH1N1) (x4,NOH1N1) (x5,H1N1), · · · , (x5,H1N1)
mathematically:
0 1000
1 0
=1000 0
0 1
·0 1 1 0
Key Idea: Cost Transformation
0 1000
1 0
| {z }
C
=1000 0
0 1
| {z }
#of copies
·0 1 1 0
| {z }
Cc
0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0
| {z }
C
=
1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1
| {z }
mixture weightsQ
·
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
| {z }
Cc,invertible
split the cost-sensitive example:
(x , 2)
=⇒ a mixture of regular examples {(x, 1), (x, 2), (x, 2), (x, 3)}
or a weighted mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)}
why split?
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 15 / 27
Cost Equivalence by Splitting
0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0
| {z }
C
=
1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1
| {z }
mixture weightsQ
·
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
| {z }
Cc
(x , 2)
=⇒ a weighted mixture {(x, 1, 1), (x, 2, 2), (x, 3, 1)}
cost equivalence: for any classifier g, C(y , g(x)) =XK
`=1Q(y , `)Cc(`,g(x )) ming expected LHS (cost-sensitive)
= ming expected RHS (regular when Q(y , `) ≥ 0)
Cost Transformation Methodology: Preliminary
1 split each training example (xn,yn)to a weighted mixture(xn, `,Q(yn, `)) K
`=1
2 apply regular classification algorithm on the weighted mixtures
N
S
n=1
(xn, `,Q(yn, `)) K
`=1
by cost equivalence,
good g for new regular classification problem
=
good g for original cost-sensitive classification problem regular classification: needs Q(yn, `) ≥0but what if Q(yn, `)negative?
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 17 / 27
Similar Cost Vectors
1 0 1 2 3 2 3 4
| {z }
costs
=1/3 4/3 1/3 −2/3
1 2 1 0
| {z }
mixture weights Q(y , `)
·
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
| {z }
classification costs
negative Q(y , `): cannot split
but ˆc = (1, 0, 1, 2) is similar to c = (3, 2, 3, 4):
for any classifier g,
ˆc[g(x )] + constant = c[g(x )]
constant can be dropped during minimization
shifting cost matrix by constant rows does not affect minimization
Cost Transformation Methodology: Revised
0 1 1 1 1 0 1 2 1 1 0 1 1 1 1 0
| {z }
C
+
0 0 0 0 2 2 2 2 0 0 0 0 0 0 0 0
| {z }
constant rows
=
1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1
| {z }
mixture weightsQ
·
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
| {z }
Cc
1 shift each row of original cost to a similar and
“splittable” C(y , :), i.e., with Q(yn, `) ≥0
2 split (xn,yn)to weighted mixture(xn, `,Q(yn, `)) K
`=1
3 apply regular classification algorithm on the weighted mixtures
N
S
n=1
(xn, `,Q(yn, `)) K
`=1
good g for new regular classification problem
=
good g for cost-sensitive classification problemH.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 19 / 27
Uncertainty in Mixture
a single example {(x , 2)}
—certain that the desired label is 2
a mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)} sharing the same x
—uncertainty in the desired label (25% : 1, 50% : 2, 25% : 3) over-shifting adds unnecessary mixture uncertainty:
3 2 3 4
33 32 33 34
| {z }
costs
= 1 2 1 0
11 12 11 10
| {z }
mixture weights
·
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
| {z }
Cc
should choose a similar and splittablec withminimum mixture uncertainty
Cost Transformation Methodology: Final
1 shift original cost to a similar and splittable Cwith minimum “mixture uncertainty”
2 split (xn,yn)to a weighted mixture(xn, `,Q(yn, `) K
`=1
with C
3 apply regular classification algorithm on the weighted mixtures
N
S
n=1
(xn, `,Q(yn, `)) K
`=1
mixture uncertainty: entropy of each normalized Q(y , :) a simple and unique optimal shifting exists for every C
—Q(y , k ) = max`C(y , `) − C(y , k )
good g for new regular classification problem
=
good g for cost-sensitive classification problemH.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 21 / 27
Unavoidable (Minimum) Uncertainty
Original Cost-Sensitive Clas- sification Problem
1 2 3 4
individual examples with certainty
+absolute cost =
New Regular Classification Problem
mixtures with unavoidable uncertainty
new problem usuallyharder than original one
needrobustregular classification algorithm to deal with uncertainty
From OVO to CSOVO
One-Versus-One: A Popular Classification Meta-Method
1 for a pair (i, j), take all examples (xn,yn)that yn=i or j
2 train a binary classifier g(i,j)using those examples
3 repeat the previous two steps for all different (i, j)
4 predict using the votes from g(i,j)
cost-sensitive multiclass classification
cost transformation
=⇒ regular (weighted) multiclass classification
OVO decomposition
=⇒ regular (weighted) binary classification cost-sensitive one-versus-one:
cost transformation + one-versus-one
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 23 / 27
Cost-Sensitive One-Versus-One (CSOVO)
1 for a pair (i, j), transform all examples (xn,yn)to xn,argmin
k ∈{i,j}
C(yn,k )
!
with weight
C(yn,i) − C(yn,j)
2 train a binary classifier g(i,j)using those examples
3 repeat the previous two steps for all different (i, j)
4 predict using the votes from g(i,j) comes withgood theoretical guarantee:
test cost of final classifier ≤ 2X
i<jtest cost of g(i,j) simple, efficient, and
takes original OVO asspecial case
CSOVO v.s. OVO
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 25 / 27
veh vow seg dna sat usp
0 20 40 60 80 100 120 140 160 180 200
avg. test random cost
CSOVO
OVO OVO: popular regular classification
meta-method,NOT cost-sensitive couple both
meta-methods with SVM
CSOVO often better suited for cost-sensitive classification
CSOVO v.s. WAP
veh vow seg dna sat usp
0 20 40 60 80 100 120 140 160 180 200
avg. test random cost
CSOVO
WAP a general
cost-sensitive setup with “random” cost WAP(Abe et al., 2004): related to CSOVO, but more complicated and slower
couple both
meta-methods with SVM
CSOVO simpler, faster, with similar performance
—a preferable choice
Conclusion
cost transformation methodology:
makesany (robust) regular classification algorithm cost-sensitive theoretical guarantee: cost equivalence
algorithmic use: anovel and simple algorithm CSOVO experimental performance of CSOVO:superior
Thank you for your attention!
H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 27 / 27