## Cost-sensitive Multiclass Classification Using One-versus-one Comparisons

Hsuan-Tien Lin

Assistant Professor

Dept. of Computer Science and Information Engineering National Taiwan University

Talk at Institute of Statistics, National Tsing Hua University, 12/12/2014

Based on the paper “Reduction from cost-sensitive multiclass classification to one-versus-one binary classification”, ACML 2014

## Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

a**classification problem**

—grouping “pictures” into different “categories”

**How can machines learn to classify?**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 1 / 27

## Learning from Data

(Abu-Mostafa, Magdon-Ismail and Lin, 2012) Truth f (x ) + noise e(x )?

examples (picture x_{n}, category y_{n})

?

learning good

decision function g(x ) ≈ f (x ) algorithm

'

&

$

% -

6

learning model {gα(x )}

challenge:

see only {(x_{n},y_{n})}without knowing f (x ) or e(x )

=⇒? **generalize to unseen (x , y ) w.r.t. f (x )**

## Mis-prediction Costs

(g(x ) ≈ f (x )?)? ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:**two-dollar mistake**
evaluation by formation similarity:

1:not very similar; 2: very similar;

3:somewhat similar; 4: a**silly prediction**

**different applications evaluate mis-predictions**
**differently**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 3 / 27

## ZIP Code Recognition

?

1: wrong; 2: right; 3:wrong; 4:right
**regular classification problem: only right or wrong**
wrong cost: 1; right cost: 0

prediction error of g on some (x , y ):

classification cost =Jy 6= g (x )K

regular classification:**well-studied, many good algo-**
rithms

## Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: **two-dollar mistake**
**cost-sensitive classification problem:**

different costs for different mis-predictions e.g. prediction error of g on some (x , y ):

absolute cost = |y − g(x )|

cost-sensitive classification: **new, need more re-**
search

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 5 / 27

## What is the Status of the Patient?

?

H1N1-infected cold-infected healthy

another**classification problem**

—grouping “patients” into different “status”

**Are all mis-prediction costs equal?**

## Patient Status Prediction

error measure = society cost

C =

XXXX

XXXX XX actual

predicted

H1N1 cold healthy

H1N1 0 1000 **100000**

cold 100 0 3000

healthy 100 30 0

H1N1 mis-predicted as healthy:**very high cost**
cold mis-predicted as healthy: high cost

cold correctly predicted as cold: no cost

human doctors consider costs of decision;

**can computer-aided diagnosis do the same?**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 7 / 27

## Which Age-Group?

?

infant (1) child (2) teen (3) adult (4)

small mistake—classify a child as a teen;

big mistake—classify an infant as an adult prediction error of g on some (x , y ):

C(y , g(x)), where C =

0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0

**C: cost matrix**

## Cost Matrix C

regular classification
C = classification cost C_{c}:

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

cost-sensitive classification C = anything other than Cc:

0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0

regular classification:

**special case of cost-sensitive classification**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 9 / 27

## Cost-sensitive Classification Setup

Given

N examples, each (input x_{n},label y_{n}) ∈ X × {1, 2, . . . , K } × R^{K}; cost
matrix C

K = 2: binary; K > 2: multiclass
will assume C(y , y ) = min_{1≤k ≤K}C(y , k )

Goal

a classifier g(x ) that pays a small cost C(y , g(x )) on future**unseen**
example (x , y )

cost-sensitive classification:

**more realistic than regular one**

## Our Contribution

binary multiclass

regular well-studied well-studied

cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)

a theoretical and algorithmic study of cost-sensitive classi- fication, which ...

introduces a methodology for extending regular
classification algorithms to cost-sensitive ones with
**any cost**

provides**strong theoretical support for the**
methodology

leads to some promising algorithms with**superior**
**experimental results**

will describe the methodology and a concrete algorithm

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 11 / 27

## Central Idea: Reduction

(iPod)

complex cost-sensitive problems

(adapter) (reduction)

(cassette player)

simpler regular classification problems with well-known results on models, algorithms, and theories

**If I have seen further it is by standing on the**
**shoulders of Giants—I. Newton**

## Cost-Sensitive Binary Classification (1/2)

medical profile x

?
medical profile x_{1}

H1N1(1)

medical profile x_{2}
NOH1N1(2)
predictingH1N1asNOH1N1:

serious consequences to public health predictingNOH1N1asH1N1:

not good, but less serious
cost-sensitive C: ^{0 1000}

1 0

regular Cc:^{0 1}
1 0

**how to change the entry from 1 to 1000?**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 13 / 27

## Cost-Sensitive Binary Classification (2/2)

**copy each case labeledH1N1**1000**times**

original problem

evaluate w/^{0 1000}

1 0

(x_{1},H1N1)
(x_{2},NOH1N1)
(x_{3},NOH1N1)
(x_{4},NOH1N1)

(x5,H1N1)

equivalent problem
evaluate w/^{0 1}

1 0

(x_{1},H1N1), · · · , (x_{1},H1N1)
(x_{2},NOH1N1)
(x_{3},NOH1N1)
(x_{4},NOH1N1)
(x5,H1N1), · · · , (x5,H1N1)

mathematically:

0 1000

1 0

=1000 0

0 1

·0 1 1 0

## Key Idea: Cost Transformation

0 1000

1 0

| {z }

C

=1000 0

0 1

| {z }

#of copies

·0 1 1 0

| {z }

Cc

0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0

| {z }

C

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

C_{c},invertible

**split the cost-sensitive example:**

(x , 2)

=⇒ a mixture of regular examples {(x, 1), (x, 2), (x, 2), (x, 3)}

or a weighted mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)}

**why split?**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 15 / 27

## Cost Equivalence by Splitting

0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0

| {z }

C

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

C_{c}

(x , 2)

=⇒ a weighted mixture {(x, 1, 1), (x, 2, 2), (x, 3, 1)}

**cost equivalence: for any classifier g,**
C(y , g(x)) =XK

`=1Q(y , `)C_{c}(`,g(x ))
min_{g} expected LHS (cost-sensitive)

= ming expected RHS (regular when Q(y , `) ≥ 0)

## Cost Transformation Methodology: Preliminary

1 split each training example (x_{n},y_{n})to a weighted
mixture(x_{n}, `,Q(yn, `)) K

`=1

2 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(x_{n}, `,Q(y_{n}, `)) K

`=1

by cost equivalence,

good g for new regular classification problem

## =

good g for original cost-sensitive classification problem regular classification: needs Q(y_{n}, `) ≥0

**but what if Q(y**n, `)**negative?**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 17 / 27

## Similar Cost Vectors

1 0 1 2 3 2 3 4

| {z }

costs

=1/3 4/3 1/3 −2/3

1 2 1 0

| {z }

mixture weights Q(y , `)

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

classification costs

negative Q(y , `): cannot split

but ˆ**c = (1, 0, 1, 2) is similar to c = (3, 2, 3, 4):**

for any classifier g,

ˆ**c[g(x )] + constant = c[g(x )]**

constant can be dropped during minimization

shifting cost matrix by constant rows does not affect minimization

## Cost Transformation Methodology: Revised

0 1 1 1 1 0 1 2 1 1 0 1 1 1 1 0

| {z }

C

+

0 0 0 0 2 2 2 2 0 0 0 0 0 0 0 0

| {z }

constant rows

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

1 shift each row of original cost to a similar and

“splittable” C(y , :), i.e., with Q(y_{n}, `) ≥0

2 split (x_{n},y_{n})to weighted mixture(x_{n}, `,Q(y_{n}, `)) K

`=1

3 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(x_{n}, `,Q(yn, `)) K

`=1

good g for new regular classification problem

## =

good g for cost-sensitive classification problemH.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 19 / 27

## Uncertainty in Mixture

a single example {(x , 2)}

—certain that the desired label is 2

a mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)} sharing the same x

—uncertainty in the desired label (25% : 1, 50% : 2, 25% : 3) over-shifting adds unnecessary mixture uncertainty:

3 2 3 4

33 32 33 34

| {z }

costs

= 1 2 1 0

11 12 11 10

| {z }

mixture weights

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

C_{c}

should choose a similar and splittable**c**
with**minimum mixture uncertainty**

## Cost Transformation Methodology: Final

1 shift original cost to a similar and splittable Cwith minimum “mixture uncertainty”

2 split (xn,yn)to a weighted mixture(x_{n}, `,Q(yn, `) K

`=1

with C

3 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(x_{n}, `,Q(yn, `)) K

`=1

mixture uncertainty: entropy of each normalized Q(y , :) a simple and unique optimal shifting exists for every C

—Q(y , k ) = max_{`}C(y , `) − C(y , k )

good g for new regular classification problem

## =

good g for cost-sensitive classification problemH.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 21 / 27

## Unavoidable (Minimum) Uncertainty

Original Cost-Sensitive Clas- sification Problem

1 2 3 4

individual examples with certainty

+absolute cost =

New Regular Classification Problem

mixtures with unavoidable uncertainty

new problem usually**harder than original one**

**needrobustregular classification algorithm**
**to deal with uncertainty**

## From OVO to CSOVO

One-Versus-One: A Popular Classification Meta-Method

1 for a pair (i, j), take all examples (x_{n},y_{n})that y_{n}=i
or j

2 train a binary classifier g^{(i,j)}using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from g^{(i,j)}

cost-sensitive multiclass classification

cost transformation

=⇒ regular (weighted) multiclass classification

OVO decomposition

=⇒ regular (weighted) binary classification
**cost-sensitive one-versus-one:**

**cost transformation + one-versus-one**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 23 / 27

## Cost-Sensitive One-Versus-One (CSOVO)

1 for a pair (i, j), transform all examples (xn,yn)to xn,argmin

k ∈{i,j}

C(yn,k )

!

with weight

C(yn,i) − C(yn,j)

2 train a binary classifier g^{(i,j)}using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from g^{(i,j)}
comes with**good theoretical guarantee:**

test cost of final classifier ≤ 2X

i<jtest cost of g^{(i,j)}
**simple, efficient, and**

takes original OVO as**special case**

## CSOVO v.s. OVO

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 25 / 27

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

**avg. test random cost**

CSOVO

OVO OVO: popular regular classification

meta-method,NOT cost-sensitive couple both

meta-methods with SVM

**CSOVO often better suited**
**for cost-sensitive classification**

## CSOVO v.s. WAP

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

**avg. test random cost**

CSOVO

WAP a general

cost-sensitive setup with “random” cost WAP(Abe et al., 2004): related to CSOVO, but more complicated and slower

couple both

meta-methods with SVM

**CSOVO simpler, faster, with similar performance**

**—a preferable choice**

## Conclusion

**cost transformation methodology:**

makes**any (robust) regular classification algorithm cost-sensitive**
theoretical guarantee: **cost equivalence**

algorithmic use: a**novel and simple algorithm CSOVO**
experimental performance of CSOVO:**superior**

**Thank you for your attention!**

H.-T. Lin (NTU CSIE) Cost-sensitive One-versus-one 12/12/2014 27 / 27