# Cost-sensitive Multiclass Classiﬁcation Using One-versus-one Comparisons

## Full text

(1)

### Cost-sensitive Multiclass Classification Using One-versus-one Comparisons

Hsuan-Tien Lin

Dept. of CSIE, NTU

Talk at NTHU EE, 04/09/2010

(2)

### Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

aclassification problem

—grouping “pictures” into different “categories”

How can machines learn to classify?

(3)

### Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

\$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture xn, category yn)

?

learning good decision

function g(x ) ≈ f (x ) algorithm

'

&

\$

% -

6

learning model {gα(x )}

challenge:

see only {(xn,yn)}without knowing f (x ) or e(x )

=⇒? generalize to unseen (x , y ) w.r.t. f (x )

(4)

### Mis-prediction Costs

(g(x ) ≈ f (x )?)

? ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake evaluation by formation similarity:

1:not very similar; 2: very similar;

3:somewhat similar; 4: asilly prediction

different applications evaluate mis-predictions differently

(5)

### ZIP Code Recognition

?

1: wrong; 2: right; 3:wrong; 4:right regular classification problem: only right or wrong wrong cost: 1; right cost: 0

prediction error of g on some (x , y ):

classification cost =Jy 6= g (x )K

regular classification: well-studied, many good algorithms

(6)

### Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake cost-sensitive classification problem:

different costs for different mis-predictions e.g. prediction error of g on some (x , y ):

absolute cost = |y − g(x )|

cost-sensitive classification:new, need more research

(7)

### Which Age-Group?

?

infant (1) child (2) teen (3) adult (4)

small mistake—classify a child as a teen;

big mistake—classify an infant as an adult prediction error of g on some (x , y ):

C(y , g(x)), where C =

0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0

C: cost matrix

(8)

### Cost Matrix C

regular classification C = classification cost Cc:

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

cost-sensitive classification C = anything other than Cc:

0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0

regular classification:

special case of cost-sensitive classification

(9)

### Cost-Sensitive Binary Classification (1/2)

medical profile x

? medical profile x1

H1N1(1)

medical profile x2 NOH1N1(2) predictingH1N1asNOH1N1:

serious consequences to public health predictingNOH1N1asH1N1:

not good, but less serious cost-sensitive C: 0 1000

1 0



regular Cc:0 1 1 0



how to change the entry from 1 to 1000?

(10)

### Cost-Sensitive Binary Classification (2/2)

copy each case labeledH1N11000times

original problem

evaluate w/0 1000

1 0



(x1,H1N1) (x2,NOH1N1) (x3,NOH1N1) (x4,NOH1N1)

(x5,H1N1)

equivalent problem evaluate w/0 1

1 0



(x1,H1N1), · · · , (x1,H1N1) (x2,NOH1N1) (x3,NOH1N1) (x4,NOH1N1) (x5,H1N1), · · · , (x5,H1N1)

mathematically:

0 1000

1 0



=1000 0

0 1



·0 1 1 0



(11)

### Our Contribution

binary multiclass

regular well-studied well-studied

cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)

a theoretical and algorithmic study of cost-sensitive classi- fication, which ...

introduces a methodology for extending regular classification algorithms to cost-sensitive ones with any cost

providesstrong theoretical support for the methodology

leads to some promising algorithms withsuperior experimental results

will describe the methodology and a concrete algorithm

(12)

### Key Idea: Cost Transformation

0 1000

1 0



| {z }

C

=1000 0

0 1



| {z }

#of copies

·0 1 1 0



| {z }

Cc

0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0

| {z }

C

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

split the cost-sensitive example:

(x , 2)

=⇒ a mixture of regular examples {(x, 1), (x, 2), (x, 2), (x, 3)}

or a weighted mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)}

(13)

### Cost Equivalence by Splitting

0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0

| {z }

C

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

(x , 2)

=⇒ a weighted mixture {(x, 1, 1), (x, 2, 2), (x, 3, 1)}

cost equivalence: for any classifier g, C(y , g(x)) =XK

`=1Q(y , `)J` 6= g (x )K

ming expected LHS (original cost-sensitive problem)

= ming expected RHS (a regular problem when Q(y , `) ≥ 0)

(14)

### Cost Transformation Methodology: Preliminary

1 split each training example (xn,yn)to a weighted mixture(xn, `,Q(yn, `)) K

`=1

2 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(xn, `,Q(yn, `)) K

`=1

by cost equivalence,

good g for new regular classification problem

### =

good g for original cost-sensitive classification problem regular classification: needs Q(yn, `) ≥0

but what if Q(yn, `)negative?

(15)

### Similar Cost Vectors

1 0 1 2 3 2 3 4



| {z }

costs

=1/3 4/3 1/3 −2/3

1 2 1 0



| {z }

mixture weights Q(y , `)

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

classification costs negative Q(y , `): cannot split

but ˆc = (1, 0, 1, 2) is similar to c = (3, 2, 3, 4):

for any classifier g,

ˆc[g(x )] + constant = c[g(x )] =XK

`=1Q(y , `)J` 6= g (x )K constant can be dropped during minimization

ming expected ˆC(y , g(x)) (original cost-sensitive problem)

= ming expected RHS (regular problem w/ Q ≥ 0)

(16)

### Cost Transformation Methodology: Revised

1 shift each row of original cost ˆC to a similar and

“splittable” C(y , :)

2 split (xn,yn)to a weighted mixture

(xn, `,Q(yn, `)) K

`=1with C

3 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(xn, `,Q(yn, `)) K

`=1

splittable: Q(yn, `) ≥0

by cost equivalence after shifting:

good g for new regular classification problem

### =

good g for original cost-sensitive classification problem but infinitely many similar and splittable C!

(17)

### Uncertainty in Mixture

a single example {(x , 2)}

—certain that the desired label is 2

a mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)} sharing the same x

—uncertainty in the desired label (25% : 1, 50% : 2, 25% : 3) over-shifting adds unnecessary mixture uncertainty:

 3 2 3 4 33 32 33 34



| {z }

costs

= 1 2 1 0 11 12 11 10



| {z }

mixture weights

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

should choose a similar and splittablec withminimum mixture uncertainty

(18)

### Cost Transformation Methodology: Final

1 shift original cost ˆC to a similar and splittable C with minimum “mixture uncertainty”

2 split (xn,yn)to a weighted mixture

(xn, `,Q(yn, `) K

`=1with C

3 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(xn, `,Q(yn, `)) K

`=1

mixture uncertainty: entropy of each normalized Q(y , :) a simple and unique optimal shifting exists for every ˆC

good g for new regular classification problem

### =

good g for original cost-sensitive classification problem

(19)

### From OVO to CSOVO

One-Versus-One: A Popular Classification Meta-Method

1 for a pair (i, j), take all examples (xn,yn)that yn=i or j

2 train a binary classifier g(i,j)using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from g(i,j) cost-sensitive one-versus-one:

cost transformation + one-versus-one

(20)

### Cost-Sensitive One-Versus-One (CSOVO)

1 for a pair (i, j), transform all examples (xn,yn)to xn,argmin

k ∈{i,j}

C(yn,k )

!

with weight

C(yn,i) − C(yn,j)

2 train a binary classifier g(i,j)using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from g(i,j) comes withgood theoretical guarantee:

test cost of final classifier ≤ 2X

i<jtest cost of g(i,j) simple, efficient, and takes original OVO as special case

(21)

### CSOVO v.s. WAP

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

avg. test random cost

CSOVO

WAP a general

cost-sensitive setup with “random” cost WAP(Abe et al., 2004): related to CSOVO, but more complicated and slower

couple both

meta-methods with SVM

CSOVO simpler, faster, with similar performance

—a preferable choice

(22)

### CSOVO v.s. OVO

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

avg. test random cost

CSOVO

OVO OVO: popular regular classification

meta-method,NOT cost-sensitive couple both

meta-methods with SVM

CSOVO often better suited for cost-sensitive classification

(23)

### Conclusion

cost transformation methodology:

makesany (robust) regular classification algorithm cost-sensitive theoretical guarantee: cost equivalence

algorithmic use: anovel and simple algorithm CSOVO experimental performance of CSOVO:superior

many more cost-sensitive algorithms can be designed similarly

Updating...

## References

Related subjects :