• 沒有找到結果。

# Cost-sensitive Multiclass Classiﬁcation Using One-versus-one Comparisons

N/A
N/A
Protected

Share "Cost-sensitive Multiclass Classiﬁcation Using One-versus-one Comparisons"

Copied!
23
0
0

(1)

### Cost-sensitive Multiclass Classification Using One-versus-one Comparisons

Hsuan-Tien Lin

Dept. of CSIE, NTU

Talk at NTHU EE, 04/09/2010

(2)

### Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

aclassification problem

—grouping “pictures” into different “categories”

How can machines learn to classify?

(3)

### Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

\$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture xn, category yn)

?

learning good decision

function g(x ) ≈ f (x ) algorithm

'

&

\$

% -

6

learning model {gα(x )}

challenge:

see only {(xn,yn)}without knowing f (x ) or e(x )

=⇒? generalize to unseen (x , y ) w.r.t. f (x )

(4)

### Mis-prediction Costs

(g(x ) ≈ f (x )?)

? ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake evaluation by formation similarity:

1:not very similar; 2: very similar;

3:somewhat similar; 4: asilly prediction

different applications evaluate mis-predictions differently

(5)

### ZIP Code Recognition

?

1: wrong; 2: right; 3:wrong; 4:right regular classification problem: only right or wrong wrong cost: 1; right cost: 0

prediction error of g on some (x , y ):

classification cost =Jy 6= g (x )K

regular classification: well-studied, many good algorithms

(6)

### Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake cost-sensitive classification problem:

different costs for different mis-predictions e.g. prediction error of g on some (x , y ):

absolute cost = |y − g(x )|

cost-sensitive classification:new, need more research

(7)

### Which Age-Group?

?

infant (1) child (2) teen (3) adult (4)

small mistake—classify a child as a teen;

big mistake—classify an infant as an adult prediction error of g on some (x , y ):

C(y , g(x)), where C =

0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0

C: cost matrix

(8)

### Cost Matrix C

regular classification C = classification cost Cc:

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

cost-sensitive classification C = anything other than Cc:

0 1 4 5 1 0 1 3 3 1 0 2 5 4 1 0

regular classification:

special case of cost-sensitive classification

(9)

### Cost-Sensitive Binary Classification (1/2)

medical profile x

? medical profile x1

H1N1(1)

medical profile x2 NOH1N1(2) predictingH1N1asNOH1N1:

serious consequences to public health predictingNOH1N1asH1N1:

not good, but less serious cost-sensitive C: 0 1000

1 0



regular Cc:0 1 1 0



how to change the entry from 1 to 1000?

(10)

### Cost-Sensitive Binary Classification (2/2)

copy each case labeledH1N11000times

original problem

evaluate w/0 1000

1 0



(x1,H1N1) (x2,NOH1N1) (x3,NOH1N1) (x4,NOH1N1)

(x5,H1N1)

equivalent problem evaluate w/0 1

1 0



(x1,H1N1), · · · , (x1,H1N1) (x2,NOH1N1) (x3,NOH1N1) (x4,NOH1N1) (x5,H1N1), · · · , (x5,H1N1)

mathematically:

0 1000

1 0



=1000 0

0 1



·0 1 1 0



(11)

### Our Contribution

binary multiclass

regular well-studied well-studied

cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)

a theoretical and algorithmic study of cost-sensitive classi- fication, which ...

introduces a methodology for extending regular classification algorithms to cost-sensitive ones with any cost

providesstrong theoretical support for the methodology

leads to some promising algorithms withsuperior experimental results

will describe the methodology and a concrete algorithm

(12)

### Key Idea: Cost Transformation

0 1000

1 0



| {z }

C

=1000 0

0 1



| {z }

#of copies

·0 1 1 0



| {z }

Cc

0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0

| {z }

C

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

split the cost-sensitive example:

(x , 2)

=⇒ a mixture of regular examples {(x, 1), (x, 2), (x, 2), (x, 3)}

or a weighted mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)}

(13)

### Cost Equivalence by Splitting

0 1 1 1 3 2 3 4 1 1 0 1 1 1 1 0

| {z }

C

=

1 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1

| {z }

mixture weightsQ

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

(x , 2)

=⇒ a weighted mixture {(x, 1, 1), (x, 2, 2), (x, 3, 1)}

cost equivalence: for any classifier g, C(y , g(x)) =XK

`=1Q(y , `)J` 6= g (x )K

ming expected LHS (original cost-sensitive problem)

= ming expected RHS (a regular problem when Q(y , `) ≥ 0)

(14)

### Cost Transformation Methodology: Preliminary

1 split each training example (xn,yn)to a weighted mixture(xn, `,Q(yn, `)) K

`=1

2 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(xn, `,Q(yn, `)) K

`=1

by cost equivalence,

good g for new regular classification problem

### =

good g for original cost-sensitive classification problem regular classification: needs Q(yn, `) ≥0

but what if Q(yn, `)negative?

(15)

### Similar Cost Vectors

1 0 1 2 3 2 3 4



| {z }

costs

=1/3 4/3 1/3 −2/3

1 2 1 0



| {z }

mixture weights Q(y , `)

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

classification costs negative Q(y , `): cannot split

but ˆc = (1, 0, 1, 2) is similar to c = (3, 2, 3, 4):

for any classifier g,

ˆc[g(x )] + constant = c[g(x )] =XK

`=1Q(y , `)J` 6= g (x )K constant can be dropped during minimization

ming expected ˆC(y , g(x)) (original cost-sensitive problem)

= ming expected RHS (regular problem w/ Q ≥ 0)

(16)

### Cost Transformation Methodology: Revised

1 shift each row of original cost ˆC to a similar and

“splittable” C(y , :)

2 split (xn,yn)to a weighted mixture

(xn, `,Q(yn, `)) K

`=1with C

3 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(xn, `,Q(yn, `)) K

`=1

splittable: Q(yn, `) ≥0

by cost equivalence after shifting:

good g for new regular classification problem

### =

good g for original cost-sensitive classification problem but infinitely many similar and splittable C!

(17)

### Uncertainty in Mixture

a single example {(x , 2)}

—certain that the desired label is 2

a mixture {(x , 1, 1), (x , 2, 2), (x , 3, 1)} sharing the same x

—uncertainty in the desired label (25% : 1, 50% : 2, 25% : 3) over-shifting adds unnecessary mixture uncertainty:

 3 2 3 4 33 32 33 34



| {z }

costs

= 1 2 1 0 11 12 11 10



| {z }

mixture weights

·

0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

| {z }

Cc

should choose a similar and splittablec withminimum mixture uncertainty

(18)

### Cost Transformation Methodology: Final

1 shift original cost ˆC to a similar and splittable C with minimum “mixture uncertainty”

2 split (xn,yn)to a weighted mixture

(xn, `,Q(yn, `) K

`=1with C

3 apply regular classification algorithm on the weighted mixtures

N

S

n=1

(xn, `,Q(yn, `)) K

`=1

mixture uncertainty: entropy of each normalized Q(y , :) a simple and unique optimal shifting exists for every ˆC

good g for new regular classification problem

### =

good g for original cost-sensitive classification problem

(19)

### From OVO to CSOVO

One-Versus-One: A Popular Classification Meta-Method

1 for a pair (i, j), take all examples (xn,yn)that yn=i or j

2 train a binary classifier g(i,j)using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from g(i,j) cost-sensitive one-versus-one:

cost transformation + one-versus-one

(20)

### Cost-Sensitive One-Versus-One (CSOVO)

1 for a pair (i, j), transform all examples (xn,yn)to xn,argmin

k ∈{i,j}

C(yn,k )

!

with weight

C(yn,i) − C(yn,j)

2 train a binary classifier g(i,j)using those examples

3 repeat the previous two steps for all different (i, j)

4 predict using the votes from g(i,j) comes withgood theoretical guarantee:

test cost of final classifier ≤ 2X

i<jtest cost of g(i,j) simple, efficient, and takes original OVO as special case

(21)

### CSOVO v.s. WAP

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

avg. test random cost

CSOVO

WAP a general

cost-sensitive setup with “random” cost WAP(Abe et al., 2004): related to CSOVO, but more complicated and slower

couple both

meta-methods with SVM

CSOVO simpler, faster, with similar performance

—a preferable choice

(22)

### CSOVO v.s. OVO

veh vow seg dna sat usp

0 20 40 60 80 100 120 140 160 180 200

avg. test random cost

CSOVO

OVO OVO: popular regular classification

meta-method,NOT cost-sensitive couple both

meta-methods with SVM

CSOVO often better suited for cost-sensitive classification

(23)

### Conclusion

cost transformation methodology:

makesany (robust) regular classification algorithm cost-sensitive theoretical guarantee: cost equivalence

algorithmic use: anovel and simple algorithm CSOVO experimental performance of CSOVO:superior

many more cost-sensitive algorithms can be designed similarly

3.3 Locally-learned Surrogate Loss for General Cost-sensitive Multi-label Deep Learning From Figure 1, it can be seen that the key to designing a cost- sensitive model is that the

Moreover, when compared with other meta-algorithms that reduce cost-sensitive classification to binary classification—namely, error-correcting output code (Langford and

The well-known random k-labelsets (RAkEL) (Tsoumakas and Vlahavas, 2007) method focuses on many smaller multi-class classification problems to be computationally efficient, but it

To build a cost- sensitive DNN for a K-class cost-sensitive classification prob- lem, the proposed framework replaces the layer-wise pretrain- ing step with layer-wise cost

The framework consists of three main components: decomposing the ordinal ranks to binary classification labels to respect the discrete nature of the ranks; allowing different costs

Moreover, when compared with other meta-algorithms that reduce cost-sensitive classification to binary classification—namely, one-versus-all (Lin, 2008), error-correcting output

possible preceding labels when we train the m -th chain, m examples will exist for each example in the original training data, and they may have different label features and

The proposed al- gorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors by using

The embedding allows the proposed algorithm, active learning with cost embedding (ALCE), to define a cost-sensitive uncertainty measure from the distance in the hidden space..

We proposed the condensed filter tree (CFT) algorithm by coupling several tools and ideas: the label powerset approach for reducing to cost-sensitive classifi- cation, the

Exten- sive experimental results justify the validity of the novel loss function for making existing deep learn- ing models cost-sensitive, and demonstrate that our proposed model

Experiments on the benchmark and the real-world data sets show that our proposed methodology in- deed achieves lower test error rates and similar (sometimes lower) test costs

Furthermore, we leverage the cost information embedded in the code space of CSRPE to propose a novel active learning algorithm for cost-sensitive MLC.. Extensive exper- imental

Coupling AdaBoost with the reduction framework leads to a novel algorithm that boosts the training accuracy of any cost-sensitive ordinal ranking algorithms theoretically, and in

In addi- tion, soft cost-sensitive classification algorithms reach significantly lower test error rate than their hard siblings, while achieving similar (sometimes better) test

D. Existing cost-insensitive active learning strategies 1) Binary active learning: Active learning for binary classification (binary active learning) has been studied in many works

For those methods utilizing label powerset to reduce the multi- label classification problem, in , the author proposes cost- sensitive RAkEL (CS-RAkEL) based on RAkEL optimizing on

We address the two issues by proposing Feature-aware Cost- sensitive Label Embedding (FaCLE), which utilizes deep Siamese network to keep cost information as the distance of

error measure = society cost XXXX actual XXXX

customer 1 who hates romance but likes terror error measure = non-satisfaction. XXXX actual XXXX

makes any (robust) regular classification algorithm cost-sensitive theoretical guarantee: cost equivalence. algorithmic use: a novel and simple algorithm CSOVO experimental

Cost-and-Error-Sensitive Classification with Bioinformatics Application Cost-Sensitive Ordinal Ranking with Information Retrieval Application Summary.. Non-Bayesian Perspective

Cost-Sensitive Multiclass Classification CSMC Motivation and