(will be inserted by the editor)

### A Simple Cost-sensitive Multiclass Classification Algorithm Using One-versus-one Comparisons

Hsuan-Tien Lin

Abstract Many real-world applications require varying costs for different types of mis- classification errors. Such a cost-sensitive classification setup can be very different from the regular classification one, especially in the multiclass case. Thus, traditional meta-algorithms for regular multiclass classification, such as the popular one-versus-one approach, may not always work well under the cost-sensitive classification setup. In this paper, we extend the one-versus-one approach to the field of cost-sensitive classification. The extension is derived using a rigorous mathematical tool called the cost-transformation technique, and takes the original one-versus-one as a special case. Experimental results demonstrate that the pro- posed approach can achieve better performance in many cost-sensitive classification sce- narios when compared with the original one-versus-one as well as existing cost-sensitive classification algorithms.

Keywords cost-sensitive classification, one-versus-one, meta-learning

1 Introduction

Many real-world applications of machine learning and data mining require evaluating the learned system with different costs for different types of mis-classification errors. For in- stance, a false-negative prediction for a spam classification system only takes the user an extra second to delete the email, while a false-positive prediction can mean a huge loss when the email actually carries important information. When recommending movies to a subscriber with preference “romance over action over horror”, the cost of mis-predicting a romance movie as a horror one should be significantly higher than the cost of mis-predicting the movie as an action one. Such a need is also shared by applications like targeted market- ing, information retrieval, medical decision making, object recognition and intrusion de- tection (Abe et al, 2004), and can be formalized as the cost-sensitive classification setup.

In fact, cost-sensitive classification can be used to express any finite-choice and bounded- loss supervised learning setups (Beygelzimer et al, 2005). Thus, it has been attracting much research attention in recent years (Domingos, 1999; Margineantu, 2001; Abe et al, 2004;

Beygelzimer et al, 2005; Langford and Beygelzimer, 2005; Beygelzimer et al, 2007).

H.-T. Lin

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.

E-mail: htlin@csie.ntu.edu.tw

Abe et al (2004) grouped existing research on cost-sensitive classification into three categories: making a particular classifier cost-sensitive, making the prediction procedure cost-sensitive, and making the training procedure cost-sensitive. The third category con- tains mostly meta-algorithms that reweight training examples before feeding them into the underlying learning algorithm. Such a meta-algorithm can be used to make any existing algorithm cost-sensitive. While a promising meta-algorithm exists and is well-understood for cost-sensitive binary classification (Zadrozny et al, 2003), the counterpart for multiclass classification remains an ongoing research issue (Abe et al, 2004; Langford and Beygelz- imer, 2005; Zhou and Liu, 2006).

In this paper, we propose a general meta-algorithm that reduces cost-sensitive multi- class classification tasks to regular classification ones. The meta-algorithm is based on the cost-transformationtechnique, which converts one cost to another by not only reweight- ing the original training examples, but also relabeling them. We show that any cost can be transformed to the regular mis-classification one with the cost-transformation technique. As a consequence, general cost-sensitive classification and general regular classification tasks are equivalent in terms of hardness.

We further couple the meta-algorithm with another popular meta-algorithm in regu- lar classification—the one-versus-one (OVO) decomposition from multiclass to binary. The resulting algorithm, which is called cost-sensitive one-versus-one (CSOVO), can perform cost-sensitive multiclass classification with any base binary classifier. Interestingly, CSOVO is algorithmically similar to an existing meta-algorithm for cost-sensitive classification:

weighted all-pairs (WAP; Beygelzimer et al, 2005). Nevertheless, CSOVO is not only sim- pler but also more efficient than WAP. Our experimental results on real-world data sets demonstrate that CSOVO shares a similar performance over WAP, while both of them can be significantly better than OVO. Therefore, CSOVO is a preferable OVO-type cost-sensitive classification algorithm. Moreover, when compared with other meta-algorithms that reduce cost-sensitive classification to binary classification—namely, one-versus-all (Lin, 2008), error-correcting output code (Langford and Beygelzimer, 2005), tree (Beygelzimer et al, 2005), filter tree and all-pair filter tree (Beygelzimer et al, 2007) decompositions—we see that CSOVO can often achieve the best test performance. Those results further validate the usefulness of CSOVO.

The paper is organized as follows. In Section 2, we formalize the cost-sensitive classifi- cation setup. Then, we present the cost-transformation technique with its theoretical impli- cations in Section 3, and derive our proposed CSOVO algorithm in Section 4. Finally, we compare CSOVO with other algorithms empirically in Section 5 and conclude in Section 6.

2 Problem Setup

We start by defining the setup that will be used in this paper.

Definition 1 (weighted classification) Assume that there is an unknown distributionDwon
X × Y ×R^{+}. where the input spaceX ⊆ R^{D}and the label spaceY = {1, 2, · · · , K}. A
weighted example is a tuple(x, y, w) ∈ X × Y ×R^{+}, where the non-negative numbers
w ∈R^{+}are called the weights. In the weighted classification setup, we are given a set of
i.i.d. weighted training examplesSw= {(xn, yn, wn)}^{N}_{n=1}∼ D^{N}_{w}. Use

E(g, D) ≡ E

(x,y,w)∼D

w ·Jy 6= g(x)K

to denote the expected weighted classification error of any classifierg : X → Y with re- spect to some distributionD. The goal of the weighted classification is to useSwto find a classifierˆgsuch thatE(ˆg, Dw)is small.

ForK = 2, the setup is called (weighted) binary classification; forK > 2, the setup is called (weighted) multiclass classification. When the weights are constants (say,1), weighted classification becomes a special case called regular classification, which has been widely and deeply studied for years (Beygelzimer et al, 2005). In general, weighted classification can be easily reduced to regular classification for both binary and multiclass cases using the famous COSTING reduction (Zadrozny et al, 2003). In addition, many of the existing regu- lar classification algorithms can be easily extended to perform weighted classification. Thus, there are plenty of useful theoretical and algorithmic tools for both weighted classification and regular classification (Beygelzimer et al, 2005).

The main setup that we will study in this paper is cost-sensitive classification, which is more general than weighted classification.

Definition 2 (cost-sensitive classification) Assume that there is an unknown distributionDc

onX × Y ×R^{K}. A cost-sensitive example is a tuple(x, y, c) ∈ X × Y ×R^{K}, wherec[k]de-
notes the cost to be paid whenxis predicted as categoryk. In the cost-sensitive classification
setup, we are given a set of i.i.d. cost-sensitive training examplesSc= {(xn, yn, cn)}^{N}_{n=1}∼
D^{N}_{c} . We shall reuse

E(g, D) ≡ E

(x,y,c)∼D

c[g(x)] .

to denote the expected cost of any classifier g : X → Y with respect to some distribu- tionD. The goal of the cost-sensitive classification is to useScto find a classifierˆgsuch thatE(ˆg, Dc)is small.

We make two remarks here. First, when looking at the definition ofE, we see that the labelyis actually not needed in evaluating the classifierg. We keep the label there to better illustrate the connection between cost-sensitive and regular/weighted classification. Natu- rally, we assume thatDc would only generate examples(x, y, c)such thatc[y] = cmin = min1≤`≤Kc[`].

Secondly, let us define the classification cost vector c^{(`)}c [k] ≡ J` 6= kK. We see that
weighted classification is a special case of cost-sensitive classification usingc = w · c^{(y)}c

as the cost vector of(x, y, w), and regular classification is a special case of cost-sensitive
classification usingc = c^{(y)}c as the cost vector.

While both regular and weighted classification have been widely studied, cost-sensitive classification is theoretically well-understood only in the binary case (Zadrozny et al, 2003), in which weighted classification and cost-sensitive classification simply coincide. Next, we will introduce the cost-transformation technique, which allows us to tightly connect cost-sensitive classification with regular/weighted classification, and helps understand cost- sensitive classification better in the multiclass case.

3 Cost Transformation

Cost-transformation is a tool that connects a cost vector to other cost vectors. In particular, we hope to link any cost vectorcwith the classification cost vectorsCc =n

c^{(`)}_{c} oK

`=1, be- cause the link allows us to reduce cost-sensitive classification (which deals withc) to regular

classification (which deals withCc). We start introducing the cost-transformation technique by making two definitions about the relations between cost vectors. The first definition re- lates two cost vectors˜candc.

Definition 3 (similar cost vectors) A cost vector˜cis similar tocby∆if and only if˜c[·] = c[·] + ∆with some constant∆.

For instance, (4, 3, 2, 3)is similar to(2, 1, 0, 1) by2. We shall omit the “by∆” part when it is clear from the context. Note that when˜cis similar toc, using˜cfor evaluating a predictiong(x)is equivalent to usingcplus a constant cost of∆. The constant shifting fromcto˜cdoes not change the relative cost difference between the predictiong(x)and the best predictiony.

Next, we relate a cost vectorcto a set of cost vectorsCb.

Definition 4 (decomposable cost vectors) A cost vectorcis decomposable to a set of base
cost vectorsC_{b}=

n
c^{(t)}_{b}

oT

t=1if and only if there exists non-negative coefficientsq[t]such thatc[·] =PT

t=1q[t] · c^{(t)}_{b} [·].

That is, a cost vectorcis decomposable toC_{b}if we can splitcto a conic combination of
the base cost vectorsc^{(t)}_{b} . Why is such a decomposition useful? Let us take a cost-sensitive
example(x, y, c)and choose the classification costCc asC_{b}. Ifcis decomposable toCc,
then for any classifierg,

c[g(x)] =

K

X

`=1

q[`] · c^{(`)}_{c} [g(x)] =

K

X

`=1

q[`] ·J` 6= g(x)K .

That is, if we randomly generate`proportional toq[`]and relabel the cost-sensitive exam- ple(x, y, c)to a regular one(x, `, w = 1), then the cost that any classifiergneeds to pay for its prediction onxis proportional to the expected classification error, where the expectation is taken with respect to the relabeling process. Thus, if a classifiergperforms well for the

“relabeled” (regular classification) task, it would also perform well for the original cost-
sensitive classification task. The non-negativity ofq[`]ensures thatqcan be normalized to
form a probability distribution.^{1}

Definition 4 is a key of the cost-transformation technique. It not only allows us to trans- form one cost vectorcto an equivalent representationn

(q[t] , c^{(t)}_{b} )o

for some generalC_{b},
but more specifically also lets us relabel a cost-sensitive example(x, y, c)to another (ran-
domized) regular example(x, `, 1). However, is every cost vectorcdecomposable to the
classification costCc? The short answer is no. For instance, the cost vectorc = (6, 3, 0, 3)
is not decomposable toCc, becausecyields a unique linear decomposition ofCcwith some
negative coefficients:

c = −2 · c^{(1)}_{c} + 1 · c^{(2)}_{c} + 4 · c^{(3)}_{c} + 1 · c^{(4)}_{c} .

Although the cost vector(6, 3, 0, 3)itself is not decomposable toCc, we can easily see that its similar cost vector,(12, 9, 6, 9), is decomposable toCc. In fact, for any cost vectorc, there is an infinite number of its similar cost vectors˜cthat are decomposable to Cc, as formalized below.

1 We take a minor assumption that not all q[`] are zero. Otherwise c = 0 and the example (x, y, c) can be simply discarded.

Theorem 1 (decomposition via a similar vector; Lin 2008) Consider any cost vectorc. Assume that˜cis similar tocby∆. Then,˜cis decomposable toCcif and only if

∆ ≥ (K −1) cmax−

K

X

k=1

c[k] ,

wherecmax= max

1≤`≤Kc[`].

The proof (Lin, 2008) uses the fact thatCcis linearly independent while spanning R^{K},
and the constant cost vector(∆, ∆, · · · , ∆)with a positive∆is decomposable toCcwith
q[`] = _{K−1}^{∆} .

From Theorem 1, there are infinitely many cost vectors ˜cthat we can use. The next question is, which is more preferable? Recall that with a givenq˜, the relabeling probability distribution isp[`] = ˜˜ q[`]

.PK

k=1˜q[k] .To reduce the variance with respect to the relabel- ing process, one possibility is to require the discrete probability distributionp[·]˜ to be of the least entropy. That is, we want to solve the following optimization problem.

min

˜ p,˜q,∆

K

X

`=1

˜

p[`] log 1

˜

p[`], (1)

subject to c[·] =

K

X

`=1

˜

q[`] · c^{(`)}_{c} [·] − ∆, p[·] = ˜˜ q[·]

, _{K}

X

k=1

˜ q[k] ;

∆ ≥ (K −1) cmax−

K

X

k=1

c[k] .

Theorem 2 (decomposition with minimum-entropy; Lin 2008) If not allc[`]are equal, the unique optimal solution to(1) is

˜

q[`] = cmax− c[`] , (2)

∆ = (K −1) cmax−

K

X

k=1

c[k] . (3)

Note that the resulting∆is the smallest one that makes˜cdecomposable toCc. The details of the proof can be found in the previous work (Lin, 2008). Using Theorem 2, we can then define the following distributionDr(x, `, w)fromDc(x, y, c).

Dr(x, `, w) =Jw = 1K · Λ

−1 1 ·

Z

y,c

˜

q[`] · Dc(x, y, c),

whereq[`]˜ is computed fromcusing (2) and

Λ1= Z

x,y,c K

X

`=1

˜

q[`] · Dc(x, y, c).

is a normalization constant.^{2}Then, we can derive the following theorem.

2 Even when all c[`] are equal, equation (2) can still be used to get ˜q[`] = 0 for all `, which means the example (x, y, c) can be dropped instead of relabeled.

Theorem 3 (cost-transformation) For any classifierg, E(g, Dc) = Λ1· E(g, Dr) − Λ2,

whereΛ2is a constant that can be computed by integrating over the∆term associated with eachc ∼ Dcfrom(3).

Theorem 3 can then be used to further prove the following regret equivalence theorem.

Theorem 4 (regret equivalence) ConsiderDcand its associatedDr. Ifg∗ is the optimal classifier underDc, andg˜∗is the optimal classifier under the associatedDr. Then, for any classifierg,

E(g, Dc) − E(g∗, Dc) = Λ1·

E(g, Dr) − E( ˜g∗, Dr)

.

That is, if a regular classification algorithm Arcan return someˆgthat is close tog˜∗

underDr, the very sameˆgwould be close to the optimal classifierg∗for the original cost- sensitive classification task.

Theoretically, Theorem 4 indicates an equivalence in terms of hardness between gen- eral cost-sensitive classification tasks and general regular classification tasks. Then, we can reduce cost-sensitive classification to regular classification using Algorithm 1.

Algorithm 1 Reduction with Relabeling

1. Obtain N^{0}independent regular training examples Sr= {(xn, `n, 1)}^{N}_{n=1}^{0} from Dr:
(a) Transform each (xn, yn, cn) to (xn, ˜qn) by (2).

(b) Apply the COSTING reduction (Zadrozny et al, 2003) and accept the multi-labeled example (xn, ˜qn) with probability proportional toPK

`=1q˜n[`].

(c) For those (xn, ˜qn) that survive from COSTING, randomly assign its label `nwith probability proportional to ˜qn[`].

2. Use a regular classification algorithm Ar on Sr to obtain a classifier ˆgr that ideally yields a small E(ˆgr, Dr).

3. Return ˆg ≡ ˆgr.

From Algorithm 1, any good regular classification algorithm Ar can be turned into a good cost-sensitive classification algorithmAc, and trivially vice versa. That is, cost- sensitive classification is as hard as (or as easy as) regular classification. Note that the “hard”

part of the arguments is quite important, as illustrated below.

While the cost-transformation steps above are supported with theoretical guarantees from Theorems 3 and 4, they may not work well in practice. For instance, if we look at an example (xn, yn, cn) with yn = 1 and cn = (0, 1, 1, 334), the resulting ˜qn = (334, 333, 333, 0). Because of the large value incn[4], the example looks almost like a uni- form mixture of labels{1, 2, 3}, with only0.334of probability to keep its original label. In other words, for the purpose of encoding some large components in a cost vector, the rela- beling process could pay a huge variance (like noise) and relabel (or mislabel) the example more often than not. Then, the regular classification algorithmArmay receive someSrthat contains lots of misleading labels, making it hard for the algorithm to return a decentˆgr.

The observation above indicates that cost-transformation can introduce noise to the learning process through relabeling. In other words, it reduces the original cost-sensitive classification task to a possibly more noisy regular classification task. The noise makes the

learning process less stable, and hence the returnedˆgrmay not be good. One small improve- ment that aims at decreasing the relabeling variance is to use Algorithm 2, called training set expansion and weighting (TSEW), instead of relabeling. Note that the algorithm reduces cost-sensitive classification to weighted classification rather than regular classification.

Algorithm 2 Training Set Expansion and Weighting

1. Obtain N K training examples Sw= {(xn`, yn`, wn`)}:

(a) Transform each (xn, yn, cn) to (xn, ˜qn) by (2).

(b) For every `, let (xn`, y_{n`}, w_{n`}) = (xn, `, ˜qn[`])
(c) Add (xn`, yn`, wn`) to Sw.

2. Use a weighted classification algorithm Awon Swto obtain a classifier ˆgw. 3. Return ˆg ≡ ˆgw.

In Algorithm 2, it is not hard to show thatDr(x, `, 1) ∝ w · Dw(x, `, w)for someDw,
andSw contains (dependent) examples generated fromDw. We can think of Sw, which
trades independence for smaller variance, as a more stable version ofSr. The expanded
training setSwcontains all possible`, and hence always includes the correct labelynalong
with the largest weightwny_{n}= ˜qn[yn].

Note that the Aw in TSEW can also be performed by a regular classification algo- rithmAcusing the COSTING reduction (Zadrozny et al, 2003). Then, Algorithm 1 is simply a special (and possibly less stable) case of TSEW.

The TSEW algorithm is a good representative of our proposed cost-transformation tech- nique. Note that TSEW is actually the same as the data space expansion (DSE) algorithm proposed by Abe et al (2004). Nevertheless, our derivation from the minimum entropy per- spective is novel, and our theoretical results on the out-of-sample costE(g, Dc)are more general than the in-sample cost analysis by Abe et al (2004). Xia et al (2007) also proposed an algorithm similar to TSEW using LogitBoost asAwbased on a restricted version of The- orem 3. It should be noted that the results discussed in this section are partially influenced by the work of Abe et al (2004) but are independent from the work of Xia et al (2007).

From the experimental results in literature, a direct use of TSEW (DSE) does not per- form well in practice (Abe et al, 2004). A possible explanation is that althoughSw does not contain relabeling noise, it still carries multi-labeling ambiguities. That is, the same in- put vectorxncan come with many different labels (with possibly different weights) inSw. Thus, commonAwcan findSwtoo difficult to digest (Xia et al, 2007). One could improve the basic TSEW algorithm by using (or designing) anAwthat is robust with multi-labeled training feature vectors. We shall present one such algorithm in the next section.

4 Cost-Sensitive One-Versus-One

In this section, we propose a novel cost-sensitive classification algorithms by coupling the cost-transformation technique with the popular and robust one-versus-one (OVO) algorithm for regular classification. Before we get into our proposed cost-sensitive one-versus-one (CSOVO) algorithm, we shall introduce the original OVO first.

4.1 Original One-versus-one

We shall present a weighted version of OVO here. As shown in Algorithm 3, OVO decom-
poses the multiclass classification task into ^{K(K−1)}_{2} binary classification subtasks. Because
of theO(K^{2})growth in the number of subtasks, OVO is usually more suited whenKis not
too large (Hsu and Lin, 2002).

Algorithm 3 One-versus-one (Hsu and Lin, 2002)

1. For each i, j that 1 ≤ i < j ≤ K,

(a) Take the original Sw = {(xn, yn, wn)}^{N}_{n=1} and construct a binary training set S_{b}^{(i,j)} =
{(xn, yn, wn) : yn= i or j}.

(b) Use a weighted binary classification algorithm Abon S_{b}^{(i,j)}to get a binary classifier ˆg_{b}^{(i,j)}.
2. Return ˆg(x) = argmax

1≤`≤K

P

i<j

r ˆ

g_{b}^{(i,j)}(x) = `
z

.

In short, each binary classification subtask consists of comparing examples from two
categories only. That is, eachgˆ^{(i,j)}_{b} (x)intends to predict whetherx“prefers” categoryior
categoryj, andˆgpredicts with the preference votes gathered from thosegˆ^{(i,j)}_{b} . The goal
ofA_{b}is to locate binary classifiersgˆ^{(i,j)}_{b} with a smallE

ˆ

g_{b}^{(i,j)}, D^{(i,j)}_{OVO}

, where

D^{(i,j)}OVO

x, y, u

=r

u =Jy = iorjK zZ

w

Dr

x, y, w

.

In particular, it has been proved (Beygelzimer et al, 2005) that E(ˆg, Dr) ≤ 2X

i<j

E

ˆ

g^{(i,j)}_{b} , D_{OVO}^{(i,j)}

.

That is, ifE

ˆ

g_{b}^{(i,j)}, D^{(i,j)}_{OVO}

are all small, thenE(ˆg, Dr)should also be small.

4.2 Cost-sensitive One-versus-one

By coupling OVO with the cost-transformation technique (TSEW in Algorithm 2), we can easily get a preliminary version of CSOVO in Algorithm 4.

One thing to notice in Algorithm 4 is that each training example(xn, yn)may be split to two examples

xn, i, w^{(i)}_{n}

and

xn, j, w^{(j)}_{n}

for eachS_{b}^{(i,j)}. That is, the example is
ambiguously presented for each binary classification subtask. We can take a simple trick to
eliminate the ambiguity before training. In particular, we keep only the label (say,i) that
comes with a larger weight, and adjust its weight to

^{w}

(i)
n − wn^{(j)}

. The trick follows from the same principle as shifting the cost vectors to a similar one. Then, we can eliminate one unnecessary example and remove the multi-labeling ambiguity in the binary classification subtask.

Algorithm 4 TSEW-OVO

1. For each i, j that 1 ≤ i < j ≤ K,

(a) Transform each cost-sensitive example (xn, yn, cn) to (xn, ˜qn) by (2).

(b) Use all the (xn, ˜qn) to construct a binary classification training set
S_{b}^{(i,j)}=n

xn, i, w^{(i)}n

o∪n

xn, j, w^{(j)}n

o ,

where w^{(`)}n = ˜qn[`].

(c) Use a weighted binary classification algorithm Abon S_{b}^{(i,j)}to get a binary classifier ˆg_{b}^{(i,j)}.
2. Return ˆg(x) = argmax

1≤`≤K

P

i<j

r ˆ

g_{b}^{(i,j)}(x) = `z
.

Recall that(xn, i, wn^{(i)})would be of weightw^{(i)}n = ˜qn[i]and(xn, j, wn^{(j)})would be of
weightw^{(j)}n = ˜qn[j]. By the discussion above, the simplifiedS_{b}^{(i,j)}is

(

xn,argmax

`=ior^{j}

˜qn[`] ,

^{˜}^{q}^{n}^{[i] − ˜}^{q}^{n}^{[j]}

!)

= (

xn,argmin

`=ior^{j}
cn[`] ,

^{c}^{n}^{[i] − c}^{n}^{[j]}

!)

. (4)

Then, we get our proposed CSOVO algorithm.

Algorithm 5 Cost-sensitive One-versus-one

1. For each i, j that 1 ≤ i < j ≤ K,

(a) Take the original Sc= {(xn, yn, cn)}^{N}_{n=1}and construct S_{b}^{(i,j)}by (4).

(b) Use a weighted binary classification algorithm Abon S_{b}^{(i,j)}to get a binary classifier ˆg_{b}^{(i,j)}.
2. Return ˆg(x) = argmax

1≤`≤K

P

i<j

r ˆ

g_{b}^{(i,j)}(x) = `z
.

An intuitive explanation is that CSOVO asks each binary classifiergˆ_{b}^{(i,j)}to answer the
question “isc[i]orc[j]smaller for thisx?” We can easily see that CSOVO (Algorithm 5)
takes OVO (Algorithm 3) as a special case when using only the weighted classification cost
vectors(w · c^{(`)}c ).

4.3 Theoretical Guarantee

Next, we analyze the theoretical guarantee of Algorithm 5. Note that each created example xn,argmin

`=ior^{j}
cn[`] ,

^{c}^{n}^{[i] − c}^{n}^{[j]}

!

can be thought as if coming from a distribution
D^{(i,j)}_{CSOVO}(x, k, u)

= Z

y,c

r

k =argmin

`=iorj

c[`]zr u =

c[i] − c[j]

zDc(x, y, c) .

We then get the following theorem:

Theorem 5 Consider any family of binary classifiers n

g^{(i,j)}_{b} : X → {i, j}

o

1≤i<j≤K . Letg(x) =argmax

1≤`≤K

P

i<j

r

g^{(i,j)}_{b} (x) = `z
.Then,

E(g, Dc) − E

(x,y,c)∼D_{c}

cmin ≤ 2X

i<j

E

g^{(i,j)}_{b} , D^{(i,j)}_{CSOVO}

. (5)

Proof For each(x, y, c)generated fromDc, ifc[g(x)] = c[y] = cmin, its contribution on the left-hand side is0, which is trivially less than its contribution on the right-hand side.

Without loss of generality (by sorting the elements of the cost vectorcand shuffling the labelsy ∈ Y), consider an example(x, y, c)such that

c_{min}= c[1] ≤ c[2] ≤ . . . ≤ c[K] = cmax.

From the results of Beygelzimer et al (2005), supposeg(x) = k, then for each1 ≤ ` ≤ k−1, there are at leastdk/2epairs(i, j), wherei ≤ k < j, and

g^{(i,j)}_{b} (x) 6=argmin

`=iorj

cn[`] .

Therefore, the contribution of(x, y, c)on the right-hand side is no less than

k−1

X

`=1

(c[` + 1] − c[`])l

`. 2m

≥ 1 2

k−1

X

`=1

` (c[` + 1] − c[`])

= 1 2

k−1

X

`=1

(c[k] − c[`])

≥ 1

2(c[k] − c_{min}) ,

and the left-hand-side contribution is(c[k] − cmin). The desired result can be proved by

integrating over allDc. ut

Thus, similar to the original OVO algorithm, ifE

ˆ

g_{b}^{(i,j)}, D^{(i,j)}_{CSOVO}

are all small, then the resultingE(ˆg, Dc)should also be small.

4.4 A Sibling Algorithm: Weighted All-pairs

Note that a similar theoretical proof of Theorem 5 was made by Beygelzimer et al (2005) to analyze another algorithm called weighted all-pairs (WAP). As illustrated in Algorithm 6, the WAP algorithm shares many similar algorithmic structures with CSOVO. In particular, we see that except the difference between equations (4) and (6), WAP is exactly the same as CSOVO.

Algorithm 6 A Special Version of WAP (Beygelzimer et al, 2005)

Run CSOVO, while replacing (4) in step 1(a) with

S_{b}^{(i,j)}=
(

xn, argmin

`=iorj

cn[`] ,

vn[i] − vn[j]

!)

(6)

where vn[i] =
Z c_{n}[i]

c_{min}

1

|{k : cn[k] ≤ t}|dt

Define

D^{(i,j)}_{WAP} (x, k, u)

= Z

y,c

r

k =argmin

`=ior^{j}^{c[`]}

zr u =

v[i] − v[j]

zDc(x, y, c) ,

wherevis computed fromcusing a similar definition as the one in Algorithm 6. The cost bound of WAP is then (Beygelzimer et al, 2005)

E(g, Dc) − E

(x,y,c)∼D_{c}

cmin≤ 2X

i<j

E

g_{b}^{(i,j)}, D_{WAP}^{(i,j)}

. (7)

Note that we can letvn^{0}[i] ≡ cn[i] − cmin=Rcn[i]

c_{min} (1) dt. Then, CSOVO equivalently uses

^{v}

0n[i] − v^{0}_{n}[j]

as the underlying example weight. It is not hard to see that for any given example(x, y, c), the associated

^{v}

0

n[i] − v^{0}_{n}[j]

^{≥}

^{v}^{n}^{[i] − v}^{n}^{[j]}

^{.}

Thus, the WAP cost bound (7) is tighter than the CSOVO one (5) when using the same binary classifiersn

ˆ
g_{b}^{(i,j)}o

. In particular, while the right-hand-side of (5) and (7) look sim-
ilar, the total weight that CSOVO takes inD_{CSOVO}^{(i,j)} is larger than the total weight that WAP
takes inD^{(i,j)}WAP . The difference allows WAP to have anO(K)regret transform (Beygelzimer
et al, 2005) instead of theO(K^{2})one of CSOVO (Theorem 5). Thus, for binary classifiers
n

ˆ
g_{b}^{(i,j)}o

with the same error rate, it appears that WAP is better than CSOVO because of the tighter upper bound. However, CSOVO enjoys the advantage of efficiency and simplicity in implementation, because equation (6) would require a complete sorting of each cost vector (of sizeK) to compute while (4) only needs a simple subtraction. In the next section, we shall study whether the tighter cost bound with the additional complexity (WAP) leads to better empirical performance than the other way around (CSOVO).

4.5 Another Sibling Algorithm: All-pair Filter Tree

Another sibling algorithm of CSOVO is called the all-pair filter tree (APFT; Beygelzimer et al, 2007). APFT designs a elimination-based tournament in order to find the label with the lowest cost. During training, if an example(xn, yn)as well as the classifiers in the lower

levels of the tree allow labels {i, j}to meet in one game of the tournament, a weighted example

xn,argmin

`=iorj

cn[`] ,

^{c}^{n}^{[i] − c}^{n}^{[j]}

!

is added to the training set S_{b}^{(i,j)} for learning a classifiergˆ^{(i,j)}_{b} . The goal ofˆg_{b}^{(i,j)} is to
achieve a smallE

ˆ

g^{(i,j)}_{b} , D_{APFT}^{(i,j)}

, where
D^{(i,j)}_{APFT} (x, k, u)

= Z

y,c

r

i, jattends the tournament^{zr}k =argmin

`=ior^{j}^{c[`]}

zr u =

c[i] − c[j]

zDc(x, y, c) .

Note that the condition^{r}i, jattends the tournament^{z}depends on lower-level classifiers that

“filter” the distribution for higher-level training. During prediction, the results fromn
ˆ
g_{b}^{(i,j)}o
are decoded using the same tournament design rather than voting.

Letn
g^{(i,j)}_{b}

o

be a set of binary classifiers andgAPFTbe the resulting classifier after de- coding the predictions of

n
g^{(i,j)}_{b}

o

from the tournament. It can be shown (Beygelzimer et al, 2007) that

E(gAPFT, Dc) − E

(x,y,c)∼D_{c}

cmin≤X

i<j

E

g^{(i,j)}_{b} , D_{APFT}^{(i,j)}

. (8)

Comparing CSOVO with APFT, we see one similarity: the weighted examples included in
eachS_{b}^{(i,j)}. Nevertheless, note that APFT uses fewer examples than CSOVO—the former
uses only pairs of labels that are met in the tournament and the latter uses all possible pairs
of labels. The difference allows for a tighter error bound for APFT by conditioning on the
tournament results. Thus, when using binary classifiers of the same error rate, it appears that
APFT is better than CSOVO because of the tighter upper bound. Furthermore, by restricting
to a specific tournament, APFT results in anO(K)predicting scheme instead of theO(K^{2})
one that CSOVO needs to take. Nevertheless, APFT essentially breaks the symmetry be-
tween classes by restricting to a specific tournament, and the reduced number of examples
in eachS_{b}^{(i,j)}could degrade the practical learning performance. In the next section, we shall
also study whether the tighter cost bound with the tournament restriction (APFT) leads to
better empirical performance than the other way around (CSOVO).

5 Experiments

We will first compare CSOVO with the original OVO on various real-world data sets. Then,
we will compare CSOVO with WAP (Beygelzimer et al, 2005) and APFT (Beygelzimer et al,
2007). All four algorithms are of OVO-type. That is, they obtain a multiclass classifierˆgby
calling a weighted binary classification algorithmAbfor^{K(K−1)}_{2} times. During prediction,
CSOVO, OVO and WAP requires gathering votes from ^{K(K−1)}_{2} binary classifiers, while
APFT determines the label by usingK − 1of those classifiers in the tournament. In addi-
tion, we will compare CSOVO with other existing algorithms that also reduce cost-sensitive
classification to weighted binary classification.

Table 1 classification data sets

data set # examples # categories (K) # features

zoo 101 7 16

glass 214 6 9

vehicle 846 4 18

vowel 990 11 10

yeast 1484 10 8

segment 2310 7 19

dna 3186 3 180

pageblock 5473 5 10

satimage 6435 6 36

usps 9298 10 256

We take the support vector machine (SVM) with the perceptron kernel (Lin and Li, 2008) asAbin all the experiments and use LIBSVM (Chang and Lin, 2001) as our SVM solver. Note that SVM with the perceptron kernel is known as a strong classification algo- rithm (Lin and Li, 2008) and can be naturally adopted to perform weighted binary classifi- cation (Zadrozny et al, 2003). We will also take a weaker classifier, namely SVM with the linear kernel, in Subsection 5.6.

We use ten classification data sets: zoo, glass, vehicle, vowel, yeast, segment, dna,
pageblock, satimage, usps (Table 1).^{3}The first nine come from the UCI machine learning
repository (Hettich et al, 1998) and the last one is from Hull (1994).

Note that the ten data sets were originally gathered as regular classification tasks. We shall first adopt the randomized proportional (RP) cost-generation procedure that was used by Beygelzimer et al (2005). In particular, we generate the cost vectors from a cost ma- trixC(y, k)that does not depends onx. The diagonal entriesC(y, y)are set as0and each of the other entriesC(y, k)is a random variable sampled uniformly from

h

0, 2000^{|{n : y}_{|{n : y}^{n}^{=k}|}

n=y}|

i . Then, for a cost-sensitive example(x, y, c), we simply takec[k] = C(y, k). We acknowl- edge that the RP procedure may not fully reflect realistic application needs. Nevertheless, we still take the procedure as it is a longstanding benchmark for comparing general-purpose cost-sensitive classification algorithms. We will take another cost-generating procedure in Subsection 5.4.

We randomly choose75%of the examples in each data set for training and leave the other25%of the examples as the test set. Then, each feature in the training set is linearly scaled to[−1, 1], and the feature in the test set is scaled accordingly. The results reported are all averaged over20trials of different training/test splits, along with the standard error.

In the coming tables, those entries within one standard error of the lowest one are marked in bold.

SVM with the perceptron kernel takes a regularization parameter (Lin and Li, 2008), which is chosen within

2^{−17}, 2^{−15}, . . . , 2^{3} with a5-fold cross-validation (CV) procedure
on only the training set (Hsu et al, 2003). For the OVO algorithm, the CV procedure selects
the parameter that results in the smallest cross-validation classification error. For CSOVO
and other cost-sensitive classification algorithms, the CV procedure selects the parameter
that results in the smallest cross-validation cost. We then re-run each algorithm on the whole
training set with the chosen parameter to get the classifierˆg. Finally, we evaluate the average
performance ofˆgwith the test set.

3 All data sets except zoo, glass, yeast and pageblock are actually downloaded from http://www.

csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets

Table 2 test cost of CSOVO/OVO

data set CSOVO OVO

zoo 36.79±9.09 105.56±33.45 glass 210.06±15.88 492.99±40.77 vehicle 155.14±20.63 185.38±17.23 vowel 20.05±1.95 11.90±1.96 yeast 52.45±2.97 5823.21±1290.65 segment 25.27±2.25 25.15±2.11

dna 53.18±4.25 48.15±3.33 pageblock 24.98±4.94 501.57±74.98

satimage 66.57±4.77 94.07±5.49 usps 20.51±1.17 23.62±0.66

Table 3 test classification error of CSOVO/OVO

data set CSOVO OVO

zoo 0.131±0.015 0.060±0.010 glass 0.605±0.034 0.304±0.010 vehicle 0.283±0.020 0.185±0.005 vowel 0.059±0.010 0.011±0.002 yeast 0.767±0.007 0.398±0.006 segment 0.051±0.008 0.024±0.001 dna 0.115±0.017 0.043±0.002 pageblock 0.776±0.064 0.033±0.001 satimage 0.168±0.012 0.072±0.002 usps 0.077±0.030 0.023±0.000

5.1 CSOVO versus OVO

Table 2 compares the test cost of CSOVO and the original cost-insensitive OVO. We can see that on7out of the10data sets, CSOVO is significantly better than OVO, which justifies that it can be useful to include the cost information into the training process. Thet-test results, which will be shown in Table 8, suggest the same finding. The big difference on yeast and pageblock is because they are highly unbalanced and hence the components inccan be huge. Then, not using (or discarding) cost information (as OVO does) would intuitively lead to worse performance.

The only data set on which CSOVO is much worse than OVO is vowel. One may wonder why including the accurate cost information does not improve performance. The reason lies in Table 3, in which we compare the test classification error rate of CSOVO and OVO. Not surprisingly, OVO can always achieve lower test error than CSOVO. In fact, on yeast and pageblock, in which some of the cost components are large, we clearly see that CSOVO is willing to trade a significant amount of classification accuracy for a lower cost. For vowel, note that OVO can achieve a very low test error (1.1%), which readily leads to a low test cost. Then the caveat of using CSOVO, or more generally the cost-transformation technique, arises. In particular, cost transformation reduces the original “easy” task for OVO to a more difficult one. The change in hardness degrades the learning performance, and thus CSOVO results in relatively higher test cost.

Recall that CSOVO comes from coupling the cost-transformation technique with OVO, and we discussed in Section 3 that cost transformation inevitably introduces multi-labeling ambiguity into the learning process. The ambiguity acts like noise, and generally makes the learning tasks more difficult. On the vowel data set, OVO readily achieves low test error and hence low test cost, while CSOVO suffers from the difficult learning tasks and hence

Table 4 test cost of CSOVO/WAP/APFT

data set CSOVO WAP APFT

zoo 36.79±9.09 49.09±16.67 56.92±15.41 glass 210.06±15.88 220.29±18.56 215.95±17.36 vehicle 155.14±20.63 148.63±19.74 158.60±20.35 vowel 20.05±1.95 19.36±1.81 27.59±2.94 yeast 52.45±2.97 52.71±3.58 63.53±8.11 segment 25.27±2.25 24.40±1.96 28.51±2.55 dna 53.18±4.25 51.13±4.37 53.37±5.49 pageblock 24.98±4.94 20.68±2.52 25.60±4.98

satimage 66.57±4.77 72.05±5.13 80.70±5.98 usps 20.51±1.17 21.04±1.19 29.75±1.75

Table 5 test cost bound of CSOVO/WAP/APFT

data set CSOVO with (5) WAP with (7) APFT with (8) zoo 465.53±124.48 147.56±30.32 143.54±40.97 glass 2138.52±260.90 811.30±69.02 471.66±50.41 vehicle 499.53±55.49 295.87±30.25 245.15±27.54 vowel 475.25±35.18 126.60±9.51 111.26±10.87 yeast 4820.96±301.43 736.31±26.70 401.92±55.94 segment 245.86±21.83 94.37±7.50 77.65±7.04

dna 95.77±6.76 71.94±5.77 69.18±8.35 pageblock 533.36±89.50 260.73±43.80 112.51±28.93

satimage 477.45±22.52 193.25±10.88 183.24±12.73 usps 415.36±23.40 112.22±6.08 98.34±5.78

gets high test cost. Similar situations happen on segment, dna, usps, in which the test cost of CSOVO and OVO are quite close. That is, it is worth trading the cost information for easier learning tasks. On the other hand, when OVO cannot achieve low test error (like on vehicle) or when the cost information is extremely important (like on pageblock), it is worth trading the easy learning tasks for knowing the accurate cost information, and thus CSOVO performs better.

5.2 CSOVO versus WAP and APFT

Next, we compare CSOVO with WAP and APFT in terms of the average test cost in Table 4.

We see that CSOVO and WAP are comparable in performance, with WAP being slightly
worse on satimage and CSOVO being slightly worse on pageblock. The similarity is nat-
ural because CSOVO and WAP are only different in the weights given to the underlying
binary examples. On the other hand, Table 4 shows that APFT usually performs slightly
worse than CSOVO and WAP. Thus, CSOVO and WAP should be better choices, unless the
O(K)prediction time (and the shorterO(K^{2})training time because of the conditioning on
the tournament) of APFT is needed. Again, thet-test results in Table 8 lead to the same
conclusion.

We mentioned in Subsections 4.4 and 4.5 that the cost bounds of WAP and APFT are both tighter than the one of CSOVO. To understand whether the cost bounds can explain the results in Table 4, we compute the bounds using the test set and list them in Table 5.

We see that both WAP and APFT can indeed reach lower cost bounds. Nevertheless, the bounds are quite loose when compared to the actual test cost values in Table 4. Because of

Table 6 comparison of meta-algorithms that reduce cost-sensitive to binary classification

CSOVO WAP APFT FT/TREE SECOC CSOVA

# of binary classifiers ^{K(K−1)}_{2} ^{K(K−1)}_{2} K − 1 ^{K(K−1)}_{2} 12 · 2^{dlog}^{2}^{Ke} K
prediction time O(K^{2}) O(K^{2}) O(K) O(log_{2}K) O(K) O(K)

theoretical guarantee yes yes yes yes yes partial

the looseness, using WAP for the tighter bound (or APFT for the tighter bound) does not lead to much gain in performance.

In summary, CSOVO performs better than APFT; CSOVO performs similarly to WAP but enjoys a simpler and more efficient implementation (see Subsection 4.4). Thus, CSOVO should be preferred over both WAP and APFT in practice.

5.3 CSOVO versus Others

Next, we compare CSOVO with four other existing algorithms, namely TREE (Beygelz- imer et al, 2005), Filter Tree (FT; Beygelzimer et al, 2007), Sensitive Error Correcting Out- put Codes (SECOC; Langford and Beygelzimer, 2005), and Cost-sensitive One-versus-all (CSOVA; Lin, 2008). The algorithms cover commonly-used decompositions from multi- class classification to binary classification. Note that TREE, FT and SECOC, like CSOVO and WAP, come with sound theoretical guarantee, which assures that a good binary classifier can be cast as a good cost-sensitive one. CSOVA, on the other hand, follows a heuristic step in its derivation and hence does not carry a strong theoretical guarantee. A quick comparison about the properties of the four meta-algorithms (as well as CSOVO, WAP and APFT) are shown in Table 6.

Table 7 compares the average test RP cost of CSOVO, TREE, FT, SECOC and CSOVA;

Table 8 lists the pairedt-test results with significance level0.05. SECOC is the worst of the five, which is because it contains a thresholding (quantization) step that can lead to an inaccurate representation of the cost information.

FT performs slightly worse than CSOVO, which demonstrates that a full pairwise com- parison (CSOVO/WAP) can be more stable than an elimination-based tournament (FT).

TREE performs even worse than FT, which complies with the finding in the original FT paper (Beygelzimer et al, 2007) that a regret-based reduction (FT) can be more robust than an error-based reduction (TREE).

Interestingly, when comparing Table 7 with Table 4, we see that FT performs better
than APFT. Such a result suggests that decomposing the decision of each game to pairwise
classifiers (APFT) is not better than directly predicting the outcome of each game in the
tournament (FT). This is possibly because the reduced number of examples in eachS_{b}^{(i,j)}
of APFT could degrade the practical learning performance (see Subsection 4.5), especially
when using a highly nonlinear model like SVM with the perceptron kernel. We will discuss
more about this issue in Subsection 5.6.

CSOVO and CSOVA are quite similar in performance on many data sets. Nevertheless,
recall that CSOVO comes with a stronger theoretical guarantee than CSOVA. Thus, whenK
is relatively small (like on our data sets) and training^{K(K−1)}_{2} binary classifiers is affordable,
CSOVO is the best meta-algorithm for reducing multiclass cost-sensitive classification to
binary classification in terms of both accuracy and efficiency.

Table 7 test cost of meta-algorithms that reduce cost-sensitive to binary classification

data set CSOVO FT TREE SECOC CSOVA

zoo 36.79±9.09 74.01±27.65 57.07±17.40 179.02±30.52 79.70±25.31 glass 210.06±15.88 212.76±19.38 264.02±24.49 347.77±37.87 231.77±18.51 vehicle 155.14±20.63 156.01±20.14 156.72±20.12 167.60±20.97 156.91±19.91

vowel 20.05±1.95 24.66±2.92 28.42±3.02 95.25±7.35 15.06±1.73 yeast 52.45±2.97 53.73±3.14 66.49±6.09 277.14±49.41 86.80±9.78 segment 25.27±2.25 27.06±2.32 27.76±2.26 68.08±4.02 25.50±2.23

dna 53.18±4.25 53.76±4.23 55.32±4.39 65.90±5.66 39.67±2.41 pageblock 24.98±4.94 29.93±6.16 23.00±3.28 249.28±67.31 42.99±8.17

satimage 66.57±4.77 74.01±4.57 78.13±4.31 102.81±5.41 77.26±4.73 usps 20.51±1.17 27.07±1.52 25.74±1.55 86.59±9.45 21.49±1.23

Table 8 t-test for comparing CSOVO with other meta-algorithms using RP cost

data set OVO WAP APFT FT TREE SECOC CSOVA

zoo ◦ ∼ ∼ ∼ ∼ ◦ ∼

glass ◦ ∼ ∼ ∼ ◦ ◦ ∼

vehicle ◦ ∼ ∼ ∼ ∼ ◦ ∼

vowel × ∼ ◦ ◦ ◦ ◦ ×

yeast ◦ ∼ ∼ ∼ ◦ ◦ ◦

segment ∼ ∼ ◦ ◦ ◦ ◦ ∼

dna ∼ ∼ ∼ ∼ ∼ ◦ ×

pageblock ◦ ∼ ∼ ∼ ∼ ◦ ◦

satimage ◦ ◦ ◦ ◦ ◦ ◦ ◦

usps ◦ ∼ ◦ ◦ ◦ ◦ ∼

◦: CSOVO significantly better; ×: CSOVO significantly worse; ∼: similar

5.4 Comparison with Emphasizing Cost

Next, we take another cost-generating procedure to compare cost-sensitive classification algorithms. We consider a practical task in which one wishes to mark some of the classes as “important.” Traditionally, the task is tackled with the weighted classification approach:

putting a larger weightwon those classes. That is, consider a cost matrixCc(y, k)where
they-th row ofCcis the classification cost vectorc^{(y)}c . The approach above corresponds to
scaling up some rows ofCcbyw, indicating that examples that come from those classes are
more important.

Another way of saying that class`is important is to prevent mis-classifying examples of other classes as`. For instance,`can be the “non-cancerous” class and it is certainly bad to predict patients carrying any kind of cancer as non-cancerous. Thus, we need a cost matrix that comes from scaling up some columns ofCc. Weighted classification cannot deal with such a cost matrix, but cost-sensitive classification can.

In our experiments, we generate costs by first picking a randombK/2cof the columns, and decide whether to scale each of them by100with a fair coin flip. Thus, in expectation,

bK/2c

2 columns ofCcare scaled to form a cost matrixC. Then, for a cost-sensitive example (x, y, c), we takec[k] = C(y, k). The resulting costCwould be called emphasizing cost.

Table 9 compares the average test emphasizing cost of CSOVO with WAP, APFT, FT, TREE, SECOC and CSOVA; Table 10 lists the pairedt-test results with significance level 0.05. Similar to the findings when using the RP procedure, the leading algorithms in perfor- mance are CSOVO, WAP and CSOVA. They are mostly able to achieve decent performance under the emphasizing cost. In particular, the average cost is usually less than1, indicat-

Table 9 test emphasizing cost of meta-algorithms that reduce cost-sensitive to binary classification

data set CSOVO WAP APFT FT TREE SECOC CSOVA

zoo 0.08±0.01 0.07±0.01 0.10±0.02 0.09±0.01 0.08±0.01 0.95±0.19 0.85±0.58 glass 0.57±0.09 0.59±0.09 0.75±0.19 0.83±0.27 0.98±0.28 1.16±0.20 1.15±0.32 vehicle 0.49±0.03 0.48±0.03 0.50±0.03 0.48±0.03 0.48±0.03 0.71±0.01 0.43±0.04

vowel 0.12±0.02 0.11±0.02 0.15±0.03 0.12±0.03 0.17±0.05 0.87±0.02 0.07±0.01 yeast 0.50±0.02 0.51±0.02 0.51±0.02 0.49±0.01 0.60±0.05 0.91±0.01 0.51±0.02 segment 0.19±0.04 0.16±0.03 0.17±0.03 0.17±0.03 0.23±0.06 0.83±0.03 0.14±0.02 dna 0.27±0.02 0.27±0.02 0.27±0.02 0.28±0.02 0.29±0.02 0.61±0.01 0.23±0.02 pageblock 0.23±0.04 0.27±0.06 0.24±0.04 0.24±0.04 0.28±0.06 0.83±0.04 0.58±0.15

satimage 0.24±0.02 0.29±0.02 0.25±0.02 0.37±0.12 0.35±0.08 0.81±0.01 0.28±0.03 usps 0.11±0.01 0.11±0.01 0.13±0.01 0.13±0.01 0.12±0.01 0.87±0.01 0.12±0.02

Table 10 t-test for comparing CSOVO with other meta-algorithms with emphasizing cost data set WAP APFT FT TREE SECOC CSOVA

zoo ∼ ∼ ∼ ∼ ◦ ∼

glass ∼ ∼ ∼ ∼ ◦ ◦

vehicle ∼ ∼ ∼ ∼ ◦ ∼

vowel ∼ ◦ ∼ ∼ ◦ ×

yeast ∼ ◦ ∼ ◦ ◦ ∼

segment ∼ ∼ ∼ ∼ ◦ ∼

dna ∼ ∼ ◦ ◦ ◦ ×

pageblock ◦ ∼ ∼ ∼ ◦ ◦

satimage ◦ ∼ ∼ ∼ ◦ ◦

usps ∼ ◦ ◦ ◦ ◦ ∼

◦: CSOVO significantly better; ×: CSOVO significantly worse; ∼: similar

ing that the serious mis-classification costs (the emphasized ones with cost100) have been carefully prevented. On the other hand, SECOC often cannot reach the same level of perfor- mance; APFT, FT and TREE are also sometimes worse. The results again verify that CSOVO is a superior choice for reducing from cost-sensitive classification to binary classification.

5.5 Comparison on Ordinal Ranking Data Sets

We mentioned in Section 1 that cost-sensitive classification can express any finite-choice and bounded-loss supervised learning setups (Beygelzimer et al, 2005). Next, we demon- strate the usefulness of cost-sensitive classification with one such setup: ordinal ranking.

Ordinal ranking can be viewed as a special case of cost-sensitive classification (Lin, 2008)
that takes cost vectors of some specific forms. For instance, many existing works on ordi-
nal ranking (Chu and Keerthi, 2007; Li and Lin, 2007) focus on the absolute cost vectors
c^{(y)}a [k] = |y − k|. We adopt these cost vectors in our experiments and assignc = c^{(y)}a for
each cost-sensitive example(x, y, c). We then conduct experiments with eight benchmark
ordinal ranking data sets: pyrimdines, machineCPU, boston, abalone, bank, computer,
california, census, which were used by Chu and Keerthi (2007). Similar to their original
procedure, we kept the same training/test split ratios, and averaged the results over20trials.

Table 11 compares the test performance of all the cost-sensitive algorithms on the or- dinal ranking data sets; Table 12 lists thet-test results. We see that CSOVO is usually sig- nificantly better than the other cost-sensitive classification algorithms. In addition, CSOVO