Reduction from Cost-sensitive Multiclass Classification to One-versus-one Binary Classification

(1)

Reduction from Cost-sensitive Multiclass Classification to One-versus-one Binary Classification

Department of Computer Science and Information Engineering, National Taiwan University

Abstract

Many real-world applications require varying costs for different types of mis-classification errors. Such a cost-sensitive classification setup can be very different from the regular classification one, especially in the multiclass case. Thus, traditional meta-algorithms for regular multiclass classification, such as the popular one-versus-one approach, may not always work well under the cost-sensitive classification setup. In this paper, we extend the one-versus-one approach to the field of cost-sensitive classification. The extension is derived using a rigorous mathematical tool called the cost-transformation technique, and takes the original one-versus-one as a special case. Experimental results demonstrate that the proposed approach can achieve better performance in many cost-sensitive classification scenarios when compared with the original one-versus-one as well as existing cost-sensitive classification algorithms.

Keywords: cost-sensitive classification, one-versus-one, meta-learning

1. Introduction

Many real-world applications of machine learning and data mining require evaluating the learned system with different costs for different types of mis-classification errors. For instance, a false-negative prediction for a spam classification system only takes the user an extra second to delete the email, while a false-positive prediction can mean a huge loss when the email actually carries important information. When recommending movies to a subscriber with preference “romance over action over horror”, the cost of mis-predicting a romance movie as a horror one should be significantly higher than the cost of mis-predicting the movie as an action one. Such a need is also shared by applications like targeted mar- keting, information retrieval, medical decision making, object recognition and intrusion detection (Abe et al.,2004), and can be formalized as the cost-sensitive classification setup.

In fact, cost-sensitive classification can be used to express any finite-choice and bounded-loss supervised learning setups (Beygelzimer et al., 2005). Thus, it has been attracting much research attention in recent years (Domingos, 1999; Margineantu, 2001; Abe et al., 2004;

Beygelzimer et al.,2005;Langford and Beygelzimer,2005;Beygelzimer et al.,2007).

Abe et al. (2004) grouped existing research on cost-sensitive classification into three categories: making a particular classifier cost-sensitive, making the prediction procedure cost-sensitive, and making the training procedure cost-sensitive. The third category contains mostly meta-algorithms that reweight training examples before feeding them into the underlying learning algorithm. Such a meta-algorithm can be used to make any existing algorithm cost-sensitive. While a promising meta-algorithm exists and is well-understood for cost-sensitive binary classification (Zadrozny et al.,2003), the counterpart for multiclass

(2)

classification remains an ongoing research issue (Abe et al.,2004;Langford and Beygelzimer, 2005;Zhou and Liu,2006).

In this paper, we propose a general meta-algorithm that reduces cost-sensitive multiclass classification tasks to regular classification ones. The meta-algorithm is based on the cost- transformation technique, which converts one cost to another by not only reweighting the original training examples, but also relabeling them. We show that any cost can be trans- formed to the regular mis-classification one with the cost-transformation technique. As a consequence, general cost-sensitive classification and general regular classification tasks are equivalent in terms of hardness.

We further couple the meta-algorithm with another popular meta-algorithm in regular classification—the one-versus-one (OVO) decomposition from multiclass to binary. The resulting algorithm, which is called cost-sensitive one-versus-one (CSOVO), can perform cost-sensitive multiclass classification with any base binary classifier. Interestingly, CSOVO is algorithmically similar to an existing meta-algorithm for cost-sensitive classification:

weighted all-pairs (WAP; Beygelzimer et al., 2005). Nevertheless, CSOVO is somewhat simpler to implement than WAP. Our experimental results on real-world data sets demonstrate that CSOVO shares a similar (and sometimes even better) performance over WAP, while both of them can be significantly better than OVO. Therefore, CSOVO is a preferable OVO-type cost-sensitive classification algorithm. Moreover, when compared with other meta-algorithms that reduce cost-sensitive classification to binary classification—namely, error-correcting output code (Langford and Beygelzimer, 2005), tree (Beygelzimer et al., 2005), filter tree and all-pair filter tree (Beygelzimer et al., 2007) decompositions—we see that CSOVO can often achieve the best test performance. Those results further validate the usefulness of CSOVO.

The paper is organized as follows. In Section 2, we formalize the cost-sensitive classification setup. Then, we present the cost-transformation technique with its theoretical implications in Section 3, and derive our proposed CSOVO algorithm in Section 4. Fi- nally, we compare CSOVO with other algorithms empirically in Section 5 and conclude in Section6.

2. Problem Setup

We start by defining the setup that will be used in this paper.

Definition 1 (weighted classification) Assume that there is an unknown distribution Dw

on X × Y × R⁺. where the input space X ⊆ R^D and the label space Y = {1, 2, · · · , K}.

A weighted example is a tuple (x, y, w) ∈ X × Y × R⁺, where the non-negative numbers w ∈ R⁺ are called the weights. In the weighted classification setup, we are given a set of i.i.d. weighted training examples Sw= {(xn, yn, wn)}^N_n=1 ∼ D^N_w. Use

E(g, D) ≡ E

(x,y,w)∼D

w ·Jy 6= g(x)K

to denote the expected weighted classification error of any classifier g : X → Y with respect to some distribution D. The goal of the weighted classification is to use Sw to find a classifier ˆg such that E(ˆg, D_w) is small.

For K = 2, the setup is called (weighted) binary classification; for K > 2, the setup is called (weighted) multiclass classification. When the weights are constants (say, 1), weighted

(3)

classification becomes a special case called regular classification, which has been widely and deeply studied for years (Beygelzimer et al.,2005). In general, weighted classification can be easily reduced to regular classification for both binary and multiclass cases using the famous COSTING reduction (Zadrozny et al., 2003). In addition, many of the existing regular classification algorithms can be easily extended to perform weighted classification. Thus, there are plenty of useful theoretical and algorithmic tools for both weighted classification and regular classification (Beygelzimer et al.,2005).

The main setup that we will study in this paper is cost-sensitive classification, which is more general than weighted classification.

Definition 2 (cost-sensitive classification) Assume that there is an unknown distribution D_c on X × Y × R^K. A cost-sensitive example is a tuple (x, y, c) ∈ X × Y × R^K, where c[k] denotes the cost to be paid when x is predicted as category k. In the cost- sensitive classification setup, we are given a set of i.i.d. cost-sensitive training examples S_c= {(xn, yn, cn)}^N_n=1 ∼ D_c^N. We shall reuse

E(g, D) ≡ E

(x,y,c)∼D

c[g(x)] .

to denote the expected cost of any classifier g : X → Y with respect to some distribution D. The goal of the cost-sensitive classification is to use Sc to find a classifier ˆg such that E(ˆg, D_c) is small.

We make two remarks here. First, when looking at the definition of E, we see that the label y is actually not needed in evaluating the classifier g. We keep the label there to better illustrate the connection between cost-sensitive and regular/weighted classification.

Naturally, we assume that Dcwould only generate examples (x, y, c) such that c[y] = cmin= min_1≤`≤Kc[`].

Secondly, let us define the classification cost vector c^(`)c [k] ≡ J` 6= kK. We see that weighted classification is a special case of cost-sensitive classification using c = w · c^(y)c

as the cost vector of (x, y, w), and regular classification is a special case of cost-sensitive classification using c = c^(y)c as the cost vector.

While both regular and weighted classification have been widely studied, cost-sensitive classification is theoretically well-understood only in the binary case (Zadrozny et al.,2003), in which weighted classification and cost-sensitive classification simply coincide. Next, we will introduce the cost-transformation technique, which allows us to tightly connect cost-sensitive classification with regular/weighted classification, and helps understand cost- sensitive classification better in the multiclass case.

3. Cost Transformation

Cost-transformation is a tool that connects a cost vector to other cost vectors. In particular, we hope to link any cost vector c with the classification cost vectors C_c = n

c^(`)c

oK

`=1, because the link allows us to reduce cost-sensitive classification (which deals with c) to regular classification (which deals with Cc). We start introducing the cost-transformation technique by making two definitions about the relations between cost vectors. The first definition relates two cost vectors ˜c and c.

(4)

Definition 3 (similar cost vectors) A cost vector ˜c is similar to c by ∆ if and only if

˜

c[·] = c[·] + ∆ with some constant ∆.

For instance, (4, 3, 2, 3) is similar to (2, 1, 0, 1) by 2. We shall omit the “by ∆” part when it is clear from the context. Note that when ˜c is similar to c, using ˜c for evaluating a prediction g(x) is equivalent to using c plus a constant cost of ∆. The constant shifting from c to ˜c does not change the relative cost difference between the prediction g(x) and the best prediction y.

Next, we relate a cost vector c to a set of cost vectors Cb.

Definition 4 (decomposable cost vectors) A cost vector c is decomposable to a set of base cost vectors Cb =

n c^(t)_b

oT

t=1 if and only if there exists non-negative coefficients q[t] such that c[·] =PT

t=1q[t] · c^(t)_b [·].

That is, a cost vector c is decomposable to C_b if we can split c to a conic combination of the base cost vectors c^(t)_b . Why is such a decomposition useful? Let us take a cost-sensitive example (x, y, c) and choose the classification cost C_c as C_b. If c is decomposable to C_c, then for any classifier g,

c[g(x)] =

K

X

`=1

q[`] · c^(`)_c [g(x)] =

K

X

`=1

q[`] ·J` 6= g(x)K .

That is, if we randomly generate ` proportional to q[`] and relabel the cost-sensitive example (x, y, c) to a regular one (x, `, w = 1), then the cost that any classifier g needs to pay for its prediction on x is proportional to the expected classification error, where the expectation is taken with respect to the relabeling process. Thus, if a classifier g performs well for the

“relabeled” (regular classification) task, it would also perform well for the original cost- sensitive classification task. The non-negativity of q[`] ensures that q can be normalized to form a probability distribution.¹

Definition4is a key of the cost-transformation technique. It not only allows us to transform one cost vector c to an equivalent representation

n

(q[t] , c^(t)_b ) o

for some general Cb, but more specifically also lets us relabel a cost-sensitive example (x, y, c) to another (randomized) regular example (x, `, 1). However, is every cost vector c decomposable to the classification cost Cc? The short answer is no. For instance, the cost vector c = (6, 3, 0, 3) is not decomposable to C_c, because c yields a unique linear decomposition of C_c with some negative coefficients: c = −2 · c⁽¹⁾c + 1 · c⁽²⁾c + 4 · c⁽³⁾c + 1 · c⁽⁴⁾c .

Although the cost vector (6, 3, 0, 3) itself is not decomposable to C_c, we can easily see that its similar cost vector, (12, 9, 6, 9), is decomposable to Cc. In fact, for any cost vector c, there is an infinite number of its similar cost vectors ˜c that are decomposable to Cc, as formalized below.

Theorem 5 (decomposition via a similar vector) Consider any cost vector c. As- sume that ˜c is similar to c by ∆. Then, ˜c is decomposable to Cc if and only if

∆ ≥ (K −1) cmax−

K

X

k=1

c[k] ,

1. We take a minor assumption that not all q[`] are zero. Otherwise c = 0 and the example (x, y, c) can be simply discarded.

(5)

where c_max= max

1≤`≤Kc[`].

The proof is based on the fact that C_c is linearly independent while spanning R^K, and the constant cost vector (∆, ∆, · · · , ∆) with a positive ∆ is decomposable to Cc with q[`] = _K−1^∆ .

From Theorem 5, there are infinitely many cost vectors ˜c that we can use. The next question is, which is more preferable? Recall that with a given ˜q, the relabeling probability distribution is ˜p[`] = ˜q[`]

.PK

k=1q[k] . To reduce the variance with respect to the relabeling˜ process, one possibility is to require the discrete probability distribution ˜p[·] to be of the least entropy. That is, we want to solve the following optimization problem.

˜min

p,˜q,∆

K

X

`=1

˜

p[`] log 1

˜

p[`] , (1)

subject to c[·] =

K

X

`=1

˜

q[`] · c^(`)_c [·] − ∆, p[·] = ˜˜ q[·]

, _K

X

k=1

˜ q[k] ;

∆ ≥ (K −1) c_max−

K

X

k=1

c[k] .

Theorem 6 (decomposition with minimum-entropy) If not all c[`] are equal, the unique optimal solution to (1) is

˜

q[`] = cmax− c[`] , (2)

∆ = (K −1) c_max−

K

X

k=1

c[k] . (3)

The proof is listed in AppendixA. Note that the resulting ∆ is the smallest one that makes

˜

c decomposable to C_c. Using Theorem 6, we can then define the following distribution D_r(x, `, w) from D_c(x, y, c).

D_r(x, `, w) =Jw = 1K · Λ

−1 1 ·

Z

y,c

q[`] · D˜ c(x, y, c), where ˜q[`] is computed from c using (2) and

Λ₁ = Z

x,y,c K

X

`=1

˜

q[`] · D_c(x, y, c).

is a normalization constant.² Then, we can derive the following theorem.

Theorem 7 (cost-transformation) For any classifier g, E(g, Dc) = Λ1· E(g, D_r) − Λ2,

where Λ2 is a constant that can be computed by integrating over the ∆ term associated with each c ∼ D_c from (3).

2. Even when all c[`] are equal, equation (2) can still be used to get ˜q[`] = 0 for all `, which means the example (x, y, c) can be dropped instead of relabeled.

(6)

Theorem7 can then be used to further prove the following regret equivalence theorem.

Theorem 8 (regret equivalence) Consider Dcand its associated Dr. If g∗ is the optimal classifier under D_c, and ˜g∗ is the optimal classifier under the associated D_r. Then, for any classifier g,

E(g, Dc) − E(g∗, Dc) = Λ1·

E(g, Dr) − E( ˜g∗, Dr)

.

That is, if a regular classification algorithm A_r can return some ˆg that is close to ˜g∗

under Dr, the very same ˆg would be close to the optimal classifier g∗ for the original cost- sensitive classification task.

Theoretically, Theorem8 indicates an equivalence in terms of hardness between general cost-sensitive classification tasks and general regular classification tasks. Then, we can reduce cost-sensitive classification to (weighted) regular classification using Algorithm 1, called training set expansion and weighting (TSEW). The algorithm carries the information in ˜q[`] as weights of multi-labelled examples.

Algorithm 1: Training Set Expansion and Weighting

1. Obtain N K training examples Sw = {(xn`, yn`, wn`)}:

(a) Transform each (xn, yn, cn) to (xn, ˜qn) by (2).

(b) For every `, let (x_n`, y_n`, w_n`) = (x_n, `, ˜q_n[`]) (c) Add (xn`, yn`, wn`) to Sw.

2. Use a weighted classification algorithm Aw on Sw to obtain a classifier ˆgw. 3. Return ˆg ≡ ˆg_w.

The TSEW algorithm is a good representative of our proposed cost-transformation technique. Note that TSEW is actually the same as the data space expansion (DSE) algorithm proposed by Abe et al. (2004). Nevertheless, our derivation from the minimum entropy perspective is novel, and our theoretical results on the out-of-sample cost E(g, D_c) are more general than the in-sample cost analysis by Abe et al. (2004). Xia et al. (2007) also proposed an algorithm similar to TSEW using LogitBoost as A_w based on a restricted version of Theorem 7. It should be noted that the results discussed in this section are partially influenced by the work ofAbe et al.(2004) but are independent from the work ofXia et al.

(2007).

From the experimental results in literature, a direct use of TSEW (DSE) does not perform well in practice (Abe et al., 2004). A possible explanation is that although Sw

carries multi-labeling ambiguities. That is, the same input vector x_n can come with many different labels (with possibly different weights) in Sw. Thus, common Aw can find Sw too difficult to digest (Xia et al.,2007). One could improve the basic TSEW algorithm by using (or designing) an Aw that is robust with multi-labeled training feature vectors. We shall present one such algorithm in the next section.

(7)

4. Cost-Sensitive One-Versus-One

In this section, we propose a novel cost-sensitive classification algorithms by coupling the cost-transformation technique with the popular and robust one-versus-one (OVO) algorithm for regular classification. Before we get into our proposed cost-sensitive one-versus-one (CSOVO) algorithm, we shall introduce the original OVO first.

4.1. Original One-versus-one

We shall present a weighted version of OVO here. As shown in Algorithm2, OVO decom- poses the multiclass classification task into ^K(K−1)₂ binary classification subtasks. Because of the O(K²) growth in the number of subtasks, OVO is usually more suited when K is not too large (Hsu and Lin,2002).

Algorithm 2: One-versus-one (Hsu and Lin,2002)

1. For each i, j that 1 ≤ i < j ≤ K,

(a) Take the original Sw = {(xn, yn, wn)}^N_n=1 and construct a binary training set S_b^(i,j)= {(xn, yn, wn) : yn= i or j}.

(b) Use a weighted binary classification algorithm A_b on S_b^(i,j) to get a binary classifier ˆg_b^(i,j).

2. Return ˆg(x) = argmax

1≤`≤K

P

i<j

r ˆ

g_b^(i,j)(x) = ` z

.

In short, each binary classification subtask consists of comparing examples from two categories only. That is, each ˆg_b^(i,j)(x) intends to predict whether x “prefers” category i or category j, and ˆg predicts with the preference votes gathered from those ˆg^(i,j)_b . The goal of A_b is to locate binary classifiers ˆg^(i,j)_b with a small E

ˆ

g^(i,j)_b , D_OVO^(i,j)

, where

D^(i,j)_OVO

x, y, u

=r

u =Jy = i or j K zZ

w

D_r

x, y, w .

In particular, it has been proved (Beygelzimer et al.,2005) that E(ˆg, D_r) ≤ 2X

i<j

E ˆ

g_b^(i,j), D^(i,j)OVO

.

That is, if E ˆ

g^(i,j)_b , D_OVO^(i,j)

are all small, then E(ˆg, D_r) should also be small.

4.2. Cost-sensitive One-versus-one

By coupling OVO with the cost-transformation technique (TSEW in Algorithm1), we can easily get a preliminary version of CSOVO in Algorithm3.

(8)

Algorithm 3: TSEW-OVO

(a) Transform each cost-sensitive example (xn, yn, cn) to (xn, ˜qn) by (2).

(b) Use all the (x_n, ˜q_n) to construct a binary classification training set S_b^(i,j)=n

x_n, i, w⁽ⁱ⁾_n o

∪n

x_n, j, w_n^(j)o , where w^(`)n = ˜q_n[`].

(c) Use a weighted binary classification algorithm A_b on S_b^(i,j) to get a binary classifier ˆg_b^(i,j).

1≤`≤K

P

i<j

r ˆ

g_b^(i,j)(x) = ` z

.

One thing to notice in Algorithm 3 is that each training example (x_n, y_n) may be split to two examples

x_n, i, w⁽ⁱ⁾_n

and

x_n, j, w_n^(j)

for each S_b^(i,j). That is, the example is ambiguously presented for each binary classification subtask. We can take a simple trick to eliminate the ambiguity before training. In particular, we keep only the label (say, i) that comes with a larger weight, and adjust its weight to

wn⁽ⁱ⁾− w_n^(j)

. The trick follows from the same principle as shifting the cost vectors to a similar one. Then, we can eliminate one unnecessary example and remove the multi-labeling ambiguity in the binary classification subtask.

Recall that (xn, i, w⁽ⁱ⁾n ) would be of weight wn⁽ⁱ⁾ = ˜qn[i] and (xn, j, wn^(j)) would be of weight w^(j)n = ˜q_n[j]. By the discussion above, the simplified S_b^(i,j)is

(

x_n, argmax

`=i orj

˜ q_n[`] ,

q˜_n[i] − ˜q_n[j]

!)

= (

x_n, argmin

`=i orj

c_n[`] ,

c_n[i] − c_n[j]

!)

. (4)

Then, we get our proposed CSOVO algorithm.

An intuitive explanation is that CSOVO asks each binary classifier ˆg_b^(i,j) to answer the question “is c[i] or c[j] smaller for this x?” We can easily see that CSOVO (Algorithm4) takes OVO (Algorithm2) as a special case when using only the weighted classification cost vectors (w · c^(`)c ).

4.3. Theoretical Guarantee

Next, we analyze the theoretical guarantee of Algorithm4. Note that each created example xn, argmin

`=i orj

cn[`] ,

cn[i] − cn[j]

!

(9)

Algorithm 4: Cost-sensitive One-versus-one

(a) Take the original S_c= {(x_n, y_n, c_n)}^N_n=1 and construct S_b^(i,j) by (4).

(b) Use a weighted binary classification algorithm A_b on S_b^(i,j) to get a binary classifier ˆg_b^(i,j).

1≤`≤K

P

i<j

r ˆ

g_b^(i,j)(x) = `z .

can be thought as if coming from a distribution D^(i,j)_CSOVO(x, k, u)

= Z

y,c

r

k = argmin

`=i orj

c[`]

zr u =

c[i] − c[j]

z

D_c(x, y, c) . We then get the following theorem:

Theorem 9 Consider any family of binary classifiers n

g_b^(i,j): X → {i, j}o

1≤i<j≤K . Let g(x) = argmax

1≤`≤K

P

i<j

r

g_b^(i,j)(x) = `z

. Then, E(g, D_c) − E

(x,y,c)∼Dc

c_min ≤ 2X

i<j

E

g_b^(i,j), D^(i,j)CSOVO

. (5)

Proof For each (x, y, c) generated from D_c, if c[g(x)] = c[y] = c_min, its contribution on the left-hand side is 0, which is trivially less than its contribution on the right-hand side.

Without loss of generality (by sorting the elements of the cost vector c and shuffling the labels y ∈ Y), consider an example (x, y, c) such that

cmin= c[1] ≤ c[2] ≤ . . . ≤ c[K] = cmax.

From the results ofBeygelzimer et al.(2005), suppose g(x) = k, then for each 1 ≤ ` ≤ k−1, there are at least dk/2e pairs (i, j), where i ≤ k < j, and

g_b^(i,j)(x) 6= argmin

`=i orj

cn[`] .

Therefore, the contribution of (x, y, c) on the right-hand side is no less than

k−1

X

`=1

(c[` + 1] − c[`]) l

` .

2 m

≥ 1

2

k−1

X

`=1

` (c[` + 1] − c[`])

= 1

2

k−1

X

`=1

(c[k] − c[`])

≥ 1

2(c[k] − cmin) ,

(10)

and the left-hand-side contribution is (c[k] − c_min). The desired result can be proved by integrating over all Dc.

Thus, similar to the original OVO algorithm, if E ˆ

g_b^(i,j), D^(i,j)_CSOVO

are all small, then the resulting E(ˆg, Dc) should also be small.

4.4. A Sibling Algorithm: Weighted All-pairs

Note that a similar theoretical proof of Theorem9was made byBeygelzimer et al.(2005) to analyze another algorithm called weighted all-pairs (WAP). As illustrated in Algorithm5, the WAP algorithm shares many similar algorithmic structures with CSOVO. In particular, we see that except the difference between equations (4) and (6), WAP is exactly the same as CSOVO.

Algorithm 5: A Special Version of WAP (Beygelzimer et al.,2005) Run CSOVO, while replacing (4) in step 1(a) with

S_b^(i,j) = (

x_n, argmin

`=i orj

c_n[`] ,

v_n[i] − v_n[j]

!)

(6)

where vn[i] = Z cn[i]

cmin

1

|{k : c_n[k] ≤ t}|dt

Define

D^(i,j)_WAP (x, k, u)

= Z

y,c

r

k = argmin

`=i orj

c[`]

zr u =

v[i] − v[j]

z

D_c(x, y, c) ,

where v is computed from c using a similar definition as the one in Algorithm5. The cost bound of WAP is then (Beygelzimer et al.,2005)

E(g, D_c) − E

(x,y,c)∼Dc

c_min ≤ 2X

i<j

E

g^(i,j)_b , D_WAP^(i,j)

. (7)

Note that we can let v⁰_n[i] ≡ c_n[i] − c_min = Rcn[i]

cmin (1) dt. Then, CSOVO equivalently uses

v_n⁰[i] − v⁰_n[j]

as the underlying example weight. It is not hard to see that for any given example (x, y, c), the associated

v⁰_n[i] − v⁰_n[j]

≥

vn[i] − vn[j]

.

Thus, the WAP cost bound (7) is tighter than the CSOVO one (5) when using the same binary classifiers

n ˆ g_b^(i,j)

o

. In particular, while the right-hand-side of (5) and (7) look similar, the total weight that CSOVO takes in DCSOVO^(i,j) is larger than the total weight that WAP takes in D^(i,j)WAP. The difference allows WAP to have an O(K) regret transform (Beygelzimer

(11)

et al.,2005) instead of the O(K²) one of CSOVO (Theorem9). Thus, for binary classifiers n

ˆ g^(i,j)_b

o

with the same error rate, it appears that WAP is better than CSOVO because of the tighter upper bound. However, CSOVO enjoys the advantage of efficiency and simplicity in implementation, because equation (6) would require a complete sorting of each cost vector (of size K) to compute while (4) only needs a simple subtraction. In the next section, we shall study whether the tighter cost bound with the additional complexity (WAP) leads to better empirical performance than the other way around (CSOVO).

4.5. Another Sibling Algorithm: All-pair Filter Tree

Another sibling algorithm of CSOVO is called the all-pair filter tree (APFT; Beygelzimer et al.,2007). APFT designs a elimination-based tournament in order to find the label with the lowest cost. During training, if an example (xn, yn) as well as the classifiers in the lower levels of the tree allow labels {i, j} to meet in one game of the tournament, a weighted example

xn, argmin

`=i orj

cn[`] ,

cn[i] − cn[j]

!

is added to the training set S_b^(i,j)for learning a classifier ˆg_b^(i,j). The goal of ˆg^(i,j)_b is to achieve a small E

ˆ

g_b^(i,j), D^(i,j)APFT

, where D_APFT^(i,j)(x, k, u)

= Z

y,c

r

i, j attends the tournamentzr

k = argmin

`=i orj

c[`]zr u =

c[i] − c[j]

zD_c(x, y, c) .

Note that the condition r

i, j attends the tournamentz

depends on lower-level classifiers that “filter” the distribution for higher-level training. During prediction, the results from n

ˆ g^(i,j)_b o

are decoded using the same tournament design rather than voting.

Letn g^(i,j)_b o

be a set of binary classifiers and gAPFTbe the resulting classifier after decoding the predictions ofn

g^(i,j)_b o

from the tournament. It can be shown (Beygelzimer et al.,2007) that

E(gAPFT, Dc) − E

(x,y,c)∼Dc

cmin ≤ X

i<j

E

g_b^(i,j), DAPFT^(i,j)

. (8)

Comparing CSOVO with APFT, we see one similarity: the weighted examples included in each S_b^(i,j). Nevertheless, note that APFT uses fewer examples than CSOVO—the former uses only pairs of labels that are met in the tournament and the latter uses all possible pairs of labels. The difference allows for a tighter error bound for APFT by conditioning on the tournament results. Thus, when using binary classifiers of the same error rate, it appears that APFT is better than CSOVO because of the tighter upper bound. Furthermore, by restricting to a specific tournament, APFT results in an O(K) predicting scheme instead of the O(K²) one that CSOVO needs to take. Nevertheless, APFT essentially breaks the symmetry between classes by restricting to a specific tournament, and the reduced number

(12)

of examples in each S_b^(i,j) could degrade the practical learning performance. In the next section, we shall also study whether the tighter cost bound with the tournament restriction (APFT) leads to better empirical performance than the other way around (CSOVO).

5. Experiments

We will first compare CSOVO with the original OVO on various real-world data sets. Then, we will compare CSOVO with WAP (Beygelzimer et al., 2005) and APFT (Beygelzimer et al., 2007). All four algorithms are of OVO-type. That is, they obtain a multiclass classifier ˆg by calling a weighted binary classification algorithm Abfor ^K(K−1)₂ times. During prediction, CSOVO, OVO and WAP requires gathering votes from ^K(K−1)₂ binary classifiers, while APFT determines the label by using K − 1 of those classifiers in the tournament. In addition, we will compare CSOVO with other existing algorithms that also reduce cost- sensitive classification to weighted binary classification.

We take the support vector machine (SVM) with the perceptron kernel (Lin and Li, 2008) as A_b in all the experiments and use LIBSVM (Chang and Lin, 2001) as our SVM solver. Note that SVM with the perceptron kernel is known as a strong classification algorithm (Lin and Li, 2008) and can be naturally adopted to perform weighted binary classification (Zadrozny et al.,2003).

We use ten classification data sets: zoo, glass, vehicle, vowel, yeast, segment, dna, pageblock, satimage, usps.³ The first nine come from the UCI machine learning repository (Het- tich et al.,1998) and the last one is fromHull (1994).

Note that the ten data sets were originally gathered as regular classification tasks. We shall first adopt the randomized proportional (RP) cost-generation procedure that was used by Beygelzimer et al. (2005). In particular, we generate the cost vectors from a cost matrix C(y, k) that does not depends on x. The diagonal entries C(y, y) are set as 0 and each of the other entries C(y, k) is a random variable sampled uniformly from

h

0, 2000^{|{n : y}_{|{n : y}ⁿ^=k}|

n=y}|

i . Then, for a cost-sensitive example (x, y, c), we simply take c[k] = C(y, k). We acknowledge that the RP procedure may not fully reflect realistic application needs. Nevertheless, we still take the procedure as it is a longstanding benchmark for comparing general-purpose cost-sensitive classification algorithms.

We randomly choose 75% of the examples in each data set for training and leave the other 25% of the examples as the test set. Then, each feature in the training set is linearly scaled to [−1, 1], and the feature in the test set is scaled accordingly. The results reported are all averaged over 20 trials of different training/test splits, along with the standard error.

In the coming tables, those entries within one standard error of the lowest one are marked in bold.

SVM with the perceptron kernel takes a regularization parameter (Lin and Li, 2008), which is chosen within2⁻¹⁷, 2⁻¹⁵, . . . , 2³ with a 5-fold cross-validation (CV) procedure on only the training set (Hsu et al.,2003). For the OVO algorithm, the CV procedure selects the parameter that results in the smallest cross-validation classification error. For CSOVO and other cost-sensitive classification algorithms, the CV procedure selects the parameter that results in the smallest cross-validation cost. We then re-run each algorithm on the

3. All data sets except zoo, glass, yeast and pageblock are actually downloaded fromhttp://www.csie.

ntu.edu.tw/~cjlin/libsvmtools/datasets

(13)

Table 1: test cost of CSOVO/OVO

data set CSOVO OVO

zoo 36.79±9.09 105.56±33.45 glass 210.06±15.88 492.99±40.77 vehicle 155.14±20.63 185.38±17.23 vowel 20.05±1.95 11.90±1.96

yeast 52.45±2.97 5823.21±1290.65 segment 25.27±2.25 25.15±2.11

dna 53.18±4.25 48.15±3.33 pageblock 24.98±4.94 501.57±74.98

satimage 66.57±4.77 94.07±5.49 usps 20.51±1.17 23.62±0.66

whole training set with the chosen parameter to get the classifier ˆg. Finally, we evaluate the average performance of ˆg with the test set.

5.1. CSOVO versus OVO

Table 1 compares the test cost of CSOVO and the original cost-insensitive OVO. We can see that on 7 out of the 10 data sets, CSOVO is significantly better than OVO, which justifies that it can be useful to include the cost information into the training process. The t-test results, which will be shown in Table 4, suggest the same finding. The big difference on yeast and pageblock is because they are highly unbalanced and hence the components in c can be huge. Then, not using (or discarding) cost information (as OVO does) would intuitively lead to worse performance.

The only data set on which CSOVO is much worse than OVO is vowel. One may wonder why including the accurate cost information does not improve performance. We check and find that OVO achieves a very low test error (1.1%), which readily leads to a low test cost.

Then the caveat of using CSOVO, or more generally the cost-transformation technique, arises. In particular, cost transformation reduces the original “easy” task for OVO to a more difficult one. The change in hardness degrades the learning performance, and thus CSOVO results in relatively higher test cost.

Recall that CSOVO comes from coupling the cost-transformation technique with OVO, and we discussed in Section3 that cost transformation inevitably introduces multi-labeling ambiguity into the learning process. The ambiguity acts like noise, and generally makes the learning tasks more difficult. On the vowel data set, OVO readily achieves low test error and hence low test cost, while CSOVO suffers from the difficult learning tasks and hence gets high test cost. Similar situations happen on segment, dna, usps, in which the test cost of CSOVO and OVO are quite close. That is, it is worth trading the cost information for easier learning tasks. On the other hand, when OVO cannot achieve low test error (like on vehicle) or when the cost information is extremely important (like on pageblock), it is worth trading the easy learning tasks for knowing the accurate cost information, and thus CSOVO performs better.

5.2. CSOVO versus WAP and APFT

Next, we compare CSOVO with WAP and APFT in terms of the average test cost in Table2. We see that CSOVO and WAP are comparable in performance, with WAP being

(14)

Table 2: test cost of CSOVO/WAP/APFT

data set CSOVO WAP APFT

zoo 36.79±9.09 49.09±16.67 56.92±15.41 glass 210.06±15.88 220.29±18.56 215.95±17.36 vehicle 155.14±20.63 148.63±19.74 158.60±20.35 vowel 20.05±1.95 19.36±1.81 27.59±2.94 yeast 52.45±2.97 52.71±3.58 63.53±8.11 segment 25.27±2.25 24.40±1.96 28.51±2.55 dna 53.18±4.25 51.13±4.37 53.37±5.49 pageblock 24.98±4.94 20.68±2.52 25.60±4.98

satimage 66.57±4.77 72.05±5.13 80.70±5.98 usps 20.51±1.17 21.04±1.19 29.75±1.75

slightly worse on satimage and CSOVO being slightly worse on pageblock. The similarity is natural because CSOVO and WAP are only different in the weights given to the underlying binary examples. On the other hand, Table 2 shows that APFT usually performs slightly worse than CSOVO and WAP. Thus, CSOVO and WAP should be better choices, unless the O(K) prediction time (and the shorter O(K²) training time because of the conditioning on the tournament) of APFT is needed. Again, the t-test results in Table 4 lead to the same conclusion.

In summary, CSOVO performs better than APFT; CSOVO performs similarly to WAP but enjoys a simpler and implementation (see Subsection 4.4). Thus, CSOVO should be preferred over both WAP and APFT in practice.

5.3. CSOVO versus Others

Next, we compare CSOVO with four other existing algorithms, namely TREE (Beygelzimer et al., 2005), Filter Tree (FT; Beygelzimer et al., 2007), and Sensitive Error Correcting Output Codes (SECOC; Langford and Beygelzimer, 2005). The algorithms cover major types of decompositions from multiclass classification to binary classification.

Table3compares the average test RP cost of CSOVO, TREE, FT, and SECOC; Table4 lists the paired t-test results with significance level 0.05. SECOC is the worst of the five, which is because it contains a thresholding (quantization) step that can lead to an inaccurate representation of the cost information.

FT performs slightly worse than CSOVO, which demonstrates that a full pairwise comparison (CSOVO/WAP) can be more stable than an elimination-based tournament (FT).

TREE performs even worse than FT, which complies with the finding in the original FT paper (Beygelzimer et al., 2007) that a regret-based reduction (FT) can be more robust than an error-based reduction (TREE). Overall, Table3 and Table4 suggest that CSOVO is the best meta-algorithm of the four.

6. Conclusion

We presented the cost-transformation technique, which can transform any cost vector c to a similar one that is decomposable to the classification cost Cc with the minimum entropy.

The technique allowed us to design the TSEW algorithm, which can be generally applied to make any regular classification algorithm cost-sensitive. We coupled TSEW with the

(15)

Table 3: test cost of meta-algorithms that reduce cost-sensitive to binary classification

data set CSOVO FT TREE SECOC

zoo 36.79±9.09 74.01±27.65 57.07±17.40 179.02±30.52 glass 210.06±15.88 212.76±19.38 264.02±24.49 347.77±37.87 vehicle 155.14±20.63 156.01±20.14 156.72±20.12 167.60±20.97

vowel 20.05±1.95 24.66±2.92 28.42±3.02 95.25±7.35 yeast 52.45±2.97 53.73±3.14 66.49±6.09 277.14±49.41 segment 25.27±2.25 27.06±2.32 27.76±2.26 68.08±4.02

dna 53.18±4.25 53.76±4.23 55.32±4.39 65.90±5.66 pageblock 24.98±4.94 29.93±6.16 23.00±3.28 249.28±67.31

satimage 66.57±4.77 74.01±4.57 78.13±4.31 102.81±5.41 usps 20.51±1.17 27.07±1.52 25.74±1.55 86.59±9.45

Table 4: t-test for comparing CSOVO with other meta-algorithms using RP cost data set OVO WAP APFT FT TREE SECOC

zoo ◦ ∼ ∼ ∼ ∼ ◦

glass ◦ ∼ ∼ ∼ ◦ ◦

vehicle ◦ ∼ ∼ ∼ ∼ ◦

vowel × ∼ ◦ ◦ ◦ ◦

yeast ◦ ∼ ∼ ∼ ◦ ◦

segment ∼ ∼ ◦ ◦ ◦ ◦

dna ∼ ∼ ∼ ∼ ∼ ◦

pageblock ◦ ∼ ∼ ∼ ∼ ◦

satimage ◦ ◦ ◦ ◦ ◦ ◦

usps ◦ ∼ ◦ ◦ ◦ ◦

◦: CSOVO significantly better; ×: CSOVO significantly worse; ∼: similar

popular OVO meta-algorithm, and obtained a novel CSOVO algorithm that can conquer cost-sensitive classification by reducing it to several binary classification tasks. Experimen- tal results demonstrated that CSOVO can be significantly better than the original OVO for cost-sensitive classification, which justified the usefulness of CSOVO.

We also analyzed the theoretical guarantee of CSOVO, and discussed its similarity to the existing WAP algorithm. We conducted a thorough experimental study that compared CSOVO with not only WAP but also many major meta-algorithms that reduce cost-sensitive classification to binary classification. We empirically found that CSOVO is similar to WAP but performs better than other major meta-algorithms on many cost-sensitive classification data sets. The results make CSOVO a promising meta-algorithm from cost-sensitive to binary classification.

While CSOVO can perform well for cost-sensitive classification, it does not scale well with K, the number of classes. Applying the cost-transformation technique to design more efficient cost-sensitive classification algorithms will be an important future research direc- tion.

(16)

References

Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class cost-sensitive learning. In KDD, pages 3–11. ACM, 2004.

Alina Beygelzimer, Varsha Dani, Tom Hayes, John Langford, and Bianca Zadrozny. Error limiting reductions between classification tasks. In ICML, pages 49–56. ACM, 2005.

Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Multiclass classification with filter trees. Downloaded from http://hunch.net/~jl, 2007.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines.

National Taiwan University, 2001. Software available athttp://www.csie.ntu.edu.tw/

~cjlin/libsvm.

Pedro Domingos. MetaCost: A general method for making classifiers cost-sensitive. In KDD, pages 155–164. ACM, 1999.

Seth Hettich, Catherine L. Blake, and Christopher J. Merz. UCI repository of machine learning databases, 1998. Downloadable at http://www.ics.uci.edu/~mlearn/

MLRepository.html.

Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. Technical report, National Taiwan University, 2003.

Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, 1994.

John Langford and Alina Beygelzimer. Sensitive error correcting output codes. In COLT, pages 158–172. Springer-Verlag, 2005.

Hsuan-Tien Lin and Ling Li. Support vector machinery for infinite ensemble learning.

Journal of Machine Learning Research, 9:285–312, 2008.

Dragos Dorin Margineantu. Methods for Cost-Sensitive Learning. PhD thesis, Oregon State University, 2001.

Fen Xia, Liang Zhou, Yanwu Yang, and Wensheng Zhang. Ordinal regression as multiclass classification. International Journal of Intelligent Control and Systems, 12(3):230–236, 2007.

Bianca Zadrozny, John Langford, and Naoki Abe. Cost sensitive learning by cost- proportionate example weighting. In ICDM, pages 435–442, 2003.

Zhi-Hua Zhou and Xu-Ying Liu. On multi-class cost-sensitive learning. In AAAI, pages 567–572, 2006.

(17)

Appendix A. Proof of Theorem 6

Proof If not all c[`] are equal, not all q[`] are equal. Now we substitute those ˜p in the objective function by the right-hand sides of the equality constraints. Then, the objective function becomes

f (∆) = −

K

X

`=1

q[`] + ∆ PK

k=1q[k] + K∆log q[`] + ∆ PK

k=1q[k] + K∆ .

The constraint on ∆ ensures that all the p log p operations above are well defined.⁴ Now, let ¯q ≡ _K¹ PK

k=1q[k]. We get df

d∆ = − 1

K (¯q + ∆)²

K

X

`=1

(−q[`] + ¯q) ·

log q[`] + ∆

¯ q + ∆

− log K + 1

= − 1

K (¯q + ∆)²

K

X

`=1





− (q[`] + ∆)

| {z }

a_`

+ (¯q + ∆)

| {z }

b`





· logq[`] + ∆

¯ q + ∆

= 1

K (¯q + ∆)²

K

X

`=1

a_`− b_` · log a_`− log b_`.

When not all q[`] are equal, there exists at least one a_` that is not equal to b_`. Therefore,

df

d∆ is strictly positive, and hence the unique minimum of f (∆) happens when ∆ is of the smallest possible value. That is, for the unique optimal solution,







∆ = max

1≤`≤K(−q[`]) = cmax−

1 K−1

PK k=1c[k]

;

˜

q[`] = c_max− c[`] , ˜p[`] = PK^c^max^−c[`]

k=1(cmax−c[k]).

4. We take the convention that 0 log 0 ≡ lim→0 log = 0.