Thesis by

### Hsuan-Tien Lin

In Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

California Institute of Technology Pasadena, California

2008

(Defended May 12, 2008)

2008c Hsuan-Tien Lin All Rights Reserved

## Acknowledgements

First of all, I thank my advisor, Professor Yaser Abu-Mostafa, for his encouragement and guidance on my study, research, career, and life. It is truly a pleasure to be embraced by his wisdom and friendliness, and a privilege to work in the free and easy atmosphere that he creates for the learning systems group.

I also thank my thesis committee, Professors Yaser Abu-Mostafa, Jehoshua Bruck, Pietro Perona, and Christopher Umans, for reviewing the thesis and for stimulating many interesting research ideas during my presentations.

I have enjoyed numerous discussions with Dr. Ling Li and Dr. Amrit Pratap, my fellow members of the group. I thank them for their valuable input and feedback. I am particularly indebted to Dr. Ling Li, who not only collaborated in many projects with me throughout the years, but also gave me lots of valuable suggestions in research and in life. Furthermore, I want to thank Lucinda Acosta, who just makes everything in the group much simpler.

Special gratitude goes to Professor Chih-Jen Lin, who not only brought me into the fascinating area of machine learning eight years ago, but also continued to provide me very helpful suggestions in research and in career.

I thank Kai-Min Chung, Dr. John Langford, and the anonymous reviewers for their valuable comments on the earlier publications that led to this thesis. I am also grateful for the financial support from the Caltech Center for Neuromorphic Systems Engineering under the US NSF Cooperative Agreement EEC-9402726, the Caltech Bechtel Fellowship, and the Caltech Division of Engineering and Applied Science Fellowship.

Finally, I thank my friends, especially the members of the Association of Caltech

Taiwanese, for making my life colorful. Most importantly, I thank my family for their endless love and support. I want to express my deepest gratitude to my parents, who faithfully believe in me; to my brother, who selflessly shares everything with me; to my soulmate Yung-Han Yang, who passionately cherishes our long-distance relationship with me.

## Abstract

We study the ordinal ranking problem in machine learning. The problem can be viewed as a classification problem with additional ordinal information or as a re- gression problem without actual numerical information. From the classification per- spective, we formalize the concept of ordinal information by a cost-sensitive setup, and propose some novel cost-sensitive classification algorithms. The algorithms are derived from a systematic cost-transformation technique, which carries a strong the- oretical guarantee. Experimental results show that the novel algorithms perform well both in a general cost-sensitive setup and in the specific ordinal ranking setup.

From the regression perspective, we propose the threshold ensemble model for ordinal ranking, which allows the machines to estimate a real-valued score (like re- gression) before quantizing it to an ordinal rank. We study the generalization ability of threshold ensembles and derive novel large-margin bounds on its expected test performance. In addition, we improve an existing algorithm and propose a novel al- gorithm for constructing large-margin threshold ensembles. Our proposed algorithms are efficient in training and achieve decent out-of-sample performance when compared with the state-of-the-art algorithm on benchmark data sets.

We then study how ordinal ranking can be reduced to weighted binary classi- fication. The reduction framework is simpler than the cost-sensitive classification approach and includes the threshold ensemble model as a special case. The frame- work allows us to derive strong theoretical results that tightly connect ordinal ranking with binary classification. We demonstrate the algorithmic and theoretical use of the reduction framework by extending SVM and AdaBoost, two of the most popular bi- nary classification algorithms, to the area of ordinal ranking. Coupling SVM with the

reduction framework results in a novel and faster algorithm for ordinal ranking with superior performance on real-world data sets, as well as a new bound on the expected test performance for generalized linear ordinal rankers. Coupling AdaBoost with the reduction framework leads to a novel algorithm that boosts the training accuracy of any cost-sensitive ordinal ranking algorithms theoretically, and in turn improves their test performance empirically.

From the studies above, the key to improve ordinal ranking is to improve binary classification. In the final part of the thesis, we include two projects that aim at understanding binary classification better in the context of ensemble learning. First, we discuss how AdaBoost is restricted to combining only a finite number of hypotheses and remove the restriction by formulating a framework of infinite ensemble learning based on SVM. The framework can output an infinite ensemble through embedding infinitely many hypotheses into an SVM kernel. Using the framework, we show that binary classification (and hence ordinal ranking) can be improved by going from a finite ensemble to an infinite one. Second, we discuss how AdaBoost carries the property of being resistant to overfitting. Then, we propose the SeedBoost algorithm, which uses the property as a machinery to prevent other learning algorithms from overfitting. Empirical results demonstrate that SeedBoost can indeed improve an overfitting algorithm on some data sets.

## Contents

Acknowledgements iii

Abstract v

List of Figures ix

List of Tables x

List of Selected Algorithms xii

1 Introduction 1

1.1 Supervised Learning . . . 1

1.2 Ordinal Ranking . . . 7

1.3 Overview . . . 10

2 Ordinal Ranking by Cost-Sensitive Classification 12 2.1 Cost-Sensitive Classification . . . 12

2.2 Cost-Transformation Technique . . . 14

2.3 Algorithms . . . 22

2.3.1 Cost-Sensitive One-Versus-All . . . 23

2.3.2 Cost-Sensitive One-Versus-One . . . 26

2.4 Experiments . . . 31

2.4.1 Comparison on Classification Data Sets . . . 31

2.4.2 Comparison on Ordinal Ranking Data Sets . . . 34

3 Ordinal Ranking by Threshold Regression 36

3.1 Large-Margin Bounds of Threshold Ensembles . . . 38

3.2 Boosting Algorithms for Threshold Ensembles . . . 46

3.2.1 RankBoost for Ordinal Ranking . . . 47

3.2.2 ORBoost with Left-Right Margins . . . 50

3.2.3 ORBoost with All Margins . . . 53

3.3 Experiments . . . 54

3.3.1 Artificial Data Set . . . 55

3.3.2 Benchmark Data Sets . . . 55

4 Ordinal Ranking by Extended Binary Classification 58 4.1 Reduction Framework . . . 59

4.2 Usefulness of Reduction Framework . . . 69

4.2.1 SVM for Ordinal Ranking . . . 70

4.2.2 AdaBoost for Ordinal Ranking . . . 74

4.3 Experiments . . . 79

4.3.1 SVM for Ordinal Ranking . . . 80

4.3.2 AdaBoost for Ordinal Ranking . . . 82

5 Studies on Binary Classification 87 5.1 SVM for Infinite Ensemble Learning . . . 87

5.1.1 SVM and Ensemble Learning . . . 88

5.1.2 Infinite Ensemble Learning . . . 91

5.1.3 Experiments . . . 96

5.2 AdaBoost with Seeding . . . 101

5.2.1 Algorithm . . . 102

5.2.2 Experiments . . . 103

6 Conclusion 106

Bibliography 108

## List of Figures

1.1 Illustration of the learning scenario . . . 3 3.1 Prediction procedure of a threshold ranker . . . 37 3.2 Margins of a correctly predicted example . . . 39 3.3 Decision boundaries produced by ORBoost-All on an artificial data set 55 4.1 Reduction (top) and reverse reduction (bottom) . . . 64 4.2 Training time (including automatic parameter selection) of SVM-based

ordinal ranking algorithms with the perceptron kernel . . . 82 4.3 Decision boundaries produced by AdaBoost.OR on an artificial data set 84

## List of Tables

2.1 Classification data sets . . . 31

2.2 Test RP cost of CSOVO and WAP . . . 33

2.3 Test absolute cost of CSOVO and WAP . . . 33

2.4 Test RP cost of cost-sensitive classification algorithms . . . 34

2.5 Test absolute cost of cost-sensitive classification algorithms . . . 34

2.6 Ordinal ranking data sets . . . 35

2.7 Test absolute cost of cost-sensitive classification algorithms on ordinal ranking data sets . . . 35

3.1 Test absolute cost of algorithms for threshold ensembles . . . 56

3.2 Test classification cost of algorithms for threshold ensembles . . . 56

4.1 Instances of the reduction framework . . . 70

4.2 Test absolute cost of SVM-based ordinal ranking algorithms . . . 80

4.3 Test classification cost of SVM-based ordinal ranking algorithms . . . . 80

4.4 Test absolute cost of SVM-based ordinal ranking algorithms with the perceptron kernel . . . 82

4.5 Test absolute cost of all SVM-based algorithms . . . 83

4.6 Test classification cost of all SVM-based algorithms . . . 83

4.7 Training absolute cost of base and AdaBoost.OR algorithms . . . 86

4.8 Test absolute cost of base and AdaBoost.OR algorithms . . . 86

5.1 Binary classification data sets . . . 98

5.2 Test classification cost (%) of SVM-Stump and AdaBoost-Stump . . . 99

5.3 Test classification cost (%) of SVM-Perc and AdaBoost-Perc . . . 100 5.4 Test absolute cost of algorithms for threshold perceptron ensembles . . 100 5.5 Test classification cost (%) of SeedBoost with SSVM-Perc . . . 105 5.6 Test classification cost (%) of SeedBoost with SVM-Perc . . . 105 5.7 Test classification cost (%) of SeedBoost with SSVM versus stand-alone

SVM . . . 105

## List of Selected Algorithms

2.2 Cost transformation with relabeling . . . 20

2.3 TSEW: training set expansion and weighting . . . 21

2.5 Generalized one-versus-all . . . 25

2.6 CSOVA: Cost-sensitive one-versus-all . . . 26

2.8 CSOVO: Cost-sensitive one-versus-one . . . 28

3.3 RankBoost-OR: RankBoost for ordinal ranking . . . 47

3.4 ORBoost-LR: ORBoost with left-right margins . . . 52

4.1 Reduction to extended binary classification . . . 59

4.3 AdaBoost.OR: AdaBoost for ordinal ranking . . . 76

5.1 SVM-based framework for infinite ensemble learning . . . 92

5.3 SeedBoost: AdaBoost with seeding . . . 102

## Chapter 1 Introduction

Machine learning, the study that allows computational systems to adaptively improve their performance with experience accumulated from the data observed, is becoming a major tool in many fields. Furthermore, the growing application needs in the Internet age keep supplementing machine learning research with new types of problems. This thesis is about one of them—the ordinal ranking problem. It belongs to a family of learning problems, called supervised learning, which will be introduced below.

### 1.1 Supervised Learning

In the supervised learning problems, the machine is given a training set Z ={zn}^{N}n=1,
which contains training examples zn = (xn, yn). We assume that each feature vec-
tor x_{n} ∈ X ⊆ R^{D}, each label y_{n} ∈ Y, and each training example zn is drawn
independently from an unknown probability measure dF(x, y) on X × Y. We focus
on the case where dF(y| x), the random process that generates y from x, is governed
by

y = g_{∗}(x) + ǫx.

Here g∗: X → Y is a deterministic but unknown component called the target function, which denotes the best function that can predict y from x. The exact notion of “best”

varies by application needs and will be formally defined later in this section. The other

part of y, which cannot be perfectly explained by g∗(x), is represented by a random component ǫx.

With the given training set, the machine should return a decision function ˆg as the inference of the target function. The decision function is chosen from a learning model G = {g}, which is a collection of candidate functions g : X → Y. Briefly speaking, the task of supervised learning is to use the information in the training set Z to find some decision function ˆg ∈ G that is almost as good as g∗ under dF(x, y).

For instance, we may want to build a recognition system that transforms an image of a written digit to its intended meaning. We can first ask someone to write down N digits and represent their images by the feature vectors xn. We then label the images by yn∈ {0, 1, . . . , 9} according to their meanings. The target function g∗here encodes the process of our human-based recognition system and ǫxrepresents the mistakes we may make in our brain. The task of this learning problem is to set up an automatic recognition system (decision function) ˆg that is almost as good as our own recognition system, even on the yet unseen images of written digits in the future.

The machine conquers the task with a learning algorithm A. Generally speaking, the algorithm takes the learning model G and the training set Z as inputs. It then returns a decision function ˆg ∈ G by minimizing a predefined objective function E(g, Z) over g∈ G. The full scenario of learning is illustrated in Figure 1.1.

Let us take one step back and look at what we mean by g_{∗} being the “best”

function to predict y from x. To evaluate the predicting ability of any g : X → Y, we define its out-of-sample cost

π(g, F ) = Z

x,y

C y, g(x)

dF(x, y) .

Here C(y, k) is called the cost function, which quantifies the price to be paid when an example of label y is predicted as k. The value of π(g, F ) reflects the expected test cost on the (mostly) unseen examples drawn from dF(x, y). Then, the “best”

noise target function

ǫ g∗

@@R ?

unknown learning training set

Z

?

learning algorithm decision function

A ˆg

'

&

$

% -

6

learning model G

Figure 1.1: Illustration of the learning scenario .

function g∗ should satisfy

π(g∗, F )≤ π(g, F ) ∀g : X → Y .

One of such a g∗ can be defined by

g_{∗}(x)≡ argmin

k∈Y

Z

y

C y, k

dF(y| x)

. (1.1)

In this thesis, we assume that such a g∗ exists with ties in argmin arbitrarily broken, and denote π(g, F ) by π(g) when F is clear from the context.

Recall that the task of supervised learning is to find some ˆg ∈ G that is almost as good as g∗ under dF(x, y). Since π(g∗) is the lower bound, we desire π(ˆg) to be as small as possible. Note thatA minimizes E(g, Z) to get ˆg, and hence ideally we want to set E(g, Z) = π(g). Nevertheless, because dF(x, y) is unknown, it is not possible to compute such an E(g, Z) nor to minimize it directly. A substitute quantity that

depends only on Z is called the in-sample cost

ν(g) = XN

i=1

C yn, g(xn)

· 1 N .

Note that ν(g) can also be defined by π(g, Zu) where Zudenotes a uniform distribution over the training set Z. Because ν(g) is an unbiased estimate of π(g) for any given single g, many learning algorithms take ν(g) as a major component of E(g, Z).

A small ν(g), however, does not always imply a small π(g) (Abu-Mostafa 1989;

Vapnik 1995). When the decision function ˆg comes with a small ν(ˆg) and a large π(ˆg),
we say that ˆg (or the learning algorithm A) overfits the training set Z. For instance,
consider a training set Z with x_{n} 6= xm for all n6= m, and C (y, k) = |y − k|. Assume
that a

ˆ g(x) =

yn, for x∈ {xn}^{N}n=1;
some constant ∆, otherwise.

Then, we see that ν(ˆg) = 0 (the smallest possible value) and π(ˆg) can be as large as we want by varying the constant. That is, there exists a decision function like ˆg that leads to serious overfitting. Preventing overfitting is one of the most important objective when designing learning models and algorithms. Generally speaking, the objective can be achieved when the complexity of G (and hence the chosen ˆg) is reasonably controlled (Abu-Mostafa 1989; Abu-Mostafa et al. 2004; Vapnik 1995).

One important type of supervised learning problem is (univariate) regression, which deals with the case when Y is a metric space isometric to R. For simplic- ity, we shall restrict ourselves to the case where Y = R. Although not strictly re- quired, common regression algorithms usually not only work on someG that contains continuous functions, but also desires ˆg to be reasonably smooth as a control of its complexity (Hastie, Tibshirani and Friedman 2001). The metric information is thus important in determining the smoothness of the function.

For instance, a widely used cost function for regression is Cs(y, k) = (y− k)^{2}, the

squared cost. With the cost in mind, the ridge regression algorithm (Hastie, Tibshirani and Friedman 2001) works on a linear regression model

G = {gv,b: gv,b(x) =hv, xi + b} ,

with ˆg being the optimal solution of

ming∈G E(g, Z) ,

where E(gv,b, Z) = λ

2hv, vi + 1 N

XN n=1

Cs

yn, gv,b(xn) .

The first part of E(g, Z) controls the smoothness of the decision function chosen, and the second part is ν(gv,b).

Another important type of supervised learning problem is called classification, in whichY is a finite set Yc ={1, 2, . . . , K}. Each label in Yc represents a different cate- gory. For instance, the digit recognition system described earlier can be formulated as a classification problem. A function of the formX → Yc is called a classifier. In the special case where |Yc| = 2, the classification problem is called binary classification, in which the classifier g is called a binary classifier.

To evaluate whether a classifier predicts the desired category correctly, a com-
monly used cost function is the classification cost Cc(y, k) = Jy 6= kK.^{1} In some
classification problems, however, it may be desired to treat different kinds of clas-
sification mistakes differently (Margineantu 2001). For instance, when designing a
system to classify cells as {cancerous, noncancerous}, in terms of the possible loss on
human life, the cost of classifying some cancerous cell as a noncancerous one should be
significantly higher than the other way around. These classification problems would
thus include cost functions other than Cc. We call them cost-sensitive classification
problems to distinguish them from the regular classification problems, which uses
only Cc.

1J·K = 1 when the inner condition is true, and 0 otherwise.

Note that in classification problems, for any given (x, y), the cost function C
would be evaluated only on parameters (y, k) where k ∈ Y^{c}. We can then represent
the needed part of C by a cost vector c with respect to y, where c[k] = C(y, k). In
this thesis, we take a more general setup of cost-sensitive classification and allow dif-
ferent cost functions to be used on different examples (Abe, Zadrozny and Langford
2004). In the setup, we assume that the vector c is drawn from some probability mea-
sure dF(c| x, y) on a collection C of possible cost functions. We call the tuple (x, y, c)
the cost-sensitive example to distinguish it from a regular example (x, y). The learn-
ing algorithm now receives a cost-sensitive training set{(xn, yn, cn)}^{N}n=1to work with.^{2}
Using the cost-sensitive examples (x, y, c), the out-of-sample cost becomes

π(g) = Z

x,y

Z

c

c[g(x)] dF(c| x, y)

dF(x, y) ,

and the in-sample cost becomes

ν(g) = XN

i=1

cn[g(xn)]· 1 N .

As can be seen from the updated definitions of π(g) and ν(g), our setup do not explicitly need the label y. We shall, however, keep the notation for clarity and assume that y = argmin

k∈Y

c[k].

A special instance of cost-sensitive classification takes the cost vector c to be of the form c[k] = w·Cc(y, k) for every cost-sensitive example (x, y, c) with some w ≥ 0. We call the instance weighted classification, in which a cost-sensitive example (x, y, c) can be simplified to a weighted example (x, y, w). It is known that weighted classification problems can be readily solved by regular classification algorithms with rejection- based sampling (Zadrozny, Langford and Abe 2003).

2In many applications, the exact form of dF(c| x, y) is known by application needs. In all of our
theoretical results, however, dF(c| x, y) can be either known or unknown, as long as the learning
algorithm receives a cost-sensitive training set where cn is drawn independently from dF(c| x^{n}, yn).

### 1.2 Ordinal Ranking

Ordinal ranking is another type of supervised learning problem. It is similar to classification in the sense that Y is a finite set Yr = {1, 2, . . . , K} = Yc. Therefore, ordinal ranking is also called ordinal classification (Cardoso and da Costa 2007; Frank and Hall 2001). Nevertheless, in addition to representing the nominal categories (as the usual classification labels), now those y ∈ Yr also carry the ordinal information.

That is, two different labels in Yr can be compared by the usual “<” operation. We call those y the ranks to distinguish them from the usual classification labels. We use a ranker r(x) to denote a function from X → Yr. In an ordinal ranking problem, the decision function is denoted by ˆr(x), and the target function is denoted by r∗(x).

Because ranks can be naturally used to represent human preferences, ordinal rank-
ing lends itself to many applications in social science, psychology, and information
retrieval. For instance, we may want to build a recommendation system that predicts
how much a user likes a movie. We can first choose N movies and represent each
movie by a feature vector x_{n}. We then ask the user to (see and) rate each movie by
{one star, two star, . . ., five star}, depending on how much she or he likes the movie.

The set Yr = {1, 2, . . . , 5} includes different levels of preference (numbers of stars), which are ordered by “<” to represent “worse than.” The task of this learning prob- lem is to set up an automatic recommendation system (decision function) ˆr : X → Yr

that is almost as good as the user, even on the yet unseen movies in the future.

Ordinal ranking is also similar to regression, in the sense that ordinal information
is similarly encoded in y ∈ R. Therefore, ordinal ranking is also popularly called
ordinal regression (Chu and Ghahramani 2005; Chu and Keerthi 2007; Herbrich,
Graepel and Obermayer 2000; Li and Lin 2007b; Lin and Li 2006; Shashua and Levin
2003; Xia, Tao, et al. 2007; Xia, Zhou, et al. 2007). Nevertheless, unlike the real-
valued regression labels, the discrete ranks y ∈ Y^{r} do not carry metric information.

For instance, we cannot say that a five-star movie is 2.5 times better than a two- star one. In other words, the rank serves as a qualitative indication rather than a quantitative outcome. The lack of metric information violates the assumption of

many regression algorithms, and hence they may not perform well on ordinal ranking problems.

The ordinal information carried by the ranks introduce the following two proper- ties, which are important for modeling ordinal ranking problems.

• Closeness in the rank space Yr: The ordinal information suggests that the mislabeling cost depend on the “closeness” of the prediction. For example, predicting a two-star movie as a three-star one is less costly than predicting it as a five-star one. Hence, the cost vector c should be V-shaped with respect to y (Li and Lin 2007b), that is,

c[k−1] ≥ c[k] , for 2≤ k ≤ y ; c[k+1]≥ c[k] , for y ≤ k ≤ K −1.

(1.2)

Briefly speaking, a V-shaped cost vector says that a ranker needs to pay more if its prediction on x is further away from y. We shall assume that every cost vec- tor c generated from dF(c| x, y) is V-shaped with respect to y = argmin

1≤k≤K

c[k].

With this assumption, ordinal ranking can be casted as a cost-sensitive classi- fication problem with V-shaped cost vectors.

In some of our results, we need a stronger condition: The cost vectors should be convex (Li and Lin 2007b), that is,

c[k+1]− c[k] ≥ c[k] − c[k−1] , for 2 ≤ k ≤ K −1 . (1.3)

When using convex cost vectors, a ranker needs to pay increasingly more if its prediction on x is further away from y. It is not hard to see that any convex cost vector c is V-shaped with respect to y = argmin

1≤k≤K

c[k].

• Structure in the feature space X : Note that the classification cost vec- torsn

c^{(ℓ)}c : c^{(ℓ)}c [k] = Jℓ 6= kKoK

ℓ=1, which are associated with the classification cost function Cc, are also V-shaped. If those cost vectors (and hence Cc) are used,

what distinguishes ordinal ranking and regular classification?

Note that the total order withinYrand the target function r∗introduces a total preorder in X (Herbrich, Graepel and Obermayer 2000). That is,

x <∼ x^{′} ⇐⇒ r∗(x)≤ r∗(x^{′}).

The total preorder allows us to naturally group and compare vectors in the feature space X . For instance, a two-star movie is “worse than” a three-star one, which is in term “worse than” a four-star one; movies of less than three stars are “worse than” movies of at least three stars.

It is the meaningfulness of the grouping and the comparison that distinguishes ordinal ranking from regular classification, even when the classification cost vectors n

c^{(ℓ)}c

oK

ℓ=1 are used. For instance, if apple = 1, banana = 2, grape = 3, orange = 4, strawberry = 5, we can intuitively see that comparing fruits {1, 2}

with fruits{3, 4, 5} is not as meaningful as comparing “movies of less than three stars” with “movies of at least three stars.”

Ordinal ranking has been studied from the statistics perspective in detail by Mc- Cullagh (1980), who viewed ranks as contiguous intervals on an unobservable real- valued random variable. From the machine learning perspective, many ordinal rank- ing algorithms have been proposed in recent years. For instance, Herbrich, Graepel and Obermayer (2000) followed the view of McCullagh (1980) and designed an algo- rithm with support vector machines (Vapnik 1995). One key idea of Herbrich, Graepel and Obermayer (2000) is to compare training examples by their ranks in a pairwise manner. Har-Peled, Roth and Zimak (2003) proposed a constraint classification algo- rithm that also compares training examples in a pairwise manner. Another instance of the pairwise comparison approach is the RankBoost algorithm (Freund et al. 2003;

Lin and Li 2006), which will be further described in Chapter 3. Nevertheless, because
there are O(N^{2}) pairwise comparisons out of N training examples, it is harder to
apply those algorithms on large-scale ordinal ranking problems.

There are some other algorithms that do not lead to such a quadratic expansion, such as perceptron ranking (Crammer and Singer 2005, PRank), ordinal regression boosting (Lin and Li 2006, ORBoost, which will be further introduced in Chapter 3), support vector ordinal regression (Chu and Keerthi 2007, SVOR), and the data repli- cation method (Cardoso and da Costa 2007). As we shall see later in Chapter 4, these algorithms can be unified under a simple reduction framework (Li and Lin 2007b).

Still some other algorithms fall into neither of the approaches above, such as C4.5-ORD (Frank and Hall 2001), Gaussian process ordinal regression (Chu and Ghahramani 2005, GPOR), recursive feature extraction (Xia, Tao, et al. 2007), and Weighted-LogitBoost (Xia, Zhou, et al. 2007).

### 1.3 Overview

In Chapter 2, we study the ordinal ranking problem from a classification perspec- tive. That is, we cast ordinal ranking as a cost-sensitive classification problem with V-shaped costs. We propose the cost-transformation technique to systematically ex- tend regular classification algorithms to their cost-sensitive versions. The technique carries strong theoretical guarantees. Based on the technique, we derive two novel cost-sensitive classification algorithms based on their popular versions in regular clas- sification and test their performance on both cost-sensitive classification and ordinal ranking problems.

In Chapter 3, we study the ordinal ranking problem from a regression perspective.

That is, we solve ordinal ranking by thresholding an estimation of a latent contin- uous variable. Learning models associated with this approach are called threshold models. We propose an novel instance of the threshold model called the threshold ensemble model and prove its theoretical properties. We not only extend RankBoost for constructing threshold ensemble rankers, but also propose a more efficient al- gorithm called ORBoost. The proposed algorithm roots from the famous adaptive boosting (AdaBoost) approach and carries promising properties from its ancestor.

In Chapter 4, we show that ordinal ranking can be reduced to binary classification

theoretically and algorithmically, which includes the threshold model as a special case. We derive theoretical foundations of the reduction framework and demonstrate a surprising equivalence: Ordinal ranking is as hard (easy) as binary classification.

In addition to extending support vector machines (SVM) to ordinal ranking with the reduction framework, we also propose a novel algorithm called AdaBoost.OR, which efficiently constructs an ensemble of ordinal rankers as its decision function.

In Chapter 5, we include two concrete research projects that aim at understanding and improving binary classification. The results can in turn be coupled with the re- duction framework to improve ordinal ranking. First, we propose a novel framework of infinite ensemble learning based on SVM. The framework is not limited by the finiteness restriction of existing ensemble learning algorithms. Using the framework, we show that binary classification (and hence ordinal ranking) can be improved by going from a finite ensemble to an infinite one. Second, we discuss how AdaBoost carries the property of being resistant to overfitting. We then propose the Seed- Boost algorithm, which uses the property as a machinery to prevent other learning algorithms from overfitting.

Some of the results in Chapters 3, 4, and 5 were jointly developed by Dr. Ling Li and the author (Li and Lin 2007b; Lin and Li 2006, 2008). The results that should be credited to Dr. Ling Li will be properly acknowledged in the coming chapters. The results without such acknowledgment are the original contributions of the author.

## Chapter 2

## Ordinal Ranking by Cost-Sensitive Classification

As discussed in Section 1.2, ordinal ranking can be casted as a cost-sensitive classifica- tion problem with V-shaped cost vectors. In this chapter, we study the cost-sensitive classification problem in general and propose a systematic technique to transform it to a regular classification problem. We first derive the theoretical foundations of the technique. Then, we use the technique to extend two popular algorithms for reg- ular classification, namely one-versus-one and one-versus-all, to their cost-sensitive versions. We empirically demonstrate the usefulness of the new cost-sensitive algo- rithms on general cost-sensitive classification problems as well as on ordinal ranking problems.

### 2.1 Cost-Sensitive Classification

Cost-sensitive classification fits the needs of many practical applications of machine learning and data mining, such as targeted marketing, fraud detection, and medical decision systems (Abe, Zadrozny and Langford 2004). Margineantu (2001) discussed three kinds of approaches for solving the problem: manipulating the training exam- ples, modifying the learning algorithm, or manipulating the decision function. There is also the fourth kind: designing a new learning algorithm to solve the problem directly.

For manipulating the training examples, Domingos (1999) proposed the MetaCost algorithm, which takes any classification algorithm to estimate dF(y| x), uses the estimate to relabel the training examples, and then retrains a classifier with the relabeled examples. The algorithm, however, depends strongly on how well dF(y| x) is estimated, which is hard to be theoretically guaranteed. In addition, the algorithm needs to know the cost collectionC in advance and only accepts some restricted forms of dF(c| x, y). These shortcomings make it difficult to use the algorithm on our more general cost-sensitive setup. Approaches that manipulate the decision function suffer from similar shortcomings (Abe, Zadrozny and Langford 2004; Margineantu 2001).

There are many cost-sensitive classification approaches that come from modifying some regular classification algorithm (see, for instance, Margineantu 2001, Subsec- tion 2.3.2). These approaches are usually constructed by identifying where the clas- sification cost vectors are used in E(g, Z) (or some intermediate quantity within A), and then replacing them with cost-sensitive ones. Nevertheless, the modifications are usually ad hoc and heuristic based. In other words, those approaches usually do not come with a strong theoretical guarantee, either.

Recently, some authors proposed new algorithms for solving cost-sensitive classifi- cation problems directly (Beygelzimer et al. 2005; Beygelzimer, Langford and Raviku- mar 2007; Langford and Beygelzimer 2005). These algorithms come with stronger theoretical guarantee, but because of their novelty, they have not been as widely tested nor as successful as some popular algorithms for regular classification.

Our work takes the route of modifying existing algorithms, along with the ob- jective of being systematic as well as providing a strong theoretical guarantee. In addition, our proposed modifications are based on the cost-transformation technique, which is related to manipulating training examples in a principled manner. Then, we can easily extend successful algorithms for regular classification to their cost-sensitive versions. Next, we illustrate the cost-transformation technique.

### 2.2 Cost-Transformation Technique

The key of the cost-transformation technique is to decompose a cost vector c to a conic combination of the classification cost vectors n

c^{(ℓ)}c

oK

ℓ=1, where
c^{(ℓ)}_{c} [k] = Cc(ℓ, k) = Jℓ 6= kK .

For instance, consider a cost vector ˜c = (4, 3, 2, 3), we see that

˜c = 0· (0, 1, 1, 1)

| {z }

c^{(1)}c

+1· (1, 0, 1, 1)

| {z }

c^{(2)}c

+2· (1, 1, 0, 1)

| {z }

c^{(3)}c

+1· (1, 1, 1, 0)

| {z }

c^{(4)}c

.

Why is such a decomposition useful? If there is a cost-sensitive example (x, y, c), where c =PK

ℓ=1q[ℓ]˜ · c^{(ℓ)}^{c} , then for any classifier g,

c[g(x)] = XK

ℓ=1

˜

q[ℓ]· c^{(ℓ)}c [g(x)] =
XK

ℓ=1

˜

q[ℓ]· Jℓ 6= g(x)K .

That is, if we sample ℓ proportional to ˜q[ℓ] and replace the cost-sensitive exam-
ple (x, y, c) by a regular one (x, ℓ), then the cost that g needs to pay for its prediction
on x is proportional to the expected classification cost. Thus, if a classifier g performs
well on the “relabeled” problem using the expected classification cost, it would also
perform well on the original cost-sensitive problem. The nonnegativity of ˜q[ℓ] ensures
that ˜q can be scaled to form a probability distribution dF(ℓ| ˜q).^{1}

Nevertheless, can every cost vector c be decomposed to a conic combination of n

c^{(ℓ)}c

oK

ℓ=1? The short answer is no. For instance, the cost vector c = (2, 1, 0, 1) cannot be decomposed to any conic combination of n

c^{(ℓ)}c

o4

ℓ=1, because c comes with a unique linear decomposition:

(2, 1, 0, 1) =−2

3 · (0, 1, 1, 1) +1

3 · (1, 0, 1, 1) + 4

3· (1, 1, 0, 1) + 1

3 · (1, 1, 1, 0) .

1We take a minor assumption that not all ˜q[ℓ] are zero. Otherwise c = 0 and the example (x, y, c) can be simply dropped.

Thus, c cannot be represented by any conic combination of n
c^{(ℓ)}c

o4

ℓ=1. The unique existence of a linear combination is formalized in the following lemma.

Lemma 2.1. Any c ∈ R^{K} can be uniquely decomposed to c = PK

ℓ=1q[ℓ] · c^{(ℓ)}^{c} ,
where q[ℓ]∈ R for ℓ = 1, 2, . . . , K.

Proof. Note that q[ℓ] needs to satisfy the following matrix equation:

c[1]

c[2]

. . . c[K]

| {z }

c^{T}

=

0 1 1 . . . 1 1 0 1 . . . 1 . . . .

1 1 1 . . . 0

| {z }

M

q[1]

q[2]

. . . q[K]

| {z }

q^{T}

.

Because M is invertible with

M^{−1} = 1
K−1

−(K −2) 1 1 . . . 1

1 −(K −2) 1 . . . 1

. . . .

1 1 1 . . . −(K −2)

,

the vector q can be uniquely computed by M^{−1}c^{T}T

. That is,

q[ℓ] = 1 K−1

XK k=1

c[k]

!

− c[ℓ] .

Although ˜c = (4, 3, 2, 3) yields a conic decomposition but c = (2, 1, 0, 1) does not, the two cost vectors are not very different when being used to evaluate the performance of a classifier g. Note that for every x, ˜c[g(x)] = c[g(x)] + 2. The constant shifting from c to ˜c does not affect the relative cost difference between the prediction g(x) and the best prediction y. That is, using ˜c is equivalent to using c plus a constant cost of 2 on every example. We call a cost vector ˜c similar to c by ∆

when

˜c[·] = c[·] + ∆ ,

with some constant ∆. If we only use c additively, as what we did in the definition of π(g) and ν(g), using ˜c instead of c would introduce only a constant shift.

Although we cannot decompose any c to a conic combination of n
c^{(ℓ)}c

oK

ℓ=1, there exists infinitely many cost vectors ˜c that allow a conic combination while being similar to c. To see this, note that

(K−1) · (∆, ∆, . . . , ∆) = ∆ XK

ℓ=1

c^{(ℓ)}_{c} .

Then, consider c =PK

ℓ=1q[ℓ]· c^{(ℓ)}^{c} ,

˜c = c + (K−1) · (∆, ∆, . . . , ∆) = XK

ℓ=1

(q[ℓ] + ∆)· c^{(ℓ)}c .

We can easily make ˜q[ℓ] = q[ℓ] + ∆≥ 0 by choosing ∆ ≥ max

1≤ℓ≤K(−q[ℓ]).

Lemma 2.2. Consider some c =PK

ℓ=1q[ℓ]· c^{(ℓ)}^{c} . If ˜c is similar to c by (K−1) · ∆,
then ˜c yields a conic combination of n

c^{(ℓ)}c

oK

ℓ=1 if and only if ∆≥ max

1≤ℓ≤K(−q[ℓ]).

Proof. By Lemma 2.1, the decomposition of ˜c by XK

ℓ=1

(q[ℓ] + ∆)· c^{(ℓ)}c

is unique. Then, it is not hard to see that ˜q[ℓ] = q[ℓ]+∆≥ 0 for every ℓ = 1, 2, . . . , K if and only if ∆≥ max

1≤ℓ≤K(−q[ℓ]).

Thus, we can first transform each cost vector c to a similar one ˜c that yields a conic combination, get the vector ˜q, and randomly relabel (x, y, c) to (x, ℓ) with probability dF(ℓ| c) proportional to ˜q[ℓ]. The procedure above transforms the original cost-sensitive classification problem to an equivalent regular classification one. From

Lemma 2.2, there are infinitely many ˜c that we can use. The next question is, which is more preferable? Since the proposed procedure relabels with probability

dF(ℓ| c) = ˜p[ℓ] = q[ℓ]˜ PK

k=1q[k]˜ ,

we would naturally desire the discrete probability distribution ˜p[·] to be of the least entropy. That is, we want the distribution to come from the optimal solution of

minp,∆˜

XK ℓ=1

˜

p[ℓ] log 1

p[ℓ]˜ , (2.1)

subject to ∆ ≥ max

1≤ℓ≤K(−q[ℓ]) ,

˜

p[ℓ] = q[ℓ]˜ PK

k=1q[k]˜ , ℓ = 1, 2, . . . , K,

˜

q[ℓ] = q[ℓ] + ∆ , ℓ = 1, 2, . . . , K,

q[ℓ] = 1

K−1 XK k=1

c[k]

!

− c[ℓ] , ℓ = 1, 2, . . . , K.

Theorem 2.3. If not all c[ℓ] are equal, the unique optimal solution to (2.1) is

˜

p[ℓ] = cmax− c[ℓ]

PK

k=1(cmax− c[k]) and ∆ = max

1≤ℓ≤K(−q[ℓ]) , where c^{max}= max

1≤ℓ≤Kc[ℓ] . Proof. If not all c[ℓ] are equal, not all q[ℓ] are equal. Now we substitute those ˜p in the objective function by the right-hand sides of the equality constraints. Then, the objective function becomes

f (∆) = − XK

ℓ=1

q[ℓ] + ∆ PK

k=1q[k] + K∆log q[ℓ] + ∆ PK

k=1q[k] + K∆ .

The constraint on ∆ ensures that all the p log p operations above are well defined.^{2}

2We take the convention that 0 log 0≡ limǫ→0ǫlog ǫ = 0.

Now, let ¯q ≡ _{K}^{1} PK

k=1q[k]. We get df

d∆ = − 1

K (¯q + ∆)^{2}
XK

ℓ=1

(−q[ℓ] + ¯q) ·

log q[ℓ] + ∆

¯ q + ∆

− log K + 1

= − 1

K (¯q + ∆)^{2}
XK

ℓ=1

− (q[ℓ] + ∆)

| {z }

aℓ

+ (¯q + ∆)

| {z }

bℓ

· logq[ℓ] + ∆

¯ q + ∆

= 1

K (¯q + ∆)^{2}
XK

ℓ=1

aℓ− bℓ

· log aℓ− log bℓ

.

When not all q[ℓ] are equal, there exists at least one a_{ℓ} that is not equal to b_{ℓ}.
Therefore, _{d∆}^{df} is strictly positive, and hence the unique minimum of f (∆) happens
when ∆ is of the smallest possible value. That is, for the unique optimal solution,

∆ = max

1≤ℓ≤K(−q[ℓ]) = cmax−

1 K−1

PK

k=1c[k]

;

˜

q[ℓ] = cmax− c[ℓ] , ˜p[ℓ] = ^{P}^{K}^{c}^{max}^{−c[ℓ]}

k=1(cmax−c[k]).

(2.2)

Using Theorem 2.3, we can define the following probability measure dFc(x, ℓ) from dF(x, y, c):

dFc(x, ℓ)∝ Z

y,c

˜

q[ℓ] dF(x, y, c) ,

where ˜q[ℓ] is computed from c using (2.2).^{3} More precisely, let

Λ1 = Z

x,y,c

XK ℓ=1

˜

q[ℓ] dF(x, y, c) . (2.3)

Note that from (2.2), PK

ℓ=1q[ℓ] > 0 if not all c[ℓ] are equal. Thus, we can generally˜

3Even when all c[ℓ] are equal, (2.2) can still be used to get ˜q[ℓ] = 0 for all ℓ, which means the example (x, y, c) can be dropped instead of relabeled.

assume that the integral results in a nonzero value. That is, Λ1 > 0, and

dFc(x, ℓ) = Λ^{−1}_{1} ·
Z

y,c

˜

q[ℓ] dF(x, y, c) .

Then, we can derive the following cost-transformation theorem:

Theorem 2.4. For any classifier g,

π(g, F ) = Λ1· π(g, F^{c})− Λ^{2},

where Λ2 = (K−1) ·R

x,y,c∆· dF(x, y, c) and each ∆ in the integral is computed from c with (2.2).

Proof.

π(g, F ) = Z

x

Z

y,c

c[g(x)] dF(y, c| x)

dF(x)

= Z

x

Z

y,c

XK ℓ=1

q[ℓ]· c^{(ℓ)}c [g(x)] dF(y, c| x)

!

dF(x)

= Z

x

Z

y,c

XK ℓ=1

(˜q[ℓ]− ∆) · c^{(ℓ)}c [g(x)] dF(y, c| x)

!

dF(x)

= −Λ^{2}+
Z

x

Z

y,c

XK ℓ=1

˜

q[ℓ]· c^{(ℓ)}c [g(x)] dF(y, c| x)

!

dF(x)

= −Λ2+ Z

x

XK ℓ=1

c^{(ℓ)}_{c} [g(x)]·

Z

y,c

˜

q[ℓ] dF(y, c| x)

dF(x)

= −Λ^{2}+ Λ1·
Z

x,ℓ

c^{(ℓ)}_{c} [g(x)]· dF^{c}(x, ℓ)

= −Λ^{2}+ Λ1· π(g, F^{c}).

An immediate corollary of Theorem 2.4 is:

Corollary 2.5. If g∗ is the target function under dF (x, y, c), and ˜g∗ is the target function under dFc(x, ℓ), then π(g∗, F ) = π( ˜g∗, F ) and π(g∗, Fc) = π( ˜g∗, Fc).

That is, if a regular classification algorithm Ac is able to return a decision func-

tion ˆg = ˜g∗, the decision function is as good as the target function g∗ under the original dF(x, y, c). Furthermore, as formalized in the following regret transforma- tion theorem, if any classifier g is close to ˜g∗ under dFc(x, ℓ), it is also close to g∗

under dF (x, y, c).

Theorem 2.6. Consider dFc(x, ℓ) defined from dF (x, y, c) above, for any classifier g,

π(g, F )− π(g∗, F ) = Λ_{1}·

π(g, F_{c})− π( ˜g_{∗}, F_{c})
.

Thus, to deal with a cost-sensitive classification problem generated from dF (x, y, c), it seems that the learning algorithm A can take the following steps:

Algorithm 2.1 (Cost transformation with relabeling, preliminary).

1. Compute dFc(x, ℓ) and obtain N independent training examples Zc ={(xn, ℓn)}^{N}n=1

from dFc(x, ℓ).

2. Use a regular classification algorithm A^{c} on Zc to obtain a decision function ˆgc

that ideally yields a small π(ˆgc, Fc).

3. Return ˆg ≡ ˆg^{c}.

There is, however, a caveat in the algorithm above. Recall that dF (x, y, c) is
unknown, and dFc(x, ℓ) depends on dF (x, y, c). Thus, dFc(x, ℓ) cannot be actually
computed. Nevertheless, we know that the training set Z ={(xn, yn, cn)}^{N}_{n=1}contains
examples that are generated independently from dF (x, y, c). Then, the first step can
be implemented (almost equivalently) as follows.

Algorithm 2.2 (Cost transformation with relabeling).

1. Obtain N^{′} independent training examples Zc ={(x^{n}, ℓn)}^{N}n=1^{′} from dFc(x, ℓ):

(a) Transform each (x_{n}, y_{n}, c_{n}) to (x_{n}, ˜q_{n}) by (2.2).

(b) Apply the rejection-based sampling technique (Zadrozny, Langford and Abe 2003) and accept (xn, ˜qn) with probability proportional to PK

ℓ=1q˜n[ℓ].

(c) For those (xn, ˜qn) that survive from rejection-based sampling, randomly
assign its label ℓn with probability ˜pn[ℓ]∝ ˜q^{n}[ℓ].

2. Use a regular classification algorithm A^{c} on Zc to obtain a decision function ˆgc

that ideally yields a small π(ˆgc, Fc).

3. Return ˆg ≡ ˆgc.

It is easy to check that the new training set Zc contains N^{′} (usually less than N)
independent examples from dF_{c}(x, ℓ).

While the steps above are supported with theoretical guarantees from Theo- rems 2.4 and 2.6, they may not work well in practice. For instance, if we look at an example (xn, yn, cn) with yn = 1 and cn = (0, 1, 1, 334), the resulting ˜qn = (334, 333, 333, 0). Because of the large value in cn[4], the example looks almost like a uniform mixture of labels {1, 2, 3}, with only 0.334 of probability to keep its original label. In other words, for the purpose of encoding some large components in a cost vector, the relabeling process could pay a huge variance and relabel (or mislabel) the example more often than not. Then, the regular classification algorithmAc would re- ceive some Zcthat contains lots of misleading labels, making it hard for the algorithm to return a decent ˆgc.

One remedy to the difficulty above is to use the following algorithm, called training set expansion and weighting (TSEW), instead of relabeling:

Algorithm 2.3 (TSEW: training set expansion and weighting).

1. Obtain NK weighted training examples Zw ={(x^{nℓ}, ynℓ, wnℓ)}:

(a) Transform each (xn, yn, cn) to (xn, ˜qn) by (2.2).

(b) For every 1≤ ℓ ≤ K, let (xnℓ, ynℓ, wnℓ) = (xn, ℓ, ˜qn[ℓ])and add (xnℓ, ynℓ, wnℓ) to Zw.

2. Use a weighted classification algorithm A^{w} on Zw to obtain a decision func-
tion ˆgw.

3. Return ˆg ≡ ˆgw.

It is not hard to show that dFc(x, ℓ) ∝ w · dFw(x, ℓ, w), and Zw contains (depen-
dent) examples generated from dFw(x, ℓ, w). We can think of Zw, which trades
independence for smaller variance, as a more stable version of Zc. The expanded
training set Zw contains all possible ℓ, and hence always includes the correct la-
bel yn (along with the largest weight on ˜qn[yn]). The Aw in TSEW can also be
performed by a regular classification algorithm A^{c} using the rejection-based sam-
pling technique (Zadrozny, Langford and Abe 2003). Then, Algorithm 2.2 is simply
a special (and less-stable) case of TSEW.

The TSEW algorithm is a basic instance of our proposed cost-transformation tech- nique. It is the same as the data space expansion (DSE) algorithm (Abe, Zadrozny and Langford 2004). Nevertheless, our derivation from the minimum entropy per- spective is novel, and our theoretical results on the out-of-sample cost π(g) are more general than the in-sample cost analysis by Abe, Zadrozny and Langford (2004). Re- cently, Xia, Zhou, et al. (2007) also proposed an algorithm similar to TSEW using LogitBoost as Aw based on a restricted version of Theorem 2.4. It should be noted that the results discussed in this section are partially influenced by the work of Abe, Zadrozny and Langford (2004) but are independent from the work of Xia, Zhou, et al.

(2007).

From the experimental results, TSEW (DSE) does not perform well in prac- tice (Abe, Zadrozny and Langford 2004). A possible reason is that common Aw

still find Zw too difficult (Xia, Zhou, et al. 2007), because a training feature vector xn

could be multilabeled in Zw, which may confuse A^{w}. One could improve the basic
TSEW algorithm by using (or designing) anAwthat is more robust with multilabeled
training feature vectors, as discussed in the next section.

### 2.3 Algorithms

In this section, we propose two novel cost-sensitive classification algorithms by cou- pling the cost-transformation technique with popular algorithms for regular (weighted) classification.

### 2.3.1 Cost-Sensitive One-Versus-All

The one-versus-all (OVA) algorithm is a popular algorithm for weighted classification.

It solves the weighted classification problem by decomposing it to several weighted binary classification problems, as shown below.

Algorithm 2.4 (One-versus-all, see, for instance, Hsu and Lin 2002).

1. For each 1≤ ℓ ≤ K,

(a) Take the original training set Z = {(x^{n}, yn, wn)}^{N}n=1 and construct a binary
classification training set Z_{b}^{(ℓ)} =n

(xn, yn^{(ℓ)}, wn) : yn^{(ℓ)} = Jy^{n} = ℓKoN
n=1.
(b) Use a weighted binary classification algorithm Ab on Z_{b}^{(ℓ)} to get a decision

function ˆg_{b}^{(ℓ)}.

2. Return ˆg(x) = argmax

1≤ℓ≤K

ˆ
g_{b}^{(ℓ)}(x).

Each ˆg_{b}^{(ℓ)}(x) intends to predict whether x belongs to category ℓ. Thus, if a feature
vector x should be of category 1, and all ˆg_{b}^{(ℓ)} are mistake free, then ideally ˆg^{(1)}_{b} (x) = 1
and ˆg^{(ℓ)}_{b} (x) = 0 for ℓ 6= 1, and hence ˆg(x) could make a correct prediction. Never-
theless, if some of the binary decision functions ˆg_{b}^{(ℓ)} make mistakes, the performance
of OVA can be affected by the ties in the argmax operation. In practice, the OVA
algorithm usually allows the decision functions ˆg_{b}^{(ℓ)} to output a soft prediction (say,
in [0, 1]) rather than a hard one of {0, 1}. The soft prediction represents the sup-
port (confidence) on whether x belongs to category ℓ, and ˆg(x) returns the prediction
associated with the highest support.

What would happen if we directly use OVA as Aw in TSEW? Recall that a cost- sensitive training example x1, 1, (0, 1, 1, 334)

in Z would introduce the following multilabeled examples in Zw:

(x1, 1, 334), (x1, 2, 333), (x1, 3, 333).

If we feed Zw directly to OVA, the underlying binary classification algorithmAb would
use the following examples to get ˆg^{(1)}_{b} :

(x1, 1, 334), (x1, 0, 666).

That is, even though x1 is of category 1, paradoxically we preferA^{b} to return some ˆg_{b}^{(1)}
that predicts x1 as 0 rather than 1. The paradox is similar to what we encountered
when sampling Zc from Z in Algorithm 2.1: The relabeling process results in a
misleading label more often than not. Thus, directly plugging OVA into the TSEW
algorithm does not work.

Nevertheless, we can modify OVA and make it more robust when given multi-
labeled training examples. In fact, a variant of the OVA algorithm can readily be
used for multilabeled classification in literature (see, for instance, Joachims 2005,
Section 2). For a training feature vector xn that can be labeled either as 1 or 2, the
OVA algorithm for a multilabeled training set Z would pair x_{n} with y^{(1)}n = 1 when
constructing Z_{b}^{(1)}, and with yn^{(2)} = 1 when constructing Z_{b}^{(2)} as well. That is, when Z
contains both (and only) (xn, 1) and (xn, 2), the feature vector xn “supports” both
category 1 and 2, while it does not support categories 3, 4, . . ., K.

The support perspective can also be understood with the cost vectors and the cost-transformation technique. Note that the expanded training set Zw contains both (xn, 1) and (xn, 2) (with weights 1) if and only if the original cost-sensitive train- ing example (xn, yn, cn) comes with a cost vector cnthat is similar to (0, 0, 1, 1, . . . , 1).

Thus, xn supports both categories 1 and 2 because no cost needs to be paid when the prediction falls in them. Note that ˜qn = (1, 1, 0, 0, . . . , 0) in this case. Equivalently speaking, we can say that xn supports those categories ℓ with ˜qn[ℓ] = 1 and does not support those ℓ with ˜qn[ℓ] = 0.

From the observation above, we can define s[ℓ] = _{q}_{˜}^{q}^{˜}_{max}^{[ℓ]} as the support for category ℓ,
where ˜qmax = max1≤ℓ≤Kq[ℓ]. Thus, s[ℓ]˜ ∈ [0, 1], and from (2.2),

s[ℓ] = 0 when c[ℓ] = cmax,

s[ℓ] = 1 when c[ℓ] = cmin = min1≤k≤Kc[k] .

With the definition above, we propose the generalized OVA algorithm, which takes the original OVA algorithm as a special case.

Algorithm 2.5 (Generalized one-versus-all).

1. For each 1 ≤ ℓ ≤ K, use Ab to learn a binary classifier ˆg_{b}^{(ℓ)}(x) with the hope
that

Z

x,y,c

˜
q_{max}·

s[ℓ]− ˆgb^{(ℓ)}(x)2

dF(x, y, c) (2.4)

is small.

2. Return ˆg(x) = argmax

1≤ℓ≤K

ˆ
g_{b}^{(ℓ)}(x).

How can Ab learn a binary classifier? Equation (2.4) is deliberately formu-
lated as a learning problem. Then, for each training example (x_{n}, y_{n}, c_{n}) obtained
from dF(x, y, c), we can compute a new training example (xn, yn, sn[ℓ] , (˜qmax)n) to
provide information for solving such a learning problem. Assume that we keep the
convention x^{(ℓ)}n = xn and yn^{(ℓ)} = Jy^{n}= ℓK in Algorithm 2.4 and try to approximately
deal with (2.4) by casting it as a weighted binary classification problem. One simple
method to obtain the weight w^{(ℓ)}n from xn, yn, sn[ℓ] , (˜qmax)n

is^{4}

w_{n}^{(ℓ)} =

(˜qmax)n· sn[ℓ] when yn = ℓ ; (˜qmax)n· (1 − sn[ℓ]) when yn 6= ℓ .

(2.5)

4There are more than one methods to do the transformation, and it is not theoretically clear which of them should be preferred. We choose the simple approach in (2.5) because of its promising performance in practice.

By replacing step 1 of generalized OVA with the weighted binary classification prob- lem, we get the cost-sensitive one-versus-all (CSOVA) algorithm, as formalized below.

Algorithm 2.6 (CSOVA: Cost-sensitive one-versus-all).

1. For each 1≤ ℓ ≤ K,

(a) Take the original training set Z = {(xn, yn, cn)}^{N}n=1 and construct a binary
classification training set Z_{b}^{(ℓ)} =n

(xn, yn^{(ℓ)}, w^{(ℓ)}n )o

from (2.5).

(b) Use a weighted binary classification algorithm A^{b} on Z_{b}^{(ℓ)} to get a decision
function ˆg_{b}^{(ℓ)}.

2. Return ˆg(x) = argmax

1≤ℓ≤K

ˆ
g_{b}^{(ℓ)}(x).

We can easily see Algorithm 2.4 is a special case of Algorithm 2.6 when all cn are classification cost vectors.

### 2.3.2 Cost-Sensitive One-Versus-One

The one-versus-one (OVO) algorithm is another popular algorithm for weighted clas- sification. It is suitable for practical use when K is not too large (Hsu and Lin 2002).

Similar to the OVA algorithm, it also solves the weighted classification problem by decomposing it to several weighted binary classification problems. Unlike OVA, how- ever, each binary classification problem consists of comparing examples from two categories only.

Algorithm 2.7 (One-versus-one, see, for instance, Hsu and Lin 2002).

1. For each i, j that 1≤ i < j ≤ K,

(a) Take the original training set Z = {(xn, yn, wn)}^{N}_{n=1} and construct a binary
classification training set Z_{b}^{(i,j)}={(x^{n}, yn, wn) : yn= i or yn = j}.

(b) Use a weighted binary classification algorithmA^{b} on Z_{b}^{(i,j)} to get a decision
function ˆg_{b}^{(i,j)}.

2. Return ˆg(x) = argmax

1≤ℓ≤K

P

i<j

rgˆ_{b}^{(i,j)}(x) = ℓz
.

In short, each ˆg_{b}^{(i,j)}(x) intends to predict whether x “prefers” i or category j, and ˆg
predicts with the preference votes gathered from those ˆg_{b}^{(i,j)}. The goal of A^{b} is to
locate decision functions ˆg_{b}^{(i,j)} with a small π

ˆ

g_{b}^{(i,j)}, F^{(i,j)}

, where dF^{(i,j)}(x, y) =
dF(x, y| y = i or j), because it can be proved that (Beygelzimer et al. 2005)

π(ˆg)≤ 2X

i<j

Prob [y = i or j]· π ˆ

g_{b}^{(i,j)}, F^{(i,j)}
.

Let us see if we can use OVO as Aw in TSEW for cost-sensitive classification problems. Again, consider a cost-sensitive training example x1, 1, (0, 1, 1, 334)

in Z.

Recall that it would introduce the following multilabeled examples in Z_{w}:

(x1, 1, 334), (x1, 2, 333), (x1, 3, 333).

If we directly use OVO as A^{w} in TSEW, the underlying binary classification algo-
rithm Ab would use the following two examples in Z_{b}^{(1,2)} to get ˆg^{(1,2)}_{b} :

(x1, 1, 334), (x1, 2, 333).

Note that these weighted examples can be equivalently generated by labeling x_{1} as 1
with probability ^{334}_{667} and as 2 with probability ^{333}_{667}. Because the probabilities are both
close to ^{1}_{2}, the labels are almost as if decided by throwing a fair coin. Therefore, the
binary classification algorithm A^{b} may be confused by the two examples.

For any classifier g_{b}^{(i,j)}: X → {i, j}, and a given example (x1, y1, c1) above, we see
that the classifier needs to pay a constant cost of 333 first, regardless of its prediction.

Now, we can again use the technique of shifting costs by a constant as we did in
constructing similar cost vectors. Then, the two examples (x1, 1, 334), (x1, 2, 333) is
the same as one single example of (x1, 1, 1). The shifting not only simplifies Z_{b}^{(i,j)}
by eliminating one unnecessary example, but also removes the random relabeling
ambiguity that caused the confusion discussed above.