Reduction from Cost-sensitive Ordinal Ranking to Weighted Binary Classiﬁcation

(1)

1

Reduction from Cost-sensitive Ordinal Ranking to Weighted Binary Classification

Hsuan-Tien Lin

¹and

Ling Li

²

1Department of Computer Science, National Taiwan University.

2Department of Computer Science, California Institute of Technology.

Keywords: cost-sensitive, ordinal ranking, binary classification, reduction

Abstract

We present a reduction framework from ordinal ranking to binary classification. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranker from the binary classifier. Based on the framework, we show that a weighted 0/1 loss of the binary classifier upper-bounds the mislabeling cost of the ranker, both error-wise and regret-wise. Our framework allows not only to design good ordinal ranking algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal ranking from known bounds for binary classification. In addition, our framework unifies many existing ordinal ranking algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms. In addition, the newly designed algorithms lead to better cost-sensitive ordinal ranking performance as well as improved listwise ranking performance.

(2)

1 Introduction

We work on a supervised learning problem called ordinal ranking, which is also referred to as ordinal regression (Chu and Keerthi, 2007) or ordinal classification (Frank and Hall, 2001). For instance, the rating that a customer gives on a movie might be one of do-not-bother, only-if-you-must, good, very-good, and run-to-see.

Those ratings are called the ranks, which can be represented by ordinal class labels like

f1;2;3;4;5g. The ordinal ranking problem is closely related to multi-class classification and metric regression. Somehow it is different from the former because of the ordinal information encoded in the ranks, and is different from the latter because of the lack of the metric distance between the ranks. Since rank is a natural representation of human preferences, the problem lends itself to many applications in social science and information retrieval (Liu, 2009).

Many ordinal ranking algorithms have been proposed from a machine learning perspective in recent years. For instance, Herbrich et al. (2000) designed an approach with support vector machines based on comparing training examples in a pairwise manner.

Har-Peled et al. (2003) proposed a constraint classification approach that works with any binary classifiers based on the pairwise comparison framework. Nevertheless, such a pairwise comparison perspective may not be suitable for large-scale learning because the size of the associated optimization problem can be large. In particular, for an ordinal ranking problem with ^N examples, if at least two of the ranks are supported by

(N)examples (which is quite common in practice), the size of the pairwise learning problem is quadratic to^N.

There are some other approaches that do not lead to such a quadratic expansion. For instance, Crammer and Singer (2005) generalized the online perceptron algorithm with multiple thresholds to do ordinal ranking. In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. Shashua and Levin (2003) proposed new support vector machine (SVM) formulations to handle multiple thresholds, and some other SVM formulations were studied by Rajaram et al.

(2003); Chu and Keerthi (2007); Cardoso and da Costa (2007). All these algorithms share a common property: they are modified from well-known binary classification approaches. Still some other approaches fall into neither of the perspective above, such

(3)

as Gaussian process ordinal regression (Chu and Ghahramani, 2005).

Since binary classification is much better studied than ordinal ranking, a general framework to systematically reduce the latter to the former can introduce two immediate benefits. First, well-tuned binary classification approaches can be readily transformed into good ordinal ranking algorithms, which saves immense efforts in design and implementation. Second, new generalization bounds for ordinal ranking can be easily derived from known bounds for binary classification, which saves tremendous efforts in theoretical analysis.

In this paper, we propose such a reduction framework. The framework is based on extended binary examples, which are extracted from the original ordinal ranking examples. The binary classifier trained from the extended binary examples can then be used to construct a ranker. We prove that the mislabeling cost of the ranker is bounded by a weighted 0/1 loss of the binary classifier. Furthermore, we prove that the mislabeling regret of the ranker is bounded by the regret of the binary classifier as well. Hence, binary classifiers that generalize well could introduce rankers that generalize well. The advantages of the framework in algorithmic design and in theoretical analysis are both demonstrated in the paper. In addition, we show that our framework provides a unified view for many existing ordinal ranking algorithms. The experiments on some benchmark data sets validate the usefulness of the framework in practice, both for improving cost-sensitive ordinal ranking performance and for helping improve other ranking criteria.

The paper is organized as follows. We introduce the ordinal ranking problem in Section 2. Some related works are discussed in Section 3. We illustrate our reduction framework in Section 4. The algorithmic and theoretical usefulness of the framework is shown in Section 5. Finally, we present experimental results in Section 6, and conclude in Section 7.

A short version of the paper appeared in the 2006 Conference on Neural Informa- tion Processing Systems (Li and Lin, 2007b). The paper was then enriched by the more general cost-sensitive setting as well as the deeper theoretical results that were revealed in the 2009 Preference Learning Workshop (Lin and Li, 2009). For complete- ness, selected results from an earlier conference work (Lin and Li, 2006) are included in Subsection 5.2. The works above are also parts of the first author’s Ph.D. thesis (Lin,

(4)

2008). In addition to the results that have been published in the conferences, we pointed out some important properties of ordinal ranking in Section 2, added a detailed literature discussion in Section 3, showed deeper theoretical results on the equivalence between ordinal ranking and binary classification in Section 4, clarified the differences of different SVM-based ordinal ranking algorithms in Section 5 and strengthened the experimental results to emphasize the usefulness of cost-sensitive ordinal ranking in Section 6.

2 Ordinal Ranking Setup

The ordinal ranking problem aims at predicting the rank ^y of some input vector ^x, where^xis in an input space^X ^R^D and^yis in a label space^Y ⁼ ^f1;^2; ^;^Kg. A function^r^: ^X ^! ^Y is called an ordinal ranker, or a ranker in short. We shall adopt the cost-sensitive setting (Abe et al., 2004; Lin, 2008), in which a cost vector ² ^R^K is generated with^(x;^y)from some fixed but unknown distribution^P(x;^y;⁾on^X

YR

K. The^k-th element^{[ k℄}of the cost vector represents the penalty when predicting the input vector ^x as rank ^k. We naturally assume that ^{[ k℄} ⁰ and ^{[ y℄} ⁼ ⁰. Thus,^y⁼argmin

1kK

[ k℄. An ordinal ranking problem comes with a given training set^S ⁼ ^f(xn

;y

n

;

n )g

N

n=1

, whose elements are drawn i.i.d. from^P(x;^y;⁾. The goal of the problem is to find a ranker^rsuch that its expected test cost

E(r) E

(x;y; )P

[ r(x)℄

is small.

The setting above looks similar to that of a cost-sensitive multiclass classification problem (Abe et al., 2004), in the sense that the label space^Y is a finite set. Therefore, ordinal ranking is also called ordinal classification (Frank and Hall, 2001; Cardoso and da Costa, 2007). Nevertheless, in addition to representing the nominal categories (as the usual classification labels), now those^y ² ^Y also carry the ordinal information. That is, two different labels in^Y can be compared by the usual “^<” operation. Thus, those

y2Y are called the ranks to distinguish them from the usual classification labels.

Ordinal ranking is also similar to regression (for which ^y ² ^R instead of^Y), because the real values in^R can be ordered by the usual “^<” operation. Therefore, ordi-

(5)

nal ranking is also popularly called ordinal regression (Herbrich et al., 2000; Shashua and Levin, 2003; Chu and Ghahramani, 2005; Chu and Keerthi, 2007; Xia et al., 2007).

Nevertheless, unlike the real-valued regression labels^y ² ^R, the discrete ranks^y ² ^Y do not carry metric information. For instance, we cannot say that a five-star movie is^2:5 times better than a two-star one; we also cannot compute the exact distance (difference) between a five-star movie and a one-star movie. In other words, the rank serves as a qualitative indication rather than a quantitative outcome. Thus, any monotonic transform of the label space should not alter the ranking performance. Nevertheless, many regression algorithms depend on the assumed metric information and can be highly affected by monotonic transforms of the label space (which are equivalent to change- of-metric operations). Thus, those regression algorithms may not always perform well on ordinal ranking problems.

The ordinal information carried by the ranks introduces the following two properties, which are important for modeling ordinal ranking problems.

Closeness in the rank space^Y: The ordinal information suggests that the mislabeling cost depend on the “closeness” of the prediction. For example, predicting a two-star movie as a three-star one is less costly than predicting it as a five-star one. Hence, the cost vectorshould be V-shaped with respect to ^y(Li and Lin, 2007b), that is,

8

>

<

>

:

[k 1℄ [ k℄; for²^k ^y^;

[k+1℄ [ k℄; for^y^k ^K ^1:

(1)

Briefly speaking, a V-shaped cost vector says that a ranker needs to pay more if its prediction on ^xis further away from^y. We shall assume that every cost vector generated from ^P(^j^x;^y) is V-shaped with respect to ^y ⁼ argmin

1kK [ k℄. In other words, one can decompose ^P(y;^j^x) ⁼ ^P(y^j⁾^P(^j^x) where

P( jx)is always V-shaped and^P^(y^j⁾satisfies^y⁼argmin

1kK [ k℄.

In some of our results, we need a stronger condition: the cost vectors should be convex(Li and Lin, 2007b), which is defined by the condition¹

[ k+1℄ [ k℄ [ k℄ [ k 1℄; for²^k ^K ¹^: (2)

1When connecting the points^(k;^{[k℄ )}from a convex cost vectorby line segments, it is not difficult to prove that the resulting curve is convex for^k²^[1;^K℄.

(6)

When using convex cost vectors, a ranker needs to pay increasingly more if its prediction on ^x is further away from ^y. Provably, any convex cost vector is V-shaped with respect to^y⁼argmin

1kK [k℄.

The V-shaped and convex cost vectors are general choices that can be used to represent the ordinal nature of^Y. One popular cost vector that has been frequently used for ordinal ranking is the absolute cost vector, which accompanies^(x;^y)as

(y)

[k℄=jy kj; 1yK:

Because the absolute cost vectors come with the ^medianfunction as its a popu- lation minimizer (Dembczy´nski et al., 2008), it appears to be a natural choice for ordinal ranking, similar to how the traditional 0/1 loss is the most natural choice for classification. Nevertheless, our work aims at studying more flexible possi- bilities (costs) beyond the natural choice, similar to the more flexible weighted loss beyond the 0/1 one in weighted classification (Zadrozny et al., 2003). As we shall show later in Section 6, the flexible costs can be used to embed the desired structural information in^Y for better test performance.

Comparability in the input space^X: Note that the classification cost vectors

(`)

[k℄=J^`⁶⁼^kK^; ¹^` ^K;

which checks whether the predicted rank ^k is exactly the same as the desired rank^`, are also V-shaped.² If those cost vectors are used, an immediate question is: What distinguishes ordinal ranking and common multiclass classification?

Let ^r denote the optimal ranker with respect to ^P^(x;^y;⁾. Note that ^r introduces a total preorder in^X (Herbrich et al., 2000). That is,

x

<

x

0

()r

(x)r

(x

0

):

The total preorder allows us to naturally group and compare vectors in the input space^X. For instance, a two-star movie is “worse than” a three-star one, which is in turn “worse than” a four-star one; movies of less than three stars are “worse than” movies of at least three stars.

2JK is¹if the inner condition is true, and⁰otherwise.

(7)

The simplicity of the grouping and the comparison distinguishes ordinal ranking from multiclass classification. For instance, when classifying movies, it is difficult to group^faction movies^;romantic movies^gand compare with^fcomic movies, thriller movies^g, but “movies of less than three stars” can be naturally compared with “movies of at least three stars.”

The comparability property connects ordinal ranking to monotonic classification (Sill, 1998; Kotłowski and Słowi´nski, 2009), which is also referred to as ordinal classification with the monotonicity constraints and is an important problem on its own. Monotonic classification models the ordinal ranking problem by assuming that an explicit order in the input space (such as the value-order of one particular feature) can be used to directly (and monotonically) infer about the order of the ranks in the output space (^y ^y⁰). In other words, monotonic classification allows putting thresholds on the explicit order to perform ordinal ranking. The comparability property shows that there is an order (total pre-order) introduced by the ranks. Nevertheless, the order is not always “explicit” in general ordinal ranking problems. Therefore, many of the existing ordinal ranking approaches, such as the thresholded model that will be discussed in Section 5, seek the implicit order through transforming the input vectors before respecting the monotonic nature between the implicit order and the order of the ranks.

In Table 1, we summarize four different learning problems in terms of their comparability and closeness properties.

Table 1: properties of different learning problems

P

comparability

closeness weak strong

(classification cost vectors) (other V-shaped cost vectors) yes degenerate ordinal ranking usual ordinal ranking

no multiclass classification special cases of cost-sensitive classification

As discussed, usual ordinal ranking problems come with strong closeness in ^Y (which is represented by V-shaped cost vectors) and simple comparability in^X. The

(8)

classification cost vectors can be viewed as degenerate V-shaped cost vectors, and hence introduce degenerate ordinal ranking problems.

Multiclass classification problems, on the other hand, do not allow examples of different classes to be naturally grouped and compared. If we want to use cost vectors other than the classification ones, we move to special cases of cost-sensitive classification. For instance, when trying to recognize digits^f0;^1; ^;^9gfor written checks, a possible cost is the absolute one (to represent monetary differences) rather than sim- ply right or wrong (the classification cost). The absolute cost is V-shaped and convex.

Nevertheless, the digits intuitively cannot be grouped and compared, and hence the recognition problem belongs to cost-sensitive multiclass classification rather than ordinal ranking (Lin, 2008).

From the discussions above, a good ordinal ranking algorithm should appropriately use the comparability property. In Section 4, we will show how the property serves as a key to derive our proposed reduction framework.

3 Related Literature

The analysis of ordinal data has been studied in statistics by defining a suitable link function that models the underlying probability for generating the ordinal labels (Ander- son, 1984). For instance, one popular model is the the cumulative link model (Agresti, 2002) that will be discussed in Section 5. Similar models can be traced back to the work of McCullagh (1980). The many earlier works in statistics, which usually focus on the effectiveness and efficiency of the modeling, influence the ordinal ranking studies in machine learning (Herbrich et al., 2000), including our work. Another related area that study the analysis of ordinal data is operations research, especially in the subarea of multi-criteria decision analysis (Greco et al., 2000; Figueira et al., 2005), which contains many works that focus on reasonable decision making with ordinal preference scales. Our work tackles ordinal ranking problems from the machine learning perspective—improving the test performance—and is hence different from the works that take the perspective of statistics or operations research.

In machine learning (and information retrieval), there are three major families of ranking algorithms: pointwise, pairwise and listwise (Liu, 2009). The ordinal ranking

(9)

setup presented in Section 2 belongs to pointwise ranking. Next, we discuss some representative algorithms in each family and relate them to the ordinal ranking setup. Then, we compare the proposed reduction framework with other reduction-based approaches for ranking.

3.1 Families of Ranking Algorithms

Pointwise ranking. Pointwise ranking aims at predicting the relevance of some input vector^xusing either real-valued scores or ordinal-valued ranks. It does not directly use the comparison nature of ranking.

The ordinal ranking algorithms studied in this paper focus on computing ordinal- valued ranks for pointwise ranking. For obtaining real-valued scores, a fundamental tool is traditional least-squared regression (Hastie et al., 2001). As discussed in Sec- tion 2, however, when the given examples come with ordinal labels, the ordinal ranking algorithms studied in this paper can be more useful than traditional regression by taking the metric-less nature of labels into account.

Pairwise ranking. Pairwise ranking aims at predicting the relative order between two input vectors^xand^x⁰ and thus captures the local comparison nature of ranking. It is arguably one of the most widely used technique in the ranking family and is usually cast as a binary classification problem of predicting whether^xis preferred over^x⁰. During training, such a problem translates to comparing all pairs of^(xn

;x

m

)based on their cor- responding labels. One representative pairwise ranking algorithm is RankSVM (Her- brich et al., 2000), which trains an underlying support vector machine using those pairs.

RankSVM was initially proposed for data sets that come with ordinal labels, but is also commonly applied to data sets that come with real-valued labels.

Note that even when all the labels takes ordinal values, as long as two of the classes contain^(N)examples, there are^(N²⁾pairs. Such a quadratic number of pairs makes it difficult to scale up general pairwise ranking algorithms, except in special cases like the linear support vector machine (Joachims, 2006) or RankBoost (Herbrich et al., 2000;

Lin and Li, 2006). Thus, when the training set is large and contains ordinal labels, the ordinal ranking algorithms studied in this paper may serve as a useful alternative over

(10)

pairwise ranking ones.

Listwise ranking. Listwise ranking aims at ordering a whole finite set of input vectors

S 0

= fx 0

m g

M

m=1

. In particular, the (listwise) ranker tries to minimize the inconsistency between the predicted permutation and the ground truth permutation of^S⁰ (Liu, 2009).

Listwise ranking captures the global comparison nature of ranking. One representative listwise ranking algorithm is ListNet (Cao et al., 2007), which is based on an underlying neural network model along with an estimated distribution of all possible permutations (rankers). Nevertheless, there are^M^!permutations for a given^S⁰. Thus, listwise ranking can be computationally even more expensive than pairwise ranking.

Many listwise ranking algorithms try to alleviate the computational burden by keep- ing some internal pointwise rankers. For instance, ListNet uses the underlying neural network to score each instance (Cao et al., 2007) for the purpose of permutation. The use of internal pointwise rankers for listwise ranking further justify the importance to better understand pointwise ranking, including the ordinal ranking algorithms studied in this paper.

3.2 Reduction Approaches for Ranking

Because ranking is a relatively new and diverse problem in machine learning, many existing ranking approaches try to reduce the ranking problem to other learning problems. Next, we discuss some existing reduction-based approaches that are related to the framework proposed in this paper.

From pairwise ranking to binary classification. Balcan et al. (2007) propose a robust reduction from bipartite (i.e. ordinal with two outcomes) pairwise ranking to binary classification. The training part of the reduction works like usual pairwise ranking:

learning a binary classifier on whether ^xis preferred over ^x⁰. The prediction part of the reduction asks the underlying binary classifier to vote for each example in the test set in order to rank those examples. The reduction is simple but yields solid theoretical guarantees. In particular, for ranking^M test examples, the reduction uses^(M²⁾calls to the binary classifier and transforms a binary classification regret of^r to a bipartite ranking regret (measured by the so-called AUC criterion) of at most^2r.

(11)

Ailon and Mohri (2008) improve the reduction of Balcan et al. (2007) and propose a more efficient reduction from general pairwise ranking to binary classification. The prediction part of the reduction operates by taking the underlying binary classifier as the comparison function of the popular QuickSort algorithm. In the special bipartite ranking case, for ranking^M examples, the reduction uses^O(M^log^M⁾calls to the binary classifier in average and transforms a binary classification regret of^rto a bipartite ranking regret of at most^2r.

From listwise ranking to regression (pointwise ranking). The Subset Ranking (Cos- sock and Zhang, 2008) algorithm can be viewed as a reduction from listwise ranking to regression. In particular, Cossock and Zhang (2008) prove that regression with various cost functions can be used to approximate a Bayes optimal listwise ranker. In other words, low-regret regressors can be cast as low-regret listwise rankers.

From listwise ranking to ordinal (pointwise) ranking. McRank (Li et al., 2008) is a reduction from listwise ranking to ordinal ranking with the classification cost. The main theoretical justification of the reduction shows that a scaled classification cost of an ordinal ranker can upper bound the regret of the associated listwise ranker. That is, low-error ordinal rankers can be cast as low-regret listwise rankers. Li et al. (2008) empirically verified that McRank can perform better than the Subset Ranking algorithm (Cossock and Zhang, 2008).

From ordinal ranking to binary classification. The proposed framework in this pa- per and the associated shorter version (Li and Lin, 2007b) is a reduction from ordinal ranking to binary classification. We will show that the reduction is both error and regret preserving. That is, low-error binary classifiers can be cast as low-error ordinal rankers;

low-regret binary classifiers can be cast as low-regret ordinal rankers.

The data replication method, which was independently proposed by Cardoso and da Costa (2007), is a similar but more restricted case of the reduction framework. The data replication method essentially considers the absolute cost. In addition, the focus of the data replication method (Cardoso and da Costa, 2007) is on explaining the training procedure of the reduction. The proposed framework in this paper is more general than the data replication method in terms of the cost considered as well as the deeper

(12)

Table 2: comparison of general reductions from ranking to binary classification reduction size of trans-

formed set during training

# calls to binary classifiers during prediction

evaluation criterion

the proposed framework ^O(KN⁾ ^O(KM⁾ ranking cost

(Balcan et al., 2007) ^O(N²⁾ ^O(M²⁾ AUC

(Ailon and Mohri, 2008) ^O(N²⁾ ^O(M^log^M⁾ AUC

theoretical analysis on both the training and the test performance of the reduction.

The proposed reduction framework for pointwise ranking and existing reductions in pairwise ranking (Balcan et al., 2007; Ailon and Mohri, 2008) take very different views on the ranking problem and considers different evaluation criteria. As a consequence, when learning ^N examples and ranking (predicting on) ^M instances with ^K ordinal scales, the proposed framework results in a transformed training set of size ^O(KN⁾ and a prediction procedure with time complexity^O(KM⁾. Both the size of the training set and the time complexity of the prediction procedure is more efficient than the state- of-the-art reduction from pairwise ranking to binary classification (Ailon and Mohri, 2008), as shown in Table 2.

Note that the work of Li et al. (2008) revealed an opportunity to use the discrete nature of ordinal-valued labels to improve the listwise ranking performance over Subset Ranking when using a heuristic ordinal ranking algorithm. The proposed framework is a more rigorous study on ordinal ranking that can be coupled with McRank to yield a reduction from listwise ranking to binary classification, which allows state-of-art binary classification algorithms to be efficiently used for listwise ranking. We will demonstrate the use of this opportunity in Subsection 6.4.

4 Reduction Framework

We will first introduce the details of our proposed reduction framework. Then, we will demonstrate its theoretical guarantees. Consider, for instance, that we want to know how good a movie^x is. Using the comparability property of ordinal ranking, we can then ask the associated question “is the rank of^xgreater than^k?”

(13)

For a given ^k, such a question is exactly a binary classification problem, and the rank of ^xcan be determined by asking multiple questions for^k ⁼ ¹, ², until^(K ¹⁾. The questions are the core of the Dominance-based Rough Set Approach in operations research for reasoning from ordinal data (Słowi´nski et al., 2007). From the machine learning perspective, Frank and Hall (2001) proposed to solve each binary classification problem independently and combine the binary outputs to a rank. Although their approach is simple, the generalization performance using the combination step cannot be easily analyzed.

The proposed reduction framework works differently. First, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow. Moreover, all the binary classification problems are solved jointly to obtain a single binary classifier.

Assume that ^g(x;^k) is the single binary classifier that provides answers to all the associated questions above. Consistent answers would be^g(x;^k)⁼⁺¹(“yes”) for^k ⁼

1until^(` ¹⁾for some^`, and ¹(“no”) afterward. Then, a reasonable ranker based on the binary answers is^rg

(x)=`=1+minfk: g(x;k)=+1g. Equivalently,

r

g

(x)1+ K 1

X

k=1

J^g(x;^k)^>⁰K^: (3)

The binary classifier ^g that only produces consistent answers would be called rank- monotonic.³

For any ordinal example ^(x;^y;⁾, we can define the extended binary examples

X (k)

;Y (k)

with weights^W^(k)as

X (k)

=(x;k); Y (k)

=2J^k ^<^yK ^1; ^W^(k)⁼^(K ¹⁾

[ k℄ [ k+1℄

(4) The extended input vector ^X^(k) represents the associated question “is the rank of ^x greater than^k?”; the binary label^Y^(k)represents of the desired answer to the question;

the weight^W^(k)represents the importance of the question and will be used in the coming theoretical analysis. Here ^X^(k) stands for an abstract pair and we will discuss its practical encoding in Section 5. If^g ^X^(k)

g(x;k)makes no errors on all the associated questions,^rg

(x)equals^yby (3). That is, ^{[ r}g

(x)℄ =0. In the following theorem, we further connects^{[ r}g

(x)℄to the amount of error that^g makes.

3Although (3) can be flexibly applied even when^g is not rank-monotonic, a rank-monotonic^g is usually desired in order to introduce a good ranker^r^g.

(14)

Theorem 1 (Per-example cost bound). For any ordinal example^(x;^y;⁾, whereis V- shaped and^[y℄⁼⁰, consider its associated extended binary examples ^X^(k)^;^Y^(k)^;^W^(k)

in(4). Assume that the ranker ^rg is constructed from a binary classifier ^g using (3).

If^g ^X^(k)

is rank-monotonic or ifis convex, then

[ r

g

(x)℄ 1

K 1 K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

: (5)

Proof. Because^g is rank-monotonic,^g ^X^(k)

= +1for^k ^<^rg

(x)and^g ^X^(k)

= 1

for^k ^rg

(x). Thus, the cost that the ranker^rg needs to pay is

[r

g

(x)℄ = K 1

X

k=rg(x)

( [ k℄ [ k+1℄)+ [ K℄

= K 1

X

k=1

( [k℄ [ k+1℄ )

q

g X (k)

<0

y

+ [ K℄: (6) Because the cost vectoris V-shaped,^Y^(k) equals the sign of ^{(
[ k℄} ^{[ k}⁺^{1℄ )}if the latter is not zero. Continuing from (6) with^[y℄⁼⁰,

(K 1) [ r

g (x)℄

= y 1

X

k=1 W

(k)

Y (k)

q

g X (k)

<0

y

+(K 1) [K℄

K 1

X

k=y W

(k)

Y (k)

1

q

g X (k)

>0

y

= y 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

+(K 1) [ y℄+ K 1

X

k=y W

(k)

q

Y (k)

6=g X (k)

y

= K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

: (7)

When^gis not rank-monotonic but the cost vectoris convex, equation (7) becomes an inequality that could be alternatively proved by replacing (6) with

K 1

X

k=r

g (x)

( [ k℄ [ k+1℄ ) K 1

X

k=1

( [ k℄ [k+1℄ )

q

g X (k)

<0

y

:

The inequality above holds because ^{(
[ k℄} ^[k⁺^1℄) is decreasing due to the con- vexity, and there are exactly ^(rg

(x) 1) zeros and ^(K ^rg

(x)) ones in the values ofq

g X (k)

<0

yaccording to (3).

(15)

We call (5) the per-example cost bound, which says that if ^g makes only a small amount of error on the extended binary examples ^X^(k)^;^Y^(k)^;^W^(k)

, then^rgis guaran- teed to only pay a small amount of cost on the original example^(x;^y;⁾. The bound allows us to derive the following reduction method, which is composed of three stages:

preprocessing, training, and prediction.

Algorithm 1 (Reduction to extended binary classification).

1. Preprocessing: For each original training example^(xn

;y

n

;

n

)2S and for each

k = 1;2;:::;K 1, generate an extended training example

X (k)

n

;Y (k)

n

;W (k)

n

and include it in^SE, where

X (k)

n

=(x

n

;k); Y (k)

n

=2J^k ^<^yⁿK ^1; ^Wⁿ^(k) ⁼^(K ¹⁾

n

[k℄

n

[k+1℄

:

2. Training: Use a binary classification algorithm on ^SE and get a binary classi- fier^gon a concrete encoding (to be discussed in Section 5) of^X^f1;^2; ^;^K ^1g. Let^g(x;^k)^g ^X^(k)

.

3. Prediction: For any^x²^X, estimate its rank with(3).

4.1 Cost Bound of the Reduction Framework

Consider the following probability distribution^Pb X

(k)

;Y (k)

;W (k)

that generates the extended binary examples.

1. Draw a tuple^(x;^y;⁾independently from^P(x;^y;⁾and draw^kuniformly from the set^f1;^2;^:^:^:^;^K ^1g.

2. Generate ^X^(k)^;^Y^(k)^;^W^(k)

by (4).

The extended training set ^SE contains examples that are equivalent (in terms of expectation) to examples drawn independently from^Pb

X (k)

;Y (k)

;W (k)

. For any given binary classifier^g, define its out-of-sample error with respect to^Pbas

E

b

(g) E

(X;Y;W)P

b

W J^Y ⁶⁼^g^(X)K^:

Using the definitions above, we can prove the first theoretical guarantee of the reduction framework.

(16)

Theorem 2 (Cost bound of the reduction framework). Consider a ranker ^rg that is constructed from a binary classifier^gusing(3). Assume thatis V-shaped and^{[ y℄}⁼⁰ for every tuple^(x;^y;⁾ generated from ^P(^j^x;^y). If^g^(x;^k)is rank-monotonic or if every cost vectoris convex, then^E(rg

)E

b (g). Proof. From (5),

[ r

g

(x)℄ 1

K 1 K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

:

Take the expectation over^P on both sides and useuto mean the uniform sampling,

E(r

g

) E

(x;y; )P 1

K 1 K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

= E

(x;y; )P

E

k

u

f1;;K 1g W

(k)

q

Y (k)

6=g X (k)

y

= E

(X;Y;W)P

b

W J^Y ⁶⁼^g(X)K

= E

b (g):

4.2 Regret Bound of the Reduction Framework

Theorem 2 indicates that if there exists a decent binary classifier ^g, we can obtain a decent ranker ^rg. Nevertheless, it does not guarantee how good ^rg is in comparison with other rankers. In particular, if we consider the optimal binary classifier ^g under^Pb

(X;Y;W), and the optimal ranker^runder^P(x;^y;⁾, does a small regret^Eb (g)

E

b (g

)in binary classification translate to a small regret^E(rg

) E(r

)in ordinal ranking? Furthermore, is^E(rg

)close to^E(r

)? Next, we introduce the reverse reduction technique, which helps to answer the questions above.

The reverse reduction technique works on the binary classification problems generated by the reduction method. It goes through the preprocessing and the prediction stages of the reduction method in the opposite direction. In the preprocessing stage, instead of starting with ordinal examples ^(xn

;y

n

;

n

), reverse reduction deals with weighted binary examples

X (k)

n

;Y (k)

n

;W (k)

n

. It first combines each set of binary

(17)

ordinal

example

(x

n

;y

n

;

n )

)

A

%

&

weighted

binary

examples

X (k )

n

;Y (k )

n

;W (k )

n

k=1;:::;K 1 ) ) )

ore

binary

lassi ation

algorithm

) ) )

%

&

lassiers

g X (k )

k=1;:::;K 1 A A

)

ordinal

ranker

r

g (x)

%

$ '

&

weighted

binary

examples

X (k )

n

;Y (k )

n

;W (k )

n

k=1;:::;K 1 ) ) )

A A

ordinal

example

(xn;yn; n) )

ore

ordinal

ranking

algorithm )

ordinal

ranker

r(x)

) ) )

A

%

$ '

&

lassiers

g

r X

(k )

k=1;:::;K 1

Figure 1: reduction (top) and reverse reduction (bottom) examples sharing the same^xnto an ordinal example by

8

>

<

>

: y

n

= 1+ K 1

P

k=1

r

Y (k)

n

>0

z

;

n [k℄=

K 1

X

`=1 W

(`)

n

K 1

J^yⁿ^`^<^kor^k^<^`^ynK^:

(8)

It is easy to verify that (8) is the exact inverse transform of (4) on the training examples under the assumption that^{[ y℄}⁼⁰. These ordinal examples are then given to an ordinal ranking algorithm to obtain a ranker^r. In the prediction stage, reverse reduction works by decomposing the prediction^r(x)to^K ¹binary predictions, each as if coming from a binary classifier

g

r X

(k)

=2J^r(x)^>^kK ^1: (9)

Then, a lemma on the out-of-sample cost of^grimmediately follows (Lin and Li, 2009).

Lemma 1. With the definitions of ^P(x;^y;⁾ and ^Pb X

(k)

;Y (k)

;W (k)

in Theorem 2, for every ordinal ranker^r,^E(r)⁼^Eb

(g

r ).

Proof. Because^gr is rank-monotonic by construction, the same proof for the first part of Theorem 2 leads to the desired result.

The stages of reduction and reverse reduction are illustrated in Figure 1. Next, we show how the reverse reduction technique allows us to draw a strong theoretical

(18)

connection between ordinal ranking and binary classification. By the definition of^r

and^g, for any ranker^rand any binary classifier^g,

E(r)E(r

); E

b

(g)E

b (g

): (10)

Then, the reverse reduction technique yields a simple proof of the regret bound.

Theorem 3 (Regret bound of the reduction framework). If^g(x;^k)is rank-monotonic, or if every cost vectoris convex, then

E(r

g

) E(r

)E

b

(g) E

b (g

): (11)

Proof.

E(r

g

) E(r

) E

b

(g) E(r

) (from Theorem 2⁾

= E

b

(g) E

b (g

r

) (from Lemma 1⁾

E

b

(g) E

b (g

) from Equation 10

:

The cost bound (Theorem 2) and the regret bound (Theorem 3) provide different guarantees for the reduction method. The former describes how the ordinal ranking cost is upper bounded by the binary classification error in an absolute sense, and the latter describes the upper bound in a relative sense.

4.3 Equivalence between Ordinal Ranking and Binary Classifica- tion

The results above suggest that ordinal ranking can be reduced to binary classification without any loss of optimality. That is, ordinal ranking is “no harder than” binary classification. Intuitively, binary classification is also “no harder than” ordinal ranking, because the former is a special case of the latter with^K ⁼ ². Next, we formalize the notion of hardness with the probably approximately correct (PAC) setup in computa- tional learning theory (Kearns and Vazirani, 1994) and prove that ordinal ranking and binary classification are indeed equivalent in hardness. We use the following definition of PAC in our coming theorems (Valiant, 1984; Kearns and Vazirani, 1994).

(19)

Definition 1. In cost-sensitive classification, a learning model ^G is efficiently PAC- learnable (using the same representation class) if there exists a (possibly randomized) learning algorithm^Asatisfying the following property: for every distribution^P(x;^y;⁾ being considered, where

[g

(x)℄= [ y℄=

min

=0;

with some^g

2 G; for all ⁰ ^< and ⁰ ^< ^Æ ^< ¹

2

, if ^A is given access to an oracle generating examples^(x;^y;⁾from^P(x;^y;⁾, as well as inputsand^Æ, then^Aoutputs

g 2Gsuch that^E^(g)with probability at least¹ ^Æas well as with time polynomial in ¹

and ¹

Æ

.

Briefly speaking, the definition assumes that the target function ^g is within the learning model^Gand is of cost⁰(the minimum cost). In other words, it is the noiseless setup of learning. We shall only focus on this case while pointing out that similar results can also be proved for the noisy setup (Lin, 2008).

Theorem 4 (Equivalence theorem of the reduction framework). Consider a learning model^Rfor ordinal ranking, its associated learning model^G⁼^fgr

: r2Rgfor binary classification, and distributions ^P(x;^y;⁾ such that all cost vectors are V-shaped.

Then,^Ris efficiently PAC-learnable if and only if^Gis efficiently PAC-learnable.

Proof. If^G is efficiently PAC-learnable using algorithmÂG, we can convertÂG to an efficient algorithmÂRfor ordinal ranking as follows.

1. Transform the oracle that generates ^(x;^y;⁾ from ^P(x;^y;⁾ to an oracle that generates ^X^(k)^;^Y^(k)^;^W^(k)

by picking^kuniformly and applying (4).

2. Run^AG with the transformed oracle until it outputs some^g ^X^(k)

. 3. Return^rg.

It is not hard to see that ^AR is as efficient as ^AG, and the cost guarantee comes from Theorem 2 using the fact that^grare all rank-monotonic.

Now we consider the other direction. If^Ris efficiently PAC-learnable using algo- rithmÂR, we can convertÂRto an efficient algorithmÂG for binary classification.