1

## Reduction from Cost-sensitive Ordinal Ranking to Weighted Binary Classification

**Hsuan-Tien Lin**

^{1}and

**Ling Li**

^{2}

1Department of Computer Science, National Taiwan University.

2Department of Computer Science, California Institute of Technology.

**Keywords: cost-sensitive, ordinal ranking, binary classification, reduction**

**Abstract**

We present a reduction framework from ordinal ranking to binary classification. The framework consists of three steps: extracting extended examples from the original ex- amples, learning a binary classifier on the extended examples with any binary classi- fication algorithm, and constructing a ranker from the binary classifier. Based on the framework, we show that a weighted 0/1 loss of the binary classifier upper-bounds the mislabeling cost of the ranker, both error-wise and regret-wise. Our framework allows not only to design good ordinal ranking algorithms based on well-tuned binary classi- fication approaches, but also to derive new generalization bounds for ordinal ranking from known bounds for binary classification. In addition, our framework unifies many existing ordinal ranking algorithms, such as perceptron ranking and support vector ordi- nal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generaliza- tion performance over existing algorithms. In addition, the newly designed algorithms lead to better cost-sensitive ordinal ranking performance as well as improved listwise ranking performance.

**1** **Introduction**

*We work on a supervised learning problem called ordinal ranking, which is also referred*
to as ordinal regression (Chu and Keerthi, 2007) or ordinal classification (Frank and
Hall, 2001). For instance, the rating that a customer gives on a movie might be one of
do-not-bother, only-if-you-must, good, very-good, and run-to-see.

*Those ratings are called the ranks, which can be represented by ordinal class labels like*

f1;2;3;4;5g. The ordinal ranking problem is closely related to multi-class classifica- tion and metric regression. Somehow it is different from the former because of the ordinal information encoded in the ranks, and is different from the latter because of the lack of the metric distance between the ranks. Since rank is a natural representation of human preferences, the problem lends itself to many applications in social science and information retrieval (Liu, 2009).

Many ordinal ranking algorithms have been proposed from a machine learning per- spective in recent years. For instance, Herbrich et al. (2000) designed an approach with support vector machines based on comparing training examples in a pairwise manner.

Har-Peled et al. (2003) proposed a constraint classification approach that works with
any binary classifiers based on the pairwise comparison framework. Nevertheless, such
a pairwise comparison perspective may not be suitable for large-scale learning because
the size of the associated optimization problem can be large. In particular, for an or-
dinal ranking problem with ^{N} examples, if at least two of the ranks are supported by

(N)examples (which is quite common in practice), the size of the pairwise learning
problem is quadratic to^{N}.

There are some other approaches that do not lead to such a quadratic expansion. For instance, Crammer and Singer (2005) generalized the online perceptron algorithm with multiple thresholds to do ordinal ranking. In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. Shashua and Levin (2003) proposed new support vector machine (SVM) formulations to handle multiple thresholds, and some other SVM formulations were studied by Rajaram et al.

(2003); Chu and Keerthi (2007); Cardoso and da Costa (2007). All these algorithms share a common property: they are modified from well-known binary classification approaches. Still some other approaches fall into neither of the perspective above, such

as Gaussian process ordinal regression (Chu and Ghahramani, 2005).

Since binary classification is much better studied than ordinal ranking, a general framework to systematically reduce the latter to the former can introduce two imme- diate benefits. First, well-tuned binary classification approaches can be readily trans- formed into good ordinal ranking algorithms, which saves immense efforts in design and implementation. Second, new generalization bounds for ordinal ranking can be easily derived from known bounds for binary classification, which saves tremendous efforts in theoretical analysis.

In this paper, we propose such a reduction framework. The framework is based on extended binary examples, which are extracted from the original ordinal ranking exam- ples. The binary classifier trained from the extended binary examples can then be used to construct a ranker. We prove that the mislabeling cost of the ranker is bounded by a weighted 0/1 loss of the binary classifier. Furthermore, we prove that the mislabeling regret of the ranker is bounded by the regret of the binary classifier as well. Hence, binary classifiers that generalize well could introduce rankers that generalize well. The advantages of the framework in algorithmic design and in theoretical analysis are both demonstrated in the paper. In addition, we show that our framework provides a unified view for many existing ordinal ranking algorithms. The experiments on some bench- mark data sets validate the usefulness of the framework in practice, both for improving cost-sensitive ordinal ranking performance and for helping improve other ranking cri- teria.

The paper is organized as follows. We introduce the ordinal ranking problem in Section 2. Some related works are discussed in Section 3. We illustrate our reduction framework in Section 4. The algorithmic and theoretical usefulness of the framework is shown in Section 5. Finally, we present experimental results in Section 6, and conclude in Section 7.

A short version of the paper appeared in the 2006 Conference on Neural Informa- tion Processing Systems (Li and Lin, 2007b). The paper was then enriched by the more general cost-sensitive setting as well as the deeper theoretical results that were revealed in the 2009 Preference Learning Workshop (Lin and Li, 2009). For complete- ness, selected results from an earlier conference work (Lin and Li, 2006) are included in Subsection 5.2. The works above are also parts of the first author’s Ph.D. thesis (Lin,

2008). In addition to the results that have been published in the conferences, we pointed out some important properties of ordinal ranking in Section 2, added a detailed litera- ture discussion in Section 3, showed deeper theoretical results on the equivalence be- tween ordinal ranking and binary classification in Section 4, clarified the differences of different SVM-based ordinal ranking algorithms in Section 5 and strengthened the experimental results to emphasize the usefulness of cost-sensitive ordinal ranking in Section 6.

**2** **Ordinal Ranking Setup**

The ordinal ranking problem aims at predicting the rank ^{y} of some input vector ^{x},
where^{x}is in an input space^{X} ^{} ^{R}^{D} and^{y}is in a label space^{Y} ^{=} ^{f1;}^{2;}^{}^{}^{} ^{;}^{Kg}. A
function^{r}^{:} ^{X} ^{!} ^{Y} *is called an ordinal ranker, or a ranker in short. We shall adopt*
the cost-sensitive setting (Abe et al., 2004; Lin, 2008), in which a cost vector^{
} ^{2} ^{R}^{K}
is generated with^{(x;}^{y)}from some fixed but unknown distribution^{P(x;}^{y;}^{
)}on^{X} ^{}

YR

K. The^{k}-th element^{
[ k℄}of the cost vector represents the penalty when predict-
ing the input vector ^{x} as rank ^{k}. We naturally assume that ^{
[ k℄} ^{} ^{0} and ^{
[ y℄} ^{=} ^{0}.
Thus,^{y}^{=}argmin

1kK

[ k℄. An ordinal ranking problem comes with a given training
set^{S} ^{=} ^{f(x}n

;y

n

;

n )g

N

n=1

, whose elements are drawn i.i.d. from^{P(x;}^{y;}^{
)}. The goal
of the problem is to find a ranker^{r}such that its expected test cost

E(r) E

(x;y; )P

[ r(x)℄

is small.

The setting above looks similar to that of a cost-sensitive multiclass classification
problem (Abe et al., 2004), in the sense that the label space^{Y} is a finite set. Therefore,
ordinal ranking is also called ordinal classification (Frank and Hall, 2001; Cardoso and
da Costa, 2007). Nevertheless, in addition to representing the nominal categories (as the
usual classification labels), now those^{y} ^{2} ^{Y} also carry the ordinal information. That
is, two different labels in^{Y} can be compared by the usual “^{<}” operation. Thus, those

y2Y *are called the ranks to distinguish them from the usual classification labels.*

Ordinal ranking is also similar to regression (for which ^{y} ^{2} ^{R} instead of^{Y}), be-
cause the real values in^{R} can be ordered by the usual “^{<}” operation. Therefore, ordi-

nal ranking is also popularly called ordinal regression (Herbrich et al., 2000; Shashua and Levin, 2003; Chu and Ghahramani, 2005; Chu and Keerthi, 2007; Xia et al., 2007).

Nevertheless, unlike the real-valued regression labels^{y} ^{2} ^{R}, the discrete ranks^{y} ^{2} ^{Y}
do not carry metric information. For instance, we cannot say that a five-star movie is^{2:5}
times better than a two-star one; we also cannot compute the exact distance (difference)
between a five-star movie and a one-star movie. In other words, the rank serves as a
qualitative indication rather than a quantitative outcome. Thus, any monotonic trans-
form of the label space should not alter the ranking performance. Nevertheless, many
regression algorithms depend on the assumed metric information and can be highly
affected by monotonic transforms of the label space (which are equivalent to change-
of-metric operations). Thus, those regression algorithms may not always perform well
on ordinal ranking problems.

The ordinal information carried by the ranks introduces the following two proper- ties, which are important for modeling ordinal ranking problems.

**Closeness in the rank space**^{Y}: The ordinal information suggests that the misla-
beling cost depend on the “closeness” of the prediction. For example, predicting
a two-star movie as a three-star one is less costly than predicting it as a five-star
one. Hence, the cost vector^{
}*should be V-shaped with respect to* ^{y}(Li and Lin,
2007b), that is,

8

>

<

>

:

[k 1℄
[ k℄; for^{2}^{}^{k} ^{}^{y}^{;}

[k+1℄
[ k℄; for^{y}^{}^{k} ^{}^{K} ^{1:}

(1)

Briefly speaking, a V-shaped cost vector says that a ranker needs to pay more if its
prediction on ^{x}is further away from^{y}. We shall assume that every cost vector^{
}
generated from ^{P(
}^{j}^{x;}^{y)} is V-shaped with respect to ^{y} ^{=} argmin

1kK
[ k℄.
In other words, one can decompose ^{P(y;}^{
}^{j}^{x)} ^{=} ^{P(y}^{j}^{
)}^{P(
}^{j}^{x)} where ^{
} ^{}

P(
jx)is always V-shaped and^{P}^{(y}^{j}^{
)}satisfies^{y}^{=}argmin

1kK [ k℄.

In some of our results, we need a stronger condition: the cost vectors should be
*convex*(Li and Lin, 2007b), which is defined by the condition^{1}

[ k+1℄
[ k℄
[ k℄
[ k 1℄; for^{2}^{}^{k} ^{}^{K} ^{1}^{:} (2)

1When connecting the points^{(k;}^{
[k℄ )}from a convex cost vector^{
}by line segments, it is not difficult
to prove that the resulting curve is convex for^{k}^{2}^{[1;}^{K℄}.

When using convex cost vectors, a ranker needs to pay increasingly more if its
prediction on ^{x} is further away from ^{y}. Provably, any convex cost vector ^{
} is
V-shaped with respect to^{y}^{=}argmin

1kK [k℄.

The V-shaped and convex cost vectors are general choices that can be used to
represent the ordinal nature of^{Y}. One popular cost vector that has been frequently
used for ordinal ranking is the absolute cost vector, which accompanies^{(x;}^{y)}as

(y)

[k℄=jy kj; 1yK:

Because the absolute cost vectors come with the ^{median}function as its a popu-
lation minimizer (Dembczy´nski et al., 2008), it appears to be a natural choice for
ordinal ranking, similar to how the traditional 0/1 loss is the most natural choice
for classification. Nevertheless, our work aims at studying more flexible possi-
bilities (costs) beyond the natural choice, similar to the more flexible weighted
loss beyond the 0/1 one in weighted classification (Zadrozny et al., 2003). As we
shall show later in Section 6, the flexible costs can be used to embed the desired
structural information in^{Y} for better test performance.

**Comparability in the input space**^{X}: Note that the classification cost vectors

(`)

[k℄=J^{`}^{6=}^{k}K^{;} ^{1}^{}^{`} ^{}^{K;}

which checks whether the predicted rank ^{k} is exactly the same as the desired
rank^{`}, are also V-shaped.^{2} If those cost vectors are used, an immediate question
is: What distinguishes ordinal ranking and common multiclass classification?

Let ^{r} denote the optimal ranker with respect to ^{P}^{(x;}^{y;}^{
)}. Note that ^{r} intro-
duces a total preorder in^{X} (Herbrich et al., 2000). That is,

x

<

x

0

()r

(x)r

(x

0

):

The total preorder allows us to naturally group and compare vectors in the input
space^{X}. For instance, a two-star movie is “worse than” a three-star one, which
is in turn “worse than” a four-star one; movies of less than three stars are “worse
than” movies of at least three stars.

2J^{}K is^{1}if the inner condition is true, and^{0}otherwise.

*The simplicity of the grouping and the comparison distinguishes ordinal ranking*
from multiclass classification. For instance, when classifying movies, it is diffi-
cult to group^{f}action movies^{;}romantic movies^{g}and compare with^{f}comic movies,
thriller movies^{g}, but “movies of less than three stars” can be naturally compared
with “movies of at least three stars.”

The comparability property connects ordinal ranking to monotonic classifica-
tion (Sill, 1998; Kotłowski and Słowi´nski, 2009), which is also referred to as
ordinal classification with the monotonicity constraints and is an important prob-
lem on its own. Monotonic classification models the ordinal ranking problem
by assuming that an explicit order in the input space (such as the value-order of
one particular feature) can be used to directly (and monotonically) infer about
the order of the ranks in the output space (^{y} ^{} ^{y}^{0}). In other words, monotonic
classification allows putting thresholds on the explicit order to perform ordinal
ranking. The comparability property shows that there is an order (total pre-order)
introduced by the ranks. Nevertheless, the order is not always “explicit” in gen-
eral ordinal ranking problems. Therefore, many of the existing ordinal ranking
approaches, such as the thresholded model that will be discussed in Section 5,
seek the implicit order through transforming the input vectors before respecting
the monotonic nature between the implicit order and the order of the ranks.

In Table 1, we summarize four different learning problems in terms of their compa- rability and closeness properties.

Table 1: properties of different learning problems

P

P

P

P

P

P

P

P

P

P

P

P

P

comparability

closeness weak strong

(classification cost vectors) (other V-shaped cost vectors) yes degenerate ordinal ranking usual ordinal ranking

no multiclass classification special cases of cost-sensitive classification

As discussed, usual ordinal ranking problems come with strong closeness in ^{Y}
(which is represented by V-shaped cost vectors) and simple comparability in^{X}. The

classification cost vectors can be viewed as degenerate V-shaped cost vectors, and hence introduce degenerate ordinal ranking problems.

Multiclass classification problems, on the other hand, do not allow examples of dif-
ferent classes to be naturally grouped and compared. If we want to use cost vectors
other than the classification ones, we move to special cases of cost-sensitive classifi-
cation. For instance, when trying to recognize digits^{f0;}^{1;}^{}^{}^{} ^{;}^{9g}for written checks,
a possible cost is the absolute one (to represent monetary differences) rather than sim-
ply right or wrong (the classification cost). The absolute cost is V-shaped and convex.

Nevertheless, the digits intuitively cannot be grouped and compared, and hence the recognition problem belongs to cost-sensitive multiclass classification rather than ordi- nal ranking (Lin, 2008).

From the discussions above, a good ordinal ranking algorithm should appropriately use the comparability property. In Section 4, we will show how the property serves as a key to derive our proposed reduction framework.

**3** **Related Literature**

The analysis of ordinal data has been studied in statistics by defining a suitable link function that models the underlying probability for generating the ordinal labels (Ander- son, 1984). For instance, one popular model is the the cumulative link model (Agresti, 2002) that will be discussed in Section 5. Similar models can be traced back to the work of McCullagh (1980). The many earlier works in statistics, which usually fo- cus on the effectiveness and efficiency of the modeling, influence the ordinal ranking studies in machine learning (Herbrich et al., 2000), including our work. Another re- lated area that study the analysis of ordinal data is operations research, especially in the subarea of multi-criteria decision analysis (Greco et al., 2000; Figueira et al., 2005), which contains many works that focus on reasonable decision making with ordinal pref- erence scales. Our work tackles ordinal ranking problems from the machine learning perspective—improving the test performance—and is hence different from the works that take the perspective of statistics or operations research.

In machine learning (and information retrieval), there are three major families of ranking algorithms: pointwise, pairwise and listwise (Liu, 2009). The ordinal ranking

setup presented in Section 2 belongs to pointwise ranking. Next, we discuss some rep- resentative algorithms in each family and relate them to the ordinal ranking setup. Then, we compare the proposed reduction framework with other reduction-based approaches for ranking.

**3.1** **Families of Ranking Algorithms**

**Pointwise ranking. Pointwise ranking aims at predicting the relevance of some input**
vector^{x}using either real-valued scores or ordinal-valued ranks. It does not directly use
the comparison nature of ranking.

The ordinal ranking algorithms studied in this paper focus on computing ordinal- valued ranks for pointwise ranking. For obtaining real-valued scores, a fundamental tool is traditional least-squared regression (Hastie et al., 2001). As discussed in Sec- tion 2, however, when the given examples come with ordinal labels, the ordinal ranking algorithms studied in this paper can be more useful than traditional regression by taking the metric-less nature of labels into account.

**Pairwise ranking. Pairwise ranking aims at predicting the relative order between two**
input vectors^{x}and^{x}^{0} and thus captures the local comparison nature of ranking. It is
arguably one of the most widely used technique in the ranking family and is usually cast
as a binary classification problem of predicting whether^{x}is preferred over^{x}^{0}. During
training, such a problem translates to comparing all pairs of^{(x}n

;x

m

)based on their cor- responding labels. One representative pairwise ranking algorithm is RankSVM (Her- brich et al., 2000), which trains an underlying support vector machine using those pairs.

RankSVM was initially proposed for data sets that come with ordinal labels, but is also commonly applied to data sets that come with real-valued labels.

Note that even when all the labels takes ordinal values, as long as two of the classes
contain^{(N)}examples, there are^{(N}^{2}^{)}pairs. Such a quadratic number of pairs makes
it difficult to scale up general pairwise ranking algorithms, except in special cases like
the linear support vector machine (Joachims, 2006) or RankBoost (Herbrich et al., 2000;

Lin and Li, 2006). Thus, when the training set is large and contains ordinal labels, the ordinal ranking algorithms studied in this paper may serve as a useful alternative over

pairwise ranking ones.

**Listwise ranking. Listwise ranking aims at ordering a whole finite set of input vectors**

S 0

= fx 0

m g

M

m=1

. In particular, the (listwise) ranker tries to minimize the inconsistency
between the predicted permutation and the ground truth permutation of^{S}^{0} (Liu, 2009).

Listwise ranking captures the global comparison nature of ranking. One representative
listwise ranking algorithm is ListNet (Cao et al., 2007), which is based on an underlying
neural network model along with an estimated distribution of all possible permutations
(rankers). Nevertheless, there are^{M}^{!}permutations for a given^{S}^{0}. Thus, listwise rank-
ing can be computationally even more expensive than pairwise ranking.

Many listwise ranking algorithms try to alleviate the computational burden by keep- ing some internal pointwise rankers. For instance, ListNet uses the underlying neural network to score each instance (Cao et al., 2007) for the purpose of permutation. The use of internal pointwise rankers for listwise ranking further justify the importance to better understand pointwise ranking, including the ordinal ranking algorithms studied in this paper.

**3.2** **Reduction Approaches for Ranking**

Because ranking is a relatively new and diverse problem in machine learning, many existing ranking approaches try to reduce the ranking problem to other learning prob- lems. Next, we discuss some existing reduction-based approaches that are related to the framework proposed in this paper.

**From pairwise ranking to binary classification. Balcan et al. (2007) propose a robust**
reduction from bipartite (i.e. ordinal with two outcomes) pairwise ranking to binary
classification. The training part of the reduction works like usual pairwise ranking:

learning a binary classifier on whether ^{x}is preferred over ^{x}^{0}. The prediction part of
the reduction asks the underlying binary classifier to vote for each example in the test
set in order to rank those examples. The reduction is simple but yields solid theoretical
guarantees. In particular, for ranking^{M} test examples, the reduction uses^{(M}^{2}^{)}calls
to the binary classifier and transforms a binary classification regret of^{r} to a bipartite
ranking regret (measured by the so-called AUC criterion) of at most^{2r}.

Ailon and Mohri (2008) improve the reduction of Balcan et al. (2007) and propose
a more efficient reduction from general pairwise ranking to binary classification. The
prediction part of the reduction operates by taking the underlying binary classifier as
the comparison function of the popular QuickSort algorithm. In the special bipartite
ranking case, for ranking^{M} examples, the reduction uses^{O(M}^{log}^{M}^{)}calls to the bi-
nary classifier in average and transforms a binary classification regret of^{r}to a bipartite
ranking regret of at most^{2r}.

**From listwise ranking to regression (pointwise ranking). The Subset Ranking (Cos-**
sock and Zhang, 2008) algorithm can be viewed as a reduction from listwise ranking to
regression. In particular, Cossock and Zhang (2008) prove that regression with various
cost functions can be used to approximate a Bayes optimal listwise ranker. In other
words, low-regret regressors can be cast as low-regret listwise rankers.

**From listwise ranking to ordinal (pointwise) ranking. McRank (Li et al., 2008) is**
a reduction from listwise ranking to ordinal ranking with the classification cost. The
main theoretical justification of the reduction shows that a scaled classification cost of
an ordinal ranker can upper bound the regret of the associated listwise ranker. That is,
low-error ordinal rankers can be cast as low-regret listwise rankers. Li et al. (2008)
empirically verified that McRank can perform better than the Subset Ranking algo-
rithm (Cossock and Zhang, 2008).

**From ordinal ranking to binary classification. The proposed framework in this pa-**
per and the associated shorter version (Li and Lin, 2007b) is a reduction from ordinal
ranking to binary classification. We will show that the reduction is both error and regret
preserving. That is, low-error binary classifiers can be cast as low-error ordinal rankers;

low-regret binary classifiers can be cast as low-regret ordinal rankers.

The data replication method, which was independently proposed by Cardoso and da Costa (2007), is a similar but more restricted case of the reduction framework. The data replication method essentially considers the absolute cost. In addition, the focus of the data replication method (Cardoso and da Costa, 2007) is on explaining the training procedure of the reduction. The proposed framework in this paper is more general than the data replication method in terms of the cost considered as well as the deeper

Table 2: comparison of general reductions from ranking to binary classification reduction size of trans-

formed set during training

# calls to binary classifiers during prediction

evaluation criterion

the proposed framework ^{O(KN}^{)} ^{O(KM}^{)} ranking cost

(Balcan et al., 2007) ^{O(N}^{2}^{)} ^{O(M}^{2}^{)} AUC

(Ailon and Mohri, 2008) ^{O(N}^{2}^{)} ^{O(M}^{log}^{M}^{)} AUC

theoretical analysis on both the training and the test performance of the reduction.

The proposed reduction framework for pointwise ranking and existing reductions in
pairwise ranking (Balcan et al., 2007; Ailon and Mohri, 2008) take very different views
on the ranking problem and considers different evaluation criteria. As a consequence,
when learning ^{N} examples and ranking (predicting on) ^{M} instances with ^{K} ordinal
scales, the proposed framework results in a transformed training set of size ^{O(KN}^{)}
and a prediction procedure with time complexity^{O(KM}^{)}. Both the size of the training
set and the time complexity of the prediction procedure is more efficient than the state-
of-the-art reduction from pairwise ranking to binary classification (Ailon and Mohri,
2008), as shown in Table 2.

Note that the work of Li et al. (2008) revealed an opportunity to use the discrete nature of ordinal-valued labels to improve the listwise ranking performance over Subset Ranking when using a heuristic ordinal ranking algorithm. The proposed framework is a more rigorous study on ordinal ranking that can be coupled with McRank to yield a reduction from listwise ranking to binary classification, which allows state-of-art binary classification algorithms to be efficiently used for listwise ranking. We will demonstrate the use of this opportunity in Subsection 6.4.

**4** **Reduction Framework**

We will first introduce the details of our proposed reduction framework. Then, we will
demonstrate its theoretical guarantees. Consider, for instance, that we want to know
how good a movie^{x} is. Using the comparability property of ordinal ranking, we can
then ask the associated question “is the rank of^{x}greater than^{k}?”

For a given ^{k}, such a question is exactly a binary classification problem, and the
rank of ^{x}can be determined by asking multiple questions for^{k} ^{=} ^{1}, ^{2}, until^{(K} ^{1)}.
The questions are the core of the Dominance-based Rough Set Approach in operations
research for reasoning from ordinal data (Słowi´nski et al., 2007). From the machine
learning perspective, Frank and Hall (2001) proposed to solve each binary classifica-
tion problem independently and combine the binary outputs to a rank. Although their
approach is simple, the generalization performance using the combination step cannot
be easily analyzed.

The proposed reduction framework works differently. First, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow. Moreover, all the binary classification problems are solved jointly to obtain a single binary classifier.

Assume that ^{g(x;}^{k)} is the single binary classifier that provides answers to all the
*associated questions above. Consistent answers would be*^{g(x;}^{k)}^{=}^{+1}(“yes”) for^{k} ^{=}

1until^{(`} ^{1)}for some^{`}, and ^{1}(“no”) afterward. Then, a reasonable ranker based on
the binary answers is^{r}g

(x)=`=1+minfk: g(x;k)=+1g. Equivalently,

r

g

(x)1+ K 1

X

k=1

J^{g(x;}^{k)}^{>}^{0}K^{:} (3)

The binary classifier ^{g} *that only produces consistent answers would be called rank-*
*monotonic.*^{3}

For any ordinal example ^{(x;}^{y;}^{
)}, we can define the extended binary examples

X (k)

;Y (k)

with weights^{W}^{(k)}as

X (k)

=(x;k); Y (k)

=2J^{k} ^{<}^{y}K ^{1;} ^{W}^{(k)}^{=}^{(K} ^{1)}^{}

[ k℄ [ k+1℄

(4)
The extended input vector ^{X}^{(k)} represents the associated question “is the rank of ^{x}
greater than^{k}?”; the binary label^{Y}^{(k)}represents of the desired answer to the question;

the weight^{W}^{(k)}represents the importance of the question and will be used in the com-
ing theoretical analysis. Here ^{X}^{(k)} stands for an abstract pair and we will discuss its
practical encoding in Section 5. If^{g} ^{X}^{(k)}

g(x;k)makes no errors on all the associ-
ated questions,^{r}g

(x)equals^{y}by (3). That is, ^{
[ r}g

(x)℄ =0. In the following theorem,
we further connects^{
[ r}g

(x)℄to the amount of error that^{g} makes.

3Although (3) can be flexibly applied even when^{g} is not rank-monotonic, a rank-monotonic^{g} is
usually desired in order to introduce a good ranker^{r}^{g}.

**Theorem 1 (Per-example cost bound). For any ordinal example**^{(x;}^{y;}^{
)}*, where*^{
}*is V-*
*shaped and*^{
[y℄}^{=}^{0}*, consider its associated extended binary examples* ^{X}^{(k)}^{;}^{Y}^{(k)}^{;}^{W}^{(k)}

*in(4). Assume that the ranker* ^{r}g *is constructed from a binary classifier* ^{g} *using* *(3).*

*If*^{g} ^{X}^{(k)}

*is rank-monotonic or if*^{
}*is convex, then*

[ r

g

(x)℄ 1

K 1 K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

: (5)

*Proof.* Because^{g} is rank-monotonic,^{g} ^{X}^{(k)}

= +1for^{k} ^{<}^{r}g

(x)and^{g} ^{X}^{(k)}

= 1

for^{k} ^{}^{r}g

(x). Thus, the cost that the ranker^{r}g needs to pay is

[r

g

(x)℄ = K 1

X

k=rg(x)

( [ k℄ [ k+1℄)+ [ K℄

= K 1

X

k=1

( [k℄ [ k+1℄ )

q

g X (k)

<0

y

+
[ K℄: (6)
Because the cost vector^{
}is V-shaped,^{Y}^{(k)} equals the sign of ^{(
[ k℄} ^{
[ k}^{+}^{1℄ )}if the
latter is not zero. Continuing from (6) with^{
[y℄}^{=}^{0},

(K 1) [ r

g (x)℄

= y 1

X

k=1 W

(k)

Y (k)

q

g X (k)

<0

y

+(K 1) [K℄

K 1

X

k=y W

(k)

Y (k)

1

q

g X (k)

>0

y^{}

= y 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

+(K 1) [ y℄+ K 1

X

k=y W

(k)

q

Y (k)

6=g X (k)

y

= K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

: (7)

When^{g}is not rank-monotonic but the cost vector^{
}is convex, equation (7) becomes
an inequality that could be alternatively proved by replacing (6) with

K 1

X

k=r

g (x)

( [ k℄ [ k+1℄ ) K 1

X

k=1

( [ k℄ [k+1℄ )

q

g X (k)

<0

y

:

The inequality above holds because ^{(
[ k℄} ^{
[k}^{+}^{1℄)} is decreasing due to the con-
vexity, and there are exactly ^{(r}g

(x) 1) zeros and ^{(K} ^{r}g

(x)) ones in the values ofq

g X (k)

<0

yaccording to (3).

We call (5) the per-example cost bound, which says that if ^{g} makes only a small
amount of error on the extended binary examples ^{X}^{(k)}^{;}^{Y}^{(k)}^{;}^{W}^{(k)}

, then^{r}gis guaran-
teed to only pay a small amount of cost on the original example^{(x;}^{y;}^{
)}. The bound
allows us to derive the following reduction method, which is composed of three stages:

preprocessing, training, and prediction.

**Algorithm 1 (Reduction to extended binary classification).**

*1. Preprocessing: For each original training example*^{(x}n

;y

n

;

n

)2S *and for each*

k = 1;2;:::;K 1*, generate an extended training example*

X (k)

n

;Y (k)

n

;W (k)

n

*and include it in*^{S}E*, where*

X (k)

n

=(x

n

;k); Y (k)

n

=2J^{k} ^{<}^{y}^{n}K ^{1;} ^{W}^{n}^{(k)} ^{=}^{(K} ^{1)}^{}

n

[k℄

n

[k+1℄

:

*2. Training: Use a binary classification algorithm on* ^{S}E *and get a binary classi-*
*fier*^{g}*on a concrete encoding (to be discussed in Section 5) of*^{X}^{f1;}^{2;}^{}^{}^{} ^{;}^{K} ^{1g}*.*
*Let*^{g(x;}^{k)}^{}^{g} ^{X}^{(k)}

*.*

*3. Prediction: For any*^{x}^{2}^{X}*, estimate its rank with(3).*

**4.1** **Cost Bound of the Reduction Framework**

Consider the following probability distribution^{P}b
X

(k)

;Y (k)

;W (k)

that generates the extended binary examples.

1. Draw a tuple^{(x;}^{y;}^{
)}independently from^{P(x;}^{y;}^{
)}and draw^{k}uniformly from
the set^{f1;}^{2;}^{:}^{:}^{:}^{;}^{K} ^{1g}.

2. Generate ^{X}^{(k)}^{;}^{Y}^{(k)}^{;}^{W}^{(k)}

by (4).

The extended training set ^{S}E contains examples that are equivalent (in terms of ex-
pectation) to examples drawn independently from^{P}b

X (k)

;Y (k)

;W (k)

. For any given
binary classifier^{g}, define its out-of-sample error with respect to^{P}bas

E

b

(g) E

(X;Y;W)P

b

W J^{Y} ^{6=}^{g}^{(X)}K^{:}

Using the definitions above, we can prove the first theoretical guarantee of the reduction framework.

**Theorem 2 (Cost bound of the reduction framework). Consider a ranker**^{r}g *that is*
*constructed from a binary classifier*^{g}*using(3). Assume that*^{
}*is V-shaped and*^{
[ y℄}^{=}^{0}
*for every tuple*^{(x;}^{y;}^{
)} *generated from* ^{P(
}^{j}^{x;}^{y)}*. If*^{g}^{(x;}^{k)}*is rank-monotonic or if*
*every cost vector*^{
}*is convex, then*^{E(r}g

)E

b
(g)*.*
*Proof.* From (5),

[ r

g

(x)℄ 1

K 1 K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

:

Take the expectation over^{P} on both sides and use^{}uto mean the uniform sampling,

E(r

g

) E

(x;y; )P 1

K 1 K 1

X

k=1 W

(k)

q

Y (k)

6=g X (k)

y

= E

(x;y; )P

E

k

u

f1;;K 1g W

(k)

q

Y (k)

6=g X (k)

y

= E

(X;Y;W)P

b

W J^{Y} ^{6=}^{g(X)}K

= E

b (g):

**4.2** **Regret Bound of the Reduction Framework**

Theorem 2 indicates that if there exists a decent binary classifier ^{g}, we can obtain a
decent ranker ^{r}g. Nevertheless, it does not guarantee how good ^{r}g is in comparison
with other rankers. In particular, if we consider the optimal binary classifier ^{g} un-
der^{P}b

(X;Y;W), and the optimal ranker^{r}under^{P(x;}^{y;}^{
)}, does a small regret^{E}b
(g)

E

b (g

)in binary classification translate to a small regret^{E(r}g

) E(r

)in ordinal rank-
ing? Furthermore, is^{E(r}g

)close to^{E(r}

)*? Next, we introduce the reverse reduction*
*technique, which helps to answer the questions above.*

The reverse reduction technique works on the binary classification problems gen-
erated by the reduction method. It goes through the preprocessing and the predic-
tion stages of the reduction method in the opposite direction. In the preprocessing
stage, instead of starting with ordinal examples ^{(x}n

;y

n

;

n

), reverse reduction deals with weighted binary examples

X (k)

n

;Y (k)

n

;W (k)

n

. It first combines each set of binary

ordinal

example

(x

n

;y

n

;

n )

)

A

A

%

&

weighted

binary

examples

X (k )

n

;Y (k )

n

;W (k )

n

k=1;:::;K 1 ) ) )

ore

binary

lassi ation

algorithm

) ) )

%

&

related

binary

lassiers

g X (k )

k=1;:::;K 1 A A

)

ordinal

ranker

r

g (x)

%

$ '

&

weighted

binary

examples

X (k )

n

;Y (k )

n

;W (k )

n

k=1;:::;K 1 ) ) )

A A

ordinal

example

(xn;yn; n) )

ore

ordinal

ranking

algorithm )

ordinal

ranker

r(x)

) ) )

A

A

%

$ '

&

related

binary

lassiers

g

r X

(k )

k=1;:::;K 1

Figure 1: reduction (top) and reverse reduction (bottom)
examples sharing the same^{x}nto an ordinal example by

8

>

>

>

<

>

>

>

: y

n

= 1+ K 1

P

k=1

r

Y (k)

n

>0

z

;

n [k℄=

K 1

X

`=1 W

(`)

n

K 1

J^{y}^{n}^{}^{`}^{<}^{k}or^{k}^{<}^{`}^{}^{y}nK^{:}

(8)

It is easy to verify that (8) is the exact inverse transform of (4) on the training examples
under the assumption that^{
[ y℄}^{=}^{0}. These ordinal examples are then given to an ordinal
ranking algorithm to obtain a ranker^{r}. In the prediction stage, reverse reduction works
by decomposing the prediction^{r(x)}to^{K} ^{1}binary predictions, each as if coming from
a binary classifier

g

r X

(k)

=2J^{r(x)}^{>}^{k}K ^{1:} (9)

Then, a lemma on the out-of-sample cost of^{g}rimmediately follows (Lin and Li, 2009).

**Lemma 1. With the definitions of**^{P(x;}^{y;}^{
)} *and* ^{P}b
X

(k)

;Y (k)

;W (k)

*in Theorem 2,*
*for every ordinal ranker*^{r}*,*^{E(r)}^{=}^{E}b

(g

r
)*.*

*Proof.* Because^{g}r is rank-monotonic by construction, the same proof for the first part
of Theorem 2 leads to the desired result.

The stages of reduction and reverse reduction are illustrated in Figure 1. Next, we show how the reverse reduction technique allows us to draw a strong theoretical

connection between ordinal ranking and binary classification. By the definition of^{r}

and^{g}, for any ranker^{r}and any binary classifier^{g},

E(r)E(r

); E

b

(g)E

b (g

): (10)

Then, the reverse reduction technique yields a simple proof of the regret bound.

**Theorem 3 (Regret bound of the reduction framework). If**^{g(x;}^{k)}*is rank-monotonic,*
*or if every cost vector*^{
}*is convex, then*

E(r

g

) E(r

)E

b

(g) E

b (g

): (11)

*Proof.*

E(r

g

) E(r

) E

b

(g) E(r

) (from Theorem 2^{)}

= E

b

(g) E

b (g

r

) (from Lemma 1^{)}

E

b

(g) E

b (g

) from Equation 10

:

The cost bound (Theorem 2) and the regret bound (Theorem 3) provide different guarantees for the reduction method. The former describes how the ordinal ranking cost is upper bounded by the binary classification error in an absolute sense, and the latter describes the upper bound in a relative sense.

**4.3** **Equivalence between Ordinal Ranking and Binary Classifica-** **tion**

The results above suggest that ordinal ranking can be reduced to binary classification
without any loss of optimality. That is, ordinal ranking is “no harder than” binary clas-
sification. Intuitively, binary classification is also “no harder than” ordinal ranking,
because the former is a special case of the latter with^{K} ^{=} ^{2}. Next, we formalize the
*notion of hardness with the probably approximately correct (PAC) setup in computa-*
tional learning theory (Kearns and Vazirani, 1994) and prove that ordinal ranking and
binary classification are indeed equivalent in hardness. We use the following definition
of PAC in our coming theorems (Valiant, 1984; Kearns and Vazirani, 1994).

**Definition 1. In cost-sensitive classification, a learning model**^{G} *is efficiently PAC-*
*learnable (using the same representation class) if there exists a (possibly randomized)*
*learning algorithm*^{A}*satisfying the following property: for every distribution*^{P(x;}^{y;}^{
)}
*being considered, where*

[g

(x)℄= [ y℄=

min

=0;

*with some*^{g}

2 G*; for all* ^{0} ^{<} ^{} *and* ^{0} ^{<} ^{Æ} ^{<} ^{1}

2

*, if* ^{A} *is given access to an oracle*
*generating examples*^{(x;}^{y;}^{
)}*from*^{P(x;}^{y;}^{
)}*, as well as inputs*^{}*and*^{Æ}*, then*^{A}*outputs*

g 2G*such that*^{E}^{(g)}^{}^{}*with probability at least*^{1} ^{Æ}*as well as with time polynomial*
*in* ^{1}

*and* ^{1}

Æ

*.*

Briefly speaking, the definition assumes that the target function ^{g} is within the
learning model^{G}and is of cost^{0}(the minimum cost). In other words, it is the noiseless
setup of learning. We shall only focus on this case while pointing out that similar results
can also be proved for the noisy setup (Lin, 2008).

**Theorem 4 (Equivalence theorem of the reduction framework). Consider a learning***model*^{R}*for ordinal ranking, its associated learning model*^{G}^{=}^{fg}r

: r2Rg*for binary*
*classification, and distributions* ^{P(x;}^{y;}^{
)} *such that all cost vectors* ^{
} *are V-shaped.*

*Then,*^{R}*is efficiently PAC-learnable if and only if*^{G}*is efficiently PAC-learnable.*

*Proof.* If^{G} is efficiently PAC-learnable using algorithm^{A}G, we can convert^{A}G to an
efficient algorithm^{A}Rfor ordinal ranking as follows.

1. Transform the oracle that generates ^{(x;}^{y;}^{
)} from ^{P(x;}^{y;}^{
)} to an oracle that
generates ^{X}^{(k)}^{;}^{Y}^{(k)}^{;}^{W}^{(k)}

by picking^{k}uniformly and applying (4).

2. Run^{A}G with the transformed oracle until it outputs some^{g} ^{X}^{(k)}

.
3. Return^{r}g.

It is not hard to see that ^{A}R is as efficient as ^{A}G, and the cost guarantee comes from
Theorem 2 using the fact that^{g}rare all rank-monotonic.

Now we consider the other direction. If^{R}is efficiently PAC-learnable using algo-
rithm^{A}R, we can convert^{A}Rto an efficient algorithm^{A}G for binary classification.