1
Reduction from Cost-sensitive Ordinal Ranking to Weighted Binary Classification
Hsuan-Tien Lin
1andLing Li
21Department of Computer Science, National Taiwan University.
2Department of Computer Science, California Institute of Technology.
Keywords: cost-sensitive, ordinal ranking, binary classification, reduction
Abstract
We present a reduction framework from ordinal ranking to binary classification. The framework consists of three steps: extracting extended examples from the original ex- amples, learning a binary classifier on the extended examples with any binary classi- fication algorithm, and constructing a ranker from the binary classifier. Based on the framework, we show that a weighted 0/1 loss of the binary classifier upper-bounds the mislabeling cost of the ranker, both error-wise and regret-wise. Our framework allows not only to design good ordinal ranking algorithms based on well-tuned binary classi- fication approaches, but also to derive new generalization bounds for ordinal ranking from known bounds for binary classification. In addition, our framework unifies many existing ordinal ranking algorithms, such as perceptron ranking and support vector ordi- nal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generaliza- tion performance over existing algorithms. In addition, the newly designed algorithms lead to better cost-sensitive ordinal ranking performance as well as improved listwise ranking performance.
1 Introduction
We work on a supervised learning problem called ordinal ranking, which is also referred to as ordinal regression (Chu and Keerthi, 2007) or ordinal classification (Frank and Hall, 2001). For instance, the rating that a customer gives on a movie might be one of do-not-bother, only-if-you-must, good, very-good, and run-to-see.
Those ratings are called the ranks, which can be represented by ordinal class labels like
f1;2;3;4;5g. The ordinal ranking problem is closely related to multi-class classifica- tion and metric regression. Somehow it is different from the former because of the ordinal information encoded in the ranks, and is different from the latter because of the lack of the metric distance between the ranks. Since rank is a natural representation of human preferences, the problem lends itself to many applications in social science and information retrieval (Liu, 2009).
Many ordinal ranking algorithms have been proposed from a machine learning per- spective in recent years. For instance, Herbrich et al. (2000) designed an approach with support vector machines based on comparing training examples in a pairwise manner.
Har-Peled et al. (2003) proposed a constraint classification approach that works with any binary classifiers based on the pairwise comparison framework. Nevertheless, such a pairwise comparison perspective may not be suitable for large-scale learning because the size of the associated optimization problem can be large. In particular, for an or- dinal ranking problem with N examples, if at least two of the ranks are supported by
(N)examples (which is quite common in practice), the size of the pairwise learning problem is quadratic toN.
There are some other approaches that do not lead to such a quadratic expansion. For instance, Crammer and Singer (2005) generalized the online perceptron algorithm with multiple thresholds to do ordinal ranking. In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. Shashua and Levin (2003) proposed new support vector machine (SVM) formulations to handle multiple thresholds, and some other SVM formulations were studied by Rajaram et al.
(2003); Chu and Keerthi (2007); Cardoso and da Costa (2007). All these algorithms share a common property: they are modified from well-known binary classification approaches. Still some other approaches fall into neither of the perspective above, such
as Gaussian process ordinal regression (Chu and Ghahramani, 2005).
Since binary classification is much better studied than ordinal ranking, a general framework to systematically reduce the latter to the former can introduce two imme- diate benefits. First, well-tuned binary classification approaches can be readily trans- formed into good ordinal ranking algorithms, which saves immense efforts in design and implementation. Second, new generalization bounds for ordinal ranking can be easily derived from known bounds for binary classification, which saves tremendous efforts in theoretical analysis.
In this paper, we propose such a reduction framework. The framework is based on extended binary examples, which are extracted from the original ordinal ranking exam- ples. The binary classifier trained from the extended binary examples can then be used to construct a ranker. We prove that the mislabeling cost of the ranker is bounded by a weighted 0/1 loss of the binary classifier. Furthermore, we prove that the mislabeling regret of the ranker is bounded by the regret of the binary classifier as well. Hence, binary classifiers that generalize well could introduce rankers that generalize well. The advantages of the framework in algorithmic design and in theoretical analysis are both demonstrated in the paper. In addition, we show that our framework provides a unified view for many existing ordinal ranking algorithms. The experiments on some bench- mark data sets validate the usefulness of the framework in practice, both for improving cost-sensitive ordinal ranking performance and for helping improve other ranking cri- teria.
The paper is organized as follows. We introduce the ordinal ranking problem in Section 2. Some related works are discussed in Section 3. We illustrate our reduction framework in Section 4. The algorithmic and theoretical usefulness of the framework is shown in Section 5. Finally, we present experimental results in Section 6, and conclude in Section 7.
A short version of the paper appeared in the 2006 Conference on Neural Informa- tion Processing Systems (Li and Lin, 2007b). The paper was then enriched by the more general cost-sensitive setting as well as the deeper theoretical results that were revealed in the 2009 Preference Learning Workshop (Lin and Li, 2009). For complete- ness, selected results from an earlier conference work (Lin and Li, 2006) are included in Subsection 5.2. The works above are also parts of the first author’s Ph.D. thesis (Lin,
2008). In addition to the results that have been published in the conferences, we pointed out some important properties of ordinal ranking in Section 2, added a detailed litera- ture discussion in Section 3, showed deeper theoretical results on the equivalence be- tween ordinal ranking and binary classification in Section 4, clarified the differences of different SVM-based ordinal ranking algorithms in Section 5 and strengthened the experimental results to emphasize the usefulness of cost-sensitive ordinal ranking in Section 6.
2 Ordinal Ranking Setup
The ordinal ranking problem aims at predicting the rank y of some input vector x, wherexis in an input spaceX RD andyis in a label spaceY = f1;2; ;Kg. A functionr: X ! Y is called an ordinal ranker, or a ranker in short. We shall adopt the cost-sensitive setting (Abe et al., 2004; Lin, 2008), in which a cost vector 2 RK is generated with(x;y)from some fixed but unknown distributionP(x;y; )onX
YR
K. Thek-th element [ k℄of the cost vector represents the penalty when predict- ing the input vector x as rank k. We naturally assume that [ k℄ 0 and [ y℄ = 0. Thus,y=argmin
1kK
[ k℄. An ordinal ranking problem comes with a given training setS = f(xn
;y
n
;
n )g
N
n=1
, whose elements are drawn i.i.d. fromP(x;y; ). The goal of the problem is to find a rankerrsuch that its expected test cost
E(r) E
(x;y; )P
[ r(x)℄
is small.
The setting above looks similar to that of a cost-sensitive multiclass classification problem (Abe et al., 2004), in the sense that the label spaceY is a finite set. Therefore, ordinal ranking is also called ordinal classification (Frank and Hall, 2001; Cardoso and da Costa, 2007). Nevertheless, in addition to representing the nominal categories (as the usual classification labels), now thosey 2 Y also carry the ordinal information. That is, two different labels inY can be compared by the usual “<” operation. Thus, those
y2Y are called the ranks to distinguish them from the usual classification labels.
Ordinal ranking is also similar to regression (for which y 2 R instead ofY), be- cause the real values inR can be ordered by the usual “<” operation. Therefore, ordi-
nal ranking is also popularly called ordinal regression (Herbrich et al., 2000; Shashua and Levin, 2003; Chu and Ghahramani, 2005; Chu and Keerthi, 2007; Xia et al., 2007).
Nevertheless, unlike the real-valued regression labelsy 2 R, the discrete ranksy 2 Y do not carry metric information. For instance, we cannot say that a five-star movie is2:5 times better than a two-star one; we also cannot compute the exact distance (difference) between a five-star movie and a one-star movie. In other words, the rank serves as a qualitative indication rather than a quantitative outcome. Thus, any monotonic trans- form of the label space should not alter the ranking performance. Nevertheless, many regression algorithms depend on the assumed metric information and can be highly affected by monotonic transforms of the label space (which are equivalent to change- of-metric operations). Thus, those regression algorithms may not always perform well on ordinal ranking problems.
The ordinal information carried by the ranks introduces the following two proper- ties, which are important for modeling ordinal ranking problems.
Closeness in the rank spaceY: The ordinal information suggests that the misla- beling cost depend on the “closeness” of the prediction. For example, predicting a two-star movie as a three-star one is less costly than predicting it as a five-star one. Hence, the cost vector should be V-shaped with respect to y(Li and Lin, 2007b), that is,
8
>
<
>
:
[k 1℄ [ k℄; for2k y;
[k+1℄ [ k℄; foryk K 1:
(1)
Briefly speaking, a V-shaped cost vector says that a ranker needs to pay more if its prediction on xis further away fromy. We shall assume that every cost vector generated from P( jx;y) is V-shaped with respect to y = argmin
1kK [ k℄. In other words, one can decompose P(y; jx) = P(yj )P( jx) where
P( jx)is always V-shaped andP(yj )satisfiesy=argmin
1kK [ k℄.
In some of our results, we need a stronger condition: the cost vectors should be convex(Li and Lin, 2007b), which is defined by the condition1
[ k+1℄ [ k℄ [ k℄ [ k 1℄; for2k K 1: (2)
1When connecting the points(k; [k℄ )from a convex cost vector by line segments, it is not difficult to prove that the resulting curve is convex fork2[1;K℄.
When using convex cost vectors, a ranker needs to pay increasingly more if its prediction on x is further away from y. Provably, any convex cost vector is V-shaped with respect toy=argmin
1kK [k℄.
The V-shaped and convex cost vectors are general choices that can be used to represent the ordinal nature ofY. One popular cost vector that has been frequently used for ordinal ranking is the absolute cost vector, which accompanies(x;y)as
(y)
[k℄=jy kj; 1yK:
Because the absolute cost vectors come with the medianfunction as its a popu- lation minimizer (Dembczy´nski et al., 2008), it appears to be a natural choice for ordinal ranking, similar to how the traditional 0/1 loss is the most natural choice for classification. Nevertheless, our work aims at studying more flexible possi- bilities (costs) beyond the natural choice, similar to the more flexible weighted loss beyond the 0/1 one in weighted classification (Zadrozny et al., 2003). As we shall show later in Section 6, the flexible costs can be used to embed the desired structural information inY for better test performance.
Comparability in the input spaceX: Note that the classification cost vectors
(`)
[k℄=J`6=kK; 1` K;
which checks whether the predicted rank k is exactly the same as the desired rank`, are also V-shaped.2 If those cost vectors are used, an immediate question is: What distinguishes ordinal ranking and common multiclass classification?
Let r denote the optimal ranker with respect to P(x;y; ). Note that r intro- duces a total preorder inX (Herbrich et al., 2000). That is,
x
<
x
0
()r
(x)r
(x
0
):
The total preorder allows us to naturally group and compare vectors in the input spaceX. For instance, a two-star movie is “worse than” a three-star one, which is in turn “worse than” a four-star one; movies of less than three stars are “worse than” movies of at least three stars.
2JK is1if the inner condition is true, and0otherwise.
The simplicity of the grouping and the comparison distinguishes ordinal ranking from multiclass classification. For instance, when classifying movies, it is diffi- cult to groupfaction movies;romantic moviesgand compare withfcomic movies, thriller moviesg, but “movies of less than three stars” can be naturally compared with “movies of at least three stars.”
The comparability property connects ordinal ranking to monotonic classifica- tion (Sill, 1998; Kotłowski and Słowi´nski, 2009), which is also referred to as ordinal classification with the monotonicity constraints and is an important prob- lem on its own. Monotonic classification models the ordinal ranking problem by assuming that an explicit order in the input space (such as the value-order of one particular feature) can be used to directly (and monotonically) infer about the order of the ranks in the output space (y y0). In other words, monotonic classification allows putting thresholds on the explicit order to perform ordinal ranking. The comparability property shows that there is an order (total pre-order) introduced by the ranks. Nevertheless, the order is not always “explicit” in gen- eral ordinal ranking problems. Therefore, many of the existing ordinal ranking approaches, such as the thresholded model that will be discussed in Section 5, seek the implicit order through transforming the input vectors before respecting the monotonic nature between the implicit order and the order of the ranks.
In Table 1, we summarize four different learning problems in terms of their compa- rability and closeness properties.
Table 1: properties of different learning problems
P
P
P
P
P
P
P
P
P
P
P
P
P
comparability
closeness weak strong
(classification cost vectors) (other V-shaped cost vectors) yes degenerate ordinal ranking usual ordinal ranking
no multiclass classification special cases of cost-sensitive classification
As discussed, usual ordinal ranking problems come with strong closeness in Y (which is represented by V-shaped cost vectors) and simple comparability inX. The
classification cost vectors can be viewed as degenerate V-shaped cost vectors, and hence introduce degenerate ordinal ranking problems.
Multiclass classification problems, on the other hand, do not allow examples of dif- ferent classes to be naturally grouped and compared. If we want to use cost vectors other than the classification ones, we move to special cases of cost-sensitive classifi- cation. For instance, when trying to recognize digitsf0;1; ;9gfor written checks, a possible cost is the absolute one (to represent monetary differences) rather than sim- ply right or wrong (the classification cost). The absolute cost is V-shaped and convex.
Nevertheless, the digits intuitively cannot be grouped and compared, and hence the recognition problem belongs to cost-sensitive multiclass classification rather than ordi- nal ranking (Lin, 2008).
From the discussions above, a good ordinal ranking algorithm should appropriately use the comparability property. In Section 4, we will show how the property serves as a key to derive our proposed reduction framework.
3 Related Literature
The analysis of ordinal data has been studied in statistics by defining a suitable link function that models the underlying probability for generating the ordinal labels (Ander- son, 1984). For instance, one popular model is the the cumulative link model (Agresti, 2002) that will be discussed in Section 5. Similar models can be traced back to the work of McCullagh (1980). The many earlier works in statistics, which usually fo- cus on the effectiveness and efficiency of the modeling, influence the ordinal ranking studies in machine learning (Herbrich et al., 2000), including our work. Another re- lated area that study the analysis of ordinal data is operations research, especially in the subarea of multi-criteria decision analysis (Greco et al., 2000; Figueira et al., 2005), which contains many works that focus on reasonable decision making with ordinal pref- erence scales. Our work tackles ordinal ranking problems from the machine learning perspective—improving the test performance—and is hence different from the works that take the perspective of statistics or operations research.
In machine learning (and information retrieval), there are three major families of ranking algorithms: pointwise, pairwise and listwise (Liu, 2009). The ordinal ranking
setup presented in Section 2 belongs to pointwise ranking. Next, we discuss some rep- resentative algorithms in each family and relate them to the ordinal ranking setup. Then, we compare the proposed reduction framework with other reduction-based approaches for ranking.
3.1 Families of Ranking Algorithms
Pointwise ranking. Pointwise ranking aims at predicting the relevance of some input vectorxusing either real-valued scores or ordinal-valued ranks. It does not directly use the comparison nature of ranking.
The ordinal ranking algorithms studied in this paper focus on computing ordinal- valued ranks for pointwise ranking. For obtaining real-valued scores, a fundamental tool is traditional least-squared regression (Hastie et al., 2001). As discussed in Sec- tion 2, however, when the given examples come with ordinal labels, the ordinal ranking algorithms studied in this paper can be more useful than traditional regression by taking the metric-less nature of labels into account.
Pairwise ranking. Pairwise ranking aims at predicting the relative order between two input vectorsxandx0 and thus captures the local comparison nature of ranking. It is arguably one of the most widely used technique in the ranking family and is usually cast as a binary classification problem of predicting whetherxis preferred overx0. During training, such a problem translates to comparing all pairs of(xn
;x
m
)based on their cor- responding labels. One representative pairwise ranking algorithm is RankSVM (Her- brich et al., 2000), which trains an underlying support vector machine using those pairs.
RankSVM was initially proposed for data sets that come with ordinal labels, but is also commonly applied to data sets that come with real-valued labels.
Note that even when all the labels takes ordinal values, as long as two of the classes contain(N)examples, there are(N2)pairs. Such a quadratic number of pairs makes it difficult to scale up general pairwise ranking algorithms, except in special cases like the linear support vector machine (Joachims, 2006) or RankBoost (Herbrich et al., 2000;
Lin and Li, 2006). Thus, when the training set is large and contains ordinal labels, the ordinal ranking algorithms studied in this paper may serve as a useful alternative over
pairwise ranking ones.
Listwise ranking. Listwise ranking aims at ordering a whole finite set of input vectors
S 0
= fx 0
m g
M
m=1
. In particular, the (listwise) ranker tries to minimize the inconsistency between the predicted permutation and the ground truth permutation ofS0 (Liu, 2009).
Listwise ranking captures the global comparison nature of ranking. One representative listwise ranking algorithm is ListNet (Cao et al., 2007), which is based on an underlying neural network model along with an estimated distribution of all possible permutations (rankers). Nevertheless, there areM!permutations for a givenS0. Thus, listwise rank- ing can be computationally even more expensive than pairwise ranking.
Many listwise ranking algorithms try to alleviate the computational burden by keep- ing some internal pointwise rankers. For instance, ListNet uses the underlying neural network to score each instance (Cao et al., 2007) for the purpose of permutation. The use of internal pointwise rankers for listwise ranking further justify the importance to better understand pointwise ranking, including the ordinal ranking algorithms studied in this paper.
3.2 Reduction Approaches for Ranking
Because ranking is a relatively new and diverse problem in machine learning, many existing ranking approaches try to reduce the ranking problem to other learning prob- lems. Next, we discuss some existing reduction-based approaches that are related to the framework proposed in this paper.
From pairwise ranking to binary classification. Balcan et al. (2007) propose a robust reduction from bipartite (i.e. ordinal with two outcomes) pairwise ranking to binary classification. The training part of the reduction works like usual pairwise ranking:
learning a binary classifier on whether xis preferred over x0. The prediction part of the reduction asks the underlying binary classifier to vote for each example in the test set in order to rank those examples. The reduction is simple but yields solid theoretical guarantees. In particular, for rankingM test examples, the reduction uses(M2)calls to the binary classifier and transforms a binary classification regret ofr to a bipartite ranking regret (measured by the so-called AUC criterion) of at most2r.
Ailon and Mohri (2008) improve the reduction of Balcan et al. (2007) and propose a more efficient reduction from general pairwise ranking to binary classification. The prediction part of the reduction operates by taking the underlying binary classifier as the comparison function of the popular QuickSort algorithm. In the special bipartite ranking case, for rankingM examples, the reduction usesO(MlogM)calls to the bi- nary classifier in average and transforms a binary classification regret ofrto a bipartite ranking regret of at most2r.
From listwise ranking to regression (pointwise ranking). The Subset Ranking (Cos- sock and Zhang, 2008) algorithm can be viewed as a reduction from listwise ranking to regression. In particular, Cossock and Zhang (2008) prove that regression with various cost functions can be used to approximate a Bayes optimal listwise ranker. In other words, low-regret regressors can be cast as low-regret listwise rankers.
From listwise ranking to ordinal (pointwise) ranking. McRank (Li et al., 2008) is a reduction from listwise ranking to ordinal ranking with the classification cost. The main theoretical justification of the reduction shows that a scaled classification cost of an ordinal ranker can upper bound the regret of the associated listwise ranker. That is, low-error ordinal rankers can be cast as low-regret listwise rankers. Li et al. (2008) empirically verified that McRank can perform better than the Subset Ranking algo- rithm (Cossock and Zhang, 2008).
From ordinal ranking to binary classification. The proposed framework in this pa- per and the associated shorter version (Li and Lin, 2007b) is a reduction from ordinal ranking to binary classification. We will show that the reduction is both error and regret preserving. That is, low-error binary classifiers can be cast as low-error ordinal rankers;
low-regret binary classifiers can be cast as low-regret ordinal rankers.
The data replication method, which was independently proposed by Cardoso and da Costa (2007), is a similar but more restricted case of the reduction framework. The data replication method essentially considers the absolute cost. In addition, the focus of the data replication method (Cardoso and da Costa, 2007) is on explaining the training procedure of the reduction. The proposed framework in this paper is more general than the data replication method in terms of the cost considered as well as the deeper
Table 2: comparison of general reductions from ranking to binary classification reduction size of trans-
formed set during training
# calls to binary classifiers during prediction
evaluation criterion
the proposed framework O(KN) O(KM) ranking cost
(Balcan et al., 2007) O(N2) O(M2) AUC
(Ailon and Mohri, 2008) O(N2) O(MlogM) AUC
theoretical analysis on both the training and the test performance of the reduction.
The proposed reduction framework for pointwise ranking and existing reductions in pairwise ranking (Balcan et al., 2007; Ailon and Mohri, 2008) take very different views on the ranking problem and considers different evaluation criteria. As a consequence, when learning N examples and ranking (predicting on) M instances with K ordinal scales, the proposed framework results in a transformed training set of size O(KN) and a prediction procedure with time complexityO(KM). Both the size of the training set and the time complexity of the prediction procedure is more efficient than the state- of-the-art reduction from pairwise ranking to binary classification (Ailon and Mohri, 2008), as shown in Table 2.
Note that the work of Li et al. (2008) revealed an opportunity to use the discrete nature of ordinal-valued labels to improve the listwise ranking performance over Subset Ranking when using a heuristic ordinal ranking algorithm. The proposed framework is a more rigorous study on ordinal ranking that can be coupled with McRank to yield a reduction from listwise ranking to binary classification, which allows state-of-art binary classification algorithms to be efficiently used for listwise ranking. We will demonstrate the use of this opportunity in Subsection 6.4.
4 Reduction Framework
We will first introduce the details of our proposed reduction framework. Then, we will demonstrate its theoretical guarantees. Consider, for instance, that we want to know how good a moviex is. Using the comparability property of ordinal ranking, we can then ask the associated question “is the rank ofxgreater thank?”
For a given k, such a question is exactly a binary classification problem, and the rank of xcan be determined by asking multiple questions fork = 1, 2, until(K 1). The questions are the core of the Dominance-based Rough Set Approach in operations research for reasoning from ordinal data (Słowi´nski et al., 2007). From the machine learning perspective, Frank and Hall (2001) proposed to solve each binary classifica- tion problem independently and combine the binary outputs to a rank. Although their approach is simple, the generalization performance using the combination step cannot be easily analyzed.
The proposed reduction framework works differently. First, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow. Moreover, all the binary classification problems are solved jointly to obtain a single binary classifier.
Assume that g(x;k) is the single binary classifier that provides answers to all the associated questions above. Consistent answers would beg(x;k)=+1(“yes”) fork =
1until(` 1)for some`, and 1(“no”) afterward. Then, a reasonable ranker based on the binary answers isrg
(x)=`=1+minfk: g(x;k)=+1g. Equivalently,
r
g
(x)1+ K 1
X
k=1
Jg(x;k)>0K: (3)
The binary classifier g that only produces consistent answers would be called rank- monotonic.3
For any ordinal example (x;y; ), we can define the extended binary examples
X (k)
;Y (k)
with weightsW(k)as
X (k)
=(x;k); Y (k)
=2Jk <yK 1; W(k)=(K 1)
[ k℄ [ k+1℄
(4) The extended input vector X(k) represents the associated question “is the rank of x greater thank?”; the binary labelY(k)represents of the desired answer to the question;
the weightW(k)represents the importance of the question and will be used in the com- ing theoretical analysis. Here X(k) stands for an abstract pair and we will discuss its practical encoding in Section 5. Ifg X(k)
g(x;k)makes no errors on all the associ- ated questions,rg
(x)equalsyby (3). That is, [ rg
(x)℄ =0. In the following theorem, we further connects [ rg
(x)℄to the amount of error thatg makes.
3Although (3) can be flexibly applied even wheng is not rank-monotonic, a rank-monotonicg is usually desired in order to introduce a good rankerrg.
Theorem 1 (Per-example cost bound). For any ordinal example(x;y; ), where is V- shaped and [y℄=0, consider its associated extended binary examples X(k);Y(k);W(k)
in(4). Assume that the ranker rg is constructed from a binary classifier g using (3).
Ifg X(k)
is rank-monotonic or if is convex, then
[ r
g
(x)℄ 1
K 1 K 1
X
k=1 W
(k)
q
Y (k)
6=g X (k)
y
: (5)
Proof. Becauseg is rank-monotonic,g X(k)
= +1fork <rg
(x)andg X(k)
= 1
fork rg
(x). Thus, the cost that the rankerrg needs to pay is
[r
g
(x)℄ = K 1
X
k=rg(x)
( [ k℄ [ k+1℄)+ [ K℄
= K 1
X
k=1
( [k℄ [ k+1℄ )
q
g X (k)
<0
y
+ [ K℄: (6) Because the cost vector is V-shaped,Y(k) equals the sign of ( [ k℄ [ k+1℄ )if the latter is not zero. Continuing from (6) with [y℄=0,
(K 1) [ r
g (x)℄
= y 1
X
k=1 W
(k)
Y (k)
q
g X (k)
<0
y
+(K 1) [K℄
K 1
X
k=y W
(k)
Y (k)
1
q
g X (k)
>0
y
= y 1
X
k=1 W
(k)
q
Y (k)
6=g X (k)
y
+(K 1) [ y℄+ K 1
X
k=y W
(k)
q
Y (k)
6=g X (k)
y
= K 1
X
k=1 W
(k)
q
Y (k)
6=g X (k)
y
: (7)
Whengis not rank-monotonic but the cost vector is convex, equation (7) becomes an inequality that could be alternatively proved by replacing (6) with
K 1
X
k=r
g (x)
( [ k℄ [ k+1℄ ) K 1
X
k=1
( [ k℄ [k+1℄ )
q
g X (k)
<0
y
:
The inequality above holds because ( [ k℄ [k+1℄) is decreasing due to the con- vexity, and there are exactly (rg
(x) 1) zeros and (K rg
(x)) ones in the values ofq
g X (k)
<0
yaccording to (3).
We call (5) the per-example cost bound, which says that if g makes only a small amount of error on the extended binary examples X(k);Y(k);W(k)
, thenrgis guaran- teed to only pay a small amount of cost on the original example(x;y; ). The bound allows us to derive the following reduction method, which is composed of three stages:
preprocessing, training, and prediction.
Algorithm 1 (Reduction to extended binary classification).
1. Preprocessing: For each original training example(xn
;y
n
;
n
)2S and for each
k = 1;2;:::;K 1, generate an extended training example
X (k)
n
;Y (k)
n
;W (k)
n
and include it inSE, where
X (k)
n
=(x
n
;k); Y (k)
n
=2Jk <ynK 1; Wn(k) =(K 1)
n
[k℄
n
[k+1℄
:
2. Training: Use a binary classification algorithm on SE and get a binary classi- fiergon a concrete encoding (to be discussed in Section 5) ofXf1;2; ;K 1g. Letg(x;k)g X(k)
.
3. Prediction: For anyx2X, estimate its rank with(3).
4.1 Cost Bound of the Reduction Framework
Consider the following probability distributionPb X
(k)
;Y (k)
;W (k)
that generates the extended binary examples.
1. Draw a tuple(x;y; )independently fromP(x;y; )and drawkuniformly from the setf1;2;:::;K 1g.
2. Generate X(k);Y(k);W(k)
by (4).
The extended training set SE contains examples that are equivalent (in terms of ex- pectation) to examples drawn independently fromPb
X (k)
;Y (k)
;W (k)
. For any given binary classifierg, define its out-of-sample error with respect toPbas
E
b
(g) E
(X;Y;W)P
b
W JY 6=g(X)K:
Using the definitions above, we can prove the first theoretical guarantee of the reduction framework.
Theorem 2 (Cost bound of the reduction framework). Consider a ranker rg that is constructed from a binary classifiergusing(3). Assume that is V-shaped and [ y℄=0 for every tuple(x;y; ) generated from P( jx;y). Ifg(x;k)is rank-monotonic or if every cost vector is convex, thenE(rg
)E
b (g). Proof. From (5),
[ r
g
(x)℄ 1
K 1 K 1
X
k=1 W
(k)
q
Y (k)
6=g X (k)
y
:
Take the expectation overP on both sides and useuto mean the uniform sampling,
E(r
g
) E
(x;y; )P 1
K 1 K 1
X
k=1 W
(k)
q
Y (k)
6=g X (k)
y
= E
(x;y; )P
E
k
u
f1;;K 1g W
(k)
q
Y (k)
6=g X (k)
y
= E
(X;Y;W)P
b
W JY 6=g(X)K
= E
b (g):
4.2 Regret Bound of the Reduction Framework
Theorem 2 indicates that if there exists a decent binary classifier g, we can obtain a decent ranker rg. Nevertheless, it does not guarantee how good rg is in comparison with other rankers. In particular, if we consider the optimal binary classifier g un- derPb
(X;Y;W), and the optimal rankerrunderP(x;y; ), does a small regretEb (g)
E
b (g
)in binary classification translate to a small regretE(rg
) E(r
)in ordinal rank- ing? Furthermore, isE(rg
)close toE(r
)? Next, we introduce the reverse reduction technique, which helps to answer the questions above.
The reverse reduction technique works on the binary classification problems gen- erated by the reduction method. It goes through the preprocessing and the predic- tion stages of the reduction method in the opposite direction. In the preprocessing stage, instead of starting with ordinal examples (xn
;y
n
;
n
), reverse reduction deals with weighted binary examples
X (k)
n
;Y (k)
n
;W (k)
n
. It first combines each set of binary
ordinal
example
(x
n
;y
n
;
n )
)
A
A
%
&
weighted
binary
examples
X (k )
n
;Y (k )
n
;W (k )
n
k=1;:::;K 1 ) ) )
ore
binary
lassi ation
algorithm
) ) )
%
&
related
binary
lassiers
g X (k )
k=1;:::;K 1 A A
)
ordinal
ranker
r
g (x)
%
$ '
&
weighted
binary
examples
X (k )
n
;Y (k )
n
;W (k )
n
k=1;:::;K 1 ) ) )
A A
ordinal
example
(xn;yn; n) )
ore
ordinal
ranking
algorithm )
ordinal
ranker
r(x)
) ) )
A
A
%
$ '
&
related
binary
lassiers
g
r X
(k )
k=1;:::;K 1
Figure 1: reduction (top) and reverse reduction (bottom) examples sharing the samexnto an ordinal example by
8
>
>
>
<
>
>
>
: y
n
= 1+ K 1
P
k=1
r
Y (k)
n
>0
z
;
n [k℄=
K 1
X
`=1 W
(`)
n
K 1
Jyn`<kork<`ynK:
(8)
It is easy to verify that (8) is the exact inverse transform of (4) on the training examples under the assumption that [ y℄=0. These ordinal examples are then given to an ordinal ranking algorithm to obtain a rankerr. In the prediction stage, reverse reduction works by decomposing the predictionr(x)toK 1binary predictions, each as if coming from a binary classifier
g
r X
(k)
=2Jr(x)>kK 1: (9)
Then, a lemma on the out-of-sample cost ofgrimmediately follows (Lin and Li, 2009).
Lemma 1. With the definitions of P(x;y; ) and Pb X
(k)
;Y (k)
;W (k)
in Theorem 2, for every ordinal rankerr,E(r)=Eb
(g
r ).
Proof. Becausegr is rank-monotonic by construction, the same proof for the first part of Theorem 2 leads to the desired result.
The stages of reduction and reverse reduction are illustrated in Figure 1. Next, we show how the reverse reduction technique allows us to draw a strong theoretical
connection between ordinal ranking and binary classification. By the definition ofr
andg, for any rankerrand any binary classifierg,
E(r)E(r
); E
b
(g)E
b (g
): (10)
Then, the reverse reduction technique yields a simple proof of the regret bound.
Theorem 3 (Regret bound of the reduction framework). Ifg(x;k)is rank-monotonic, or if every cost vector is convex, then
E(r
g
) E(r
)E
b
(g) E
b (g
): (11)
Proof.
E(r
g
) E(r
) E
b
(g) E(r
) (from Theorem 2)
= E
b
(g) E
b (g
r
) (from Lemma 1)
E
b
(g) E
b (g
) from Equation 10
:
The cost bound (Theorem 2) and the regret bound (Theorem 3) provide different guarantees for the reduction method. The former describes how the ordinal ranking cost is upper bounded by the binary classification error in an absolute sense, and the latter describes the upper bound in a relative sense.
4.3 Equivalence between Ordinal Ranking and Binary Classifica- tion
The results above suggest that ordinal ranking can be reduced to binary classification without any loss of optimality. That is, ordinal ranking is “no harder than” binary clas- sification. Intuitively, binary classification is also “no harder than” ordinal ranking, because the former is a special case of the latter withK = 2. Next, we formalize the notion of hardness with the probably approximately correct (PAC) setup in computa- tional learning theory (Kearns and Vazirani, 1994) and prove that ordinal ranking and binary classification are indeed equivalent in hardness. We use the following definition of PAC in our coming theorems (Valiant, 1984; Kearns and Vazirani, 1994).
Definition 1. In cost-sensitive classification, a learning model G is efficiently PAC- learnable (using the same representation class) if there exists a (possibly randomized) learning algorithmAsatisfying the following property: for every distributionP(x;y; ) being considered, where
[g
(x)℄= [ y℄=
min
=0;
with someg
2 G; for all 0 < and 0 < Æ < 1
2
, if A is given access to an oracle generating examples(x;y; )fromP(x;y; ), as well as inputsandÆ, thenAoutputs
g 2Gsuch thatE(g)with probability at least1 Æas well as with time polynomial in 1
and 1
Æ
.
Briefly speaking, the definition assumes that the target function g is within the learning modelGand is of cost0(the minimum cost). In other words, it is the noiseless setup of learning. We shall only focus on this case while pointing out that similar results can also be proved for the noisy setup (Lin, 2008).
Theorem 4 (Equivalence theorem of the reduction framework). Consider a learning modelRfor ordinal ranking, its associated learning modelG=fgr
: r2Rgfor binary classification, and distributions P(x;y; ) such that all cost vectors are V-shaped.
Then,Ris efficiently PAC-learnable if and only ifGis efficiently PAC-learnable.
Proof. IfG is efficiently PAC-learnable using algorithmAG, we can convertAG to an efficient algorithmARfor ordinal ranking as follows.
1. Transform the oracle that generates (x;y; ) from P(x;y; ) to an oracle that generates X(k);Y(k);W(k)
by pickingkuniformly and applying (4).
2. RunAG with the transformed oracle until it outputs someg X(k)
. 3. Returnrg.
It is not hard to see that AR is as efficient as AG, and the cost guarantee comes from Theorem 2 using the fact thatgrare all rank-monotonic.
Now we consider the other direction. IfRis efficiently PAC-learnable using algo- rithmAR, we can convertARto an efficient algorithmAG for binary classification.