HsuanTien Lin^{1} and Ling Li^{2}
1 Department of Computer Science and Information Engineering National Taiwan University
htlin@csie.ntu.edu.tw
2 Department of Computer Science California Institute of Technology
ling@caltech.edu
Abstract. We analyze the relationship between ordinal ranking and binary classification with a new technique called reverse reduction. In particular, we prove that the regret can be transformed between ordinal ranking and binary classification. The proof allows us to establish a gen eral equivalence between the two in terms of hardness. Furthermore, we use the technique to design a novel boosting approach that improves any costsensitive base ordinal ranking algorithm. The approach extends the wellknown AdaBoost to the area of ordinal ranking, and inherits many of its good properties. Experimental results demonstrate that our pro posed approach can achieve decent training and test performance even when the base algorithm produces only simple decision stumps.
1 Introduction
We work on a supervised learning task called ordinal ranking, which is also re ferred to as ordinal regression [1] or ordinal classification [2]. The task, which aims at predicting the ranks (i.e., ordinal class labels) of future inputs, is closely related to multiclass classification and metric regression. Somehow it is different from the former because of the ordinal information encoded in the ranks, and is different from the latter because the metric distance between the ranks is not ex plicitly defined. Since rank is a natural representation of human preferences, the task lends itself to many applications in social science and information retrieval.
Many ordinal ranking algorithms have been proposed from a machine learning perspective in recent years. For instance, Herbrich et al. [3] designed an approach with support vector machines based on comparing training examples in a pair wise manner. Nevertheless, such a pairwise comparison perspective may not be suitable for largescale learning because the size of the associated optimization problem is quadratic to the number of training examples.
There are some other approaches that do not lead to such a quadratic ex pansion, such as perceptron ranking [4, PRank] and support vector ordinal re gression [1, SVOR]. Li and Lin [5] proposed a reduction method that unified these approaches using the extended binary classification (EBC) perspective,
and showed that any binary classification algorithm can be casted as an ordi nal ranking one with EBC. Still some other approaches fall into neither of the perspective above, such as Gaussian process ordinal regression [6].
Given the wide availability of ordinal ranking algorithms, a natural question is whether their performance could be further improved by a generic method. In binary classification, there are many boosting algorithms that serve the purpose.
They usually work by combining the hypotheses that are produced from a base binary classification algorithm [7], including the wellknown adaptive boosting (AdaBoost) approach [8]. There are also some boostingrelated approaches for ordinal ranking. For example, Freund et al. [9] introduced the RankBoost ap proach based on the pairwise comparison perspective. Lin and Li [10] proposed ordinal regression boosting (ORBoost), which is a special instance of the EBC perspective. However, both approaches take a base binary classification algo rithm rather than a base ordinal ranking one. In other words, they cannot be directly used to improve the performance of existing ordinal ranking algorithms such as PRank or SVOR.
In this paper, we propose a novel boosting approach for ordinal ranking.
The approach improves the performance of any costsensitive ordinal ranking algorithm, including PRank and SVOR. Our approach directly extends the orig inal AdaBoost, and inherits many of its good properties. The approach is de signed with a technique called reverse reduction, which complements the reduc tion method of Li and Lin [5]. The technique not only helps in designing our proposed approach, but also reveals strong theoretical connections between or dinal ranking and binary classification.
The paper is organized as follows. In Section 2, we introduce the basic setup as well as the reduction method of Li and Lin [5]. Then, we discuss the reverse reduction technique and its theoretical implications in Section 3. We use the technique to design and analyze the proposed AdaBoost.OR approach in Sec tion 4. Finally, we show the experimental results in Section 5 and conclude in Section 6.
2 Background
We shall first define the ordinal ranking problem. Then, we introduce the reduc tion method of Li and Lin [5] and the consequent error bound.
2.1 Problem Setup
In the ordinal ranking problem, the task is to predict the rank y of some input x ∈ X ⊆ R^{D}, where y belongs to a set K of consecutive integers 1, 2, · · · , K.
We shall adopt the costsensitive setting, in which a cost vector c ∈ R^{K} is generated with (x, y) from some fixed but unknown distribution P on X × K × R^{K}. The kth element c[k] of the cost vector represents the penalty when predicting the input vector x as rank k. We naturally assume that c[y] = 0 and c[k] ≥ 0 for all k ∈ K. An ordinal ranking problem comes with a given training
set S = {(xn, yn, cn)}^{N}_{n=1}, whose elements are drawn i.i.d. from P. The goal of the problem is to find an ordinal ranker r : X → K such that its generalization error π(r) ≡ E(x,y,c)∼Pc[r(x)] is small.
Note that if we replace “rank” with “label”, the setup above is the same as the costsensitive Kclass classification problem [11]. The rank, however, carries extra ordinal information, which suggests that the mislabeling penalty depend on the “closeness” of the prediction. Hence, the cost vector c should be Vshaped with respect to y [5], i.e.,
(c[k−1] ≥ c[k] , for 2 ≤ k ≤ y ; c[k+1] ≥ c[k] , for y ≤ k ≤ K −1.
We shall assume that every cost vector c generated from P is Vshaped with respect to its associated y.
2.2 Reduction Method and Error Bound
Li and Lin [5] proposed a reduction method from ordinal ranking to binary classi fication. The reduction method constitutes of two stages: training and prediction.
During the training stage, each ordinal example is extended to K−1 weighted bi nary examples. Then, the binary examples are used to train a set of K−1 closely related binary classifiers, or equivalently, one joint binary classifier g(x, k). Then, during the prediction stage, the ordinal ranker rg(x) is constructed from g(x, k) by a counting method:^{3}
rg(x) ≡ 1 +
K−1
X
k=1
Jg(x, k) > 0K . (1)
Although Li and Lin [5] dealt with a more restricted costsensitive setting, the er ror bound theorem [5, Theorem 3], which is one of their key results, can be easily extended for our setting. The extension is based on the following distribution Pb
that generates weighted binary examples (x, k, z, w):
1. Draw a tuple (x, y, c) from P, and draw k uniformly within {1, 2, · · · , K −1}.
2. Let( z = 2 ·Jk < yK − 1 w = (K −1) ·
c[k+1] − c[k]
.
(2) With distribution Pb, the generalization error of any binary classifier g is
πb(g) ≡ E
(x,k,z,w)∼Pb
w ·Jz 6= g(x, k)K .
Then, we can obtain the extended error bound theorem with a proof similar to the one from Li and Lin [5].
3
J·K is 1 if the inner condition is true, and 0 otherwise.
Theorem 1. If g(x, k) is rankmonotonic, i.e.,
g(x, k−1) ≥ g(x, k), for 2 ≤ k ≤ K −1, or if every cost vector c is convex, i.e.,
c[k+1] − c[k] ≥ c[k] − c[k−1] , for 2 ≤ k ≤ K −2, then π(rg) ≤ πb(g).
Proof. The details can be found in the work of Lin [12].
3 Reverse Reduction Technique
Theorem 1 indicates that if there exists a decent binary classifier g, we can ob tain a “good” ordinal ranker rg. Nevertheless, it does not guarantee how good rg
is in comparison with other ordinal rankers. If we denote g_{∗} as the optimal bi nary classifier under Pb, and r_{∗} as the optimal ordinal ranker under P, does a small regret πb(g) − πb(g_{∗}) in binary classification translate to a small re gret π(r_{g}) − π(r_{∗}) in ordinal ranking? In particular, is π(rg∗) close to π(r_{∗})?
Next, we introduce the reverse reduction technique, which helps to answer the questions above.
3.1 Reverse Reduction
The reverse reduction technique works on the binary classification problems gen erated by the reduction method described in Section 2. We can use the technique to not only understand more about the theoretical nature of ordinal ranking, but also design better ordinal ranking algorithms. Reverse reduction goes through each stage of the reduction method in a different direction. In the training stage, instead of starting with the ordinal examples (xn, yn, cn), reverse reduction deals with the weighted binary examples (xn, k, znk, wnk). It first combines each set of binary examples sharing the same xn to an ordinal example by
yn = 1 +
K−1
P
k=1Jz^{nk}> 0K ; c_{n}[k] =
K−1
P
`=1 w_{n`}
K−1·Jyn ≤ ` < k or k < ` ≤ ynK .
(3)
It is easy to verify that (3) is the exact inverse transform of (2) on the training examples. These ordinal examples are then given to an ordinal ranking algorithm to obtain an ordinal ranker r. In the prediction stage, reverse reduction works by decomposing the prediction r(x) to K −1 binary predictions, each as if coming from a joint binary classifier
gr(x, k) = 2Jk < r(x)K − 1. (4) Then, a lemma on the generalization ability of gr immediately follows.
ordinal example (xn, yn, cn)
⇒
%
$ '
&
weighted binary examples (xn, k, znk, wnk) k = 1, · · · , K −1⇒
⇒
⇒ core
binary classification
algorithm ⇒
⇒
⇒
%
$ '
&
related binary classifiers
g(x, k) k = 1, · · · , K −1
⇒
ordinal ranker
rg(x)
%
$ '
&
weighted binary examples (xn, k, znk, wnk) k = 1, · · · , K −1 ⇒
⇒
⇒
ordinal example (xn, yn, cn)
⇒
core ordinal ranking algorithm
⇒
ordinal ranker
r(x) ⇒
⇒
⇒
%
$ '
&
related binary classifiers
gr(x, k) k = 1, · · · , K −1
Fig. 1. Top: Reduction; Bottom: Reverse Reduction Lemma 1. For every ordinal ranker r, π(r) = πb(gr).
Proof. Because every cost vector c from P is Vshaped,
π(r) = E
(x,y,c)∼P
X
Jk<yK6=Jk<r(x)K
c[k + 1] − c[k]
= E
(x,k,z,w)∼P_{b}
w ·Jz 6= gr(x, k)K
= π_{b}(g_{r}).
The steps of reduction and reverse reduction are illustrated in Figure 1. Note that if the reduction block is plugged into the reverse reduction block, we recover the underlying binary classification algorithm (and vice versa). This observation may suggest that reverse reduction is trivial and useless. Nevertheless, as we will show next, reverse reduction is a perfect complement of the reduction method, and allows us to draw a strong theoretical connection between ordinal ranking and binary classification. In addition, reverse reduction is useful in designing boosting methods for ordinal ranking, which will be demonstrated in Section 4.
3.2 Regret Bound via Reverse Reduction
Without loss of generality, we use the following definition for the optimal ordinal ranker r∗ and the optimal binary classifier g∗, with ties arbitrarily broken and
sign(0) assumed to be +1.
r∗(x) ≡ argmin
`
E
c∼P(·x)
c[`] ,
g_{∗}(x, k) ≡ sign
E
(w,z)∼P_{b}(·x,k)
(w · z)
.
It is not hard to prove that r_{∗} and g_{∗} are optimal, i.e., for any ordinal ranker r and any binary classifier g,
π(r) ≥ π(r_{∗}), πb(g) ≥ πb(g_{∗}) . (5) With the definitions of r_{∗} and g_{∗}, the reverse reduction technique allows a simple proof of the following regret bound.
Theorem 2. If g(x, k) is rankmonotonic, or if every cost vector c is convex, then
π(rg) − π(r_{∗}) ≤ πb(g) − πb(g_{∗}).
Proof.
π(rg) − π(r_{∗}) ≤ πb(g) − π(r_{∗}) (from Theorem 1)
= π_{b}(g) − π_{b}(g_{r}_{∗}) (from Lemma 1)
≤ π_{b}(g) − π_{b}(g_{∗}) from (5)
An immediate implication of the regret bound is as follows. If there exists one optimal binary classifier g+that is rankmonotonic, both the righthandside and the lefthandside are 0. That is, every optimal binary classifier under P_{b} leads to an optimal ordinal ranker under P. In other words, locating an optimal ordinal ranker is “no harder than” locating an optimal binary classifier. On the other hand, binary classification is also “no harder than” ordinal ranking, because the former is a special case of the latter with K = 2. Therefore, if there is a rankmonotonic g+, ordinal ranking is equivalent to binary classification in hardness.^{4} In the following theorem, we show a general sufficient condition for the equivalence.
Theorem 3. Assume that the effective cost c_{x}[k] = E
c∼P(·x)
c[k] − min
` E
c∼P(·x)
c[`]
is Vshaped with respect to yx = argmin_{`}cx[`] = r∗(x) on every point x ∈ X . Let g+(x, k) ≡ 2Jk < yxK − 1. Then g+ is rankmonotonic and optimal for Pb.
4 Note that the equivalence in hardness here is qualitative and considers neither the number of independent examples N needed nor the number of classes K.
Proof. By construction g+ is rankmonotonic. The key is to show g_{∗}(x, k) = sign(cx[k + 1] − cx[k]). Then, because cx is Vshaped, g+(x, k) = g_{∗}(x, k) for all (x, k) except when cx[k + 1] − cx[k] = 0. Therefore, πb(g+) = πb(g∗) and g+
is optimal for Pb.
Note that if every cost vector c is convex, the effective cost cx would also be convex, and hence Vshaped. Thus, the convexity of c is also a (weaker) sufficient condition for the equivalence in hardness between ordinal ranking and binary classification.
As can be seen from the definition of r_{∗}, the effective cost cxconveys sufficient information for determining the optimal prediction at x. Because ordinal ranking predictions should take “closeness” into account (see Section 2), it is reasonable to assume that cx is Vshaped. Hence, in general (with such a minor assumption), ordinal ranking is equivalent to binary classification in terms of hardness.
4 AdaBoost for Ordinal Ranking
We now use reduction and reverse reduction to design a novel boosting approach for ordinal ranking. We shall first introduce the ideas behind the approach. In the training stage, we apply the reduction technique, and take AdaBoost as the core binary classification algorithm. AdaBoost would then train a base binary classifier ˆgt with weighted binary examples in its tth iteration. We use the reverse reduction technique to replace ˆgtwith gr_{t}, and let the approach train a base ordinal ranker rt with costsensitive ordinal examples instead.
After the training steps above, our approach returns an ensemble of ordinal rankers H = {(r_{t}, v_{t})}^{T}_{t=1}, where v_{t}≥ 0 is the weight associated with the ordinal ranker r_{t}. In the prediction stage, we first apply the reverse reduction technique in (4) to cast each ordinal ranker r_{t} as a joint binary classifier ˆg_{t} = g_{r}_{t}. The weighted votes from all the binary classifiers in the ensemble are gathered to form binary predictions. Then, the reduction technique comes into play, and constructs an ordinal prediction from the binary ones by (1). Combining the steps above, we get the following prediction rule for an ordinal ranking ensemble H:
rH(x) ≡ 1 +
K−1
X
k=1
t _{T} X
t=1
vtJk < r^{t}(x)K ≥ 1 2
T
X
t=1
vt

. (6)
The steps of going back and forth between reduction and reverse reduction may seem complicated. Nevertheless, we can simplify many of them with careful derivations, which are illustrated below.
4.1 Prediction Steps
We shall start with the prediction steps, and derive a simplified form of (6) as follows.
Theorem 4. (prediction with the weighted median) For any ordinal ranking en semble H = {(rt, vt)}^{T}_{t=1}, assume that vt≥ 0 and PT
t=1vt= 1. Then, rH(x) = min
( k :
T
X
t=1
vtJk ≥ r^{t}(x)K >
1 2
)
. (7)
Proof. Let k^{∗}= minn k : PT
t=1vtJk ≥ r^{t}(x)K >
1 2
o . Then,
T
X
t=1
vtJk ≥ r^{t}(x)K >
1 2 if and only if k^{∗} ≤ k. That is,PT
t=1vtJk < r^{t}(x)K ≥
1
2 if and only if k < k^{∗}. Thus, rH(x) = 1 + k^{∗}− 1 = k^{∗}.
Therefore, the prediction rule (6) that goes back and forth between reduction and reverse reduction can be equivalently performed by computing a simple and intuitive statistic in (7): the weighted median. Note that the rule in (7) is not specific for our approach. It can be applied to ordinal ranking ensembles produced by any ensemble learning approaches, such as bagging [13].
4.2 Training Steps
We now look at the training steps. The steps of the original AdaBoost are listed in Algorithm 1. After plugging AdaBoost into reduction and a base ordinal ranking algorithm into reverse reduction, we can equivalently obtain Algorithm 2:
AdaBoost.OR. The equivalence is based on maintaining the following invariance in each iteration.
Lemma 2. Substitute the indices m in Algorithm 1 with (n, k). That is, ˆxm= (xn, k), ˆzm= ˆznk, and ˆwm= ˆwnk. Take ˆgt(x, k) = gr_{t}(x, k), and assume that in Algorithms 1 and 2,
c^{(τ )}_{n} [k] =
K−1
X
`=1
ˆ w_{n`}^{(τ )}
K −1·Jy^{n}≤ ` < k or k < ` ≤ ynK (8) is satisfied for τ = t with ˆw_{n`}^{(τ )}≥ 0. Then, equation (8) is satisfied for τ = t + 1 with ˆw_{n`}^{(τ )}≥ 0.
Proof. Because (8) is satisfied for τ = t and ˆw^{(t)}_{n`} ≥ 0, the cost vector c^{(t)}n is Vshaped with respect to yn and c^{(t)}n [yn] = 0. Thus,
N
X
n=1
c^{(t)}_{n} [1] + c^{(t)}_{n} [K]
=
N
X
n=1 K−1
X
k=1
ˆ w^{(t)}_{nk}.
Algorithm 1 AdaBoost [8]
Input: examples {(ˆxm, ˆzm, ˆwm)}^{M}_{m=1}
Initialize ˆwm^{(1)}= ˆwmfor all m.
For t = 1 to T
1. Obtain ˆgt from the base binary classification algorithm.
2. Compute the weighted training er ror ˆt.
ˆ
t=
M
X
m=1
ˆ
w^{(t)}_{m} ·J ˆzm6= ˆgt(ˆxm)K
!
. X^{M}
m=1
ˆ wm^{(t)}
!
If ˆt > ^{1}_{2}, set T = t − 1 and abort loop.
3. Let ˆvt=^{1}_{2}log^{1−ˆ}_{ˆ}_{}^{}^{t}
t . 4. Let Λˆt= exp (2ˆvt) − 1.
ˆ
w^{(t+1)}m = ˆwm^{(t)}+ 8
><
>:
0, zˆm= ˆgt(ˆxm) ;
Λˆtwˆm^{(t)}, zˆm6= ˆgt(ˆxm) .
Algorithm 2 AdaBoost.OR Input: examples {(xn, yn, cn)}^{N}_{n=1}
Initialize c^{(1)}n [k] = cn[k] for all n, k.
For t = 1 to T
1. Obtain rt from the base ordinal ranking algorithm.
2. Compute the weighted training er ror t.
t=
N
X
n=1
c^{(t)}n [rt(x)]
!
. X^{N}
n=1
c^{(t)}_{n} [1] + c^{(t)}_{n} [K]
!
If t > ^{1}_{2}, set T = t − 1 and abort loop.
3. Let vt=^{1}_{2}log^{1−}_{} ^{t}
t . 4. Let Λt= exp (2vt) − 1.
If rt(xn) ≥ yn, then c^{(t+1)}_{n} [k] = c^{(t)}_{n} [k] +
8
><
>:
0, k ≤ yn;
Λt· c^{(t)}n [rt(xn)] , k > rt(xn) ; Λt· c^{(t)}n [k] , otherwise . Else, switch > to < and vice versa.
In addition, since ˆg_{t}(x, k) = g_{r}_{t}(x, k), by a proof similar to Lemma 1,
N
X
n=1
c^{(t)}_{n} [rt(x)] =
N
X
n=1 K−1
X
k=1
ˆ
w_{nk}^{(t)}·Jz^{nk}6= ˆgt(xn, k)K .
Therefore, ˆ_{t}= _{t}, ˆv_{t}= v_{t}, and ˆΛ_{t}= Λ_{t}.
Because ˆgt(xn, k) 6= znkif and only if rt(xn) ≤ k < yn or yn< k ≤ rt(xn),
ˆ w^{(t+1)}_{nk} =
ˆ
w^{(t)}_{nk}+ ˆΛtwˆ^{(t+1)}_{nk} , yn< k ≤ rt(xn) or yn < k ≤ rt(xn) ; ˆ
w^{(t)}_{nk}, otherwise .
(9)
It is easy to check that ˆw^{(t+1)}_{nk} are nonnegative. Furthermore, we see that the update rule in Algorithm 2 is equivalent to combining (9) and (8) with τ = t + 1.
Thus, equation (8) is satisfied for τ = t + 1.
Then, by mathematical induction from τ = 1 up to T with Lemma 2, plug ging AdaBoost into reduction and a base ordinal ranking algorithm into reverse
reduction is equivalent to running AdaBoost.OR with the base algorithm. Ada Boost.OR takes AdaBoost as a special case of K = 2, and inherits many of its good properties, as discussed below.
4.3 Properties
AdaBoost.OR can take any costsensitive base algorithm that produces individ ual ordinal rankers r_{t} with errors _{t} ≤ ^{1}_{2}. In binary classification, the ^{1}_{2} error bound can be naturally achieved by a constant classifier or a fair coin flip. For ordinal ranking, is ^{1}_{2} still easy to achieve? The short answer is yes. In the follow ing theorem, we demonstrate that there always exists a constant ordinal ranker that satisfies the error bound.
Theorem 5. Define constant ordinal rankers ˜rk by ˜rk(x) ≡ k for all x. For any set {cn}^{N}_{n=1}, there exists a constant ranker with k ∈ K such that
˜
k=
N
X
n=1
cn[˜rk(x)]
!
. X^{N}
n=1
cn[1] + cn[K]
!
≤ 1 2
Proof. Either ˜r1or ˜rK achieves error ≤ ^{1}_{2} because by definition ˜1+ ˜K= 1.
Therefore, even the simplest deterministic ordinal rankers can always achieve the desired error bound.^{5} If the base ordinal ranking algorithm produces better ordinal rankers, the following theorem bounds the normalized training cost of the final ensemble H.
Theorem 6. Suppose the base ordinal ranking algorithm produces ordinal rankers with errors 1, · · · , T, where each t≤^{1}_{2}. Let γt= ^{1}_{2}− t, the final ensemble rH
satisfies the following error bound:
PN
n=1cn[rH(xn)]
PN
n=1cn[1] + cn[K] ≤
T
Y
t=1
q
1 − 4γ_{t}^{2}≤ exp
−2
T
X
t=1
γ_{t}^{2} .
Proof. Similar to the proof for Lemma 2, the lefthandside of the bound equals
N
X
n=1 K−1
X
k=1
w_{nk}qz_{nk}6= g_{H}_{ˆ}(x_{n}, k)y
!
. X^{N}
n=1 K−1
X
k=1
w_{nk}
! ,
where ˆH is a binary classification ensemble {(ˆg_{t}, v_{t})}^{T}_{t=1} with ˆg_{t} = g_{r}_{t}. Then, the bound is a simple consequence of the wellknown AdaBoost bound [8].
Theorem 6 indicates that if the base algorithm always produces an ordi nal ranker with _{t} ≤ ^{1}_{2} − γ for γ > 0, the training cost of H would decrease
5 Similarly, the error bound can be achieved by a randomized ordinal ranker which returns either 1 or K with equal probability.
1 2 3 4
(a) T = 1
1 2 3 4
(b) T = 10
1 2 3 4
(c) T = 100
1 2 3 4
(d) T = 1000
Fig. 2. decision boundaries produced by AdaBoost.OR on an artificial data set exponentially with T . That is, AdaBoost.OR can rapidly improve the training performance of such a base algorithm.
We can also extend the generalization error bounds of AdaBoost to Ada Boost.OR, including the T dependent bound [8] and the marginbased ones [14].
The steps for proving these bounds are similar to those for the SVOR bound derived by Li and Lin [5].
5 Experiments
We now demonstrate the validity of AdaBoost.OR. We will first illustrate its behavior on an artificial data set. Then, we test its training and test performance on benchmark data sets.
5.1 Artificial Data
We generate 500 input vectors xn ∈ [0, 1] × [0, 1] uniformly, and rank them with K = {1, 2, 3, 4} based on three quadratic boundaries. Then, we apply Ada Boost.OR on these examples with the absolute cost, i.e., c[k] ≡ y − k with respect to the associated y.
We use a simple base algorithm called ORStump, which solves the following optimization problem efficiently with dynamic programming:
min
θ,d,q N
X
n=1
cn[r(xn, θ, d, q)]
subject to θ1≤ θ2≤ · · · ≤ θK−1,
where r(x, θ, d, q) ≡ max {k : q · (x)_{d} < θ_{k}} .
The ordinal ranking decision stump r(·, θ, d, q) is a natural extension of the binary decision stump [15]. Note that the set of all possible ordinal ranking deci sion stumps includes constant ordinal rankers. Therefore, ORStump can always achieve t≤ ^{1}_{2}.
The decision boundaries generated by AdaBoost.OR with ORStump using T = 1, 10, 100, 1000 are shown in Figure 2. The case of T = 1 is the same as applying ORStump directly on the artificial set, and we can see that its resulting decision boundary cannot capture the full characteristic of the data.
As T gets larger, however, AdaBoost.OR is able to boost up ORStump to form
more sophisticated boundaries that approximate the underlying quadratic curves better.
5.2 Benchmark Data
Next, we run AdaBoost.OR on eight benchmark data sets^{6} with the absolute cost. We keep the splits provided by Chu and Keerthi [1], and also average the results over 20 trials. Thus, our results can be fairly compared with their benchmark SVOR ones.
We couple AdaBoost.OR with three base algorithms: ORStump, PRank [4], and a reductionbased formulation of SVOR [1] proposed by Li and Lin [5].
For the PRank algorithm, we adopt the SiPrank variant, and make it cost sensitive by presenting random examples (xn, yn) with probability proportional to max_{k∈K}c_{n}[k]. In addition, we apply the pocket technique with ratchet [16]
for 2000 epochs to get a decent training cost minimizer. For SVOR, we follow the same setting of Li and Lin [5] except for the choice of parameter. In partic ular, we set the parameter C in the tth iteration as the smallest number within
2^{−20}, 2^{−18}, · · · , 2^{20} that makes t≤ 0.3. The setup guarantees that SVOR pro duces a decent training cost minimizer without overfitting the training examples too much.
We run AdaBoost.OR for T = 1000, 100, 10 iterations for ORStump, PRank, and SVOR respectively. Such a setup is intended to compensate the computa tional complexity of each individual base algorithm. Nevertheless, a more sophis ticated choice of T should further improve the performance of AdaBoost.OR.
For each algorithm, the average training cost as well as its standard error is reported in Table 1; the average test cost and its standard error is reported in Table 2. For each pair of single and AdaBoost.OR (shorthanded AB.OR) entries, we mark those within one standard error of the lowest in bold. We also list the benchmark SVOR results from Chu and Keerthi [1] in Table 2, and mark the entries better than the benchmark ones with †.
From the tables, we see that AdaBoost.OR almost always improves both the training and test performance of the base algorithm significantly, especially for ORStump and SVOR. It is harder for AdaBoost.OR to improve the performance of PRank, because it sometimes cannot produce a good rtin terms of minimizing the training cost even with the pocket technique.
Note that a singleshot execution of the SVOR base algorithm is much worse than the benchmark SVOR in terms of test cost. The difference can be explained by looking at their parameter selection procedures. The SVOR base algorithm chooses its parameter by only the training cost to guarantee t≤^{1}_{2}, which is the condition that allows AdaBoost.OR to work (see Theorem 6). On the other hand, the benchmark SVOR goes through a complete parameter selection procedure using crossvalidation, which justifies its good test performance but is quite time consuming. On the other hand, AdaBoost.OR (especially with ORStump) is
6 They are pyrimdines, machineCPU, boston, abalone, bank, computer, california, and census [1].
Table 1. average cost of ordinal ranking algorithms on the training set
data ORStump PRank SVOR
set single AB.OR single AB.OR single AB.OR
pyr. 1.76 ± 0.02 0.02 ± 0.01 0.46 ± 0.03 0.27 ± 0.05 2.44 ± 0.04 0.18 ± 0.02 mac. 1.12 ± 0.02 0.12 ± 0.01 0.88 ± 0.01 0.86 ± 0.01 2.49 ± 0.03 0.36 ± 0.02 bos. 1.05 ± 0.01 0.00 ± 0.00 0.85 ± 0.01 0.83 ± 0.01 2.43 ± 0.01 0.29 ± 0.02 aba. 1.53 ± 0.01 1.05 ± 0.01 1.44 ± 0.01 1.44 ± 0.01 2.63 ± 0.01 0.39 ± 0.01 ban. 1.98 ± 0.01 1.14 ± 0.00 1.51 ± 0.00 1.47 ± 0.00 1.63 ± 0.07 0.18 ± 0.02 com. 1.18 ± 0.00 0.50 ± 0.00 0.66 ± 0.00 0.66 ± 0.00 2.51 ± 0.01 0.35 ± 0.01 cal. 1.62 ± 0.00 0.88 ± 0.00 1.21 ± 0.00 1.21 ± 0.00 2.61 ± 0.01 0.52 ± 0.01 cen. 1.83 ± 0.00 1.11 ± 0.00 1.58 ± 0.01 1.56 ± 0.01 2.51 ± 0.00 0.43 ± 0.02 (results that are as significant as the best one of each pair are marked in bold)
Table 2. average cost of ordinal ranking algorithms on the test set
data ORStump PRank SVOR benchmark
set single AB.OR single AB.OR single AB.OR result
pyr. 1.91 ± 0.09 1.24 ± 0.05^{†} 1.57 ± 0.07 1.42 ± 0.07 2.63 ± 0.10 1.36 ± 0.05 1.294 mac. 1.29 ± 0.04 0.84 ± 0.02^{†} 0.97 ± 0.01 0.93 ± 0.02^{†} 2.62 ± 0.04 0.93 ± 0.03^{†} 0.990 bos. 1.17 ± 0.01 0.89 ± 0.01 0.91 ± 0.01 0.89 ± 0.01 2.46 ± 0.03 0.80 ± 0.01 0.747 aba. 1.59 ± 0.00 1.48 ± 0.00 1.48 ± 0.01 1.48 ± 0.01 2.65 ± 0.02 1.53 ± 0.01 1.361 ban. 2.00 ± 0.00 1.53 ± 0.00 1.54 ± 0.00 1.50 ± 0.00 1.68 ± 0.07 1.48 ± 0.00 1.393 com. 1.20 ± 0.00 0.63 ± 0.00 0.66 ± 0.00 0.66 ± 0.00 2.51 ± 0.01 0.67 ± 0.01 0.596 cal. 1.64 ± 0.00 1.00 ± 0.00^{†} 1.21 ± 0.00 1.21 ± 0.00 2.61 ± 0.01 1.10 ± 0.00 1.008 cen. 1.85 ± 0.00 1.25 ± 0.00 1.60 ± 0.01 1.58 ± 0.00 2.51 ± 0.01 1.25 ± 0.01 1.205
(results that are better than the benchmark one are marked with †) (results that are as significant as the best one of each pair are marked in bold)
faster in training and can achieve a decent performance without resorting to the crossvalidation steps. The efficiency along with the comparable performance can make AdaBoost.OR a promising alternative for some application needs.
6 Conclusion
We presented the reverse reduction technique between ordinal ranking and bi nary classification. The technique complemented the reduction method of Li and Lin [5], and allowed us to derive a novel regret bound for ordinal ranking.
Furthermore, we used the technique to prove that ordinal ranking is generally equivalent to binary classification in hardness.
We also used reduction and reverse reduction to design a novel boosting ap proach, AdaBoost.OR, to improve the performance of any costsensitive base ordinal ranking algorithm. We showed the parallel between AdaBoost.OR and AdaBoost in algorithmic steps and in theoretical properties. Experimental re sults validated that AdaBoost.OR indeed improved both the training and test performance of existing ordinal ranking algorithms.
Acknowledgment
We thank Yaser AbuMostafa, Amrit Pratap, and the anonymous reviewers for valuable suggestions. HsuanTien Lin was partly sponsored by the Caltech Divi sion of Engineering and Applied Science Fellowship when initiating this project, and the continuing work was supported by the National Science Council of Tai wan via the grant NSC 982218E002019.
References
[1] Wei Chu and S. Sathiya Keerthi. New approaches to support vector ordinal regression. In Luc De Raedt and Stefan Wrobel, editors, Proceedings of ICML 2005, pages 145–152. ACM, 2005.
[2] Eibe Frank and Mark Hall. A simple approach to ordinal classification. In Luc De Raedt and Peter Flach, editors, Proceedings of ECML 2001, volume 2167 of Lecture Notes in Artificial Intelligence, pages 145–156. Springer Verlag, 2001.
[3] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. In Alexander J. Smola, Peter J. Bartlett, Bernhard Sch¨okopf, and Dale Schuurmans, editors, Advances in Large Mar gin Classifiers, pages 115–132. MIT Press, 2000.
[4] Koby Crammer and Yoram Singer. Online ranking by projecting. Neural Computation, 17:145–175, 2005.
[5] Ling Li and HsuanTien Lin. Ordinal regression by extended binary clas sification. In Bernhard Sch¨olkopf, John C. Platt, and Thomas Hofmann, editors, Proceedings of NIPS 2006, volume 19, pages 865–872. MIT Press, 2007.
[6] Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regres sion. Journal of Machine Learning Research, 6:1019–1041, 2005.
[7] Ron Meir and Gunnar R¨atsch. An introduction to boosting and leveraging.
In S. Mendelson and A. J. Smola, editors, Advanced Lectures on Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 118–
183. SpringerVerlag, 2003.
[8] Yoav Freund and Robert E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
[9] Yoav Freund, Raj Iyer, Robert E. Shapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
[10] HsuanTien Lin and Ling Li. Largemargin thresholded ensembles for ordi nal regression: Theory and practice. In Jos´e L. Balcaz´ar, Philip M. Long, and Frank Stephan, editors, Proceedings of ALT 2006, volume 4264 of Lec ture Notes in Artificial Intelligence, pages 319–333. SpringerVerlag, 2006.
[11] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multiclass costsensitive learning. In Won Kim, Ron Kohavi, Johannes Gehrke, and William DuMouchel, editors, Proceedings of KDD 2004, pages 3–11. ACM, 2004.
[12] HsuanTien Lin. From Ordinal Ranking to Binary Classification. PhD thesis, California Institute of Technology, 2008.
[13] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[14] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boost ing the margin: A new explanation for the effectiveness of voting methods.
The Annals of Statistics, 26(5):1651–1686, 1998.
[15] Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–91, 1993.
[16] Stephen I. Gallant. Perceptronbased learning algorithms. IEEE Transac tions on Neural Networks, 1(2):179–191, 1990.