**Ordinal Regression by Extended Binary Classification**

**Ling Li**
Learning Systems Group
California Institute of Technology

ling@caltech.edu

**Hsuan-Tien Lin**
Learning Systems Group
California Institute of Technology

htlin@caltech.edu

**Abstract**

We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classi- fier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as percep- tron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing al- gorithms, which demonstrates the usefulness of our framework.

**1** **Introduction**

*We work on a type of supervised learning problems called ranking or ordinal regression, where ex-*
*amples are labeled by an ordinal scale called the rank. For instance, the rating that a customer gives*
on a movie might be one ofdo-not-bother,only-if-you-must,good,very-good, and
run-to-see. The ratings have a natural order, which distinguishes ordinal regression from gen-
eral multiclass classification.

Recently, many algorithms for ordinal regression have been proposed from a machine learning per- spective. For instance, Crammer and Singer [1] generalized the online perceptron algorithm with multiple thresholds to do ordinal regression. In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. Shashua and Levin [2] pro- posed new support vector machine (SVM) formulations to handle multiple thresholds. Some other formulations were studied by Rajaram et al. [3] and Chu and Keerthi [4]. All these algorithms share a common property: they are modified from well-known binary classification approaches.

Since binary classification is much better studied than ordinal regression, a general framework to systematically reduce the latter to the former can introduce two immediate benefits. First, well-tuned binary classification approaches can be readily transformed into good ordinal regression algorithms, which saves immense efforts in design and implementation. Second, new generalization bounds for ordinal regression can be easily derived from known bounds for binary classification, which saves tremendous efforts in theoretical analysis.

In this paper, we propose such a reduction framework. The framework is based on extended ex- amples, which are extracted from the original examples and a given mislabeling cost matrix. The binary classifier trained from the extended examples can then be used to construct a ranking rule.

We prove that the mislabeling cost of the ranking rule is bounded by a weighted 0/1 loss of the

binary classifier. Hence, binary classifiers that generalize well could introduce ranking rules that generalize well. The advantages of the framework in algorithmic design and in theoretical analysis are both demonstrated in the paper. In addition, we show that our framework provides a unified view for many existing ordinal regression algorithms. The experiments on some benchmark data sets validate the usefulness of our framework in practice.

The paper is organized as follows. In Section 2, we introduce our reduction framework. An uni- fied view of some existing algorithms based on the framework is discussed in Section 3. Theo- retical guarantee on the reduction, including derivations of new generalization bounds for ordinal regression, is provided in Section 4. We present experimental results of several new algorithms in Section 5, and conclude in Section 6.

**2** **The reduction framework**

In an ordinal regression problem, an example (x, y) is composed of an input vector x ∈ X and an ordinal label (i.e., rank) y ∈ Y = {1, 2, . . . , K}. Each example is assumed to be drawn i.i.d. from some unknown distribution P (x, y) on X × Y. The generalization error of a ranking rule r : X → Y is then defined as

C(r, P )^{def}= E

(x,y)∼PCy,r(x),

where C is a K × K cost matrix with C_{y,k}being the cost of predicting an example (x, y) as rank k.

Naturally we assume Cy,y = 0 and Cy,k > 0 for k 6= y. Given a training set S = {(xn, yn)}^{N}_{n=1}
containing N examples, the goal is to find a ranking rule r that generalizes well, i.e., associates with
a small C(r, P ).

The setting above looks similar to that of a multiclass classification problem, except that the ranks are
ordered. The ordinal information can be interpreted in several ways. In statistics, the information is
assumed to reflect a stochastic ordering on the conditional distributions P (y ≤ k | x) [5]. Another
interpretation is that the mislabeling cost depends on the “closeness” of the prediction. Consider
an example (x, 4) with r1(x) = 3 and r2(x) = 1. The rule r2should pay more for the erroneous
prediction than the rule r_{1}*. Thus, we generally want each row of C to be V-shaped. That is, C*_{y,k−1}≥
Cy,kif k ≤ y and Cy,k≤ Cy,k+1if k ≥ y.

*A simple C with V-shaped rows is the classification cost matrix, with entries C*y,k =Jy 6= kK.

1 The
classification cost is widely used in multiclass classification. However, because the cost is invariant
*for all kinds of mislabelings, the ordinal information is not taken into account. The absolute cost*
*matrix, which is defined by C*_{y,k} = |y − k|, is a popular choice that better reflects the ordering
*preference. Its rows are not only V-shaped, but also convex. That is, C*y,k+1− Cy,k ≥ Cy,k− Cy,k−1

for 1 < k < K. The convex rows encode a stronger preference in making the prediction “close.”

In this paper, we shall always assume that the ordinal regression problem under study comes with a cost matrix of V-shaped rows, and discuss how to reduce the ordinal regression problem to a binary classification problem. Some of the results may require the rows to be convex.

**2.1** **Reducing ordinal regression to binary classification**

The ordinal information allows ranks to be compared. Consider, for instance, that we want to know how good a movie x is. An associated question would be: “is the rank of x greater than k?” For a fixed k, such a question is exactly a binary classification problem, and the rank of x can be de- termined by asking multiple questions for k = 1, 2, until (K − 1). Frank and Hall [6] proposed to solve each binary classification problem independently and combine the binary outputs to a rank.

Although their approach is simple, the generalization performance using the combination step can- not be easily analyzed.

Our framework works differently. First, all the binary classification problems are solved jointly to obtain a single binary classifier. Second, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow.

1The Boolean testJ·K is 1 if the inner condition is true, and 0 otherwise.

Assume that fb*(x, k) is a binary classifier for all the associated questions above. Consistent answers*
would be fb(x, k) = 1 (“yes”) for k = 1 until (y^{0}− 1) for some y^{0}, and 0 (“no”) afterwards. Then,
a reasonable ranking rule based on the binary answers is r(x) = y^{0} = 1 + min {k : fb(x, k) = 1}.

Equivalently,

r(x)^{def}= 1 +

K−1

X

k=1

f_{b}(x, k).

Although the definition can be flexibly applied even when fbis not consistent, a consistent fbis usually desired in order to introduce a good ranking rule r.

Furthermore, the ordinal information can help to model the relative confidence in the binary out-
puts. That is, when k is farther from the rank of x, the answer f_{b}(x, k) should be more confi-
dent. The confidence can be modeled by a real-valued function f : X × {1, 2, . . . , K − 1} → R,
with fb(x, k) =Jf (x, k) > 0K and the confidence encoded in the magnitude of f . Accordingly,

r(x)^{def}= 1 +

K−1

X

k=1

Jf (x, k) > 0K. (1)

*The ordinal information would naturally require f to be rank-monotonic, i.e., f (x, 1) ≥ f (x, 2) ≥*

· · · ≥ f (x, K − 1) for every x. Note that a rank-monotonic function f introduces consistent answers fb. Again, although the construction (1) can be applied to cases where f is not rank- monotonic, a rank-monotonic f is usually desired.

When f is rank-monotonic, we have f (x, k) > 0 for k < r(x), and f (x, k) ≤ 0 for k ≥ r(x). Thus the cost of the ranking rule r on an example (x, y) is

C_{y,r(x)}=

K−1

X

k=r(x)

(C_{y,k}− C_{y,k+1}) + C_{y,K} =

K−1

X

k=1

(C_{y,k}− C_{y,k+1})Jf (x, k) ≤ 0K + Cy,K. (2)

Define the extended examples (x^{(k)}, y^{(k)}) with weights wy,kas

x^{(k)}= (x, k), y^{(k)}= 2Jk < yK − 1, wy,k = |Cy,k− Cy,k+1| . (3)
Because row y in C is V-shaped, the binary variable y^{(k)}equals the sign of (C_{y,k}− C_{y,k+1}) if the
latter is not zero. Continuing from (2),

Cy,r(x)=

y−1

X

k=1

wy,k· y^{(k)}Jf (x

(k)) ≤ 0K +

K−1

X

k=y

wy,k· y^{(k)} 1 −Jf (x

(k)) > 0K + Cy,K

=

y−1

X

k=1

wy,kJy

(k)f (x^{(k)}) ≤ 0K + C^{y,y}+

K−1

X

k=y

wy,kJy

(k)f (x^{(k)}) < 0K

≤

K−1

X

k=1

wy,kJy

(k)f (x^{(k)}) ≤ 0K. (4)

Inequality (4) shows that the cost of r on example (x, y) is bounded by a weighted 0/1 loss of f on
the extended examples. It becomes an equality if the degenerate case f (x^{(k)}) = 0 does not happen.

When f is not rank-monotonic but row y of C is convex, the inequality (4) could be alternatively proved from

K−1

X

k=r(x)

(Cy,k− Cy,k+1) ≤

K−1

X

k=1

(Cy,k− Cy,k+1)Jf (x

(k)) ≤ 0K.

The inequality above holds because (Cy,k− Cy,k+1) is decreasing due to the convexity, and there are exactly (r(x) − 1) zeros and (K − r(x)) ones in the values ofJf (x

(k)) ≤ 0K in (1).

Altogether, our reduction framework consists of the following steps: we first use (3) to transform
all training examples (xn, yn) to extended examples (x^{(k)}n , yn^{(k)}) with weights wy_{n},k(also denoted
as wn^{(k)}). All the extended examples would then be jointly learned by a binary classifier f with
confidence outputs, aiming at a low weighted 0/1 loss. Finally, a ranking rule r is constructed
from f using (1). The cost bound in (4) leads to the following theorem.

**Theorem 1 (reduction) An ordinal regression problem with a V-shaped cost matrix C can be re-***duced to a binary classification problem with the extended examples in (3) and the ranking rule r*
*in (1). If f is rank-monotonic or every row of C is convex, for any example (x, y) and its extended*
*examples (x*^{(k)}, y^{(k)}*), the weighted sum of the 0/1 loss of f (x*^{(k)}*) bounds the cost of r(x).*

**2.2** **Thresholded model**

From Theorem 1 and the illustrations above, a rank-monotonic f is preferred for our framework. A popular approach to obtain such a function f is to use a thresholded model [1, 4, 5, 7]:

f (x, k) = g(x) − θ_{k}.

*As long as the threshold vector θ is ordered, i.e., θ*1 ≤ θ2 ≤ · · · ≤ θK−1, the function f is
rank-monotonic. The question is then, “when can a binary classification algorithm return ordered
thresholds?” A mild but sufficient condition is shown as follows.

**Theorem 2 (ordered thresholds) If every row of the cost matrix is convex, and the binary classifi-***cation algorithm minimizes the loss*

Λ(g) +

N

X

n=1 K−1

X

k=1

w^{(k)}_{n} · `

y^{(k)}_{n} (g(x_{n}) − θ_{k})

, (5)

*where `(ρ) is non-increasing in ρ, there exists an optimal solution (g*^{∗}, θ^{∗}*) such that θ*^{∗}*is ordered.*

PROOF For an optimal solution (g, θ), assume that θk > θ_{k+1} for some k. We shall prove that
switching θk and θk+1 would not increase the objective value of (5). First, consider an example
with yn= k + 1. Since yn^{(k)}= 1 and yn^{(k+1)}= −1, switching the thresholds changes the objective
value by

w^{(k)}_{n} [`(g(xn) − θk+1) − `(g(xn) − θk)] + w_{n}^{(k+1)}[`(θk− g(xn)) − `(θk+1− g(xn))] . (6)
Because `(ρ) is non-increasing, the change is non-positive.

For an example with yn< k + 1, we have yn^{(k)}= y^{(k+1)}n = −1. The change in the objective is
(w^{(k)}_{n} − w^{(k+1)}_{n} ) [`(θk+1− g(xn)) − `(θk− g(xn))] .

Note that row ynof the cost matrix being convex leads to w^{(k)}n ≤ w^{(k+1)}n if yn< k + 1. Since `(ρ)
is non-increasing, the change above is also non-positive. The case for examples with yn> k + 1 is
similar and the change there is also non-positive.

Thus, by switching adjacent pairs of strictly decreasing thresholds, we can actually obtain a solu-
tion (g^{∗}, θ^{∗}) with a smaller or equal objective value in (5), and g^{∗} = g. The optimality of (g, θ)

shows that (g^{∗}, θ^{∗}) is also optimal. _{}

Note that if `(ρ) is strictly decreasing for ρ < 0, and there are training examples for every rank, the
change (6) is strictly negative. Thus, the optimal θ^{∗}for any g^{∗}is always ordered.

**3** **Algorithms based on the framework**

So far the reduction works only by assuming that x^{(k)} = (x, k) is a pair understandable by f .
Actually, any lossless encoding from (x, k) to a vector can be used to encode the pair. With proper
choices of the cost matrix, the encoding scheme of (x, k), and the binary learning algorithm, many
existing ordinal regression algorithms can be unified in our framework. In this section, we will
briefly discuss some of them. It happens that a simple encoding scheme for (x, k) via a coding
matrix E of (K − 1) rows works for all these algorithms. To form x^{(k)}, the vector e_{k}, which
denotes the k-th row of E, is appended after x. We will mostly work with E = γIK−1, where γ is
a positive scalar and I_{K−1}is the (K − 1) × (K − 1) identity matrix.

**3.1** **Perceptron-based algorithms**

The perceptron ranking (PRank) algorithm proposed by Crammer and Singer [1] is an online ordinal
regression algorithm that employs the thresholded model with f (x, k) = hu, xi − θ_{k}. Whenever a
training example is not predicted correctly, the current u and θ are updated in a way similar to the
perceptron learning rule [8]. The algorithm was proved to keep an ordered θ, and a mistake bound
was also proposed [1].

With the simple encoding scheme E = IK−1, we can see that f (x, k) = (u, −θ), x^{(k)} . Thus,
when the absolute cost matrix is taken and a modified perceptron learning rule^{2}is used as the under-
lying binary classification algorithm, the PRank algorithm is a specific instance of our framework.

The orderliness of the thresholds is guaranteed by Theorem 2, and the mistake bound is a direct application of the well-known perceptron mistake bound (see for example Freund and Schapire [8]).

Our framework not only simplifies the derivation of the mistake bound, but also allows the use of other perceptron algorithms, such as a batch-mode algorithm rather than an online one.

**3.2** **SVM-based algorithms**

SVM [9] can be thought as a generalized perceptron with a kernel that computes the inner product on transformed input vectors φ(x). For the extended examples (x, k), we can suitably define the extended kernel as the original kernel plus the inner product between the extensions,

K ((x, k), (x^{0}, k^{0})) = hφ(x), φ(x^{0})i + hek, ek^{0}i .

Then, several SVM-based approaches for ordinal regression are special instances of our framework.

For example, the approach of Rajaram et al. [3] is equivalent to using the classification cost matrix,
the coding matrix E defined with e_{k,i} = γ ·Jk ≤ iK for some γ > 0, and the hard-margin SVM.

When E = γIK−1 and the traditional soft-margin SVM are used in our framework, the binary classifier f (x, k) has the form hu, φ(x)i − θk− b, and can be obtained by solving

u,θ,bminkuk^{2}+ kθk^{2}/γ^{2}+ κ

N

X

n=1 K−1

X

k=1

w_{n}^{(k)}maxn

0, 1 − y_{n}^{(k)}(hu, φ(xn)i − θk− b)o

. (7)

The explicit (SVOR-EXP) and implicit (SVOR-IMC) approaches of Chu and Keerthi [4] can be
regarded as instances of our framework with a modified soft-margin SVM formulation (since they
excluded the term kθk^{2}/γ^{2}and added some constraints on θ). Thus, many of their results can be
alternatively explained with our reduction framework. For example, their proof for ordered θ of
SVOR-IMC is implied from Theorem 2. In addition, they found that SVOR-EXP performed better
in terms of the classification cost, and SVOR-IMC preceded in terms of the absolute cost. This
finding can also be explained by reduction: SVOR-EXP is an instance of our framework using the
classification cost and SVOR-IMC comes from using the absolute cost.

Note that Chu and Keerthi paid much effort in designing and implementing suitable optimizers for
their modified formulation. If the unmodified soft-margin SVM (7) is directly used in our frame-
work with the absolute cost, we obtain a new support vector ordinal regression formulation.^{3} From
Theorem 2, the thresholds θ would be ordered. The dual of (7) can be easily solved with state-of-
the-art SVM optimizers, and the formulations of Chu and Keerthi can be approximated by setting γ
to a large value. As we shall see in Section 5, even a simple setting of γ = 1 performs similarly to
the approaches of Chu and Keerthi in practice.

**4** **Generalization bounds**

With the extended examples, new generalization bounds can be derived for ordinal regression prob- lems with any cost matrix. A simple result that comes immediately from (4) is:

2To precisely replicate the PRank algorithm, the (K − 1) extended examples sprouted from a same example should be considered altogether in updating the perceptron weight vector.

3The formulation was only briefly mentioned in a footnote, but not studied, by Chu and Keerthi [4].

* Theorem 3 (reduction of generalization error) Let c*y = Cy,1+ Cy,K

*and c = max*ycy

*. If f is*

*rank-monotonic or every row of C is convex, there exists a distribution ˆP on (X, Y ), where X*

*contains the encoding of (x, k) and Y is a binary label, such that*

(x,y)∼PE C_{y,r(x)} ≤ c · E

(X,Y )∼ ˆPJY f (X) ≤ 0K.

PROOF We prove by constructing ˆP . Given the conditions, following (4), we have
C_{y,r(x)}≤

K−1

X

k=1

wy,kJy

(k)f (x^{(k)}) ≤ 0K = c^{y}· E

k∼PkJy

(k)f (x^{(k)}) ≤ 0K,

where P_{k}(k | y) = w_{y,k}/c_{y}is a probability distribution because c_{y} =PK−1

k=1 w_{y,k}. Equivalently,
we can define a distribution ˆP (x^{(k)}, y^{(k)}) that generates (x^{(k)}, y^{(k)}) by drawing the tuple (x, y, k)
from P (x, y) and Pk(k | y). Then, the generalization error of r is

(x,y)∼PE C_{y,r(x)} ≤ E

(x,y)∼Pcy· E

k∼PkJy

(k)f (x^{(k)}) ≤ 0K ≤ c · E

(x^{(k)},y^{(k)})∼ ˆPJy

(k)f (x^{(k)}) ≤ 0K. (8)

Theorem 3 shows that, if the binary classifier f generalizes well when examples are sampled from ˆP ,
the constructed ranking rule would also generalize well. The terms y^{(k)}f (x^{(k)}), which are exactly
the margins of the associated binary classifier f_{b}*(x, k), would be analogously called the margins for*
ordinal regression, and are expected to be positive and large for correct and confident predictions.

Herbrich et al. [5] derived a large-margin bound for an SVM-based thresholded model using pairwise
comparisons between examples. However, the bound is complicated because O(N^{2}) pairs are taken
into consideration, and the bound is restricted because it is only applicable to hard-margin cases,
i.e., for all n, the margins yn^{(k)}f (x^{(k)}n ) ≥ ∆ > 0. Another large-margin bound was derived by
Shashua and Levin [2]. However, the bound is not data-dependent, and hence does not fully explain
the generalization performance of large-margin ranking rules in reality (for more discussions on
data-dependent bounds, see the work of, for example, Bartlett and Shawe-Taylor [10]).

Next, we show how a novel data-dependent bound for SVM-based ordinal regression approaches can
be derived from our reduction framework. Our bound includes only O(KN ) extended examples,
and applies to both hard-margin and soft-margin cases, i.e., the margins y^{(k)}f (x^{(k)}) can be negative.

Similar techniques can be used to derive generalization bounds when AdaBoost is the underlying classifier (see the work of Lin and Li [7] for one of such bounds).

* Theorem 4 (data-dependent bound for support vector ordinal regression) Assume that*
f (x, k) ∈n

f : (x, k) 7→ hu, φ(x)i − θk, kuk^{2}+ kθk^{2}≤ 1, kφ(x)k^{2}+ 1 ≤ R^{2}o
.
*If θ is ordered or every row of C is convex, for any margin criterion ∆, with probability at least 1−δ,*
*every rank rule r based on f has generalization error no more than*

β N ·

N

X

n=1 K−1

X

k=1

w_{n}^{(k)}Jy

(k)

n f (x^{(k)}_{n} ) ≤ ∆K + O
log N

√ N ,R

∆, r

log1 δ

!

*, where β =* max_{y}c_{y}
minycy

.

PROOF Consider the extended training set ˆS =(x^{(k)}n , y^{(k)}n ) , which contains N (K − 1) elements.

Each element is a possible outcome from the distribution ˆP constructed in Theorem 3. Note, how-
ever, that these elements are not all independent. Thus, we cannot directly use the whole extended
set as i.i.d. outcomes from ˆP . Nevertheless, some subsets of ˆS do contain i.i.d. outcomes from ˆP .
One way to extract such a subset is to choose independent kn from Pk(k | y_{n}) for each (xn, y_{n}).

The subset would be named T =(x^{(k}n^{n}^{)}, y^{(k}n^{n}^{)}) ^{N}

n=1.

Bartlett and Shawe-Taylor [10] showed that with probability at least (1 − δ/2) over the choice of N i.i.d. outcomes from ˆP , which is the case of T ,

E

(x^{(k)},y^{(k)})∼ ˆPJy

(k)f (x^{(k)}) ≤ 0K ≤
1
N

N

X

n=1

Jy

(kn)

n f (x^{(k}_{n}^{n}^{)}) ≤ ∆K + O
log N

√ N ,R

∆, r

log1 δ

! . (9)

Table 1: Test error with absolute cost

Reduction based on SVOR-IMC with kernel

data set C4.5 boost-stump SVM-perceptr. perceptron Gaussian [4]

pyrimidines 1.565 ± 0.072 1.360 ± 0.054 1.304 ± 0.040 1.315 ± 0.039 1.294 ± 0.046 machine 0.987 ± 0.024 0.875 ± 0.017 0.842 ± 0.022 0.814 ± 0.019 0.990 ± 0.026 boston 0.950 ± 0.016 0.846 ± 0.015 0.732 ± 0.013 0.729 ± 0.013 0.747 ± 0.011 abalone 1.560 ± 0.006 1.458 ± 0.005 1.383 ± 0.004 1.386 ± 0.005 1.361 ± 0.003 bank 1.700 ± 0.005 1.481 ± 0.002 1.404 ± 0.002 1.404 ± 0.002 1.393 ± 0.002 computer 0.701 ± 0.003 0.604 ± 0.002 0.565 ± 0.002 0.565 ± 0.002 0.596 ± 0.002 california 0.974 ± 0.004 0.991 ± 0.003 0.940 ± 0.001 0.939 ± 0.001 1.008 ± 0.001 census 1.263 ± 0.003 1.210 ± 0.001 1.143 ± 0.002 1.143 ± 0.002 1.205 ± 0.002

Let bn =Jy

(kn)

n f (x^{(k}n^{n}^{)}) ≤ ∆K be a Boolean random variable introduced by kn∼ Pk(k | yn). The
variable has mean c^{−1}_{y}_{n} ·PK−1

k=1 w^{(k)}n Jy

(k)

n f (x^{(k)}n ) ≤ ∆K. An extended Chernoff bound shows that
when each bnis chosen independently, with probability at least (1 − δ/2) over the choice of bn,

1 N

N

X

n=1

bn≤ 1 N

N

X

n=1

1
cy_{n}

K−1

X

k=1

w_{n}^{(k)}Jy

(k)

n f (x^{(k)}_{n} ) ≤ ∆K + O

√1 N,

r log1

δ

!

. (10)

The desired result can be obtained by combining (8), (9), and (10) with a union bound. _{}

**5** **Experiments**

We performed experiments with eight benchmark data sets that were used by Chu and Keerthi [4].

The data sets were produced by quantizing some metric regression data sets with K = 10. We used the same training/test ratio and also averaged the results over 20 trials. Thus, with the absolute cost matrix, we can fairly compare our results with those of SVOR-IMC [4].

We tested our framework with E = γIK−1 and three different binary classification algorithms.

The first binary algorithm is Quinlan’s C4.5 [11]. The second is AdaBoost-stump which uses Ada- Boost to aggregate 500 decision stumps. The third one is SVM with the perceptron kernel [12], with a simple setting of γ = 1. Note that the Gaussian kernel was used by Chu and Keerthi [4].

We used the perceptron kernel instead to gain the advantage of faster parameter selection. The
parameter κ of the soft-margin SVM was determined by a 5-fold cross validation procedure with
log_{2}κ = −17, −15, . . . , 3, and LIBSVM [13] was adopted as the solver. For a fair comparison, we
also implemented SVOR-IMC with the perceptron kernel and the same parameter selection proce-
dure in LIBSVM.

We list the mean and the standard error of all test results in Table 1, with entries within one standard error of the lowest one marked in bold. With our reduction framework, all the three binary learning algorithms could be better than SVOR-IMC with the Gaussian kernel on some of the data sets, which demonstrates that they achieve decent out-of-sample performances. Among the three algorithms, SVM-perceptron is significantly better than the other two.

Within the three SVM-based approaches, the two

bank computer california census 0

2 4 6

**avg. training time (hour)**

reduction SVOR−IMC

Figure 1: Training time (including automatic parameter selection) of the SVM-based ap- proaches with the perceptron kernel

with the perceptron kernel are better than SVOR- IMC with the Gaussian kernel in test performance.

Our direct reduction to the standard SVM performs
similarly to SVOR-IMC with the same perceptron
kernel, but is much easier to implement. In addi-
tion, our direct reduction is significantly faster than
SVOR-IMC in training, which is illustrated in Fig-
ure 1 using the four largest data sets.^{4} The main
cause to the time difference is the speedup heuris-
tics. While, to the best of our knowledge, not much

4The results are averaged CPU time gathered on a 1.7G Dual Intel Xeon machine with 1GB of memory.

has been done to improve the original SVOR-IMC algorithm, plenty of heuristics, such as shrinking and advanced working set selection in LIBSVM, can be seamlessly adopted by our direct reduction.

This difference demonstrates another advantage of our reduction framework: improvements to bi- nary classification approaches can be immediately inherited by reduction-based ordinal regression algorithms.

**6** **Conclusion**

We presented a reduction framework from ordinal regression to binary classification based on ex- tended examples. The framework has the flexibility to work with any reasonable cost matrix and any binary classifiers. We demonstrated the algorithmic advantages of the framework in design- ing new ordinal regression algorithms and explaining existing algorithms. We also showed that the framework can be used to derive new generalization bounds for ordinal regression. Furthermore, the usefulness of the framework was empirically validated by comparing three new algorithms con- structed from our framework with the state-of-the-art SVOR-IMC algorithm.

**Acknowledgments**

We wish to thank Yaser S. Abu-Mostafa, Amrit Pratap, John Langford, and the anonymous review- ers for valuable discussions and comments. Ling Li was supported by the Caltech SISL Graduate Fellowship, and Hsuan-Tien Lin was supported by the Caltech EAS Division Fellowship.

**References**

[1] K. Crammer and Y. Singer. Pranking with ranking. In T. G. Dietterich, S. Becker, and Z. Ghahramani,
*eds., Advances in Neural Information Processing Systems 14, vol. 1, pp. 641–647. MIT Press, 2002.*

[2] A. Shashua and A. Levin. Ranking with large margin principle: Two approaches. In S. Becker, S. Thrun,
*and K. Obermayer, eds., Advances in Neural Information Processing Systems 15, pp. 961–968. MIT Press,*
2003.

[3] S. Rajaram, A. Garg, X. S. Zhou, and T. S. Huang. Classification approach towards ranking and sort-
*ing problems. In N. Lavraˇc, D. Gamberger, H. Blockeel, and L. Todorovski, eds., Machine Learning:*

*ECML 2003, vol. 2837 of Lecture Notes in Artificial Intelligence, pp. 301–312. Springer-Verlag, 2003.*

[4] W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In L. D. Raedt and
*S. Wrobel, eds., ICML 2005: Proceedings of the 22nd International Conference on Machine Learning,*
pp. 145–152. Omnipress, 2005.

[5] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In
*A. J. Smola, P. L. Bartlett, B. Sch¨olkopf, and D. Schuurmans, eds., Advances in Large Margin Classifiers,*
chapter 7, pp. 115–132. MIT Press, 2000.

*[6] E. Frank and M. Hall. A simple approach to ordinal classification. In L. D. Raedt and P. Flach, eds., Ma-*
*chine Learning: ECML 2001, vol. 2167 of Lecture Notes in Artificial Intelligence, pp. 145–156. Springer-*
Verlag, 2001.

[7] H.-T. Lin and L. Li. Large-margin thresholded ensembles for ordinal regression: Theory and practice. In
*J. L. Balc´azar, P. M. Long, and F. Stephan, eds., Algorithmic Learning Theory: ALT 2006, vol. 4264 of*
*Lecture Notes in Artificial Intelligence, pp. 319–333. Springer-Verlag, 2006.*

*[8] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine*
*Learning, 37(3):277–296, 1999.*

*[9] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2nd edition, 1999.*

[10] P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern
*classifiers. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, eds., Advances in Kernel Methods: Support*
*Vector Learning, chapter 4, pp. 43–54. MIT Press, 1998.*

*[11] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.*

*[12] H.-T. Lin and L. Li. Novel distance-based SVM kernels for infinite ensemble learning. In Proceedings of*
*the 12th International Conference on Neural Information Processing, pp. 761–766, 2005.*

*[13] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines, 2001. Software available at*
http://www.csie.ntu.edu.tw/˜cjlin/libsvm.