2 Thresholded Ensemble Model for Ordinal Regression

(1)

Ordinal Regression: Theory and Practice

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology, USA [email protected], [email protected]

Abstract. We propose a thresholded ensemble model for ordinal regression problems. The model consists of a weighted ensemble of confidence functions and an ordered vector of thresholds. We derive novel large- margin bounds of common error functions, such as the classification error and the absolute error. In addition to some existing algorithms, we also study two novel boosting approaches for constructing thresholded ensembles. Both our approaches not only are simpler than existing algorithms, but also have a stronger connection to the large-margin bounds. In addition, they have comparable performance to SVM-based algorithms, but enjoy the benefit of faster training. Experimental results on benchmark datasets demonstrate the usefulness of our boosting approaches.

1 Introduction

Ordinal regression resides between multiclass classification and metric regression in the area of supervised learning. They have many applications in social science and information retrieval to match human preferences. In an ordinal regression problem, examples are labeled with a set of K ≥ 2 discrete ranks, which, unlike general class labels, also carry ordering preferences. However, ordinal regression is not exactly the same as common metric regression, because the label set is of finite size and metric distance between ranks is undefined.

Several approaches for ordinal regression were proposed in recent years from a machine learning perspective. For example, Herbrich et al. [1] designed an algorithm with support vector machines (SVM). Other SVM formulations were first studied by Shashua and Levin [2], and some improved ones were later proposed by Chu and Keerthi [3]. Crammer and Singer [4] generalized the perceptron learning rule for ordinal regression in an online setting. These approaches are all extended from well-known binary classification algorithms [5]. In addition, they share a common property in predicting: the discrete rank comes from thresholding a continuous potential value, which represents an ordering preference.

Ideally, examples with higher ranks should have higher potential values.

In the special case of K = 2, ordinal regression is similar to binary classification [6]. If we interpret the similarity from the other side, the confidence function for a binary classifier can be naturally used as an ordering preference. For example, Freund et al. [7] proposed a boosting algorithm, RankBoost, that constructs

J. Balcaz´ar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 319–333, 2006.

c

Springer-Verlag Berlin Heidelberg 2006

(2)

an ensemble of those confidence functions to form a better ordering preference.

However, RankBoost was not specifically designed for ordinal regression. Hence, some efforts are needed when applying RankBoost for ordinal regression.

In this work, we combine the ideas of thresholding and ensemble learning to propose a thresholded ensemble model for ordinal regression. In our model, potential values are computed from an ensemble of confidence functions, and then thresholded to rank labels. It is well-known that ensemble is useful and powerful in approximating complex functions for classification and metric regression [8]. Our model shall inherit the same advantages for ordinal regression.

Furthermore, we define margins for the thresholded ensemble model, and derive novel large-margin bounds of its out-of-sample error. The results indicate that large-margin thresholded ensembles could generalize well.

Algorithms for constructing thresholded ensembles are also studied. We not only combine RankBoost with a thresholding algorithm, but also propose two simpler boosting formulations, named ordinal regression boosting (ORBoost).

ORBoost formulations have stronger connections with the large-margin bounds that we derive, and are direct generalizations to the famous AdaBoost algorithm [9]. Experimental results demonstrate that ORBoost formulations share some good properties with AdaBoost. They usually outperform RankBoost, and have comparable performance to SVM-based algorithms.

This paper is organized as follows. Section 2 introduces ordinal regression, as well as the thresholded ensemble model. Large-margin bounds for thresholded ensembles are derived in Sect. 3. Then, an extended RankBoost algorithm and two ORBoost formulations, which construct thresholded ensembles, are discussed in Sect. 4. We show the experimental results in Sect. 5, and conclude in Sect. 6.

2 Thresholded Ensemble Model for Ordinal Regression

In an ordinal regression problem, we are given a set of training examples S = {(xⁿ, yn)}^Nn=1, where each input vector xn ∈ R^D is associated with an ordinal label (i.e., rank) yn. We assume that yn belongs to a set {1, 2, . . . , K}. The goal is to find an ordinal regression rule G(x) that predicts the rank y of an unseen input vector x. For a theoretic setting, we shall assume that all input-rank pairs are drawn i.i.d. from some unknown distribution D.

The setting above looks similar to that of a multiclass classification problem.

Hence, a general classification error,¹

EC(G, D) = E(x,y)∼DJG(x) 6= yK,

can be used to measure the performance of G. However, the classification error does not consider the ordering preference of the ranks. One naive interpretation of the ordering preference is as follows: for an example (x, y) with y = 4, if G1(x) = 3 and G2(x) = 1, G1 is preferred over G2on that example. A common

1 J·K = 1 when the inner condition is true, and 0 otherwise.

(3)

practice to encode such preference is to use the absolute error:

EA(G, D) = E(x,y)∼D|G(x) − y| .

Next, we propose the thresholded ensemble model for ordinal regression. As the name suggests, the model has two components: a vector of thresholds, and an ensemble of confidence functions.

Thresholded models are widely used for ordinal regression [3, 4]. The thresholds can be thought as estimated scales that reflect the discrete nature of ordinal regression. The ordinal regression rule, denoted as GH,θ, is illustrated in Fig. 1.

Here H(x) computes the potential value of x, and θ is a (K − 1) dimensional ordered vector that contains the thresholds (θ1 ≤ θ² ≤ · · · ≤ θ^K−1). We shall denote GH,θ as Gθ when H is clear from the context. Then, if we let θ0= −∞

and θK = ∞, the ordinal regression rule is

Gθ(x) = min {k : H(x) ≤ θ^k} = max {k : H(x) > θ^k−1} = 1 +

K−1

X

k=1

JH(x) > θkK.

In the thresholded ensemble model, we take an ensemble of confidence functions to compute the potentials. That is,

H(x) = HT(x) =

T

X

t=1

αtht(x), αt ∈ R.

We shall assume that the confidence function ht comes from a hypothesis set H, and has an output range [−1, 1]. A special case of the confidence function, which only outputs −1 or 1, would be called a binary classifier. Each confidence function reflects a possibly imperfect ordering preference. The ensemble linearly combines the ordering preferences with α. Note that we allow αtto be any real value, which means that it is possible to reverse the ordering preference of ht in the ensemble when necessary.

Ensemble models in general have been successfully used for classification and metric regression [8]. They not only introduce more stable predictions through the linear combination, but also provide sufficient power for approximating complex functions. These properties shall be inherited by the thresholded ensemble model for ordinal regression.

3 Large-Margin Bounds for Thresholded Ensembles

Margin is an important concept in structural risk minimization [10]. Many large- margin error bounds were proposed based on the intuition that large margins lead to good generalization. They are typically of the form

E1(G, D) ≤ E²(G, S^u, ∆) + complexity term.

Here E1(G, D) is the generalization error of interest, such as EA(G, D). Su de- notes the uniform distribution on the set S, and E2(G, Su, ∆) represents some training error with margin ∆, which will be further explained in this section.

(4)

d d - θ1

t t t

θ2

x xx x θ3

++

ρ1 -

ρ2

ρ3 -

1 2 3 4

Gθ(x)

H(x)

Fig. 1.The thresholded model and the margins of a correctly-predicted example

For ordinal regression, Herbrich et al. [1] derived a large-margin bound for a thresholded ordinal regression rule G. Unfortunately the bound is quite restricted since it requires that E2(G, Su, ∆) = 0. In addition, the bound uses a definition of margin that has O(N²) terms, which makes it more complicated to design algorithms that relate to the bound. Another bound was derived by Shashua and Levin [2]. The bound is based on a margin definition of only O(KN ) terms, and is applicable to the thresholded ensemble model. However, the bound is loose when T , the size of the ensemble, is large, because its complexity term grows with T .

In this section, we derive novel large-margin bounds of different error functions for the thresholded ensemble model. The bounds are extended from the results of Schapire et al. [11]. Our bounds are based on a margin definition of O(KN ) terms. Similar to the results of Schapire et al., our bounds do not require E2(G, Su, ∆) = 0, and their complexity terms do not grow with T .

3.1 Margins

The margins with respect to a thresholded model are illustrated in Fig. 1. Intu- itively, we expect the potential value H(x) to be in the correct interval (θy−1, θy], and we want H(x) to be far from the boundaries (thresholds):

Definition 1. Consider a given thresholded ensemble Gθ(x).

1. The margin of an example (x, y) with respect to θk is defined as

ρk(x, y) =

(H(x) − θk, ify > k;

θk− H(x), if y ≤ k.

2. The normalized marginρ¯k(x, y) is defined as

¯

ρk(x, y) = ρk(x, y)

T

X

t=1

|αt| +

K−1

X

k=1

|θk|

! .

Definition 1 is similar to the definition by Shashua and Levin [2], which is analogous to the definition of margins in binary classification. A negative ρk(x, y) would indicate an incorrect prediction.

For each example (x, y), we can obtain (K − 1) margins from Definition 1.

However, two of them are of the most importance. The first one is ρy−1(x, y),

(5)

which is the margin to the left (lower) boundary of the correct interval. The other is ρy(x, y), which is the margin to the right (upper) boundary. We will give them special names: the left-margin ρL(x, y), and the right-margin ρR(x, y). Note that by definition, ρL(x, 1) = ρR(x, K) = ∞.

∆-classification error: Next, we take a closer look at the error functions for thresholded ensemble models. If we make a minor assumption that the degenerate cases ¯ρR(x, y) = 0 are of an infinitesimal probability,

EC(Gθ, D) = E(x,y)∼DJGθ(x) 6= yK

= E(x,y)∼DJ¯ρL(x, y) ≤ 0 or ¯ρR(x, y) ≤ 0K.

The definition could be generalized by expecting both margins to be larger than ∆. That is, define the ∆-classification error as

EC(Gθ, D, ∆) = E(x,y)∼DJ¯ρL(x, y) ≤ ∆ or ¯ρR(x, y) ≤ ∆K.

Then, EC(Gθ, D) is just a special case with ∆ = 0.

∆-boundary error: The “or” operation of EC(Gθ, D, ∆) is not easy to handle in the proof of the coming bounds. An alternative choice is the ∆-boundary error:

EB(Gθ, D, ∆) = E(x,y)∼D







J¯ρR(x, y) ≤ ∆K, if y = 1;

J¯ρL(x, y) ≤ ∆K, if y = K;

1

2· (J¯ρL(x, y) ≤ ∆K + J¯ρR(x, y) ≤ ∆K) , otherwise.

The ∆-boundary error and the ∆-classification error are equivalent up to a constant. That is, for any (Gθ, D, ∆),

1

2EC(Gθ, D, ∆) ≤ EB(Gθ, D, ∆) ≤ EC(Gθ, D, ∆). (1)

∆-absolute error: We can analogously define the ∆-absolute error as

EA(Gθ, D, ∆) = E(x,y)∼D K−1

X

k=1

J¯ρk(x, y) ≤ ∆K.

Then, if we assume that the degenerate cases ρk(x, y) = 0 happen with an infinitesimal probability, EA(Gθ, D) is just a special case with ∆ = 0.

3.2 Large-Margin Bounds

An important observation for deriving our bounds is that EB and EA can be written with respect to an additional sampling of k. For example,

EA(Gθ, D, ∆) = (K − 1)E(x,y)∼D,k∼{1,...,K−1}_uJ¯ρk(x, y) ≤ ∆K.

(6)

Equivalently, we can define a distribution ˆD by D and {1, . . . , K − 1}u to gen- erate the tuple (x, y, k). Then EA(Gθ, D) is simply the portion of nonposi- tive ¯ρk(x, y) under ˆD. Consider an extended training set ˆS = {(xn, yn, k)}

with N (K − 1) elements. Each element is a possible outcome from ˆD. Note, however, that these elements are not all independent. For example, (xn, yn, 1) and (xn, yn, 2) are dependent. Thus, we cannot directly use the whole ˆS as a set of i.i.d. outcomes from ˆD.

Fortunately, some subsets of ˆS contain independent outcomes from ˆD. One way to extract such subsets is to choose one knfrom {1, . . . , K − 1}ufor each example (xn, yn) independently. The subset would be named T = {(xⁿ, yn, kn)}^Nn=1. Then, we can obtain a large-margin bound of the absolute error:

Theorem 1. Consider a setH, which contains only binary classifiers, is negation- complete,² and has VC-dimension d. Let δ > 0, and N > d + K − 1 = ˆd. Then with probability at least1 − δ over the random choice of the training set S, every thresholded ensemble Gθ(x), where the associated H is constructed with h ∈ H, satisfies the following bound for all ∆ > 0:

EA(Gθ, D) ≤ E^A(Gθ, S^u, ∆) + O





√K N

ˆd log²(N/ ˆd)

∆² + log1 δ

!1/2

. Proof. The key is to reduce the ordinal regression problem to a binary classification problem, which consists of training examples derived from (xn, yn, kn) ∈ T :

(Xn, Yn) =

( (xn, 1kn) , +1, if yn > kn;

(xn, 1kn) , −1, if yn ≤ kⁿ, (2) where 1m is a vector of length (K − 1) with a single 1 at the m-th dimension and 0 elsewhere. The test examples are constructed similarly with (x, y, k) ∼ ˆD.

Then, large-margin bounds for the ordinal regression problem can be inferred from those for the binary classification problem, as shown in Appendix A. ut

Similarly, if we look at the boundary error,

EB(Gθ, D, ∆) = E(x,y)∼D,k∼ByJ¯ρk(x, y) ≤ ∆K, for some distribution B^y on {L, R}. Then, a similar proof leads to Theorem 2. For the same conditions as of Theorem 1,

EB(Gθ, D) ≤ EB(Gθ, Su, ∆) + O





√1 N

ˆd log²(N/ ˆd)

∆² + log1 δ

!1/2

. Then, a large-margin bound of the classification error can immediately be derived by applying (1).

2 h ∈ H ⇐⇒(−h) ∈ H, where (−h)(x) = −`h(x)´ for all x.

(7)

Corollary 1. For the same conditions as of Theorem 1,

EC(Gθ, D) ≤ 2E^C(Gθ, S^u, ∆) + O





√1 N

ˆd log²(N/ ˆd)

∆² + log1 δ

!1/2

. Similar bounds can be derived with another large-margin theorem [11, The- orem 4] when H contains confidence functions rather than binary classifiers.

These bounds provide motivations for building algorithms with margin-related formulations.

4 Boosting Algorithms for Thresholded Ensembles

The bounds in the previous section are applicable to thresholded ensembles gen- erated from any algorithms. One possible algorithm, for example, is an SVM- based approach [3] with special kernels [12]. In this section, we focus on another branch of approaches: boosting. Boosting approaches can iteratively grow the ensemble H(x), and have been successful in classification and metric regression [8].

Our study includes an extension to the RankBoost algorithm [7] and two novel formulations that we propose.

4.1 RankBoost for Ordinal Regression

RankBoost [7] constructs a weighted ensemble of confidence functions based on the following large-margin concept: for each pair (i, j) such that yi > yj, the difference between their potential values, Ht(xi) − Ht(xj), is desired to be positive and large. Thus, in the t-th iteration, the algorithm chooses (ht, αt) to approximately minimize

X

yi>yj

e^−H^t−1^(xⁱ^)−α^t^h^t^(xⁱ^)+H^t−1^(x^j^)+α^t^h^t^(x^j⁾. (3)

Our efforts in extending RankBoost for ordinal regression are discussed as follows:

Computingαt: Two approaches can be used to determine αtin RankBoost [7]:

1. Obtain the optimal αt by numerical search (confidence functions) or analyt- ical solution (binary classifiers).

2. Minimize an upper bound of (3).

If ht(xn) is monotonic with respect to yn, the optimal αt obtained from approach 1 is ∞, and one single ht would dominate the ensemble. This situation not only makes the ensemble less stable, but also limits its power. For example, if (yn, ht(xn)) pairs for four examples are (1, −1), (2, 0), (3, 1), and (4, 1), ranks 3 and 4 on the last two examples cannot be distinguished by ht. We have frequently observed such a degenerate situation, called partial matching, in real- world experiments, even when htis as simple as a decision stump. Thus, we shall

(8)

use approach 2 for our experiments. Note, however, that when partial matching happens, the magnitude of αt from approach 2 can still be relatively large, and may cause numerical difficulties.

Obtainingθ: After RankBoost computes a potential function H(x), a reason- able way to obtain the thresholds based on training examples is

θ = argminϑEA(Gϑ, S^u). (4) The combination of RankBoost and the absolute error criterion (4) would be called RankBoost-AE. The optimal range of ϑk can be efficiently determined by dynamic programming. For simplicity and stability, we assign θk to be the middle value in the optimal range. The algorithm that aims at EC instead of EA can be similarly derived.

4.2 Ordinal Regression Boosting with Left-Right Margins

The idea of ordinal regression boosting comes from the definition of margins in Sect. 3. As indicated by our bounds, we want the margins to be as large as possible. To achieve this goal, our algorithms, similar to AdaBoost, work on minimizing the exponential margin loss.

First, we introduce a simple formulation called ordinal regression boosting with left-right margins (ORBoost-LR), which tries to minimize

N

X

n=1

he^−ρ^L^(xⁿ^,yⁿ⁾+ e^−ρ^R^(xⁿ^,yⁿ⁾i

. (5)

The formulation can be thought as maximizing the soft-min of the left- and right- margins. Similar to RankBoost, the minimization is performed in an iterative manner. In each iteration, a confidence function ht is chosen, its weight αt is computed, and the vector θ is updated. If we plug in the margin definition to (5), we can see that the iteration steps should be designed to approximately minimize

N

X

n=1

hϕne^α^t^h^t^(xⁿ^)−θ^yn+ ϕ⁻¹_n e^θ^yn−1^−α^t^h^t^(xⁿ⁾i

, (6)

where ϕn= e^H^t−1^(xⁿ⁾. Next, we discuss these three steps in detail.

Choosing ht: Mason et al. [13] explained AdaBoost as a gradient descent technique in function space. We derive ORBoost-LR using the same technique.

We first choose a confidence function ht that is close to the negative gradient:

ht= argmin

h∈H N

X

n=1

h(xn) ϕne^−θ^yn − ϕ⁻¹n e^θ^yn−1 .

This step can be performed with the help of another learning algorithm, called the base learner.

(9)

Computing αt: Similar to RankBoost, we minimize an upper bound of (6), which is based on a piece-wise linear approximation of e^x for x ∈ [−1, 0] and x ∈ [0, 1]. The bound can be written as W+e^α+ W−e^−α, with

W+= X

ht(xn)>0

ht(xn)ϕne^−θ^yn − X

ht(xn)<0

ht(xn)ϕ⁻¹_n e^θ^yn−1,

W−= X

ht(xn)>0

ht(xn)ϕ⁻¹_n e^θ^yn−1− X

ht(xn)<0

ht(xn)ϕne^−θ^yn.

Then, the optimal αt for the bound can be computed by ¹₂log^W_W⁻

+.

Note that the upper bound is equal to (6) if ht(xn) ∈ {−1, 0, 1}. Thus, when htis a binary classifier, the optimal αtcan be exactly determined. Another remark here is that αt is finite under some mild conditions which make both W+

and W− positive. Thus, unlike RankBoost, ORBoost-LR rarely sets αt to ∞.

Updating θ: Note that when the pair (ht, αt) is fixed, (6) can be reorganized as PK−1

k=1 Wk,+e^θ^k+ Wk,−e^−θ^k. Then, each θk can be computed analytically, uniquely, and independently. However, when each θk is updated independently, the thresholds may not be ordered. Hence, we propose to add an additional ordering constraint to (6). That is, choosing θ by solving

minϑ K−1

X

k=1

Wk,+e^ϑ^k+ Wk,−e^−ϑ^k (7) s.t. ϑ1≤ ϑ2≤ · · · ≤ ϑK−1.

An efficient algorithm for solving (7) can be obtained from by a simple modifica- tion of the pool adjacent violators (PAV) algorithm for isotonic regression [14].

Combination of the steps: ORBoost-LR works by combining the three steps above sequentially in each iteration. Note that after ht is determined, αt and θt

can be either jointly optimized, or cyclically updated. However, we found that joint or cyclic optimization does not always introduce better performance, and could sometimes cause ORBoost-LR to overfit. Thus, we only execute each step once in each iteration.

4.3 Ordinal Regression Boosting with All Margins ORBoost with all margins (ORBoost-All) operates on

N

X

n=1 K−1

X

k=1

e^−ρ^k^(xⁿ^,yⁿ⁾ (8)

instead of (6). The derivations for the three steps are almost the same as ORBoost-LR. We shall just make some remarks.

Updating θ: When using (8) to update the thresholds, we have proved that each θkcan be updated uniquely and independently, while still being ordered [5].

Thus, we do not need to implement the PAV algorithm for ORBoost-All.

(10)

Relationship between algorithm and theory: A simple relation is that for any ∆, e^{−A ¯}^ρ^k^(xⁿ^,yⁿ⁾ is an upper bound of e^−A∆· J¯ρk(xn, yn) ≤ ∆K. If we take A to be the normalization term of ¯ρk, we can see that

– ORBoost-All works on minimizing an upper bound of EA(Gθ, S^u, ∆).

– ORBoost-LR works to minimizing an upper bound of EB(Gθ, Su, ∆), or

1

2EC(Gθ, Su, ∆).

ORBoost-All not only minimizes an upper bound, but provably also minimizes the term EA(Gθ, S^u, ∆) exponentially fast with a sufficiently strong choice of ht. The proof relies on an extension of the training error theorem of Ada- Boost [11, Theorem 5]. Similar proof can be used for ORBoost-LR.

Connection to other algorithms: ORBoost approaches are direct generalizations of AdaBoost using the gradient descent optimization point of view. In the special case of K = 2, both ORBoost approaches are almost the same as AdaBoost with an additional term θ1. Note that the term θ1 can be thought as the coefficient of a constant classifier. Interestingly, Rudin et al. [6] proved the connection between RankBoost and AdaBoost when including a constant classifier in the ensemble. Thus, when K = 2, RankBoost-EA, ORBoost-LR, and ORBoost-All, all share some similarity with AdaBoost.

ORBoost formulations also have connections with SVM-based algorithms. In particular, ORBoost-LR has a counterpart of SVM with explicit constraints (SVM-EXC), and ORBoost-All is related to SVM with implicit constraints (SVM-IMC) [3]. These connections follow closely with the links between Ada- Boost and SVM [12, 15].

5 Experiments

In this section, we compare the three boosting formulations for constructing the thresholded ensemble model. We also compare these formulations with SVM- based algorithms.

Two sets of confidence functions are used in the experiments. The first one is the set of perceptronssign w^Tx + b : w ∈ R^D, b ∈ R . The RCD-bias algorithm is known to work well with AdaBoost [16], and is adopted as our base learner.

The second set istanh(w^Tx + b) : w^Tw + b²= γ² , which contains normalized sigmoid functions. Note that sigmoid functions smoothen the output of perceptrons, and the smoothness is controlled by the parameter γ. We use a naive base learner for normalized sigmoid functions as follows: RCD-bias is first performed to get a perceptron. Then, the weights and bias of the perceptron are normalized, and the outputs are smoothened. Throughout the experiments we use γ = 4, which was picked with a few experimental runs on some datasets.

5.1 Artificial Dataset

We first verify that the idea of the thresholded ensemble model works with an artificial 2-D dataset (Fig. 2(a)). Figure 2(b) depicts the separating boundaries

(11)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) the target

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) with perceptron

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) with sigmoid Fig. 2.An artificial 2-D dataset and the learned boundaries with ORBoost-All

of the thresholded ensemble of 200 perceptrons constructed by ORBoost-All. By combining perceptrons, ORBoost-All works reasonably well in approximating the nonlinear boundaries. A similar plot can be obtained with ORBoost-LR.

RankBoost-AE cannot perform well on this dataset due to numerical difficulties (see Subsect. 4.1) after only 5 iterations.

If we use a thresholded ensemble of 200 normalized sigmoid functions, it is observed that ORBoost-All, ORBoost-LR, and RankBoost-AE perform similarly.

The result of ORBoost-All (Fig. 2(c)) shows that the separating boundaries are much smoother because each sigmoid function is smooth. As we shall discuss later, the smoothness can be important for some ordinal regression problems.

5.2 Benchmark Datasets

Next, we perform experiments with eight benchmark datasets³ that were used by Chu and Keerthi [3]. The datasets are quantized from some metric regression datasets. We use the same K = 10, the same “training/test” partition ratio, and also average the results over 20 trials. Thus, we can compare RankBoost and ORBoost fairly with the SVM-based results of Chu and Keerthi [3].

The results on the abalone dataset with T up to 2000 are given in Fig. 3. The training errors are shown in the top plots, while the test errors are shown in the bottom plots. Based on these results, we have several remarks:

RankBoost vs. ORBoost: RankBoost-AE can usually decrease both the training classification and the training absolute errors faster than ORBoost algorithms. However, such property often lead to consistently worse test error than both ORBoost-LR and ORBoost-All. An explanation is that although the Rank- Boost ensemble orders the training examples well, the current estimate of θ is not used to decide (ht, αt). Thus, the two components (HT, θ) of the thresholded ensemble model are not jointly considered, and the greediness in constructing only HT results in overfitting. In contrast, ORBoost-LR and ORBoost-All take into consideration the current θ in choosing (ht, αt) and the current HT in updating θ. Hence, a better pair of (HT, θ) could be obtained.

3 pyrimdines, machineCPU, boston, abalone, bank, computer, california, and census.

(12)

0 500 1000 1500 2000 0

0.2 0.4 0.6 0.8 1

T

training classification error

0 500 1000 1500 2000

0 0.5 1 1.5 2

T

training absolute error

0 500 1000 1500 2000

0.7 0.75 0.8 0.85 0.9 0.95 1

T

test classification error

ORBoost−LR, perceptron ORBoost−LR, sigmoid ORBoost−All, perceptron ORBoost−All, sigmoid RankBoost−EA, perceptron RankBoost−EA, sigmoid

0 500 1000 1500 2000

1.2 1.4 1.6 1.8 2

T

test absolute error

Fig. 3.Errors on the abalone dataset over 20 runs

ORBoost-LR vs. ORBoost-All: Both ORBoost formulations inherit a good property from AdaBoost: not very vulnerable to overfitting. ORBoost-LR is better on test classification errors, while ORBoost-All is better on test absolute errors. This is partially justified by our discussion in Subsect. 4.3 that the two formulations minimize different margin-related upper bounds. A similar observation was made by Chu and Keerthi [3] on SVM-EXC and SVM-IMC algorithms.

Note, however, that ORBoost-LR with perceptrons minimizes the training classification error slower than ORBoost-All on this dataset, because the additional ordering constraint of θ in ORBoost-LR slows down the convergence.

Perceptron vs. sigmoid: Formulations with sigmoid functions have consistently higher training error, which is due to the naive choice of base learner and the approximation of αt. However, the best test performance is also achieved with sigmoid functions. One possible reason is that the abalone dataset is quantized from a metric regression dataset, and hence contains some properties such as smoothness of the boundaries. If we only use binary classifiers like perceptrons, as depicted in Fig. 2(b), the boundaries would not be as smooth, and more errors may happen. Thus, for ordinal regression datasets that are quan-

(13)

Table 1.Test classification error of ordinal regression algorithms

data RankBoost-AE ORBoost-LR ORBoost-All SVM-EXC [3]

set perceptron sigmoid perceptron sigmoid perceptron sigmoid

pyr. 0.758^±0.015 0.767^±0.020 0.731^±0.019 0.731^±0.018 0.744^±0.019 0.735^±0.017 0.752^±0.014 mac.0.717^±0.022 0.669^±0.011 0.610^±0.009 0.633^±0.011 0.605^±0.010 0.625^±0.014 0.661^±0.012 bos. 0.603^±0.006 0.578^±0.008 0.580^±0.006 0.549^±0.007 0.579^±0.006 0.558^±0.006 0.569^±0.006 aba. 0.759^±0.001 0.765^±0.002 0.740^±0.002 0.716^±0.002 0.749^±0.002 0.731^±0.002 0.736^±0.002 ban. 0.805^±0.001 0.822^±0.001 0.767^±0.001 0.777^±0.002 0.771^±0.001 0.776^±0.001 0.744^±0.001 com.0.598^±0.002 0.616^±0.001 0.498^±0.001 0.491^±0.001 0.499^±0.001 0.505^±0.001 0.462^±0.001 cal. 0.741^±0.001 0.690^±0.001 0.628^±0.001 0.605^±0.001 0.626^±0.001 0.618^±0.001 0.640^±0.001 cen. 0.808^±0.001 0.780^±0.001 0.718^±0.001 0.694^±0.001 0.722^±0.001 0.701^±0.001 0.699^±0.000

(results that are within one standard error of the best are marked in bold)

Table 2.Test absolute error of ordinal regression algorithms

data RankBoost-AE ORBoost-LR ORBoost-All SVM-IMC [3]

set perceptron sigmoid perceptron sigmoid perceptron sigmoid

pyr. 1.619^±0.078 1.590^±0.077 1.340^±0.049 1.402^±0.052 1.360^±0.046 1.398^±0.052 1.294^±0.046 mac.1.573^±0.191 1.282^±0.034 0.897^±0.019 0.985^±0.018 0.889^±0.019 0.969^±0.025 0.990^±0.026

bos. 0.842^±0.014 0.829^±0.014 0.788^±0.013 0.758^±0.015 0.791^±0.013 0.777^±0.015 0.747^±0.011 aba. 1.517^±0.005 1.738^±0.008 1.442^±0.004 1.537^±0.007 1.432^±0.003 1.403^±0.004 1.361^±0.003 ban. 1.867^±0.004 2.183^±0.007 1.507^±0.002 1.656^±0.005 1.490^±0.002 1.539^±0.002 1.393^±0.002 com.0.841^±0.003 0.945^±0.004 0.631^±0.002 0.634^±0.003 0.626^±0.002 0.634^±0.002 0.596^±0.002 cal. 1.528^±0.006 1.251^±0.004 1.042^±0.004 0.956^±0.002 0.977^±0.002 0.942^±0.002 1.008^±0.001 cen. 2.008^±0.006 1.796^±0.005 1.305^±0.003 1.262^±0.003 1.265^±0.002 1.198^±0.002 1.205^±0.002

(results that are within one standard error of the best are marked in bold)

tized from metric regression datasets, smooth confidence functions may be more useful than discrete binary classifiers.

We list the mean and standard errors of all test results with T = 2000 in Tables 1 and 2. Consistent with the results on the abalone dataset, RankBoost- AE almost always performs the worst; ORBoost-LR is better on classification errors, and ORBoost-All is slightly better on absolute errors. When compared with SVM-IMC on classification errors and SVM-EXC on absolute errors [3], both ORBoost formulations have similar errors as the SVM-based algorithms.

Note, however, that ORBoost formulations with perceptrons or sigmoid functions are much faster. On the census dataset, which contains 6000 training examples, it takes about an hour for ORBoost to finish one trial. But SVM-based approaches, which include a time-consuming automatic parameter selection step, need more than four days. With the comparable performance and significantly less computational cost, ORBoost could be a useful tool for large datasets.

6 Conclusion

We proposed a thresholded ensemble model for ordinal regression, and defined margins for the model. Novel large-margin bounds of common error functions were proved. We studied three algorithms for obtaining thresholded ensembles.

The first algorithm, RankBoost-AE, combines RankBoost and a thresholding algorithm. In addition, we designed two new boosting approaches, ORBoost-LR

(14)

and ORBoost-All, which have close connections with the large-margin bounds.

ORBoost formulations are direct extensions of AdaBoost, and inherit its advantage of being less venerable to overfitting.

Experimental results demonstrated that ORBoost formulations have superior performance over RankBoost-AE. In addition, they are comparable to SVM- based algorithms in terms of test error, but enjoy the advantage of faster training. These properties make ORBoost formulations favorable over SVM-based algorithms on large datasets.

ORBoost formulations can be equipped with any base learners for confidence functions. In this work, we studied the perceptrons and the normalized sigmoid functions. Future work could be exploring other confidence functions for OR- Boost, or extending other boosting approaches to perform ordinal regression.

Acknowledgment

We thank Yaser S. Abu-Mostafa, Amrit Pratap, and the anonymous reviewers for helpful comments. Hsuan-Tien Lin is supported by the Caltech Division of Engineering and Applied Science Fellowship.

References

1. Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for ordinal regression. In: Advances in Large Margin Classifiers. MIT Press (2000) 115–132

2. Shashua, A., Levin, A.: Ranking with large margin principle: Two approaches. In:

Advances in Neural Information Processing Systems 15, MIT Press (2003) 961–968 3. Chu, W., Keerthi, S.S.: New approaches to support vector ordinal regression. In:

Proceedings of ICML 2005, Omnipress (2005) 145–152

4. Crammer, K., Singer, Y.: Online ranking by projecting. Neural Computation 17 (2005) 145–175

5. Li, L., Lin, H.T.: Ordinal regression by extended binary classification. Under review (2007)

6. Rudin, C., Cortes, C., Mohri, M., Schapire, R.E.: Margin-based ranking meets boosting in the middle. In: Learning Theory: COLT 2005, Springer-Verlag (2005) 63–78

7. Freund, Y., Iyer, R., Shapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4 (2003) 933–969 8. Meir, R., R¨atsch, G.: An introduction to boosting and leveraging. In: Advanced

Lectures on Machine Learning. Springer-Verlag (2003) 118–183

9. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In:

Machine Learning: ICML 1996, Morgan Kaufmann (1996) 148–156

10. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1995) 11. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (1998) 1651–1686

12. Lin, H.T., Li, L.: Infinite ensemble learning with support vector machines. In:

Machine Learning: ECML 2005, Springer-Verlag (2005) 242–254

(15)

13. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Functional gradient techniques for combining hypotheses. In: Advances in Large Margin Classifiers. MIT Press (2000) 221–246

14. Robertson, T., Wright, F.T., Dykstra, R.L.: Order Restricted Statistical Inference.

John Wiley & Sons (1988)

15. Rätsch, G., Mika, S., Schölkopf, B., Müller, K.R.: Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 1184–1199

16. Li, L.: Perceptron learning with random coordinate descent. Technical Report CaltechCSTR:2005.006, California Institute of Technology (2005)

A Proof of Theorem 1

As shown in (2), we first construct a transformed binary problem. Then, the problem is modeled by an ensemble function F (x) defined on a base space

F = H ∪ {s^k}^K−1k=1 .

Here sk(X) = − sign(XD+k− 0.5) is a decision stump on dimension (D + k). It is not hard to show that the VC-dimension of F is no more than ˆd = d + K − 1.

Without loss of generality, we normalize Gθ(x) such thatPT

t=1|α^t|+PK−1 k=1 |θ^k| is 1. Then, consider the associated ensemble function

F (X) =

T

X

t=1

αtht(X) +

K−1

X

k=1

θksk(X).

An important property for the transform is that for every (X, Y ) derived from the tuple (x, y, k), Y F (X) = ¯ρk(x, y).

Because T contains N i.i.d. outcomes from ˆD, the large-margin theorem [11, Theorem 2] states that with probability at least 1 − δ/2 over the choice of T ,

E(x,y,k)∼ ˆD[Y F (X) ≤ 0] ≤ 1

N

X

n=1

JYnF (Xn) ≤ ∆K + O





√1 N

ˆd log²(N/ ˆd)

∆² + log1 δ

!1/2

. (9)

Since Y F (X) = ¯ρk(x, y), the left-hand-side is _K−1¹ EA(Gθ, D).

Let bn = JYnF (Xn) ≤ ∆K = J¯ρkn(xn, yn) ≤ ∆K, which is a Boolean random variable. An extended Chernoff bound shows that when each bn is chosen independently, with probability at least 1 − δ/2 over the choice of bⁿ,

1 N

N

X

n=1

bn≤ 1 N

N

X

n=1

Ekn∼{1,··· ,K−1}_ubn+ O 1

√N

log1

δ

1/2!

. (10)

The desired result can be obtained by combining (9) and (10), with a union bound and Ekn∼{1,··· ,K−1}_ubn =_K−1¹ EA(Gθ, S^u, ∆). ut